Production AI App Patterns

Token Counting and Budget Management for LLM Apps

If you ship an app that calls GPT, Claude, or any other large language model, your bill is measured in tokens, not requests. That single fact catches a lot of teams off guard. A feature works fine in testing, then a few power users paste in 30-page documents and your monthly invoice triples overnight. This guide is for backend and full-stack engineers who need token counting that actually maps to cost, plus practical budget controls that stop runaway spend before it reaches your credit card.

By the end, you will know how to count tokens accurately for each provider, estimate the dollar cost of a request before you send it, and enforce per-request and per-user budgets in production code. We will also cover the common mistakes that quietly inflate token usage, because most cost overruns come from boring bugs, not exotic ones.

What Is Token Counting (and Why It Drives Cost)?

Token counting is the process of measuring how many tokens a piece of text consumes when sent to or returned from a language model. A token is a chunk of text, roughly four characters or three-quarters of a word in English. Providers bill per million tokens, and they charge separately for input (your prompt) and output (the model’s reply), so accurate counting is the foundation of every cost estimate.

Tokens are not words, and they are not characters. The phrase “tokenization” might be three tokens; a snippet of minified JavaScript with lots of punctuation can be far denser than prose. Code, non-English text, and structured data all tokenize differently from plain English. Because of this, you cannot eyeball a prompt’s cost. You have to count.

The reason this matters in production is simple: input and output tokens have different prices, and output is usually three to five times more expensive. A long retrieval-augmented prompt costs you on input; a chatty model that rambles costs you on output. Controlling cost means controlling both, and that starts with knowing the numbers.

How LLM Pricing Actually Works

Every major provider prices in dollars per million tokens, split into input and output rates. The table below shows current Anthropic Claude pricing as a concrete reference; other providers follow the same input-versus-output structure with different numbers.

ModelInput ($/1M tokens)Output ($/1M tokens)
Claude Opus 4.8$5.00$25.00
Claude Sonnet 4.6$3.00$15.00
Claude Haiku 4.5$1.00$5.00

Notice the output multiplier. On Opus 4.8, output tokens cost five times what input tokens cost. That asymmetry shapes your optimization priorities: trimming a verbose system prompt helps, but capping a model that loves to over-explain often helps more.

Two more line items affect the real total. Cached input tokens bill at a fraction of the normal input rate, which is why prompt caching is one of the biggest levers you have. Batch processing typically runs at half price for workloads that tolerate delay. We will link to deeper guides on both later, because they change the math significantly once your volume grows.

Counting Tokens Before You Send a Request

The single most important habit is to count tokens before the request goes out, not after. Counting ahead of time lets you reject oversized inputs, choose a cheaper model for small jobs, and show users a cost estimate. How you count depends on the provider, and getting this wrong is a frequent source of bad estimates.

Counting Tokens for OpenAI Models

OpenAI models use the tiktoken library, which runs locally and gives you exact counts without a network call. Pick the encoding that matches your model, then encode the text and measure the result.

import tiktoken

def count_openai_tokens(text: str, model: str = "gpt-4o") -> int:
    """Count tokens for an OpenAI model using the matching encoding.

    Runs locally with no API call, so it's safe to call on every request.
    """
    try:
        encoding = tiktoken.encoding_for_model(model)
    except KeyError:
        # Fall back to the current general-purpose encoding for unknown models
        encoding = tiktoken.get_encoding("o200k_base")

    return len(encoding.encode(text))

prompt = "Summarize the quarterly report in three bullet points."
print(count_openai_tokens(prompt))  # e.g. 11

This counts the raw text. For chat completions, the message structure (roles, formatting tokens) adds a small fixed overhead per message, so a full request is slightly larger than the sum of its content. For budgeting purposes the content count is close enough; if you need exact request sizes, add a few tokens per message.

Counting Tokens for Claude Models

Here is a mistake that costs people real money: using tiktoken to count tokens for Claude. It does not work. tiktoken is OpenAI’s tokenizer, and it can undercount Claude tokens by 15 to 20 percent on prose and far more on code. Instead, use Anthropic’s dedicated count_tokens endpoint, which uses the model’s actual tokenizer.

from anthropic import Anthropic

client = Anthropic()  # reads ANTHROPIC_API_KEY from the environment

def count_claude_tokens(text: str, model: str = "claude-opus-4-8") -> int:
    """Count tokens for a Claude model using the official count endpoint.

    This makes a lightweight API call, so cache results for repeated text.
    """
    response = client.messages.count_tokens(
        model=model,
        messages=[{"role": "user", "content": text}],
    )
    return response.input_tokens

prompt = "Summarize the quarterly report in three bullet points."
print(count_claude_tokens(prompt))

The trade-off is that count_tokens makes a network call, unlike the local tiktoken. The call is cheap and not billed as inference, but you should still cache results for repeated content, such as a static system prompt that prefixes every request. If you are just getting started with the provider, our guide to getting started with the Claude API walks through the client setup in more detail.

Estimating Cost Per Request

Once you can count input tokens, you can estimate cost. Output is the wrinkle: you do not know the exact reply length in advance, so you estimate using your max_tokens ceiling as the worst case, then track actual usage after the response returns.

# Prices in dollars per token (per-million rate divided by 1,000,000)
PRICING = {
    "claude-opus-4-8": {"input": 5.00 / 1_000_000, "output": 25.00 / 1_000_000},
    "claude-sonnet-4-6": {"input": 3.00 / 1_000_000, "output": 15.00 / 1_000_000},
    "claude-haiku-4-5": {"input": 1.00 / 1_000_000, "output": 5.00 / 1_000_000},
}

def estimate_cost(model: str, input_tokens: int, max_output_tokens: int) -> float:
    """Worst-case dollar cost for a request, assuming output hits max_tokens."""
    rates = PRICING[model]
    return input_tokens * rates["input"] + max_output_tokens * rates["output"]

cost = estimate_cost("claude-opus-4-8", input_tokens=2_400, max_output_tokens=1_000)
print(f"Worst-case cost: ${cost:.4f}")  # Worst-case cost: $0.0370

The estimate above is deliberately pessimistic because it assumes the model uses every output token you allow. Most replies are shorter. After each call, read the real usage from the response (usage.output_tokens on Anthropic, usage.completion_tokens on OpenAI) and log the actual cost. Over time, the gap between your estimate and reality tells you whether your max_tokens ceiling is set sensibly.

This is also why choosing the right model matters so much. A classification task that runs fine on Haiku costs one-fifth as much per token as the same job on Opus. Route simple work to cheaper models and reserve the expensive ones for tasks that genuinely need them.

Setting Per-Request and Per-User Budgets

Estimation tells you what something will cost. Budgets stop you from spending more than you intended. The most effective pattern is a small budget manager that checks an estimated cost against a limit before the request goes out, and rejects anything over the line.

from dataclasses import dataclass, field

class BudgetExceededError(Exception):
    """Raised when a request would push a user over their spend limit."""

@dataclass
class UserBudget:
    limit_usd: float
    spent_usd: float = 0.0

    def remaining(self) -> float:
        return max(0.0, self.limit_usd - self.spent_usd)

    def authorize(self, estimated_cost: float) -> None:
        """Reject the request if it would exceed the remaining budget."""
        if estimated_cost > self.remaining():
            raise BudgetExceededError(
                f"Request needs ${estimated_cost:.4f} but only "
                f"${self.remaining():.4f} remains."
            )

    def record(self, actual_cost: float) -> None:
        """Commit the real cost after the response returns."""
        self.spent_usd += actual_cost

# Per-user budgets keyed by user ID, loaded from your datastore
budgets: dict[str, UserBudget] = field(default_factory=dict)

The key idea is two-phase accounting. Before the call, you authorize against the worst-case estimate, which prevents a single huge request from blowing the budget. After the call, you record the actual cost, which keeps the running total honest. In a real system you would persist spent_usd to a database or Redis rather than holding it in memory, and you would reset it on a billing cycle.

Putting it together, a guarded request looks like this:

def guarded_completion(user_id: str, prompt: str, model: str = "claude-opus-4-8"):
    budget = budgets[user_id]
    input_tokens = count_claude_tokens(prompt, model)
    max_output = 1_000

    estimated = estimate_cost(model, input_tokens, max_output)
    budget.authorize(estimated)  # raises before any spend if over limit

    response = client.messages.create(
        model=model,
        max_tokens=max_output,
        messages=[{"role": "user", "content": prompt}],
    )

    actual = estimate_cost(model, response.usage.input_tokens,
                           response.usage.output_tokens)
    budget.record(actual)
    return response

Because authorize runs before client.messages.create, an over-budget user never triggers a paid API call. That ordering is the whole point: you fail fast and free, rather than discovering the overage on your monthly statement.

Tracking Spend Across an App

Per-user budgets handle individual limits, but you also want a bird’s-eye view of total spend, broken down by model, endpoint, and user. The simplest approach is to log a structured record after every call and aggregate later.

import logging

logger = logging.getLogger("llm.usage")

def log_usage(user_id: str, model: str, input_tokens: int,
              output_tokens: int, cost_usd: float) -> None:
    logger.info(
        "llm_call",
        extra={
            "user_id": user_id,
            "model": model,
            "input_tokens": input_tokens,
            "output_tokens": output_tokens,
            "cost_usd": round(cost_usd, 6),
        },
    )

Pipe those logs into whatever you already use for observability, then build dashboards on top. Once your volume grows, a dedicated LLM gateway becomes worth the setup cost. Tools like LiteLLM and Portkey sit between your app and the provider, tracking tokens and enforcing spend limits centrally so you do not reimplement budgeting in every service. For a high-volume product spread across several backends, that centralization saves real engineering time.

One detail catches teams that stream responses: when you stream tokens to the client, the usage figures arrive in the final event of the stream, not in a separate response object. Make sure your accounting reads usage from the right place. Our guide on streaming LLM responses with SSE versus WebSockets covers how the final usage payload arrives in each transport.

Reducing Token Usage Without Hurting Quality

Counting and budgeting tell you where the money goes. The next step is spending less of it. Three techniques deliver the most savings for the least effort.

First, prompt caching lets you reuse a large static prefix, such as a long system prompt or a document, at a fraction of the input price. If every request shares the same 5,000-token instruction block, caching it can cut your input cost by up to 90 percent on repeat calls. Our deep dive on Anthropic prompt caching walks through placement and the gotchas that silently break the cache.

Second, batching runs non-urgent work at half price. If you process documents overnight or generate embeddings in bulk, the OpenAI Batch API and its equivalents trade latency for a 50 percent discount. For workloads where a few hours of delay is acceptable, that is free money.

Third, trim what you actually send. Truncate conversation history to the last several turns, summarize old context instead of replaying it verbatim, and set a tight max_tokens so a verbose model cannot ramble on your dime. These changes are small individually but compound across millions of requests.

When to Use Token Counting and Budgets

  • You expose an LLM feature to end users who can submit arbitrary-length input
  • Your costs scale with usage and you need predictable per-user or per-tenant limits
  • You route requests across multiple models and want to pick the cheapest one that fits
  • You need to show users a cost estimate or enforce a plan-based quota
  • You run high-volume batch jobs where small per-request savings add up fast

When NOT to Use Heavy Budget Tooling

  • You have a fixed internal tool with a handful of trusted users and predictable input
  • Your total monthly spend is small enough that engineering time costs more than overruns
  • A provider-side spend limit or hard max_tokens cap already covers your risk
  • You are prototyping and optimizing for iteration speed over cost control

Common Mistakes with Token Counting

  • Using tiktoken to count tokens for Claude or other non-OpenAI models, which produces counts that are off by 15 percent or more
  • Estimating cost from input tokens only and ignoring the more expensive output tokens
  • Forgetting that streamed responses report usage in the final event, leaving your accounting blind
  • Setting max_tokens far higher than needed, which inflates your worst-case budget reservation and lets verbose replies run up cost
  • Counting tokens after the request instead of before, so oversized inputs reach the API and bill you anyway
  • Holding spend totals only in memory, so a restart resets every user’s budget to zero

Conclusion

Token counting turns an opaque LLM bill into numbers you can predict and control. Count input tokens before every request with the right tool for each provider, estimate worst-case cost using your max_tokens ceiling, and gate spending with a budget manager that authorizes before the call and records after it. Layer on prompt caching and batching once your volume justifies the effort, and pipe usage logs into a dashboard so surprises show up early instead of on the invoice.

Start with the cheapest high-impact change: add a token count and a hard max_tokens to your busiest endpoint today, then measure the actual versus estimated cost for a week. From there, explore prompt caching to cut input cost on repeated context, and consider an LLM gateway like LiteLLM once token counting and budget management span more than one service.

Leave a Comment