Skip to content
C Codeloom
LLMs

LLM Token Counting and Cost Control

Learn how tokens are counted, how to estimate API spend before you send a request, and concrete strategies to cut LLM bills without hurting quality.

·4 min read · By Codeloom
Intermediate 10 min read

What you'll learn

  • What a token actually is
  • How to count tokens before sending a request
  • How input, output, and cached tokens are priced
  • Where the biggest cost wins usually hide
  • How to budget and monitor LLM spend in production

Prerequisites

  • Familiar with how APIs work

What and Why

Every hosted LLM bills you by the token. A token is roughly four characters of English, but it depends entirely on the model’s tokenizer. If you do not know how many tokens your prompts and responses consume, you cannot predict your bill, you cannot reason about latency, and you cannot tell whether a “small” prompt is silently blowing past the context window.

Mental Model

Tokens are the atomic unit of both context and cost. There are three buckets you pay for:

  1. Input tokens sent in the request (system prompt + user message + tools + retrieved chunks).
  2. Output tokens generated by the model.
  3. Cached tokens (when supported) which are billed at a steep discount.

Output tokens are usually 3 to 5x more expensive than input tokens, which inverts the intuition many engineers start with. A small prompt that asks for a long answer can cost more than a huge prompt that asks for a one-word answer.

System prompt    [~200 tok]  -+
Tool definitions [~400 tok]   |
Retrieved chunks [~2000 tok]  +-> INPUT  (cheaper, possibly cached)
Chat history     [~1500 tok]  |
User message     [~50 tok]   -+

Model response   [~600 tok]  ---> OUTPUT (most expensive)
Where tokens accumulate in a request

Hands-on Example

Use tiktoken for OpenAI-family models. Anthropic, Google, and others expose their own counters; the pattern is identical.

import tiktoken

enc = tiktoken.encoding_for_model("gpt-4o-mini")

def count_tokens(text: str) -> int:
    return len(enc.encode(text))

def estimate_cost(prompt: str, expected_output_tokens: int,
                  in_price=0.15, out_price=0.60):
    # Prices are per 1M tokens; adjust to your model.
    in_tok = count_tokens(prompt)
    cost = (in_tok / 1_000_000) * in_price \
         + (expected_output_tokens / 1_000_000) * out_price
    return in_tok, cost

prompt = "Summarize the following article in three bullets...\n\n" + open("doc.txt").read()
in_tok, dollars = estimate_cost(prompt, expected_output_tokens=150)
print(f"Input tokens: {in_tok}, est cost: ${dollars:.5f}")

For chat messages, remember to add the small per-message overhead (3-4 tokens for role and separator markers). The exact value is documented per model.

Trade-offs

There is real tension between quality and spend. Tactics to cut tokens often nudge quality down.

  • Shorter system prompts save tokens on every call, but vague instructions hurt accuracy.
  • Smaller retrieval top_k is cheaper but raises the chance of missing the right chunk.
  • Smaller models are 5-20x cheaper but may need few-shot examples to match quality, which eats some of the savings.
  • Truncating chat history keeps costs flat over long conversations but the model forgets earlier turns.

The right answer is almost always “measure, then trim.” Premature aggressive trimming makes the model look dumb in ways that erode user trust faster than the bill grows.

Practical Tips

A handful of patterns consistently produce the biggest savings.

  • Cache aggressive system prompts. Providers offer prompt caching that discounts repeated prefixes by 50-90%. Put stable content (system prompt, tool schemas, long instructions) at the very start of the message so the cache key matches.
  • Cap max_tokens on output. Without a cap, a runaway model can generate 4000 tokens when you wanted 200. Set it slightly above your realistic ceiling.
  • Route by difficulty. Send easy requests (classification, simple extraction) to a cheap small model, and only escalate to a frontier model when needed.
  • Compress retrieved context. Strip boilerplate from chunks. A pre-pass that removes nav bars, footers, and HTML attributes can drop 30% of input tokens.
  • Avoid re-sending tool definitions you do not need. If a route never calls a tool, do not include its schema.
  • Stream and short-circuit. If you can detect a complete answer early (e.g. JSON closing brace), close the stream and stop paying for tokens you will discard.
  • Track tokens per request. Log usage.prompt_tokens, usage.completion_tokens, and any cached counts to your observability stack. Build a dashboard that breaks spend down by route and by user.

A simple monthly budget alert costs nothing and has saved more than one team from a surprise $40k bill.

Wrap-up

LLM economics reward engineers who treat tokens as a first-class metric. Count them before you send, cap them on the way out, cache what repeats, and route by difficulty. Once your pipeline emits usage metrics for every call, cost optimization becomes a normal performance problem rather than a quarterly finance fire drill.