LLM Token Counting and Cost Control
Learn how tokens are counted, how to estimate API spend before you send a request, and concrete strategies to cut LLM bills without hurting quality.
What you'll learn
- ✓What a token actually is
- ✓How to count tokens before sending a request
- ✓How input, output, and cached tokens are priced
- ✓Where the biggest cost wins usually hide
- ✓How to budget and monitor LLM spend in production
Prerequisites
- •Familiar with how APIs work
What and Why
Every hosted LLM bills you by the token. A token is roughly four characters of English, but it depends entirely on the model’s tokenizer. If you do not know how many tokens your prompts and responses consume, you cannot predict your bill, you cannot reason about latency, and you cannot tell whether a “small” prompt is silently blowing past the context window.
Mental Model
Tokens are the atomic unit of both context and cost. There are three buckets you pay for:
- Input tokens sent in the request (system prompt + user message + tools + retrieved chunks).
- Output tokens generated by the model.
- Cached tokens (when supported) which are billed at a steep discount.
Output tokens are usually 3 to 5x more expensive than input tokens, which inverts the intuition many engineers start with. A small prompt that asks for a long answer can cost more than a huge prompt that asks for a one-word answer.
System prompt [~200 tok] -+
Tool definitions [~400 tok] |
Retrieved chunks [~2000 tok] +-> INPUT (cheaper, possibly cached)
Chat history [~1500 tok] |
User message [~50 tok] -+
Model response [~600 tok] ---> OUTPUT (most expensive) Hands-on Example
Use tiktoken for OpenAI-family models. Anthropic, Google, and others expose their own counters; the pattern is identical.
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4o-mini")
def count_tokens(text: str) -> int:
return len(enc.encode(text))
def estimate_cost(prompt: str, expected_output_tokens: int,
in_price=0.15, out_price=0.60):
# Prices are per 1M tokens; adjust to your model.
in_tok = count_tokens(prompt)
cost = (in_tok / 1_000_000) * in_price \
+ (expected_output_tokens / 1_000_000) * out_price
return in_tok, cost
prompt = "Summarize the following article in three bullets...\n\n" + open("doc.txt").read()
in_tok, dollars = estimate_cost(prompt, expected_output_tokens=150)
print(f"Input tokens: {in_tok}, est cost: ${dollars:.5f}")
For chat messages, remember to add the small per-message overhead (3-4 tokens for role and separator markers). The exact value is documented per model.
Trade-offs
There is real tension between quality and spend. Tactics to cut tokens often nudge quality down.
- Shorter system prompts save tokens on every call, but vague instructions hurt accuracy.
- Smaller retrieval
top_kis cheaper but raises the chance of missing the right chunk. - Smaller models are 5-20x cheaper but may need few-shot examples to match quality, which eats some of the savings.
- Truncating chat history keeps costs flat over long conversations but the model forgets earlier turns.
The right answer is almost always “measure, then trim.” Premature aggressive trimming makes the model look dumb in ways that erode user trust faster than the bill grows.
Practical Tips
A handful of patterns consistently produce the biggest savings.
- Cache aggressive system prompts. Providers offer prompt caching that discounts repeated prefixes by 50-90%. Put stable content (system prompt, tool schemas, long instructions) at the very start of the message so the cache key matches.
- Cap
max_tokenson output. Without a cap, a runaway model can generate 4000 tokens when you wanted 200. Set it slightly above your realistic ceiling. - Route by difficulty. Send easy requests (classification, simple extraction) to a cheap small model, and only escalate to a frontier model when needed.
- Compress retrieved context. Strip boilerplate from chunks. A pre-pass that removes nav bars, footers, and HTML attributes can drop 30% of input tokens.
- Avoid re-sending tool definitions you do not need. If a route never calls a tool, do not include its schema.
- Stream and short-circuit. If you can detect a complete answer early (e.g. JSON closing brace), close the stream and stop paying for tokens you will discard.
- Track tokens per request. Log
usage.prompt_tokens,usage.completion_tokens, and any cached counts to your observability stack. Build a dashboard that breaks spend down by route and by user.
A simple monthly budget alert costs nothing and has saved more than one team from a surprise $40k bill.
Wrap-up
LLM economics reward engineers who treat tokens as a first-class metric. Count them before you send, cap them on the way out, cache what repeats, and route by difficulty. Once your pipeline emits usage metrics for every call, cost optimization becomes a normal performance problem rather than a quarterly finance fire drill.
Related articles
- LLMs LLM Cost Tracking in Production
A practical guide to attributing, monitoring, and controlling LLM spend per user, per feature, and per request without slowing down delivery.
- LLMs LLM Fine-tuning vs Prompting Trade-offs
Decide between prompt engineering, retrieval, and fine-tuning by weighing cost, latency, control, and data requirements honestly.
- LLMs LLM Prompt Caching Deep Dive
How prompt caching works in modern LLM APIs, when it saves significant cost and latency, and how to design prompts so the cache actually hits in production.
- LLMs LLM Streaming Responses Tutorial
Stream tokens from an LLM as they are generated to cut perceived latency, handle partial outputs, and build responsive chat UIs.