Skip to content
C Codeloom
LLMs

LLM Cost Tracking in Production

A practical guide to attributing, monitoring, and controlling LLM spend per user, per feature, and per request without slowing down delivery.

·4 min read · By Codeloom
Intermediate 10 min read

What you'll learn

  • How LLM billing actually works
  • How to attribute spend across users and features
  • How to surface cost as a first-class metric
  • When to cache, batch, or downscale models
  • How to set budgets that protect production

Prerequisites

  • Basic LLM API usage

What and Why

The first month an LLM feature ships, the bill is usually a curiosity. By month six it is often a top-five line item in your cloud spend, and someone in finance is asking pointed questions. By then it is too late to retrofit the observability you needed from day one. Cost tracking is not a finance problem; it is an engineering problem, and the systems that scale gracefully are the ones that treated cost as a metric from the beginning.

Mental Model

LLM cost has three components: input tokens, output tokens, and overhead (embedding calls, vector store reads, tool invocations). Input and output are priced differently, often with output an order of magnitude more expensive. Cached prompt prefixes are cheaper still, sometimes by 90 percent, but only if you structure prompts so the prefix is genuinely stable.

The right unit of accounting is the request, tagged with enough metadata to roll up by user, feature, model, and tenant. If you cannot answer “what did user X cost us this month” in a single query, you are flying blind.

Hands-on Example

Picture a SaaS product with three LLM features: a chat assistant, a document summarizer, and a code completion tool. Each call goes through a shared client.


 feature code
      |
      v
+----------------------+
| LLM client wrapper   |
|  - inject metadata   |  user_id, feature, tenant
|  - choose model      |  routing rules
|  - check budget      |  reject if over cap
+----------+-----------+
         |
         v
 +---------------+
 | provider API  |
 +-------+-------+
         |
         v
+----------------------+
| usage logger         |
|  - input tokens      |
|  - output tokens     |
|  - cache hits        |
|  - price calc        |
+----------+-----------+
         |
         v
+----------------------+
| metrics + warehouse  |
|  - per-user dash     |
|  - alerts on spikes  |
+----------------------+
A cost-aware LLM client that tags, prices, and budgets every request

Every request flows through one wrapper that knows how to tag, price, and log. The wrapper is the only place that talks to the provider. Features cannot bypass it. Dashboards roll up by any dimension you tagged. Anomalies trigger alerts before the monthly bill does.

Trade-offs

Centralizing through a wrapper adds a small layer of code that every team must use. It is easy to violate by importing the provider SDK directly. The fix is a lint rule and a code review norm, not a technical lock.

Token counting on the client side is approximate. Tokenizers differ across providers and across model versions. For accurate accounting you need the provider’s reported usage from the response, not your local estimate. Use local estimates for budget checks and reported usage for billing.

Hard budget caps protect your wallet but break user experience when hit. Soft caps with graceful degradation (switch to a cheaper model, return a partial answer, queue the request) are usually better, though more complex.

Practical Tips

Tag every request with at least: user id, tenant id, feature name, model name, and request id. Without these, attribution is impossible after the fact.

Store raw usage events in a warehouse, not just aggregated metrics. You will want to slice by dimensions you did not anticipate.

Set per-user and per-tenant budgets early, even if they are generous. A single buggy loop in production can rack up thousands of dollars in minutes. A budget check is your circuit breaker.

Watch output token length as carefully as input. Asking for “a brief summary” without a max_tokens cap is how surprise bills happen.

Use prompt caching aggressively where the provider supports it. A stable system prompt followed by variable user input is the canonical cache-friendly shape.

Review your per-feature cost per active user weekly. If a feature is not earning its cost, either fix it, downscale the model, or cut it.

Wrap-up

LLM cost is a knob, not a fact. With proper attribution you can see which features and which users are expensive, and you can make informed decisions about caching, model routing, and pricing. Build the tracking layer first, before the bill demands it. Future you will be grateful when the CFO asks for a breakdown and you have one ready.