LLM Prompt Caching Deep Dive
How prompt caching works in modern LLM APIs, when it saves significant cost and latency, and how to design prompts so the cache actually hits in production.
What you'll learn
- ✓What prompt caching does under the hood
- ✓How to structure prompts for cache hits
- ✓Cost and latency math
- ✓When caching does not pay off
- ✓Common pitfalls
Prerequisites
- •Familiar with APIs
Prompt caching is one of the highest-leverage optimizations available in modern LLM APIs. Done right, it can cut input costs by 90 percent and shave hundreds of milliseconds off latency. Done wrong, it does nothing. This post explains how it works and how to actually hit the cache.
What prompt caching really is
When you send a prompt to an LLM, the provider runs the entire input through the model’s attention layers and computes intermediate activations. Most of that work is identical across requests when your prompt starts with the same prefix, like a long system message or a fixed set of tool definitions.
Prompt caching stores those intermediate activations for a short time. The next request that arrives with the same prefix skips the prefix computation and only processes the new tail. The savings show up as cheaper input tokens and faster first-token latency.
Mental model
Picture your prompt as a stack of layers from the bottom up: system instructions, tools, retrieved documents, conversation history, current message. Everything below a cache breakpoint is reused. Everything above gets recomputed each call.
[cached prefix]
system instructions
tool definitions
reference docs
---- cache breakpoint ----
[uncached suffix]
recent messages
current user input Hands-on example
In the Anthropic SDK, you mark a content block with cache_control to set a breakpoint.
from anthropic import Anthropic
client = Anthropic()
resp = client.messages.create(
model="claude-opus-4-7",
max_tokens=512,
system=[
{"type": "text", "text": SYSTEM_INSTRUCTIONS},
{"type": "text", "text": LARGE_REFERENCE_DOC,
"cache_control": {"type": "ephemeral"}},
],
messages=[{"role": "user", "content": user_message}],
)
print(resp.usage)
The response usage object reports cache_creation_input_tokens and cache_read_input_tokens. The first request writes the cache (slightly more expensive than baseline). Subsequent hits read the cache (much cheaper than baseline). The ephemeral cache typically lasts a few minutes; longer cache windows cost more to write but persist longer.
OpenAI’s prefix caching works similarly but without an explicit breakpoint: any prompt that shares a token-aligned prefix with a recent request gets a discount automatically.
Trade-offs
Caching only helps when prefixes are stable and large. A 200-token system message hardly matters. A 20,000-token document or a long tool list pays off immediately.
The cache write costs more than a normal call. If your traffic does not repeat the same prefix within the cache window, you pay extra for nothing. Estimate hit rate before flipping it on.
Cache breakpoints can fragment your prompt. Anthropic supports up to four breakpoints, so you can cache different prefixes for different conversation states, but you must keep the prefix ordering stable across requests.
Conversation history is tricky. Each new message appends to the history and breaks the cache for that turn unless you cache the history as a separate prefix that grows in chunks rather than tokens.
Practical tips
Put the largest stable content first. Move tool definitions and reference docs to the top of your system prompt so the breakpoint covers as many tokens as possible.
Keep the prefix byte-identical. Even one extra space or a swapped order of tools breaks the cache. Centralize prefix construction in one function.
Measure hit rate. Log cache_read_input_tokens and cache_creation_input_tokens separately. The ratio tells you whether the optimization is paying off.
Use the longer cache duration for predictable workloads. Anthropic and others offer extended cache windows that cost more to write but persist for an hour or more. Worth it for slow-burning shared prompts like internal tools.
Beware tool result text. If tool outputs vary per call but appear before user messages, they bust the cache. Place them after the breakpoint or design tools to return stable strings.
Wrap-up
Prompt caching is a small API change with outsized impact when the workload matches. Big shared prefix, high request rate, stable formatting. Hit those three and the cost graph drops noticeably. Skip the optimization for low-traffic or highly variable prompts; the engineering work will not pay back.
Related articles
- LLMs LLM Cost Tracking in Production
A practical guide to attributing, monitoring, and controlling LLM spend per user, per feature, and per request without slowing down delivery.
- LLMs LLM Quantization Explained
How quantization shrinks LLMs to run on smaller hardware, the math behind 8-bit and 4-bit weights, and the trade-offs between speed, memory, and quality.
- LLMs LLM Token Counting and Cost Control
Learn how tokens are counted, how to estimate API spend before you send a request, and concrete strategies to cut LLM bills without hurting quality.
- CI/CD CI/CD Pipeline Caching Techniques
Speed up CI builds with dependency caches, layer caches, remote build caches, and content-addressed storage. Learn what to cache and what to skip.