Skip to content
C Codeloom
LLMs

LLM Prompt Caching Deep Dive

How prompt caching works in modern LLM APIs, when it saves significant cost and latency, and how to design prompts so the cache actually hits in production.

·4 min read · By Codeloom
Intermediate 10 min read

What you'll learn

  • What prompt caching does under the hood
  • How to structure prompts for cache hits
  • Cost and latency math
  • When caching does not pay off
  • Common pitfalls

Prerequisites

  • Familiar with APIs

Prompt caching is one of the highest-leverage optimizations available in modern LLM APIs. Done right, it can cut input costs by 90 percent and shave hundreds of milliseconds off latency. Done wrong, it does nothing. This post explains how it works and how to actually hit the cache.

What prompt caching really is

When you send a prompt to an LLM, the provider runs the entire input through the model’s attention layers and computes intermediate activations. Most of that work is identical across requests when your prompt starts with the same prefix, like a long system message or a fixed set of tool definitions.

Prompt caching stores those intermediate activations for a short time. The next request that arrives with the same prefix skips the prefix computation and only processes the new tail. The savings show up as cheaper input tokens and faster first-token latency.

Mental model

Picture your prompt as a stack of layers from the bottom up: system instructions, tools, retrieved documents, conversation history, current message. Everything below a cache breakpoint is reused. Everything above gets recomputed each call.

[cached prefix]
system instructions
tool definitions
reference docs
---- cache breakpoint ----
[uncached suffix]
recent messages
current user input
Prompt structure for caching

Hands-on example

In the Anthropic SDK, you mark a content block with cache_control to set a breakpoint.

from anthropic import Anthropic

client = Anthropic()
resp = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=512,
    system=[
        {"type": "text", "text": SYSTEM_INSTRUCTIONS},
        {"type": "text", "text": LARGE_REFERENCE_DOC,
         "cache_control": {"type": "ephemeral"}},
    ],
    messages=[{"role": "user", "content": user_message}],
)
print(resp.usage)

The response usage object reports cache_creation_input_tokens and cache_read_input_tokens. The first request writes the cache (slightly more expensive than baseline). Subsequent hits read the cache (much cheaper than baseline). The ephemeral cache typically lasts a few minutes; longer cache windows cost more to write but persist longer.

OpenAI’s prefix caching works similarly but without an explicit breakpoint: any prompt that shares a token-aligned prefix with a recent request gets a discount automatically.

Trade-offs

Caching only helps when prefixes are stable and large. A 200-token system message hardly matters. A 20,000-token document or a long tool list pays off immediately.

The cache write costs more than a normal call. If your traffic does not repeat the same prefix within the cache window, you pay extra for nothing. Estimate hit rate before flipping it on.

Cache breakpoints can fragment your prompt. Anthropic supports up to four breakpoints, so you can cache different prefixes for different conversation states, but you must keep the prefix ordering stable across requests.

Conversation history is tricky. Each new message appends to the history and breaks the cache for that turn unless you cache the history as a separate prefix that grows in chunks rather than tokens.

Practical tips

Put the largest stable content first. Move tool definitions and reference docs to the top of your system prompt so the breakpoint covers as many tokens as possible.

Keep the prefix byte-identical. Even one extra space or a swapped order of tools breaks the cache. Centralize prefix construction in one function.

Measure hit rate. Log cache_read_input_tokens and cache_creation_input_tokens separately. The ratio tells you whether the optimization is paying off.

Use the longer cache duration for predictable workloads. Anthropic and others offer extended cache windows that cost more to write but persist for an hour or more. Worth it for slow-burning shared prompts like internal tools.

Beware tool result text. If tool outputs vary per call but appear before user messages, they bust the cache. Place them after the breakpoint or design tools to return stable strings.

Wrap-up

Prompt caching is a small API change with outsized impact when the workload matches. Big shared prefix, high request rate, stable formatting. Hit those three and the cost graph drops noticeably. Skip the optimization for low-traffic or highly variable prompts; the engineering work will not pay back.