LLM Streaming Responses Tutorial
Stream tokens from an LLM as they are generated to cut perceived latency, handle partial outputs, and build responsive chat UIs.
What you'll learn
- ✓Why streaming changes perceived latency
- ✓How Server-Sent Events deliver tokens
- ✓Streaming with the OpenAI Python SDK
- ✓Handling partial JSON and tool calls
- ✓Backpressure and cancellation in production
Prerequisites
- •Familiar with how APIs work
What and Why
When you call an LLM without streaming, you wait for the entire response to be generated before any bytes arrive. For a 500-token answer that is often 4-8 seconds of silence. Streaming sends each token (or small group of tokens) as soon as the model produces it, which turns a long blank screen into a typewriter effect.
The total time to finish is the same, but the time to first token drops from seconds to a few hundred milliseconds. Users perceive a streaming response as dramatically faster, even when the totals match.
Mental Model
A streaming response is a long-lived HTTP connection that emits a sequence of small events. Each event carries a partial chunk of the model output. The client appends each chunk to a buffer and renders as it arrives.
Model -> token0 token1 token2 ... tokenN
| | |
v v v
SSE chunk SSE chunk SSE chunk (text/event-stream)
| | |
v v v
client buffer -> UI render -> typewriter effect Most LLM providers use Server-Sent Events (SSE) over HTTP. Each event is a data: line containing JSON. A final data: [DONE] marker (or an explicit end event) signals completion.
Hands-on Example
The OpenAI Python SDK exposes streaming with a single flag.
from openai import OpenAI
client = OpenAI()
stream = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "Explain CAP theorem in 3 sentences."}],
stream=True,
)
for chunk in stream:
delta = chunk.choices[0].delta
if delta.content:
print(delta.content, end="", flush=True)
print()
On the server side of a web app, you typically proxy these chunks to the browser as SSE:
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
app = FastAPI()
@app.post("/chat")
def chat(prompt: str):
def event_stream():
stream = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
stream=True,
)
for chunk in stream:
piece = chunk.choices[0].delta.content or ""
if piece:
yield f"data: {piece}\n\n"
yield "data: [DONE]\n\n"
return StreamingResponse(event_stream(), media_type="text/event-stream")
In the browser, EventSource (or fetch with a ReadableStream) reads the chunks and appends them to the DOM.
Trade-offs
Streaming is not free. The decisions you have to make trade UX wins for engineering complexity.
- Structured outputs are harder. If you need strict JSON, you cannot validate until the stream completes. You either render text only when valid, or use a streaming JSON parser that tolerates partial input.
- Tool calls arrive in pieces. Function-call arguments are streamed token by token. You have to accumulate fragments before you can parse and execute the tool.
- Error handling shifts. A failure mid-stream is different from a failure on a non-streamed call. You may have already rendered half an answer before the connection dies.
- Buffering layers fight you. Reverse proxies (nginx defaults, some CDNs) buffer responses. You need to disable buffering for SSE routes or your stream becomes a slow non-stream.
- Cost is identical. Streaming does not change tokens consumed. It only changes when bytes arrive.
Practical Tips
- Set
X-Accel-Buffering: noon responses passing through nginx. For Cloudflare, configure the route to stream rather than cache. - Add a heartbeat (a comment line every 15 seconds) so intermediate proxies do not close the connection during long generations.
- Propagate client cancellations to the upstream call. When the user closes the tab, you should stop paying for tokens. With async clients, await on the connection and abort the upstream stream when the client disconnects.
- For tool calls, build an accumulator keyed by
tool_call_idso arguments concatenate cleanly even when interleaved across chunks. - Use a streaming JSON parser (such as
partial-jsonin TS orjson-stream-parserin Python) when you need to render structured fields as they appear. - Always include a final non-streamed reconciliation step in critical flows. After the stream ends, log the full text and validate it for safety, schema, or moderation rules.
- Track time-to-first-token as a product metric, not just total latency. It is the number users actually feel.
Wrap-up
Streaming turns LLM latency from a wall of silence into a flowing conversation. The provider-side change is a single flag, but doing it well in production means thinking about proxies, cancellation, partial parsing, and tool-call assembly. Get those pieces right once and every future feature gets the snappy feel for free.
Related articles
- LLMs LLM Fine-tuning vs Prompting Trade-offs
Decide between prompt engineering, retrieval, and fine-tuning by weighing cost, latency, control, and data requirements honestly.
- LLMs LLM Temperature and Top-p Explained
Understand how temperature and top-p sampling shape the creativity, determinism, and quality of large language model outputs.
- LLMs LLM Token Counting and Cost Control
Learn how tokens are counted, how to estimate API spend before you send a request, and concrete strategies to cut LLM bills without hurting quality.
- LLMs LLM Tool Calling and Agents Overview
Understand how tool calling lets LLMs invoke functions, why agents loop over tools, and how to design reliable tool schemas.