Skip to content
C Codeloom
LLMs

LLM Streaming Responses Tutorial

Stream tokens from an LLM as they are generated to cut perceived latency, handle partial outputs, and build responsive chat UIs.

·4 min read · By Codeloom
Intermediate 10 min read

What you'll learn

  • Why streaming changes perceived latency
  • How Server-Sent Events deliver tokens
  • Streaming with the OpenAI Python SDK
  • Handling partial JSON and tool calls
  • Backpressure and cancellation in production

Prerequisites

  • Familiar with how APIs work

What and Why

When you call an LLM without streaming, you wait for the entire response to be generated before any bytes arrive. For a 500-token answer that is often 4-8 seconds of silence. Streaming sends each token (or small group of tokens) as soon as the model produces it, which turns a long blank screen into a typewriter effect.

The total time to finish is the same, but the time to first token drops from seconds to a few hundred milliseconds. Users perceive a streaming response as dramatically faster, even when the totals match.

Mental Model

A streaming response is a long-lived HTTP connection that emits a sequence of small events. Each event carries a partial chunk of the model output. The client appends each chunk to a buffer and renders as it arrives.

Model -> token0 token1 token2 ... tokenN
          |       |       |
          v       v       v
      SSE chunk SSE chunk SSE chunk  (text/event-stream)
          |       |       |
          v       v       v
     client buffer -> UI render -> typewriter effect
Streaming pipeline from model to user

Most LLM providers use Server-Sent Events (SSE) over HTTP. Each event is a data: line containing JSON. A final data: [DONE] marker (or an explicit end event) signals completion.

Hands-on Example

The OpenAI Python SDK exposes streaming with a single flag.

from openai import OpenAI

client = OpenAI()

stream = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Explain CAP theorem in 3 sentences."}],
    stream=True,
)

for chunk in stream:
    delta = chunk.choices[0].delta
    if delta.content:
        print(delta.content, end="", flush=True)
print()

On the server side of a web app, you typically proxy these chunks to the browser as SSE:

from fastapi import FastAPI
from fastapi.responses import StreamingResponse

app = FastAPI()

@app.post("/chat")
def chat(prompt: str):
    def event_stream():
        stream = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": prompt}],
            stream=True,
        )
        for chunk in stream:
            piece = chunk.choices[0].delta.content or ""
            if piece:
                yield f"data: {piece}\n\n"
        yield "data: [DONE]\n\n"
    return StreamingResponse(event_stream(), media_type="text/event-stream")

In the browser, EventSource (or fetch with a ReadableStream) reads the chunks and appends them to the DOM.

Trade-offs

Streaming is not free. The decisions you have to make trade UX wins for engineering complexity.

  • Structured outputs are harder. If you need strict JSON, you cannot validate until the stream completes. You either render text only when valid, or use a streaming JSON parser that tolerates partial input.
  • Tool calls arrive in pieces. Function-call arguments are streamed token by token. You have to accumulate fragments before you can parse and execute the tool.
  • Error handling shifts. A failure mid-stream is different from a failure on a non-streamed call. You may have already rendered half an answer before the connection dies.
  • Buffering layers fight you. Reverse proxies (nginx defaults, some CDNs) buffer responses. You need to disable buffering for SSE routes or your stream becomes a slow non-stream.
  • Cost is identical. Streaming does not change tokens consumed. It only changes when bytes arrive.

Practical Tips

  • Set X-Accel-Buffering: no on responses passing through nginx. For Cloudflare, configure the route to stream rather than cache.
  • Add a heartbeat (a comment line every 15 seconds) so intermediate proxies do not close the connection during long generations.
  • Propagate client cancellations to the upstream call. When the user closes the tab, you should stop paying for tokens. With async clients, await on the connection and abort the upstream stream when the client disconnects.
  • For tool calls, build an accumulator keyed by tool_call_id so arguments concatenate cleanly even when interleaved across chunks.
  • Use a streaming JSON parser (such as partial-json in TS or json-stream-parser in Python) when you need to render structured fields as they appear.
  • Always include a final non-streamed reconciliation step in critical flows. After the stream ends, log the full text and validate it for safety, schema, or moderation rules.
  • Track time-to-first-token as a product metric, not just total latency. It is the number users actually feel.

Wrap-up

Streaming turns LLM latency from a wall of silence into a flowing conversation. The provider-side change is a single flag, but doing it well in production means thinking about proxies, cancellation, partial parsing, and tool-call assembly. Get those pieces right once and every future feature gets the snappy feel for free.