Streaming LLM Responses with Server-Sent Events

Intermediate 10 min read

What you'll learn

✓Why streaming dramatically improves perceived LLM latency
✓The SSE wire format and how it differs from WebSockets
✓How to consume an SSE stream from JavaScript with fetch and ReadableStream
✓How to stream from a Python SDK and forward to your frontend
✓UX patterns for tokens, tool calls, and errors mid-stream

Prerequisites

•A working mental model of [what an LLM is](/blog/what-is-an-llm)
•Comfort with async JavaScript and HTTP fundamentals

When a model takes eight seconds to produce a 400-token answer, your user does not actually wait eight seconds. They wait for the first token. Streaming turns a long blocking call into a continuous trickle of partial output and is the single biggest UX win for any LLM-powered surface. This guide covers how streaming works on the wire, how to consume it in JavaScript and Python, and the UX patterns that make it feel natural.

Why streaming matters

Two things change when you stream:

Time to first token (TTFT) becomes the latency the user perceives. For chat-style interfaces, this is often under 500 ms even when total generation takes several seconds.
Cancelability is real. The user can stop generation early, saving tokens and money, and you can free server resources.

A non-streaming call is one HTTP request and one HTTP response. A streaming call is one request and a long-lived response that emits chunks as the model produces them.

The SSE wire format

Most LLM providers stream over Server-Sent Events (SSE), a simple HTTP/1.1 convention. The response has Content-Type: text/event-stream, stays open, and emits records like:

data: {"delta":"Hello"}

data: {"delta":" world"}

data: [DONE]

The rules are straightforward:

Each event is one or more lines prefixed with a field name (data:, event:, id:, retry:).
Events are separated by a blank line.
Multiple data: lines in the same event are joined with newlines.
A line starting with : is a comment, often used as a heartbeat.

SSE is one-directional (server to client), text-based, and goes through any HTTP/1.1 proxy that does not buffer. WebSockets are bidirectional and full-duplex, but for “stream tokens from a model,” SSE is simpler, plays nicer with CDNs, and is what every major LLM API uses.

Consuming SSE from JavaScript

The browser ships an EventSource API, but it only supports GET and cannot set headers. For LLM calls you almost always need POST with an Authorization header, so you use fetch and parse the stream yourself.

async function streamCompletion(prompt, onToken) {
  const res = await fetch('/api/chat', {
    method: 'POST',
    headers: { 'content-type': 'application/json' },
    body: JSON.stringify({ prompt }),
  });

  if (!res.ok || !res.body) throw new Error(`HTTP ${res.status}`);

  const reader = res.body.getReader();
  const decoder = new TextDecoder();
  let buffer = '';

  while (true) {
    const { value, done } = await reader.read();
    if (done) break;

    buffer += decoder.decode(value, { stream: true });

    // SSE events are separated by a blank line
    let sep;
    while ((sep = buffer.indexOf('\n\n')) !== -1) {
      const raw = buffer.slice(0, sep);
      buffer = buffer.slice(sep + 2);

      for (const line of raw.split('\n')) {
        if (!line.startsWith('data:')) continue;
        const payload = line.slice(5).trim();
        if (payload === '[DONE]') return;

        try {
          const json = JSON.parse(payload);
          if (json.delta) onToken(json.delta);
        } catch {
          // ignore malformed events
        }
      }
    }
  }
}

Two details that catch people:

TextDecoder with { stream: true } is essential. UTF-8 characters can split across chunk boundaries, and without the streaming option you will see mojibake in the last token of a chunk.
Buffer until you have a full event. Network chunks do not align with SSE events. You must accumulate and split on the blank-line delimiter.

To cancel mid-stream, use an AbortController and call controller.abort(). The reader will throw, and the upstream provider’s request gets cancelled if your server forwards the abort.

Streaming from a Python backend

Most providers’ Python SDKs expose streaming as an iterator. The shape is similar across SDKs:

def stream_chat(prompt: str):
    with client.chat.completions.stream(
        model="some-model",
        messages=[{"role": "user", "content": prompt}],
    ) as stream:
        for event in stream:
            if event.type == "content.delta":
                yield event.delta
            elif event.type == "tool_call.delta":
                yield {"tool": event.delta}

Forwarding to an SSE endpoint with FastAPI:

from fastapi import FastAPI
from fastapi.responses import StreamingResponse
import json

app = FastAPI()

def sse(event: dict) -> bytes:
    return f"data: {json.dumps(event)}\n\n".encode("utf-8")

@app.post("/api/chat")
def chat(req: dict):
    def gen():
        for delta in stream_chat(req["prompt"]):
            yield sse({"delta": delta})
        yield b"data: [DONE]\n\n"
    return StreamingResponse(gen(), media_type="text/event-stream")

A few production-hardening notes:

Set X-Accel-Buffering: no if you sit behind nginx, otherwise it will buffer.
Disable response compression for SSE; gzip will hold the stream until it has enough bytes.
Emit a heartbeat comment (: ping\n\n) every 15-30 seconds for very slow generations, so proxies do not close the idle connection.
Catch upstream exceptions inside the generator and emit them as a final event: error so the client can show a useful message instead of a silent truncation.

Streaming tool calls

If you are doing tool use, the model’s tool-call arguments are themselves streamed token by token. You usually do not act on them until the call is complete, but you can start showing “Calling search_docs…” to the user as soon as the call begins. Most SDKs emit distinct event types (tool_call.started, tool_call.delta, tool_call.completed); reflect those in your wire format so the client can render meaningful affordances.

UX patterns that work

Streaming opens up patterns that are impossible with blocking calls:

Render the cursor. Show a blinking caret at the end of the streamed text so the user knows more is coming.
Stop button, always. A visible cancel control turns a slow response from frustrating into negotiable.
Stream into a code block. If the model is writing code, do not wait for the closing fence — render the partial block with syntax highlighting that re-tokenizes as text arrives.
Markdown on the fly. Use a streaming-safe markdown renderer that tolerates unclosed tags. Re-render the trailing paragraph on each token; do not re-render the entire history.
Reserve space. Pin the chat to the bottom only while the user has not scrolled up. If they scrolled to read, do not yank them back.
Token counters. A small, unobtrusive count gives power users feedback and helps you debug cost.

If you are scoring model quality with evaluation basics, remember that streaming does not change correctness, only perceived latency. Your evals should still compare full responses.

Common failure modes

Truncation without error. A proxy timed out the idle connection. Add heartbeats and check your reverse proxy’s proxy_read_timeout.
All tokens arrive at once. Compression or buffering somewhere. Disable gzip for text/event-stream and set X-Accel-Buffering: no.
Garbled characters. Forgot { stream: true } on TextDecoder, or your server flushed in the middle of a UTF-8 byte sequence. Yield only at SSE event boundaries.
The user cancels but you keep paying. Wire the AbortController all the way from the browser through your backend to the provider SDK; do not just close the socket on your side.

Wrap up

Streaming is the single highest-leverage change you can make to an LLM product. The wire protocol is plain HTTP plus a simple text format. The hard parts are buffering correctly, plumbing cancellation, and getting the UX details right: cursors, stop buttons, partial markdown. Get those right and a five-second answer feels instant.