Skip to content
C Codeloom
LLMs

LLM Multi Turn Conversation Design

How to design multi-turn LLM conversations that stay coherent, respect context limits, handle long histories, and support useful features like summarization and recall.

·4 min read · By Codeloom
Intermediate 10 min read

What you'll learn

  • How conversation state actually works
  • Managing context window limits
  • Summarization and rolling memory
  • Persistent memory vs in-context history
  • Pitfalls in long chats

Prerequisites

  • Familiar with APIs

LLM APIs are stateless. The model does not remember anything between calls. Multi-turn conversation is something your code constructs by replaying the history each turn. This post is about doing that well as the conversation grows.

What conversation state really is

Each call to a chat completion endpoint sends a list of messages: system, user, assistant, user, assistant, and so on. The model reads the whole list and produces the next assistant message. To make turn 5 feel like a continuation of turn 4, you include turns 1 through 4 in the messages array.

Your application stores the history. The model never sees a session id or a database row; it sees only what you put in the prompt.

Mental model

Conversation state is a transcript that grows by two messages per turn. Your job is to decide what part of that transcript to include in each request, and how to compress or summarize the rest so the model still has the context it needs without blowing the window.

turn N:
history (turns 1..N-1) + user message N
            |
            v
      LLM call (with system prompt)
            |
            v
      assistant message N
            |
            v
      append to history (turns 1..N)
Conversation lifecycle

Hands-on example

A minimal conversation loop with windowing.

from anthropic import Anthropic
client = Anthropic()
history = []

def turn(user_msg: str) -> str:
    history.append({"role": "user", "content": user_msg})
    windowed = history[-20:]  # keep last 20 messages
    resp = client.messages.create(
        model="claude-opus-4-7",
        max_tokens=512,
        system="You are a helpful assistant.",
        messages=windowed,
    )
    reply = resp.content[0].text
    history.append({"role": "assistant", "content": reply})
    return reply

Windowing keeps the latest messages and drops the oldest. It is simple, but it loses early context like the user’s name or stated preferences. The fix is to combine a sliding window with a summary.

def trim():
    global history
    if total_tokens(history) > 6000:
        old = history[:-12]
        summary = summarize(old)
        history = [{"role": "system",
                    "content": f"Earlier in this chat: {summary}"}] + history[-12:]

Persistent memory takes this further. Instead of summarizing only what was said in this session, you store key facts in a database keyed by user id and retrieve them at the start of each new conversation.

Trade-offs

Full history replay is simple and accurate up to the context window. It is expensive and slow once the conversation grows past a few thousand tokens.

Sliding window is cheap and predictable. It loses early facts. Acceptable for short interactions, awful for long-running assistants.

Rolling summary keeps the conversation feeling continuous and bounds the prompt size. It costs extra LLM calls to maintain and can drift over many turns as small summarization errors compound.

Persistent memory across sessions creates a real assistant experience but raises privacy and correctness concerns. Store too little and the assistant feels amnesic; store too much and you have built a profile you may not want.

Practical tips

Number the turns in your logs. When users report odd behavior, you need to reproduce the exact prompt at turn N, which requires per-turn snapshots.

Cap assistant responses. Long, rambly replies eat the window twice over: once now and once when they replay next turn. A max_tokens setting and a length instruction in the system prompt help.

Use cached prefixes for the system message and any tool list. The system prompt does not change across turns, so prompt caching pays off immediately in chat workloads.

Summarize at natural boundaries. Triggering summarization when the user switches topics or after every N turns is more coherent than waiting until you hit the token limit and dumping half the history.

Distinguish memory from history. Memory is “the user is a vegetarian.” History is “the user said hi at 9:14.” Store them separately and decide per turn which you need.

Wrap-up

Multi-turn chat is mostly about prompt construction, not model magic. The model gets whatever you put in front of it, and great chat experiences come from thoughtful decisions about what to keep, what to compress, and what to retrieve from outside the prompt. Build the prompt construction layer carefully and the rest falls into place.