RAG Chunk Overlap Strategies

Intermediate 9 min read

What you'll learn

✓Why naive chunk boundaries lose answers
✓How overlap preserves context across splits
✓Fixed, sliding, and semantic overlap strategies
✓How to pick an overlap size from your data
✓How to evaluate overlap quality cheaply

Prerequisites

•Basic familiarity with RAG and embeddings

What and Why

When you split a long document into chunks, important information frequently sits right at the boundary you just cut. A definition starts in chunk one and the example is in chunk two. The user asks about the example, retrieval finds chunk two, and the LLM is missing the definition that gave it meaning.

Chunk overlap is the simple fix: each chunk repeats some text from the previous chunk so that boundary context is never the only copy of a fact. It is the cheapest reliability improvement in most RAG pipelines.

Mental Model

Think of chunks as overlapping windows sliding across the document. The stride is how far each window moves; the overlap is the difference between window size and stride. If your window is 800 tokens and your stride is 600, you have 200 tokens of overlap.

Overlap trades storage and embedding cost for recall. More overlap means more duplicated text, more vectors, more compute, and a higher chance that any given answer survives in at least one chunk fully intact.

Hands-on Example

Here is a minimal overlap chunker. It takes a list of tokens and emits windows of size tokens with overlap tokens shared between neighbours.

def chunk_with_overlap(tokens, size=800, overlap=200):
    stride = size - overlap
    chunks = []
    for start in range(0, len(tokens), stride):
        end = start + size
        chunks.append(tokens[start:end])
        if end >= len(tokens):
            break
    return chunks

Visually, the windows look like this.

document tokens: [..................................................]

window 1:        [================]
window 2:                  [================]
window 3:                          [================]
                        ^^^^^^      ^^^^^^
                       overlap    overlap

stride = size - overlap
fact at boundary appears in BOTH neighbouring windows

Overlapping windows preserve boundary context across chunks

In LangChain the same idea is one call: RecursiveCharacterTextSplitter(chunk_size=800, chunk_overlap=200). The key knob is the ratio of overlap to size, usually somewhere between fifteen and twenty five percent.

Trade-offs

More overlap raises recall but costs you in three places. First, your vector database grows: a twenty percent overlap means roughly twenty percent more vectors. Second, embedding bills go up by the same ratio. Third, retrieval can return near-duplicate chunks that waste prompt space and confuse the LLM with repeated text.

Less overlap saves money and keeps results diverse, but boundary answers start to drop out. Zero overlap is fine for highly structured data like FAQ rows where each chunk is already a self-contained unit.

Sliding overlap is uniform and predictable. Semantic overlap, where you keep the last full sentence or paragraph of the previous chunk, gives cleaner boundaries but harder to reason about size. Hierarchical overlap, where small chunks point to a larger parent chunk, gives you both precision and context but adds retrieval complexity.

Practical Tips

Start with twenty percent overlap and a chunk size around five hundred to one thousand tokens. That covers most prose corpora without much thinking.

Always overlap on token or sentence boundaries, never inside a word. Splitting mid-word produces embeddings that drift toward fragments rather than concepts.

For code, overlap by logical unit. Repeat the function signature or class header in the next chunk so the body still has a name attached to it.

Deduplicate at retrieval time. If two retrieved chunks share more than half their content, drop the lower scoring one before passing to the LLM. This recovers prompt space lost to overlap.

Measure overlap with a small evaluation set. Pick twenty queries with known correct answers, retrieve top five chunks, and check whether the answer text is fully present in any retrieved chunk. Tune overlap up until recall plateaus.

Watch out for tables and lists. A naive splitter can chop a table header off its rows. Either preserve those structures whole, or repeat the header in every chunk that contains rows.

Wrap-up

Chunk overlap is the boundary insurance of a RAG system. It costs a little storage and embedding budget, and in return it stops your retriever from quietly losing answers at chunk edges. Start with a sensible default, measure recall, and adjust by ratio rather than absolute tokens so that the strategy scales when you change chunk size.