RAG Chunking Strategies That Actually Work

Intermediate 10 min read

What you'll learn

✓Why chunking dominates RAG quality more than the model choice
✓How fixed-size, semantic, and structural chunkers differ in practice
✓How to choose chunk size and overlap with concrete heuristics
✓What metadata to attach to every chunk and why it pays off
✓How to evaluate whether your chunking is actually working

Prerequisites

•A grasp of [embeddings](/blog/rag-embeddings-explained)
•Familiarity with [vector databases](/blog/rag-vector-databases-overview)

If your RAG system feels mediocre, the cause is almost always chunking, not the embedding model and not the LLM. Chunking decides what the retriever can possibly find and what the model gets to read. This guide walks through the three families of chunking strategies, the parameters that actually matter, and how to tell if your choice is working.

Why chunking is the bottleneck

A vector search returns the top-k most similar chunks to a query. If the right answer is split across two chunks, neither chunk individually scores high enough, and the retriever misses it. If a chunk is too large, it contains so much off-topic text that its embedding becomes a blurred average and no query lands close to it. If a chunk is too small, it lacks the context the model needs to answer.

Good chunking gives each chunk a single tight topic, enough surrounding context to be self-contained, and metadata that lets the retriever filter intelligently. The three big families of strategies trade off how they get there.

Fixed-size chunking

The simplest approach: slide a window over the text, emit fixed-length chunks with some overlap.

def fixed_chunks(text: str, size: int = 800, overlap: int = 120) -> list[str]:
    chunks = []
    step = size - overlap
    for start in range(0, len(text), step):
        chunk = text[start:start + size]
        if chunk.strip():
            chunks.append(chunk)
        if start + size >= len(text):
            break
    return chunks

Two refinements make this respectable in production:

Count tokens, not characters. Embedding models price and clip by tokens. A fast approximation is len(text) / 4 for English, but a real tokenizer is better.
Snap to sentence boundaries. Slide to the nearest period or newline within ±100 characters of the cut point. This eliminates the worst failure mode: a chunk that starts or ends mid-sentence and embeds poorly.

Fixed-size chunking is the right baseline. It is fast, predictable, and surprisingly hard to beat on prose-heavy corpora.

Semantic chunking

Semantic chunking groups consecutive sentences while they remain on-topic and starts a new chunk when the topic shifts. The standard recipe:

Split the document into sentences.
Embed each sentence.
Compute the cosine distance between consecutive sentence embeddings.
Find the local maxima — points where the topic visibly shifts.
Cut at those boundaries.

import numpy as np

def semantic_chunks(sentences: list[str], embed, threshold_pct: int = 90):
    embs = np.array(embed(sentences))
    # Cosine distance between consecutive sentences
    dists = 1 - (embs[:-1] * embs[1:]).sum(axis=1) / (
        np.linalg.norm(embs[:-1], axis=1) * np.linalg.norm(embs[1:], axis=1)
    )
    threshold = np.percentile(dists, threshold_pct)
    cuts = [0] + [i + 1 for i, d in enumerate(dists) if d > threshold] + [len(sentences)]
    return [" ".join(sentences[a:b]) for a, b in zip(cuts, cuts[1:])]

The percentile threshold is a tuning knob. A higher percentile means fewer, larger chunks; a lower one means more, smaller chunks. Start at 90 and adjust based on eval results.

Semantic chunking shines when documents wander across topics in ways that headings do not capture — meeting transcripts, customer interviews, freeform articles. It is more expensive than fixed-size (you embed every sentence at index time) and adds a dependency on the embedding model at preprocessing time.

Structural chunking

If your documents have structure — Markdown, HTML, code, PDFs with headings — use it. Headings are nature’s chunk boundaries.

A simple Markdown chunker:

import re

def markdown_chunks(text: str, max_chars: int = 1200) -> list[dict]:
    sections, current, heading_stack = [], [], []
    for line in text.splitlines():
        m = re.match(r"^(#{1,6})\s+(.*)", line)
        if m:
            if current:
                sections.append({"heading_path": list(heading_stack), "text": "\n".join(current)})
                current = []
            level = len(m.group(1))
            heading_stack = heading_stack[:level - 1] + [m.group(2)]
        else:
            current.append(line)
    if current:
        sections.append({"heading_path": list(heading_stack), "text": "\n".join(current)})

    # Split oversized sections, keep small ones intact
    result = []
    for s in sections:
        if len(s["text"]) <= max_chars:
            result.append(s)
        else:
            for piece in fixed_chunks(s["text"], size=max_chars, overlap=150):
                result.append({"heading_path": s["heading_path"], "text": piece})
    return result

Notice the heading path — every chunk carries the chain of headings above it. That metadata becomes powerful: you can boost search results that match the heading, filter by section, and prepend the heading path to the chunk text when passing it to the model so it has natural context.

For code, your structural boundaries are functions, classes, and top-level blocks. A tree-sitter parser gives you these reliably. Never chunk source code by character count alone; you will split functions in half and the embeddings will be useless.

Overlap, the underrated parameter

Overlap is the number of tokens (or characters) shared between consecutive chunks. Why it matters: if the answer to a question straddles a chunk boundary, the only chance retrieval has of catching it is for at least one chunk to contain the whole answer.

Heuristics that hold up:

Prose: 10-15% overlap. For 800-character chunks, 100-120 characters.
Code: 0-5% overlap. Functions are self-contained, and duplicate code in two chunks confuses retrieval.
Conversational logs: 15-25% overlap, because adjacent turns reference each other heavily.

More overlap is not always better. Too much and you embed duplicate content, inflate your index, and dilute results because the same passage shows up multiple times in the top-k.

Metadata that earns its keep

A chunk in your vector store should be a record, not a string. Attach:

source_id and source_url
title and heading_path
created_at, updated_at
doc_type (article, FAQ, runbook, ticket, code)
token_count
content_hash so you can detect unchanged chunks during reindexing
lang for multilingual corpora

Metadata enables hybrid retrieval: vector search narrows the field, then metadata filters trim it. “Top-5 chunks from doc_type=runbook updated in the last 90 days” is a vastly better query than pure vector search alone, and it is one line of filter syntax in most vector databases (see vector databases overview).

Evaluating retrieval quality

You cannot tune chunking without measuring. Build a small eval set first: 50-200 questions with the known-correct chunk IDs they should retrieve. You can bootstrap this by having a model read each chunk and propose questions it could answer.

Measure two things:

Recall@k. For each question, is the correct chunk in the top-k results? Recall@5 is the workhorse metric.
MRR (mean reciprocal rank). If the correct chunk is at position r, you score 1/r. Averaged across the set, this captures both whether you found the answer and how well you ranked it.

def recall_at_k(eval_set, retrieve, k=5):
    hits = 0
    for q, gold_id in eval_set:
        results = retrieve(q, k=k)
        if any(r.id == gold_id for r in results):
            hits += 1
    return hits / len(eval_set)

This is the same shape of measurement that LLM evaluation basics lays out for generation. Retrieval evals are cheaper to run and faster to iterate on than full end-to-end evals, so they should be the inner loop of your chunking work.

When you compare strategies, change one variable at a time. Fixed-size at 600 versus 1200 characters. Semantic at p90 versus p85. Structural with and without the heading path prepended. Tabulate Recall@5 and MRR; pick the winner; move on.

A pragmatic default

If you do not yet have an eval harness, this default works well for documentation-style corpora: structural chunking on headings, with oversized sections split by fixed-size chunks of about 1000 tokens with 150 tokens of overlap, snapped to sentence boundaries. Prepend the heading path to each chunk’s embedded text. Store full metadata. You can graduate to semantic chunking once you can measure whether it actually helps.

Wrap up

Chunking is unglamorous and decisive. Fixed-size gives you a fast baseline, semantic adapts to messy prose, structural exploits documents that already have shape. The parameters that move the needle are chunk size, overlap, and metadata — and the only honest way to choose between strategies is a Recall@k harness on your own data. Get this layer right and the rest of the RAG stack starts behaving like the magic it was sold as.