RAG Chunking Strategies Explained

Intermediate 10 min read

What you'll learn

✓Why chunking is the most under-appreciated lever in RAG
✓Trade-offs of fixed-size, sentence, and semantic chunking
✓How overlap fixes boundary problems
✓When to chunk on structure (headings, code blocks)
✓How to evaluate a chunking strategy

Prerequisites

•Familiar with how APIs work

What and Why

Retrieval Augmented Generation (RAG) gives an LLM access to your documents by splitting them into pieces, embedding each piece, and retrieving the most similar pieces for a given query. The split step is called chunking, and it quietly determines whether your whole system works or not.

Chunks too small lose context. Chunks too large dilute the embedding signal. Chunks split at the wrong boundary slice answers in half. Most “RAG isn’t working” complaints trace back to chunking, not the LLM.

Mental Model

Each chunk gets one embedding vector. That vector has to represent the chunk well enough for similarity search to find it when a related query comes in. The chunk also has to be small enough to fit in the prompt with several other chunks, but large enough to contain a useful answer.

document
 |
 v
chunker -> [chunk1] [chunk2] [chunk3] ...
              |        |        |
              v        v        v
          embedder  embedder  embedder
              |        |        |
              v        v        v
            vector  vector   vector  --> vector DB
                                          |
query -> embed -> similarity search -> top-k chunks -> LLM prompt

Chunking sits at the start of the RAG pipeline

Hands-on Example

Four chunking strategies, each in a few lines.

import re

text = open("doc.txt").read()

# 1. Fixed-size chunks with character overlap
def fixed_chunks(text, size=800, overlap=100):
    out = []
    for i in range(0, len(text), size - overlap):
        out.append(text[i:i + size])
    return out

# 2. Sentence-based chunks
def sentence_chunks(text, max_chars=800):
    sentences = re.split(r"(?<=[.!?])\s+", text)
    chunks, buf = [], ""
    for s in sentences:
        if len(buf) + len(s) > max_chars:
            chunks.append(buf.strip())
            buf = s
        else:
            buf += " " + s
    if buf: chunks.append(buf.strip())
    return chunks

# 3. Structural chunks (markdown headings)
def heading_chunks(md):
    parts = re.split(r"(^#{1,3} .+$)", md, flags=re.MULTILINE)
    chunks = []
    for i in range(1, len(parts), 2):
        heading, body = parts[i], parts[i + 1] if i + 1 < len(parts) else ""
        chunks.append(f"{heading}\n{body}".strip())
    return chunks

# 4. Semantic chunks: merge sentences that are similar to neighbors,
# split where embeddings diverge sharply.

For most corpora, sentence chunking with a max character cap and small overlap is a strong default and embarrassingly simple to implement.

Trade-offs

Each strategy has a personality and a failure mode.

Fixed-size is fastest and works on any text. It happily cuts sentences in half, so always pair it with overlap.
Sentence-based respects natural boundaries and rarely fragments thoughts. Slightly slower to compute.
Structural (headings, sections, code blocks) is best for technical docs where structure carries meaning. Useless for unstructured prose.
Semantic uses embeddings to detect topic shifts. Highest quality, highest complexity, hardest to debug.
Recursive (LangChain-style) tries multiple separators in order: paragraph, sentence, word. A good general-purpose default.

Chunk size has its own trade-off. Smaller chunks (200-400 tokens) give precise retrieval but force the LLM to stitch many pieces together. Larger chunks (800-1500 tokens) carry more context but dilute the embedding, making relevant chunks harder to find.

Overlap (50-150 chars or 10-15% of chunk size) cushions boundary mistakes. If the answer straddles two chunks, overlap means at least one of them contains the full answer.

Practical Tips

Start with sentence chunking and ~500 tokens. It is a strong baseline that beats most exotic strategies on most corpora.
Always add overlap. The cost is tiny and it eliminates a major failure mode.
Preserve metadata on each chunk. Document title, section heading, URL, date. The LLM uses this for citations and you use it for filtering.
Add headers back to body chunks. A chunk containing only “…this is supported by the third clause” without its heading is useless. Prepend the section title.
Keep code blocks intact. Splitting code mid-function destroys it. Detect fenced blocks and treat them as atomic units.
Tune chunk size empirically. Build a small eval set of question-answer pairs and measure recall at top-k for several sizes. Pick the sweet spot.
Re-chunk when you change models. A new embedding model may prefer a different chunk size. Do not assume settings transfer.
For tables, convert rows to short natural-language sentences before embedding. Raw tabular text embeds poorly.

The biggest mistake teams make is treating chunking as a settled detail. It is a hyperparameter, and like other hyperparameters it should be tuned with measurement, not guessed once and forgotten.

Wrap-up

Chunking is the unglamorous first step in RAG that decides how well everything downstream works. Start with sentence-aware chunks around 500 tokens with overlap, preserve structure and metadata, and treat the strategy as something you measure and adjust. A well-chunked corpus makes a mediocre embedding model look great. A poorly chunked one makes a great model look broken.