RAG Chunking Strategies Explained
Compare fixed-size, sentence, semantic, and structural chunking for retrieval augmented generation and pick the right one for your corpus.
What you'll learn
- ✓Why chunking is the most under-appreciated lever in RAG
- ✓Trade-offs of fixed-size, sentence, and semantic chunking
- ✓How overlap fixes boundary problems
- ✓When to chunk on structure (headings, code blocks)
- ✓How to evaluate a chunking strategy
Prerequisites
- •Familiar with how APIs work
What and Why
Retrieval Augmented Generation (RAG) gives an LLM access to your documents by splitting them into pieces, embedding each piece, and retrieving the most similar pieces for a given query. The split step is called chunking, and it quietly determines whether your whole system works or not.
Chunks too small lose context. Chunks too large dilute the embedding signal. Chunks split at the wrong boundary slice answers in half. Most “RAG isn’t working” complaints trace back to chunking, not the LLM.
Mental Model
Each chunk gets one embedding vector. That vector has to represent the chunk well enough for similarity search to find it when a related query comes in. The chunk also has to be small enough to fit in the prompt with several other chunks, but large enough to contain a useful answer.
document
|
v
chunker -> [chunk1] [chunk2] [chunk3] ...
| | |
v v v
embedder embedder embedder
| | |
v v v
vector vector vector --> vector DB
|
query -> embed -> similarity search -> top-k chunks -> LLM prompt Hands-on Example
Four chunking strategies, each in a few lines.
import re
text = open("doc.txt").read()
# 1. Fixed-size chunks with character overlap
def fixed_chunks(text, size=800, overlap=100):
out = []
for i in range(0, len(text), size - overlap):
out.append(text[i:i + size])
return out
# 2. Sentence-based chunks
def sentence_chunks(text, max_chars=800):
sentences = re.split(r"(?<=[.!?])\s+", text)
chunks, buf = [], ""
for s in sentences:
if len(buf) + len(s) > max_chars:
chunks.append(buf.strip())
buf = s
else:
buf += " " + s
if buf: chunks.append(buf.strip())
return chunks
# 3. Structural chunks (markdown headings)
def heading_chunks(md):
parts = re.split(r"(^#{1,3} .+$)", md, flags=re.MULTILINE)
chunks = []
for i in range(1, len(parts), 2):
heading, body = parts[i], parts[i + 1] if i + 1 < len(parts) else ""
chunks.append(f"{heading}\n{body}".strip())
return chunks
# 4. Semantic chunks: merge sentences that are similar to neighbors,
# split where embeddings diverge sharply.
For most corpora, sentence chunking with a max character cap and small overlap is a strong default and embarrassingly simple to implement.
Trade-offs
Each strategy has a personality and a failure mode.
- Fixed-size is fastest and works on any text. It happily cuts sentences in half, so always pair it with overlap.
- Sentence-based respects natural boundaries and rarely fragments thoughts. Slightly slower to compute.
- Structural (headings, sections, code blocks) is best for technical docs where structure carries meaning. Useless for unstructured prose.
- Semantic uses embeddings to detect topic shifts. Highest quality, highest complexity, hardest to debug.
- Recursive (LangChain-style) tries multiple separators in order: paragraph, sentence, word. A good general-purpose default.
Chunk size has its own trade-off. Smaller chunks (200-400 tokens) give precise retrieval but force the LLM to stitch many pieces together. Larger chunks (800-1500 tokens) carry more context but dilute the embedding, making relevant chunks harder to find.
Overlap (50-150 chars or 10-15% of chunk size) cushions boundary mistakes. If the answer straddles two chunks, overlap means at least one of them contains the full answer.
Practical Tips
- Start with sentence chunking and ~500 tokens. It is a strong baseline that beats most exotic strategies on most corpora.
- Always add overlap. The cost is tiny and it eliminates a major failure mode.
- Preserve metadata on each chunk. Document title, section heading, URL, date. The LLM uses this for citations and you use it for filtering.
- Add headers back to body chunks. A chunk containing only “…this is supported by the third clause” without its heading is useless. Prepend the section title.
- Keep code blocks intact. Splitting code mid-function destroys it. Detect fenced blocks and treat them as atomic units.
- Tune chunk size empirically. Build a small eval set of question-answer pairs and measure recall at top-k for several sizes. Pick the sweet spot.
- Re-chunk when you change models. A new embedding model may prefer a different chunk size. Do not assume settings transfer.
- For tables, convert rows to short natural-language sentences before embedding. Raw tabular text embeds poorly.
The biggest mistake teams make is treating chunking as a settled detail. It is a hyperparameter, and like other hyperparameters it should be tuned with measurement, not guessed once and forgotten.
Wrap-up
Chunking is the unglamorous first step in RAG that decides how well everything downstream works. Start with sentence-aware chunks around 500 tokens with overlap, preserve structure and metadata, and treat the strategy as something you measure and adjust. A well-chunked corpus makes a mediocre embedding model look great. A poorly chunked one makes a great model look broken.
Related articles
- RAG RAG Chunk Overlap Strategies
Learn how chunk overlap rescues boundary context in RAG pipelines, with practical strategies for choosing overlap size and shape for different corpora.
- RAG RAG HyDE: Hypothetical Document Embeddings
Learn how Hypothetical Document Embeddings (HyDE) improve RAG recall by embedding a generated answer instead of the raw query, with examples and trade-offs.
- AI Vector Databases Explained for Engineers Shipping RAG
What vector databases actually do, how ANN indexes work, and how to choose one without falling for benchmark theater.
- RAG RAG Document Loaders Overview
An overview of document loaders in RAG pipelines, covering common formats, libraries, and how to choose the right loader for your data.