RAG Retrieval Strategies

Beginner 11 min read

What you'll learn

✓How chunk size and overlap affect recall
✓Why hybrid search beats pure vector search
✓How rerankers improve precision
✓When to rewrite queries before retrieval
✓How to evaluate retrieval without an LLM

Prerequisites

•Basic Python familiarity

A RAG system is only as good as its retriever. You can use the best LLM in the world, but if the right chunk is not in the top-k context, the model will hallucinate or refuse. Most RAG quality work is retrieval quality work. The strategies below are the ones that consistently move evaluation numbers.

Chunking is not a detail

Documents are split into chunks before embedding. Chunk size controls a tradeoff: small chunks are precise but lose context; large chunks carry context but dilute the embedding signal. A reasonable starting range is 300 to 800 tokens, with 50 to 100 tokens of overlap so important spans are not split across boundaries.

Use semantically aware splitters when possible: by headings for markdown, by sections for docs, by function for code. A “recursive character splitter” that respects newlines and punctuation usually beats a fixed-window splitter. Store enough metadata with each chunk (source, section, page) to let you reconstruct context and filter at query time.

from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=600,
    chunk_overlap=80,
    separators=["\n## ", "\n\n", "\n", ". ", " "],
)
chunks = splitter.split_text(open("docs.md").read())
print(len(chunks), "chunks")

Hybrid search: vectors plus keywords

Pure vector search misses cases where the query contains rare terms, codes, or product names that embeddings smooth over. Pure keyword search (BM25) misses paraphrases. Combining them recovers both kinds of recall. Run both, then merge with Reciprocal Rank Fusion (RRF) or a learned weight.

Most modern vector databases support hybrid search natively. If yours does not, you can run BM25 in something like OpenSearch or rank_bm25 and merge results in your application. The gain over pure vector search is consistently noticeable on real corpora.

Rerankers turn recall into precision

A retriever should return more candidates than you need, say 50, then a cross-encoder reranker scores each candidate against the query and you keep the top 5 to 10 for the LLM. Cross-encoders are slower per pair than bi-encoders, but you only run them on a small candidate set, so total latency stays reasonable.

Cohere’s rerank API and the bge-reranker family are common picks. Adding a reranker on top of hybrid search is often the single biggest quality win in a RAG pipeline.

from sentence_transformers import CrossEncoder

reranker = CrossEncoder("BAAI/bge-reranker-base")
query = "How do I rotate API keys?"
candidates = ["...", "...", "..."]  # top-50 from retriever
pairs = [[query, c] for c in candidates]
scores = reranker.predict(pairs)
top = [c for _, c in sorted(zip(scores, candidates), reverse=True)[:5]]

Query rewriting and expansion

User queries are often short, vague, or full of pronouns. Before retrieval, an LLM can rewrite the query into a more searchable form. For multi-turn chats this is essential: “and what about the second one?” makes no sense without history.

Two useful patterns. First, condense the conversation plus latest user message into a standalone question. Second, generate two or three paraphrased versions of the query and retrieve for each, then merge results. The cost is one cheap LLM call per query; the recall improvement is usually worth it.

A more advanced variant is HyDE (Hypothetical Document Embeddings): ask the model to draft a hypothetical answer to the query, embed that draft, and retrieve against it. The draft is often closer in embedding space to the real answer than the question is.

Metadata filtering

When your corpus has structure, filtering before vector search is a huge win. If the user is asking about pricing in Germany, filter to documents tagged topic=pricing and region=DE before running the vector search. The candidate set is smaller, the noise drops, and latency improves.

Make sure your vector database supports pre-filtering, not just post-filtering. Pre-filtering reduces the search space; post-filtering retrieves first and discards, which can leave you with too few results.

Knowing when to retrieve

Not every query benefits from retrieval. “Hello” does not need a vector search. A small router, either a classifier or a cheap LLM call, can decide whether to retrieve. Skipping retrieval when it is not needed saves cost and avoids polluting the context with irrelevant chunks that confuse the model.

For agent systems, retrieval becomes a tool the model chooses to call. That works well as long as you give the model a clear description of when to use it and good defaults for top-k.

Evaluating retrieval honestly

Build a small labeled set: queries paired with the chunk IDs that should appear in the top-k. Then compute recall@k and mean reciprocal rank (MRR). These metrics do not need an LLM, run fast, and let you iterate on chunking and retrieval changes in minutes.

For end-to-end RAG evaluation, an LLM judge can score whether the final answer is grounded in the retrieved chunks and whether it actually answers the question. Tools like Ragas and TruLens automate this. Track grounded-ness and answer relevance separately; a confident wrong answer is worse than a polite “I don’t know”.

A solid default pipeline

Chunk with a recursive splitter at 500 to 800 tokens with overlap. Embed with a strong model and store with rich metadata. At query time, rewrite the query if the conversation has history, apply metadata filters, run hybrid search to get 50 candidates, rerank to the top 8, and pass those to the LLM with citations. Evaluate recall@k and MRR weekly on a labeled set. That stack handles most RAG use cases well before you need anything fancier.