RAG Hybrid Search: BM25 + Vectors
Combine lexical BM25 with dense vector search to recover the queries each method misses on its own and boost RAG retrieval quality.
What you'll learn
- ✓Why pure vector search misses keyword queries
- ✓How BM25 ranks documents using term statistics
- ✓Score fusion with reciprocal rank fusion (RRF)
- ✓When hybrid helps and when it adds noise
- ✓Tuning weights between lexical and semantic
Prerequisites
- •Familiar with how APIs work
What and Why
Dense vector search is brilliant at “what does this mean?” queries. It struggles with exact-match queries: product SKUs, error codes, function names, rare acronyms, and any token the embedding model has not seen often. BM25, a classic lexical scorer, handles these effortlessly because it matches words directly.
Hybrid search runs both retrievers, fuses the rankings, and returns a list that combines the strengths of each. It is one of the highest-ROI upgrades you can make to a RAG pipeline.
Mental Model
Each retriever is a specialist. BM25 is a literalist: it loves queries that share exact tokens with the document. Dense vectors are a semanticist: they love queries that share meaning even when the words differ.
query
|
+---------------+--------------+
v v v
BM25 search vector search (optional filter)
| |
v v
top-100 top-100
\ /
\ /
score fusion (RRF or weighted)
|
v
top-k results
|
v
LLM prompt You do not pick one. You combine them and let fusion balance the strengths.
Hands-on Example
A minimal hybrid retriever using rank_bm25 for lexical and any embedding model for dense.
from rank_bm25 import BM25Okapi
import numpy as np
# Lexical index
tokenized = [doc.lower().split() for doc in docs]
bm25 = BM25Okapi(tokenized)
# Dense index (precomputed)
embeddings = embed_model.encode(docs) # shape (N, d)
embeddings /= np.linalg.norm(embeddings, axis=1, keepdims=True)
def search(query, k=10):
# BM25 scores
q_tokens = query.lower().split()
bm_scores = bm25.get_scores(q_tokens)
bm_top = np.argsort(bm_scores)[::-1][:100]
# Vector scores
q_emb = embed_model.encode([query])[0]
q_emb /= np.linalg.norm(q_emb)
vec_scores = embeddings @ q_emb
vec_top = np.argsort(vec_scores)[::-1][:100]
# Reciprocal Rank Fusion
fused = {}
for rank, idx in enumerate(bm_top):
fused[idx] = fused.get(idx, 0) + 1 / (60 + rank)
for rank, idx in enumerate(vec_top):
fused[idx] = fused.get(idx, 0) + 1 / (60 + rank)
return sorted(fused.items(), key=lambda x: -x[1])[:k]
RRF is dead simple, has one tuning knob (60 is the standard constant), and works without normalizing across two very different score scales.
Trade-offs
Hybrid is not a free lunch.
- Latency adds up. You run two retrievers. With careful infra they run in parallel, but each adds an index hit.
- Storage doubles. You maintain both a lexical index and a vector index.
- Tuning has more knobs. Top-k from each side, fusion weights, and possibly a reranker on top.
- BM25 is sensitive to tokenization. Stopwords, stemming, and language-specific tokenizers matter for short queries.
But the upside is real. On typical enterprise corpora, hybrid search improves top-10 recall by 5-15 percentage points over either method alone. The improvement is largest on queries containing exact identifiers (codes, names, versions) that vector search routinely fumbles.
Practical Tips
- Start with RRF before weighted fusion. It avoids the normalization headache and is competitive in most benchmarks.
- Pull top-50 to top-100 from each side, fuse, then return top-10. Fusing too few candidates throws away the second retriever’s contribution.
- Add a reranker on top of fusion for the best quality. A cross-encoder reranker on the top-30 fused results often beats any fusion tuning.
- Tune BM25 parameters (
k1,b). Defaults (k1=1.5,b=0.75) are reasonable but worth a sweep on your data. - Lowercase and strip punctuation before BM25. Tokenization mismatch is the most common reason BM25 underperforms.
- Keep separate metadata filters in both indexes. A query that filters by date or owner should narrow both sides, not be applied post-fusion.
- Measure on real queries. Public benchmark wins do not always transfer. Build an eval set from your own logs.
A surprising side benefit: BM25 acts as a sanity check on your embeddings. If BM25 dominates your fused results consistently, your embedding model probably is not strong on your domain and is worth re-evaluating.
Wrap-up
Hybrid search is the single highest-value upgrade for most RAG systems. Vector retrieval handles meaning, BM25 handles literals, and reciprocal rank fusion glues them together in a few lines of code. Add a reranker on top once hybrid is stable. Two indexes, one fusion step, and noticeably better answers.
Related articles
- RAG RAG Chunk Overlap Strategies
Learn how chunk overlap rescues boundary context in RAG pipelines, with practical strategies for choosing overlap size and shape for different corpora.
- RAG RAG HyDE: Hypothetical Document Embeddings
Learn how Hypothetical Document Embeddings (HyDE) improve RAG recall by embedding a generated answer instead of the raw query, with examples and trade-offs.
- RAG RAG Chunking Strategies Explained
Compare fixed-size, sentence, semantic, and structural chunking for retrieval augmented generation and pick the right one for your corpus.
- RAG RAG Metadata Filtering Strategies
How to use metadata filters in RAG to improve precision, scope retrieval, and enforce permissions without sacrificing recall.