RAG Hybrid Search: BM25 + Vectors

Intermediate 10 min read

What you'll learn

✓Why pure vector search misses keyword queries
✓How BM25 ranks documents using term statistics
✓Score fusion with reciprocal rank fusion (RRF)
✓When hybrid helps and when it adds noise
✓Tuning weights between lexical and semantic

Prerequisites

•Familiar with how APIs work

What and Why

Dense vector search is brilliant at “what does this mean?” queries. It struggles with exact-match queries: product SKUs, error codes, function names, rare acronyms, and any token the embedding model has not seen often. BM25, a classic lexical scorer, handles these effortlessly because it matches words directly.

Hybrid search runs both retrievers, fuses the rankings, and returns a list that combines the strengths of each. It is one of the highest-ROI upgrades you can make to a RAG pipeline.

Mental Model

Each retriever is a specialist. BM25 is a literalist: it loves queries that share exact tokens with the document. Dense vectors are a semanticist: they love queries that share meaning even when the words differ.

query
|
+---------------+--------------+
v               v              v
BM25 search   vector search   (optional filter)
|               |
v               v
top-100         top-100
\             /
 \           /
  score fusion (RRF or weighted)
          |
          v
      top-k results
          |
          v
      LLM prompt

Hybrid retrieval with score fusion

You do not pick one. You combine them and let fusion balance the strengths.

Hands-on Example

A minimal hybrid retriever using rank_bm25 for lexical and any embedding model for dense.

from rank_bm25 import BM25Okapi
import numpy as np

# Lexical index
tokenized = [doc.lower().split() for doc in docs]
bm25 = BM25Okapi(tokenized)

# Dense index (precomputed)
embeddings = embed_model.encode(docs)  # shape (N, d)
embeddings /= np.linalg.norm(embeddings, axis=1, keepdims=True)

def search(query, k=10):
    # BM25 scores
    q_tokens = query.lower().split()
    bm_scores = bm25.get_scores(q_tokens)
    bm_top = np.argsort(bm_scores)[::-1][:100]

    # Vector scores
    q_emb = embed_model.encode([query])[0]
    q_emb /= np.linalg.norm(q_emb)
    vec_scores = embeddings @ q_emb
    vec_top = np.argsort(vec_scores)[::-1][:100]

    # Reciprocal Rank Fusion
    fused = {}
    for rank, idx in enumerate(bm_top):
        fused[idx] = fused.get(idx, 0) + 1 / (60 + rank)
    for rank, idx in enumerate(vec_top):
        fused[idx] = fused.get(idx, 0) + 1 / (60 + rank)

    return sorted(fused.items(), key=lambda x: -x[1])[:k]

RRF is dead simple, has one tuning knob (60 is the standard constant), and works without normalizing across two very different score scales.

Trade-offs

Hybrid is not a free lunch.

Latency adds up. You run two retrievers. With careful infra they run in parallel, but each adds an index hit.
Storage doubles. You maintain both a lexical index and a vector index.
Tuning has more knobs. Top-k from each side, fusion weights, and possibly a reranker on top.
BM25 is sensitive to tokenization. Stopwords, stemming, and language-specific tokenizers matter for short queries.

But the upside is real. On typical enterprise corpora, hybrid search improves top-10 recall by 5-15 percentage points over either method alone. The improvement is largest on queries containing exact identifiers (codes, names, versions) that vector search routinely fumbles.

Practical Tips

Start with RRF before weighted fusion. It avoids the normalization headache and is competitive in most benchmarks.
Pull top-50 to top-100 from each side, fuse, then return top-10. Fusing too few candidates throws away the second retriever’s contribution.
Add a reranker on top of fusion for the best quality. A cross-encoder reranker on the top-30 fused results often beats any fusion tuning.
Tune BM25 parameters (k1, b). Defaults (k1=1.5, b=0.75) are reasonable but worth a sweep on your data.
Lowercase and strip punctuation before BM25. Tokenization mismatch is the most common reason BM25 underperforms.
Keep separate metadata filters in both indexes. A query that filters by date or owner should narrow both sides, not be applied post-fusion.
Measure on real queries. Public benchmark wins do not always transfer. Build an eval set from your own logs.

A surprising side benefit: BM25 acts as a sanity check on your embeddings. If BM25 dominates your fused results consistently, your embedding model probably is not strong on your domain and is worth re-evaluating.

Wrap-up

Hybrid search is the single highest-value upgrade for most RAG systems. Vector retrieval handles meaning, BM25 handles literals, and reciprocal rank fusion glues them together in a few lines of code. Add a reranker on top once hybrid is stable. Two indexes, one fusion step, and noticeably better answers.