Skip to content
C Codeloom
RAG

RAG Hybrid Search: BM25 + Vectors

Combine lexical BM25 with dense vector search to recover the queries each method misses on its own and boost RAG retrieval quality.

·4 min read · By Codeloom
Intermediate 10 min read

What you'll learn

  • Why pure vector search misses keyword queries
  • How BM25 ranks documents using term statistics
  • Score fusion with reciprocal rank fusion (RRF)
  • When hybrid helps and when it adds noise
  • Tuning weights between lexical and semantic

Prerequisites

  • Familiar with how APIs work

What and Why

Dense vector search is brilliant at “what does this mean?” queries. It struggles with exact-match queries: product SKUs, error codes, function names, rare acronyms, and any token the embedding model has not seen often. BM25, a classic lexical scorer, handles these effortlessly because it matches words directly.

Hybrid search runs both retrievers, fuses the rankings, and returns a list that combines the strengths of each. It is one of the highest-ROI upgrades you can make to a RAG pipeline.

Mental Model

Each retriever is a specialist. BM25 is a literalist: it loves queries that share exact tokens with the document. Dense vectors are a semanticist: they love queries that share meaning even when the words differ.

query
|
+---------------+--------------+
v               v              v
BM25 search   vector search   (optional filter)
|               |
v               v
top-100         top-100
\             /
 \           /
  score fusion (RRF or weighted)
          |
          v
      top-k results
          |
          v
      LLM prompt
Hybrid retrieval with score fusion

You do not pick one. You combine them and let fusion balance the strengths.

Hands-on Example

A minimal hybrid retriever using rank_bm25 for lexical and any embedding model for dense.

from rank_bm25 import BM25Okapi
import numpy as np

# Lexical index
tokenized = [doc.lower().split() for doc in docs]
bm25 = BM25Okapi(tokenized)

# Dense index (precomputed)
embeddings = embed_model.encode(docs)  # shape (N, d)
embeddings /= np.linalg.norm(embeddings, axis=1, keepdims=True)

def search(query, k=10):
    # BM25 scores
    q_tokens = query.lower().split()
    bm_scores = bm25.get_scores(q_tokens)
    bm_top = np.argsort(bm_scores)[::-1][:100]

    # Vector scores
    q_emb = embed_model.encode([query])[0]
    q_emb /= np.linalg.norm(q_emb)
    vec_scores = embeddings @ q_emb
    vec_top = np.argsort(vec_scores)[::-1][:100]

    # Reciprocal Rank Fusion
    fused = {}
    for rank, idx in enumerate(bm_top):
        fused[idx] = fused.get(idx, 0) + 1 / (60 + rank)
    for rank, idx in enumerate(vec_top):
        fused[idx] = fused.get(idx, 0) + 1 / (60 + rank)

    return sorted(fused.items(), key=lambda x: -x[1])[:k]

RRF is dead simple, has one tuning knob (60 is the standard constant), and works without normalizing across two very different score scales.

Trade-offs

Hybrid is not a free lunch.

  • Latency adds up. You run two retrievers. With careful infra they run in parallel, but each adds an index hit.
  • Storage doubles. You maintain both a lexical index and a vector index.
  • Tuning has more knobs. Top-k from each side, fusion weights, and possibly a reranker on top.
  • BM25 is sensitive to tokenization. Stopwords, stemming, and language-specific tokenizers matter for short queries.

But the upside is real. On typical enterprise corpora, hybrid search improves top-10 recall by 5-15 percentage points over either method alone. The improvement is largest on queries containing exact identifiers (codes, names, versions) that vector search routinely fumbles.

Practical Tips

  • Start with RRF before weighted fusion. It avoids the normalization headache and is competitive in most benchmarks.
  • Pull top-50 to top-100 from each side, fuse, then return top-10. Fusing too few candidates throws away the second retriever’s contribution.
  • Add a reranker on top of fusion for the best quality. A cross-encoder reranker on the top-30 fused results often beats any fusion tuning.
  • Tune BM25 parameters (k1, b). Defaults (k1=1.5, b=0.75) are reasonable but worth a sweep on your data.
  • Lowercase and strip punctuation before BM25. Tokenization mismatch is the most common reason BM25 underperforms.
  • Keep separate metadata filters in both indexes. A query that filters by date or owner should narrow both sides, not be applied post-fusion.
  • Measure on real queries. Public benchmark wins do not always transfer. Build an eval set from your own logs.

A surprising side benefit: BM25 acts as a sanity check on your embeddings. If BM25 dominates your fused results consistently, your embedding model probably is not strong on your domain and is worth re-evaluating.

Wrap-up

Hybrid search is the single highest-value upgrade for most RAG systems. Vector retrieval handles meaning, BM25 handles literals, and reciprocal rank fusion glues them together in a few lines of code. Add a reranker on top once hybrid is stable. Two indexes, one fusion step, and noticeably better answers.