RAG Reranking Models Overview

Intermediate 10 min read

What you'll learn

✓Why retrieval and reranking are different problems
✓How cross-encoders compute query-document relevance
✓Where ColBERT-style late interaction fits
✓LLM-based rerankers and their cost profile
✓How to integrate a reranker in a RAG pipeline

Prerequisites

•Familiar with how APIs work

What and Why

Vector retrieval is fast but coarse. It encodes the query and each document independently and compares the resulting vectors. A reranker does the opposite: it looks at the query and one candidate document together and produces a fine-grained relevance score.

The pattern is two-stage: a cheap retriever fetches the top 50-100 candidates, then a more expensive reranker reorders them to surface the truly relevant ones. The combined system is much better than either component alone.

Mental Model

Retrieval is like a librarian giving you a shelf of plausibly relevant books. Reranking is like an editor reading the first paragraph of each book and reordering them by how well they answer your question.

query
|
v
fast retriever (bi-encoder / BM25) -> top 100 candidates
|
v
reranker (cross-encoder) sees [query, doc] together
|
v
top 10 reordered by precise relevance
|
v
LLM prompt

Two-stage retrieval with a reranker

The reranker is slower per pair but only processes ~100 pairs, not millions of documents. The economics work out.

Hands-on Example

A cross-encoder reranker with sentence-transformers.

from sentence_transformers import CrossEncoder

reranker = CrossEncoder("BAAI/bge-reranker-base")

# Assume `candidates` is the top-100 from your retriever
pairs = [[query, doc.text] for doc in candidates]
scores = reranker.predict(pairs)

# Attach scores and re-sort
for c, s in zip(candidates, scores):
    c.rerank_score = float(s)

reranked = sorted(candidates, key=lambda c: -c.rerank_score)[:10]

For a hosted reranker (Cohere, Voyage, Jina), the API is similar:

import cohere
co = cohere.Client()
resp = co.rerank(model="rerank-english-v3.0",
                 query=query,
                 documents=[c.text for c in candidates],
                 top_n=10)
top10 = [candidates[r.index] for r in resp.results]

The interface barely changes; the quality and latency profiles change a lot.

Reranker Families

There are three common families. Each makes a different trade.

Cross-encoders concatenate query and document and run a single transformer pass. Highest quality among the three, latency around 5-20 ms per pair on a GPU.
Late-interaction models (ColBERT) keep per-token embeddings for queries and documents and compute MaxSim. Faster than cross-encoders on long candidate lists at the cost of larger storage.
LLM rerankers ask an LLM (“how relevant is this document to that query, 0-10?”) and use the score. Very strong on nuanced relevance, but slow and expensive.

Trade-offs

A reranker is not always worth it.

Adds latency. Even a fast cross-encoder on 100 pairs adds 200-500 ms.
Adds cost. GPU inference (self-hosted) or per-call billing (hosted).
Diminishing returns at low candidate counts. Reranking 10 results changes little. Reranking 100 changes a lot.
Domain mismatch hurts. A general-purpose reranker can underperform BM25 on highly technical text it has never seen.

The win is in top-k precision. On a typical RAG eval, adding a reranker after hybrid search lifts top-3 relevance from “the right doc is there 70% of the time” to “85-90% of the time.” That difference shows up directly in answer quality.

Practical Tips

Start with a hosted reranker. Cohere, Voyage, or Jina rerankers are excellent baselines that require almost no infrastructure work.
Send 50-100 candidates to the reranker. Fewer wastes the reranker’s edge; more wastes money.
Cache reranker scores when possible. Repeated queries on the same candidate set should not re-pay.
Reranker after hybrid search is the sweet spot. BM25 + vector + reranker outperforms any single retriever and any pair of retrievers in most benchmarks.
Truncate document text for the reranker. Cross-encoders have small context windows. Send a passage, not the whole document.
Evaluate on your own queries. Public benchmarks rank models well overall, but your domain may invert that order.
Batch the predictions. Calling the reranker 100 times sequentially is much slower than once with 100 pairs.
Track top-k recall before and after reranking. This tells you whether the reranker is pulling the right docs to the top, or just shuffling already-good results.

A useful pattern: run the reranker only when retrieval scores are ambiguous. If the top retrieval result is far ahead of the second, skip reranking and save the cost.

Wrap-up

Rerankers are the cheapest way to get a noticeable lift on top-k retrieval quality. Pair a fast retriever with a cross-encoder or hosted reranker, send 50-100 candidates, and the model gets a much more accurate context window. The added latency and cost are usually worth it, and the gain compounds because better context produces shorter, more reliable LLM answers.