Skip to content
C Codeloom
RAG

RAG Reranking Models Overview

Add a reranker on top of vector retrieval to dramatically improve top-k quality with cross-encoders, late interaction, and LLM rerankers.

·4 min read · By Codeloom
Intermediate 10 min read

What you'll learn

  • Why retrieval and reranking are different problems
  • How cross-encoders compute query-document relevance
  • Where ColBERT-style late interaction fits
  • LLM-based rerankers and their cost profile
  • How to integrate a reranker in a RAG pipeline

Prerequisites

  • Familiar with how APIs work

What and Why

Vector retrieval is fast but coarse. It encodes the query and each document independently and compares the resulting vectors. A reranker does the opposite: it looks at the query and one candidate document together and produces a fine-grained relevance score.

The pattern is two-stage: a cheap retriever fetches the top 50-100 candidates, then a more expensive reranker reorders them to surface the truly relevant ones. The combined system is much better than either component alone.

Mental Model

Retrieval is like a librarian giving you a shelf of plausibly relevant books. Reranking is like an editor reading the first paragraph of each book and reordering them by how well they answer your question.

query
|
v
fast retriever (bi-encoder / BM25) -> top 100 candidates
|
v
reranker (cross-encoder) sees [query, doc] together
|
v
top 10 reordered by precise relevance
|
v
LLM prompt
Two-stage retrieval with a reranker

The reranker is slower per pair but only processes ~100 pairs, not millions of documents. The economics work out.

Hands-on Example

A cross-encoder reranker with sentence-transformers.

from sentence_transformers import CrossEncoder

reranker = CrossEncoder("BAAI/bge-reranker-base")

# Assume `candidates` is the top-100 from your retriever
pairs = [[query, doc.text] for doc in candidates]
scores = reranker.predict(pairs)

# Attach scores and re-sort
for c, s in zip(candidates, scores):
    c.rerank_score = float(s)

reranked = sorted(candidates, key=lambda c: -c.rerank_score)[:10]

For a hosted reranker (Cohere, Voyage, Jina), the API is similar:

import cohere
co = cohere.Client()
resp = co.rerank(model="rerank-english-v3.0",
                 query=query,
                 documents=[c.text for c in candidates],
                 top_n=10)
top10 = [candidates[r.index] for r in resp.results]

The interface barely changes; the quality and latency profiles change a lot.

Reranker Families

There are three common families. Each makes a different trade.

  • Cross-encoders concatenate query and document and run a single transformer pass. Highest quality among the three, latency around 5-20 ms per pair on a GPU.
  • Late-interaction models (ColBERT) keep per-token embeddings for queries and documents and compute MaxSim. Faster than cross-encoders on long candidate lists at the cost of larger storage.
  • LLM rerankers ask an LLM (“how relevant is this document to that query, 0-10?”) and use the score. Very strong on nuanced relevance, but slow and expensive.

Trade-offs

A reranker is not always worth it.

  • Adds latency. Even a fast cross-encoder on 100 pairs adds 200-500 ms.
  • Adds cost. GPU inference (self-hosted) or per-call billing (hosted).
  • Diminishing returns at low candidate counts. Reranking 10 results changes little. Reranking 100 changes a lot.
  • Domain mismatch hurts. A general-purpose reranker can underperform BM25 on highly technical text it has never seen.

The win is in top-k precision. On a typical RAG eval, adding a reranker after hybrid search lifts top-3 relevance from “the right doc is there 70% of the time” to “85-90% of the time.” That difference shows up directly in answer quality.

Practical Tips

  • Start with a hosted reranker. Cohere, Voyage, or Jina rerankers are excellent baselines that require almost no infrastructure work.
  • Send 50-100 candidates to the reranker. Fewer wastes the reranker’s edge; more wastes money.
  • Cache reranker scores when possible. Repeated queries on the same candidate set should not re-pay.
  • Reranker after hybrid search is the sweet spot. BM25 + vector + reranker outperforms any single retriever and any pair of retrievers in most benchmarks.
  • Truncate document text for the reranker. Cross-encoders have small context windows. Send a passage, not the whole document.
  • Evaluate on your own queries. Public benchmarks rank models well overall, but your domain may invert that order.
  • Batch the predictions. Calling the reranker 100 times sequentially is much slower than once with 100 pairs.
  • Track top-k recall before and after reranking. This tells you whether the reranker is pulling the right docs to the top, or just shuffling already-good results.

A useful pattern: run the reranker only when retrieval scores are ambiguous. If the top retrieval result is far ahead of the second, skip reranking and save the cost.

Wrap-up

Rerankers are the cheapest way to get a noticeable lift on top-k retrieval quality. Pair a fast retriever with a cross-encoder or hosted reranker, send 50-100 candidates, and the model gets a much more accurate context window. The added latency and cost are usually worth it, and the gain compounds because better context produces shorter, more reliable LLM answers.