RAG Evaluation Metrics Tutorial
Measure RAG quality with recall@k, MRR, context precision, faithfulness, and answer relevancy so you can iterate on data, not vibes.
What you'll learn
- ✓Why RAG needs both retrieval and generation metrics
- ✓Retrieval metrics: recall@k, MRR, nDCG
- ✓Generation metrics: faithfulness, answer relevancy
- ✓Building a labeled eval set without burning weeks
- ✓Running an automated eval loop with LLM-as-judge
Prerequisites
- •Familiar with how APIs work
What and Why
A RAG system has two failure modes that get blamed on each other. Either the retriever brought the wrong chunks, or the LLM ignored the right chunks. Without measurement, every regression turns into a guessing game between the two halves.
Good RAG evaluation separates the components, scores them independently, and rolls them up into a system-level metric. Then you can change one piece, watch the right metric move, and ship with confidence.
Mental Model
Split the pipeline at the boundary between retrieval and generation. Score each half on metrics that match what it controls.
query -> retriever -> top-k chunks -> LLM -> answer
| | |
v v v
retrieval metrics context metrics answer metrics
(recall@k, MRR) (precision) (faithfulness,
relevancy) When a system regresses, you look at the layer-level scores first to localize the problem.
Retrieval Metrics
These need a labeled set of (query, relevant_doc_id) pairs.
- Recall@k: of the documents that are truly relevant, how many appear in the top-k? Most important single retrieval metric.
- Precision@k: of the top-k returned, how many are relevant? Useful when the LLM has limited context budget.
- MRR (Mean Reciprocal Rank): emphasizes whether the first relevant result is near the top. Great for “single right answer” queries.
- nDCG@k: rewards relevant results higher in the ranking, with graded relevance. The gold standard for ranked retrieval.
Quick implementations:
def recall_at_k(retrieved, relevant, k):
return len(set(retrieved[:k]) & set(relevant)) / max(1, len(relevant))
def mrr(retrieved, relevant):
for i, doc in enumerate(retrieved, 1):
if doc in relevant:
return 1 / i
return 0.0
Aim for recall@5 above 0.85 on your eval set before you spend effort tuning generation. If retrieval is broken, no prompt fix will save you.
Generation Metrics
Generation metrics measure what the LLM does with retrieved context. These often use an LLM-as-judge approach.
- Faithfulness: are the claims in the answer supported by the retrieved context? Catches hallucination.
- Answer relevancy: does the answer actually address the question? Catches drift.
- Context precision: among the retrieved chunks, which ones were actually used? Tells you if you are retrieving noise.
- Context recall: do the retrieved chunks contain everything needed to answer? Tells you if you need more or better chunks.
A simple LLM-judge prompt for faithfulness:
JUDGE = """Given:
- ANSWER: {answer}
- CONTEXT: {context}
Score from 0 (entirely unsupported) to 1 (every claim supported).
Return only a number."""
def faithfulness(answer, context, judge_llm):
score_text = judge_llm(JUDGE.format(answer=answer, context=context))
return float(score_text.strip())
Libraries like ragas and deepeval ship batteries-included implementations of these metrics. Use them once you understand the underlying ideas.
Building an Eval Set
A 50-200 example labeled set is usually enough to drive iteration.
- Sample real queries from logs. Synthetic queries do not match user behavior.
- For each query, label the correct document(s) by hand or with a strong LLM and review.
- Write a reference answer when possible. Even rough ones help compare versions.
- Group queries by intent. Factoid, multi-hop, summarization, comparison. You will find your pipeline excels at some and stumbles at others.
The first 50 examples are the most painful and the most valuable. Each one you label is a future regression test.
Trade-offs
- LLM-as-judge is noisy. Scores fluctuate run-to-run. Average across 2-3 calls or use a stronger judge.
- Synthetic eval data inflates scores. Models tend to handle synthetic phrasings more easily than real ones.
- Faithfulness is hard for multi-hop. When an answer combines information across chunks, judges sometimes mark supported claims as unsupported.
- Higher recall@k is not always better. Beyond a point, extra context becomes noise that lowers answer quality.
- Eval cost can rival inference cost. Run full evals on a schedule, not on every commit.
Practical Tips
- Start with recall@5 and faithfulness. These two metrics catch the majority of regressions.
- Lock the eval set. Do not change examples between runs or you cannot compare versions.
- Track per-segment scores. A change that improves the overall average can quietly destroy a critical query subset.
- Use a stronger judge model than the model under test. Otherwise the judge inherits the same blind spots.
- Include hard negatives. Queries that should not return any chunk are essential for catching false positives.
- Run eval on every prompt or retrieval change. A 200-example eval costs a few dollars and prevents shipping a broken pipeline.
- Hand-review a sample of failures. Metrics tell you what regressed; manual review tells you why.
A bonus pattern: store every production query and answer with the retrieved chunks. Periodically add new failures to the eval set so it grows with the system.
Wrap-up
You cannot improve what you do not measure, and RAG has two layers that need separate scorecards. Build a small labeled set, track recall and faithfulness, and run the eval on every meaningful change. Once iteration is grounded in metrics, RAG quality climbs steadily instead of bouncing around with the latest prompt rewrite.
Related articles
- RAG RAG Tracing with LangSmith Tutorial
Use LangSmith to trace, debug, and evaluate RAG pipelines step by step, from instrumentation to dataset replay and regression detection.
- Machine Learning Confusion Matrix Deep Dive
A thorough look at the confusion matrix: how to read it, the metrics it produces, and how to use it to diagnose classifier behavior beyond a single accuracy number that often hides what is going wrong.
- RAG RAG Chunk Overlap Strategies
Learn how chunk overlap rescues boundary context in RAG pipelines, with practical strategies for choosing overlap size and shape for different corpora.
- RAG RAG Chunking Strategies Explained
Compare fixed-size, sentence, semantic, and structural chunking for retrieval augmented generation and pick the right one for your corpus.