RAG Evaluation Metrics Tutorial

Intermediate 11 min read

What you'll learn

✓Why RAG needs both retrieval and generation metrics
✓Retrieval metrics: recall@k, MRR, nDCG
✓Generation metrics: faithfulness, answer relevancy
✓Building a labeled eval set without burning weeks
✓Running an automated eval loop with LLM-as-judge

Prerequisites

•Familiar with how APIs work

What and Why

A RAG system has two failure modes that get blamed on each other. Either the retriever brought the wrong chunks, or the LLM ignored the right chunks. Without measurement, every regression turns into a guessing game between the two halves.

Good RAG evaluation separates the components, scores them independently, and rolls them up into a system-level metric. Then you can change one piece, watch the right metric move, and ship with confidence.

Mental Model

Split the pipeline at the boundary between retrieval and generation. Score each half on metrics that match what it controls.

query -> retriever -> top-k chunks -> LLM -> answer
            |                  |              |
            v                  v              v
      retrieval metrics  context metrics  answer metrics
      (recall@k, MRR)    (precision)     (faithfulness,
                                          relevancy)

Two layers of evaluation in a RAG pipeline

When a system regresses, you look at the layer-level scores first to localize the problem.

Retrieval Metrics

These need a labeled set of (query, relevant_doc_id) pairs.

Recall@k: of the documents that are truly relevant, how many appear in the top-k? Most important single retrieval metric.
Precision@k: of the top-k returned, how many are relevant? Useful when the LLM has limited context budget.
MRR (Mean Reciprocal Rank): emphasizes whether the first relevant result is near the top. Great for “single right answer” queries.
nDCG@k: rewards relevant results higher in the ranking, with graded relevance. The gold standard for ranked retrieval.

Quick implementations:

def recall_at_k(retrieved, relevant, k):
    return len(set(retrieved[:k]) & set(relevant)) / max(1, len(relevant))

def mrr(retrieved, relevant):
    for i, doc in enumerate(retrieved, 1):
        if doc in relevant:
            return 1 / i
    return 0.0

Aim for recall@5 above 0.85 on your eval set before you spend effort tuning generation. If retrieval is broken, no prompt fix will save you.

Generation Metrics

Generation metrics measure what the LLM does with retrieved context. These often use an LLM-as-judge approach.

Faithfulness: are the claims in the answer supported by the retrieved context? Catches hallucination.
Answer relevancy: does the answer actually address the question? Catches drift.
Context precision: among the retrieved chunks, which ones were actually used? Tells you if you are retrieving noise.
Context recall: do the retrieved chunks contain everything needed to answer? Tells you if you need more or better chunks.

A simple LLM-judge prompt for faithfulness:

JUDGE = """Given:
- ANSWER: {answer}
- CONTEXT: {context}

Score from 0 (entirely unsupported) to 1 (every claim supported).
Return only a number."""

def faithfulness(answer, context, judge_llm):
    score_text = judge_llm(JUDGE.format(answer=answer, context=context))
    return float(score_text.strip())

Libraries like ragas and deepeval ship batteries-included implementations of these metrics. Use them once you understand the underlying ideas.

Building an Eval Set

A 50-200 example labeled set is usually enough to drive iteration.

Sample real queries from logs. Synthetic queries do not match user behavior.
For each query, label the correct document(s) by hand or with a strong LLM and review.
Write a reference answer when possible. Even rough ones help compare versions.
Group queries by intent. Factoid, multi-hop, summarization, comparison. You will find your pipeline excels at some and stumbles at others.

The first 50 examples are the most painful and the most valuable. Each one you label is a future regression test.

Trade-offs

LLM-as-judge is noisy. Scores fluctuate run-to-run. Average across 2-3 calls or use a stronger judge.
Synthetic eval data inflates scores. Models tend to handle synthetic phrasings more easily than real ones.
Faithfulness is hard for multi-hop. When an answer combines information across chunks, judges sometimes mark supported claims as unsupported.
Higher recall@k is not always better. Beyond a point, extra context becomes noise that lowers answer quality.
Eval cost can rival inference cost. Run full evals on a schedule, not on every commit.

Practical Tips

Start with recall@5 and faithfulness. These two metrics catch the majority of regressions.
Lock the eval set. Do not change examples between runs or you cannot compare versions.
Track per-segment scores. A change that improves the overall average can quietly destroy a critical query subset.
Use a stronger judge model than the model under test. Otherwise the judge inherits the same blind spots.
Include hard negatives. Queries that should not return any chunk are essential for catching false positives.
Run eval on every prompt or retrieval change. A 200-example eval costs a few dollars and prevents shipping a broken pipeline.
Hand-review a sample of failures. Metrics tell you what regressed; manual review tells you why.

A bonus pattern: store every production query and answer with the retrieved chunks. Periodically add new failures to the eval set so it grows with the system.

Wrap-up

You cannot improve what you do not measure, and RAG has two layers that need separate scorecards. Build a small labeled set, track recall and faithfulness, and run the eval on every meaningful change. Once iteration is grounded in metrics, RAG quality climbs steadily instead of bouncing around with the latest prompt rewrite.