Evaluating LLM Outputs

Beginner 10 min read

What you'll learn

✓Why eyeballing outputs is not enough
✓How to build a useful evaluation set
✓When to use exact match, semantic similarity, or LLM judges
✓How to keep judges honest
✓How to run regression tests on prompts

Prerequisites

•Basic Python familiarity

Without evaluation, LLM development becomes vibes-based engineering. A prompt feels better, a model upgrade looks fine on three examples, a fine-tune produces nice screenshots. Then it ships, the long tail of inputs shows up, and quality regresses in ways nobody can measure. Solid evaluation is what separates an LLM toy from an LLM product.

Build a test set first

Before changing anything, write down 20 to 50 real inputs with expected behavior. They should reflect the distribution you actually see: easy cases, hard cases, the weird edge cases your users send. For each one, write what a correct response looks like, even if loosely. A grading rubric (“must mention refund window of 30 days” or “must refuse if PII is requested”) is enough.

This set is the source of truth for every change you make from now on. Treat it like a unit test suite. Add new cases when bugs are reported. Keep it under version control.

Pick the right metric for the task

There is no one metric. The right one depends on the task.

For classification or extraction, exact match and F1 are perfect. The output is a label or a structured field, and you compare against the ground truth directly.

For short factual questions, exact match is often too strict and semantic similarity is too loose. A small regex or string-containment check often works (“must include the phrase ‘within 30 days’”).

For longer generation, no automatic metric correlates well with human judgment. BLEU and ROUGE are unreliable for anything beyond translation. This is where LLM judges enter the picture.

LLM judges, used carefully

An LLM judge is a separate prompt that grades outputs against a rubric. Give it the input, the output, and explicit criteria; ask for a score and a one-line reason. With a thoughtful rubric, judges correlate well with human grades and run cheaply.

from anthropic import Anthropic

client = Anthropic()

def judge(question, answer):
    rubric = """
    Score the answer 1-5 on:
    1. Factual correctness (must match the expected facts)
    2. Format (must be plain prose, no markdown)
    3. Length (under 80 words)
    Respond as JSON: {"score": int, "reason": "..."}
    """
    msg = client.messages.create(
        model="claude-opus-4-7",
        max_tokens=200,
        messages=[{"role": "user",
                   "content": f"{rubric}\nQ: {question}\nA: {answer}"}],
    )
    return msg.content[0].text

Two rules keep judges honest. First, use a different model from the one you are evaluating when possible, so the judge does not just rubber-stamp its own style. Second, calibrate the judge against human grades on a small sample. If they agree most of the time, trust the judge; if they disagree wildly, fix the rubric before scaling up.

Pairwise comparison beats absolute scores

Asking a judge “is this a 7 or an 8?” is noisy. Asking “is A better than B?” is more reliable. For prompt or model comparisons, generate outputs from both versions and have the judge pick the winner, then compute a win rate. Pairwise grading is how chatbot arenas rank models, and it scales down to your own evaluations.

To avoid position bias, swap the order on half the queries and average.

Test for failure modes, not just success

The most important cases in your test set are the ones that should fail safely. PII requests that should be refused. Out-of-scope questions that should defer. Adversarial prompt injections that should be ignored. If you only test happy paths, you will not notice when a prompt change makes the model leak data or follow injected instructions.

Build a small “red team” subset of your test set with these scenarios and grade them with strict pass/fail rules. A drop in safety pass rate is a release blocker even if helpfulness goes up.

Run regressions on every change

Every change to a prompt, a model, or a retrieval parameter should be evaluated against the full set before merging. This sounds heavy but is fast once automated: a script, a CI job, a dashboard. Track scores over time so you can see when quality drifts.

When a regression appears, look at the specific failures. The point of a test set is not the aggregate number; it is the diff. One new failure is a clue about what your change broke.

Distinguish offline and online evaluation

Offline evaluation uses your fixed test set. It is fast, cheap, and reproducible, but it cannot capture distribution shift in real traffic. Online evaluation uses actual production logs, sampling responses for human or LLM grading, and watching metrics like thumbs-up rates, retry rates, and escalation rates.

Both matter. Offline catches regressions before they ship. Online catches the things your test set did not anticipate. Feed online findings back into the offline test set so the loop tightens over time.

Cost and latency are quality too

A model that is 1% better but twice as expensive may not be a win. Track latency, token cost, and reliability alongside quality scores. The right move for a feature might be a cheaper, faster model that is slightly worse on average but well within the quality bar.

A workable workflow

Write a small labeled test set. Pick metrics that match each task. Use LLM judges with strict rubrics for open-ended generation. Always include failure-mode tests. Run the full evaluation on every change. Sample production traffic and feed findings back. Track quality, latency, and cost together. None of this requires fancy tooling; a Python script and a CSV are enough to start. The discipline is the whole point.