LLM Evaluation: Measuring What Actually Matters

Intermediate 11 min read

What you'll learn

✓Why "it looks fine" is not enough once you ship
✓How to build a golden dataset that pays back forever
✓When to use exact-match scoring vs LLM-as-judge
✓A/B comparing prompts and models without fooling yourself
✓Regression suites and production observability

Prerequisites

•You have a working LLM feature you can call from code
•Basic familiarity with prompts — see Prompt Engineering Basics

Every team building with LLMs eventually hits the same wall. You tweak a prompt, the demo case improves, you ship — and three other cases silently regress. Vibes-based testing does not scale past one engineer and one demo.

Evals are how you escape. They are unit tests for non-deterministic systems: an honest, repeatable way to ask “is this change better, worse, or the same?”

Why evals beat vibes

Three structural problems with eyeballing:

Drift. A change that fixes one prompt often breaks an adjacent one you forgot about.
Selection bias. You test the cases you remember, which are the cases the model already handles.
Comparison is hard. “Is GPT-4 better than Claude here?” deserves a number, not a feeling.

A small eval suite — even 30 hand-curated cases — turns these into measurable problems. You stop guessing whether a change helped.

The golden dataset

The unit of LLM evaluation is the example: an input, optionally an expected output, and optionally metadata about what kind of case it is.

[
  {
    "id": "refund-001",
    "input": "I want my money back, the product never arrived.",
    "expected_intent": "refund_request",
    "tags": ["intent-classification", "happy-path"]
  },
  {
    "id": "refund-002",
    "input": "Where's my order at?",
    "expected_intent": "order_status",
    "tags": ["intent-classification", "easy-confuse"]
  }
]

Tips that pay off later:

Start small. Twenty to fifty cases beats a thousand you never look at.
Curate, do not generate. Real user inputs from logs are gold. Synthetic data is filler.
Cover the failure modes. Every bug you ship should land in the eval set so it never reappears.
Tag aggressively. Tags let you slice scores — overall might be 85%, but “edge cases” 40%.
Version it. Eval data is code. Put it in git.

A golden dataset is the single highest-leverage artefact in an LLM project. Build it once, use it forever.

Exact-match scoring

When the output is structured — a label, a JSON field, a yes/no — you can compare directly.

# Exact-match scoring for an intent classifier
def score_intent(predicted: str, expected: str) -> int:
    return 1 if predicted.strip().lower() == expected.lower() else 0

results = []
for case in dataset:
    out = classify_intent(case["input"])
    results.append(score_intent(out, case["expected_intent"]))

accuracy = sum(results) / len(results)
print(f"Accuracy: {accuracy:.2%}")

Exact-match is cheap, fast, and unambiguous. Use it wherever you can. Many features that feel “open ended” actually have a structured core — extract that core, score it, leave the prose alone.

Related scoring families that are still cheap:

Substring / regex match — “did the answer mention the order ID?”
JSON schema validation — “did it return well-formed JSON?”
Numeric tolerance — “is the number within 1% of expected?”

LLM-as-judge

For truly open-ended outputs — summaries, explanations, helpful tone — exact match fails. The compromise is LLM-as-judge: a separate LLM call grades the output against a rubric.

# A judge prompt — keep the rubric tight
JUDGE_PROMPT = """You are grading a customer-support reply.

Rubric:
- Correct: addresses the user's actual question (yes/no)
- Polite: friendly tone, no blame (yes/no)
- Concise: under 4 sentences (yes/no)

Return JSON: {"correct": bool, "polite": bool, "concise": bool, "notes": str}

USER MESSAGE:
{user_msg}

ASSISTANT REPLY:
{reply}
"""

This works, with caveats:

Judges have biases. They prefer longer, more confident answers. Calibrate by spot-checking.
Use a strong model as judge. A cheaper judge produces noisier scores.
Keep rubrics tight. Three concrete criteria beat one vague “is this good?”.
Sanity-check on humans. Score 50 cases yourself, then check the judge agrees most of the time.

A useful rule: if a smart human cannot grade the case from the rubric, the judge cannot either.

A/B comparing prompts and models

Once you have a dataset and a scoring function, comparisons become mechanical.

# Compare two prompts on the same dataset
scores_a, scores_b = [], []
for case in dataset:
    out_a = run_with_prompt(PROMPT_A, case["input"])
    out_b = run_with_prompt(PROMPT_B, case["input"])
    scores_a.append(score(out_a, case))
    scores_b.append(score(out_b, case))

# Look at overall and by-tag deltas
print(f"A: {mean(scores_a):.2%}  B: {mean(scores_b):.2%}")

Things to watch:

Statistical noise. With 30 examples, a 2-point delta is probably noise. Aim for clear gaps or larger datasets before declaring victory.
Per-tag breakdowns. Overall scores hide regressions. Always look at “B is +5% overall but -15% on refund cases” before shipping.
Cost and latency. A prompt that scores 1% higher but costs 4x is rarely worth it.
Same data, same seed. Compare on identical inputs; otherwise you are measuring the dataset, not the system.

Try it. Pick one LLM feature in your codebase. Write down 20 inputs you have actually seen from users. For each, write what the ideal output looks like. Run your current prompt against them and score by hand. That hour will tell you more about your system than a month of intuition.

Regression suites

A regression suite is a golden dataset plus a CI job. Every PR runs the suite; a meaningful score drop blocks the merge.

What “meaningful” means depends on noise. A common pattern:

Run the suite three times, take the median score
Compare against the score on main
Block if any tag drops more than 5 percentage points

This catches the classic “prompt tweak fixed one case, broke five others” pattern before it ships. Pair with a manual review of the diff: which cases changed answers, and was the new answer actually better?

Production observability

Pre-deploy evals catch known cases. Production catches the cases you did not think of.

What to log per LLM call:

Inputs, outputs, model, prompt version
Latency, token counts, cost
Tool calls made (if any) — see Tool Use and Function Calling
A user-facing signal — thumbs up/down, retry, abandonment

Signals you actually look at:

Cost per session over time
Thumbs-down rate by feature and by prompt version
Latency p95 — slow LLM calls kill engagement
Sampling for review. Pull 1% of conversations daily and read them. There is no substitute.

The thumbs-down conversations are your next round of eval cases. The loop is: production → review → add to dataset → fix → ship → measure. This is the boring, durable way LLM products improve.

A small workflow that scales

1. Curate 30 examples from real user logs
2. Run current system, score, record baseline
3. Make one change — prompt, model, retrieval tweak
4. Re-run, compare per-tag, decide ship/no-ship
5. After ship, monitor production signals
6. Add new failures to the dataset, go to 3

This is unglamorous. It is also how the teams shipping reliable LLM products work.

Reflection. When was the last time you compared two prompts with numbers, not impressions? If the answer is “never,” your shipping velocity is being capped by something a Python script and a CSV can fix.

Recap

Vibes do not scale. Evals turn LLM changes into measurable decisions.
Golden datasets of 30–50 curated cases pay back forever.
Exact-match scoring whenever you can. LLM-as-judge when you must.
A/B compare on the same inputs; check per-tag deltas.
Regression suites in CI catch silent breakage.
Observability + review turn production into your next eval set.

Next steps

Evaluation tells you what is working. Embeddings power most of the retrieval those evals depend on — that is the next foundation worth understanding.

→ Next: Text Embeddings: The Foundation of Semantic Search

Questions or feedback? Email codeloomdevv@gmail.com.