Prompt Engineering: Evaluation Loops

Intermediate 10 min read

What you'll learn

✓Why eval loops beat one-shot prompt tweaking
✓How to build a small but useful eval dataset
✓Choosing between human, code, and LLM-as-judge graders
✓Wiring evals into a tight iteration loop
✓Catching regressions before they ship

Prerequisites

•Basic LLM API usage
•Comfort with Python scripts

If you change a prompt and only test it on the one example that motivated the change, you are tuning by vibes. Evaluation loops are how you replace vibes with evidence. They take a little setup and they pay back every time you touch the prompt afterward.

What and why

An evaluation loop is a repeatable process: a fixed dataset of inputs, a way to score each output, and a runner that produces a number you can compare across prompt versions. The output is not a vibe-check, it is a metric.

The reason this matters is regression. Prompt changes often improve one case while quietly breaking three others. Without an eval, you only find out when a user complains. With an eval, you see the breakage before you ship.

Mental model

Treat the prompt like a piece of software and the eval like a test suite. Tests are not exhaustive proof; they are tripwires for the failure modes you have decided to care about. Your eval should cover the obvious happy path plus the cases that have burned you before.

A grader can be code (regex, schema validation, exact match), a human (slow, expensive, gold standard), or another LLM (cheap, scalable, biased). Most production setups use a mix: code for the cheap checks, LLM-as-judge for the subjective ones, and humans for spot-audits.

Hands-on example

Start with twenty examples. Not two thousand. Just enough to see signal.

dataset = load_jsonl("evals/customer_intent.jsonl")

def grade(example, output):
    return {
        "format_ok": is_valid_json(output),
        "label_match": output.get("intent") == example["expected_intent"],
    }

results = []
for ex in dataset:
    out = run_prompt(prompt_v3, ex["input"])
    results.append(grade(ex, out))

print(summarize(results))  # accuracy, format failures, per-tag breakdown

The loop itself looks like this.

prompt vN
 |
 v
[run on dataset] -- outputs --> [grader]
 |                                |
 v                                v
metrics --> compare with prev --> regression?
                                  |
                     yes -- block / investigate
                                  |
                     no  -- promote vN as new baseline

Prompt evaluation loop with grader and regression check

The key step is the comparison with the previous baseline. Without it, you have a benchmark, not a loop. Save metrics and a sample of outputs for every run so you can diff them later.

Trade-offs

LLM-as-judge graders scale well and handle subjective dimensions like tone or helpfulness, but they have their own biases. They tend to prefer longer answers and to agree with their own family of models. Anchor them with rubrics and check inter-grader agreement against humans on a small subset.

Code graders are cheap and deterministic but only work when the metric is clean. They are perfect for format and schema checks, terrible for nuanced quality.

Big datasets feel rigorous but slow the loop and discourage iteration. A twenty-example smoke set you actually run beats a two-thousand-example set you run once a quarter.

Practical tips

Keep the eval dataset in version control next to the prompt. When the prompt changes, reviewers see both diffs in one place.

Add new examples whenever a bug ships. The set should grow with each lesson learned. Tag examples by failure type so you can track which categories regress.

Run evals on pull requests. A GitHub Action that posts a metrics summary to the PR comment turns the loop into a habit instead of a chore.

Cache model outputs by input hash during development. You will run the same dataset dozens of times in a single session; uncached runs make iteration painful and expensive.

For LLM-as-judge, write the rubric as if you were briefing a contractor. Be explicit about what counts as a fail. Vague rubrics give noisy scores.

Wrap-up

Evaluation loops are the difference between prompt engineering as craft and prompt engineering as guessing. The smallest useful version is a JSONL file, a grader function, and a script. Build that, run it on every change, and your prompts will get measurably better instead of just feeling better.