LLM Output Parsing and Validation

Intermediate 10 min read

What you'll learn

✓Why parsing LLM outputs is hard
✓JSON mode and structured outputs
✓Schema validation patterns
✓Repair and retry loops
✓How to detect partial successes

Prerequisites

•Familiar with APIs

Models that produce free text are wonderful for chat and terrible for downstream code. The moment you need to extract a field, store a row, or branch on a value, you need structured output you can trust. This post lays out the techniques that make that reliable.

What output parsing really is

Output parsing turns a model’s text response into a data structure your code can use. The naive approach is to ask for JSON in the prompt and call json.loads. That works most of the time and fails in unpleasant ways when it does not: trailing commentary, missing keys, wrong types, unescaped quotes.

Reliable parsing has three layers. Generation control nudges the model toward valid output. Validation checks that what came back matches the schema. Repair and retry handle the remaining errors.

Mental model

Think of each LLM call as an API client that occasionally returns malformed responses. Your job is to put a thin parsing layer in front that catches the rough edges before the rest of the system sees them.

prompt + schema
 |
 v
LLM call -> raw text
 |
 v
parse + validate
 |  fails
 v
repair (retry with error)
 |
 v
final typed object

Parsing pipeline around an LLM call

Hands-on example

A robust pattern with Pydantic and Anthropic.

from pydantic import BaseModel, ValidationError
from anthropic import Anthropic
import json

class Ticket(BaseModel):
    title: str
    priority: int
    tags: list[str]

client = Anthropic()
schema = Ticket.model_json_schema()

def extract(user_text: str, attempts: int = 2) -> Ticket:
    err = None
    for _ in range(attempts):
        prompt = f"Return JSON matching this schema:\n{schema}\nText: {user_text}"
        if err:
            prompt += f"\nFix this error: {err}"
        resp = client.messages.create(
            model="claude-opus-4-7", max_tokens=512,
            messages=[{"role": "user", "content": prompt}],
        )
        try:
            return Ticket(**json.loads(resp.content[0].text))
        except (json.JSONDecodeError, ValidationError) as e:
            err = str(e)
    raise RuntimeError(f"failed after {attempts}: {err}")

Better yet, use the provider’s structured outputs feature. OpenAI’s response_format=json_schema and Anthropic’s tool-use trick (call a tool whose input schema is your data shape) both constrain decoding at the model level so invalid outputs become impossible.

Libraries like Instructor wrap this pattern around major providers. You declare a Pydantic model and get typed results back, with automatic retries on validation errors.

Trade-offs

Free-text JSON mode is the easiest but least reliable. Provider-supplied structured output mode is more reliable but ties you to that vendor’s feature set.

Strict schemas catch errors early but reject borderline-valid responses. A field declared as int that gets the string “3” might be salvageable; deciding whether to fail or coerce is a design choice.

Repair loops cost extra calls. Each retry doubles the cost on failures. Set a tight retry budget and log every failure for offline analysis.

Tool-use as parsing trick gives schema enforcement for free on providers that support it. The downside is that you cannot stream the result easily, since the entire tool input must arrive before validation.

Practical tips

Always validate, never trust raw output. Even with structured output mode, build a Pydantic or Zod model and run it. Future model changes can shift behavior subtly.

Log the raw response alongside the parsed object. When something looks off downstream, you want the original string available without having to reproduce the call.

Surface specific error messages on retry. Telling the model “your JSON had a trailing comma after the tags array” is much more effective than “invalid JSON, try again.”

Prefer enums over free text for categorical fields. If priority must be one of low, medium, high, encode that in the schema. The model gets fewer chances to invent values.

Separate parsing from interpretation. The parser turns text into a typed object. The next layer decides what to do with that object. Mixing them makes both harder to test.

Wrap-up

Reliable LLM output parsing is a layered job: constrain generation, validate hard, repair softly. Use provider-supplied structured output features where you can, validate with a real schema library, and keep retries cheap and bounded. Once this layer is solid, downstream code can pretend it is calling a normal API.