Structured Outputs with LLMs

Intermediate 10 min read

What you'll learn

✓Why free-text LLM outputs are unreliable for code
✓How JSON mode and tool use enforce structure
✓How to validate with schemas
✓Patterns for retries and partial parses
✓Pitfalls and cost considerations

Prerequisites

•You have called an LLM API once

If you have ever asked an LLM “respond in JSON” and watched it return JSON wrapped in Here is the response: plus a stray trailing comma, you already know why structured outputs exist. Free text is not a data format. The fix is to either constrain the model to a schema, route the response through tool use, or both. This post is about the patterns that actually hold up.

The problem in one paragraph

LLMs sample tokens from a probability distribution. Even when they “know” the format, a small amount of probability mass can drift into prose, code fences, or near-JSON. Your downstream parser does not care that 99 percent of the response was valid; one bad token breaks the run. Structured outputs are about removing that one percent.

Mental model

Free text:
prompt --> model --> "Sure! { name: 'Ada', age: 36 }" -> parser fails

Tool use / JSON schema:
prompt + schema --> model constrained at decode time
                      --> {"name": "Ada", "age": 36} -> parser ok

Free text vs structured outputs

The provider either constrains decoding (impossible tokens get zero probability) or validates the output before returning, depending on the API.

Hands-on: tool use as a typed return

The most reliable pattern across providers is tool use. You define a “tool” whose only job is to receive the answer, and the model invokes it instead of writing prose.

from anthropic import Anthropic

client = Anthropic()

extract = {
    "name": "save_person",
    "description": "Save the extracted person to the database.",
    "input_schema": {
        "type": "object",
        "properties": {
            "name": {"type": "string"},
            "age":  {"type": "integer", "minimum": 0},
            "email": {"type": "string", "format": "email"},
        },
        "required": ["name", "email"],
    },
}

resp = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=512,
    tools=[extract],
    tool_choice={"type": "tool", "name": "save_person"},
    messages=[{"role": "user", "content": "Extract: Ada Lovelace, ada@example.com, 36"}],
)
person = next(b for b in resp.content if b.type == "tool_use").input
print(person)  # {'name': 'Ada Lovelace', 'age': 36, 'email': 'ada@example.com'}

tool_choice forces the model to call this specific tool. There is no prose to parse, no markdown to strip, and the schema is checked.

Hands-on: JSON mode and JSON schema

Many providers also offer a “respond as JSON” mode, optionally with a JSON schema. With OpenAI:

from openai import OpenAI
client = OpenAI()

schema = {
  "type": "object",
  "properties": {
    "summary": {"type": "string"},
    "tags": {"type": "array", "items": {"type": "string"}, "maxItems": 5},
  },
  "required": ["summary", "tags"],
  "additionalProperties": False,
}

resp = client.responses.create(
  model="gpt-x",
  input="Summarize: ...",
  response_format={"type": "json_schema", "json_schema": {"name": "Summary", "schema": schema, "strict": True}},
)

In strict mode the decoder is constrained to produce valid JSON matching the schema. The output is a string you can JSON.parse with confidence.

Validate even when the API says it is valid

Provider-side schemas check JSON shape but not your business rules. Always run the result through your own validator (Pydantic, Zod, JSON Schema) before using it.

from pydantic import BaseModel, EmailStr, conint

class Person(BaseModel):
    name: str
    age: conint(ge=0, le=150) | None = None
    email: EmailStr

person = Person.model_validate(person)  # raises on invalid business rules

Retries and partial parses

When the model legitimately fails to comply, send the validation error back as a turn and ask for a fix. This loop succeeds far more often than re-prompting blindly.

for _ in range(3):
    try:
        result = call_model(messages)
        return Person.model_validate(result)
    except ValidationError as e:
        messages.append({"role": "user", "content": f"Validation failed: {e}. Return only valid JSON."})
raise RuntimeError("model could not produce valid output")

Streaming partial JSON is also tractable: parse incrementally with a streaming JSON parser to render parts of the result as they arrive. Useful for UX, not for correctness.

Common pitfalls

Asking for JSON in the prompt and not the API. Prompts shift behavior, but only API-level constraints guarantee it.
Schemas with anyOf or open enums. Models drift on ambiguous schemas; keep types narrow and required fields explicit.
Letting the model choose the schema. If you offer ten optional fields, expect inconsistent results across calls. Fewer fields, more required.
Using JSON mode for nested rich text. JSON is a transport; markdown belongs inside string fields, not as nested structures.
Skipping validation because “the API says strict.” Strict checks JSON Schema only. Email format, foreign keys, business rules: still your job.
Forgetting cost. Tool use and JSON-mode responses count tokens like anything else; verbose schemas inflate prompt size.

Practical tips

Prefer tool use when you have a single target shape and want strong constraints. Prefer JSON schema mode when you want a typed response with no tool call ceremony.
Keep schemas under ~30 fields per call. Big schemas degrade quality; split into stages.
Add additionalProperties: false. Otherwise models invent fields and you silently swallow them.
Log the raw response alongside the parsed one. When parsing fails you will want to see exactly what the model produced.
Use temperature=0 for extraction tasks. Determinism matters more than creativity.

Wrap-up

Treat the LLM as a function with a typed return: declare the type, constrain at decode time, validate after, and retry with the error message when validation fails. Tool use and JSON schemas turn LLM calls from prose generators into actual functions, which is the only way most production pipelines should be calling them.