Prompt Engineering Techniques That Work

Beginner 10 min read

What you'll learn

✓How to write clear, testable instructions
✓When few-shot examples beat zero-shot
✓How to enforce structured output
✓Chain-of-thought and when not to use it
✓How to evaluate prompt changes systematically

Prerequisites

•Basic Python familiarity

Prompt engineering sounds vague, but the underlying job is concrete: write instructions that produce reliable outputs for the inputs you actually see. The techniques below are the ones that survive contact with real applications. None of them are magic; together they give you a workflow that beats trial and error.

Start with a clear contract

Treat every prompt like an API contract. State the goal in one sentence. Describe the input. Describe the expected output exactly, including format and edge cases. List what the model must not do. If a human reading the prompt cannot describe the right output for a tricky input, the model will not either.

A useful structure for system prompts: role, task, constraints, output format, then any examples. Keep it short. Long system prompts hurt latency, cost, and often quality, because the model has more to weigh.

Use structured output

Free-form prose is hard to parse and easy to drift. When you can, ask for JSON with a fixed schema, or use a structured output feature when the SDK supports it. With the Anthropic, OpenAI, and Google SDKs, you can pass a JSON schema and the model will conform to it most of the time. For models without that feature, give a strict format and a worked example.

import json
from anthropic import Anthropic

client = Anthropic()

prompt = """
You triage support tickets. Return JSON with keys:
- category: one of ["billing", "technical", "account", "other"]
- priority: one of ["low", "medium", "high"]
- summary: a single short sentence

Ticket: I cannot log in and my trial expires tomorrow.
JSON:
"""

resp = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=256,
    messages=[{"role": "user", "content": prompt}],
)
data = json.loads(resp.content[0].text)
print(data)

If the model occasionally adds prose around the JSON, post-process by extracting the first {...} block, or use the SDK’s structured-output mode.

Few-shot examples beat description

For anything with a specific style or format, two or three concrete examples beat any amount of description. Pick examples that span the variety of inputs you care about, including at least one edge case. Keep them short; long examples push your real input out of the model’s working attention.

A common mistake is using examples that all look alike. The model will then assume every input matches that mold and fail on anything that does not. Diversity in examples teaches the boundary.

Chain-of-thought, used sparingly

Asking the model to think step by step before answering helps for genuine reasoning tasks: math, multi-step logic, code with tricky control flow. It does not help (and sometimes hurts) for simple classification or pure recall. For modern models with built-in reasoning, you often do not need to ask for chain-of-thought explicitly; the model already does it internally.

When you do use chain-of-thought, separate the reasoning from the final answer so you can extract just the answer for downstream use. A common pattern is to ask for reasoning inside a tag, then a final answer inside another tag.

Role and persona, carefully

A short role hint (“You are a senior backend engineer”) can nudge tone and vocabulary in helpful directions. Long elaborate personas usually do not improve answers, and they can drag the model into roleplay that ignores instructions. Use roles when style matters; skip them when only correctness matters.

Negative instructions are weak

Telling the model “do not do X” is less reliable than describing the right behavior. “Do not include code” is weaker than “Reply with a single short paragraph in plain prose.” Whenever you find yourself writing a negative instruction, try to flip it into a positive description of what you do want.

Decompose hard tasks

If a single prompt has to do retrieval, reasoning, formatting, and validation, it will do all four badly. Split the work. One call extracts, another classifies, another formats. Each step has a tight contract you can test independently. The total token cost might be higher but quality and debuggability are much better.

Manage context like memory

Long context windows are not free. Models tend to weight the beginning and end of the context more heavily, a tendency sometimes called “lost in the middle”. Put the most important instructions and the most important data near the top or bottom. For RAG, prefer fewer high-quality chunks over many noisy ones.

Evaluate every change

The most important habit is evaluation. Build a small set of representative inputs with expected outputs or grading rubrics. Whenever you change a prompt, run the eval and compare. Without this, you will polish one example into perfection while regressing on three others you never noticed.

cases = [
    {"input": "Cannot log in", "expects": "technical"},
    {"input": "Refund my last invoice", "expects": "billing"},
]

def classify(text):
    # call your prompt, parse, return category
    return "technical"

correct = sum(1 for c in cases if classify(c["input"]) == c["expects"])
print(f"accuracy: {correct}/{len(cases)}")

A workflow that works

Write a tight system prompt with role, task, constraints, and format. Add two or three diverse few-shot examples. Use structured output. Build a 20-case evaluation set. Iterate by reading the failures and adjusting the prompt or examples that caused them. When you cannot make a single prompt reliable, decompose. When you cannot make it cheap enough, consider fine-tuning. Everything else is style.