Prompt Engineering: System Prompts Guide
Write system prompts that steer model behavior reliably: role, format, constraints, refusals, and evaluation patterns that actually work.
What you'll learn
- ✓What the system prompt does that user messages cannot
- ✓The five sections of a reliable system prompt
- ✓How to specify output format strictly
- ✓Handling refusals and out-of-scope queries
- ✓Versioning and evaluating system prompts
Prerequisites
- •Familiar with how APIs work
What and Why
The system prompt is the first message in a chat completion and the place to set rules that should apply to every turn. The model treats it with higher priority than user messages. A well-written system prompt eliminates whole classes of bugs without any code: tone drift, format breakage, off-topic answers, and accidental policy violations.
A bad system prompt fails quietly. It looks polite (“you are a helpful assistant”) but tells the model nothing useful, so the model defaults to its training-time average. A good system prompt is specific, prioritized, and short enough to read in one sitting.
Mental Model
Think of the system prompt as the model’s job description. It tells the model who it is, what it does, what it must never do, and what its output must look like.
[ROLE] who the assistant is and what it specializes in
[INPUTS] what the user will provide
[STEPS] how to think about each request
[OUTPUT] exact format and shape required
[CONSTRAINTS] what to refuse, what to escalate, what not to assume Sections do not need labels in production. The point is that you can mentally check each section is present and unambiguous.
Hands-on Example
A system prompt for a customer support triage assistant:
You are a triage assistant for Acme's customer support team.
You receive raw customer messages and classify them so they can be routed
to the right queue.
For each message:
1. Read the message in full.
2. Identify the primary issue.
3. Decide if there are any sensitive flags (legal threats, harm, payment dispute over $500).
4. Choose exactly one category from the allowed list.
Return JSON only, in this exact shape:
{
"category": "billing" | "shipping" | "technical" | "account" | "other",
"urgency": "low" | "normal" | "high",
"sensitive_flag": true | false,
"summary": "<one sentence, <= 25 words>"
}
Rules:
- Never include any text outside the JSON object.
- If the message is unclear, choose "other" and summarize what you saw.
- Do not make up customer names or order numbers.
- If the message contains harmful intent toward a person, set sensitive_flag to true.
Compared to “You are a helpful classifier,” this prompt removes ambiguity from every step. The model knows the categories, the format, the failure mode for unclear inputs, and the safety rule.
Trade-offs
System prompts are not a magic wand.
- Long system prompts cost tokens on every call. Each instruction must earn its keep.
- Conflicting instructions confuse models. “Be concise” and “explain in detail” cannot both win. Choose.
- The model can still be talked out of system rules. Prompt injection from user input is a real risk for high-stakes flows.
- Behavior varies across model versions. A prompt finely tuned to GPT-4o may behave differently on a smaller or newer model.
Long does not equal effective. Many production prompts hit the sweet spot at 300-600 tokens. Beyond that, you are often re-stating instructions the model already follows or contradicting yourself.
Patterns That Work
A handful of reusable patterns:
- Hard format with a schema. “Return JSON only, in this exact shape: …” is more reliable than “respond in JSON.”
- Refusal rules with examples. “If asked to do X, respond with exactly: ’…’” is enforceable; “refuse unsafe requests” is not.
- Step-by-step instructions. Numbered steps reduce skipped reasoning, especially on smaller models.
- Explicit failure modes. Tell the model what to do when inputs are bad. Otherwise it invents.
- Constraints near the top. Important rules should appear early. Models pay more attention to the start.
- Single source of truth for tone. “Respond in plain English. No bullet points unless asked.” Pick once, stick with it.
For sensitive use cases, restate the most critical safety rules near the bottom of the system prompt, just before user content begins. Models tend to weight the most recent instruction heavily.
Practical Tips
- Version your system prompts. Treat them like code: store in source control, tag versions, and diff between updates.
- Pair every prompt with an eval set. Without it, you have no idea if the new version is better or worse.
- Cache the system prompt. Providers offer prompt caching that discounts repeated prefixes. Keep your system prompt stable across calls to maximize hits.
- Avoid long lists of negatives. “Do not do X, Y, Z” works less well than “Do A.” Models follow positive instructions more reliably.
- Test for prompt injection. Include a user message that says “ignore previous instructions” in your eval. The system prompt should hold.
- Trim ruthlessly. After a prompt works, delete any line you cannot justify with a failing case.
- Document why each rule exists. Comments in the source where the prompt is stored save future-you from re-deriving the same constraint.
A useful exercise is to read your system prompt aloud as if you were the new hire receiving it. If it confuses a human, it will confuse the model.
Wrap-up
The system prompt is the cheapest, highest-leverage piece of behavior tuning you have. Keep it specific, prioritized, and short. Pin output format with a schema, spell out refusal rules, and treat it like code with versioning and an eval. Most “the model is unreliable” complaints turn out to be system prompts that never told the model what reliable looked like.
Related articles
- Prompt Engineering Prompt Engineering Anti-Patterns: Mistakes That Quietly Hurt Quality
A field guide to the most common prompt engineering anti-patterns, why they degrade LLM output quality, and concrete refactors that fix each one.
- Prompt Engineering Prompt Engineering: Chain of Thought
Use chain-of-thought prompting to unlock multi-step reasoning, with zero-shot, few-shot, and structured variants for production use.
- Prompt Engineering Prompt Engineering: Evaluation Loops
How to build evaluation loops for prompts so you can iterate with evidence instead of vibes. Covers datasets, graders, regressions, and how to make eval cheap enough to run often.
- Prompt Engineering Prompt Engineering: Few-shot vs Zero-shot
Decide between zero-shot and few-shot prompting by weighing example quality, cost, and how strictly you need to control output format.