Prompt Engineering: Chain of Thought
Use chain-of-thought prompting to unlock multi-step reasoning, with zero-shot, few-shot, and structured variants for production use.
What you'll learn
- ✓What chain-of-thought actually changes about model output
- ✓Zero-shot CoT with a single phrase
- ✓Few-shot CoT with worked examples
- ✓Structured CoT that is friendlier to production parsing
- ✓When CoT hurts more than it helps
Prerequisites
- •Familiar with how APIs work
What and Why
Chain-of-thought (CoT) prompting asks the model to write out intermediate reasoning before producing a final answer. Models that go directly from question to answer often skip steps and arrive at wrong conclusions on multi-step problems. Writing the steps out gives the model more “thinking budget” and improves accuracy on math, logic, planning, and complex extraction tasks.
The technique is one of the highest-leverage prompt patterns you can learn because it is cheap to try and frequently delivers large gains on tasks that matter.
Mental Model
When a model writes intermediate steps, each step conditions the next. The model effectively turns one hard inference into many easy ones.
direct: Q -> A (one big leap, often wrong)
CoT: Q -> step1 -> step2 -> step3 -> A
| | |
v v v
each step is conditioned on the prior steps,
so the model can carry intermediate state. This is also why CoT works better on larger models. Small models cannot reliably maintain the intermediate state, so writing more steps gives them more rope to hang themselves.
Hands-on Example
The simplest zero-shot CoT is a single line.
prompt = """Q: A bakery sells muffins for $2 each and cookies for $1.
On Monday they sold 30 muffins and 80 cookies. On Tuesday they sold half as many
muffins and double the cookies. What were the total earnings over the two days?
Let's think step by step.
A:"""
That one phrase (“Let’s think step by step.”) consistently improves accuracy on arithmetic and reasoning tasks.
Few-shot CoT shows the model the reasoning style you want.
prompt = """Q: Roger has 5 tennis balls. He buys 2 more cans, each with 3 balls. How many now?
A: Roger started with 5. He bought 2 cans of 3 = 6. Total = 5 + 6 = 11.
Q: The cafeteria had 23 apples. They used 20 and bought 6 more. How many now?
A: Started with 23. Used 20 -> 3 left. Bought 6 -> 3 + 6 = 9.
Q: A bakery sells muffins for $2 and cookies for $1. Monday: 30 muffins, 80 cookies. Tuesday: half the muffins, double the cookies. Total earnings?
A:"""
For production, a structured CoT keeps reasoning separate from the parseable answer.
prompt = """Solve the problem. Output JSON with:
- "reasoning": step-by-step working
- "answer": the final numeric answer only
Problem: ..."""
Now your downstream parser can extract answer cleanly while you keep reasoning for logging and debugging.
Trade-offs
CoT is not free, and it is not always helpful.
- Latency and cost grow. The model generates more tokens, so each call takes longer and costs more.
- Sometimes hurts on simple tasks. For pure pattern-matching or classification, asking for reasoning can introduce noise. A confident one-token answer is fine.
- Reasoning can be confidently wrong. A neat-looking chain of thought is not a guarantee of a correct answer. It just biases the model toward better answers on hard problems.
- Modern “reasoning models” do CoT internally. Models with built-in chain-of-thought handle most of this for you; you may not need to instruct it explicitly.
The trade is straightforward: spend more tokens, get more accuracy on hard tasks. On easy tasks, you spend more tokens for no gain.
Variants Worth Knowing
- Self-consistency: sample several CoT chains at high temperature, pick the most common final answer. Strong on math but expensive.
- Tree of thoughts: explore multiple reasoning branches and prune. Powerful for planning, complex to implement.
- Plan-then-solve: ask for a plan first, then execute the plan. Useful for multi-step tool use.
- Scratchpad: a dedicated section in the prompt where the model can write working notes before answering.
For most production cases, structured zero-shot or few-shot CoT covers 90% of the value of the fancier variants.
Practical Tips
- Add CoT first, optimize later. It is one phrase. Try it before you change anything else.
- Pin reasoning into a separate field. JSON with
"reasoning"and"answer"keys keeps your parser clean. - Use few-shot CoT when style matters. If the reasoning needs to follow a specific format (e.g. step numbers, intermediate variable names), show examples.
- Strip reasoning before returning to users. Long internal thoughts are not usually what end users want to read.
- Log reasoning for debugging. When an answer is wrong, the chain shows you where the model went off the rails.
- Skip CoT for classification. A direct label is cleaner and faster.
- Lower temperature for CoT. Reasoning is brittle to randomness. Temperature 0 or 0.2 usually beats 0.7.
- Watch for token cost. A CoT that triples your output tokens triples your output cost.
A bonus pattern for very long problems: ask the model to break the problem into sub-problems, solve each, then assemble. This works well for code generation and long-form planning.
Wrap-up
Chain-of-thought is the cheapest accuracy improvement in the prompt-engineering toolbox. One line of instruction turns single-shot guessing into deliberate reasoning. Use structured output to keep production parsing clean, sample several chains when stakes are high, and skip it on tasks where reasoning would only add noise. For anything multi-step, default to CoT first and ship the gain.
Related articles
- Prompt Engineering Prompt Engineering: Self-Consistency for Reliable LLM Outputs
Learn how self-consistency prompting samples multiple reasoning paths and aggregates answers to improve accuracy, with hands-on examples and trade-offs.
- Prompt Engineering Prompt Engineering: Tree of Thought for Deliberate Reasoning
Explore Tree of Thought prompting, which lets LLMs branch, evaluate, and backtrack through reasoning steps to solve problems chain-of-thought cannot.
- Prompt Engineering Prompt Engineering Anti-Patterns: Mistakes That Quietly Hurt Quality
A field guide to the most common prompt engineering anti-patterns, why they degrade LLM output quality, and concrete refactors that fix each one.
- Prompt Engineering Prompt Engineering: Evaluation Loops
How to build evaluation loops for prompts so you can iterate with evidence instead of vibes. Covers datasets, graders, regressions, and how to make eval cheap enough to run often.