Prompt Engineering: Tree of Thought for Deliberate Reasoning
Explore Tree of Thought prompting, which lets LLMs branch, evaluate, and backtrack through reasoning steps to solve problems chain-of-thought cannot.
What you'll learn
- ✓How Tree of Thought differs from chain-of-thought
- ✓The four stages: decompose, generate, evaluate, search
- ✓A hands-on planning puzzle walk-through
- ✓When the technique earns its complexity
- ✓Practical tips for keeping search costs reasonable
Prerequisites
- •Familiar with LLMs or Python
What and Why
Tree of Thought, or ToT, is a prompting framework that treats reasoning as a search problem. Instead of forcing the model to commit to a single linear chain of thought, you let it expand multiple partial thoughts at each step, score them, and continue exploring the most promising branches. The model can backtrack when a branch dead-ends.
This matters because chain-of-thought is fundamentally greedy. Once the model writes a sentence, it conditions on it. If step three was wrong, the rest of the chain inherits that error. ToT fixes this by separating “generate candidates” from “decide which to keep”, giving the LLM something closer to the deliberate, exploratory reasoning humans use for planning, games, and proofs.
Mental Model
Picture chess. You do not pick the first move that comes to mind. You consider three or four candidates, briefly imagine the opponent’s response to each, and only then commit. ToT scripts that loop. The LLM proposes K next-thoughts, a scorer (often the same LLM with a different prompt) rates each on a scale, and a search algorithm such as breadth-first or beam search keeps the top B branches alive.
A “thought” can be a sentence, an equation, a sub-goal, or a candidate answer. The unit depends on the task.
Hands-on Example
Take the Game of 24: given four numbers, combine them with +, -, *, / to reach 24. With numbers 4, 9, 10, 13 a chain-of-thought model often guesses wrongly because there are many dead ends.
ToT structure:
- Decompose into three steps, each combining two numbers.
- At each step generate five candidate operations.
- Prompt the model to rate each candidate as “sure / maybe / impossible” of reaching 24.
- Keep the top three branches and continue.
[4,9,10,13]
/ | \
13-9=4 10-4=6 4+9=13
[4,10,4] [6,9,13] [13,10,13]
maybe sure impossible
|
┌─────┴─────┐
13-9=4 9-6=3
[4,6,4] [3,13,3]
sure maybe
|
(4+4)*(?)
...
─► 24 found
The “impossible” branch is pruned immediately. The “sure” branches receive depth-first effort. The model spends compute where it can pay off, not on doomed prefixes.
Trade-offs
ToT is powerful but expensive. Each node in the tree is at least one LLM call to generate and another to evaluate. A breadth of 5 and depth of 3 is already up to 30 calls per problem. For real-time chat this latency is usually unacceptable.
It also requires orchestration code outside the prompt itself. You cannot do real ToT in a single prompt; you need a loop, a scorer, and a search policy. Frameworks help, but you own the complexity.
Finally, ToT shines on problems with clear intermediate states and verifiable progress: puzzles, planning, code synthesis, theorem-style steps. For open-ended writing, the evaluator cannot reliably tell which branch is better, and the search collapses to expensive sampling.
Practical Tips
- Pick the right unit of thought. Too fine and the tree explodes; too coarse and you lose the benefit of branching.
- Use a small, cheap model as the evaluator and a stronger model as the generator. The evaluator runs more often.
- Cap depth and breadth aggressively. Start with B=3, D=3, and grow only when needed.
- Always include a “give up” signal in the evaluator so dead branches die fast.
- Log the tree. When ToT fails, the tree explains why and tells you whether to tune the generator or the evaluator.
- Combine with self-consistency at the leaves for an extra accuracy bump.
Wrap-up
Tree of Thought turns the LLM into a deliberate problem solver instead of a one-shot guesser. It costs more calls and more glue code, but for puzzles, planning, and multi-step reasoning it unlocks problems chain-of-thought simply cannot crack. Reach for it when the task has clear states and clear progress, and you can afford the search.
Related articles
- Prompt Engineering Prompt Engineering: Chain of Thought
Use chain-of-thought prompting to unlock multi-step reasoning, with zero-shot, few-shot, and structured variants for production use.
- Prompt Engineering Prompt Engineering: Self-Consistency for Reliable LLM Outputs
Learn how self-consistency prompting samples multiple reasoning paths and aggregates answers to improve accuracy, with hands-on examples and trade-offs.
- Prompt Engineering Prompt Engineering Anti-Patterns: Mistakes That Quietly Hurt Quality
A field guide to the most common prompt engineering anti-patterns, why they degrade LLM output quality, and concrete refactors that fix each one.
- Prompt Engineering Prompt Engineering: Evaluation Loops
How to build evaluation loops for prompts so you can iterate with evidence instead of vibes. Covers datasets, graders, regressions, and how to make eval cheap enough to run often.