Prompt Engineering: Tree of Thought for Deliberate Reasoning

Intermediate 8 min read

What you'll learn

✓How Tree of Thought differs from chain-of-thought
✓The four stages: decompose, generate, evaluate, search
✓A hands-on planning puzzle walk-through
✓When the technique earns its complexity
✓Practical tips for keeping search costs reasonable

Prerequisites

•Familiar with LLMs or Python

What and Why

Tree of Thought, or ToT, is a prompting framework that treats reasoning as a search problem. Instead of forcing the model to commit to a single linear chain of thought, you let it expand multiple partial thoughts at each step, score them, and continue exploring the most promising branches. The model can backtrack when a branch dead-ends.

This matters because chain-of-thought is fundamentally greedy. Once the model writes a sentence, it conditions on it. If step three was wrong, the rest of the chain inherits that error. ToT fixes this by separating “generate candidates” from “decide which to keep”, giving the LLM something closer to the deliberate, exploratory reasoning humans use for planning, games, and proofs.

Mental Model

Picture chess. You do not pick the first move that comes to mind. You consider three or four candidates, briefly imagine the opponent’s response to each, and only then commit. ToT scripts that loop. The LLM proposes K next-thoughts, a scorer (often the same LLM with a different prompt) rates each on a scale, and a search algorithm such as breadth-first or beam search keeps the top B branches alive.

A “thought” can be a sentence, an equation, a sub-goal, or a candidate answer. The unit depends on the task.

Hands-on Example

Take the Game of 24: given four numbers, combine them with +, -, *, / to reach 24. With numbers 4, 9, 10, 13 a chain-of-thought model often guesses wrongly because there are many dead ends.

ToT structure:

Decompose into three steps, each combining two numbers.
At each step generate five candidate operations.
Prompt the model to rate each candidate as “sure / maybe / impossible” of reaching 24.
Keep the top three branches and continue.


            [4,9,10,13]
            /    |     \
       13-9=4  10-4=6  4+9=13
       [4,10,4] [6,9,13] [13,10,13]
        maybe    sure    impossible
                  |
            ┌─────┴─────┐
          13-9=4      9-6=3
         [4,6,4]     [3,13,3]
          sure        maybe
            |
        (4+4)*(?)
        ...
      ─► 24 found

Tree of Thought explores branches and prunes low-value paths before committing to a final answer.

The “impossible” branch is pruned immediately. The “sure” branches receive depth-first effort. The model spends compute where it can pay off, not on doomed prefixes.

Trade-offs

ToT is powerful but expensive. Each node in the tree is at least one LLM call to generate and another to evaluate. A breadth of 5 and depth of 3 is already up to 30 calls per problem. For real-time chat this latency is usually unacceptable.

It also requires orchestration code outside the prompt itself. You cannot do real ToT in a single prompt; you need a loop, a scorer, and a search policy. Frameworks help, but you own the complexity.

Finally, ToT shines on problems with clear intermediate states and verifiable progress: puzzles, planning, code synthesis, theorem-style steps. For open-ended writing, the evaluator cannot reliably tell which branch is better, and the search collapses to expensive sampling.

Practical Tips

Pick the right unit of thought. Too fine and the tree explodes; too coarse and you lose the benefit of branching.
Use a small, cheap model as the evaluator and a stronger model as the generator. The evaluator runs more often.
Cap depth and breadth aggressively. Start with B=3, D=3, and grow only when needed.
Always include a “give up” signal in the evaluator so dead branches die fast.
Log the tree. When ToT fails, the tree explains why and tells you whether to tune the generator or the evaluator.
Combine with self-consistency at the leaves for an extra accuracy bump.

Wrap-up

Tree of Thought turns the LLM into a deliberate problem solver instead of a one-shot guesser. It costs more calls and more glue code, but for puzzles, planning, and multi-step reasoning it unlocks problems chain-of-thought simply cannot crack. Reach for it when the task has clear states and clear progress, and you can afford the search.