LLM Temperature and Top-p Explained
Understand how temperature and top-p sampling shape the creativity, determinism, and quality of large language model outputs.
What you'll learn
- ✓What temperature does to the token distribution
- ✓How top-p (nucleus) sampling differs
- ✓When to lower or raise either knob
- ✓Combining temperature with top-p safely
- ✓Practical defaults for common tasks
Prerequisites
- •Familiar with how APIs work
What and Why
Every time an LLM produces a token, it actually emits a probability distribution over its entire vocabulary. The job of the decoding strategy is to pick one token from that distribution. Temperature and top-p are the two most common knobs that control how adventurous that pick is.
If you have ever asked the same prompt twice and gotten wildly different answers, or seen a model produce repetitive, robotic text, sampling parameters are usually the cause.
Mental Model
Think of the model’s output as a histogram of candidate next tokens. Sampling parameters reshape that histogram before drawing from it.
- Temperature scales the logits before softmax. Low temperature sharpens the distribution toward the most likely token; high temperature flattens it.
- Top-p (nucleus sampling) truncates the distribution. It keeps the smallest set of tokens whose cumulative probability is at least
p, and renormalizes.
Temperature changes how peaked the distribution is. Top-p changes how many candidates you consider at all.
logits -> divide by T -> softmax -> probability distribution
|
v
sort tokens by prob
|
v
keep tokens until cum prob >= p
|
v
sample one token Hands-on Example
Here is a minimal Python example using the OpenAI SDK pattern that most providers mirror.
from openai import OpenAI
client = OpenAI()
def generate(prompt, temperature=0.7, top_p=1.0):
resp = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
temperature=temperature,
top_p=top_p,
)
return resp.choices[0].message.content
# Deterministic factual answer
print(generate("Capital of Japan?", temperature=0.0))
# Creative brainstorm
print(generate("Three startup names for a dog yoga app", temperature=1.1, top_p=0.9))
A useful experiment is to call the same prompt ten times at temperature 0.0, 0.7, and 1.3 and inspect the variance in outputs. At 0.0 you will usually see identical text. At 1.3 you may see hallucinations or off-topic detours.
Trade-offs
There is no single “best” setting. The right value depends on the task.
| Task | Temperature | Top-p | Why |
|---|---|---|---|
| Classification | 0.0 | 1.0 | Want consistent labels |
| Extraction / JSON | 0.0 - 0.2 | 1.0 | Structure must not drift |
| Chat assistant | 0.5 - 0.8 | 0.9 | Helpful but not robotic |
| Creative writing | 0.9 - 1.2 | 0.9 | Variety matters more than precision |
| Code generation | 0.0 - 0.3 | 1.0 | Syntax errors compound quickly |
A few rules of thumb worth internalizing:
- Do not turn both knobs at once. Pick one as your primary creativity dial and leave the other at a neutral value (
top_p=1.0ortemperature=1.0). - High temperature is not “smarter.” It just samples from a wider tail, which often surfaces less likely (and less correct) tokens.
- Temperature 0 is not perfectly deterministic on most hosted APIs because of floating-point non-determinism, batching, and load balancing across hardware.
Practical Tips
- Start at
temperature=0when debugging a prompt. It removes one source of variance so you can tell whether your prompt or your sampling is the problem. - For RAG pipelines, keep temperature low. You want the model to follow your retrieved context, not improvise.
- For brainstorming, prefer top-p around
0.9with temperature around1.0. This keeps the tail trimmed but lets the model wander. - Log the parameters you used alongside the outputs. Future-you will thank present-you when an eval regresses.
- If outputs feel “too safe” or formulaic, try raising temperature before rewriting the prompt. Sometimes the prompt is fine and the sampling is too tight.
There is also a related parameter, frequency_penalty and presence_penalty, which penalize tokens that have already appeared. These are not substitutes for temperature; they are useful when you see literal repetition in long generations.
Wrap-up
Temperature and top-p are the smallest, cheapest experiments you can run on a prompt. Before you rewrite an instruction or fine-tune a model, sweep these two values and look at the distribution of outputs. You will often find that the right setting unlocks the behavior you wanted without any prompt change at all. Pick one dial, leave the other neutral, and tune deliberately.
Related articles
- LLMs LLM Fine-tuning vs Prompting Trade-offs
Decide between prompt engineering, retrieval, and fine-tuning by weighing cost, latency, control, and data requirements honestly.
- LLMs LLM Streaming Responses Tutorial
Stream tokens from an LLM as they are generated to cut perceived latency, handle partial outputs, and build responsive chat UIs.
- LLMs LLM Token Counting and Cost Control
Learn how tokens are counted, how to estimate API spend before you send a request, and concrete strategies to cut LLM bills without hurting quality.
- LLMs LLM Tool Calling and Agents Overview
Understand how tool calling lets LLMs invoke functions, why agents loop over tools, and how to design reliable tool schemas.