LLM Temperature and Top-p Explained

Intermediate 9 min read

What you'll learn

✓What temperature does to the token distribution
✓How top-p (nucleus) sampling differs
✓When to lower or raise either knob
✓Combining temperature with top-p safely
✓Practical defaults for common tasks

Prerequisites

•Familiar with how APIs work

What and Why

Every time an LLM produces a token, it actually emits a probability distribution over its entire vocabulary. The job of the decoding strategy is to pick one token from that distribution. Temperature and top-p are the two most common knobs that control how adventurous that pick is.

If you have ever asked the same prompt twice and gotten wildly different answers, or seen a model produce repetitive, robotic text, sampling parameters are usually the cause.

Mental Model

Think of the model’s output as a histogram of candidate next tokens. Sampling parameters reshape that histogram before drawing from it.

Temperature scales the logits before softmax. Low temperature sharpens the distribution toward the most likely token; high temperature flattens it.
Top-p (nucleus sampling) truncates the distribution. It keeps the smallest set of tokens whose cumulative probability is at least p, and renormalizes.

Temperature changes how peaked the distribution is. Top-p changes how many candidates you consider at all.

logits -> divide by T -> softmax -> probability distribution
                              |
                              v
                     sort tokens by prob
                              |
                              v
                keep tokens until cum prob >= p
                              |
                              v
                        sample one token

Temperature reshapes, top-p truncates

Hands-on Example

Here is a minimal Python example using the OpenAI SDK pattern that most providers mirror.

from openai import OpenAI

client = OpenAI()

def generate(prompt, temperature=0.7, top_p=1.0):
    resp = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        temperature=temperature,
        top_p=top_p,
    )
    return resp.choices[0].message.content

# Deterministic factual answer
print(generate("Capital of Japan?", temperature=0.0))

# Creative brainstorm
print(generate("Three startup names for a dog yoga app", temperature=1.1, top_p=0.9))

A useful experiment is to call the same prompt ten times at temperature 0.0, 0.7, and 1.3 and inspect the variance in outputs. At 0.0 you will usually see identical text. At 1.3 you may see hallucinations or off-topic detours.

Trade-offs

There is no single “best” setting. The right value depends on the task.

Task	Temperature	Top-p	Why
Classification	0.0	1.0	Want consistent labels
Extraction / JSON	0.0 - 0.2	1.0	Structure must not drift
Chat assistant	0.5 - 0.8	0.9	Helpful but not robotic
Creative writing	0.9 - 1.2	0.9	Variety matters more than precision
Code generation	0.0 - 0.3	1.0	Syntax errors compound quickly

A few rules of thumb worth internalizing:

Do not turn both knobs at once. Pick one as your primary creativity dial and leave the other at a neutral value (top_p=1.0 or temperature=1.0).
High temperature is not “smarter.” It just samples from a wider tail, which often surfaces less likely (and less correct) tokens.
Temperature 0 is not perfectly deterministic on most hosted APIs because of floating-point non-determinism, batching, and load balancing across hardware.

Practical Tips

Start at temperature=0 when debugging a prompt. It removes one source of variance so you can tell whether your prompt or your sampling is the problem.
For RAG pipelines, keep temperature low. You want the model to follow your retrieved context, not improvise.
For brainstorming, prefer top-p around 0.9 with temperature around 1.0. This keeps the tail trimmed but lets the model wander.
Log the parameters you used alongside the outputs. Future-you will thank present-you when an eval regresses.
If outputs feel “too safe” or formulaic, try raising temperature before rewriting the prompt. Sometimes the prompt is fine and the sampling is too tight.

There is also a related parameter, frequency_penalty and presence_penalty, which penalize tokens that have already appeared. These are not substitutes for temperature; they are useful when you see literal repetition in long generations.

Wrap-up

Temperature and top-p are the smallest, cheapest experiments you can run on a prompt. Before you rewrite an instruction or fine-tune a model, sweep these two values and look at the distribution of outputs. You will often find that the right setting unlocks the behavior you wanted without any prompt change at all. Pick one dial, leave the other neutral, and tune deliberately.