Prompt Engineering: Few-shot vs Zero-shot

Beginner 9 min read

What you'll learn

✓What zero-shot and few-shot prompting really mean
✓How examples steer formatting more than knowledge
✓Where few-shot shines and where it backfires
✓Picking and ordering examples for best results
✓When to graduate from few-shot to fine-tuning

Prerequisites

•Familiar with how APIs work

What and Why

Zero-shot prompting asks the model to perform a task without showing examples. Few-shot prompting includes a handful of solved examples right inside the prompt before the actual question. Modern models are good enough that zero-shot works for many tasks, but few-shot is still the fastest way to pin down a tricky output format or a specific style.

Choosing between them is mostly about a single question: does the model already know the task, or does it need a demonstration to lock in the format you want?

Mental Model

The prompt is a contract. Zero-shot describes the task in words. Few-shot also shows you what good output looks like. Models are very good at pattern-matching off shown examples, often better than they are at following instructions alone.

zero-shot:
[task instruction]
[user input]
-> output

few-shot:
[task instruction]
[example 1 input -> example 1 output]
[example 2 input -> example 2 output]
[example 3 input -> example 3 output]
[user input]
-> output (mimics the example style)

Zero-shot vs few-shot prompt shape

A useful intuition: instructions tell the model what to do; examples tell it what done looks like. The latter is often more effective because it leaves less room for interpretation.

Hands-on Example

A zero-shot sentiment classifier:

prompt = """Classify the sentiment of the review as positive, negative, or neutral.
Respond with one word only.

Review: The food was cold and the service was rude.
Label:"""

A few-shot version:

prompt = """Classify the sentiment of the review.
Respond with one word: positive, negative, or neutral.

Review: Loved every bite, will be back.
Label: positive

Review: It was fine. Nothing special.
Label: neutral

Review: Overpriced for the portion size.
Label: negative

Review: The food was cold and the service was rude.
Label:"""

For a modern model on a simple sentiment task, both work. The few-shot version is more reliable when:

The model occasionally invents new labels (“mixed”, “okay”) in zero-shot mode.
The format must be exactly one word.
The domain has unusual conventions (legal, medical, finance).

Trade-offs

Each technique has a sweet spot.

Zero-shot wins when the task is common, the instructions are clear, and you want minimum tokens.
Few-shot wins for format control, niche domains, and tasks where the right answer depends on conventions the model cannot guess.
Few-shot is more expensive. Each example is input tokens you pay for on every call. Three examples can easily add 500 tokens.
Examples can mislead. Three examples that all flag “long review” as positive can teach the model that length implies positivity. Pick examples that span the distribution.
More examples is not always better. Returns diminish quickly past 5-8 examples, and very long prompts hurt latency.

A practical rule: if zero-shot reaches 90% of your target quality, stay zero-shot. Spend the prompt budget on better instructions instead. If zero-shot lands at 70% and you cannot fix it with instructions, reach for few-shot.

Picking Good Examples

The quality of few-shot is dominated by the quality of the examples.

Cover the edge cases. Include short and long inputs, ambiguous cases, and the rare classes.
Match the test distribution. Examples in formal English will not help on slangy chat messages.
Order matters. Models pay more attention to the last few examples. Put the most representative one closest to the query.
Vary the answers. Three positive examples in a row bias the model toward positive. Mix the label distribution.
Avoid trick examples. A weird example teaches the model weird patterns. Stick to clean canonical cases.

For tasks with many possible categories, dynamic few-shot helps: retrieve the most semantically similar labeled examples for the current query and inject them. This is essentially RAG for prompting.

Practical Tips

Start zero-shot. Try a strong instruction first. Most tasks need nothing more.
Add an output schema before examples. A schema can deliver most of the format control benefit without paying for examples.
Use 3-5 examples as a default. More than that rarely pays back the token cost.
Make every example a teaching moment. If two examples teach the same lesson, drop one.
Test on held-out queries. Without an eval set, you cannot tell whether examples are helping or just changing outputs randomly.
Switch to dynamic few-shot at scale. Indexing labeled examples and retrieving the closest matches per query beats a single fixed set.
Graduate to fine-tuning when you have hundreds of consistent examples. At that point baking them into weights is cheaper per call than re-sending them on every request.

Wrap-up

Zero-shot is the default for capable models on common tasks. Few-shot is the next step up when format control or domain conventions are not negotiable. Pick examples that cover edge cases, vary the labels, and order the most relevant one last. When the pattern stabilizes and the example list grows, that is your signal to consider fine-tuning instead.