LLM Fine-tuning vs Prompting Trade-offs

Intermediate 10 min read

What you'll learn

✓What fine-tuning actually changes in a model
✓When prompting is strictly better than fine-tuning
✓When fine-tuning pays off
✓How RAG fits between the two
✓A practical decision checklist

Prerequisites

•Familiar with how APIs work

What and Why

When a stock LLM is not quite right, you have three big levers: prompt engineering, retrieval augmented generation (RAG), and fine-tuning. They are not interchangeable. Picking the wrong lever wastes weeks of engineering and tens of thousands of dollars.

The goal of this article is to make the choice clean. You will leave with a checklist that tells you when to reach for which technique.

Mental Model

Each lever modifies a different part of the model’s behavior.

Prompting changes what the model sees on this single call. It is the fastest, cheapest experiment.
RAG changes what the model sees by injecting relevant external knowledge at runtime.
Fine-tuning changes the model itself by training on examples until weights adjust toward your data.

Prompting:    [system + user prompt] -> base model -> output
RAG:          [query] -> retrieve docs -> [prompt + docs] -> base model -> output
Fine-tuning:  base model + training data -> tuned model -> [prompt] -> output

Three ways to steer an LLM

A rough principle: prompting and RAG change inputs, fine-tuning changes weights. Knowledge is best stored in inputs (especially RAG, because it stays fresh). Style, format, and behavior are best stored in weights when prompting alone cannot get them stable.

Hands-on Example

Fine-tuning data is small JSONL with input/output pairs. A useful fine-tune typically wants at least a few hundred high-quality examples.

{"messages":[{"role":"system","content":"You are a SQL writer."},{"role":"user","content":"Top 5 customers by revenue last quarter"},{"role":"assistant","content":"SELECT customer_id, SUM(amount) AS revenue FROM orders WHERE order_date >= '2025-04-01' AND order_date < '2025-07-01' GROUP BY customer_id ORDER BY revenue DESC LIMIT 5;"}]}
{"messages":[{"role":"system","content":"You are a SQL writer."},{"role":"user","content":"Average order value per region in 2025"},{"role":"assistant","content":"SELECT region, AVG(amount) AS aov FROM orders WHERE EXTRACT(YEAR FROM order_date)=2025 GROUP BY region;"}]}

Upload the file and start a job:

from openai import OpenAI
client = OpenAI()

f = client.files.create(file=open("train.jsonl","rb"), purpose="fine-tune")
job = client.fine_tuning.jobs.create(training_file=f.id, model="gpt-4o-mini-2024-07-18")
print(job.id)

After training, you call the tuned model by name like any other model. The interface does not change; only the weights do.

Trade-offs

	Prompting	RAG	Fine-tuning
Iteration speed	Minutes	Hours	Days
Cost to build	Near zero	Moderate (infra)	High (data + compute)
Per-call cost	High (long prompt)	Moderate	Often lower (short prompt)
Stays fresh	Yes	Yes	No (frozen at train time)
Best for	Style, format, simple tasks	Up-to-date knowledge	Stable behaviors, structured output

Important nuance:

Fine-tuning is bad at teaching new facts. Teaching a model that “Acme’s CEO is Pat” by fine-tuning will work less reliably than putting that sentence in the prompt or retrieving it. Models tend to learn patterns better than facts.
Fine-tuning is great at teaching a tone. Customer support voice, code style, structured output formats are exactly where weights help.
A short tuned prompt can pay for the tune. If you can drop a 3000-token prompt down to 200 tokens after fine-tuning, the per-call savings can recover the training cost in weeks.

Practical Tips

A decision checklist that has held up in production:

Try the strongest prompt you can write first. Most “needs fine-tuning” claims evaporate when given a careful system prompt and three few-shot examples.
If the failure is “doesn’t know this fact,” reach for RAG, not fine-tuning.
If the failure is “knows it but writes it wrong,” try output schemas or strict JSON mode before fine-tuning.
If you have stable, repetitive structured outputs and 500+ high-quality examples, fine-tune.
Never fine-tune until you have an eval set. Without a deterministic eval, you cannot tell if the tune helped or hurt.
Mix small + tuned with big + prompted. A common pattern is to fine-tune a small cheap model for the common case and route hard cases to a frontier model.
Refresh tunes on a schedule. Data drifts. A fine-tune from a year ago likely underperforms a current prompt.

A surprising win: many teams find that good prompt engineering plus light RAG covers 80% of cases that initially looked like fine-tuning candidates.

Wrap-up

Fine-tuning is a powerful tool but rarely the first one to reach for. Start with prompts, layer on RAG for knowledge, and reserve fine-tuning for stable behaviors backed by an eval suite. The cheapest lever you can pull is almost always the right one to pull first.