Fine-Tuning LLMs: When and How
A practical guide to deciding whether to fine-tune an LLM, choosing between full fine-tuning, LoRA, and instruction tuning, and avoiding the common pitfalls that waste a budget.
What you'll learn
- ✓When to fine-tune versus prompt or use RAG
- ✓The differences between full FT, LoRA, and QLoRA
- ✓How to prepare a clean instruction dataset
- ✓A minimal LoRA training loop with PEFT
- ✓How to evaluate before shipping
Prerequisites
- •Basic Python familiarity
Fine-tuning gets pitched as the answer to every LLM problem. Most of the time it is not. Before you spend GPU hours, it is worth understanding what fine-tuning actually changes, when cheaper options work better, and what a sensible workflow looks like when you decide to do it.
What fine-tuning really does
A pretrained model has learned general patterns of language from trillions of tokens. Fine-tuning continues training on a smaller, more focused dataset so the model leans toward your style, format, or domain. It does not give the model new facts in any reliable way; for facts, retrieval-augmented generation (RAG) is the right tool. Fine-tuning is best for behavior: tone, structure, classification labels, function-call patterns, refusal style.
If you ask “should I fine-tune for our internal documentation?” the honest answer is usually no. Build RAG first. Fine-tune if you need consistent formatting that prompts cannot achieve, or if your usage volume is so high that a smaller fine-tuned model would be cheaper than calling a large frontier model.
Three approaches
Full fine-tuning updates every parameter of the model. It is the most expressive option and the most expensive. For a 7B model you need tens of gigabytes of GPU memory just for the optimizer state, and you end up with a full-sized checkpoint to host. Use it only when you have a serious dataset and budget.
LoRA (Low-Rank Adaptation) freezes the base model and trains tiny rank-decomposition matrices that ride alongside the original weights. You typically update less than 1% of the parameters, fit on a single consumer GPU, and produce adapter files of a few hundred megabytes. Multiple LoRA adapters can share one base model in memory, which is great for serving many fine-tunes cheaply.
QLoRA combines LoRA with 4-bit quantization of the base model. It lets you fine-tune 13B or even 70B models on a single 24 GB or 48 GB GPU. The training is slightly slower per step, but the memory savings are huge. For most teams, QLoRA is the default starting point.
Data is the whole game
Models will copy whatever style your data has, including the mistakes. A thousand high-quality examples almost always beat a hundred thousand noisy ones. Focus on three things: format consistency, coverage of edge cases, and clean separation of input and output. For instruction tuning, every example should look like the kind of message-response pair you will see in production.
A simple JSONL schema works well:
import json
examples = [
{"messages": [
{"role": "system", "content": "You triage support tickets."},
{"role": "user", "content": "My invoice shows the wrong tax rate."},
{"role": "assistant", "content": "Category: billing\nPriority: medium"},
]},
]
with open("train.jsonl", "w") as f:
for ex in examples:
f.write(json.dumps(ex) + "\n")
Hold out at least 10% as a validation set and keep a separate held-out test set you never look at during tuning.
A minimal LoRA training loop
The Hugging Face transformers and peft libraries make this approachable. The snippet below is the skeleton; in real projects you would add gradient accumulation, evaluation callbacks, and proper logging.
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer
from peft import LoraConfig, get_peft_model
from datasets import load_dataset
model_id = "meta-llama/Llama-3.1-8B"
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, load_in_4bit=True)
lora = LoraConfig(r=16, lora_alpha=32, target_modules=["q_proj", "v_proj"],
lora_dropout=0.05, bias="none", task_type="CAUSAL_LM")
model = get_peft_model(model, lora)
ds = load_dataset("json", data_files="train.jsonl")["train"]
args = TrainingArguments(
output_dir="./out", per_device_train_batch_size=4,
num_train_epochs=3, learning_rate=2e-4, bf16=True, logging_steps=10,
)
Trainer(model=model, args=args, train_dataset=ds, tokenizer=tok).train()
model.save_pretrained("./adapter")
Hyperparameters that actually matter
Learning rate dominates everything. For LoRA, values around 1e-4 to 3e-4 are normal; full fine-tuning wants something an order of magnitude smaller. Train for a small number of epochs, usually one to three, and watch validation loss. If it keeps falling but generations get repetitive, you are overfitting and should stop.
Rank r and lora_alpha control LoRA capacity. Start with r=8 or r=16 and lora_alpha = 2*r. Increase only if validation loss plateaus too early.
Evaluate like you mean it
Loss numbers are deceptive. Build a small evaluation harness with real prompts from your domain and grade outputs on the criteria you care about: format compliance, factual accuracy, tone, refusal behavior. For classification tasks, compute accuracy or F1. For generation, use an LLM judge with a fixed rubric, or ask humans to rate side-by-side against the base model.
Compare against three baselines: the base model with a good prompt, the base model with few-shot examples, and a RAG version. If your fine-tune is not clearly better than all three, do not ship it.
A reasonable workflow
Start with a strong prompt. If that is not enough, add few-shot examples. If you need facts, add RAG. Only then consider fine-tuning, and start with QLoRA on a small high-quality dataset. Measure before and after on a real evaluation set. Ship when the win is clear, and keep the adapter files versioned so you can roll back. Fine-tuning is a tool, not a destination.
Related articles
- AI AI Agents vs Pipelines Explained
Understand the difference between AI agents and AI pipelines, when to choose each, and how to design systems that combine both for reliability and flexibility.
- AI AI Evaluation Frameworks Overview
A practical overview of evaluation frameworks for AI applications: what they measure, how they differ, and how to pick one that matches your workflow.
- AI AI Guardrails and Content Filtering
How to design guardrails and content filters for AI applications, including input checks, output checks, layered defenses, and trade-offs between safety and usefulness.
- AI AI Image Generation: Stable Diffusion Overview
How Stable Diffusion turns text prompts into images: the latent diffusion architecture, sampling loop, and the practical knobs that shape what you get.