Fine-Tuning LLMs: When and How

Beginner 11 min read

What you'll learn

✓When to fine-tune versus prompt or use RAG
✓The differences between full FT, LoRA, and QLoRA
✓How to prepare a clean instruction dataset
✓A minimal LoRA training loop with PEFT
✓How to evaluate before shipping

Prerequisites

•Basic Python familiarity

Fine-tuning gets pitched as the answer to every LLM problem. Most of the time it is not. Before you spend GPU hours, it is worth understanding what fine-tuning actually changes, when cheaper options work better, and what a sensible workflow looks like when you decide to do it.

What fine-tuning really does

A pretrained model has learned general patterns of language from trillions of tokens. Fine-tuning continues training on a smaller, more focused dataset so the model leans toward your style, format, or domain. It does not give the model new facts in any reliable way; for facts, retrieval-augmented generation (RAG) is the right tool. Fine-tuning is best for behavior: tone, structure, classification labels, function-call patterns, refusal style.

If you ask “should I fine-tune for our internal documentation?” the honest answer is usually no. Build RAG first. Fine-tune if you need consistent formatting that prompts cannot achieve, or if your usage volume is so high that a smaller fine-tuned model would be cheaper than calling a large frontier model.

Three approaches

Full fine-tuning updates every parameter of the model. It is the most expressive option and the most expensive. For a 7B model you need tens of gigabytes of GPU memory just for the optimizer state, and you end up with a full-sized checkpoint to host. Use it only when you have a serious dataset and budget.

LoRA (Low-Rank Adaptation) freezes the base model and trains tiny rank-decomposition matrices that ride alongside the original weights. You typically update less than 1% of the parameters, fit on a single consumer GPU, and produce adapter files of a few hundred megabytes. Multiple LoRA adapters can share one base model in memory, which is great for serving many fine-tunes cheaply.

QLoRA combines LoRA with 4-bit quantization of the base model. It lets you fine-tune 13B or even 70B models on a single 24 GB or 48 GB GPU. The training is slightly slower per step, but the memory savings are huge. For most teams, QLoRA is the default starting point.

Data is the whole game

Models will copy whatever style your data has, including the mistakes. A thousand high-quality examples almost always beat a hundred thousand noisy ones. Focus on three things: format consistency, coverage of edge cases, and clean separation of input and output. For instruction tuning, every example should look like the kind of message-response pair you will see in production.

A simple JSONL schema works well:

import json

examples = [
    {"messages": [
        {"role": "system", "content": "You triage support tickets."},
        {"role": "user", "content": "My invoice shows the wrong tax rate."},
        {"role": "assistant", "content": "Category: billing\nPriority: medium"},
    ]},
]

with open("train.jsonl", "w") as f:
    for ex in examples:
        f.write(json.dumps(ex) + "\n")

Hold out at least 10% as a validation set and keep a separate held-out test set you never look at during tuning.

A minimal LoRA training loop

The Hugging Face transformers and peft libraries make this approachable. The snippet below is the skeleton; in real projects you would add gradient accumulation, evaluation callbacks, and proper logging.

from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer
from peft import LoraConfig, get_peft_model
from datasets import load_dataset

model_id = "meta-llama/Llama-3.1-8B"
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, load_in_4bit=True)

lora = LoraConfig(r=16, lora_alpha=32, target_modules=["q_proj", "v_proj"],
                  lora_dropout=0.05, bias="none", task_type="CAUSAL_LM")
model = get_peft_model(model, lora)

ds = load_dataset("json", data_files="train.jsonl")["train"]

args = TrainingArguments(
    output_dir="./out", per_device_train_batch_size=4,
    num_train_epochs=3, learning_rate=2e-4, bf16=True, logging_steps=10,
)
Trainer(model=model, args=args, train_dataset=ds, tokenizer=tok).train()
model.save_pretrained("./adapter")

Hyperparameters that actually matter

Learning rate dominates everything. For LoRA, values around 1e-4 to 3e-4 are normal; full fine-tuning wants something an order of magnitude smaller. Train for a small number of epochs, usually one to three, and watch validation loss. If it keeps falling but generations get repetitive, you are overfitting and should stop.

Rank r and lora_alpha control LoRA capacity. Start with r=8 or r=16 and lora_alpha = 2*r. Increase only if validation loss plateaus too early.

Evaluate like you mean it

Loss numbers are deceptive. Build a small evaluation harness with real prompts from your domain and grade outputs on the criteria you care about: format compliance, factual accuracy, tone, refusal behavior. For classification tasks, compute accuracy or F1. For generation, use an LLM judge with a fixed rubric, or ask humans to rate side-by-side against the base model.

Compare against three baselines: the base model with a good prompt, the base model with few-shot examples, and a RAG version. If your fine-tune is not clearly better than all three, do not ship it.

A reasonable workflow

Start with a strong prompt. If that is not enough, add few-shot examples. If you need facts, add RAG. Only then consider fine-tuning, and start with QLoRA on a small high-quality dataset. Measure before and after on a real evaluation set. Ship when the win is clear, and keep the adapter files versioned so you can roll back. Fine-tuning is a tool, not a destination.