RLHF: Reinforcement Learning from Human Feedback

Intermediate 11 min read

What you'll learn

✓Why pretrained LLMs need alignment
✓The three-stage RLHF pipeline
✓How reward models are trained from preferences
✓The role of PPO and the KL penalty
✓Trade-offs and modern alternatives

Prerequisites

•Familiarity with LLM training basics

Reinforcement learning from human feedback, or RLHF, is the technique that turned raw language models into the assistants people actually use. It is not magic and it is not a single algorithm. It is a pipeline. This post walks through how that pipeline works.

What and Why

A pretrained language model predicts the next token over a giant corpus. It will happily continue a prompt in whatever style the corpus suggests, which is often unhelpful, verbose, or unsafe. Pretraining optimizes for likelihood, not helpfulness.

RLHF reshapes the model’s behavior using human preferences as the supervision signal. Instead of asking annotators to write perfect responses (expensive, inconsistent), you ask them to rank pairs of model outputs (cheap, reliable). Those rankings train a reward model that the language model is then optimized against.

The goal is alignment with what humans actually want, captured implicitly through preference judgments.

Mental Model

RLHF has three stages. First, supervised fine-tuning on a small set of high-quality demonstrations teaches the model the format and tone of good responses. Second, a reward model is trained on human preference pairs to predict which of two completions a human would prefer. Third, the language model is fine-tuned with reinforcement learning to maximize that reward, with a KL penalty that keeps it close to the supervised baseline.

The reward model is the workhorse. It compresses thousands of fuzzy preference labels into a smooth, differentiable signal. The RL stage is where the policy actually changes.

Hands-on Example

A simplified reward model training step looks like this:

import torch
import torch.nn.functional as F

def reward_loss(rm, prompt, chosen, rejected):
    r_chosen = rm(prompt, chosen)
    r_rejected = rm(prompt, rejected)
    # Bradley-Terry: maximize log sigmoid of the margin
    return -F.logsigmoid(r_chosen - r_rejected).mean()

Once the reward model is trained, the policy is updated with Proximal Policy Optimization (PPO):

def ppo_step(policy, ref_policy, rm, prompt):
    response = policy.generate(prompt)
    reward = rm(prompt, response)
    kl = compute_kl(policy, ref_policy, prompt, response)
    objective = reward - beta * kl
    # PPO uses clipped policy gradient on this objective
    ...

The beta * kl term is essential. Without it, the policy can drift into degenerate text that exploits the reward model.

stage 1: supervised fine-tuning
 demonstrations -> base LLM -> SFT model

stage 2: reward model training
 prompts -> SFT model -> pairs of completions
 humans rank pairs
 pairs -> reward model (BT loss)

stage 3: RL optimization (PPO)
 prompt
   |
   v
 policy ----generate----> response
   |                         |
   v                         v
 KL vs SFT          reward model score
   \                       /
    +-- objective: reward - beta * KL --+
                     |
                     v
                PPO update on policy

Three-stage RLHF pipeline

The SFT model serves a double role: it is the starting point for the policy and the reference distribution for the KL penalty.

Trade-offs

RLHF is expensive. You need a high-quality preference dataset, infrastructure to run the reward model and policy together, and careful tuning of PPO hyperparameters. A small mistake in the reward model is amplified by the RL loop into clearly bad behavior.

Reward hacking is the central failure mode. The policy finds completions that the reward model loves but humans hate. Repetitive flattery, hedged refusals, or stylistic tricks are common symptoms. The KL penalty mitigates this but does not eliminate it.

Newer methods like Direct Preference Optimization (DPO) skip the explicit reward model and PPO loop. They derive a closed-form objective directly from preference pairs. DPO is simpler, more stable, and often matches RLHF quality, which is why many recent open models use it. Constitutional AI and RLAIF use AI-generated preferences instead of human labels to scale faster.

Practical Tips

Invest in preference data quality. Inter-annotator agreement matters more than volume. A clean 30k dataset usually beats a noisy 300k one.

Train the reward model on a diverse prompt distribution that matches what users will actually send. A reward model trained only on academic questions will not align a general assistant well.

Monitor the KL divergence during PPO. If it spikes, the policy is drifting fast and likely reward hacking. Lower the learning rate or raise the KL coefficient.

Hold out a set of evaluation prompts with diverse styles and run them through the policy at every checkpoint. Looking at outputs is the only reliable way to catch subtle alignment regressions.

Consider DPO before full RLHF for a first pass. It is simpler to implement, easier to debug, and often good enough.

Wrap-up

RLHF is the bridge between language modeling and useful assistants. The three-stage pipeline is conceptually simple, but each stage has its own pitfalls and tuning surface. Reward hacking, distribution shift, and preference data quality are where projects live or die. Whether you use classic PPO or a newer method like DPO, the core idea is the same: turn human judgments into a signal a model can learn from, then walk carefully toward it.