SFT taught it how. Preference optimization teaches it what.

Think first
After SFT, your model will dutifully write a 10,000-word essay, or confidently defend a flat-earth theory, or answer "How do I pick a lock?" with detailed instructions. The capability is there, but the judgment is not. How do you teach judgment without writing a rulebook?

You show the model many pairs of "better response, worse response" and teach it to prefer the better one. Not "here is exactly what to say" (SFT) but "out of these two, which would a human pick?". This is preference optimization, and the modern recipe is DPO, with PPO-based RLHF as its older cousin and GRPO with verifiable rewards as its newer cousin.

Preference Data: The Triplet

Preference data is different from SFT data. Each example is a triplet:

(prompt, chosen_response, rejected_response)

Three sources:

  1. Human annotation. Hire humans to compare two outputs and pick the better. Accurate but expensive and slow.
  2. AI feedback (RLAIF). Use a strong model (GPT-4, Claude) as the judge. Cheaper, scalable, risks judge bias.
  3. Verifiable rewards. For math/code, skip human judgment entirely: did the answer match? Did the unit tests pass? DeepSeek-R1 used this extensively.

The Original Recipe: RLHF with PPO

The 2022-2023 standard. Three stages:

  1. SFT: start with an instruction-tuned model.
  2. Train a reward model to predict human preferences using the Bradley-Terry loss.
  3. PPO: policy generates responses, reward model scores them, PPO updates the policy to maximize reward -- with a KL penalty toward the SFT model to prevent drift.
Interactive: RLHF Pipeline Stepper

Click through the classic PPO-based RLHF pipeline.

PPO is powerful but notoriously painful: four models in memory (policy, reference, reward, value), high variance, tricky hyperparameters, easy to diverge. This is why DPO became popular.

DPO: Skip the Reward Model

Rafailov et al. (2023) proved a clever mathematical equivalence: you don't need a separate reward model at all. The preference signal can be extracted directly from the policy itself, by comparing its current probability ratios against a frozen reference model.

For each (prompt, chosen, rejected) triplet:

  1. Compute log-prob of chosen under the trained model and the reference model. Call the difference r_chosen (implicit reward for chosen).
  2. Compute the same for rejected. Call it r_rejected.
  3. Apply the Bradley-Terry model: loss = -log sigmoid(beta * (r_chosen - r_rejected)).
  4. Minimize this loss. Gradient pushes the trained model to increase log-prob of chosen and decrease log-prob of rejected.
# DPO loss (simplified)
pi_chosen   = policy.logp(chosen)
ref_chosen  = reference.logp(chosen)
pi_rejected = policy.logp(rejected)
ref_rejected= reference.logp(rejected)

r_chosen   = pi_chosen   - ref_chosen      # implicit reward
r_rejected = pi_rejected - ref_rejected

loss = -log(sigmoid(beta * (r_chosen - r_rejected)))

The beta parameter controls how far the model can drift from the reference. High beta (0.5) = conservative, stays close. Low beta (0.01) = aggressive, allows more change. Default ~0.1.

Why this works

The math shows that under the Bradley-Terry assumption, the optimal PPO solution with a KL penalty can be written in a closed form that depends only on the ratio between the policy and a fixed reference. That closed form becomes the DPO loss. You get the same theoretical guarantees as PPO-RLHF with dramatically simpler training.

Interactive: DPO vs PPO Comparison

Interactive: DPO vs PPO

Click each row to see how the two methods differ.

Preference Comparison Demo

Interactive: Preference Comparison

Pick the better response for each prompt. See how aggregated preferences become training signal.

Beyond DPO: SimPO, KTO, ORPO, IPO

DPO inspired a family of variants, each addressing one of its limitations:

GRPO and the Rise of Verifiable Rewards

DeepSeek-R1 (January 2025) marked a shift. GRPO (Group Relative Policy Optimization, from the DeepSeekMath paper) eliminates the value network that PPO needs. For each prompt:

  1. Generate multiple responses (4-16) from the current policy
  2. Score each with a reward function
  3. Compute the group's mean reward as a baseline
  4. Advantage = response_reward - group_mean
  5. Update the policy with a clipped surrogate objective and KL regularization

The real insight was verifiable rewards: for math, check if the final answer matches. For code, run the unit tests. For format, check whether required tags were used. These deterministic rules can't be hacked the way neural reward models can.

R1-Zero showed that starting from just a base model (no SFT!), GRPO with verifiable rewards produced emergent chain-of-thought reasoning. The model learned to "think step by step" as an instrumental strategy for getting more correct answers.

# Verifiable reward for math
def accuracy_reward(response, ground_truth):
    answer = extract_boxed(response)
    return 1.0 if answer == ground_truth else 0.0

# Format reward: encourages <think>...</think> structure
def format_reward(response):
    return 0.1 if has_think_tags(response) else 0.0

The Alignment Tax

Preference optimization is not free. Costs:

Mitigations: careful beta tuning, mixing SFT data into preference training, iterative online DPO (generate new pairs from the current policy), targeted rather than blanket safety training.

Evaluation hierarchy

RewardBench tests reward models directly. MT-Bench has GPT-4 score outputs on 1-10. AlpacaEval measures win rate against GPT-4 Turbo with length control. Arena-Hard uses hard prompts from Chatbot Arena. The gold standard is still LMArena -- blind A/B tests with real users, 5M+ votes collected.

Check Your Understanding

1. What does DPO eliminate compared to classic PPO-RLHF?
Correct: The separate reward model -- the preference signal is extracted directly from the policy via the Bradley-Terry closed form
2. What does the beta parameter in DPO control?
Correct: How much the trained model is allowed to drift from the reference model
3. Why do verifiable rewards (used by GRPO) resist reward hacking?
Correct: They are deterministic rules (answer match, unit test pass) with no gradient for the model to exploit
4. What is the Bradley-Terry model and why is it used?
Correct: A statistical model where P(A beats B) = sigmoid(r(A) - r(B)), used to turn pairwise preferences into a reward signal
5. What is the alignment tax?
Correct: The cost in capabilities (over-refusal, flattened creativity, capability degradation) that comes from heavy preference optimization

Teach It Back

Explain to a friend: How does preference optimization differ from SFT, what does DPO do that PPO-based RLHF does, and why did DeepSeek-R1 move toward GRPO with verifiable rewards? Cover the Bradley-Terry intuition, the role of the reference model and beta, and what the alignment tax costs.

An AI tutor will compare your explanation against the course material.

Evaluating...

Flashcards (click to flip)

SFT vs preference optimization?
Click to reveal
SFT teaches how to respond by imitating a target response (cross-entropy on a fixed answer). Preference optimization teaches what to say by comparing chosen vs rejected responses. SFT = capability; preference opt = alignment.
Three RLHF stages (PPO)?
Click to reveal
1. Start from SFT model. 2. Train reward model on human preference pairs via Bradley-Terry. 3. PPO: policy generates, reward model scores, policy updated to increase reward with KL penalty toward SFT model to prevent drift and reward hacking.
DPO in one sentence?
Click to reveal
Directly minimize -log sigmoid(beta * (log p(chosen)/p_ref(chosen) - log p(rejected)/p_ref(rejected))). Extracts the preference signal from the policy itself, no separate reward model needed.
What does the beta in DPO do?
Click to reveal
Controls how far the trained model can drift from the reference model. High beta (0.5) = conservative, stays close. Low beta (0.01) = aggressive. Default ~0.1. Corresponds to the KL coefficient in the equivalent PPO formulation.
Why is DPO sensitive to learning rate?
Click to reveal
DPO typically uses 1e-7 to 1e-6, much smaller than SFT. Too high a learning rate causes the implicit reward signal to overshoot, making the model collapse to repetitive or degenerate outputs.
What are verifiable rewards?
Click to reveal
Deterministic reward functions that check correctness directly (math answer match, unit test pass, format match). Used by DeepSeek R1 with GRPO. Cannot be hacked like neural reward models because there is no gradient signal to exploit.