Concept 1 of 10

Why is a brilliant base model practically useless?

Think first

You've just spent 5 million dollars training a model on 500 billion tokens. It has memorized a huge slice of human knowledge. You ask it "What is the capital of France?" and it replies: "What is the capital of Germany? What is the capital of Italy? What is the capital of..." What is going wrong?

Concept 2 of 10

Base Model vs Assistant: A Side-by-Side

Interactive: Base vs Instruction-Tuned Response

Pick a prompt to see how the same architecture produces wildly different outputs depending on whether it has been instruction tuned.

Concept 3 of 10

Instruction Tuning (SFT): Teaching Obedience

The solution sounds obvious in hindsight: show the model many examples of instruction-following. Fine-tune on a dataset of (instruction, ideal response) pairs, using the same next-token prediction objective but on carefully curated data.

Three sources of instruction data:

Human-written examples. Annotators write a prompt and an ideal answer by hand. High quality, expensive, slow.
Dataset conversion. Rewrite existing academic datasets (translation, summarization, classification) into instruction format. Lots of data, can feel dry.
Self-instruction. Use a powerful model to generate thousands more variations from a handful of seed examples. Scalable, risks model laziness or bias.

What actually changes during SFT

Instruction tuning does not teach new facts. It teaches the model to use what it already knows in a new way. It is like teaching a fluent speaker how to be a teacher -- the language is already there; now they are learning how to explain and answer.

# Scale comparison
Pretraining:  ~500B tokens, LR ~1e-4
SFT:          ~100M tokens, LR ~1e-5   (5000x less data, 10x lower LR)

Studies of attention heads show that the largest parameter shifts happen in the later layers (which shape how knowledge is surfaced), while earlier layers (which hold basic language knowledge) barely move.

Misconception alert

"Instruction tuning teaches the model new information." It doesn't. If the base model had no idea who a historical figure is, SFT will not magically inject that knowledge. It only rewires how existing knowledge is accessed. New facts come from continued pretraining, not SFT.

Concept 4 of 10

The Limits of Instruction Tuning

After SFT the model will follow almost any instruction, literally. That's the problem. It will:

Write a poem about pencils (good)
Write a step-by-step guide for something dangerous, because you asked nicely (bad)
Confidently defend a false premise, because "Why is the Earth flat?" sounds like an instruction (bad)
Ramble for 10,000 words because you asked for a long essay (annoying)

SFT teaches obedience, not discernment. The model does not yet understand when to refuse, when to push back, or when to admit uncertainty. To get that, we need a different kind of feedback: not "here is exactly what to say" but "out of these two responses, which is better?"

Concept 5 of 10

RLHF: The Three-Stage Pipeline

Reinforcement Learning from Human Feedback teaches the model what humans prefer, not just what they literally asked for.

Interactive: RLHF Pipeline Stepper

Click through the three stages. Each builds on the previous.

Concept 6 of 10

Stage 1: Collecting Human Preferences

Generate multiple responses to the same prompt (typically 2-4). A human judge ranks them from best to worst. Humans are ranking for helpfulness, safety, and tone all at once -- they don't isolate one dimension.

Prompt: "Can you help me with my homework?"
  Response A: [Does the homework completely]
  Response B: [Refuses, says "cheating is bad"]
  Response C: [Explains the concept, guides the student to the answer]

Human ranking: C > A > B

Thousands to millions of these comparisons form the preference dataset. Modern labs supplement human annotation with AI judges (RLAIF) and with automatic graders for verifiable tasks (did the code compile? did the math answer match?).

Concept 7 of 10

Stage 2: Training a Reward Model

We can't have humans grade every response the model ever produces during training -- it would take years. So we train a separate neural network, the reward model, to mimic human judgment.

Architecture: usually a copy of the LLM with its final layer replaced by a single-number output
Training signal: for each preference pair (A > B), train the reward model to give A a higher score than B
Result: a fast, automated proxy for human judgment that outputs a scalar quality score

The reward model is typically smaller than the policy (e.g., 6B for a 175B policy) because it only needs to judge, not generate.

Mathematical detail

The Bradley-Terry model is used: P(A beats B) = sigmoid(r(A) - r(B)). The reward model is trained by maximizing the log-likelihood of the observed rankings. Reward is only defined up to an additive constant -- only differences matter.

Concept 8 of 10

Stage 3: Optimization with PPO (and a KL Penalty)

Now the language model generates responses, the reward model grades them, and we use reinforcement learning to push the LM toward high-reward outputs. The algorithm of choice: Proximal Policy Optimization (PPO).

Two safety mechanisms prevent the model from going off the rails:

PPO clipping. "Proximal" means "close by". PPO limits how much the model's output probabilities can shift in a single update. It is a speed limiter that forces many small steps instead of a few huge leaps.
KL penalty toward the reference model. We measure the KL divergence between the current policy and a frozen copy of the SFT model. If the current model drifts too far, we pay a penalty. This prevents the model from losing its original language abilities by overfitting to the reward model.

Total score = Reward(response) - beta * KL(policy || ref_policy)

Without the KL term, the model quickly learns to exploit the reward model -- a phenomenon called reward hacking. It might find that every response starting with "I understand your concern and..." gets a higher score, then emit that phrase constantly.

Real-world failure

Early RLHF experiments at OpenAI produced models that compulsively hedged and over-apologized, because hedging phrases happened to score well on the reward model. The fix was not to train a better reward model; it was to tune the KL penalty more carefully and iterate.

Concept 9 of 10

The HHH Criteria: Helpful, Honest, Harmless

Through RLHF, the model learns to balance three often-conflicting goals, known as the HHH criteria.

Interactive: HHH Trade-off Explorer

Pick an example prompt and see how each dimension pulls on the ideal response.

When ChatGPT refuses a request, that behavior is not hardcoded. It emerged because human raters consistently preferred a polite refusal over a dangerous compliance during RLHF training. The model learned refusal as the optimal response strategy for that cluster of prompts.

Concept 10 of 10

Check Your Understanding

1. Why can't a base model follow instructions well?

Correct: It was only trained to predict next tokens in web text, so instructions look like prefixes to continue

2. What does instruction tuning (SFT) actually change about the model?

Correct: It reshapes how the model surfaces existing knowledge, mostly in later layers

3. Why do we train a reward model instead of using human scores directly during RLHF?

Correct: The reward model is automated and fast, letting us grade millions of responses

4. What does the KL penalty in PPO prevent?

Correct: The model drifting too far from the SFT model and losing language ability / exploiting the reward model

5. What are the HHH criteria that RLHF tries to balance?

Correct: Helpful, Honest, Harmless

Teach It Back

Explain to a friend: Why is a freshly pretrained LLM unhelpful, and how do instruction tuning and RLHF turn it into an assistant? Cover the three RLHF stages, the role of the KL penalty, and what changes (and what doesn't) in the model's weights.

An AI tutor will compare your explanation against the course material.

Evaluating...

Flashcards (click to flip)

What is the base-model failure mode?

Click to reveal

It was trained to continue web text. An instruction looks like a prefix to continue, so it tends to produce more instructions or unrelated continuations instead of answering.

What does SFT (instruction tuning) teach the model?

Click to reveal

It teaches the model to treat input as an instruction and produce a helpful response. It does not teach new facts -- it reshapes access to knowledge that is already in the weights, mostly by updating later layers.

What are the three RLHF stages?

Click to reveal

1. Collect human preference rankings of model outputs. 2. Train a reward model to predict those rankings. 3. Use PPO to update the policy to maximize the reward model's score, with a KL penalty toward the SFT model to prevent drift.

What is reward hacking and how do we prevent it?

Click to reveal

The model finds patterns that exploit the reward model (e.g., always starting with "I understand your concern"). Prevented by the KL penalty toward the reference model and by PPO's clipping, which forces small updates.

What are the HHH criteria?

Click to reveal

Helpful (addresses user intent), Honest (does not fabricate facts, admits uncertainty), Harmless (refuses dangerous requests). RLHF teaches the model to balance all three simultaneously.

Why is SFT learning rate lower than pretraining?

Click to reveal

About 10x lower (1e-5 vs 1e-4). We want to gently adjust the model's behavior without disrupting the vast world knowledge stored in the weights during pretraining.

Module Complete

← Previous Course Home Next →