Concept 1 of 8

The superficial alignment hypothesis

Think first

Meta's LIMA paper (2023) fine-tuned a base model on only 1,000 examples — and beat models trained on millions. If that's true, what does it imply about where a language model's actual knowledge lives, and what SFT is really teaching?

Concept 2 of 8

Instruction Masking: Where the Loss Actually Goes

Here's the single most important mechanical detail of SFT. Pretraining's loss is computed on every token. SFT isn't.

During SFT, we set the label of every instruction/prompt token to -100, a sentinel that tells PyTorch's cross-entropy loss "skip this position". Gradients only flow from the assistant's response tokens. This is called instruction masking or loss masking.

Interactive: What the Loss Sees

Green tokens contribute to the loss. Red-dashed tokens are masked out (label = -100). Toggle to see why this matters.

With masking ON (default in DataCollatorForCompletionOnlyLM), 100% of the gradient signal teaches the model how to respond, not how to regurgitate the prompt.

The math: for a sequence with loss at positions i in response set R,

L_SFT = - (1 / |R|) · Σ_{i ∈ R}  log P(token_i | tokens_<i)

Without masking, gradient capacity would be wasted on "memorize the user's phrasing" — which is not useful and can actively hurt, since it teaches the model to parrot prompts.

Why not pretraining style?

In pretraining, every token is useful signal — it's all raw text. In SFT, the prompt is input you already have; only the response is behavior you want to learn. Masking focuses 100% of the ~0.001 learning rate budget on the thing that matters.

Concept 3 of 8

Data Formats: Alpaca and ShareGPT

Two formats dominate the open-source ecosystem:

Alpaca (single-turn)

{
  "instruction": "Give three tips for staying healthy.",
  "input": "",
  "output": "1. Eat a balanced diet rich in fruits and vegetables.\n2. Exercise regularly...\n3. Get 7-9 hours of sleep."
}

Three fields: instruction, input (optional context, e.g., a paragraph to summarize), output. Simple, but can't represent conversations.

ShareGPT (multi-turn)

[
  {"from": "human",  "value": "Explain the difference between DPO and PPO."},
  {"from": "gpt",    "value": "<thinking>The user wants a technical comparison...</thinking> DPO eliminates the reward model by..."},
  {"from": "human",  "value": "Which is more stable in practice?"},
  {"from": "gpt",    "value": "DPO is significantly more stable because..."}
]

Can represent arbitrary conversation length, and (critically) allows <thinking>-style chain-of-thought blocks that teach the model to reason before answering.

Interactive: Instruction-Response Explorer

Pick an example above.

Concept 4 of 8

Data Quality: The Four Sources

Where do high-quality instructions come from? Four primary sources, each with tradeoffs:

1. Human-written

Gold standard quality. Very expensive. Examples: OpenAssistant, No Robots, Dolly-15K. Used for a diverse core seed set.

2. Synthetic (distilled from a stronger model)

Cheap and scalable. You prompt GPT-4 or Claude and use their outputs as training data. Methods: Self-Instruct, Evol-Instruct, Alpaca. Limited by the teacher model's own ceiling.

3. Reformatted NLP datasets

Take existing labeled datasets (SQuAD, GLUE, etc.) and rewrite them as instructions: "Answer the question…". Cheap and high-accuracy on narrow tasks. FLAN / T0 style.

4. User interaction logs

Real-world prompt distribution. Requires consent, deduplication, PII scrubbing. WildChat, ShareGPT. The closest match to production traffic.

Evol-Instruct: making prompts harder on purpose

Xu et al. (2023) pointed out a weakness of vanilla Self-Instruct: synthetic instructions cluster around easy. Evol-Instruct iteratively rewrites prompts to add constraints, deepen reasoning, or increase specificity:

Original:  "Write a function to add two numbers."
Evolved:   "Write a Python function that adds two numbers but rejects
            non-numeric input and handles integer overflow for very
            large values, with type hints and a doctest."

Rejection sampling fine-tuning (RFT)

Llama 3.1 uses this extensively: for each prompt, generate N responses from the current best model, keep only the ones a reward model or verifier says are good, train on those. For math (Yuan et al. 2023): sample solutions, keep only those whose final answer matches ground truth. The model bootstraps its way to better data.

Concept 5 of 8

Data Mixing: The Llama 3 Recipe

No real SFT run uses a single source. Llama 3.1's published data mix:

Interactive: Data Quality Explorer

Drag the sliders to build your own SFT mixture. Notice how skills rise and fall.

Predicted skill profile

Decontamination

Tulu 3 found that 11.3% of NuminaMath-TIR — a popular math dataset — overlapped with MATH evaluation problems via 8-gram matching. If you train on it naively, your benchmark numbers are inflated by pure memorization. Always run n-gram decontamination against every benchmark before training.

Concept 6 of 8

Training Dynamics: Why SFT Uses Tiny Learning Rates

SFT hyperparameters look very different from pretraining:

Pretraining:        lr ≈ 1e-4 to 3e-4,  batch: millions of tokens,  epochs: 1
SFT:                lr ≈ 5e-6 to 2e-5,  batch: 128–256 seqs,       epochs: 1–3
Llama 3.1 SFT:      lr = 1e-5,           ~8,500 steps,              cosine decay

Why 20× lower learning rate? Two reasons:

Catastrophic forgetting. The base model holds trillions of tokens' worth of knowledge in its weights. A large learning rate would overwrite it. With a small one, we gently nudge the distribution toward "respond in instruction style" without erasing underlying capability.
Small datasets overfit fast. SFT datasets are tiny (tens of thousands to a few million examples). After 2-3 epochs the loss on training instructions becomes near-zero — but held-out quality starts dropping. This is why 1–3 epochs is standard.

Recent finding

A 2024 comprehensive study found that larger batch sizes combined with lower learning rates consistently beat small-batch / higher-LR setups on downstream SFT quality. This mirrors a broader trend in post-training: reduce variance, take small steps.

Concept 7 of 8

The Modern Iterative Pipeline

Llama 3.1 doesn't just run SFT once. It runs six rounds. Each round looks like this:

1

Collect prompts (human + synthetic)

2

Generate N responses using current best model (rejection sampling)

3

Filter with reward model, execution feedback, or answer verification

4

Mix datasets, decontaminate against all benchmarks

5

Train (small LR, 1–3 epochs, careful checkpointing)

6

Evaluate (IFEval, MMLU, GSM8K, HumanEval, MT-Bench, AlpacaEval)

7

Use the new SFT model to generate better data → round N+1

This is a positive feedback loop. Each round produces a better model, which generates better synthetic data, which trains an even better model. The compounding is why modern open-source models have caught up to proprietary ones on many benchmarks.

Evaluation suites in use

IFEval — instruction following ("respond in exactly three sentences ending with a question"). Deterministic checker.
MMLU / ARC — factual knowledge and reasoning.
GSM8K / MATH — math word problems.
HumanEval / MBPP — code generation with unit tests.
MT-Bench, AlpacaEval — LLM-as-judge open-ended chat.
Chatbot Arena — human head-to-head (gold standard).

Concept 8 of 8

Check Your Understanding

1. What is instruction masking and why does it matter?

Exactly. PyTorch's cross-entropy loss ignores positions labeled -100. All gradient signal concentrates on response tokens — which is the behavior we want to learn.

2. The Superficial Alignment Hypothesis claims that…

Correct. LIMA demonstrated the point with just 1,000 examples. Quality >> quantity, because SFT is unlocking existing capability rather than teaching new knowledge.

3. Why does SFT use a learning rate ~20× lower than pretraining?

Right. Large LR would overwrite pretrained weights; small datasets would overfit in a single epoch. 1e-5 is the modern sweet spot.

4. What does "rejection sampling fine-tuning" mean?

Correct. For math, "best" can mean "final answer matches ground truth"; for code, "unit tests pass"; for open-ended chat, "highest reward model score". This is how Llama 3.1 bootstraps high-quality data.

5. Why is decontamination crucial?

Exactly. 8-gram matching against every eval benchmark is now standard. Failing to do this is how papers accidentally publish inflated numbers.

Teach It Back

Explain to a colleague: What exactly changes between pretraining and SFT (objective, learning rate, data size, loss masking)? Why does the Superficial Alignment Hypothesis justify using only ~1,000 examples?

An AI tutor will grade your explanation.

Evaluating...

Flashcards

Instruction masking — one sentence

Click to reveal

Set cross-entropy labels for prompt tokens to -100 so loss only flows from assistant-response tokens. Implemented by DataCollatorForCompletionOnlyLM in HuggingFace trl.

SFT loss formula

Click to reveal

L_SFT = -(1/|R|) · Σ_{i ∈ R} log P(tok_i | tok_<i), where R is the set of response-token positions. Identical to pretraining's cross-entropy, but summed only over response positions.

LIMA / Superficial Alignment Hypothesis

Click to reveal

Meta 2023: 1,000 carefully curated examples fine-tuned LLaMA-65B to compete with models trained on millions. Claim: knowledge comes from pretraining; SFT only teaches response format and style.

Typical SFT hyperparameters

Click to reveal

LR: 5e-6 to 2e-5 (Llama 3.1: 1e-5). Batch: 128–256. Epochs: 1–3. Cosine decay. Low-LR + large-batch empirically wins.

Evol-Instruct vs Self-Instruct

Click to reveal

Self-Instruct bootstraps new prompts from a seed set (used by Alpaca) but clusters around easy difficulty. Evol-Instruct rewrites existing prompts to add constraints, deepen reasoning, increase specificity — used by WizardLM.

Rejection Sampling Fine-Tuning (RFT)

Click to reveal

For each prompt: sample N responses, filter with reward model / verifier / unit tests, keep only the best. Train on those. Used heavily in Llama 3.1's six iterative SFT rounds.

Module 19 Complete

SFT teaches the model how to respond. Next up: teaching it what to say — DPO, RLHF, and the rise of verifiable rewards.

← Previous Next: Preference Optimization →