Meta's LIMA paper (2023) fine-tuned a base model on only 1,000 examples — and beat models trained on millions. If that's true, what does it imply about where a language model's actual knowledge lives, and what SFT is really teaching?
The LIMA paper found that a 65B LLaMA base model, fine-tuned on just 1,000 carefully curated prompts and responses, matched or beat models trained on millions of lower-quality examples. This is the Superficial Alignment Hypothesis:
LIMA's claim
Nearly all of the model's knowledge and capabilities are learned during pretraining. SFT is mostly teaching the model the format and style of responding — which subset of its distribution to sample from when a user message arrives.
The implication is drastic: quality >> quantity for SFT data. A handful of careful, diverse, high-quality examples can unlock an enormous amount of pretrained behavior. This is why modern open-source projects (Tulu 3, LLaMA 3 SFT) spend more effort on decontamination, filtering, and verification than on raw data collection.
SFT=Teaching a polyglot polymath to write a business email. They already know the vocabulary, grammar, facts, and social conventions. You just need to show them 20 examples of the specific register and greeting — and they've got it.
Concept 2 of 8
Instruction Masking: Where the Loss Actually Goes
Here's the single most important mechanical detail of SFT. Pretraining's loss is computed on every token. SFT isn't.
During SFT, we set the label of every instruction/prompt token to -100, a sentinel that tells PyTorch's cross-entropy loss "skip this position". Gradients only flow from the assistant's response tokens. This is called instruction masking or loss masking.
Interactive: What the Loss Sees
Green tokens contribute to the loss. Red-dashed tokens are masked out (label = -100). Toggle to see why this matters.
With masking ON (default in DataCollatorForCompletionOnlyLM), 100% of the gradient signal teaches the model how to respond, not how to regurgitate the prompt.
The math: for a sequence with loss at positions i in response set R,
Without masking, gradient capacity would be wasted on "memorize the user's phrasing" — which is not useful and can actively hurt, since it teaches the model to parrot prompts.
Why not pretraining style?
In pretraining, every token is useful signal — it's all raw text. In SFT, the prompt is input you already have; only the response is behavior you want to learn. Masking focuses 100% of the ~0.001 learning rate budget on the thing that matters.
Concept 3 of 8
Data Formats: Alpaca and ShareGPT
Two formats dominate the open-source ecosystem:
Alpaca (single-turn)
{
"instruction": "Give three tips for staying healthy.",
"input": "",
"output": "1. Eat a balanced diet rich in fruits and vegetables.\n2. Exercise regularly...\n3. Get 7-9 hours of sleep."
}
Three fields: instruction, input (optional context, e.g., a paragraph to summarize), output. Simple, but can't represent conversations.
ShareGPT (multi-turn)
[
{"from": "human", "value": "Explain the difference between DPO and PPO."},
{"from": "gpt", "value": "<thinking>The user wants a technical comparison...</thinking> DPO eliminates the reward model by..."},
{"from": "human", "value": "Which is more stable in practice?"},
{"from": "gpt", "value": "DPO is significantly more stable because..."}
]
Can represent arbitrary conversation length, and (critically) allows <thinking>-style chain-of-thought blocks that teach the model to reason before answering.
Interactive: Instruction-Response Explorer
Pick an example above.
Concept 4 of 8
Data Quality: The Four Sources
Where do high-quality instructions come from? Four primary sources, each with tradeoffs:
1. Human-written
Gold standard quality. Very expensive. Examples: OpenAssistant, No Robots, Dolly-15K. Used for a diverse core seed set.
2. Synthetic (distilled from a stronger model)
Cheap and scalable. You prompt GPT-4 or Claude and use their outputs as training data. Methods: Self-Instruct, Evol-Instruct, Alpaca. Limited by the teacher model's own ceiling.
3. Reformatted NLP datasets
Take existing labeled datasets (SQuAD, GLUE, etc.) and rewrite them as instructions: "Answer the question…". Cheap and high-accuracy on narrow tasks. FLAN / T0 style.
4. User interaction logs
Real-world prompt distribution. Requires consent, deduplication, PII scrubbing. WildChat, ShareGPT. The closest match to production traffic.
Evol-Instruct: making prompts harder on purpose
Xu et al. (2023) pointed out a weakness of vanilla Self-Instruct: synthetic instructions cluster around easy. Evol-Instruct iteratively rewrites prompts to add constraints, deepen reasoning, or increase specificity:
Original: "Write a function to add two numbers."
Evolved: "Write a Python function that adds two numbers but rejects
non-numeric input and handles integer overflow for very
large values, with type hints and a doctest."
Rejection sampling fine-tuning (RFT)
Llama 3.1 uses this extensively: for each prompt, generate N responses from the current best model, keep only the ones a reward model or verifier says are good, train on those. For math (Yuan et al. 2023): sample solutions, keep only those whose final answer matches ground truth. The model bootstraps its way to better data.
Concept 5 of 8
Data Mixing: The Llama 3 Recipe
No real SFT run uses a single source. Llama 3.1's published data mix:
Interactive: Data Quality Explorer
Drag the sliders to build your own SFT mixture. Notice how skills rise and fall.
Predicted skill profile
Decontamination
Tulu 3 found that 11.3% of NuminaMath-TIR — a popular math dataset — overlapped with MATH evaluation problems via 8-gram matching. If you train on it naively, your benchmark numbers are inflated by pure memorization. Always run n-gram decontamination against every benchmark before training.
Concept 6 of 8
Training Dynamics: Why SFT Uses Tiny Learning Rates
SFT hyperparameters look very different from pretraining:
Pretraining: lr ≈ 1e-4 to 3e-4, batch: millions of tokens, epochs: 1
SFT: lr ≈ 5e-6 to 2e-5, batch: 128–256 seqs, epochs: 1–3
Llama 3.1 SFT: lr = 1e-5, ~8,500 steps, cosine decay
Why 20× lower learning rate? Two reasons:
Catastrophic forgetting. The base model holds trillions of tokens' worth of knowledge in its weights. A large learning rate would overwrite it. With a small one, we gently nudge the distribution toward "respond in instruction style" without erasing underlying capability.
Small datasets overfit fast. SFT datasets are tiny (tens of thousands to a few million examples). After 2-3 epochs the loss on training instructions becomes near-zero — but held-out quality starts dropping. This is why 1–3 epochs is standard.
Recent finding
A 2024 comprehensive study found that larger batch sizes combined with lower learning rates consistently beat small-batch / higher-LR setups on downstream SFT quality. This mirrors a broader trend in post-training: reduce variance, take small steps.
Concept 7 of 8
The Modern Iterative Pipeline
Llama 3.1 doesn't just run SFT once. It runs six rounds. Each round looks like this:
1
Collect prompts (human + synthetic)
2
Generate N responses using current best model (rejection sampling)
3
Filter with reward model, execution feedback, or answer verification
4
Mix datasets, decontaminate against all benchmarks
Use the new SFT model to generate better data → round N+1
This is a positive feedback loop. Each round produces a better model, which generates better synthetic data, which trains an even better model. The compounding is why modern open-source models have caught up to proprietary ones on many benchmarks.
Evaluation suites in use
IFEval — instruction following ("respond in exactly three sentences ending with a question"). Deterministic checker. MMLU / ARC — factual knowledge and reasoning. GSM8K / MATH — math word problems. HumanEval / MBPP — code generation with unit tests. MT-Bench, AlpacaEval — LLM-as-judge open-ended chat. Chatbot Arena — human head-to-head (gold standard).
Concept 8 of 8
Check Your Understanding
1. What is instruction masking and why does it matter?
Exactly. PyTorch's cross-entropy loss ignores positions labeled -100. All gradient signal concentrates on response tokens — which is the behavior we want to learn.
2. The Superficial Alignment Hypothesis claims that…
Correct. LIMA demonstrated the point with just 1,000 examples. Quality >> quantity, because SFT is unlocking existing capability rather than teaching new knowledge.
3. Why does SFT use a learning rate ~20× lower than pretraining?
Right. Large LR would overwrite pretrained weights; small datasets would overfit in a single epoch. 1e-5 is the modern sweet spot.
4. What does "rejection sampling fine-tuning" mean?
Correct. For math, "best" can mean "final answer matches ground truth"; for code, "unit tests pass"; for open-ended chat, "highest reward model score". This is how Llama 3.1 bootstraps high-quality data.
5. Why is decontamination crucial?
Exactly. 8-gram matching against every eval benchmark is now standard. Failing to do this is how papers accidentally publish inflated numbers.
Teach It Back
Explain to a colleague: What exactly changes between pretraining and SFT (objective, learning rate, data size, loss masking)? Why does the Superficial Alignment Hypothesis justify using only ~1,000 examples?
An AI tutor will grade your explanation.
Evaluating...
-
Score
out of 10
Feedback
Flashcards
Instruction masking — one sentence
Click to reveal
Set cross-entropy labels for prompt tokens to -100 so loss only flows from assistant-response tokens. Implemented by DataCollatorForCompletionOnlyLM in HuggingFace trl.
SFT loss formula
Click to reveal
L_SFT = -(1/|R|) · Σ_{i ∈ R} log P(tok_i | tok_<i), where R is the set of response-token positions. Identical to pretraining's cross-entropy, but summed only over response positions.
LIMA / Superficial Alignment Hypothesis
Click to reveal
Meta 2023: 1,000 carefully curated examples fine-tuned LLaMA-65B to compete with models trained on millions. Claim: knowledge comes from pretraining; SFT only teaches response format and style.
Self-Instruct bootstraps new prompts from a seed set (used by Alpaca) but clusters around easy difficulty. Evol-Instruct rewrites existing prompts to add constraints, deepen reasoning, increase specificity — used by WizardLM.
Rejection Sampling Fine-Tuning (RFT)
Click to reveal
For each prompt: sample N responses, filter with reward model / verifier / unit tests, keep only the best. Train on those. Used heavily in Llama 3.1's six iterative SFT rounds.
Module 19 Complete
SFT teaches the model how to respond. Next up: teaching it what to say — DPO, RLHF, and the rise of verifiable rewards.