How do you know training is working, before it's over?

Think first
You kick off a 54-day training run that will cost 10 million dollars. Day 3, you want to know: is this going to work? What would you measure, and how often?

You track four signals: (1) intrinsic metrics (loss curves, perplexity), (2) zero-shot and few-shot benchmarks, (3) contamination checks, and (4) stopping criteria. None alone tell the full story. You need all four.

Perplexity: The Oldest Metric

Perplexity (introduced by IBM in 1977 for speech recognition) is defined as:

Perplexity = exp(mean cross-entropy loss)

If average loss = 2.0, perplexity = e^2 ~ 7.39. Interpretation: the model is roughly choosing between ~7.39 equally likely tokens at each step. Lower perplexity = less surprised = better language modeling.

For decades this was THE metric. It correlates with downstream performance as scale increases. But at frontier scale it breaks down:

Use perplexity for: detecting training anomalies, controlled comparisons at equal scale, early training monitoring. Do NOT use it for: cross-dataset comparisons, predicting downstream task performance, fine-grained ranking of similar models.

Loss Curves: Reading the Training Health

Interactive: Loss Curve Viewer

Pick a scenario to see what the loss curve looks like and what it means.

Expected healthy pattern with cosine schedule:

Always plot in log scale to see the later-training details. Loss-curve collapse (the Scaling with Collapse paper, 2025) shows that for compute-optimal training, normalized curves across scales overlap -- a powerful tool for prediction.

Zero-Shot and Few-Shot Benchmarks

Loss tells you the model is learning something. Benchmarks tell you what it can do. Standard suites:

Few-shot evaluation gives the model a few examples in the prompt (in-context learning, no weight updates). Shot counts are standardized at 0, 1, 5, or 10 for comparability. Chain-of-thought prompting ("Let's think step by step") can unlock capabilities that zero-shot hides.

Interactive: Benchmark Comparison

Interactive: Benchmark Comparison Tool

Compare model scores on common benchmarks (scores are illustrative).

Evaluation frequency: every few thousand steps early, every 10K+ mid-run, and more frequently again at the end to decide stopping. Expensive benchmarks (HumanEval requires code execution, MMLU is large) are run periodically while cheap proxies run continuously.

Benchmark Contamination

If benchmark test data leaks into training data, the score becomes meaningless. Contamination happens three ways:

  1. Web crawls picking up benchmark questions/answers
  2. Curated datasets that quietly incorporate benchmark items
  3. Data augmentation that inadvertently reproduces test instances

A 2024 study found only a minority of benchmark instances uncontaminated. NuminaMath had 11.3% overlap with MATH. LLMs' Codeforces scores dropped sharply for problems after their training cutoff.

Detection:

Mitigation: aggressive deduplication before training, LiveBench-style held-out benchmarks refreshed periodically, temporal cutoffs, private evaluation sets.

Harsh truth

Contamination at scale is inevitable. The real question is: how much, and does it affect the decisions you're about to make?

When to Stop Training

Chinchilla optimizes for training cost, not deployment cost. In practice, frontier labs overtrain (see Module 12) and look at multiple stopping signals:

Teams set target benchmarks before training starts, track cost-per-improvement during training, and reserve compute for an annealing phase. Opportunity cost matters: could the remaining compute be better spent on a fresh run with improved data?

Check Your Understanding

1. What is perplexity in one sentence?
Correct: exp(mean cross-entropy loss) -- the effective number of equally likely tokens the model is choosing between
2. Why does perplexity break down at frontier scale?
Correct: It stops correlating with downstream performance and is dominated by trivial common tokens
3. What does a sudden spike in the loss curve usually indicate?
Correct: Instability from a bad batch, learning rate issue, or data corruption -- typically mitigated by rolling back to a recent checkpoint
4. What is benchmark contamination and why does it matter?
Correct: Test set data leaking into training data, making benchmark scores reflect memorization instead of capability
5. Why not stop at the Chinchilla-optimal point?
Correct: Inference cost often dominates; overtraining a smaller model yields a cheaper-to-serve model even if training costs more

Teach It Back

Explain to a friend: What signals do you track during LLM pretraining, why perplexity alone is no longer enough, how contamination affects benchmark scores, and how labs decide when to stop or switch to an annealing phase.

An AI tutor will compare your explanation against the course material.

Evaluating...

Flashcards (click to flip)

Perplexity = ?
Click to reveal
exp(mean cross-entropy loss). Interpretation: effective number of equally likely choices at each step. Lower is better. Dominated by frequent tokens; breaks down at frontier scale.
Healthy loss curve shape with cosine schedule?
Click to reveal
First ~5% (warmup): sharp drop, learning syntax. Middle ~80%: slow decrease, learning nuance. Final ~15%: tiny improvements. Plot in log scale. Spikes indicate instability; plateaus may indicate saturation.
Standard zero-shot benchmarks for LLMs?
Click to reveal
HellaSwag (commonsense), ARC (science QA), WinoGrande (pronoun/coref), TruthfulQA (factuality), MMLU (broad knowledge across 57 subjects), GSM8K/MATH (math), HumanEval/MBPP (code).
Why is benchmark contamination so hard to avoid?
Click to reveal
Web crawls naturally include benchmark questions; curated datasets sometimes incorporate them; synthetic data augmentation can reproduce them. A 2024 survey found most popular benchmarks have non-trivial overlap with common training corpora.
When do labs stop training?
Click to reveal
Multiple signals: loss curve flattens in log scale, target benchmarks saturate, cost per improvement rises sharply, opportunity cost favors a new run. Chinchilla-optimal is rarely the actual stopping point.
Perplexity: when is it useful vs useless?
Click to reveal
Useful: anomaly detection, controlled comparisons at equal scale, early training health. Useless: cross-dataset comparisons, predicting downstream task performance, fine-grained ranking of comparable models.