Concept 1 of 8

How do you know training is working, before it's over?

Think first

You kick off a 54-day training run that will cost 10 million dollars. Day 3, you want to know: is this going to work? What would you measure, and how often?

Concept 2 of 8

Perplexity: The Oldest Metric

Perplexity (introduced by IBM in 1977 for speech recognition) is defined as:

Perplexity = exp(mean cross-entropy loss)

If average loss = 2.0, perplexity = e^2 ~ 7.39. Interpretation: the model is roughly choosing between ~7.39 equally likely tokens at each step. Lower perplexity = less surprised = better language modeling.

For decades this was THE metric. It correlates with downstream performance as scale increases. But at frontier scale it breaks down:

Within a fixed scale, perplexity distinguishes training variants weakly.
Long context can achieve low perplexity with only local cues, not true long-range reasoning.
Over-weighted by common tokens ("the", "is", "and") which are trivial to predict.
Task mismatch. Applications care about reasoning, not raw next-token prediction.

Use perplexity for: detecting training anomalies, controlled comparisons at equal scale, early training monitoring. Do NOT use it for: cross-dataset comparisons, predicting downstream task performance, fine-grained ranking of similar models.

Concept 3 of 8

Loss Curves: Reading the Training Health

Interactive: Loss Curve Viewer

Pick a scenario to see what the loss curve looks like and what it means.

Expected healthy pattern with cosine schedule:

First 5% (warmup): loss drops dramatically -- model picks up basic syntax
Middle 80%: loss decreases slowly -- nuanced patterns
Final 15%: tiny improvements -- squeezing final quality

Always plot in log scale to see the later-training details. Loss-curve collapse (the Scaling with Collapse paper, 2025) shows that for compute-optimal training, normalized curves across scales overlap -- a powerful tool for prediction.

Concept 4 of 8

Zero-Shot and Few-Shot Benchmarks

Loss tells you the model is learning something. Benchmarks tell you what it can do. Standard suites:

HellaSwag -- commonsense completion
ARC -- grade-school science questions (easy + challenge)
WinoGrande -- pronoun resolution, commonsense reasoning
TruthfulQA -- factual accuracy under misleading prompts
MMLU -- 57 subjects, broad knowledge (high school to professional level)
GSM8K / MATH -- math word problems
HumanEval / MBPP -- code generation (evaluated by running unit tests)

Few-shot evaluation gives the model a few examples in the prompt (in-context learning, no weight updates). Shot counts are standardized at 0, 1, 5, or 10 for comparability. Chain-of-thought prompting ("Let's think step by step") can unlock capabilities that zero-shot hides.

Concept 5 of 8

Interactive: Benchmark Comparison

Interactive: Benchmark Comparison Tool

Compare model scores on common benchmarks (scores are illustrative).

Evaluation frequency: every few thousand steps early, every 10K+ mid-run, and more frequently again at the end to decide stopping. Expensive benchmarks (HumanEval requires code execution, MMLU is large) are run periodically while cheap proxies run continuously.

Concept 6 of 8

Benchmark Contamination

If benchmark test data leaks into training data, the score becomes meaningless. Contamination happens three ways:

Web crawls picking up benchmark questions/answers
Curated datasets that quietly incorporate benchmark items
Data augmentation that inadvertently reproduces test instances

A 2024 study found only a minority of benchmark instances uncontaminated. NuminaMath had 11.3% overlap with MATH. LLMs' Codeforces scores dropped sharply for problems after their training cutoff.

Detection:

N-gram overlap. GPT-3 used 13-gram matches. Misses paraphrased versions.
Chronological analysis. Compare pre- vs post-cutoff performance.
Test Set Slot Guessing. Mask wrong answers in multiple-choice; if the model fills them in too well, it has memorized.
Data Contamination Quiz. Perturbed versions of benchmark items.

Mitigation: aggressive deduplication before training, LiveBench-style held-out benchmarks refreshed periodically, temporal cutoffs, private evaluation sets.

Harsh truth

Contamination at scale is inevitable. The real question is: how much, and does it affect the decisions you're about to make?

Concept 7 of 8

When to Stop Training

Chinchilla optimizes for training cost, not deployment cost. In practice, frontier labs overtrain (see Module 12) and look at multiple stopping signals:

Loss curve flattening. Plot in log scale; when nearly flat, additional training is low value.
Benchmark saturation. Target benchmarks stop improving.
Validation loss divergence. Training loss falls while val loss stalls or rises. Rare in pretraining but critical in fine-tuning.
Diminishing returns on cost. Cost per unit benchmark improvement climbs sharply.

Teams set target benchmarks before training starts, track cost-per-improvement during training, and reserve compute for an annealing phase. Opportunity cost matters: could the remaining compute be better spent on a fresh run with improved data?

Concept 8 of 8

Check Your Understanding

1. What is perplexity in one sentence?

Correct: exp(mean cross-entropy loss) -- the effective number of equally likely tokens the model is choosing between

2. Why does perplexity break down at frontier scale?

Correct: It stops correlating with downstream performance and is dominated by trivial common tokens

3. What does a sudden spike in the loss curve usually indicate?

Correct: Instability from a bad batch, learning rate issue, or data corruption -- typically mitigated by rolling back to a recent checkpoint

4. What is benchmark contamination and why does it matter?

Correct: Test set data leaking into training data, making benchmark scores reflect memorization instead of capability

5. Why not stop at the Chinchilla-optimal point?

Correct: Inference cost often dominates; overtraining a smaller model yields a cheaper-to-serve model even if training costs more

Teach It Back

Explain to a friend: What signals do you track during LLM pretraining, why perplexity alone is no longer enough, how contamination affects benchmark scores, and how labs decide when to stop or switch to an annealing phase.

An AI tutor will compare your explanation against the course material.

Evaluating...

Flashcards (click to flip)

Perplexity = ?

Click to reveal

exp(mean cross-entropy loss). Interpretation: effective number of equally likely choices at each step. Lower is better. Dominated by frequent tokens; breaks down at frontier scale.

Healthy loss curve shape with cosine schedule?

Click to reveal

First ~5% (warmup): sharp drop, learning syntax. Middle ~80%: slow decrease, learning nuance. Final ~15%: tiny improvements. Plot in log scale. Spikes indicate instability; plateaus may indicate saturation.

Standard zero-shot benchmarks for LLMs?

Click to reveal

HellaSwag (commonsense), ARC (science QA), WinoGrande (pronoun/coref), TruthfulQA (factuality), MMLU (broad knowledge across 57 subjects), GSM8K/MATH (math), HumanEval/MBPP (code).

Why is benchmark contamination so hard to avoid?

Click to reveal

Web crawls naturally include benchmark questions; curated datasets sometimes incorporate them; synthetic data augmentation can reproduce them. A 2024 survey found most popular benchmarks have non-trivial overlap with common training corpora.

When do labs stop training?

Click to reveal

Multiple signals: loss curve flattens in log scale, target benchmarks saturate, cost per improvement rises sharply, opportunity cost favors a new run. Chinchilla-optimal is rarely the actual stopping point.

Perplexity: when is it useful vs useless?

Click to reveal

Useful: anomaly detection, controlled comparisons at equal scale, early training health. Useless: cross-dataset comparisons, predicting downstream task performance, fine-grained ranking of comparable models.

Module Complete

← Previous Course Home Next →