Beyond random shuffling: can you train smarter, not longer?

Think first
Children learn counting before calculus. Should LLMs be taught simple text first and complex text later? Your intuition says yes. The empirical answer at frontier scale says... mostly no. Why?

Curriculum learning is intuitive but hard to make work at frontier scale. Random shuffling with high-quality filtering beats most curriculum schemes in practice. This module covers three advanced techniques that do work: long-context extension, continual pretraining, and annealing.

Curriculum Learning: What Worked, What Didn't

The idea: structure data order rather than random shuffling. Three families were tried:

Consensus by 2024: aggressive curriculum is not worth it at frontier scale. Kim & Lee 2024 tested on Mistral-7B and Gemma-7B, found no significant gains. What does help: start with slightly easier data for the first 10-20% of training (warmup curriculum), then switch to the full mixture. This adds stability without the complexity of full curriculum.

What actually works at frontier scale

1. Quality filter upfront. 2. Tune domain mixing ratios. 3. Upsample high-quality sources throughout. These three practices became the standard replacements for explicit curriculum.

Long-Context Extension: The Positional Encoding Problem

Training at 128K context length from scratch is 16x more expensive per step than at 8K (quadratic attention). The standard approach: train at moderate context (8K or 16K), then extend.

The obstacle is positional encodings. They were only trained on positions seen during pretraining. Naive extrapolation produces garbage because:

Positional Interpolation (PI) and YaRN

Positional Interpolation (Chen et al. 2023): instead of extrapolating past the trained positions, compress the target sequence into the trained range by scaling the position indices. For RoPE, this means slowing down the rotations by a scale factor equal to target_length / trained_length. Brief fine-tuning at the new length recovers quality. LLaMA 2 went from 4K to 32K context this way.

YaRN (Peng et al. 2023) improves on PI. Insight: RoPE encodes position through multiple frequency components. High-frequency components distinguish nearby tokens (do NOT compress). Low-frequency components encode long-range structure (can tolerate compression). YaRN splits RoPE dimensions into three groups with different scaling factors. Requires fewer fine-tuning tokens than PI.

Limits of context extension

Extended models are slower at inference (still quadratic) and are often less effective at long-context reasoning than models trained on long context from scratch. Sweet spot: train at 8K-16K, extend to 128K with YaRN or PI plus continued training.

Continual Pretraining and Catastrophic Forgetting

Take an existing pretrained model and continue training on domain-specific data. Different from fine-tuning: continual pretraining uses the same next-token objective, just with new data.

Interactive: Curriculum / Continual Pretraining Visualizer

Slide through the training timeline to see how different strategies order data.

Catastrophic forgetting is the main risk. If you train a general LLaMA on pure medical text, its general knowledge degrades. Solutions:

Examples: Code Llama = LLaMA 2 + billions of code tokens. Medical models (BioMedLM, Med-PaLM 2) = general models + medical corpora.

Train From Scratch vs Continual Pretraining

Use continual pretraining when the domain is still fundamentally natural language: medicine, law, code, finance. You preserve general reasoning and language abilities while adding domain knowledge.

Train from scratch only when:

Annealing: The Late-Stage Refinement Phase

At the end of training, switch to a carefully curated high-quality dataset (textbooks, scientific papers, verified content) and drop the learning rate dramatically (e.g., 3e-5 -> 1e-6, a 30x reduction). This is annealing.

Interactive: Synthetic Data Generation Flow

Click each step in the synthetic math generation pipeline.

Risks: annealing data selection is critical. The wrong data can hurt. Over-annealing can cause the model to overfit the small high-quality set. Phi models and DeepSeek use synthetic math problems generated by strong models as annealing fuel -- proven to boost benchmark scores but with ongoing concerns about distribution narrowness.

Check Your Understanding

1. Why is aggressive curriculum learning not commonly used at frontier scale?
Correct: Empirically it does not produce consistent gains; random shuffling with quality filtering works as well
2. What problem do PI and YaRN solve?
Correct: Extending context length beyond what was trained, without catastrophic quality loss
3. What is catastrophic forgetting in continual pretraining?
Correct: Loss of general capabilities when training on narrow domain data
4. What is annealing in LLM training?
Correct: A final phase on high-quality curated data with dramatically reduced learning rate
5. When should you train from scratch vs use continual pretraining?
Correct: Continual pretraining for natural-language domains (medicine, law, code); from scratch only when tokenizer / context / data distribution differ fundamentally

Teach It Back

Explain to a friend: Why doesn't curriculum learning work well at frontier scale, how do techniques like positional interpolation and YaRN extend context length after training, what is continual pretraining vs fine-tuning, and why do labs add an annealing phase at the end?

An AI tutor will compare your explanation against the course material.

Evaluating...

Flashcards (click to flip)

Why is curriculum learning mostly unused at frontier scale?
Click to reveal
Empirical tests (Kim & Lee 2024 on Mistral/Gemma, 200+ runs at 0.5-1B scale) show no significant gains from ordered data. Random shuffling + quality filter + domain mixing matches or beats curriculum schemes while being simpler.
How does Positional Interpolation extend context?
Click to reveal
Compress new positions into the trained range. For RoPE: slow down the rotations by target_len/trained_len. Brief fine-tuning recovers quality. Took LLaMA 2 from 4K to 32K context with modest compute.
What does YaRN improve over PI?
Click to reveal
YaRN separates RoPE frequency components into groups. High-frequency (local) components keep their rotation; low-frequency (long-range) components get more compression. Fewer fine-tuning tokens, better quality retention.
What is continual pretraining?
Click to reveal
Continue training a pretrained model on domain-specific data using the same next-token objective. Risk: catastrophic forgetting of general skills. Fix: mix general data (70/30), lower LR, monitor general benchmarks.
Continual pretraining vs training from scratch?
Click to reveal
Continual: domain is still natural language (medicine, law, code). From scratch: need different tokenizer, fundamentally different distribution, or general priors are a liability.
What is annealing?
Click to reveal
Final phase of pretraining (1-5% of budget): switch to curated high-quality data (textbooks, verified sources), drop LR by 10-30x. Boosts math/reasoning benchmarks for smaller models; diminishing returns at 400B+.