Concept 1 of 8

Beyond random shuffling: can you train smarter, not longer?

Think first

Children learn counting before calculus. Should LLMs be taught simple text first and complex text later? Your intuition says yes. The empirical answer at frontier scale says... mostly no. Why?

Concept 2 of 8

Curriculum Learning: What Worked, What Didn't

The idea: structure data order rather than random shuffling. Three families were tried:

Token-level curriculum. Order examples by word rarity. Campos et al. 2021 tested this comprehensively at the Allen Institute: no consistent gains. Frequent words in weird contexts are often harder than rare words in clear contexts.
Competence-based curriculum. Train a small reference model, use its loss as a difficulty score. Platanios et al. 2019 showed 70% training-time reduction for NMT. Does not scale to trillion-token pretraining because you need to run a reference model on everything first.
Domain-based phased training. Code first (structured), then web text, then books. Mixed results -- pure phasing usually underperforms good mixtures.

Consensus by 2024: aggressive curriculum is not worth it at frontier scale. Kim & Lee 2024 tested on Mistral-7B and Gemma-7B, found no significant gains. What does help: start with slightly easier data for the first 10-20% of training (warmup curriculum), then switch to the full mixture. This adds stability without the complexity of full curriculum.

What actually works at frontier scale

1. Quality filter upfront. 2. Tune domain mixing ratios. 3. Upsample high-quality sources throughout. These three practices became the standard replacements for explicit curriculum.

Concept 3 of 8

Long-Context Extension: The Positional Encoding Problem

Training at 128K context length from scratch is 16x more expensive per step than at 8K (quadratic attention). The standard approach: train at moderate context (8K or 16K), then extend.

The obstacle is positional encodings. They were only trained on positions seen during pretraining. Naive extrapolation produces garbage because:

Absolute positional embeddings (GPT-2, GPT-3) have no entry for positions beyond the trained length at all.
RoPE (LLaMA) gives rotations at unseen positions that the model has never calibrated against. Attention patterns become unstable.

Concept 4 of 8

Positional Interpolation (PI) and YaRN

Positional Interpolation (Chen et al. 2023): instead of extrapolating past the trained positions, compress the target sequence into the trained range by scaling the position indices. For RoPE, this means slowing down the rotations by a scale factor equal to target_length / trained_length. Brief fine-tuning at the new length recovers quality. LLaMA 2 went from 4K to 32K context this way.

YaRN (Peng et al. 2023) improves on PI. Insight: RoPE encodes position through multiple frequency components. High-frequency components distinguish nearby tokens (do NOT compress). Low-frequency components encode long-range structure (can tolerate compression). YaRN splits RoPE dimensions into three groups with different scaling factors. Requires fewer fine-tuning tokens than PI.

Limits of context extension

Extended models are slower at inference (still quadratic) and are often less effective at long-context reasoning than models trained on long context from scratch. Sweet spot: train at 8K-16K, extend to 128K with YaRN or PI plus continued training.

Concept 5 of 8

Continual Pretraining and Catastrophic Forgetting

Take an existing pretrained model and continue training on domain-specific data. Different from fine-tuning: continual pretraining uses the same next-token objective, just with new data.

Interactive: Curriculum / Continual Pretraining Visualizer

Slide through the training timeline to see how different strategies order data.

Training progress: 0%

Catastrophic forgetting is the main risk. If you train a general LLaMA on pure medical text, its general knowledge degrades. Solutions:

Mix domain data with general data (e.g., 70/30)
Lower learning rate
Regularization (EWC, KL to base)
Monitor general benchmarks during training

Examples: Code Llama = LLaMA 2 + billions of code tokens. Medical models (BioMedLM, Med-PaLM 2) = general models + medical corpora.

Concept 6 of 8

Train From Scratch vs Continual Pretraining

Use continual pretraining when the domain is still fundamentally natural language: medicine, law, code, finance. You preserve general reasoning and language abilities while adding domain knowledge.

Train from scratch only when:

You need a tokenizer optimized for a very different symbol distribution (e.g., DNA)
You need a context length or attention design that cannot be retrofitted
Your data is so small that mixing general data would dominate
General-text priors are actively a liability (e.g., certain mathematical proof formats)

Concept 7 of 8

Annealing: The Late-Stage Refinement Phase

At the end of training, switch to a carefully curated high-quality dataset (textbooks, scientific papers, verified content) and drop the learning rate dramatically (e.g., 3e-5 -> 1e-6, a 30x reduction). This is annealing.

Cost: 1-5% of total training budget.
Benefit: often significant benchmark improvements, especially on math and reasoning.
LLaMA 3 8B saw large gains on GSM8K and MATH from annealing. LLaMA 3 405B saw almost none -- larger models may already capture these gains during main pretraining.

Interactive: Synthetic Data Generation Flow

Click each step in the synthetic math generation pipeline.

Risks: annealing data selection is critical. The wrong data can hurt. Over-annealing can cause the model to overfit the small high-quality set. Phi models and DeepSeek use synthetic math problems generated by strong models as annealing fuel -- proven to boost benchmark scores but with ongoing concerns about distribution narrowness.

Concept 8 of 8

Check Your Understanding

1. Why is aggressive curriculum learning not commonly used at frontier scale?

Correct: Empirically it does not produce consistent gains; random shuffling with quality filtering works as well

2. What problem do PI and YaRN solve?

Correct: Extending context length beyond what was trained, without catastrophic quality loss

3. What is catastrophic forgetting in continual pretraining?

Correct: Loss of general capabilities when training on narrow domain data

4. What is annealing in LLM training?

Correct: A final phase on high-quality curated data with dramatically reduced learning rate

5. When should you train from scratch vs use continual pretraining?

Correct: Continual pretraining for natural-language domains (medicine, law, code); from scratch only when tokenizer / context / data distribution differ fundamentally

Teach It Back

Explain to a friend: Why doesn't curriculum learning work well at frontier scale, how do techniques like positional interpolation and YaRN extend context length after training, what is continual pretraining vs fine-tuning, and why do labs add an annealing phase at the end?

An AI tutor will compare your explanation against the course material.

Evaluating...

Flashcards (click to flip)

Why is curriculum learning mostly unused at frontier scale?

Click to reveal

Empirical tests (Kim & Lee 2024 on Mistral/Gemma, 200+ runs at 0.5-1B scale) show no significant gains from ordered data. Random shuffling + quality filter + domain mixing matches or beats curriculum schemes while being simpler.

How does Positional Interpolation extend context?

Click to reveal

Compress new positions into the trained range. For RoPE: slow down the rotations by target_len/trained_len. Brief fine-tuning recovers quality. Took LLaMA 2 from 4K to 32K context with modest compute.

What does YaRN improve over PI?

Click to reveal

YaRN separates RoPE frequency components into groups. High-frequency (local) components keep their rotation; low-frequency (long-range) components get more compression. Fewer fine-tuning tokens, better quality retention.

What is continual pretraining?

Click to reveal

Continue training a pretrained model on domain-specific data using the same next-token objective. Risk: catastrophic forgetting of general skills. Fix: mix general data (70/30), lower LR, monitor general benchmarks.

Continual pretraining vs training from scratch?

Click to reveal

Continual: domain is still natural language (medicine, law, code). From scratch: need different tokenizer, fundamentally different distribution, or general priors are a liability.

What is annealing?

Click to reveal

Final phase of pretraining (1-5% of budget): switch to curated high-quality data (textbooks, verified sources), drop LR by 10-30x. Boosts math/reasoning benchmarks for smaller models; diminishing returns at 400B+.

Module Complete

← Previous Course Home Next →