Concept 1 of 8

LLaMA 3: The most documented frontier training run in history

Meta's LLaMA 3 technical report is 92 pages and documents nearly every decision of a 16,384-GPU training run that produced the 8B, 70B, and 405B models. This module uses it as a capstone to revisit every pretraining concept from Modules 10-16.

Why this case study?

Most frontier model training is proprietary. Meta's report is the clearest public window into what actually happens at scale. Every choice you see here was made with billions of dollars and reputations on the line.

Concept 2 of 8

The Design Philosophy: Data, Scale, Simplicity

Meta identified three levers:

Data quality and diversity would drive the biggest gains, more than architectural innovation.
Scale beyond previous LLaMA models (405B vs LLaMA 2's 70B).
Managing complexity -- consistently pick simpler, proven approaches over novel ones.

Biggest conservative choice: stick with a dense transformer rather than Mixture-of-Experts (MoE). DeepSeek V3 later achieved SOTA with ~10x less compute using MoE. But MoE adds routing complexity, load balancing issues, and harder-to-debug training dynamics. On 16,000 GPUs, stability matters more than efficiency.

Concept 3 of 8

Scaling Law Decisions and Overtraining by Design

Meta used a two-step scaling law approach:

Predict loss from compute using small-scale experiments (standard Chinchilla).
Map predicted loss to benchmark accuracy via a sigmoid -- because benchmark scores follow an S-curve, not a linear curve.

This predicted that a 402B model trained on 16.5T tokens would be compute-optimal for their budget. But they deliberately overtrained the 8B model -- it saw ~15T tokens (~1,875 tokens per parameter, nearly 100x Chinchilla). Reason: inference cost matters, and a well-trained small model is the sweet spot for deployment.

405B

largest model

15T

tokens (main)

16,384

H100 GPUs

54 days

observed window

466

interruptions

400

TFLOPs/GPU

Concept 4 of 8

Data Engineering: The Bootstrap Approach

Meta used LLaMA 2 to help build LLaMA 3's training data:

LLaMA 2 trained quality classifiers for web data
LLaMA 3 checkpoints (mid-training) evaluated the value of new datasets
The final LLaMA 3 generated synthetic data for post-training

Better models -> better data -> even better models. The final mix was approximately:

50% general knowledge
25% math and reasoning
17% code
8% multilingual

The mix was not static. Meta adjusted during training: more math later for reasoning, more recent web data to push knowledge cutoff forward, downsampled weaker subsets identified by quality metrics.

Meta's key finding

Filtering was more valuable than collecting. Their quality classifiers produced the largest gains of any single intervention -- more than architectural changes, more than optimizer tweaks, more than learning rate tuning.

Concept 5 of 8

Interactive: LLaMA 3 Architecture and Parallelism

Interactive: LLaMA 3 Architecture Overview

Click each component to learn more.

4D parallelism on the 405B model:

Tensor Parallelism TP=8 within each NVSwitch-connected node
Pipeline Parallelism PP=16, 126 layers split into 16 stages
Data Parallelism via FSDP sharding weights/grads/optimizer
Context Parallelism CP=16 (only enabled for long-context phase)

TP=8 x PP=16 = 128 GPUs to hold one copy of the model. Sustained 400 TFLOPs/GPU is ~40% of theoretical peak -- excellent for distributed training.

Concept 6 of 8

The Training Recipe: Three Phases

Interactive: LLaMA 3 Training Timeline

Click a phase to learn what happened.

Concept 7 of 8

Evaluation, Fault Tolerance, Small-Model-First

Evaluation innovation: Meta used scaling laws not just to predict loss but to predict benchmark performance, extrapolated over four orders of magnitude of compute. Their predictions slightly underestimated final performance, which was a pleasant surprise.

Fault tolerance: 466 interruptions in 54 days, ~78% hardware-related. Meta achieved 90%+ effective training time through automated failure detection, redundant checkpoint storage (local SSD + network filesystem + cloud), and silent data corruption detection.

Small-model-first pattern: scaling law experiments, data mix experiments, and annealing experiments all ran on 8B models before committing to 405B scale. A bad decision on 8B costs thousands. The same decision on 405B costs millions.

Secondary use of annealing: Meta discovered they could use annealing runs to quickly evaluate new datasets. Take a 50% trained 8B model, anneal on 70% default + 30% new data, compare benchmark changes. Not as rigorous as full scaling experiments but much cheaper for early screening.

Concept 8 of 8

Check Your Understanding

1. Why did LLaMA 3 stick with a dense transformer instead of MoE?

Correct: On 16K GPUs, stability and debuggability mattered more than the efficiency gains of MoE

2. What was LLaMA 3 8B's token-to-parameter ratio?

Correct: ~1,875 (nearly 100x Chinchilla)

3. What was Meta's biggest-impact data intervention?

Correct: Building aggressive quality classifiers for web data

4. What was the 4D parallelism config for 405B?

Correct: TP=8, PP=16, CP=16, FSDP

5. Why run experiments on 8B before 405B?

Correct: A bad decision on 8B costs thousands; on 405B it costs millions. Small-model-first is insurance.

Teach It Back

Explain to a friend: What made LLaMA 3 a successful frontier training run? Cover the design philosophy (data + scale + simplicity), the three-phase training recipe, the 4D parallelism setup, how Meta handled 466 interruptions, and why small-model-first experimentation is critical.

An AI tutor will compare your explanation against the course material.

Evaluating...

Flashcards (click to flip)

LLaMA 3 three-phase training?

Click to reveal

Phase 1: main pretraining at 8K context on 15T tokens. Phase 2: gradual long-context extension from 8K to 128K in six stages. Phase 3: annealing (40M tokens) on high-quality data with sharply reduced LR, then checkpoint averaging.

Why overtrain the 8B model?

Click to reveal

At ~1,875 tokens/param (vs Chinchilla's 20), the 8B is heavily overtrained. Inference is much cheaper than a 70B, so overtraining training-cost to save inference-cost is a winning trade for deployment.

Meta's 4D parallelism config for 405B?

Click to reveal

TP=8 within each NVSwitch node; PP=16 across nodes splitting 126 layers into 16 stages; FSDP data-parallel on top; CP=16 enabled only during long-context phase. 128 GPUs hold one copy of the model.

Meta's key data finding?

Click to reveal

Filtering beats collecting. Quality classifiers built on LLaMA 2 produced the single largest gains of any intervention. Better filter -> better data -> better model -> better filter (bootstrap).

How did Meta survive 466 interruptions in 54 days?

Click to reveal

Automated failure detection + aggressive checkpointing to redundant storage + silent data corruption detection. 90%+ effective training time maintained despite ~8-9 interruptions per day (~78% hardware-related).

Annealing results on LLaMA 3?

Click to reveal

Huge gains on the 8B model for GSM8K and MATH. Barely any movement on the 405B model. Hypothesis: larger models already capture these gains during main pretraining, so annealing has diminishing returns at frontier scale.

Module Complete

← Previous Course Home Next →