GPT-3 has 175 billion parameters. In mixed precision, that is 350 GB just for the weights. The H100 GPU has 80 GB of memory. Before you even start thinking about training, how do you fit the model?
You don't fit it. You split it across many GPUs and have them coordinate. This is parallelism, and there are four main flavors. Modern frontier training runs combine all of them, a setup called "4D parallelism".
Concept 2 of 9
The Memory Budget
For a 175B model training in mixed precision with AdamW:
Interactive: Memory Breakdown Calculator
Adjust the model size to see how memory requirements scale.
350 GB
Weights (BF16)
350 GB
Gradients
2100 GB
Optim. states (FP32)
~60 GB
Activations
~2860 GB
Total
~36
H100s (80GB) minimum
Rule of thumb: mixed precision + AdamW needs ~16 bytes per parameter (2 weights + 2 grads + 4+4 optimizer state + activations). This is the reason parallelism is mandatory for any model above ~5B parameters.
Concept 3 of 9
Data Parallelism (DP)
Replicate the full model on every GPU. Each GPU processes a different batch. After backward pass, all GPUs exchange gradients via all-reduce so they end up with the same averaged gradient, then each applies the same weight update.
Pros: simplest, near-linear scaling in throughput until comm becomes the bottleneck.
Cons: every GPU still stores the full model. Does not solve the "fits on GPU" problem by itself.
ZeRO (DeepSpeed, Microsoft) and FSDP (PyTorch) are enhanced data parallelism that shard the optimizer states (ZeRO 1), gradients (ZeRO 2), and even parameters (ZeRO 3 / FSDP) across DP ranks. Each GPU only materializes the slice it owns at any given time; communication goes up, but memory per GPU goes way down.
Concept 4 of 9
Tensor Parallelism (TP)
Introduced by Megatron-LM (NVIDIA, 2019). Individual layer weight matrices are split across GPUs. For the MLP (which expands 768 -> 3072 then 3072 -> 768), the intermediate matrix is split column-wise across 4 GPUs; each GPU holds 3072/4 = 768 columns. After local matmul, an all-reduce stitches the outputs back together.
Attention parallelizes even more naturally: if you have 96 heads and 8 GPUs, give each GPU 12 heads. No communication needed inside a head; only one all-reduce at the end of the block.
Pros: large memory savings, scales the usable model size.
Cons: heavy communication every single layer. Requires fast interconnect (NVLink ~900 GB/s within a node). Does not extend well across nodes.
Concept 5 of 9
Pipeline Parallelism (PP)
Split the model by layer depth, not width. GPU 0 holds layers 1-12, GPU 1 holds layers 13-24, etc. Data flows through in a pipeline: GPU 0 processes layer 1, hands off activations, then starts the next micro-batch.
The problem is pipeline bubbles: at start and end of each batch, some GPUs sit idle while the pipeline fills or drains. GPipe runs forward pass fully before any backward (simple but bubbly). PipeDream interleaves forward/backward 1F1B to shrink the bubble.
Pros: each GPU only stores the layers it owns; lower communication per step.
Click a strategy to see how a 4-layer model is distributed across 4 GPUs.
Concept 7 of 9
Sequence, Context, and 4D Parallelism
Sequence parallelism shards activations along the sequence dimension. With 8 GPUs and an 8192-token sequence, each GPU holds activations for 1024 tokens. Addresses regions that tensor parallelism leaves replicated (LayerNorm, Dropout).
Context parallelism (Ring Attention) extends this to million-token contexts. Each GPU holds a chunk of the sequence and passes keys/values around in a ring, so no GPU ever materializes the full attention matrix.
Real frontier runs combine multiple strategies. LLaMA 3 405B used:
Tensor Parallelism TP=8 within each node (NVLink bandwidth)
Pipeline Parallelism PP=16 across nodes
Data Parallelism via FSDP across everything
Context Parallelism CP=16 for long-context training
Total: 16,384 H100 GPUs orchestrated by the scheduler. 400 TFLOPs/GPU sustained = ~40% of theoretical peak, which is excellent for distributed training.
Concept 8 of 9
Fault Tolerance and Real-World Ops
Training frontier models is a firefight. LLaMA 3's 54-day training window saw 466 unexpected interruptions (~8-9 per day). Roughly 78% were hardware-related: bad GPUs, network blips, memory errors.
Survival strategies:
Aggressive checkpointing. First 1000 steps: every 500. Stable phase: every 2000-5000. Keep last 3-5 checkpoints in case the most recent is corrupted.
Silent data corruption detection. Sometimes a GPU produces wrong results without throwing an error. Canary training steps, gradient norm monitoring, and periodic cross-GPU verification catch these.
Redundant storage. LLaMA 3 wrote checkpoints to local SSD, network filesystem, and cloud simultaneously.
A 70B model checkpoint is ~1.1 TB. Writing one has to be sharded and streamed; otherwise, just saving would dominate training time.
Andrej Karpathy's recipe
Reduce scope first. Debug on 1 GPU, then 8, then 64. Verify your model can overfit a tiny (32-example) batch before you scale up. Checkpoint aggressively early. Make failures reproducible with fixed random seeds. Systematic discipline beats firefighting.
Concept 9 of 9
Check Your Understanding
1. Roughly how many bytes per parameter does mixed-precision + AdamW training need?
2. What does ZeRO / FSDP improve over plain data parallelism?
Correct: Memory: it shards optimizer states, gradients, and optionally parameters across DP ranks
3. Why does tensor parallelism require a fast interconnect like NVLink?
Correct: It performs an all-reduce on activations inside every layer, which would saturate slow interconnects
4. What is a pipeline bubble?
Correct: Idle time at the start and end of each batch while the pipeline fills or drains
5. How did LLaMA 3 405B handle 466 interruptions in 54 days?
Correct: Automated failure detection, aggressive checkpointing to redundant storage, and silent corruption detection
Teach It Back
Explain to a friend: Why can't you train a 175B model on one GPU, and how do data, tensor, pipeline, and context parallelism combine to make it possible? Include what ZeRO/FSDP does, why pipeline bubbles matter, and how frontier labs survive hundreds of hardware failures during a single training run.
An AI tutor will compare your explanation against the course material.
Evaluating...
-
Score
out of 10
Feedback
Flashcards (click to flip)
Memory math for LLM training?
Click to reveal
Mixed precision + AdamW needs ~16 bytes per parameter: 2 (weights BF16) + 2 (grads) + 8 (optimizer state FP32 for momentum and variance) + activations. A 175B model needs ~2.8 TB total -> many GPUs required.
Data Parallelism in one sentence?
Click to reveal
Every GPU holds the full model, processes a different batch, exchanges gradients via all-reduce after backward. Simple and scalable, but does not reduce per-GPU memory.
What is ZeRO / FSDP?
Click to reveal
Sharded data parallelism. Stage 1 shards optimizer states, stage 2 adds gradients, stage 3 (FSDP) adds parameters themselves. Each GPU only materializes its slice; extra communication buys massive memory savings.
Tensor Parallelism vs Pipeline Parallelism?
Click to reveal
TP splits individual layer matrices across GPUs (width-wise); needs fast interconnect because of per-layer all-reduce. PP splits layers (depth-wise); lower comm but introduces pipeline bubbles at batch boundaries.
Why combine all parallelism strategies?
Click to reveal
Each addresses a different bottleneck. TP fits big layers; PP fits many layers; DP scales throughput; CP/SP handles long sequences. Frontier runs use 4D parallelism (e.g., LLaMA 3: TP=8, PP=16, CP=16, FSDP).
How is training kept alive through hardware failures?
Click to reveal
Aggressive checkpointing to redundant storage, automated restart + latest-valid-checkpoint load, silent data corruption detection via canaries and gradient monitoring. LLaMA 3 maintained 90%+ effective training despite 466 interruptions in 54 days.