Concept 1 of 10

How does a brain of billions of numbers actually learn?

Think first

A model has 175 billion numbers (parameters). Every time it predicts the wrong word, we need to nudge these numbers so it predicts better next time. But how do we know which of the 175 billion numbers to change, and by how much?

Concept 2 of 10

Step 1: Measuring Error with Cross-Entropy Loss

The model outputs a probability distribution over all ~50,000 possible next tokens. Cross-entropy loss measures how much probability mass landed on the correct token.

Loss = -log(P[correct_token])

-log(0.90) = 0.11   good prediction, low loss
-log(0.50) = 0.69   meh
-log(0.08) = 2.53   bad prediction, high loss
-log(0.001) = 6.91  terrible, very high loss

Why the logarithm? Because it converts products of probabilities (for whole sentences) into sums, and because it punishes confident wrong answers far more than unsure ones. A model that's 99.9% sure of the wrong answer deserves a much stronger correction than one that's 51% sure.

Interactive: Cross-Entropy Loss Calculator

Drag the slider to change the probability the model assigned to the correct token. See how the loss explodes as the probability approaches zero.

P(correct token) = 0.50

0.69

Cross-Entropy Loss

meh

Verdict

Common misconception

"Loss is the number of wrong predictions." No. Loss is continuous. A model that puts 49% probability on the correct answer is doing better than one that puts 10%, even though both technically picked the wrong top-1 token. Training pushes that 49 toward 51, then 60, then 90.

Concept 3 of 10

Step 2: Backpropagation -- The Blame Game

Now that we have a single number (loss) describing how wrong we were, we need to figure out which of the billions of parameters deserve the blame. This is backpropagation.

Starting from the loss and working backwards through the network, we compute a gradient for each parameter -- a number telling us "if you nudged this parameter up by a tiny bit, the loss would change by X". The gradient is a directional report card: which way to change, and how strongly.

Output layer: "Why did I prefer 'painting' over 'sunset'?"
Layer 47: "Which features made me lean toward painting?"
Layer 10: "How did my attention patterns contribute to those features?"
Embedding layer: "Were the initial word vectors even pointing in useful directions?"

Every component gets a gradient. The chain rule from calculus lets us combine gradients at each layer by multiplication: local slope at each layer composes into a global gradient for each parameter.

Mathematical depth

For a parameter W somewhere deep in the network, the gradient is dL/dW = dL/dy * dy/dh * dh/dz * dz/dW, where each term is the local derivative computed at that layer. Modern frameworks (PyTorch, JAX) build this computation graph automatically.

Concept 4 of 10

Step 3: Gradient Descent -- Taking Tiny Steps

Once we know the gradient for each parameter, we nudge it in the opposite direction (down the slope of the loss):

W_new = W_old - learning_rate * gradient

The learning rate (LR) controls how big each step is. Typical values for LLM pretraining: 1e-4 to 6e-4. This sounds tiny -- and it is -- but that's the point. If you made each parameter change dramatically on every update, the model would thrash around and forget everything it already learned. Small, consistent nudges accumulate into sophisticated behavior over trillions of updates.

Gradient descent=Walking downhill in fog. You can't see the valley, but you can feel the slope under your feet at each step. Take a small step downhill, re-sense the slope, step again. Too-big steps bounce you across the valley. Too-small steps take forever.

Concept 5 of 10

Batch Learning: Averaging Many Examples

We don't update weights after every single example. We process a batch (32 / 64 / 512 / many more sequences in parallel), average the gradients across all of them, and then apply one update.

Three reasons:

GPUs love parallelism. A single example barely uses the hardware; a batch of 512 saturates it.
Averaged gradients are less noisy. One weird example might pull the model in a bad direction. Averaged with 511 others, the signal survives and the noise cancels.
Balanced updates. A batch containing multiple uses of "rise" (sun rises, bread rises, prices rise) produces a single update that respects all the meanings.

Interactive: Batch Size vs Gradient Noise

Click the button to sample random gradient estimates at different batch sizes. Notice how single-example gradients are jumpy while large batches are smooth.

True gradient shown as dashed line. Samples show per-batch estimates.

There's a limit: beyond the critical batch size, doubling the batch doesn't keep cutting noise in half -- you're better off using those FLOPs for more steps instead. For LLMs, batches of 2-4 million tokens are standard.

Concept 6 of 10

Learning Rate Schedules: Warmup + Cosine Decay

A single fixed learning rate is rarely optimal. Modern LLMs use a two-phase schedule:

Linear warmup (first ~2-5% of steps): LR starts near zero and ramps up to its peak. Reason: early parameters are random; big steps in random directions would destabilize training.
Cosine decay (rest of training): LR follows half a cosine curve down to a small minimum. Reason: late in training, the model is near a good solution; small, careful refinements are better than large disruptive jumps.

Interactive: Learning Rate Schedule Plotter

Peak LR: 3.0e-4

Warmup %: 2

X axis: training step. Y axis: learning rate. Standard LLM recipe: 2% warmup, cosine decay to 10% of peak.

Concept 7 of 10

Teacher Forcing and Parallel Learning

During training, the model never uses its own predictions as the next input. Even if it predicts "hat" instead of "mat", we show it the correct answer and move on. This is called teacher forcing.

This turns a sequence of dependent predictions into a single parallel computation. "The cat sat on the mat" yields 6 training signals (one per position) from a single forward pass, thanks to causal masking letting each position attend only to its predecessors.

Misconception alert

"Teacher forcing makes the model cheat during generation." It doesn't affect generation at all -- teacher forcing is only used during training. At inference time the model does feed its own outputs back, which is why generation drift (error compounding) can happen. This is called exposure bias and is one of the motivations behind RLHF.

Concept 8 of 10

Scale, Checkpoints, and the Evolution of a Model

GPT-3 trained on ~500 billion tokens. Reading nonstop at 250 words per minute, that's 2,850 years of text. Training cost: ~3,640 petaflop/s-days. Every few thousand steps, the full model state is saved as a checkpoint so training can resume after crashes.

500B

tokens seen

175B

parameters

3640

PF/s-days

~$5M

compute cost

What the model knows at each stage

1% trained (5B tokens): weak grammar, bigram-like surface statistics
10% trained (50B tokens): basic grammar stable, interchangeable surfaces ("blue" and "green")
100% trained (500B tokens): nuanced understanding, cultural associations, rare-word competence

Concept 9 of 10

Validation Loss: Detecting Overfitting and Underfitting

We split data into training (used to compute gradients) and validation (held out, used only to measure progress). Three patterns tell us what's happening:

Healthy learning: train and validation loss both decrease together.
Overfitting: train loss keeps falling but validation loss flattens or rises -- the model is memorizing training quirks instead of learning generalizable patterns.
Underfitting: both stay high -- model too small, LR too low, or training too short.

Overfitting is relatively uncommon in modern LLM pretraining because the dataset is so vast and training is so short (usually <1 epoch over the data). It becomes much more relevant during fine-tuning on smaller datasets.

Real-world application

When Anthropic, Meta, and OpenAI train frontier models, they monitor validation loss every few thousand steps along with downstream benchmark scores. If loss plateaus but benchmarks keep climbing, training continues. If both plateau, it's time to stop or switch to an annealing phase.

Concept 10 of 10

Check Your Understanding

1. What does cross-entropy loss actually measure?

Correct: The negative log of the probability the model assigned to the correct token

2. Why do we need warmup at the start of training?

Correct: Parameters are random; large early steps in random directions destabilize training

3. What is teacher forcing?

Correct: During training, feeding the correct previous token instead of the model's own prediction

4. Why use large batches instead of updating after every example?

Correct: GPUs process batches in parallel, and averaged gradients are less noisy

5. What signals overfitting during training?

Correct: Training loss falls while validation loss flattens or rises

Teach It Back

Explain to a friend: How does a language model actually learn from its mistakes? Walk through cross-entropy loss, backpropagation, and gradient descent, and explain why we use learning rate warmup and large batches.

An AI tutor will compare your explanation against the course material.

Evaluating...

Flashcards (click to flip)

What is cross-entropy loss?

Click to reveal

Loss = -log(P[correct token]). Low probability on the right answer gives high loss. It penalizes confident wrong answers far more than unsure ones, and turns products of per-token probabilities into summable log-values.

What does backpropagation compute?

Click to reveal

For each of billions of parameters, it computes the gradient: how much the loss would change if that parameter were nudged up slightly. It uses the chain rule to propagate local derivatives from the loss backward through every layer.

Why does gradient descent use a tiny learning rate?

Click to reveal

Large jumps overshoot and destroy learned patterns. Small consistent steps (e.g., 3e-4) let billions of parameters co-adapt. Nudge everything slightly in the direction that reduces loss, repeat trillions of times.

Why warmup + cosine decay?

Click to reveal

Warmup prevents divergence when random initial weights meet large gradients. Cosine decay lets the model make big moves early (when there is lots to learn) and tiny refinements late (when the model is near a good solution).

What is teacher forcing?

Click to reveal

During training, the model is always fed the correct previous token as input, even if its own prediction was wrong. This decouples positions so all predictions in a sequence can be computed in parallel via causal masking.

Training vs validation loss: what do they tell us?

Click to reveal

Both falling = healthy learning. Train falls, val rises = overfitting (memorizing). Both stuck high = underfitting (model too small / LR too low / training too short).

Module Complete

← Previous Course Home Next →