Three ways to turn raw text into a supervised signal

Think first
Raw text has no labels. How do you turn "The cat sat on the mat" into a supervised learning problem without anyone writing annotations? Name as many strategies as you can.

Three fundamental self-supervised objectives dominate modern pre-training. They all share the same trick: hide part of the input and ask the model to predict it from the rest. What differs is what is hidden.

Causal LM (CLM): predict token n+1 from tokens 1..n. Used by GPT, LLaMA, Claude. Masked LM (MLM): hide ~15% of tokens and predict them bidirectionally. Used by BERT. Span Corruption: hide whole spans of consecutive tokens, replace with sentinel tokens, model generates the missing spans. Used by T5.

CLM vs MLM vs Span Corruption — pick an objective

Interactive objective picker — click to see how each one labels the sentence "The cat sat on the mat"
Causal LM
The cat sat on the [?]
Left-to-right. Target: next token.
Masked LM
The cat [MASK] on [MASK] mat
Bidirectional. ~15% of tokens hidden.
Span Corruption
The cat <X> on the <Y>
Spans replaced by sentinels; decoder regenerates them.
Why CLM won

MLM makes classification accurate but cannot generate text naturally (it fills blanks). Span corruption is clever but complex. CLM naturally supports generation (predict, append, repeat) AND enables in-context few-shot learning, which turned out to be the killer app after GPT-3.

RoPE: positions as rotations

The original Transformer added a learned or sinusoidal position vector to each token embedding. Rotary Position Embeddings (RoPE) instead rotate each embedding by an angle proportional to its position. Each pair of dimensions acts like a clock hand ticking at a fixed frequency.

Why it's powerful: when two tokens compute their attention dot product, the result depends only on their relative rotation (their position difference). Relative positions fall out of the math for free.

RoPE Clock — drag to change position; watch different frequencies rotate at different speeds
dim 0-1 (fast)
dim 32-33 (medium)
dim 126-127 (slow)
position 0
Different frequencies, different roles

Fast-rotating dimensions discriminate nearby tokens (local word order). Slow-rotating dimensions provide a coarse signal across long ranges (document-level structure). This multi-scale property is why RoPE extrapolates to longer contexts (with tricks like YaRN) better than learned absolute positions.

RMSNorm vs LayerNorm

LayerNorm computes mean AND variance across features then normalizes. RMSNorm skips the mean-centering step and just divides by the root mean square. That's it. One pass over the data instead of two.

LayerNorm(x) = gamma * (x - mean(x)) / sqrt(var(x) + eps) + beta
RMSNorm(x)   = gamma * x / sqrt(mean(x^2) + eps)
Relative speed per layer-norm pass (lower = faster)
LayerNorm
100%
RMSNorm
~65%
7-64% faster, no quality loss

At GPT-4 scale, saving 20% on normalization compute is the equivalent of hundreds of free GPUs across a multi-month training run. Every major modern model — LLaMA, Mistral, Gemma, Qwen — uses RMSNorm. Pre-Norm (normalize before each sublayer) is also universal now; it creates a "gradient highway" that makes 100+ layer networks trainable.

SwiGLU: gated activations

The original Transformer used ReLU in the feed-forward network. GPT-2/3 used GELU. Modern models use SwiGLU — a gated variant with two parallel projections multiplied element-wise:

SwiGLU(x) = (x * W_content) * SiLU(x * W_gate)

Think of it as: one path computes what information could flow; another path computes how much of it to let through. The element-wise product is selective, per-dimension gating.

SwiGLU gating demo — input on the left, watch content * gate = output
More FLOPs, better quality-per-parameter

SwiGLU adds ~50% FLOPs to the FFN (three weight matrices instead of two). Llama 2 found it worth the cost: better loss at the same parameter count.

Check your understanding

1. What's the essential difference between CLM and MLM?
Directionality is the core difference — and it's why MLM can't naturally generate text.
2. Why does RoPE encode relative positions for free?
Rotation is a linear operation and dot products between rotated vectors collapse to a function of the angular difference, which is exactly the relative position.
3. What does RMSNorm skip compared to LayerNorm?
Skipping the mean means one statistic instead of two — one pass over the data, no mean-subtract step.
4. Why use SwiGLU over GELU or ReLU?
The gating path decides how much of each dimension should flow — it's the key quality win despite the extra FLOPs.
5. FlashAttention changed attention from O(N^2) memory to O(N). How?
FlashAttention is a systems win: operating in fast SRAM and recomputing attention during backward is cheaper than loading the full matrix from slow HBM.

Solidify your understanding

Teach It Back

Explain: What is causal LM, how is it different from MLM and span corruption, and name three architectural refinements modern LLMs use (RoPE, RMSNorm, SwiGLU, FlashAttention) — explaining what each one buys you.

An AI tutor will compare your explanation against the course material and give specific feedback.

Evaluating your response against the course material...

Flashcards (click to flip)

Write the CLM loss in plain language.
Click to reveal
For each position t, maximize log P(token_t | token_1..token_{t-1}). Averaged over all positions and sequences, then negated: that's the cross-entropy loss. Also known as next-token prediction.
Why 15% masking in MLM?
Click to reveal
Too few masks = slow learning (few supervised signals per sentence). Too many = not enough context to predict from. BERT also splits: 80% real [MASK], 10% random word, 10% unchanged, to avoid train/inference mismatch on the [MASK] token.
In one sentence: what is Pre-Norm vs Post-Norm?
Click to reveal
Post-Norm (original Transformer) normalizes AFTER the residual add — unstable in deep nets. Pre-Norm normalizes BEFORE each sublayer — stable "gradient highway" that makes 100+ layer models trainable.
Why can RoPE extrapolate to longer contexts?
Click to reveal
Positions are rotations at various frequencies. The model learned to attend based on rotation differences, which are smooth functions. Interpolation tricks (PI, YaRN) scale the rotation frequencies down so "unseen" positions map back into the trained angular range.
FlashAttention: counter-intuitive trick in the backward pass?
Click to reveal
Instead of storing the attention matrix for backward, FlashAttention recomputes it. Recomputing in fast SRAM is cheaper than loading the huge matrix from slow HBM memory. Total wall-clock goes down even though FLOPs go up.
The compound effect at GPT-3 scale?
Click to reveal
RoPE + RMSNorm + SwiGLU + FlashAttention together turn a 6-month training run into ~3 months, extend 2K context to 100K+, improve quality, and increase training stability — without changing the core next-token prediction objective.

Module 11 Complete

You now understand CLM, the position/norm/activation stack, and why FlashAttention unlocked the long-context era. Next up: Scaling Laws and Optimization — how to budget compute between model size and tokens, and the AdamW + warmup + cosine recipe.

← Previous Course Home Next →