Why stack attention 96 times?

Think first
One attention layer lets "sat" look at "cat". But what about "The cat that chased the mouse sat on the mat" - "sat" needs to know the subject is "cat", that cat did the chasing, and that what it chased was a mouse. Can one round of attention capture this chain? Or do you need multiple rounds?

One attention pass does exactly one hop of information sharing. "sat" can read directly from any other token in one layer - but the information sitting in those tokens is still their original embeddings. It hasn't been enriched yet.

To reason about "the cat that chased the mouse", "sat" needs information that itself has been composed from multiple other tokens. That requires a second hop: first "chased" absorbs "mouse", then "cat" absorbs "chased" (which now carries "mouse" info), then "sat" absorbs "cat" (which now carries "chased-mouse" info). Three hops = three layers.

Key Insight

Each transformer layer is one step of "gossip". Layer 1 gossips direct neighbors. Layer 2 gossips about what your neighbors just heard. By layer 96, information from any token can have been composed, re-composed, and specialized many times over.

What different layers actually learn

Probing studies (BERT Rediscovers the Classical NLP Pipeline, etc.) reveal a stunning pattern: deep models spontaneously organize themselves into stages that resemble a classical NLP pipeline - without ever being told to.

L 1-4
Surface & Syntax
Part-of-speech, phrase boundaries, local grammar
L 5-8
Semantics & Roles
Subject/verb/object, coreference, semantic roles
L 9-12
Task-specific
Multi-hop reasoning, prediction preparation

"quick" and "brown" get tagged as modifiers early. The pronoun "she" gets linked to its antecedent in middle layers. By late layers, the model is composing abstractions like "the mathematician who solved the theorem" as a single reasoning unit.

Watch "bank" disambiguate through layers

This visualization simulates how the representation of "bank" shifts from generic (a mix of financial, river, and airplane meanings) to firmly financial as it absorbs context from "money" and "deposited" layer by layer.

Layer-by-layer disambiguation of "bank"
Not selection - reshaping

The model does NOT "look up" the financial sense from a dictionary of senses. It continuously reshapes a single vector, pushing it toward one region of meaning-space as context piles up. The representation is born generic and matured through residual updates.

Residual connections: the highway

Deep networks have a dark history. When you stack 50+ layers naively, gradients shrink during backprop (they multiply by small numbers many times), and the original input signal gets washed out after a few transformations. Training stalls. This killed deep learning for years - until ResNet.

The fix is almost embarrassingly simple:

output = input + layer_transform(input)

Instead of replacing the input with the layer's output, you add the layer's output on top. This means:

Residual connection demo

Layer normalization

With 96 layers of additions, vector values can grow wild. Layer norm stabilizes them at each step:

  1. Compute mean and standard deviation across the vector's dimensions.
  2. Subtract mean, divide by std - now mean=0, std=1.
  3. Scale by a learnable gamma, shift by a learnable beta.

Example: [2.0, 0.5, -1.0, 1.5] normalizes to [1.09, -0.22, -1.52, 0.65]. The shape is preserved but the magnitudes are controlled.

Modern transformers use Pre-Norm: normalize before attention/FFN, not after. This makes very deep models trainable in practice.

The Feed-Forward Network: expand and compress

After attention decides what info to mix in, the FFN decides how to transform it. Each token passes through the FFN independently (no cross-token interaction here).

The magic is the 4x expansion: project from d=768 up to 4d=3072, apply GELU nonlinearity, then project back down to 768. That temporary expansion gives the model "thinking space" to compute non-linear transformations. Roughly 2/3 of all parameters in GPT-3 live in FFN layers - the FFN is where most of the knowledge is stored.

FFN expand -> activate -> compress
Attention = "Which other tokens should I read from?"
FFN = "Now that I've read them, what should I do with that information?"

GELU vs ReLU: ReLU chops off negatives sharply (max(0, x)). GELU softens the cutoff - it multiplies x by its probability under a Gaussian. Smoother gradients, works better at scale, chosen by GPT-2 and every modern model since.

The complete Transformer block

Each of GPT-3's 96 blocks looks exactly like this:

x = x + attention(layer_norm(x))
x = x + ffn(layer_norm(x))

Two sub-blocks, each wrapped in a pre-norm + residual. That's it. Repeat 96 times, stick a final layer norm and a linear projection to vocab size at the end, and you have GPT-3.

The final token's representation at the top of the stack is a 12,288-dimensional vector encoding subject, action, tense, style, likely continuation, and a great deal more. A final W_out matrix projects it to a 50,000-long score vector over the vocabulary, and softmax turns those scores into probabilities. "mat" wins because it satisfies grammar, semantics, and context simultaneously.

Check your understanding

1. Why do Transformers stack many layers instead of making one giant attention layer?
Exactly. Layer 1 sees raw embeddings. Layer 2 sees outputs composed from layer 1 - so information can now flow through intermediaries. Deep reasoning = many hops.
2. What is the purpose of the residual connection x = x + layer(x)?
Right. Residuals give gradients a direct identity path backward, preventing vanishing, and let layers learn small additive updates rather than total rewrites.
3. Why does the FFN expand to 4x dimensions before compressing back?
Correct. The 4x hidden dimension plus GELU gives the FFN much more representational capacity per layer - most knowledge storage happens here.
4. Probing studies found that early layers (1-4) in BERT tend to capture...
Yes - deep models self-organize into something like a classical NLP pipeline: surface in early layers, semantics in the middle, task abstractions at the top.
5. Why does the model use GELU instead of ReLU?
Right. ReLU has a hard corner at zero which kills gradients for negative inputs. GELU softens this and empirically trains better in deep networks.

Teach It Back

Explain to a friend: why Transformers stack many layers, how residual connections work, what role layer normalization and the FFN play, and how the representation of an ambiguous word like "bank" evolves through the stack.

An AI tutor will compare your explanation against the course material.

Evaluating your response...

Flashcards (click to flip)

Why stack many attention layers?
Click to reveal
Each layer is one hop of information sharing. Multi-hop reasoning ("X did Y to Z, therefore W") requires several sequential passes, each building on outputs from the previous one.
What is a residual connection?
Click to reveal
x = x + layer(x). Instead of replacing the input with the layer's output, add the output on top. Preserves original signal, prevents vanishing gradients, lets each layer learn a small additive refinement.
What does the FFN do in a Transformer block?
Click to reveal
Attention decides what info to pull from other tokens; FFN transforms that info. It expands to 4x dimensions, applies GELU, compresses back. ~2/3 of GPT-3's parameters live here - it's where most knowledge is stored.
What does layer normalization do?
Click to reveal
Normalizes each vector to mean=0, std=1, then applies learnable scale (gamma) and shift (beta). Keeps activations stable across many layers. Modern models use Pre-Norm (normalize BEFORE the sub-layer).
What do different layers learn?
Click to reveal
Early layers (1-4): part-of-speech, phrase boundaries, local grammar. Middle layers (5-8): subject/object roles, coreference, semantics. Late layers (9-12): multi-hop reasoning, task-specific abstractions, prediction preparation.
What is the full Pre-Norm Transformer block?
Click to reveal
x = x + Attention(LayerNorm(x)); x = x + FFN(LayerNorm(x)). Two sub-blocks, each with pre-norm and residual. Repeated 12x in GPT-2, 96x in the largest GPT-3.

Module 6 Complete

You now understand the full Transformer block. Next: how we actually train these things - loss functions, gradients, batch sizes, and learning rate schedules.

← Previous Course Home Next →