The word "bank" has one embedding vector. But it means completely different things in: "I deposited money at the bank", "The river bank was muddy", and "The plane had to bank left". How could surrounding words help the model figure out which meaning to use?
We are building a model that reads text and predicts the next word. So far we have tokens, embedding vectors (encoding meaning), and positional information. But there's a critical problem.
Our embedding table has just one entry for "bank" -- always the same vector. This single static vector can't capture financial institutions, river edges, and airplane maneuvers simultaneously.
Attention allows surrounding words to pass information to refine meaning. When "bank" appears near "money" and "deposited", attention mixes in financial context. Near "river" and "muddy", different context is mixed in.
Key Insight
Attention doesn't just help with ambiguous words. It works for all words. "sat" can incorporate information from "cat", helping the model know something animal-like is doing the sitting. Every word's representation gets enriched by its neighbors.
Attention=A cocktail party. Each word "listens" to every other word, but pays more attention to the most relevant speakers. "bank" tunes in to "money" and tunes out "the".
Concept 2 of 10
Query, Key, and Value: The Three Roles
Every word plays three roles simultaneously in attention. Think of it like a library:
Query (Q)
"What information do I need?"
Like a search query you type into a library catalog. The word "sat" asks: "who or what is doing the sitting?"
Key (K)
"What information can I provide?"
Like book titles on the shelf. Each word advertises its content. "cat" advertises: "I'm an animal, a subject".
Value (V)
"Here is my actual information"
Like the book's contents. Once you've matched query to key, the value is what actually gets transferred.
Each role is created by multiplying the embedding by a different learned weight matrix:
"sat" embedding: [0.3, -0.3, 0.7, 0.15]
"sat" x Wq = query: [0.12, 0.13]("what do I need?")
"sat" x Wk = key: [0.09, 0.58]("what can I offer?")
"sat" x Wv = value: [0.13, 0.45]("here is my info")
The same weight matrices are applied to every word. Queries and keys are intentionally smaller than embeddings (GPT-3: 12,288 embedding -> 128 per query head). This compression forces focus.
Q, K, V=Google search. Your search query (Q) is matched against page titles/tags (K). When there's a match, you read the page content (V). Attention is like searching every word against every other word.
Concept 3 of 10
The Matching Step: Dot Product Scores
How does the model decide which words are relevant to each other? It calculates the dot product between each query-key pair. Higher positive number = strong relevance. Near zero = little relevance.
Interactive: Query/Key/Value Calculator
Select a word to see its Q, K, V vectors and how it matches with other words. Click "sat" to see how it attends to "cat".
Click a word above to compute its Q, K, V vectors and attention scores.
In practice, all queries are stacked into one matrix, all keys into another (transposed), and one matrix multiplication computes all pairs simultaneously. This is why GPUs are so important -- they excel at matrix multiplication.
Key Insight
The dot product measures how well a query "matches" a key. It's like measuring how similar two directions are. Parallel vectors = high score. Perpendicular = zero. Opposite = negative.
Concept 4 of 10
From Scores to Probabilities: Scale and Softmax
Think first
Raw dot product scores can be any number (like -10, 45, 100). We need probabilities that sum to 1. How would you convert arbitrary numbers into a probability distribution? And why might very large numbers be a problem?
Step 1: Scale the scores
Divide by sqrt(dk) to prevent huge numbers from dominating. GPT-3 with 128-dim keys: a raw score of 45.2 becomes 45.2 / sqrt(128) = 4.0. This keeps gradients healthy during training.
Step 2: Apply softmax
Softmax converts arbitrary numbers into a probability distribution: exponentiate each score, sum all exponentials, divide each by the sum. The result is always positive and sums to 1.
Interactive: Softmax Visualizer
Adjust the raw scores to see how softmax converts them into probabilities. Notice how it amplifies differences -- a small lead becomes dominant.
Softmax amplifies differences: a small advantage becomes dominant attention.
Key Insight
Softmax is the attention "sharpener". Without it, every word would pay roughly equal attention to everything. With it, the model can focus -- choosing to attend 84% to "cat" and only 8% to "The".
Concept 5 of 10
Attention Scores: The Full Picture
Let's see what attention looks like for an entire sentence. Each row shows how much one word attends to every other word.
Interactive: Attention Score Heatmap
Hover over cells to see exact attention weights. Each row sums to 100%. Click a sentence to switch examples.
Low attentionHigh attention
Notice how "bank" in the financial sentence attends strongly to "deposited" and "money", while "bank" in the river sentence attends to "river" and "muddy". Same word, different attention patterns, different contextual meaning.
Concept 6 of 10
Causal Masking: No Peeking at the Future
Causal language models (like GPT) predict the next word. During training, the model sees the whole sentence but must predict each word using only the words before it. If it could see future words, it would just copy the answer!
Before applying softmax, we set all future position scores to -infinity. Softmax converts -infinity to exactly zero -- those positions get zero attention.
Interactive: Causal Mask
Click "Apply Mask" to see how the triangular mask blocks future positions. Only the lower triangle survives.
After masking:
Position 0 ("The") sees only: The
Position 1 ("cat") sees: The, cat
Position 2 ("sat") sees: The, cat, sat
Position 3 ("on") sees: The, cat, sat, on
The quadratic bottleneck
This attention matrix grows quadratically. 10 tokens = 100 interactions. 1,000 tokens = 1,000,000 interactions. 100,000 tokens = 10 billion interactions. This is the Transformer's main computational bottleneck and why context length is so expensive to increase.
Concept 7 of 10
The Weighted Sum: Mixing Information
Now we combine attention probabilities with value vectors. Each word's new representation is a weighted mix of all the value vectors it can see.
"sat" attends to:
33.3% from "The" value vector
34.0% from "cat" value vector
32.6% from "sat" value vector
Weighted sum = Context Vector [0.39, 0.34]
The critical step: we don't replace the old embedding. We add the context vector to it. This is called a Residual Connection:
Original Embedding + Attention Output = Final Representation
"sat" now contains its dictionary meaning plus context that it's an action performed by a cat.
Key Insight
Residual connections are crucial. Without them, stacking many layers would cause the original word meaning to be lost. By adding rather than replacing, each layer contributes a refinement while preserving the foundation.
Concept 8 of 10
The Complete Formula and Multi-Head Attention
Attention(Q, K, V) = softmax(QKT / sqrt(dk)) * V
QKT = match scores | /sqrt(dk) = scale | softmax = probabilities | * V = extract info
Multiple Attention Heads (MHA)
One head captures one type of relationship. Multiple heads run in parallel, each learning to focus on different things:
All head outputs are concatenated, then multiplied by an Output Projection Matrix (Wo) to synthesize different perspectives into a single coherent update:
GPT-3: (96 heads x 128 dims) -> projected back to 12,288 embedding dims.
Multi-Head=A panel of 96 experts, each examining the sentence from a different angle. One checks grammar, another checks meaning, another checks style. Their findings are combined into a single comprehensive report.
Context vector from blending all attended value vectors
6
Repeat for all heads in parallel
Each head captures different relationships
7
Concatenate, project, add to original (residual)
Result: "sat" now knows it's an action performed by a cat!
And this is just one layer. Deep understanding requires iteration -- stacking 12, 24, or 96 attention layers allows the model to reason about context that relies on other context. Layer 1 might learn grammar. Layer 50 might learn meaning. Layer 96 might learn reasoning.
Purple = covered so far. Gray = coming in next modules.
Concept 10 of 10
Check Your Understanding
1. What problem does attention solve?
Correct! Without attention, "bank" always has the same representation. Attention lets surrounding words (like "money" or "river") modify its meaning dynamically.
2. What are the roles of Query, Key, and Value in attention?
Right! Query is the search question, Key is what each word advertises about itself, and Value is the actual information that gets transferred when there's a match.
3. Why is causal masking necessary?
Exactly! During training, the model sees the full sentence but must predict each token using only previous tokens. Masking future positions (setting them to -infinity before softmax) enforces this constraint.
4. Why does GPT-3 use 96 attention heads instead of just one?
Correct! One head captures one type of relationship. Multiple heads running in parallel can simultaneously track grammar, meaning, pronouns, style, and many other linguistic features.
5. What is the main computational bottleneck of attention?
Right! Every token must attend to every other token: n^2 interactions. 1000 tokens = 1M interactions. This quadratic scaling is why increasing context length is so expensive.
Teach It Back
Explain to a friend: What is attention in a Transformer, how does the Q/K/V mechanism work, and why does "bank" get different representations in "river bank" vs "money bank"?
An AI tutor will compare your explanation against the course material and give specific feedback.
Evaluating your response against the course material...
-
Score
out of 10
Feedback
Flashcards (click to flip)
What is the attention formula?
Click to reveal
Attention(Q, K, V) = softmax(QKT / sqrt(dk)) * V. QKT computes match scores, dividing by sqrt(dk) scales them, softmax converts to probabilities, multiplying by V extracts the information.
What do Query, Key, and Value represent?
Click to reveal
Query: "What information do I need?" Key: "What information can I provide?" Value: "Here is my actual information." Each is created by multiplying the embedding by a different learned weight matrix (Wq, Wk, Wv).
Why do we divide by sqrt(dk) before softmax?
Click to reveal
Without scaling, large dot products push softmax into regions with tiny gradients (near 0 or 1), making learning very slow. Dividing by sqrt(dk) keeps values in a moderate range where softmax gradients are healthy.
What is causal masking?
Click to reveal
Setting attention scores for future positions to -infinity before softmax, so they become exactly zero. This creates a triangular mask: position 0 sees only itself, position 1 sees positions 0-1, position 2 sees 0-2, etc. Prevents "cheating" during training.
Why multi-head attention instead of single-head?
Click to reveal
Each head learns to focus on different relationship types (grammar, meaning, pronoun reference, etc.). Multiple heads running in parallel capture richer patterns than a single head could. GPT-3 uses 96 heads, each with its own Q, K, V matrices.
What is a residual connection in attention?
Click to reveal
Instead of replacing the original embedding with the attention output, we add the attention output to the original. This preserves the word's core meaning while adding contextual refinements. Without residual connections, deep networks lose information.
Module 4 Complete
You now understand the core mechanism that makes Transformers powerful. Next up: Feed-Forward Networks and Layer Stacking -- how the model processes attention outputs and builds deeper understanding through repetition.
Synthesis question: If attention lets words share information, and we stack 96 layers of attention, what kind of "information sharing" might happen in layer 1 vs. layer 50 vs. layer 96?