Concept 1 of 10

How does a model know which "bank" you mean?

Think first

The word "bank" has one embedding vector. But it means completely different things in: "I deposited money at the bank", "The river bank was muddy", and "The plane had to bank left". How could surrounding words help the model figure out which meaning to use?

Concept 2 of 10

Query, Key, and Value: The Three Roles

Every word plays three roles simultaneously in attention. Think of it like a library:

Query (Q)

"What information do I need?"

Like a search query you type into a library catalog. The word "sat" asks: "who or what is doing the sitting?"

Key (K)

"What information can I provide?"

Like book titles on the shelf. Each word advertises its content. "cat" advertises: "I'm an animal, a subject".

Value (V)

"Here is my actual information"

Like the book's contents. Once you've matched query to key, the value is what actually gets transferred.

Each role is created by multiplying the embedding by a different learned weight matrix:

"sat" embedding: [0.3, -0.3, 0.7, 0.15]
"sat" x Wq = query:  [0.12, 0.13] ("what do I need?")
"sat" x Wk = key:    [0.09, 0.58] ("what can I offer?")
"sat" x Wv = value:  [0.13, 0.45] ("here is my info")

The same weight matrices are applied to every word. Queries and keys are intentionally smaller than embeddings (GPT-3: 12,288 embedding -> 128 per query head). This compression forces focus.

Q, K, V = Google search. Your search query (Q) is matched against page titles/tags (K). When there's a match, you read the page content (V). Attention is like searching every word against every other word.

Concept 3 of 10

The Matching Step: Dot Product Scores

How does the model decide which words are relevant to each other? It calculates the dot product between each query-key pair. Higher positive number = strong relevance. Near zero = little relevance.

Interactive: Query/Key/Value Calculator

Select a word to see its Q, K, V vectors and how it matches with other words. Click "sat" to see how it attends to "cat".

Click a word above to compute its Q, K, V vectors and attention scores.

In practice, all queries are stacked into one matrix, all keys into another (transposed), and one matrix multiplication computes all pairs simultaneously. This is why GPUs are so important -- they excel at matrix multiplication.

Key Insight

The dot product measures how well a query "matches" a key. It's like measuring how similar two directions are. Parallel vectors = high score. Perpendicular = zero. Opposite = negative.

Concept 4 of 10

From Scores to Probabilities: Scale and Softmax

Think first

Raw dot product scores can be any number (like -10, 45, 100). We need probabilities that sum to 1. How would you convert arbitrary numbers into a probability distribution? And why might very large numbers be a problem?

Concept 5 of 10

Attention Scores: The Full Picture

Let's see what attention looks like for an entire sentence. Each row shows how much one word attends to every other word.

Interactive: Attention Score Heatmap

Hover over cells to see exact attention weights. Each row sums to 100%. Click a sentence to switch examples.

Low attention

High attention

Notice how "bank" in the financial sentence attends strongly to "deposited" and "money", while "bank" in the river sentence attends to "river" and "muddy". Same word, different attention patterns, different contextual meaning.

Concept 6 of 10

Causal Masking: No Peeking at the Future

Causal language models (like GPT) predict the next word. During training, the model sees the whole sentence but must predict each word using only the words before it. If it could see future words, it would just copy the answer!

Before applying softmax, we set all future position scores to -infinity. Softmax converts -infinity to exactly zero -- those positions get zero attention.

Interactive: Causal Mask

Click "Apply Mask" to see how the triangular mask blocks future positions. Only the lower triangle survives.

After masking:

Position 0 ("The") sees only: The
Position 1 ("cat") sees: The, cat
Position 2 ("sat") sees: The, cat, sat
Position 3 ("on") sees: The, cat, sat, on

The quadratic bottleneck

This attention matrix grows quadratically. 10 tokens = 100 interactions. 1,000 tokens = 1,000,000 interactions. 100,000 tokens = 10 billion interactions. This is the Transformer's main computational bottleneck and why context length is so expensive to increase.

Concept 7 of 10

The Weighted Sum: Mixing Information

Now we combine attention probabilities with value vectors. Each word's new representation is a weighted mix of all the value vectors it can see.

"sat" attends to:
33.3% from "The" value vector
34.0% from "cat" value vector
32.6% from "sat" value vector
Weighted sum = Context Vector [0.39, 0.34]

The critical step: we don't replace the old embedding. We add the context vector to it. This is called a Residual Connection:

      Original Embedding + Attention Output = Final Representation
    

"sat" now contains its dictionary meaning plus context that it's an action performed by a cat.

Key Insight

Residual connections are crucial. Without them, stacking many layers would cause the original word meaning to be lost. By adding rather than replacing, each layer contributes a refinement while preserving the foundation.

Concept 8 of 10

The Complete Formula and Multi-Head Attention

        Attention(Q, K, V) = softmax(QKT / sqrt(dk)) * V
      

QK^T = match scores | /sqrt(d_k) = scale | softmax = probabilities | * V = extract info

Multiple Attention Heads (MHA)

One head captures one type of relationship. Multiple heads run in parallel, each learning to focus on different things:

Head 1

Subject-verb relationships: "who did what"

Head 2

Location relationships: "where things happen"

Head 3

Pronoun references: "it" refers to "cat"

Head 4

Descriptive relationships: "fluffy" describes "cat"

Each head has its own distinct Q, K, V matrices.

Model scales:
GPT-2:  12 heads
BERT:   12 heads
GPT-3:  96 heads

Combining Heads

All head outputs are concatenated, then multiplied by an Output Projection Matrix (Wo) to synthesize different perspectives into a single coherent update:

GPT-3: (96 heads x 128 dims) -> projected back to 12,288 embedding dims.

Multi-Head = A panel of 96 experts, each examining the sentence from a different angle. One checks grammar, another checks meaning, another checks style. Their findings are combined into a single comprehensive report.

Concept 9 of 10

Complete Flow: Following "sat" Through Attention

Step-by-Step: "sat" in "The cat sat"

1

Start with embedding

[0.3, -0.3, 0.7, 0.15]

2

Create Q, K, V for each head

Multiply embedding by Wq, Wk, Wv matrices

3

Calculate attention scores (with causal masking)

Q_sat . K_The = 0.33, Q_sat . K_cat = 0.34, Q_sat . K_sat = 0.33

4

Convert to probabilities (softmax)

33.3% The, 34.0% cat, 32.6% sat

5

Mix values: weighted sum

Context vector from blending all attended value vectors

6

Repeat for all heads in parallel

Each head captures different relationships

7

Concatenate, project, add to original (residual)

Result: "sat" now knows it's an action performed by a cat!

And this is just one layer. Deep understanding requires iteration -- stacking 12, 24, or 96 attention layers allows the model to reason about context that relies on other context. Layer 1 might learn grammar. Layer 50 might learn meaning. Layer 96 might learn reasoning.

Concept Map: Where We Are

Text -> Tokenize -> Embeddings -> Attention -> FFN Layers -> Prediction -> Text

Purple = covered so far. Gray = coming in next modules.

Concept 10 of 10

Check Your Understanding

1. What problem does attention solve?

Correct! Without attention, "bank" always has the same representation. Attention lets surrounding words (like "money" or "river") modify its meaning dynamically.

2. What are the roles of Query, Key, and Value in attention?

Right! Query is the search question, Key is what each word advertises about itself, and Value is the actual information that gets transferred when there's a match.

3. Why is causal masking necessary?

Exactly! During training, the model sees the full sentence but must predict each token using only previous tokens. Masking future positions (setting them to -infinity before softmax) enforces this constraint.

4. Why does GPT-3 use 96 attention heads instead of just one?

Correct! One head captures one type of relationship. Multiple heads running in parallel can simultaneously track grammar, meaning, pronouns, style, and many other linguistic features.

5. What is the main computational bottleneck of attention?

Right! Every token must attend to every other token: n^2 interactions. 1000 tokens = 1M interactions. This quadratic scaling is why increasing context length is so expensive.

Teach It Back

Explain to a friend: What is attention in a Transformer, how does the Q/K/V mechanism work, and why does "bank" get different representations in "river bank" vs "money bank"?

An AI tutor will compare your explanation against the course material and give specific feedback.

Evaluating your response against the course material...

Flashcards (click to flip)

What is the attention formula?

Click to reveal

Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) * V. QK^T computes match scores, dividing by sqrt(d_k) scales them, softmax converts to probabilities, multiplying by V extracts the information.

What do Query, Key, and Value represent?

Click to reveal

Query: "What information do I need?" Key: "What information can I provide?" Value: "Here is my actual information." Each is created by multiplying the embedding by a different learned weight matrix (Wq, Wk, Wv).

Why do we divide by sqrt(dk) before softmax?

Click to reveal

Without scaling, large dot products push softmax into regions with tiny gradients (near 0 or 1), making learning very slow. Dividing by sqrt(dk) keeps values in a moderate range where softmax gradients are healthy.

What is causal masking?

Click to reveal

Setting attention scores for future positions to -infinity before softmax, so they become exactly zero. This creates a triangular mask: position 0 sees only itself, position 1 sees positions 0-1, position 2 sees 0-2, etc. Prevents "cheating" during training.

Why multi-head attention instead of single-head?

Click to reveal

Each head learns to focus on different relationship types (grammar, meaning, pronoun reference, etc.). Multiple heads running in parallel capture richer patterns than a single head could. GPT-3 uses 96 heads, each with its own Q, K, V matrices.

What is a residual connection in attention?

Click to reveal

Instead of replacing the original embedding with the attention output, we add the attention output to the original. This preserves the word's core meaning while adding contextual refinements. Without residual connections, deep networks lose information.

Module 4 Complete

You now understand the core mechanism that makes Transformers powerful. Next up: Feed-Forward Networks and Layer Stacking -- how the model processes attention outputs and builds deeper understanding through repetition.

Synthesis question: If attention lets words share information, and we stack 96 layers of attention, what kind of "information sharing" might happen in layer 1 vs. layer 50 vs. layer 96?

<- Previous Course Home

How does a model know which "bank" you mean?

Query, Key, and Value: The Three Roles

The Matching Step: Dot Product Scores

From Scores to Probabilities: Scale and Softmax

Step 1: Scale the scores

Step 2: Apply softmax

Attention Scores: The Full Picture

Causal Masking: No Peeking at the Future

The Weighted Sum: Mixing Information

The Complete Formula and Multi-Head Attention

Multiple Attention Heads (MHA)

Combining Heads

Complete Flow: Following "sat" Through Attention

Check Your Understanding

Teach It Back

Flashcards (click to flip)

Module 4 Complete