How does a model know that "cat" and "kitten" are related?

Think first
We have token IDs like [976, 9059, 382] for "The cat is". But the number 9059 is just an arbitrary integer. It tells the model nothing about cats. How could we represent words so that similar words (like "cat" and "kitten") look similar mathematically?

We have successfully turned our text into token IDs like [976, 9059, 382] (representing "The cat is..."). However, we still have a problem. These numbers mean nothing by themselves. To a computer, the number 402 is just an arbitrary integer. It isn't inherently "animal-like" or "furry."

The model needs a representation that captures three things:

Similarity: Similar words (like "cat" and "kitten") should look similar mathematically.
Relationships: We can preserve logical connections (like "King" is to "Queen" as "Man" is to "Woman").
Density: We can pack a lot of meaning into a small space.

Key Insight

Embeddings turn arbitrary token IDs into rich vectors of numbers where position encodes meaning. Words with similar meanings end up near each other in this high-dimensional space.

One-Hot Encoding: The Naive Approach

The simplest approach to turning a token ID into something a neural network can use: one-hot encoding. Say our vocabulary is just 5 words: ["The", "cat", "is", "playing", "."]. To represent "The" (ID 0), we create a vector of length 5 with a 1 in the 0th position and 0 everywhere else.

Interactive: One-Hot Encoding

Click a word to see its one-hot vector. Notice how all vectors are equally far apart.

[ Click a word above ]

This works fine for toy examples, but real-world vocabularies have 50,000+ tokens. "cat" would be a vector with 49,999 zeros and a single one. This sparse representation has two massive problems:

Why one-hot encoding fails

1. Incredibly wasteful of memory. Each token needs a 50,000-dimensional vector that's almost entirely zeros.

2. Creates meaningless distances. The distance between "cat" and "kitten" is identical to the distance between "cat" and "airplane". Every word is equally different from every other word. The model can't learn that some words are more related than others.

Dense Embeddings: Packing Meaning into Numbers

Instead of a massive vector of zeros, we compress the meaning into a smaller vector of fixed size. The original Transformer used 512 dimensions. GPT-3 uses 12,288.

Every position in this vector is a decimal number (like 0.2, -0.89, or 0.001) that represents a specific "feature" of the word.

One-Hot (50,000 size): "cat" -> [0,0,0,...,1,...,0,0,0] (Mostly empty)
Dense (512 size): "cat" -> [0.2, -0.5, 0.8, ..., 0.3] (Packed with information)
512 dimensions = Describing someone's personality. You need multiple dimensions: Introversion, Openness, Kindness, Humor. Similarly, to capture "cat", one dimension might capture "living vs. object", another "size", another "furriness".
Key Insight

Dense embeddings are the bridge from meaningless IDs to meaningful representations. A 512-number vector can encode incredibly nuanced meaning -- from "is this a living thing?" to "is this related to royalty?"

Visualizing the Embedding Space

When we say the embedding size is 512, our "space" has 512 axes instead of just two. Each token is a point with a specific coordinate on every axis. We can't draw 512 dimensions, but we can see the principle in 2D.

Interactive: 2D Embedding Space

Hover over the dots to see each word's coordinates. Notice how similar concepts cluster together. The axes represent simplified features: "Living vs. Mechanical" (x) and "Large vs. Small" (y).

Mechanical Living

Cat and Dog are physically close in embedding space -- the model mathematically understands they are related concepts. Airplane is far away in a different region. This "spatial proximity" is the entire basis of how the model captures meaning.

Cat: [0.95, 0.35] (High Living, Small)
Dog: [0.97, 0.55] (High Living, Medium)
Horse: [0.90, 0.80] (High Living, Large)
Car: [0.08, 0.65] (Low Living, Medium)
Airplane: [0.05, 0.90] (Low Living, Large)

The Embedding Matrix: A Giant Lookup Table

Think first
We have 50,257 tokens and each needs a 512-dimensional vector. What's the simplest way to store and retrieve these vectors? How many total numbers do we need?

The Embedding Layer is effectively a giant spreadsheet:

Rows: One for each token in the vocabulary (50,257 tokens).
Columns: The embedding dimension (512 or more).
Shape: (vocab_dimension, embedding_dimension).

When the model receives token ID 3797 ("cat"), it simply goes to Row 3797 of the matrix, copies the list of numbers in that row, and that list is the embedding vector. That's it -- a lookup.

Interactive: Embedding Lookup Table

Click a token to look up its embedding vector. The highlighted row is the vector that gets sent to the next layer.

Row Token d1 d2 d3 d4 d5 d6 d7 d8 ...
Resulting embedding vector:
Click a token above

In GPT-3, the embedding matrix has 50,257 rows x 12,288 columns = roughly 617 million parameters just for the embedding layer alone!

Embedding Matrix = A massive dictionary where each entry isn't a definition in words, but a list of 12,288 numbers that encode the meaning mathematically.

How Embeddings "Learn" Meaning

We don't hand-code these vectors. The embedding matrix is learned during training, updated alongside every other parameter in the network.

At the start of training, the matrix is random noise. "Cat" might be close to "Table" or "Sky." As the model trains on billions of tokens, a remarkable thing happens:

When it sees "The cat sat on the mat" and "The dog sat on the mat", "cat" and "dog" appear in the same context. Predicting the next word after either one requires similar information. So the training process pushes their vectors closer together.

Meanwhile, "The airplane flew in the sky" appears in completely different contexts. So "airplane" gets pushed to a different region of the space.

After training:
"cat" -> [0.2, -0.5, 0.8, ...]
"dog" -> [0.3, -0.4, 0.7, ...] (similar!)
"airplane" -> [0.9, 0.1, -0.2, ...] (different!)

Semantic Arithmetic: The King-Queen Analogy

Relationships become so precise you can do semantic arithmetic:

Interactive: Word Vector Arithmetic
King
-
Man
+
Woman
=
Queen

The "direction" from Man to King (adding "royalty") is the same direction from Woman to Queen. Try other analogies:

Key Insight

Nobody programs these relationships. They emerge from training on billions of sentences. The model discovers that "royalty" is a consistent direction in embedding space simply because royal words appear in similar contexts.

The Missing Piece: Position

Embedding vectors contain no information about position. Consider these two sentences:

"The cat ate the mouse."
Cat eats mouse
"The mouse ate the cat."
Mouse eats cat (!)

These two sentences look identical to the model -- same tokens, same embeddings. The model knows what the words mean, but not where they are.

To fix this, we add Positional Encodings:

Final Input = Word Embedding + Position Embedding

This "tags" the word "cat" with information that says "I am the second word in the sentence." The position embedding is another learned vector, added to the word embedding before entering the attention layers.

What goes wrong without position information

Without positional encodings, the model treats "dog bites man" and "man bites dog" identically. It becomes a bag of words -- knowing what's present but not the order. Word order is essential to meaning in most languages.

Where This Fits

Concept Map: Where We Are
Text -> Tokenize -> Token IDs -> Embeddings -> Attention -> Layers -> Prediction

Purple = covered so far. Gray = coming in next modules.

We now have a complete path from raw text to meaningful number representations. Each token is now a dense vector that captures its meaning. But there's still a problem: each word's embedding is static. The word "bank" has the same vector whether we're talking about a river bank or a financial bank. To solve this, we need Attention -- the mechanism that lets words share context with each other.

Check Your Understanding

1. Why is one-hot encoding impractical for LLMs?
Correct! One-hot vectors are extremely sparse (mostly zeros) and worst of all, every word is equidistant from every other word. "Cat" is just as far from "kitten" as from "airplane". Dense embeddings solve both problems.
2. What does the embedding matrix actually do?
Right! The embedding matrix is simply a lookup table. Token ID 3797 -> go to row 3797, copy those 512 (or 12,288) numbers. The magic is that these numbers were learned during training to capture meaning.
3. Why does vector("king") - vector("man") + vector("woman") approximately equal vector("queen")?
Exactly! No one programs these relationships. The model learns from context: king and queen appear in similar royal contexts, man and woman in similar gendered contexts. The "royalty" concept becomes a geometric direction in the space.
4. What information is missing from word embeddings alone?
Correct! Embeddings encode meaning but not position. "The cat ate the mouse" and "The mouse ate the cat" would have the same set of embeddings. Positional encodings are added to fix this.

Teach It Back

Explain to a friend: What are embeddings, why do we need them instead of just using token IDs, and how does the model learn them? Bonus: explain the king-queen analogy.

An AI tutor will compare your explanation against the course material and give specific feedback.

Evaluating your response against the course material...

Flashcards (click to flip)

What is the difference between one-hot encoding and dense embeddings?
Click to reveal
One-hot: a vector of size vocab_size with a single 1 and all zeros. Sparse and meaningless distances. Dense: a much smaller vector (512-12,288 dims) of learned decimal numbers that encode semantic meaning. Similar words have similar vectors.
What is the shape of GPT-3's embedding matrix?
Click to reveal
(50,257 x 12,288) -- 50,257 tokens in the vocabulary, each with a 12,288-dimensional vector. That's roughly 617 million parameters just for the embedding layer.
How do embeddings "learn" that cat and dog are related?
Click to reveal
During training, "cat" and "dog" appear in similar contexts ("The ___ sat on the mat"). To predict the next word, both need similar information, so gradient descent pushes their vectors closer together. Words in different contexts drift apart.
What is the "king - man + woman = queen" analogy about?
Click to reveal
It shows that embeddings capture relationships as geometric directions. The direction from "man" to "king" encodes "royalty." Applying the same direction to "woman" lands near "queen." This emerges from training, not programming.
Why do we need positional encodings?
Click to reveal
Word embeddings contain no position information. "The cat ate the mouse" and "The mouse ate the cat" would look identical. Positional encodings are added to tell the model where each word appears in the sequence: Final Input = Word Embedding + Position Embedding.

Module 3 Complete

Next up: Attention -- the mechanism that lets "bank" mean different things depending on whether "river" or "money" is nearby.

Synthesis question: If every word has a fixed embedding vector, how can the model understand that "bank" means different things in "river bank" vs. "bank account"?

<- Previous Course Home Next: Attention ->