Why does "dog bites man" confuse a transformer?

Think first
Transformers process every token in parallel - there's no "read left to right" like an RNN. But embed("dog") + embed("bites") + embed("man") gives the same sum as embed("man") + embed("bites") + embed("dog"). How could we inject order back into the model without breaking parallelism?

Addition is commutative: a + b = b + a. If the only thing the model sees is a bag of embeddings, "dog bites man" and "man bites dog" are literally identical inputs. Legal defense: none.

RNNs handled this naturally because they read one word at a time - order was baked into the temporal unfolding. Transformers ditch recurrence for speed, so they need a different trick: give each position its own fingerprint vector and add it to the embedding.

Key Insight

The embedding says what the word is. The positional encoding says where it is. Adding them yields a vector that encodes both - and later attention layers can learn to separate the two signals when needed.

Three obvious ideas that fail

Attempt 1: Just add the position index

Position 1 -> embedding + 1. Position 100 -> embedding + 100. Fails: embedding values are tiny (~0.05, -0.2). Adding 100 obliterates the word information - the positional signal drowns the meaning.

Attempt 2: Normalize to [0,1]

Divide index by sentence length. Fails: step size depends on sentence length. Position 5 in a 10-word sentence equals position 50 in a 100-word sentence. The meaning of "position 0.5" is unstable.

Attempt 3: Binary encoding

Position 5 -> [1,0,1,0,...]. Fails: neighboring positions look totally different. Position 7 = 0111, Position 8 = 1000 - every bit flipped. The model has to learn that those two "very different" vectors are actually adjacent. Jagged, non-smooth, hard to learn.

Design Constraints

We want a positional encoding that is: (1) small enough not to overpower embeddings, (2) unique for every position, (3) smooth so nearby positions get nearby vectors, (4) unbounded in length - works for any sentence size, and (5) structured so the model can learn "k steps later" as a simple operation.

The sinusoidal trick

The Transformer paper's answer: treat position as a signal frequency. Each position gets a unique fingerprint built from sines and cosines at many different frequencies - like a bar code made of waves.

Split the d=512 dimensions into 256 pairs. Each pair uses one sin and one cos, and each pair has its own frequency:

PE(pos, 2i)   = sin(pos / 10000^(2i/d))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d))

Why pairs of sin and cos? Because sin(0) = sin(pi) = 0 - by itself sin can't distinguish those two positions. But (sin(0), cos(0)) = (0, 1) and (sin(pi), cos(pi)) = (0, -1) are clearly different. Sine and cosine together trace a unique point on the unit circle for every position.

Why many frequencies? Early pairs (small denominator) oscillate fast - they encode fine-grained position. Late pairs (huge denominator ~9646) change extremely slowly - they encode broad "am I near the beginning or the end?" information.

Sinusoidal PE = A clock with 256 hands. The fast second-hand tells you exactly where you are; the slow hour-hand tells you roughly where in the hour you are. Together they encode any time uniquely.
Pair iDenominator 10000^(2i/512)Cycle length
01~6 positions
11.036~6.5 positions
506.04~38 positions
255~9646~60,000 positions

See the waves at different frequencies

Move the slider to change which pair i you're looking at. Notice how pair 0 oscillates many times across the sentence, while pair 255 barely wiggles.

Sinusoidal wave visualizer
Denominator: 1.00   Cycle length: 6.3 positions
Blue = sin (dimension 2i)  |  Orange = cos (dimension 2i+1)

Calculate a position's full vector

Pick a position and see the first 8 dimensions of its positional encoding computed step by step. Notice how neighboring positions produce nearly-identical early values (smooth) but still-unique overall vectors.

Position vector calculator (d=512)

"Dog bites man" vs "Man bites dog"

Without positional encoding, these are literally identical to the model. Toggle PE on/off and watch the two inputs become distinguishable.

Order disambiguation demo
Relative position math

A beautiful property: PE(pos + k) can be written as a linear transformation of PE(pos). This means attention can learn "look at the token 3 positions back" as a simple matrix multiplication - exactly what we want for grammar and local structure.

What modern LLMs actually use

GPT-2 and GPT-3 don't use the sinusoidal formula - they use learned positional embeddings: a trainable lookup table of shape (max_seq_length, d_model). Simpler, often slightly better for the trained length, but has a hard edge: cannot extrapolate beyond max_seq_length.

Frontier models (Llama, Mistral, PaLM, Qwen) use RoPE - Rotary Positional Embeddings. RoPE brings sinusoidal math back but applies it inside attention: it rotates the Q and K vectors by an angle proportional to their position. The dot product q . k then naturally depends on the relative position of the two tokens. RoPE extrapolates better and is the current default.

Sinusoidal PE = Stamp a position barcode on each word before attention sees it.
Learned PE = Look up a trained vector per position slot - simpler, but can't extrapolate.
RoPE = Rotate Q and K vectors by angle proportional to position - relative distance falls out of the dot product for free.

Check your understanding

1. Why would adding the raw position index (0, 1, 2, ..., 100) to each embedding destroy the model?
Right. Embeddings live in a small range like [-1, 1]. A positional value of 100 dominates completely and the word identity gets drowned out.
2. Why does sinusoidal encoding pair sin and cos instead of using sin alone?
Correct. (sin, cos) together trace a unique angle on the unit circle; sin by itself would alias positions.
3. What role does the huge denominator 10000^(2i/d) play?
Exactly. Small i -> small denominator -> fast oscillation. Large i -> huge denominator -> nearly constant across the sequence. Multiple frequencies = multi-scale position info.
4. What is the main limitation of GPT-2's learned positional embeddings?
Right. Learned PE is a lookup table of fixed size. Sinusoidal and RoPE can be evaluated at any position; learned PE cannot.
5. Why is the property "PE(pos + k) is a linear transformation of PE(pos)" useful?
Yes. This linearity is why sinusoidal PE helps the model learn grammar and local structure - the operation "shift by k" is built into the geometry.

Teach It Back

Explain to a friend: why Transformers need positional encoding, why the obvious ideas (raw index, [0,1] scaling, binary) fail, and how sinusoidal encoding uses multiple frequencies of sin and cos to satisfy all the design requirements.

An AI tutor will compare your explanation against the course material.

Evaluating your response...

Flashcards (click to flip)

Why do Transformers need positional encoding?
Click to reveal
Transformers process all tokens in parallel and use addition to combine embeddings. Addition is commutative, so "dog bites man" and "man bites dog" would be identical sums. PE injects order information back in.
What is the sinusoidal PE formula?
Click to reveal
PE(pos, 2i) = sin(pos / 10000^(2i/d)), PE(pos, 2i+1) = cos(pos / 10000^(2i/d)). Each pair of dimensions uses a different frequency controlled by the denominator.
Why pair sin and cos instead of using just sin?
Click to reveal
Sin alone aliases: sin(0) = sin(pi) = 0. Pairing with cos gives (sin, cos) which traces a unique point on the unit circle for every angle, so positions never collide.
What do fast vs slow frequencies encode?
Click to reveal
Fast frequencies (low i, small denominator) capture fine position - "exactly where am I?". Slow frequencies (high i, huge denominator) capture coarse position - "beginning, middle, or end?". Like the second-hand vs hour-hand of a clock.
Why is PE(pos+k) = M * PE(pos) important?
Click to reveal
It means "shift by k positions" is a fixed linear transformation. Attention can learn simple relative-position operations (e.g. "look two tokens back") as a matrix multiply, which supports grammar and local structure.
Learned PE vs sinusoidal vs RoPE?
Click to reveal
Sinusoidal (original paper): fixed formula, extrapolates to any length. Learned (GPT-2/3): trainable lookup table, simpler but capped at max trained length. RoPE (Llama, Mistral): rotates Q/K vectors by position angle, so the dot product naturally depends on relative distance; extrapolates well and is the modern default.

Module 5 Complete

Words now carry meaning (embeddings) + location (positional encoding) + context (attention). Next up: stacking many layers and what each layer actually learns.

← Previous Course Home Next →