Why "Hello world" must become numbers

Think first
When you type "Hello world" into ChatGPT, you see letters. But the model sees a stream of numbers. Why can't a neural network work with text directly? And how should we break "Hello world" into numbers?

Before a transformer can learn anything, text must become numbers. When you type "Hello world" into ChatGPT, you see letters. The model, however, sees a stream of numbers it can multiply, add, and analyze.

Computers cannot do math with words. They cannot multiply "cat" or subtract "sat." They need numbers. The process of converting text into numbers is called tokenization, and it determines the model's vocabulary and flexibility.

Key Insight

Tokenization is the very first step in the LLM pipeline. By the end of this module, you'll understand why "The cat sat" becomes [464, 3797, 2495] and how this enables language generation.

In this chapter we cover three things:

Chunking: How text is broken into meaningful units.
Encoding: How those chunks become ID numbers.
Significance: Why this step determines the model's vocabulary and flexibility.

Tokenization = Chopping ingredients before cooking. The size and shape of the pieces (characters, words, sub-words) affects the entire dish.

Characters, Bytes, and Unicode

Early systems used ASCII, a table that assigned numbers to 128 basic characters. In ASCII, the phrase "The cat" is represented as numbers like 84, 104, 101, 32, 99, 97, 116.

While this allowed computers to store text, it presented a major limitation for language modeling: these numbers do not capture meaning. To the computer, the sequence 99, 97, 116 is just three separate values, not the unified concept of "cat."

Furthermore, ASCII only covered English characters. Modern systems use Unicode, which assigns a unique number (called a code point) to every symbol from every human language, including emojis. UTF-8 is the most common way to encode these code points as bytes for storage and transmission.

Interactive: ASCII Explorer

Type text to see its ASCII/byte values. Notice how individual byte values don't capture word-level meaning.

Each character maps to a number, but 99,97,116 means nothing about "cat" to the model.

What goes wrong with character-level numbers

If we just fed raw ASCII numbers into a model, it would need to learn that 99,97,116 together mean "cat" from scratch. The sequence "cat" would be 3 separate inputs with no inherent connection. The model must learn spelling before meaning -- an enormous waste of capacity.

Finding Meaningful Chunks

Simply turning characters into numbers lets us store text, but it does not help the model understand text. The way we choose to "chunk" the text dramatically affects how well the model learns.

Think first
Consider generating the word "understanding". If you had to choose: break it into individual characters (u,n,d,e,r...), keep it as one whole word, or split it into meaningful parts (under+stand+ing) -- which would you pick and why?

Consider generating the word "understanding":

Character-level: The model must generate 13 separate steps (u, n, d, e...). Tiny vocabulary, but incredibly long sequences.
Word-level: The model generates it in one step, if the word is in its vocabulary. Fast, but what about rare words?
Subword-level: The model might generate "under" + "stand" + "ing" (3 steps). A sweet spot.

Interactive: Chunking Comparison

Type a word or phrase and see how the three strategies chunk it differently.

Character-level:
Word-level:
Subword-level:
Strategy Vocab Size Sequence Length Handles Rare Words?
Character ~100 Very Long Yes (always)
Word 171,000+ Short No (OOV problem)
Subword ~50,000 Medium Yes (splits into parts)
Key Insight

The choice of chunking determines generation speed, vocabulary size, and the model's ability to handle new words. Subword tokenization is the sweet spot used by all modern LLMs.

Word-Level Tokenization and the OOV Problem

The traditional approach: split a sentence by spaces and punctuation, and every distinct word becomes a vocabulary entry. Each unique word is assigned a unique, permanent integer ID:

"the" -> 976
"cat" -> 9059
"." -> 13

These mappings are fixed. Once you assign "cat" to 9059, it never changes. When the model generates a response, it outputs numbers. We need to turn those numbers back into readable text. This is called Detokenization (or decoding). If the model outputs [976, 9059], the detokenizer looks up the dictionary and prints "the cat".

English has over 171,000 words in common use. Including every variation ("run", "runs", "ran", "running") requires millions of entries. That is too computationally expensive. So we usually limit the vocabulary size (e.g., top 50,000 most common words). But this creates a major issue:

The "Out of Vocabulary" Problem

When the model encounters a word not in its vocabulary, it gets replaced with a special <|unk|> (Unknown) token. Imagine asking a model about "pneumonoultramicroscopicsilicovolcanoconiosis" and it sees only <|unk|>. The meaning is completely lost. If your text is full of <|unk|> tokens, the model is effectively blind.

OOV Problem = Like a phrasebook for travel. If the phrase you need isn't in the book, you can only say "I don't know this word" -- no matter how important the word is.

Special Tokens: Traffic Signs for the Model

The model needs help navigating the text. We add Special Tokens to act as traffic signs:

<|bos|> (Beginning of Sequence): Marks the start of a conversation. Tells the model: "A new text begins here."
<|eos|> (End of Sequence): Marks the end of a conversation. Tells the model: "Stop generating."
<|pad|> (Padding): Models process sentences in batches. Padding fills short sentences to match the longest one so computation works with neat rectangles.

ChatML Format

Models like GPT-4 use ChatML to structure multi-turn conversations:

<|im_start|>system<|im_sep|>You are a helpful assistant.<|im_end|>
<|im_start|>user<|im_sep|>What is tokenization?<|im_end|>
<|im_start|>assistant<|im_sep|>

Pattern: <|im_start|>role<|im_sep|>content goes here<|im_end|>

Key Insight

Special tokens are invisible to you but essential for the model. They tell the model where conversations begin, end, and who is speaking. Without them, the model can't distinguish your question from its own answer.

Byte Pair Encoding (BPE): The Key Algorithm

We want the brevity of word tokenization and the flexibility of character tokenization to avoid the <|unk|> problem. The answer is Byte Pair Encoding (BPE), originally a data compression technique (Philip Gage, 1994), adapted for NLP by Rico Sennrich et al. in 2016.

Think first
If you started with individual characters like [l, o, w, e, r] and could repeatedly merge the most common pair into a single token, which pairs would you merge first in the text "lower lowest"?

How BPE works:

1. Start with individual characters.
2. Count which pairs of adjacent tokens appear most frequently.
3. Merge the most frequent pair into a new token.
4. Repeat until reaching the target vocabulary size (~50,000 tokens).

Interactive: BPE Algorithm Step-by-Step

Watch BPE build tokens from characters. Click "Next Merge" to step through the algorithm.

Vocabulary size: 11

Standard BPE Weakness

Standard BPE relies on a fixed list of base characters from training data. Rare symbols or new emojis may still cause <|unk|> tokens.

Byte-Level BPE

Used by GPT-2 and GPT-3. Instead of starting with characters, it starts with a base vocabulary of the 256 possible byte values. Since computers store everything as bytes (numbers 0-255), any text, code, or symbol can be represented. This completely eliminates the <|unk|> token.

Key Insight

Byte-level BPE is the breakthrough: by starting from raw bytes, no input is ever "unknown". In the worst case, unknown sequences are broken down into raw bytes. This is why GPT can handle any language, emoji, or code snippet.

Other Approaches

Algorithm Used By Key Difference
BPE GPT-2, GPT-3 Merges by frequency (most popular pair)
WordPiece BERT Merges by probability (best fit for prediction)
SentencePiece Llama, T5 Treats text as raw stream including spaces. Language-agnostic.

SentencePiece is notable because it treats text as a raw stream of characters, including spaces. This makes it language-agnostic -- it works equally well for English, Chinese, or Japanese. When using the Unigram algorithm, SentencePiece works in reverse: starts with a large vocabulary and prunes tokens that contribute least.

The Complete Pipeline

Let's trace "The cat is playing." through GPT's tokenizer:

1
Original text
"The cat is playing."
2
Tokenization (gpt-4o)
The cat is playing .
3
Token to ID mapping
976 9059 382 8252 13
4
Generation & Detokenization
Reverse the process: IDs -> tokens -> text

In GPT-3, the vocabulary contained 50,257 tokens.

Interactive: Live Tokenizer Demo

Type anything and see how different tokenization strategies break it down. Try words like "ChatGPT", "pneumonia", emojis, or code snippets.

Tokens:
Token IDs:
Tokens
0
Characters
0
Ratio
0

Try these experiments:

- Type "ChatGPT" -- it becomes ["Chat", "GPT"]. Common compound words split at natural boundaries.
- Type "Understanding" -- becomes ["Under", "standing"]. Subword splits preserve meaning.
- Type "asdfjkl" -- nonsense gets split into individual characters. The fallback works!
- Switch between modes to see how character-level creates many more tokens for the same text.

Vocab size tradeoff = Like a dictionary. Too few entries = you can't express many ideas. Too many entries = the book is enormous and expensive to search. ~50,000 tokens is the sweet spot most models use.

What's Missing?

The number 3797 ("cat") is just an arbitrary ID. It doesn't capture the essence of a cat. The number 3797 is no more "cat-like" than 42 or 99999. There is nothing in the number that tells the model cats are furry, have four legs, or are related to dogs.

To fix this, we need Embeddings -- a way to turn those arbitrary IDs into rich vectors that capture meaning. That's the next module.

Concept Map: Where We Are
Text -> Tokenize -> Token IDs -> Embeddings -> Attention -> Layers -> Prediction

Purple = covered in this module. Gray = coming in next modules.

Check Your Understanding

1. Why do LLMs need tokenization?
Correct! Neural networks are mathematical systems -- they can only multiply, add, and compare numbers. Tokenization converts text into numerical IDs the model can process.
2. What is the main advantage of subword tokenization over word-level tokenization?
Right! Subword tokenization never encounters an "unknown" word. If it hasn't seen "pneumonoultramicro...", it splits it into smaller pieces like "pn", "eum", "ono", etc. This eliminates the OOV problem entirely.
3. How does Byte-level BPE differ from standard BPE?
Exactly! Standard BPE starts with characters from training data -- rare symbols might be missing. Byte-level BPE starts with all 256 byte values, so any input can be represented. This is what GPT-2/3 use.
4. What does the <|pad|> special token do?
Correct! Models process batches of sentences in parallel. Since sentences have different lengths, shorter ones get padded with <|pad|> tokens to create uniform-length rectangles for efficient matrix operations.

Teach It Back

Explain to a friend who knows nothing about AI: What is tokenization, why is it needed, and why do modern LLMs use subword tokenization instead of whole words or individual characters? Use your own words.

An AI tutor will compare your explanation against the course material and give specific feedback.

Evaluating your response against the course material...

Flashcards (click to flip)

What is tokenization?
Click to reveal
The process of converting text into numbers (token IDs) that a neural network can process. Text is chunked into tokens and each token is mapped to a unique integer ID.
What is the OOV (Out of Vocabulary) problem?
Click to reveal
When using word-level tokenization with a limited vocabulary, words not in the vocabulary get replaced with a special <|unk|> token, losing all meaning. This is a critical flaw of word-level approaches.
How does BPE (Byte Pair Encoding) work?
Click to reveal
Start with individual characters. Count the most frequent adjacent pair. Merge it into a new token. Repeat until reaching the target vocabulary size (~50,000). This builds tokens bottom-up from characters to common subwords and words.
Why does Byte-level BPE eliminate the <|unk|> token?
Click to reveal
Because it starts with all 256 possible byte values as its base vocabulary. Since all data is stored as bytes, any text, emoji, or symbol can always be represented -- at worst, by its raw byte values.
What are the three main special tokens?
Click to reveal
<|bos|>: Beginning of sequence (start of text). <|eos|>: End of sequence (stop generating). <|pad|>: Padding (fill short sentences to uniform length for batch processing).
How does WordPiece differ from BPE?
Click to reveal
BPE merges pairs based on frequency (most popular pair). WordPiece merges pairs based on probability (which merge best improves the language model's predictions). WordPiece is used by BERT.

Module 2 Complete

Next up: The Embedding Layer -- how token ID 3797 becomes a rich vector that captures the meaning of "cat".

Synthesis question to think about: If two words always appear in similar contexts during training, should they have similar or different token IDs? Does the ID number itself matter?

<- Previous Course Home Next: Embeddings ->