How can multiplication create poetry?

Think first
ChatGPT writes poems, translates languages, and explains quantum physics. But inside, it's just multiplying and adding numbers. No rules about grammar. No database of facts. How is that possible?

Type anything into ChatGPT right now. Ask it to write a poem about pizza, and within seconds, you receive something creative, coherent and completely original. This feels like magic, doesn't it? But here's what actually happened in those brief moments: millions of numbers danced through mathematical operations, each one carefully orchestrated by something called a Transformer.

ChatGPT was never given a dictionary entry for pizza or a recipe for how to write poetry. No one hard-coded those rules, and it isn't searching a database of poems. Instead, there are just numbers being multiplied and added together, billions of times.

The Central Question

How can multiplication and addition create poetry? That's the question this module will answer. By the end, you'll understand the full journey from raw text to generated language -- and why it all comes down to predicting one word at a time.

Large Language Models (LLMs) are basically deep learning applied to text. The "Large" refers to their scale: billions of parameters trained on massive amounts of text. However, despite their magic, they don't truly understand language like humans do. They do something simpler but incredibly powerful: they predict the next word in a sequence.

Key Insight

LLMs do one thing: predict the next word. That's it. But to predict well, they must implicitly learn grammar, facts, reasoning, and style. All the "intelligence" emerges from this single, simple objective applied at massive scale.

LLM = The world's best autocomplete. Your phone keyboard predicts "morning" after "good" -- an LLM does the same thing, but with billions of parameters and trillions of training examples.

From Rules to Learning

Before we continue, let's take a step back to understand how we actually got here. Artificial intelligence (AI) usually refers to any system that mimics human intelligence. What sets it apart from standard software is its ability to capture and replicate human heuristics -- mental shortcuts and rules of thumb -- something impossible for static, rule-based systems.

The Spam Detection Problem

Let's take the classic spam detection problem as an example. To catch spam, programmers used to write explicit rules: "If an email contains the word lottery, mark it as spam."

But what happens if the malicious sender changes the format? What do you do when they write l o t t e r y, or luttery? You could write a new rule for every variation, but there are infinite ways a sender can try to trick you. You would need to write infinite rules in response.

And that's just for the word "lottery." What happens when they switch to the "Nigerian Prince" scam? Or what if a real lottery notifies a winner? You wouldn't want to block that. Attempting to capture every possibility with manual rules is impossible. It simply doesn't scale.

Machine Learning Changed the Approach

Instead of writing rules, we show a model thousands of examples. We feed it spam emails and legitimate ones, and it learns to recognize the patterns itself. But there was still a problem. In early Machine Learning, humans still had to decide what features mattered. Someone had to format the data and tell the computer: "Look for specific keywords," or "Look at the sender's address." If the data was too complex -- like an image or a nuanced paragraph of text -- the model would struggle because humans couldn't easily define the rules for it to learn.

What's even worse is that the model was also limited by its capacity. If you asked a simple model to recognize a lottery scam, a Nigerian Prince scam, and a phishing link all at once, it might fail to separate them effectively. It's like being asked to build a perfect square using only three sticks, which is impossible. Sometimes, the tool just isn't complex enough for the task.

The Deep Learning Breakthrough

This is where Deep Learning comes in. Deep Learning relies on the same fundamental mechanism as machine learning, but with a unique advantage: it uses incredibly large, multi-layered neural networks. Deep learning has existed since the 1980s, but it didn't work well initially due to lack of computing power (specifically, powerful GPUs).

But in 2012, that changed. A deep learning system called AlexNet won the ImageNet competition -- a competition to classify a huge database of images -- by a massive margin. It succeeded because the creators didn't have to manually tell the model what a "cat" or a "car" looked like. The model learned the key features all by itself.

When researchers looked under the hood, they found something remarkable:

AlexNet's Layers -- What the Model Discovered on Its Own

First layers detected simple edges. The next layers combined edges into shapes (like circles or corners). Deeper layers combined those shapes to recognize entire object silhouettes (like a wheel or an ear). Each layer built on the previous one. No human told AlexNet to look for "edges" or "circles." It discovered these hierarchical features on its own through extremely deep layers of processing.

This is where Deep Learning truly shines: it is complex enough to find correlations humans might not even know exist.

Interactive: The Evolution of AI

Click a stage to explore it.

What goes wrong without learning

In 2004, a major email provider relied on hand-written spam rules. Spammers figured out the rules and started writing "l.o" instead of "lottery", "fr33" instead of "free". Every new trick required a new rule. The team was writing 50+ rules per week and still losing. The system was replaced by a machine learning classifier that learned patterns automatically.

Neural Networks: Learning by Adjusting Knobs

All deep learning models, including LLMs, are built on neural networks. And here's the unifying idea: neural networks are function approximators. Every task can be framed as learning some function f(x) = y:

Task Input x Output y
Spam detectionemail textspam or not
Image classificationpixels"cat" or "dog"
Movie recommendationyour preferenceslike or dislike
Next word predictionprevious wordsnext word

We don't know the "true" function that perfectly maps inputs to outputs. The neural network approximates it by learning from examples. It takes numbers in, applies learned operations, and outputs numbers. But humans communicate in words, not numbers -- so we need to convert data to numbers first.

Let's build intuition with a simple example: a movie preference predictor. Imagine predicting if you'll enjoy a movie based on two factors: action (0 to 1) and comedy (0 to 1).

Movie Action Comedy Liked?
Movie 10.10.9Yes (comedy fan)
Movie 20.80.2No (too much action)
Movie 30.30.7Yes (comedy with some action)

The Components

Neurons: Containers that hold numbers. Our input layer has 2 neurons (action score, comedy score). Our output layer has 2 neurons (probability of "like", probability of "dislike").

Parameters: The adjustable numbers inside the network. Think of them as millions of tiny knobs that get tuned during training. These include:

- Weights: Numbers that multiply each input (like importance scores).
- Biases: Numbers added at the end (like baseline adjustments).

Weights = Knobs on a mixing board. Each one controls how much a particular input influences the output. Training = finding the right knob positions.

Forward Pass: The Calculation

Let's trace through one example. Movie 1 has action = 0.1 and comedy = 0.9.

Step 1: Random initialization. The network starts with random weights and biases because it hasn't learned anything yet:

Weight for action -> like: 0.5
Weight for comedy -> like: -0.3
Bias for like: 0.1
Weight for action -> dislike: -0.2
Weight for comedy -> dislike: 0.6
Bias for dislike: 0.2

Step 2: Calculate the "like" neuron.

like = (action * weight) + (comedy * weight) + bias
like = (0.1 * 0.5) + (0.9 * -0.3) + 0.1
like = 0.05 + (-0.27) + 0.1
like = -0.12

Step 3: Calculate the "dislike" neuron.

dislike = (action * weight) + (comedy * weight) + bias
dislike = (0.1 * -0.2) + (0.9 * 0.6) + 0.2
dislike = -0.02 + 0.54 + 0.2
dislike = 0.72

This is a linear transformation of weights, inputs and biases.

Step 4: Convert to probabilities. We use a function called Softmax to turn these numbers into probabilities (numbers between 0 and 1 that sum to 1):

probability_like = 0.24 (24%)
probability_dislike = 0.76 (76%)

The network predicts we won't like Movie 1 (76% chance of dislike). But we actually did like it! The network is wrong. What we just calculated is called a forward pass.

Step 5: Measuring the Error. The network was wrong, but we need to measure exactly how wrong. This measurement is called the loss.

We wanted: "like" = 1.0 (100%), "dislike" = 0.0 (0%)
We got:    "like" = 0.24,      "dislike" = 0.76

Loss = (1.0 - 0.24)^2 + (0.0 - 0.76)^2
Loss = (0.76)^2 + (0.76)^2
Loss = 0.58 + 0.58 = 1.16

The bigger the loss, the more wrong we were. Our goal is to make this number as small as possible.

Step 6: Adjusting the Parameters. Now comes the clever part. We need to adjust our weights and biases to reduce the loss. But which way should we adjust them? We calculate something called Gradients. A gradient tells us: "If I increase this weight by a tiny amount, how much does the loss change?"

For example: Current comedy -> like weight: -0.3. The gradient calculation tells us increasing this weight decreases the loss. So we increase it slightly: -0.3 + 0.01 = -0.29.

This process is called Backpropagation because we work backward from the loss (output) back to the weights (input).

Now try it yourself -- adjust the sliders below and watch the forward pass happen in real time:

Interactive: Forward Pass Calculator

You're predicting if you'll like a movie. Adjust the action and comedy scores and watch the neural network calculate in real-time.

0.1 Action 0.9 Comedy 24% Like 76% Dislike 0.5 -0.2 -0.3 0.6
P(Like)
P(Dislike)
76%
Verdict
Dislike
Key Insight

The network doesn't "understand" comedy or action. It just learned that when input 2 (comedy) is high, output "like". Pure math, zero understanding. But the math works.

The Training Loop: How Networks Learn

Think first
The network above predicted "Dislike" for a comedy movie, but the correct answer was "Like". How would you fix the weights? Which knobs would you turn, and in which direction?

The training loop repeats 5 steps, millions of times:

1
2
3
4
5

1. Forward pass: Input goes through the network, produces a prediction.

2. Calculate loss: Compare the prediction to the correct answer. Loss = (expected - actual)^2. The bigger the number, the more wrong we were.

3. Backpropagation: Calculate gradients for each parameter. For each weight, answer: "if I nudge this weight up by a tiny amount, does the loss go up or down?"

4. Update weights: Adjust each parameter slightly to reduce loss.

5. Repeat: Do this for every training example.

One complete pass through all training data is called an Epoch. After many epochs, the weights converge to values that reflect what the network has learned. Here's what our movie predictor's weights might look like after training:

Weight for action -> like:  -2.1 (strong negative)
Weight for comedy -> like:  3.2 (strong positive)
Weight for action -> dislike: 2.8 (strong positive)
Weight for comedy -> dislike: -2.9 (strong negative)

Look at what happened! The network has learned that you love comedy and dislike action. It didn't read your mind -- it discovered this preference through pure mathematics. The comedy -> like weight became strongly positive (3.2), while the action -> like weight became strongly negative (-2.1). The network learned your preferences without any explicit rules -- just by repeatedly calculating, measuring error, and nudging weights.

Training = Learning to throw darts blindfolded. Someone tells you "too far left" or "too high" after each throw. After thousands of throws, you hit the bullseye consistently -- without ever seeing the board.
What goes wrong without enough training

After 1 epoch, the network might learn "words exist." After 100, basic grammar. After millions, common sense like "cats sit on mats." GPT-3 saw 500 billion tokens. If you stopped at 5 billion, it would write grammatically correct nonsense with no understanding of the world.

From Movies to Language

What if instead of predicting movie preferences, we predict the next letter in a sequence? The same neural network approach works. Let's take the phrase: "The cat sat o_"

First, we need to convert this into numbers. In modern systems, we use vectors (lists of numbers), but for this simple example, let's imagine we assign a unique ID to every character:

Position 1: "T" -> ID = 20
Position 2: "h" -> ID = 8
Position 3: "e" -> ID = 5
Position 4: " " -> ID = 0
Position 5: "c" -> ID = 3
...and so on

So we have 13 input neurons (one for each character position). Our output layer has 27 neurons: one for each letter A-Z, plus one for space.

After training on millions of English sentences, the network learns patterns. When it sees "...at o", the weights result in the 'n' neuron getting the highest value.

Generating Multiple Characters

Here's the key trick: we feed predictions back into the network, sliding our window forward:

1. "The cat sat o" -> predict 'n'
2. "he cat sat on" -> predict ' ' (space)
3. "e cat sat on " -> predict 't'
4. " cat sat on t" -> predict 'h'
5. ...and so on, generating "The cat sat on the mat"

This window is called the context length: how much text the model can "see" at once. Modern models use tokens (words or word pieces) instead of characters, and have massive context windows -- GPT-4 can see up to 128,000 tokens at once.

Now let's see the same idea at the word level, which is closer to how real LLMs work:

Interactive: Next-Word Prediction

Watch the model predict one word at a time. Click "Next Step" to advance.

The cat sat on the ___
"mat" 30% "floor" 25% "chair" 20% "couch" 15% "bed" 10%
Key Insight

The model doesn't "know" English. It learned that after "sat on the", the token "mat" appeared very often in training data. It's pattern matching at a cosmic scale -- and it works.

Tokens: Breaking Text into Pieces

Models don't read words. They read tokens -- pieces of text that might be a word, part of a word, or punctuation. This is the first step in the pipeline.

Interactive: Tokenizer

Type anything and see how a model breaks it into tokens.

Tokens:
Token IDs:
Token count: 0

Try typing "ChatGPT" -- it becomes ["Chat", "GPT"]. Or "Understanding" -- ["Under", "standing"]. Common words stay whole, rare words get split.

Tokens = LEGO bricks. Common words are single bricks. Rare words are built from smaller bricks snapped together. The model's vocabulary is its LEGO set.

Building Large Language Models

Now that we understand neural networks, training, and next-word prediction, let's see how these pieces come together to build something like ChatGPT. There are two main phases.

Phase 1: Pre-training

Pre-training is where the model learns language from scratch. We feed it enormous amounts of text from the internet -- books, Wikipedia, news articles, forums, and websites. We don't need humans to label this data. The text itself provides the labels through self-supervised learning.

Every piece of text can become millions of training examples:

Input: "The cat sat on the" -> Label: "mat"
Input: "The cat sat on"     -> Label: "the"
Input: "The cat sat"        -> Label: "on"

After pre-training, we get a Base Model. GPT is one such model:

Generative -- It generates text.
Pre-trained -- It learned from massive unlabeled text.
Transformer -- The architecture it uses.
The Base Model Problem

This base model is smart but unruly. If you ask it "What is the capital of France?", it might answer "And what is the capital of Germany?" because it thinks it is generating a quiz. It learned to predict text continuations -- not to follow instructions or be helpful. It's a brilliant but unfocused student.

Phase 2: Fine-tuning

Base models know a lot but don't know how to be helpful. Fine-tuning teaches them to be useful assistants.

Instruction Fine-tuning: We train on thousands of instruction-response pairs. "Summarize this article" -> [good summary]. "Translate to French" -> [correct translation]. The model learns to follow instructions rather than just continuing text.

Classification Fine-tuning: We can also train the model to categorize text -- sentiment analysis, topic classification, etc.

The Transformer Architecture

Most modern LLMs rely on the transformer architecture. The original transformer, introduced in 2017, had two parts:

Encoder: Takes input text and converts it into numerical representations that capture meaning and context. The "understanding" part.

Decoder: Takes those representations and generates output text. The "speaking" part.

Modern models typically use only one of these parts, depending on their training approach:

Masked Language Modeling (MLM) -- "Fill-in-the-blank" training. Uses only the encoder. BERT uses this approach:

Input: "The [MASK] sat on the mat" -> Predict: "cat"

Causal Language Modeling (CLM) -- Predicts the next word. Uses only the decoder. GPT uses this approach:

Input: "The cat sat" -> Predict: "on"
Key Insight

ChatGPT, Claude, and most modern AI assistants use Causal Language Modeling. They're decoder-only models. Before transformers (2017), language models processed text sequentially -- one word at a time. Transformers process all words simultaneously, with masking so each word only attends to previous words. This parallelism is what made training at massive scale possible.

The Complete Pipeline

Here's the complete journey from your question to ChatGPT's answer:

The Transformer Pipeline
1
Tokenize
"The cat sat" -> ["The", " cat", " sat"]
2
Embed
Each token -> vector of 512-12,288 numbers capturing meaning
3
Attention
Words share context: "sat" learns that "cat" is doing the sitting
4
Transform (x96 layers)
Each layer builds deeper understanding. Layer 1: grammar. Layer 50: meaning. Layer 96: reasoning.
5
Predict
Output probabilities for every word in vocabulary. Pick one. Add to input. Repeat.

These vectors flow through two main types of processing blocks: Attention Blocks (where words share context with each other) and MLP Blocks (which refine understanding independently). Together, these make up each transformer layer.

The base model is good at autocomplete. Fine-tuning on conversation examples creates the assistant experience that we know as ChatGPT, Claude, and others.

Concept Map: Where We Are
Text -> Tokens -> Embeddings -> Attention -> Layers -> Prediction -> Text

Purple = covered in this module. Gray = coming in next modules.

Check Your Understanding

1. What is the core training objective of GPT-style models?
Correct! GPT models are trained purely on next-token prediction. Grammar, facts, and reasoning all emerge as side effects of getting better at this single objective.
2. What happens during backpropagation?
Right! Backpropagation traces the error backward through the network, computing each weight's contribution (the gradient), then nudges weights to reduce the error.
3. Why is a "base model" (pre-trained only) not the same as ChatGPT?
Exactly! A base model is a powerful text predictor but acts like a brilliant unfocused student. It needs fine-tuning on instruction-response pairs to become a helpful assistant.
4. What is a "token"?
Tokens are the basic units the model works with. "ChatGPT" becomes ["Chat", "GPT"]. Common words stay whole, rare words get split into sub-words.

Teach It Back

Explain to a friend who knows nothing about AI: How does ChatGPT generate a response when you ask it a question? Use your own words. Don't worry about being perfect -- the goal is to test what stuck.

An AI tutor will compare your explanation against the course material and give specific feedback.

Evaluating your response against the course material...

Flashcards (click to flip)

What do neural networks fundamentally do?
Click to reveal
They approximate unknown functions by learning from examples. They take numbers in, apply learned weights and biases, and output numbers. They are function approximators: f(x) = y.
What is the "loss" in training?
Click to reveal
A number measuring how wrong the model's prediction was. Higher loss = more wrong. Training minimizes this number. Common formula: (expected - actual)^2.
What does GPT stand for, and what does each letter mean?
Click to reveal
Generative (generates text), Pre-trained (learned from massive unlabeled text), Transformer (the architecture it uses).
What is "backpropagation"?
Click to reveal
The algorithm that computes how much each weight contributed to the prediction error, working backward from the output to the input. It answers: "which knobs should I turn, and in which direction?"
Why can't a base model (pre-trained only) be used as a chatbot?
Click to reveal
A base model only learned to predict the next word. It might continue "What is the capital of France?" with "What is the capital of Germany?" because it thinks it's generating a quiz. It needs fine-tuning on instruction-response pairs to become helpful.
What are the two main phases of building an LLM?
Click to reveal
1. Pre-training: Self-supervised learning on trillions of tokens. Learns language, facts, reasoning. Produces a "base model."
2. Fine-tuning: Train on instruction-response pairs (+ RLHF) to become a helpful assistant.

Module 1 Complete

Next up: Tokenization -- how text becomes numbers, and why "ChatGPT" splits into ["Chat", "GPT"].

Synthesis question to think about: If the model only predicts the next token, how does it "know" to answer questions instead of just continuing text?

Course Home Next: Tokenization ->