Before we continue, let's take a step back to understand how we actually got here. Artificial intelligence (AI) usually refers to any system that mimics human intelligence. What sets it apart from standard software is its ability to capture and replicate human heuristics -- mental shortcuts and rules of thumb -- something impossible for static, rule-based systems.
Let's take the classic spam detection problem as an example. To catch spam, programmers used to write explicit rules: "If an email contains the word lottery, mark it as spam."
But what happens if the malicious sender changes the format? What do you do when they write l o t t e r y, or luttery? You could write a new rule for every variation, but there are infinite ways a sender can try to trick you. You would need to write infinite rules in response.
And that's just for the word "lottery." What happens when they switch to the "Nigerian Prince" scam? Or what if a real lottery notifies a winner? You wouldn't want to block that. Attempting to capture every possibility with manual rules is impossible. It simply doesn't scale.
Instead of writing rules, we show a model thousands of examples. We feed it spam emails and legitimate ones, and it learns to recognize the patterns itself. But there was still a problem. In early Machine Learning, humans still had to decide what features mattered. Someone had to format the data and tell the computer: "Look for specific keywords," or "Look at the sender's address." If the data was too complex -- like an image or a nuanced paragraph of text -- the model would struggle because humans couldn't easily define the rules for it to learn.
What's even worse is that the model was also limited by its capacity. If you asked a simple model to recognize a lottery scam, a Nigerian Prince scam, and a phishing link all at once, it might fail to separate them effectively. It's like being asked to build a perfect square using only three sticks, which is impossible. Sometimes, the tool just isn't complex enough for the task.
This is where Deep Learning comes in. Deep Learning relies on the same fundamental mechanism as machine learning, but with a unique advantage: it uses incredibly large, multi-layered neural networks. Deep learning has existed since the 1980s, but it didn't work well initially due to lack of computing power (specifically, powerful GPUs).
But in 2012, that changed. A deep learning system called AlexNet won the ImageNet competition -- a competition to classify a huge database of images -- by a massive margin. It succeeded because the creators didn't have to manually tell the model what a "cat" or a "car" looked like. The model learned the key features all by itself.
When researchers looked under the hood, they found something remarkable:
First layers detected simple edges. The next layers combined edges into shapes (like circles or corners). Deeper layers combined those shapes to recognize entire object silhouettes (like a wheel or an ear). Each layer built on the previous one. No human told AlexNet to look for "edges" or "circles." It discovered these hierarchical features on its own through extremely deep layers of processing.
This is where Deep Learning truly shines: it is complex enough to find correlations humans might not even know exist.
Click a stage to explore it.
In 2004, a major email provider relied on hand-written spam rules. Spammers figured out the rules and started writing "l.o" instead of "lottery", "fr33" instead of "free". Every new trick required a new rule. The team was writing 50+ rules per week and still losing. The system was replaced by a machine learning classifier that learned patterns automatically.
All deep learning models, including LLMs, are built on neural networks. And here's the unifying idea: neural networks are function approximators. Every task can be framed as learning some function f(x) = y:
| Task | Input x | Output y |
|---|---|---|
| Spam detection | email text | spam or not |
| Image classification | pixels | "cat" or "dog" |
| Movie recommendation | your preferences | like or dislike |
| Next word prediction | previous words | next word |
We don't know the "true" function that perfectly maps inputs to outputs. The neural network approximates it by learning from examples. It takes numbers in, applies learned operations, and outputs numbers. But humans communicate in words, not numbers -- so we need to convert data to numbers first.
Let's build intuition with a simple example: a movie preference predictor. Imagine predicting if you'll enjoy a movie based on two factors: action (0 to 1) and comedy (0 to 1).
| Movie | Action | Comedy | Liked? |
|---|---|---|---|
| Movie 1 | 0.1 | 0.9 | Yes (comedy fan) |
| Movie 2 | 0.8 | 0.2 | No (too much action) |
| Movie 3 | 0.3 | 0.7 | Yes (comedy with some action) |
Neurons: Containers that hold numbers. Our input layer has 2 neurons (action score, comedy score). Our output layer has 2 neurons (probability of "like", probability of "dislike").
Parameters: The adjustable numbers inside the network. Think of them as millions of tiny knobs that get tuned during training. These include:
- Weights: Numbers that multiply each input (like importance scores).
- Biases: Numbers added at the end (like baseline adjustments).
Let's trace through one example. Movie 1 has action = 0.1 and comedy = 0.9.
Step 1: Random initialization. The network starts with random weights and biases because it hasn't learned anything yet:
Step 2: Calculate the "like" neuron.
Step 3: Calculate the "dislike" neuron.
This is a linear transformation of weights, inputs and biases.
Step 4: Convert to probabilities. We use a function called Softmax to turn these numbers into probabilities (numbers between 0 and 1 that sum to 1):
The network predicts we won't like Movie 1 (76% chance of dislike). But we actually did like it! The network is wrong. What we just calculated is called a forward pass.
Step 5: Measuring the Error. The network was wrong, but we need to measure exactly how wrong. This measurement is called the loss.
The bigger the loss, the more wrong we were. Our goal is to make this number as small as possible.
Step 6: Adjusting the Parameters. Now comes the clever part. We need to adjust our weights and biases to reduce the loss. But which way should we adjust them? We calculate something called Gradients. A gradient tells us: "If I increase this weight by a tiny amount, how much does the loss change?"
For example: Current comedy -> like weight: -0.3. The gradient calculation tells us increasing this weight decreases the loss. So we increase it slightly: -0.3 + 0.01 = -0.29.
This process is called Backpropagation because we work backward from the loss (output) back to the weights (input).
Now try it yourself -- adjust the sliders below and watch the forward pass happen in real time:
You're predicting if you'll like a movie. Adjust the action and comedy scores and watch the neural network calculate in real-time.
The network doesn't "understand" comedy or action. It just learned that when input 2 (comedy) is high, output "like". Pure math, zero understanding. But the math works.
What if instead of predicting movie preferences, we predict the next letter in a sequence? The same neural network approach works. Let's take the phrase: "The cat sat o_"
First, we need to convert this into numbers. In modern systems, we use vectors (lists of numbers), but for this simple example, let's imagine we assign a unique ID to every character:
So we have 13 input neurons (one for each character position). Our output layer has 27 neurons: one for each letter A-Z, plus one for space.
After training on millions of English sentences, the network learns patterns. When it sees "...at o", the weights result in the 'n' neuron getting the highest value.
Here's the key trick: we feed predictions back into the network, sliding our window forward:
This window is called the context length: how much text the model can "see" at once. Modern models use tokens (words or word pieces) instead of characters, and have massive context windows -- GPT-4 can see up to 128,000 tokens at once.
Now let's see the same idea at the word level, which is closer to how real LLMs work:
Watch the model predict one word at a time. Click "Next Step" to advance.
The model doesn't "know" English. It learned that after "sat on the", the token "mat" appeared very often in training data. It's pattern matching at a cosmic scale -- and it works.
Models don't read words. They read tokens -- pieces of text that might be a word, part of a word, or punctuation. This is the first step in the pipeline.
Type anything and see how a model breaks it into tokens.
Try typing "ChatGPT" -- it becomes ["Chat", "GPT"]. Or "Understanding" -- ["Under", "standing"]. Common words stay whole, rare words get split.
Now that we understand neural networks, training, and next-word prediction, let's see how these pieces come together to build something like ChatGPT. There are two main phases.
Pre-training is where the model learns language from scratch. We feed it enormous amounts of text from the internet -- books, Wikipedia, news articles, forums, and websites. We don't need humans to label this data. The text itself provides the labels through self-supervised learning.
Every piece of text can become millions of training examples:
After pre-training, we get a Base Model. GPT is one such model:
This base model is smart but unruly. If you ask it "What is the capital of France?", it might answer "And what is the capital of Germany?" because it thinks it is generating a quiz. It learned to predict text continuations -- not to follow instructions or be helpful. It's a brilliant but unfocused student.
Base models know a lot but don't know how to be helpful. Fine-tuning teaches them to be useful assistants.
Instruction Fine-tuning: We train on thousands of instruction-response pairs. "Summarize this article" -> [good summary]. "Translate to French" -> [correct translation]. The model learns to follow instructions rather than just continuing text.
Classification Fine-tuning: We can also train the model to categorize text -- sentiment analysis, topic classification, etc.
Most modern LLMs rely on the transformer architecture. The original transformer, introduced in 2017, had two parts:
Modern models typically use only one of these parts, depending on their training approach:
Masked Language Modeling (MLM) -- "Fill-in-the-blank" training. Uses only the encoder. BERT uses this approach:
Causal Language Modeling (CLM) -- Predicts the next word. Uses only the decoder. GPT uses this approach:
ChatGPT, Claude, and most modern AI assistants use Causal Language Modeling. They're decoder-only models. Before transformers (2017), language models processed text sequentially -- one word at a time. Transformers process all words simultaneously, with masking so each word only attends to previous words. This parallelism is what made training at massive scale possible.
Here's the complete journey from your question to ChatGPT's answer:
These vectors flow through two main types of processing blocks: Attention Blocks (where words share context with each other) and MLP Blocks (which refine understanding independently). Together, these make up each transformer layer.
The base model is good at autocomplete. Fine-tuning on conversation examples creates the assistant experience that we know as ChatGPT, Claude, and others.
Purple = covered in this module. Gray = coming in next modules.
Explain to a friend who knows nothing about AI: How does ChatGPT generate a response when you ask it a question? Use your own words. Don't worry about being perfect -- the goal is to test what stuck.
An AI tutor will compare your explanation against the course material and give specific feedback.
Next up: Tokenization -- how text becomes numbers, and why "ChatGPT" splits into ["Chat", "GPT"].
Synthesis question to think about: If the model only predicts the next token, how does it "know" to answer questions instead of just continuing text?