Where does a model's knowledge actually come from?

Think first
ChatGPT can discuss quantum mechanics, write Python, translate French, and summarize legal texts. Nobody sat down and taught it any of these skills explicitly. So where did all that knowledge come from, and what was the single objective used to get it there?

Pre-training is the stage where a model "learns everything." It is large-scale self-supervised learning on trillions of tokens of text. Post-training (instruction tuning, RLHF) comes later and shapes raw potential into a helpful assistant, but the raw knowledge, grammar, reasoning patterns, and facts are all absorbed during pre-training.

The objective is deceptively simple: predict the next token. Given the first n tokens of a sequence, predict token n+1. Do this on a sizable fraction of all text humans have written, and the model is forced to learn grammar, style, factual knowledge, arithmetic, logical patterns, and world models as a byproduct of minimizing prediction loss.

Key Insight

Pre-training determines the ceiling of a model. No amount of post-training can add knowledge that was not in the pre-training corpus. Post-training just rearranges and shapes the knowledge already there.

A brief history: how we got here

Pre-training did not appear fully formed. It evolved over ~10 years through a chain of discoveries. Click each milestone to expand.

Interactive Timeline
Pattern

Each milestone did not change the architecture much. What changed was scale, data quality, and the realization that a single self-supervised objective could replace dozens of task-specific supervised ones.

The scale story: 17 million to 1.8 trillion parameters

The most dramatic change in pre-training has been raw scale. Compare how models grew across key dimensions (parameters and training tokens):

Parameters (log scale, relative to GPT-3)
Training Tokens (log scale)
Scaling laws

In 2020 OpenAI showed loss follows smooth power laws with parameters, data, and compute. In 2022 DeepMind's Chinchilla refined this: data and parameters should scale equally. GPT-3 was under-trained. A 70B model on 1.4T tokens (Chinchilla) beat a 280B model on 300B tokens (Gopher).

The cost reality

GPT-3 cost an estimated $5M to train. GPT-4 is estimated at $100M+. Llama 3 405B used 16,384 H100 GPUs for ~54 days. This scale pushed frontier research out of universities and into a handful of well-funded labs.

The modern pre-training recipe

By 2024 the recipe converged across almost every frontier lab:

Architecture: Decoder-only Transformer with RoPE positional embeddings, RMSNorm, SwiGLU activations, and FlashAttention for memory efficiency.

Data: 1-15 trillion tokens. Obsession with quality over quantity. Filtered web + books + papers + code. Code improves reasoning even for non-code tasks. Synthetic data for math.

Optimization: AdamW (beta1=0.9, beta2=0.999, weight_decay=0.1), learning rate warmup then cosine decay, gradient clipping to norm 1.0, BF16 precision, batch sizes of 2-4M tokens.

Training: Thousands of H100s/TPUs for months. Engineers babysit the run, roll back to checkpoints on loss spikes, adjust data mix on the fly (more math/code near end).

Quality beats quantity

Phi-1 (1.3B params, 7B tokens of "textbook-quality" data) hit 50.6% on HumanEval, beating models 10x its size. FineWeb-Edu matched MMLU scores with 10x fewer tokens. The era of "just scrape more" is over.

Check your understanding

1. What is the single training objective behind most modern LLMs like GPT-3/4?
Causal language modeling (next-token prediction) is what GPT-family decoder-only models optimize. BERT used MLM (masked) instead.
2. What did Chinchilla (2022) change about how we allocate compute?
Chinchilla found 10x compute means ~3.16x parameters AND ~3.16x tokens, not 5.5x params and 1.8x tokens like Kaplan (2020).
3. GPT-2 (2019) was architecturally almost identical to GPT-1. What was the big lesson?
GPT-2 was 13x larger than GPT-1 with the same architecture and showed surprising zero-shot abilities, seeding the "scale is the thing" hypothesis.
4. Why did decoder-only architectures beat encoder-decoder by 2023?
GPT-3 showed few-shot in-context learning was a game changer, and decoder-only is simpler to train, scale, and deploy for that paradigm.
5. Phi-1 (1.3B params) beat 10x larger models on HumanEval. What does this illustrate?
Data quality can substitute for scale when the domain is narrow and the data is carefully curated.

Solidify your understanding

Teach It Back

Explain to a friend: What is pre-training, what objective is used, and why did scale and data quality turn out to matter more than architectural cleverness?

An AI tutor will compare your explanation against the course material and give specific feedback.

Evaluating your response against the course material...

Flashcards (click to flip)

What does Word2Vec learn and what's its main limitation?
Click to reveal
Word2Vec (2013) learns static word embeddings via CBOW or Skip-gram. Captures analogies like king-queen = man-woman. Limitation: one vector per word regardless of context, so "bank" gets one meaning.
Contrast BERT and GPT in one sentence each.
Click to reveal
BERT: encoder-only, bidirectional, trained with masked LM + next-sentence prediction, strong at understanding but cannot generate naturally. GPT: decoder-only, left-to-right, trained with causal next-token prediction, naturally generates text.
Kaplan (2020) vs Chinchilla (2022) scaling laws?
Click to reveal
Kaplan: 10x compute = 5.5x params, 1.8x tokens (led to big undertrained GPT-3). Chinchilla: 10x compute = ~3.16x params AND ~3.16x tokens. Chinchilla-optimal ≈ 20 tokens per parameter.
Why did GPT-3 matter historically?
Click to reveal
175B parameters, 100x GPT-2, demonstrated in-context few-shot learning. Made "scale alone is enough" believable, shifted the field from fine-tuning to prompting, and made frontier AI only feasible for well-funded labs.
Name the modern architectural refinements over vanilla GPT-2.
Click to reveal
RoPE (rotary positional embeddings), RMSNorm (faster than LayerNorm), SwiGLU activations, FlashAttention for memory-efficient attention. Together they enable longer context and faster training at no quality cost.
Why is pre-training considered the "ceiling" of a model?
Click to reveal
Pre-training is where raw knowledge is absorbed from trillions of tokens. Post-training (SFT, RLHF) only reshapes existing knowledge — it cannot add facts, grammar, or reasoning patterns that were absent from the pre-training corpus.

Module 10 Complete

You now have the big-picture map of pre-training. Next up: Training Objectives and Architectural Details — the concrete math of causal LM, MLM, span corruption, RoPE, RMSNorm, and SwiGLU.

Synthesis question: If next-token prediction is "just" a simple loss, why do we need trillions of tokens to learn it well?

← Previous Course Home Next →