Everything so far has been theory. In this module we translate every abstraction from the past 8 modules into running PyTorch code. By the end you will have a working GPT-2 style model that reads text, learns from it, and generates its own stories.
What we will build:
GPT-2 (124M) is the smallest model that is recognizably modern. Its architecture is nearly identical to GPT-4 -- same attention, same MLP, same block structure, same autoregressive objective. The differences between GPT-2 and frontier models are mostly scale, data, and post-training. Master this and you understand the bones of every LLM.
Before writing a single custom line, we load the official pretrained GPT-2 from HuggingFace to see what "finished product" looks like:
from transformers import pipeline, set_seed
generator = pipeline('text-generation', model='gpt2')
set_seed(42)
generator("Hello, I'm a language model,", max_length=30)
# -> "Hello, I'm a language model, so you can't just use the same data..."
Default is the 124M parameter version. The family also includes gpt2-medium (335M), gpt2-large (774M), and gpt2-xl (1.5B). We will build the 124M version from scratch and then verify our architecture is correct by loading OpenAI's official weights into our custom class.
We use the TinyStories dataset (Eldan and Li, 2023). Each entry is a short story using only vocabulary a 4-year-old would understand. Perfect for tiny models because they can actually learn coherent English from it without thousands of GPUs.
from datasets import load_dataset
import tiktoken, torch
from torch.utils.data import Dataset, DataLoader
encoder = tiktoken.get_encoding("gpt2") # GPT-2 BPE tokenizer, 50257 tokens
ds = load_dataset("roneneldan/TinyStories")
class TinyStoriesDataset(Dataset):
def __init__(self, split, encoder, context_length=128):
self.tokens = []
for row in ds[split].select(range(1000)):
self.tokens.extend(encoder.encode(row['text']))
self.tokens.append(encoder.eot_token) # end-of-text marker
self.tokens = torch.tensor(self.tokens, dtype=torch.long)
self.context_length = context_length
def __len__(self):
return len(self.tokens) // self.context_length
def __getitem__(self, idx):
start = idx * self.context_length
x = self.tokens[start : start + self.context_length]
y = self.tokens[start + 1 : start + self.context_length + 1]
return x, y
The key trick is the shifted targets: for every input position, the label is the very next token. One story of 128 tokens gives us 127 training signals in a single forward pass, thanks to causal masking.
Treating every story as an independent sample would waste most of each batch on padding. Concatenating and chunking keeps the GPUs maximally busy. The end-of-text token tells the model "start over".
Click any box to see what it does and its shape.
@dataclass
class GPTConfig:
block_size: int = 128 # max sequence length
vocab_size: int = 50257 # size of BPE vocabulary
n_layer: int = 12 # number of transformer blocks
n_head: int = 12 # attention heads per block
n_embd: int = 768 # embedding dimension
These exact numbers define the 124M parameter GPT-2. Want gpt2-medium? Set n_layer=24, n_head=16, n_embd=1024. Want gpt2-xl? n_layer=48, n_head=25, n_embd=1600. The architecture stays identical.
class Block(nn.Module):
def __init__(self, config):
super().__init__()
self.ln_1 = nn.LayerNorm(config.n_embd)
self.attn = CausalSelfAttention(config)
self.ln_2 = nn.LayerNorm(config.n_embd)
self.mlp = MLP(config)
def forward(self, x):
x = x + self.attn(self.ln_1(x)) # residual + attention
x = x + self.mlp(self.ln_2(x)) # residual + MLP
return x
Two critical design choices:
class CausalSelfAttention(nn.Module):
def __init__(self, config):
super().__init__()
assert config.n_embd % config.n_head == 0
# one big matrix produces Q, K, V in a single shot
self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd)
self.c_proj = nn.Linear(config.n_embd, config.n_embd)
self.n_head = config.n_head
self.n_embd = config.n_embd
# causal mask: lower triangular matrix of ones
self.register_buffer("bias", torch.tril(
torch.ones(config.block_size, config.block_size)
).view(1, 1, config.block_size, config.block_size))
def forward(self, x):
B, T, C = x.size() # batch, time, channels
qkv = self.c_attn(x) # (B, T, 3C)
q, k, v = qkv.split(self.n_embd, dim=2) # three (B, T, C)
# reshape to (B, n_head, T, head_dim)
k = k.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)
q = q.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)
v = v.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)
# (B, nh, T, T) attention scores
att = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(k.size(-1)))
att = att.masked_fill(self.bias[:, :, :T, :T] == 0, float('-inf'))
att = F.softmax(att, dim=-1)
y = att @ v # (B, nh, T, head_dim)
y = y.transpose(1, 2).contiguous().view(B, T, C)
return self.c_proj(y)
Every line maps to something from Module 4:
c_attn is one big linear layer producing Q, K, V all at once (more efficient than three separate ones).view(...).transpose(...) dance splits the 768 embedding dimensions into 12 heads of 64 eachq @ k.transpose(-2, -1) is the all-pairs dot product matrix QK^T1/sqrt(head_dim) keeps softmax gradients healthymasked_fill sets future positions to -inf, which become exactly 0 after softmax -- this is causal maskingatt @ v is the weighted sum that produces the new contextualized vectorsc_proj projects the concatenated head outputs back to the residual stream widthclass MLP(nn.Module):
def __init__(self, config):
super().__init__()
self.c_fc = nn.Linear(config.n_embd, 4 * config.n_embd) # 768 -> 3072
self.gelu = nn.GELU(approximate='tanh')
self.c_proj = nn.Linear(4 * config.n_embd, config.n_embd) # 3072 -> 768
def forward(self, x):
return self.c_proj(self.gelu(self.c_fc(x)))
The MLP is surprisingly simple: expand 4x, apply a non-linearity (GELU), contract back. Two thirds of a transformer's parameters live in MLPs. They are where the model stores most of its "knowledge" -- individual neurons in the expanded space often correspond to interpretable concepts (recent interpretability work has shown that specific MLP neurons fire on things like "code in Python", "negative sentiment", or "reference to a specific country").
Without the non-linearity (GELU), stacking linear layers would collapse to one big linear transformation. GELU is the trick that makes depth actually buy you anything. GPT-2 specifically uses the tanh approximation of GELU for speed.
class GPT(nn.Module):
def __init__(self, config):
super().__init__()
self.config = config
self.transformer = nn.ModuleDict(dict(
wte = nn.Embedding(config.vocab_size, config.n_embd), # token emb
wpe = nn.Embedding(config.block_size, config.n_embd), # pos emb
h = nn.ModuleList([Block(config) for _ in range(config.n_layer)]),
ln_f = nn.LayerNorm(config.n_embd),
))
self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)
self.lm_head.weight = self.transformer.wte.weight # weight tying
def forward(self, idx, targets=None):
B, T = idx.size()
pos = torch.arange(0, T, dtype=torch.long, device=idx.device)
tok_emb = self.transformer.wte(idx) # (B, T, n_embd)
pos_emb = self.transformer.wpe(pos) # (T, n_embd)
x = tok_emb + pos_emb
for block in self.transformer.h:
x = block(x)
x = self.transformer.ln_f(x)
logits = self.lm_head(x) # (B, T, vocab_size)
loss = None
if targets is not None:
loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1))
return logits, loss
@torch.no_grad()
def generate(self, idx, max_new_tokens, temperature=1.0, top_k=None):
for _ in range(max_new_tokens):
idx_cond = idx[:, -self.config.block_size:] # crop to context
logits, _ = self(idx_cond)
logits = logits[:, -1, :] / temperature
if top_k is not None:
v, _ = torch.topk(logits, top_k)
logits[logits < v[:, [-1]]] = -float('inf')
probs = F.softmax(logits, dim=-1)
idx_next = torch.multinomial(probs, num_samples=1)
idx = torch.cat((idx, idx_next), dim=1)
return idx
Weight tying (self.lm_head.weight = self.transformer.wte.weight) shares parameters between the input embedding and the output projection. It saves ~40M parameters and consistently improves results.
Training is standard PyTorch: AdamW with weight decay 0.1, learning rate 3e-4 with cosine decay, gradient clipping to norm 1.0, batch size as large as memory allows. After a few hours on a single consumer GPU, your model will generate recognizable TinyStories sentences like "The little girl was very happy and she said thank you."
Walk through the GPT-2 architecture from input tokens to output logits. Explain what each layer does (embedding, positional embedding, transformer blocks, layer norm, lm head) and how causal attention + residual connections + the MLP combine to make next-token prediction work.
An AI tutor will compare your explanation against the course material.