Concept 1 of 9

Let's build a GPT from scratch.

Everything so far has been theory. In this module we translate every abstraction from the past 8 modules into running PyTorch code. By the end you will have a working GPT-2 style model that reads text, learns from it, and generates its own stories.

What we will build:

Load and tokenize the TinyStories dataset (short simple stories generated by GPT-3.5/4 specifically to teach tiny models)
A PyTorch Dataset that chunks text into fixed-length sequences with shifted targets
The full GPT architecture: token embeddings, positional embeddings, 12 transformer blocks, each with causal self-attention and an MLP
Training loop with AdamW, cross-entropy loss, and gradient clipping
Generation loop that samples one token at a time

Why GPT-2?

GPT-2 (124M) is the smallest model that is recognizably modern. Its architecture is nearly identical to GPT-4 -- same attention, same MLP, same block structure, same autoregressive objective. The differences between GPT-2 and frontier models are mostly scale, data, and post-training. Master this and you understand the bones of every LLM.

Concept 2 of 9

Step 0: Setup and Baseline

Before writing a single custom line, we load the official pretrained GPT-2 from HuggingFace to see what "finished product" looks like:

from transformers import pipeline, set_seed
generator = pipeline('text-generation', model='gpt2')
set_seed(42)
generator("Hello, I'm a language model,", max_length=30)
# -> "Hello, I'm a language model, so you can't just use the same data..."

Default is the 124M parameter version. The family also includes gpt2-medium (335M), gpt2-large (774M), and gpt2-xl (1.5B). We will build the 124M version from scratch and then verify our architecture is correct by loading OpenAI's official weights into our custom class.

Concept 3 of 9

Step 1: The Data Pipeline

We use the TinyStories dataset (Eldan and Li, 2023). Each entry is a short story using only vocabulary a 4-year-old would understand. Perfect for tiny models because they can actually learn coherent English from it without thousands of GPUs.

from datasets import load_dataset
import tiktoken, torch
from torch.utils.data import Dataset, DataLoader

encoder = tiktoken.get_encoding("gpt2")  # GPT-2 BPE tokenizer, 50257 tokens
ds = load_dataset("roneneldan/TinyStories")

class TinyStoriesDataset(Dataset):
    def __init__(self, split, encoder, context_length=128):
        self.tokens = []
        for row in ds[split].select(range(1000)):
            self.tokens.extend(encoder.encode(row['text']))
            self.tokens.append(encoder.eot_token)  # end-of-text marker
        self.tokens = torch.tensor(self.tokens, dtype=torch.long)
        self.context_length = context_length

    def __len__(self):
        return len(self.tokens) // self.context_length

    def __getitem__(self, idx):
        start = idx * self.context_length
        x = self.tokens[start : start + self.context_length]
        y = self.tokens[start + 1 : start + self.context_length + 1]
        return x, y

The key trick is the shifted targets: for every input position, the label is the very next token. One story of 128 tokens gives us 127 training signals in a single forward pass, thanks to causal masking.

Why concatenate all stories?

Treating every story as an independent sample would waste most of each batch on padding. Concatenating and chunking keeps the GPUs maximally busy. The end-of-text token tells the model "start over".

Concept 4 of 9

Step 2: The GPT Architecture at a Glance

Interactive: GPT-2 Architecture Visualizer

Click any box to see what it does and its shape.

@dataclass
class GPTConfig:
    block_size: int = 128     # max sequence length
    vocab_size: int = 50257   # size of BPE vocabulary
    n_layer:    int = 12      # number of transformer blocks
    n_head:     int = 12      # attention heads per block
    n_embd:     int = 768     # embedding dimension

These exact numbers define the 124M parameter GPT-2. Want gpt2-medium? Set n_layer=24, n_head=16, n_embd=1024. Want gpt2-xl? n_layer=48, n_head=25, n_embd=1600. The architecture stays identical.

Concept 5 of 9

Step 3: The Transformer Block

class Block(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.ln_1 = nn.LayerNorm(config.n_embd)
        self.attn = CausalSelfAttention(config)
        self.ln_2 = nn.LayerNorm(config.n_embd)
        self.mlp  = MLP(config)

    def forward(self, x):
        x = x + self.attn(self.ln_1(x))   # residual + attention
        x = x + self.mlp(self.ln_2(x))    # residual + MLP
        return x

Two critical design choices:

Pre-norm, not post-norm. LayerNorm is applied before each sublayer, not after. This change from the original Transformer paper stabilizes training in very deep networks -- gradients flow cleanly through the residual stream.
Two residual connections per block. Each sublayer adds its output to its input rather than replacing it. The residual stream is the spine that carries information from embedding to final prediction, with each block contributing a refinement.

Concept 6 of 9

Step 4: Causal Self-Attention in Code

class CausalSelfAttention(nn.Module):
    def __init__(self, config):
        super().__init__()
        assert config.n_embd % config.n_head == 0
        # one big matrix produces Q, K, V in a single shot
        self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd)
        self.c_proj = nn.Linear(config.n_embd, config.n_embd)
        self.n_head = config.n_head
        self.n_embd = config.n_embd
        # causal mask: lower triangular matrix of ones
        self.register_buffer("bias", torch.tril(
            torch.ones(config.block_size, config.block_size)
        ).view(1, 1, config.block_size, config.block_size))

    def forward(self, x):
        B, T, C = x.size()                         # batch, time, channels
        qkv = self.c_attn(x)                       # (B, T, 3C)
        q, k, v = qkv.split(self.n_embd, dim=2)    # three (B, T, C)

        # reshape to (B, n_head, T, head_dim)
        k = k.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)
        q = q.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)
        v = v.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)

        # (B, nh, T, T) attention scores
        att = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(k.size(-1)))
        att = att.masked_fill(self.bias[:, :, :T, :T] == 0, float('-inf'))
        att = F.softmax(att, dim=-1)

        y = att @ v                                # (B, nh, T, head_dim)
        y = y.transpose(1, 2).contiguous().view(B, T, C)
        return self.c_proj(y)

Every line maps to something from Module 4:

c_attn is one big linear layer producing Q, K, V all at once (more efficient than three separate ones)
The .view(...).transpose(...) dance splits the 768 embedding dimensions into 12 heads of 64 each
q @ k.transpose(-2, -1) is the all-pairs dot product matrix QK^T
Scaling by 1/sqrt(head_dim) keeps softmax gradients healthy
masked_fill sets future positions to -inf, which become exactly 0 after softmax -- this is causal masking
att @ v is the weighted sum that produces the new contextualized vectors
c_proj projects the concatenated head outputs back to the residual stream width

Concept 7 of 9

Step 5: The MLP (Feed-Forward Network)

class MLP(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.c_fc   = nn.Linear(config.n_embd, 4 * config.n_embd)   # 768 -> 3072
        self.gelu   = nn.GELU(approximate='tanh')
        self.c_proj = nn.Linear(4 * config.n_embd, config.n_embd)   # 3072 -> 768

    def forward(self, x):
        return self.c_proj(self.gelu(self.c_fc(x)))

The MLP is surprisingly simple: expand 4x, apply a non-linearity (GELU), contract back. Two thirds of a transformer's parameters live in MLPs. They are where the model stores most of its "knowledge" -- individual neurons in the expanded space often correspond to interpretable concepts (recent interpretability work has shown that specific MLP neurons fire on things like "code in Python", "negative sentiment", or "reference to a specific country").

Mathematical note

Without the non-linearity (GELU), stacking linear layers would collapse to one big linear transformation. GELU is the trick that makes depth actually buy you anything. GPT-2 specifically uses the tanh approximation of GELU for speed.

Concept 8 of 9

Step 6: Putting It All Together + Generation

class GPT(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.config = config
        self.transformer = nn.ModuleDict(dict(
            wte = nn.Embedding(config.vocab_size, config.n_embd),   # token emb
            wpe = nn.Embedding(config.block_size, config.n_embd),   # pos emb
            h   = nn.ModuleList([Block(config) for _ in range(config.n_layer)]),
            ln_f = nn.LayerNorm(config.n_embd),
        ))
        self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)
        self.lm_head.weight = self.transformer.wte.weight   # weight tying

    def forward(self, idx, targets=None):
        B, T = idx.size()
        pos = torch.arange(0, T, dtype=torch.long, device=idx.device)
        tok_emb = self.transformer.wte(idx)   # (B, T, n_embd)
        pos_emb = self.transformer.wpe(pos)   # (T, n_embd)
        x = tok_emb + pos_emb
        for block in self.transformer.h:
            x = block(x)
        x = self.transformer.ln_f(x)
        logits = self.lm_head(x)              # (B, T, vocab_size)
        loss = None
        if targets is not None:
            loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1))
        return logits, loss

    @torch.no_grad()
    def generate(self, idx, max_new_tokens, temperature=1.0, top_k=None):
        for _ in range(max_new_tokens):
            idx_cond = idx[:, -self.config.block_size:]   # crop to context
            logits, _ = self(idx_cond)
            logits = logits[:, -1, :] / temperature
            if top_k is not None:
                v, _ = torch.topk(logits, top_k)
                logits[logits < v[:, [-1]]] = -float('inf')
            probs = F.softmax(logits, dim=-1)
            idx_next = torch.multinomial(probs, num_samples=1)
            idx = torch.cat((idx, idx_next), dim=1)
        return idx

Weight tying (self.lm_head.weight = self.transformer.wte.weight) shares parameters between the input embedding and the output projection. It saves ~40M parameters and consistently improves results.

Training is standard PyTorch: AdamW with weight decay 0.1, learning rate 3e-4 with cosine decay, gradient clipping to norm 1.0, batch size as large as memory allows. After a few hours on a single consumer GPU, your model will generate recognizable TinyStories sentences like "The little girl was very happy and she said thank you."

Concept 9 of 9

Check Your Understanding

1. Why is the causal mask a lower-triangular matrix?

Correct: Position i may only attend to positions 0..i, so everything strictly above the diagonal (future) is masked to -inf

2. What does weight tying mean in GPT-2?

Correct: Sharing parameters between the token embedding and the output projection layer

3. Why does the MLP expand to 4x the embedding dimension?

Correct: To give the model more "workspace" to process each token before compressing back; a common hyperparameter, not a hard rule

4. What is pre-norm vs post-norm, and why does GPT-2 use pre-norm?

Correct: Pre-norm applies LayerNorm before each sublayer and stabilizes training in deep networks

5. In the generate() loop, why is the logits tensor indexed with [:, -1, :]?

Correct: Because the model outputs a prediction at every position, but we only care about the one at the final position when sampling the next token

Teach It Back

Walk through the GPT-2 architecture from input tokens to output logits. Explain what each layer does (embedding, positional embedding, transformer blocks, layer norm, lm head) and how causal attention + residual connections + the MLP combine to make next-token prediction work.

An AI tutor will compare your explanation against the course material.

Evaluating...

Flashcards (click to flip)

What are the GPT-2 (124M) hyperparameters?

Click to reveal

n_layer=12, n_head=12, n_embd=768, block_size=1024 (we use 128 for the demo), vocab_size=50257. Scaling up to medium/large/xl just increases n_layer and n_embd proportionally; the architecture is identical.

What does a Block contain?

Click to reveal

LayerNorm -> CausalSelfAttention -> residual add; LayerNorm -> MLP -> residual add. Pre-norm (LN before the sublayer) is used for training stability.

How is Q, K, V produced efficiently?

Click to reveal

One linear layer c_attn produces a (B, T, 3*n_embd) tensor, then split into three (B, T, n_embd) tensors. Reshape to (B, n_head, T, head_dim) by splitting embedding dim across heads.

How is the causal mask implemented?

Click to reveal

register_buffer stores a lower-triangular matrix of 1s. In forward, att.masked_fill(mask == 0, -inf). Softmax converts -inf to 0, so future positions get zero attention.

What is weight tying?

Click to reveal

The input embedding matrix (wte) and the output language-model head share the same weights. Both project between tokens and n_embd-dim vectors, so sharing saves parameters and improves quality.

What does the generation loop look like?

Click to reveal

Forward pass on current tokens, take logits at the last position, divide by temperature, optionally top-k filter, softmax to probabilities, sample one token, append to sequence, repeat until max length or EOS.

Module Complete

← Previous Course Home Next →