You've just finished pretraining. The model has read trillions of tokens and can complete any text with scary accuracy. You type "What is the capital of France?" and it responds: "What is the largest city in Germany? What is the population of Italy?". What happened — and what would you have to change to make it actually answer?
A base model (like gemma-3-1b-pt) is a pure statistical engine. It has compressed trillions of tokens of human text into weights that predict "what comes next". On the public web, questions often appear in lists of questions (FAQs, exams, trivia pages). So "What is the capital of France?" is statistically continued by more questions.
The base model doesn't refuse to answer. It isn't even trying. It doesn't understand the concept of "answer" at all. It only understands "continue this text plausibly".
Post-training is the process of transforming this pattern machine into a useful assistant — teaching it that user messages are things to respond to, that turns have boundaries, that some answers are preferred over others, and that some requests should be refused.
Key insight
Pre-training is about quantity — trillions of tokens, huge compute, broad knowledge. Post-training is about quality — thousands of carefully curated examples that shape behavior. The knowledge is already there. Post-training is unlocking it.
Base model=A brilliant autistic savant who has memorized every book in the library but has never had a conversation. They'll speak, but not to you — they'll just keep reading the book aloud from wherever you start them.
Concept 2 of 8
The Completion Trap: Base vs Aligned
Let's see the difference concretely. Below, try different prompts and compare how a base model and an instruction-tuned model respond. Notice that the base model isn't "broken" — it's doing exactly what it was trained to do: continue text.
Interactive: Base vs Aligned Model
Base model (gemma-3-1b-pt)
Select a prompt above.
Aligned model (gemma-3-1b-it)
Select a prompt above.
Same weights architecture, same pretraining data. The only difference is post-training.
Why this matters
The base model's outputs aren't wrong — they're statistically plausible continuations of what appears on the web. Post-training shifts the distribution so that "user asks, assistant answers" becomes the dominant pattern.
Concept 3 of 8
Chat Templates: Teaching Turn Boundaries
Before we even touch weights, we can partially bridge the gap with chat templates. These use special tokens — unique IDs in the vocabulary that never appear in normal text — as structural signals.
<bos><start_of_turn>user
Hello, how are you?<end_of_turn>
<start_of_turn>model
I'm doing great. How can I help?<end_of_turn>
Tokens like <end_of_turn> serve as stop signs. During post-training, the model learns: "when I emit <end_of_turn>, I'm done speaking — the other side gets a turn." The application code watches for this token and stops sampling.
Every model family uses different special tokens (ChatML, Llama, Gemma, Mistral — all different). HuggingFace hides this behind tokenizer.apply_chat_template():
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("google/gemma-3-1b-it")
messages = [
{"role": "user", "content": "Hello, how are you?"},
{"role": "assistant", "content": "I'm doing great. How can I help?"}
]
formatted = tokenizer.apply_chat_template(messages, tokenize=False)
Special tokens=Stage directions in a play script. "[enter stage left]" isn't something the actor says — it's a structural cue invisible to the audience but essential for the performance.
Concept 4 of 8
The Post-Training Pipeline
Modern post-training has two main stages, sometimes followed by tools and safety work:
Interactive: Alignment Pipeline Explorer
Click a stage to see what it does, what data it uses, and what changes in the model.
Stage 1 (SFT) makes the model capable — it learns to respond at all. Stage 2 (Preference Optimization) makes it aligned — it learns which responses are better. A model that only received SFT is helpful but might happily explain how to synthesize VX nerve gas. Preference optimization tells it "no, that one's worse".
Concept 5 of 8
Full Fine-Tuning: The Naive Approach
The most straightforward way to do SFT: just keep training. Take all the model's parameters and run backpropagation on instruction data. Mechanically identical to pretraining — same forward pass, same cross-entropy loss, same AdamW optimizer. Only the data is different.
from trl import SFTTrainer, SFTConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
model = AutoModelForCausalLM.from_pretrained("google/gemma-3-1b-pt")
tokenizer = AutoTokenizer.from_pretrained("google/gemma-3-1b-pt")
dataset = load_dataset("timdettmers/openassistant-guanaco")
trainer = SFTTrainer(
model=model,
train_dataset=dataset["train"],
args=SFTConfig(output_dir="./results", max_seq_length=512),
)
trainer.train()
The memory wall
A 1B-parameter model in FP16 needs ~2 GB just for weights, plus gradients (2 GB), plus AdamW optimizer state (8 GB for m and v moments in FP32), plus activations — about 14 GB of VRAM total. A 70B model would need ~1,000 GB. Fine-tuning frontier models naively is impossible on consumer hardware.
Concept 6 of 8
LoRA: The Sticky-Note Trick
Insight from Hu et al. 2021: the weight changes needed during fine-tuning are low-rank. You don't need to modify every dimension of every weight matrix — the useful update lives in a small subspace.
So instead of learning the full delta matrix ΔW (the same shape as W), we decompose it:
W_new = W_frozen + ΔW
ΔW ≈ A × B where A: (d × r), B: (r × d), r ≪ d
Concretely, for one 4096 × 4096 attention projection:
Interactive: LoRA Parameter Calculator
At inference time, LoRA adapters can be merged back into the base weights (W_merged = W + A·B), so there's zero runtime overhead. You can also keep them separate and hot-swap different adapters for different personas or domains.
QLoRA: going further
QLoRA stores the frozen base model in 4-bit NF4 quantization (a non-uniform quantization optimized for normally-distributed weights) while training the LoRA adapters in higher precision. 16-bit → 4-bit = 4× compression. A 70B model goes from ~140 GB to ~35 GB of weights, which fits on a single A100. This is how people fine-tune Llama-70B on a free Colab T4.
Concept 7 of 8
The Alignment Problem
Why is all this machinery necessary? Because the pretraining objective — "predict the next token on the internet" — is not the same thing as "be helpful to a specific human asking a specific question". This gap is called the alignment problem.
Pretraining objective
maximize log P(next token | previous tokens) averaged over all web text
What we actually want
produce outputs that a specific human would rate as helpful, honest, and harmless — a target that cannot be written as a differentiable loss
Since we can't directly optimize "be helpful", post-training uses a clever two-step dodge:
SFT: show the model examples of helpful behavior (curated responses). It learns to imitate them.
Preference optimization: show the model pairs of responses and tell it which one humans prefer. It learns to produce the preferred style even on prompts it has never seen.
This works because both steps use the one thing we can compute: the log-probability the model assigns to any given text. SFT increases the log-prob of good responses. Preference optimization increases the log-prob of chosen responses relative to rejected ones.
Alignment=Teaching a brilliant intern who has read every book but has never had a job. You can't directly give them "good judgment" — but you can show them examples of good work, then correct them when they make preferences you disagree with. Over time, they internalize the style.
Concept 8 of 8
Check Your Understanding
1. Why does a base model respond to "What is the capital of France?" with more questions instead of "Paris"?
Correct. The base model knows the fact — but its objective is "predict the next token", not "answer the question". Post-training reshapes the distribution so Q→A becomes dominant.
2. For a 4096×4096 attention projection matrix, how many trainable parameters does LoRA use with rank r=16 compared to full fine-tuning?
Right. LoRA decomposes ΔW into A (d×r) + B (r×d). Total params = 2·d·r = 2·4096·16 = 131,072. Training ~0.8% of full parameters typically recovers most of the quality.
3. What is the point of special tokens like <end_of_turn> in a chat template?
Correct. Special tokens have unique IDs that never appear in normal text, so the model can learn them as turn boundaries without confusion with content.
4. Why is QLoRA able to fine-tune a 70B model on a single GPU while full FT cannot?
Right. 70B × 16-bit = ~140 GB. 70B × 4-bit = ~35 GB. Since only the tiny LoRA adapters need gradients/optimizer state, the dominant cost becomes just loading the 4-bit base.
5. What is the "alignment problem" in one sentence?
Exactly. "Predict the next token" and "be genuinely helpful" are different objectives; post-training is the set of techniques that bridges them without being able to directly write "helpfulness" as a differentiable loss.
Teach It Back
Explain in your own words: Why can't we skip post-training? What does SFT vs preference optimization each contribute, and why does LoRA exist?
An AI tutor will compare your explanation against the course material.
Evaluating your response against the course material...
-
Score
out of 10
Feedback
Flashcards (click to flip)
Base model vs instruct model — in one sentence?
Click to reveal
A base model is a next-token predictor trained on raw text; an instruct model has been post-trained (SFT + preference optimization) so that user messages become things to respond to rather than text to continue.
What are the two main stages of post-training?
Click to reveal
SFT (supervised fine-tuning on curated instruction-response pairs) to teach capability, then preference optimization (DPO/RLHF) on (prompt, chosen, rejected) triples to teach which responses humans prefer.
LoRA: what, why, and the parameter math?
Click to reveal
Low-Rank Adaptation. Decompose ΔW ≈ A·B with A:(d×r), B:(r×d), r≪d. For d=4096, r=16: 2·4096·16 = 131K trainable params vs 16M for full. Merges at inference for zero overhead.
QLoRA's trick?
Click to reveal
Freeze the base model in 4-bit NF4 quantization; dequantize blocks on-the-fly during forward/backward; train only the higher-precision LoRA adapters. Makes 70B fine-tuning feasible on a single 24-48 GB GPU.
Why do chat templates exist?
Click to reveal
To give the model unambiguous turn boundaries via special tokens (like <end_of_turn>) that never appear in normal text. The model learns to emit these as stop signals; the sampling loop watches for them.
Pre-training vs post-training in a phrase?
Click to reveal
Pre-training = quantity (trillions of tokens, broad knowledge). Post-training = quality (thousands of curated examples that shape behavior). The knowledge is already inside; post-training unlocks it.
Module 18 Complete
You understand the why of post-training. Next: how SFT actually works — instruction masking, data formats, LIMA, and the modern SFT pipeline.