Tulu 3: The most transparent post-training recipe ever released
Ai2's Tulu 3 (November 2024) is the post-training equivalent of LLaMA 3 -- a fully documented, fully open end-to-end post-training pipeline. Every dataset, every training script, every intermediate checkpoint, every evaluation is released. It built on LLaMA 3.1 base models and matched GPT-4o-mini on most benchmarks.
OLMo 3 (November 2025) takes it further: a fully open pipeline including the base model (trained on Dolma 3, 9.3T tokens). Both are the clearest public window into modern post-training.
Transparency. Release everything -- not just weights, but all training data, code, and intermediate checkpoints.
Systematic experimentation. Controlled ablation studies instead of intuition-driven tweaks.
Verifiable improvement. Rigorous evaluation with development/unseen benchmark splits to prevent implicit overfitting.
For years, open post-training lagged behind proprietary methods. Tulu 3 closed that gap -- not with a secret technique but with clean, systematic execution.
Concept 3 of 8
The Four-Stage Pipeline
Interactive: Tulu 3 Pipeline Visualization
Click each stage to see what happens and what it produces.
Concept 4 of 8
Stage 1: Data Curation (939,344 prompts)
Public datasets (57%): WildChat, OpenAssistant, No Robots, FLAN v2, OpenMathInstruct, NuminaMath, Evol-CodeAlpaca, Aya, SciRIFF, TableGPT, Daring-Anteater.
Synthetic data (43%): generated using a persona-driven methodology. ~150K math word problems across difficulty levels, programming challenges across languages, instruction-following tasks with verifiable constraints.
Response generation: for prompts without high-quality responses, they used GPT-4o or Claude 3.5 Sonnet to generate new ones, then filtered out empty responses, model self-references, and weak outputs.
Decontamination: 8-gram matching against evaluation benchmarks. Critical finding: 11.3% of NuminaMath-TIR overlapped with MATH evaluation problems. Any matching examples were removed. Without this step, benchmark scores would have been inflated.
Key lesson
Popular public datasets can silently include benchmark content. Decontamination is not optional at frontier quality. Tulu 3 found major overlaps and fixed them. Most closed models probably did not.
Concept 5 of 8
Stages 2-3: SFT and Preference Optimization
SFT: Iterative Data Mixing
The innovation: iterative mixing, not one-shot training. Process:
Compare current model against SOTA on each target skill
Create specialized models (math-only, code-only) to establish upper bounds
Merge successful mixtures, add/remove datasets to fix lagging skills
Remove evaluation overlaps and reduce oversized datasets
Ablation findings: synthetic data boosts math and code; removing any single capability dataset degrades that skill without helping others; data mixing ratios matter more than total volume; response quality matters more than prompt quantity.
DPO: On-Policy Preference Data
~300K prompts with preference pairs (5x larger than UltraFeedback's 60K). Process:
Collect diverse prompts
Generate multiple responses per prompt using multiple models, including the Tulu 3 SFT model itself
Score with reward models
Select chosen/rejected pairs
On-policy vs off-policy: including responses from the Tulu 3 SFT model (on-policy) improved performance compared to only external model responses (off-policy). The model learns better from its own mistakes.
Algorithm selection: DPO, PPO, and SimPO all performed comparably with proper tuning. Algorithm choice matters less than data quality. Length normalization is critical to prevent rewarding verbosity.
Concept 6 of 8
Stage 4: RLVR (Reinforcement Learning with Verifiable Rewards)
Replace the neural reward model with a verification function for tasks with objectively correct answers:
def compute_rlvr_reward(prompt, response, task_type):
if task_type == "math":
# Extract boxed answer, compare to ground truth
return 1.0 if correct else 0.0
elif task_type == "code":
# Run unit tests
return fraction_passed
elif task_type == "format":
# Check that response follows required structure
return 0.1 if format_matches else 0.0
Benefits: no reward model needed, no reward hacking possible, no human annotation bias.
Training domains: math (exact answer verification) and code (unit test execution).
Infrastructure for the 405B model: 256 GPUs total, 16 for inference (vLLM with tensor parallelism) and 240 for gradient updates. The split matters because RL needs both generation and training to run concurrently.
Results: RLVR provides targeted improvements (math, code) without degrading general capabilities. Key finding: RLVR improves more at larger scales -- the 405B model showed larger gains than 8B or 70B. As base models improve, returns to RL-based post-training may increase.
Concept 7 of 8
Safety, Evaluation, and OLMo 3
Safety throughout the pipeline, not as an afterthought. Three key datasets: harmful request examples, refusal demonstrations, and contrastive benign examples (from WildJailbreak and CoCoNot).
Key finding: safety is orthogonal to capability. Removing safety data affects safety metrics but NOT general benchmarks. This means there is no fundamental trade-off between being safe and being capable -- safety must be explicitly trained but can be added without hurting performance.
Evaluation innovation: dev vs unseen splits. Development benchmarks are used during training to tune data mixtures. Unseen benchmarks are held out entirely and only reported at the end. Prevents implicit benchmark overfitting.
OLMES (Open Language Model Evaluation System) standardizes prompting formats, shot counts, decoding parameters, and evaluation templates. Makes results reproducible across teams.
OLMo 3: Fully Open Alternative
OLMo 3 goes further than Tulu 3 by opening the base model too:
Pretraining on Dolma 3 (9.3T tokens, ~5.9T used for final)
Midtraining on Dolmino (100B tokens, math/science/code)
Long context extension on Longmino (50B tokens)
Post-training with Dolci mixes (SFT + DPO + RLVR)
Three variants: OLMo 3-Instruct (standard), OLMo 3-Think (chain-of-thought reasoning; 32B hits 96.2% on MATH), OLMo 3-RL Zero (RL directly on the base model, no SFT, releasing checkpoints for math, code, instruction following, and combined variants).
Efficiency improvements: 8x throughput for SFT, 4x for RL via in-flight weight updates and continuous batching.
Concept 8 of 8
Lessons and Check Your Understanding
Data curation matters most: systematic prompt collection and decontamination are critical.
On-policy preference data beats off-policy: the model learns better from its own generations.
Algorithm choice matters less than data quality: DPO, PPO, SimPO all comparable with proper tuning.
RLVR is the new frontier: verifiable rewards provide targeted improvements without capability degradation.
Safety is orthogonal to capability: can be added without hurting performance.
Decontamination is essential: 11.3% overlap in popular datasets silently inflates benchmarks.
Evaluation must be rigorous: dev/unseen splits prevent implicit overfitting.
Full transparency enables science: releasing everything lets the community verify and build upon results.
1. What are the four stages of the Tulu 3 pipeline?
2. What contamination did Tulu 3 find in a popular public dataset?
Correct: 11.3% of NuminaMath-TIR overlapped with MATH evaluation problems
3. What did Tulu 3 find about on-policy vs off-policy preference data?
Correct: On-policy (generations from the current SFT model) improved final performance -- the model learns better from its own mistakes
4. What makes RLVR resistant to reward hacking?
Correct: The reward is deterministic and rule-based (answer match, unit test pass), not a learned neural network
5. What does "safety is orthogonal to capability" mean in the Tulu 3 context?
Correct: Adding safety training affects safety metrics but does not significantly impact general capability benchmarks -- no fundamental trade-off
Teach It Back
Explain to a friend: What are the four stages of the Tulu 3 post-training pipeline, why does decontamination matter, what does on-policy preference data buy you, what is RLVR, and what surprising finding did Tulu 3 make about safety and capability?
An AI tutor will compare your explanation against the course material.
Evaluating...
-
Score
out of 10
Feedback
Flashcards (click to flip)
Tulu 3 four-stage pipeline?
Click to reveal
1. Data curation (939K prompts, 57% public + 43% synthetic, aggressive decontamination). 2. SFT with iterative data mixing. 3. DPO on ~300K preference pairs, on-policy data included. 4. RLVR on math and code with verifiable rewards.
Tulu 3's contamination finding?
Click to reveal
11.3% of NuminaMath-TIR overlapped with MATH benchmark problems via 8-gram matching. This was silently inflating evaluation scores. Required removal before training to get honest numbers.
On-policy vs off-policy preference data?
Click to reveal
On-policy: preference pairs generated by the current SFT model itself. Off-policy: only pairs from external models. Tulu 3 found on-policy consistently improved over off-policy alone. The model learns better from its own failure modes.
What is RLVR?
Click to reveal
Reinforcement Learning with Verifiable Rewards. Replaces neural reward model with deterministic rules: answer match for math, unit tests for code, format checks. No reward hacking possible. Targeted gains without general capability degradation.
Safety is orthogonal to capability -- what does this mean?
Click to reveal
Tulu 3 ablation showed that removing safety data affected safety metrics but did NOT significantly change general benchmarks. No fundamental trade-off: you can add safety training without paying a capability tax.
OLMo 3 three variants?
Click to reveal
OLMo 3-Instruct (standard SFT+DPO+RLVR), OLMo 3-Think (extended chain-of-thought; 32B hits 96.2% on MATH), OLMo 3-RL Zero (RL directly on base model with no SFT, producing math/code/instruction-following checkpoints).