Concept 1 of 10

Why is training data 80% of the work?

Think first

You download 100 TB of Common Crawl web data. Before training, you need to decide what to keep and what to throw away. What are the top three problems with raw web text that could poison your model if you don't fix them?

Concept 2 of 10

The Data Landscape

Major sources that modern LLMs draw from:

Common Crawl -- monthly free crawl of the web, raw HTML. Unfiltered it is ~petabytes; filtered pipelines yield ~trillions of tokens.
WebText / WebText2 (OpenAI) -- curated via Reddit upvotes. Foundation for GPT-2 and GPT-3.
The Pile (EleutherAI) -- 825GB of diverse sources: arxiv, GitHub, Books3, Wikipedia, etc.
RedPajama / SlimPajama -- open reproductions of LLaMA's data mix.
RefinedWeb (Falcon) and FineWeb (HuggingFace) -- showed that heavily filtered Common Crawl alone can rival curated mixes.
Curated domains: arXiv, PubMed, GitHub, Wikipedia, Books, Stack Exchange.

Interactive: Data Pipeline Visualization

Watch a stream of documents pass through each filter stage. Click to advance.

Concept 3 of 10

The Curation Pipeline

Text extraction: strip HTML, boilerplate, nav bars, ads. Tools like trafilatura or custom extractors.
Language identification: detect and filter out unintended languages with fastText or cld3.
Quality filtering: heuristics (sentence length, punctuation ratio, stop-word density, repetition) followed by model-based classifiers trained on (good, bad) examples.
Content filtering: remove toxic, NSFW, PII (personally identifiable information), and benchmark contamination.
Deduplication: remove exact and near-duplicate documents.

Each stage typically discards 30-70% of the input.

Concept 4 of 10

Deduplication: MinHash + LSH

Exact duplicates are easy -- hash each document. But there are often hundreds of near-duplicates: the same article republished with minor edits, a Wikipedia page scraped from 50 mirrors, a blog post with a different footer. These near-duplicates must be removed or the model memorizes them.

MinHash gives each document a fingerprint of ~128 hashes over its n-gram shingles. Two documents with many shared shingles share many hash minima -- Jaccard similarity is well-approximated by fraction of matching MinHash values. LSH (Locality Sensitive Hashing) buckets MinHash signatures so only probably-similar documents get compared, turning O(N^2) into O(N).

Interactive: MinHash Deduplication Demo

Adjust the Jaccard threshold to see which pairs of documents get flagged as duplicates.

Similarity threshold: 0.80

Why this matters

The C4 dataset (used for T5) had 13% of documents near-duplicated. After deduplication, models trained on the same number of tokens generalized better. GPT-NeoX found deduplication worth several percentage points on downstream benchmarks.

Concept 5 of 10

Quality Filtering: Heuristics and Classifiers

Traditional quality heuristics (from Gopher, CC-Net, MassiveWeb):

Document length between 50 and 100,000 words
Mean word length 3 to 10 characters
At least 80% lines ending in terminal punctuation
Fraction of repeated lines < 30%
Symbol-to-word ratio < 10%
Contains at least 2 of the top 50 English stop-words

These catch most spam but miss subtler problems. Modern pipelines add a model-based classifier: train a small model (fastText or BERT-base) to distinguish "high quality" (Wikipedia, books, reference articles) from "low quality" (random CC). At inference, only keep documents scoring above a threshold.

Interactive: Quality Filter Explorer

Click a sample document to see which heuristics pass or fail.

Concept 6 of 10

Data Mixing: The Recipe Matters More Than the Volume

You have a pile of web data, a pile of code, a pile of scientific papers, and a pile of books. What ratio should you use in the training mix? This decision has an outsized effect on downstream performance.

Code oversampling: code is ~2% of the web but ~10-20% of modern training mixes. Code improves reasoning benchmarks even for non-code tasks.
Upsampling high quality: Wikipedia, arXiv, books get repeated multiple epochs within a single training run.
DoReMi (Xie et al. 2023) -- automatic mixture optimization via distributionally robust optimization. Learns which domains the model is weakest on and boosts them.
Dynamic mixing: LLaMA 3 adjusted the mixture during training -- more math and reasoning content late in training.

Interactive: Data Mixing Slider

Adjust mix proportions to see how they compare to LLaMA 3's mix.

Concept 7 of 10

Multilingual Data: Temperature Sampling

If you sample multilingual data in proportion to its availability, English dwarfs everything. The model becomes great at English and poor at French, Hindi, Swahili. The fix is temperature sampling: raise the per-language probability to a power T < 1, then renormalize. T = 1 is natural sampling (English dominates). T = 0.3 is a common value that boosts low-resource languages without hurting high-resource ones too much.

p_lang = (count_lang / total)^T / sum_l (count_l / total)^T

Concept 8 of 10

Contemporary Challenges

Data scarcity: high-quality public text on the web is finite. Frontier models are already consuming most of it. Estimates suggest humans will run out of fresh high-quality text around 2026-2028.
Synthetic data: use a strong model to generate new training examples (math problems, code, dialogues). Used heavily by Phi models, DeepSeek, and Tulu 3. Risks: distribution collapse, factual errors propagating.
Multi-epoch training: with scarcity, labs are re-reading the same data multiple times. Recent work shows 4-8 epochs are fine for high-quality subsets without overfitting.
Benchmark contamination: test sets leak into training data via web crawls. A 2024 study found ~11% of NuminaMath overlapped with MATH benchmark problems.
Copyright and attribution: legal landscape is unsettled. Labs now maintain detailed source tracking.

Real-world cost

For a frontier training run, data engineering consumes 50-80% of the team's calendar time. Architecture work is often just tweaking known-good recipes.

Concept 9 of 10

Quality vs Quantity: The Modern Consensus

For years the mantra was "more data is always better". As models scaled past 1T tokens, this broke down. Findings:

Effective tokens are what matter, not raw tokens. A filtered high-quality trillion beats a raw 5T.
Perplexity filtering (keep only documents the model finds "normal") improves downstream benchmarks.
Model-based quality scoring (trained on good/bad examples) outperforms every heuristic.
You can afford to throw away 80%+ of raw data and still come out ahead.

Concept 10 of 10

Check Your Understanding

1. Why is deduplication so important?

Correct: Near-duplicates are memorized by the model, hurting generalization and wasting training compute

2. What is MinHash used for?

Correct: Approximating Jaccard similarity between documents so near-duplicates can be found in near-linear time

3. Why oversample code in a general-purpose LLM training mix?

Correct: Code improves reasoning and structure even on non-code tasks

4. What problem does temperature sampling solve for multilingual data?

Correct: It balances languages so low-resource languages are not drowned out by English

5. What is benchmark contamination?

Correct: Test set examples leaking into training data, making benchmark scores meaningless

Teach It Back

Explain to a friend: Walk through the modern LLM data pipeline from raw web crawl to training-ready tokens. Cover extraction, language ID, quality filtering, deduplication (including MinHash/LSH), data mixing, and why quality matters more than quantity today.

An AI tutor will compare your explanation against the course material.

Evaluating...

Flashcards (click to flip)

What is the typical curation pipeline?

Click to reveal

1. Text extraction (strip HTML/boilerplate). 2. Language ID. 3. Quality filtering (heuristics + model classifier). 4. Content filtering (toxic/PII/contamination). 5. Deduplication (exact + near via MinHash/LSH). Each stage drops 30-70% of input.

How does MinHash/LSH deduplication work?

Click to reveal

MinHash produces a 128-hash fingerprint per document from its n-gram shingles. Jaccard similarity approx fraction of matching MinHash values. LSH buckets signatures so only similar pairs get compared, reducing O(N^2) to near-linear.

Why mix domains instead of using pure web data?

Click to reveal

Different domains teach different skills. Code boosts reasoning; math content boosts quantitative benchmarks; books improve long-form coherence. Empirical mixtures like LLaMA 3's ~50% general, 25% math/reasoning, 17% code outperform pure web.

What is temperature sampling for multilingual data?

Click to reveal

p_lang = count_lang^T / sum. T<1 boosts low-resource languages. T=0.3 is common. Prevents English from drowning out smaller languages in training.

What are the contemporary data challenges?

Click to reveal

Scarcity (running out of high-quality public text), synthetic data risks, multi-epoch training, benchmark contamination (~11% overlap in NuminaMath), copyright/attribution, and the need for detailed source tracking.

Quality vs quantity: what wins today?

Click to reveal

Quality. Heavily filtered smaller datasets beat raw larger ones. FineWeb and RefinedWeb showed filtered Common Crawl alone rivals curated mixes. Labs routinely discard 80-95% of raw data.

Module Complete

← Previous Course Home Next →