Why is training data 80% of the work?

Think first
You download 100 TB of Common Crawl web data. Before training, you need to decide what to keep and what to throw away. What are the top three problems with raw web text that could poison your model if you don't fix them?

Raw web text is mostly garbage. Boilerplate HTML, duplicate copies of the same article, SEO spam, auto-generated pages, non-target languages, and harmful content. Letting this through produces worse models than using 10% of the data after heavy filtering. Modern labs routinely reject 80-95% of raw crawl data.

Key insight

Meta's LLaMA 3 team described data quality classifiers as their single biggest lever. They reported larger gains from better filtering than from any architectural change.

The Data Landscape

Major sources that modern LLMs draw from:

Interactive: Data Pipeline Visualization

Watch a stream of documents pass through each filter stage. Click to advance.

The Curation Pipeline

  1. Text extraction: strip HTML, boilerplate, nav bars, ads. Tools like trafilatura or custom extractors.
  2. Language identification: detect and filter out unintended languages with fastText or cld3.
  3. Quality filtering: heuristics (sentence length, punctuation ratio, stop-word density, repetition) followed by model-based classifiers trained on (good, bad) examples.
  4. Content filtering: remove toxic, NSFW, PII (personally identifiable information), and benchmark contamination.
  5. Deduplication: remove exact and near-duplicate documents.

Each stage typically discards 30-70% of the input.

Deduplication: MinHash + LSH

Exact duplicates are easy -- hash each document. But there are often hundreds of near-duplicates: the same article republished with minor edits, a Wikipedia page scraped from 50 mirrors, a blog post with a different footer. These near-duplicates must be removed or the model memorizes them.

MinHash gives each document a fingerprint of ~128 hashes over its n-gram shingles. Two documents with many shared shingles share many hash minima -- Jaccard similarity is well-approximated by fraction of matching MinHash values. LSH (Locality Sensitive Hashing) buckets MinHash signatures so only probably-similar documents get compared, turning O(N^2) into O(N).

Interactive: MinHash Deduplication Demo

Adjust the Jaccard threshold to see which pairs of documents get flagged as duplicates.

Why this matters

The C4 dataset (used for T5) had 13% of documents near-duplicated. After deduplication, models trained on the same number of tokens generalized better. GPT-NeoX found deduplication worth several percentage points on downstream benchmarks.

Quality Filtering: Heuristics and Classifiers

Traditional quality heuristics (from Gopher, CC-Net, MassiveWeb):

These catch most spam but miss subtler problems. Modern pipelines add a model-based classifier: train a small model (fastText or BERT-base) to distinguish "high quality" (Wikipedia, books, reference articles) from "low quality" (random CC). At inference, only keep documents scoring above a threshold.

Interactive: Quality Filter Explorer

Click a sample document to see which heuristics pass or fail.

Data Mixing: The Recipe Matters More Than the Volume

You have a pile of web data, a pile of code, a pile of scientific papers, and a pile of books. What ratio should you use in the training mix? This decision has an outsized effect on downstream performance.

Interactive: Data Mixing Slider

Adjust mix proportions to see how they compare to LLaMA 3's mix.

Multilingual Data: Temperature Sampling

If you sample multilingual data in proportion to its availability, English dwarfs everything. The model becomes great at English and poor at French, Hindi, Swahili. The fix is temperature sampling: raise the per-language probability to a power T < 1, then renormalize. T = 1 is natural sampling (English dominates). T = 0.3 is a common value that boosts low-resource languages without hurting high-resource ones too much.

p_lang = (count_lang / total)^T / sum_l (count_l / total)^T

Contemporary Challenges

Real-world cost

For a frontier training run, data engineering consumes 50-80% of the team's calendar time. Architecture work is often just tweaking known-good recipes.

Quality vs Quantity: The Modern Consensus

For years the mantra was "more data is always better". As models scaled past 1T tokens, this broke down. Findings:

Check Your Understanding

1. Why is deduplication so important?
Correct: Near-duplicates are memorized by the model, hurting generalization and wasting training compute
2. What is MinHash used for?
Correct: Approximating Jaccard similarity between documents so near-duplicates can be found in near-linear time
3. Why oversample code in a general-purpose LLM training mix?
Correct: Code improves reasoning and structure even on non-code tasks
4. What problem does temperature sampling solve for multilingual data?
Correct: It balances languages so low-resource languages are not drowned out by English
5. What is benchmark contamination?
Correct: Test set examples leaking into training data, making benchmark scores meaningless

Teach It Back

Explain to a friend: Walk through the modern LLM data pipeline from raw web crawl to training-ready tokens. Cover extraction, language ID, quality filtering, deduplication (including MinHash/LSH), data mixing, and why quality matters more than quantity today.

An AI tutor will compare your explanation against the course material.

Evaluating...

Flashcards (click to flip)

What is the typical curation pipeline?
Click to reveal
1. Text extraction (strip HTML/boilerplate). 2. Language ID. 3. Quality filtering (heuristics + model classifier). 4. Content filtering (toxic/PII/contamination). 5. Deduplication (exact + near via MinHash/LSH). Each stage drops 30-70% of input.
How does MinHash/LSH deduplication work?
Click to reveal
MinHash produces a 128-hash fingerprint per document from its n-gram shingles. Jaccard similarity approx fraction of matching MinHash values. LSH buckets signatures so only similar pairs get compared, reducing O(N^2) to near-linear.
Why mix domains instead of using pure web data?
Click to reveal
Different domains teach different skills. Code boosts reasoning; math content boosts quantitative benchmarks; books improve long-form coherence. Empirical mixtures like LLaMA 3's ~50% general, 25% math/reasoning, 17% code outperform pure web.
What is temperature sampling for multilingual data?
Click to reveal
p_lang = count_lang^T / sum. T<1 boosts low-resource languages. T=0.3 is common. Prevents English from drowning out smaller languages in training.
What are the contemporary data challenges?
Click to reveal
Scarcity (running out of high-quality public text), synthetic data risks, multi-epoch training, benchmark contamination (~11% overlap in NuminaMath), copyright/attribution, and the need for detailed source tracking.
Quality vs quantity: what wins today?
Click to reveal
Quality. Heavily filtered smaller datasets beat raw larger ones. FineWeb and RefinedWeb showed filtered Common Crawl alone rivals curated mixes. Labs routinely discard 80-95% of raw data.