You download 100 TB of Common Crawl web data. Before training, you need to decide what to keep and what to throw away. What are the top three problems with raw web text that could poison your model if you don't fix them?
Raw web text is mostly garbage. Boilerplate HTML, duplicate copies of the same article, SEO spam, auto-generated pages, non-target languages, and harmful content. Letting this through produces worse models than using 10% of the data after heavy filtering. Modern labs routinely reject 80-95% of raw crawl data.
Key insight
Meta's LLaMA 3 team described data quality classifiers as their single biggest lever. They reported larger gains from better filtering than from any architectural change.
Concept 2 of 10
The Data Landscape
Major sources that modern LLMs draw from:
Common Crawl -- monthly free crawl of the web, raw HTML. Unfiltered it is ~petabytes; filtered pipelines yield ~trillions of tokens.
WebText / WebText2 (OpenAI) -- curated via Reddit upvotes. Foundation for GPT-2 and GPT-3.
The Pile (EleutherAI) -- 825GB of diverse sources: arxiv, GitHub, Books3, Wikipedia, etc.
RedPajama / SlimPajama -- open reproductions of LLaMA's data mix.
RefinedWeb (Falcon) and FineWeb (HuggingFace) -- showed that heavily filtered Common Crawl alone can rival curated mixes.
Deduplication: remove exact and near-duplicate documents.
Each stage typically discards 30-70% of the input.
Concept 4 of 10
Deduplication: MinHash + LSH
Exact duplicates are easy -- hash each document. But there are often hundreds of near-duplicates: the same article republished with minor edits, a Wikipedia page scraped from 50 mirrors, a blog post with a different footer. These near-duplicates must be removed or the model memorizes them.
MinHash gives each document a fingerprint of ~128 hashes over its n-gram shingles. Two documents with many shared shingles share many hash minima -- Jaccard similarity is well-approximated by fraction of matching MinHash values. LSH (Locality Sensitive Hashing) buckets MinHash signatures so only probably-similar documents get compared, turning O(N^2) into O(N).
Interactive: MinHash Deduplication Demo
Adjust the Jaccard threshold to see which pairs of documents get flagged as duplicates.
Why this matters
The C4 dataset (used for T5) had 13% of documents near-duplicated. After deduplication, models trained on the same number of tokens generalized better. GPT-NeoX found deduplication worth several percentage points on downstream benchmarks.
Concept 5 of 10
Quality Filtering: Heuristics and Classifiers
Traditional quality heuristics (from Gopher, CC-Net, MassiveWeb):
Document length between 50 and 100,000 words
Mean word length 3 to 10 characters
At least 80% lines ending in terminal punctuation
Fraction of repeated lines < 30%
Symbol-to-word ratio < 10%
Contains at least 2 of the top 50 English stop-words
These catch most spam but miss subtler problems. Modern pipelines add a model-based classifier: train a small model (fastText or BERT-base) to distinguish "high quality" (Wikipedia, books, reference articles) from "low quality" (random CC). At inference, only keep documents scoring above a threshold.
Interactive: Quality Filter Explorer
Click a sample document to see which heuristics pass or fail.
Concept 6 of 10
Data Mixing: The Recipe Matters More Than the Volume
You have a pile of web data, a pile of code, a pile of scientific papers, and a pile of books. What ratio should you use in the training mix? This decision has an outsized effect on downstream performance.
Code oversampling: code is ~2% of the web but ~10-20% of modern training mixes. Code improves reasoning benchmarks even for non-code tasks.
Upsampling high quality: Wikipedia, arXiv, books get repeated multiple epochs within a single training run.
DoReMi (Xie et al. 2023) -- automatic mixture optimization via distributionally robust optimization. Learns which domains the model is weakest on and boosts them.
Dynamic mixing: LLaMA 3 adjusted the mixture during training -- more math and reasoning content late in training.
Interactive: Data Mixing Slider
Adjust mix proportions to see how they compare to LLaMA 3's mix.
Concept 7 of 10
Multilingual Data: Temperature Sampling
If you sample multilingual data in proportion to its availability, English dwarfs everything. The model becomes great at English and poor at French, Hindi, Swahili. The fix is temperature sampling: raise the per-language probability to a power T < 1, then renormalize. T = 1 is natural sampling (English dominates). T = 0.3 is a common value that boosts low-resource languages without hurting high-resource ones too much.
Data scarcity: high-quality public text on the web is finite. Frontier models are already consuming most of it. Estimates suggest humans will run out of fresh high-quality text around 2026-2028.
Synthetic data: use a strong model to generate new training examples (math problems, code, dialogues). Used heavily by Phi models, DeepSeek, and Tulu 3. Risks: distribution collapse, factual errors propagating.
Multi-epoch training: with scarcity, labs are re-reading the same data multiple times. Recent work shows 4-8 epochs are fine for high-quality subsets without overfitting.
Benchmark contamination: test sets leak into training data via web crawls. A 2024 study found ~11% of NuminaMath overlapped with MATH benchmark problems.
Copyright and attribution: legal landscape is unsettled. Labs now maintain detailed source tracking.
Real-world cost
For a frontier training run, data engineering consumes 50-80% of the team's calendar time. Architecture work is often just tweaking known-good recipes.
Concept 9 of 10
Quality vs Quantity: The Modern Consensus
For years the mantra was "more data is always better". As models scaled past 1T tokens, this broke down. Findings:
Effective tokens are what matter, not raw tokens. A filtered high-quality trillion beats a raw 5T.
Perplexity filtering (keep only documents the model finds "normal") improves downstream benchmarks.
Model-based quality scoring (trained on good/bad examples) outperforms every heuristic.
You can afford to throw away 80%+ of raw data and still come out ahead.
Concept 10 of 10
Check Your Understanding
1. Why is deduplication so important?
Correct: Near-duplicates are memorized by the model, hurting generalization and wasting training compute
2. What is MinHash used for?
Correct: Approximating Jaccard similarity between documents so near-duplicates can be found in near-linear time
3. Why oversample code in a general-purpose LLM training mix?
Correct: Code improves reasoning and structure even on non-code tasks
4. What problem does temperature sampling solve for multilingual data?
Correct: It balances languages so low-resource languages are not drowned out by English
5. What is benchmark contamination?
Correct: Test set examples leaking into training data, making benchmark scores meaningless
Teach It Back
Explain to a friend: Walk through the modern LLM data pipeline from raw web crawl to training-ready tokens. Cover extraction, language ID, quality filtering, deduplication (including MinHash/LSH), data mixing, and why quality matters more than quantity today.
An AI tutor will compare your explanation against the course material.
Evaluating...
-
Score
out of 10
Feedback
Flashcards (click to flip)
What is the typical curation pipeline?
Click to reveal
1. Text extraction (strip HTML/boilerplate). 2. Language ID. 3. Quality filtering (heuristics + model classifier). 4. Content filtering (toxic/PII/contamination). 5. Deduplication (exact + near via MinHash/LSH). Each stage drops 30-70% of input.
How does MinHash/LSH deduplication work?
Click to reveal
MinHash produces a 128-hash fingerprint per document from its n-gram shingles. Jaccard similarity approx fraction of matching MinHash values. LSH buckets signatures so only similar pairs get compared, reducing O(N^2) to near-linear.
Why mix domains instead of using pure web data?
Click to reveal
Different domains teach different skills. Code boosts reasoning; math content boosts quantitative benchmarks; books improve long-form coherence. Empirical mixtures like LLaMA 3's ~50% general, 25% math/reasoning, 17% code outperform pure web.
What is temperature sampling for multilingual data?
Click to reveal
p_lang = count_lang^T / sum. T<1 boosts low-resource languages. T=0.3 is common. Prevents English from drowning out smaller languages in training.
What are the contemporary data challenges?
Click to reveal
Scarcity (running out of high-quality public text), synthetic data risks, multi-epoch training, benchmark contamination (~11% overlap in NuminaMath), copyright/attribution, and the need for detailed source tracking.
Quality vs quantity: what wins today?
Click to reveal
Quality. Heavily filtered smaller datasets beat raw larger ones. FineWeb and RefinedWeb showed filtered Common Crawl alone rivals curated mixes. Labs routinely discard 80-95% of raw data.