Pre-training data

Introduction

Closely follow this slide from the course website.

Three objectives of pre-training. (i) Quantity – Bigger is better, no labels needed, only raw text. For example, Llama 3 uses 15T and Llama4 uses 30T tokens, where Wikipedia has just 3B tokens. (ii) Diversity – Broad enough to reflect nearly all possible NLP tasks. (iii) High-quality – text quality is crucial.

Pre-training history.
▷ 2018. English Wikipedia (2.5B) and books (1B). ELMo is pretrained on 1B Word Benchmark. GPT-1 is pretrained on BooksCorpus (0.8T) and BERT uses BooksCorpus and Wikipedia (2.5B).
▷ 2019. GPT-2 uses WebText, such as scrapping web pages that are outbound links from Reddit with at least 3 karma, extracting from HTML, deduplication and heuristic cleaning. (i) CommonCrawl: Publicly-available web archive without markup and other non-text content from the scraped HTML files, that updates every month (20TB/month); The majority of text is not even natural language. (ii) GPT-2 uses bottom-up approach: starts from url and finds related contents. (iii) T5 paper and C4, i.e., Colossal Clean Crawled Corpus: some top-down approach using lots of heuristic filters (e.g. remove any page with curly bracket, the word 'JavaScript', non-English, etc.) → Filtered 90% of the data, leading to 175B tokens.
▷ 2020. GPT-3 trained a classifier to distinguish high-quality data from low-quality Common Crawl based on logistic regression classifier. Positive data: WebText, Wikipedia, book corpora; Negative data: unfiltered Common Crawl. However, you do not discard all low-quality docs. Instead you randomly sample based on estimated score, to keep diversity. After all the filtering, roughly equivalent to 400B tokens.

Other public training datasets exist today? C4 (2019), Pile (2020, the first public dataset matching GPT-3), RedPajama (1.2T), etc. Pile (Gao et al., 2020): Many academic data, less internet data. Llama 1 (Touvron et al. 2023): Common Crawl and C4 takes 82%. Dolma (Soldaini et al., 2024): Common Crawl, Github, Reddit, Semantic Scholar, Project Gutenberg, etc.

Today's open data and models: (i) Open source: C4 (leading to T5), Pile (leading to Pythia), RefinedWeb (leading to Falcon LLM), Dolma (Reading to OLMo 1/2), etc. (ii) Open Weights with data info: Llama 1, etc. (iii) Open Weightes with little data info: Llama 2,3, Mistral, Mixstral, Qwen, Deepseek, etc.

Overview of pre-training data curation: Acquisition → Linearization → Filtering → Mixing (rebalancing) → Experiments and Evaluation. ▷ Acquisition: Common Crawl + additional sources obtained by developers. (i) Big crawlers have coverage issues. Some data (e.g. PDFs) may not be available on Common Crawl. For high-quality data (e.g. Wikipedia), it's better to duplicate. (ii) Popular choices for additional sources: Wikipedia, Project Gutenberg, arXiv (in LaTeX) and PubMed, Github, Big-Query, OpenMathText, StackExchange, Reddit. ▷ Linearization: html or pdf to plain text. This is not a trivial step, and may have huge impact on the model: it matters. ▷ Filtering: More in DCLM paper. ▷ Mixing: Sampling from acquired data. ▷ Experiments and evaluation. (i) It is challenging: pipelines steps aren't independent, and pre-training runs are expensive (ii) Rules of thumbs: Fix scale (Solve backwards from your compute budget, and then maximize quality)l Fix evaluation set (Choose benchmarks suitable for pre-training: 'hot to get early, stable signals' rather than 'most challenging' – opposite from fine-tuning level) (iii) Invest in smallest experimental design that generalizes: scaling laws; annealing (Llama 3 explains this well)

DCLM & FineWeb

DCLM and FineWeb. Both used Common Crawls as (almost) only source and deep dived into Linearization, Filtering, and Experiments. Note that DCLM is to suggest baseline for the DCLM competition they held. ▷ Motivation: Supporting open research on training data. ▷ Goal: Fix everything other than the training data (e.g. fix compute budget for scale, architecture, etc.) ▷ Setting: DCLM has model-size (parameters) and data-size (1x or 2x of the size that is optimal according to Chinchilla scaling laws.) pair. FineWeB always use 1B. ▷ Evaluation: Meaningful signals at small scale: small variance, trends predictable, above random baseline even at small scale; DCLM – MMLU 5-shot accuracy; CORE (Avg of 22 tasks (e.g. HellaSwag and ARC) with high signal-to-noise ratio); ESSENTIAL (Avg of 55 tasks).

Linearization. Both papers are some of the first to ablate linearization. CommonCrawl data is available in two main formats: WET (plain text only version) and WARC (raw data from the crawl). Both papers show that their own linearization with WARC is better than WET.

Heuristic filtering. DCLM used RefinedWeb. FineWeb used RefinedWeb's filter, then ablate different individual filters (lengh filter, punctuation, filter, etc) in C4.

Deduplication. Often to fuzzy deduplication rather than exact deduplication (e.g. remove docs with 50% overlap in 13-grams.) Deduplication is not really trivial: DCLM has 8 pages of ablations and FineWeb has 7 pages of ablations. Their ablations mostly don't overlap as well. Some factors in deduplications are the following. ▷ Hyperparameters: p overlaps in n-gram. ▷ Data structure: MinHash vs Suffix array vs near-duplicate Bloom filtering (BFF). ▷ Level: Paragraph? Doc vs Doc? Doc vs corpus? ▷ Shared deduplication vs Global deduplication: DCLM used `shared deduplication' and ablated these, with conclusions of: Minhash + SuffixArray + Exact ≈ BFF, and just use BFF; FineWeb used MinHash and more, and figured out more deduplication does not always improve, thus uses FineWeb independent MinHash.

Model-based filtering. This part is where two papers differ the most. ▷ DCLM 'trains a model' to serve as quality filters, e.g. PageRank, AskLLM, Perplexity filtering (one of standard in pre-train), fastText (a bi-gram classifier trained in positive and negative examples). ▷ FineWeb uses synthetic data to develop classifiers for identifying 'educational' content. First, uses Llama-3-70B-Instruct to annotate the quality of the doc, based on its educational contents; then, uses Llama3-70B annotations to train a small classifier; their prompting example is very useful, maybe we should follow this example.

Did they change the field? ▷ DCLM competition itself was not popular, because there was compute barrier from academia labs, and the DCLM-baseline was too strong. Yet, the findings were widely adopted including by leading industry labs. ▷ FineWeb's findings were widely adopted including by leading industry labs. Experts later found DCLM surpassed Fineweb. One of the hypothesis is linearization differences. Yet, approach for training a classifier with Llama annotations proven impactful. ▷ cf. Subsequent work of DCLM and FineWeb: OLMo 2 (Walsh et al., 2025).

Some questions and plausible answers. ▷ Q. If model-based filtering is utlimately applied, why heuristic filtering at all? A. Simpler things can work (efficient), and also reduces computation time. ▷ Q. Both papers did heuristic filtering → deduplication → model-based filtering. Is this coincidence? A. This is basically the ordering of computation cost. ▷ Q. DCLM shows that its AskLLM baseline underperforms a simple fastText classifier, while FineWeb's AskLLM-like approach is highly effective. What's the difference? A. FineWeb define 'high-quality' data in a much more detailed way. ▷Q. If we have an infinite compute budget, do we still want to do filtering? A. This is an open ended question. ▷ Q. What dose frontier LLM's pre-training data composition look like? A. It is now open to public, but we can.

More readings

Language models are few-shot learners. (Open AI, 2020; GPT-3 paper) The paper shows scaling up LM improves task-agnostic, few-shot performance: train GPT-3, a LM with 175B parameters (10x more than any previous non-sparse LM) and test its performance in few-shot settings. ▷ The paper finds that GPT-3 can generate news articles which is hard to be distinguished from articles written by human. ▷ While zero-shot performance improves somewhat linearly with model size, few-shot performance increases more rapidly. ▷ The paper uses a "natural language prompt" (e.g. "Answer:") for some tasks. Interestingly, their Figure 2.2 visualize total compute used during training for BERT, GPT-3, etc, and also has a section about energy usage. The GPC-3 175B consumed several thousands petaflops per day, while GPT-2 1.5B used tens of petaflops per day. However, the paper emphasizes we should consider that running the trained model does not cost that much.

Deduplicating training data makes language models better (Google Research Brain Team, 2022) The paper develops two tools that deduplicates training dataset: (a) Using a suffix array, remove duplicate substrings from the dataset if they occur verbatim in more than one example. (b) Use MinHash, an efficient algorithm for estimating the $n$-gram similarity between all pairs of examples in a corpus, to remove entire examples from the dataset if they have high $n$-gram overlap with any other example.

Nemotron-CC: Transforming Common Crawl into a refined long-horizon pretraining dataset. (NVIDIA, 2025) Recent English Common Crawl datasets like FineWeb-Edu and DCLM achieved significant benchmark gains via aggressive model-based filterint, but at the cost of removing 90% of data. This limits their suitability for long-token horizon training, such as 15T tokens for Llama 3.1. The paper suggests how to achieve better trade-offs between accuracy and data quantity by a combination of classifier ensembling (by choosing max score among three classifiers), synthetic data rephrasing, and reduced reliance on heuristic filters.

DataComp-LM. (2025) A testbed for controlled dataset experiments with the goal of improving LM. Highlights the importance of dataset design for training language models, and offers a starting point for further research on data curation. ▷ The DCLM workflow is the following: (a) Choose a scale of model (tokens or parameters); (b) filters or mixes data in the pool to create a dataset; (c) trains a LM with the curated dataset; (d) evaluate on pre-selected downstream tasks. ▷ DCLM-Pool is an unfiltered web-text corpus, and DCLM-Baseline is 1.4% of DCLM-Pool which is curated by authors using: heuristic cleaning, deduplication, model-based quality filtering. In the mixing track, low-performing data improves with mixing with other data, yet DCLM-Baseline does not benefit from it.

FineWeb. (HuggingFace, 2024) FineWeb is a new large-scale (15T tokens) dataset for LLM pretraining. They also introduce FineWeb-Edu, a subset of FineWeb, for educational values and benchmarks. ▷ FineWeb recipe: (a) Text extraction from CommonCrawl; (b) Filtering related to RefinedWeb; (c) Deduplication; (d) Heuristic filter, based on high-level statistics on both high/low quality datasets and Wasserstein distance between the two distributions. ▷ FineWeb-edu is based on an emerging approach for filtering LLM trained dataset: using synthetic data to develop classifiers for identifying educational content. This technique was notably used in Llama 3 and Phi 3.

Caveat. This is a (part of) scribing note fom Data-Centric Large Language Models (CS 294-288; Fall 2025; UC Berkeley) taught by Professor Sewon Min. The note may contain typos, inaccurate information, or inadvertent errors; any mistakes or errors are mine. The note do not represent the views of the instructor, guest speakers, or any other participants of the course. The note is published with the instructor’s permission to post publicly. The content of the note has not been reviewed or endorsed by the instructor, guest speakers, or any other participants of the course. If you spot an error, I’d appreciate a correction. For any comments, here is my email address.

Seunghoon Paik

Pre-training data

Introduction

DCLM & FineWeb

More readings