Synthetic pre-training data

Reading

Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling. This is the WRAP (Web Rephrase Augmented Pre-training) paper, one of two CommonCrawl access source discussed in DCLM paper in Pre-training lecture. WRAP uses an off-the-shelf instruction-tuned model prompted to paraphrase docs on the web in specific styles, such as 'like Wikipedia' or in 'question-answer format' to jointly pre-train LLMs on real and synthetic rephrases. WRAP tackles three challenges in the ambiguity around data curation: what data should you pre-train on; how can you pre-train with limited data; how can you pre-train computationally efficiently. They sample real and synthetic data in a 1:1 ratio. The perplexity of the pre-trained model is evaluated on the validation set of multiple OOD datasets.

Lecture

Closely follow this (guest lecture) slide from the course website.

Data-centric ML. Considering models remain constant, we can think about to optimize over the best data to train. In CNN-etc-regime, people were thinking about new model design, but now, at the era of scacling law, we come back to the data. And, importantly, we know about 'garbage in, garbage out'.

Data is weird. ▷ For example, in the RedPajama dataset, one example is found that there are 700k documents with a ceratin identical format. ▷ Also, some documents are identical except for a few spaces before '\textbackslash n' (i.e., linebreak); however, using/not-using them in a training set leads to significantly different loss function behavior. ▷ Fast spikes in AI2 (discovery by Dirk Groeneveld from AI2): it is found that there is a Reddit channel called 'MicrowaveGang' which basically repeat 'mmmm...' many many times, and they occurs the spike in training → One line summary of today: Data is weird, therefore we need to look into it more closely!

Can we do better than power-law? We need to find a different slope for power-law (linear), e.g., a paper called Beating neural scaling law with data.

Four pillars of data curation. Geometric curation. Quality filtering. Synthetic Data. Data mixing. And today we focuses on synthetic data.

a. Geometric curation. Three methods unlocked with geometric curation: (perceptive and sementic) deduplication, mining (i.e. finding related data); mining; balancing (i.e., good data should not over-represent certain field).

b. Quality filtering. Heuristics and model-based filtering. Example: CLIP data is image-caption pairs from the web, and the goal was to learn the best visual representation for object classification. Remove certain class of images whose caption is just repeating their text in the image, and found that to reduce one bad data is equivalently valuable to add three good data.

c. Synthetic data. → The rise of synthetic data comes from to generate small data for fine-tuning. Also its useful ness for pre-training is also discovered. → A paper from Microsoft that only uses the synthetic data generated by larger LLM: How small can LM be and still speak coherent English. This motivates Phi family of LM by Microsoft. → There are two paradigm: Generator-driven (e.g. generated by GPT-3) and Source-rephrasing, which is main focus of the lecturer.

WRAP paper. → Source-rephrasing. Different rephrase style. Including real data matters. Analyzing data leakage via cosine similarity. → Limitation: cost of rephrasing, diversity of synthetic generations, soundness of generations, can the generator be smaller than the LLM trained?

Nemotron-CC. Nvidia-provided rephraser. They also provide prompt templates. This is one of the best open synthetic dataset.

BeyondWeb. Better than Nemotron-CC in some sense. Not any generation of synthetic data will make your model better. Also the quality of the data you are rephrasing matters. We can use synthetic data generation to fit in the target distribution matching.

Kimi K2. Long-context rephrasing: (i) split the document to multiple token (ii) rewrite the model with the output from previous split, input of current split, and the input of all text.

Caveat. This is a (part of) scribing note fom Data-Centric Large Language Models (CS 294-288; Fall 2025; UC Berkeley) taught by Professor Sewon Min. The note may contain typos, inaccurate information, or inadvertent errors; any mistakes or errors are mine. The note do not represent the views of the instructor, guest speakers, or any other participants of the course. The note is published with the instructor’s permission to post publicly. The content of the note has not been reviewed or endorsed by the instructor, guest speakers, or any other participants of the course. If you spot an error, I’d appreciate a correction. For any comments, here is my email address.

Seunghoon Paik

Synthetic pre-training data

Reading

Lecture