Synthetic data and distillation

Lecture

Why does synthetic data have a gain? Alpaca (2023) → Textbooks are all you need (2023) → False promises (2023) →Rephrasing the web (2024) → Tulu (2024), etc. Distillation: the student model use the synthetic data from the teacher model.

Alpaca: A Strong, Replicable Instruction-Following Model. Back in the days, OpenAI had insturction-following models such as InstructGPT and ChatGPT. But the academia and open-source models didn't, because RLHF is hard and expensive: they needed scalable and cheap methods for instruction tuning. That's when Alpaca came: use ChatGPT to generate synthetic data to train LLaMA to instruction follows. ▷ Stage 1: Human written seed tasks. ▷ Stage 2: fine-tuning the data from ChatGPT-LLaMA. ▷ Alpaca behaves similar to OpenAI's model, with 52k synthetic instruction following examples, which costs less than $600. There is a follow-up called Vicuna, which uses real user interaction with ChatGPT. The order of the performance (in some evaluations): LLaMA, Alpaca, Vicuna, Bard, ChatGPT.

Textbooks Are All You Need. ▷ For coding. Started from 35B filtered web data, for educational value, which reduces to 6B tokens. Generate 1B tokens synthetic textbook data and 180M tokens exercise data with GPT3.5. ▷ Pretrain with filtered web data and textbook data: Phi-1-1base. Then fine tune with exercise data: Phi-1.

The False Promise of Imitating Proprietary LLMs. ▷ Task specific Imitation: 6000 ChatGPT generated Wikipedia entity facts. ▷ Broad skill Imitations (ShareGPT dialogues, HC3 prompts and responses, Discord ChatGPT bot dialogues). Experimented with 1.5B GPT2, LLaMA 7B and 13B. ▷ Good at: to learn specific tasks, to imitate style or persona, being an alternative to expensive annotation for fine-tuning. ▷ Bad at: to acquire broad-coverage behavior, to solve challenging tasks (e.g. faculty, coding, problem solving), to learn new knowledge.

Caveat. This is a (part of) scribing note fom Data-Centric Large Language Models (CS 294-288; Fall 2025; UC Berkeley) taught by Professor Sewon Min. The note may contain typos, inaccurate information, or inadvertent errors; any mistakes or errors are mine. The note do not represent the views of the instructor, guest speakers, or any other participants of the course. The note is published with the instructor’s permission to post publicly. The content of the note has not been reviewed or endorsed by the instructor, guest speakers, or any other participants of the course. If you spot an error, I’d appreciate a correction. For any comments, here is my email address.

Seunghoon Paik

Synthetic data and distillation

Lecture