Data-Centric LLM

Scribing note: Data-Centric LLM (CS 294-288)

I am not an expert on all the topics here, yet I hope these note help others who, like me, struggled to find a starting point for being exposed to frontier LLM research as a non-insider of the field.

I was fortunate to audit Data-Centric Large Language Models (CS 294-288; Fall 2025; UC Berkeley) taught by Professor Sewon Min. From the course website:

Advances in large language models (LLMs) have been driven by the growing availability of large and diverse datasets. But where do these datasets come from? How are they being used? How can we leverage them in better, more creative ways? What challenges or issues do they present, and how might we address them? In this seminar, we will explore these questions as part of a broader effort to rethink how data is used in the development of LLMs: what data we use, how we use it, why it works, and what problems it brings.

I strongly recommend visiting the course homepage if you are interested in. You will find a great reading list that covers recent work to foundational papers on data-centric perspectives for LLMs.

Below is my scribing note from the course. As scribing happened in real time, and each class covered multiple papers, the note is not perfectly complete. I have tried to keep the key idea and results of the papers, along with my takeaways from the class. The note is text-only, no images or plots (and thus little dense), combining the structure of a classic LaTeX note and the accessibility of html. For details, please refer to the original papers, which can be found on the course website.

cf. Class structure: each class was mainly student-led, with three different roles: (i) main presenters, who covered the main readings; (ii) critics, who brought up questions and potential improvement; (iii) advocates, who responded to the critics and highlighted the impact and contributions of the work.

Each topic heading links to the corresponding note.

Pre-training data

Synthetic pre-training data

Scaling laws

Post-training

Synthetic data and distillation

Evaluation

Reasoning

Mixture of experts

Memory

Creativity and model collapse

Caveat. This is a (part of) scribing note fom Data-Centric Large Language Models (CS 294-288; Fall 2025; UC Berkeley) taught by Professor Sewon Min. The note may contain typos, inaccurate information, or inadvertent errors; any mistakes or errors are mine. The note do not represent the views of the instructor, guest speakers, or any other participants of the course. The note is published with the instructor’s permission to post publicly. The content of the note has not been reviewed or endorsed by the instructor, guest speakers, or any other participants of the course. If you spot an error, I’d appreciate a correction. For any comments, here is my email address.