Scaling laws
Scaling laws
Reading
Scaling Laws for Neural Language Models. (Open AI, 2020) First (or almost pioneering) paper about scaling laws.
Training Compute-Optimal Large Language Models (DeepMind, 2022) Chinchilla scaling law paper. ▷ Investigate the optimal model size and number of tokens for training a transformer LM under a given computer budget, i.e., given a fixed FLOPs budget, how should one trade-off model size and the number of training tokens? The model size and the number of training tokens should be scaled equally. ▷ For other hyperparameters, such as learning rate schedule, batch size, optimizer, width-depth ratio, the authors rely on existing work. ▷ Difference between OpenAI's 2020 paper: that paper used a fixed number of training tokens and learning rate schedule for all models, and the models in this paper uses much larger parameters than the ones in that paper.
Language models scale reliably with over-training and on downstream tasks (2024) Chinchilla optimality focuses on the compute-optimal training regime, where model and dataset size are set to yield minimum loss for a given compute budget. However, this setting ignores inference costs. ▷ As larger models are more expensive at inference, it is now common practice to over-train smaller models. Another mismatch is that most scaling laws quantify model performance in next-token prediction rather than benchmark dataset. Thus, the authors conduct experiments for both scaling in the over-trained regime and benchmark performance prediction. ▷ The token multiplier is defined as M=D/N, where D is training tokens and N is the model parameters. Over-training is defined as M>M^* where M^* is a fixed compute-optimal token multiplier.
Emergent Abilities of LLM (2022) Scaling laws have been shown to predictably improve performance and sample efficiency on a wide range of downstream tasks. This paper instead discusses an unpredictable phenomenon that the authors refer to as emergent abilities, which cannot be extrapolated from the performance of smaller models -- notably, in the few-shot prompting, multi-step reasoning, instruction following, model calibration.
Lecture
Omitted.
Caveat. This is a (part of) scribing note fom Data-Centric Large Language Models (CS 294-288; Fall 2025; UC Berkeley) taught by Professor Sewon Min. The note may contain typos, inaccurate information, or inadvertent errors; any mistakes or errors are mine. The note do not represent the views of the instructor, guest speakers, or any other participants of the course. The note is published with the instructor’s permission to post publicly. The content of the note has not been reviewed or endorsed by the instructor, guest speakers, or any other participants of the course. If you spot an error, I’d appreciate a correction. For any comments, here is my email address.