Creativity and model collapse
Creativity and model collapse
Lecture
Motivation. ▷Dead internet theory: the theory may not be real, but AI contents are indeed extremely popular, maybe more than human generated contents at this point. ▷ What does this mean to LLM? Some related papers (e.g., AI models collapse when trained on recursively generated data; The Curse of Recursion: Training on Generated Data Makes Models Forget). ▷ When we recursively generate answers we can observe the model collapse. The curse of recursion paper suggests this is due to finite sampling, and thus, common/rare events are either overestimated/underestimated. ▷ What about the vision model? While retraining with no synthetic data keeps similar performance, in the synthetic data, retraining reduces the quality. ▷ How do we measure such collapse or creativity in text, with respect to the changes in quality and diversity of the text?
The Curious Decline of Linguistic Diversity: Training Language Models on Synthetic Text. (2023) ▷ Key ideas: how can we measure diversity. ▷ Tasks: news summarization, scientific abstract generation, story generation. ▷ Metric 1: Lexical diversity -- what is the variety of words in the text? Hypothesis: a collapsed LM will have a smaller vocabulary. How to quantify these: type-token ratio (TTR, distinct n, 1); Belu. ▷ Metric 2: Semantic diversity -- get embeddings from sentence-BERT. ▷ Metric 3: Syntactic diversity -- look into sentence structure, by making graph representations with nodes (rods) and edges (dependency). ▷ Recursive training: The paper recursively retrain the model for 6 times, using the data generated by the model in the previous retrain. ▷ The result: synthetic and lexical diversity decreased the most, while semantic diversity is pretty stable; high-entropy tasks suffer more. ▷ Abalation study What if we use less than 100% synthetic data? The paper considered two other scenarios: (i) filter synthetic data, and discard the worst 20% of the synthetic data with an acceptability filter using RoBERTa; (ii) for each retrain, mix human and synthetic data in the ratio of 1:4. Their results claim that these two are not really different from the original approach.
Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic Data (2024) Key question: Is the model collapse fear real? Prior works largely assume that the synthetic data totally replaces human data, but what if we accumulate synthetic data on top of human data. This paper choose the latter, and they experimented with multiple models. ▷ For example, VAE on image shows that replacement shows some model collapse while the accumulation looks pretty similar to the real data. ▷ For example, linear models: while replacement leads to model collapse, the accumulation does not fail dramatically (worse but shows only finite amount of error) ▷ How does temperature affect the curves? For total replacement with synthetic data, when the temperature is low, i.e., model is more deterministic, then model collapse happens faster. However, when the temperature is high, (or model is more 'creative'), then model collapse happen slowly. Yet, for accumulation, model collapse does not happen. ▷ Is the paper's data reliable? In the previous paper, perplexity (i.e., cross entropy) gets better over the iterations but the fine-grained metrics get worse.
Some interesting questions and critique ▷ Q. Why does model collapse generates broken words as well, not only repeated words? ▷ Q. Why self-bootstrapping methods do not suffer from model collapse? Or do they? ▷ Regarding paper 1, maybe scaling can help; the paper only uses opt-350M (from 2022); it is not very clear whether larger models or reasoning models would have the same issue. ▷ Regarding paper 1, maybe better prompting leads to stronger diversity. ▷ Regarding paper 2, maybe the model itself is the issue; Lu, et al. (2025) suggests some result that creativity gets worse with RLHF, using creativity index; thus, maybe the model or ourselves (human) is the issue; Does the rise of Ai-generated content and synthetic data impacts on human as well?
Caveat. This is a (part of) scribing note fom Data-Centric Large Language Models (CS 294-288; Fall 2025; UC Berkeley) taught by Professor Sewon Min. The note may contain typos, inaccurate information, or inadvertent errors; any mistakes or errors are mine. The note do not represent the views of the instructor, guest speakers, or any other participants of the course. The note is published with the instructor's permission to post publicly. The content of the note has not been reviewed or endorsed by the instructor, guest speakers, or any other participants of the course. If you spot an error, I'd appreciate a correction. For any comments, here is my email address.