Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic Data
Accumulating Data Prevents Model Collapse
摘要
评审与讨论
Post rebuttal:
I increased the rating from 6 to 7 because of the clarification from the authors addressed my concerns to a certain extent.
Initial review:
This work investigates the problem of model collapse, which usually happen when the model is trained continually with its own output data. Unlike previous research, in which only newly generated data is added, this study assumes that both real data and synthetic data accumulate with evolving.
The authors conduct empirical analysis in a simple setting with linear model, followed by extended experiments in Transformers, Diffusion model and VAE, while the training data in multiple modalities is selected accordingly.
Both theoretical and empirical analysis support the conclusion that simply adding real data in multi rounds of training can solve the model collapse problem.
接收理由
- The problem of model collapse is quite important in many fields, as the synthetic data is becoming a considerable part of training dataset for foundational models.
- This study provides a clear analysis of this problem, by providing experiments in widely used model architectures, with multiple modalities, including language, images and protein.
- Surprisingly, all experiments support the assumption of the authors, that simply adding the original real data into the newly generated data can solve the problem of model collapse. This insight is straight-forward, clear, and easy-to-implement, thus potentially impactful to the community.
拒绝理由
- A minor concern: is it possible to accumulate the real data as well? In nature the real data will be generated by humans over time. The experiments will provide more insights.
- A major concern: the statement of previous study [1] in this paper seems inaccurate, the "synthetic augmentation loop" (section 2 in [1]) defined as training model with fixed set of real data and synthetic data from previous iteration seems to be an exact same setting as that in this paper. Hence, the novelty and contribution of this paper highly depends on the difference with previous study. A clarification (as well as a detailed section of related work) from the authors will be appreciated.
I would increase the rating if I mis-understood above issues.
[1] Alemohammad et al. Self-Consuming Generative Models Go MAD. ICLR 2024
给作者的问题
Stylistic error: broken reference in Section 2.3
Dear reviewer TmgS,
Thank you for your review.
is it possible to accumulate the real data as well? In nature the real data will be generated by humans over time. The experiments will provide more insights.
We agree this is an interesting question and a more realistic modeling assumption. For the current paper, we intentionally wanted to present a very pessimistic scenario of accumulating data, to show that even under pessimistic conditions, simply keeping previous data avoids model collapse.
In practical terms, for experiments accumulating real data does present a challenge, as you can’t create new real data in an experiment, so one would have to start out with just a fraction of the dataset in the initial iteration.
A clarification (as well as a detailed section of related work) from the authors will be appreciated.
We will update the manuscript with an in-depth 2-page Related Work section
To your specific question, our setting is different from the synthetic augmentation loop in Alemohammad et al! In their work, they primarily consider the case where a constant amount of synthetic data is replaced in each iteration, in addition to a constant amount of real data. That is, their setting is closer to what we consider the “replacement” setting, and they do conclude that this does not avoid model collapse. They do briefly look at accumulating data as well, but only in one part of one experiment, and they conclude that this merely slows model collapse down.
We will include a new appendix section closely examining Figure 7 of Alemohammed et al. We believe that a closer examination suggests that test error plateaus (and this would likely have been more apparent in subsequent iterations), consistent with our results and in contrast with their conclusion that “fixed real training data only delays the inevitable degradation of the quality or diversity of the generative models over generations."
Dear reviewer TmgS, Thank you so much for your prompt response to our rebuttal. You state we've addressed your concerns “to a certain extent”. May we ask what concerns remain? We would of course like to address them in full.
The update plan of this manuscript looks good to me. 7 is a clear accept rating. I didn't give higher rating because the updated contents are not available at this moment. But I still support the current version to be accepted.
Dear reviewer, thank you for following up. We understand, and indeed, we had hoped to be able to upload a revised manuscript during the discussion period, but COLM does not allow this (despite earlier information in the CfP that they would). However, we have just posted the new prior works section as a top-level comment here on OpenReview, including the discussion of differences to Alemohammad et al, should you wish to have a look.
This paper investigates the phenomenon of model collapse, where the performance progressively degrades when they are trained on their own generated outputs over multiple iterations. The key contribution os this paper is showing that accumulating the generated synthetic data with the original real data across iterations, instead of replacing data, can prevent model collapse. They authors provide both theoretically and empirically analysis.
接收理由
- They paper invistigates a critical and timely issue in the field of AI, i.e., model collapse in generative models. And it mainly focuses on data accumulation.
- The paper proveds a theoretical foundation to understand the model collapse when data is replaced accross iterations and how accumulating data can prevent this phenomenon.
- The authors have conducted extensive experiments across different types of models and data modalities to show the effectiveness.
拒绝理由
- The exploration on larger model is desired. This paper disscussed some models (linear models, transformers, diffusion models, and VAEs), but it does not extend its exploration to large models, such as those with billion-parameter scales, which is the main-stream nowadays.
- The paper mainly relies on linear models for its theoretical analysis. While useful, this simplification might not fully capture the complexities of real-world scenarios.
- The discussion about the related work is missing.
- This paper lacks of discussion and comparison with other popular methods in this area.
给作者的问题
Please refer to the negative comments. No further questions.
We thank Reviewer 7uQd for their review and interesting discussion. Our responses are below:
The exploration on larger model is desired [...] such as those with billion-parameter scales
We agree larger models would be better but billion-parameter models are computationally unaffordable for us. We pre-train from scratch, repeatedly, on growing data. 5 model-fitting iterations is equivalent to 20 full pre-training runs; the experiments in the paper together are well into the hundreds of full pre-training runs. To evaluate 7B parameter models on a 2T token starting dataset, we would need to train for 40T tokens for a single 5-iteration experiment – more than twice as many as Llama3! And this is not even taking into account generating the synthetic tokens. We cannot afford this.
Because of this, we carefully choose the TinyStories dataset, as this has been designed specifically to allow for meaningful experimentation with smaller models.
That said, if you have a specific suggestion of a model size and dataset that are both achievable with reasonable compute budgets, and would provide insight that our current experiments do not, we would be very happy to discuss this!
The paper mainly relies on linear models for its theoretical analysis [...] this simplification might not fully capture the complexities of real-world scenarios.
We agree that linear models may fail to capture real-world scenarios, which is precisely why we complemented the theory with three medium-to-large-scale realistic experiments on deep generative models across 3 data modalities.
Our main results are these extensive experiments. We will swap the order of experiments & theory to emphasize this.
This paper lacks of discussion and comparison with other popular methods in this area
Could the reviewer please clarify what “other popular methods in the area” means? Our work is scientifically studying a particular system, i.e., sequences of generative models trained on outputs from their predecessors.
The discussion about the related work is missing.
We have written a 2 page Related Work and will add this to the Appendix.
Thanks for your response.
My point is that since the model is relatively small, the generation quality is not good. Then you use the generated low-quality data as training data, how could we expect the model can learn good performance from them? But the things are different when adopt LLMs, since the quality of generated samples are good. Then it's interesting to see the performance change under your setting. I understood that the training cost is unaffordable for billion-level LLMs. But for image, there are lots of diffusion models (e.g., stable-diffusion 1.5) that have good generation quality, and it won't require so many resources. However, the image experiments are conducted on small VAEs, which cannot adequately convince me, either. Here are lots of observisions show some rules/laws will be out of work when the scale grows up.
And for the related works, it will be better if you can post here to let us konw the details not just say you have written a 2 page related works.
Dear reviewer,
On model size, we would like to make several points:
- For the LLM experiments, we have specifically chosen a dataset that was designed to be fit well by 10-30M parameter models. The generations by our models of up to 150M parameter size fit this data well.
- For image models, we did consider experiments using diffusion models, but do not believe that this would be feasible. Let us be clear that we do is pretraining these models from scratch in every iteration, not finetuning. Pretraining SD 1.0 took 150000 hours of A100 time, or a market cost of 600k USD, for a single pretraining run. [source: https://x.com/emostaque/status/1563870674111832066 ] Even on smaller datasets, we cannot realistically run 20+ such runs. We’d also like to note the COLM review guidelines which ask that reviewers “take into account that most researchers do not have access to large-scale compute.”, and that “Naturally, this runs the risk that some small-scale results will not hold when studied later on at a large scale. But some results will, and they will not make it unless we, the program committee, make a bet on them.”
- We’d also like to say that we’ve focused primarily on the LLM experiments, as this is the focus of COLM. While we include other modalities to show that the effect isn’t strictly limited to language, we would like to ask you to take into account the focus of the conference when evaluating these experiments.
- Finally, we do look at generation quality as a confounding variable in the LLM experiments in appendix C, paragraph “Model quality after first model-fitting iteration” and find that this does not affect our conclusions.
Regarding related works, we fully agree. We had hoped to be able to upload a new revision of our paper with such changes, but COLM does not allow this (despite earlier information in the CfP that they would). We have just posted the new prior works section as a top-level comment here on OpenReview, and thank you for your understanding.
Thanks for your clarification. I'd like to raise my score to 6.
Dear Reviewer 7uQd, thank you again for your review and comments. We wanted to follow up to check if we’ve addressed your questions and concerns? If not, we’d be keen on discussing further in the remaining time.
This paper investigates the phenomenon of model collapse, where training generative models on their own outputs leads to progressive degradation in performance. The authors compare two settings: 1) replacing the training data with new synthetic data at each model-fitting iteration, and 2) accumulating synthetic data with real data at each iteration. They prove theoretically for linear models that accumulating data bounds the test error independent of the number of iterations, while replacing data causes test error to grow linearly with iterations. The authors then empirically demonstrate consistent results for deep generative models including causal transformers on text, diffusion models on molecules, and variational autoencoders on images. They find that across architectures and datasets, accumulating data prevents the model collapse that occurs when replacing data.
On a first glance, the paper is clearly written, with theoretical results derived mathematically and then validated numerically.Upon further reflection, the theoretical formulation, along with the experiment setup guided by the theory, seems to be a bit distant from the real world setting especially language modeling. In fact, I think how it should be formulated inevitably is deeply integrated with the nature of each modality (e.g., what is the ground truth of the modality).
接收理由
- The paper formulates the timely and critical issue of model collapse in generative models, providing both theoretical and empirical insights.
- The theoretical analysis for linear models is rigorous and clearly presented. The key result that accumulating data bounds test error is novel.
- The authors conducted extensive experiments on transformer, diffusion, and VAE models on text, molecule, and image datasets. This breadth makes the results widely interesting and applicable.
拒绝理由
- Test loss on real data might not be the best metric to compare, and thus does not faithfully characterize the real world scenario: The primary metric used to assess model performance in this framework is the test loss on real data. However, in the context of language model pre-training, reducing test loss might not always correlate with overall model effectiveness. There is growing evidence suggesting that language models are capable of generating synthetic data of superior quality compared to real data [1], which can be more beneficial for certain downstream tasks. This phenomenon challenges the assumption that minimizing test loss on real data accurately reflects model quality. In this case, I think the mode collapse on the "replacing data" part should not happen because maybe loss on synthetic data is a better metric to compare. Their specific setup (small language models, and small data), simply hasn't reached the inflection point, which makes the current results a bit misleading and not generalizable. I would suggest the author measure the loss on a held-out dataset, or work in a more frontier synthetic data setting, where the generated data could be of a higher quality.
- Lack of discuss on data quality: The paper lacks a discussion on the quality of synthetic data across different domains. Synthetic data quality can possibly alter our perceptions of what constitutes ground truth in each modality. For language modeling, synthetic data might represent "ground truth" better as the data quality increases, since in practice we train LMs to perform cognitive reasoning tasks, not simply language itself. This argument might still hold for image generation, but may not for molecule modeling. It is essential to engage with the leading edge of synthetic data in each domain to advance our understanding.
- Impact of data size in comparisons: The study attempts to compare the effects of accumulating versus replacing data, but it does not sufficiently account for the influence of data size on test loss. In practical scenarios, test loss is often dependent on the amount of data (e.g., LM) due to factors such as learning rate schedules and optimization algorithms (e.g., Adam). This oversight suggests that we need a more nuanced analysis that considers how increasing data size impacts learning outcomes beyond the simplistic linear model framework employed.
[1] Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling
给作者的问题
- Optimal Use of Synthetic Data: Figure 4 suggests that incorporating synthetic data can reduce test loss on real data, yet it raises an important question about the diminishing returns of synthetic data. Specifically, it would be insightful to determine the point at which further iterations of adding synthetic data no longer enhance model performance (if model performance can be properly measured). This inflection point is observed earlier in domains like molecule generation and image generation compared to language modeling.
- Typo: reference in 2.3 is missing: Code in App. ??
Thank you for your in-depth review and the interesting and on-point discussions you raise.
Lack of discuss on data quality [...] It is essential to engage with the leading edge of synthetic data in each domain to advance our understanding.
There is growing evidence suggesting that language models are capable of generating synthetic data of superior quality compared to real data [1], which can be more beneficial for certain downstream tasks.
We agree that research on using synthetic data to improve models or accelerate learning is an exciting new topic.
However, we see this as a step beyond our paper. Our paper is not asking the best way to use synthetic data. Our paper is asking whether future generative models will become useless in a pessimistic future where synthetic data is dumped without restraint on the internet. Our conclusion is no. Our point is not that random synthetic data is good for models, but that such data is not especially harmful so long as data accumulates. We believe this is an important and timely point to make, considering model collapse has recently been picked up by mainstream media as a catastrophic problem that threatens the future of generative AI.
We add to our Discussion that identifying how to best use synthetic data is an exciting research frontier and cite your recommended paper “Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling”.
Impact of data size in comparisons
Appendix C includes this ablation (“Controlling for dataset size”) and our updated manuscript will have more experiments. For hyperparameters, is there any specific experiment you would like to see?
Test loss on real data might not be the best metric to compare
Test loss on real data is an extremely standard metric for evaluating models. While today’s language models indeed have significant post-processing applied, such methods rely on the base models having low pretraining loss. Applying DPO or some other preference optimization method to a high-perplexity LM will not work.
I would suggest the author measure the loss on a held-out dataset
To be clear, we do measure the loss on a held-out split of the real data.
maybe loss on synthetic data is a better metric to compare
Measuring the model’s loss using model-generated synthetic data opens the door to undesirable minima. For instance, if the model always outputs the same token, its cross entropy loss on the generated data will be 0, but its outputs are useless.
Dear Reviewer Vd4L, thank you again for your review and comments. We wanted to follow up to check if we’ve addressed your questions and concerns? If not, we’d be keen on discussing further in the remaining time.
Thank you so much for your response! I agree with you that working in a setting where the generated data's quality is inferior to real data is still valid and interesting to show. However, I would encourage the authors to discuss the following points in their revisions:
- Would the observed phenomenon still hold if the quality of synthetic data exceeds that of real data (which is of mixed quality)? This is particularly relevant for GPT-generated texts, where it has been shown that training on synthetic data can be more beneficial than training on real data. I believe this discussion is crucial, as we have already observed this empirically.
- Why do different patterns exist between language, molecular modeling, and image synthesis during accumulative training? How do these modalities differ in nature?
- Regarding dataset size, my point was that the comparison between replacement and accumulation differs in the number of optimization steps due to varying dataset sizes. Controlling the dataset size to be the same would be more helpful in identifying how the distribution of data composition affects model performance. I believe the conclusion will largely hold for the current setting though.
I have raised my score to 6, and I am happy to have further discussions.
Dear reviewer,
Thank you for your response, and the interesting points you raise.
- “Would the observed phenomenon still hold if the quality of synthetic data exceeds that of real data” - this is an interesting question, and raises a much bigger point: What does data “quality” mean? In practice, we typically care about downstream tasks. On the other hand, for theory, and in all prior works on model collapse, models are evaluated purely on fitting the input distribution well (no matter its quality). Our paper is in large part motivated by this prior literature on model collapse. This has gotten a fair bit of attention, both within academia as well as in more mainstream media with sometimes alarming headlines (e.g., Scientific American, “AI-Generated Data Can Poison Future AI Models”, Forbes, “Generative AI And The Risk Of Inbreeding”, and many others). We think it is important to provide a counterpoint to these. Within the framework of the existing literature, it would be difficult to define what high-quality data means, or how to evaluate models trained on such data. But more importantly, even if we were able to meaningfully define this, it would risk lessening the impact of our work, as it would open us to criticism of unfair comparison. I.e., if we only showed that high quality data avoids model collapse, that wouldn’t necessarily mean that model collapse couldn’t still happen if unfiltered data contaminated the internet. We show that even under such pessimistic assumptions, data accumulation alone is enough to largely avoid the problem.
- Put differently: Think of our paper as a worst-case analysis, and a significant improvement on the prior lower bound, rather than a statement about the best-case.
- All that said, we would be happy to add a discussion of these points to the paper.
- On different modalities: We are not sure. The main difference is in the molecule experiments, and we admit none of us have sufficient domain knowledge to have a clear intuition of the data. (All three modalities do show the same effect with regard to accumulation vs replacement of data, to be clear.)
- On dataset size: We have in the mean time run an experiment where we keep the dataset size constant in the accumulation regime (I.e., we accumulate, but we subsample a fixed size training set at each model-fitting iteration.) This does somewhat worse than accumulation, but significantly better than replacing. In particular, it looks like the test error remains bounded by a constant (or possibly is increasing at a very small rate). This is of course somewhat tangential to our purposes, as part of our point is that models are being trained on increasing dataset sizes over time. The only way we can think of to control purely for the number of optimizer steps would be to increase the batch size for each model-fitting iteration for the replacement regime. This would of course necessitate a very small batch size in the first iteration, and would be limited to just a few iterations, but we would be happy to try to run this experiment if you think it would add to the paper.
Thank you very much for your detailed responses! I now fully understand the research angle of the work and believe it is valid. I would appreciate it if you could include all these discussions in your next revision. Given the discussion, I have decided to raise the score to 7.
The paper focuses on the study of model collapse in deep generative models. It analyzes how the performance of these models deteriorates when they are trained on their own outputs. Two specific scenarios are examined: one where the data is replaced in each model iteration, and the other where data is accumulated over iterations. The authors offer a theoretical analysis on this in the context of linear regression. The paper finds that data accumulation is effective in preventing model collapse, in contrast to data replacement which leads to increased test error. Afterwards, authors conduct empirical experiments on various models such as language models on texts, diffusion models on small molecules, and variational autoencoders on images to verify the practical applicability of their the hypothesis.
接收理由
This paper offers a solid contribution to the understanding of model collapse in generative models, which is critical for the continuous improvement of model reliability and performance. The comparative analysis of data replacement vs accumulation presents a clear and well-supported argument for the benefits of the latter strategy (data accumulation). The empirical evidence covering multiple deep generative models and datasets validate of the conclusions, demonstrating that this approach has practical applicability.
拒绝理由
I feel like the major weakness is the limited scope of theoretical analysis to linear regression. This means that the conclusions may not immediately generalize to more complex models. To better connect the theoretical analysis and application in modern deep generative models, the authors could expand the theoretical framework to include non-linear models. For example, given that the authors trying to validate their hypothesis on language models on discrete sequence data, which can be seen as a sequence of classification problems, authors can consider to extend their theoretical conclusion to logistic regression?
Also, why data accumulation prevents model collapse can be further explored and articulated. Further theoretical insights or explanatory hypotheses would enrich the reader's understanding of the underlying mechanisms. Plus, authors can validate the practical applicability of their claims on larger model settings, for instance a Llama 3B?
给作者的问题
see weaknesses
Dear Reviewer Djpz,
Thank you for your review and useful comments. Overall, we want to stress that the main message of our paper is to show that accumulating vs replacing data leads to a stark difference in model collapse (or lack thereof), a point overlooked in prior literature. We believe, as you also state, that we thoroughly demonstrate this. To your specific points:
I feel like the major weakness is the limited scope of theoretical analysis to linear regression [...] the conclusions may not immediately generalize to more complex models.
We consider the main contribution of our paper to be the experiments, which are more extensive and diverse than prior work in the model collapse literature. The theory was intended as the simplest analytically-tractable example to explain mathematically why accumulating data makes a difference. In our revised manuscript, we swap the order of experiments and theory to emphasize the experiments.
the authors could expand the theoretical framework to include non-linear models.
Our results hold for nonlinear kernel regressors; we would be happy to state such results in the appendix. However, deriving results for other nonlinear models like logistic regression is non-trivial, as we know of no closed-form solution. In general, theory in this direction is very hard. We also feel that this would detract from the manuscript’s main message, which we believe is an important and highly timely one, and one for which we provide ample evidence.
Plus, authors can validate the practical applicability of their claims on larger model settings,
We want to clarify that we pre-train these models from scratch, at each iteration, with growing dataset sizes. Thus, to evaluate a sequence of models with just 5 model-fitting iterations, we need to train for the equivalent of 20 full pre-training runs (1+2+3+4+5 for accumulate, 1+1+1+1+1 for replace). The experiments in the paper altogether constitute hundreds of full pre-training runs. Doing this on billion-parameter models and web-scale data would require compute matching OpenAI, Google or Meta, which sadly we do not have.
Because of this, we carefully chose the tinystories dataset precisely because it was specifically designed to allow insightful experiments with smaller models. In particular, it was designed to be well-fit by 10-30M parameter models, and we evaluate up to an order of magnitude larger than that.
Dear Reviewer Djpz, thank you again for your review and comments. We wanted to follow up to check if we’ve addressed your questions and concerns? If not, we’d be keen on discussing further in the remaining time.
Dear Reviewer, as the discussion period is coming to a close soon, we wanted to follow up once more to check if we have addressed your concerns? We would be keen to fully utilize the remaining time to discuss further if any questions or concerns remain.
Disclaimer: COLM limits rebuttals to one 2500 character response per reviewer. Given that multiple reviewers have asked for additional related work, we are posting our 2 page Related Work section here. Please don't penalize us or desk-reject us for responding to the reviewers' request.
A Summarization and Discussion of Prior and Related Work
Prior Empirical Work A growing body of recent work has investigated the phenomenon of iteratively training models on data generated by previous models, e.g., Hataya et al. (2023); Mart´ınez et al. (2023a); Shumailov et al. (2023); Alemohammad et al. (2023); Mart´ınez et al. (2023b); Bertrand et al. (2023); Briesch et al. (2023); Dohmatob et al. (2024a;b) and (in a different context) Taori & Hashimoto (2023). Hataya et al. (2023) and Mart´ınez et al. (2023b) conducted experiments replacing real training data with generated data at each iteration, assuming that the dataset size remains fixed over time. They found that this iterative retraining procedure can lead to model degradation if the proportion of synthetic data becomes too high. Similarly, Shumailov et al. (2023) ran experiments with Gaussian mixture models, VAEs, and language models in which the total number of samples per iteration was held constant, and the samples always originated with the previous model rather than aggregating over time. Building on this work, Alemohammad et al. (2023) considered three treatments of data: fully replacing real data with synthetic data, augmenting a fixed real dataset with additional synthetic data, and mixing new real data with synthetic data at each iteration. In almost all of their experiments, they drew a fixed size dataset from the most recent model at each iteration, without accumulating data. Bertrand et al. (2023) also assumed that dataset size and mixing proportions are constant over time in their theoretical stability analysis and empirical validation
Prior Theoretical Work Over the last few years, there has been significant research effort contributing to our theoretical understanding of model behavior when synthetic data are integrated into training. The most closely related works to ours are Dohmatob et al. (2024a) and Dohmatob et al. (2024b); of course, the inspiration for the linear regression model studied in this paper directly comes from Dohmatob et al. (2024a). Dohmatob et al. (2024a) performs an in-depth analysis of high dimensional linear and ridge regression when the training data used per iteration are generated from the previous iteration’s fitted model. They are able to conclude that the test error grows linearly with the iteration count in their setup, as well as derive more interesting and more nuanced results using random matrix theory. They also discuss how to mitigate model collapse through optimal regularization both when the training data are noise-free and noisy versions of the previous model’s synthetic outputs. A related noise-free setup was studied by Mobahi et al. (2020) in the case of self-distillation. Although Mobahi et al. (2020) considers a more general setup with ridge regression as a special case, they use noiseless predictions from the previous model as the training data for the next model, and show that eventually, the predictions shrink to zero. Through this, they highlight that self-distillation induces regularization in the function space, which initially is beneficial for reducing over-fitting, but eventually over- regularization causes underfitting and hence performance decay. Dohmatob et al. (2024b) go beyond the linear model to study model collapse – they study the tails of LLM outputs vs. real data and provide scaling laws that clearly identify regimes of model degradation when synthetic data misses tails present in real data. They identify an interesting phase transition in the test error scaling law depending on the size of the real dataset size in comparison to (a functional of) the chopped-off tail, and conclude that enough real data is able to mitigate model collapse. All these works consider the scenario where the amount of training data available per iteration is fixed (and does not grow with the iteration count), and it is certainly possible that with larger amount of synthetic data (from prediction by the previous model), several of these scalings would improve significantly. For example, in Equation (12) of Dohmatob et al. (2024b), one obtains the linear scaling (with iteration count) of test error simply because the amount of synthetic data generated per iteration is the same. If one generated synthetic data with size proportional to the iteration count, then at iteration n, the scaling would, instead of n, be like for . When one does not increase the dataset size, Dohmatob et al. (2024b) points out that increasing the proportion of real data would help one to avoid model collapse altogether.
(continued from Part 1 above)
However, even if one did increase the amount of synthetic data with iteration count, Theorem 3.2 coupled with Corollary 3.3 in Dohmatob et al. (2024b) would tell us that the amount of real data was all that mattered – if the amount of real data is large, we overcome model collapse. If one only had synthetic data (and no real data), no matter how large, it would be impossible to regain the origina real-data scaling laws. The scenario we study is highly inspired by these pioneering works, but still, in our view, different. We consider the case when we keep augmenting synthetic data (generated by the previous model trained on all the previous data so far) as iterations progress, much akin to how – in our view – the internet evolves. We observe that we can avoid model collapse in this setting. The analysis of previous models in our case is more involved, since the data used for training at iteration n is not homogeneous – different models from the past impart different statistical aspects to different parts of the training data. We also note a related augmentation model studied by Jain et al. (2024) – they perform risk minimization augmenting real data with synthetic data available from a potentially different independent source. One of their messages is that augmentation of (even) pure noise can be looked upon as adding a ridge penalty and hence, in certain cases, can improve test error. Their setup, however, is different from ours, since the synthetic data in their setup is not obtained by a learning algorithm employed on the real data, and the process is not iterative. However, morally, each iteration of ours involves risk minimization on data statistically composed of an equal mixture of data generated from the previous models, and hence each iteration of ours can be mapped to the general framework developed in Jain et al. (2024), although the dependencies among the various models trained in our setup introduce theoretical complications that do not seem to be too easily addressed by the theory developed in Jain et al. (2024). Shortly after v1 of our manuscript was uploaded to ArXiv, two other manuscripts appeared, dealing with the theoretical aspects in a setting similar to ours. Theorem 1 of Marchi et al. (2024) obtains the same square summability scaling of the variance as us. Seddik et al. (2024) studies collapse in language models in both purely synthetic and partly synthetic regimes and obtains deviation bounds as model iterations progress.
Considering Accumulating Data The two papers we found that partially considered accu- mulating data are Mart´ınez et al. (2023a) and Alemohammad et al. (2023). Alemohammad et al. (2023) did so in one-half of one experiment: StyleGAN2 trained on FliqrFaces 128×128 (App. Fig. 8). The authors concluded that accumulating data does not avoid model collapse, but merely slows it down. However, we believe that a closer examination of their results (App. Fig. 8) reveals that accumulating data causes the test error to plateau to a relatively low error with increasing numbers of model-fitting iterations. This result would support our conclusion that accumulating data avoids model collapse and does not merely delay it. The results from Mart´ınez et al. (2023a) are harder to evaluate; model collapse only seems to occur when the amount of synthetic data added per model-fitting iteration is 2× the total amount of accumulated data, and the subsequent work by the authors switched from accumulating data to replacing data (Mart´ınez et al., 2023b). We think understanding what conditions and why these discrepancies exist is an interesting future direction. Avoiding Model Collapse Several papers present methods for avoiding or slowing model collapse. Bertrand et al. (2023) shows in the replacing data setting that model collapse will not occur if the initial generative models approximate the data distribution well enough and the proportion of real data is sufficiently large with respect to the synthetic data. Dohmatob et al. (2024b) similarly demonstrates that in the replacing data setting, carefully selecting real data to mix with synthetic data can avoid model collapse. Other solutions may also be possible in various models and under various assumptions. To our knowledge, no paper has claimed an “optimal” strategy to avoid model collapse, and neither has ours.
(continued from Part 2 above)
Note: In our revised manuscript, we display Figure 7 from Alemohammed et al. (2023) here. However, we do not know how to embed an image on OpenReview.
Figure 8: Clarification of Data Accumulation in Alemohammad et al. (2023). Figure 7 from Alemohammad et al. (2023) (above) shows that linearly accumulating data (“Synthetic augmentation loop”) causes poor behavior to plateau with the number of model-fitting iterations. Alemohammad et al. (2023) write, “Our experiments [...] support our main conclusion [that] fixed real training data only delays the inevitable degradation of the quality or diversity of the generative models over generations.” We believe is that our evidence and their evidence is more consistent with the conclusion that accumulating data avoids model collapse and does not merely delay it
I think this is a great paper. The problem, model collapse, is interesting. It provides interesting empirical work with interesting theoretical work on a toy case. I disagree with reviewer Djpz that the theoretical analysis is too limited -- it's hard to analyze beyond the linear case. I thought the authors did a tremendous job in engaging with the reviewer feedback. Apologies for the shorter meta-review -- it is simply due to the fact that the reviewers did a great job on this one.