How to Synthesize Text Data without Model Collapse?
摘要
评审与讨论
This paper introduces a novel approach to generating semi-synthetic text data to address the issue of model collapse when trained with synthetic data. The method is supported by a solid theoretical framework under a simplified linear model setting. Extensive experiments validate the effectiveness of the approach, showing improvements in the stability of the training.
给作者的问题
If the issue of synthetic data is the coverage and lack of long tails, could we generate tons of data using LLM and only preserve those long tail ones? Will this method be better than token editing because token editing still require human data?
论据与证据
Yes
方法与评估标准
Yes
理论论述
I didn't check the proofs closely
实验设计与分析
I checked the experiments- they look sound to me.
补充材料
No I did not.
与现有文献的关系
According to my knowledge, the key contribution, token editing method, is novel.
遗漏的重要参考文献
None
其他优缺点
The proposed token editing method to avoid model collapse is interesting. It is simple and useful. The experiments demonstrate the effectiveness of the method.
However, the proposed method is generating semi-synthetic data rather than pure synthetic data. In this sense, it does not address the model collapse when training with pure synthetic data, which is a more important problem. Because the motivation of using synthetic data is for the situation where we ran out of human data; this semi-synthetic data still requires human data and is more like a data augmentation method, not a data synthesis method.
其他意见或建议
The title is "How to Synthesize Text Data without Model Collapse?", but the solution authors provided is to use semi-synthetic data instead, which seems deviated from what the title suggests.
We are grateful for your positive feedback and insightful comments. Below, we give detailed responses to your questions.
[Q1] : The solution authors provided is to use semi-synthetic data instead, which seems deviated from what the title suggests.
We will revise the wording and provide further clarification to avoid misunderstandings. An example is provided below:
Based on prior works, model collapse is related to multiple factors, including self-generated processes, data quality, and others. Our proposed method is inspired by the statistical analysis on synthetic data in Sec. 3. Specifically, we focus more on the model collapse phenomenon caused by data quality. Therefore, we try to improve data quality then further indirectly prevent model collapse. In other words, we are not directly addressing model collapse, and thereby indirectly prevent it by improving data quality.
These statements will be included in the Introduction and Related work sections to better clarify our method.
[Q2] : If the issue of synthetic data is the coverage and lack of long tails, could we generate tons of data using LLM and only preserve those long tail ones? Will this method be better than token editing because token editing still require human data?
1. Could we generate tons of data using LLM and only preserve those long tail ones?
For now, generating long-tail samples is currently difficult for language models. The reason lies in the sampling strategy of LLMs [1]. Current LLMs adopt top-p, top-k or other sampling strategy for better performance. However, these sampling strategy will lead to cut-off distribution. When the data synthesizing scale up, this drawback will finally scaling law cut-off on synthetic data. However, human corpus data follows a Zipf distribution, [2]. The truncated output distribution causes the LLMs to nearly fail to sample long-tail samples. In other words, it is currently difficult to induce long-tail samples fromLLMs that are as diverse as human data.
On the other hand, if we force the language model to generate long-tail samples, these may contain both noisy and high-information samples, which are like two sides of a coin, both distributed in the long tail of the data [3]. This necessitates further filtering of the high-information samples. Unfortunately, such samples are challenging to automatically identify in practice and may require extensive human annotation [4].
[1] Dohmatob E, Feng Y, Yang P, et al. A tale of tails: Model collapse as a change of scaling laws[J]. arXiv preprint arXiv:2402.07043, 2024.
[2] Zipf G K. The psycho-biology of language: An introduction to dynamic philology[M]. Routledge, 2013.
[3] Swayamdipta S, Schwartz R, Lourie N, et al. Dataset cartography: Mapping and diagnosing datasets with training dynamics[J]. arXiv preprint arXiv:2009.10795, 2020.
[4] Lin Z, Gou Z, Gong Y, et al. Rho-1: Not all tokens are what you need[J]. arXiv preprint arXiv:2404.07965, 2024.
2. Will this method be better than token editing because token editing still require human data?
Yes, we agree with you on this idea. If we had sufficiently powerful LLMs, we could generate a large amount of long-tail data and quickly filter out the irrelevant ones, which would significantly boost the performance of current LLMs. However, these conditions are difficult to meet at present, and much work remains to be done.
Thanks! I will keep my score. I suggest the authors re-consider the title in future versions
Thank you very much for your quick and constructive feedback! Based on your suggestion, we have revised the title. Please consider the following updated title:
- To Edit and Not to Synthesize: Combating Model Collapse with Semi-Synthetic Data
We remain open to making further revisions based on your valuable feedback. Looking forward to your continued comments.
The paper is twofold: The first part of the paper focuses on the effects of mixing real and synthetic data and what the authors call non-iterative model collapse. The second part of the paper proposes ToEdit, a method to adjust synthetic text data by resampling those tokens that have high probability to be generated. They argue that this method can prevent the negative effects of synthetic text data.
update after rebuttal
After reading the other reviews and corresponding rebuttals, I think the authors addressed most of my (and the other reviewers) concerns. Therefore, I think this paper can be accepted and I will raise my score. I still suggest that the authors spend some more space on explaining their method to make it easier for the reader.
给作者的问题
- Q1: Does ToEdit change the properties of synthetic data in a desired way as specified in the first part of the paper? Could the authors provide similar statistics as in section 3 for the edited data?
- Q2: Does ToEdit help in an iterative process as well? If I understand it correctly the theoretical proof says so but there is no experimental evidence for that in the paper?
- Q3: Can ToEdit help with already strongly collapsed data or is a minimum quality of the data necessary?
论据与证据
The claims of the paper are supported by sound experiments and theoretical analysis.
方法与评估标准
The methods to validate their ToEdit approach seem fitting and convincing.
理论论述
The authors provide theoretical evidence for their ToEdit method. However, I did not check the extend proof in the Appendix for correctness.
实验设计与分析
The experimental design seems fitting. Nevertheless, I have some further questions (see below).
补充材料
I did read through most of the Appendix (except the extended proof) and it answered a lot of questions I had when reading the main text (especially section G). I would suggest that the authors at least mention those sections at the appropriate time in the main text to make it obvious for the reader where to find those answers.
与现有文献的关系
The first part about non-iterative model collapse is nothing else than a single iteration of model collapse like it happens in the real world. The authors claim that non-iterative model collapse is different as it is not data generated by the same model as in the related work. I would argue that this is just an experimental choice for tractability in other papers.
The main contribution to literature is the ToEdit approach to help mitigate model collapse for textual data.
遗漏的重要参考文献
There are several other studies on model collapse both theoretical and empirical. I suggest that the authors extend their related work section to give proper credit.
For example, but not limited to:
Alemohammad, S., Casco-Rodriguez, J., Luzi, L., Humayun, A. I., Babaei, H., LeJeune, D., ... & Baraniuk, R. Self-Consuming Generative Models Go MAD. In The Twelfth International Conference on Learning Representations. (2024)
Bertrand, Q., Bose, J., Duplessis, A., Jiralerspong, M., & Gidel, G. On the Stability of Iterative Retraining of Generative Models on their own Data. In The Twelfth International Conference on Learning Representations. (2024)
Briesch, M., Sobania, D., & Rothlauf, F. (2023). Large language models suffer from their own output: An analysis of the self-consuming training loop. arXiv preprint arXiv:2311.16822.
Martínez, G., Watson, L., Reviriego, P., Hernández, J. A., Juarez, M., & Sarkar, R. (2023, August). Towards understanding the interplay of generative artificial intelligence and the internet. In International Workshop on Epistemic Uncertainty in Artificial Intelligence (pp. 59-73). Cham: Springer Nature Switzerland.
其他优缺点
The ToEdit method is not trivial to understand on the first read of the paper. Maybe the authors could use some more space to better illustrate their method (as it is the core contribution) so the reader can easier understand the edit operations (maybe a small example can help here).
其他意见或建议
none
We are grateful for your enthusiastic feedback. Below, we give detailed responses to your questions.
[Q1] : References Not Discussed
We will include a discussion in the revised version as follows:
[1] demonstrates that without enough fresh real images, future generative models will gradually decline. [2] develops a rigorous framework to demonstrate the importance of real data in maintaining the stability of iterative training. [3] illustrates that real data in the iterative training process can slow the decline of LLMs, but cannot fully prevent it. [4] shows that the quality and diversity of generated images degrade over time.
[1] Self-Consuming Generative Models Go MAD.
[2] On the Stability of Iterative Retraining of Generative Models on their own Data.
[3] Large language models suffer from their own output: An analysis of the self-consuming training loop.
[4] Towards Understanding the Interplay of Generative Artificial Intelligence and the Internet.
[Q2] : Maybe a small example to illustrate their method.
We add an input-output example below. The complete workflow is also provided in Lines.
Case 1: code reasoning sample in Magicoder-Evol-Instruct-110K:
| Before (source) | After (edited) | |
|---|---|---|
| Construct a function using PHP language that applies lexical analysis on a provided text string to analyze the individual, non-repeated words elements present. | Construct a function using PHP language that applies lexical analysis on a provided text string to quantify unique words. | "analyze" → "quantify" |
Test with provided string, $str = 'Greetings, Planet Earth!'. | Test with provided string, $str = 'Greetings, Planet Earth!'. | No changes. |
Implements wordCount to remove punctuation, convert text to lowercase, split into words, and count unique words. | Implements wordCount to remove punctuation, convert text to lowercase, split into words, and calculate unique words. | "count" → "calculate" |
Returns {'greetings': 1, 'planet': 1, 'earth': 1}. | Returns {'greetings': 1, 'planet': 1, 'earth': 1}. | No changes. |
[Q3] : Does ToEdit change properties of data in a desired way ?
Yes, we present the statistical analysis of edited data below. As shown in Supplemented Tables 1 and 2, ToEdit meets our expectations by preserving the original long-tail distribution. Additionally, Table 14 (Line 1100) illustrates that tokens above the threshold gradually decrease as the iterations progress.
- KL Divergence Between Distributions (gen_0, gen_1, gen_2)
| Distribution Comparison | gen_0 | gen_1 | gen_2 |
|---|---|---|---|
| gen_0 | 0 | 5.56e-6 | 1.34e-5 |
| gen_1 | 5.56e-6 | 0 | 9.61e-6 |
| gen_2 | 1.34e-5 | 9.61e-6 | 0 |
- Sample Distribution Across PPL Intervals (gen_0, gen_1, gen_2)
| PPL Interval | gen_0 | gen_1 | gen_2 |
|---|---|---|---|
| 6.04~25.43 | 49087 | 49101 | 49132 |
| 25.43~44.83 | 42548 | 42532 | 42510 |
| 44.83~64.23 | 5147 | 5149 | 5148 |
| 64.23~83.62 | 1993 | 1993 | 1984 |
| 83.62~103.02 | 762 | 763 | 763 |
| 103.02~122.42 | 339 | 338 | 340 |
| 122.42~141.81 | 98 | 98 | 95 |
| 141.81~161.21 | 18 | 18 | 19 |
| 161.21~180.6 | 4 | 4 | 4 |
| 180.6~200.0 | 4 | 4 | 4 |
[Q4] : Does ToEdit help in an iterative process?
Yes, the following supplemented table shows effectiveness in an iterative process, with slight improvements in performance across generations.
- Performance in an iterative process on Instruction tuning data.
| PIQA | BoolQ | HS | SIQA | WG | Avg | |
|---|---|---|---|---|---|---|
| Gen 0 | 79.87 | 81.28 | 59.72 | 49.69 | 74.51 | 69.01 |
| Gen 1 | 80.25 | 81.16 | 59.74 | 50.56 | 74.59 | 69.26 |
| Gen 2 | 80.14 | 82.69 | 59.82 | 50.51 | 73.80 | 69.39 |
[Q5] : Can ToEdit help with already strongly collapsed data or is a minimum quality of the data necessary?
This is a very interesting and insightful question. The ToEdit algorithm was initially designed to preserve the long-tail distribution during the data generation process, thereby avoiding model collapse. For already collapsed data, the variance is typically very small, and enhancing diversity is crucial. We can also adjust the threshold to introduce more randomness in data. Through this operation, we can inject randomness into the collapsed data. However, this is a theoretical scenario, and as we know, data situations are highly complex. In practice, there will be many more challenges to address.
This paper investigates the issue of model collapse. Model collapse happens when training models on synthetic data cause performance degradations or, in some scenarios, complete model breakdown. The authors discuss the negative correlation between the proportion of synthetic data and model performance, even without iterative training.
To understand this decline, the authors perform statistical analyses revealing that synthetic data suffers from distribution narrowing and over-concentration of n-gram features. As a solution, instead of focusing on synthetic data generation, they propose token-level editing on human-produced data to generate semi-synthetic data. They utilize a pretrained LLM to find the 'easy' data points, then edit them for better final training performance.
给作者的问题
1- How does the editing mechanism ensure the quality of the semi-synthetic data?
2- How are algorithm meters selected?
论据与证据
Yes
方法与评估标准
Yes
理论论述
I've checked the theoretical results of the main papers and appendix, and they seem correct.
实验设计与分析
Yes, this is explained further in the weakness and question sections.
补充材料
Yes, all.
与现有文献的关系
The underlying problem, model collapse, is critical since real data has become more scarce, and models do need high-quality synthetic data in their training. Some of the findings of the paper regarding the characteristics of synthetic data are especially important in guiding better synthetic data generation methods. However, the proposed method does not do much about solving the model collapse. Instead, the authors focus on improving and enhancing the quality of the real-data training via editing.
遗漏的重要参考文献
Papers showing the model collapse phenomena:
[1] Bertrand, Quentin, et al. "On the stability of iterative retraining of generative models on their own data." ICLR 2024.
[2] Ferbach, Damien, et al. "Self-consuming generative models with curated data provably optimize human preferences." NeurIPS 2024.
[3] Kazdan, Joshua, et al. "Collapse or Thrive? Perils and Promises of Synthetic Data in a Self-Generating World." ICML 2024.
Since this paper shows a method to alleviate model collapse, it should also mention some of the known methods that have successfully generated useful synthetic data. [some of them are listed below]
[1] Wang, Yizhong, et al. "Self-instruct: Aligning language models with self-generated instructions." ACL 2023.
[2] Ulmer, Dennis, et al. "Bootstrapping llm-based task-oriented dialogue agents via self-talk." arXiv preprint arXiv:2401.05033 (2024).
[3] Gulcehre, Caglar, et al. "Reinforced self-training (rest) for language modeling." arXiv preprint arXiv:2308.08998 (2023).
[4] Singh, Avi, et al. "Beyond human data: Scaling self-training for problem-solving with language models." TMLR 2024
etc.
其他优缺点
Strength:
1- The statistical findings about the differences between synthetic and real data are interesting.
2- The problem description and observations are well-written
3- The paper presents the importance of the problem in various settings and experiments.
Weakness: 1- The first finding is already established information, and multiple prior works have observed it before.
2- The proposed method is not a direct solution for model collapse but focuses on improving the quality of real data.
3- Some of the improvements in Tables 2 and 3 are minor. The authors should elaborate further on the results, as they are currently show that the proposed method does not help in several tasks.
4- Some of the experiment settings are missing (for example the training parameters for pertaining and fine-tuning).
5- The information about details of ToEdit experiments is missing. How many iterations are required to generate the semi-synthetic data? What are the computational costs?
其他意见或建议
Please check out the weakness, question, and citation sections.
We appreciate your valuable feedback. In the following, we will address your concerns accordingly.
[Q1] : References Not Discussed
We will include a discussion on all the provided references as follows:
[1] develops a rigorous framework to demonstrate the importance of real data in maintaining the stability of iterative training. [2] theoretically demonstrates that the impact of data curation can be formalized as an implicit preference optimization mechanism. [3] reveals the detailed training dynamics of model collapse under three different training workflows. Of course, there are also some remarkable studies that successfully used synthetic data. [4] proposes the Self-Instruct data generation framework, enhancing instruction-following capabilities. [5] employs the self-talk method to generate high-quality data. ReST [6] uses a policy model to generate datasets and then employs offline RL to fine-tune LLMs on generated datasets. [7] demonstrates that self-training with binary feedback filtering can reduce reliance on real data.
[1] On the stability of iterative retraining of generative models on their own data.
[2] Self-consuming generative models with curated data provably optimize human preferences.
[3] Collapse or Thrive? Perils and Promises of Synthetic Data in a Self-Generating World.
[4] Self-instruct: Aligning language models with self-generated instructions.
[5] Ulmer, Dennis, et al. "Bootstrapping llm-based task-oriented dialogue agents via self-talk.
[6] Gulcehre, Caglar, et al. "Reinforced self-training (rest) for language modeling.
[7] Beyond human data: Scaling self-training for problem-solving with language models.
[Q2] : The first finding is already established information.
We agree with you that the potential for synthetic data to cause model collapse has been demonstrated in prior notable works [8]. However, our contribution of statistical experiments provides a more fine-grained analysis of textual data features, specifically examining the underlying reasons behind the failure of synthetic data, e.g., overconcentration of n-grams.
[8] Position: Model Collapse Does Not Mean What You Think
[Q3] : The authors focus on improving the quality of the real-data training via editing.
Yes, we agree with you on this point. Based on insights from prior works [1], model collapse is associated with multiple factors, including self-generated processes, data quality, and others. Our proposed method is inspired by the statistical findings on synthetic data (as stated in Sec. 3). Therefore, we attempt to prevent model collapse by improving data quality.
Furthermore, we will clarify in the paper that we are not directly addressing model collapse, but rather indirectly preventing it by improving data quality. Additionally, in our experiments, the data we edited is not entirely real data. We also conduct experiments with synthetic data, such as (1) datasets used in continual pretraining (e.g., Biomed, Finance) , (2) OSS-Instruct-75K and Evol-Instruct-110K also contain samples synthesized by ChatGPT.
Please refer to Appendix G, Q7 for more discussion.
[9] Instruction Pre-Training: Language Models are Supervised Multitask Learners (Cheng et al., EMNLP 2024)
[Q4] : The authors should elaborate further on the results.
As shown in Table 3, our method improves performance across both OLMo-1B and Llama-3-8B in successful tasks. In Biomedicine, OLMo-1B’s average score improves from 38.83 to 40.89, and Llama-3-8B from 56.04 to 56.48. Similar improvements are seen in Finance and Math domains. Additionally, Table 2 shows the effectiveness of our approach in general pre-training, with OLMo-1B’s average performance rising from 32.75 to 33.11. In Table 4 (SFT), ToEdit enhances FLAN v2 from 70.18 to 70.65 and boosts task performance in Natural Instructions. However, in more challenging tasks, such as Math, the improvements are limited or negligible. This indicate that data modification alone has limited impact on harder reasoning tasks.
Please refer to Appendix C for detailed discussion.
[Q5] :
- Some of the experiment settings are missing.
- The information about the details of the ToEdit experiments is missing.
Please refer to:
- Appendix F (Line 412) for detailed experimental settings, including pre-training and fine-tuning.
- Section 5.1 (Line 381) for the ToEdit experiment settings and cost.
Additionally, we add iterative experiments on instruction tuning data in Rebuttal to Reviewer Qb5k.
[Q6] : How does the editing mechanism ensure the quality of the semi-synthetic data?
We conduct pre-ablation on experiments on throhold to control the editing process and ensure quality of output data. As shown in Table 5, we choose best parameters for main experiments.
[Q7] : How are algorithm metrics selected?
We select our evaluation metrics based on established practices. Please refer to Appendix F.2.
The issue that is addressed by this paper, using synthetic data during pretraining, is a very important and timely one. Going forward, pretraining will use a higher, and eventually dominant, proportion of synthetic data. The main findings are in 3.2, the three failure modes of Cosmopedia, when evaluated using perplexity as the main metric. These failure modes teach us how to inspect synthetic data to find out what's wrong, and inspirs ways to adjust the synthetic data generation pipeline. For this reviewer, the token level editing method is interesting, but not as important as the identification of the failure modes. Better to address the problem at the source, not after the synthetic data are generated.
给作者的问题
No questions
论据与证据
I think the problem with "synthetic data" is with a particular synthetic data (Cosmopedia) only. This paper can lead to a false impression that all synthetic data suffer from the same issues.
方法与评估标准
Yes they make sense
理论论述
I apologize that I did not have time to check the equations. The ideas are sound and practice-able, so I didn't check.
实验设计与分析
They are sound
补充材料
No, I did not
与现有文献的关系
No comment
遗漏的重要参考文献
I don't know of any
其他优缺点
No
其他意见或建议
No
We sincerely thank you for your critical feedback and valuable suggestions. Below, we will strive to address your concerns and refine our paper accordingly.
[Q1] : I think the problem with "synthetic data" is with a particular synthetic data (Cosmopedia) only. This paper can lead to a false impression that all synthetic data suffer from the same issues.
We would like to clarify that our analysis is not meant to imply that all synthetic datasets suffer from the same issues. The analysis and theoretical validation are grounded in previous research on model collapse [1,2,3,4], a potential risk associated with training using synthetic datasets. Specifically, we use Cosmopedia as a representative example to highlight these concerns.
To avoid any potential misunderstanding, we will include the following explanation and clarification in the revision:
- Cosmopedia is currently the largest open-source synthetic dataset, and it comes with detailed statistical information. We use Cosmopedia as a representative example to illustrate failure modes when using synthetic data. We agree that issues identified may vary depending on the specific dataset or generation pipeline used. However, the results obtained on Cosmopedia should also have a certain level of representativeness.
- While our analysis primarily focuses on Cosmopedia, we do not intend to imply that all synthetic datasets suffer from identical problems. Rather, we aim to highlight general principles and cautionary lessons about synthetic data generation and evaluation.
- The tremendous success of synthetic data does not conflict with potential issues, such as model collapse. We provide several examples of the success of synthetic data in the Introduction and Related Work. In Line 33-36 and 799-812, there are listed numerous high-quality synthetic datasets, such as UltraChat, UltraMedical, and so on. Furthermore, famous Phi-1/2 series models are basiclly on synthetic data.
[1] AI models collapse when trained on recursively generated data. Nature 631, 755–759 (2024).
[2] A tale of tails: model collapse as a change of scaling laws. ICML'24
[3] Collapse or Thrive? Perils and Promises of Synthetic Data in a Self-Generating, arXiv, 2024
[4] Position: Model Collapse Does Not Mean What You Think. 2025.
This paper introduces a novel approach to text data synthesis that effectively addresses model collapse issues. The proposed token-level editing method demonstrates technical innovation and promising empirical results. For the camera-ready version, the authors should expand their experimentation across different dataset sizes and provide more comprehensive ablation studies to better demonstrate the method's advantages.