PaperHub
6.7
/10
Poster3 位审稿人
最低6最高8标准差0.9
6
6
8
3.0
置信度
正确性3.0
贡献度2.7
表达2.7
ICLR 2025

A Theoretical Perspective: How to Prevent Model Collapse in Self-consuming Training Loops

OpenReviewPDF
提交: 2024-09-27更新: 2025-03-02
TL;DR

The paper introduces recursive stability to tackle Self-Consuming Training Loops, offering generalization bounds for generative models like transformers.

摘要

关键词
Generative ModelsSynthetic DataTransformerGeneralization ErrorLearning Theory

评审与讨论

审稿意见
6

The authors attempt to present a theoretical framework to describe the generalization error of models when they are trained on synthetic data. They are motivated by the observation that multiple workflows presented in the literature on model collapse / self-consuming loops arrive at different conclusions. They introduce the notion of recursive stability, inspired by past notions of algorithmic stability like uniform stability, using which they quantify the propagation of error as model-fitting iterations progress. Finally, using these tools, they perform an analysis of in-context learning in transformers, obtaining generalization error estimates that describe the impact of real data in the model-fitting loop.

优点

Through their notion of recursive stability, the authors have considered a much more general class of algorithms / models than what the prior literature has described. They have been able to go beyond explicit model calculations or estimation strategies. This notion allows them to decompose the generalization error and explicitly attribute each term to a specific statistical reason. Interestingly, the authors are also able to identify, through these upper bounds on the generalization gap, the effect of proportion of real data in the STL.

缺点

The authors have tried to be too general - this has significantly affected the readability of the paper. There are no examples of models provided. What models are recursively stable? How do we know the TV bounds will be of the same order? The paper really needs some examples for clarity. As of now, it is too hard to parse. Understandably it is a math paper, but a lot of what is written needs to be appropriately motivated as the premise addressed by the authors is an important empirical problem (the problem of model collapse) which the authors are tackling theoretically. I would recommend the authors to rethink their presentation for greater impact.

Further, to me the bounds derived are a rather straightforward application of "working with upper bounds" when the correct assumptions are in place. Thus, I feel the technical novelty is limited, and I am not sure if we are learning something substantially new from these upper bounds, than what we already knew about STLs.

Finally, the literature review is a little too crude. The authors should connect their results with those obtained by Gerstgrasser et. al. (2024), Dohmatob et. al. (2024), Alemohammad et. al. (2023), etc. because we know how specific workflows function already, and it is important to clarify that the present authors are not contradicting any existing result.

问题

  1. In line 195, does S0S_0' depend on jj or not? (the previous sentence suggests it does) If it does, then the provided formula doesn't make sense.

  2. With regards to equation 1, let's consider α=0\alpha=0 i.e. no real data is used. What is the behavior of the error gap? It seems to me that only the last term will matter and using the fact that limα0(1(1α)i)/α=i\lim_{\alpha\to0} (1 - (1-\alpha)^i)/\alpha = i, this predicts a linear rate of increase of the generalization gap. Is this correct?

  3. Following the point I make in 2 above, it seems to me that the discussion in lines 257-260 needs to change, because when α\alpha is too small, yes α1\alpha^{-1} gets large but it's multiplied by 1(1α)i1-(1-\alpha)^i which is small too. They balance each other out exactly giving ii.

  4. I don't understand the final conclusion of Theorem 4. There are too many terms - which one matters the most? Gerstgrasser et. al. (2024), Marchi et. al. (2024) and Seddik et. al. (2024) have shown that the gap is like i1/i2\sum_i 1/i^2, which is kind of similar to the second term in the upper bound, but it has n1n^{-1} in front which makes me think that this term is a higher order term. The presented bound makes me think that only the first term O(n1/4)O(n^{-1/4}) matters, but then the rate log(i)\log(i) (essentially) seemingly contradicts past findings. I would urge the authors to actually connect this part with the more recent literature on accumulating data for avoiding model collapse.

评论

Q3: The literature review is a little too crude. The authors should connect their results with those obtained by Gerstgrasser et al. (2024), Dohmatob et al. (2024), Alemohammad et al. (2023), etc. because we know how specific workflows function already, and it is important to clarify that the present authors are not contradicting any existing result.


A: Thank you for your valuable feedback. Based on your suggestion, we have expanded the literature review by including detailed comparisons with recent works in the remarks following Theorem 1 and Theorem 4. These additions highlight how our results align with or extend previous findings, ensuring that our work is properly situated within the context of existing research. Below is a summary of the changes:


1. Additions to the Remark Following Theorem 1

`` Dohmatob et al. (2024) examined a linear regression setting, focusing solely on statistical approximation error without addressing the functional approximation error described in Shumailov (2024). They did not consider incorporating real data to prevent collapse and demonstrated a linear dependency of degradation on the generation number in the case of fully synthetic data. Similarly, Alemohammad et al. (2023) and Shumailov (2024) provided theoretical insights using simple Gaussian models without incorporating real data, proving that the variance diverges linearly with the generation number. Seddik et al. (2024) explored a linear softmax classifier and, while also neglecting functional approximation error, demonstrated that adding real data can mitigate model collapse. Marchi et al. (2024) used asymptotic analysis to study parameter variance, assuming an infinite number of training generations and considering scenarios where the generative model is controlled via a "temperature'' parameter. They proved that parameter variance is bounded under these conditions.

In contrast, our work addresses a much more complex and realistic scenario by introducing the novel concept of recursive stability and providing the first generalization analysis for STLs. Our analysis accounts for statistical approximation error, functional approximation error, and optimization error during the training of generative models. Unlike the settings explored in prior theoretical works, such as linear regression (Dohmatob et al. 2024, Gerstgrasser et al. 2024), Gaussian models (Alemohammad et al. 2023, Shumailov 2024), or asymptotic assumptions (Marchi et al. 2024), our framework accommodates more complex generative model architectures, such as transformers. Specifically, we reveal how both model architecture and the ratio of real to synthetic data influence the success of STLs. For example, in Theorem 3, we demonstrate how our general generalization bound applies to transformer-based generative models, providing a theoretical framework that aligns with practical and more sophisticated use cases.

Additionally, while Marchi et al. (2024) assumed an infinite number of training generations for their asymptotic analysis, we consider finite generations, which is more practical since most experimental setups limit generations to fewer than 10 (as noted in Shumailov 2024). Moreover, our results confirm that when α=0\alpha = 0 (i.e., no real data is used), the last term in our bound, representing the Cumulative Distribution Shift (dTV(n)M(1(1α)i)α1d_{\text{TV}}(n) M (1 - (1 - \alpha)^i) \alpha^{-1}), grows linearly. This finding aligns with the theoretical results of Dohmatob et al. (2024), Alemohammad et al. (2023), Shumailov (2024), and Fu et al. (2024). Furthermore, we show that introducing even a constant proportion of real data significantly mitigates model collapse, aligning with experimental findings by Alemohammad et al. (2023) and Bertrand et al. (2024).''


2. Additions to the Remark Following Theorem 4

``Gerstgrasser et al. (2024) also explored the use of accumulating data to prevent model collapse. They considered a linear regression setting without accounting for the dynamic process of training generative models, focusing solely on statistical approximation error. They demonstrated that under the assumption of fixed synthetic data quality matching the original real data, statistical approximation error can be controlled.

By contrast, our work addresses a much more complex and realistic scenario, incorporating the dynamic behavior of transformer-based generative models, learning algorithms, and both statistical and functional approximation errors. Additionally, we allow for dynamic regulation of synthetic data size via a λ\lambda coefficient, enabling us to identify the optimal synthetic dataset size for avoiding model collapse in these more challenging settings.''

评论

Q6: I don't understand the final conclusion of Theorem 4. There are too many terms - which one matters the most? Gerstgrasser et al. (2024), Marchi et al. (2024), and Seddik et al. (2024) have shown that the gap is like i1i2\sum_i \frac{1}{i^2}, which is kind of similar to the second term in the upper bound, but it has n1n^{-1} in front which makes me think that this term is a higher order term. The presented bound makes me think that only the first term O(n1/4)\mathcal{O}(n^{-1/4}) matters, but then the rate log(i)\log(i) seemingly contradicts past findings. I would urge the authors to connect this part with the more recent literature on accumulating data for avoiding model collapse.


A: Thank you for your thoughtful comments. Below, we address your concerns regarding the terms in Theorem 4, their connection to related works, and the changes we made to incorporate this feedback.


1. Clarification on the Terms in Theorem 4

In Theorem 4, the generalization bound is expressed as:

Generalizationn1/4log((1+iλ)n)+n1ρ2(1+iλ)2log((1+iλ)n)i!B~W(i+1)L+n1/2Mi1+iλ.|Generalization| \lesssim n^{-1/4}\log ((1+i\lambda)n)+ n^{-1}\frac{\rho^2}{(1+i\lambda)^2} \log ((1+i\lambda) n)i!\widetilde{B}_W^{(i+1)L}+n^{-1/2}\frac{Mi}{1+i\lambda}.
  • Cumulative Distribution Shift Term:
    The first term, O(n1/4log((1+iλ)n))\mathcal{O}(n^{-1/4}\log((1+i\lambda) n)), reflects the Cumulative Distribution Shift. This arises because generative models cannot perfectly fit the training distribution, leading to distributional drift with each generation. The magnitude of this term heavily depends on the model's capacity to approximate the data distribution. In this theorem, where the generative model is a transformer and the learning algorithm is SGD, the Cumulative Distribution Shift term dominates. Importantly, this term grows with the number of generations ii at a rate of O(log(i))\mathcal{O}(\log(i)). This growth does not contradict past findings but rather aligns with the intuition that each generation introduces incremental drift, which accumulates over time, as also stated by [2] Fu et al. (2024). Similar behavior has been observed in ΔBias\Delta\text{Bias} terms discussed in [5] Dohmatob et al. (2024), where it was also shown to grow with ii.

  • Generalization Error on Mixed Distributions:
    The second and third terms capture the Generalization Error on Mixed Distributions, which depends on the stability of the learning algorithm and the recursive stability of the generative model. These terms are influenced by both the choice of learning algorithm and the architecture of the generative model. For example, if a less stable learning algorithm is used, these terms could dominate the generalization bound.


2. Connection to Related Literature

We appreciate the references to [7] Gerstgrasser et al. (2024), [8] Marchi et al. (2024), and [6] Seddik et al. (2024), which you mentioned in relation to terms involving i1i2\sum_i \frac{1}{i^2}. We note the following distinctions:

  • [7] Gerstgrasser et al. (2024):
    Gerstgrasser et al. analyzed a linear regression setting without considering the dynamic process of training generative models. Their analysis focuses on statistical approximation error, assuming that synthetic data generated in the ii-th generation occupies a proportion of 1/i1/i in the training set. Under this assumption and using squared error, they derive a term involving i1i2\sum_i \frac{1}{i^2}. In contrast, our work accounts for both statistical approximation error and functional approximation error, as introduced by Shumailov (2024), as well as optimization error arising in transformer training. Furthermore, our second term includes a component resembling 1(1+iλ)2 \frac{1}{(1+i\lambda)^2}, which arises because, in the ii-th generation, the synthetic data constitutes 1/(1+iλ)1/(1+i\lambda) of the training set, and the stability of the learning algorithm is of the order 1/((1+iλ)n)1/((1+i\lambda)n). The product of these factors naturally results in the 1(1+iλ)2\frac{1}{(1+i\lambda)^2} term, which is distinct from their result.

  • [8] Marchi et al. (2024):
    The i1i2\sum_i \frac{1}{i^2} term in Marchi et al. arises under a completely different context. They assume the dataset size grows with ii (i.e., ii represents dataset size, not generation number as in our work or [7] Gerstgrasser et al.). Their analysis focuses on asymptotic parameter variance under the assumption of infinite generations. By contrast, our results do not rely on asymptotic analysis and apply to finite numbers of generations while considering a broader set of error sources.

  • [6] Seddik et al. (2024):
    We could not identify terms resembling i1i2\sum_i \frac{1}{i^2} in their work. Their focus appears unrelated to the setting of our Theorem 4.

Overall, while the i1i2\sum_i \frac{1}{i^2} terms in [7] Gerstgrasser et al. and [8] Marchi et al. appear superficially similar to the 1(1+iλ)2\frac{1}{(1+i\lambda)^2} term in our bound, they are derived under different assumptions and address fundamentally different problems.

评论

Q4: In line 195, does S0S'_0 depend on jj or not? (The previous sentence suggests it does.) If it does, then the provided formula doesn't make sense.


A: Thank you for pointing out the potential ambiguity in our definition on line 195. To clarify, S0S'_0 does not depend on jj. Our intent was to state that for all datasets S0S_0 and S0S'_0 such that S0S_0 and S0S'_0 differ by a single example, we use the recursive stability parameter to quantify the difference in outputs produced by the generative model when S0S_0 and S0S'_0 are provided as the initial training datasets.

To address this ambiguity, we have revised the definition on line 195 as follows:

Definition 2. (Recursive Stability)
Let S0S_0 represent the original real dataset, and S0S'_0 denote a dataset differing from S0S_0 by a single example. A generative model G\mathcal{G} is said to be recursively γni,α\gamma_n^{i,\alpha}-stable with respect to the distance measure dd after the ii-th generation of STLs, where the ratio of real to synthetic data is set to α\alpha, if the following condition holds:

S0,S0Zn,d(G(i)(S0),G(i)(S0))γni,α.\forall S_0, S'_0 \in \mathbb{Z}^n, \quad d\left(\mathcal{G}^{(i)}(S_0), \mathcal{G}^{(i)}(S'_0)\right) \leq \gamma_n^{i,\alpha}.

This revised definition explicitly removes any ambiguity regarding jj and ensures that the recursive stability parameter is clearly defined. We appreciate your attention to this detail and have updated the manuscript accordingly.


Q5: With regards to equation 1, let's consider α=0\alpha = 0, i.e., no real data is used. What is the behavior of the error gap? It seems to me that only the last term will matter and using the fact that limα0(1(1α)i)α=i\lim_{\alpha \to 0} \frac{(1 - (1 - \alpha)^i)}{\alpha} = i, this predicts a linear rate of increase of the generalization gap. Is this correct? Following the point I make above, it seems to me that the discussion in lines 257-260 needs to change, because when α\alpha is too small, yes α1\alpha^{-1} gets large but it's multiplied by 1(1α)i1 - (1 - \alpha)^i, which is small too. They balance each other out exactly giving ii.


A: Thank you for your detailed observation and insightful question. Your understanding is absolutely correct!

In the context of Equation 1, our general generalization bound, when we consider α=0\alpha = 0 (i.e., no real data is used), it indeed holds that limα0(1(1α)i)α=i\lim_{\alpha \to 0} \frac{(1 - (1 - \alpha)^i)}{\alpha} = i. As a result, the last term, representing the Cumulative Distribution Shift,

dTV(n)M(1(1α)i)α1,d_{\text{TV}}(n) M \left(1 - (1 - \alpha)^i\right) \alpha^{-1},

grows linearly with ii, thereby becoming the dominant term in our general generalization bound.

This finding aligns closely with prior theoretical results:

  1. [5] Dohmatob et al. 2024: Considers a linear regression setting and shows a similar linear dependency of degradation on the generation number.
  2. [11] Shumailov et al. 2024: Demonstrates using a one-dimensional Gaussian model that the variance diverges linearly when no real data is incorporated.
  3. [2] Fu et al. 2024: Proves that for diffusion models under fully synthetic scenarios, the distributional shift error accumulates linearly with the number of generations.

These results further validate the correctness and applicability of our theoretical bounds.

In response to your comment, we have revised the discussion in lines 257-260 to better reflect this understanding. Specifically, we have clarified:

``When α0\alpha \to 0, we observe that (1(1α)i)αi\frac{(1 - (1 - \alpha)^i)}{\alpha} \to i, leading to a linear accumulation of errors due to the Distribution Shift, making it increasingly challenging to control the overall error. This observation aligns with the theoretical results reported in Shumailov et al. 2024, Dohmatob et al. 2024, and Fu et al. 2024. ''

We appreciate your valuable feedback, which has helped us refine the presentation of our results. Thank you again!

评论

Q2: Further, to me the bounds derived are a rather straightforward application of ``working with upper bounds'' when the correct assumptions are in place. Thus, I feel the technical novelty is limited, and I am not sure if we are learning something substantially new from these upper bounds, than what we already knew about STLs.


A: Thank you for your feedback and for raising concerns about the technical novelty of our work. We would like to highlight how our contributions go beyond previous theoretical studies and address substantially more complex and realistic scenarios.

1. From Traditional Models to Generative Models

As you have kindly noted, “the authors have considered a much more general class of algorithms/models than what the prior literature has described." Unlike prior theoretical results that focus on linear regression (e.g., [5] Dohmatob et al. 2024, [7] Gerstgrasser et al. 2024) and Gaussian models (e.g., [9] Alemohammad et al. 2023, [11] Shumailov 2024), our theoretical findings are applicable to a broader class of generative models, such as the transformer-based architectures commonly used in large models. This general setting results in significant technical challenges, including the analysis of nonconvex optimization for nonlinear attention models under recursive training, which drives us to propose a novel analytical tool—recursive stability.

2. From i.i.d Data to Non-i.i.d Data

To simplify the mathematical analysis of models and algorithms, many theoretical analyses rely on the i.i.d data assumption, such as [12] Bousquet et al. In our manuscript, we tackle this challenge by leveraging additional conditional independence properties. This enables our theoretical results to handle mixed data distributions in recursive training.

3. From Statistical Error to the Integration of Statistical, Functional, and Optimization Errors

Previous works primarily focus on statistical approximation error while ignoring the functional approximation error arising from generative model training and the optimization error associated with learning algorithms (e.g., [5] Dohmatob et al. 2024, [7] Gerstgrasser et al. 2024, [6] Seddik et al. 2024). In contrast, our work introduces the intriguing notion of recursive stability and performs a detailed analysis of the dynamic interactions of complex generative models and learning algorithms in STLs. By doing so, we elucidate how statistical approximation error, functional approximation error, and optimization error interact and accumulate during the training process.

4. From Asymptotic Analysis to Finite Analysis

Unlike [8] Marchi et al. 2024, which relies on the assumption of an infinite number of training generations for their asymptotic analysis, we analyze finite generations, which is more practical and relevant. Most experimental setups in the literature limit generations to fewer than 10 (e.g., [11] Shumailov 2024), making our results more applicable to real-world scenarios.


Technical Innovations

Recursive Stability:
We introduce the intriguing notion of recursive stability to quantify the differences in a complex generative model’s outputs after STLs when small perturbations are applied to the initial real dataset. Notably, this concept is significantly more intricate than algorithm stability due to the inherent challenges posed by the recursive structure of STLs.

By addressing these challenges, our results go beyond merely "working with upper bounds" and provide new insights into the interplay between real and synthetic data, the architecture of generative models, and the behavior of learning algorithms under recursive training schemes.


We believe our work introduces substantial technical novelty by combining recursive stability with comprehensive error analysis in a realistic and challenging setting, offering significant advancements over existing theoretical studies. Thank you again for your feedback, which allowed us to better articulate the contributions and significance of our work.

评论

3. Clarity through Examples

To address the reviewer’s concern, we apply our framework to a tractable example—a GMM in a binary classification task. We adopt the setup from prior works [5,6] and consider a binary classification task where Y=1,1Y =\\{-1, 1\\}. Given a vector μRd\mu \in \mathbb{R}^d with μ2=1\\|\mu\\|_2 = 1 and noise variance σ2>0\sigma^2 > 0, the data distribution is specified as follows: yuniform1,1y \sim \text{uniform}\\{-1, 1\\} and xyN(yμ,σ2Id)x \mid y \sim \mathcal{N}(y \mu, \sigma^2 I_d). We define the conditional generative model using parameters μy\mu_y and σk2\sigma_k^2, where y1,1y \in \\{-1, 1\\} and k[d]k \in [d]. For nn data points, let nyn_y represent the number of samples in class yy. The means μ^y\hat{\mu}_y of the Gaussian mixture model are estimated as

yi=yxiny,\frac{\sum_{y_i = y} x_i}{n_y},

while the variances σ^k2\hat{\sigma}_k^2 are calculated as

ynynyi=y(xikμ^yk)2ny1.\sum_y \frac{n_y}{n} \frac{\sum_{y_i = y} (x_{ik} - \hat{\mu}_{yk})^2}{n_y - 1}.

Then we can generate new samples from the distribution: yuniform1,1y \sim \text{uniform}\\{-1, 1\\} and xyN(μ^y,Σ)x \mid y \sim \mathcal{N}(\hat{\mu}_y, \Sigma).

Additionally, the learning algorithm functions as a linear classifier, parameterized by θRd\theta \in \mathbb{R}^d, with predictions given by:

y^=sign(θx).\hat{y} = \text{sign}(\theta^\top \mathbf{x}).

The loss function is defined as:

(θ,(x,y))=12σ2(xyθ)(xyθ).\ell(\theta, (x, y)) = \frac{1}{2\sigma^2} (x - y\theta)^\top (x - y\theta).

Thus, the output is:

θ^=1mi=1myixi.\hat{\theta} = \frac{1}{m} \sum_{i=1}^m y_i x_i.

In this setting, we demonstrate recursive stability for the Gaussian mixture model as follows:


Theorem (Recursive Stability): Let S0,S0S_0, S_0' denote two initial real datasets differing by a single example. Let nn represent the sample size of the mixed dataset S~j\tilde{S}_j, where S~j=αS0+(1α)Sj\tilde{S}_j = \alpha S_0 + (1 - \alpha) S_j for 1ji1 \leq j \leq i. Choose m=O(n)m = \mathcal{O}(\sqrt{n}). Consider the previously described sampling and learning steps, where real data samples are drawn from the Gaussian Mixture Model distribution D\mathcal{D}, and the synthetic data for the ii-th generation is generated from the learned Gaussian Mixture distribution of the ii-th generation. Then with probability at least 1δ1-\delta, we have:

γni,αn1/2α1(1(1α)i)log(nd/δ),\gamma_n^{i,\alpha} \lesssim n^{-1/2} \alpha^{-1}(1-(1-\alpha)^i)\log (nd/\delta),

where the measure for the recursive stability parameter is taken as the KL divergence.


As α\alpha approaches 0, indicating that no real data is incorporated during each generation of training, we observe:

γniin1/2logndδ,\gamma_n^i \lesssim i n^{-1/2} \log \frac{nd}{\delta},

which suggests a linear accumulation of errors. This finding aligns closely with the theoretical insights presented in [3,7], where a Gaussian model trained without real data demonstrated a linear divergence in variance. Thus, this underscores the validity of our theoretical results, confirming that the derived bound is meaningful and not vacuous.

Moreover, by leveraging the generalization error bound established in Theorem 1, we derive the following:


Theorem(Generalization Error Bound): Consider the Gaussian Mixture Model in the setting outlined above. Let nn represent the sample size of the mixed dataset S~j\tilde{S}_j, where S~j=αS0+(1α)Sj\tilde{S}_j = \alpha S_0 + (1 - \alpha) S_j for 1ji1 \leq j \leq i. Suppose the loss function is defined above. Let A(S~i)\mathcal{A}(\tilde{S}_i) denote the output of applying the linear classifier described above to the mixed dataset S~i\tilde{S}_i. Then, for any δ(0,1)\delta \in (0,1), with probability at least 1δ1-\delta, the following holds:

Generalizationn1/2(d+log(n/δ))lognlog(1/δ)+n1/4(1(1α)i)α1(d+log(n/δ))dlog(nd/δ). |Generalization| \lesssim n^{-1/2}(d+\log(n/\delta))\log n\log (1/\delta) +n^{-1/4}(1-(1-\alpha)^i)\alpha^{-1}(d+\log (n/\delta))\sqrt{d\log(nd/\delta)}.

We observe that when α\alpha is set to a constant (e.g., α=0.1\alpha = 0.1), the generalization error can be effectively controlled, preventing model collapse. This result aligns with the experimental findings in [2] for Gaussian models.

The above discussions have been carefully added to our paper.

评论

Thank you for your thorough review and constructive comments. All your concerns have been carefully addressed as below. The paper has been thoroughly revised, with the revised sections highlighted in blue for clarity. We sincerely hope our responses fully address your questions.


Q1: There are no examples of models provided. What models are recursively stable? How do we know the TV bounds will be of the same order? The paper really needs some examples for clarity.


A: We appreciate the reviewer’s insightful feedback. Below, we address the concerns raised regarding examples of recursively stable models, the TV distance bounds, and the need for clarity through examples:

1. What models are recursively stable?

Recursive stability, as introduced in our paper, measures the stability of generative models within STLs in response to perturbations in the initial data. This property depends on both the model architecture and the ratio α\alpha of real to synthetic data within STLs.

  1. Model Architectures:
    Generative models satisfying recursive stability must maintain distributional fidelity even under perturbations in the training dataset. This depends on the model's capacity to approximate distributions effectively. In addition to rigorously proving recursive stability for transformers in in-context learning within this paper, other examples include GANs ([1] Farnia 2021) and Gaussian Mixture Models (GMMs). In the final part of this response, we provide a detailed example demonstrating how a GMM satisfies recursive stability.

  2. Effect of α\alpha:
    In addition to the impact of model architecture, the ratio α\alpha in STLs plays a critical role. The specific value of α\alpha required to control recursive stability depends on the stability capacity of the chosen generative model. As shown in Theorem 2, when α0\alpha \to 0, the recursive stability of transformers increases rapidly. However, when α\alpha is set to 1B~WL1 - \tilde{B}_W^{-L}, recursive stability can be effectively controlled. In addition, we demonstrate in the subsequent GMM example that a constant value of α\alpha (e.g., 0.1) is sufficient to control recursive stability.

2. Why are TV bounds assumed to be of the same order?

The assumption that TV bounds will be of the same order is a common and well-established hypothesis, as also adopted in ([2] Fu et al. 2024). In generalization theory, the TV distance bound for generative models with the same architecture is primarily determined by the size of the training dataset nn ([3] Li 2023, [4] Zhang 2023). Following the settings of prior works ([2] Fu et al. 2024, [5] Dohmatob et al. 2024, [6] Seddik et al. 2024), where the training dataset size nn is assumed to remain consistent across generations, it is reasonable to extend this assumption and infer that the TV bounds will likewise remain of the same order.

评论

3. Revised Discussion in the Paper

To connect our work more explicitly to recent literature on accumulating data for avoiding model collapse, we have added the following discussion to the paper:

``Gerstgrasser et al. (2024) also explored the use of accumulating data to prevent model collapse. They considered a linear regression setting without accounting for the dynamic process of training generative models, focusing solely on statistical approximation error. They demonstrated that under the assumption of fixed synthetic data quality matching the original real data, statistical approximation error can be controlled.

By contrast, our work addresses a much more complex and realistic scenario, incorporating the dynamic behavior of transformer-based generative models, learning algorithms, and both statistical and functional approximation errors. Additionally, we allow for dynamic regulation of synthetic data size via a λ\lambda coefficient, enabling us to identify the optimal synthetic dataset size for avoiding model collapse in these more challenging settings.''

We hope this revision addresses your concerns and provides a clearer connection to existing literature while emphasizing the unique contributions of our work. Thank you for your valuable feedback!

Reference:

[1]\left[1\right] Farnia F, Ozdaglar A. Train simultaneously, generalize better: Stability of gradient-based minimax learners[C]//International Conference on Machine Learning, 2021.

[2]\left[2\right] Fu S, Zhang S, Wang Y, et al. Towards Theoretical Understandings of Self-Consuming Generative Models[C]//Forty-first International Conference on Machine Learning.

[3]\left[3\right] Li P, Li Z, Zhang H, et al. On the generalization properties of diffusion models[J]. Advances in Neural Information Processing Systems, 2023.

[4]\left[4\right] Zhang Y, Zhang F, Yang Z, et al. What and how does in-context learning learn? bayesian model averaging, parameterization, and generalization[J], 2023.

[5]\left[5\right] Dohmatob E, Feng Y, Kempe J. Model collapse demystified: The case of regression[J]. arXiv, 2024.

[6]\left[6\right] Seddik M E A, Chen S W, Hayou S, et al. How bad is training on synthetic data? a statistical analysis of language model collapse[J]. arXiv, 2024.

[7]\left[7\right] Gerstgrasser M, Schaeffer R, Dey A, et al. Is model collapse inevitable? breaking the curse of recursion by accumulating real and synthetic data[J]. arXiv, 2024.

[8]\left[8\right] Marchi M, Soatto S, Chaudhari P, et al. Heat Death of Generative Models in Closed-Loop Learning[J]. arXiv, 2024.

[9]\left[9\right] Alemohammad S, Casco-Rodriguez J, Luzi L, et al. Self-Consuming Generative Models Go MAD[C]//The Twelfth International Conference on Learning Representations.

[10]\left[10\right] Zheng C, Wu G, Li C. Toward understanding generative data augmentation[J]. Advances in neural information processing systems, 2023.

[11]\left[11\right] Shumailov I, Shumaylov Z, Zhao Y, et al. AI models collapse when trained on recursively generated data[J]. Nature, 2024.

[12]\left[12\right] Bousquet O, Klochkov Y, Zhivotovskiy N. Sharper bounds for uniformly stable algorithms[C]//Conference on Learning Theory. PMLR, 2020: 610-626.

评论

I thank the authors for a detailed response. Based on this, I have raised the score to 6.

评论

Thank you for your support! We are glad that we have addressed your concerns.

审稿意见
6

This paper theoretically studies Self-consuming Training Loops (STLs), where generative models generate their own training data. The key concept that the paper introduces is the recursive stability which the authors prove to be vital for generalization and for avoiding model collapse. After the general result on how the recursive stability is important, the authors move to provide an upper bound for the recursive ability for transformers. The theoretical results suggest a trade-off of synthetic data augmentation.

优点

  • The paper introduces a concept called recursive stability that addresses the complexity and difficulty of self-consuming training loops. It can be used to derive an upper bound on the generalization error of learning algorithms trained on the self generated (mixture of self-generated and real) dataset after any iterations of the self-consuming loop.

  • The application to transformers is nice to have as it bridges to the most popular architecture in practice.

缺点

  • The reviewer is concerned about the validity of this recursive stability which is supposed to be the core contribution of this paper. Details follow. The recursive stability γni\gamma^i_n is a rather strong assumption, and it hides a lot of things underneath. For example, it is uniform not only in terms of the data point modification, but also uniform w.r.t. to any randomness in the STL process. It also depends on how the STL is performed, e.g., the ratio α\alpha of the data mixture. The paper derives an upper-bound on the recursive stability γni\gamma^i_n for transformer in Theorem 2 but the bound has an exponential factor eBWLe^{B_WL} where BWB_W is the bound on the norm of weights and LL is the depth. So it appears that α\alpha is required to be very close to 11, i.e., almost only using real data, for the guarantee on model collapse to stand.
  • Following up on the above concern, it is not really satisfying to see the paper jumps directly to the application to transformers in-context learning which is notoriously difficult to make the theory accurate. Why not apply the recursive stability and Theorem 1 to simple cases first, e.g., Gaussian models, where everything can be made accurate, to examine if the proposed recursive is indeed a valid property to look at? This means to examine the LHS and RHS of Theorem 1 accurately for simple cases and show that the bound is not vacuous.

----------------after rebuttal--------------------

The authors have well addressed my major concerns, so the score is raised.

问题

see the questions in the above section.

评论

References:

[1]\left[1\right] Bousquet O, Elisseeff A. Stability and generalization[J]. The Journal of Machine Learning Research, 2002, 2: 499-526.

[2]\left[2\right] Briesch M, Sobania D, Rothlauf F. Large language models suffer from their own output: An analysis of the self-consuming training loop[J]. arXiv 2023.

[3]\left[3\right] Alemohammad S, Casco-Rodriguez J, Luzi L, et al. Self-Consuming Generative Models Go MAD[C]//The Twelfth International Conference on Learning Representations.

[4]\left[4\right] Chen B, Li X, Liang Y, et al. Bypassing the Exponential Dependency: Looped Transformers Efficiently Learn In-context by Multi-step Gradient Descent[J]. arXiv 2024.

[5]\left[5\right] He H, Yan H, Tan V Y F. Information-theoretic characterization of the generalization error for iterative semi-supervised learning[J]. Journal of Machine Learning Research, 2022.

[6]\left[6\right] Zheng C, Wu G, Li C. Toward understanding generative data augmentation[J]. Advances in neural information processing systems, 2023.

[7]\left[7\right] Shumailov I, Shumaylov Z, Zhao Y, et al. AI models collapse when trained on recursively generated data[J]. Nature, 2024.

评论

Q2: It is not really satisfying to see the paper jumps directly to the application to transformers in-context learning which is notoriously difficult to make the theory accurate. Why not apply the recursive stability and Theorem 1 to simple cases first, e.g., Gaussian models, where everything can be made accurate, to examine if the proposed recursive is indeed a valid property to look at? This means to examine the LHS and RHS of Theorem 1 accurately for simple cases and show that the bound is not vacuous.


A: Thank you for your insightful suggestion. We fully agree on the value of validating our theoretical framework with simpler models to ensure its applicability and robustness. Following your guidance, we have applied our recursive stability metric and the general generalization bound (Theorem 1) to a Gaussian mixture model.

We adopt the setup from prior works [5,6] and consider a binary classification task where Y=1,1Y =\\{-1, 1\\}. Given a vector μRd\mu \in \mathbb{R}^d with μ2=1\\|\mu\\|_2 = 1 and noise variance σ2>0\sigma^2 > 0, the data distribution is specified as follows: yuniform1,1y \sim \text{uniform}\\{-1, 1\\} and xyN(yμ,σ2Id)x \mid y \sim \mathcal{N}(y \mu, \sigma^2 I_d). We define the conditional generative model using parameters μy\mu_y and σk2\sigma_k^2, where y1,1y \in \\{-1, 1\\} and k[d]k \in [d]. For nn data points, let nyn_y represent the number of samples in class yy. The means μ^y\hat{\mu}_y of the Gaussian mixture model are estimated as

yi=yxiny,\frac{\sum_{y_i = y} x_i}{n_y},

while the variances σ^k2\hat{\sigma}_k^2 are calculated as

ynynyi=y(xikμ^yk)2ny1.\sum_y \frac{n_y}{n} \frac{\sum_{y_i = y} (x_{ik} - \hat{\mu}_{yk})^2}{n_y - 1}.

Then we can generate new samples from the distribution: yuniform1,1y \sim \text{uniform}\\{-1, 1\\} and xyN(μ^y,Σ)x \mid y \sim \mathcal{N}(\hat{\mu}_y, \Sigma).

Additionally, the learning algorithm functions as a linear classifier, parameterized by θRd\theta \in \mathbb{R}^d, with predictions given by:

y^=sign(θx).\hat{y} = \text{sign}(\theta^\top \mathbf{x}).

The loss function is defined as:

(θ,(x,y))=12σ2(xyθ)(xyθ).\ell(\theta, (x, y)) = \frac{1}{2\sigma^2} (x - y\theta)^\top (x - y\theta).

Thus, the output is:

θ^=1mi=1myixi.\hat{\theta} = \frac{1}{m} \sum_{i=1}^m y_i x_i.

In this setting, we demonstrate recursive stability for the Gaussian mixture model as follows:


Theorem (Recursive Stability): Let S0,S0S_0, S_0' denote two initial real datasets differing by a single example. Let nn represent the sample size of the mixed dataset S~j\tilde{S}_j, where S~j=αS0+(1α)Sj\tilde{S}_j = \alpha S_0 + (1 - \alpha) S_j for 1ji1 \leq j \leq i. Choose m=O(n)m = \mathcal{O}(\sqrt{n}). Consider the previously described sampling and learning steps, where real data samples are drawn from the Gaussian Mixture Model distribution D\mathcal{D}, and the synthetic data for the ii-th generation is generated from the learned Gaussian Mixture distribution of the ii-th generation. Then with probability at least 1δ1-\delta, we have:

γni,αn1/2α1(1(1α)i)log(nd/δ),\gamma_n^{i,\alpha} \lesssim n^{-1/2} \alpha^{-1}(1-(1-\alpha)^i)\log (nd/\delta),

where the measure for the recursive stability parameter is taken as the KL divergence.


As α\alpha approaches 0, indicating that no real data is incorporated during each generation of training, we observe:

γniin1/2logndδ,\gamma_n^i \lesssim i n^{-1/2} \log \frac{nd}{\delta},

which suggests a linear accumulation of errors. This finding aligns closely with the theoretical insights presented in [3,7], where a Gaussian model trained without real data demonstrated a linear divergence in variance. Thus, this underscores the validity of our theoretical results, confirming that the derived bound is meaningful and not vacuous.

Moreover, by leveraging the generalization error bound established in Theorem 1, we derive the following:


Theorem (Generalization Error Bound): Consider the Gaussian Mixture Model in the setting outlined above. Let nn represent the sample size of the mixed dataset S~j\tilde{S}_j, where S~j=αS0+(1α)Sj\tilde{S}_j = \alpha S_0 + (1 - \alpha) S_j for 1ji1 \leq j \leq i. Suppose the loss function is defined above. Let A(S~i)\mathcal{A}(\tilde{S}_i) denote the output of applying the linear classifier described above to the mixed dataset S~i\tilde{S}_i. Then, for any δ(0,1)\delta \in (0,1), with probability at least 1δ1-\delta, the following holds:

Generalizationn1/2(d+log(n/δ))lognlog(1/δ)+n1/4(1(1α)i)α1(d+log(n/δ))dlog(nd/δ). |Generalization| \lesssim n^{-1/2}(d+\log(n/\delta))\log n\log (1/\delta) +n^{-1/4}(1-(1-\alpha)^i)\alpha^{-1}(d+\log (n/\delta))\sqrt{d\log(nd/\delta)}.

We observe that when α\alpha is set to a constant (e.g., α=0.1\alpha = 0.1), the generalization error can be effectively controlled, preventing model collapse. This result aligns with the experimental findings in [2] for Gaussian models.

The above discussions have been carefully added to our paper to strengthen the validation of our theoretical framework and highlight its applicability to different generative model settings. Thank you for your valuable feedback.

评论

Thank you for your thorough review and constructive comments. All your concerns have been carefully addressed as below. The paper has been thoroughly revised, with the revised sections highlighted in blue for clarity. We sincerely hope our responses fully address your questions.


Q1: The recursive stability γni\gamma_n^i is a rather strong assumption, and it hides a lot of things underneath. For example, it is uniform not only in terms of the data point modification but also uniform w.r.t. to any randomness in the STL process. It also depends on how the STL is performed, e.g., the ratio α\alpha of the data mixture. The paper derives an upper-bound on the recursive stability γni\gamma_n^i for transformer in Theorem 2, but the bound has an exponential factor eBWLe^{B_W L}, where BWB_W is the bound on the norm of weights and LL is the depth. So it appears that α\alpha is required to be very close to 1, i.e., almost only using real data, for the guarantee on model collapse to stand.


A: Thank you for your thoughtful feedback. We fully understand your concerns, particularly regarding two key points: (1) whether recursive stability constitutes an overly strict assumption, and (2) whether the derived bounds imply that α\alpha must be very close to 1. Below, we address each point in detail.

1. Recursive Stability as a Metric, Not a Strong Assumption

We respectfully clarify that recursive stability is introduced as an evaluation metric rather than a strong assumption.

Similar to how algorithmic stability is a foundational concept in generalization theory for quantifying a learning algorithm's sensitivity to data perturbations [1], recursive stability serves as a critical metric for evaluating the sensitivity of generative models in STLs to perturbations in the initial data. A wide range of generative models exhibit recursive stability at varying levels. Beyond the transformer architecture analyzed in this paper, we also provide an upper bound for the recursive stability of Gaussian Mixture Models (GMMs) in the subsequent discussion. This metric offers valuable insights into how model architecture and the real-synthetic data ratio α\alpha influence STL performance, enabling the identification of scenarios where recursive stability grows rapidly (leading to model collapse) and those where stability is maintained.

Additionally, following the reviewer’s suggestion, we have revised our definition of recursive stability to explicitly express the role of the ratio α\alpha:

Definition 2. (Recursive Stability)
Let S0S_0 represent the original real dataset, and S0S'_0 denote a dataset differing from S0S_0 by a single example. A generative model G\mathcal{G} is said to be recursively γni,α\gamma_n^{i,\alpha}-stable with respect to the distance measure dd after the ii-th generation of STLs, where the ratio of real to synthetic data is set to α\alpha, if the following condition holds:

S0,S0Zn,d(G(i)(S0),G(i)(S0))γni,α.\forall S_0, S'_0 \in \mathbb{Z}^n, \quad d\left(\mathcal{G}^{(i)}(S_0), \mathcal{G}^{(i)}(S'_0)\right) \leq \gamma_n^{i,\alpha}.

2. Ratio α\alpha Does Not Need to Be Close to 1

We clarify that the requirements for the ratio α\alpha vary across different generative model architectures. For instance, as demonstrated in our response to the next question, when α=0.1\alpha = 0.1, the recursive stability of the GMM can still be effectively controlled, thereby preventing model collapse. This observation aligns with the experimental results reported in [3].

For transformers, the depth LL is typically small in practical settings. For example, studies on LLM performance in STLs, such as [2], often employ models with L=6L = 6. Furthermore, techniques like layer normalization effectively constrain the norm of weights BWB_W to values close to 1, ensuring numerical stability during training. Therefore, in practical scenarios, the combination of small LL and well-controlled BWB_W ensures that the recursive stability bound does not necessitate α\alpha being excessively close to 1.

We also acknowledge that addressing the exponential dependency on LL in convergence analyses for in-context learning is indeed an important research direction. Existing works, such as [4], have started tackling this challenge. However, this issue is beyond the primary scope of our paper. Our core contribution lies in presenting the first theoretical generalization analysis that demonstrates how model architecture and the real-synthetic data ratio α\alpha influence the performance and success of STLs.

We sincerely thank the reviewer for their detailed observations. In the final version of the paper, we will incorporate a discussion on recursive stability’s role as a metric, the practical implications of LL and BWB_W, and its adaptability to different models and settings.

评论

I appreciate the authors’ additional efforts to strengthen their theoretical argument. Since both of my major concerns are well addressed, I have raised my score.

评论

Thank you for your thoughtful feedback. Your suggestions have been invaluable in enhancing the quality of our paper, and we’re delighted to know that your concerns have been fully resolved.

审稿意见
8

This paper theoretically examines the properties of self-training loops in training generative models. This is an area fo great interest and relatively limited theoretical analysis in current research. This paper aims to prove generalization bounds for self-consuming training loops. As they note, this can be difficult due to the distribution shift across training iterations, as well as the non-i.i.d nature of synthetic-real mixed training datasets. To address the challenge, this paper proposes a novel theoretical notion of recursive algorithmic stability, which allows for controlling the propagation of error across training generations. Leveraging this new theoretical concept, this paper proves a generalization bound. They show that this generalization bound resembles existing qualitative and empirical trends observed in the use of synthetic data. For example, the necessary role of including an appropriate proportion of real data. In addition to their general, model-independent results, this paper also studies the relevant problem of in-context learning in transformers. They provide a bound of the recursive stability of transformers on this task and demonstrate that a constant fraction of real data is sufficient to maintain stability. They additionally remark on the conflicting role played by self-consuming data: it increases the distribution shift in the training loop while decreasing the generalization error component of each step.

优点

Overall, I am very supportive of this paper. Self-consuming data loops are currently an important topic in practical foundation model settings --both intentionally from the point of view of synthetic data and unintentionally from the point of view of leakage of model-generated content on the internet. While some existing works have theoretically demonstrated that self-generated data can lead to model collapse, as a practical matter self-generated data can often be leveraged in harmless or beneficial ways -- albeit without theoretical justification. This paper thus makes a significant contribution to placing the role of model-generated data on a more solid theoretical footing. They convincingly isolate some central technical challenges in understanding self-consuming training loops and propose an elegant and intuitive concept of recursive algorithmic stability to confront these challenges. This enables them to make a key contribution of presenting the first generalization bound on self-consuming training loops. The analysis of the transformer architecture is additionally interesting and instructive, establishing that their proposed recursive stability criterion holds naturally on realistic architectures and tasks. Throughout the paper, they demonstrate that their theory formalizes and quantifies many previous intuitions or folklore on properties of self-consuming training loops (such as the necessity of maintaining a proportion of real data -- or the benefits of synthetic data primarily arising in low real-data regimes).

The paper is extremely clearly written. The proof sketch was insightful, yet accessible. Moreover, the authors interleave their technical and theoretical content with real-world motivation and implications. I found the paper quite enjoyable and educational to read.

缺点

I don't have any serious concerns about this paper. One potential weakness is that the paper primarily focuses on the theoretical angle and mostly relates their findings to empirical observations in prior work. It could also be interesting if the authors were to add some of their own numerical simulations justifying their findings. Although it is understandable to exclude this, it would be interesting to have some numerical backing in the paper itself. The construction of a simulated setting or dataset would also help in understanding the relationships between the setting studied in this work and real-world setups.

The authors primarily use their theory to explain or justify existing empirical or conceptual observations about self-consuming training loops. However, I feel it would be quite exciting if authors could propose and test (even in simulated settings) some novel predictions made by their theory. Perhaps, this might also be a question of merely highlighting the currently novel predictions made by theory a bit more explicitly.

问题

  1. Perhaps I missed something but what is the notation ziSi,α\sum\limits_{z_{i} \in S_{i}, \alpha} referring to in line 358?
  2. To clarify, in the in-context learning setup, the "synthetic data" inputs are not generated by a model but sampled according to the ground-truth distribution correct?
  3. Follow-up from (2), I wonder if you could comment on the fact that in many "real-world" synthetic data settings both the "input" and "output" components are sampled from the model. For example, when generating instruction/chat tuning datasets, I believe that both instructions and responses are synthetically generated often. How do you believe that would change your analysis? Seemingly, it would make recursive stability worse due to the additional error in the input distribution.
评论

Q4: I wonder if you could comment on the fact that in many "real-world" synthetic data settings both the "input" and "output" components are sampled from the model. For example, when generating instruction/chat tuning datasets, I believe that both instructions and responses are synthetically generated often. How do you believe that would change your analysis? Seemingly, it would make recursive stability worse due to the additional error in the input distribution.


A: We sincerely thank the reviewer for this insightful question. While our paper focuses on transformers in in-context learning under STLs, assuming query inputs are drawn from the ground-truth distribution, our theoretical framework extends naturally to Self-Generated Instruction tuning, where both inputs and outputs are synthetically generated.

Our recursive stability analysis does not require query inputs to come from the ground-truth distribution but rather from any known distribution, including synthetic distributions generated by prior model iterations. Following [2], in the Self-Generated Instruction tuning setting, the initial dataset S0S_0 contains ground-truth (human-written) instruction-input-output examples. For the next-generation dataset S1S_1, the generation process involves three steps:

  1. Instruction Generation: The model generates new instructions based on real in-context examples.
  2. Input Generation: The generated instructions are mixed with ground-truth instructions to form in-context examples for generating inputs.
  3. Output Generation: The generated inputs are mixed with ground-truth inputs to form in-context examples for generating outputs.

This process produces a synthetic instruction-input-output dataset for the next generation.

This recursive setup significantly increases the challenge of maintaining stability, as it involves three levels of recursive training per generation, leading to faster error accumulation. For example, in Theorem 2, the generation number ii in the recursive stability bound:

γni,α(1α)iB~W(i+1)L2n+1,\gamma_n^{i,\alpha}\lesssim (1-\alpha)^i \frac{\widetilde{B}_W^{(i+1)L}}{2n+1},

scales to 3i3i, yielding:

γni,α(1α)3iB~W(3i+3)L2n+1.\gamma_n^{i,\alpha}\lesssim (1-\alpha)^{3i} \frac{\widetilde{B}_W^{(3i+3)L}}{2n+1}.

This escalation makes controlling recursive stability more difficult, particularly when real data is scarce, as the stability parameter decays at a faster exponential rate. However, by setting α\alpha to 1B~WL1 - \widetilde{B}_W^{-L}, recursive stability can still be effectively managed.

Reference:

[1]\left[1\right] Dong Q, Li L, Dai D, et al. A survey on in-context learning[J]. arXiv preprint arXiv, 2022.

[2]\left[2\right] Wang Y, Kordi Y, Mishra S, et al. Self-Instruct: Aligning Language Models with Self-Generated Instructions[C]. ACL. 2023.

评论

I thank the authors for their response. I will maintain my score but would like to say that my questions have been addressed well.

评论

Thank you for your support! We are pleased to know that your concerns have been effectively addressed.

评论

Thank you for your constructive comments and kind support! All your concerns have been carefully addressed as below. The paper has been thoroughly revised, with the revised sections highlighted in blue for clarity. We sincerely hope our responses fully address your questions.


Q1: It could also be interesting if the authors were to add some of their own numerical simulations justifying their findings. Although it is understandable to exclude this, it would be interesting to have some numerical backing in the paper itself.


A: We thank the reviewer for the suggestion to include numerical simulations to support our theoretical findings. Following this recommendation, we conducted additional experiments during the discussion period to validate our results. Specifically, we trained transformer models to in-context learn linear functions within STLs.

In these experiments, we considered the class of linear functions:

F=ff(x)=wx,wRd,\mathcal{F} = \\{ f \mid f(x) = w^\top x, w \in \mathbb{R}^d \\},

in d=5d=5 dimensions. We sampled x1,,xk,xqueryx_1, \ldots, x_k, x_{\text{query}}, and ww independently from the isotropic Gaussian distribution N(0,Id)\mathcal{N}(0, I_d). For each xix_i, we computed yi=wxiy_i = w^\top x_i and constructed the prompt as:

P=(x1,y1,x2,y2,,xk,yk,xquery).P = (x_1, y_1, x_2, y_2, \ldots, x_k, y_k, x_{\text{query}}).

We employed a 12-layer, 8-head GPT-2 model with a hidden size of 256, trained on an R5\mathbb{R}^5 linear regression task with 40 in-context examples. Two cases were considered:

  • Mixed Case: Fresh data and generated data were mixed in a 0.5 ratio.
  • Full Synthetic Case: No fresh data was used.

The results of these experiments are summarized below:

#Loop1234
Full Synthetic0.38171.49751.53962.0836
Mixed0.38170.42080.43910.4503

As observed, the error accumulates progressively with more self-consuming loops, particularly in the full synthetic case, where the error grows rapidly. In contrast, maintaining a constant-sized proportion of real data effectively reduces the loss, which is consistent with our theoretical findings.

Furthermore, we commit to conducting additional and more comprehensive experiments to strengthen our results in the final version of the paper.


Q2: Perhaps I missed something but what is the notation ziSi,α\sum_{z_i \in S_{i, \alpha}} referring to in line 358?


A: We sincerely thank the reviewer for their detailed and careful question regarding the notation ziS0,α\sum_{z_i \in S_{0,\alpha}} and ziSi,1α\sum_{z_i \in S_{i,1-\alpha}} on line 358.

To clarify, this notation is explained in detail in the appendix (line 822). Specifically, S0,αS_{0,\alpha} refers to a subset of the real dataset S0S_0, where S0,αS0S_{0,\alpha} \subseteq S_0 and its size is defined as α×S0\alpha \times |S_0|. In this context, S0,αS_{0,\alpha} represents a proportion α\alpha of the nn data points in S0S_0, leading to a total of n×αn \times \alpha data points. Similarly, Si,1αS_{i,1-\alpha} denotes a subset of the synthetic dataset SiS_i, where Si,1αSiS_{i,1-\alpha} \subseteq S_i and its size is (1α)×Si(1 - \alpha) \times |S_i|. This subset, Si,1αS_{i,1-\alpha}, contains a proportion 1α1 - \alpha of the nn data points in SiS_i, resulting in n×(1α)n \times (1 - \alpha) data points.

We greatly appreciate your observation and have revised the paper to explicitly include this explanation on line 358 to enhance clarity for all readers.


Q3: To clarify, in the in-context learning setup, the "synthetic data" inputs are not generated by a model but sampled according to the ground-truth distribution, correct?


A: We thank the reviewer for their question and for seeking clarification. Your understanding is entirely correct. In our work, we follow the standard in-context learning setup as outlined in [1]. Specifically, this involves predicting the label for a given query input. The "synthetic data" in this context refers to query inputs that are known and sampled from the ground-truth distribution. However, their corresponding output labels are generated by the model.

We hope this addresses your concern and are happy to provide further clarifications if needed.

评论

We report a potential technical limitation on the OpenReview platform that some readers have observed. Specifically, mathematical formulas in official comments uploaded may not render correctly in certain situations.

Our investigation shows that this issue can typically be resolved by opening the page using the Google Chrome browser or refreshing the page a few times.

We apologize for any inconvenience this may cause and appreciate your understanding. Please feel free to reach out if you encounter further difficulties.

评论

Dear Reviewers,

We sincerely appreciate the time and effort you have dedicated to reviewing our manuscript. Your expert opinions and constructive feedback are invaluable in helping us improve the quality of our work.

We are writing to kindly remind you that the rebuttal discussion period will conclude in less than three days. We would be most grateful if you could provide any further comments or suggestions at your earliest convenience. It is important for us to confirm whether our responses have adequately addressed your concerns, and your additional input would greatly contribute to enhancing our manuscript.

We look forward to your valuable feedback and a productive discussion. Thank you once again for your time and thoughtful consideration.

Best regards,

The Authors.

AC 元评审

This work provides theoretical studies for the setting of Self-consuming Training Loops (STLs), where the generative model increasingly generate their own data for further training. As the reviewers' reach the consensus that the problem setting is interesting and challenging, the contribution is sound and the paper is well-written, it is recommended to appear in the conference for our community. Due to the vibrant and fruitful discussion, incorporating additional comments and discussions (as some of them already been updated) would be beneficial for the manuscript.

审稿人讨论附加意见

The reviewers concerns are well-addressed.

最终决定

Accept (Poster)