Why Diffusion Models Don’t Memorize: The Role of Implicit Dynamical Regularization in Training
Implicit dynamical regularization during training gives diffusion models a generalization window that widens with the training set size, so stopping within this window prevents memorization.
摘要
评审与讨论
This work examines the role of early stopping in preventing memorization in diffusion models. They observe that when the number of samples are increased, the so called window of generalization, the stopping times at which generalization turns into memorization, scales with . They verify this in a toy setting with Celeb A data, and verify this scaling theoretically for random Gaussian data in the random feature model for the score function, deriving results in the high-dimensional limit using the replica method based on previous work.
优缺点分析
Strengths
- The exposition of the problem is well-motivated and the manuscript is clearly written.
- The metrics selected for memorization and generalization in diffusion models are intuitive and easy to compute and interprete.
- The theoretical results are communicated clearly and extend existing results.
Weakness
- As mentioned by the authors, larger parameter regimes for the models would be desirable, and they propose this as future work. However, it seems important to the empirical verification of the phase diagram in Fig. 3 to verify these regimes as well. It would be useful to understand if there are deeper reasons why these experiments were not included.
- It is not entirely clear how much of this analysis is specific to the score function in diffusion models - or if the phenomenon of first generalizing then memorizing is something that is more generally observed in random feature models.
- In the context of the score function, getting the score correct at different diffusion times is weighted the same, but errors could accucmulate on sampling new items. Yet, in the experiments with Celeb, the final generated samples are compared. It is unclear to me in how far these measurements are comparable.
- It is unclear whether the theory that holds for GD is inidicative of adaptive optimizers. For the empirical results on Celeb this could be tested for at least a smaller fraction of hyperparameter values to obtain an idication.
问题
Questions
- Is the phenomenon you observe unique to training diffusion models? Or does it also occur when you simply want to train a model to memorize given input data? It would be great to explicitly comment on this in the manuscript.
- Fig. 3 right: Generally, this plot is rather confusing, as the visualization suggests that one should to match the two lines to the architectural and dynamical regularization in the theoretical intro Figure 1. But it seems that since you do not have n large / p small enough, the architectural bias curve does not show up. Can you clarify, and edit the figure so that it is not suggesting direct correspondence between theory and experiment lines? (or correct my understanding, if this is the intended case!)
- Paragraph L.189: It generally seems that experiments on larger p/n would help support your results, you just state your expectation and refer it to future work, yet it seems that it would underline a crucial point about architectural regularization you make in the introduction and that would not be a substantially new direction of study.
- Optionally, it could be nice to make the memorization scores more tangible by showing images of the closest two samples used in the metric for a random subset of training samples - long with the values of the scores - similar to the FID in the Appendix.
- Can you clarify whether the assumptions you make in L.239 are mainly technical or due to fundamental limitations? (Similarly for the frozen first layer weights?)
- Is the code available only on request, or will it be published?
Some small comments
- L.45 seems like the first time where you introduce - for clarity it could be useful to distinguish it from the time of the diffusion process and specifiy whether you mean training budget in an online setting, or epochs of the optimization process.
- L.41 implicit bias — it would help to spell out what this bias is biasing towards (generalizing instead of memorizing solutions?)
- Personally I find the term “dynamical regularization” slightly misleading, as my first thought was that the regularization would be dynamical, i.e. adaptive. On second thought it had to do with the diffusion time dimensions. Both interpretations were wrong. It could be helpful to say very early in the introduction that early stopping is an example, which both indicates that you mean training dynamics and rules out the diffusion itself.
- L.152 it could be useful to mention explicitly that the FID is used to measure generalization.
- Do you have an understanding on how sensitive your results are to your choice of k=1/3 (6)
- Fig. 3 right: Can you add markers to the phase diagram, so that is clear what the measured datapoints were? Which W is used here? All of them?
- L.227 From context clear but not easy to parse: Do you rescale time or the loss?
- In the caption of Fig. 6 could you clarify whether the lines are theoretical or empirical.
- L.229 it might be worth clarifying that the score is not a function which depends on time, but that the time is a property of the input, i.e. write .
- Can you clarify in Figure 5 whether we are in Regime I or II, and what can be identified as or ? Or if it is an misunderstanding, why this distribution is not related to the forms in Thm. 3.2?
局限性
They are openly adressed by the authors in a separate section and discussed in weaknesses in this review.
最终评判理由
All questions I had were adressed and with the extra page, I believe that the authors will be able to incorporate them in the given space.
格式问题
None.
We first want to thank you for the very positive report as well as for your different questions that helped us to improve our work. You will find below an answer to the several points raised in your review.
Main comments
About larger parameters regime for Fig. 3
We have conducted additional experiments in the meantime. In particular, we added an intermediary point (M) and found that for CelebA resized to 32x32 the range of parameters show convergence on the training timescales we focus on ( at maximum) and therefore does not require much larger . We plan to update Fig. 3 and the discussion accordingly in the final version of the manuscript. Exploring the diagram for much larger on this dataset would demand significantly larger and much more training time, which goes beyond our computational capacities without, we believe, strengthening our core results (see also the answer below).
About Fig. 3 (right) and the architectural regularization
We agree that although the layouts of Fig. 1 and Fig. 3 are similar, they plot related but distinct diagnostics, which may be confusing. Fig. 1 is a schematic diagram drawn in the regime, illustrating the theoretical findings. On the other hand, Fig. 3 fixes and empirically identifies the smallest at which . Building upon your question, we were able to identify, for our range of and training time (2M steps), convergence of as increases, therefore drawing the architectural regularization line hinted by Fig. 1. We thank you for this suggestion and we will definitely add this line to Fig. 3 to reinforce the parallel with Fig. 1 and validate the observations on realistic trained models and datasets.
About generalization to adaptive optimizers
In the numerical experiments on CelebA we use SGD with fixed momentum. To demonstrate that the observed phenomenon is not specific to SGD, we included in Appendix B.3. numerical experiments on a toy Gaussian Mixture dataset using Adam which exhibit the same behaviour. Since the submission, we also conducted additional experiments on CelebA at fixed , and on random features, both trained with Adam, showing that also scales linearly with in these cases. We will be pleased to incorporate these new results in the appendix of the camera‐ready version of the manuscript, accompanied with a remark in the conclusion.
Is the phenomenon unique to diffusion models?
Thank you for raising this important question on the generality of our reuslts. We believe that our findings extend to other generative models. For instance, the same analysis can be done transparently for stochastic interpolants [1] and for Flow Matching [2]. Indeed, the learning of the score function (or velocity field) can be rephrased as a regression task (the denoising score-matching loss in the case of diffusion models). In the context of supervised learning, [3] observed a similar timescales for overfitting with an overparametrized two-layer neural network. More specifically, they show that if is kept fixed, then this timescale goes linearly with , which is similar to our result. We will mention this point in the final version of the paper.
Making memorization scores more tangible
We show in Fig. 7 some generated images along with their nearest neighbor in the training set. We will make use of the additional page of the camera-ready version to include a similar (but better-looking) figure in the main text showing a random batch of generated sample making the memorization fractions more tangible (e.g. in Fig. 2).
About the assumptions of L239
We thank you for asking this clarification. It is indeed important to assess the role of these assumptions and simplifications. We will do it in the camera-ready version if the paper is accepted. There are two assumptions made in the analytical part: (i) the activation function admits an expansion in Hermite polynomials and (ii) is odd. Both are mainly technical and are not a limitation of the approach. Assumption (i) is equivalent to being in the functional space and is verified by all the activation functions used in practice (, ReLU...). Regarding (ii), it is the same as in [4], and greatly simplifies computations. Moreover, the target score function we focus on also has this symmetry, therefore it's natural to consider such families of activation functions. It can be relaxed by requiring only that , which simply amounts to a constant shift of the activation function .
Focusing on the case where the first layer of weights is frozen is a limitation. It allows us to study the memorization-generalization phenomenon in a well-defined and solvable setting, the celebrated Random Features model [5]. While simplistic, it has already served as theoretical framework to study several behaviors of deep neural network such as the double descent phenomenon [6,7]. We want to stress that our approach can be extended to more general settings. In fact, the Neural Tangent Kernel scaling limit of deep neural networks can be mapped to a specific Random Feature model [8,9]. Therefore, one can extend the approach we developed in this work to study deep neural networks models of the score in the NTK limit - the only difference being that the random matrix problem to solve (Theorem 3.1) is more involved but conceptually identical. After the completion of this work we started to work on this project and the associated results will be presented in a forthcoming publication.
We will clarify the nature and importance of each assumption in the final version of the paper.
Availability of the code
Thank you for raising this point. We will make all codes used for training our models and reproducing the main figures of the paper publicly available in a non-anonymized Github repository upon publication.
Smaller comments
-
is the training time, measured in the paper as the number of gradient updates. We will make this clearer in the final version of the paper.
-
We will make this sentence clearer to avoid any confusion.
-
We understand that the chosen wording might be misleading at first sight and we are committed to making it clearer at the very beginning of the next version of the paper.
-
Thank you, we now specify that the evolution of the FID is used to actually measure in the numerical experiments.
-
The choice comes from previous numerical studies [10, 11] to fit the visual appreciation of memorization on image datasets. We made sure that varying (e.g. to or ) does not affect the scaling behaviour of .
-
We reworked the diagram in Fig. 3 (right) as you suggested to add markers indicating clearly the measured points. We also completed it with a two parameter values with and . All available are used.
-
In Section 3, we rescale the Loss of Eq. (4) by the dimension of the data to have a well-defined large dimensional limit. We agree that the formulation of the sentence is ambiguous and we will clarify it in the camera-ready version of the paper.
-
The lines in Fig. 6 are empirical, obtained using a PyTorch implementation of full-batch gradient descent. Details of the numerical setup can be found in Appendix D. We will make this explicit in the figure caption by indicating that the curves are empirical.
-
In full generality, the score function is a function of and the diffusion time . In the analytical part of the article, we fix the diffusion time and study the learning of the score at this specific time . In other words, we train a new model for the score at each diffusion time . Usually, practitioners only use one model for the whole generative process which depends on the time . However, in our setting, the learned parameters depend on the diffusion time. Thus the score function depends on time through its parameters . This setting was also used in some previous works [4,12] and greatly simplifies the analysis. We thank you for raising it and we will make the dependence of on time (and therefore of the score function ) more explicit in the camera-ready version of the article to avoid any confusion.
-
Figure 5 is computed with parameters and which corresponds to the Regime I (overparametrized regime) of Thm.3.2. which is the regime of interest of the article. The first bulk in the inset, whose support lies approximately between and , corresponds to the density in the theorem. The second bulk, supported between and , corresponds to . We thank the reviewer for pointing out the lack of precision in the legend of Figure 5, and we will revise it accordingly in the final version of the paper.
[1] Albergo, Boffi, Vanden-Eijnden, arxiv: 2303.08797, 2023.
[2] Lipman, Chen, Ben-Hamu, et al., ICLR, 2023.
[3] Montanari, Urbani, arxiv:2502.21269, 2025
[4] Georges, Veiga, Macris, arxiv:2502.00336, 2025.
[5] Rahimi, Recht, Neurips, 2007.
[6] Mei, Montanari, Communications on Pure and Applied Mathematics,2019.
[7] d’Ascoli, Refinetti, Biroli, et al., ICML, 2020.
[8] Jacot, Gabriel, Hongler, NeurIPS, 2018.
[9] Chizat, Oyallon, Bach. NeurIPS, 2019.
[10] Yoon, Choi, Kwon, et al., ICML, 2023.
[11] Gu, Du, Pang, et al., Transactions on Machine Learning Research, 2025.
[12] Cui, Krzakala, Vanden-Eijnden, et al., ICLR, 2025.
Thank you for adressing the points I raised carefully, which will improve the updated version of the work. I have updated my score.
Thank you for updating your score. We believe that your suggested points will definitely strengthen our manuscript, and we appreciate the care and thoroughness of your review.
This work investigates how diffusion models transition from generalization to memorization. The authors identify two training timescales: an early time when models begin generating new samples, and a later time , when the model starts to memorize training data. The authors claim that the desired generalization behavior of diffusion models occur when the training time is between and . The authors provide experimental evidence and theoretical analysis in the setting of a random feature network.
优缺点分析
Strength
- The perspective that the model achieves generalization and then eventually memorization as the training progresses is quite compelling. To the best of my knowledge, this is a novel perspective and the experimental evidence seems sound.
- The characterization that is roughly independent of while is proportional to is a non-obvious and interesting observation.
- The theoretical analysis with random feature network is convincing.
Weakness
- Although I really appreciate the conceptual insight that this work offers, I believe that all diffusion models in practice never reach or are within the architectural regularization regime. Therefore, the present theory does not inform us as to what to do differently.
问题
Is there a way the findings and the insights of this paper may guide us to change something with the current diffusion model practice?
局限性
yes
最终评判理由
The main finding of this work is novel and interesting. With the solid experimental and theoretical evidence, the paper provides clear value.
格式问题
None
We thank you for the very supportive report.
I believe that all diffusion models in practice never reach or are within the architectural regularization regime. Is there a way the findings and the insights of this paper may guide us to change something with the current diffusion model practice?
We agree that models trained on massive and richly diverse datasets operate above the architectural regularization threshold, and are therefore not concerned by the presented theoretical findings. However, even some models industrially trained on massive datasets like LAION were found to exhibit complete or partial memorization [1, 2]. There are also domains where training data are scarce - for instance many physical science applications like cosmology or climate science - hence often falling in a low-data and high-capacity regime. In all these cases, our work offers some concrete guidelines and possible fixes (early-stopping and/or controlling the network capacity) that could help to robustly train the diffusion model to avoid memorization without needing more domain-specific inductive biases. We would be happy to discuss these aspects in the conclusion of the camera-ready version of the paper.
[1] Carlini, Hayes, Nasir, et al., Extracting Training Data from Diffusion Models, USENIX, 2023.
[2] Somepalli, Singla, Goldblum, et al., Understanding and Mitigating Copying in Diffusion Models, NeurIPS, 2023.
Thank you for the response.
In the rebuttal, the authors claim "concrete guidelines and possible fixes (early-stopping and/or controlling the network capacity) that could help to robustly train the diffusion model to avoid memorization", but I believe this qualitative insight was known prior to this paper.
In any case, I am happy with the paper, so I maintain my score of acceptance.
This article investigates two important timescales for score-based diffusion models: an early generalization time , before which high-quality generation occurs, and a later memorization time , after which the generative ability of the diffusion model becomes weakened or limited as the dynamics begin to dominate. The transition regime between these two timescales, related to the so-called dynamical regularization effect, is shown to scale linearly with the training set size , i.e., . Furthermore, an architectural regularization phase is identified and linked to the expressivity of the underlying neural network.
优缺点分析
Strengths: The article is clearly written and well organized. It provides a valuable framework for studying one of the most important aspects of diffusion models. The theoretical analysis leverages connections to random matrix theory and spectral analysis, offering a powerful mathematical tool for researchers in the field.
Weakness: The diffusion model considered in the article is linear, with constant drift and diffusion coefficients (see equation (2)), and the analysis focuses only on one-layer neural network learning (see equation (8)). Although this setup serves well to demonstrate the phenomenon as a toy model, it is limited in capturing nonlinear dynamical effects and in analyzing deep neural networks. Admittedly, extending the analysis to such settings would be a highly challenging task.
问题
Question: Is there an explicit estimate for the two timescales and ? How do these timescales depend on the spectrum of as described in Theorems 3.1 and 3.2?
局限性
Yes.
最终评判理由
The paper is well written and makes strong theoretical contributions. The authors provided clear and satisfactory responses to reviewer concerns.
格式问题
None.
We first want to thank you for the very positive report concerning the soundness and contributions of our work. You will find below an answer to the points raised in your review.
Constant drift and diffusion coefficients in the diffusion model and one-layer neural network for the analytical part.
We are not completely sure we fully understood the comment, so we will present an extended response and apologize in advance if part of it is not what you referred to.
The exposition we make of diffusion models in the introduction is indeed kept minimalist, but adding time-dependence to the drift and/or diffusion terms of Eq. 2 in fact amounts to a time reparameterization of the diffusion process. This is therefore not a limitation. All of our experiments in fact use the standard DDPM formalism with
where is a predefined noise schedule. In Appendix A1, we show that the two formulations are indeed equivalent under a proper reparameterization of the time . Hence, also the theoretical results hold for a non-constant drift and diffusion coefficient.
We do consider a linear score in the theoretical analysis. The motivation is to study the memorization-generalization phenomenon in a well-defined and solvable setting. As you stated, extending the analysis to more general settings would be a highly challenging task. We do hope that our work will trigger new research along these lines. Finally, it is true that we consider a one layer network to model the score, but our approach can be extended to more general settings. In fact, the "Neural Tangent Kernel" scaling limit of deep neural networks can be mapped to a specific Random Feature model [1,2]. Therefore, one can extend the approach we developed in this work to study deep neural networks models of the score in the NTK limit -- the only difference being that the random matrix problem to solve (Theorem 3.1) is more involved but conceptually identical. After the completion of this work we started to work on this project. Results will be presented in a forthcoming publication.
We will clarify all these aspects in the final version of the paper to highlight the broad applicability of our results.
Is there an explicit estimate for the two timescales and ? How do these timescales depend on the spectrum of as described in Theorems 3.1 and 3.2?
As stated in Line 247 of the main text and proved in Proposition C.1 of the SM, the timescales of the training dynamics are given by the inverse eigenvalues of the matrix , where is defined in Eq. (12). In Theorem 3.2, we show that the spectrum of exhibits two well-separated bulks, with normalized densities denoted by and . This structure allows us to define two characteristic timescales:
Furthermore, Proposition C.2 of the SM shows that on the fast timescale , both the training and test losses decrease without overfitting, whereas on the longer timescale , the generalization loss begins to increase. This behavior supports the identification:
[1] Jacot, Franck, Clément, Neural tangent kernel: Convergence and generalization in neural networks, NeurIPS, 2018.
[2] Chizat, Oyallon, Bach, On lazy training in differentiable programming, NeurIPS, 2019.
I would like to thank the authors for addressing my questions. I will maintain my current score (Accept).
Thank you for your answer and for maintaing your recommendation about the acceptance of our manuscript. We are grateful for your feedback and pleased that our rebuttal addressed your questions.
This paper investigates the generalization ability of diffusion model, and identifies two crucial time points: an early time when models begin to generate high-quality samples (generation time), and a later time when memorization emerges (memorization time). The generation time is a constant and the memorization time scales with the data size. The findings reveal a form of implicit dynamical regularization, and backed up with both theoretical and empirical investigations.
优缺点分析
The paper is well-written in genaral and easy to follow. This work contributes to the theoretical understanding of generalization performance of diffusion models. I didn't go through the proofs, but the result is sound. The theoreical investigation is based on a random feature simplification. Could the authors provide more elaboration on how that's closely related to the more practical models or what's used in experiments?
More discussions on how such findings help design better training strategies for diffusion models would enhance the significance of this work.
问题
It seems the results are restrictive to SGD. Is there a similar phenomena for other gradient based training algorithms with momentum terms?
Are the results applicable to conditional diffusion models?
Is the result applicable on datasets with long-tail distributions? How does the identified implicit dynamic regularization effect affect the learning from samples in long-tail?
局限性
yes
最终评判理由
Most of my concerns are addressed, I therefore increase my score and recommend for acceptance.
格式问题
none
We thank you for your remarks and questions that, we believe, will help improve the quality of the paper. Please find below the answers to your interrogations.
Discussions about the practical impact of our work
Several recent work show that even industrial models [1, 2] trained on millions of data exhibit partial of complete memorization of some training samples. While our work is proposing a theoretical understanding of the memorization phenomenon, we believe it also provides interesting guidelines for practitioners on some ways of fixing this issue: either by reducing the total number of parameters in the network as long as it does not harm generation's quality, or by stopping the training earlier. Our work also shows that some intuitive and simple metrics can be used to monitor the amount of memorization during training (memorized fraction or train vs test loss at fixed diffusion times). We will be happy to discuss more how our results could better guide practice in the final version of the paper.
Generalization to other optimization algorithms with momentum terms
In all the experiments on CelebA reported in Sect. II, we in fact use stochastic gradient descent (SGD) with fixed momentum . To demonstrate that the observed phenomenon is not specific to SGD, we included in Appendix B.3. numerical experiments on a toy Gaussian Mixture dataset using Adam which exhibit the same behaviour. Moreover, since the submission, we also conducted additional experiments on CelebA at fixed , and on random features, both trained with Adam, showing that also scales linearly with in these cases. We will be pleased to incorporate these new results in the appendix of the camera‐ready version of the manuscript, accompanied with a remark in the conclusion.
Are the results applicable to conditional diffusion models?
Thank you for raising this interesting point. We know memorization is still observed in models trained conditionally, as for instance shown in [2, 3, 4]. Importantly, our results do not rely on the model being unconditional: any conditioning variable (class, text embeddings or other information) usually enters the model as extra input at the training level. In consequence, our observations are expected to hold when training conditional scores, in particular for each conditioning, we do expect to increase with . It could happen that and depend on the class. In this case, before entering a full memorization (generalization) phase, one could be in a regime of and in which some classes are in the generalization (memorization) phase and others do not.
To validate our results on conditional diffusion models, we trained DDPMs with classifier free-guidance [5] on a mixture of two Gaussian (same data and model as in the appendix) and measured the associated memorization fraction during training for several dataset sizes . We show that, even under guidance, scales linearly with . We will add this result to the appendix and explicitly mention it in the main text of the camera-ready version of the paper as we think it is a valuable addition showing the generality of our results.
Is the result applicable on datasets with long-tail distributions?
Thank you for this question. Since the submission of this work, we have extended our analysis to compute the spectrum of for Gaussian data with arbitrary covariance matrix , including the cases where the density of eigenvalues of has power-law tails. We will add this computation in the final version of the paper. The case corresponding to non-Gaussian data, whose distribution has power law tails is more challenging. We have done preliminary numerical experiments on the RF model applied to datasets with long-tail distribution as in [6]. Our findings confirm that the increase of with is present also in this case.
A thorough understanding of how heavy-tailed data distributions affect the implicit regularization dynamics would require a more detailed analysis, which goes beyond the current scope of the paper. However, we agree with you that this is an interesting direction that we would like to pursue in a future work.
[1] Carlini, Hayes, Nasir, et al., Extracting Training Data from Diffusion Models, USENIX, 2023.
[2] Somepalli, Singla, Goldblum, et al., Understanding and Mitigating Copying in Diffusion Models, NeurIPS, 2023.
[3] Wen, Liu, Chen, et al., Detecting, explaining and mitigating memorization in Diffusion Models, ICLR, 2024.
[4] Chen, Yu, Xu, Towards Memorization-Free Diffusion Models, arxiv:2404.00922.
[5] Ho, Salimans, Classifier-Free Diffusion Guidance, NeurIPS, 2021.
[6] Adomaityte, Defilippis, Loureiro, et al., High-dimensional robust regression under heavy-tailed data: asymptotics and universality, Journal of Statistical Mechanics: Theory and Experiment, 2024.
Thanks for the detailed reply. Most of my concerns are addressed, and I increased my score accordingly (from weak accept to accept).
Thank you for informing us about the score increase, we truly appreciate it. We are confident that the modifications you suggested will further enhance the impact of our work.
This paper makes a significant and cohesive contribution to the theoretical and empirical understanding of generalization and memorization in diffusion models, which are well acknowledged by the reviewers (in both pre-rebuttal and post-rebuttal phases). Empirically, through rigorous and well-designed experiments on the CelebA dataset, the authors clearly demonstrate the emergence of two distinct training timescales: an early timescale for achieving high-quality generation, which remains constant with dataset size , and a later memorization timescale , which scales linearly with . This creates an expanding window for effective generalization, a finding that is robustly validated across varying model capacities and supported by comprehensive metrics including FID, memorization fraction, and train/test loss dynamics. The empirical findings benefit further study of the memorization effect of the diffusion model.
Theoretically, the authors provide a complementary and equally rigorous analysis using a tractable random features model, deriving the spectral properties of the feature correlation matrix in the high-dimensional limit. By leveraging tools from random matrix theory, they formally link the eigenvalue distribution to the separation of timescales, confirming the linear scaling of with and elucidating the role of model and data complexity ratios (, ). The synergy between clear empirical demonstrations and a solid theoretical foundation offers profound insights into the mechanisms of implicit dynamical regularization.
Overall, the paper is exceptionally well-written and easy to follow, the rebuttal thoroughly addressed all reviewer concerns, and the findings are of fundamental importance to the machine learning community. Thus, I would recommend an oral presentation.