PaperHub
5.8
/10
Rejected4 位审稿人
最低5最高6标准差0.4
6
5
6
6
2.8
置信度
正确性2.5
贡献度2.5
表达2.5
ICLR 2025

The Inductive Bias of Minimum-Norm Shallow Diffusion Models That Perfectly Fit the Data

OpenReviewPDF
提交: 2024-09-25更新: 2025-02-05

摘要

While diffusion models can generate high-quality images through the probability flow process, the theoretical understanding of this process is incomplete. A key open question is determining when the probability flow converges to the training samples used for denoiser training and when it converges to more general points on the data manifold. To address this, we analyze the probability flow of shallow ReLU neural network denoisers which interpolate the training data and have a minimal $\ell^2$ norm of the weights. For intuition, we also examine a simpler dynamics which we call the score flow, and demonstrate that, in the case of orthogonal datasets, the score flow and probability flow follow similar trajectories. Both flows converge to a training point or a sum of training points. However, due to early stopping induced by the scheduler, the probability flow can also converge to a general point on the data manifold. This result aligns with empirical observations that diffusion models tend to memorize individual training examples and reproduce them during testing. Moreover, diffusion models can combine memorized foreground and background objects, indicating they can learn a "semantic sum" of training points. We generalize these results from the orthogonal dataset case to scenarios where the clean data points lie on an obtuse simplex. Simulations further confirm that the probability flow converges to one of the following: a training point, a sum of training points, or a point on the data manifold.
关键词
Probability flowScore flowDiffusion modelsDenoisingNeural networks

评审与讨论

审稿意见
6

This work considers particular forms of robust score-based denoiser which map points in a ball around a training point to the training point itself. Based on the closed-form derivations of Zeno et al. (2023) for 'min-cost' robust shallow denoisers, they derive the score dynamics for three main experiments. The score is presented in the variable noise (variable robustness) as well as fixed noise (robustness) case, which have similar generalization dynamics for small noise and finite time reverse diffusion.

优点

  • The background is clearly presented and easy to follow. Contributions are accurate and clearly stated.
  • The experiments follow the presented theory well, with illustrative figures. The experiments also exhibit interesting phenomena, such as virtual quadruplets being stable even in the case of inexact denoising, as suggested by theory.
  • The theory seems correct (mostly computation of piecewise functions), but I have not checked line-by-line.

缺点

  • The nature of the work gives limited future work directions, as it is mainly deriving properties of specific closed-form solutions. I do not see how to generalize this work to e.g. minimum RKHS norm solutions, NTK, weight regularization etc., as the form of robustness considered is quite nonstandard.
  • It is unclear how the simple observation of Prop. 1 contributes to the work, as well as the comment afterwards: ``training points have high probability''. This does not follow from Prop. 1 at all, as being a stable stationary point does not imply high probability.
  • The experiments seem to introduce "diffusion sampling". Is this probability flow or something else? If it is, then this should be made clear. If not, diffusion sampling needs to be defined and probability flow should show up somewhere in the experiments.

问题

  • Typo in the reference on p.5? There is no Theorem 4 in Zeno et al. (2023).
  • Eq (17, 39): Should there be a minus sign in the second ReLU?
  • Instances of 'rays' in Sec. 4.2 may be better off as 'chords', as the required condition requires that the initial point y0y_0 is almost in-between xix_i and 0. Rays typically refer to infinitely long half-lines of the form {λxλ>0}\{\lambda x | \lambda >0\}.
  • "The mathematical connection between the score function and the denoiser" is sometimes referred to as "Tweedie's identity", maybe the authors would consider putting this in.
评论

C: The nature of the work gives limited future work directions, as it is mainly deriving properties of specific closed-form solutions. I do not see how to generalize this work to e.g. minimum RKHS norm solutions, NTK, weight regularization etc., as the form of robustness considered is quite nonstandard.

R: We would like to clarify the following relation between the min-cost solution and the directions suggested by the reviewer:

(1) Weight regularization: The min-cost solution can be defined as the global minimizer of the loss with an infinitesimally small regularization on the weights (but not on the biases and skip connection). Therefore, our results already match those of training with such weight regularization (see `general comment’ on this, and the new results in Appendix D)

(2) RKHS norm, NTK: As proven in [A], the min-cost solution cannot be written as an RKHS norm (it is closer to an L1 norm). Thus, the min-cost solution cannot be an NTK solution. Therefore, it is impossible to generalize our results in this direction.

However, we see many other directions to generalize our work, such as understanding the effect of the scheduler, variance preserving probability flow (in contrast to the variance-exploding version we examine here), and the number of noise samples per training point.

[A] Greg Ongie, Rebecca Willett, Daniel Soudry, and Nathan Srebro. A function space view of bounded norm infinite width relu nets: The multivariate case

C: It is unclear how the simple observation of Prop. 1 contributes to the work, as well as the comment afterwards: ``training points have high probability''. This does not follow from Prop. 1 at all, as being a stable stationary point does not imply high probability.

R: We thank the reviewer for raising this important point. Proposition 1 demonstrates that training points are stable stationary points of the score flow, which represents the sampling process from one of the modes of the (multi-modal) distribution of yy. This means we expect the score flow to converge to the training points, and it does not necessarily imply a high probability in cases where additional stationary points exist. However, in the specific case where the score function is differentiable and the training points are the only stationary points, it follows that the score flow will converge to the training points with probability 1 since the probability of reaching non-stable stationary points is zero for any non-measure zero initialization (from standard properties of gradient flow). The connection to the rest of the work lies in our observation that in probability flow the training samples are also stable stationary points. We have clarified this in the revised version (lines 206-207).

C: The experiments seem to introduce "diffusion sampling". Is this probability flow or something else? If it is, then this should be made clear. If not, diffusion sampling needs to be defined and probability flow should show up somewhere in the experiments.

R: Yes, in “diffusion sampling” we meant the discrete probability flow. We have fixed this in the revised version.

C: Other typos and suggestions (p.5; Eqs 17,19; Sec 4.2; Tweedie’s identity)

R: We thank the reviewer for the suggestions and fixes. We have revised the paper accordingly.

评论

I thank the authors for their revisions and responses to my comments. I maintain my score.

审稿意见
5

This paper studies the convergence of the probability flow and the score flow equipped with a two-layer neural network as a denoiser. In particular, the neural network interpolates the training data with minimum representation cost. Then for three different types of training data, namely orthogonal datasets, obtuse-angle datasets, and equilateral triangle datasets, the authors characterize the convergence of the probability flow and the score flow to either the actual training data points or certain sum of the training data points. The theoretical analysis is supported by numerical experiments.

优点

  1. The problem studied in the paper is interesting, and the analysis seems original in its approach to studying the convergence of the probability flow and the score flow, and provides interesting results on the convergence of these flows for synthetic settings.
  2. By leveraging the explicit form of two-layer neural networks that interpolate the training data with minimum representation cost, the results regarding the convergence of the probability flow and the score flow are theoretically well characterized and provide very quantitative characterizations.

缺点

  1. Some notions in the paper lack formal definitions. For example, what are the the hyperbox defined by the data and its boundary? Also, what's the definition of normalized score flow mentioned in the second paragraph in Section 4.2?
  2. Some assumptions are made without dicussion about their meaning or necessity: 1) quantities like ρ0(ϵ),T0(ϵ,ρ),T1(ρ)\rho_0(\epsilon),T_0(\epsilon,\rho),T_1(\rho) in Theorem 2, and τ(yT,ρT)\tau(y_T,\rho_T) in Theorem 3 lack interpretation. Similar for other quantities in the other theorems; 2) also, the assumptions on {ui}i=1n\{u_i\}_{i=1}^n, and the use of the Taylor's approximation in the small-noise level regime require more justification and discussion.
  3. The statements of the main theorems are not very clear. Specifically, the meaning of "converge to" is unclear. For example, in Theorem 3, by "converge to this vertex", does it mean that y0y_0 is exactly the vertex?
  4. The three types of training data seem artifical and it is not clear how they are related to real-world data. Related to this, the theoretical analyses seem to heavily rely on the explicit solution for the min-norm interpolator provided in Equation (17), and the results do not extend beyond this. It would also be helpful if the authors could discuss the technical challenges and novelty of the current paper compared to existing works.

问题

  1. Under Equation (11), for the approximation logp(yt,σt)s(y)\nabla \log p(y_t,\sigma_t)\approx s(y), the right-hand side does not depend on tt? Also, in Equation (12), since grg_r takes different values at different rr, does this suggest that we need different neural networks for different rr? Further related to this, for the results on the probability flow, which ODE is being analyzed?
  2. In Equation (16), on the right-hand of the second equality, it should also be infθ:h=hθ\inf_{\theta: h=h_\theta}?
评论

C: “what are the hyperbox defined by the data and its boundary?”

R: A hyperbox is an nn-dimensional analog of a rectangular. For example, if we have 3 orthogonal training points x1x_1, x2x_2, and x3x_3 the vertices of the hyperbox are x1x_1, x2x_2, x3x_3, x1+x2x_1+x_2, x1+x3x_1+x_3, x2+x3x_2+x_3, x1+x2+x3x_1+x_2+x_3.

C: “what's the definition of normalized score flow mentioned in the second paragraph in Section 4.2?”

R: The normalized score is the score function, normalized for visual clarity purposes (only in Figure 1). Specifically, we multiplied by the log of the norm of the score and divided by the norm of the score. We moved the definition to the main text in the revised version (lines 286-287)

C: “quantities like ρ0(ϵ),T0(ϵ,ρ),T1(ρ)\rho_0(\epsilon),T_0(\epsilon,\rho),T_1(\rho) in Theorem 2, and τ(yT,ρT)\tau(y_T,\rho_T) in Theorem 3 lack interpretation. Similar for other quantities in the other theorems; 2)..”

R: To simplify the theorems we do not use the full analytical expressions, which can be found in the Appendix B2&B3.

C: “.the assumptions on uii=1n{u_i}_{i=1}^nu_i, and the use of the Taylor's approximation in the small-noise level regime require more justification and discussion. ”

R: We added Figure 7 in Appendix E a 2D plot of the trajectory of the exact score function and the trajectory of the approximated score function. As can be seen, the trajectories are practically identical for a small noise level.

C: “the meaning of "converge to" is unclear. For example, in Theorem 3, by "converge to this vertex", does it mean that y0y_0 is exactly the vertex?”

R: In this paper, we analyze the solutions of the probability flow ODE and the score flow ODE. The probability flow ODE represents the reverse process in diffusion models. For this case, we study the solution over the time interval [0,T][0,T], where the sampled point is at y0y_0​ at t=0t=0. In contrast, for the score flow ODE, we analyze the solution for t>0t>0, focusing on the behavior as tt \to \infty. In the probability flow ODE the process arrives at the point exactly. Since this is a special case of convergence and to maintain consistent terminology, we used the term "converge to a point" in both cases.

C: “The three types of training data seem artificial and it is not clear how they are related to real-world data.”

R: The orthogonal datasets are commonly used in the literature as an approximation for random data with isotropic distribution in high-dimensional spaces e.g. [A,B,C] below.

[A] Andrew M. Saxe, James L. McClelland, Surya Ganguli, Exact solutions to the nonlinear dynamics of learning in deep linear neural networks

[B] Etienne Boursier, Loucas Pillaud-Vivien, Nicolas Flammarion, Gradient flow dynamics of shallow ReLU networks for square loss and orthogonal inputs

[C] Konstantin Donhauser, Mingqi Wu, Fanny Yang, How rotational invariance of common kernels prevents generalization in high dimensions.

评论

C: “The theoretical analyses seem to heavily rely on the explicit solution for the min-norm interpolator provided in Equation (17), and the results do not extend beyond this. It would also be helpful if the authors could discuss the technical challenges and novelty of the current paper compared to existing works.”

R: As explained in the introduction, the current justifications for the convergence of score-based diffusion algorithms rely on inexact assumptions, such as the positive semi-definiteness of the Jacobian matrix (or the differentiability of the score, which breaks for ReLU-based nets). By leveraging closed-form solutions, we aim to provide more rigorous insights into the sampling process. To the best of our knowledge, our paper is the first to analytically analyze the probability flow ODE in a setting where the score function is approximated using a neural network denoiser (as done in practice). To do so, we first derive an equivalent ODE to the probability flow ODE which holds for any noise scheduler, which is novel. Next, using the closed-form solutions of the denoiser from [D] we analytically solve the ODE and analyze the sampling process, which is also novel. The main technical challenge is pinpointing what parts of the relevant dynamics can be analytically solved in a piece-wise linear switching system of ODEs (which can be challenging to analyze in general).

[D] Chen Zeno, Greg Ongie, Yaniv Blumenfeld, Nir Weinberger, and Daniel Soudry. How do minimum-norm shallow denoisers look in function space?

C: “Under Equation (11), for the approximation logp(yt,σt)s(y)\nabla \log p(y_t,\sigma_t)\approx s(y), the right-hand side does not depend on tt?”

R: Indeed. We fixed this typo in the revised version.

C: “in Equation (12), since grg_r takes different values at different rr, does this suggest that we need different neural networks for different rr?”

R: Yes, as mentioned in Section 5, we have a set of SS different neural networks, each for its own rr. In practice, some diffusion models use this method, while others also incorporate weight sharing between the timesteps.

C: “for the results on the probability flow, which ODE is being analyzed?”

R: Throughout the entire paper for the probability flow we use the variance exploding ODE, defined in Eq 11.

C: In Equation (16), on the right-hand of the second equality, it should also be logp(yt,σt)s(y)\nabla \log p(y_t,\sigma_t)\approx s(y)?

R: Did the reviewer mean Eq (6)? If so, this is indeed a typo. We fixed it in the revised version, thanks.

评论

I thank the authors for the detailed response. I have adjusted the score accordingly.

评论

We thank the reviewer for the time, and for adjusting the score. Does the reviewer have any remaining concerns? If so, we will be happy to know them, so we can address them and improve the paper.

审稿意见
6

Diffusion models approximate data distributions by estimating the score of the probability density. In practice, however, rather than learning the exact empirical score, these models often learn a smoothed version of it. This paper examines when data is memorised in diffusion models and the inductive biases that emerge when minimal norm solutions of the neural network weights are considered. The findings suggest that, in such cases, empirical data points act as attractive sinks for the score function, meaning that generated samples will tend to recover the nearest empirical data point to their initialisation. Several simplified settings are explored to analyse these effects in detail.

优点

This paper is well-written and effectively demonstrates that, in simplified cases, diffusion models can recover the data distribution. A particularly interesting result is that, even in these simplified settings, a diffusion model may recover the sum of the data points, which already suggests some degree of generalisation beyond the empirical data distribution. This naturally leads to an exploration of the convex hull of the data distribution, which the authors address in Section 4. I consider this a novel and valuable contribution to the literature, and I appreciate the examples provided in Section 4.2.

The simulated data experiments provided are enlightening and well motivated by the theory. There is a clear correlation between the derived theorems and the computational experiments shown, revealing that for minimal norm denoisers, data on the hypercube of the empirical dataset may be recovered.

缺点

The main weakness of this work is that the theorems rely on strong assumptions to analyse the simplified cases considered. However, this is generally well addressed in the limitations section of the paper. One concern I have about the scope of the work is that it operates in the "low noise" regime without explicitly defining what this entails. For time tt very close to zero, where the diffused density is simply a Gaussian mixture of the data distribution, I don't believe this setting is particularly interesting. However, for tt large enough that meaningful generalisation occurs, the results become more significant. I believe the authors investigate this in their hypercube examples, but it would be helpful for them to clarify what they mean by "low noise" to ensure they are not referring to the earlier, less informative case.

I would also encourage the authors to further explore the effect of early stopping in their simulated examples, potentially through an ablation study on the early stopping parameter and its impact on generalisation over the hypercube. While I agree that early stopping likely induces some smoothing of the empirical data distribution, it would be interesting to determine to what extent it increases the likelihood of generated samples residing on the hypercube of the empirical data. The authors' current observation of this generalisation is intriguing, but further analysis could provide more valuable insights.

问题

Could the authors provide an ablation study, perhaps in 2D, to demonstrate how different early stopping times influence the generalisation of the learned dataset compared to the empirical dataset? My understanding is that, before reaching an attractive sink (an empirical data point), the generated data traverses the hypercube formed by the empirical data. In this case, varying early stopping times should lead to different learned densities. Could the authors confirm whether my interpretation is correct? Including such a study in the paper would help to illustrate the implications of the theorems presented.

Additionally, could the authors comment on what they might expect if the minimal norm assumption were lifted? At present, the intuition behind why minimal norm would lead to different results from training without this constraint is unclear to me. Including a simple comparison between minimal norm training and a neural network trained without this constraint in a simplified setting might help clarify this choice and assist readers in better understanding its impact on the findings.

评论

C: "One concern I have about the scope of the work is that it operates in the "low noise" regime without explicitly defining what this entails"

R: We define the "low noise regime" as defined in [A]: a scenario in which clusters of noisy samples remain well-separated around each clean point. This setting includes non-negligible levels of noise. For example, in the BSD68 denoising benchmark, a widely used benchmark in denoising, a noise level of σ=0.1 is considered to fall within the low noise regime, as discussed in [A]. Furthermore, recent findings indicate that the "useful" aspects of diffusion dynamics predominantly occur below a critical noise threshold [B]. More specifically, the simulations in Section 5 include 150 denoisers, of which only the (last) 50 timesteps (those with the smallest noise level) the noise balls around each clean data point do not intersect, nevertheless, the simulation confirmed our theoretical findings. We have clarified this in the revised version (see footnote 1 lines 160-161).

[A] Chen Zeno, Greg Ongie, Yaniv Blumenfeld, Nir Weinberger, and Daniel Soudry. How do minimum-norm shallow denoisers look in function space?

[B] Gabriel Raya and Luca Ambrogioni. Spontaneous symmetry breaking in generative diffusion models.

C: "I would also encourage the authors to further explore the effect of early stopping."

R: Please note that we do not use early stopping in our simulation. By “early stopping” we refer to the inherent effect of the scheduler on the probability flow sampling process.

C: “Additionally, could the authors comment on what they might expect if the minimal norm assumption were lifted?”

R: We thank the reviewer for this important question. We added in Appendix D additional simulations where we train the neural denoisers using a standard training procedure with the Adam optimizer, with or without weight decay regularization. Specifically, for weight decay, we tried λ=0.25,0.5,1\lambda=0.25, 0.5, 1. All values resulted in similar results, therefore we included in Figure 5 only the case of λ=0.25\lambda=0.25. As can be seen, in the case of WD=0 we converge only to the training points or boundary points of the hyperbox, whereas for WD>0 we converge to virtual points as well, which aligns with the results achieved with the Augmented Lagrangian method (which we used to obtain the min-norm denoiser).

评论

As the discussion period ends in less than two days, could the reviewer kindly let us know if there are any remaining concerns we should address? We have responded to the comments and questions raised by the reviewer and would greatly appreciate the opportunity to address any additional feedback.

审稿意见
6

For three data settings in which Zeno et al. (2023) presented closed form solutions for shallow skip connected ReLU denoisers, the authors provide theoretical and empirical results on the convergence points for 1) probability flow 2) score flow, which is a simplification of the probability flow with constant step size.

The authors first provide theoretical evidence showing that for orthogonal datasets (i.e., all samples orthogonal to each other), probability flow and score flow follow a similar trajectory starting from the same initialization point. Following that the authors provide theoretical evidence that for all three dataset settings score flow converges to the training points or sum of training points, whereas, probability flow also converges to the boundary of a hyperbox, the vertices of which are sums of the training data points, i.e., virtual points.

优点

The topic is very relevant for duplication and generalization in diffusion models and their inductive biases. Paper is well organized and the contributions are novel.

缺点

While the results present in the paper have important connection with memorization/duplication in diffusion models, the presented work does not take the number of samples into account. Prior work related to memorization in practical diffusion models have often shown that i) with oversampling in the training dataset duplication increases [https://arxiv.org/pdf/2305.20086] ii) with increased number of samples in the datasets, duplication decreases [https://arxiv.org/pdf/2212.03860].

What is the connection with the number of samples in a dataset and the stable points? How would the distribution of convergence points in figure 3 b change subject to the number of training samples? We do see more convergence near training points compared to virtual points, would that be effected by a higher number of training samples? Do we expect to see more boundary points instead if training samples are increased?

问题

Line 430: What is meant by "The decrease in percentages is due to small deviations in the ReLU boundaries of the trained denoiser compared to the theoretical optimal denoiser."

Line 442: Why is 0.2 set as the l_inf threshold? Could the authors show how continuously changing this value affects the distribution of data points.

Line 470: "specifically in our example for the first 100 denoisers, the noise level is not small compared to the norm" would the authors be able to elaborate this part for clarity? Why is the noise level small for only the first 100 denoisers? Why are they not expected to have stable virtual points?

评论

C: “While the results present in the paper have an important connection with memorization/duplication in diffusion models, the presented work does not take the number of samples into account.”

R: We thank the reviewer for pointing that out. Based on the reviewer's suggestion, we added in Appendix C simulations to analyze the effect of the number of training samples on the diffusion sampling process. Specifically, we repeat the experiment from Section 5 while changing the number of training points NN. Figure 4 shows that, as we increase the number of training points, the probability flow converges less to training points, and more to boundaries and virtual points, i.e., the generalization increases. Regarding the effect of oversampling duplications, previous papers observed that diffusion models tend to overfit more to duplicate training points than to other training points. However, note that we study the regime in which the model anyway perfectly fits all the training points. In practice, if the model does not perfectly fit all training points, duplicating some training points would cause the network to fit them better, at the expense of the other training points. This suggests our analysis still holds, but only for the well-fitted training points and their associated virtual points. Mirroring the previous experiments' results, this decreases the generalization and will cause more convergence to the duplicated training points.

C: "Line 430: What is meant by The decrease in percentages is due to small deviations in the ReLU boundaries of the trained denoiser compared to the theoretical optimal denoiser."

R: In practical training, the ReLU boundaries have some small deviations from the closed-form solution. If there are small deviations in the ReLU boundary, the neural network's output h(x_i+x_j) will no longer equal x_i+x_j​, which is a necessary condition for a fixed point. The impact of small deviations in the ReLU boundary is analyzed in Theorem 4.

C: "Line 442: Why is 0.2 set as the l_inf threshold? Could the authors show how continuously changing this value affects the distribution of data points"

R: We show the effect of changing the LinfL_{inf} threshold in Appendix E, as well as for L2L_2 threshold. Qualitatively, this does not drastically affect the results.

C: "Line 470: specifically in our example for the first 100 denoisers, the noise level is not small compared to the norm would the authors be able to elaborate this part for clarity? Why is the noise level small for only the first 100 denoisers? Why are they not expected to have stable virtual points?"

R: In this work, we analyze the variance-exploding version of the probability flow. There, theoretically, the reverse process begins with Gaussian noise of infinite variance. In practical sampling (as shown in Section 5), we initiate the sampling process with a large, but finite, variance for the Gaussian noise. In the analytical derivation (Section 4), our results hold starting from the point in the reverse process where the noise balls around each clean data point no longer intersect, which is what we refer to as the ``low-noise regime’’. The simulations in Section 5 include 150 denoisers, of which only the (last) 50 timesteps (those with the smallest noise level) the noise-balls around each clean data point do not intersect. Therefore, stable virtual points are not expected to form for the initial 100 denoisers. We have clarified this in the revised version, thanks (lines 466-473).

评论

As the discussion period ends in less than two days, could the reviewer kindly let us know if there are any remaining concerns we should address? We have responded to the comments and questions raised by the reviewer and would greatly appreciate the opportunity to address any additional feedback.

评论

I would like to thank the authors for their response and providing clarity. I also thank the authors for their additional experiments regarding sample size.

I would like to retain my score since I have reservations about how much the insights from the paper can be translated to practical diffusion models, since the paper studies only the setting where the training data is perfectly fit, i.e. when training data is memorized. The theoretical results do explain some aspects of generalization or compositionality in diffusion models. But it's not clear to me how the findings help improve the practice of diffusion model training or help us theoretically understand more practical diffusion models better.

What new theoretical findings can be derived using the theoretical results provided in the paper?

How can the insights from this paper benefit practical diffusion model training or architectures?

评论

We thank the reviewer for sharing this very constructive feedback.

First, just to make sure we are on the same page: although our theoretical model assumes a perfect fit to training points (as is common in many theoretical papers), our results hold also in our experiments in which the data is not perfectly fitted (using Adam with WD — see Appendix D and general comment on this). Our theoretical analysis examines an ideal case where we optimize perfectly, as common in theoretical works.

Second, regarding the utility of our work. In this paper, we build a novel theoretical framework to analyze diffusion models. This framework shows promise by matching several empirical observations, which were not originally baked into the model (such as the level of memorization as a function of the number of training samples, as suggested by the reviewer). Building on this framework, we can envision many exciting future directions that may benefit practical diffusion models:

  • Finding a quantitative prediction on the minimum number of training samples required to transition from a “memorization” regime to a “generalization” regime.

  • We (and others [A]) noticed that many of the high-noise steps in the diffusion models can be discarded. Our analysis may be able to approximate the time step from which we can start the diffusion process.

  • Examining the effect of different schedulers and various training methods (e.g. variance preserving vs. variance exploding).

[A] Gabriel Raya and Luca Ambrogioni. Spontaneous symmetry breaking in generative diffusion models

评论

We thank the reviewers for their helpful feedback and remarks, and their interest in our paper. We have addressed all the comments and revised the paper accordingly, as detailed below. The rebuttal is presented in a comment-response (C/R) format. Based on the reviewers' feedback, we conducted the following additional simulations, which were added to the paper in Appendixes C and D:

Impact of the Number of Training Points:

We repeated the simulations from Section 5 with varying numbers of training points. We observed that decreasing the number of training points increases the chances of the probability flow sampling a specific training point, and decreases the chances of it sampling virtual points and points on the hyperbox boundaries. This phenomenon, of increased memorization with fewer training points, was observed in previous works (e.g. [A], [B]) and aligns with our findings.

Effect of the min-norm assumption:

We ran additional simulations similar to Section 5, where the neural network denoiser was trained using a standard training protocol using the ADAM optimizer and with or without weight decay (WD), relaxing the assumption of minimum-norm solutions. We observed that without WD, the process converges only to the training points or boundary points on the hyperbox, whereas with WD the process converges to virtual points as well. This aligns with our reported findings and the necessity of the min-norm assumption.

[A] Gowthami Somepalli, Vasu Singla, Micah Goldblum, Jonas Geiping, and Tom Goldstein. Diffusion art or digital forgery?

[B] Zahra Kadkhodaie, Florentin Guth, Eero P Simoncelli, and Stéphane Mallat. Generalization in diffusion models arises from geometry-adaptive harmonic representations.

AC 元评审

The paper investigates the nature of the points to which probability flows converge in the context of diffusion models, using two-layer ReLU neural networks as denoisers. The study focuses on specific choices of simple data distributions where some degree of analytical tractability is achievable. The reviewers raised several important concerns. Issues related to the presentation of the general framework were addressed to a reasonable extent during discussions with the authors. However, a key concern recurring in the reviews pertained to the scope and generalizability of the results, given the set of assumptions employed (model and data), which were argued to be overly restrictive. This issue remained despite the discussions, with the reviewers maintaining their initial lukewarm overall evaluation. The authors are encouraged to incorporate the important feedback given by the knowledgeable reviewers.

审稿人讨论附加意见

See above:

The reviewers raised several important concerns. Issues related to the presentation of the general framework were addressed to a reasonable extent during discussions with the authors. However, a key concern recurring in the reviews pertained to the scope and generalizability of the results, given the set of assumptions employed (model and data), which were argued to be overly restrictive. This issue remained despite the discussions, with the reviewers maintaining their initial lukewarm overall evaluation.

最终决定

Reject