5.7

/10

Poster3 位审稿人

最低4最高7标准差1.2

4.3

置信度

正确性2.7

贡献度2.3

表达3.0

NeurIPS 2024

Understanding Generalizability of Diffusion Models Requires Rethinking the Hidden Gaussian Structure

Xiang Li,Yixiang Dai,Qing Qu

OpenReview PDF

提交: 2024-05-14更新: 2024-12-25

摘要

关键词

diffusion modelsinductive biasgeneralizationmemorization

评审与讨论

审稿意见

评分: 6置信度: 52024-07-06

This paper compares diffusion models trained on natural image datasets with their Gaussian approximations. It evaluates the quality of this approximation in both the memorization and generalization regime by studying the influence of the training set size, model capacity, and training time.

优点

This paper makes several novel and interesting empirical observations:

according to the authors' linearity measure, denoisers are surprisingly well-aproximated by linear functions (but more on this below),
reducing model capacity or training time can bring diffusion models into the generalization regime even with very small training sets, though at the expense of image quality.

It is also clearly written, and studies the very important problem of characterizing the inductive biases of diffusion models/network denoisers which allow them to generalize.

缺点

I have two main issues with this paper:

First, a large part of its content (most of section 3) is rather obvious for people with a signal/image processing background. Indeed, Theorem 1 is well known, usually under the name of "Wiener filter" (which is simply a consequence of linear regression): see, e.g., Theorem 11.3 from A Wavelet Tour of Signal Processing, Third Edition: The Sparse Way, Stephane Mallat, Elsevier/Academic Press, 2009. It immediately implies that differences between the "linear" and "Gaussian" denoisers can only come from suboptimality of the network denoiser (or optimization errors of the linear denoiser), and are thus irrelevant for this study. A second observation that the authors did not make is that any linear denoiser leads to Gaussian generated images (as is evident from equation (4) since $x_{i+1}$ is then a linear function of $x_i$ and $x_0$ is Gaussian). Further, the Gaussian denoiser produces samples from the Gaussian distribution with the same mean and covariance as the training set, which is thus a very simple (but crude) model for which samples can easily be produced without any diffusion procedure.
The main claim of the paper then boils down to whether Gaussian distributions are a good approximation of natural image distributions (and in particular, a better approximation than the empirical distribution of the training set). These two examples are interesting to contrast, as they lie on the two extremes of the quality-diversity tradeoff (Gaussian distributions maximize entropy but have very low quality, while the a sum of delta functions at the training images has perfect quality but essentially no diversity). However, Gaussian models are extremely crude and have been extensively studied in the past, and we therefore have few things left to learn from them. It is not surprising that good diffusion models are capable of capturing the covariance of their training data, but this is well understood. As the authors acknowledge in the discussion, what we do not understand is the rest (higher-order moments captured with non-linear denoisers). The visual similarity between faces and Gaussian images with matched first and second moments results from the fact that the faces are centered, so that they are well-approximated with a PCA, also known as "eigenfaces": Low-dimensional procedure for the characterization of human faces. L. Sirovich; M. Kirby (1987), Journal of the Optical Society of America A. 4 (3): 519–524.). This similarity breaks down for more complex datasets such as LSUN-churches which are more translation-invariant, leading to stationary Gaussian samples which look like textures. I therefore disagree with the main claim that "the image contents generated with such Gaussian denoisers [...] closely resemble those produced by well-trained diffusion models." For the same reason, the inductive biases of network denoisers cannot be reduced to the Gaussian structure only, as they are capable of learning much more complex structures despite the curse of dimensionality.

Specific points:

The sentence at lines 164-166 is wrong, as shown in Figure 1: the denoiser is linear for very small $\sigma$ (close to identity) and very large $\sigma$ (close to its Gaussian approximation as noted in [19]), but less linear in the middle range.
Root mean square error in equation (8) should have the square root outside the expected value.
The RMS plots in Figures 2, 6, 7, and 8 could be improved in several ways. First, they are difficult to interpret due to the lack of a reference point. I suggest normalizing them by the expected RMS norm of the network denoiser, so as to have values in $[0,1]$ . Second, I suggest to use a continuous colormap when a single parameter is being varied (such as the size of the training set or number of epochs), in order to facilitate visual analysis. Third, why are the numerical values for these varied parameters chosen arbitrarily and irregularly? It would be easier to see trends with regularly (e.g., linearly or logarithmically) spaced parameters, chosen at round-ish numbers.
The authors should mention in section 3.3 that the Gaussian denoisers compared in Figure 6 use the empirical covariances of the smaller training sets, this is implicit and confused me for a while.
Two observations relative to memorization and generalization made by the authors are straightforward to me. First, any generative model which perfectly optimizes its training loss will only reproduce its training set, so it is clear that trained network denoisers are suboptimal for their training loss, and in a specific way that enables them to generalize "correctly". Second, a Gaussian model will strongly generalize in the sense of [15] when the two training sets have similar first and second-order statistics, which then happens quite rapidly (and can be studied precisely with random matrix theory analyses).

Minor:

Typo line 286: "diffsuion"
Typo caption of Figure 9: "Gausisan" (twice)
Typo line 316: "covaraince"
Equations (9) and the first of Appendix C should are missing $c_{\rm out}(\sigma(t))$
line 504: footnote -> subscript

问题

Given the points above, the surprisingly high cosine similarities reported in Figure 1 look suspicious to me. If denoisers were truly linear, then they would generate Gaussian images, and we know that they learn much better models than this. As noted by the authors in appendix A, the linearity measure $`LS`(t)$ evaluates the denoiser out-of-distribution since linear combinations of natural images are not natural images (they are superpositions of two unrelated images). Doesn't this mean that these cosine similarities are not really meaningful and deceiving? Another factor that could influence these high values is that they are computed for the denoiser function $\mathcal D_\theta(x)$ , as opposed to the denoiser residual $x - \mathcal D_\theta(x)$ (or equivalently the score). As noted in Appendix A, the denoisers are very close to the identity function at low noise and therefore are expected to be close to linear. It would be more interesting to evaluate the linearity of the score, which I suspect is much smaller (I predict it would only decrease as the noise level decreases).

局限性

As stated above, the main limitation of this paper is that it studies the Gaussian case which is well-understood. As a result, a large fraction of its results are straightforward (and some even long known), and it misses the point which resides entirely in the non-Gaussian/linear structure learned by these models. I encourage the authors to study the literature in denoising (e.g., Chapter 11 of the aforementioned book by Mallat) and image modeling (for a brief introductory review, see, e.g., Simoncelli, Eero P. Statistical modeling of photographic images. Chapter 4.7 of Handbook of Video and Image Processing 9 (2005).)

However, these topics are unfortunately not well-known by the broader machine learning community. I thus think that this paper could provide a service to the community by reviewing what is known about the Gaussian case (after major changes, e.g., to remove the linear denoiser and only mention the Gaussian denoiser). It also makes several novel and interesting observations that are valuable to the community (see strengths).

In the current state of the paper, I recommend to reject, and encourage the authors to resubmit their work after another iteration.

作者回复

2024-08-07

Q1. First, most of section 3 is rather obvious for people with a signal/image processing background; e.g., Theorem 1 is well known.

We sincerely appreciate the review for pointing us to the Wiener filter. We agree that Theorem 1 is well studied (we will add citations on the Wiener filter to acknowledge this), but we believe many findings of our work are not obvious as the reviewer suggested. Please see A.2. in the global response.

Q2: A second observation that the authors did not make is that any linear denoiser leads to Gaussian generated images. Further, the Gaussian denoiser produces samples from the Gaussian distribution with the same mean and covariance as the training set, which is thus a very simple model for which samples can easily be produced without any diffusion procedure.

Firstly, in this work, we are interested in the mapping from the noise space z to the image space x. An arbitrary linear denoiser does not produce similar images as the actual diffusion models therefore it is worth no interest. Furthermore, we cannot compare diffusion mapping with the Gaussian mapping if we directly sample from the Gaussian distribution without the reverse diffusion procedure.

Q3. It is not surprising that good diffusion models are capable of capturing the covariance of their training data

Again, this point is not well understood as well. First of all, we not only show that diffusion models capture the covariance of their training data but more importantly, we show that the obtained diffusion denoisers share similar function mapping with the Gaussian denoisers. The first observation doesn’t explicitly imply the second. Since even the optimal denoisers under the multi-delta distribution assumption captures the covariances, after all, they are able to reproduce the whole training dataset. On the contrary, it is not well understood how gradient descent on deep networks arrive at a denosier that is close to the Gaussian denoisers. To understand this requires rigorous analysis on the gradient dynamics of the deep network and to our best knowledge, no such analysis exists in the current literature. Furthermore, in the current literature on diffusion models, this problem is circumvented by either assuming an infinite number of training data or directly restricting the architectures of the diffusion models.

Q4. The similarity breaks down for more complex datasets. The inductive biases of network denoisers cannot be reduced to the Gaussian structure only, as they are capable of learning much more complex structures despite the curse of dimensionality.

Our results in Figure 15 in the appendices and Figure 2 (b) of our newly uploaded PDF demonstrate that this similarity also persists in more complex datasets (e.g., LSUN-Churches, AFHQ datasets and Cifar-10 dataset). This similarity indicates that the first and second order statistics of the finite training dataset is heavily utilized by diffusion models. Otherwise, we should not be able to observe the similarity. However, we do agree the generation power cannot be reduced to the Gaussian structure only, but it indeed plays a critical role.

Q5. The surprisingly high cosine similarities reported in Figure 1 look suspicious to me.

We conduct experiments to measure the linearity of the score functions as the reviewer suggested. The results are shown in Figure 3 of our newly uploaded PDF, where we see that measuring linearity of score is not very meaningful. This is because the noise magnitude can be much higher compare to the denoising output (in the range of [-1,1]), therefore subtracting the denoising outputs from the noisy image does not change the noisy image significantly except in the low noise variance regime. For this reason, we always see a high linearity of the scores for most of the noise variances. In the figure, denoised_img_1 and dneoised_img_2 correspond to $D(x_1|\sigma_t)$ and $D(x_2|\sigma_t)$ , denoised_img_1+denoised_img_2 corresponds to $\frac{1}{\sqrt{2}} D(x_1|\sigma_t)+\\frac{1}{\sqrt{2}}D(x_1|\sigma_t)$ and denoised_image_additive corresponds to $D(\frac{1}{\sqrt{2}}x_1|\sigma_t)+D(\frac{1}{\sqrt{2}}x_2|\sigma_t)$ . $x_1$ and $x_2$ are two randomly sampled noisy image.

Q5. Two observations relative to memorization and generalization made by the authors are straightforward to me...

First of all, this point is not well perceived by the current literature. Current theorems normally assume you can directly sample from the ground truth training data distribution instead of just a finite number of training data. In this case, exactly minimizing the score denoising matching loss indeed results in the ground truth score function. However, less work focuses on the finite training setting, in which minimizing the score denoising matching loss results in overfitting. As you mentioned, the trained network is suboptimal in a specific way and enables them to generalize correctly. Our results indicate that this specific way is to be close to the Gaussian denoiser and why this happens is not well understood.

Secondly, our discussion on strong generalizability focuses on the actual diffusion model rather than the Gaussian model. Yes, a Gaussian model will strongly generalize in the sense of [15] when the two training sets have similar first and second-order statistics. We use this fact to explain why diffusion models exhibit strong generalizability. Since the diffusion models learn similar function mappings as their corresponding Gaussian denoisers in the generalization regime, strong generalization of the Gaussian models leads to the strong generalization of the actual diffusion models. This is supported by the fact that we can direct the model towards strong generalization by either early stopping or decreasing the model scale. Remember these two actions prompts the emergence of the Gaussian structure, which further highlights the necessity of the Gaussian structure in diffusion model’s generalizability.

2024-08-08

I thank the authors for their detailed response.

Q1: I agree that in the memorization regime, the linear and Gaussian denoisers studied in this work become different. Thank you for making this point.

Q5: Thank you for the additional experiments, which are interesting and puzzling. I do not understand why the scores appear linear except for very small noise levels. Is the noise variance compared to images with values in [0, 255] or [0, 1]? In the latter case, then I agree that linearity of the score for noise larger than 1 is not meaningful, just like linearity of the denoising function for noise smaller than 0.1 is not meaningful. But the interesting range is then a variance smaller than 1 which is hidden in the choice of axis limits here. This would indicate than there is non-negligible non-linearity in the score, which are the objects that appear in the reverse SDE/ODE.

Other questions: I agree that capturing the Gaussian/linear structure of the score is the first-order necessary condition for generalization, and it goes a relatively long way in producing images that are correlated with the images generated by a generalizing network. However, I maintain that pointing this out does not teach us a lot about how diffusion models generalize. When I said that this was well understood, I meant that very simple and classical approaches lead to generative models that capture the linear/Gaussian structure and generalize in the sense of [15]. The mystery lies in how non-Gaussian/linear structure can still be estimated from limited samples despite the curse of dimensionality (by diffusion models or other approaches).

2024-08-08

1. I do not understand why the scores appear linear except for very small noise levels... This would indicate than there is non-negligible non-linearity in the score, which are the objects that appear in the reverse SDE/ODE.

Yes, the linearity of the score for noise larger than 1 is not meaningful since the pixel range of our images lie in [-1,1]. In fact, we've already implicitly measure the linearity of the diffusion denoisers, as shown in Figure 2 (left). Notice that the difference between the actual diffusion denoisers and the corresponding linear denoisers are the largest for $\sigma$ in the range of [0.4, 10]. This aligns with our original measure of linearity shown in Figure 1, where we see the most nonlinear part lies in the range of [0.4,10] as well. Therefore, we believe our original linearity measure is good enough. Let's then go back to the reviewer's original question: "Given the points above, the surprisingly high cosine similarities reported in Figure 1 look suspicious to me. If denoisers were truly linear, then they would generate Gaussian images, and we know that they learn much better models than this. " Please notice that we never claim the denoisers are exactly linear. However, we do see the trend that the denoisers become increasingly linear as it transitions from memorization to generalization. In fact, as we've shown in the Figure 2 (a) of our newly uploaded PDF, the denoising outputs of the actual diffusion denoisers are highly similar to those from the Gaussian denoisers. Because of this similarity, the linear models obtained through distillation match the Gaussian models. Please let us know if you have further doubt about the emergence of linearity observed in our paper.

2. I agree that capturing the Gaussian/linear structure of the score is the first-order necessary condition for generalization, and it goes a relatively long way in producing images that are correlated with the images generated by a generalizing network. However, I maintain that pointing this out does not teach us a lot about how diffusion models generalize. When I said that this was well understood, I meant that very simple and classical approaches lead to generative models that capture the linear/Gaussian structure and generalize in the sense of [15]. The mystery lies in how non-Gaussian/linear structure can still be estimated from limited samples despite the curse of dimensionality (by diffusion models or other approaches).

Could the reviewer provides us with some references on "very simple and classical approaches lead to generative models that capture the linear/Gaussian structure and generalize in the sense of [15]?" We are sincerely eager to learn more about this. Furthermore, we want to emphasize that our findings don't just show the diffusion models capture the first and second order statistics, but more importantly, diffusion denoisers are quite close to the Gaussian denoisers. We want to emphasize that capture first and second order statistics doesn't mean the latter, since even in the case of memorization, the model still capture the second and first order statistics, but the diffusion denoisers in this case are not close to Gaussian.

Furthermore, for other generative models such as VAE and GAN, they can also be interpreted as denoisers since they generate images by mapping noise space to the image space. Moreover, they also capture the first and second order statistics of the training dataset since they have a strong generation power. However, their function mappings don't share any similarity with the Gaussian denoisers. For this reason, we believe our results are intriguing and meaningful since we demonstrate that the function mapping of the diffusion models share high similarity with the Gaussian models. To be more specific, our results demonstrate that the best linear approximation of the nonlinear diffusion models are nearly identical to the Gaussian models.

Please share your thoughts about this, we are open to any criticize, in the hope that we can make the paper better.

2024-08-09

Thank you for your answers.

I found statements like "generalizing denoisers are mostly linear" deceptive. While this is true in an MSE sense (as measured by the linearity metric), I wanted to emphasize that there is a lot we do not understand that is hidden by the "mostly" (in particular for the noise range [0.4,10] which is critical in practice for sample quality). However, as you pointed out, one contribution of the paper is that it evidences that memorizing denoisers are less linear than their generalizing counterparts. This is definitely a novel and interesting observation. I suggest putting more the emphasis on this latter phrasing of the results rather than the former.
Gaussian models of images goes back at least to the 40s with Kolmogorov on turbulence and the 60s with Julesz on texture. But these references are now part of the general folklore. They are a maximum entropy model of images conditioned on the first and second-order moments, which leads to modeling the image distribution as a Gaussian distribution with mean and covariance given by the empirical statistics of the training set (for stationary image distributions, it is sufficient to estimate the global spatial mean and the power spectrum rather than the full mean and covariance). This gives a generative model (just sample from this Gaussian distribution) that naturally reproduces the mean and covariance of the training set. Now if one splits a sufficiently large training set in two, the resulting generative models will be close to identical as soon as the two halves have the same first and second order moments (the number of samples should be at least of the order of the image size or its square, depending on whether the distribution is assumed stationary or not). This can be checked by sampling one $z \sim \mathcal N(0, \mathrm{Id})$ and comparing $\mu_1 + \Sigma_1^{1/2} z$ with $\mu_2 + \Sigma_2^{1/2} z$ to reproduce the setting of [15].

As you pointed out, memorizing diffusion models also reproduce the first and second-order statistics of their training set. As they generalize, they move away from this strategy, and it is interesting that they become closer (but not equal) to the Gaussian/linear denoisers. In a way, this natural, as the entropy (diversity) of the generative model increases while still capturing the first and second-order moments of the data, it becomes more similar to the maximum-entropy Gaussian model.

I thank the authors for pointing out some of the more subtle points of their work. I hope that this discussion may help the authors make these points clearer in the paper. As a result of this discussion, I have decided to increase my score. This paper makes interesting contributions to our understanding of generalization in diffusion models and should be accepted.

2024-08-09

We sincerely appreciate the insightful comments and suggestions provided by the reviewer as they have been extremely helpful in improving the submitted paper. We will revise our paper based on our discussion.

Sincerely,

Authors

审稿意见

评分: 4置信度: 42024-07-13

The paper investigates the generalization properties of diffusion models by examining the learned score functions, which are denoisers trained on various noise levels. It shows that nonlinear diffusion denoisers exhibit linearity when the model can generalize, leading to the idea of distilling these nonlinear mappings into linear models. The findings suggest that these linear denoisers closely align with optimal denoisers for a multivariate Gaussian distribution, indicating an inductive bias towards capturing Gaussian structures in the training data. This bias becomes more pronounced with smaller model sizes and might provide insights into the strong generalizability seen in real-world diffusion models.

优点

The paper is well-motivated and written.
The Gaussian structure hypothesis for generalization in diffusion models is interesting.

缺点

Although the paper's focus is on generalization in diffusion models, it lacks any quantitative measure of generalization within the experiments presented.
Similarly, no quantitative measure of memorization is reported.
The work primarily investigates the linearity of the learned score functions across various noise levels. However, the connection between these experiments and the necessity of linearity for generalization in diffusion models is unclear and not well substantiated.
The term "inductive bias" is frequently used throughout the paper, but it is never clearly defined. It remains ambiguous whether this bias pertains to the model architecture, the parameterization of the forward diffusion process, or the denoising score matching loss.
The paper lacks detailed information on the experimental setup, including training procedures and the hyperparameters of the architectures used.

问题

Can the authors explain how from their set of experiments we can conclude that the emergence of Gaussian structure leads to strong generalizability?
The paper shows how the linear (distilled) model resembles the optimal denoiser for a multivariate Gaussian distribution. Would the same phenomena be observed if we would linearize the trained model instead of distilling it? Have the authors conducted any experiments in this direction?
Could the authors provide a more precise definition of inductive bias in their work? Specifically, is it a property of the architecture, the input-output mapping parameterized by the network, the forward diffusion process, or the denoising score matching loss?
The authors refer to edm-ve, edm-vp, and edm-adm as different "model architectures". Could they clarify this terminology? My understanding is that all these models use a U-net based architecture to parameterize the score function, and the difference lies in the parameterization of the forward process SDE.
Do VE and VP stand for variance exploding and variance preserving? How do they differ from ADM?
Can the authors provide experiments that measure generalization compared to memorization in their setup?
In Figure 1, the authors show that when the model is memorizing (yellow dashed curve), the score function is not approximately linear (at least for some noise levels). However, there is no evidence reported that the yellow curve corresponds to memorization compared to the other curves. Could the authors provide empirical evidence to support this?
Can the authors provide details of their experimental setup, including training and architectures?
Could the authors better discuss the comparison with [1] in more detail? They claim, "We propose an alternative explanation: the networks capture certain common structural features inherent across the non-overlapping datasets". How is this claim in contrast with the hypothesis that due to the inductive bias of the neural network, the score and therefore the density are learned? I fail to understand how learning "common structural features" differs from learning the density. Could the authors elaborate on this?

局限性

The description of the experimental setup lacks essential details, making it difficult to replicate the results.
The experiments exclusively focus on the RMSE as the performance metric across various noise scales. Alternative performance measures are neither reported nor discussed.

作者回复

2024-08-07

Q1. Although the paper's focus is on generalization in diffusion models, it lacks any quantitative measure of generalization within the experiments presented.

A1: The memorization and generalization can be clearly observed in our experimental results by comparing the generated image with the nearest neighbor (NN) in the training dataset (see Figure 6 to Figure 9). To measure the memorization and generalization quantitatively, we can empirically define the generalization score as follows

\text{GL Score} := \frac{1}{k}\sum_{i=1}^k\frac{||x_i-\text{NN}_Y(x_i)||_2}{||x_i||_2}
%\mathcal{L}(\mb x_k,\text{NN}_Y(\mb x_i)),

where $[x_1, x_2, ..., x_k]$ are sampled images from the diffusion models, and $Y:=[y_1, y_2, ..., y_N]$ denote the training dataset. We used this metric to assess generalization versus memorization, with the results presented in Figure 1 of our global response. As shown in Figure 1 (d) to (g), diffusion models start to exhibit memorization behavior when the number of training images becomes smaller than 8750 (GL Score around 0.5). Therefore we choose 0.6 as the threshold to distinguish between generalization and memorization. As shown in Figure 1(a) and (b), the diffusion denosiers exhibit increasing linearity as the diffusion models shift from memorization to generalization.

Q2. The work primarily investigates the linearity of the learned score functions across various noise levels. However, the connection between these experiments and the necessity of linearity for generalization in diffusion models is unclear and not well substantiated.

A2. Please refer to A.3. of our global response.

Q3. The term "inductive bias" is frequently used throughout the paper, but it is never clearly defined.

A3: Here we give a precise definition of the inductive bias in our work: “When the model size is relatively small compared to the training dataset size, training diffusion models with the score matching loss (Equation (3)) results in diffusion denoisers that behave similarly to the linear Gaussian denoisers (though with certain amount of difference especially in the intermediate noise variance region). Furthermore, even when the model is overparameterized, such similarity emerges in the early training epochs." Our finding is consistent across various architectures including VE, VP and ADM (see Figure (13)).

Though in the paper, we mainly test on EDM configuration, which specifies specially designed forward process, and denosier parameterization (Equation (9)), we expect the inductive bias will manifest for other forward process and parameterization as well since recent work [18] shows that diffusion models trained with different forward process and parameterization generate almost identical images starting from the same random seed, which suggests that even in those cases, the diffusion denoiser share similar function mappings.

Q4. The paper lacks detailed information on the experimental setup, including training procedures and the hyperparameters of the architectures used.

A4: In the revision, we will include more experimental details (e.g., the model architectures, hyperparameters, training procedures) in our paper for reproducibility of our results.

Q5. Can the authors explain how from their set of experiments we can conclude that the emergence of Gaussian structure leads to strong generalizability?

A5: Firstly, based upon our studies in Section 3 and 4 we find strong correlation between generalization and Gaussian structures. Furthermore, in section 5 we show that decreasing model scale and early stopping are able to prompts strong generalization. Remember those two activations prompts the emergence of the Gaussian structure.

Q6. Could the authors better discuss the comparison with [15] in more detail? I fail to understand how learning "common structural features" differs from learning the density.

A6: Our work offers complementary explanations based on the common Gaussian structures from non-overlapping datasets and the inductive bias towards these structures. In contrast, [15] hypothesized that the models learn the same underlying distribution. Notice that learning the ground truth distribution is only a sufficient condition not a necessity for the strong generalization. Just because two models produce the same images doesn’t mean they learn the underlying distribution. It only implies certain common structures of the two datasets are captured but such a common structure might not be the optimal. Our experiment shows that such a common structure is highly related to the Gaussian structure.

Q7. The paper shows how the linear (distilled) model resembles the optimal denoiser for a multivariate Gaussian distribution. Would the same phenomena be observed if we would linearize the trained model instead of distilling it?

A7: Yes, our experiment results show that directly training a linear diffusion model based on the denoising score matching loss results in the Gaussian denoisers. However, this is not the main point of the paper since what is interesting is that diffusion models behave closely to the Gaussian denoisers even without this explicit linear constriant.

Q8. The authors refer to edm-ve, edm-vp, and edm-adm as different "model architectures". Could they clarify this terminology?

A8: The EDM paper [4] proposes a novel diffusion configuration (a specially designed forward process, time schedule, network parameterization). This configuration can be adapted to various network architectures. EDM-VE [3], EDM-VP [2] and EDM_ADM [23] are all trained with the EDM configuration but with different network architectures. Here VE stands for the architecture proposed in the paper [3], VP stands for the architecture proposed in the paper [2] and ADM corresponds to the architecture proposed in [23]. We refer the reviewer to the EDM [4] paper for more detail.

评论- Looking forward to your response

2024-08-11

Dear Reviewer 6yNJ,

We have tried our best to address your concerns in our response. Since the deadline for the discussion period is approaching, we'd like to know if you have further questions so that we can do our best to respond further. Please feel free to raise any questions. Thanks for your insightful feedback.

Due to the space limitation of the rebuttal, we could not add the experimental setup of our paper. Here, we include it below for your reference:

Here we provide a more detailed description of our experiment setup:

Section 3: we train linear models to distillate the actual diffusion models (including EDM-VE, EDM-VP, EDM-ADM). The actual diffusion models are trained on the FFHQ dataset (70000 images in total) for around 1000 epochs. The details of training of linear models are in the appendix B. We use the default hyperparameters (including learning rate, detailed network parameters) provided in the EDM code base.
Section 4: we study the impact of (i) dataset size (ii) model scale and (iii) training time on diffusion models’s generalizability. For figure 6, we train the same models on datasets with various sizes. The datasets are randomly sampled from the FFHQ dataset. The diffusion model is trained with EDM-VE configuration. All models are trained for around 2 days to ensure convergence. For figure 7, we fix the dataset size as 1094. We still use the same EDM-VE training configuration but vary the model scales [4,8,16,32,64,128,128]. For Figure 8, we train a diffusion model with EDM-VE (scale 128)] configuration on FFHQ dataset with 1094 images. We early stop the model at various epochs specified in the paper.
Section 5: we study the strong generalizability. We randomly split FFHQ dataset into non overlapping datasets with size 35000 and 1094. All the models are trained with EDM-VE configurations.

In the revision, we will make sure to include those experimental details (e.g., the model architectures, hyperparameters, training procedures) in our paper for the reproducibility of our results. We will make our code public upon publication to ensure reproducibility.

Best,

Authors

2024-08-14

Dear Reviewer 6yNJ,

We have worked diligently to address your concerns.

As the rebuttal period draws to a close, please feel free to reach out if you have any last-minute questions or need further clarification.

Best regards,

The Authors

2024-08-14

Thank you for the provided clarifications and additional experiments. However, I still have the following concerns:

Inductive Bias Argument: Is it accurate to say that learning the Gaussian structure in this context corresponds to learning the first and second moments of the data distribution? If so, is the primary conclusion that models succeed in generalizing when they learn the mean and covariance, as opposed to merely memorizing and acting like a dictionary? If this is the correct interpretation, could the authors elaborate on why this finding is surprising? Additionally, why is learning the first two moments considered a bias, given that it doesn’t preclude learning higher moments?

Use of RMSE: Several empirical results in the paper (Figures 2, 6, 7, and 8) rely on RMSE measurements, which are not normalized. My concern is how these figures objectively support the claims, considering that differences in output scales of the denoisers are not accounted for. Could the authors explain their choice of RMSE and discuss why it does not affect the validity of their conclusions?

Novelty in Theorem 1: Could the authors clarify the novelty of Theorem 1? There seems to be a lack of discussion that relates this theorem to previous works in denoising and Bayes estimator.

Gaussian Structure and Strong Generalizability: The key achievement of diffusion models is not just avoiding memorization but their ability to generate high-quality images. I believe this goes beyond what the "inductive bias hypothesis" can explain. Specifically, I disagree with the claim that "strong generalizability can be achieved with considerably small training datasets," as stated in contrast to [15], which asserts that strong generalizability requires larger datasets (more than 10^5 images). A more accurate claim might be that the covariance of the data can be learned from smaller datasets, nevertheless generating high-quality images and therefore strong generalizability, as defined in [15], still necessitates larger datasets because it is not possible with only learning the first two moments. Could the authors clarify this point?

Model Linearization: Regarding my previous question on linearizing trained models, I was referring to performing a first-order Taylor expansion on the trained models, as discussed in [1].

[1] Ortiz-Jiménez, Guillermo, Seyed-Mohsen Moosavi-Dezfooli, and Pascal Frossard. "What can linearized neural networks actually say about generalization?".

2024-08-14

Thanks for your reply, we are happy to further address your concerns.

Q1. Inductive bias

As we've stated in our last response (please see A.3), a more precise interpretation of the inductive bias in this paper should be:

When the model size is relatively small compared to the training dataset size, training diffusion models with the score matching loss (Equation (3)) results in diffusion denoisers that behave similarly to the linear Gaussian denoisers (though with certain amount of difference especially in the intermediate noise variance region). Furthermore, even when the model is overparameterized, such similarity emerges in the early training epochs.

Please notice that our findings don't just show the diffusion models capture the first and second order statistics, but more importantly, when diffusion models transition from generalization to memorization, the corresponding diffusion denoisers progressively get closer to the Gaussian denoisers. We want to emphasize that capturing first and second order statistics doesn't mean the latter. Consider diffusion models in the memorization regime, they also capture the first and second order statistics of the training dataset since they can perfectly reproduce the training data, however, in this case the diffusion denoisers are not close to the Gaussian denoisers.

Q2. Why this finding is surprising and considered as a bias ?

Remember that we are training diffusion models on finite number of training data. The optimal solution to the training objective (equation 3) in this scenario has the form of equation 8, which has no generalizability. On the contrary, our findings show that in practice, when the model capacity is relatively small compared to the training dataset size, training diffusion models with the score matching objective (equation 3) leads to diffusion denoisers that share high similarity with the Gaussian denoisers. Furthermore, we also show that even when the model capacity is sufficiently large, such similarity emerges in the early training epochs. We considered this behavior as a bias since the diffusion models in generalization regime bias towards learning similar diffusion denoisers as the Gaussian models. Our findings are interesting since they are not well understood in the current literature.

We encourage the reviewer to take a look at our discussion with reviewer N1bc, who asked questions similar to Q1. and Q2. We hope the discussion there can help the reviewer understand our paper better.

Q3. The use of RMSE.

We use RMSE because it is able to accurately characterize the difference between the Guassian denoisers and the actual diffusion denoisers. To address your concerns on normalization, we have conducted experiments using normalized MSE, where the trend in Figure (2,6,7,8) remains the same and does not change our conclusions. This is because the pixel values of the outputs of the diffusion denoisers for different time steps consistently lie in the range [-1,1].

In the final version, we will definitely include these results and clarification as suggested.

Q4. Novelty in Theorem 1.

Although we agree that Theorem 1 is well studied (we will add citations also suggested by Reviewer N1bC), we state the result mainly to establish the connection between the linear denoisers and the Gaussian denoisers which is not our main contribution. Our main novelty and contribution mainly instead amounts to:

(i) Establishing the inductive bias that diffusion models exhibit emerging linearity as they transition from memorization to generalization, and the corresponding diffusion denoisers become progressively closer to the Gaussian denoisers (section 3).

(ii) Showing such inductive bias is governed by the relative model capacity compared to the dataset size. Furthermore, in the overparameterized setting, such inductive emerges in early training epochs.

(Iii) Showing that generalization of diffusion models can happen with a small training dataset size.

2024-08-14

Q4. Gaussian Structure and Strong Generalizability

First of all, we agree with the reviewer that generating high quality images is important. However, we believe that our experiment results which show that diffusion models trained on non-overlapping dataset generate nearly identical images (figure 9 (c)) is also generalization. As we’ve discussed in our last response to you (please see Q.6), strong generalization happens when the diffusion models capture certain common information between two non-overlapping datasets. Such common information might be the underlying ground truth image distribution, might be something else. We don’t know that yet in the current literature. However, please notice that the generated images of diffusion models trained on 1094 (figure 9 (c)) and 35000 (figure 9 (a) bottom) images are highly similar. This high similarity indicates that much of (more than first and second order statistics) of the information in the large dataset, which is essential for generalization, is already there in the smaller dataset. Our experiments in section 5 are meaningful since we demonstrate we can exploit this important structural information in a small dataset by either using a small model or applying early stopping. As we’ve demonstrated in section 4, by applying these two actions, we are prompting the diffusion models to learn similar function mappings as the corresponding Gaussian models, which indicates that the Gaussian inductive bias is important for the emergence of strong generalizability.

From a function mapping perspective, in the strong generalization regime, the high similarity between generated images of diffusion models trained on 1094 and 35000 images indicates that their diffusion denoisers share high similarity. But what are these function mappings? According to our experiments in section 4, the function mappings of these models are also close to the Gaussian denoisers. This again implies the Gaussian inductive bias plays an important role in the strong generalization of diffusion models. Notice that the Gaussian denoisers generate images that share similar structure as those generated by actual diffusion models in the strong generalization regime.

However, we agree with the reviewer that larger dataset contains more information which is essential for generating images with higher quality. The differences among the diffusion denoisers (1094, 35000) and the Gaussian denoisers play an important role in generating finer image structures. We really appreciate the reviewer for pointing this out and we will emphasize this in our revision. Nevertheless, our main message in section 5 is to show that we can prompt generalization on small datasets by steering the diffusion denoisers to get closer to the Gaussian denoisers.

Q5. Model Linearization.

The reference provided by the author is performing Taylor expansion, which can only approximate a function locally. For this reason, it doesn't serve our purpose since in this work we aim to find a linear model that can approximate the nonlinear deep network globally, i.e., we want to well approximate the input output mapping of the deep network for inputs that are widely spread. We will cite the paper and carefully discuss it.

2024-08-14

Please let us know if you have any more questions. Sincerely, Authors

2024-08-14

Thank you for the last clarifications; they helped me better understand your work. I have updated my score accordingly. However, I remain unconvinced that the hidden Gaussian structure is the key to understanding generalization in diffusion models. Specifically, I believe that the similarity observed in the case of human faces may be more attributable to the fact that this particular dataset is well approximated by the first and second moments alone. Additionally, I think that the middle range of noise levels, where the diffusion process deviates further from linearity, might play a significant role in generalization, which warrants further investigation.

审稿意见

评分: 7置信度: 42024-07-13

The work analyzes the behavior of a diffusion-based generative model from the perspective of ``Gaussian structures''. In particular, it checks the linearity of scores at single-time steps and compares scores against a Gaussian model. The analysis shows that the Gaussian structure plays a main role in image generation.

优点

The work has compared diffusion-based generative models against a Gaussian model. It has discovered convincing evidence that linear score plays an important role in image generation.

It has done extensive experiments with different configurations. By controlling irrelevant variables, the study highlights factors that affect generalization. In particular, it discovers that the linear score mimicking a Gaussian model plays an important role.

缺点

The study is only limited to a few generative models and a face dataset. The observation might be different on different datasets. The face dataset has a more obvious Gaussian distribution. For example, the mean of all faces is a face, and correlations between pixels are relatively consistent. Therefore, a Gaussian distribution is somewhat reasonable for such a dataset. It is unknown whether a Gaussian distribution is still a reasonable choice for a set of diverse images. In such a case, does the diffusion model still need to mimic the Gaussian model?

The current analysis still could not explain all observations. For example, the behavior in noise levels between 20 and 80 is not well explained in Figure 6. These observations might be worth more in-depth discussion.

问题

Have you considered using standard Gaussian distribution as the convergent distribution of the diffusion process?

局限性

The scope of the study could be wider.

作者回复

2024-08-07

Q1. The study is only limited to a few generative models and a face dataset. The observation might be different on different datasets. The face dataset has a more obvious Gaussian distribution. For example, the mean of all faces is a face, and correlations between pixels are relatively consistent. Therefore, a Gaussian distribution is somewhat reasonable for such a dataset. It is unknown whether a Gaussian distribution is still a reasonable choice for a set of diverse images. In such a case, does the diffusion model still need to mimic the Gaussian model?

A1: Although the Gaussian structure is the most evident on face images, the same linear phenomenon persists in other datasets as well. As shown in Figure 15 in the appendix, for more complicated dataset such as AFHQ and LSUN-Churches, there is still a clear similarity between the generated images from the diffusion models and those from the Gaussian model. Furthermore, we conducted extra experiments on Cifar-10 during the rebuttal, with results in Figure 2 (b) of our newly uploaded PDF. We can observe that the generated images for Cifar-10 dataset exhibits similarity with Gaussian models.

Q2. The current analysis still could not explain all observations. For example, the behavior in noise levels between 20 and 80 is not well explained in Figure 6. These observations might be worth more in-depth discussion.

A2: For the noise variance region between 20 and 80, the behavior of the learned diffusion denoisers are relatively stable. From Figure 6, we see that the difference between the diffusion denoisers and the Gaussian denoisers in the high noise region does not change as much compared to the intermediate noise variance region when varying the dataset size. Furthermore, from Figure 7 we see that the difference between the diffusion denoisers and the Gaussian denoisers seems to decrease consistently as the model scale increases. These observations indicate that with sufficient training, the diffusion denoisers for the high noise variance region converge to the Gaussian denoisers without overfitting. This observation might be attributed to the fact that for the high noise variance region, the Gaussian denoisers are approximately the global minimizer of the denoising score matching objective under finite training dataset assumption (see Equation (5)); this is proved in [19]. Therefore, we should expect that with a large enough model capacity, deep networks converge to such a minimizer. Since it is a global minimizer, we observe less overfitting.

We will include this discussion in our revised paper and make those points clear. Nevertheless, this region doesn’t influence the final generated images as much as the intermediate noise regions, which is why we mainly focus on the discussion of the intermediate noise region.

Q3. Have you considered using standard Gaussian distribution as the convergent distribution of the diffusion process?

A3: We start the sampling process from a standard Gaussian distribution with std = 80. Since the image lies in the range of [-1,1], the noise magnitude is much larger, which means the convergent distribution is roughly a standard Gaussian distribution with std equals to 80.

We thank the review for the insightful comments, please let us know if you have further questions.

评论- Thank you for your responses

2024-08-09

I have checked other reviews and your responses. I have no more questions for now.

2024-08-10

Thanks for taking the time to go through our work and provide insightful comments !

作者回复

2024-08-07

To all Reviewers:

We thank all the reviewers for their insightful and constructive comments. Most of the reviewers find our work well written (6yNJ, VCtP, N1bC), well-motivated (6yNJ, N1bC), interesting and novel (6yNJ, N1bC, VCtP) with convincing evidence (VCtP).

We summarize our main findings as follows:

Inductive bias towards Gaussian structures. Here, through linear distillation, we demonstrated that when the model generalizes, practically learned diffusion models exhibit an inductive bias toward Gaussian/linear structures without any explicit linear constraints. In contrast, the linear/Gaussian structures are less prominent in the memorization regime. Furthermore, we study how training dataset size, model capacity and training time effect such inductive bias. These insights are neither obvious nor well understood in previous works.

In this response, we address the common concerns and questions raised by most reviewers and we will include the revisions into our manuscript. Unless otherwise specified, all reference numbers refer to those in the bibliography of the NeurIPS submission.

Q1. Does the phenomenon appear on more complex datasets?

A1. Although the Gaussian structure is the most evident on face images, the same linear phenomenon persists in other datasets as well. As shown in Figure 15 in the appendix, for more complicated dataset such as AFHQ and LSUN-Churches and Cifar-10 (Figure 2 (b) of our newly uploaded PDF), there is still a clear similarity between the generated images from the diffusion models and those from the Gaussian model. This similarity implies that though natural images distributions are far from the Gaussian distribution, the Gaussian structure of the finite datasets play an important role in the generation ability of diffusion models in the generalization regime.

Q2. The main limitation of this paper is that it studies the Gaussian case which is well-understood/most of section 3 is rather obvious for people with a signal/image processing background.

A2. Our work is not studying the Gaussian case itself, but to study the relationship between generalization and the Gaussian structures. We demonstrate that in the generalization regime, diffusion denoisers share high similarity with the Gaussian denoisers. This result suggests that the Gaussian structure (first and second order statistics) of the finite training dataset plays an vital important role in the generation process. More specifically, notice in section 3, our linear models are not trained using the score matching loss, rather, it is trained to regress the input and output pairs of the actual diffusion networks (we made this clear in line 150-155 of the main text and also in the appendix B). For this reason, the high similarity between the linear model and the Gaussian model is not trivial. It requires the deep network to be regularized. In this case, this similarity can be attributed to the fact that for diffusion models in the generalization regime, the corresponding diffusion denoisers share similar function mappings as the Gaussain denoisers (though with a certain level of difference especially in the intermediate region). We refer the review to the Figure 2 (a) of our newly uploaded PDF, where we directly visualize the denoising outputs of the Gaussian model and the linear model. Notice that when the diffusion model generalizes (model scale = 4 or 8), the denoising outputs of the Gaussian model and the diffusion model are quite similar. This similarity in denoising outputs holds even if you directly input the noise to the denoisers. However, for diffusion models that memorize models (scale = 64 or 128), the similarity between Gaussian denoisers and the actual diffusion models breaks. In this case, as shown in Figure 1(c) of our newly uploaded PDF, the linear model trained with the linear distillation can no longer approximate the diffusion models as well as in the case of generalization and the similarity between the linear model and the Gaussian model breaks as well. The reason behind the similarity between the diffusion denoisers in the generalization regime and the Gaussian denoisers is not well understood.

Q3. The work primarily investigates the linearity of the learned score functions across various noise levels. However, the connection between these experiments and the necessity of linearity for generalization in diffusion models is unclear and not well substantiated.

A3. Our work has shown strong connections between the necessity of linearity and generalization in the following aspects.

In Section 3, in the generalization regime we demonstrate that diffusion models produce images that are highly similar to those generated by the distilled linear denoisers (as well as the Gaussian denoisers) when sampled starting from the same random noise. This means that in the generalization regime, the learned diffusion denoisers share similar function mappings as the Gaussian denoisers. On the contrast, this similarity between the linear model and Gaussian model breaks when diffusion model memorizes, meaning that the Gaussian linear structure is ubiquitous in the generalization regime. This shows the necessity of linearity and generalization.

In Section 4, we demonstrate that as the diffusion models transition from memorization to generalization, the similarity between the diffusion denoisers and the Gaussian denoisers increases as well. This strong correlation implies the necessity of the Gaussian structure in the diffusion model’s generalizability.

In Section 5, As shown in Figure 9 of our paper, we can bring memorized models into strong generalization by either decreasing the model capacity or applying early stopping. Remember from section 4, these two actions basically prompts the diffusion denoisers to get closer to the Gaussian denoisers. This once again highlights the necessity of Gaussian structure in the generalization of diffusion models.

最终决定Accept (poster)

2024-09-25

The work analyzes the behavior of a diffusion-based generative model from the perspective of hidden Gaussian structures. The work makes a novel finding, which would be useful in understanding diffusion generative models. There is still a room for futher explaining this finding.