On the Relation Between Linear Diffusion and Power Iteration
We analyze linear diffusion models, showing that generation acts as a correlation machine, aligning noise with data's principal components. We connect it to power iteration, explaining low-frequency emergence and convergence in deep denoisers
摘要
评审与讨论
The explores the relationship between the power iteration method and diffusion models with a linear denoiser, that can be modeled as a PCA projection, enabling analysis via the spiked covariance model. The authors demonstrate that, in the reverse process, high eigen-value components emerge earlier, and leading eigenvectors align with the data distribution.
优点
The paper provides a novel perspective on diffusion models for understanding the generation process by drawing parallels with power iteration. The work opens up new research directions for improving the interpretability and performance of generative models by highlighting the eigenvector alignment properties in diffusion models. The paper is well-written with most claims justified theoretically and backed by empirical evidence.
缺点
The paper does a good job at conveying the idea and intuition from the simulations, however, the proof of the main result (Theorem 4.3), could be refined more. Particularly, I can not fully understand the proof in the high noise regime i.e., . A more explicit explanation for the following statement would enhance the rigor: how exactly does Assumption 4.2 guarantee that is diagonal just enough not to spoil the diagonality of the next partial operator ? Can the authors justify this mathematically?
问题
- Most experiments for linear case are done on MNIST- could the authors clarify to what extent the assumptions made in the spiked covariance model apply to real-world datasets? For example, in cases where the low-dimensional subspace assumption does not hold.
- Moreover, since power iteration is such a major theme of the paper, a small discussion of the power iteration method in the introduction would help enhance readability.
- The term , has never been defined. Moreover, a consistent notation of keeping time in the superscript and index in the subscript (or vice-versa) for both vectors and matrices will help avoid confusion.
Thank you for your review. Following the general remark, we have added an explicit mathematical explanation of our statement in the high noise regime, to justify the convergence of the total projection iterator.
Answering your specific questions:
- The low dimensional subspace prior is very common considering natural images, and was considered in many previous works - in the context of diffusion models (which we already cited in the main text) and also more generally in computer vision - we added references to further justify this model. Of course, more realistic datasets entail more information, and encompass a larger subspace, which may be a collection of different smaller subspaces itself.
- We added a short paragraph regarding the power iteration method in the introduction of the main text, with further details in an appendix for the ease of reading.
- We added a clarification regarding the notations used throughout the paper, in addition to the definition of .
I thank the authors for addressing my concerns. I will keep my positive score.
This paper explore diffusion reverse sampling process by analyzing a linear diffusion model. The main result is that the reverse sampling process leads to samples that concentrate on the first eigenvector of the covariance matrix of the underlying data.
优点
The paper has a clear structure and is easy to read. The topic is also interesting since studying the evolution of the denoisers along diffusion trajectory is important but less well explored in the literature.
缺点
I have doubt on both the theoretical and practical aspects of this work. The theoretical results do not seem correct (correct me if I am wrong) and the connection to practice is weak. I list my questions as below:
*Major questions:
(i) Can the author explain how do they conclude Theorem 4.3 ? The author claimed that the final projection operator is a diagonal matrix with a spectrum concentrate around the first eigenvalue (line 409-412) and they supported this claim qualitatively in Figure 5. However, I cannot find a close-form expression for this projection matrix as in the limit of . From (24) I would assume
$
\mathbb{E}\mathcal{P_{\tau}}=U_0 c_0^\tau\text{diag}[1, (\frac{c_1}{c_0})^\tau,...] U_0^T,
$
, which do not really hold since equation (20) only holds for small t. If we assume this expression holds, we have
$
\mathbb{E}x_gx_g^T=(U_0c_0^{2\tau}\text{diag}[1, (\frac{c_1}{c_0})^{\tau},...])(\mathbb{E}[\epsilon_T\epsilon_T^T])(\text{diag}[1, (\frac{c_1}{c_0})^{\tau},...]U_0^T).
$
By the assumption that diffusion stepsize is and there are steps in total (line 360-361), we have . Therefore . Then how does this term equal to as in equation (17)? Shouldn't all eigenvectors go to zero as ?
(ii) In section 5 the author claims that correlation of the low indices (lower frequencies) withstands higher noise levels. However, it is shown in Figure 7 that many of the dark curves remain flat during a wide range of time steps and only drop when t approaches 0, which is very different from linear case (Figure 3) that show a gradual evolution of the sine angles. Therefore I disagree with that the decaying behavior of the nonlinear model is similar to that in the linear case.
*Minor questions: (i) The Figure 7 is very difficult to read since all the useful information is concentrated near . I suggest the authors to redraw the plot using log scale.
(ii) The paper could benefit from putting more details on the theoretical results in an appendix. Not sure whether authors are allowed to add one.
问题
I'd like to summarize my final assessment of this paper here:
-
First of all, the authors argue a linear diffusion models generate samples that concentrate on the first eigenvector of the covariance matrix. However, the diffusion models considered are not trained with the standard denoising score matching loss, rather, it is trained with a sequential denoising loss that is very different from the standard diffusion model. Besides the training loss, the reverse diffusion process considered in this paper is also not the standard reverse diffusion process. Due to this differences between the model considered in the paper and the standard diffusion model (which is not clearly explained in the paper), the conclusion in this paper can be misleading to the general public. The theorem states the linear diffusion models can only generate images that concentrate on the first eigenvector, thus of limiting diversity. However, as I've shown in my comment, the optimal linear diffusion models trained with standard diffusion denoising score matching loss are equivalent to the score function of noisy Gaussian distributions. Therefore, standard reverse sampling using the optimal linear diffusion models is equivalent to sampling from a Gaussian distribution, which can generate samples with great diversity. I urge the author to carefully take care of this discrepancy between standard diffusion and the model they considered, to avoid any misleading.
-
Second, in the section 5 of the paper the authors investigate the Jacobian and state the linear diffusion model is similar to the nonlinear diffusion model. However, while there indeed is certain similarity between case and nonlinear case, a significant amount of dissimilarity also exists but is ignored by the authors without any discussion. The implication of this section is also unclear. I expect the authors to discuss the property of the final generated samples from the nonlinear diffusion models with some mathematical formulation. Please notice that for linear model, the Jacobian fully explains the network behavior since Jacobian is the linear weight itself. However, for nonlinear models, the Jacobian only captures how the network's output change w.r.t. to input at a local region. However, to investigate the property of the final samples, we need to study the property of itself.
-
Third, some assumptions made in this paper is not valid. For example, the author cannot assume (line 354), which implies . Under this assumption, more sampling steps result in smaller amount of noise at the end stage of diffusion forward process. However, the noise level at the end of the forward diffusion should be sufficiently large so that the noisy distribution at time step is approximately a Gaussian distribution with zero mean. If this is violated, the reverse sampling process starting from a random Gaussian noise can not recover the clean distribution. Though the authors in the comment argue they use valid schedule in the numerical experiment, I do believe they should derive their theorem from valid schedule as well. Unfortunately since the theorem derived in the paper explicitly depends on , the will be extremely small, directly contradict the basic assumption of diffusion model, therefore the theorem is less meaningful.
In conclusion, I suggest rejection for the paper and I believe major revision should be made before the next submission.
Thank you for your review. The main point our proof is that as long as the elements on the diagonal are ordered, the total product converges to the first eigenvector. We believe that his issue answers both of your major questions. We elaborate on that next:
(i) It is true that all the eigenvectors go to zero, but since we are interested in the direction of the final vector (and not its norm), the final sample is aligned to the first eigenvector, where the ratio between the largest and second largest eigenvalues determines the rate of convergence. This is exactly the same as the convergence process in the classic power iteration method, which we added in an appendix for the ease of reading. If we also want to consider the norm of the final vector, the first eigenvector will always be the dominant part, no matter how small the total norm is. In addition, Figure 5 explicitly shows the convergence to the first eigenvector (and specifically not to zero). Following your comment and to make it clearer, we added more details to the proof in the main text, including a closed form expression that covers the large noise regime. We hope it is easier to follow now.
(ii) In accordance with the main idea behind our proof, the resemblance to the linear case lies in the fact that the rate of decay depends on the index. It makes sense that the overall rates are faster in the nonlinear model compared to the linear case, since the nonlinear model attends to more complicated datasets. Following your comment, we have added this discussion to the main text, and modified the scale of Figure 7 in order to also address your minor question as well.
Note that we have updated manuscript to address all the concerns. Changes are marked in blue. Please let us know if there any remaining issues and we would be happy to answer. If not, we would appreciate if you would kindly consider updating the score.
Thanks for your reply. However, my questions are not well addressed.
(i) First, I have doubt on whether it is rigorous to express as (line 394-399). As mentioned by the authors themselves (line 338-339), this nice form depends on the assumption of small angles, which only holds when . Can the authors justify why they can derive the equation on lines 394-399 based on this assumption while it does not hold anymore? This is a very important question since the whole argument that linear diffusion model generates samples that align only with the first eigenvector explicitly depends on this closed form expression.
(ii) Furthermore, under this closed form expression , we have . Since all the 's are within the range [-1,1] (they are cosine angles), the norm of the final generated images will diminish to 0 as your sampling steps increase. This is very inconsistent with what's happening in practice, even under the linear diffusion assumption-increasing the number of sampling steps will increase the generation quality, not pushing the samples to zero norm. Therefore, this serves as another evidence that the assumptions are not valid and consequentially the resulting does not accurately align with practice (even for linear diffusion). This inconsistency with practice can be clearly observed from Figure 6, where the authors perform numerical experiments based on empirical covariance. According to the theory, as T keeps increasing, the generated images should have increasingly smaller norm and at the same time become more and more concentrated on the first eigenvector. However, from Figure 6 we observe that the norms of final generation do not show decreasing trend. More importantly, the generated images do not concentrate only on the first eigenvector, instead, for large T, the generated samples have significant energy on the first 4 eigenvectors, not just the first. I doubt if keep increasing T will further prompt the generation to concentrate on the first eigenvector since from current figure the final generation seem to be identical (become stable) after T=110.
(iii) I am wondering why don't you theoretically study the case that the population covariance is given, i.e., the case for Figure 6, where we don't need to make assumption 4.2 and assumption 4.3 and the reverse sampling trajectory will have a simple close form solution due to the linearity. It is worth studying how the linear diffusion models work under this scenario. And maybe you can then train a linear diffusion models to see the differences between the case that population covariance is given.
(iv) Lastly, the empirical experiments in Figure 7 are not convincing enough. For section 5, if I understand correctly, the authors want to show that the convergence for lower index (darker color) should be faster. However, the current experiments do not validate this sufficiently. As shown in Figure 7, (i) the angles of some darker curves go to zero slower than the brighter curves (ii) a lot of the very bright (yellow) curves overlap with the very dark curve, showing similar convergence rate. Therefore, the Figure shows phenomenons that contradict with the main statement of the paper.
Thank you for the rapid response. Yet, we believe there is still some misunderstanding of our claim and we are more than happy to clarify further. We provide the answer below to all the questions raised:
(i) The diagonal matrix in lines (394 - 399), do not represent as you mention, but the product . Notice, that is not diagonal, in fact we do not assume anything about its entries. This equation deals exactly with the case where the small angle approximation breaks (this is not the nice form we discuss in lines 338 - 339). Moreover, we want to re-emphasize an important implication of Assumption 4.2 that might have been unclear - the times for which the approximation ceases to hold are also ordered by index, so we are guaranteed that is concentrated at indices larger than . This makes sense considering Assumption 4.1 (as the change in angle is ordered), and also backed by the numerical experiments in Figure 5.
(ii) In this work we analyze a denoising chain. Denoisers naturally decrease the norm of their output, in order to reduce the energy contributed by the contaminating noise. To overcome this, the output of the denoised signal can be rescaled, which is done in the diffusion process where the signal is rescaled and also done in the power method in order to retain the unit norm). As we mention in the paper (in lines 125-126), we overlook the scaling of the final product and focus on the angle alignment of the desnoiser base, as this is the contact point in connection with the spiked covariance model.
Given the above, notice that there is no contradiction between our mathematical model and our experiments. Specifically, please note that the expression depicted in Figure 6 is normalized, as defined in Eq. 26. Hence, it does not contradict our theory and does not mean that the unscaled final output retains its original norm as we present there the normalized version of the output. The goal of this figure is to show the effect of the parametrization on the final distribution. As we mention in the revised version of the paper, we leave the derivation of optimal parametrization for the convergence of the linear model to future work, as it is not in the scope of this work.
(iii) This is an interesting suggestion, which differs significantly from our perspective. It is closely related to the derivation of the optimal parametrization for the linear case discussed above. While being very interesting, we believe it should be studied separately in a future work.
(iv) The purpose of Section 5 is to show that the alignment of lower indices (darker colors) decays at a slower rate, i.e., their curve has a lower slope (like in the linear model, depicted in Figure 2). It is not intended to describe the convergence of the generation process. Unlike the linear model with the simple MNIST dataset, Figure 7 presents the Jacobians of a real diffusion model applied to CelebA. Thus, it is very reasonable for it to be noisier than Figure 2. Nevertheless, the higher slopes in general of the brighter colors is clearly apparent, with no contradiction to our theory and main message. On the contrary, these plots rather support our claims as they show similar trends to the linear case (although being noisier, which is quite expected).
We will update the manuscript to include, clarify and emphasize all these points. Thank you again for your quick response. Please let us know if there is anything else that requires further clarification.
Thanks for your response, but my questions are not addressed.
(i) I understand is not diagonal, I abuse the notation here a little bit. But my main concern is why you can write the diagonal entries of as when the angle is not small? Please explain this.
(ii) I understand Figure 6 is normalized and this figure contradicts with your theorem in the following ways:
-
First, according to the theory the should concentrate on the first eigenvector as T increase. However, the Figure 6 shows there are significant energy on the first 4 eigenvectors, not just the first.
-
Second, from the plot I doubt whether keep increasing T will push to further concentrate on the first eigenvector since it is clear after T=110, is not changing anymore, achieving a stable solution.
(iii) Since currently Figure 6 does not align with theorem 4.1, I suggest directly studying what's happening in Figure 6, which should be much easier since you can easily derive the close form expression for the denoising chain without any extra assumption (just combining equations (10), (11) and (13) ). Does this close form denoising chain align with your theorem 4.1?
It is confusing to me why you make assumptions 4.1 and 4.2 if you believe Figure 6 aligns with your theorem (I don't think they well align). Why don't you directly study the property of denoising chain by considering the empirical covariance.
(iv) I have no misunderstanding on section 5 and Figure 7. What I mean darker curves should converge faster is that given a , we should see the sine angle of the darker curve is smaller than that of a brighter curve, i.e., the low index eigenvectors of should converge to earlier.
Again, Figure 7 contradicts with your main statement in the following ways:
-
First, a lot of the very dark curves (low index eigenvectors) have higher slope compared to the less dark curves, contradicting with the statement that lower index should have lower slope.
-
Second, a lot of the very dark curves overlap with the very bright curves (a lot of dark curves seem to be hided behind the bright curves), contradicting with the statement that lower index should have lower slope.
Overall, there is a huge gap between section 5 and the previous sections.
Thanks for looking into this. We hope to fully address your concerns.
(i) Note that the matrix is diagonal only for per our assumption that the angle is still small for and it is large for for which we do not claim diagonality. Thus, at the top left of the matrix, the small angle approximation still holds and we can write . For the rest it is not diagonal and we have our large angle analysis for this part.
(ii) Figure 6 is produced using the MNIST dataset. It is a real dataset, and we do not control its spectrum, which affects the convergence properties of the denoising chain. We did not try to optimize the noise schedule to the specific data we use, but rather chose a convenient formulation for the brevity of the proof. Thus, the convergence requires a lot of iterations. When adding this plot, we assumed it is enough to show that the rest of the eigenvalues diminish when the iterations increase but to make it clearer, we updated the noise schedule and increased the number of iterations to where all eigenvalues except the first one diminish even more. As we mention in lines 353-354, it is possible to select many other schedules that will sustain our theory. Thus, in order to show that indeed all eigenvalues except the first go to zero we added an additional curve to Fig. 6, with a better suited noise schedule that converges faster considering the dataset with , . This configuration conforms with our theory, and provides faster convergence.
(iii) As far as we understand your suggestion, the empirical closed form combining equations (10), (11), and (13) is depicted in Figure 5. It is very closely aligned with the final statement of Theorem 4.1, and also illustrates the different stages of the proof. The small impurities in the final product in Figure 5 are due to the fact that the denoising did not fully converge and they can be improved using better suited parameters as we discuss above. Assumption 4.1 is the heart of this work, it is the connecting point to the theory regarding the spiked covariance model and the work of Nadler (2008). We were motivated to do this work to explore the implications of this theory to diffusion models, via a linear denoising chain. Surely, there may be other approaches for studying a linear denoising chain, which may be explored in future work.
(iv) We do not deny the gap between Section 5 and the other simulations, though we do not see it as a huge gap, considering the difference between the linear model and the state of the art model at stake. The complexity of the real model and datasets results in very noisy simulations. Indeed, as you mention, there is some local miss-ordering in the dark indices, and some overlap of bright and darker ones (the choice of the amount of Jacobian eigenvectors to show affects this overlap). However, despite the noise, we believe that the overall trend is clear, which is the purpose of these results.
Thank you again for discussing with us the paper and giving us the opportunity to improve it. We really appreciate your comments that shed the light on points that were not clear and with your help we believe are now clearer. Please let us know if there is anything else that requires further clarification.
Thanks for your reply, however, my questions remain unresolved.
(i ) First, I maintain my doubt on the assumptions. Second, I don't understand the meaning of the following statement: "Our takeaway from Section 5 is that (high complexity) nonlinear diffusion models consists of many low scale linearized environments, and can navigate a diverse set of linearized regions during the generation process (as illustrated in Figure 7)". The statement is quite abstract. Please make your definitions more concrete:
- Please clearly define the meaning of "local scale linearized environement", "navigate".
- Please elaborate how the empirical results lead to your conclusion. Right now, the reasoning is not adequate.
(ii) Third, could you please explain why and are not simultaneously diagonizable? They can still be expressed using , right?
##########################################################
(iii) Next I want to discuss more on the denoising chain:
-
You mentioned in line 128-129 that "we choose to present the standard diffusion models...". The standard way of training diffusion models is to learn the score of intermediate noisy distributions, which is exactly equivalent to the denoising problem I formulated in my last comment, not the sequentially denoising problem you wrote. Furthermore, in line 151-154 you wrote that you are learning the denoiser that project the noisy samples on to the "target" distribution, which also suggests you are learning the denoiser that predicts clean image from the noisy image. Could you please clarify this? If you are considering the denoiser that predicts the next noise scale from a higher noise, then it is not the standard diffusion model, making the paper of less interest. In Ho et.al (2020), the in equation (6) represents the estimated , not . You should write your denoising objective explicitly in the paper so that readers can notice that it is not the standard diffusion model.
-
My derivations in my previous comment demonstrate that linear diffusion models with standard reverse sampling process generate samples that follow a Gaussian distribution, which contradicts with your theorem. You mentioned that you still observe the concentration on the first eigenvector by analyzing the denoising chain. This inconsistency implies your denoising chain cannot recover the clean distribution with the score of intermediate distributions. Therefore, from my point of view, I don't find it meaningful to study this denoising chain unless it can recover the clean data distribution. Moreover, I am confused why the authors do not study the standard diffusion model with standard reverse sampling process. I don't see any significant difficulty in doing so.
-
Please modify line 500-505 since standard linear diffusion models are equivalent to sample from a Gaussian distribution, which can generate diverse samples without injected noise. Or you may keep the current conclusion but explicitly indicate the differences between the standard diffusion and the special diffusion you are studying (I am not sure whether it can be called as 'diffusion model'). Otherwise, the conclusion can be misleading.
-
Another related issue is that you cannot assume (line 354). If you choose this schedule, then , which implies more sampling steps result in smaller amount of noise at the end stage of diffusion forward process. However, the noise level at the end of the forward diffusion should be sufficiently large so that the noisy distribution at time step is approximately a Gaussian distribution with zero mean. If this is violated, the reverse sampling process starting from a random Gaussian noise can not recover the clean distribution.
##################################################################
After seeing the authors' last comment, I have more questions and become more confused on this work, particularly in the differences between this work and the standard diffusion models. I will maintain my score for now. I suggest the authors make major revision to the paper before submitting it again.
Thank you for your detailed feedback and for engaging with us during the rebuttal period. We appreciate your thoughtful comments and the opportunity to respond to them in detail.
Before addressing your specific concerns, we would like to emphasize first the key contributions of the paper, that may have been missed due to going into the details:
a. Analyzing diffusion models in the linear case (now including two denoising processes following your important feedback).
b. Demonstrating that our analyzed behavior can be spotted in the final stages of the diffusion, which both reassures the importance of our theory and opens up avenues for future work that could investigate the Jacobian's behavior in both the final and initial stages.
c. Providing insights into the sample complexity of the diffusion process, and the impact of different components of the signal learned at various stages of diffusion. Specifically, we show that the leading eigenvector is learned early, as is typical in diffusion models where low frequencies are learned first.
We believe that the above are important contributions to the community and that the points you raised do not reduce their importance. We will make all the above clearer in the final version.
Next, we would like to respond to each of the points raised by the reviewer in the last comment so it would be clear that there are no technical gaps in our paper.
(i) Our experiments in Section 5 show that diffusion models used in practice exhibit similarity to the linear case in their final steps, i.e., when they are in a very close environment of the final generated sample. This is what we mean by "local scale linearized environment". Note though that this "local environment" is different per image and is aligned to its own linearized environment. Specifically, it depends on the Jacobian of the network for that image. This is the reason we say the nonlinear model can "navigate" between environments. This is unlike the linear model, where the denoiser operation does not depend on the final image. Indeed, there is also a gap between the final stages of the non-linear case and the linear case. Yet, we find this gap relatively small and we believe it stems from the fact that the Jacobian of the network changes between the iterations. But because it changes slowly there is a similarity to the linear behavior that we study. To us the similarity (in the final iterations - not throughout all noise levels!) is clear but we understand that it may be perceived differently by different readers. Yet, we believe it is important to show these results and emphasize the difference as it is an interesting question for a future work that would explore this gap and our hypothesis about the Jacobian (which is partly supported by https://openreview.net/forum?id=h8GeqOxtd4). In the revision, we will add the above explanation, which we hope satisfies the reviewer, and emphasize the existing gap between the linear and non-linear case and the need for further analysis (we no longer can update the submission now).
(ii) In population, they are diagonizable under the same base, as you mention. However, considering a finite sample, the noise changes the eigenvectors, as shown in Nadler, 2008.
(iii) Following your previous comment we now analyze both denoising chains. If we could upload a revised version of the paper, we would have taken the discussion regarding the two scenarios to the opening of the main text, to clarify this point and improve the paper. Unfortunately, we are unable to make these changes at this stage, but we will incorporate this clarification in the final version. Note though that there is no contradiction between our message about the composition of the generated sample considering stochastic sampling (lines 408 - 436), in either of the studied cases (see discussion in the appendix about the similarity between them). Regarding the specific noise schedule we chose to implement, it generates high enough noise in practice as can be seen in our simulations. Indeed, many other schedules suffices for the proof, which also resemble the noise levels used in practice, but we chose that one for simplicity.
Once again, we greatly appreciate your careful evaluation of our work and the valuable feedback you provided. We believe that with your feedback the paper now is better and clearer. We hope that the reviewer will reconsider their score and see the valuable contribution of the paper given all the clarifications and additions we made to the paper.
This paper studies the generation process of diffusion models. Focusing on linear models where the diffusion process is similar to performing noisy PCA, the authors show that the generation process of the linear diffusion model is similar to the power iteration method in that it converges to the leading eigenvector of the underlying data distribution.
优点
The authors aim to provide an analysis of the diffusion process, an important question for diffusion models. This is achieved using a simple model.
Overall, the writing is easy to follow.
缺点
Providing theoretical studies for complex problems in deep learning and deep generative models is often very challenging. Thus, it is common practice to study a simpler problem that can shed light on the underlying mechanisms of more complex ones. While this work falls into this category, I found that the results for the linear diffusion model diverge significantly from the phenomena observed in real cases.
(1) Theorem 4.3 shows that the diffusion process only converges to the dominant eigenvector (the one corresponding to the largest eigenvalue). Thus, it gives a bad estimation for the true distribution with a dimension larger than 1. While eq. (25) also provides a formulation when noise is injected in the intermediate steps, (i) it does not provide an estimation guarantee for the entire spectrum, (ii) many recent diffusion models such as DDIMDMP-solver do not require the injection of noise in the intermediate steps.
(2) The results in Section 5 (e.g, Figure 7) on real cases only demonstrate the convergence to the entire spectrum, not the dominant eigenvector. This also supports the gap between the analysis in Theorem 4.3 and the phenomenon observed in real cases.
(3) Initially, I thought Section 5 was intended to support the argument that the diffusion process converges to the "true distribution" as in Theorem 4.3. However, this is not the case, as Figure 7 plots the convergence of the network Jacobian evaluated at . Thus, even at , it depends on the learned network and does not represent the true distribution of the data; the convergence probably holds as long as the network is Lipschitz. In other words, the convergence does not indicate the quality of the diffusion process. For example, a network that outputs all zeros would also exhibit fast convergence. I may be misinterpreting the results; if so, I'd appreciate clarification.
问题
(1) what are and in eq. (26)? (2) where right after eq. (26) should be . (3) line 487: what is in ? is it ?
Thank you for your review. We will first clarify the meaning of the results in Section 5 (per your points (2) and (3)), and then address the gap compared to nonlinear deterministic samplers (point (1)).
The results in section 5 (e.g. Figure 7) do not try to describe the distribution of the generated data - each plot is calculated per a single generated example. They are intended to show that in close proximity to generated samples, the denoiser base, i.e., the linearization of the nonlinear network, resembles the behavior of the linear denoising chain (i.e., the results in Figure 2). At t=0, the network Jacobian is aligned to the generated sample , as was shown in previous work (Kadkhodaie et al., 2024). When t grows larger, the angle w.r.t. the base in t=0 grows larger, in a rate proportional to the index, like in the linear denoising chain. Thus, this section is not intended to show the convergence of a real network to the entire spectrum, but to show the resemblance of its local structure to the linear case we analyze. We have added this clarification to the main text.
Now, for the gap between the linear and nonlinear deterministic samplers (point (1)). This analysis focuses on the local behavior of the nonlinear denoiser at the end of the generation process, demonstrating its similarity to a linear denoising chain. Each plot represents a single generation path, not the overall distribution of generated outputs. While linear diffusion models are easy to analyze, they may struggle to generate complex datasets. Nonlinear models, on the other hand, can navigate a diverse set of linearized regions during the generation process (as illustrated in Figure 7). This allows them to generate diverse data even without added noise, unlike linear models which ultimately converge in mean to a single point (Theorem 3.4) and therefore require noise injection for diverse outputs. This contrasts with some deterministic nonlinear samplers (e.g., DDIMDMP) that do not rely on added noise. We have added this discussion to the main text. We chose to show how the final generated distribution depends on the choice of parameters, where one can control the mean of the generated spectrum (this might be a feature for some applications, such as segmentation via diffusion, etc.). Indeed, it might be interesting to derive the optimal parametrization for the convergence of the linear model - we leave this for future work, as it not the main focus of the paper.
Answers to Minor Claims/Questions:
(1) what are and in eq. (26)?
is a generated sample, out of examples. We added this definition to the main text.
(2) should right after eq. (26) be ?
That is correct, thank you. We fixed it in the main text.
(3) line 487: what is in ? is it ?
J stands for Jacobian, as opposed to in the linear model. We follow the common practice of using the Jacobian of a non-linear model as its local linearization. We have clarified this in the main text.
Note that we have updated manuscript to address all the concerns. Changes are marked in blue. Please let us know if there any remaining issues and we would be happy to answer. If not, we would appreciate if you would kindly consider updating the score.
Thank you for the detailed responses and revision. All the questions about notations have been addressed. But my concern about the weakness remains. In particular, one of my main concerns is that Theorem 4.3 (the main result) shows in expectation the sample converges to the dominant PCA direction (), which has a large gap with the practice where the diffusion model can generate diverse samples (e.g., the results in Section 5 (e.g, Figure 7) on real cases). The response highlights "the results in section 5 (e.g. Figure 7) do not try to describe the distribution of the generated data - each plot is calculated per a single generated example." However, this result still demonstrates my point that the diffusion model can generate diverse samples.
Very minor comments after a further reading: 1. line 299, for ; 2. line 312, Assuming 4.2, 4.1 (missing Assumption)
Thank you for your response.
Our takeaway from Section 5 is that (high complexity) nonlinear diffusion models consists of many low scale linearized environments, that allow the diversity of the generated output. Locally, these environments resemble the linear model, but the overall nonlinear model is free to navigate between these localities and generate diverse data. Of course, we do not try to claim that all diffusion models converge to the leading data eigenvector, not even in mean. Unlike the linear model, the nonlinear denoiser depends on the generated image (as we mention in line 481), so the convergence of the mean to the leading Jacobian eigenvector does not imply reduced diversity. It might be interesting (and challenging) to inspect the statistics of trajectories that converge to the same linearized environment in a nonlinear diffusion model. However, it significantly differs from the scope of this paper, and might be studied in future work. Please let us know if there is anything else that remained unclear in the paper.
This paper presents a theoretical study on diffusion models. The main result is the establishment of a connection between power iteration and linear diffusion. It is claimed that this result can provide insight to the theoretical understanding of the generation process of diffusion models. The major weakness pointed out is the linear diffusion setting considered in this work is too restricted and the techniques developed can not be applied to more practical settings. There are also some concerns on the proofs that are not fully addressed.
审稿人讨论附加意见
The reviewers raise some questions on the results (e.g., Assumptions, Theorem 4.3, Section 5). The authors reply by modifying the paper and adding some more clarifications in the response. The discussions between the authors and reviewers last for multiple rounds. The reviewers are not convinced.
Reject