6.4

/10

Spotlight5 位审稿人

最低6最高7标准差0.5

2.4

置信度

正确性2.8

贡献度3.0

表达2.8

NeurIPS 2024

Reverse Transition Kernel: A Flexible Framework to Accelerate Diffusion Inference

Xunpeng Huang,Difan Zou,Hanze Dong,Yi Zhang,Yian Ma,Tong Zhang

OpenReview PDF

提交: 2024-05-15更新: 2025-01-14

摘要

To generate data from trained diffusion models, most inference algorithms, such as DDPM, DDIM, and other variants, rely on discretizing the reverse SDEs or their equivalent ODEs. In this paper, we view such approaches as decomposing the entire denoising diffusion process into several segments, each corresponding to a reverse transition kernel (RTK) sampling subproblem. Specifically, DDPM uses a Gaussian approximation for the RTK, resulting in low per-subproblem complexity but requiring a large number of segments (i.e., subproblems), which is conjectured to be inefficient. To address this, we develop a general RTK framework that enables a more balanced subproblem decomposition, resulting in $\tilde O(1)$ subproblems, each with strongly log-concave targets. We then propose leveraging two fast sampling algorithms, the Metropolis-Adjusted Langevin Algorithm (MALA) and Underdamped Langevin Dynamics (ULD), for solving these strongly log-concave subproblems. This gives rise to the RTK-MALA and RTK-ULD algorithms for diffusion inference. In theory, we further develop the convergence guarantees for RTK-MALA and RTK-ULD in total variation (TV) distance: RTK-ULD can achieve $\epsilon$ target error within $\tilde{\mathcal O}(d^{1/2}\epsilon^{-1})$ under mild conditions, and RTK-MALA enjoys a $\mathcal{O}(d^{2}\log(d/\epsilon))$ convergence rate under slightly stricter conditions. These theoretical results surpass the state-of-the-art convergence rates for diffusion inference and are well supported by numerical experiments.

关键词

diffusion inferencereverse transition kernel

评审与讨论

审稿意见

评分: 7置信度: 32024-07-11

Diffusion models rely on the use of the reverse SDE (or ODE) for sample generation, but the discrete-time simulation of the reverse SDE, when regarded as solving a sequence of subproblems, is potentially less efficient compared to the usage of better MCMC samplers for solving each subproblem. This paper suggests that given a fixed computational budget, employing MALA and ULD to simulate samples from a de-noised distribution allows for higher efficiency. The paper then analyzes the theoretical properties of 2 RTK samplers and showcase a more efficient use of computational expenses when compared to SDE based samplers.

优点

The paper provides a compelling argument of accelerating the inference of reverse SDE, which seems more interesting than previous attempts to solve the same problem using specialized SDE solvers.
The paper also presents a valid argument in taking bigger step sizes in order to take advantage of Lipchitz continuity.
The paper can help bridge between diffusion modeling and MCMC, which can lead to interesting future work on further improving the sampling of de-noised distributions.

缺点

I like this paper overall -- the main weaknesses are mostly about how it is written, as there exist small mistakes or confounding statements.

The paper's usage of the word "Gaussian process" (e.g., in L. 125) seems confusing. I would assume Gaussian process mostly refers to infinite-dimensional Gaussian distribution in the space of functions, but finite discretization of the reverse SDEs simply yields a sequence of Gaussian conditional distributions.
The paper suggests numerical experiments in the abstract, but only includes the experiments in the appendix.
Should the paper clarify the use of the $\tilde{\mathcal{O}}$ notation? It does not seem like a standard notation and has already been used in the abstract.
Minor: the paper uses "closed" as opposed to "close", e.g., in Algorithm 1 and Line 172, 176.

问题

The use of an underdamped dynamical system in diffusion modeling is not entirely a new idea: For instance, Dockhorn et al. suggest using a critically dampened Langevin diffusion as the forward process. While a critically dampened LD encodes a different evolution of marginal distributions, solving one discretization step of the backward SDE would be quite similar to the use of ULD in this paper. What do the authors think about the pros and cons of underdamping the forward process directly vs. your approach of only modifying the inference of the backward SDE?
Another advantage of SDE or RTK samplers, when compared the deterministic simulation using probability flow ODEs, is the addition of stochasticity that ameliorates possible imperfections in the score estimation. I think it is reasonable to assume that the adoption of RTK as a generator might further improve this issue.
The experiment figures in the appendix often show the failures of DDPM, but I'm not sure why there exists a dramatic difference between DDPM and others in figure 1 and 2. I had previously assumed that DDPM should be able to sample from the same data distribution, even though it might be computationally more expensive compared to the methods proposed. Can the authors provide some clarification about why the generated samples are so different (I would assume it's either inaccurate estimates of the scores, or the step size being too large)?

[1] Dockhorn T, Vahdat A, Kreis K. Score-Based Generative Modeling with Critically-Damped Langevin Diffusion. In: International Conference on Learning Representations [Internet]. 2021.

局限性

The paper suggests in the final section several limitations of the current paper. I agree with the characterization of limitations.

作者回复

2024-08-07

Thanks for your positive comments and helpful suggestions. We will response to your concerns point by point in the following.

W1: (1). Writing Issues, (2). The paper suggests numerical experiments in the abstract, but only includes the experiments in the appendix. (3). Should the paper clarify the use of the notation \tilde{O}? Minor: the paper uses "closed" as opposed to "close".

AW1: (1). We will revise them in the next version. (2). We will move the experiments part to the main body of this paper. (3). We clarified the notation $\tilde{O}$ in the notations paragraph of Section 2, i.e., Preliminaries. We will highlight it in the next version. Minor: We will revise them in the next version.

Q1: The use of an underdamped dynamical system in diffusion modeling is not entirely a new idea: For instance, Dockhorn et al. suggest using a critically dampened Langevin diffusion as the forward process. While a critically dampened LD encodes a different evolution of marginal distributions, solving one discretization step of the backward SDE would be quite similar to the use of ULD in this paper. What do the authors think about the pros and cons of underdamping the forward process directly vs. your approach of only modifying the inference of the backward SDE?

AQ1: Although there are some similarities in the use of ULD to accelerate the inference process of diffusion models, our work is orthogonal to [Dockhorn et al.] mentioned in your comments, and the motivations of these two papers are different. Specifically, [Dockhorn et al.] expected to change the way of modeling the probability flow, while our work aimed to utilize well-trained scores to accelerate the inference process. Compared with the reverse SDE, although we use a different Markov chain to generate the data, we do not require another training process as [Dockhorn et al.]. Under these conditions, we think it is good to combine our work, such as applying RTK in CLD-based probability flow, rather than comparison.

Q2: Another advantage of SDE or RTK samplers, when compared the deterministic simulation using probability flow ODEs, is the addition of stochasticity that ameliorates possible imperfections in the score estimation. I think it is reasonable to assume that the adoption of RTK as a generator might further improve this issue.

AQ2: This is a good observation. Actually, in order to achieve the same KL/TV convergence as SDE-based methods in theory, existing ODE-based methods require either an SDE corrector (DPOM and DPUM [Chen et al.]) or some strict assumptions (a small estimation error of energy Hessian [Li et al.]), which are shown in our Table. 1. To the best of our knowledge, whether the pure ODE-based generator can achieve the same theoretical results as those for the SDE-based generator, under mild assumptions, is still an open question. We have known that (1) stochasticity is the only difference between the implementations of ODE and SDE-based methods and (2) the contraction of KL divergence highly depends on the diffusion term from a Fokker-Planck equation perspective. Therefore, it can be inferred that stochasticity could ameliorate possible imperfections in the score estimation.

Q3: The experiment figures in the appendix often show the failures of DDPM, but I'm not sure why there exists a dramatic difference between DDPM and others in figure 1 and 2. I had previously assumed that DDPM should be able to sample from the same data distribution, even though it might be computationally more expensive compared to the methods proposed. Can the authors provide some clarification about why the generated samples are so different (I would assume it's either inaccurate estimates of the scores, or the step size being too large)?

We would like to clarify that, to remove the influence of the score estimation error on the convergence, we conduct the experiments on the synthetic data with exact scores, which can be explicitly calculated since the target distribution is a Gaussian mixture. Besides, we tune the hyper-parameters only under small NFE settings, which may require a relatively large step size. Such a large step size may magnify the differences between the generated samples. In the latest supplementary materials for the rebuttal, we added some new experiments. We especially compare RTK-based methods and DDPM on the real-world dataset MNIST, which demonstrates that RTK-based methods are better than DDPM even with score estimation errors in the same setting.

2024-08-13

Thank you very much for your rebuttal. I already recommend acceptance therefore I am keeping my current assessment.

审稿意见

评分: 7置信度: 12024-07-13

This paper presents a framework for accelerating the inference process in denoising diffusion probabilistic models (DDPMs) by optimizing the balance between the number and complexity of sampling subproblems.

优点

The proposed Reverse Transition Kernel (RTK) framework allows for a more efficient decomposition of the denoising diffusion process, enabling the use of advanced sampling algorithms like the Metropolis-Adjusted Langevin Algorithm (MALA) and Underdamped Langevin Dynamics (ULD).
By incorporating MALA and ULD, the paper leverages well-established acceleration techniques. This integration not only improves the convergence rates but also provides robust theoretical guarantees.
The authors provide theoretical convergence guarantees for these algorithms surpass current state-of-the-art rates for diffusion inference, supported by numerical experiments.

缺点

While the paper focuses on diffusion models, it raises the question of whether the RTK framework can be applied to other generative models or machine learning tasks.
The paper lacks sufficient numerical evaluation to thoroughly demonstrate the effectiveness of the proposed method. Additionally, there is a lack of detailed comparison to other related approaches, making it difficult to fully assess the advantages and limitations of the RTK framework in practical settings.

问题

See above.

局限性

N/A

作者回复

2024-08-07

Thanks for your positive comments and helpful suggestions. We will response to your concerns point by point in the following.

Q1: While the paper focuses on diffusion models, it raises the question of whether the RTK framework can be applied to other generative models or machine learning tasks.

A good question! We should note that the main idea of RTK techniques is to replace the discretization of SDE/ODE with a sampling subproblem. Therefore, this framework can also be applied to some of the latest generative models, especially for flow matching [Lipman et al.]. Instead of defining the forward process as an SDE in diffusion models, flow matching defines the forward process as a distribution conditioning on the initial data distribution. Actually, for their optimal transport implementation, the forward process can be converted to the following SDE $d x_t^\to = \frac{x^\to_t}{t-1}d t + \sqrt{\frac{2t}{1-t}}d B_t$ equivalently. According to the exponential integrator, we can first calculate the forward transition kernel and then deduce the reverse transition kernel. By decomposing the reverse process of flow matching with a serious of strongly log-concave sampling subproblem and choosing proper inner samplers, I believe it is possible to accelerate the flow matching inference.

Q2: The paper lacks sufficient numerical evaluation to thoroughly demonstrate the effectiveness of the proposed method. Additionally, there is a lack of detailed comparison to other related approaches, making it difficult to fully assess the advantages and limitations of the RTK framework in practical settings.

A2: In this paper, we mainly focus on the theory (the RTK framework) about the convergence of diffusion inference process, and highlight how to design algorithms (RTK-ULD and RTK-MALA) to achieve faster convergence rate in theory. In the meanwhile, we have also provided some experimental results for the Gaussian mixture targets shown in Appendix. A, which justify the effectiveness and efficiency of the RTK based methods.

We also added some new experiments in our latest supplementary material of rebuttal. These experiments are conducted on another two Mixtures of Gaussians (MoG) and the MNIST dataset. Besides, we add an additional related approach, i.e., [Song et al.], for comparison. All experimental results validate that the RTK-based inference process outperforms DDPM in different settings. More details about the supplementary experiments can be found in the common response.

We will add this additional context to our paper in the next version.

[Lipman et al.] Flow Matching for Generative Modeling.

[Song et al.] Generative Modeling by Estimating Gradients of the Data Distribution.

评论- After rebuttal

2024-08-12

The authors addressed my comments, and I've raised my score accordingly.

2024-08-12

Thank you so much for the response.

If you have further concerns, please feel free to contact us. We're always ready to discuss them further and work with you to improve our paper.

审稿意见

评分: 6置信度: 22024-07-15

The manuscript is on accelerating or improving the inference in diffusion models. This denoising process corresponds to the discretization of an ODE or SDE. In this work, the authors view this process as multiple reverse transition kernel sampling subproblems. They introduce a general framework for the reverse transition kernel, and leverage two sampling algorithms to solve it. Theoretical results on the convergence of both sampling methods are presented.

优点

The method seems novel and founded on a strong theoretical analysis. The authors find the analytic solution of the RTK for specific SDEs which enables practical algorithms.

The overall presentation of the manuscript is good, and the theoretical analysis improves on existing ones.

缺点

The manuscript lacks of experiments and especially comparison with existing inference methods. Theoretically, these methods seem to improve on existing one, but it would be great to show it in practice as well, e.g. comparing FIDs of a pretrained model for different inference methods.

The discretization of an ODE or SDE is replaced with $K$ subproblems consisting of sampling from the reverse transition probabilities. Within each of these subsampling problems, we need to simulate a chain of size $S$ , requiring at least $S$ calls of the score network. My understanding is that you would need to call the score network at least $K\times S$ times. I wonder if the authors could show experiments comparing the number of function evaluations (or computation time) for different methods and their scores (e.g. FID of generated images). I believe some of this analysis on synthetic datasets is presented in the appendix. It would be great to improve these experiments and move them to the main body of the text; it would help convey that for the same NFE the accuracy is greater, so the proposed method indeed accelerates inference without decreasing the performance.

问题

Minor comments

potential typo on line 118 "To implement SDE."
wrong punctuation at the end of eq. 5.
Alg. 3 "of the previous iteration $x_0$ ", shouldn't it be $x_{k-1}$ ?

局限性

The authors address the limitations of their work in the manuscript.

作者回复

2024-08-07

Thanks for your comments. We will response to your question in the following.

W1: The manuscript lacks of experiments and especially comparison with existing inference methods. Theoretically, these methods seem to improve on existing one, but it would be great to show it in practice as well, e.g. comparing FIDs of a pretrained model for different inference methods.

We have conducted additional experiments on both synthetic and real-world data. Additionally, we have included comparisons with an existing method and evaluated a new numerical metric. The detailed results of these experiments are included in the supplementary material of the rebuttal.

For synthetic data, we conducted additional experiments on spiral-shaped and chessboard-shaped Mixture of Gaussians (MoG), as shown in Figures 1 and 2. Our RTK-based methods achieved significantly better marginal accuracy with a small number of function evaluations (NFE). Comparing Annealed Langevin Dynamics (ALD) [Song et al.] with our RTK-based methods, we found that ALD has much lower marginal accuracy, particularly when NFE is small. Furthermore, using the Wasserstein Distance metric (Figure 3), we found that our RTK-based methods outperformed DDPM and ALD, especially with a small NFE.

For real-world data, we conducted experiments on the MNIST dataset, as shown in Figure 4. In Figure 4(a), we show that, with the same NFE, the RTK-based methods achieve better FID scores than DDPM, especially when NFE is small. Figures 4(b-d) further illustrate that the RTK-based methods generate images of higher quality than DDPM when NFE is small. We hope these supplementary experiments will address your concerns.

We will include these additional results and try more real-world dataset in the revision.

W2: Weakness. 2: The discretization of an ODE or SDE is replaced with K subproblems consisting of sampling from the reverse transition probabilities. Within each of these subsampling problems, we need to simulate a chain of size S, requiring at least S calls of the score network. My understanding is that you would need to call the score network at least K*S times. I wonder if the authors could show experiments comparing the number of function evaluations (or computation time) for different methods and their scores (e.g. FID of generated images). I believe some of this analysis on synthetic datasets is presented in the appendix. It would be great to improve these experiments and move them to the main body of the text; it would help convey that for the same NFE the accuracy is greater, so the proposed method indeed accelerates inference without decreasing the performance.

Actually, the gradient complexities (or the number of function evaluations) we analyzed in Theorem 4.1 and Theorem 4.4 are indeed calculated by $K\cdot S$ . For Theorem 4.1, we provide the detailed chose of $K$ and $S$ at Line. 945 — Line. 947. For Theorem 4.4, we provide the $K$ and $S$ at Line. 301 — Line. 302 and Line. 304 — Line. 305 in the main body of this paper.

We also provided the number of function evaluations (NFE) in our experiments for the Gaussian mixture targets. According to the top-right figure in Figure 1, we compare the marginal accuracy [Mangoubi et al.] of different inference algorithms, whose x-axis explicitly denotes NFE. Therefore, we think the comparison is fair, and we will move our experiments to the main text in revision.

Furthermore, we also provide some comparisons in FID-NFE for both synthetic and real-world data which can be found in Figure. 3 (calculated by Wasserstein 2 distance) and Figure. 4 in our latest supplementary material of rebuttal. Both the FID-NFE graph and the case study illustrate that the RTK-based methods generate images of higher quality than DDPM when NFE is small. We hope these additional experiments will solve your concerns.

Minor comments:

A: We will revise these typos in our next version.

[Mangoubi et al.] Dimensionally Tight Bounds for Second-Order Hamiltonian Monte Carlo

评论- Thank you

2024-08-11

Thank you for the clarifications, especially on the complexity and NFE. I also appreciate the new experiments.

2024-08-12

Thank you so much for acknowledging our clarification and new experiments.

If you have further concerns, please feel free to contact us. We're always ready to discuss them further and work with you to improve our paper.

审稿意见

评分: 6置信度: 32024-07-23

This paper takes a fresh perspective on simulating the backward SDE in diffusion modelling, proposing to replace the usual Euler-Maruyama discretisation (i.e. based on Gaussian approximations) with a more accurate approximation (based on Metropolis--Hastings and related techniques). The approach is explored in theoretical detail, and it is argued that in principle fewer time steps are required to simulate from the backward SDE.

优点

Casting the solution of the backward SDE as a sequence of "simple" sampling problems is an interesting approach that opens the door to new methods and it is conceivable that these could deliver practical benefit.
The theoretical analysis appears to be extremely detailed, taking into account both the usual SDE discretisation errors that occur when approximating the backward SDE and the sampling errors that are incurred using Metropolis--Hastings and related techniques.

缺点

The principal justification for this strategy is theoretical, and it is not yet clear whether it is practically beneficial.
Some of the English language was confusing at times, for example referring to "Gaussian process" when I believe "Gaussian approximation" (or similar) was intended.

问题

Is solving the backward SDE really a computational bottleneck in diffusion modelling? I had imagined that learning the score function was actually the main bottleneck.
Do the author(s) results allow for plug-and-play in terms of the sampling method used? For instance, if I design a new sampling method for use in this context (e.g. some new version of Metropolis--Hastings) what conditions would my method need to satisfy in order to inherit the theoretical guarantees that have been established? A clear statement could help other researchers to engage with these results.

局限性

The main limitation is the lack of empirical support, which the author(s) acknowledge in the conclusion section.

作者回复

2024-08-07

Thanks for your positive comments and helpful suggestions. We will response to your concerns point by point in the following.

W1: The principal justification for this strategy is theoretical, and it is not yet clear whether it is practically beneficial.

AW1: In this paper, we mainly focus on the theory (RTK framework) about the convergence of the diffusion inference process and highlight how to design algorithms (RTK-ULD and RTK-MALA) to achieve a faster convergence rate in theory. In the meantime, we have also provided some experimental results for the Gaussian mixture targets shown in the Appendix. A, which justifies the effectiveness and efficiency of the RTK-based methods. Besides, we also add some new experiments in our latest supplementary material of rebuttal. These experiments are conducted on another two Mixture of Gaussians (MoG) and the MNIST dataset, which validates RTK-based inference process outperforms DDPM in different settings. More details about the supplementary experiments can be found in the common response. We will add them to our paper in the next version.

W2: Some of the English language was confusing at times, for example referring to "Gaussian process" when I believe "Gaussian approximation" (or similar) was intended.

AW2: We will revise them in the next version.

Q1: Is solving the backward SDE really a computational bottleneck in diffusion modelling? I had imagined that learning the score function was actually the main bottleneck.

AQ1: In summary, learning the score function and solving the reverse SDE are separate subproblems of diffusion modeling. Specifically, solving the reverse SDE is the bottleneck for the inference process. For the practitioners, if they would like to generate data based on pre-trained models, the efficiency of generation only depend on the inference process , i.e., solving the SDE with well-trained scores. On the other hand, if one would like to develop a diffusion model for some new datasets, learning score in the training process will be the main focus.

Q2: Do the author(s) results allow for plug-and-play in terms of the sampling method used? For instance, if I design a new sampling method for use in this context (e.g. some new version of Metropolis--Hastings) what conditions would my method need to satisfy in order to inherit the theoretical guarantees that have been established? A clear statement could help other researchers to engage with these results.

AQ2: That is a good question. Our RTK framework definitely allows for plug-and-play in terms of the sampling method. This is one of the core advantages of the RTK framework.

As for the conditions of a new sampler that need to be satisfied, we require it to converge to any strong log-concave target distribution with arbitrary error tolerance for KL divergence or TV distance. To guarantee the final convergence, we first set the step size $\eta$ and the iteration number $K$ of outer loops as Corollary C.3 and C.5, i.e., $\eta = \frac{1}{2}\cdot \log \frac{2L+1}{2L},\quad K = 4L\cdot\log \frac{(1+L^2)d+\|\nabla f_*(0)\|^2}{\epsilon^2},$ which is independent of the choice of inner samplers. Then, if the inner sampler can guarantee the TV distance convergence, we require it to converge to the target distribution with an error tolerance at most $TV\left(\hat{p}\{(k+1)\eta|k\eta}(\cdot|\hat{x}), p^\gets{(k+1)\eta|k\eta}(\cdot|\hat{x})\right)\le \frac{\epsilon}{K}\le \frac{\epsilon}{4L}\cdot \left(\log \frac{(1+L^2)d+\|\nabla f_*(0)\|^2}{\epsilon^2}\right)^{-1}$ which is shown in our Corollary C.3. A similar result can be found in Corollary C.5 if the inner sampler can achieve the convergence of KL divergence.

2024-08-08

Thank you for responding to my questions.

2024-08-12

Thank you so much for the response.

If you have further concerns, please feel free to contact us. We're always ready to discuss them further and work with you to improve our paper.

审稿意见

评分: 6置信度: 32024-07-27

This paper discusses an acceleration method for the inference of the diffusion models. It considers the denoising diffusion process as a sequence of reverse transition kernel (RTK) sampling subproblems. Then, the paper integrate the RTK subproblems to the Metropolis-Adjusted Langevin Algorithm (MALA) and Underdamped Langevin Dynamics (ULD), for solving strongly log-concave subproblems.

优点

This paper presents a new modification of diffusion models, where it views DDPM as approximately solving RTK sampling subproblems using a small step size. Based on this, the authors propose two efficient algorithms for diffusion inference and demonstrate the acceleration theoretically and empirically through a mixed Gaussian example.

缺点

The choice of step size is important as there is a trade-off between the log-concavity of RTK subproblems and the number of subproblems. Either too small or too large of the step size results in a high computational complexity. The convergence results in section 4 are built upon a particular choice in the step size, which depends on the Lipschitz constant associated with the target data. The proposed method will gain more clarity and generality, and be beneficial to algorithm design if a concrete computational complexity in terms of the step size can be analyzed. See questions below for more details.

问题

There is a certain step size choice used in this paper (for example after Lemma 3.1 and Theorem 4.1), and the choices are to ensure the strong long-concavity of target distribution. How is this derived? Is this a unique choice in some sense? There is a paragraph after algorithm 2 commenting on the step size choice, I'm wondering if this result can be made more concrete and precise.
What is a practical way of choosing the step size for general target data like real-world image datasets?
What are the challenges in complexity analysis compared to the cited results [8, 9, 21] in Table 1? As DDPM can be seen as solving RTK sampling subproblems using a small step size, can you show the method discussed in this paper yields a more general technique in terms of complexity analysis?

局限性

The limitations have been discussed in the paper.

作者回复

2024-08-07

Thank you very much for your careful review. We will answer your concerns one by one in the following.

W1: The convergence results in section 4 are built upon a particular choice in the step size...

AW1: Sorry for the confusion. The particular step size choice in Section 4 is to simplify our theory. Any step size smaller than such a choice will be feasible, making the target distribution of the sampling subproblem strongly log-concave.

For general step size $\eta$ in RTK-ULD, the gradient complexity should be

$\tilde{O}\left( \underbrace{C_{\mathrm{LSI}}(\eta)^{3/2} \cdot \left(L+\frac{e^{-2\eta}}{1-e^{-2\eta}}\right)^{3/2}\cdot d^{1/2}\epsilon^{-1}\eta^{-1/2}}\_{S} \cdot \underbrace{\eta^{-1}\cdot \log \frac{(1+L^2)d+\\|\nabla f\**(0)\\|^2}{\epsilon^2}}\_{K} \right),$ which can be obtained with similar proof techniques in Theorem 4.4.

Note that the LSI constants of the subproblems depend on $\eta$ . If the step size is large ( $\Theta(1)$ -level), the constant $C_{\mathrm{LSI}}$ will blow up and grow exponentially wrt $L\eta$ , i.e., $C_{\mathrm{LSI}} = \exp(O(L\eta))$ (Proposition 1 of [Ma et al.]). If the step size is smaller than $\ln (1+1/2L)$ , the target distribution of inner loops will be log-concave, and we have $C_{\mathrm{LSI}}(\eta) = \left(-L + \frac{e^{-2\eta}}{1-e^{-2\eta}}\right)^{-1}.$ Omitting the log-terms, the gradient complexity will be $\tilde{O}\left(\underbrace{ \left(\frac{L+\frac{e^{-2\eta}}{1-e^{-2\eta}}}{-L+\frac{e^{-2\eta}}{1-e^{-2\eta}}}\right)^{3/2}\cdot d^{1/2}\epsilon^{-1}\eta^{-1/2}}\_{S} \cdot \underbrace{\eta^{-1}}\_{K} \right).$ Therefore, plugging our setting $\eta=1/2\cdot \ln (1+1/2L)$ , we can obtain the complexity shown in Theorem 4.4.

When $\eta$ is extremely small ( $\Theta(\epsilon)$ -level), we can use a one-step update for the inner sampler. Under these conditions, we have $S=1$ , and the complexity will only be dependent on $K$ . We defer the explanation in Question 3. In the next version, we will add this theorem to demonstrate the gradient complexity for general cases.

Q1: How the step size is derived? Is this a unique choice in some sense?

AQ1: Sorry for the confusion. The step size is derived from Line 188-189. Specifically, we requested the Hessian matrix of the RTK to be $L$ -positive-definite, $-\nabla^2 \log p_{(K-k-1)\eta|(K-k)\eta}(x) = \nabla^2 f_{(K-k-1)\eta}(x)+ \frac{e^{-2\eta}}{1-e^{-2\eta}}\cdot I \succeq -L\cdot I + \frac{e^{-2\eta}}{1-e^{-2\eta}}\cdot I \succeq L\cdot I.$ where the first inequality follows Assumption 1 (L-smoothness of $\log p_t$ ). With this inequality, we have $\eta\le 1/2\cdot \ln(1+1/2L)$ . Thus, we can provide the upper bound of the step size rather than the unique one. Some results shown in the para after alg 2 are also derived by the Hessian matrix. We will make it clear in the next version.

Q2: What is a practical way of choosing the step size for general target data like real-world image datasets?

AQ2: Since data distribution smoothness is usually unknown, we tune the step sizes with grid search. Practically, we first set the NFE as $M$ . Then, we tune the outer iteration number $K$ . For any $K$ , the outer step size is set as $\eta = 1.0/K$ , and the inner iteration number is $S = M/K$ . Finally, we tune the inner step size $\tau$ .

Q3: What are the challenges in complexity analysis compared to the cited results [8, 9, 21] in Table 1? ... can you show the method discussed in this paper yields a more general technique in terms of complexity analysis?

In summary, compared with [8, 9, 21] in Tab 1, the technical challenges of this paper are as below

For the outer complexity analysis, how can we break away from the paradigm (used in [8, 9, 21]) of using the Girsanov Theorem to analyze the discretized reverse process? As a solution, we calculate the close form of the reverse transition kernel (RTK), which considers the inference as a series sampling subproblem.
For the inner complexity analysis, how can we derive the gradient complexity when sampling algorithms can only access an approximate drift (score estimation $s_\theta$ ) or energy differences rather than the exact ones ( $\nabla\ln p_t$ )?
For the total complexity analysis, how can we control the error propagation of a series of sampling sub-problems whose target distributions are RTKs? As a solution, we introduce chain rules of $f$ -divergence.

For the generality, DDPM is a special case of our analysis when $\eta$ is chosen at an $O(\epsilon)$ -level. When we choose DDPM to generate data, the ideal and practical reverse transition kernels are shown as follows $p^\gets(x|x^\prime) \propto \exp\\left(-f_{k\eta}(x) - \frac{\\|e^\eta x^\prime - x\\|^2}{2(e^{2\eta}-1)}\right).$ $\hat{p}(x|x^\prime)\propto \exp\left(-\frac{\\|e^\eta x^\prime - 2(1-e^{\eta})s\_{\theta,(k+1)\eta}(x^\prime)- x\\|^2}{2(e^{2\eta}-1)}\right).$ with Lemma 3.1 and Line 123-132 in Section 2. We suppose step size is small, e.g., $e^{2\eta} - 1 = \epsilon$ , and have $-\nabla_x^2 \log p(x|x^\prime) = \nabla\_{x}^2 f_{k\eta}(x) + \epsilon^{-1}\cdot I \succeq (2\epsilon)^{-1}\cdot I.$ This means the RTK is an $1/2\epsilon$ -strongly log-concave distribution. With LSI, we have $\mathbb{E}\*{x^\prime}\left[ KL\left(\hat{p}(\cdot|x^\prime)\|p^\gets(\cdot|x^\prime)\right)\right]\le 2\epsilon\cdot \int \hat{p}(x,x^\prime)\cdot \left\|\nabla \log \frac{\hat{p}(x|x^\prime)}{p^\gets(x|x^\prime)}\right\|^2 dx \approx 2\epsilon \cdot \int \hat{p}(x,x^\prime)\cdot \left\|\nabla \ln p\_{k\eta} - s_{\theta,(k+1)\eta}(x^\prime)\right\|^2 dx,$ where RHS of this inequality is the same as the discretization error controlled in [Chen et al.], which can be further relaxed by Eq. (5.6) of [Chen et al.]. Thus, the method discussed yields a more general technique in terms of complexity analysis.

[Ma et al.] Sampling Can Be Faster Than Optimization.

[Chen et al.] Sampling is as easy as learning the score: theory for diffusion models with minimal data assumptions.

2024-08-12

The authors have addressed my concerns. I have raised my score accordingly.

2024-08-12

Thank you so much for the response.

If you have further concerns, please feel free to contact us. We're always ready to discuss them further and work with you to improve our paper.

作者回复

2024-08-07

Thanks for your valuable suggestions. To supplement numerical experiments and demonstrate the benefits of our RTK-based methods in practice, we have conducted the following new experiments.

Synthetic Data Experiments

We performed additional experiments on various Mixture of Gaussians (MoG) settings as described in the manuscript. Figures 1 and 2 in our latest rebuttal submission illustrate these new experiments. Specifically:

Figure 1: Shows experiments with a spiral-shaped MoG as the ground truth distribution.
Figure 2: Shows experiments with a chessboard-shaped MoG as the ground truth distribution.
Figure 3: Shows Fréchet distance comparison among RTK-based methods and other baselines when the target distribution is set as the above two settings.

In these figures, we observed that our RTK-based methods achieve significantly better marginal accuracy when the number of function evaluations (NFE) is small.

Additionally, we compared Annealed Langevin Dynamics (ALD) [Song et al.] with our RTK-based methods and found that ALD has a lower marginal accuracy, particularly when NFE is small. We hope these experimental results can address the concerns about “comparison to other related approaches” of Reviewer Y93M.

Furthermore, in Figure 3, we evaluated the methods using the Wasserstein Distance metric, which corresponds to Fréchet Inception Distance. Our results indicate that our RTK-based methods have a lower Wasserstein Distance compared to DDPM and ALD, especially when NFE is small. We hope these experimental results can address the concerns about the “comparison of FID” of Reviewer kQ8M.

Real-World Data Experiments

For real-world data, we conducted experiments on the MNIST dataset, as shown in Figure 4. We first trained a score model following the typical variance-preserving noise schedule and then compared different sampling methods using the Fréchet Inception Distance (FID) evaluation criterion:

Figure 4(a): Demonstrates that the RTK-based methods achieve better FID scores than DDPM, particularly when NFE is small.
Figures 4(b-d): Illustrate that the RTK-based methods generate images of higher quality than DDPM when NFE is small.

We hope these supplementary experiments address your concerns and provide further evidence of the effectiveness of our RTK-based methods.

最终决定Accept (spotlight)

2024-09-25

This paper notes that MCMC methods can be used to produce consistent simulations from SDEs, and then develops this idea in detail for the specific SDE that must be solved to simulate the reverse process in denoising diffusion models. The paper is theoretical, and the main result is an analysis of how to select the discrete times at which the SDE is discretised in order that the sampling problem of moving from one state to the next becomes convex, and therefore very amenable to MCMC. Sufficient conditions are presented for the MALA and ULA MCMC algorithms, but the general framework shows how other MCMC methods could also be considered. As with all large scale machine learning, one wonders whether the errors in simulating SDEs for large scale denoising diffusion models provides some form of useful implicit regularisation, but small experiments in this paper suggest this may not be the case (meaning that more accurate solution of the SDE could be useful). One also wonders whether the use of more complicated MCMC algorithms in place of simple Euler-Maruyama is worth the additional effort, but small experiments suggest the number of score function evaluations is not greatly increased when the proposed method is used. A caveat with these experiments is that the baseline implementation of denoising diffusion models seems sub-optimal, as the authors admit in their response, and this may have exaggerated the advantages of the proposed method. Overall this paper is well-written (modulo numerous grammatical errors, e.g. closed instead of close in Algorithm 1), with substantial theory and weakly convincing empirical evidence. I believe the framework will be of interest to a wide section of the NeurIPS community, encompassing both diffusion modellers and MCMC algorithm developers. I recommend that the paper is accepted.