6.3

/10

Poster4 位审稿人

最低4最高7标准差1.3

3.8

置信度

正确性2.8

贡献度3.0

表达2.3

NeurIPS 2024

Divide-and-Conquer Posterior Sampling for Denoising Diffusion priors

Yazid Janati,Badr MOUFAD,Alain Oliviero Durmus,Eric Moulines,Jimmy Olsson

OpenReview PDF

提交: 2024-05-14更新: 2024-11-06

摘要

关键词

Denoising Diffusion modelsposterior samplingMonte Carlo methods

评审与讨论

审稿意见

评分: 7置信度: 32024-07-01

The topic of using a pre-trained diffusion model as a prior for a Bayesian posterior sampling problem is considered. This formulation is compelling (particularly for images) as it combines the powerful generative capacities of large-scale diffusion models with the coherent inference under certainty provided by Bayes’ theorem.

The paper refines the diffusion posterior sampler of Chung et al by replacing a point mass approximation with a variational Gaussian approximation to the conditional transition distributions.

Chung et al: Diffusion posterior sampling for general noisy inverse problems, ICLR, 2023

优点

The paper tackles an interesting problem and is mathematically compelling as the point mass approximation is likely to be crude in practice.

The experiments are practical and varied - investigating various likelihoods (linear+Gaussian, Poisson, JPEG decoding). And the results for the introduced sampler are promising.

缺点

The paper is (perhaps understandably quite dense), in particular, I think some more clarity on the use of twisted potentials and how this differs (or to what extent) from the Chung et al approach would aid the reader.

The resulting algorithm is somewhat complex and it is not clear how sensitive the algorithm is to the choice of twisting potential and hyperparameters, in particular the number of internal Langevin (M) and VI gradient steps (K).

Additionally, as noted by the authors, the choice of twisting potentials considered is bespoke in that they have to be specified for each problem. A more general-purpose posterior sampling pipeline that fully separates likelihood and data specification from (automated) posterior sampling would be highly desirable. Although understandably as a future direction in the considered work.

Chung et al: Diffusion posterior sampling for general noisy inverse problems, ICLR, 2023

问题

Would MAP estimation (over posterior sampling) also be feasible/performative in this setting? And perhaps computationally easier?

It's not clear to me how exactly the proposed DCPS algorithm is "divide-and-conquer". Over which axis does it divide? And which computational bottleneck does this speed-up? How does this compare to other computational bottlenecks? (I.e. is it dominating)

局限性

The authors have addressed key limitations of the work. I see no potential for negative societal impact.

作者回复

2024-08-06

Dear Reviewer GK7d,

We thank you for your time, your feedback and suggestions. Please find our response to your questions below. In the PDF attached to the main rebuttal you can also find an update of Table 2 in the main paper with 4 more competitors suggested by the other Reviewers. These new results further demonstrate the strength of our method in comparison with the state-of-the-art.

More precision on the difference with Chung et al.: The potentials, and the subsequent approximations we introduce in this paper, differ in many aspects from the approximation developed in Chung et al. While [1] derives a new Gaussian approximation of $p(y|x_t)$ , we take a different approach by introducing likelihood functions at various timesteps $(k \_\ell) \_{\ell=0}^L$ in the diffusion process. This transforms the initial sampling problem into sampling from intermediate posteriors, $\pi \_{k_\ell}$ , defined by these new likelihoods. DCPS then iteratively samples from $\pi \_{k_\ell}$ using approximate samples from the previous posterior $\pi_{k_{\ell+1}}$ . Essentially, we reduce the initial problem to solving $L$ simpler posterior sampling problems , which only require approximating $p_{k_\ell|t}(\cdot|x_t)$ for $t \in [k_{\ell}, k_{\ell+1}]$ (hence the divide-and-conquer terminology), rather than the difficult-to-estimate $p_{0|t}(\cdot|x_t)$ on which DPS but also many recent works in the literature have focused on deriving accurate estimates. When unrolling the backward transitions involved in our algorithm, it can be seen that these are fundamentally different from those of DPS. We will add a side by side comparison in the revised version of the manuscript.

Choice of twisting potentials: In the paper, we proposed choices of potentials for three important problems, all following the same principle: each potential is derived by annealing the initial potential and applying it to a rescaled input, i.e., $g_k(x_k) = g_0 \left( \frac{x_k}{\beta_k} \right)^{\gamma_k}$ , where $\beta_k$ and $\gamma_k$ are tunable parameters. Specifically, we used $\beta_k = \sqrt{\alpha_k}$ and $\gamma_k = \alpha_k$ for linear inverse problems and JPEG dequantization. For the Poisson problem, we used $\gamma_k = \sqrt{\alpha_k}$ and the normal approximation of the Poisson distribution for $g_0$ . We believe that this principle can serve as a general recipe for constructing potentials, leaving the tuning of the parameters to the users. Future works include deriving informed choices of hyperparameters in the general case. We have added a discussion of this matter in the revision of the manuscript.

Sensitivity to hyperparameters: Please note that in the Gaussian mixture experiment we examine the impact of the number of Langevin steps, which can be seen to significantly improve the performance when using a large number of steps. Similarly, we have found experimentally that increasing the number of gradient steps improves the performance.

MAP estimation: We believe that the principles we presented in this paper can be extended to MAP estimation; one could also design a sequence of intermediate posteriors well-suited for MAP estimation such that the terminal distribution is the posterior of interest, and sequentially solve these problems. The MAP at the previous iteration should be a good initialization for the optimization in the next step. An initial guess would be to start off with the sequence of distributions used in our paper.

2024-08-12

Thanks to the authors for their thoughtful rebuttal.

I think the revision would be improved by more clearly discussing that the divide-and-conquer strategy arises from solving $L$ simpler posterior problems rather than the straight-to-time-zero conditional distribution approximated by DPS.

Overall, I think this is a nice paper tackling an interesting and modern problem (using a large pre-trained model as a prior) with very strong benchmarks. I have increased my score to a 7 (previously 6).

2024-08-13

Thank you for your helpful feedback and suggestions. We will clarify the intuition in the revision of the paper. We’re glad you found our work valuable, and we appreciate the improved score. Your input has been very helpful in improving the paper.

审稿意见

评分: 7置信度: 42024-07-05

This work proposes a new posterior sampling scheme for unconditional diffusion models. The sampling algorithm, named divide-and-conquer posterior sampling (DCPS), combines Langevin Monte Carlo and Gaussian variational inference to operate the transitions between noise levels along the reverse diffusion process. The authors apply their sampling scheme to several inverse problems and compare their results with several previous works. DCPS seems to improve upon these works, both quantitatively and qualitatively.

优点

The manuscript is well written for the most part.
The subject of posterior sampling algorithms in diffusion models is interesting and relevant for the ML and scientific communities.
The proposed algorithm seems sound, albeit too complex in my opinion (see weaknesses).
The experiments are well described and the results are clearly presented.
The code is provided, although it is not documented and instructions to reproduce experiments are minimal.

缺点

The proposed method, although presented quite differently, is very similar to the predictor-corrector approach proposed by Rozet et al. [1], where the predictor is a classic DDPM/DDIM step which approximates the transition from one noise level to the next, while the corrector is a number of Langevin Monte Carlo (LMC) iterations. The similarity is striking, especially when noticing that the DDPM step is a Gaussian approximation of the transition. The authors do not discuss nor mention Rozet et al. [1]. The authors should highlight the differences between their posterior sampling algorithm and the one of Rozet et al. [1] and compare their results.
Even if it were correct, the link between diffusion posterior sampling and Feynman-Kac (FK) models in Section 2 is not relevant and confusing. Unlike the authors state (line 110), the overwhelming majority of diffusion/score-based posterior sampling literature uses either a Markov chain or a stochastic differential equation (SDE) representation. Furthermore, this link with FK models is not even leveraged as part of the method, which only relies on Markov chain properties.
As mentioned by the authors, intermediate potentials should be designed for each problem, which could be difficult for some inverse problems. In addition the convolution of these potentials with the bridge kernel (Eq. (3.6)) should be (cheaply) tractable and differentiable.
There are two aspects to the proposed algorithm: the LMC iterations and the intermediate potentials/posteriors. It is hard to assess the contribution of each aspect in the results. My hypothesis is that most of the quality gains are due to the LMC iterations. My rationale is that the potentials $g^l_m(x_m) = \mathbb{E} [ g_{k_l}(x_{k_l}) \mid x_m ]$ are actually user-defined potentials, just like $g_m(x_m) = g_0(x_m)$ or $g_m(x_m) = \mathbb{E} [ g_0(x_0) \mid x_m ]$ . As long as they converge to $g_0$ when $m \to 0$ , the algorithm should work.

Comments

I am not sure to understand the reference to "divide-and-conquer" from dynamic programming.
At line 80, $t$ is undefined.
At line 143, Finzi et al. [2] are the first to propose to use the Jacobian of the denoiser to estimate the covariance. Boys et al. [3] later proposed an approximation using the diagonal of the Jacobian.
In Algorithm 1, I think there is a typo in "for $j = \ell + 1 - 1$ to $\ell$ do".
Table 5 should be Figure 5.
At line 232, "resort to a optimizing" -> "resort to optimizing".
At line 258, $K$ is not defined in the main text.
At line 334, if I understand correctly DCPS is not limited to linear inverse problems.

[1] Rozet et al. "Score-based Data Assimilation". In Advances in Neural Information Processing Systems. 2023.

[2] Finzi et al. "User-defined Event Sampling and Uncertainty Quantification in Diffusion Models for Physical Dynamical Systems". In Proceedings of the 40th International Conference on Machine Learning. 2023.

[3] Boys et al. "Tweedie Moment Projected Diffusions for Inverse Problems". 2024.

问题

See weaknesses.

局限性

See weaknesses.

作者回复

2024-08-06

Dear Reviewer af3x,

We would like to thank you for the time you took to review our manuscript and for your feedback and suggestions. Please find below our response to what we believe is the most crucial point in your review. Please consider reading the main rebuttal containing comparisons with new competitors and comments on the choice of potentials.

Similarity with [1]: We humbly disagree with the claim that our algorithm is similar to SDA [1]. To clarify this misunderstanding, let us first give a brief description of SDA. The authors propose an approximation of $p(y|x_t)$ ( $g_t ^{*} (x_t)$ with our notation). The resulting density approximation is $\hat{p}(y|x_t)=N(y;A \hat{x}\_{0|t}^\theta(x_t), \sigma^2 \_y + \frac{\gamma (1 - \alpha_t)}{\alpha_t} AA^T )$ . It should be noted that it is similar to the one used in PGDM [3], but with a slightly different choice of variance. After sampling $X_{t-1}$ given $X_t$ by leveraging the approximate conditional score, the authors perform few corrections steps using Langevin Diffusion targeting the marginal distribution $\propto \hat{p}(y|x_{t-1}) p_{t-1}(x_{t-1})$ . As the authors acknowledge, this PC step is not new and already appears in the seminal paper of Song et al. 2021

We now proceed to a step-by-step comparison with [1].

Intermediate posteriors (Please see also the second point in the main rebuttal): While [1] focuses on deriving a new approximation of $p(y|x_t)$ using a Gaussian approximation of $p_{0|t}(\cdot x_t)$ , we take a different approach and introduce likelihood functions $g_{k_\ell}$ at different timesteps $(k_\ell) \_{\ell = 0} ^L$ of the diffusion process. By introducing these likelihood functions, we transform the initial problem of sampling from the posterior of interest to that of sampling from the intermediate posteriors, denoted by $\pi_{k_\ell}$ in our paper. DCPS is then an iterative procedure where at each step we aim to sample from $\pi_{k \_\ell}$ using approximate samples from the previous posterior $\pi \_{k \_{\ell+1}}$ . In short, we reduce the initial problem to that of solving $L$ simpler posterior sampling problems. These problems are simpler since, in contrast with the initial problem which required deriving an approximation of the difficult to estimate distribution $p_{0|t}(\cdot | x_t)$ , now we only require approximating the simpler distributions $p_{k_\ell | t}(\cdot | x_t)$ for $t \in [k_{\ell}: k_{\ell+1}]$ .
Backward transition approximation: The conditional score in [1] is used within DDIM whereas we posit a Gaussian approximation with learnable mean and diagonal covariance matrix and perform KL minimization on the conditional target distribution that immediately precedes Eq. 3.5. This is step 2 (see line 185). This step does not target the same distribution as [1] and relies on a different approximation.
Langevin steps: In [1], $C$ -Langevin steps are performed directly after each DDIM step. On the other hand, we resort to Langevin steps only to move from one block to the next, i.e., to obtain an approximate sample of $\pi^\ell _{k _{\ell+1}}$ that initializes the next block; see Eq. 3.4. This is step 1, (see line 183). We thus use significantly less LMC steps than SDA and the performance gains are not simply due to them, as opposed to [1].

Our algorithm DCPS and the algorithm in [1] are thus entirely different. We have implemented SDA and compared it with ours on linear inverse problem imaging experiments using the authors' implementation. We used two Langevin correction steps, $\tau = 0.1$ for the Langevin step size, and a diagonal approximation of $\gamma = 0.1$ after tuning. While SDA performs well on FFHQ, we couldn't find optimal parameters for ImageNet, resulting in degraded performance. DCPS outperforms it in both average and median LPIPS scores. Please see the PDF in the main rebuttal for detailed results and comparisons with other prominent algorithms.

Choice of potentials: Please see the second point of the main rebuttal. The reviewer is right in that the choice of potentials is arbitrary and any sequence should perform well as long as it converges to the inverse problem likelihood function. However, in practice, a mischosen potential will require a very large number of Langevin steps or may even result in significant instabilities. We illustrate the latter point. Take as example the choice of potential $g_m(x_m) = g_0(x_m)$ that you propose. Assume that $x \in R^2$ and $g_0(x) = N(y; x_1, \sigma^2 \_y)$ . Assume that $x \in R^2$ and $g_0(x) = N(y; x_1, \sigma^2 \_y)$ . We also assume that the prior distribution is for example a mixture with well separated modes far from 0 and that we observe $y >> 0$ , the first coordinate of either one of the modes. During the early steps of the diffusion, when $\sigma_y \approx 0$ , samples from the intermediate posteriors have first coordinates centered around $y$ , while the second coordinate remains nearly Gaussian. This leads to instabilities in subsequent steps. When moving to the next step, i.e., sampling $X_{k-1}$ given $X_k$ , the denoiser network is evaluated at $X_k$ . However, at the $k$ -th noise step, the denoising network has been trained to process samples that are almost white noise, which is not the case for our current sample $X_k$ , which has a large first component. As a result, the network returns an incoherent value, and since this happens at every step of the diffusion, the errors accumulate, leading to significant instabilities.

As for the second example $g_m(x_m) = \mathbb{E}[g_0(X_0) | X_m = x_m]$ , this is a valid argument and we believe that it can work well in practice. However, it should be noted that using this method requires a significant amount of memory and further increases the runtime. Namely, differentiating the loss in Eq. A.8 with respect to both parameters requires differentiating the composition of two Denoisers, which is very computationally intensive.

2024-08-08

Thank you for your response and taking the time to implement [1].

I do not agree with your interpretation of SDA [1]. Even though $\hat{p}(y \mid x_t)$ is an approximation of $p(y \mid x_t)$ , it is also a likelihood function and therefore defines a series of intermediate posteriors $\hat{p}(x_t \mid y)$ . The DDIM step (predictor) is an approximation of the transition from $\hat{p}(x_t \mid y)$ to $\hat{p}(x_{t - \Delta} \mid y)$ and the $C$ Langevin steps (corrector) "ensure" a sample from $\hat{p}(x_{t - \Delta} \mid y)$ . In this light, DCPS and SDA's sampling algorithm are quite similar, especially regarding the Langevin steps which are used to "correct" the transition errors. These similarities should be discussed in the main text. There are also major differences, obviously. Notably,

The likelihood function $g^l_m(x_m)$ is chosen as $\mathbb{E}[g_{k_l}(x_{k_l}) \mid x_m]$ instead of a (very) rough approximation to $\mathbb{E}[g_0(x_0) \mid x_m]$ .
The "predictor" step spans several diffusion steps and consists in a sequence of variational Gaussian approximations, which is indeed quite different from a DDIM step.

Concerning the choice of potentials, I very much agree that bad potentials could lead to unstable sampling. However, I still believe that most of the quality gains of DCPS are due to the Langevin steps and not to the potentials $g^l_m(x)$ . My intuition is that potentials $g_m(x)$ defined as in Eq. (3.9), (3.10) or (3.11) but with $m$ instead of $k_l$ would work nearly as well.

2024-08-09

Thank you for taking the time to respond to our rebuttal.

Regarding your first point, we agree that SDA also defines a sequence of posterior distributions that are sampled sequentially. However, except for this similarity, our work differs significantly from [SDA]: First, regarding the number of elements of this sequence (the number $L$ ), second and more importantly, regarding how these distributions are defined and the way we sample from them. In the revised version of the paper we will make sure to highlight these two distinctions and discuss SDA. Furthermore, let us still argue that even the way we use Langevin is different.

More precisely, as you have said, SDA proceeds by first drawing a sample from an approximate transition of $\hat{p}(x_t | y)$ to $\hat{p}(x_{t-\Delta} | y)$ , then proceeds to correct this sample by running Langevin steps targeting the same distribution $\hat{p}(x_{t-\Delta} | y)$ . On the other hand, in DCPS, when we sample from a transition within a block, we never correct the sample using Langevin steps. Langevin steps happen only at the end of the block and their purpose is the following; first, right before the end of the block we draw an approximate sample from the transition that goes from $\pi^\ell \_{k_\ell + 1}$ to $\pi^\ell \_{k_\ell} = \pi \_{k_\ell}$ using our variational approach. Afterwards, we run Langevin steps but their purpose is not to correct to ensure that it is distributed according to $\pi \_{k_\ell}$ . Their purpose is to ensure that the sample is distributed according to the next distribution $\pi^{\ell - 1} \_{k_\ell}$ , which is the initial distribution of the next block, as per eq. 3.4.

As such, the Langevin steps in our algorithm are not used as correction steps but are rather used to transition to a different posterior (that has the same prior).

In our imaging experiments we only use a small number of Langevin steps overall (5 * L in total) so as to have a lower computational cost than the main competitors. We have found that even using no Langevin steps in between the blocks still performs well. Increasing the number of steps further improves the performance. In the table below you can find the LPIPS results on FFHQ with the Half mask and 0.05 observation std. DCPS-M refers to our algorithm with $M$ Langevin steps in between the blocks (the other parameters are similar to those used in our main experiments). We also compare with SDA with no Langevin steps. As is show, DCPS is still competitive when no Langevin steps are used. We list some other competitors from the table in the PDF for comparison.

Method	DCPS-0	DCPS-5	DCPS-50	DCPS-500	SDA-0	SDA-2	DDNM	FPS	MCGDIFF
LPIPS	0.22	0.20	0.20	0.19	0.28	0.22	0.22	0.28	0.36

As stated in our rebuttal, we firmly believe that the main gains stem essentially from our choice of potentials and the KL minimization step. The Langevin steps help further improve the performance.

For your second point:

“My intuition is that potentials $g_m(x)$ defined as in Eq. (3.9), (3.10) or (3.11) but with $m$ instead of $k_\ell$ would work nearly as well.”

We have already experimented with the setting you suggest and found that it doesn’t compare favorably with DCPS.

Finally, we would like to highlight that our procedure introduces several novel aspects, and our empirical results demonstrate that it performs well against nine competitors (as shown in the PDF in the main rebuttal), including some well-established methods. Given these findings, we are genuinely surprised by your rating of our contribution (1/4) and the resulting recommendation for weak rejection. We would greatly appreciate if you could provide additional clarification regarding your decision.

2024-08-09

Thank you for your answer.

As such, the Langevin steps in our algorithm are not used as correction steps but are rather used to transition to a different posterior (that has the same prior).

I see. That is a subtlety I did not catch.

As is show, DCPS is still competitive when no Langevin steps are used.

Thank you for running this additional experiment. This indeed supports your claims. My intuition failed me it seems.

Given these findings, we are genuinely surprised by your rating of our contribution (1/4) and the resulting recommendation for weak rejection. We would greatly appreciate if you could provide additional clarification regarding your decision.

My rating was mainly influenced by my concern of the similarities between DCPS and [1]. But thanks to your clarifications, I now see that they have less in common than I had initially taught. The additional results in the attached PDF also strengthen your claims of improved sampling. A remaining concern would be the relative complexity of the method, but this is quite minor. Based on my current assessment, I now argue for acceptance and will therefore raise my score to 7 (previously 4).

2024-08-13

Thank you very much for taking the time to review our manuscript and especially for your engagement during the rebuttal period. We believe that your comments and suggestions have significantly improved the quality of our paper.

审稿意见

评分: 4置信度: 52024-07-11

This paper aims to solve the inverse problem within the Bayesian framework. The authors propose DCPS, which approximately samples from the posterior. DCPS creates a sequence of intermediate posterior distributions to approach the target posterior step by step. The authors showcase the effectiveness of DCPS on inverse problems like image restoration and trajectory prediction.

优点

Good empirical results on Gaussian mixture problems.
DCPS is computationally efficient.

缺点

No justification for the design of the intermediate potentials. They are approximations of $g^*_k$ defined in line 122. But why are they good approximations? The manuscript will benefit from a theoretical analysis of the approximation error.
In general, it is unclear how to design intermediate potential for nonlinear forward models. DCPS seems limited to linear problems.
Only LPIPS is reported in the image restoration tasks. The experiment section would benefit from adding evaluation metrics such as PSNR and SSIM.
DCPS generates pseudo observations in a similar way to the prior work FPS [1]. I believe this is an important baseline to discuss and compare.
Overall, the paper is hard to parse, with many long sentences, redundant phrases, and inconsistent notation usage. Just list some of the issues here. I believe a major revision is needed to make this paper more accessible.
1. Typos in Algorithm 1: " $X^l_{l+1}\leftarrow X_l+1$ " and"for $j=l+1-1$ to $l$ "
2. $l$ is used in subscript and superscript together, which is very confusing. For example, $X^l_{l+1}$ in Algorithm 1, $\pi^l_{k_l}$ in Eq. (3.4).
3. Line 66: remove the comma before "and illustrate"
4. Line 64: "describe theoretically" -> "theoretically describe"
5. Line 70: "generate efficiently" -> "efficiently generate"
6. Line 158: a comma is missing before "we find"
7. The authors use $N(x;\mu, \Sigma)$ to represent the Gaussian density. I suggest the conventional notation $\mathcal{N}$ from the literature for better consistency with the field.
8. Calling $g_k$ as the potential is confusing when it's actually a likelihood function.
9. The use of upper case $X$ and lower case $x$ is confusing throughout the paper. For example, in line 85, it's $X_k=x_k$ .
10. Redundant phrases and long sentences should be revised for clarity. For example, line 62-71, line 96-99, line 151-153, line 160-163.
11. Line 102: "where $\mathcal{Z}=\int g_0(x_0)p_0(dx_0)$ " -> "where $\mathcal{Z}=\int g_0(x_0)p_0(x_0)dx_0$ "
12. Incorrect inline citations throughout the paper.

[1]: Dou, Zehao, and Yang Song. "Diffusion posterior sampling for linear inverse problem solving: A filtering perspective." _The Twelfth International Conference on Learning Representations. 2024.

问题

the unadjusted Langevin is a biased sampler for any finite time step. Why not include the Metroplis-Hasting correction?
Why is L much smaller than n? How does the value of L affect the final performance?
In the image restoration experiments, the authors claim to use only 5 Langevin steps. It is very unlikely that the algorithm is sampling from the right distribution within 5 Langevin steps. Langevin Monte Carlo is known for slow convergence for high-dimensional problems, which takes hundreds of steps even on simple low-dimensional problems. How can the proposed DCPS work with only 5 Langevin steps?

局限性

Problem-specific intermediate potentials must be designed, limiting DCPS's application from a wider range of inverse problems.
While the goal of this paper is posterior sampling, the proposed DCPS is a biased sampler that does not converge to the true posterior.

作者回复

2024-08-06

Dear Reviewer xni9,

We would like to thank you for the time you took to review our manuscript and for your feedback and suggestions. Please find below our response to what we believe are the most crucial points.

On the approximation of $g_k ^{*}$ and design of the intermediate potentials: We would like to emphasize that in our methodology the functions $g_k$ are completely user-defined. This is exactly the main point of our paper; we propose a principled method that deals with the intractability of the true potentials and allows the use of problem-informed, user-defined potentials. Therefore, it is not particularly relevant to question whether the functions $g_k$ are precise approximations of the functions $g_k^*$ . As already emphasized in the paper, our method does not require any knowledge of $g^* _k$ .

In the paper we have proposed choices of potentials for three important problems and their design follows the same principle; please see the second point in the main rebuttal for more details. We do not claim that these potentials are good approximations of $g^*_k$ as this is not a requirement of our method; nevertheless, we demonstrate that although they may not be accurate approximations, our method still demonstrates strong performance; as evidenced by our experiments. As an illustration, you can find in the PDF of the main rebuttal a plot of the gradient field of our log potentials compared with those of the true potentials $g^* _k$ (which are computable in a closed form in the Gaussian mixture case). Notably, these are very different and yet DCPS achieves a strong performance, as you stated in the strengths.

Finally, many recent published papers in top conferences [1, 2, 3, 4, 5, 6, 7] (including the FPS paper you suggest we should compare with) has focused only on linear inverse problems, which continues to be a highly relevant and challenging problem class, and as far as we are concerned there is still no gold-standard algorithm that is widely accepted for its robust performance. Besides linear problems, our study also covers two other significant inverse problems for which we also design potentials $g_k$ and empirically assess and illustrate their benefits. Therefore, while it would undoubtedly be ideal to have a parameter-free general solution, up to our knowledge only no existing works have been able to tackle this degree of generality while showing a performance comparable to more “specialized” schemes.

Comparison to FPS: We agree with the reviewer that both DCPS and FPS rely on “pseudo-observations,” as acknowledged in the paper (see the discussion in Section A.4 of the appendix where we already describe FPS). For the sake of completeness, we have added a comparison with FPS on imaging experiments but also with three other competitors suggested by the other reviewers, see the main rebuttal. Clearly, DCPS outperforms FPS by a significant margin. While SMC-based algorithms like FPS and MCGDiff have strong theoretical guarantees, their performance scales poorly with increasing dimensions, requiring exponentially more memory. In contrast, DCPS performs well across both low and high dimensions without incurring such high memory costs.

LPIPS only: We want to emphasize a key point. Our goal is to develop a sampling algorithm that efficiently generates diverse samples over the entire posterior distribution. For the multimodal examples in our paper, high-quality samples should be varied and differ significantly from the true image. A pixel-by-pixel comparison with the true image is not meaningful for posterior sampling, as it doesn't reflect sampler performance. Since the true image is near the maximum a posteriori (MAP) estimator, good pixel-by-pixel performance might indicate the algorithm generates non-diversified draws from dominant modes, failing to explore the full posterior support.

Why no Metropolis-Hastings correction? The author is right in thinking about the use of a MH correction. However, note that we do not actually have access to (unnormalized versions of) $\pi^\ell \_{k_{\ell+1}}$ since we cannot evaluate $p_k$ . Therefore we cannot use MH type algorithms. LMC is of course biased when run for a finite number of steps, but as we show with the Gaussian mixture example, it can still perform decently, very close to that of the asymptotically exact algorithm MCGDiff.

Typos: We would like to thank you for drawing our attention to the notation problems present in Algorithm 1 and we apologize for these. Please find a corrected version of our algorithm in the PDF attached to the main rebuttal. Thank you also for pointing out other typos and suggesting corrections. Note that throughout the paper we use upper case for random variables and lower case for their realizations. With this in mind, saying that $p_{0|k}(\cdot | x_k)$ is the conditional density of $X_0$ given $X_k = x_k$ is not confusing.

bias of our algorithm: To our knowledge, all methods for sampling from posterior distributions with probabilistic diffusion priors are biased with finite iterations. This includes algorithms based on importance sampling, like SMC, which are also biased with limited particles. Regarding LMC, similarly to SMC using an infinite number of particles, if we were using decreasing step sizes converging to $0$ , it is also asymptotically unbiased by [8, 9]. DCPS can be made asymptotically unbiased by running LMC targeting the posterior with the final sample from Algorithm 1. The key issue is whether our method’s samples are adequate initializations. While we have shown this empirically, a quantitative analysis of our algorithm’s accuracy remains for future work. Therefore, based on the current existing literature, pointing out that our method is biased (without any further indications) as a limitation seems unjustified in our opinion. We humbly ask the reviewer to elaborate more on their comment.

2024-08-14

Thank the authors for the response. I appreciate the clarification on the intermediate potentials, FPS, and MH correction. I've raised the score to 4. However, there are a few major questions or concerns that are ignored or not directly responded to.

DCPS seems limited to linear problems except for JPEG. The generic design principle of intermediate potential in the general rebuttal makes sense for the linear forward model, but it is hard to justify its use in nonlinear problems.
Question 3 is not directly responded to. The concern is that the algorithm with only 5 LMC steps is uncontrolledly biased. It behaves more like an optimization procedure rather than a sampling algorithm.
Relying on a single metric is not a good practice, as one may easily tune the algorithm to overfit it. The point of using multiple metrics is to avoid overfitting to a single metric. Of course, no metric is perfect. A good LPIPS value does not reflect sampler performance either. Saying PSNR or SSIM is not a perfect metric does not justify the problem of using LPIPS only for the experiments. Also, PSNR and SSIM are good for SR and denoising problems, which are heavily studied in this paper.

2024-08-07

[1] Wang, Y., Yu, J. and Zhang, J. Zero-shot image restoration using denoising diffusion null-space model. International Conference on Learning Representations (ICLR) 2023.

[2] Kawar, B., Elad, M., Ermon, S. and Song, J., 2022. Denoising diffusion restoration models. Advances in Neural Information Processing Systems.

[3] Zhang, G., Ji, J., Zhang, Y., Yu, M., Jaakkola, T. and Chang, S., 2023, July. Towards Coherent Image Inpainting Using Denoising Diffusion Implicit Models. In International Conference on Machine Learning (ICML) 2023

[4] Song, J., Vahdat, A., Mardani, M. and Kautz, J., 2023, May. Pseudoinverse-guided diffusion models for inverse problems. In International Conference on Learning Representations.

[5] Dou, Z. and Song, Y., 2024. Diffusion posterior sampling for linear inverse problem solving: A filtering perspective. In The Twelfth International Conference on Learning Representations.

[6] Cardoso, G., Le Corff, S. and Moulines, E., 2023. Monte Carlo guided Denoising Diffusion models for Bayesian linear inverse problems. In The Twelfth International Conference on Learning Representations.

[7] Trippe, B.L., Yim, J., Tischer, D., Baker, D., Broderick, T., Barzilay, R. and Jaakkola, T.S., Diffusion Probabilistic Modeling of Protein Backbones in 3D for the motif-scaffolding problem. In The Eleventh International Conference on Learning Representations.

[8] Durmus, Alain, and Eric Moulines. "Nonasymptotic convergence analysis for the unadjusted Langevin algorithm." (2017): Annals of Applied Probability, 1551-1587.

[9] Lamberton, Damien, and Gilles Pages. “Recursive computation of the invariant distribution of a diffusion.” (2002): Bernoulli, 367-405.

审稿意见

评分: 7置信度: 32024-07-13

This paper presents a sampling algorithm (DCPS) that leverages diffusion models as priors to solve linear inverse problems. DCPS defines and recursively solves a series of intermediate inverse problems that converge to the original inverse problem. In each iteration, a sample is drawn from the next posterior distribution following a combination of Langevin dynamics and an inhomogeneous Markov chain. DCPS shows nice results in synthetic inverse problems and also image restoration tasks.

优点

DCPS is well-motivated by intuition to mitigate the approximation error of the conditional distribution, which is supported by solid theoretical evidence in Proposition 3.1.
The experimental results on synthetic data validate the claim of accurate posterior sampling.

缺点

The experiments on images are not convincing enough. DCPS performs slightly better than DPS and DDRM, while recent works like DDNM[1], DiffPIR[2], and ReSample[3] have claimed much better posterior sampling quality on image restoration tasks but are not compared as baselines.

[1] Yinhuai Wang, Jiwen Yu, and Jian Zhang. Zero-shot image restoration using denoising diffusion null-space model. ICLR, 2023.

[2] Yuanzhi Zhu, Kai Zhang, Jingyun Liang, Jiezhang Cao, Bihan Wen, Radu Timofte, and Luc Van Gool. Denoising diffusion models for plug-and-play image restoration. CVPR, 2023.

[3] Bowen Song, Soo Min Kwon, Zecheng Zhang, Xinyu Hu, Qing Qu, Liyue Shen. Solving Inverse Problems with Latent Diffusion Models via Hard Data Consistency. ICLR, 2024.

问题

It is noted for the Gaussian mixture experiments that DCPS does not beat MCGDiff in terms of accurate posterior sampling. However, it outperforms MCGDiff by a large margin in image restoration tasks. What are the pros and cons of DCPS compared to MCGDiff? And what makes synthetic inverse problems and image restoration tasks so different?
Would it be possible to apply DCPS to more realistic inverse problems in other domains? What are the possible limitations of this work when adapted to more complicated inverse problems?

局限性

The authors address the limitations and provide possible future directions in their paper.

作者回复

2024-08-06

Dear Reviewer vN6C,

We would like to thank you for your time, your feedback and suggestions. We answer your questions below.

Further experiments: We have followed your suggestion and proceeded to compare our algorithm with DDNM [1] and DiffPIR [2] on the imaging experiment. We did not compare with ReSample [3] since it is specialized for latent diffusion, which is not the focus of the present paper and is left for future work. We have also added a discussion of these papers to the manuscript. Following the suggestions of Reviewer xni9 and af3x we have also added FPS [4] and SDA [3] to our benchmark. The results can be found in the PDF in the main rebuttal. Both algorithms you suggested show strong performance, with DDNM coming second in our benchmark, and DiffPIR scoring third in terms of median performance and fourth in terms of average. Our algorithm, DCPS, still ranks first. Finally, we would like to emphasize that it ranks first or second consistently on 14 of the 16 tasks we have considered (across low and large observation standard deviation), which is not the case for any other algorithm under consideration.

Comparison with MCGDiff: Please note that SMC-based methods like MCGDiff heavily suffer from the curse of dimensionality. More specifically, these methods iteratively evolve a set of $N$ samples approximating the final posterior. The number of samples required to obtain a good approximation to the posterior distribution should be an increasing function of the dimension of the problem and may even grow exponentially with this quantity (see [5]), requiring a prohibitive amount of memory. Typically, for a number of samples $N$ which has been chosen too small relative to the dimension, all samples typically collapse into a single one, so that the final posterior distribution is approximated by basically a Dirac mass.

In our high-dimensional experiments, we used 32 particles for MCGDiff, which is far from sufficient, but this was the maximal number we were able to tackle due to memory constraints. The returned approximation is therefore relatively poor. For the Gaussian mixture experiment, the dimension is low, and this is exactly the setting where MCGDiff performs best and is thus better than DCPS. Nonetheless, note that if the computational budget of DCPS is increased, it can still perform better than MCGDiff, as we show in the $10$ -dimensional example. To summarize, the comparison with MCGDiff provides a convincing argument in favor of DCPS; in low dimensions it comes very close to the performance of SMC methods, and, crucially, it performs much better for high-dimensional problems.

Other applications: As an example of other possible applications, we are currently applying DCPS to zero-shot conditional sampling for traffic simulation. In this setting, one has access to a diffusion model for the joint distribution of $N$ interacting agents and the aim is to be able to sample joint agent trajectories that satisfy user-specified constraints. Two notable examples are the simulation of joint trajectories that do not collide or trajectories that satisfy certain speed limits, see Table I in [6]. Both cases define specific potentials $g_0: \mathbb{R}^{N \times T \times d} \to \mathbb{R}$ where $N$ is the number of agents, $T$ is the length of the trajectory and $d$ is the number of trajectory features considered. DCPS applies successfully by considering potentials built following the same recipe as in the paper; please see the second point in the main rebuttal for more details. We have found that this general principle works well for various inverse problems and as a future work we aim to understand how to select these hyperparameters in a principled manner.

[3] Rozet et al. "Score-based Data Assimilation". In Advances in Neural Information Processing Systems. 2023.

[4] Dou, Zehao, and Yang Song. "Diffusion posterior sampling for linear inverse problem solving: A filtering perspective." _The Twelfth International Conference on Learning Representations. 2024.

[5] Bickel, Peter, Bo Li, and Thomas Bengtsson. "Sharp failure rates for the bootstrap particle filter in high dimensions." In Pushing the limits of contemporary statistics: Contributions in honor of Jayanta K. Ghosh, vol. 3, pp. 318-330. Institute of Mathematical Statistics, 2008.

[6] Zhong, Ziyuan, Davis Rempe, Danfei Xu, Yuxiao Chen, Sushant Veer, Tong Che, Baishakhi Ray, and Marco Pavone. "Guided conditional diffusion for controllable traffic simulation." In 2023 IEEE International Conference on Robotics and Automation (ICRA), pp. 3560-3566. IEEE, 2023.

2024-08-11

Thanks for your detailed response to my questions. The new experimental results have resolved my previous concerns. I appreciate the discussion and clarification of the pros and cons of DCPS compared to MCGDiff. The potential applications of DCPS to realistic inverse problems are also intriguing. Overall, my opinion on this work has not changed so I will keep my rating the same (Accept).

2024-08-13

Thank you for reviewing our additional experimental results and for your feedback, which has been instrumental in improving the paper.

作者回复

2024-08-07

Firstly, we would like to sincerely thank the reviewers for taking the time to review our paper and for providing constructive feedback. Here below we address the two recurring comments across your reviews.

More competitors: Following the suggestions of Reviewer vN6C, xni9 and af3x we have added 4 more competitors; DiffPIR, DDNM, SDA and FPS. We now compare DCPS with 9 algorithms from the literature on linear inverse problems and we are happy to report that it ranks first both in average and median LPIPS. The results can be found in the attached PDF.

On the choice of potentials: The methodology we present in this paper relies on the use of different potentials that define intermediate posterior distributions and their purpose is to guide the samples towards the desired posterior distribution. This principle is widely used in the sampling literature; when aiming to sample from a given distribution $\pi$ , practitioners define a sequence $(\pi \_t)\_{t = 0} ^T$ which starts with a simple base distribution $\pi_0$ that is generally straightforward to sample, and then progressively move to the desired posterior $\pi_T = \pi$ . The design of these intermediate distributions is made on the basis that two consecutive distributions $\pi_t$ and $\pi_{t+1}$ should be close to each other in terms of some divergence or distance. Popular examples include tempering; see for example [1], or annealing; see for example [2].

In this paper we have followed a similar principle, adapted to the sampling of a posterior distribution with prior given by a pre-trained diffusion model. While the corresponding diffusion provides a natural path for sampling from the posterior, the grad log of $g^* _k$ (defined before eq. 2.5 in the paper) is in most cases intractable, rendering this path difficult to use in practice. Many works in the literature have focused on approximating it and in this paper we take a different approach and consider more general paths defined through different potential functions, as is usually done in the sampling literature. Hence, we no longer need to find an accurate approximation of $g^* _k$ and we are free to design more convenient potentials. In the paper we consider the sequence defined in eq. 3.3 and we have proposed three examples of potentials for three important problems; linear inverse problems, JPEG dequantization, and inverse problems with Poisson noise. The design of these potentials follows a single principle, which is to use an annealing of the initial potential $g_0$ applied to a rescaled input, i.e. set $g_k(x_k) = g_0 (x_k / \beta_k)^{\gamma_k}$ where $\beta_k$ and $\gamma_k$ are tunable parameters. In the examples considered in the paper we used $\beta_k = \sqrt{\alpha_k}$ for the rescaling parameter and $\gamma_k = \alpha_k$ for linear inverse problems and JPEG dequantization (with an additional approximation), while for the Poisson problem we used $\gamma_k = \sqrt{\alpha_k}$ and the normal approximation of the Poisson distribution for $g_0$ . This generic principle is also clearly inspired by tempering schemes used in the sampling literature in which many practitioners use $\pi_k(x) \propto p(y|x)^{\gamma_k} p(x)$ where $p$ is the prior and $\gamma_k$ is a sequence that tends to 1 as $k$ increases. Here in our context, with a very specific prior distribution, we add a rescaling factor that we set to $1 / \sqrt{\alpha_k}$ since during the forward process the noised states are obtained by first scaling the initial sample by $\sqrt{\alpha_k}$ and then adding noise. The annealing coefficient $\gamma_k$ is on the other hand a tunable parameter of which we leave the study for future work. A second important aspect of our work is the way we sample from this path of distributions. Practitioners in the sampling literature use either variants of the Metropolis-Hastings algorithms or sequential Monte Carlo (SMC) methods to sequentially sample from the intermediate posteriors. In our context we cannot the Metropolis-Hastings correction since we do not have access to the densities of the distribution within the path and secondly, we do not use SMC due to the curse of dimensionality. Instead, our sampling scheme relies on the structure of the prior as well as variational inference and Langevin Monte Carlo.

[1] Syed, Saifuddin, Vittorio Romaniello, Trevor Campbell, and Alexandre Bouchard-Côté. "Parallel tempering on optimized paths." In International Conference on Machine Learning, pp. 10033-10042. PMLR, 2021.

[2] Neal, Radford M. "Annealed importance sampling." Statistics and computing 11 (2001): 125-139.

最终决定Accept (poster)

2024-09-25

The paper presents a posterior sampling method that combines Langevine MC and VI and uses diffusion processes. Four reviewers provided feedback, with three of the reviewers scoring an unambiguous Accept and one reviewer raising the score during feedback to a less firm Borderline Reject. The reviewers thought the paper was well written and the idea was novel and relevant to NeurIPS. Some concern was raised about how convincing the experiments were, but the general sense was that the authors demonstrated that the idea is indeed practically useful. Therefore, the paper would be suitable for acceptance as a poster.