PaperHub
7.2
/10
Poster4 位审稿人
最低3最高4标准差0.4
4
4
3
4
ICML 2025

A Mixture-Based Framework for Guiding Diffusion Models

OpenReviewPDF
提交: 2025-01-21更新: 2025-07-24

摘要

关键词
Diffusion ModelsGuidanceInverse ProblemsMonte Carlo methods

评审与讨论

审稿意见
4

This paper explores solving linear and nonlinear inverse problems—sampling from p(x0y)p(\mathbf{x}_0|\mathbf{y})—using pre-trained unconditional diffusion models in a Bayesian framework. To approximate the posterior, the authors iteratively sample from intermediate distributions p(xty)p(\mathbf{x}_t|\mathbf{y}), where the prior is given by the unconditional score network, but the likelihood is intractable. Their key contribution is a Gibbs sampling-based method to approximate and sample from p(xty)p(\mathbf{x}_t|\mathbf{y}). At each denoising step, they perform RR Gibbs iterations, each consisting of: (1) GG gradient steps to fit a variational distribution, (2) sampling from the unconditional diffusion model using MM DDPM steps, and (3) a closed-form sampling step from the noising process.

The method is evaluated on linear and nonlinear inverse problems in both pixel and latent space, as well as on a linear audio source separation task. It performs well on pixel-space tasks (though sometimes underperforms competitors) and generally surpasses benchmarks in latent-space tasks.

给作者的问题

Besides the already mentioned questions from the previous sections (mainly from the Weaknesses section), I have a few more questions:

  • Q1: Remark A.1. from Appendix A.3 - I find it surprising that computing the MC estimate for the squared norm in (17) gives better results than computing it analytically. Any insight as to why this might be the case?
  • Q2: Why don't you provide comparisons to MGPS?
  • Q3: This might be hard to estimate, but do you think a variational approximation with a non-diagonal covariance matrix would give significant improvements in the method? Or do you think that would be too complicated for relatively small improvements?
  • Q4: Do you have any intuition of how accurate the Gaussian approximation for π^s0,ty\hat{\pi}^{\mathbf{y}}_{s|0, t} is for (possibly nonlinear) cases with non-Gaussian noise? I realise this might actually depend on how close ss is to tt and 00, but just curious to see if you have any intuition about this.
  • Q5: For the Gibbs sampler procedure (Algorithm 1) - could you do the steps in another order? I agree that the one you chose seems to be the most natural one, but just wondering whether this is a possibility and how you think it might affect the results. In the limit, it should converge to the same distribution, right?

Update after rebuttal

I am satisfied with how the rebuttal has addressed my concerns. In particular, I found the addition of the runtime and memory comparisons with competitors valuable additions to the paper. As a consequence, I increased my score to 4 (Accept).

论据与证据

The paper claims that the proposed method achieves performance on inverse problems that is either comparable to or better than related approaches. While the study includes a broad and relevant set of baselines, a few improvements could strengthen the validity of these claims:

  • Report runtime: Performance metrics alone are insufficient without the corresponding runtime. Including runtime (and memory requirements, if relevant) would provide a more complete comparison.
  • Include standard deviations / confidence intervals: Reporting only the mean values of LPIPS, PSNR, and SSIM does not fully convey the statistical significance of the results. Adding standard deviations / confidence intervals would help assess statistical significance.
  • Add FID as a metric: This is commonly used in inverse problems solved with pre-trained diffusion models, so I would have expected to see it in the image tasks.

Finally, the authors claim that a strength of the method is the possibility to adjust the number of Gibbs sampling steps RR, besides the number of gradient steps GG from the variational approximation, to enhance performance. However, allocating more compute to Gibbs sampling rather than gradient steps appears beneficial only in the phase retrieval task, while for source separation, increasing GG seems to be the better strategy. In the other image-based experiments (aside from phase retrieval), only results for R=1R=1 are reported, likely because this provided the best tradeoff between performance and runtime. While this is not necessarily an issue, the paper should more clearly specify in which tasks increasing Gibbs sampling steps leads to improvements and when prioritising gradient steps is the preferable approach.

方法与评估标准

The benchmark datasets do make sense, and I appreciate the fact that the authors provide experiments using both pixel- as well as latent-space models, and on both linear and nonlinear inverse problems. The audio task is also a good addition.

However, I would also suggest evaluating FID to provide a more complete assessment of generative quality. Additionally, as mentioned earlier, reporting standard deviations or confidence intervals alongside the mean metrics would help assess the statistical significance of the results.

For a more comprehensive performance analysis across diverse inverse problems, it would have been useful to include at least one example with a non-Gaussian likelihood and perhaps a setting with higher noise. However, I appreciate that computational constraints may have limit the amount of experiments that can be conducted.

理论论述

Yes, I did go through the proofs and, as far as I am concerned, they are correct. However, I do believe that introducing ss as a state in the Gibbs sampling procedure is better motivated theoretically, but I agree that it is computationally infeasible.

The only equations I disagree with are:

  • Equation (14) where the integration in the denominator should be over dxs,t\mathrm{d} \mathbf{x}_{s', t'}
  • The one in the appendix (line 800 in A.5), though I believe it is likely a typo.

实验设计与分析

The authors provide details on the sources of the baseline implementations, which generally seem reasonable. The tasks considered are fairly standard and have been explored in previous works, including those the authors compare against.

I do wonder, however, whether better hyperparameter choices for the baselines could lead to improved performance. While tuning baselines extensively may be beyond the scope of the paper, I was surprised by the underperformance of PGDM vs. DPS. In the linear setting, I would have expected PGDM to perform comparably to or better than DPS. If this was not observed, it might have been because of suboptimal hyperparameter settings for PGDM.

The paper also makes several heuristic choices, which I feel are decently motivated, but might benefit from some extra investigation. Some of these are mentioned in the paper, but they include: the weight sequence, the number of DDPM steps MM, the interaction between RR and GG. However, there are so many hyperparameters, that I appreciate it is hard to gain a comprehensive of how each one of these affects performance, and whether this is task-dependent or not.

补充材料

Yes, I reviewed the A. Methodology Details section and B. 4 Implementation of the competitors section.

I also looked over the code implementation from the zip file of MGDM, DPS, PGDM, and the hyperparameters used for these methods.

与现有文献的关系

The paper contains an extensive overview of the relevant literature, highlighting the most relevant works that tackle Bayesian inverse modelling leveraging pre-trained diffusion models. The alternative likelihood approaches are well captured, the authors mention the relevant SMC approaches, and also compare to the most closely related methods based on Gibbs sampling. The detailed comparisons in Appendix A.5. are particularly useful in clarifying the distinctions between MGDM and the closely related DAPS and MGPS methods.

遗漏的重要参考文献

Not that I am aware of.

其他优缺点

Strengths

  • S1: Able to handle pixel- and latent-space diffusion out of the box.
  • S2: Able to tune performance by tweaking two different hyperparameters: RR and GG, although also see weakness 3.
  • S3: Clear comparison to related methods.
  • S4: Strong empirical performance in the majority of cases, although this should be analysed alongside runtime metrics and also include standard deviations besides mean metrics.

Weaknesses

  • W1: One of the main weaknesses is the lack of computational time analysis when comparing to the baselines. In my perspective, this is crucial to holistically compare different related posterior sampling methods, especially when some metrics are so similar.
  • W2: Another weakness from the evaluation side is, as mentioned above, the lack of standard deviations / confidence intervals in the results.
  • W3: Although the authors stress that the ability to increase performance through increasing RR is a strength of the algorithm, this doesn’t seem to be the best strategy in all cases (i.e. directing compute to gradient steps is more lucrative in source separation). In general having too many hyperparameters can become overwhelming, especially if they are task- and domain- dependent. I am not convinced that currently the paper contains enough settings for the authors to make some clear recommendations.
  • W4: There is also limited exploration of the effect of MM, the number of DDPM steps. Why did the authors go with M=20M=20 and have other choices been explored too? Is it clear that having a fixed MM value is the optimal choice, rather than potentially having it depend somehow on ss?
  • W5: All tasks consider either no noise or Gaussian noise with fixed σy=0.05\sigma_y=0.05. This does not make it clear whether the method would perform well under non-Gaussian likelihoods.

其他意见或建议

  • Line 201 right: “Treating s as fixed”
  • Line 315 right - you only compare to seven competitors.
  • The reference to Wu et al. [2024] [1] is repeated twice (2024a and 2024b). Same with Zhang et al. [2024] [2]
  • Line 303 right: “repeatedly” instead of “repeatidly”
  • Perhaps when mentioning the scaling from works such as DPS (L122 right) it is also worth highlighting that the scaling factors are heuristic, rather than very well theoretically underpinned.
  • Appendix A.2. Line 635 Equation (14) - the integration in the denominator should be over dxs,t\mathrm{d} \mathbf{x}_{s', t'} I think
  • Typo on line 800 - I think it should be π0t+1y(x0xt+1)\pi^{\mathbf{y}}_{0|t+1}(\mathbf{x}_0|\mathbf{x}_{t+1})
  • Typo on line 956: “their" instead of "there variables”
  • Line 1009: What do you mean by you “exposed” the coupling parameter ρ\rho?
  • Typo 1036: “fine-grained” instead of in-grained” + that are more coherent

[1] Wu, Z., Sun, Y., Chen, Y., Zhang, B., Yue, Y., and Bouman, K. L. Principled probabilistic imaging using diffu- sion models as plug-and-play priors. arXiv preprint arXiv:2405.18782, 2024a [2] Zhang, B., Chu, W., Berner, J., Meng, C., Anandkumar, A., and Song, Y. Improving diffusion inverse problem solving with decoupled noise annealing. arXiv preprint arXiv:2407.01521, 2024a

作者回复

Thank you for your thorough review of our paper. We address your main weak points/questions below. The additional tables we discuss below can be found here: https://anonymous.4open.science/r/rebuttal-F9B0/rebuttal_tables.pdf

[...] at least one example with a non-Gaussian likelihood [..].

In the initial version, we prioritized extensive comparisons against multiple methods across two modalities. Following your suggestion, we have conducted preliminary experiments with Poisson noise, comparing our method against DPS due to their similar experimental setup. For MGDM, we directly employed the Poisson likelihood without resorting to the Gaussian approximation and maintained the original hyperparameters. In contrast, we used the Gaussian approximation for DPS as outlined in Eqn. (19)[arXiv version], since the Poisson likelihood proved challenging to implement effectively and the original DPS paper lacks guidelines for this scenario. The results are detailed in Table 6, clearly showing that MGDM outperforms DPS across all considered metrics. Additionally, to address potential concerns regarding MGDM’s performance under higher noise conditions, we have included an extra benchmark with a noise standard deviation of 0.3, presented in Table 5.

Report runtime [...] Add FID as a metric

We have implemented all of your recommendations. Please see Fig. 1 and Tables 1, 2, and 3 in the attached document. We refer to our response to R. E9Pa for comments on our runtime and that of the competitors. Our method consistently achieves competitive FID scores across all three evaluated models.

better hyperparameter choices

PGDM has no official implementation, and directly implementing Algorithm 1 yielded suboptimal results. To ensure a fair comparison, we instead utilized the authors' official RedDiff implementation, where the guidance term is scaled by alphat1alphat\\alpha_{t-1}\\alpha_t rather than solely by αt\alpha_t (lines 70 and 73 in https://github.com/NVlabs/RED-diff/blob/master/algos/pgdm.py, with grad_term_weight=1). This modification improved PGDM's performance across most tasks, except for JPEG2, where the original formulation was superior (line 71 in our pgdm.py). Supporting this observation, Boys et al. (2023, Table 2) and Rozet et al. (2024, Table 4) also reported that PGDM generally underperforms DPS. For a fair comparison, we carefully tuned PGDM at 1000 steps to match compute budgets, despite Rozet et al.'s suggestion that fewer steps (around 100) might yield better results. Even under these conditions, PGDM at best matches DPS performance, reinforcing our conclusion that MGDM consistently outperforms both methods. We stress that we promote the reproduction of all benchmarks (as detailed in Appendix B.4 and our openly accessible codebase), to ease verification by the community.

[...] number of Gibbs sampling steps [...]

In Fig. 3 of the paper, we demonstrate that performance consistently improves as the number of Gibbs steps increases for the audio separation task, surpassing all training-free posterior sampling methods. Furthermore, increasing the number of gradient steps allows our approach to exceed Demucs' performance. Across all experiments, we've consistently observed benefits from increasing Gibbs steps, whereas additional gradient steps yield diminishing returns beyond a certain threshold. Specifically, we selected R=1R=1 for the image experiments as it provides the optimal balance between computational efficiency and competitive performance relative to other baselines.

effect of MM, the number of DDPM steps [...]

We selected M=20M=20 as it offered a trade-off between computational cost and image quality. While increasing MM enhances image quality, it does not significantly improve posterior exploration, in contrast to Gibbs and gradient steps. However, in response to your recommendation, we conducted additional experiments to investigate how varying MM and the number of diffusion steps affect performance; results can be found in Figure 2. Furthermore, regarding the choice of the weight sequence, we already provide in Apdx B.1 an empirical analysis of the strategy of sampling ss close to 00 at every step.

  • MC estimate of the KL: We also found this result surprising, given the performance gains observed. Specifically, we discovered that estimating the KL divergence using the same noise as for the likelihood expectation effectively reduces the variance of the gradient estimator, leading to improved results.
  • Comparison to MGPS We have prioritized comparing well-established methods and recent contenders. Nevertheless, we commit to comparing with MGPS in the revised version of the paper.
  • non-diagonal covariance We have already tested adding a low-rank perturbation to the diagonal covariance matrix but found that it doesn’t significantly improve the result. The most significant improvement comes from using a diagonal instead of a scalar matrix.
审稿人评论

Thank you for your thorough and well-structured rebuttal. The additional results and explanations, particularly the runtime and memory comparisons with competitors, are valuable contributions to the paper. I strongly encourage you to incorporate some of the discussion points from your response to reviewer E9Pa into the final manuscript.

Thank you for performing the additional experiments with the non-Gaussian likelihood and higher noise level. "the Poisson likelihood proved challenging to implement effectively and the original DPS paper lacks guidelines for this scenario"---This is indeed in line with my experience and the experience of other works that attempted to apply DPS for Poisson likelihoods.

"[...] number of Gibbs sampling steps [...]" This is the only point I would like further clarification on. The way I interpret Figure 3 (left) for the audio separation task is that:

  1. Increasing the number of Gibbs sampling steps RR does indeed generally lead to better performance (although it seems to plateau after R>4R>4. In some cases, R=4R=4 slightly outperforms R=6R=6, though the difference may not be statistically significant.
  2. However, the last column uses R=1R=1 Gibbs sampling steps and a number of gradient steps GG that makes the runtime the same as R=6R=6. The last column is the one that gives the best results overall. Doesn't this mean that for a fixed compute budget (equivalent to R=6R=6) the best strategy is to only use 11 Gibbs sampling step and use the rest for gradient steps?

Overall, I am satisfied with how the rebuttal has addressed my concerns and would be willing to increase my score to 4 (Accept). However, I think the system does not currently allow for score adjustments.

作者评论

Thank you! We are glad that our rebuttal has addressed your concerns.

Regarding your comment on the number of Gibbs steps, we agree with your interpretation. While increasing the number of gradient steps can sometimes lead to the best performance, this strategy is not consistently optimal across tasks. In contrast, increasing the number of Gibbs steps tends to yield more reliable improvements and is therefore our recommended default for practitioners, particularly when tuning is limited. We will clarify this guidance in the final version of the paper and include additional examples to illustrate this recommendation more concretely.

We also would like to thank you for your decision to increase your score to Accept. It is now possible to modify the score by clicking the edit button on your original review.

审稿意见
4

The paper presents a novel training-free guidance method that allows to samples from g(y|x_0)p(x_0) where p(x_0) is a pre-trained diffusion model distribution and g(y|x_0) is a likelihood function on the clean data. To do this, they come up with a novel approach to approximate the conditioned noisy distributions p(x_t|y) given the unconditional model, and on top of that, a new method to calculate the scores from these density approximations. The core of the method is that when sampling at diffusion noise level x_t, there is an inner loop that samples some s<t, and does Gibbs sampling from p(x_0, x_s, x_t)g(y|D(x_s)). This process defines a specific distribution over x_t, and the outer loop sampling process consists of moving through this sequence of distributions, until we hit p(x_0)g(y|x_0) at the end. To be more precise, the target distribution at each level is a mixture of the p(x_0, x_s, x_t)g(y|D(x_s)) distributions, with different probabilities for different s, and this is where the paper derives its name. The method achieves strong performance on a variety of linear and nonlinear inverse problem tasks on image data and multi-source audio separation. The method is also applicable for use with latent diffusion models. The authors also find that the Gibbs sampling procedure provides new ways to improve performance by applying more inference time compute.

给作者的问题

The method is quite complex, and it is a priori somewhat unclear why would we use this, instead of, e.g., the methods mentioned in Appendix A.5. Why do we introduce the intermediate s-timestep in the first place? Why do we use a mixture of timesteps s instead of a fixed s schedule?

Do you have an intuition on why is sampling s close to 0 is a source of instabilities?

论据与证据

I think most of the claims are supported by evidence.

方法与评估标准

The evaluation criteria makes sense for the application at hand, and the method is sensible as well. The method does, however, have some complexity, and the motivation for the particular choices in the method are not entirely clear for me.

理论论述

I didn’t go through the details of the method very closely in the Appendix (although the main calculation-wise involved new part seems to be Appendix A.2.). I think I have understood the mathematical definition of the model, and it is sensible to me.

实验设计与分析

The main experiments on imaging inverse problems are standard benchmarks in the field, and are sound applications to focus on. One issue is that I did not find a comparison of runtime or neural function evaluation count for the different methods. As pointed out in the paper, they are able to improve results with more inference-time compute, but this also holds for the other methods (at least by increasing the amount of diffusion steps for methods like DPS, PGDM, DDNM and DiffPIR, and by increasing the amount of optimization steps in methods like Reddiff). It is useful to be able to push the results further than previous methods in absolute terms, but ideally we would also have an analysis of the different methods at different levels of compute to clearly distinguish the regime in which the method provides improvements over prior work. In case an extensive evaluation of all of the competing methods at different compute requirements is difficult, I think that it would at least be important to include the compute requirements for producing the results. Based on Table 5 in the Appendix and my understanding of the method, in pixel-space FFHQ, it seems that the method is using on the order of >2700 forward passes and >700 backward passes of the denoiser?

Another part I found a bit lacking was the discussion on the hyperparameters K, R, M and G (regular diffusion steps, Gibbs repetitions, denoising steps in the Gibbs inner loop for p(x_0|x_s) and gradient steps for approximating q(x_s|x_0,x_t)p(y|g(D(x_s))). Would it be possible to provide a bit more thorough evaluation of scaling the inference-time compute along these different axes? This would be especially interesting considering that the authors found increasing R in the imaging inverse problem case to be more useful than increasing G, and vice versa for the audio source separation case.

补充材料

I briefly glanced through appendices A.1, A.2, A.3, A.4., A.5., and looked at the additional results in Table 6 and the hyperparameter choices in Table 5.

与现有文献的关系

The paper continues exploring new methods for applying inference-time conditions on diffusion models without changing the denoiser network. Many works in the area have focused on directly approximating the modifications needed to the diffusion model score function, e.g., through Tweedie’s formula (e.g., [1], [2]), allowing the use of standard diffusion samplers. Other works, including this one, remove the requirement of using standard diffusion sampling processes, and instead redefine the sequence of marginal distributions, using MCMC-like methods to move to lower noise levels. This paper is in the latter category, and, to the best of my knowledge, proposes a novel method in this area. The central idea of introducing a triplet of timesteps (0,s,t) to update x_t to match with the constraint, using approximations at the s stage, is new, although related to earlier some earlier work (as detailed in Appendix A.5 and the related works section “Gibbs sampling approaches”).

[1] Chung, Hyungjin, et al. "Diffusion Posterior Sampling for General Noisy Inverse Problems." The Eleventh International Conference on Learning Representations. [2] Song, Jiaming, et al. "Pseudoinverse-guided diffusion models for inverse problems." International Conference on Learning Representations. 2023.

遗漏的重要参考文献

I am not aware of key contributions in this domain that are missing, although I was not very familiar with the DAPS and MGPS algorithms cited as the closest related work.

其他优缺点

I think that the idea of enlarging the design space of conditional samplers by considering triplets (x_0,x_s,x_t), performing approximations in strategic locations of the algorithm, and defining custom Gibbs samplers, appears sensible and is original. The quantitative results against baselines are good as well, and the method appears promising.

A negative (aside from the experiments mentioned before) is that the model description was somewhat difficult to read, and the paper would benefit from improving the writing. For instance, having an overview and stating the key approximations up front before going to the weeds of the algorithm description could be useful. A more clear explanation of the method and the problems it tackles in the introduction would be useful as well.

On the actual algorithm side, the motivation for the particular design choices was left a bit lacking (see questions). The details of how the s distribution is chosen is also not particularly elegant, which is not a major issue since there is some analysis on it. Due to the concerns raised, I am starting out with a weak reject, but I am open to adjust after the rebuttal.

其他意见或建议

This is not a key concern, but it is mentioned in the Related Works that “Finzi et al., Stevens et al., Boys et al.” use that the covariance of p(x_0|x_t) is proportional to the Jacobian of the denoiser… To mitigate this, these works and subsequent ones assume that the Jacobian of the denoiser is constant with respect to x_t.” I don’t think this is exactly true: at least Finzi et al. and Boys et al. do use the network Jacobian, and as such the covariance is not constant w.r.t. x_t. This is also true in subsequent work that does not use the network Jacobian for the covariance approximation ([1] and [2]).

[1] Peng. et al. "Improving Diffusion Models for Inverse Problems Using Optimal Posterior Covariance." ICML 2024

[2] Rissanen. et al. “ Free Hunch: Denoiser Covariance Estimation for Diffusion Models Without Extra Costs.” ICLR 2025

作者回复

Thank you for your thorough review. We address the main weak points/questions below. The additional tables and figures mentioned below can be found here: https://anonymous.4open.science/r/rebuttal-F9B0/rebuttal_tables.pdf

comparison of runtime[...]

Although initially not mentioned (now corrected), we used the maximum diffusion steps (1000) for DPS, PGDM, DDNM, DiffPIR, and PSLD. For methods like RedDiff, DAPS, ReSample, and PnP-DM, we tuned the compute time by increasing Langevin/denoising/optimization steps until performance plateaued, ensuring optimal competitor performance. Hence, we are confident that we have pushed each competitor to their optimal performance limit. Please see Figure 1 for a figure summarizing the runtime and memory costs of each method. This figure will also be included in the revised version of our paper. Note that our method has similar memory demands as DPS and PGDM on pixel space, while in latent space its memory usage aligns with the other methods. Notably, in the latent-diffusion setting—which is particularly relevant given the prevalence of latent-space models—our method outperforms all others in speed while maintaining consistently strong performance across all benchmarks. In pixel space, our method is slower, but this comes with the benefit of consistent and strong performance. Please see our comment to Reviewer E9Pa regarding the runtime of the competitors.

evaluation of scaling the inference-time [..]

We thank the reviewer for the valuable suggestion. We conducted additional experiments analyzing performance evolution with hyperparameters K,M,GK, M, G—representing regular diffusion steps, Gibbs repetitions, denoising steps in Gibbs for $p(x_0|x_s)$, and gradient steps for approximating q(xsx0,xt)p(yg(D(xs))q(x_s|x_0,x_t) p(y|g(D(x_s)), respectively. Specifically, we extended Figure 3 to explore increased compute along different axes for the phase retrieval task (see Figure 2). We concluded that scaling diffusion, denoising, or gradient steps have less impact for this task than increasing Gibbs steps.

model description [...]

Thank you for your suggestion. We acknowledge some parts of our presentation could be more apparent. To improve accessibility, we've reorganized Section 3, which begins with a concise conceptual overview before detailing the algorithm step-by-step. We have also added the main intuition behind latent variables x0,xsx_0, x_s, and xtx_t and their iterative evolution through denoising, noising, and variational updates.

Why do we introduce the intermediate s-timestep [...]

Our method introduces two latent variables xs,x0x_s, x_0, enabling updates without relying on unrealistic approximations of the denoising distributions. This contrasts with approaches like DAPS, which approximate the posterior distribution x0xtx_0 \mid x_t using a Gaussian distribution parameterized by a tunable covariance. Furthermore, our method leverages likelihood approximations derived from earlier diffusion steps s<ts < t, a perspective that initially motivated our algorithm. Additionally, framing our approach through the lens of Gibbs sampling provides a novel dimension for performance enhancement. These three aspects—accurate latent variable integration, leveraging historical likelihood approximations, and exploiting Gibbs sampling insights—are currently underexplored and could significantly benefit the research community.

Why is sampling s close to 0 is a source of instabilities?” + Why do we use a mixture of timesteps s instead of a fixed s schedule?

As we explain in Appendix B.1, sampling ss close to 0 leads to an accurate likelihood approximation, which is particularly important for latent diffusion. This allows us to fit the observation quickly; however, the unobserved part of the state evolves very slowly, leading to a poor reconstruction of the unobserved part of the state; see Figure 4. This occurs because the prior qs0,t(x0,xt)q_{s|0, t}(\cdot|x_0, x_t) is concentrated around x0x_0 when s is close to 0. Conversely, when ss is far from 0, the likelihood approximation is less accurate but the prior becomes less constrained. Using a uniform mixture balances these opposing behaviors effectively, with minimal hyperparameters.

[...]Boys et al. do use the network Jacobian, and as such the covariance is not constant w.r.t. xtx_t

In Boys et al., the likelihood approximation (Eqn. (11), arXiv version) explicitly assumes that the covariance—and consequently the Jacobian of the denoiser—is constant w.r.t xtx_t. Without this assumption, differentiating Eqn. (10) would not lead directly to Eqn. (11). Indeed, Boys et al. explicitly mention: ‘For this reason, we treat the matrix C0tC_{0|t} as constant w.r.t. xtx_t when computing the gradient.’ Similarly, [2] employs the same assumption (see Eqn. (23), OpenReview version, and the corresponding commentary).

Again, we thank the reviewer for the valuable feedback. Please let us now if you have any further questions.

审稿意见
3

To resolve the error in approximated likelihood gradient of diffusion posterior sampling (DPS) and relevant works, the paper defines a novel posterior density p(xty)p(x_t|y) as mixture of normalized p^s(xty)=p^s(yxt)p(xt)\hat p_s(x_t|y)= \hat p_s(y|x_t)p(x_t) where p^s(yxt)=p^(yxs)pst(xsxt)dxs\hat p_s(y|x_t)= \int \hat p(y|x_s)p_{s|t}(x_s|x_t) dx_s , p^(yxs)=p(yE[x0xs])\hat p(y|x_s)= p(y|\mathbb{E}[x_0|x_s]) and 0<st10<s\leq t-1. To sample from this intractable density, the paper also leverage a Gibbs sampling with Gaussian variational approximation.

update after rebuttal

Regarding the first drawback in the evaluation: (2) Heavy reliance on a single metric (LPIPS) — the authors emphasize the inadequacy of pixel-wise metrics (PSNR and SSIM) for evaluating the specific use case of inverse problems. While the reviewer agrees that pixel-wise metrics can favor blurry or overly smooth outputs, these metrics remain important for assessing low-frequency information, such as color accuracy. Consequently, inverse problem solvers [1,2] typically report both pixel-wise and perceptual metrics, rather than relying on a single type. From this perspective, the reviewer believes that the evaluation in the paper places excessive emphasis on perceptual metrics.

For the other points, the major concerns have been resolved convincingly.

In summary, the reviewer thanks the authors for their efforts in the rebuttal and maintains the score of 3.

[1] Direct Diffusion Bridge using Data Consistency for Inverse Problems, NeurIPS 2023

[2] Improving Diffusion Inverse Problem Solving with Decoupled Noise Annealing, CVPR 2025

给作者的问题

  • Could authors provide an analysis on efficiency of the proposed method compared with baselines? For example. runtime and memory cost.
  • Could authors evaluate the FID for provided experiments?

论据与证据

Main claim is that proposed mixture approximation of pt(xy)p_t(x|y) is better than using likelihood approximation of pt(yx)p_t(y|x), which are proposed by prior works, in two perspectives: principled way and adaptability to computational budget. Paper provides evidence via extensive experiments.

方法与评估标准

The paper evaluates their method on widely used benchmark datasets: FFHQ and ImageNet. Furthermore, authors provide extensive types of linear and non-linear problems. Drawbacks in evaluation is that 1) missing evaluation metric FID and 2) heavy dependency of analysis on a single metric LPIPS.

理论论述

The reviewer has checked mathematical details given in the appendix.

实验设计与分析

The reviewer checked the validity of experimental design. It follows the prior works in diffusion-based inverse problem solvers. However, the paper does not provide an analysis on its efficiency while it introduces Gibbs sampling with variational optimization.

补充材料

The reviewer has checked details on algorithm, related works, experimental details and additional results.

与现有文献的关系

Inverse problem is applicable to side range of scientific problems. Thus, improving the performance for inverse problem leads to further impact to scientific literature too.

遗漏的重要参考文献

The paper includes essential citations and related works.

其他优缺点

Strength

  • Paper is written clearly and its motivation
  • Performance of the proposed method is promising

Weakness

  • Analysis on efficiency is missing.
  • Reported metric in the main paper only focuses on a single perceptual metric, LPIPS.

其他意见或建议

No minor comments.

作者回复

Thank you for your feedback. We appreciate your acknowledgment of the paper's clarity and the promising nature of our method. Below, we directly address your key points and questions, supported by additional results. The supplementary tables and figures mentioned can be accessed here: https://anonymous.4open.science/r/rebuttal-F9B0/rebuttal_tables.pdf.

Drawbacks in evaluation: 1) Missing evaluation metric FID, and 2) Heavy reliance on a single metric (LPIPS).

First, we refer the reviewer to Appendix B.6, where we show that LPIPS is particularly suitable for our tasks, where reconstructions naturally deviate significantly from the reference images. On the other hand, pixel-based metrics (such as PSNR or SSIM) are inadequate for capturing meaningful perceptual differences in such scenarios, potentially leading to misleading conclusions. For instance, Figure 8 in the original manuscript clearly illustrates this limitation. Although reconstructions from methods like DAPS and DiffPIR appear overly smooth and lack coherence, they nevertheless achieve superior PSNR and SSIM scores, highlighting the inadequacy of these pixel-wise metrics for evaluating our specific use case.

Nevertheless, we acknowledge the importance of evaluating our method with diverse metrics, and therefore we have included FID scores in the revised evaluation. Please refer to Tables 1, 2, and 3 in the provided PDF link for detailed results. These additional results confirm that our method remains highly competitive in terms of FID scores across all three models evaluated.

Could authors provide an analysis on efficiency of the proposed method compared with baselines? For example, runtime and memory cost.

Please see Figure 1 in the linked PDF, which summarizes both runtime and memory consumption across our method and the relevant baselines. This figure will be integrated into the revised manuscript.

Our method has memory requirements similar to DPS and PGDM in pixel space and aligns closely with other methods in latent space. Importantly, in latent diffusion—a highly relevant scenario given the prevalence of latent-space models—our method is notably faster than all competitors while consistently achieving strong performance across benchmarks. Conversely, when operating directly in pixel space, our method exhibits somewhat slower runtimes compared to some alternatives. However, this increase in computational overhead is consistently balanced by improved and stable reconstruction quality across all considered tasks. Thus, we position our method as offering a beneficial trade-off, especially in scenarios where quality and consistency of results are paramount.

We highlight several important points regarding the competitors’ runtime:

  • For latent diffusion, DAPS and PnP-DM perform a significant amount of Langevin steps using the gradient of the likelihood. Since the latter involves a vector jacobian product of the decoder, the runtime increases significantly. More generally, when the likelihood function is expensive to evaluate, DAPS and PnP-DM are expected to be much slower than DPS, PGDM and MGDM.

  • For PnP-DM on FFHQ, we have implemented the likelihood step exactly on the linear tasks. On ImageNet however, using Langevin steps provided better results and this explains the significant increase in runtime.

Thank you very much for your feedback. Please let us if you have any further question. We kindly refer you to our responses to Reviewers awaZ and SCZR, where we’ve provided additional experiments and metrics that further support our claims.

审稿意见
4

The paper introduces Mixture-Guided Diffusion Model (MGDM) algorithm to improve likelihood approximation in diffusion models using Gibbs sampling. The approach constructs a mixture approximation of intermediate posterior distributions to address lack of closed-form likelihood scores. Data augmentation scheme uses Gibbs updates. MGDM is adaptable to computational resources by adjusting the number of Gibbs iterations, with higher number of iterations for improved performance. The method shows improved results across different image-restoration tasks and musical source separation. This manuscript is well written!

给作者的问题

Discussed above.

论据与证据

Please see the summary section.

方法与评估标准

The methods and evaluation datasets and criteria are sufficiently represented. Manuscript closely follows previously published works.

理论论述

Gibbs sampler is used as an approximate posterior for the data augmentation of the mixture to represent the stationary distribution. This is used as conditionals to obtain the posterior distribution. The proof is provided in the Appendix A.

实验设计与分析

The experimental designs are appropriate for the defined problem, and are similar to previously published results.

补充材料

The experimental details presented in supplementary material was helpful to get better insights.

与现有文献的关系

The use denoising diffusion models for the Bayesian inverse problems is well studied problem. The authors have done an excellent job in comparing with SotA approaches in this domain.

遗漏的重要参考文献

Not that I am aware of.

其他优缺点

Few topics that authors could discuss:

  1. Gibbs sampling can have convergence issues, even if you run more iterations. How do you address it?
  2. Is there diversity in generated samples?
    
  3. How to address the scaling issues of the algorithm?

其他意见或建议

None.

作者回复

We thank the reviewer for their constructive comments and helpful suggestions. Below, we directly address the points raised:

1. Gibbs sampling can have convergence issues, even if you run more iterations. How do you address it?

We acknowledge that Gibbs sampling can indeed exhibit convergence challenges, especially in high-dimensional or strongly correlated distributions. In fact, in Appendix B.1 we show that for one specific choice of weight sequence, convergence issues arise. More specifically, when the weight sequence is designed to sample the index ss close to 00 at all iterations, MGDM mixes very slowly due to the high correlation between x0x_0 and xsx_s, since s0s \approx 0. This motivates our approach of using intermediate uniform mixture posteriors, which bypasses these correlation issues and reduce dependence on the precise convergence of each Gibbs iteration, thereby enhancing overall robustness.

2. Is there diversity in generated samples?

Our empirical results confirm that MGDM generates diverse samples. This is qualitatively evident in the visual samples provided in our supplementary materials; see, for example, Figure 2 and Figures 7–12.

3. How to address the scaling issues of the algorithm?

We interpret the reviewer’s question as referring to memory scaling. Our algorithm’s memory footprint is comparable to DPS and PGDM, but higher than methods like DiffPIR and DDNM. This increased memory usage enables superior reconstruction quality, particularly for inpainting and outpainting (Figures 5, 6, and 8 of the original manuscript). Achieving similar quality with lower memory requirements remains an open research question. Importantly, on latent diffusion, our memory footprint is the same as all the other methods.

Thank you very much for your feedback. We kindly refer you to our responses to Reviewers awaZ and SCZR, where we provide additional experiments and metrics that further support our claims.

最终决定

This paper has received all acceptances in the final recommendations. It presents a novel contribution by introducing the Mixture-Guided Diffusion Model, which aims to enhance likelihood approximation in diffusion models through the application of Gibbs sampling. While the reviewers initially raised concerns regarding the clarity of the method's explanation and the lack of efficiency analysis, the authors' rebuttal successfully addressed many of these issues. This ultimately led to a consensus among the reviewers in favor of acceptance. As a result, the Area Chair has decided to accept this paper.