Improving Diffusion Models for Inverse Problems Using Optimal Posterior Covariance
摘要
评审与讨论
The paper is constructed into two parts. In the first part, a framework that unifies prior diffusion model-based inverse problem solvers (excluding the DDRM family) is proposed. It is shown that even the approaches that did not explicitly aimed for a Gaussian approximation of the posterior implicitly makes this assumption, and the only difference is in the computation of the conditional posterior mean . In the second part, several methods on computing the optimal diagonal covariance of the variational Gaussian posterior are proposed. Numerical experiments validate that using these optimal covariance to derive the step sizes of the gradient improves the performance of the heuristically-chosen step sizes.
优点
-
Overall, the paper is well-written, clear, concise, with solid experiments.
-
This is the first work that shows that [3,4] can also be interpreted as approximating the posterior with a Gaussian, similar to [1,2]. Such connection was non-trivial, and it is useful that one can understand [1-4] in a unified framework.
-
Theorem 1 is useful as many of the previous works leverage a pre-trained model that can estimate the reverse diagonal covariance, but simply discards it during inference. It is good that one can leverage such un-used information to acquire the optimal posterior covariance through variational inference. Moreover, the proof obtained from variational calculus is clean.
-
Previous approaches [1,2,4] have hyper-parameters that are hard to decipher and choose from. The proposed method partially alleviates this issue. (It is partially alleviated since it still resorts to the pre-configured hyperparameters of [2,4] when .
-
On the practical side, it is the first time that I see a diffusion model-based inverse problem solver that is based on a 2nd order solver. All previous works that I am aware of utilize DDIM/DDPM sampling.
References
[1] Chung, Hyungjin, et al. "Diffusion posterior sampling for general noisy inverse problems." ICLR 2023 [2] Song, Jiaming, et al. "Pseudoinverse-guided diffusion models for inverse problems." ICLR 2023 [3] Wang, Yinhuai, Jiwen Yu, and Jian Zhang. "Zero-shot image restoration using denoising diffusion null-space model." ICLR 2023 [4] Zhu, Yuanzhi, et al. "Denoising Diffusion Models for Plug-and-Play Image Restoration." CVPRW 2023.
缺点
-
The construction of the theory mostly follows [2] (except that the proof is simpler), and hence the contribution could be limited.
-
Proposition 1 was also proposed in [1] for estimating through DPS approximation. It should be worth citing and discussing.
-
For readers' convenience, it is worth highlighting that Type II circumvents the computation of that requires expensive backpropagation as a closed-form solution for deriving exists, which is the key difference for the two types.
-
It is unclear what the authors mean by the first sentence of 5.1. paragraph Posterior variances prediction. By ground-truth square errors, do they mean that they computed the pixel-wise variance from multiple posterior mean that was obtained through the denoiser? Clarification is needed. If so, should this really be called ground truth?
-
There is a discrepancy between theory and practice. For one, the authors use a stochastic sampler for Type II, which would not be sampling from (4). How bad are the samples if one simply uses Type II with a deterministic sampler? Any reasons why a deterministic sampler would under-perform in this case?
-
How bad are the samples if one uses the predicted optimal posterior covariance for all the timesteps?
References
[1] Ravula, Sriram, et al. "Optimizing Sampling Patterns for Compressed Sensing MRI with Diffusion Generative Models." arXiv preprint arXiv:2306.03284 (2023).
[2] Bao, Fan, et al. "Analytic-dpm: an analytic estimate of the optimal reverse variance in diffusion probabilistic models." ICLR 2022.
问题
-
For estimating the optimal posterior variance with (24), how long does this computation take? Is it a bottleneck for inference?
-
The authors state that they only use the proposed variance when . Out of 50 steps, how many steps does this correspond to?
-
DPS is known to excel with 1000 NFE, GDM with 100 NFE, etc. Is there any specific reason why the authors chose 50 NFE?
-
(12pg A.1. first sentence) here we proof --> here we prove
-
Is defined somewhere in the paper?
Q3: Why we choose 50 sampling steps?
We implement our techniques based on an open-source diffusion codebase k-diffusion (https://github.com/crowsonkb/k-diffusion), where using 50 sampling steps is a default choice. To further demonstrate the effectiveness of our method, we conducted additional experiments with 100 and 1000 sampling steps using Euler's one order sampler (deterministic) following the conventional choice of DPS and GDM, shown in the following tables.
Inpainting:
| Method (100 steps) | PSNR | SSIM | LPIPS | FID |
|---|---|---|---|---|
| DPS | 33.51 | 0.9111 | 0.1000 | 30.03 |
| GDM | 31.09 | 0.8807 | 0.1474 | 53.66 |
| Analytic (Ours) | 33.79 | 0.9281 | 0.0921 | 32.85 |
| Convert (Ours) | 33.72 | 0.9294 | 0.0869 | 29.78 |
| Method (1000 steps) | PSNR | SSIM | LPIPS | FID |
|---|---|---|---|---|
| DPS | 32.96 | 0.9166 | 0.0909 | 31.20 |
| GDM | 31.16 | 0.8818 | 0.1419 | 50.57 |
| Analytic (Ours) | 33.70 | 0.9268 | 0.0897 | 31.92 |
| Convert (Ours) | 33.60 | 0.9281 | 0.0839 | 29.34 |
Gaussian debluring:
| Method (100 steps) | PSNR | SSIM | LPIPS | FID |
|---|---|---|---|---|
| DPS | 26.79 | 0.7379 | 0.2553 | 83.36 |
| GDM | 27.90 | 0.8013 | 0.1898 | 60.41 |
| Analytic (Ours) | 27.99 | 0.8053 | 0.1829 | 59.19 |
| Convert (Ours) | 27.96 | 0.8040 | 0.1818 | 58.25 |
| Method (1000 steps) | PSNR | SSIM | LPIPS | FID |
|---|---|---|---|---|
| DPS | 27.15 | 0.7685 | 0.2011 | 61.53 |
| GDM | 27.70 | 0.7973 | 0.1904 | 60.75 |
| Analytic (Ours) | 27.80 | 0.8012 | 0.1827 | 59.57 |
| Convert (Ours) | 27.85 | 0.8024 | 0.1837 | 58.37 |
Motion debluring:
| Method (100 steps) | PSNR | SSIM | LPIPS | FID |
|---|---|---|---|---|
| DPS | 22.08 | 0.6086 | 0.3597 | 122.45 |
| GDM | 26.92 | 0.7681 | 0.2141 | 64.69 |
| Analytic (Ours) | 26.96 | 0.7704 | 0.2124 | 64.18 |
| Convert (Ours) | 26.98 | 0.7705 | 0.2103 | 64.12 |
| Method (1000 steps) | PSNR | SSIM | LPIPS | FID |
|---|---|---|---|---|
| DPS | 25.94 | 0.7331 | 0.2305 | 74.09 |
| GDM | 26.85 | 0.7637 | 0.2136 | 64.61 |
| Analytic (Ours) | 26.85 | 0.7637 | 0.2110 | 62.35 |
| Convert (Ours) | 26.90 | 0.7668 | 0.2103 | 61.78 |
Super resolution:
| Method (100 steps) | PSNR | SSIM | LPIPS | FID |
|---|---|---|---|---|
| DPS | 27.56 | 0.7903 | 0.1883 | 57.79 |
| GDM | 27.70 | 0.7970 | 0.1986 | 63.50 |
| Analytic (Ours) | 27.78 | 0.7995 | 0.1952 | 65.31 |
| Convert (Ours) | 27.78 | 0.8001 | 0.1963 | 65.61 |
| Method (1000 steps) | PSNR | SSIM | LPIPS | FID |
|---|---|---|---|---|
| DPS | 26.59 | 0.7665 | 0.2068 | 70.49 |
| GDM | 27.60 | 0.7940 | 0.1985 | 63.15 |
| Analytic (Ours) | 27.65 | 0.7979 | 0.1954 | 64.08 |
| Convert (Ours) | 27.67 | 0.7968 | 0.1957 | 62.37 |
Q4 & Q5: Grammar issues
We thank the reviewer for the suggestions. We have fixed these issues in our revised version.
Reference
[1] Song, Y., et al. (2020). "Score-based generative modeling through stochastic differential equations." arXiv preprint arXiv:2011.13456.
[2] Karras, T., et al. (2022). "Elucidating the design space of diffusion-based generative models." arXiv preprint arXiv:2206.00364.
[3] Lu, Cheng, et al. "Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps." Advances in Neural Information Processing Systems 35 (2022): 5775-5787.
[4] Bao, F., et al. (2022). "Analytic-dpm: an analytic estimate of the optimal reverse variance in diffusion probabilistic models." arXiv preprint arXiv:2201.06503.
[5] Bao, Fan, et al. "Estimating the optimal covariance with imperfect mean in diffusion probabilistic models." arXiv preprint arXiv:2206.07309 (2022).
I appreciate the authors for the thorough review. As pointed out in the initial review, I think this paper is a good contribution to the field. Therefore, I raised my score.
W6: How bad are the samples when using optimal variances for all steps
The experiment results for Type I guidance using Analytic-type posterior variances for all sampling steps (referred to as Analytic (all steps)) are shown in the following tables and have been included in Appendix E of our revised version. We do not conduct experiments for Convert-type posterior variances since we cannot obtain accurate posterior variances prediction using Equation 22 at high noise level region (see 5.1 Validation of theoretical results, posterior variances prediction).
Inpainting:
| Method | PSNR | SSIM | LPIPS | FID |
|---|---|---|---|---|
| Analytic (all steps) | 33.97 | 0.9242 | 0.0832 | 25.77 |
| Analytic | 33.76 | 0.9272 | 0.0845 | 28.83 |
Gaussian debluring:
| Method | PSNR | SSIM | LPIPS | FID |
|---|---|---|---|---|
| Analytic (all steps) | 26.11 | 0.6889 | 0.2995 | 99.59 |
| Analytic | 27.71 | 0.7926 | 0.1850 | 53.09 |
Motion debluring:
| Method | PSNR | SSIM | LPIPS | FID |
|---|---|---|---|---|
| Analytic (all steps) | 21.77 | 0.5795 | 0.3917 | 149.90 |
| Analytic | 26.73 | 0.7579 | 0.2183 | 64.77 |
Super resolution:
| Method | PSNR | SSIM | LPIPS | FID |
|---|---|---|---|---|
| Analytic (all steps) | 27.31 | 0.7763 | 0.2016 | 56.90 |
| Analytic | 27.58 | 0.7878 | 0.1968 | 59.83 |
As can be seen, while using optimal variances in all steps shows good performance in tasks such as inpainting and super-resolution, there is a significant performance decrease for the deblurring tasks. Example samples are given in the file “debluring_with_analytic_variances_for_all_steps.png” of Supplementary Material. Our solution, which considers sampling with optimal variance only in regions with low noise levels, demonstrates robust performance across all tasks, including the new colorization tasks brought up by Reviewer rzGL.
Q1: Computational cost of equation 24
Similar to [4], a small number of Monte Carlo samples is typically adequate for obtaining an accurate . With a pre-trained unconditional model and training dataset, we compute offline and deploy it alongside the pre-trained unconditional model. As a result, the online inference cost of our method is the same as previous methods. In our paper, we initially calculate using a randomly selected of the training set and reuse it for all our experiments. This process takes a couple of hours on a single 1080Ti GPU.
Q2: Number of sampling steps with optimal posterior variances
As we sample more densely in regions with low noise levels suggested by [2], it corresponds to 12 out of 50 steps. We have added this information in footnote 7 of our revised version.
W1: Limited contribution
We thank the reviewer for comments on our paper. We would like to emphasize that the main contribution of this paper is to provide a novel framework that can unify existing diffusion-based zero-shot methods for inverse problems. This framework can address all the cases, including training-from-scratch, pre-trained models with and without reverse covariances. As recognized by the reviewer, it is useful to convert reverse covariances for many existing models and this novel idea has been long neglected. It is non-trivial to reveal the concept of using the optimality theorem of DDPM [8,9] to transform reverse variance predictions into posterior variance predictions, despite that the construction of Theorem 1 mostly follows [4]. We have also claimed this point as our contribution in the paper.
W2 & W3: Add Citation & Highlight difference between type I and II
We thank the reviewer for the constructive suggestion. In our revised version, the citation and discussion have been included in the last paragraph of Section 3.1, and the difference has been highlighted in the first paragraph of Section 3.2.
W4: Confusion of validation of posterior variances prediction
The computation of the ground-truth square error involves the following steps:
- We sample a clean image from the training set.
- We inject Gaussian noise to to generate a noisy image .
- We employ the pretrained diffusion model to denoise , resulting in .
- We calculate the pixel-wise square errors as .
We call the ground-truth square errors as it is the target (ground truth) that we want the posterior variances prediction to approximate. We compare with because posterior variances is the MMSE estimator of given (assuming ), as suggested by equation 17, implying that a good posterior variances prediction should be a reliable predictor of . Therefore, we may use the comparison between and as a sanity check to gauge the effectiveness of equation 22 in practice. As shown in Figure 2, is indeed accurately predicted by computed using equation 22 in regions with low noise levels. However, we encounter numerical issues using equation 22 in regions with high noise levels, and we attribute this to the fact that equation 22 approaches a 0/0 limit in these regions.
W5: Sampling using SDE for Type II
As noted in footnote 1, replacing the posterior mean with the conditional posterior mean (e.g., from equation 2 to equation 4 for ODE) to achieve conditional sampling is not limited to diffusion models formalized by ODEs. The extension to DDPM, DDIM, or SDE-formalized diffusion models can be similarly proved by leveraging conditional Tweedie’s formula (Lemma 2). Therefore, there is no discrepancy between theory and practice. We chose ODEs (from equations 2 to equation 4) due to the comparatively simpler mathematical form than SDEs and discrete diffusion models like DDPM or DDIM.
However, due to discretization in practice, the characteristics of samples based on ODEs and SDEs differ. When discretizing SDEs, a large step size (small number of steps) often causes non-convergence [3]. Hence, ODEs are more favored for designing fast samplers than SDEs [2,3]. However, SDEs excel at correcting errors made at each step [2], which ODEs lack. This may explain why Type II performs better using a stochastic sampler than a deterministic one, as it can correct errors caused not only by discretization but also by the approximation from the conditional posterior mean to the true one .
To see example samples of Type II using a deterministic sampler, please refer to "comparison_between_deterministic_and_stochastic_sampler_for_type_II.png" in the Supplementary Materials, which is generated by running DiffPIR with using deterministic and stochastic_samplers. To see more results of Type II using a deterministic sampler, please run
bash quick_start/rebuttal/eval_guidance_II_ode.sh
using our source code provided in the Supplementary Material.
This paper provides a unified view of the existing linear inverse samplers with diffusion models. Motivated by this, they proposed an improved approach by calculating the optimal covariance in the Gaussian approximation. Experimental results show that the proposed approach outperforms previous methods in most cases.
优点
- The unified view of various existing methods to solve linear inverse problems with diffusion models is interesting and useful.
- The proposed improved approach (both convert and analytic) by optimal covariance estimation performs well.
缺点
- There is a lack of analysis or experimental results on the inference speed of the proposed method.
- As mentioned by the authors, the proposed approach performs badly if it is implemented purely with the optimal variance, which is wired and unreasonable, even contradicting the starting point of this paper. From a practical implementation side, how to decide where we should switch to the optimal variance value for a specific problem?
- How does the proposed approach perform on linear tasks like colorization and denoising, compared with other methods like DPS and other variants?
问题
Please see the above
Q1: Analysis on inference speed
The speed of inference of our method depends on whether Analytic or Convert is chosen.
Analytic: Our method does not introduce any additional computational cost for online inference compared to previous methods, since the only computational cost arises from the offline pre-calculation of using equation (24). For the offline pre-calculation, similar to [3], a small number of Monte Carlo samples is adequate for obtaining an accurate from a pre-trained unconditional model and training dataset. In the paper, we pre-compute using a randomly selected of the training set and reuse it for all the experiments. This process takes a couple of hours on a single 1080Ti GPU. When pre-computed, is deployed along with the pre-trained unconditional model. As a result, our method is the same as previous methods in computational cost for online inference.
Convert: The conjugate gradient (CG) method could introduce additional computational cost. However, since CG converges rapidly and is only employed during the last few sampling steps, its additional computational cost is negligible compared to the overall inference time. We have included the discussion in Appendix E.3 in our revised version.
The results of inference speed on 1080Ti GPU are shown in the following tables.
Inference time (sec) for Type I using 50 steps Heun's 2nd sampler
| DPS | GDM | Analytic | Convert |
|---|---|---|---|
| 15.07 | 15.07 | 15.07 | 15.56 |
Inference time (sec) for Type II using 50 steps Heun's 2nd sampler
| DiffPIR | Analytic | Convert |
|---|---|---|
| 6.28 | 6.28 | 6.66 |
Q2: Concern to only using optimal variances at the last few sampling steps
The Gaussian distribution effectively approximates the posterior only in the infinitesimal limit of the noise level [1,2]. As the noise level increases, the posterior distribution can evolve into a non-Gaussian multimodal distribution [2]. We also observed this phenomenon in our initial experiments (refer to 4.2 Quantitative results, Paragraph 2). To address this, we heuristically employed our techniques only in the last few sampling steps, where the noise level is sufficiently low for the Gaussian approximation to be effective. To decide when to switch to the optimal variance, we empirically observe that switching when is a very robust choice. We use this heuristic to switch to the optimal variance across all inverse problems and sampler setups and observe consistent performance improvements over baselines.
Q3: Performance on colorization and denoising
We have performed additional experiments on colorization and demonstrate that our method has significant performance improvement over baselines like DPS, GDM, and DiffPIR. Denoising is a special case of noisy inpainting when the elements of the mask are all one and we have already demonstrated the superiority of our method in inpainting. We believe that current experimental results on image inpainting, deblurring, super-resolution, and colorization are sufficient to validate the effectiveness of the proposed method. Below, we elaborate on the evaluations for colorization.
The colorization experiments are performed on the FFHQ dataset. We provide quantitative results in terms of PSNR, SSIM, LPIPS, and FID to compare the proposed method with Type I guidance (i.e., DPS and GDM) and Type II guidance (e.g., DiffPIR). Here, PSNR refers to how close the reconstructed image is to the ground truth in the sense of Euclidean distance, and SSIM, LPIPS, and FID are subjective metrics that reflect the perceptual quality. As shown in the following tables, our method yields evident performance gains, especially in terms of subjective metrics, over baselines on the colorization task. The proposed method significantly outperforms Type I guidance (i.e., DPS and GDM) in all the metrics. Regarding Type II guidance like DiffPIR, we yield evident gains in LPIPS, FID, and SSIM and comparable PSNR. Note that, considering that multiple solutions can be consistent with the measurement, PSNR could not be a desirable metric for the task of colorization. To further validate our method, we provide the qualitative results by visualizing the colorized images in "colorization_type_I.png" and "colorization_type_II_lambda=1.png" in the Supplementary Material. Figure "colorization_type_I.png" shows that our methods significantly outperform baselines in Type I. Figure "colorization_type_II_lambda=1.png" demonstrates that although DiffPIR is slightly better than ours in PSNR, its image contains more artifacts than ours.
Quantitative results (PSNR, SSIM, LPIPS, FID) on Colorization for Type I guidance:
| Method | PSNR | SSIM | LPIPS | FID |
|---|---|---|---|---|
| DPS | 20.40 | 0.7804 | 0.2537 | 85.46 |
| 21.11 | 0.8406 | 0.2440 | 82.38 | |
| Analytic (Ours) | 21.24 | 0.8645 | 0.2100 | 76.87 |
| Convert (Ours) | 21.47 | 0.8662 | 0.2089 | 76.12 |
Quantitative results (SSIM) on Colorization for Type II guidance under different :
| Method \ | |||||
|---|---|---|---|---|---|
| DiffPIR | 0.7333 | 0.7397 | 0.7848 | 0.7791 | 0.6824 |
| Analytic (Ours) | 0.7967 | 0.7945 | 0.7979 | 0.7896 | 0.7546 |
| Convert (Ours) | 0.7927 | 0.7947 | 0.7973 | 0.7924 | 0.7604 |
Quantitative results (LPIPS) on Colorization for Type II guidance under different :
| Method \ | |||||
|---|---|---|---|---|---|
| DiffPIR | 0.3634 | 0.3529 | 0.3040 | 0.2765 | 0.3526 |
| Analytic (Ours) | 0.2671 | 0.2634 | 0.2617 | 0.2738 | 0.3150 |
| Convert (Ours) | 0.2661 | 0.2655 | 0.2594 | 0.2677 | 0.2992 |
Quantitative results (FID) on Colorization for Type II guidance under different :
| Method \ | |||||
|---|---|---|---|---|---|
| DiffPIR | 98.42 | 97.90 | 95.36 | 85.00 | 99.10 |
| Analytic (Ours) | 86.23 | 85.64 | 85.56 | 85.77 | 96.96 |
| Convert (Ours) | 87.04 | 88.31 | 83.48 | 86.26 | 92.40 |
Quantitative results (PSNR) on Colorization for Type II guidance under different :
| Method \ | |||||
|---|---|---|---|---|---|
| DiffPIR | 21.26 | 21.07 | 21.16 | 20.57 | 19.77 |
| Analytic (Ours) | 20.72 | 20.97 | 21.07 | 20.93 | 20.27 |
| Convert (Ours) | 20.56 | 20.91 | 20.83 | 20.79 | 20.77 |
Reference
[1] Sohl-Dickstein, Jascha, et al. "Deep unsupervised learning using nonequilibrium thermodynamics." International conference on machine learning. PMLR, 2015.
[2] Xiao, Zhisheng, Karsten Kreis, and Arash Vahdat. "Tackling the generative learning trilemma with denoising diffusion gans." arXiv preprint arXiv:2112.07804 (2021).
[3] Bao, F., et al. (2022). "Analytic-dpm: an analytic estimate of the optimal reverse variance in diffusion probabilistic models." arXiv preprint arXiv:2201.06503.
The paper establish a unified framework for zero-shot inverse problem solvers based on pre-trained diffusion model. Especially, they unified two different approaches to approximate (one by likelihood estimation and the other by proximal optimization) into isotropic Gaussian approximation, which is a novel interpretation. Based on the unified framework, the paper proposes to optimize the covariance for the approximated posterior via maximum likelihood estimation, which demonstrates effectiveness on various inverse problems.
优点
- The proposed interpretation is novel. It seems to be a good try to connect different approaches.
- The paper is clearly written so easy to follow. Specifically, I like how the paper provides three possible scenarios for estimating the posterior distribution from the case without pre-trained model to the case with pre-trained model that only predicts the posterior mean.
- Underlying theories are well-aligned with prior works.
缺点
- The approximation of posterior by the Gaussian distribution may limit performance and interpretability of the diffusion process. Especially, it inevitably leads to interpretation of diffusion model as sampling through trajectory between two Gaussian distributions.
- Leveraging diagonal covariance is easy to lose significant information in images so that the performance of the proposed method would be limited.
- The performance of the proposed method seems limited (i.e. comparable to baselines for multiple tasks)
- Section 4.3 is the most useful case where previous methods are depending on. However, the proposed method is only about optimizing which was hand-crafted in the previous method, which limits the novelty of the proposed method.
问题
Soundness
- I would like to ask whether approximating the conditional posterior distribution using the standard normal distribution is practical. Specifically, I believe this is equivalent to assuming that the posterior distribution follows a Gaussian distribution, which inevitably implies that the diffusion model is sampling through the trajectory between two normal distributions. This seems inadequate for describing the image generation. Can authors provide their opinions on this matter? This is a crucial point for the my review because the proposed method highly relies on this aspect (section 4).
- I agree with the statement that "letting all the elements of covariance be learnable is computationally demanding" and believe that restricting it to diagonal posterior covariance could be one solution. However, this raises questions about the significance of that covariance. It may lead to the neglect of crucial information, casting doubt on whether the proposed method can achieve "optimal" variances. Additionally, the diagonal covariance assumption implies that each pixel is independent, which may not hold true for images. Note that this assumption is also for the posterior distribution of clean image given noisy image . These issues raise concerns about the soundness of the proposed method.
Experiments
- When we see the Figure 3, we can observe that the performance of the proposed method is nearly identical to that of DiffPIR when we use . This might suggest that the effectiveness of the covariance optimization is marginal, regardless of how the covariance is computed (i.e. Analytic or Convert). Could authors provide further explanation on this point? Specifically, I wonder that whether this lack of improvement in performance can be attributed to either non-realistic approximation of Gaussian posterior or an excessively simplified covariance structure. If not, have any experimental result been obtained to support the idea that lack of performance improvement is not due to the Gaussian approximation or diagonal covariance?
- Image quality metrics such as PSNR and FID is missing in the Figure 3. Did they show the same robustness and the performance tendency compared to the baseline?
- I noticed that the paper also presents the results for the 'complete version' of prior works in the appendix E. I understand that the authors moved these results to the appendix for the sake of comparison within the proposed unified framework. However, the 'complete version' of GDM and DPS involves controlling the weight of likelihood guidance, which is orthogonal to the posterior approximation and independent to the proposed unified framework. Hence, the 'complete version' should have been reported in the main paper, as they demonstrate the effectiveness of (adaptive) likelihood guidance in solving the inverse problem without the need to optimize the covariance of approximated posterior distribution. From this perspective, when we see the Figure 4 in the appendix E, the performance of DPS and the proposed method are comparable with proper guidance. This raises the question about the effectiveness of the proposed method once again.
minors
- In the last sentence of the section 5.1, why the equation 22 becomes 0/0 when ? According to the equation 22, it should be 0 rather than 0/0. If I overlooked anything, it would be appreciated pointing it out.
Q5: Moving results for "complete version of prior works" to the main paper
We thank the reviewer for the constructive suggestion. We put these results into Appendix due to page limitation. We will try to highlight these results in our final version. For the performance concerns, please refer to our reply for "Lack of performance improvement".
Q6: Why equation 22 is a 0/0 limit?
When noise levels are high (or when is large), the posterior variances cannot be zero (LHS of equation 22) due to the considerable uncertainty of the clean image given . Therefore, if the numerator in equation 22 approaches zero, it necessitates that the denominator also approaches zero. This can be understood by directly examining . In the case of DDPM, where , and for large , , it follows that for large .
Reference
[1] Sohl-Dickstein, Jascha, et al. "Deep unsupervised learning using nonequilibrium thermodynamics." International conference on machine learning. PMLR, 2015.
[2] Xiao, Zhisheng, Karsten Kreis, and Arash Vahdat. "Tackling the generative learning trilemma with denoising diffusion gans." arXiv preprint arXiv:2112.07804 (2021).
[3] Boys, Benjamin, et al. "Tweedie Moment Projected Diffusions For Inverse Problems." arXiv preprint arXiv:2310.06721 (2023).
[4] Song, Jiaming, et al. "Pseudoinverse-guided diffusion models for inverse problems." International Conference on Learning Representations. 2022.
[5] Zhu, Y., et al. (2023). "Denoising Diffusion Models for Plug-and-Play Image Restoration." arXiv preprint arXiv:2305.08995.
[6] Chung, Hyungjin, et al. "Diffusion posterior sampling for general noisy inverse problems." arXiv preprint arXiv:2209.14687 (2022).
[7] Wang, Yinhuai, Jiwen Yu, and Jian Zhang. "Zero-shot image restoration using denoising diffusion null-space model." arXiv preprint arXiv:2212.00490 (2022).
[8] Bao, F., et al. (2022). "Analytic-dpm: an analytic estimate of the optimal reverse variance in diffusion probabilistic models." arXiv preprint arXiv:2201.06503.
[9] Bao, Fan, et al. "Estimating the optimal covariance with imperfect mean in diffusion probabilistic models." arXiv preprint arXiv:2206.07309 (2022).
I appreciate authors for their response.
Gaussian approximation
I agree with author's response and my concern on it is resolved.
Diagonal constraint
As I mentioned in the review comment, I agree that there is computational challenge in computing the covariance. However, it is hard to agree that the diagonal covariance is the best way for balancing computational costs and performance. For example, one can approximate the full covariance via low-rank decomposition or block matrix, which are more reasonable than the diagonal covariance. Note that the diagonal covariance implies independent pixels in a single image. Have authors tried with other approximation than the diagonal covariance?
Experimental results
I really appreciate authors for their effort in finding numerical issues in the implementation, which is hard to point out in short duration. However, in the revised version, the performance of the proposed method does not outperform Guidance Type II methods, especially DiffPIR.
In Figures 3 and 4 in our revised version, we clearly find that our proposed method achieves better LPIPS performance than baselines under almost all hyperparameters
I agree that the proposed method is robust. However, for deblurring tasks, DiffPIR shows comparable (or better, cannot precisely say as numerical metrics are not reported) to the proposed method. Also, in Figure 8 and 10 in the revised paper, DiffPIR achieves better PSNR and SSIM than the proposed method.
Our method optimizes for better performance. In the case of , DiffPIR coincidently finds a similar value to our estimation and yields a similar performance
If this is true, why should we optimize the covariance, not just use DiffPIR with ?
Overall, I appreciate the response from the authors, but my major concerns remain. Therefore, I will keep my score at 5.
W3 & Q2 & Q3 & Q4 & Q5: Lack of performance improvement and missing metrics
We sincerely apologize for any confusion caused by incorrect experimental results. These incorrectnesses were the consequence of the numerical issue from the implementation of our closed-form solutions for debluring. We have presented the corrected results for debluring tasks in the revised version of our paper. To provide a more thorough evaluation, we have included additional image quality metrics, such as PSNR, FID, and SSIM, in Appendix E of the revised version. To facilitate the reproduction of our results, we have provided our source code and quick start bash scripts in (Supplementary Materials, src). We are grateful for your understanding and patience.
In Figures 3 and 4 in our revised version, we clearly find that our proposed method achieves better LPIPS performance than baselines under almost all hyperparameters. We have also provided the additional results on PSNR, SSIM and FID in Appendix E, which show the same robustness and the performance tendency compared to baselines.
We can also find that DiffPIR is inferior to our method under most values of due to its handcrafted design. In equation (10) in the revised version, is related to and affects the performance with varying . Our method optimizes for better performance. In the case of , DiffPIR coincidently finds a similar value to our estimation and yields a similar performance (see "diffpir_var_vs_analytic_var.png" in the Supplementary Material for details).
W4: Limited novelty
Our paper proposes a unified framework for diffusion-based zero-shot methods. As claimed in the paper, our novelty is three-fold.
First, we provide an insight of interpreting existing methods from the view of approximating conditional posterior mean for the reverse process and reveal that they are equivalent to making isotropic Gaussian approximation with different handcrafted isotropic posterior covariances. This finding enables us to transform the handcrafted design in existing methods (classified as Type I and Type II guidance) to a general solution to optimization via maximum likelihood estimation. It is especially non-trivial for Type II guidance in which posterior variances could not be optimized due to the implicit assumption of a Gaussian distribution.
Second, we develop three approaches to address all the cases for realizing diffusion-based zero-shot methods, including training-from-scratch, and pre-trained models with and without reverse covariances, respectively. Remarkably, both pre-trained models with and without reverse covariances are useful to existing methods as elaborated below.
Pretrained models with reverse covariance. In Section 4.2, we are the first to reveal the concept of utilizing the optimality theorem of DDPM [8,9] to transform reverse variance predictions into posterior variance predictions. This result allows one to leverage previous works’ pre-trained models that estimate reverse covariance to predict posterior variance for inverse problems and enhance performance without incurring additional costs. Actually, unconditional reverse covariance is ignored in many of the existing methods for inverse problems, despite it can be calculated.
Pretrained models without reverse covariance. In Section 4.3, optimizing is not trivial for pre-trained models without reverse covariances. In fact, existing methods [4,5,6,7] all focus on finding optimal variance but they resort to handcrafted design. Contrary to them, we theoretically develop the optimal solution to the problem and empirically achieve MMSE estimation via Monte Carlo samples.
Third, as highlighted in the revised version, we offer an efficient way of calculating guidance in noisy inverse problems. does not provide an efficient method for implementing guidance in noisy inverse problems, as it requires the computation of high-dimensional matrix inversion for the vector-Jacobian product (equation 91). As a result, did not evaluate the performance on the noisy deblurring and noisy super-resolution (they conduct the noisy inpainting experiments as the guidance calculation is trivial). In contrast, we propose closed-form solutions to Type I guidance in Appendix B.1 (for inpainting, deblurring, super-resolution, and colorization) that can be efficiently evaluated. In the cases that the closed-form solutions to guidance are complex to derive, we further present numerical solutions in Appendix C based on the conjugate gradient (CG) method for Type I and II guidance, thereby enhancing the practicality of our approach.
W1 & Q1: Justification of Gaussian approximation
As discussed in Section 3 in the paper, recent zero-shot methods, including DPS [4], [6], DiffPIR [5] and DDNM [7], all make Gaussian approximation with isotropic covariance for the true unconditional posterior . They explicitly or implicitly leverage HANDCRAFTED design of isotropic posterior covariances and have achieved empirical success. Song et al. (2023) justify that the score of Gaussian approximation is close to that of the true posterior under the assumption that the denoiser behaves like a pseudo linear filter (see [4], Appendix A.6).
Our approximation is more accurate than previous methods by optimizing posterior variance (rather than using handcrafted design) without incurring additional computational costs and has demonstrated significant empirical improvements in solving various inverse problems. It should be noted that we do not use standard normal distribution to approximate the conditional posterior distribution. Actually, when (by equation 1) and are both Gaussians, the approximated conditional posterior is also a Gaussian (but not standard).
Using complex distributions to model the posterior accurately may improve performance, but it will significantly increase computational costs. Gaussian approximation for the posterior is a reasonable choice for balancing computational cost and the accuracy of conditional posterior mean approximation. It is known that the Gaussian distribution is an effective approximation of the posterior in the infinitesimal limit of the noise level [1,2]. As the noise level increases, the posterior distribution can evolve into a non-Gaussian multimodal distribution [2]. We have also observed this phenomenon in our initial experiments (refer to 4.2 Quantitative results, Paragraph 2). To circumvent this problem, in this paper, we employ our techniques only in the last few sampling steps, where the noise level is sufficiently low for the Gaussian approximation to be effective.
W2: Justification of diagonal covariance constraint
In fact, there are currently no tractable methods to implement a full covariance matrix-based approach for high-dimensional signals like images, despite that using a more expressive posterior covariance could potentially enhance performance. We identify two main challenges: (1) the computational cost of predicting the full posterior covariance and (2) the computational cost of implementing guidance based on the full posterior covariance. The concurrent work [3] (also submitted to ICLR 2024) proposed a method to optimize the posterior covariance by computing the Hessian matrix of (i.e., the derivative of the score function) to obtain a full covariance prediction from a pre-trained unconditional diffusion model. However, this method is only evaluated on low-dimensional synthesis data but still uses diagonal covariance for high-dimensional signals like images (see Appendix E.1 in [3]). Taking RGB images with a resolution of 256 256 as an example, it is cost-ineffective as it requires computing the Hessian matrix and performing (dimension of signals) times backpropagation (in our case, 256 256 3 times backpropagation). This process is significantly slower than our approach since we do not introduce additional computational costs to compute the posterior covariance. Since our paper focuses on real-world inverse problems, choosing a diagonal covariance is reasonable for balancing computational costs and performance.
In addition, we focus on zero-shot solutions to inverse problems using available checkpoints for unconditional diffusion models since it is costly to train them from scratch. As mentioned in the response to W1 & Q1, existing works [4,5,6,7] make the Gaussian approximation with diagonal covariance for the true unconditional posterior. Our methods are built upon these milestone works, of which the improvement comes from carefully designed training-free methods for obtaining optimal diagonal covariances.
Thank you for your thoughtful response! Regarding the diagonal covariance, we have to emphasize that our paper aims to propose a unified framework for inverse problems and also considers to accommodate to existing diffusion-based zero-shot methods. To this end, it is hard to directly leverage widely adopted checkpoint of unconditional diffusion models used by existing methods without using diagonal covariance. The use of the diagonal constraint has another significant advantage: it enables our method to be seamlessly integrated with existing methods. This plug-and-play nature of our approach makes it the resonable step after recent works.
The paper establishes a unified framework for zero-shot inverse problem solvers based on a pre-trained diffusion model. Based on the unified framework, the paper proposes to optimize the covariance for the approximated posterior via maximum likelihood estimation, which demonstrates effectiveness on various inverse problems.
The reviewers are concerned about the soundness of the proposed diagonal covariance, and the marginal improvements compared with existing methods. Please consider incorporating the reviewers' comments for future submissions.
为何不给更高分
Reviewers concerned about the soundness of the approach and the marginal improvements
为何不给更低分
n/a
Reject