PaperHub
5.3
/10
Poster4 位审稿人
最低4最高6标准差0.8
4
5
6
6
4.0
置信度
正确性3.0
贡献度2.8
表达2.3
NeurIPS 2024

DMPlug: A Plug-in Method for Solving Inverse Problems with Diffusion Models

OpenReviewPDF
提交: 2024-05-15更新: 2024-11-06

摘要

关键词
Inverse problemsdiffusion models

评审与讨论

审稿意见
4

This paper is devoted to the use of diffusion models to solve inverse problems, arguing for treating the inverse process in a diffusion model as a function and proposing a new plug-in approach called DMPlug. DMPlug addresses the issues of feasibility and measurement feasibility of manifolds in a principled manner, and shows great potential for robustness to unknown types and levels of noise.

优点

  1. The article is very comprehensive in its review of existing relevant technologies, which is more friendly to readers unfamiliar with the field.

  2. The article describes the proposed method in a clearer and more detailed way, which is easy to understand.

  3. The qualitative results on celeba and FFHQ datasets presented in the figures of the experimental sections look very effectual and better than the baseline method.

  4. This paper provides a comprehensive exploration of the robustness of the proposed method under various noises.

缺点

  1. Contributing issues: The proposed method is very similar to the existing work [1,2] in its approach to condition generation by optimizing the noisy latent variables of a diffusion model. The main difference in this paper (and why it works is also very surprising to me) is the empirical evidence that good results can be achieved with only 3 backward steps. It is well known that 3 backward steps usually fail to generate meaningful images for diffusion models. Although the authors do not provide a theoretical basis, they should also provide an intuitive insight into why it works.

  2. Experimental issues: The experimental section states that their metrics are calculated primarily on 100 images, which is too few, and much of the existing work is 1000-image tests. In addition, the paper selects Celeba and FFHQ which have somewhat similar patterns despite being two datasets.

[1] D-Flow: Differentiating through Flows for Controlled Generation [2] End-to-End Diffusion Latent Optimization Improves Classifier Guidance

问题

  1. Although the authors point out that the use of 3-step DDIM can significantly mitigate computational consumption, the memory required to save the computational graphs of 3-stepDDIM is still huge for some larger diffusions, as far as I know. Can the authors report a comparison with the time consumption and GPU memory used by other methods, such as DPS.

  2. Can the author provide some intuition or explanation as to why the 3 steps work.

  3. Can the author show some qualitative results of DMPlug in more difficult cases, such as inpainting of box masks or the imagenet dataset?

  4. Can the proposed DMPlug method with 3 steps be used for other controllable generation tasks rather than just inverse problems such as, Classifier/Clip guidance in [2].

[2] End-to-End Diffusion Latent Optimization Improves Classifier Guidance

If the author can address my concerns, I am open to changing my score during the discussion phase.

局限性

The authors briefly discuss for the limitations of their approach.

作者回复

We thank reviewer uJfQ for the detailed review and insightful comments!

RE Weakness 1:

We appreciate this insightful comment! Indeed, the two papers have similar ideas to ours from a high level, and we will definitely add and discuss them in our revision. However, we still want to highlight several crucial differences between their methods and our method.

  • [1] computes the gradient using the adjoint method and the gradient checkpointing trick to mitigate the memory issue, but the two techniques will further slow down the optimization process. In contrast, our method, with only a few steps, can achieve SOTA results.
  • [1] only considers linear IPs, but our paper contains three nonlinear IPs and achieves much better results than the current SOTA methods (typically 3-6dB in terms of PSNR).
  • Our paper observes the early-learning-then-overfitting and spectral bias phenomena when solving IPs. With the two phenomena, our method can achieve robustness to unknown types and levels of noise.
  • [2] also uses the gradient checkpointing trick to mitigate the memory issue, slowing down the whole process.
  • Although IPs can be considered as conditional generation problems, the condition is much stronger than the classifier/CLIP guidance. Please refer to GR1 for more information about IPs and conditional generation with the classifier/CLIP guidance.

[1] D-Flow: Differentiating through Flows for Controlled Generation

[2] End-to-End Diffusion Latent Optimization Improves Classifier Guidance

[3] Scalable gradients for stochastic differential equations

[4] https://pytorch.org/docs/stable/checkpoint.html

RE Weakness 2:

We mainly follow the experiment setting of the recent work—Resample [5] (ICLR’24 Spotlight), which also uses 100 images for each task and dataset, but we totally understand this concern. Hence, we try to add more experiment results in the rebuttal period as you and other reviewers suggested (please check General Response). Besides the two datasets, we also try a more complex dataset—LSUN-bedroom, following [5]. Also, we add more qualitative results for ImageNet as you suggested in GR5.

[5] Solving inverse problems with latent diffusion models via hard data consistency

RE Question 1:

This is indeed a valid question! Please refer to GR2.

RE Question 2:

We appreciate this insightful suggestion! Please refer to GR1.

RE Question 3:

Sure, please refer to GR5.

RE Question 4:

We appreciate this very interesting comment! Please refer to GR1.

评论

We thank reviewer uJfQ for the detailed review and insightful comments!

During the rebuttal and discussion period, we incorporated six additional strong baseline methods in GR3, GR7, and GR9, as suggested by you and reviewer yL1C. These methods include FPS, DiffPIR, DDNM, PSLD, RED-diff, and ΠGDM. We hope these results can somehow address your concerns on this point.

评论

We thank reviewer uJfQ for the detailed review and insightful comments!

For the box inpainting task, please refer to GR8 for the blurriness analysis, and both qualitative and quantitative comparisons between our method and several strong baseline methods.

审稿意见
5

This paper presents a plug-and-play method DMPlug for solving inverse problems with diffusion models. DMPlug utilizes a pre-trained diffusion model as a deterministic function that generates images from latent seeds and solves MAP problems by optimizing the seeds. Experiments show that their method beats current SOTA methods across different inverse problems and is robust against unknown perturbations.

优点

  • The proposed algorithm DMPlug is concise and novel.
  • The authors observe an intriguing early-learning-then-overfitting property of their algorithm. They integrate an early-stopping strategy in DMPlug to enhance the robustness of DMPlug against unknown data corruption, which is illustrative and useful.
  • The experimental results shown in the paper look promising. The empirical robustness analysis is also interesting and informative.

缺点

  • The paper is missing many baseline methods on plug-and-play inverse problem solving with diffusion models, such as DDNM[1], DiffPIR[2], RED-Diff[3], FPS[4]. These methods are claimed to perform much better than DPS. The experimental result is less convincing without comparison with these recent results.
  • Moreover, the major recent baseline considered in the paper, ReSample, uses pre-trained latent diffusion models, different from the pre-trained model used in this paper, so a direct comparison does not make sense. This weakness is partly mitigated by the ablation study in Table 5.

[1] Yinhuai Wang, Jiwen Yu, and Jian Zhang. Zero-shot image restoration using denoising diffusion null-space model. ICLR, 2023.

[2] Yuanzhi Zhu, Kai Zhang, Jingyun Liang, Jiezhang Cao, Bihan Wen, Radu Timofte, and Luc Van Gool. Denoising diffusion models for plug-and-play image restoration. CVPR, 2023.

[3] Morteza Mardani, Jiaming Song, Jan Kautz, and Arash Vahdat. A variational perspective on solving inverse problems with diffusion models. ICLR, 2024.

[4] Zehao Dou, Yang Song. Diffusion Posterior Sampling for Linear Inverse Problem Solving: A Filtering Perspective. ICLR, 2024.

问题

  • DMPlug seems more space/time-consuming than existing methods since it requires gradient backpropagation through the diffusion model multiple times per step, while ReSample does not backpropagate through the diffusion model. Depending on the actual number of iterations used for DMPlug, the time cost for DMPlug might be prohibitively high. I am curious to see some discussion and comparison of DMPlug with existing methods in terms of time cost.

局限性

The authors address the limitations and provide possible future directions in their paper.

作者回复

We thank reviewer y52g for the detailed review and insightful comments!

RE Weakness 1:

Please refer to GR3.

RE Weakness 2:

We totally understand this concern that using different DMs may lead to some unfairness concerns, but it is not clear which model the unfairness pertains to. Actually, even the Resample paper [1] compares their LDM-based methods with DM-based methods in most of their experiments. We are happy that our ablation study in Table 5 partially mitigated this concern, so we further add more LDM experiments this time. Please refer to GR4 for using our method with LDMs on inpainting and nonlinear deblurring tasks.

[1] Solving inverse problems with latent diffusion models via hard data consistency

RE Question 1:

This is indeed a valid question! Please refer to GR2.

评论

Thank you for answering my questions and providing new baseline results during the short rebuttal period.

I have some further questions about the experiments. Judging from the tables in the general response, DMPlug performs well on the inpainting task with a random mask. However, the qualitative result for box inpainting shown in the pdf does look blurry for both masked and unmasked regions, which is not the case for many baselines such as DPS. Could the author explain more about the possible reasons for the blurry reconstructed images? Why have the quantitative results for box inpainting not been reported?

I also appreciate the authors' response on the time cost of DMPlug. However, the results do not completely resolve my previous concern as the time complexity for DMPlug is indeed several times more than most of the baselines. I would encourage the authors to discuss more carefully on this limitation in their paper.

Overall, my opinion on this work has not changed, so I will keep my original rating for now.

评论

We thank reviewer y52g for the further discussion and insightful comments! We will follow your suggestions to revise the limitation part of this paper.

  • We will update GR8 for the box inpainting task soon.
  • Please refer to GR6 for the updated time and memory consumption of our method.
评论

We thank reviewer y52g for the detailed review and insightful comments!

During the rebuttal and discussion period, we incorporated six additional strong baseline methods in GR3, GR7, and GR9, as suggested by reviewer uJfQ and reviewer yL1C. These methods include FPS, DiffPIR, DDNM, PSLD, RED-diff, and ΠGDM. We hope these results can somehow address your concerns on this point if you have.

评论

We thank reviewer y52g for the further discussion and insightful comments!

For the box inpainting task, please refer to GR8 for the blurriness analysis, and both qualitative and quantitative comparisons between our method and several strong baseline methods.

审稿意见
6

This paper proposes an optimization-based method for optimize the initial noise for data consistency. This method is evaluated on a variety of linear inverse problems and non-linear inverse problems, and achieves SOTA results. The method also show robustness for unknown noise levels with a technique called ES-WMV.

优点

  1. The method looks novel and straightforward to implement.
  2. The paper is well written and easy to understand
  3. diffusion inverse solvers struggle in nonlinear problems since that restricts the usage of SVD or null space decomposition. This paper improves the SOTA on nonlinear inverse problems by a significant margin
  4. The Early stopping/deep image prior phenomena reported in this paper is interesting.

缺点

  1. the R(.) function looks quite long since it involves many reverse passes. Also the optimization may take up to 5000 iterations. Does that require additional memory and computing time? I cannot find the memory and time cost in this manuscript except in Fig.5
  2. the results on LDMs are limited, it would be more desirable to include LDM results on more nonlinear problems and linear problems besides super-resolution.
  3. the backbone model should be stated more clearly in the paper: what is the pretrained model? How is it trained? How many NFEs are used after optimizing the initial noise?
  4. Authors could provide more intuitions on why 3 ddim reverse steps are sufficient since it is challenging to generate high-quality images with 3 steps and the measurement function A(.) is applied on the E[x0xt]E[x_0|x_t] and may not reflect the ground truth signal. I think deblur or superresolution works since 3 DDIM reverse steps give blurry images which is quite consistent to measurement, but have the authors experiment on more challenging forward models like CT operator (Radon Transform) or phase retrieval?

问题

See weaknesses.

  1. memory and computing time
  2. clearer description of the pretrained model
  3. experimenting on more challenging forward models like CT operator or phase retrieval
  4. more LDM results

局限性

Yes, limitations are adequately addressed.

作者回复

We thank reviewer wUnV for the detailed review and insightful comments!

RE Weakness 1 & Question 1:

Please refer to GR2.

RE Weakness 2 & Question 4:

Please refer to GR4.

RE Weakness 3 & Question 2:

This is a very good suggestion! We use the standard DDIM/DDPM and LDM models. The pre-trained DDIM models of CelebA, FFHQ are from this link; the pre-trained DDIM model of LSUN-bedroom is from this link; the pre-trained LDM model of CelebA is from this link. We recommend the original papers for the training details. After optimizing the initial noise, we apply the same number of NFEs in R(.) as the training stage, i.e., 3 in most of the experiments of this paper. We will add an appendix to include these details in the revision.

RE Weakness 4 & Question 3:

Thanks for this great suggestion!

Please refer to GR1 for our discussion of the seemingly fundamental difference between inverse problems and conditional image generation.

In our paper, we have tried three challenging nonlinear IPs, including nonlinear deblur, BID and BID with turbulence, but we still want to try the tasks you suggested. For CT reconstruction, we follow the settings in [1] since they provide pre-trained DM for CT images. However, we are still waiting for the data access approval from TCIA on this link. In phase retrieval (PR), the goal is to recover a 2d or 3d image from the oversampled diffraction pattern (squared magnitude) of its Fourier transform. Although we can definitely test our method on PR, we have decided not to do it here, as the setting in the DPS paper [2] and follow-up work may be physically wrong, as Fourier PR never involves color images (Fourier involves only 2d or 3d single-channel images) [3,4,5,6]—in fact, their setting leads to a more difficult, yet unrealistic, problem; we leave this study as future work.

[1] Improving diffusion models for inverse problems using manifold constraints

[2] Diffusion posterior sampling for general noisy inverse problems

[3] The numerics of phase retrieval

[4] SiSPRNet: end-to-end learning for single-shot phase retrieval

[5] Three-dimensional imaging of strain in a single ZnO nanorod

[6] What is Wrong with End-to-End Learning for Phase Retrieval?

评论

I appreciate authors for the effort of adding new experiments in a such short rebuttal period. The results are convincing, but I had some questions about the 3-step DDIM inverse. Looks like in the rebuttal pdf, CGwCG cannot generate meaningful content in 3 steps. I am guessing that DDIM sampler may lead to this issue. Some other solvers like DPM++ may be better? Also the box-inpainting images in the rebuttal pdf look slightly blurry compared to the original images, could authors explain this phenomena?

评论

Thank you for your prompt response and valuable questions! We agree with the point that using faster samplers might support CGwCG with fewer reverse steps. In this study, we have retained the default sampler for CGwCG to (1) clearly illustrate the differences between IPs and CGwCG, and (2) address Question 4 raised by Reviewer uJfQ. For the blurs observed in the reconstructions during the box inpainting task, we repeated the experiments and saved the intermediate results in https://anonymous.4open.science/r/2024_NIPS_rebuttal-AAC3/box_process.png. Currently, we present the reconstructions with the highest PSNR. Our findings indicate that the PSNR peaks for the box inpainting task appear earlier compared to other tasks, as illustrated in Figure 6(2) of the manuscript. Combined with the spectral bias phenomenon reported in Figure 7, the reconstructions that stop at early stages may contain insufficient high-frequency information, leading to blurriness. If we continue the optimization for more iterations, more high-frequency information will be recovered, as shown in https://anonymous.4open.science/r/2024_NIPS_rebuttal-AAC3/box_process.png.

评论

The deep image prior phenomena appears in the link is also interesting. One suggestion from me would be presenting the one with the best LPIPS score since high PSNR may lead to blurriness. However, I still feel like the 3-step DDIM needs a better explanation or empirical support from more experiments on challenging tasks especially when the box-inpainting showing the reconstruction slightly blurry and different from original measurement. This casts doubt on whether the solution can keep data-fidelity to the measurement. Nevertheless, current results on super-resolution and deblurring are impressive. I believe the contribution is sufficient as an empirical method paper. So I keep my score as 6: weak accept.

评论

We thank reviewer wUnV for the further discussion and insightful comments!

For the box inpainting task, please refer to GR8 for the blurriness analysis, and both qualitative and quantitative comparisons between our method and several strong baseline methods.

评论

I think the authors have addressed most of my concerns, and I will keep my score as weak accept.

审稿意见
6

The authors propose a new framework, called DM-Plug, to solve inverse problems with pre-trained diffusion models. Most of the prior works on this topic propose approximations for the conditional score function. The authors propose an alternative approach that is closely related to techniques for solving inverse problems with GANs. Namely, the authors consider the latent space that emerges in diffusion models by projecting images through the deterministic sampler and perform the optimization in this latent space.

There are two challenges with this approach: 1) one needs to backpropagate through the whole sampling chain to perform this optimization, so this approach can be really computationally intensive, and 2) for every update of the latent, one needs to run the whole sampler, so this approach can become really slow. The authors mention that they can circumvent these two issues by only running 3 steps of the sampler and that this suffices for their considered inverse problems.

优点

  1. The topic of this research paper is interesting. There is a growing literature on approaches for solving inverse problems with diffusion models. This paper proposes a fresh idea in this space.
  2. The experimental results are strong, providing evidence that this is a promising approach.
  3. Since the approach here is very similar to solving inverse problems with GANs, there is a variety of ideas from this space that can be potentially leveraged to further improve the results of this approach.

缺点

  1. The proposed method is a maximum-likelihood method and does not offer posterior samples. For certain inverse problems, being able to sample from the posterior is really important, for uncertainty quantification, diversity, etc.
  2. The presentation of the paper could be improved. For example, one idea would be to directly optimize in the space of clean images, since diffusion models also offer the score there. This idea wouldn't work because the estimation of the score for t=0t=0 is usually poor. The authors do not explain this properly in their manuscript.
  3. The authors also mention that for conditional tasks, 3 steps are enough. However, this should truly depend on the level of corruption. At the limit of extreme corruption, it can't be true that 33 sampling steps are enough since we know for a fact that diffusion models need more steps to achieve good generation results.
  4. From Figure 6.1., it looks like the MSE is minimized at around 1000 iterations of the proposed algorithm. Given that each iteration requires 3 sampling steps, this is equivalent to 3000 sampling steps. Most methods for solving inverse problems with diffusion models work with much fewer sampling steps.
  5. The authors could have used better baselines in their experimental evaluation, including ΠDGM, Red-diff and PSLD.

问题

  1. It seems to me that one of the main weaknesses of this work is the propagation over the sampling chain. Since all that is needed is the solution of the ODE, is it possible to use this method with Consistency Diffusion Models?
  2. There are many techniques to solve inverse problems with GANs. One of these techniques is Intermediate Layer Optimization, where the optimization happens in some intermediate latent space. This idea seems directly relevant here and could lead to improvements in performance and cost. Namely, the authors could optimize in the latent space that corresponds to some time other than t=Tt=T. Could the authors ablate this?
  3. I would like to see the number of required steps needed in the sampling chain as the difficulty of the diffusion problem increases. Namely, the authors could fix an inverse problem, let's say random inpainting, and ablate the number of required sampling steps for good performance as the corruption probability increases.

局限性

The authors have addressed the limitations of their work.

作者回复

We thank reviewer yL1C for the detailed review and insightful comments!

RE Weakness 1:

We agree that allowing posterior sampling for uncertainty quantification and other purposes would be great. This is, unfortunately, not what the MAP framework can offer. We follow most current DM-based methods for inverse problems, which do not provide posterior sampling or uncertainty quantification. We will acknowledge this as a limitation of the current work, and leave it for future research.

RE Weakness 2:

We will definitely revise the manuscript thoroughly after the rebuttal to improve the presentation!

Optimizing in the space of the clean image directly reduces things to the typical MAP formulations without using the pretrained DM priors. Our consideration is not about whether the score functions are estimated accurately or not; ours is more geometric: we view the whole reverse process as a learned function that characterizes the image manifold. Then, the question here is really about how many steps we need in the pretrained DM model, so that approximate DM priors are incorporated. About this choice, we have tried to address it in Sec 3.1 as well as our GR1 above.

RE Weakness 3 & Question 3:

We totally agree with the intuition, and our choice of 3 steps is a hyperparameter that works reasonably well in typical settings, but in no sense optimal.

We conduct the ablation study as you suggested. We test 40 cases from the CelebA dataset for random inpainting, and try different numbers of reverse steps with the increase of the mask ratio. The table below shows that our method with 7 steps performs slightly better than it with 3 and 11 steps, as you expected. We will add this ablation study in our future version and recommend users ablate the number of steps when facing challenging IPs.

Inpainting (CelebA) PSNRMask ratio 90%Mask ratio 94%Mask ratio 98%
3 Steps25.9625.6824.87
7 Steps26.4826.3625.83
11 Steps25.8425.6725.31

RE Weakness 4:

Please refer to GR2.

RE Weakness 5:

Please refer to GR3.

RE Question 1:

This is a very good idea! We noticed a recent paper [1] (after our submission) that successfully combines optimizing the input of diffusion models and consistency models. We will probably also incorporate this in our future version.

[1] Inverse Problems with Diffusion Models: A MAP Estimation Perspective

Question 2:

This is an interesting idea and definitely worth exploring! Using our 3-step method for super-resolution on 100 cases from CelebA, we experiment with different latent spaces to optimize when t=T, t=⅔ T, t=⅓ T, respectively. The results below show that optimizing the latent space when t<T can lead to unsatisfactory performance, which shows that it is essential to optimize the latent space when t=T to fully take advantage of the pre-trained prior information.

SR (CelebA)PSNRSSIMLPIPS
t = T31.250.8780.067
t = ⅔ T26.350.5870.237
t = ⅓ T11.390.0961.162
评论

I read the rebuttal of the authors and I want to thank them for their efforts.

After reading the rebuttal and the comments from other Reviewers, some of my concerns remain: the method does not offer posterior samples, is very slow, and the number of reverse steps should be higher than 3 for challenging inverse problems (which makes sense).

I also found more limitations: i) the method seems to be working best with DMs and not LDMs and ii) by looking closely at the box-inpainting images, it seems that some of the results are more blurry than I expected.

The authors also did not compare with any of the methods that I proposed in my Review (ΠDGM, Red-diff, and PSLD). I think that a comparison with these baselines would strengthen the paper.

Finally, regarding Question 2, I am curious how the authors initialize the latent when they started the optimization at the intermediate noise level. I was proposing to do the optimization in an iterative way, starting from the solution obtained from the current implementation of the authors' algorithm (similar to ILO). Intuitively, optimizing in intermediate spaces should not hurt the performance, given that the algorithm is properly initialized.

For the reasons mentioned above, I am inclined to reduce my current rating. Since I appreciate the hard work of the authors during the rebuttal, I will wait for the author's reply and I will follow closely the discussion with the other Reviewers before finalizing my decision.

评论

We thank reviewer yL1C for the further discussion and insightful comments!

RE "After reading the rebuttal ... higher than 3 for challenging inverse problems (which makes sense).":

  • Again, regarding the capability to offer posterior sampling and uncertainty quantification, we acknowledge it as a common limitation of our method and most competing methods. Extending our method might allow us to develop such capabilities, but it is outside the scope of the current manuscript.

  • Regarding the slowness, please refer to our further discussion in GR6.

RE "I also found more limitations ... more blurry than I expected.":

  • For i), we acknowledge the slight performance difference between our method with DMs vs with LDMs, but we humbly disagree that this should be considered a limitation: (1) comparing the performance of our method (similarly, other competing methods) across different DM backbones directly may not make sense, as there can be intrinsic differences in priors captured by the various backbones. The purpose of the ablation studies in Table 5 is to showcase the flexibility of our method across different DMs; (2) while our method with LDM lags behind our method with DMs by approximately 1dB in terms of PSNR, it still delivers comparable or superior performance to other baseline methods, as demonstrated in GR4.

  • For ii), we will update GR8 for the box inpainting task soon.

RE "The authors also did not compare ... would strengthen the paper.":

We thank the suggestion of the reviewer. Please refer to GR7.

RE "Finally, regarding Question 2 ... given that the algorithm is properly initialized.":

Thank you for suggesting this interesting idea! We agree that the iterative algorithm is very likely to further improve the performance. We are reviewing the details of the ILO paper and trying to figure this out by the end of the rebuttal period. We will share the results as soon as they are available.

评论

We thank reviewer yL1C for the valuable suggestion to include more baseline methods. In addition to PSLD and GR7, we have tried our best to obtain the results of RED-diff and ΠGDM for super-resolution before the discussion deadline. Please refer to GR9 for more details.

评论

I would like to thank the authors for their additional results. I increased my score. That said, I would like to see the experiments we discussed in the camera-ready version of this work.

评论

We greatly appreciate the acknowledgment from review yL1C. We will be sure to include all the additional experiments in our future version.

评论

We thank reviewer yL1C for the further discussion and insightful comments!

For the box inpainting task, please refer to GR8 for the blurriness analysis, and both qualitative and quantitative comparisons between our method and several strong baseline methods.

作者回复

We thank all the reviewers for their thoughtful and constructive comments about our manuscript!

GR1: Difference between solving inverse problems and condition generation with the classifier guidance (CGwCG)

As suggested by Reviewer uJfQ and hinted by other reviewers, we ablate the number of reverse steps for [1], i.e., CGwCG, in the attached pdf file. It is clear that 3-step CGwCG cannot generate meaningful objects.

This suggests that CGwCG can be substantially more difficult than typical inverse problems. We suspect the reason is that in typical inverse problems, the measurement y provides much stronger “guidance”/information than the label/text guidance in conditional generation: the former, together with additional priors (say from pre-trained DMs), typically enables reliable estimation of the ground truth, whereas the latter can have numerous, or even infinite, solutions.

[1] End-to-End Diffusion Latent Optimization Improves Classifier Guidance

GR2: Time and memory consumption of our method

In response to the valuable suggestions from several reviewers, we have provided a comparison of our method with other competitors in terms of time consumption and memory usage below (we report the time to get the peak for ours). We acknowledge that currently, our method is often slower than other competing methods. But we also want to highlight the superior performance of our method for solving IPs, particularly nonlinear IPs (typically 3-6dB boost in terms of PSNR). The user has the flexibility to choose the right balance between recovery quality and computation cost, based on their own priorities and constraints. We will add this point into limitations and leave accelerating our method for future work.

ADMM-PnPDMPSDDRMMCGDPSReSampleFPSDiffPIRDDNMOurs
Time (s)64230434314624367635
Memory (GB)0.425.104.992.802.795.1120.631.444.876.59

GR3: Comparison with more SOTA methods

Although we have compared our method with the very recent work—Resample [2] (ICLR’24 Spotlight) for most of the experiments, we still would like to compare with more SOTA methods as reviewers suggested. We have tried our best to implement three methods in the short rebuttal period. We run super-resolution (SR) and inpainting for FPS, DiffPIR and DDNM on 100 cases of CelebA since they only consider linear IPs in their papers. Our method and DiffPIR lead other methods for SR, and our method can lead all competitors for inpainting.

[2] Solving inverse problems with latent diffusion models via hard data consistency

SR (CelebA)PSNRSSIMLPIPS
FPS29.120.8580.149
DiffPIR31.550.8570.203
DDNM29.210.8360.193
Ours31.250.8780.067
Inpainting (CelebA)PSNRSSIMLPIPS
FPS32.060.9240.064
DiffPIR31.220.8660.219
DDNM27.890.7990.224
Ours34.030.9360.039

GR4: More comparison with methods based on latent-diffusion models (LDMs)

We thank Reviewer wUnV for the suggestion to include more LDM results. We have provided additional results of our method using LDM models, as shown below. These results exhibit a similar trend to the SR results in Table 5, indicating that our method performs slightly better with DMs compared to LDMs. Regardless of whether DMs or LDMs are used, our method consistently achieves comparable or superior results to SOTA methods for these tasks.

Inpainting (CelebA)PSNRSSIMLPIPS
Best competitor32.240.9240.039
Ours (DM)34.030.9360.039
Ours (LDM)33.100.9230.048
Nonlinear deblur (CelebA)PSNRSSIMLPIPS
Best competitor28.520.8390.104
Ours (DM)31.610.8820.073
Ours (LDM)30.640.8610.108

GR5: More complex settings for inpainting and super-resolution

We follow the valuable suggestion from Reviewer uJfQ to add more qualitative results for more complex settings. Please check the visual results of box inpainting on CelebA and super-resolution on ImageNet in the attached pdf file.

评论

Dear Reviewers,

I hope this message finds you well. I would like to take this opportunity to express my gratitude for the valuable feedback and insights you have provided during the review process. Your expertise and dedication are essential to maintaining the high standards of our community.

The authors have devoted considerable time and effort to addressing the concerns and suggestions you raised in your reviews. They have carefully crafted a detailed rebuttal, aiming to clarify any misunderstandings and incorporate your feedback into their manuscript.

I kindly ask that you take a thorough and thoughtful look at the authors' responses. Your assessment of their rebuttal is crucial in determining whether they have satisfactorily addressed the issues and concerns highlighted in your initial review.

Thank you once again for your hard work and commitment to advancing the quality of our scholarly community. Your contributions are greatly appreciated.

Best regards,

AC

评论

We sincerely thank all the dedicated ACs and reviewers for your time and great efforts. We would like to take this opportunity to re-emphasize our contributions.

  • (Novelty) Recently, interleaving the diffusion process with projection or gradient updates has become the dominant framework for solving inverse problems using pretrained diffusion models, with hundreds of papers published in the past few years. In this context, we introduce a completely novel approach, DMPlug, which directly integrates pretrained diffusion models into the Maximum A Posteriori (MAP) framework. We appreciate the recognition of our novelty by Reviewer yL1C, Reviewer wUnV, and Reviewer y52g.

  • (Significant improvements in experiments) Our method outperforms most SOTA methods across a range of linear and nonlinear inverse problems, including super-resolution, inpainting, nonlinear deblurring, blind image deblurring (BID), and BID with turbulence. The performance improvements are particularly significant for complex nonlinear inverse problems, with gains typically ranging from 3 to 6 dB in PSNR. We appreciate the recognition of the improvements by all four reviewers.

  • (Robustness to unknown noise). To the best of our knowledge, this is the first paper that can address unknown measurement noise with pretrained DMs for solving IPs. Most current popular methods assume that measurement noise is a certain type, like Gaussian, and the noise level is known. We appreciate the recognition of this point by Reviewer wUnV, Reviewer y52g, and Reviewer uJfQ.

  • (Interesting phenomena). We demonstrate that our method achieves superior performance with just 3 DDIM reverse steps, whereas conditional generation using a classifier fails to produce reasonable objects in only 3 steps. We further compare and discuss this during the rebuttal period and this may suggest significant differences between conditional generation and IPs. We are delighted that most reviewers have expressed interest in this. Also, we observe early-learning-then-overfitting and spectral bias phenomena when using our method for solving IPs.

In general, we believe that this work has the potential to open up several new research directions for our community. Thanks again for all the ACs and reviewers!

评论

We sincerely appreciate the time and efforts of the AC and all the reviewers. Below, we provide additional evidence and arguments to further address several common concerns raised during the discussion so far.

GR6: Time and memory consumption of our method (Continuing GR2)

  • We fully understand the concern regarding our method's running time. We want to reiterate its superior performance in terms of reconstruction quality, particularly in solving nonlinear IPs, where we observe a typical 3-6 dB boost in PSNR. We are not claiming the superiority of our method in all aspects: quality, speed, and memory; none of the competing methods is a clear winner when all three are considered. But when people solve practical problems with different priorities and constraints, say priority in high quality, or run-time constraint, it is nice to have such tradeoffs and a whole spectrum of algorithmic choices.

  • Also, our unified optimization formulation is not limited to using the ADAM optimizer, as demonstrated in our ablation study in Table 6 of the manuscript. We have further tested our method with the L-BFGS optimizer during the discussion period to evaluate computational costs. The results indicate that our method with L-BFGS is significantly faster than with ADAM, while maintaining similar memory usage. While the L-BFGS version is still somewhat slow, it is already faster than ReSample [1] (ICLR’24 Spotlight).

[1] Song, B., Kwon, S.M., Zhang, Z., Hu, X., Qu, Q. and Shen, L., 2023. Solving inverse problems with latent diffusion models via hard data consistency. arXiv preprint arXiv:2307.08123.

ADMM-PnPDMPSDDRMMCGDPSDDNMFPSDiffPIRReSampleOurs (ADAM)Ours (L-BFGS)
Time (s)64230434314624367635255
Memory (GB)0.425.104.992.802.795.1120.631.444.876.596.74

(The wall-clock time of each method is measured while running on a single NVIDIA A100 GPU.)

GR7: Comparison with more SOTA methods (Continuing GR3)

We appreciate Reviewer yL1C's suggestion regarding several competing methods. In addition to the three strong competitors proposed by Reviewer y52g during the rebuttal period—FPS, DiffPIR, and DDNM, we have also included an additional baseline method, PSLD [2], in the discussion period. Our method continues to show comparable or superior performance to the tested competing methods. Furthermore, we are actively working to run ΠGDM and Red-diff before the discussion deadline and will update the results as soon as they become available.

[2] Solving linear inverse problems provably via posterior sampling with latent diffusion models

SR (CelebA)PSNRSSIMLPIPS
FPS29.120.8580.149
DiffPIR31.550.8570.203
DDNM29.210.8360.193
PSLD26.450.6820.243
Ours31.250.8780.067
Inpainting (CelebA)PSNRSSIMLPIPS
FPS32.060.9240.064
DiffPIR31.220.8660.219
DDNM27.890.7990.224
PSLD27.650.7850.213
Ours34.030.9360.039

GR8: Box inpainting (Continuing GR5)

We are currently working on this part. We will share the response here once it is finalized.

评论

GR8: Box inpainting (Continuing GR5)

Following several reviewers’ suggestions made during the rebuttal, we include further results and analysis on the box inpainting task below.

Why the reconstructions in GR5 appear slightly blurry inside the box mask?

This seems to be the result of the interaction between the spectral bias of our learning process and the PSNR metric we use to select the reported result, as explained below.

We provide the intermediate reconstruction of our method at https://anonymous.4open.science/r/2024_NIPS_rebuttal-AAC3/box_process.png. After the PSNR peak, we can see an obvious visual improvement in image details. So, if we visually pick the best result, we can get much sharper ones.

The spectral bias of learning here means our method picks up the low-frequency components first, and then gradually high-frequency ones. Visually, it picks up smooth blobs first, and then gradually more details, as can be seen from https://anonymous.4open.science/r/2024_NIPS_rebuttal-AAC3/box_process.png. However, the PSNR metric that we use to select the reported results is known to be not sensitive to slight smoothing/blurring—which typically causes sharp changes in perceived visual quality. For example, as [3] points out, “...a well-known example is that blurring causes large perceptual but small L2 change… ". So the reported results selected based on the PSNR metric may not be the one with the sharpest detail inside the box mask; in fact, due to the spectral bias, the sharpest one tends to be after the PSNR peak.

Would performing the selection based on other visual metrics work? Inspired by Reviewer wUnV’s suggestion, we experimented with using the Learned Perceptual Image Patch Similarity (LPIPS) metric instead of PSNR. Unfortunately, the reconstruction with the lowest LPIPS value still did not align with the visually optimal result. So this question indeed turns into which visual metric aligns well with human visual perception, which is an active area of research itself.

Why is only this box-mask inpainting task special? What about other tasks? In the manuscript, we have reported results for most other tasks based on the PSNR metrics, and their visual qualities are typically satisfactory. We believe the gap lies in how much the measurement consistency can constrain the problem: for other tasks, if we blur the target image locally anywhere, the measurement consistency will be substantially affected; for box-mask inpainting, we can blur the target image inside the box mask arbitrarily without affecting the measurement consistency. So the strong emphasis of our method on measurement consistency has avoided the trouble discussed above for other tasks we experiment with, but not box-mask inpainting.

[3] The unreasonable effectiveness of deep features as a perceptual metric.

Thorough comparison of our method against several strong baseline methods.

Overall, considering both quantitative and qualitative results, our method matches DPS in performance and surpasses other competing methods. Although our reconstructions can sometimes appear slightly blurry, they maintain the highest pixel-level fidelity with the highest PSNR, and they appear photorealistic without artifacts.

Box inpainting (CelebA)PSNRSSIMLPIPS
FPS24.940.8820.085
PSLD23.940.7250.182
ReSample21.720.8170.122
DPS23.840.8650.075
DDNM14.740.5430.559
DiffPIR23.600.8460.162
Ours (3 steps)25.040.8360.133
Ours (7 steps)25.060.8370.135

(For box inpainting, we maintain a box size of 128 × 128 pixels, consistent with the DPS and ReSample papers. In alignment with other experiments in our manuscript and the ReSample paper, we also introduce additional Gaussian noise with 𝜎 = 0.01.)

评论

We thank reviewer yL1C for the valuable suggestion to include more baseline methods. In addition to PSLD and GR7, we have tried our best to obtain the results of RED-diff and ΠGDM for super-resolution before the discussion deadline. The results indicate that our method consistently demonstrates comparable or superior performance to the tested competing methods.

SR (CelebA)PSNRSSIMLPIPS
FPS29.120.8580.149
DiffPIR31.550.8570.203
DDNM29.210.8360.193
PSLD26.450.6820.243
RED-diff30.080.8730.216
ΠGDM30.010.8760.164
Ours31.250.8780.067
最终决定

This paper introduces an optimization-based approach to refine the initial noise for data consistency. The method is tested on a range of both linear and non-linear inverse problems, achieving state-of-the-art (SOTA) results.

Most reviewers find the problem is important and the work is timely. The idea is interesting and novelty, and the demonstrated results are strong and evaluated on both linear and nonlinear inverse problems. For the camera ready, we suggest the authors to incorporate the reviewers' feedback into the final version.