5.3

/10

Poster4 位审稿人

最低5最高6标准差0.4

3.3

置信度

正确性2.5

贡献度2.5

表达2.5

NeurIPS 2024

A Modular Conditional Diffusion Framework for Image Reconstruction

Magauiya Zhussip,Iaroslav Sergeevich Koshelev,Stamatios Lefkimmiatis

OpenReview PDF

提交: 2024-05-15更新: 2025-01-17

摘要

关键词

diffusion model; image restoration; super-resolution; motion deblurring

评审与讨论

审稿意见

评分: 5置信度: 22024-07-04

This paper proposes a modular conditional diffusion framework for image reconstruction. Specifically, a small module is trained and combined with pretrained IR networks and DPMs. Experiments show that this method is effective.

优点

According to the quantitative and qualitative results, the method shows great performances on several tasks.
The method can be applied to tasks combining task-specific pre-trained models, making the method easy to be extended.
The module requiring additional training is small and the method can be trained efficiently.

缺点

The figures in this paper are of poor quality. The authors should revise the figures to make the formulas clear and professional.
Overall I think the modular framework is of limited insight. The design is clean but not so informative for image restoration and some other related domains.
If I understand correctly, this method is agnostic to the specific tasks of image restoration, though the authors present several tasks in this paper, it would be good if more qualitative results on various tasks are given.
I think the arguements in line 45-49 are not proper. As the authors discuss the conditional space, sometimes we may not need too much time for the DPM training especially we have the base model and can perform efficient fine-tuning. Also, the case that is discussed (1M images and 120GPU-days for LDM) does not precisely align with the conditional context. These values correspond to the original LDM but we today usually use SD which is trained on huger datasets and takes more time training. Here if we want to say a single condition space tailored DPM may require expensive resources, maybe it is better include cases on finetuning SD or other domain-specific DPM.
For some baseline methos, it is unclear how they are implemented and compared. For example, LDM is selected as the baseline which gives qualitative and quantitative, but I am not sure how it is used and which checkpoint is selected. The authors can clearly indicate how baselines are implemented.

问题

Considering the overall contribution and the weaknesses, I choose "Borderline reject". I am willing to adjust my score for my potential misunderstanding.

局限性

The limitations are discussed.

作者回复

2024-08-07

Q18 The figures in this paper are of poor quality. The authors should revise the figures to make the formulas clear and professional.

In our submitted manuscript we made sure that both Figures 1 and 2 are vector images and that the formulas inside these figures follow the same LaTeX style as we use inside the text. Following the reviewer's comment, we have checked the visual appearance of them in different pdf viewers and found no mentioned issues. We include the pdf file with raster screenshots of both figures as an attachment to our general response and kindly ask the reviewer to compare their visual appearance with those in our manuscript. For the camera-ready version we plan to expand the horizontal dimension of Figure 1 to match the full page width, which we hope that will enhance its appearance.

Q19 Overall I think the modular framework is of limited insight. The design is clean but not so informative for image restoration and some other related domains.

We respectfully disagree with this particular assessment of the reviewer. Previous DPM-based works [40, 60, 78] have attempted to directly estimate the conditional expectation $\mathbb{E} \left[\bf{x_0} | \bf{y}, \bf{x_t}\right]$ (or residual noise) and have achieved outstanding performance by sacrificing the generalizability to other tasks. In contrast, we demonstrate that $\mathbb{E} \left[\bf{x_0} | \bf{y}, \bf{x_t}\right]$ in Eq. (5) can be expressed as a fusion of two separate expectations: $\mathbb{E} \left[\bf{x_0} | \bf{y}\right]$ and $\mathbb{E} \left[\bf{x_0} | \bf{x_t}\right]$ . This approach allows us to learn a single unconditional generative denoising function to estimate $\mathbb{E} \left[\bf{x_0} | \bf{x_t}\right]$ and apply it to different image reconstruction problems with minimal computational overhead, by training our fusion module on a small dataset. Further, we have verified the validity of our approach on three different and very competitive blind image reconstruction tasks by performing comparisons with competitive sota methods on several public benchmarks.

Q20 If I understand correctly, this method is agnostic to the specific tasks of image restoration, though the authors present several tasks in this paper, it would be good if more qualitative results on various tasks are given.

Our method is not fully agnostic to the image restoration task at hand, since in order to apply it to a new unseen task the training of a small Fusion network (0.7M params) is required. Due to page limitations and in order to reserve enough space to explain adequately the key ideas of our work, similar to the common practice followed by several other existing works ([78],[59],[60],[40],[21]), we have decided to conduct experiments and perform comparisons on three challenging tasks from different modalities, while we plan to consider additional reconstruction tasks as a future work.

Q21 I think the arguments in line 45-49 are not proper. As the authors discuss the conditional space, sometimes we may not need too much time for the DPM training especially we have the base model and can perform efficient fine-tuning. Also, the case that is discussed (1M images and 120GPU-days for LDM) does not precisely align with the conditional context. These values correspond to the original LDM but we today usually use SD which is trained on huger datasets and takes more time training. Here if we want to say a single condition space tailored DPM may require expensive resources, maybe it is better to include cases on finetuning SD or other domain-specific DPM.

This is a valid point raised by the reviewer and in the revised version we plan to include a discussion about existing methods that fine-tunes DPMs. One such example is the T2I-Adapter [A], which requires 12 GPU-days on an NVIDIA V100 32GB GPU and utilizes between 164K and 500K images, depending on the task. Another relevant example is ControlNet [B], which demands between 4 and 25 GPU-days on an NVIDIA A100 80GB GPU (approximately three times faster than the NVIDIA V100) and employs training datasets ranging from 25K to 3M images. Furthermore, the aforementioned adapters increase the training parameters at least by 77M and up to half of the parameter size of Stable Diffusion.

[A] Mou, Chong, et al. "T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 38. No. 5. 2024.
[B] Zhang, Lvmin, Anyi Rao, and Maneesh Agrawala. "Adding conditional control to text-to-image diffusion models." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023.

Q22 For some baseline methods, it is unclear how they are implemented and compared. For example, LDM is selected as the baseline which gives qualitative and quantitative, but I am not sure how it is used and which checkpoint is selected. The authors can clearly indicate how baselines are implemented.

The LDM and its checkpoint that we have used for comparison is the original model that was trained for the SISR task: https://ommer-lab.com/files/latent-diffusion/sr_bsr.zip.

Following the reviewer's suggestion, we have included an additional section in the Appendix with the description and links to the code for all baselines that we report in the manuscript.

2024-08-13

Thank you for your response. The rebuttal has addressed my concerns, and I have accordingly raised my rating to 5.

Regarding Q18, my concern was that the figure appears to be of poor aesthetic quality and less professional. It would be beneficial for the framework figure to be clearer and more informative.

评论- Thank you for acknowledging our rebuttal and increasing your score

2024-08-13

We are very pleased that our response addressed all the reviewer's comments, and we thank the reviewer for raising the score. Regarding Q18, we now understand the reviewer's concern, which we also find similar to the one raised by the reviewer hKXQ. Following the reviewer's suggestion, apart from changing its color scheme, we will also re-design the figure content to enhance its overall informativeness. If all other concerns are addressed, we would be grateful if the reviewer increases the confidence score.

审稿意见

评分: 5置信度: 32024-07-06

This paper proposes a new approach to improving the efficiency and applicability of Diffusion Probabilistic Models (DPMs) for various image restoration (IR) tasks. A new modular diffusion probabilistic image restoration (DP-IR) framework combines pre-trained state-of-the-art IR networks with generative DPMs. This framework requires only a small additional module (0.7M parameters) to be trained for specific IR tasks, making it more practical and less computationally expensive. This framework is evaluated on four benchmarks, covering tasks such as burst JDD-SR, dynamic scene deblurring, and super-resolution.

优点

Originality: Modular framework and accelerated sampling strategy. Quality: Robust methodology and comprehensive experimental validation. Clarity: Clear presentation. Significance: By reducing the computational requirements and enhancing the generalizability of IR methods, the framework could facilitate the adoption of advanced image restoration techniques in real-world scenarios where computational resources are limited.

缺点

Computational cost analysis: A more detailed analysis of the computational cost, including memory usage and inference time, would be helpful.

Experimental validation: While the paper demonstrates strong performance on specific benchmarks, the experiments could benefit from additional datasets to further validate the generalizability of the proposed framework.

问题

How sensitive is your framework to the choice of hyperparameters?

局限性

Elaborate on future work directions that could address these limitations. For instance, proposing specific research avenues to enhance the generalizability of the framework, or developing more adaptive hyperparameter tuning methods.

作者回复

2024-08-07

Q14 Computational cost analysis: A more detailed analysis of the computational cost, including memory usage and inference time, would be helpful.

While we agree with the reviewer that the inclusion of such information has practical value, we would also like to highlight existing problems that prevents us from doing so:

Both memory usage and inference time are highly dependent on the efficiency of implementation of the particular method and the hardware equipment used. Moreover, depending on the implementation, the memory usage can be sacrificed for faster inference and vice-versa.
All our diffusion-based competitors in dynamic scene deblurring problem and one competitor in SISR (InDI) do not offer publicly available implementations, making such direct comparisons impossible.

To avoid any speculations on these subjects, we have refrained from performing such comparisons. For our computational cost analysis we rely solely on the Neural Function Evaluations (NFEs) to explicitly compare the inference compute, and the number of parameters to implicitly compare the training data and training compute. For the NFEs comparison we assume that all the competing diffusion-based methods have similar computational complexity in terms of FLOPs per single NFE. Under this assumption, to perform a comparison for the inference costs it is sufficient to just compare the amount of NFEs.

Nevertheless, we agree with the reviewer, that this is not the best approach for a computational cost analysis, and for this reason we additionally report below the exact FLOPs per 720p input for competing methods in dynamic scene deblurring problem.

Method	TFLOPs (equation)	TFLOPs Total $\downarrow$
DvSR	1.2 $\times$ N + 4.8	604.8
icDPM	4.8 $\times$ N + 5.2	2405.2
InDI	4.8 $\times$ N	48.0
Ours	4.3 $\times$ N + 1.9	23.4

The proper way to interpret these equations is as follows: $\textrm{TFLOPs Total} = x \times N + y$ , where $x$ is the TFLOPs complexity for a single backbone pass within the diffusion process, $N$ is the total number of neural function evaluations (NFEs) per sampling process, and $y$ is the complexity of sub-modules that have to be run once per image (e.g. Image Restoration network in our method, pre-processing net for icDPM and DvSR). Based on these results, we observe that the computational cost of our method is significantly lower compared to our diffusion-based competitors.

We have included this information regarding the FLOPs in the revised version of the manuscript and hope that it enhances the computational cost analysis.

Q15 Experimental validation: While the paper demonstrates strong performance on specific benchmarks, the experiments could benefit from additional datasets to further validate the generalizability of the proposed framework.

We validated our method using widely recognized and publicly available datasets to ensure a fair and reproducible assessment of our approach. Utilizing these commonly used datasets allows for more straightforward benchmarking and comparison with other methods. However, for other less commonly used or proprietary datasets, it becomes difficult to make direct comparisons. This is primarily due to the fact that many state-of-the-art (SOTA) approaches do not release their implementations publicly. Consequently, without access to these implementations, reproducing their results or performing a head-to-head comparison on different datasets poses significant challenges. Therefore, our evaluation is limited to datasets for which there is widely available and accessible benchmark data.

Q16 How sensitive is your framework to the choice of hyperparameters?

The main hyperparameter of our framework is the time threshold $\tau$ , and we refer the reviewer to Table 7 and Table 10 of our manuscript for the detailed ablation study of its influence on the reconstruction quality. All the other parameters of our framework, specifically the total number of diffusion steps $T$ and diffusion coefficients $\\{\beta_t\\}_{t=1}^T$ were taken as-is from the prior literature on diffusion models. We should note, that no other works on image reconstruction via reverse diffusion perform the sensitivity study wrt the values of this sequence, so in our work we also refrained from doing that and leave this topic as a possible future research direction.

However, we have investigated the sensitivity of our approach to the choice of the specific modules, and we refer the reviewer to Table 5 in the main text and section F in the Appendix for details of this study.

Q17 Elaborate on future work directions that could address these limitations. For instance, proposing specific research avenues to enhance the generalizability of the framework, or developing more adaptive hyperparameter tuning methods.

Please find our comment in section "Limitations and further perspectives" of our general response.

审稿意见

评分: 5置信度: 42024-07-08

This manuscript proposed a modular conditional diffusion model for image reconstruction, consisting of three components: a pre-trained image restoration network, a denoising network, and a fusion network.

优点

The model reduces computational load by minimizing the number of network modules that need to be trained and by utilizing lightweight networks. The fusion network is the only component that needs to be trained for each specific IR task. The model also achieves significant acceleration in the sampling process without any loss of reconstruction quality.

缺点

The analysis of trade-offs between model size, speed, and performance may be insufficient. There is insufficient explanation regarding the reasons for selecting specific modules.

问题

 The standard and rationale for selecting the baseline models have not been detailed. For example, why were specific image restoration networks, denoising networks, and fusion networks chosen? Were these choices based on certain performance metrics, relevant literature, or experimental results? Providing this explanation will help readers understand the rationale behind the model design.  Were other acceleration techniques used during the sampling process? If so, please provide specific technical details and experimental results, comparing performance differences before and after applying these acceleration techniques.  It is suggested that the authors revisit the formula in line 115 to ensure that each part is clear and understandable. If there are any symbols or variables that are not clearly defined, please provide explicit definitions. Consider adding explanatory text before and after the formula to help readers understand its meaning and application.  It is advisable for the authors to verify if the PSNR values in Table 5 are consistent with those in other tables. If all tables should indicate the PSNR Target values as ∞, ensure consistency across all tables to avoid confusion for readers during their review.  It is recommended that the authors redesign the color scheme of Figure 1 to enhance its aesthetic appeal and improve clarity for better understanding.

局限性

The authors do not point out the limitations of the work and do not offer further perspectives. I hope it will be improved.

作者回复

2024-08-07

Q8 The standard and rationale for selecting the baseline models have not been detailed. For example, why were specific image restoration networks, denoising networks, and fusion networks chosen? Were these choices based on certain performance metrics, relevant literature, or experimental results? Providing this explanation will help readers understand the rationale behind the model design.

Please refer to the section "Network architectures choice" in general response. Thank you!

Q9 Were other acceleration techniques used during the sampling process? If so, please provide specific technical details and experimental results, comparing performance differences before and after applying these acceleration techniques

As we mention in lines 244-246 and 270-273 of the manuscript, for the SISR task we further employ the DDIM acceleration technique during the sampling process. Following the reviewer's suggestion, we will include an additional section in the appendix with a short technical description of this technique and we will refer to it inside the main text of the manuscript. During our experiments, we didn't observe significant quantitative/qualitative difference after applying this acceleration technique. To be more specific, we provide below a comparison of the reconstruction performance with and without utilizing DDIM, which we will also include in the newly added section in the appendix.

Method	PSNR $\uparrow$	SSIM $\uparrow$	LPIPS $\downarrow$	TOPIQ $_{\Delta}$ $\downarrow$	NFE $\downarrow$
Ours (with DDIM)	28.12	0.793	0.140	$0.002$	51
Ours w/o DDIM	$28.16$	$0.794$	$0.139$	0.007	251

Q10 It is suggested that the authors revisit the formula in line 115 to ensure that each part is clear and understandable. If there are any symbols or variables that are not clearly defined, please provide explicit definitions. Consider adding explanatory text before and after the formula to help readers understand its meaning and application.

We thank the reviewer for his suggestion. Indeed, in the original version of the manuscript we have missed explaining the purpose of the $\beta\_{t}$ parameter. In the revised version we clarify that $\beta\_{t}$ is the noise scheduling parameter for the forward process. We will also re-write the formula to make it more compact and clear.

Q11 It is advisable for the authors to verify if the PSNR values in Table 5 are consistent with those in other tables. If all tables should indicate the PSNR Target values as $\inf$ , ensure consistency across all tables to avoid confusion for readers during their review.}

We thank the reviewer for raising this point. Indeed, in Table 5 there was a typo where we used the word "inf" instead of the "inf" sign, which has now been corrected.

Q12 It is recommended that the authors redesign the color scheme of Figure 1 to enhance its aesthetic appeal and improve clarity for better understanding.

Following the reviewer's suggestion, we have redesigned Figure 1 and change its horizontal dimension in order to match the text width. We believe that these changes will improve its aesthetic appeal.

Q13 The authors do not point out the limitations of the work and do not offer further perspectives. I hope it will be improved.

Please see section "Limitations and further perspectives" in our general response.

审稿意见

评分: 6置信度: 42024-07-10

This paper proposes a modular diffusion probabilistic IR framework to combine the performance benefits of existing pre-trained state-of-the-art IR networks and generative DPMs with a light-weight fusion network. Experimental results on burst JDD-SR, dynamic scene deblurring, and super-resolution demonstrate its superior performance and better perceptual quality.

优点

The proposed framework is based on the existing IR network and equipped with a denoising module, which is suitable for various reconstruction problems without retraining and reduces the burden of computing resources and training data.
An accelerated sampling algorithm is proposed to further reduce the computational burden during inference.
Experimental results on burst JDD-SR, dynamic scene deblurring, and super-resolution demonstrate its superior performance and better perceptual quality, highlighting DP-IR’s versatility.

缺点

The contribution of accelerated sampling seems trivial. As the author said, references [13],[52] already proposed a similar idea.
Although the author criticizes the task-specific nature of existing solutions, DP-IR still needs to train the fusion network for each specific task, which makes it less versatile.
Absence of necessary ablation experiment: a) The effectiveness of Eq.10’s sampling strategy; b)Inference without image restoration network, i.e., input with y directly.

问题

Why utilize a smaller version of MIRNet as the denoiser, not other architecture used for diffusion?
Why not utilize open-sourced diffusion model, like stable diffusion.
In Eq. 10, the author gets x_{\tau} through posterior sampling from x_{T} and E[x_{0}|y], why not directly noising E[x_{0}|y] to x_{\tau}? The idea that “Lemma 3.2 makes the final reconstruction quality unaffected” is confusing, because E[x_{0}|y] itself is not the true x_{0}. I think an ablation study is needed to verify your idea.
Although the ablation of different combination of image restoration network and denoising network is given, an ablation of only utilize image restoration network or only utilize denoising network is also needed.
Minor error: a) “in section 3.4” in line 228 may be “in Table 1”; b) The Table 1 should be table 2?

局限性

yes.

评论- continuation: Q6-Q7

2024-08-07

Q6 The idea that “Lemma 3.2 makes the final reconstruction quality unaffected” is confusing, because $E[**x**\_{0}|**y**]$ ,itself is not the true $**x**\_{0}$ . I think an ablation study is needed to verify your idea.

We thank the reviewer for raising a valid point. What we meant with our statement "which as a consequence of Lemma 3.2 does not compromise the final reconstruction quality" in lines 235-236 is that if we approximate $**x**\_{0}$ with $E[**x**\_{0}|**y**]$ , then the reconstruction result is going to be the same either we utilize the multi-step reverse diffusion process or the one-step process that we described on Lemma 3.2. Our statement was not related to the equivalence of the diffusion process between the utilization of the true signal $**x**\_{0}$ and $E[**x**\_{0}|**y**]$ . Based on the reviewer's comment we now understand that the current form of this statement can lead to confusion and we will reformulate it in the revised version of the manuscript.

Q7 Although the ablation of different combination of image restoration network and denoising network is given, an ablation of only utilize image restoration network or only utilize denoising network is also needed.

Regarding the utilization of only the image restoration networks, we report such comparisons and results in Tables 2-4. Specifically, for the task of JDD-SR in our framework we employed BSRT-small, for dynamic scene deblurring we employed FFTformer, while for SR we employd SwinIR. In Tables 2-4 we report the restoration performance of all these networks when used in a standalone way. From these results we can observe that our framework leads to a noticeable improvement in terms of perceptual quality, which is the primary metric that DPM methods aim to improve.

Regarding the utilization of only the denoising module, we are afraid that such comparisons will not lead to any meaningful conclusions since any proper solution of the restoration tasks under study, would have to be consistent with the measurements $**y**$ . Given the generative capabilities of existing Denoising Probabilistic Models (DPM) modules, omitting the condition $**y**$ could lead to unpredictable behavior and result in an image that does not align with the original signal $**y**$ . In the scenario that we instead consider a conditional DPM method, then we essentially end-up to the methods that we are also comparing against in Tables 3-4 and which are specifically designed for one particular task.

$**Q**$ Minor error: a) “in section 3.4” in line 228 may be “in Table 1”; b) The Table 1 should be table 2? Indeed, we have fixed wrong references in the revised version: "Table 1" -> "Figure 1" and "section 3.4" -> "figure 2" in line 228. Thank you for pointing out this issue.

作者回复

2024-08-07

Q1 The contribution of accelerated sampling seems trivial. As the author said, references [13],[52] already proposed a similar idea.

We agree with the reviewer that our acceleration strategy indeed bears certain similarities to the strategies presented in the referred papers. However, as we also highlight in lines 233-244 of the manuscript, our approach is more general than those proposed in [13],[52]. Below we discuss in more detail both the conceptual and technical differences, and show empirically that our approach leads to better reconstruction results.

Conceptual difference

Using our notation, both [13] and [52] propose to start the reverse process from a timestep $\tau$ and a noisy version $**x**_\tau$ of the initial estimate of $**x**_0$ , which we denote by $\mathbb{E}[**x**_0|**y**]$ . The main conceptual difference of our approach is that in these cases $**x**\_\tau$ is obtained using the forward diffusion process, while in our case we end up in $**x**\_\tau$ using the reverse process. The initial motivation for our proposed approach is also different. In particular, while we motivate our procedure from a probabilistic viewpoint and propose to approximate the conditional score function as a composition of three functions, the authors in [13] base their strategy on the contrastive property of reverse SDEs, while the authors in [52] use the re-projection of unrealistic images to the manifold of natural images in the noisy latent space.

Technical difference

Given that in our work we consider the standard DDPM realization of diffusion process (VP-SDE), we will explain the existing differences under this scenario. The authors of [13] and [52] propose to parameterize $**x**\_\tau$ as

$**x**\_\tau=\sqrt{\bar{\alpha}\_\tau}\mathbb{E}[**x**\_{0}|**y**]+\sqrt{1-\bar{\alpha}\_\tau} **z**$ , where $**z**\sim\mathcal{N}(**0**,**I**).$

In contrast, in our case by using Eq.(10) we adopt the following parameterization:

$**x**\_\tau=\sum_{i=0}^{T-\tau-1}\Gamma\_{T-i-1}^{\tau+1}\gamma\_{T-i}\mathbb{E}[**x**\_0|**y**]+\Gamma\_T^{\tau+1}**x**\_T+\sqrt{\sum_{i=0}^{T-\tau-1}(\Gamma\_{T-i-1}^{\tau+1})^2\sigma\_{T - i}^2}**z**$ , where $**z**,**x**\_T\sim\mathcal{N}(**0**,**I**)$ .

As we have already highlighted in lines 233-244 of the manuscript, our parameterization is more general and it is possible to show by induction, that under certain conditions it leads to the exact same $**x**_\tau$ as in [13] and [52].

Finally, to experimentally demonstrate that our approach exhibits certain benefits compared to the ones described in [13] and [52], we conducted additional comparisons for the SISR problem between the different sampling strategies. From these results it is clear that our proposed strategy works better in practice and leads to superior results both in terms of fidelity and perceptual quality.

Acceleration Strategy	PSNR $\uparrow$	SSIM $\uparrow$	LPIPS $\downarrow$	TOPIQ $_{\Delta}$ $\downarrow$	NFE $\downarrow$
Ours	$28.12$	$0.793$	$0.140$	$0.002$	51
[13],[52]	28.05	0.783	0.142	0.016	51

Q2 Although the author criticizes the task-specific nature of existing solutions, DP-IR still needs to train the fusion network for each specific task, which makes it less versatile.

Please see section "Versatility of proposed framework" in our general response.

Q3 Absence of necessary ablation experiment: a) The effectiveness of Eq.10’s sampling strategy; b)Inference without image restoration network, i.e., input with y directly.

a) Regarding the effectiveness of our sampling strategy as described in Eq. (10), we refer the reviewer to our answer to his first comment Q1.

b) It is important to note that the condition $**y**$ can represent the measurement signal from different imaging modalities. As a result $**y**$ typically lies in a different domain than the one of the target signal $**x**\_{0}$ . Therefore, the direct fusion of the output of the denoising module $\boldsymbol{\phi}_{\boldsymbol{\theta}_D}^D\left(\tilde{\boldsymbol{x}}_t, \tilde{\sigma}_t\right)$ and the condition $**y**$ is not always feasible, but requires the processing of $**y**$ to ensure that both signals lie in a common domain. An illustrative example of such a case is the burst JDD-SR (Joint Demoisaicing, Denoising, and Super-Resolution) task, where the measurement signal $**y**$ consists of several image frames with each one lying in the domain of raw (mosaicked) and low-resolution images. Had we chosen to not process $**y**$ with an image restoration network, then we would have to deploy a fusion module which would be required to pre-process $**y**$ before performing the actual fusion. In this case, for every different inverse problem we would have to carefully design a specific architecture for the fusion model. Such a strategy would be less versatile than our current approach and would not take advantage of existing pre-trained sota restoration networks.

Q4 Why utilize a smaller version of MIRNet as the denoiser, not other architecture used for diffusion? Why not utilize open-sourced diffusion model, like stable diffusion?

Please see section "Network architectures choice" in our general response.

Q5 In Eq. 10, the author gets $**x**\_{\tau}$ through posterior sampling from $**x**\_{T}$ and $E[**x**\_{0}|**y**]$ , why not directly noising $E[**x**\_{0}|**y**]$ to $**x**\_{\tau}$ ?

We refer the reviewer to our answer to Q1 where we motivate our choice for doing so.

For the rest of the responses please see our official comment below

作者回复

2024-08-07

General response

We would like to thank all reviewers for the insightful questions and valuable suggestions. In this response we would like to address common questions that were asked by more than one reviewer.

Limitations and further perspectives

In the manuscript (lines 356-363) we have briefly described one limitation of our framework related to the optimal choice of $\tau$ , which is dependent of the particular reconstruction task at hand and needs to be experimentally selected. Another limitation that exists, and which we have added in the revised version of the manuscript, is that the performance of our method is bounded by the performance of the utilized backbone networks (Denoising and IR module). Therefore, for a novel image restoration task, where pre-trained IR network does not exist, our framework is not applicable. Finally, in the case of imaging modalities (e.g. medical imaging) where a score-matching network (denoising module) has not been trained for, it is important to either fine-tune an existing denoising module or completely re-train one using appropriate image data.

Network architectures choice

Denoising and IRNet modules. The reason for utilizing a smaller size of MIRNet as a Denoising module is that we wanted to approximately match the number of parameters and the computational complexity of the networks used in our framework with those of the alternative methods under study, in order to ensure a fair evaluation and comparison. This strategy has allowed us to achieve direct performance comparisons under similar conditions. Regarding the utilization of stable diffusion, reviewer WWsb makes a valid point and according to our ablation studies (see Table 5) we observe that by utilizing more powerful modules we can expect improvements on the overall performance. The main reason for not using Stable Diffusion is that given the network size constraints that we had, as we explained above, we experimentally found that the selected network architecture provided the best results.

Fusion Module. We have experimented with several basic fusion architectures, but we did not conduct an exhaustive research about the most performing architecture. Our proposed fusion module serves as a proof of concept for the validity of our overall proposed framework and the performance improvements that it can achieve. A more in-depth investigation of appropriate fusion architectures can serve as a very interesting future research direction.

Versatility of proposed framework

Our strategy cannot work out of the box for every inverse problem but requires tuning of a relatively small 0.7M parameter fusion network for each particular problem. Nevertheless, we would like to emphasize that the main focus of our work is blind image restoration problems. To the best of our knowledge, all diffusion-based approaches that have been proposed in the literature to deal with such tasks, require the training of far larger conditional backbone networks (~10-100M params). This turns out to be significantly more challenging both in terms of necessary training data and computational resources. To showcase this, we provide an indicative example below. If we adopt the existing diffusion-based SISR baselines and train them for a completely different restoration problem, by following the original authors' training strategies it turns out that the computational and data requirements are significantly higher than those of our method.

Method	Params required	Data required
Ours	1x	1x
SRDiff	$\sim$ 34x	$\sim$ 4x
LDM	$\sim$ 240x	$\sim$ 1000x
InDI	$\sim$ 89x	$\sim$ 1x
IDM	$\sim$ 167x	$\sim$ 1x

Based on these data, we can safely state that our strategy provides a reasonable trade-off between the required training complexity and the competitive performance of our method to a variety of blind inverse problems.

Screenshot of Raster figures of our framework

Below, we have attached pdf of raster figures to compare the visual quality.

评论- Rebuttal is online - please respond

2024-08-10

Dear Reviewers,

Authors carefully prepared their rebuttal - trying to address the concerns you have raised. Please check the rebuttals and join the discussion about the paper.

Regards,

评论- Kind reminder of rebuttal acknowledgement

2024-08-13

Dear Reviewers,

We would like to kindly remind you that the end of the discussion period is approaching soon. We would be grateful if you could acknowledge whether our rebuttal has adequately addressed your comments or if you need any further clarifications or explanations. If you have any additional concerns regarding our work, please feel free to raise them so that we have the opportunity to respond in a timely manner.

Sincerely,
The Authors

最终决定Accept (poster)

2024-09-25

Authors propose a module diffusion model that can be used for image restoration. The main focus is on making use of a diffusion model feasible through reducing the computational cost. To this end, a modular structure is used.

Initial reviewers main concern was the experimental validation, ablation studies and design choices. During the rebuttal, authors clearly explained their choices and through small additional experiments could convince all the reviewers. The final ratings were 3BA and 1WA. While the novelty is not extremely high, I think this would be a valuable contribution.