5.3

/10

Poster4 位审稿人

最低4最高6标准差0.8

4.0

置信度

正确性2.5

贡献度2.8

表达2.5

NeurIPS 2024

Masked Pre-training Enables Universal Zero-shot Denoiser

Xiaoxiao Ma,Zhixiang Wei,Yi Jin,Pengyang Ling,Tianle Liu,Ben Wang,Junkang Dai,Huaian Chen

OpenReview PDF

提交: 2024-05-14更新: 2024-11-06

TL;DR

An efficient yet novel approach for zero-shot denoising

摘要

In this work, we observe that model trained on vast general images via masking strategy, has been naturally embedded with their distribution knowledge, thus spontaneously attains the underlying potential for strong image denoising. Based on this observation, we propose a novel zero-shot denoising paradigm, i.e., $M$asked $P$re-train then $I$terative fill ($MPI$). MPI first trains model via masking and then employs pre-trained weight for high-quality zero-shot image denoising on a single noisy image. Concretely, MPI comprises two key procedures: $1) Masked Pre-training$ involves training model to reconstruct massive natural images with random masking for generalizable representations, gathering the potential for valid zero-shot denoising on images with varying noise degradation and even in distinct image types. $2) Iterative filling$ exploits pre-trained knowledge for effective zero-shot denoising. It iteratively optimizes the image by leveraging pre-trained weights, focusing on alternate reconstruction of different image parts, and gradually assembles fully denoised image within limited number of iterations. Comprehensive experiments across various noisy scenarios underscore the notable advances of MPI over previous approaches with a marked reduction in inference time.

关键词

image restorationimage denoising，self-supervised learning

评审与讨论

审稿意见

评分: 6置信度: 42024-07-09

The paper proposes a novel zero-shot image denoising method named Masked Pre-train then Iterative fill (MPI). This method leverages a pre-trained model on vast natural images using a masking strategy to learn generalized image distributions, enabling effective denoising without prior knowledge of the specific noise type.

优点

Novel and sound idea, with clear benefits.
Good generalization performance for unseen noise types.
Thorough analysis of their method and results.

缺点

Regarding real noise removal, it is well known that spatial correlation of the real noise makes pixelwise masking based methods like N2V or N2S to fail at denoising, making them inappropriate comparatives. Several self-supervised denoising methods have been designed to remove real world noise (e.g., AP-BSN [1], MM-BSN [2], etc.). While these are not zero-shot methods, zero-shot modifications of these blind spot networks would serve as better comparative methods, especially since the authors have already modified N2N and N2S into zero-shot versions.
Understanding and discussion of masking ratio can be improved. Perhaps employing a random masking ratio within a certain range (e.g., $p\in[0.3, 0.6]$ would be more effective and could result in an optimal point that is applicable to both synthetic and real noise situations. And why does the masking ratio differ for different noise types? What is the main factor in choosing the best masking ratio?

Also, if this masking ratio is a crucial hyperparameter that changes depending on the image and noise types, it could weaken the paper's key contribution regarding its generalization ability for any noise types.

References

[1] Lee, Wooseok, Sanghyun Son, and Kyoung Mu Lee. "Ap-bsn: Self-supervised denoising for real-world images via asymmetric pd and blind-spot network." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022. [2] Zhang, Dan, et al. "Mm-bsn: Self-supervised image denoising for real-world with multi-mask based on blind-spot network." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.

问题

Please check weaknesses.

What is the main reason that higher masking rate $p$ is required for real-world noise?

Minor points and typos

I cannot see any difference in the images in Figure 2, even for the noisy image. Same to Figure 6, it would be better if there are enlarged views.
Missing space in line 89 i.e.Masked -> i.e. Masked
typo in line 248? forward pass 2.3 ?
Maybe a mistake? In the caption Table 4, what does the defaults in gray mean?

局限性

The authors adequately addressed the limitations.

作者回复

2024-08-07

$**Q1: More comparison with zero-shot modifications of blind spot methods**$

$**A1:**$ Thank you for your thorough consideration. We have revised AP-BSN, MM-BSN, and PUCA (as shown in the overall rebuttal to all reviewers), and the results are listed in $*Table 1*$ in provided PDF. Here are some Implementation details below:

We followed the original settings of these methods. In each iteration, we cropped 8 same-size patches from a noisy image to form a batch for training. Every 10 iterations, we performed inference on the full image to obtain denoised images. These denoised images were then combined using the same ensemble strategy as our method to ensure fairness.

These methods can effectively denoise spatially correlated noisy images. However, using only one noisy image for training can lead to overfitting and produce artifacts, as shown in $*Fig. 5*$ of the PDF.

$**Q2: More discussion about masking ratio**$

$**A2:**$ Thank you for your suggestion; it is very insightful. We are still exploring this issue. However, we are concerned that using a large masking ratio during inference might degrade synthetic noise removal performance. Here is a summary of the impact of different masking ratios $p$ on denoising:

Small $p$ : Risk of learning noise patterns from real noisy images.

Large $p$ : Loss of detail in the denoised image.

The difference in masking ratios for various noise types is mainly because real noise has strong spatial correlation, necessitating a larger masking ratio to avoid learning the noise distribution (see more analysis in $**A3**$ ).

Choosing the masking ratio is crucial. When noise is spatially uncorrelated (e.g., Gaussian, Poisson, S&P noise), a consistent $p$ =30 works well across all cases. However, the spatial correlation of real noise complicates this issue. Many blind-spot networks (e.g. AP-BSN) designed for self-supervised denoising also aim to address this problem.

$**Q3: Main reason for different$ p $**$

$**A3:**$ This primarily depends on the spatial correlation of the noise in the image. Synthetic noise is spatially uncorrelated, with noise signals are uncorrelated in neighbor positions. In contrast, real noise, after passing through a series of ISP processes, has a much more complex distribution, resembling blurred spots rather than independent points (refer to $*Fig. 2*$ in PDF). For synthetic noise, choosing a small masking ratio can help quickly recover more details in the image. However, for real noise, with a small masking ratio, the model can fit the noise distribution based on neighboring pixel values. In this case, a larger masking ratio can mitigate the impact of noise. One paper [Ref1] also discusses using different dropout ratios for different types of images.

$**Q4: Minor points and typos**$

$**A4:**$ Thanks very much for your corrections.

Regarding Fig. 2 and Fig. 6 in main text, we have provided enlarged views, as seen in $*Fig. 1 and Fig. 4*$ in PDF.
Yes, we have addressed the spacing issue on line 89.
"2.3" on line 248 refers to Section 2.3, which we have clarified in the main text.
Due to formatting constraints, we shortened the caption. "Defaults in gray" means that the gray background in the table indicates the default settings used in our work for comparison with other methods.

$**Ref:**$

Ref1: Self2Self+: Single-Image Denoising with Self-Supervised Learning and Image Quality Assessment Loss, Arxiv'23

2024-08-11

Thank you for the detailed response and the additional comparisons. I hope the points raised during this review are well reflected in the final version.

评论- Thanks for replying

2024-08-11

Thank you very much for replying so fast. We will carefully revise the article and all modifications made in rebuttal will be reflected in the next version of our paper. If you have any further questions, please let us know.

审稿意见

评分: 4置信度: 42024-07-14

This paper proposes a method that could handle image denoising regardless of the noise types and intensity. To achieve this goal, the proposed method includes two crucial steps: first, the model will be pretrained on a large amount of images (with masking); second, the pretrained model will be fine tuned on the given noisy image so that denoising. The authors validate the efficiency of their method on images corrupted by different noises.

优点

The paper is well written and it is easy to follow.
The authors have conducted experiments on different types of images including natural images and medical images.
In addition to synthetic noises, the authors also conducted experiments on real-world noise.

缺点

The idea sounds very trial and the novelty is limited. It is similar to MetaDIP[ref1] and DGP[ref2].
The idea of "ensemble for total T steps" is also proposed by DIP. DIP has shown that by averaging the outputs, the performance could be improved. So, the question here is: when the authors compare the performance with DIP, have you applied this similar "average smoothing" to DIP's outputs? If not, the comparisons here may not be fair.
For comparisons, the authors may consider other DIP models such as Ref3.
How this proposed method compare with other SOTA models such as Diffusion Models Ref4.

Ref1: Zhang, Kevin, et al. "MetaDIP: Accelerating deep image prior with meta learning." arXiv preprint arXiv:2209.08452 (2022).

Ref2: Pan, Xingang, et al. "Exploiting deep generative prior for versatile image restoration and manipulation." IEEE Transactions on Pattern Analysis and Machine Intelligence 44.11 (2021): 7474-7489.

Ref3: Jo, Yeonsik, Se Young Chun, and Jonghyun Choi. "Rethinking deep image prior for denoising." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021.

Ref4: Wang, Yinhuai, Jiwen Yu, and Jian Zhang. "Zero-shot image restoration using denoising diffusion null-space model." arXiv preprint arXiv:2212.00490 (2022).

问题

I have listed my questions in the section of [Weaknesses].

局限性

I have listed the limitations in the section of [Weaknesses].

作者回复

2024-08-07

$**Q1: Similarities and differences compared to MetaDIP and DGP**$

$**A1:**$ Thank you for your pointing out. Our method shares some similarities with MetaDIP and DGP. MetaDIP learns denoising by obtaining initial weights beneficial for downstream tasks, while our method uses masked training on natural images to enhance downstream denoising. However, MetaDIP and DGP seem to require known degradation models, which may not be applicable for some noise types. In contrast, our method learns to recover denoised images from masked noisy ones, without relying on models tailored to specific degradation types. This eliminates the need for degradation modeling, making it adaptable to unknown noise types, real noise, and various image types.

$**Q2: Comparison fairness issues caused by ensemble**$

$**A2:**$ Thanks for your attention. We acknowledge that ensemble technique are existing methods. Our focus is on demonstrating the advantages of masked pre-training priors.

To ensure fair comparison, we explained this in the experimental section of the main text. We used EMA ensemble for DIP and FasterDIP, as detailed in $*lines 170-173*$ in main text. Results for other comparison methods (N2V and N2S) with EMA ensemble are shown in $*Supplementary Materials G.2 (lines 600-608, Tables 12 and 13)*$ . Our “faster” version performs not the best in some settings, but the 1000-step version consistently leads.

We chose the EMA version for other methods because our experiments showed that EMA yields the best results compared to other ensemble methods (like averaging), as detailed in $Table 5 of the main text.

$**Q3: Comparison with other DIP methods**$

$**A3:**$ We considered the work you mentioned (DIP-SURE) when selecting comparison methods. They designed denoising solutions for Gaussian and Poisson noise to achieve high-quality denoising. However, their method requires additional noise variance as input, which could lead to unfair comparisons. Additionally, their method seems to be specific to certain types of noise, and their official code only supports Gaussian and Poisson denoising. This is why we did not compare with their method initially.

We included results for DIP-SURE using EMA ensemble, reporting both peak performance and final performance. For a fair comparison, we should compare our method with the final performance, as peak PSNR is not known without ground truth. Please refer to $*Table 1*$ in PDF for a comparison between our method and DIP-SURE.

For real datasets (SIDD, PolyU, and FMD), we computed the variance from the difference between noisy and clean images to obtain denoised results. Their approach cannot remove real-world noise well, and some artifacts exists (refer to $*Fig.5*$ in provided PDF).

$**Q4: Comparison with other Diffusion methods**$

$**A4:**$ Existing SOTA diffusion models, like DDNM, recover degraded images by decomposing into the null-space and adjusting the noise scheduler for denoising. Although diffusion models are Gaussian denoisers and can handle Gaussian noise well with proper adjustments (see PDF $*Table 1*$ ), they also require known noise variance in advance. Additionally, when set for denoising, the diffusion model relies solely on the noise scheduler, effectively becoming a Gaussian denoiser. This limitation may prevent the model from fully removing more complex real-world noise (see $*Fig. 5*$ in PDF).

2024-08-14

Thanks to the authors for the detailed rebuttal. My previous concerns have been addressed partially. I think MetaDIP and DGP may generalize to other noise types, which at least should be explored/experimented/compared. Also, I think there are other Diffusion models that could handle other noises except Gaussian noise, which again should be compared in the experiments.

评论- Reply to VFTG

2024-08-14

Since there is less than 12 hours before discussion period ends, the time may not be enough.

$**DGP:**$

We did not find the code of MetaDIP in rebuttal period, and we believe DGP and other diffusion methods share the same issues in denoising, so we did not compare with MetaDIP, here we provide DGP results, see results at table below:

Method	DGP
CSet+Gauss $\sigma$ =25	28.72/0.746
SIDDval	23.21/0.452
PolyU	32.18/0.890
FMD	22.19/0.308

(due to time limitation, we chose 100 random patches from SIDDval for comparison)

DGP, like the DDNM and DDPG models I compared, relies on a generative prior. Although DGP uses GAN loss to make the model robust to degradation, the loss of DGP includes the process of adding degradation of generated images by known degradation operators or attackers, during which unknown noise degradation can also affect it, resulting in a gap between synthetic degradation and real degradation, and it seems to perform poorly on images that are significantly different from natural images (such as FMD).

Using attackers to add perturbations can theoretically remove various types of degradation, but it may lead to longer inference time and increased training difficulty. However, the exploration of adversarial defense in DGP seems to only include jigsaw puzzle tasks, and the code they provide does not include examples of adversarial defense. The remaining time is not enough for us to explore the possibility of using adversarial defense to achieve real-world denoising; Although I believe that the denoising results under synthetic noise (Gaussian=25) can to some extent explain the problem, as the degradation disturbance in this case is consistent with that in the real image (may lead to excessively smooth denoising results or artifacts caused by GAN).

In addition, the generation ability of GANs is slightly weaker than diffusion, especially for image restoration under unconditional conditions.

$**Diffusion models:**$

Existing diffusion work mostly focuses on other image restoration tasks (such as super-resolution and deblurring), and possibly suffer from long inference time. We haven't found a training-free diffusion-based method that can adapt well to unknown real-world noisy images yet.

However, I think it is possible to use diffusion for training-free real-world denoising, but it may require additional design, and it can be a good research direction.

评论- Reminder: Reviewers please do acknowledge the rebuttals and react to them.

2024-08-13

Dear reviewer,

thanks for your review. Please look at the rebuttal and give the authors some feedback whether they could address your concerns.

Best regards Your AC.

审稿意见

评分: 5置信度: 42024-07-16

This paper proposes a zero-shot image denoising method called Masked Pre-train then Iterative fill (MPI). The key idea is to pre-train a model on natural images using masked image modeling, then apply this pre-trained model to denoise new images in a zero-shot manner through an iterative optimization process. The authors demonstrate that their approach outperforms existing zero-shot denoising methods across a variety of noise types and datasets, while also being more computationally efficient.

优点

Masked modelling in image denoising and Iterative processing is interesting.
And the above idea works, it is more interesting.
Somewhat good results.

缺点

I have to say that the writing of this paper is problematic. The method of this paper is not complicated, and it is even a little simple. However, the presentation of this paper does not show its core method. Some of the descriptions are irrelevant, but important analysis such as the core of its method is missing.
This paper uses the idea of pre-training, but follows zero-shot. A natural question is, what does the model learn from pre-training, and how is it applied to zero-shot denoising? I understand that it is difficult to give a theoretical answer, but I think there is at least an abstract explanation to verify and demonstrate the essence of its method.
Assume that the model learns the distribution of images from pre-training. But according to its experiments, the trained model can be used for unnatural images. If the model learns denoising from pre-training, the model can generalize to other noises to a certain extent. Unless it learns a common property of some noise. The key is, what is this property?
The iterative approach is interesting, and I think the author may have been inspired by Diffusion to a large extent. But it is not enough to simply assume without a clearer theoretical motivation to support this approach.
The choice of denoising is also questionable. In fact, I think the method in this article may be helpful for any image restoration task that meets the conditions. Why do we only discuss denoising?

问题

Is it possible to combine mask prediction and iterative filling, and actually use the statistical properties of noise to denoise? Make predictions under different masks each time, and integrate multiple predictions through iterative filling, and use methods similar to finding the mean between multiple predictions to achieve some denoising purpose? This is about the working nature of this method. I think it is very necessary to use experiments to prove what the model has learned and why it denoises.

局限性

yes

作者回复

2024-08-07

$**Q1: Lack of essence of our method**$

$**A1:**$ Thank you for your pointing out. The core of our method lies in the inherent denoising capability of a model pre-trained with masked natural images. This motivation is demonstrated in main text. The pixel-level random masking we employ can be viewed as a form of noise, disrupting the image structure, and the model is trained to restore these structures. Due to training on a large set of natural images, the limited model parameters are unable to precisely fit all image distributions and instead tend to prioritize learning the main features of the images while discarding noise that varies across different images, thus acquiring a certain denoising capability. Similar masked training strategies are used in many self-supervised denoising tasks [Ref4, Ref5, Ref6 in author rebuttal], where models are trained on a large set of noisy images to learn the corresponding clean images. We further extend this to a zero-shot version, combining it with a large amount of easily accessible natural images to enhance the performance of zero-shot methods.

$**Q2: What model learns from pre-training and how it is applied to zeroshot denoising**$

$**A2:**$ We address your concerns from three aspects:

$*What does the model learn from pre-training?*$

The model learns to restore randomly masked image content from a large set of natural images. This restoration process is somewhat noise-resistant. In short, a model pre-trained on a large dataset tends to recover denoised image content, functioning as a kind of denoising autoencoder (see $*Fig.1*$ in the PDF for motivation).

$*How is it applied to zero-shot denoising?*$

The pre-trained weights are iteratively optimized on a noisy image through random masking, and predicted pixels are integrated using an exponential moving average, resulting in a denoised image. The learned denoising representation provides a better initial weight and helps avoid over-fitting to the noisy image.

$*Analysis of why pre-trained features helped zero-shot denoising.*$

We analyzed the impact of pre-trained features on zero-shot denoising at the hidden layer level (see $*Fig.3*$ in the PDF). Features extracted with pre-trained weights significantly differ from those extracted from scratch (Baseline, i.e., the usual zero-shot denoising approach). The pre-trained model restores the complete image, with more distinct features between layers, whereas the baseline model's features are less differentiated between layers, tending to only restore the masked parts, leading to local optima.

$**Q3: Why a model does not learn denoising can denoise and generalize to unnatural images**$

$**A3:**$ As per $**A1**$ , we believe that the model pre-trained with masked images acquires a certain robustness to noise, enabling it to perform denoising. Regarding its applicability to non-natural images, the use of pixel-level random masking allows the model to restore masked content based on the unmasked areas. Since the unmasked pixels are spatially distributed evenly, the model tends to focus more on local, low-level features such as texture and color rather than high-level semantic features. These low-level features share some commonalities across all types of images, allowing the model to be applicable to significantly different types of images, such as medical images and natural images.

$**Q4: Motivation of our approach**$

$**A4:**$ Thank you for pointing out, our motivation comes from the blind spot network in self supervised denoising, which learns to reconstruct noisy images cropped by blind-spots (much resemble pixel-wise random masking) to denoise on noisy datasets that do not include ground truth. This is a widely explored field, and we further expand it with using natural images to obtain a better zero shot denoising algorithm.

$**Q5: Why only discuss denoising**$

$**A5:**$ Thank you for your suggestion. Our motivation stems from findings in the self-supervised denoising field, so this paper focuses solely on denoising. As per $**A1**$ , the trained model tends to acquire smooth representations beneficial for denoising. Since our current model uses a small number of parameters for efficient denoising, it might struggle to generate sufficient details in other tasks such as deblurring or super-resolution. However, we believe that with further improvements, this method has great potential in other types of degradation tasks.

$**Q6: Answer to Questions**$

$**A6:**$

During the inference phase, we generate predictions using random masks and iterative filling, with the final denoised image comes from the exponential moving average of multiple inferences for each pixel. The only difference is that we use weights pre-trained on natural images with random masking as the initial weights, which speeds up inference and reduces the risk of over-fitting.
Regarding what the model has learned and why it denoises, we believe that masked pre-training on large datasets causes the model to focus on the main features of images (relatively lower-frequency, easier-to-recover features), which gives the model robustness to noise and enables it to perform denoising.
we have analyzed the differences in features between pre-trained and from-scratch models during inference to demonstrate the advantages of pre-training over starting from scratch.

评论- Reminder: Reviewers please do acknowledge the rebuttals and react to them.

2024-08-13

Dear reviewer,

thanks for your review. Please look at the rebuttal and give the authors some feedback whether they could address your concerns.

Best regards Your AC.

评论- Response to the rebuttal

2024-08-13

I have read the author's response, as well as the comments and discussions with other reviewers. The author has partially addressed my concerns. However, I still think that the presentation of the paper is lacking at this stage. I will improve my score. However, since I cannot see the revised paper, I cannot judge whether the final presentation meets the requirements of NeurIPS.

评论- Reply to iS6g

2024-08-14

Thank you for your valuable suggestions. They have indeed helped improve the quality of the paper. We will carefully revise the manuscript and incorporate the theoretical analysis from the response into the main text to make the paper clearer and more insightful.

This article explores the practical role of mask pre training in denoising. Perhaps masking-based generative pre-training can go further by providing a clearer theoretical instruction and demonstrating its effectiveness in more low-level downstream tasks.

审稿意见

评分: 6置信度: 42024-07-18

The paper introduces a novel zero-shot image denoising paradigm called Masked Pre-train then Iterative fill (MPI). The key contributions are:

MPI first pre-trains a model on a large dataset of natural images using a pixel-wise masking strategy. This allows the model to learn the underlying distribution and representations of natural images. During the zero-shot inference stage on a single noisy image, MPI optimizes the pre-trained model to predict the masked regions, and only uses the predictions of the masked regions to assemble the final denoised output. This approach leverages the generic knowledge from pre-training, preventing overfitting during the inference stage and enabling high-quality denoising with a marked reduction in inference time compared to prior zero-shot methods. The paper demonstrates MPI's superior performance across various noise types and its ability to generalize to medical images, which are distinct from the natural images used in pre-training.

优点

Originality: The paper presents a novel zero-shot denoising paradigm that is significantly different from prior approaches. The key innovation is the use of masked pre-training on natural images to build a generic and robust model for zero-shot denoising. This is an original idea that departs from previous zero-shot methods that relied on carefully designed priors or network architectures. Leveraging masked pre-training for this task is a novel and creative approach.

Quality: The technical details and experimental evaluations in the paper are of high quality. The authors provide a clear and thorough explanation of the MPI framework, including the masked pre-training and iterative optimization steps. The experimental setup is comprehensive, analyzing performance on various noise types, real-world datasets, and even medical images. The results demonstrate significant improvements over prior zero-shot methods, validating the effectiveness of the proposed approach.

Clarity: The paper is well-written and easy to follow. The introduction provides a clear motivation and background for the problem. The method section explains the MPI framework in a structured and logical manner. The experimental section is organized in a way that allows the reader to easily understand the different evaluations and findings. Overall, the clarity of exposition is a strength of this paper.

Significance: The problem of zero-shot image denoising is an important and challenging task in computer vision. Prior methods have limitations in terms of generalization and computational efficiency. The MPI approach presented in this paper addresses these limitations in a novel way. If successful, this could lead to significant practical impact by enabling high-quality denoising with minimal user intervention or computational overhead. The ability to generalize to diverse noise types and even medical images further enhances the significance and potential impact of this work.

缺点

Comparison to other zero-shot methods: The paper primarily compares MPI to a few selected zero-shot denoising methods. However, it would strengthen the work to include a more comprehensive comparison to a wider range of zero-shot techniques, including recent advances. This would help contextualize the performance gains of MPI and highlight its specific advantages over the state-of-the-art.

[1] Xie Y, Yuan M, Dong B, et al. Unsupervised image denoising with score function[J]. Advances in Neural Information Processing Systems, 2024, 36. [2] Jang H, Park J, Jung D, et al. PUCA: patch-unshuffle and channel attention for enhanced self-supervised image denoising[J]. Advances in Neural Information Processing Systems, 2024, 36. [3] Garber T, Tirer T. Image restoration by denoising diffusion models with iteratively preconditioned guidance[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024: 25245-25254.

问题

局限性

see weakness

作者回复

2024-08-07

Thank you for your suggestions. We studied the three works you provided carefully. The first is a score-based denoising algorithm, the second is an improvement on blind-spot networks (PUCA), and the third is a diffusion-based image restoration method (DDPG). The first two seem to be unsupervised denoising approaches based on datasets, which learn to recover denoised images unsupervisely from a dataset containing only noisy images.

For the first paper, we could not find their code on GitHub. Reproducing this paper within a week is challenging, so we did not compare our method with theirs.

For the second paper, we provide results for its modified zero-shot version PUCA* and the original PUCA in the table below. The zero-shot method uses an ensemble approach to ensure a fair comparison.

For the third paper, we provide results (DDPG) in the table below. This diffusion method requires the noise variance as known information. We use the variance calculated from the difference between the noisy and clean images to ensure the best performance for this method (though this might risk ground truth leakage).

We have also compared several other self-supervised and diffusion methods. If interested, please see $*Table 1*$ in PDF. For the visual results on SIDD, see $*Fig. 5*$ in PDF.

	PUCA*	PUCA	DDPG
CSet+Gauss σ=10	-	-	32.43/0.826
CSet+Gauss σ=25	24.74/0.640	-	27.07/0.606
CSet+Gauss σ=50	-	-	15.95/0.183
SIDD validation	33.52/0.816	37.49/0.880	29.84/0.612
PolyU	33.31/0.927	-	35.79/0.887
FMD	30.22/0.808	-	30.41/0.735
Avg. Infer. time (s)	450.0	-	24.3

$**Analysis:**$ Blind-spot methods like PUCA handle strong spatially correlated noise well in real-world scenarios. However, in its zero-shot version, it may risk over-fitting due to a lack of sufficient training data.

Diffusion methods like DDPG can handle various types of degradation and are inherently robust to noise due to their training on Gaussian noise. However, they struggle to generalize to real-world noise scenarios.

评论- I have read through the rebuttal, and the author has addressed all of my questions.

2024-08-08

I have read through the rebuttal, and the author has addressed all of my questions.

评论- Thanks for your reply

2024-08-09

Thank you very much for your prompt reply. If you have any further questions of interest, please feel free to ask me.

作者回复

2024-08-07

We sincerely appreciate the time and efforts of all the reviewers, as well as their valuable suggestions provided during the review process. We are encouraged by the reviewers' recognition of our work and acknowledge that there are still many weaknesses in our current work. We carefully considered and responded to every question from the reviewer to the best of our ability.

Here we list some common weaknesses and corresponding brief responses, more details can be found in responses to each reviewer.

$**1. More comparison methods**$ including additional zero-shot methods, such as DIP-based (DIP-SURE [Ref1]) and diffusion-based methods (DDNM [Ref2], DDPG [Ref3]), and zero-shot modifications from unsupervised methods (AP-BSN [Ref4], MM-BSN [Ref5], PUCA [Ref6]), refer to $*Table 1*$ in provided PDF. The implementation details of each method can be found in rebuttal to each reviewer.

$**2. Analysis of the Priors Learned from Pretraining**$ (Refer to $*Fig. 3*$ in the provided PDF). In the main text, we have explained that this masked pre-training not only helps provide a better starting point for the model, but also helps prevent the model from over-fitting to some extent (see $*Section 4.1*$ Pre-trained weights in main text), offering a more stable zero-shot denoising process. Here we analyze the representations learned by the pre-trained model and try to explain why this helps avoid over-fitting. In Fig. 3, we used CKA analysis [Ref7] to show that the image features extracted by the pre-trained model are significantly different from those extracted by the model trained from scratch (“Baseline”). Due to insufficient data, baseline model tends to acquire similar representations across layers and focus more on recovering the masked parts of the image. This can lead to local optima and early over-fitting to noise. In contrast, pre-trained model learns from a larger variety of images and provides a better startpoint. They focus more on recovering the entire image rather than just the masked parts, which is smoother and reduces the risk of over-fitting.

$**3. Restatement of Motivation and Focus in This Work.**$ This paper focuses on leveraging a large amount of easily accessible natural images to achieve better zero-shot denoising performance. To ensure the method is simple and efficient, and to highlight the importance of pre-training, we do not extensively explore more complex network mechanisms, since it is not the focus of our paper. By relying on minimal dependencies for denoising, our approach can be further improved when combined with existing methods like random sub-samples (see $*Section 4.2*$ in main text for discussion).

$**Ref:**$

Ref1: Rethinking deep image prior for denoising, ICCV’21

Ref2: Zero-shot image restoration using denoising diffusion null-space model, ICLR’23

Ref3: Image restoration by denoising diffusion models with iteratively preconditioned guidance, CVPR’24

Ref4: Ap-bsn: Self-supervised denoising for real-world images via asymmetric pd and blind-spot network, CVPR’22

Ref5: Mm-bsn: Self-supervised image denoising for real-world with multi-mask based on blind-spot network, CVPR’23

Ref6: PUCA: patch-unshuffle and channel attention for enhanced self-supervised image denoising, NIPS’24

Ref7: Similarity of Neural Network Representations Revisited, Arxiv’19

最终决定Accept (poster)

2024-09-25

Scores 6 4 5 6

Nice paper, however, from the reviews just at the border between reject and accept.

The reviewers raised concerns about the evaluations, in particular to compare against more recent methods. These concerns were addressed by the rebuttal, where they added more comparisons. One reviewer found the novelty limited due to similarities to the methods MetaDIP and DGP. The rebuttal addressed this and expalined the differences.

Overall, I am leaning to accept the paper, since the reviewers are mainly positive about the paper and it presents solid work. Thus: accept.