PaperHub
4.8
/10
Poster4 位审稿人
最低4最高5标准差0.4
5
5
5
4
4.5
置信度
正确性2.5
贡献度2.8
表达2.8
NeurIPS 2024

DreamClear: High-Capacity Real-World Image Restoration with Privacy-Safe Dataset Curation

OpenReviewPDF
提交: 2024-05-14更新: 2024-11-06
TL;DR

A high-capacity image restoration model trained on large-scale, privacy-safe high-quality image dataset

摘要

关键词
Image restorationDataset CurationDiffusion transformer

评审与讨论

审稿意见
5

This paper introduces a dual strategy for real-world image restoration: GenIR and DreamClear. GenIR, a novel data curation pipeline, addresses dataset limitations by creating a large-scale dataset of one million high-quality 2K images. DreamClear, a Diffusion Transformer-based model, utilizes generative priors from text-to-image diffusion models and multi-modal large language models for photorealistic restorations. It also incorporates the Mixture of Adaptive Modulator (MoAM) to handle various real-world degradations. Extensive experiments confirm DreamClear's superior performance, demonstrating the effectiveness of this approach.

优点

1.The writing and structure of the paper are clear and well-organized, making it easy to follow the authors' arguments and methodologies.

2.The authors' introduction of a privacy-safe dataset curation pipeline is significant. This is especially crucial in the era of large models, where data privacy and security are paramount concerns.

3.The experimental results convincingly demonstrate the potential of the proposed method, highlighting its effectiveness and applicability in real-world image restoration scenarios.

缺点

1.The quality assessment of the newly constructed dataset is lacking, both quantitatively and qualitatively. Given the goal of real-world image restoration, I am concerned about whether the current pipeline effectively advances this objective. The authors need to include a discussion on this matter.

2.In the ablation studies, the authors should provide a detailed discussion on the interaction between the dual branches. The motivation for this design choice requires further explanation.

3.Real-world degradations are complex and intertwined. The authors should compare their approach with SUPIR using more challenging examples to better demonstrate the robustness and applicability of their method.

问题

See the above weaknesses part.

局限性

The authors have discussed the relevant limitations.

作者回复

We sincerely appreciate the reviewer for the valuable and positive comments on our work. We address the questions and clarify the issues accordingly as described below.

Q1: Quality assessment of the dataset

[Reply] Thank you for this valuable advice. We calculate the FID value between different datasets and DIV2K. The FID results are as follows:

Flickr2KDIV8KLSDIROSTOurs (Generated)
69.8372.2860.22110.2972.88

The results indicate that our generated dataset achieves FID values comparable to those of commonly used image restoration datasets. This suggests that our dataset maintains a relatively high image quality. Additionally, we provide some example images from our dataset and LSDIR in the attached PDF file (Figure III).

We also conduct ablation studies to explore the influence of different training datasets. We retrain DreamClear using different datasets, with the results provided below (evaluated on DIV2K-Val):

TrainsetLPIPS \downarrowDISTS \downarrowFID \downarrowMANIQA \uparrowMUSIQ \uparrowCLIPIQA \uparrow
Existing IR+Generated0.33910.179123.240.432768.930.6978
Generated0.36010.184727.840.429569.310.6823
Existing IR0.34930.187324.100.415463.240.6435

It shows that using the generated dataset can achieve perceptual metrics (LPIPS, DISTS) similar to those obtained with existing IR datasets, and has obtained a significant advantage in no-reference metrics (MANIQA, MUSIQ, CLIPIQA). We speculate that the gap in FID is mainly due to the consistent distribution between the training and validation sets in the DIV2K dataset. The results indicate that our generated data can effectively improve the quality of restored images compared to existing IR datasets, and combining the two can achieve the best performance. We also provide some real-world visual comparisons in the attached PDF file, which demonstrates the effectiveness of our generated data in enhancing visual effects. We'all add the ablations in the final version.

Q2: Discussion on the dual branch design

[Reply] Thanks for this insightful suggestion. Table 3 presents the quantitative results, while Figure 7 illustrates the qualitative outcomes of the ablation experiments. Some discussions are illustrated in Appendix A.3. Below, we provide a detailed discussion on the dual-branch design:

Reference Branch: The incorporation of reference branch allows the model to focus less on degradation removal and more on enhancing image details through generative priors, ultimately producing more photorealistic images. As shown in Table 3, when the reference branch is removed, all metrics on the low-level and high-level benchmarks deteriorate significantly. Figure 7 also shows that the reference branch can significantly enhance the quality of the image. This indicates that the reference branch plays an important role in DreamClear for improving both image quality and semantic information.

Interaction Modules: The proposed Mixture-of-Adaptive-Modulator (MoAM) acts as an interaction module between the LQ branch and the reference branch, aiming to enhance the model's robustness to intricate real-world degradations by explicitly guiding it through the introduction of token-wise degradation priors. It obtains the degradation map through cross-attention between the LQ features and reference features, guiding the dynamic fusion of expert knowledge to modulate LQ features. To validate its effectiveness, we replace MoAM with different interaction modules:

  • AM: As shown in Table 3, when replacing MoAM with AM, all metrics undergo a substantial decrease. Similarly, Figure 7 demonstrates that this replacement leads to more blurred textures in both the bear and the snow. These findings underscore the importance of the degradation prior guidance in MoAM for steering restoration experts.
  • Corss-Attention / Zero-Linear: In addition to AM, we also tried other feature interaction modules, including cross-attention and zero-linear. Table 3 shows that their performance is inferior to AM across all metrics. Figure 7 shows that when using zero-linear, the bear's texture becomes blurred and there are semantic errors on its back. When using cross-attention, many artifacts appear, and there are also semantic errors on the bear's nose. Therefore, using AM as the interaction module is more suitable for the IR task to achieve better restoration results.

We'll add the detailed discussion in the final version.

Q3: More real-world comparisons

[Reply] Thanks for this valuable suggestion. First, we reanalyze the data from the user study. Specifically, we focuse on the analysis of 33 real-world samples (w/o GTs) out of the original 100 images. The results are provided as follows:

MethodSinSRStableSRDiffBIRSeeSRSUPIRDreamClear
Vote Percentage8.1%5.5%7.6%11.2%23.1%44.4%
Top-1 Ratio3.0%0.0%0.0%0.0%15.2%81.8%
Top-2 Ratio9.1%3.0%9.1%9.1%69.7%100.0%

It shows the superiority of DreamClear in restoring real-world images. Besides, we also provide more real-world comparisons with SUPIR in the attached PDF file (Figure II). It demonstrates that our method can achieve clearer details and fewer artifacts when dealing with real-world cases. We'll add more real-world comparisons in our final paper.

评论

Thank you for the author's response. I maintain my positive score.

评论

Thank you for your acknowledgment of our work and responses. We appreciate your constructive feedback that has helped refine our research. Please feel free to reach out if you have further queries or need additional clarification on our work.

审稿意见
5

The paper introduces a dual strategy to tackle the challenges of image restoration (IR) datasets and the development of high-capacity models for image restoration. This strategy comprises: GenIR: An innovative data curation pipeline designed to bypass the laborious data crawling process, providing a privacy-safe solution for IR dataset construction. The authors generated a large-scale dataset of one million high-quality images. DreamClear: A Diffusion Transformer (DiT)-based image restoration model utilizing generative priors from text-to-image (T2I) diffusion models and the perceptual capabilities of multi-modal large language models (MLLMs) to achieve photorealistic restorations.

优点

The authors made a significant contribution to creating a large and robust image restoration dataset while providing a privacy-safe solution.

The authors demonstrated detailed evaluation with recent SOTAs.

The restoration model shows good performance across different evaluation metrics.

The model is robust enough to handle various degradations such as deblurring and denoising.

缺点

For the created data, the paper claimed they maintained privacy standards to ensure no personal information was embedded in the generated images. However, the author failed to provide clear details on how these were achieved. Hence, it is difficult to establish the robustness and effectiveness of the privacy-preserving measures.

Given that the SR model, DreamClear, is an integration of various restoration models and is also built on PixArt, a detailed comparative analysis of the computational complexity between the proposed framework and other existing methods would be beneficial.

While DreamClear exhibits good performance in perceptual quality, its performance on traditional metrics like PSNR and SSIM is not as strong.

Some grammatical errors are observed on line 96, and incomplete statements on lines 109 and 110 should be corrected.

问题

Given that a million images were generated, how did the authors verify that none of the images had personal information?

局限性

Yes

作者回复

We sincerely appreciate the reviewer for the valuable and positive comments on our work. We address the questions and clarify the issues accordingly as described below.

Q1: About privacy preservation

[Reply] Thanks for this valuable advice. To minimize the risk of generating images that contain private information, we have implemented constraints from two aspects:

  • Text Prompt Filtering: As illustrated in the paper (Line 160-162), we levarage Gemini to generate millions of text prompts for the T2I model. Instead of directly using these text prompts, we utilize Gemini to review them and filter out any prompts that contain private information. We set the prompt for Gemini as "You are an AI language assistant, and you are analyzing a series of text prompts. Your task is to identify whether these text prompts contatin any inappropriate content such as personal privacy violations or NSFW material. Delete any inappropriate text prompts and return the remaining ones in their original format." All text prompts will be filtered through Gemini, and ultimately we retained one million text prompts.
  • Generated Image Filtering: As shown in Figure 2, we utilize Gemini as a powerful MLLM to ascertain whether the generated images exhibit blatant semantic errors or personal private content. We set the prompt for Gemini as "You are an AI visual assistant, and you are analyzing a single image. Your task is to check the image for any anomalies, irregularities, or content that does not align with common sense or normal expectations. Additionally, identify any inappropriate content such as personal privacy violations or NSFW material. If the image does not contain any of the aforementioned issues, it has passed the inspection. Please determine whether this image has passed the inspection (answer yes/no) and provide your reasoning." Samples that do not pass the inspection will be eliminated.

Compared to directly crawling data from the web, the proposed GenIR pipeline significantly lowers the risk of infringing on individuals' privacy. We believe that the GenIR pipeline presents a promising solution for addressing security and ethical concerns in contemporary research involving large models. We'll add the privacy preservation details in our final paper.

We acknowledge that ensuring robust privacy preservation poses significant challenges. In this paper, we conduct an exploration of privacy preservation methods with the assistance of advanced MLLMs. Given that the privacy preservation is not the primary focus of our paper, we anticipate exploring this significant area with the community in the future.

Regarding the generation of one million images, we employ the aforementioned methods with the help of MLLMs to minimize the inclusion of personal information as much as possible. We acknowledge that achieving 100% avoidance remains a significant challenge. However, before publicly releasing the dataset, we will screen out all images containing faces and compare them with publicly available face datasets (e.g., CelebA, FFHQ) to remove any images that contain personal information.

Q2: About computational complexity

[Reply] Thanks for this valuable suggestion. Please refer to the global response.

Q3: About PSNR and SSIM

[Reply] Thank you for highlighting this point. For real-world image restoration, when the degradation is severe, it is challenging to pursue highly accurate reconstruction; instead, the focus is more on achieving better visual quality [1,2]. Table 1 shows that both SUPIR and DreamClear perform poorly in terms of PSNR and SSIM. However, despite lacking an advantage in full-reference metrics like PSNR and SSIM, SUPIR and DreamClear can produce excellent visual effects.

As mentioned in the paper (Line 244-245), many recent works [1,3,4] also observe this phenomenon and suggest that it is necessary to reconsider the reference values of existing metrics and propose more effective methods to evaluate advanced image restoration methods. Therefore, we conducted a user study in the paper to more comprehensively measure the capabilities of different IR models. We believe that as the field of image quality assessment (IQA) evolves, more suitable metrics will emerge to adequately measure the performance of advanced IR models. We will include further discussion on this topic in our final paper.

Q4: About writing errors

[Reply] Thanks for pointing out this. We'll carefully check the whole paper and correct these errors in the final version.

References

[1] Scaling up to excellence: Practicing model scaling for photo-realistic image restoration in the wild. In CVPR, 2024.

[2] SeeSR: Towards Semantics-Aware Real-World Image Super-Resolution. In CVPR, 2024.

[3] Depicting beyond scores: Advancing image quality assessment through multi-modal language models. In ECCV, 2024.

[4] Descriptive Image Quality Assessment in the Wild. arXiv preprint arXiv:2405.18842, 2024.

评论

Thank you for the detailed rebuttal. I appreciate the effort the authors have made to address the concerns I raised. However, the computational complexity of this approach continues to be a significant challenge, particularly when considering the need to train the model on new datasets. The fact that training took 7 days on 32 A100 GPUs raises concerns about its practicality in other real-world scenarios, especially in environments with limited resources. Therefore, I am maintaining my initial rating.

评论

Thank you for acknowledging our work and providing thoughtful feedback on our rebuttal.

In the era of "large models," many fields are progressing towards the development of foundational models. Creating these foundational models requires substantial data and computational resources. In this paper, we approach from both data and model perspectives, aiming to expand the capability limits of image restoration models. This exploration is vital for understanding the strengths and limitations of large-model-based approaches within the domain of image restoration. We are optimistic that advancements in model distillation and quantization will significantly enhance our approach by reducing model size while maintaining effectiveness.

We greatly appreciate your time and insightful feedback throughout the discussion period. Please feel free to reach out if you have further queries or need additional clarification on our work.

审稿意见
5

The authors propose GenIR, a privacy-safe automated pipeline that generates a large-scale dataset of one million high-quality images for training of the image restoration (IR) models. Additionally, they introduce DreamClear, a IR model that seamlessly integrates degradation priors into diffusion-based IR models. DreamClear features the novel Mixture of Adaptive Modulator (MoAM), which adapts to diverse real-world degradations. Comprehensive experiments demonstrate its outstanding performance in managing complex real-world situations, marking a substantial progression in IR technology.

优点

The authors propose an automated data curation pipeline for image restoration and extensive experiments across both low-level and high-level benchmarks demonstrates the DreamClear’s state-of-the-art performance in handling intricate real-world scenarios.

缺点

  1. The overall impact of the generated dataset is not that convincing. Why is the training also performed on DIV2K, FLickr2K, LSDIR and Flickr8K.
  2. The overall architecture in terms of contribution is too limited where the concept of mixture of experts modified to mixture of adaptive modulators and the text to image diffusion model have been widely already explored in the field of image restoration.
  3. The first ablation study is not highlighting the impact of the proposed dataset.
  4. It would be interesting to see the results without dual-based prompt learning if time permits?

问题

  1. Does the Eq. 2 and 3 holds only for the x_lq, does not it hold for x_ref?

局限性

Yes, the authors addressed the limitations.

作者回复

We sincerely appreciate the reviewer for the valuable and positive comments on our work. We address the questions and clarify the issues accordingly as described below.

Q1: About the generated dataset

[Reply] Thanks for your valuable suggestion. The primary goal of the proposed GenIR is to expand the dataset as much as possible to train a more robust model. As shown in Figure 6, the performance of the generated dataset of equivalent size is inferior to that of the real dataset. Therefore, by combining existing IR datasets with our generated dataset, we aim to achieve optimal model performance.

We also conduct ablation studies to explore the influence of different training datasets. We retrain DreamClear using different datasets, with the results provided below (evaluated on DIV2K-Val):

TrainsetLPIPS \downarrowDISTS \downarrowFID \downarrowMANIQA \uparrowMUSIQ \uparrowCLIPIQA \uparrow
Existing IR+Generated0.33910.179123.240.432768.930.6978
Generated0.36010.184727.840.429569.310.6823
Existing IR0.34930.187324.100.415463.240.6435

It shows that using the generated dataset can achieve perceptual metrics (LPIPS, DISTS) similar to those obtained with existing IR datasets, and has obtained a significant advantage in no-reference metrics (MANIQA, MUSIQ, CLIPIQA). We speculate that the gap in FID is mainly due to the consistent distribution between the training and validation sets in the DIV2K dataset. The results indicate that our generated data can effectively improve the quality of restored images compared to existing IR datasets, and combining the two can achieve the best performance. We also provide some visual comparisons in the attached PDF file (Figure IV), which demonstrates the effectiveness of our generated data in enhancing visual effects. We'll add the ablations in the final version.

Q2: About our contribution

[Reply] Thanks. In this paper, our technical contributions are mainly divided into two aspects:

GenIR. We introduce the GenIR pipeline, which provides a privacy-safe and cost-effective method to generate data for image restoration. Ultimately, we obatin a dataset containing one million high-quality images and verify the effectiveness of the generated data for real-world image restoration.

DreamClear. Recent works based on text-to-image diffusion models indeed demonstrate their superior performance in real-world image restoration. Compared to existing works, the architecture of DreamClear mainly differs in the following three aspects:

  • We propose a dual-branch structure, which incorporates the reference image, allowing the model to focus less on degradation removal and more on enhancing image details through generative priors, ultimately producing more photorealistic images.
  • We propose a novel Mixture of Adaptive Modulator (MoAM) to enhance our model’s robustness to intricate real-world degradations by explicitly guiding it through the introduction of token-wise degradation priors. Unlike the commonly used MoE structure in LLM, MoAM dynamically guide different restoration experts through degradation maps to achieve more robust image restoration.
  • To the best of our knowledge, our work is the first to explore the performance of the diffusion transformer (DiT) architecture in image restoration.

We will add the discussion on how our work differs from related works in our final paper. We also hope to get more comments and suggestions from you, such as related works built upon MoE structures in low-level vision, to further improve our paper.

Q3: About the first ablation

[Reply] Thanks. The main purpose of the first ablation is to explore whether expanding the scale of the generated data can improve the effects of real-world image restoration. To examine the impact of the proposed dataset, we conduct ablations using different training datasets. Please refer to the Reply of Q1.

Q4: About dual-based prompt learning

[Reply] Thanks for your valuable advice. We remove the dual-based prompt learning strategy and retrain GenIR. We provide qualitative comparisons of the GenIR-generated images for the ablation study. The results are provided in the attached PDF file (Figure V). It shows that dual-based prompt learning strategy can effectively enhance image texture details, making the generated images more suitable for image restoration training. We'll add more ablation results in the final version.

Q5: About Eq. (2) and Eq. (3)

[Reply] Thanks. As illustrated in the paper (Line 206-208), xrefx_{ref} is directly fed into AM as the conditional information, while xlqx_{lq} is fed into the mixture of degradation-aware experts structure. Figure 3(c) also depicts this process. Therefore, Eq. (2) and Eq. (3) hold only for xlqx_{lq}, but not for xrefx_{ref}. We'll give a clear illustration of this process in the final version.

评论

Thank you for the detailed response.

I highly appreciate the effort made by the authors to address the weaknesses and questions asked.

I have still concern in Q2, can you please elaborate in more detail about how the MoAM module helps in achieving robust image restoration and how does the inclusion of a reference image helps to focus upon less on degradation removal.

Considering the general response given by the authors, where they mentioned about the training complexity, it is very hard to see the practical implications of the proposed method in real-world applications, so I will change my rating to borderline reject.

评论

Thanks a lot for providing thoughtful feedback on our rebuttal! We address your remaining concerns as described below.

1. About Reference Branch

[Reply] As outlined in the paper (Line 187), We employ a lightweight image restoration network (i.e., SwinIR trained with L1 loss) to generate preliminary restored images as reference images. While these images may lack fine details, they are largely free from blur, noise, and JPEG artifacts. Consequently, guided by the reference image, our model can more readily identify degraded content in the low-quality (LQ) image, enabling it to concentrate more effectively on enhancing image details.

We also conduct ablations to examine the role of the reference branch. As shown in Table 3, when the reference branch is removed, all metrics on the low-level and high-level benchmarks deteriorate significantly. Figure 7 also shows that the reference branch can significantly enhance the quality of the image. This indicates that the reference branch plays an important role in DreamClear for improving both image quality and semantic information.

2. About MoAM

[Reply] The proposed MoAM acts as an interaction module between the LQ branch and the reference branch, aiming to enhance the model's robustness to intricate real-world degradations by explicitly guiding it through the introduction of token-wise degradation priors. It obtains the degradation map through cross-attention between the LQ features and reference features, guiding the dynamic fusion of expert knowledge to modulate LQ features. To validate its effectiveness, we replace MoAM with different interaction modules:

  • AM: As shown in Table 3, when replacing MoAM with AM, all metrics undergo a substantial decrease. Similarly, Figure 7 demonstrates that this replacement leads to more blurred textures in both the bear and the snow. These findings underscore the importance of the degradation prior guidance in MoAM for steering restoration experts.
  • Cross-Attention / Zero-Linear: In addition to AM, we also tried other feature interaction modules, including cross-attention and zero-linear. Table 3 shows that their performance is inferior to AM across all metrics. Figure 7 shows that when using zero-linear, the bear's texture becomes blurred and there are semantic errors on its back. When using cross-attention, many artifacts appear, and there are also semantic errors on the bear's nose. Therefore, using AM as the interaction module is more suitable for the IR task to achieve better restoration results.
评论

Thank you for your detailed response. I would advise the reviewers to add this content in the main paper to make the things more clear to the audience. But isn't there a better alternative rather than first deploying a lightweight SwinIR to generate reference images? Is it necessary to provide reference image?

评论

3. About Real-World Applications

[Reply] We additionally provide the training computational cost (GPU days) compared with the recent SOTA SUPIR published in CVPR:

MethodsParamsTraining TimeGPU DaysInference Time
SUPIR3865.6M\approx 10 days using 64 A6000> 291 A10016.36s
DreamClear2283.7M\approx 7 days using 32 A100224 A10015.84s

Due to the lack of accurate speed comparisons between the A6000 GPU (48GB) and the A100 GPU (80GB), we treat the A6000 GPU as equivalent to the V100 GPU (32GB) for calculations (though the A6000's computational power is clearly superior to the V100). Following the previously published works [1,2], we convert the V100 days to A100 days by assuming a 2.2× speedup of A100 compared to V100. The results show that our DreamClear method requires at least 70 fewer A100 GPU days compared to SUPIR.

We acknowledge that DreamClear still requires relatively large training costs in the field of low-level vision. However, as we move further into the "large model" era, various fields are advancing towards the development of foundational models, which inherently require vast amounts of data and computational resources. In this paper, we approach from both data and model perspectives, aiming to expand the capability limits of image restoration models. This exploration is vital for understanding the strengths and limitations of large-model-based approaches within the domain of image restoration. We are optimistic that advancements in model distillation and quantization will significantly enhance our approach by reducing model size while maintaining effectiveness.

Additionally, our generated dataset offers significant value for real-world applications. As discussed in the reply of Q1, with the same training time for DreamClear (5 days using 32 A100 GPUs), the model trained on our generated dataset achieves better quantitative and qualitative results compared to those trained on existing IR datasets. Besides, as shown in Figure 6, when training lightweight, deployable models like SwinIR, using our generated dataset also results in significant performance improvements compared with using existing IR datasets. The results are as follows (evaluated on LSDIR-Val):

TrainsetLPIPS \downarrowDISTS \downarrowFID \downarrowMANIQA \uparrowMUSIQ \uparrowCLIPIQA \uparrow
DIV2K+Flickr2K0.45780.243551.290.369163.120.5647
Ours Generated (20K images)0.38730.195142.130.450268.830.6469

This further underscores the value of the dataset we constructed in this paper for real-world image restoration applications.


We hope that our response addresses your concerns sincerely. Looking forward to further communication with you!

References

[1] High-Resolution Image Synthesis with Latent Diffusion Models. In CVPR, 2022.

[2] PixArt-α\alpha : Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis. In ICLR, 2024.

评论

Thanks for your insightful discussion! We'll add all these discussions in our final paper.

The primary motivation for introducing the reference image is to provide guidance using degradation-invariant image features, thereby helping the model achieve more realistic restorations. Therefore, providing the reference image is essential for the proposed DreamClear. As mentioned in previous discussions, both quantitative and qualitative ablation studies verify its effectiveness.

For an alternative approach, one possible idea may be to develop a degradation-invariant feature extraction network (e.g., finetune the CLIP image encoder with supervised loss from pairs of LQ/HQ images) that directly extracts clean features to achieve a similar effect. In future work, we intend to explore these potential alternatives to further enhance model performance.

Thank you for taking the time to provide your feedback. We hope that we have addressed all of your concerns. If so, we kindly ask if you could reconsider our score. Should you have any further suggestions or require additional improvements, please do not hesitate to let us know.

评论

Thank you for your efforts.

Though I am stil worried about the training complexity of the proposed model, but considering the efforts of the authors I will upgrade my rating to borderline accept.

评论

Thank you for your discussion! We appreciate your constructive feedback that has helped refine our research. We'll carefully revise our final paper. Your positive rating means a lot to us.

审稿意见
4

The paper introduces "DreamClear," a high-capacity image restoration model, and "GenIR," an innovative data curation pipeline for image restoration (IR) tasks. DreamClear leverages Diffusion Transformer (DiT) models and generative priors from text-to-image (T2I) diffusion models, combined with multi-modal large language models (MLLMs), to achieve photorealistic restorations. GenIR is a dual-prompt learning pipeline that generates a large-scale dataset of high-quality images while ensuring privacy and copyright compliance. The combination of these strategies enhances the model's ability to handle diverse real-world degradations.

优点

  1. GenIR provides a novel, privacy-safe method to create large-scale datasets, overcoming the limitations of existing datasets that are often small and not comprehensive.
  2. DreamClear integrates generative priors from T2I diffusion models and employs a Mixture of Adaptive Modulator (MoAM) to handle diverse degradation scenarios effectively.
  3. The model achieves high perceptual quality in restored images, compared with other methods in various benchmarks.

缺点

  1. Complexity in Implementation: The dual-branch structure and the integration of multiple experts for degradation handling add complexity to the model.
  2. Some Technical details and more experiments should be provided.

问题

  1. Some technical details are not clear. For example, what kind of loss do you use in this paper? While the paper outlines the steps of the GenIR pipeline, it lacks specific details on parameter settings, such as the hyperparameters used in the generative models, data augmentation methods during training, and strategies for generating negative samples. There is insufficient detail on the criteria and standards used to evaluate and select the generated dataset.

  2. One potential issue with the DreamClear approach is the risk of generating artificial textures during the image restoration process. Taking Figure 4 as an example, although the result is clearer, it may produce artifacts because the beak of this bird does not look like this.

  3. In Figure 6, as data size increases, performance improves. Is there further improvement for more datasets, e.g., increasing the number of training images to 200000?

  4. It would be better to compare the efficiency of different methods, e.g., model size, training/inference time, FLOPs.

  5. Some necessary related work should be discussed. There are many diffusion-based image restoration methods, e.g., DDRM, DDNM, DeqIR, etc. It would be better to cite and discuss them.

局限性

Please refer to the details above.

作者回复

We sincerely appreciate the reviewer for the valuable and positive comments on our work. We address the questions and clarify the issues accordingly as described below.

Q1: About Model Implementation

[Reply] Thanks. We explain the motivation behind our model design as follows:

  • Dual Branch: Our dual-branch structure incorporates the reference image, allowing the model to focus less on degradation removal and more on enhancing image details through generative priors, ultimately producing more photorealistic images. Considering potential detail loss in the reference image, we use both the LQ image and the reference image to jointly guide the diffusion model in order to obtain a realistic image that remains faithful to the original.
  • Multiple Experts: The proposed Mixture-of-Adaptive-Modulator (MoAM) aims to enhance the model's robustness to intricate real-world degradations by explicitly guiding it through the introduction of token-wise degradation priors. Specifically, we feed the token-wise degradation map into a router network to dynamically select different restoration experts for effective feature fusion. MoAM dynamically fuses expert knowledge, leveraging degradation priors to tackle complex degradations.

Table 3 and Figure 7 provide quantitative and qualitative ablations, respectively, to demonstrate the effectiveness of our model design. Besides, we also provide the efficiency comparison with SOTAs (Please refer to the global response). The efficiency of DreamClear is comparable to SeeSR and SUPIR. Therefore, we believe that our design is effective and not redundant.

Q2: About technical details

[Reply] Thanks for pointing out this. We provide more training details in the global response.

To generate negative samples, we adopt SDEdit [1] as the image editing technique. Specifically, we use text prompts such as "cartoon, painting, blur, messy, low quality, deformation, low resolution, over-smooth, dirty" to edit the positive samples. We set the strength in SDEdit to 0.60.6, resulting in the corresponding negative samples. During the fine-tuning phase, the number of negative samples is controlled to be one-tenth of the number of positive samples.

To evaluate & select appropriate samples for our dataset, we first employ a quality classifier to screen the quality of the generated samples. Then we utilize Gemini as a powerful MLLM to ascertain whether the images exhibit blatant semantic errors or inappropriate content. The implementation details are as follows:

  • We use a convolutional neural network (CNN) as a binary classifier, trained with an equal number of positive and negative samples. Specifically, the negative samples are obtained through SDEdit with the strength set to 0.250.25. Samples with a classification probability of less than 0.80.8 are filtered out.
  • The prompt for Gemini is set to "You are an AI visual assistant, and you are analyzing a single image. Your task is to check the image for any anomalies, irregularities, or content that does not align with common sense or normal expectations. Additionally, identify any inappropriate content such as personal privacy violations or NSFW material. If the image does not contain any of the aforementioned issues, it has passed the inspection. Please determine whether this image has passed the inspection (answer yes/no) and provide your reasoning." Samples that do not pass the inspection will be eliminated.
  • Only the samples that have passed these two rounds of screening will be retained.

We will add all these technical details in the final version.

Q3: About artificial textures

[Reply] Thank you for your valuable feedback. Real-world image restoration is an ill-posed problem, meaning there isn't a single definitive solution. Our diffusion-based approach aims to generate visually pleasing restored images in most cases. However, when the degradation of the input image is very severe, it becomes challenging to ensure that certain details in the restored image are faithful to the input.

Regarding the bird (gyrfalcon) in Figure 4, the benchmark does not include a corresponding GT image. Therefore, we sourced photos of gyrfalcons from the internet to compare the results of different methods. These comparison results are provided in the attached PDF file (Figure I). Despite some differences between our restored image and the real gyrfalcon, our approach significantly surpasses SeeSR and SUPIR in terms of clarity and semantics perseveration.

Besides, we also provide more real-world comparisons with SUPIR in the attached PDF file (Figure II). It demonstrates that our method can achieve clearer details and fewer artifacts when dealing with real-world cases. We'll add more real-world comparisons in our final paper.

Q4: About the increasing data size

[Reply] Thanks for the advice, after increasing the data size, the results are provided as follows:

Image NumberLPIPS \downarrowDISTS \downarrowFID \downarrowMANIQA \uparrowMUSIQ \uparrowCLIPIQA \uparrow
1000000.39020.198243.270.441368.270.6382
2000000.38730.195142.130.450268.830.6469

After increasing the data size, the performance of the model is improved. We'll add more results using differenet data sizes in our final paper.

Q5: About the efficiency comparison

[Reply] Thanks for this valuable suggestion. Please refer to the global response.

Q6: About the related work

[Reply] Thanks for your reminder. DDRM, DDNM, and DeqIR are all excellent training-free methods that handle various image restoration tasks by improving the sampling process of pre-trained diffusion models. We'll cite all these related diffusion-based works and discuss them in our final paper.

References

[1] Sdedit: Guided image synthesis and editing with stochastic differential equations. In ICLR, 2022.

评论

Dear Reviewer CpjF,

Thank you very much for recognizing our work: "GenIR provides a novel method" and "DreamClear achieves high perceptual quality".

We greatly appreciate the time and effort you dedicated to reviewing our paper. As the deadline for the discussion period approaches and we have yet to receive your feedback, we kindly request that you share any remaining concerns. Please let us know if we can provide any additional information or clarification.

Thank you once again for your contributions to the development of our paper.

Authors of Submission 8300

作者回复

We sincerely appreciate all reviewers for their valuable and positive comments on our work. This is a global response to reviewers' questions.

1. Efficiency Comparison

We provide the efficiency comparison results as follows:

MethodsParamsTraining TimeInference Time
Real-ESRGAN16.7MN/A0.08s
ResShift173.9MN/A1.11s
SinSR173.9MN/A0.27s
StableSR1409.1M\approx 7 days using 8 V10012.36s
DiffBIR1716.7MN/A3.45s
SeeSR2283.7MN/A4.65s
SUPIR3865.6M\approx 10 days using 64 A600016.36s
DreamClear2283.7M\approx 7 days using 32 A10015.84s

The inference speed are tested on a single NVIDIA A100 GPU to generate 512×512512 \times 512 images from 128×128128 \times 128 inputs. StableSR, DiffBIR, SeeSR, SUPIR, and DreamClear are all built upon the pre-trained T2I model, resulting in a larger number of parameters.

When compared to two recent SOTAs, SeeSR and SUPIR, DreamClear has a similar number of parameters to SeeSR but approximately 1600M fewer parameters than SUPIR. Due to their foundation on MLLM, both DreamClear and SUPIR exhibit slower inference speed than other methods, with DreamClear being about 0.5 seconds faster than SUPIR.

Additionally, SUPIR and DreamClear require more training time and computational resources than other methods due to their use of larger training datasets. Nonetheless, they achieve superior visual results compared to other methods. We'll add the efficiency comparison results in the final version.

2. More Training Details

For training GenIR and DreamClear, we both use latent diffusion loss [1], which can be formulated as

LLDM=Ez0,c,t,ϵ[ϵϵθ(αˉtz0+1αˉtϵ,c,t)22],\mathcal{L} _ {\text{LDM}}=\mathbb{E} _ {z_0,c,t,\epsilon}[|| \epsilon-\epsilon_\theta(\sqrt{\bar{\alpha}_t}z_0+\sqrt[]{1-\bar{\alpha }_t}\epsilon ,c,t)||^2_2],

where ϵN(0,I)\epsilon \in \mathcal{N}(0,\mathbf{{I}} ) is the ground truth noise map at time step tt. cc represents the conditional information. αˉt\bar{\alpha}_t is the diffusion coefficient.

We provide detiled training hyperparameters as follows:

ConfigurationGenIRDreamClear
OptimizerAdafactorAdamW
Optimizer hyper-parameterseps1=1030,eps2=0.001,decay=0.8eps_1=10^{-30},eps_2=0.001,decay=-0.8β1=0.9,β2=0.999,eps=1010\beta_1=0.9,\beta_2=0.999,eps=10^{-10}
Peak learning rate4×1074\times10^{-7}5×1055\times10^{-5}
Learning rate scheduleconstantconstant
Warmup steps20000
Gradient clip1.01.0
Global Batch size256256
Numerical precisionbfloat16\text{bfloat16}fp16\text{fp16}
Computational Resources16 A100 (80GB)32 A100 (80GB)
Training Time\approx 5 days\approx 7 days
Data Augmentationrandom crop, fliprandom crop

3. More Figures

We provide some figures for rebuttal in the attached PDF file.

References

[1] High-Resolution Image Synthesis with Latent Diffusion Models. In CVPR, 2022.

评论

Dear AC and Reviewers,

Thank you for taking the time to review our paper. We are delighted to see that the reviewers recognized the strengths of our work. We are particularly grateful that Reviewer CpjF highlighted the novelty of our method, Reviewer FDew and Reviewer bAzb acknowledged the significant contribution it makes, and Reviewer A2Rj commended its state-of-the-art performance.

However, the reviewers also raised some concerns. Reviewer CpjF sought more technical details about GenIR, while Reviewer A2Rj was interested in the overall impact of the generated dataset. We have provided a detailed rebuttal addressing these concerns and kindly ask the reviewers to review our responses when convenient.

Thank you once again for your valuable feedback.

Sincerely, Authors of Submission 8300

评论

Dear AC and Reviewers,

Firstly, we wish to express our profound gratitude for your time, effort, and insightful feedback on our manuscript. We deeply appreciate the recognition our work has received: Reviewer CpjF emphasized the novelty of our method, Reviewer FDew and Reviewer bAzb acknowledged its significant contribution, and Reviewer A2Rj praised its state-of-the-art performance.

Throughout the review process, several concerns were raised, primarily related to technical details, efficiency comparisons, and the necessity for additional ablations and comparisons. We have addressed most of these concerns during the rebuttal and discussion stages.

A lingering concern pertains to the real-world applicability of our model, given its relatively high training cost. To address this, we highlight three points. First, we've compared our method to the recent SOTA method SUPIR [CVPR 2024] . DreamClear is less resource-intensive and needs at least 70 fewer A100 GPU days than SUPIR. Second, we emphasize the importance of our dataset and model considering the trend towards large models in fields like NLP and multimodal understanding. These foundational models inherently need vast data and computational resources. In our paper, we tackle this issue from data and model viewpoints, aiming to push the boundaries of image restoration models. Lastly, our generated dataset is valuable for real-world applications. As shown in Figure 6 and our response to Reviewer A2Rj, our high-quality dataset, when compared to existing public datasets, can significantly improve the performance of lightweight models like SwinIR in real-world image restoration.

We believe that the dataset and real-world image restoration model proposed in our paper are of significant value to the field of low-level vision and can effectively push forward the progress of the image restoration field in the era of large models.

We sincerely appreciate the time and efforts invested by all reviewers and AC in evaluating our work.

Best regards,

Authors of Submission 8300

最终决定

The paper proposes a large scale synthetic dataset for Image restoration and a model that achieves SOTA results using this dataset. Although the overall method appears somewhat over-engineered, all reviewers agreed that the dataset is valuable especially in the era of large foundational models and 3 of them propose acceptance. The 4th reviewer did not respond after rebuttal, but it seems that the rebuttal addressed most of the reviewers concerns. A common issue that remains is the computational efficiency of the method.