ReFIR: Grounding Large Restoration Models with Retrieval Augmentation
This paper propose a training-free framework to alleviate the hallucination of existing diffusion-based restoration models producing unfaithful contents or textures.
摘要
评审与讨论
This work introduces a novel training-free paradigm which uses the retrieval augmentation to expand the knowledge boundaries of existing large restoration models by incorporating reference images as external knowledge to facilitate the restoration of high-fidelity details. The authors propose the nearest neighborhood lookup to retrieve reference images, followed by the cross-image injection to fuse external knowledge into the LRM. Qualitative and quantitative experiments demonstrate the effectiveness of the proposed method.
优点
- Although diffusion-based restoration models can produce realistic results, their inherent randomness often leads to outputs that are not faithful to the original scene. This work considers an approach by introducing additional reference images to mitigate this challenge.
- The idea of this work interesting. Utilizing RAG techniques, which has been widely applied in NLP, for low-level computer vision tasks is promising. The authors provide a detailed quantitative and qualitative analysis of the workings mechanism of existing LRMs, based on which they propose specific designs to inject external knowledge into LRMs.
- The authors propose specific solutions to address challenges such as domain preference issues and spatial misalignment when directly using high-quality reference images. The framework appears to be generic and requires no training, making the proposed method easily applicable to various existing LRMs.
缺点
- In Table 1, why there is a variation of the performance improvement when applying the proposed method to different LRMs? For example, there is a 0.38 dB PSNR improvement in SeeSR but only a 0.03 dB improvement in SUPIR. Additionally, why do the restored images produced by different methods seems still different when using the proposed ReFIR?
- It seems the proposed method requires additional reference images as input, which may increase inference costs and time.
问题
In the spatial adaptive alignment module, considering the size of the similarity matrix as , which requires computing per-pixel similarity, does this introduce significant inference costs? Will the authors open-source the related details and codes?
局限性
Please see the above weaknesses.
[Q1: Variation of the outputs from different LRMs]
In Table 1, why there is a variation of the performance improvement when applying the proposed method to different LRMs? For example, there is a 0.38 dB PSNR improvement in SeeSR but only a 0.03 dB improvement in SUPIR. Additionally, why do the restored images produced by different methods seems still different when using the proposed ReFIR?
In fact, we have also noticed this phenomenon and we think it can be well explained based on Fig.8 in the main paper. As shown in Fig.8, for the latent at the t-th time step, there are two forces in different directions pulling it to produce the latent at the next t−1-th time step. One force is from the internal knowledge of frozen weights in LRMs, and the other is the external knowledge from the retrieved reference image. Therefore, although the external knowledge of different LRMs is the same (the reference image is the same), due to their different internal knowledge (different pre-training parameters), it eventually resulting in different LRMs producing different restoration results.
[Q2: Inference costs from additional reference image]
It seems the proposed method requires additional reference images as input, which may increase inference costs and time.
To ensure efficient implementation, we use batch accerlaration to process LR and Ref paralley, whose cost is roughly similar to increasing batchsize from 1 to 2. Moreover, although using the reference image inevitably incurs computational cost compared to direct inputing LR for inference, the additional reference can mitigate the hulucination of LRMs, resulting in significant performance gains. We provide a comprehensive comparison regarding performance and effectiveness as follows.
| setup | NIQE | MUSIQ | CLIPIQA | #param | GPU memory | Inference time |
|---|---|---|---|---|---|---|
| SeeSR-NoRef | 4.7432 | 55.54 | 0.6575 | 2.04B | 24.4G | 76.5s |
| SeeSR-ReFIR | 4.4566 | 57.13 | 0.6732 | 2.04B | 40.9G | 170.7s |
[Q3: Efficiency of the adaptive alignment module]
In the spatial adaptive alignment module, considering the size of the similarity matrix sim as HW×HW, which requires computing per-pixel similarity, does this introduce significant inference costs?
We are sorry to cause your confusion. In fact, the computation of the similarity matrix is quite efficient. Since we find the pixel-wise similarity between LR and Ref does not change too much among different layers, we thus compute the all layer shared similarity maxtrix for one UNet block, resulting in only 12 calculation for one timestep. We will make it more clear in the revision.
[Q4: Code open-source]
Will the authors open-source the related details and codes?
Yes! We will release all the code when the paper is accepted.
The paper introduces ReFIR, a novel method designed to enhance the capabilities of Large Restoration Models by incorporating external knowledge through the retrieval of high-quality, content-relevant images. The main contribution is to (1) Gives both quantitative and qualitative results on how existing LRM works. (2) Propose a novel solution which is training-free and generic to alleviate the hallucination problem of LRMs.
优点
- The idea of using external data representation instead of the model parameters seems interesting, which can be used in parallel with other model-oriented approaches.
- The experiments are solid and the performance are noteworthy. The authors apply the proposed technique into various LRMs and compare with sota under settings with different difficulty levels. From Table1 and Table2, the ReFIR methods achieves significant gains on both fidelity and perceptual metrics, demonstrating the superiority.
- The framework can be potentially used for multiple Diffusion-based restoration models without additional training, making it a cost-effective solution.
- The paper is well organized and easy to follow.
缺点
- Difference form other methods. The proposed method seems relevant to some of works in image editing tasks, such as MasaCrtl and Prompt2prompt, which also use training-free technique to modify the behavior of the diffusion model. A detailed explanation about the differences between the proposed ReFIR method and these methods would be beneficial.
- Dependency on the Quality of Retrieved Images. From Table5, it seems the performance of ReFIR heavily relies on the relevance and quality of the retrieved images. If the retrieval system performs sub-optimally, it could adversely affect the restoration outcomes.
- Complexity in Implementation. The proposed cross-image injection and spatial adaptive gating mechanisms introduce additional operation, moreover, the input image also contains extra reference images, which might challenge efficient implementation in practice.
问题
- How does the retrieval system scale with increasingly large HQ retrieval datasets, and what are the computational implications of scaling?
- How might different retrieval algorithms (beyond nearest neighbor lookups) impact the performance of ReFIR?
局限性
Further exploration into the scalability, efficiency, and the robustness of the proposed method as mentioned in the Weakness would be helpful for broader application.
[Q1: Difference from other works]
Difference form other methods. The proposed method seems relevant to some of works in image editing tasks, such as MasaCrtl and Prompt2prompt, which also use training-free technique to modify the behavior of the diffusion model. A detailed explanation about the differences between the proposed ReFIR method and these methods would be beneficial.
Although both ReFIR and previous image editing work are both training-free, however, we point out that they differ in the following ways.
- The specific techniques are different. Although both modify the attention layer, our ReFIR further introduces the Spatial Adaptive Gating to solve the spatial misalignment between LR and Ref, and proposes strategies to allow the injection of multiple reference images. In addition, we also propose the novel Distribution Alignment to alleviate the domain gap between two diffusion process chains.
- The goals are different. Previous image editing work modifies the attention layer for controllable generation, while our ReFIR aims to inject external knowledge to alleviate the hallucination problem of existing LRMs.
- Finally, in addition to incorporating reference images through training-free attention modification, another contribution of this work is the proposal of a feasible retrieval system to facilitate the acquisition of external knowledge from retrieved databases.
[Q2: Dependency on the Quality of Retrieved Images]
From Table5, it seems the performance of ReFIR heavily relies on the relevance and quality of the retrieved images. If the retrieval system performs sub-optimally, it could adversely affect the restoration outcomes.
To make our ReFIR more robust under conditions when the relevance and quality of the retrieved images is poor, we further explore techniques including fallback strategy and the adaptive filtering policy. Please see the global author rebuttal part for more details.
Using these techniques can further improve the robustness of ReFIR when the retrieval system performs sub-optimally. We will explore more techniques for improvements in the future.
[Q3: Complexity in implementation]
The proposed cross-image injection and spatial adaptive gating mechanisms introduce additional operation. Moreover, the input image also contains extra reference images, which might challenge efficient implementation in practice.
Actually, since the modification of the attention layer only happens in the decoder in the last 20 timesteps, which only accounts for 12% of all the layers, the cross-image injection and spatial adaptive gating rise almost negligible increase in computational cost. We provide the specific cost comparison as follows.
| setup | input | #param | GPU memory | Inference time |
|---|---|---|---|---|
| w/o cross-image injection | batchsize=2 | 3.87B | 51.1G | 312.2s |
| and spatial adaptive gating | 1xLR + 1xRef | 3.87B | 51.4G | 322.8s |
Since the LR chain only receives the corresponding same layer features from the Ref chain, this property thus allow the batch acceleration for efficient implementation. Specifically, the input batch size is increased from 1 (LR only in original NoRef LRM) to 2 (LR+Ref in our ReFIR). Therefore, the batch parallel computation of both LR and Ref make it efficient for practical implementation.
[Q4: Scaling up with lager retrieval database]
How does the retrieval system scale with increasingly large HQ retrieval datasets, and what are the computational implications of scaling?
As suggested, we use ImageNet as the larger database for scaling up the retrieval system. The results are as follows.
| setup | NIQE | MUSIQ | CLIPIQA |
|---|---|---|---|
| ImageNet as Retrival | 4.4233 | 57.14 | 0.6770 |
| baseline | 4.4566 | 57.13 | 0.6732 |
It can be seen that increasing dataset size can improve the relevance and thus improves the performance. Furthermore, since the calculation of similarity between LR and retrieved dataset is parallel, we find it brings almost negligible computational cost when scaling up retrieved dataset.
[Q5: Explore differeent retrieval algorithms]
How might different retrieval algorithms (beyond nearest neighbor lookups) impact the performance of ReFIR?
As suggested, we use the TopK retrieval algorithm to explore the impact of different retrieval algorithms. Specifically, we select images with top 3 large similarity as reference inputs, and use the multi-reference injection technique to allow multiple reference images. The experimental results are shown below.
| setup | NIQE | MUSIQ | CLIPIQA |
|---|---|---|---|
| Top3-reference | 4.4217 | 57.25 | 0.6749 |
| baseline | 4.4566 | 57.13 | 0.6732 |
It can be seen that using the TopK retrieval algorithm can bring more relevance to the reference image, thus improving the performance.
This paper presents a plug-and-play approach to enhance the quality of Diffusion-based super-resolution models by leveraging reference images. The authors exploit CLIP to effectively filter high-definition images with similar semantics from a pre-trained dataset. These reference images are then employed to replace the original diffusion model's features within the attention operation, leading to improved super-resolution results.
优点
- The paper makes a valuable contribution by demonstrating how introducing external knowledge via CLIP filtering can effectively address the hallucination problem in Diffusion-based super-resolution models.
- The paper presents promising quantitative and qualitative results that support the effectiveness of the proposed method.
缺点
- Figures 5 and 6 would benefit from including the reference images selected by the model alongside the processed results. This would allow for a clearer evaluation of the effectiveness of the reference image selection process. Consider revising the figures to include the reference images or provide additional visualizations that demonstrate the impact of the reference information on the final output.
- Tables 1 and 2 highlight a significant performance boost from ReFIR on the RefSR dataset, which diminishes in real-world scenarios. This raises questions about ReFIR's ability to handle diverse real-world data. Can the authors provide additional evidence or analysis to demonstrate ReFIR's effectiveness in retrieving relevant reference images for real-world scenarios, even when the similarity might not be as high as in the RefSR dataset? Or, could we generate reference images automatically just like CoSeR? The comparison and more detailed discussion is necessary.
- The appendix appears to contain crucial information about the reference image retrieval stage, such as the selection criteria for the feature extractor. This information should be moved to the main body of the paper for better clarity and transparency.
- The paper should discuss the computational cost and time consumption associated with the "Source Reference Chain" process. While the performance improvement is valuable, quantifying the trade-off in terms of computational resources is essential.
问题
- The paper does not explicitly address the potential issue of similar features being missed due to image chunking in pre-trained SD models like SeeSR and SUPIR. Consider discussing whether the model retrieves features from the corresponding chunk of the reference image or employs a different strategy to handle chunked inputs.
- It would be helpful to clarify whether the features used for matching in the reference image retrieval process are pre-computed or computed online.
- The paper explores the concept of internal knowledge (learned by the LRM) and external knowledge (introduced through reference images). While directly training the LRM on the reference image dataset seems like an alternative approach, it's important to consider the potential drawbacks, such as overfitting to the specific reference data and reduced generalizability. Discussing these trade-offs would provide a more nuanced understanding of the benefits of using reference images as external knowledge.
局限性
They have.
[Q1: Reference image presentation]
Thanks for your advice, we will add the retrieved image in Fig.5 and Fig.6 in the revision to improve the presentation quality.
[Q2: Improving ReFIR's ability in real-world scenarios]
Tables 1 and 2 highlight a significant performance ... detailed discussion is necessary.
To further improve the ability of ReFIR in real-world applications, we further explore several techniques to improve its robustness. The related analysis and experiments are shown in the global author rebuttal.
Equipped with the fallback mechanisms (not using reference or generating Ref like CoSeR) as well as the adaptive filtering strategies (relevance/quality/task-based filtering), our ReFIR can alleviate the dilemma when high-correlated and high-quality images are scarce or unavailable, enhancing its ability in real-world scenarios.
[Q3: Moving the retrieval stage details to main body]
Due to the 9-page limit for submission this year (1 page less than usual), we are sorry for placing the image retrival details in the appendix. We will add more details about the image retrieval stage in the main body in the revision.
[Q4: Computaion and time cost from Source Reference Chain]
The paper should discuss the computational cost and time consumption associated with the "Source Reference Chain" process. While the performance improvement is valuable, quantifying the trade-off in terms of computational resources is essential.
Since the "Target Restoration Chain" only receives the corresponding same time-step and layer features from the "Source Reference Chain", this property thus allow the batch acceleration for efficient implementation. Specifically, these two chains are implemented sharing one LRM weight, and simply increasing the batch size from 1 (LR only in original NoRef LRM) to 2 (LR+Ref in our ReFIR).
Therefore, the computational cost and time consumption of ReFIR is roughly similar to the original LRM with batchsize set to 2, with a slight overhead increase from the additional cross-image injection. The comparison of model complexity before and after incorporating our ReFIR are given in Table7 of Addpendix, and here we also give this reuslt as follows:
TableA. Comparison on computation and time cost. The input resolution is , and is evaluated on one 80G A100 GPU.
| Method | input | #param | GPU memory | Inference time |
|---|---|---|---|---|
| SeeSR | batchsize=2 | 2.04B | 40.2G | 160.2s |
| SUPIR | batchsize=2 | 3.87B | 51.1G | 312.2s |
| SeeSR+ReFIR | 1xLR + 1xRef | 2.04B | 40.9G | 170.7s |
| SUPIR+ReFIR | 1xLR + 1xRef | 3.87B | 51.4G | 322.8s |
[Q5: Dicussion on patchfied images ]
Consider discussing whether the model retrieves features from the corresponding chunk of the reference image or employs a different strategy to handle chunked inputs.
The patchfy is usually used when the input resolution is large. Although it is intuitive to use the corresponding patch from the large Reference Image as the reference branch's input, it can inevitably incur significant performance degradation when the LR and Ref possess spatial misalignment.
Here, we introduce another solution which simply uses existing techniques to deal with this problem. Specifically, we employ the Multi-reference injection, shown in Appendix-A, in which we use all the patches from the reference image to guide the restoration of a certain LR input patch. In this way, the LR image can maintain the global receptive filed from the Ref. In experiment, we choose the number of patches as 4 for a 2048x2048 input from RealPhoto60 dataset, resulting in 4 small patches with size 512x512. The results are shown as bellow.
| setup | GPU cost | NIQE | MUSIQ | CLIPIQA |
|---|---|---|---|---|
| MultiRef-for-patchfied images | 29.8G | 4.6543 | 56.20 | 0.6598 |
| baseline | 40.6G | 4.4566 | 57.13 | 0.6732 |
To some extent, introducing the multi-reference injection can alleviate the perofrmace drop while maintaining modest memory footprint. However, similar to exsting SD-based restoration works, the inherent tradeoff between whole-image receptive filed and the low GPU cost still exist. Our framework may potentially benefit from future acceleration works.
[Q6: Computation mode for retrieval]
It would be helpful to clarify whether the features used for matching in the reference image retrieval process are pre-computed or computed online.
Since the retrival database is agnostic to the specific LR inputs, we thus pre-compute the retrival vectors for fast inference. We will make it more clear in the revision.
[Q7: Converting refernece image as internal knowledge]
Due to the backpropagation of LRM needs huge GPU costs, e.g. 64 Nvidia A6000 GPUs for training SUPIR, we are not able to practically convert reference images as internal knowledge through fully fine-tuning.
However, we point out that the idea of injecting downstream knowdge through fine-tuning has been explored in the image genration community, such as DreamBooth. During the subject-driven adaptation of dreambooth, the authors of dreambooth also encountered diffusion model's overfitting for specific subjects, and they propose the prior preservation loss to solve this problem.
As for the image restoration, it is difficult to get calibration datasets for prior preservation since the content of LR images can be various, unlike the similar topic in image generation. Moreover, in real-world applications, it is also difficult to obtain scene-specific data for training in advance, and even if we get it, it requires additional training time. In contrast, our ReFIR framework does not need the scene-specific data in advance and can adapt to specific scenes in a training-free manner.
Thank the authors for their efforts in the rebuttal. It addresses several of my concerns.
Thank you very much for your positive feedback. We are delighted that our responses have addressed your concerns. We will further improve our work in revision based on the reviewers' comments and discussions.
The paper titled "ReFIR: Grounding Large Restoration Models with Retrieval Augmentation" introduces a novel framework called Retrieval-augmented Framework for Image Restoration (ReFIR). This method addresses a significant issue in diffusion-based Large Restoration Models (LRMs) — the tendency to generate hallucinatory outputs when faced with heavily degraded images, akin to issues faced by large language models. ReFIR mitigates this by incorporating external high-quality, content-relevant images retrieved via a nearest neighbor lookup in the semantic embedding space. These images serve as a source of external knowledge, enabling the restoration model to produce more accurate and faithful details. The framework modifies the self-attention layers of existing LRMs to integrate textures from the retrieved images, significantly enhancing the model's ability to restore images without any additional training required. Extensive experiments demonstrate that ReFIR achieves realistic and high-fidelity restoration results, confirming its effectiveness across various existing LRMs. This training-free, adaptable approach significantly expands the knowledge boundary of LRMs, offering a promising solution to their inherent limitations.
优点
-
The paper is well-organized and articulates complex concepts with clarity, making it accessible to both experts and those new to the field. The significance of ReFIR lies in its potential to profoundly impact practical applications involving image restoration, such as in digital forensics, media restoration, and medical imaging, where accuracy and fidelity are paramount. This approach offers a scalable and adaptable solution that can be applied across various existing models without the need for retraining, highlighting its utility in improving the practical deployment of deep learning models for image restoration.
-
The ReFIR framework introduces a unique method of integrating retrieval-augmented techniques into image restoration models, addressing the hallucination dilemma often faced by large restoration models. By borrowing the concept of retrieval-augmented generation from natural language processing and adapting it for image processing, ReFIR creatively leverages external image databases to enhance the detail and fidelity of restored images, significantly expanding the utility of existing large-scale models.
-
The implementation of ReFIR demonstrates robust quality through extensive testing and innovative adaptations. The framework modifies existing large restoration models by integrating high-quality, content-relevant images during the restoration process. This method has shown to significantly enhance the performance of these models, as validated by comprehensive experiments that not only improve quantitative metrics like PSNR and SSIM but also yield visually more accurate restorations.
缺点
-
The effectiveness of ReFIR is significantly dependent on the quality and relevance of the images retrieved from external databases. This reliance could limit the framework's effectiveness in scenarios where highly relevant and high-quality reference images are scarce or unavailable. Addressing this, the authors could explore mechanisms to assess and ensure the relevance and quality of retrieved images dynamically during the restoration process or develop fallback strategies when suitable images are not found.
-
While the paper mentions that the framework does not require additional training, it does not thoroughly address the potential increases in computational overhead and latency introduced by the retrieval process and the modification of self-attention layers. This could be particularly challenging when deploying in real-time or resource-constrained environments. Future iterations of the framework could benefit from a detailed analysis of computational costs and potential optimizations to streamline the retrieval and integration processes.
-
The experimental setup primarily demonstrates the effectiveness of ReFIR under controlled conditions with specific types of image degradations. To fully validate the robustness and generalizability of the framework, additional testing across a broader spectrum of real-world scenarios and more diverse degradation types would be beneficial. This would help in understanding the limitations and operational range of ReFIR in practical applications, ensuring it can effectively handle unexpected or uncommon degradation patterns.
问题
-
Given that the paper primarily demonstrates ReFIR's performance on specific types of image degradations, can the authors clarify how well the framework performs across a broader spectrum of degradation scenarios, including less common or more complex types? Further elaboration on its performance in varied real-world conditions could significantly clarify the framework’s versatility and robustness.
-
Can the authors provide more details on the computational efficiency of the retrieval process integrated within the ReFIR framework? Specifically, what are the impacts on processing time and computational resources when applying ReFIR to large-scale restoration models, and are there strategies in place to optimize this process in real-time applications?
-
In cases where the retrieval process fails to find sufficiently relevant or high-quality images, what strategies does ReFIR employ to ensure the quality of the restoration output? A detailed explanation of fallback mechanisms or alternative approaches when ideal reference images are not available would be helpful in assessing the framework's reliability and functionality in less-than-ideal conditions.
局限性
Since ReFIR relies on external datasets for retrieving reference images, there is a potential for bias if those datasets are not diverse or inclusive of various demographic groups and scenarios. This could inadvertently lead to biases in restored images, particularly in sensitive applications. The authors should discuss the measures taken to ensure the diversity of the datasets and address potential biases in the image retrieval and restoration process.
[Q1: Reliance on the image quality and relevance]
The effectiveness of ReFIR is significantly dependent on the quality and relevance of the images retrieved from external databases. This reliance could limit the framework's effectiveness in scenarios where highly relevant and high-quality reference images are scarce or unavailable. Addressing this, the authors could explore mechanisms to assess and ensure the relevance and quality of retrieved images dynamically during the restoration process or develop fallback strategies when suitable images are not found.
We understand your concerns, and we have thus conducted a thorough experiment presented in the global author rebuttal part.
With these newly developed fallback and filtering strategies, our ReFIR is further improved under the real-world retrieval settings, relieving the dependence on retrieval system.
[Q2: Computational overhead from retrieval and attention modification]
While the paper mentions that the framework does not require additional training, it does not thoroughly address the potential increases in computational overhead and latency introduced by the retrieval process and the modification of self-attention layers. This could be particularly challenging when deploying in real-time or resource-constrained environments. Future iterations of the framework could benefit from a detailed analysis of computational costs and potential optimizations to streamline the retrieval and integration processes.
In order to reduce the computational overhead of the retrieval process, we pre-calculated the feature vectors of all images in the retrieval database before inference. Furthermore, the cosine similarity between the LR image vectors and all retrieval vectors is computed in parallel. These strategy results in an almost negligible (less than 3% inference time) cost of computational overhead.
The modification of self-attention layers only happens in the last 20 timestep in the decoder layers, i.e., only 12% attention layers are modified while the left is kept intact.
These analysis is also supported by practice, in which we find these two process only take up <5% inference time, with most computational cost coming from the original LRM. Future LRM acceleration (e.g. pruning, quantization) will benefit our ReFIR, and we will explore more efficient implementation in the future.
[Q3: Testing on broader degradations]
The experimental setup primarily demonstrates the effectiveness of ReFIR under controlled conditions with specific types of image degradations. To fully validate the robustness and generalizability of the framework, additional testing across a broader spectrum of real-world scenarios and more diverse degradation types would be beneficial. This would help in understanding the limitations and operational range of ReFIR in practical applications, ensuring it can effectively handle unexpected or uncommon degradation patterns.
As suggested, we test the robustness of our ReFIR framework on another two challenging real-world mixed degradations, i.e., low-resolution4+noise=50, as well as low-resolution+JPEG. The results are as follows.
TableA low-resolution4+noise=30 on the WR-SR dataset
| Metric | SeeSR | SeeSR+ReFIR |
|---|---|---|
| PSNR | 23.06 | 23.16 |
| SSIM | 0.6273 | 0.6298 |
| LPIPS | 0.2710 | 0.2698 |
| NIQE | 3.8580 | 3.7266 |
| FID | 44.10 | 42.78 |
TableB low-resolution+JPEG on the WR-SR dataset
| Metric | SeeSR | SeeSR+ReFIR |
|---|---|---|
| PSNR | 23.84 | 23.91 |
| SSIM | 0.6679 | 0.6686 |
| LPIPS | 0.2291 | 0.2287 |
| NIQE | 3.8723 | 3.8148 |
| FID | 36.74 | 35.56 |
It can be seen that our ReFIR framework maintains its effectiveness on real-world hybrid degradations, demonstrating its robustness and generalizability.
[Q4: Possible bias from retrival database]
We agree with you. Currently, we only use publicly available, high-quality academic data as the retrieval database. In the future, we will pay attention to content diversity and potential bias in constructing larger scale retrieved data, e.g., we could use VLMs or LLMs for automatic filtering followed by human review.
Global Author Rebuttal
[1. Remarks by authors]
We would like to express our sincere gratitude to all the reviewers for taking their time reviewing our work and providing fruitful reviews that have definitely improved the paper. We are encouraged that they find our method
- "offers a scalable and adaptable solution that can be applied across various existing models without the need for retraining" (Reviewer NDC8, Reviewer UnRT, Reviewer qg3y)
- "addressing the hallucination dilemma often faced by large restoration models" (Reviewer NDC8)
- "creatively leverages external image databases to enhance the detail and fidelity of restored image" (Reviewer NDC8)
- "not only improves quantitative metrics like PSNR and SSIM but also yields visually more accurate restorations" (Reviewer NDC8, Reviewer SJc6, Reviewer UnRT)
- "The experiments are solid" (Reviewer UnRT)
During this rebuttal period, we have tried our best and made a detailed response to address the concerns raised by reviewers. If you have any further questions, we will actively discuss with you during the author-reviewer discussion period.
[2. Futher exploration on the retrival in the wild]
We have noticed Reviewer NDC8, Reviewer SJc6, and Reviewer UnRT rise concerns on the performance of our ReFIR when highly relevant and high-quality reference images are scarce or unavailable. Here, we first discuss the mechanism already existed in ReFIR for mitigating uncorrelation problems. After that, we explore several new techniques to further improve the model ability in this extreme situation.
2.1 Spatial Adaptive Gating for relevance filtering
In Sec4.2, we introduce the Spatial Adaptive Gating (SAG) to resolve the spatial misalignment between LR images and Ref images. Since the mask in Eq.2 contains pixel-wise similarities between LR and Ref, we point out that this similarity-aware mask can filter out low correlated pixels from the reference image, thus imporving robustness in the wild. The visualization of this similarity mask is shown in Fig.13 of the Appendix.
2.2 Exploration for Further Improvments
Furthermore, as suggested by reviewers, we further develop several new techniques for further improvement.
We first develop fallback strategies to handle the situation where reference images are not available. This includes two possible choices:
-
Since our method does not modify the parameters of LRMs, we can directly use the original inference pipeline of the LRM without using reference images. We denote this as
origin_lrm. -
We use the BLIP model to caption the LR image to obtain the text prompt, which will then be fed into the StableDiffusion2.0 model to generate semamtic-similar high-quanlity images as the reference. We denote this as
gen_ref.
We then develop the adaptive filtering policy to assess and ensure the relevance and quality of retrieved images. This includes three alternatives:
-
We set the cosine similarity threshold, and use the retrieved image as the reference only if the similarity is greater than the threshold, otherwise use the image provided by the above fallback strategy. We denote this as , and set as 0.6 through ablation.
-
We use the non-reference image quality assessment metric CLIPIQA score as a threshold, and use retrieved image only when its quality is greater than the threshold, otherwise the fallback strategy is used. We denote it as , and set =0.6246 obtained from the mean CLIPIQA of the retrieval database.
-
Task-oriented adaptive filtering. We respectively use the retrieved image and the fallback strategy to generate the result. And then we select the one with a larger task score as the final result. We denote it as
Due to limited rebuttal time, we use SeeSR large restoration model as a representative, using the real-world degradation dataset RealPhoto60.
We first give the results in which all LR images adopt the fallback strategies in TableA.
TableA Results of all images using fallback strategies
| setup | NIQE | MUSIQ | CLIPIQA |
|---|---|---|---|
| origin_lrm | 4.7432 | 55.54 | 0.6575 |
| gen_ref | 4.6923 | 55.98 | 0.6602 |
| ReFIR | 4.4986 | 57.01 | 0.6759 |
It can be seen that using the SD2.0 generated images as the fallback image can bring slightly improvement compared with noRefernce. But it is still inferior to the results using the retrieved dataset, which we argue is due to the fact that there is a knowledge overlap between generated images and the LRMs.
Then we combine the two fallback strateges and three filtering policies to obtain a more robust RAG system in the real-world senarios, as shown in Table B.
TableB Experimental results of combining different fallback and adaptive filtering strategies
| Metrics | origin_lrm+ | origin_lrm+ | origin_lrm+ | gen_ref+ | gen_ref+ | gen_ref+ | ReFIR-baseline |
|---|---|---|---|---|---|---|---|
| NIQE | 4.4982 | 4.4943 | 4.3891 | 4.4978 | 4.4923 | 4.3464 | 4.4986 |
| MUSIQ | 56.78 | 57.12 | 57.77 | 56.95 | 57.11 | 57.68 | 57.01 |
| CLIPIQA | 0.6730 | 0.6745 | 0.6898 | 0.6743 | 0.6770 | 0.6942 | 0.6759 |
It can be seen that using the SD generated image as the reference is better than not using a reference image, supporting the results in Table A. In addition, the task-oriented filtering strategy achieves a significant performance improvement due to the fact that it works in the output end, but is is accompanied by a larger inference time. Using SD-generated image as fallback and adopt the quality-based filtering act as a competitive alternative to the our preivous baseline.
The final ratings of this paper are Weak Accept, Weak Accept, Borderline Accept, and Borderline Accept. The paper introduces a training-free method to enhance Large Restoration Models (LRMs) by integrating high-quality, content-relevant images retrieved from external databases. This approach addresses the hallucination problem in diffusion-based LRMs by incorporating textures from retrieved images into the self-attention layers, which significantly improves restoration accuracy without additional training. Extensive experiments confirm ReFIR’s effectiveness across various LRMs, making it a promising solution for practical applications. However, the method’s reliance on the quality of retrieved images and potential computational overhead are noted as limitations, with further exploration needed into scalability, efficiency, and robustness. The authors’ rebuttal has addressed the majority of concerns from reviewers, resulting in an increase in rating.