Parameter Efficient Adaptation for Image Restoration with Heterogeneous Mixture-of-Experts
This paper proposes the first parameter efficient adaptation framework, which tunes only 0.6% parameters to adapt pre-trained restoration models to various tasks.
摘要
评审与讨论
This paper introduces the mixture-of-expert approach to the image restoration community, enabling rapid adaptation of pre-trained models to various image restoration tasks. The proposed AdaptIR framework comprises three parallel branches: local interaction, channel gating, and frequency affine modules, which together extract heterogeneous representations. AdaptIR consistently achieves performance improvements on both single-degradation and hybrid-degradation tasks across two baseline models.
优点
- This paper introduces the parameter-efficient transfer learning paradigm to low-level vision tasks, designing the AdaptIR module, which inserts a few trainable parameters into frozen pre-trained restoration backbones.
- The proposed AdaptIR adapts the pre-trained model with heterogeneous representations across tasks by applying three parallel branches, excelling in local spatial, global spatial, and channel representations.
- Consistent improvements across various image restoration tasks demonstrate the effectiveness and robustness of AdaptIR.
缺点
- The proposed AdaptIR combines three experts in local spatial, global spatial, and channel representations and adaptively weights them for adapting to downstream tasks. The authors are suggested to analyze the distribution of feature response intensity of these three branches across various tasks. This analysis will be crucial to evaluate the adaptability and flexibility of the proposed approach in varying tasks.
- Experiments exhibit adaptability on image restoration tasks, consistent with the pre-trained stage. The authors are encouraged to conduct generalization experiments on new tasks and new degradation levels within the same task to further validate their approach.
- Authors claim that the fine-tuning step on downstream tasks requires 500 epochs of optimization. This is excessive for a fine-tuning strategy. If so, can the proposed method still be considered efficient transfer learning?
- AdaptIR requires heavy fine-tuning on the downstream task. However, prompt-based approaches [1,2,3,4] use various task-specific prompts for task generalization without a fine-tuning stage. Please discuss the advantages of AdaptIR compared to these prompt-based approaches.
[1] ProRes: Exploring Degradation-aware Visual Prompt for Universal Image Restoration. Arsiv, 2023.
[2] PromptIR: Prompting for All-in-One Blind Image Restoration. Arxiv, 2023.
[3] PromptRestorer: A Prompting Image Restoration Method with Degradation Perception. NeurIPS, 2023.
[4] Unifying Image Processing as Visual Prompting Question Answering. ICML, 2024.
问题
Please see Weaknesses.
局限性
Authors claimed the limitation and board impact in the Appendix.
[Q1: Analysis of feature response intensity]
The proposed AdaptIR combines three experts in local spatial, global spatial, and channel representations and adaptively weights them for adapting to downstream tasks. The authors are suggested to analyze the distribution of feature response intensity of these three branches across various tasks. This analysis will be crucial to evaluate the adaptability and flexibility of the proposed approach in varying tasks.
As suggested, we give the distribution of feature response intensity of three branches across various tasks, including SR, heavy deraining, light deraining, low-light image enhancement, and two hybrid degradations. Since OpenReview cannot post figures, these figures are given in Fig.C in the attached rebuttal PDF.
These figures indicate that our AdapIR can adjust to different degradation types by enhancing or suppressing the outputs from different branches. Specifically, for the heavy&light deraining tasks, AdaptIR adaptively learns to enhance the low-frequency global features, i.e., the frequency affine module which is responsible for global spatial modeling has large values. This property ensures the removal of the high-frequency rainstreaks as well as the preservation of the global structure of the image. For SR tasks, AdaptIR adaptively enhance the restoration of local texture details by learning large output values from the local spatial modules. For the hybrid degradation task, AdaptIR shows it can distinguish between different hybrid degradations, i.e., three branches exhibit different patterns under two types of hybrid degradations.
In short, each branch of AdaptIR can capture discriminative features under different degradations, indicating that our approach is degradation-aware. This ability guarantees the robustness on single degradation, and superior performance under hybrid degradation.
[Q2: Generalization experiments on new tasks]
Experiments exhibit adaptability on image restoration tasks, consistent with the pre-trained stage. The authors are encouraged to conduct generalization experiments on new tasks and new degradation levels within the same task to further validate their approach.
Thanks for your advice. In fact, the original pre-trained models are pre-trained only on SR, denosing and light-deraining. Therefore, the other tasks involved in this paper can, to some extent, demonstrate the robustness of our method on unseen degradation levels or types, which includes hybrid degradations, heavy-deraining, low-light enhancement.
Moreover, as suggested, we further give another experiment on unseen Real-world Denosing tasks to further demonstrate the generalization of our AdaptIR.
TableA real-world denosing tasks with SIDD datasets.
| Methods | #param | PSNR |
|---|---|---|
| AdaptFormer | 677K | 39.03 |
| LORA | 995K | 38.97 |
| Adapter | 691K | 39.00 |
| FacT | 537K | 39.02 |
| MoE | 667K | 39.05 |
| Ours | 697K | 39.10 |
It can be seen that our AdaptIR maintains its superiority when transferring to real-world degradation, demonstrating the robustness of our methods.
[Q3: Reclaim of the Efficiency]
Authors claim that the fine-tuning step on downstream tasks requires 500 epochs of optimization. This is excessive for a fine-tuning strategy. If so, can the proposed method still be considered efficient transfer learning?
We apologize for the confusion. Actually, existing restoration works usually use the dataset enlarge strategy to enlarge the number of samples in one epoch by repeating training data. Therefore, the current training paradigm requires 10 or even 100 times epochs than ours when under a fair setting. This analysis is also supported by the actual training time. For example, the current training paradigm takes a long time to converge, e.g. >2 days+8x3090GPUs for training the SR model, even one week+8x3090GPUs for denoising, as well as the large costs of training all-in-one models given in Table 2. In contrast, our approach requires only <8h+ 1x3080Ti GPU of training time to adapt the model to unseen degradation levels or even types.
[Q4: Discussion with prompt-based methods]
AdaptIR requires heavy fine-tuning on the downstream task. However, prompt-based approaches use various task-specific prompts for task generalization without a fine-tuning stage. Please discuss the advantages of AdaptIR compared to these prompt-based approaches.
To the best of our knowledge, only the PromtGIP shows the zero-shot ability when facing unseen degradations. And the other ProRes, PromptIR, and PromptRestorer can only handle degradations which have seen during training, which means they still needs additional fine-tuning for task generalization.
The advantages of AdaptIR compared to these prompt-based approaches are two-folds. As for efficiency, PromptIR, ProRes, PromptRestorer all needs full fine-tuning for adapting to new tasks, e.g. PromptIR needs 7-day 8x3090 GPUs for full fine-tuning, while our AdaptIR needs only 8h 1x3090 GPU. As for performance, since these methods need to learn multiple degradation within one model, it is inevitable to suffer the problem of negative transfer, which impairs performance. We give a thorough comparison as follows.
| Methods | Type | Fast Adaptation To unseen task | Adaptation Cost | PSNR on denoising | PSNR on deraining |
|---|---|---|---|---|---|
| PromptGIP | Prompt-based | Yes | zero-shot | 26.22 | 25.46 |
| ProRes | Prompt-based | No | 8x3090GPUs | Not open-source | Not open-source |
| PromptIR | Prompt-based | No | 7-days 8x3090GPUs | 29.39 | 37.04 |
| PromptRestorer | Prompt-based | No | 8x3090GPUs | Not open-source | Not open-source |
| Ours | PETL-based | Yes | 8h 1x3090 | 29.70 | 37.81 |
Thanks to the authors for their detailed responses. After reviewing the other reviews and the replies provided, the authors have addressed some my concerns about the feature response distribution, generalization, fine-tuning costs. However, I suggest authors to add a section about the discussion with existing prompt-based algorithms in the revised version. I raised my final rating to borderline accept.
Thank you very much for your positive feedback. We are delighted that our responses have addressed your concerns.
We will further revise our work based on the reviewers' comments and the discussion phase. We will add a section about the discussion with existing prompt-based algorithms in the revision as suggested.
We promise to open source all code and ckpt for reproducibility.
The paper presents a novel approach to image restoration tasks, which leverages a heterogeneous Mixture-of-Experts (MoE) architecture. The proposed method aims to address the limitations of existing PETL (Patch-Exemplar-based Texture Learning) techniques for image restoration. The key contributions of the paper are:A heterogeneous MoE framework that combines multiple specialized sub-models to enable more robust and effective image representation learning for restoration tasks.A detailed design of the MoE architecture, including the individual module components and their synergistic collaboration to achieve the heterogeneous image modeling.
优点
The key strengths of the paper are: 1) A heterogeneous MoE framework that combines multiple specialized sub-models to enable more robust and effective image representation learning for restoration tasks. 2) A detailed design of the MoE architecture, including the individual module components and their synergistic collaboration to achieve the heterogeneous image modeling.
缺点
1, lack comprensive comparison with other SOTA all-in-one methods like [1] and [2]
[1] Ingredient-oriented multi-degradation learning for image restoration, cvpr 2023. [2] Towards Efficient and Scalable All-in-One Image Restoration, arxiv 2023.
2, in terms of Fig. 9 and Fig. 10, the visual comparison is not obvious among Adapter, LoRA and Ours.
3, it is expected to present the testing time rather than training time to eveluate its effectiveness.
问题
see the weakness.
局限性
yes
[Q1-Comparison with other SOTA all-in-one methods]
lack comprehensive comparison with other SOTA all-in-one methods like IDR(cvpr23) and DyNet(arxiv)
Thanks for your kind advice, the suggested comparison are are follows.
TableA Effectiveness and efficiency comparison on Light-deraining.
| Methods | dataset | #param | Training time | GPU memory | PSNR | SSIM |
|---|---|---|---|---|---|---|
| AirNet | Rain100L | 8.75M | ~48h | ~11G | 34.90 | 0.977 |
| PromptIR | Rain100L | 97M | ~84h | ~128G | 37.04 | 0.979 |
| IDR | Rain100L | 15M | ~72h | ~23G | 37.64 | 0.979 |
| Dynet | Rain100L | 16M | ~72h | ~24G | 37.80 | 0.981 |
| Ours | Rain100L | 697K | ~8h | ~8G | 37.81 | 0.981 |
TableB Effectiveness and efficiency comparison on Denoising().
| Methods | dataset | #param | Training time | GPU memory | PSNR | SSIM |
|---|---|---|---|---|---|---|
| AirNet | Urban100 | 8.75M | ~48h | ~11G | 28.88 | 0.871 |
| PromptIR | Urban100 | 97M | ~84h | ~128G | 29.39 | 0.881 |
| IDR | Urban100 | 15M | ~72h | ~23G | 29.38 | 0.878 |
| Dynet | Urban100 | 16M | ~72h | ~24G | 29.52 | 0.881 |
| Ours | Urban100 | 697K | ~8h | ~8G | 29.70 | 0.881 |
From the comparison results with recent SoTA all-in-one methods, it can be seen that our PETL-based paradigm can achieve better performance gains than exiting all-in-one paradigm, while costs less training time, GPU memory as well as storage room. These experiments demonstrate it is promising to develop the PETL-based framework for image restoration.
[Q2-Visual comparison with other baselines]
In terms of Fig. 9 and Fig. 10, the visual comparison is not obvious among Adapter, LoRA and Ours.
We are sorry for this confusing presentation. Due to the selected image region is still not small enough, it may be difficult to get a straightforward visual comparison. You may zoom in on the results in Fig.9 and Fig.10. For example, in the second figure of Fig.10, previous methods produces blur edges of the character 'ink', while ours can obtain sharp character edge with less noise. The quantitative PSNR comparison can also support the observation. We will revise the figure in the revision.
[Q3-Testing time for effectiveness comparison]
It is expected to present the testing time rather than training time to evaluate its effectiveness
As suggested, we compare the testing time of existing all-in-one methods AirNet, PromptIR, IDR, and DyNet, with the proposed AdaptIR. The experiments are conducted with 3090 GPUs. We give the testing time of different methods on Urban100 datasets.
| Methods | #param | PSNR on denoising (dB) | Testing time (s/img) |
|---|---|---|---|
| AirNet | 8.75M | 28.88 | 0.53 |
| PromptIR | 97M | 29.39 | 1.33 |
| IDR | 15M | 29.38 | 0.86 |
| Dynet | 16M | 29.52 | 0.76 |
| Ours | 697K | 29.70 | 0.72 |
From the above results, it can be seen that our AdaptIR achieves the best performance using moderate testing latency. It should be noted that the latency of the PETL-based paradigm mainly comes from the pre-trained models, e.g., 99.2% of our AdaptIR comes from the pre-trained models. Therefore, the inference speed can be potentially further improved with future smaller pre-trained models.
This work introduces AdaptIR, a novel heterogeneous Mixture of Experts (MoE) structure, to adapt pre-trained restoration models to various downstream tasks. The proposed method achieves performance comparable to full fine-tuning while only training 0.6% within 8 hours, demonstrating the high efficiency. Extensive experiments, including on hybrid degradation and various single degradation, validate the effectiveness of the proposed method compared to current PETL methods.
优点
- Introducing PETL into image restoration is both interesting and promising, potentially serving as a competitive alternative to existing all-in-one image restoration solutions. Furthermore, Table 2 demonstrates the superiority of the proposed PETL paradigm over the all-in-one approach in terms of performance (PSNR, SSIM) and efficiency (parameters, time, GPU memory).
- The motivation and the specific techniques are well communicated. The authors identify that directly applying current PETL methods like LoRA to IR can result in unstable performance, and then further attributed this problem to the homogeneous frequency representations. Based on this, they propose a heterogeneous MoE structure, with each branch capturing orthogonal bases.
- The experiments are solid, the authors versify the proposed method across multiple single degradation tasks (image super-resolution, denoising, low-light enhancement, deraining), and demonstrating strong performance on hybrid degradation tasks.
- The proposed method achieves state-of-the-art performance, showing significant improvements over previous PETL methods.
- The paper is well-written and easy to follow.
缺点
- Both the proposed AdaptIR method and LoRA use low-rank matrices for efficiency, clarifying the differences between these two methods would be beneficial.
- The proposed method have not been tested on on real-world degradation scenarios.
问题
- What is computational cost for the proposed method, compared with other parameter-efficiency mehtods?
- The reported results are mainly conducted on the synthetic degradation, how does AdaptIR perform on real-world degradation scenarios?
局限性
See the weakness above.
[Q1: Differences on low-rank strategy]
Both the proposed AdaptIR method and LoRA use low-rank matrices for efficiency, clarifying the differences between these two methods would be beneficial.
In fact, not just LoRA, the low-rank strategy is a common practice in existing PETL arts, e.g. Adapter, FacT, AdaptFormer. The main difference lies in how low-dimensional features are processed. Our method uses a multi-branch structure to adapt MLP in transformer block, instead of one-branch LoRA placed in self-attention.
[Q2: Performance on real-world degradation]
The reported results are mainly conducted on the synthetic degradation, how does AdaptIR perform on real-world degradation scenarios?
We conduct experiments on real-world denoising tasks with SIDD datasets. The results are as follows:
| Methods | LORA | Adapter | AdaptFormer | FacT | MoE | Ours |
|---|---|---|---|---|---|---|
| #param | 995K | 691K | 677K | 537K | 667K | 697K |
| PSNR | 38.97 | 39.00 | 39.03 | 39.02 | 39.05 | 39.10 |
It can be seen that our AdaptIR maintains its superiority when transferring to real-world degradation, demonstrating the robustness of our methods.
[Q3: Computational cost]
What is computational cost for the proposed method, compared with other parameter-efficiency methods?
Since our AdaptIR only trains about 700K parameters, the GPU memory costs only about 8G with 8h training. e control the parameters of different methods to be roughly similar and compare the performance with other parameter-efficiency methods, the experiments demonstrate the sota performance of our methods.
My concerns are well addressed! I have no further questions or suggestions!
The paper proposes a Parameter Efficient Transfer Learning (PETL) method for image restoration, which utilizes local, global, and channel-related modules and adaptively combines them to obtain heterogeneous representation for different degradations. Experiments are conducted on multiple degradations and the results demonstrate the effectiveness compared to other PETL methods.
优点
- Exploring the use of PETL to enhance the image restoration performance is worthwhile.
- The experiments in the paper are quite comprehensive.
缺点
- A more detailed and precise definition of Heterogeneous Representation is needed. The differences in Figure 2 are results but not the underlying causes of the problem. Moreover, why are the three proposed modules in the paper considered to constitute Heterogeneous Representation?
- The local interaction module has the same structure as LoRA, it could say that it is essentially a LoRA. The other two modules are also commonly used in image restoration. Most importantly, the paper does not clearly explain why these three modules are combined.
- What is the rank of LoRA in Table 1? More details about the MoE structure should be presented in the paper.
- Why is only the single-task setting performance reported in Table 2? How do you determine which type of degradation appears in the image in the all-in-one setting?
- There are color marking errors in Table 1.
问题
Refer to Weaknesses.
局限性
None
[Q1-1: Precise definition of the Heterogeneous Representation]
A more detailed and precise definition of Heterogeneous Representation is needed.
The Heterogeneous Representation in this paper represents the learning of discriminative features across different degradation types. The term ‘representation’ here is instantiated as the Fourier curve in Fig.1 in the main paper.
Previous approaches tend to produce similar representation across various degradations. As common knowledge, restoring different degradations requires different representations, e.g., SR needs high-pass filter network while denoising needs low-pass. As a result, if the representation needed by current degradation matches the specific representation of the existing PETL method, it works. If not, it leads to unstable performance.
To demonstrate the generality of the problem regarding the unstable performance and the homogeneous representation under different degradations, we provide more evidence in Fig.A and Fig.B in the rebuttal PDF.
[Q1-2: The causal logic of our research line]
The differences in Figure 2 are results but not the underlying causes of the problem.
(Fig.2 in the paper is the technical pipeline, so we guess you mean Fig.1 Right part.)
Motivated by the success of PETL, we extensively evaluate their performance on image restoration. However, we surprisingly find these methods perform not robustly on single degradation tasks, and all suffer performance drops on hybrid degradation.
This interesting observation motivates us to find the reason. We then find existing methods tend to learn similar representation across different degradation types, and thus we further speculate that this may be the reason for the above problem.
To validate the speculation, we designed a novel MoE to learn Heterogeneous Representation, i.e. learn different features for different degradations.
Finally, experiments show that learning different representation across different degradation types can indeed obtain robust performance on single degradation and favorable performance on hybrid degradation. This result demonstrates our speculation.
[Q1-3: Three modules for Heterogeneous Representation ]
Moreover, why are the three proposed modules in the paper considered to constitute Heterogeneous Representation?
The three modules are specifically designed to force them to learn local-spatial, global-spatial, and channel interactions, respectively. And they indeed work as we expected, which is also verified in Sec.3.2. Here, the features modeled from different perspectives constitute heterogeneous representations, which is also recognized by Reviewer 62kc and EhxW.
[Q2-1: Differences between local interaction module (LIM) and LoRA]
The local interaction module has the same structure as LoRA, it could say that it is essentially a LoRA.
The technique is different. Notably, the original LoRA can only decompose the Linear Projection Weight with shape , while our proposed LIM focus on the Convolution Weight with shape , where how to handle the additional kernel size is non-trival.
The placement is different. The original LoRA is inserted in a serial manner, i.e., features will go through the frozen linear weights then LoRA. By contrast, our AdaptIR is designed in a parallel fashion, which preserves the knowledge of the pre-trained model and is experimentally verified in Tab.6,
The goal is different. The LoRA is used for weight transformation, while the low-rank in our LIM is used to learn local spatial for learning heterogeneous representation.
[Q2-2: Combination of three modules]
The other two modules are also commonly used in image restoration. Most importantly, the paper does not clearly explain why these three modules are combined.
We would like to clarify that our goal is to encourage heterogenous representation learning, instead of proposing complex modules. Therefore, we introduce commonly used module for a simple baseline.
The output from three modules are complementary to each other, which are further combined to learn different representation across different degradation types.
[Q3: Details about LoRA and MoE]
What is the rank of LoRA in Table 1? More details about the MoE structure should be presented in the paper.
The rank of LoRA is 32. The MoE is a multi-branch bottleneck structure, and we follow [A][B][C] for implementation. We introduce this baseline to show that despite MoE also uses a multi-branch structure, it is still sub-optimal when not learning heterogeneous representations.
[A] Zhu et al. Uniperceiver-moe:Learning sparse generalist models with conditional moes. NeurIPS22
[B] Carlos et al. Scaling vision with sparse mixture of experts. NeurIPS21
[C] Sneha et al. Beyond distillation:Task-level mixture-of-experts for efficient inference. arXiv preprint.
[Q4: Comparison with all-in-one methods]
Why is only the single-task setting performance reported in Table 2? How do you determine which type of degradation appears in the image in the all-in-one setting?
Actually, existing All-in-one methods, e.g. AirNet,PromptIR,IDR,DyNet, also reported single-task performance in their papers, in which they use the same model structure and train only using single-degradation data. And the results of this single-task training is better than all-in-one training for them.
We reported the single-task performance of all-in-one methods in Tab.2 for fair comparison, since this performance is the upper-bound of these methods. Notably, the performance of all-in-one methods in Tab.2 is the same as their original paper, which uses more training data than ours. We will include all-in-one setting performance in revision.
[Q5: Some typos]
We will revise them in revision. Thanks!
Thanks for the authors' response. After reading the response, I still have some concerns. And some of the statements in the response are inconsistent with the facts.
- In the Abstract and Introduction, the authors claim that the method can "obtain heterogeneous representation for different degradations", while the experiments are all conducted under single-task settings, especially Table 2, which can not validate the authors' motivation. If the comparison is made under the single-task setting, the author should compare with methods like MPR and Restormer. However, the single-task setting is not the core experiment in this paper. Instead, the experiments under the multi-task setting are crucial for validating the method's effectiveness. Therefore, the experimental setup in Table 2 is flawed, and it has nothing to do with the points mentioned in the author's response, such as "also reported single-task performance in their papers," "fair comparison," or "upper bound." This is the main reason why I gave the negative rating and asked question 4.
- The form of LoRA is consistent with LIM when processing images, and this can be referenced in the implementation of LoRA in the diffusers library. Additionally, LoRA has a parallel relationship with frozen weights, which is a fundamental concept, rather than the serial manner described by the author. It makes me feel that the LIM cannot be considered as a contribution.
We thank the reviewer for the insightful comments, and we are happy to discuss the two related concerns.
@Q1: Multi-task performance of AdaptIR
We first summarize the current multi-task restoration paradigm. Let denote the number of downstream tasks as N, then we can summarize existing paradigm into the following three categories:
"N for N": training N task-specific models for N downstream tasks, such as Restormer, MPRNet."1 for N": training 1 all-in-one model for N downstream tasks, such as PromptIR, AirNet."(1+N) for N": using 1 task-shared pre-trained weights, and N task-specific lightweight modules.
The main paper mainly focuses on the third paradigm, since storing N task-specific AdaptIR is acceptable. Therefore, this "(1+N) for N" setting requires us to evaluate performance on single-degradation task. The heterogeneous representation in this case is reflected as learning N different AdaptIR for N tasks, using the same model structure.
However, we understand the reviewer's concern that the above cannot adequately support whether AdaptIR learns heterogeneous representations to handle different degradations using one model. In the main paper, we mainly verify this through Table1, where the hybrid degradation (applying multiple degradations on one image) is used. This hybrid degradation can in part be interpreted as using one AdaptIR to handle multiple degradations.
Furthermore, as suggested, we explore the performance of our AdaptIR in the "1 for N" paradigm, where one AdaptIR model is used to handle images imposed with N different single degradation. The experiments are as follows.
Following PromptIR, AirNet, we include light rain streak removal, denoising with 25,50 tasks, and use BSD400 and WED as denoising training datasets, and use RainTrainL as the light rain streak removal datasets. We use Rain100L for deraining testing and use Urban100 for denosing testing. We train AdaptIR for 250 epochs.
| Method | Light Rain Streak Removal | Denoising with | Denoising with | Trainable param | GPU memory | Traing-time |
|---|---|---|---|---|---|---|
| AirNet | 34.90/0.967 | 31.90/0.914 | 28.68/0.861 | 8.75M | ~11G | ~48h |
| PromptIR | 36.37/0.972 | 32.09/0.919 | 28.99/0.871 | 97M | ~128G | ~84h |
| Ours | 41.27/0.9886 | 32.64/0.9263 | 29.16/0.8750 | 697K | ~8G | ~10h |
Although both our method and the previous all-in-one method suffer performance drop under multi-task learning setting compared to their single-task counterparts, the performance advantage of our AdaptIR in the original Table2 is still preserved when migrating to the "1 for N" setup. This result demonstrates AdaptIR can learn heterogeneous representations in multi-degradation restoration using one model.
We will improve Table 2 according to the above experiments in revision, and release all the code and ckpt for reproducibility.
Thank you again for providing the opportunity to improve our work through suggested experiments!
@Q2: The difference between LoRA and LIM
We have checked the diffusers implementation of LoRA. Since we focus on the conv weights decomposition, so we guess you mean the LoRACompatibleConv in the diffusers. Actually, there are slightly difference between the two.
As for the LoRACompatibleConv in diffusers, the lora is implemented as nn.Sequantial, which contains self.w_up and self.w_down. Both self.w_up and self.w_down are nn.Conv2d, which means that rank cannot be smaller than , where the equal holds when the output dimension of self.w_down is 1. In contrast, our LIM treats the up and down weights as the whole, i.e. we first obtain the conv-weights through low-rank matmul, followed by reshaping to the conv weights shape to perform convolution. This allows rank to be very flexible.
In the original rebuttal, the "features will go through the frozen linear weights then LoRA" refers to the process similar to the code in diffusers: "original_outputs + (scale * self.lora_layer(hidden_states))". We are sorry for this confusing expression. The parallel placement of AdaptIR denotes we place AdaptIR module parallel to the frozen MLP of transformers.
For the contribution of LIM, we would like the reviewers to also consider the other two modules (FAM, CGM). From a holistic perspective, the complementary relationship between LIM and the other two facilitates the heterogeneous representation.
Thanks to the authors for their effort in the response and additional experiments. I still want to know how your method trains and tests on multiple types of degradation data in a multi-task setting. Is it similar to how it's done in a single-task setting? For LoRA and LIM, I will consider the novelty of the other two modules. Please revise the paper to clarify these two parts. I will raise the score.
Thanks for your response to help us improve this work.
For the training of the multi-task setting, we combine and randomly shuffle the training data of single tasks, i.e., the training data for multi-task is a combination of multiple single-task training data. We also adjust the dataset enlarge ratio of different tasks to control training samples from different task types roughly similar. We keep the #param and model structure intact. Since the training data for one epoch increased, we reduced the training epoch from 500 to 250 as mentioned in the above author response. So the training time is roughly same as single task setting. For testing multi-task performance, since one model has been trained to solve multiple tasks, we evaluate this model on different single-task datasets respectively, similar to AirNet and PromptIR.
For the technical similarities between LIM and LoRA, as well as the novelty of the other two modules, we will provide a more detailed discussion and clarification in revision.
Thanks again for your efforts and these helpful discussions!
[Global Author Rebuttal]
We would like to express our sincere gratitude to all the reviewers for taking their time reviewing our work and providing fruitful reviews that have definitely improved the paper. And we are encouraged that reviewers find
-
"exploring PETL for image restoration is meaningful" (Reviewer Y1LM, Reviewer EhxW)
-
finding a new problem that "directly applying current PETL methods like LoRA to IR can result in unstable performance" due to the homogeneous representation. (Reviewer EhxW)
-
introducing "a heterogeneous MoE framework to enable more robust and effective image representation learning for restoration tasks." (Reviewer 62kc)
-
"consistent improvements across various image restoration tasks demonstrate the effectiveness and robustness of AdaptIR"(Reviewer jwPV, Reviewer EhxW)
-
and "the experiments are quite comprehensive". (Reviewer Y1LM, Reviewer EhxW)
During the rebuttal period, we have tried our best and made a point-to-point response to address the concerns raised by reviewers. We also use additional figures for the rebuttal, which can be found in the attached PDF. If you have any further questions, we will actively discuss with you during the author-reviewer discussion period.
Dear Reviewers,
This is a reminder that the Reviewer-Author discussion period is nearing its end. We encourage you to participate in the discussion and provide your valuable feedback to the authors in case you haven't already. Your comments are greatly appreciated and will contribute to the quality of the submission.
AC
This paper received positive scores of (5, 5, 5, 7). Reviewers generally found the exploration of PETL for restoration interesting and acknowledged the effectiveness of the proposed approach. After reviewing the paper, reviews, and rebuttal, the AC finds no reason to overturn the consensus to accept the paper.
However, the AC would like to address the presentation of training time for AdaptIR in Table 2 and the rebuttal. While the model itself may be fine-tuned in eight hours, this does not account for the pre-training time required for IPT and EDT. To avoid misrepresenting the overall time investment, the authors are strongly advised to clearly separate and report pre-training and fine-tuning times. The authors are also advised to incorporate all reviewer feedback into the camera-ready version.