6.0

/10

Poster4 位审稿人

最低6最高6标准差0.0

4.0

置信度

正确性3.0

贡献度2.8

表达2.8

NeurIPS 2024

Sharing Key Semantics in Transformer Makes Efficient Image Restoration

Bin Ren,Yawei Li,Jingyun Liang,Rakesh Ranjan,Mengyuan Liu,Rita Cucchiara,Luc Van Gool,Ming-Hsuan Yang,Nicu Sebe

OpenReview PDF

提交: 2024-05-02更新: 2025-01-07

TL;DR

We propose SemanIR, a Transformer-based approach for image restoration that optimizes attention computation by focusing on semantically relevant regions, achieving linear complexity and state-of-the-art results across multiple tasks.

摘要

关键词

Low-level VisionImage RestorationVision Transformer

评审与讨论

审稿意见

评分: 6置信度: 52024-07-05

This paper proposes a dictionary-based image restoration method that leverages the most relevant information to recover images with low computational costs. Specifically, the method constructs a key-semantic dictionary that stores the top-k semantically related regions for each patch and performs attention only among these related regions. Additionally, the dictionary is created only once at the beginning of each transformer stage to reduce the computational burden.

优点

This paper presents an efficient and effective strategy for utilizing the most relevant patches for image restoration.
Extensive experiments demonstrate the effectiveness of the proposed SemanIR method across several image restoration tasks.

缺点

The novelty of this paper is somewhat limited. Similar to KiT [1], the proposed method SemanIR utilizes the most relevant information for restoration by performing attention only among the related regions. The key difference between KiT and SemanIR is that SemanIR calculates the semantic relation only once and reuses the information in subsequent layers.
Fig. 2 needs further improvement to increase its readability. The selected $\hat{K},\hat{V}$ are not shown in the calculation process of (d). Additionally, the dimensions of $\hat{K},\hat{V}$ are unclear in the statement. Suppose the dimension of $\hat{K}$ is $k\times c$ , how could the dimension of $D^{att}_{K}=\text{Softmax}_K(Q\hat{K}^T/\sqrt{d})$ be $hw\times hw$ ?
The intention behand constructing the random top-k strategy is unclear. Compared to the fixed top-k strategy, what are the benefits of a random $k$ ? It is apparent that the fixed top-k strategy would have inferior performance since the $k=[64,128,192,256,384]$ is different from the training setting $k=512$ . As shown in Fig. 3, $k=256$ seems to be a suitable value for the proposed method. What is the need to increase $k$ to 512?
In the IR in AWC task, some related works [2,3] are not compared.

[1] KNN Local Attention for Image Restoration. CVPR'22

[2] All-in-One Image Restoration for Unknown Corruption. CVPR'22

[3] PromptIR: Prompting for All-in-One Blind Image Restoration. NIPS'23

问题

See Weaknesses.

局限性

The limitations and broader impact have been discussed in the paper.

作者回复

2024-08-07

Response to Reviewer YUtD:

Q1: Comparison to KiT[1]

A: Please refer to our answer to the 1st question in the shared "Author Rebuttal".

[1] KNN Local Attention for Image Restoration. CVPR'22

Q2: Fig.2 needs to improve regarding the dimension

A: Thanks for this very important suggestion, we have improved Fig.2 (d) and made it much clearer to follow. The revised figure is shown in Fig 5 of the rebuttal pdf file. We also demonstrated the difference in the calculation of the attention between training (Torch-Mask is used to exclude the possible negative connection below the threshold ) and inference (Triton kernel is design and used to greatly reduce the computation cost).

Q3: Top-k selection discussion?

A: Please see the answer to the 2nd question in the shared "Author Rebuttal".

Q4: Comparison to [2] and [3]?

A: We compare the deraining results among AirNet[2] and PromptIR[3] under the single deraining degradation setting. The results are reported in Tab. 7. It shows that:

Our SemanIR outperforms AirNet by a large margin (i.e., 2.93 dB) on PSNR. The reason why the parameters of AirNet are only 8.93M is that it is built on CNNs.
Our SemanIR outperforms PromptIR by 0.79dB on PSNR with 27% fewer parameters even though both SemanIR and PromptIR are built with transformers.

Table 7 The single-degradation deraining comparison on the Rain100L dataset.

Method	AirNet[2]	PromptIR[3]	SemanIR (Ours)
PSNR	34.90	$\underline{37.04}$	37.83
Params.	8.93M	35.59M	$\underline{25.85}$ M

[2] All-in-One Image Restoration for Unknown Corruption. CVPR'22

[3] PromptIR: Prompting for All-in-One Blind Image Restoration. NIPS'23

评论- Please let us know if you have additional questions

2024-08-11

Dear reviewer,

Thank you for the comments on our paper.

We have submitted the response to your comments and a PDF file. Please let us know if you have additional questions so that we can address them during the discussion period. We hope that you can consider rasiing the score.

Thank you

评论- Response to the rebuttal

2024-08-12

Thank you for the careful explanation. I have read the response and considered the feedback from other reviewers. However, the comparisons between KiT and SemanIR does not highlight the novelty of SemanIR. The key difference you mentioned, i.e., calculating the semantic relation only once for efficiency, aligns with the comments in weaknesses. Additionally, the comparisons with AirNet and PromptIR were not conducted on the datasets in Table 5 of the paper, which affects the reliability of the results. Therefore, I have decided to keep my score.

评论- More Reliable Comparison to AirNet[2] and PromptIR[3]

2024-08-13

The comparison to AirNet[2] and PromptIR[3]

Answer: We appreciate the reviewer's suggestion to validate the generalization ability of the proposed SemanIR. Since [2][3] didn't provide the results under our IR in adverse weather conditions (AWC) setting. To this end, we first compare SemanIR and [2] with the results of [2] from Tab.1 of [4] ([4] used the same training setting as ours while only the test setting is slightly different in deraining, i.e., no frog for deraining). The results shown in Tab.8 indicate that SemanIR archives obvious improvements (i.e., 2.93dB average improvement) on all three adverse weather conditions.

Table 8: IR in AWC under [4]'s test setting.

Methods	Rain	Snow	Raindrop	Average
AirNet [2]	23.12	27.92	28.23	26.42
SemanIR（Ours）	26.48	30.76	30.82	29.35

In addition, instead of training TWO methods ([2] and [3]) under our AWC setting. we train just SemanIR with the 3-degradation all-in-one setting (i.e., the same training/testing datasets for 120 epochs) from [2] and [3] during our rebuttal period. The results shown in Table 9 indicate that the proposed SemanIR achieves better all-in-one IR ability compared to [2] and [3] across all three kinds of IR tasks. Especially, 0.77dB and 0.94dB PSNR improvements on dehazing and deraining, respectively. This is also consistent with the results in Tab. 8

Table 9: The All-in-One setting IR results.

Method	Dehazing (SOTS)	Deraining (Rain 100L)	Denoising (BSD, $\sigma$ =15)	Denoising (BSD, $\sigma$ =25)	Denoising (BSD, $\sigma$ =50)	Average
AirNet[2]	27.94	34.90	33.92	31.26	28.00	31.20
PromptIR[3]	30.58	36.37	33.98	31.31	28.06	32.06
SemanIR (Ours)	31.35	37.31	34.01	31.35	28.08	32.42

Since there are three kinds of degradations for both the AWC (i.e., rain+fog, snow, and raindrop) and All-in-One (i.e., haze, rain, and noise) settings. We believe these experiments can also show the generalization ability of our method and provide a more reliable comparison. Additionally, we have submitted the experiments and will include the results of training AirNet[2] and PromptIR[3] under our AWC setting in our revised manuscript.

[2] All-in-One Image Restoration for Unknown Corruption. CVPR'22

[3] PromptIR: Prompting for All-in-One Blind Image Restoration. NIPS'23

[4] Language-driven All-in-one Adverse Weather Removal. CVPR'24

We hope this more reliable comparison may meet your expectations.

With warm regard and many thanks,

2024-08-13

Thank you for the further clarifications and additional experiments. My concerns have been fully resolved, and I have raised my score accordingly.

2024-08-13

Dear Reviewer_YUtD,

Thank you very much for your thoughtful comments and for updating your score. We are pleased to hear that your concerns have been resolved. We will include the points discussed and incorporate the necessary changes into our revised manuscript.

With warm regards,

评论- Novelty Clarification

2024-08-13

Dear Reviewer_YUtD,

We sincerely appreciate your feedback and the positive score of our manuscript, and we are committed to further enhancing the quality of our manuscript based on your suggestions.

In the following, first, we aim to address the concerns regarding the novelty of SemanIR in greater detail. Subsequently, we will provide a more reliable and thorough comparison of SemanIR with the approaches presented in [2] and [3].

Concerns regarding the novelty of SemanIR:

Answer: Although both KiT and SemanIR use KNN search, they are fundamentally different in the following core aspects:

Design logic: KiT tries to extend the attention field from a local patch to $k$ patches, which follows the local-to-global logic. On the other hand, SemanIR aims to sort out the most similar tokens for a token in the global range efficiently, which in essence is a global method.
Patch-wise vs. token-wise similarity: Due to the different design logic, KiT computes the similarity between （ $r \times r$ ） patches, then attention is conducted between tokens in $k$ ( $r\times r$ ) similar patches. By contrast, SemanIR directly computes the similarity between tokens, and the attention is done directly between the similar tokens.
Implementation and efficiency: KiT computes KNN search in each transformer layer and for the sake of efficiency, locality-sensitive hashing is used. On the other hand, SemanIR computes the KNN directly for each token, and a single KNN is shared across all transformer layers in the same stage to improve computational efficiency.

They are all beyond the key-semantic dictionary-sharing strategy. Additionally, we would like further to highlight the following contributions of the proposed SemanIR:

Attention Calculation: Our approach utilizes the Torch-Mask function during training and the Triton kernel operator during inference. This trade-off ensures accurate backpropagation with the semantically most relevant patches during training and reduces the shape of $K$ and $V$ from $HW \times C$ to $k \times c$ during inference. This optimization significantly reduces inference time, as shown in Table 2, making it practical for deployment.
Top-k Training Strategy: As detailed in our response to the second question of the "Author Rebuttal" and Appendix D, we decouple the use of $k$ between training and inference. This allows for flexible, randomly sampled $k$ during training and fixed $k$ during inference, enabling effective use of large GPU memory during training while ensuring efficient performance with limited GPU resources during inference. This strategy is applicable to large-scale IR models.
Extensive Experiments: We have validated our method on 6 diverse IR tasks, achieving state-of-the-art performance on most. The visual comparisons in the appendix further demonstrate the effectiveness of SemanIR across various degradation types.

These aspects, in addition to the dictionary-sharing concept, collectively enhance the originality and novelty of SemanIR. We believe these clarifications strengthen the presentation of SemanIR's contributions. Thank you for the opportunity to address these points, and we hope this clarification meets your expectations.

审稿意见

评分: 6置信度: 42024-07-11

This paper propose SemanIR, a novel Transformer architecture for image restoration. The paper propose a new attention mechanism for better efficiency and efficacy, based on that within a degraded image, patches semantically close to the target patch to restore provide major information in the restoration process. They build a key-semantic dictionary, which stores the top-k closest patches, for each patch. This dictionary is shared across different transformer layers. This way, they successfully constrain the cost of attention operation with preservation of its meaningfulness and with global receptive fields. Their approach reach state-of-the-art performance with great efficiency on various image restoration tasks and benchmark datasets.

优点

The idea of using top-k semantically close patches (tokens) only sounds novel and smart, adaptively saving computational costs depending on the user's choice, with minimal negative effect on performance.
The proposed method seems to be generally applicable, with potential to be adopted in tasks other than image restoration.
The strong experimental results also confirm its effectiveness.

缺点

Overall, the experiments have been conducted extensively, on a variety of restoration tasks / benchmarks. Yet, I think there could have been more experiments on the proposed method itself, than comparison to existing works. See the Questions section (#1, #2, #3) below..

Additional comments:

I think the notation D is overused. In Eq. 1, it denotes the softmax output, which denotes the dictionary after a few lines (i.e., Eq.2). Although the subscript is slightly different, I found it confusing to read, as D_{i,j} and D(i,j) denotes completely different two things.
For future studies, open-sourcing the implementation is recommended.

问题

Experiments on top-k selection

The experiment of fixed top-k in Figure 3 seems to be conducted with k=512 during training, and k is changed only during training. What happens when using a different value for k during training, matching the k value with the k to use in inference? E.g., k=64 in both training & inference, k=256 in both training / inference. I wonder this since if k is fixed to k=64 both in training / inference without notable performance drop, why would one need to use a random top-k approach, when one could simply use a small k value in training, and opt to use a larger value? In addition, I wonder how the performance would differ when the k value in inference time is set to a value unused in the training process, such as 16 or 32.

Visualization of attention maps / key-semantic dictionary?

As semantically close patches are used only in the attention process, I wonder how the visualized attention maps would look like.

Do the attention maps have a smooth attention across the top-k patches, or does it still rely heavily on a few nearest neighbors?
Could the authors provide an example visualization of top-k patch matches in the construction process? Does the visualized attention map and top-k match look as intended, similar to the concept visualization at Fig.1(e)?

Non-sharing of key-semantic dictionary?

Key-semantic dictionary is shared across multiple layers for efficiency. How does it change when it the dictionary is constructed after every layer? Ignoring the computational cost, does it lead to consistent improvement in performance?

Complexity of key-semantic dictionary

To construct a key-semantic dictionary, the N = H x W tokens initially has to compute similarities against N tokens, Thus, I guess the complexity of the dictionary construction process to be of O(N^2) cost, which would still have quite a burdensome complexity. However, according to the Appendix, the complexity is O(HWC). Can the authors give further explanations on this?

typo: the K in the second row of Eq.6 seems to be small k, not a capital K.

Why store the similarity, instead of index?

In construction of the key-semantic dictionary, it seems like the dictionary stores the similarity values of the top-k patches (Eq. 2). Is there a reason for storing the similarities values instead of storing the indices of top-k patches only?

NAFNet [11] baseline?

NAFNet is a very strong baseline well-known in image restoration literature. Is there any reason there is no performance comparison to it?

How are positional information of tokens (patches) handled?

Does the proposed attention mechanism consider positional information (either absolute or relative), or is it just the local features only that is used? I wonder how the positional information is handled both in key-semantic dictionary construction and attention layers. Are they just simply neglected?

Windowed-attention?

The proposed method use an efficient attention mechanism while maintaining the global receptive field. But according to the code from the supplementary material, it seems like SemanIR use a window-based approach from Swin-Transformers. Doesn't this limit the global receptive field and make it local, contradictory to the mentioned benefits of the proposed attention mechanism? I believe the proposed method has the potential to generalized application, even on vanilla Transformer architecture. Have the authors tried removal of windowed-attention, and applied it on vanilla Transformer architecture?

局限性

The limitations and potential impacts have been discussed appropriately.

作者回复

2024-08-07

Response to Reviewer aTgi:

Q1: Top-k selection discussion?

A: See the answer to the 2nd question in the shared "Author Rebuttal".

Q2: Visualization:

A: We set a query region in the input and provided a detailed comparison from the attention-based activation map together with the input query region in Fig. 3 of the rebuttal pdf file. Fig. 3(a) shows the query region input. Fig. 3(b) displays the activation map generated using standard attention mechanisms. Fig. 3(c-f) illustrate activation maps using our key-semantic dictionary with different top-k values ([8, 16, 64, 256]) during inference. To conclude:

Normal Attention: The activation map from normal attention (Fig. 3(b)) shows connections to regions that may not be semantically related to the query region. This indicates that normal attention can consider irrelevant regions.
SemanIR with Key-Semantic Dictionary: With the proposed SemanIR, using a smaller top-k value (e.g., top-k = 8 or 16) results in activation maps that connect only a limited number of neighboring patches with the closest semantic similarity. This is consistent with our intended concept, as demonstrated in Fig.1(e) of the original manuscript.
Effect of Top-K Value: Increasing the top-k value allows for connections to more semantically related regions. However, when the top-k value is set too high (e.g., top-k = 256 as shown in Fig. 3(f)), the activation map may include some semantically unrelated regions. This aligns with the findings and the results depicted in Fig. 3 of our manuscript, where increasing the top-k value beyond a certain point (e.g., from 396 to 512) does not further improve PSNR.

Q3: Non-sharing of key-semantic dictionary?

A: We discuss this issue from the following perspectives:

Potential Benefits: Constructing a key-semantic dictionary for each layer could indeed enhance performance, as each layer would benefit from a dictionary specifically tailored to its unique context. This approach might allow each layer to utilize a more precisely matched dictionary, potentially improving the semantic relevance and accuracy of the attention process.
Experimental Constraints: For the sake of training efficiency, the Torch-Mask strategy is used. For a window-size of 32 with key-semantic dictionary sharing, this already leads to an explosion of memory. Creating layer-wise key semantics would further increase the memory footprint significantly.
Empirical Evidence: As an alternative, we have visualized the attention maps for each layer within the same stage in Fig. 4 of the rebuttal pdf file. It shows: (i) The attention maps exhibit only slight variations across layers, indicating that the semantic focus remains largely consistent. (ii) The activation regions in these maps are very similar, which supports the effectiveness of our approach of sharing the key-semantic dictionary across layers.

Q4: Complexity of key-semantic dictionary:

A: We acknowledge the mistake and appreciate the opportunity to clarify.

Correction of Complexity: The correct complexity for the similarity calculation process is indeed $\mathcal{O}((HW)^{2}C)$ , rather than $\mathcal{O}(HWC)$ . We apologize for this error.
Revised Eq. 5:

\begin{aligned} \mathcal{O}(6 \times [4HWC^{2} + 2kHWC] + (HW)^{2}C) \end{aligned}

Revised Eq. 6:

\begin{aligned} \mathcal{O}(6 \times [4HWC^{2} + 2(M)^{2}HWC] - (6 \times [4HWC^{2} + 2kHWC] + (HW)^{2}C)) \ = \mathcal{O}((12M^{2} - 12k - HW)HWC) \end{aligned}

Example Calculation: Let’s consider a common setting as we indicate in our Appendix( $M = 7$ , patch size = 16, $H$ = $W$ = 64, k=512). We have:

\begin{aligned} \mathcal{O}((12M^{2} - 12k - HW)HWC) &= \mathcal{O}(12 \times (7 \times 7) \times(16 \times 16) - 12 \times 512 - 64 \times 64) \&= \mathcal{O}(150528 - 6144 - 4096) >> 0 \end{aligned}

Despite the errors, the conclusions about the complexity remain valid. We appreciate your understanding and the opportunity to correct these details.

Q5: Why store the similarity, not index?

A: In Eq 2 of our manuscript, we calculate the similarity values to construct the key-semantic dictionary but store only the indices. We have clarified this in the revised manuscript to prevent any misunderstandings.

Q6: Comparison to NAFNet[1]

We have included a performance comparison with [1] in Tab. 6:

Table 6: The deblurring comparison on GoPro.

Method	PSNR	SSIM
NAFNet [1]	32.85	0.960
SemanIR (Ours)	33.44	0.964

It shows that SemanIR outperforms the strong baseline [1] in both PSNR and SSIM on the single-image motion deblurring task. We have included this important comparison in our revised manuscript.

[1] Simple Baselines for Image Restoration. ECCV'22

Q7: Positional Embedding (PE)?

For $\mathcal{D}_{K}$ , the PE is not used but it is interesting to explore.
For attention, we adopted the relative PE. The full PE for all locations is first indexed from the relative positional encoding. During training, the positional encoding and corresponding value in the attention map of dissimilar tokens is masked by being set to infinity.

Q8: Windowed-attention?

For all experiments, we set window size = 32, which contains 1024 tokens. This are lots of tokens representing non-local connectivity and this size is already larger than the global range for tasks like classification.
The window-based calculation is chosen for its efficiency and reduced memory usage, as the masking strategy results in high memory consumption during training, making it costly for IR with vanilla ViT.
Applying the proposed SemanIR on vanilla ViT architecture would be a very interesting direction with proper design, which we would like to try in our future work for a more generalized exploration.

2024-08-13

Thank you for the clarifications. I am satisfied with the rebuttal and leave my decision to acceptance.

2024-08-13

Dear Reviewer_aTgi,

Thank you for your positive feedback and recommendation for the acceptance of our submission. We are grateful for your constructive comments and are committed to incorporating all the significant points discussed during the rebuttal into our revised manuscript.

Once again, we sincerely appreciate your support and encouragement throughout the review process.

With warm regards,

审稿意见

评分: 6置信度: 42024-07-11

The paper proposes an efficiency-first modification to the self-attention mechanism in ViTs. The main premise of the work is to construct a key-semantic dictionary which relates each key to its semantically-relevant patches, and then share the dictionary across Transformer blocks in the same stage for computational efficiency.

优点

S1. The proposed method is fairly straightforward, and offers a decent alternative to dense self-attention for image restoration tasks.

S2. The proposition to share a dictionary, once constructed, across Transformer blocks allows for benefits in terms of FLOPs (G), and parameters.

S3. Extensive experiments are conducted on several image restoration tasks.

缺点

W1. The proposed Key-Semantic dictionary computes the dot product to measure semantic similarity on windowed patches. While diagrammatically it is easier to illustrate, at feature level, due to mixing, etc., these windows might not necessarily contain information specific to one region. Have the authors considered window-size ablations?

W2. There are several methods that focus on improving on efficiency of self-attention in context of image restoration. However, there is no comparison with different attention methods discussing how the propose method compares either in distortion metrics or computational performance [1], [2], [3].

W3. The main computational efficiency is observed due to sharing the key information across Transformer layers. While maintaining the key-semantic dictionary is interesting, the idea of sharing information across Transformer layers has been explored previously [4].

[1] CAMixerSR: Only Details Need More “Attention”

[2] Skip-Attention: Improving Vision Transformers by Paying Less Attention

[3] Learning A Sparse Transformer Network for Effective Image Deraining

[4] You Only Need Less Attention at Each Stage in Vision Transformers

问题

Q1. For the columnar architecture style, have the authors considered sharing the dictionary across different stages? Specifically in later stages, degraded information has mostly been recovered, so sharing across Transformer stages might be reasonable.

局限性

The paper address the limitations, and the societal impact.

作者回复

2024-08-07

Response to Reviewer hRwb:

Q1: Have the authors considered window-size ablations?

A: The windows indeed contain mixed information from different semantic parts. Yet, it is precisely this semantic distinction that motivates us to develop a selection mechanism for semantic information using KNN. We have conducted ablation studies of the window size (i.e., on both gray and color image denoising with $\sigma=25$ ). The results are summarized in Tab.4 and Tab.5. With the increase of the window size, the semantic relevant information for each token is increased, thus leading to a PSNR gain for different IR tasks.

Table 4: Gray DN on Set12, BSD68, and Urban100.

window-size	Set12	BSD68	Urban100
8	31.01	29.49	31.33
16	31.06	29.51	31.45
32	31.17	29.50	31.88

For gray DN (Tab. 4), performance improves with larger window sizes, showing increased PSNR values across datasets.

Table 5: Color DN on Mcmaster, Cbsd68, Kodak24, and Urban100.

window-size	Mcmaster	Cbsd68	Kodak24	Urban100
8	33.20	31.72	32.84	32.89
16	33.24	31.74	32.86	32.95
32	33.38	31.75	32.97	33.27

Similarly, for color DN (Tab. 5), larger window sizes yield better results. For instance, in the McMaster dataset, the PSNR increases from 33.20 dB with a window size of 8 to 33.38 dB with a window size of 32.

These findings indicate that larger window sizes enhance performance by capturing more contextual information, which will be discussed in the revised manuscript.

Q2: Discussing how the proposed SemanIR compares to [1][2][3]

SemanIR vs. [1]: According to Tab. 9 in [1], [1] achieves 32.51 dB, 28.82 dB, and 27.72 dB PSNR on Set5, Set14, and BSD100 datasets, respectively. In comparison, SemanIR reaches 33.08 dB, 29.34 dB, and 27.98 dB PSNR on the same datasets. Both methods are designed for low-parameter settings. A detailed parameter comparison will be included in the revised manuscript. Note that [1] is not specifically optimized for other IR tasks.
SemanIR vs. [2]: Unlike SemanIR, [2] reuses earlier attention in subsequent layers, suggesting potential complementary benefits. [2] achieves an 11.3% FLOP reduction with performance comparable to the baseline, while SemanIR reduces FLOPs by 12.3% and improves PSNR by 0.39dB. We acknowledge the fairness of these comparisons and will conduct a similar evaluation in our IR settings, expecting similar results.
SemanIR vs. [3]: Please see the answer to the 1st question in the shared "Author Rebuttal", where we show that our SemanIR is consistently better than DRSformer on both deblurring and deraining.

[1] CAMixerSR. CVPR'24

[2] Skip-Attention. ICLR'24

[3] DRSFormer. CVPR'23

Q3: While maintaining the key-semantic dictionary is interesting, the idea of sharing information across Transformer layers has been explored previously [4].

Answer: Thank you for your feedback. We would like to address the points raised as follows:

Timing of Related Work: We note that [4] was made available on arXiv on June 1st, 2024, which is after our manuscript submission to NeurIPS. Therefore, our work was not influenced by this recent publication. Nevertheless, we will include the following discussions in the revised manuscript.
Focus of Our Work: While [4] focuses on reducing computational costs in Vision Transformers (ViTs), our proposed SemanIR method is distinct in its primary application. SemanIR is designed specifically for image restoration, rather than the broader focus of LaViT on computational efficiency within ViTs. Although both works address computational efficiency, the objectives and methodologies are different.
Difference in Computational Efficiency Approaches: LaViT [4] reduces computational costs by storing attention scores from a few initial layers and reusing them in subsequent layers. However, this approach does not change the computation cost of attention itself; it merely reuses previously computed scores. In contrast, SemanIR introduces a novel approach for reducing computation during both training and inference. During training, SemanIR leverages a pre-constructed semantic dictionary to exclude irrelevant information from other semantically unrelated patches, thus enhancing restoration quality. During inference, our implementation with Triton kernels optimizes attention operations, directly reducing computational costs.

[4] You Only Need Less Attention at Each Stage in Vision Transformers. CVPR'24

Q4: Sharing the dictionary across different stages?

Answer: Thank you for your valuable suggestion regarding dictionary sharing across different stages of our columnar architecture. We appreciate the insight and acknowledge that this is a compelling idea worth exploring further. To address your suggestion, we conducted an analysis to visualize the self-similarity of features at the beginning of each stage in our model, which consists of a total of six stages. The visualization results are shown in Fig.1 of the rebuttal pdf file. Based on our observations:

Stage-wise Semantic Similarity: We noted that there are still significant differences in the semantic similarity maps across various stages. This suggests that sharing the dictionary from the early stages to the later stages could potentially lead to a performance drop due to the divergence in feature representations.
Adjacency-based Sharing: Despite the variability across stages, we observed that adjacent stages exhibit similar semantic similarities. This indicates that it might be feasible to share the dictionary every two or three stages. Such an approach could reduce computational costs while maintaining performance.

Given these findings, we recognize the potential benefits of exploring dictionary sharing. We plan to investigate this perspective in more detail in our future work to assess its impact on performance and efficiency.

评论- Please let us know if you have additional questions

2024-08-11

Dear reviewer,

Thank you for the comments on our paper.

Thank you

评论- Post-Rebuttal Comments

2024-08-11

I have gone through the authors' response, both to my questions, and to other reviewers'. I thank the authors for responding to the comments, and for providing dictionary-sharing visualizations in the attached pdf. Further, authors have provided comparisons with other methods focusing on computational efficiency with respect to the attention mechanism, including the run analysis with DRSFormer. The proposed method, SemanIR, either scores higher in tasks, or is faster, or both. I am satisfied with the response, and do not have any follow up questions. Therefore, I would like to raise my score to accept.

2024-08-13

Dear Reviewer_hRwb,

Thank you for your positive feedback and for raising your score to accept our manuscript. We appreciate your acknowledgment of our revisions and are glad that the additional analyses and visualizations addressed your concerns. Your support throughout the review process has been invaluable.

Best regards and many thanks,

审稿意见

评分: 6置信度: 32024-07-13

Unlike traditional transformers where the multi-head self-attention layer calculates the correlation between one patch and all patches, the method proposed by the authors computes the correlation among the top k semantically similar patches, allowing image restoration with lower computational cost.

Additionally, by generating the key semantic dictionary only once at the beginning and sharing it across all transformer layers, the computational burden is significantly reduced.

优点

The authors logically explain the proposed method. They provide detailed information on the KNN algorithm and Key-Semantic Dictionary Construction and clearly illustrate how it is utilized through text and figures.

They also demonstrate the performance of the proposed method through numerous experiments. They conducted various experiments on hyperparameters and quantitatively measured the efficiency of the proposed method by assessing FLOPS, the number of parameters, and runtime.

缺点

The explanation of the difference between the KNN matching method used in KiT and DRSformer and the KNN method proposed by the authors is lacking.

The authors mention that their method differs from previous token merging or pruning methods, but there are no comparative results to show which method reduces computational cost more effectively.

In Figure 2 (d) Key-Semantic Attention should be corrected from "topk" to "top-k".

问题

What is the specific difference between the KNN matching method used in KiT, DRSformer, and the method proposed by the authors?

How does the performance of the proposed method compare to previous token merging or pruning methods? If previous methods yield better results, why did the authors choose to use the KNN method?

局限性

The authors clearly outline the limitations of their research, informing the readers.

作者回复

2024-08-07

Response to Reviewer xhpu:

Q1: What's the difference between SemanIR and other methods like KiT or DRSformer?

A: Please refer to our 1st answer in the shared "Author Rebuttal".

Q2: What is the difference between SemanIR and token merging and pruning methods?

A: The key focus of our method significantly diverges from previous token merging or pruning techniques, aiming not only for efficiency but, more crucially, for effective performance in regression-based dense prediction image restoration tasks. While improving efficiency aligns with the goals of token merging methods, the proposed SemanIR is specifically tailored for image restoration within a regression-based paradigm, as mentioned in Line 92 of our original manuscript. For image restoration, every token encodes information about a local patch. Merging or pruning tokens will lead to a loss of information in corresponding patches, which is not preferred in image restoration [4]. Our method’s suitability for image restoration can be attributed to the following points:

Utilization of Semantic Information: Unlike token merging or pruning methods [1, 2, 3], which reduce the number of tokens by combining or removing redundant ones with higher semantic similarity, SemanIR leverages these semantic-close tokens to enhance restoration. By ensuring that each token can benefit from others, our approach improves the overall quality of restoration.
Preservation of Detail: Token merging and pruning methods may reduce computational costs but can lead to the loss of crucial detailed information, such as texture, color, and local structures. In contrast, SemanIR maintains detailed information by focusing on semantic relevance rather than merely reducing token counts.
Effective KNN Strategy: In SemanIR, for each degraded pixel, we select its top-k most semantically close neighbors to contribute to the restoration of that pixel. This KNN strategy not only excludes the negative contributions from semantically unrelated neighbors but also enhances efficiency, distinguishing it from traditional approaches.

To conclude, although token merging and pruning techniques, such as those in [1, 2, 3], are effective for reducing computational complexity and are often suited for classification tasks [3], they are less appropriate for dense prediction tasks like IR. SemanIR’s approach addresses the specific needs of image restoration tasks more effectively by balancing efficiency with the preservation of detailed image information.

[1] Token merging: Your ViT but faster. ICLR'23

[2] DynamicViT: Efficient vision transformers with dynamic token sparsification. NeurIPS'21

[3] A-ViT: Adaptive tokens for efficient vision transformer. CVPR'22

[4] Skip-Attention: Improving Vision Transformers by Paying Less Attention. ICLR'24

Q3: Notation correction:

A: Thank you for pointing this out. We have revised Fig. (d) in the PDF file (Fig. 5) to include additional details, making the figure easier to follow and understand.

评论- Please let us know if you have additional questions

2024-08-11

Dear reviewer,

Thank you for the comments on our paper.

We have submitted the response to your comments and a PDF file. Please let us know if you have additional questions so that we can address them during the discussion period.

Thank you

评论- Post-Rebuttal Comments

2024-08-13

Thank you for the thorough responses. Your explanations clarified the distinctions between SemanIR and other methods, particularly in terms of its suitability for regression-based dense prediction tasks. Also, I appreciate the clear comparison with token merging and pruning methods. Based on these improvements and the strong technical arguments provided, I am inclined to increase my score.

2024-08-13

Dear Reviewer_xhpu,

Thank you for your positive feedback and for revising your score to accept our manuscript. We are pleased that our explanations and clarifications have addressed your concerns. Your supportive assessment is greatly appreciated, and we remain committed to further improving the quality of our manuscript in line with your valuable suggestions.

Best regards,

作者回复

2024-08-07

Dear All,

We appreciate the dedicated efforts each of you has invested in evaluating our work and providing invaluable suggestions the positive feedback (i.e., logically explain, decent alternative to dense self-attention, numerous and various experiments, and strong experimental results). We have taken care to address each question (Q) with detailed answers (A), ensuring comprehensive coverage of concerns. All the Figs mentioned are provided in the attached pdf file. Below are the shared responses to all the reviewers:

Q1: The difference between SemanIR and KiT[1] or DRSFormer[2]:

A: Besides the brief explanation in our manuscript ( Line 111), we offer a more detailed introduction below to emphasize the key differences:

SemanIR vs. [1]:

For k-NN matching, KiT performs KNN matching at each transformer layer, while SemanIR calculates a self-similarity once only at the beginning of each transformer stage and constructs the key-semantic dictionary $\mathcal{D}_{K}$ for sharing.
In terms of attention calculation within each transformer layer, SemanIR leverages the key-semantic dictionary so that only k of the $HW$ elements in $K$ and $V$ contribute to self-attention, with the rest excluded from the attention calculation. Most importantly for SemanIR in IR, the k selected elements are kept the same at each transformer layer within the same stage with the help of the key-semantic dictionary, which avoids the heavy KNN search within each layer like KiT, thereby enhancing efficiency.
Regarding experimental results, SemanIR also includes deblurring and deraining results. For draining, our method was trained and tested on the same datasets as KiT. Results shown in Tab. 1 indicate that SemanIR outperforms KiT in both deblurring and deraining tasks.

Table 1: The comparison between KiT and the proposed SemanIR.

Method	PSNR (Deblur:GoPro)	PSNR (Deblur: HIDE)	PSNR (Derain: 5 Test sets)
KiT[1]	32.70	30.98	32.81
SemanIR (Ours)	33.44	31.05	32.98

SemanIR vs. [2]:

As illustrated in Fig. 2 of the DRSFormer paper, its top-k sparse attention first computes the self-attention between all the tokens (i.e., the computation cost for $QK$ is not reduced) before performing the top-k and scatter operations, which means each token is still affected by all other tokens even some of the tokens are semantically unrelated. In contrast, SemanIR first selects the top-k elements in $K$ and $V$ to obtain $\hat{K}$ and $\hat{V}$ , and then computes $Q\hat{K}^{\top}$ instead of $QK^{\top}$ . This directly eliminates unnecessary contributions from semantically unrelated patches.
After the attention calculation, the DRSFormer utilized (mask, top-k, scatter) operations at each transformer layer. This increases the computation cost while the proposed SemanIR doesn't need to do a top-k matching at each layer. This leads to significant efficiency improvement for SemanIR compared to DRSFormer (This is consistent with the results shown below in Tab.2).
Regarding results, we used the same training datasets as DRSFormer and evaluated SemanIR on the Rain200H test set. The results shown in Tab. 2 indicate that while DRSFormer shows slightly higher performance on deraining, SemanIR is more efficient, with 23%, 44%, and 89% reduction in parameters, FLOPs, and runtime.

Table 2: The comparison between DRSFormer and SemanIR (The efficiency is evaluated on one image with $H=W=256$ ).

Method	PSNR (Derain: Rain200H)	Params.	FLOPs	Runtime
DRSFormer[2]	32.17	33.7 M	242.9 G	2200 ms
SemanIR (Ours)	32.01	25.85 M	135.26 G	240 ms

As the results in the tables above and manuscript, the proposed SemanIR significantly differs from both KiT and DRSFormer, resulting in substantial enhancements in efficiency and performance for image restoration tasks, with notable improvements in runtime, parameters, and computational complexity, while maintaining competitive results in deblurring and draining.

[1] KNN Local Attention for Image Restoration. CVPR'22

[2] Learning A Sparse Transformer Network for Effective Image Deraining. CVPR'23

Q2: Top-k selection

A: We examined the top-k value during training matches from the top-k value used during inference (i.e., fixed matching top-k) on JPEG CAR on the BSD500 dataset. The results in Fig. 2 indicate:

Training with fixed matching top-k yields performance comparable to or slightly better than the random top-k approach when the top-k value is relatively small. When k=512 for both training and inference, performance is comparable to random top-k.
However, using a fixed top-k value for training requires training multiple models for different k values (e.g., 6 models for k values ranging from 64 to 512). In contrast, the random top-k strategy offers more flexibility and requires training only a single model, making it more user-friendly and less resource-intensive.

When the top-k value during inference is set to a value not used during training. Tab. 3 shows:

Performance Degradation: When using a top-k value that was not used during training, there is a notable drop in PSNR compared to other settings. This suggests that the model performance is sensitive to the specific top-k values used during training.
Comparison of Unseen top-k Values: Among unseen top-k values, larger k values during inference tend to achieve better PSNR. This observation aligns with the findings from our ablation studies on window sizes.

Table 3: Inference with top-k=16, 32.

	top-k=16	top-k=32	Train: Random top-K(Average)	Train: Fix top-K=512(Average)
PSNR	30.16	30.23	30.62	30.54

Q3: Notation, Typo, and Open Source:

A: We corrected the usage of notation and addressed the typos as suggested throughout the manuscript. We have included the code in the supplementary materials, and the full training pipeline will also be released.

最终决定Accept (poster)

2024-09-25

This paper propose a Transformer architecture for image restoration. It proposes an attention mechanism for better efficiency and efficacy. Their proposed method achieves state-of-the-art performance with great efficiency on various image restoration benchmarks. After the rebuttal, all the reviewers agreed to accept this paper.