On the Surprising Effectiveness of Attention Transfer for Vision Transformers
摘要
评审与讨论
In this work, the authors demonstrate that a large part of the benefits of pre-training in ViT models actually comes not from the pre-trained features but the pre-trained knowledge of attention maps. Specifically, the authors propose an alternative to fine-tuning call Attention Transfer, and they use this method to transfer attention maps from pre-trained ViTs to teach from-scratch ViTs. This method achieves results comparable to results from finetuning. The authors present two variations of Attention Transfer, the first of which directly copies the attention maps from the teacher network, while the second teaches the student network to make its own attention maps with a distillation objective derived from the teacher attention maps. The authors demonstrate impressive and surprising results on ImageNet-1k, and present further analysis of this approach in a variety of configurations. While the proposed Attention Transfer method does have some clear limitations, specifically for domain gaps, overall, the work has very interesting implications for ViT pretraining and finetuning.
优点
The main result of the work, which is to demonstrate the key importance of pre-trained attention maps over pre-trained features in ViT finetuning/transfer learning, is very interesting, and has important implications for the use of ViTs. While I think it is unlikely that Attention Transfer (in its current form) will replace standard finetuning (see notes in the following section), I think the implications of the results and analysis are still very important.
The potential to use Attention Transfer with ensembles is also interesting, and the authors show that it has the potential to boost the performance of self-supervised ViTs further.
This work also helps to explain some of the properties of MAE. In particular, prior works have found that MAE lends itself well to fine-tuning, but not as well to direct linear probes. This work shows a possible explanation: that the strength of MAE is not in its pretrained features but instead in its pretrained attention maps.
The work is clearly presented and has wide coverage of many model and training configurations in the main work and appendix. They also include additional analysis on the impact of partial transfer, and assess if the student is or is not re-learning the same features as the teacher. Overall, their analysis is quite comprehensive.
缺点
While the proposed Attention Transfer method has very interesting implications for ViTs, I’m not sure if it will make its way into practical use, for either the Attention Copy or Attention Distillation variants. The main issue is that the proposed method does not always match the performance of the regular fine-tuning approach, particularly in cases with a domain gap between the pretraining data and the downstream task. In addition, Attention Transfer is more expensive to train than standard Fine Tuning, as acknowledged in Section A.2.
For some of the analysis results, it would be very helpful to see how Attention Distillation performs as compared to Attention Copy. In particular, in any analysis where there is a domain gap (Table 2 for example), it would be interesting to see if Attention Distillation performs better, as it is suggested that Attention Distillation allows more flexibility in the student and thus may perform better in such cases.
On a less important note, I find that the visualizations in Figures 7, 10, and 11 are somewhat difficult to see due to the combination of the heat map colors and the background images. I would suggest revising these figures to try and make them clearer.
问题
In the Attention Copy setting, are the unnecessary Q and K layers removed from the student network?
There is an important conclusion in the work that seems somewhat under-acknowledged. In lines 210-217 and in Table 1, it is found that transferring the teacher’s Q is more effective than transferring the attention map. Why was more attention and testing not performed on Q transfer?
局限性
The authors discuss the limitations of Attention Transfer and present results for it with many configurations. Overall, they are quite transparent about acknowledging the situations where Attention Transfer underperforms full Fine Tuning.
I do not see any risks for negative societal impact.
We’re glad that you found our results interesting and our analysis comprehensive. We address your questions below:
I’m not sure if it will make its way into practical use, for either the Attention Copy or Attention Distillation variants.
We agree that the attention transfer methods are not currently ready for practical use. One of our main objectives was to use attention transfer to understand the role of features vs attention in pre-trained models. But beyond scientific utility, attention transfer may have some advantages over fine-tuning in the future:
- An attention map (of size LxL, L is the sequence length) does not depend on the model’s dimension, which means the map from a model can be directly transferred to a different-sized model.
- Attention transfer gets rid of layer-wise learning rate decay. This is a crucial hyper-parameter almost used everywhere for tuning pre-trained vision models (beyond ViT). The fundamental prior here is that early layers should change less compared to later layers. But such a prior can be a restriction for next-generation models and getting rid of it opens up new opportunities.
- Sharing weights can incur security risks (e.g., white-box attacks). In such settings, we need an effective way to transfer knowledge from the pre-trained model, and attention transfer offers such a possibility.
We will add more detail on avenues for future research in the paper, and we hope other researchers find more practical ways to use attention transfer!
For some of the analysis results, it would be very helpful to see how Attention Distillation performs as compared to Attention Copy.
In our main rebuttal above, we have updated tables so that we have both Attention Copy and Attention Distillation. In general, Attention Distillation performs better than Attention Copy, which follows our existing findings.
Revising attention map visualizations
We have modified the figures to be easier to see. We have included samples in the 1 page PDF. Let us know if you have any further suggestions!
In the Attention Copy setting, are the unnecessary Q and K layers removed from the student network?
Yes, we remove them from the student network.
Why was more attention and testing not performed on Q transfer?
Thank you for pointing this out! See [a] in the main rebuttal above.
I thank the authors for their responses and discussion. I agree that the authors should be careful about phrasing to avoid a possible "overclaim" for Attention Transfer vs Transfer Learning. I also support the revised visualizations. Overall I think this work has very interesting implications about transfer learning in ViTs and I maintain my original rating in support of accepting this work.
The paper introduces attention transfer as an alternative to fine-tuning in Vision Transformers (ViT), separating intra-token and inter-token operations to enhance feature extraction and combination. Attention transfer contains attention copy and attention distillation. Attention distillation matches the performance of fine-tuning on ImageNet-1K while learning different representations, facilitating the performance of feature ensemble. The method is verified across various model scales and datasets.
优点
- The paper first points out the sufficiency of attention maps in pretrainings and provides extensive analyses on attention transfer.
- The proposed attention distillation is verified across various model scales and datasets.
缺点
- Compared with full fine-tuning, the proposed attention transfer method introduces additional computation costs on the forward process of the teacher model. Therefore, to make the article more comprehensive, authors should consider including a baseline of distillation on features.
- Transfer tasks (e.g., Table 2 and 3) should contain the results of training from scratch.
- There is an important result in L169, the result of attention copy from a fine-tuned model (85.6), not appearing in any Tables or Figures. There should be an extra Table or Figure containing the result with other similar settings (e.g., attention copy from only pre-trained model) for better comparison.
问题
In Table 1, transfer Q is better than transfer Q,K (which is equivalent to the attention map), does it mean distillation on features would achieve better performance? If so, is attention really important and sufficient for transfer tasks?
局限性
Yes.
Thank you for the feedback on our work! Below, we respond to your questions and comments:
In Table 1, transfer Q is better than transfer Q,K (which is equivalent to the attention map), does it mean distillation on features would achieve better performance?
See the main rebuttal above for a detailed discussion of the Q-copying result. For feature distillation, we followed your suggestion and tried distilling the residual stream features from a pre-trained MAE ViT-L. In our preliminary results, we obtain a downstream accuracy of 81.3 on ImageNet-1k. This is significantly lower than the 85.7 that can be achieved through fine-tuning or attention distillation. This makes sense: the features learned during self-supervised pre-training are not directly well-suited for classification, so trying to match them can hurt performance. CKA analysis of the features (Fig. 5) supports this hypothesis – the fine-tuned MAE does well by significantly changing the features in the latter half of the network. Overall, transferring attention appears to do much better than distilling the features.
Transfer tasks (e.g., Table 2 and 3) should contain the results of training from scratch.
In our main rebuttal above, we have updated Table 2-3 with the results of training from scratch.
There should be an extra Table or Figure containing the result with other similar settings (e.g., attention copy from only pre-trained model) for better comparison.
Thanks for the suggestion! We will add a new subsection and table that shows the effect of the teacher (pre-trained vs fine-tuned) on attention copy and distillation. We hope this will be more organized for readers.
Their rebuttal has addressed most of my concerns. I find this paper interesting and believe it will benefit the community. As a result, I have increased my score to weak accept.
However, I am still not convinced that the attention map is the primary factor for the high performance, rather than the Q features. More qualitative or quantitative evidence would support the claims presented in the paper.
We're glad that we have addressed most of your concerns! Below, we clarify the result on copying Q:
I am still not convinced that the attention map is the primary factor for the high performance, rather than the Q features.
- We emphasize that the only way that copying Q affects the student model is by making it easier to learn a useful attention map . Thus, the strong performance of copying Q should support our surprising findings on the importance of the pre-trained attention maps.
- During the rebuttal period, we experimented with distilling only the Q activations from the teacher to the student. We found that this achieves 85.0 when distilling from a pre-trained MAE ViT-L, which is worse than copying Q. We suspect that this is because the student must first learn to match Q and then learn the K that creates a good attention map. Only then does the Q-distillation attention map provide reasonable guidance to learn good features.
- Q distillation does not match the 85.7 that Attention Distillation achieves. This supports our hypothesis that copying Q does well because it allows the model to slightly modify the pre-trained attention maps to better match the downstream task. Indeed, Attention Copy does well when the teacher maps are well-suited for the downstream task: copying from a fine-tuned MAE ViT-L achieves the same 85.6 accuracy (L169).
- In our original response to Reviewer p7aT, we reported the result of distilling the features from a pre-trained MAE model. This achieved an accuracy of 81.3, significantly lower than what Attention Transfer achieves, which further corroborates our story that features are not as important as previously thought.
- We will add these new results and discussion on copying Q to the paper, around L217.
- Finally, Figure 7, 10, and 11 visualize the attention maps learned by attention distillation and qualitatively show that it matches the teacher’s attention maps well for the layers that are distilled. This links the pre-trained attention maps to the high downstream task performance.
Overall, we believe that our findings on the sufficiency of attention maps are already quite surprising and useful for the research community. We have run extensive experiments and ablations and hope that our paper provides valuable insights that potentially motivate the next generation of pre-training and fine-tuning algorithms.
This paper investigates how transferring attention patterns from a pre-trained ViT to a student affect the student's downstream performance. By applying the attention copy strategy, the paper shows that when the pre-trained dataset and downstream dataset are the same, the trained student may achieve performance superior than the student training from scratch. Moreover, it largely recovers the performance of a pretrained-and-finetuned model. Further, the authors propose an attention transfer (distillation) scheme. With attention distillation scheme, the student may achieve performance comparable or even better than the pretrained-and-finetuned model. The authors provide extensive experiments to verify the effectiveness of attention transfer and show when attention transfer works.
优点
-
The paper is well-written and easy to follow.
-
The findings is meaningful and interesting to an extent.
-
Extensive experiments are conducted.
缺点
-
The findings is somewhat similar to previous works which apply attention transfer in ConvNets. The only different things is in ViT, attention can be explicitly represented, which eases the operation of attention transfer.
-
The authors seem to over-claim something, e.g., "offer a potential alternative to the decade-long practice of fine-tuning pre-trained vision models". The empirical results in the paper cannot sufficiently support this claim. For example, we only see comparable results on one setting where the model is pretrained on ImageNet (unsupervised/self-supervised pretraining) and finetuned on ImageNet. For other settings, like out-of-distribution tasks, detection tasks, etc., the attention transfer does not work as well as fine-tuning.
-
The experiments which study the effect of transferring a subset of Q, K, V seems not to support the main claim of this paper. The results show that transferring Q is the best of all. So why do we choose to transferring Q and K (or attention)? It implies that transferring the attention pattern may not be the key to the superior performance on downstream task, although transferring attention patterns sounds reasonable and interpretable.
-
For Table 2-6, it is strange to see only partial results in one table. It is better to show the results for "training from scratch", "fine-tuned", "copy" and "distill".
问题
Please see the weakness part.
局限性
The authors adequately addressed the limitations.
We’re glad that you found our work well-written, with extensive experiments and interesting findings. Below, we respond to your questions and comments:
The findings is somewhat similar to previous works which apply attention transfer in ConvNets. The only different things is in ViT, attention can be explicitly represented, which eases the operation of attention transfer.
Previous works on ConvNets have mainly looked at transferring different properties of the features, like spatial feature magnitudes. They have also mainly been conducted in the knowledge distillation paradigm, where a task-specific (not pre-trained) downstream teacher is distilled into a smaller student. Our work differs from prior work in a few ways. First, as you point out, ViTs allow us to explicitly decouple the attention patterns from the features that they combine. Second, we extensively investigate the pre-training/fine-tuning paradigm, and we surprisingly find that the attention maps, by themselves, are often sufficient to achieve the same downstream performance. This result is completely new and calls into question the “feature learning” story that typically motivates pre-training in vision.
The authors seem to over-claim something, e.g., "offer a potential alternative to the decade-long practice of fine-tuning pre-trained vision models"
You’re right – we intended to say that attention transfer achieves surprisingly good results, and we didn’t mean to imply that attention transfer was already sufficiently developed to be a practical alternative to fine-tuning. We tried to make this clear in L327-330 in the conclusion. We will clarify our wording in the introduction and write “with further research, attention transfer could be a potential alternative to fine-tuning pre-trained vision models.”
The results show that transferring Q is the best of all. So why do we choose to transferring Q and K (or attention)? It implies that transferring the attention pattern may not be the key to the superior performance on downstream task, although transferring attention patterns sounds reasonable and interpretable.
This is a great question! See [a] in the main rebuttal above.
For Table 2-6, it is strange to see only partial results in one table. It is better to show the results for "training from scratch", "fine-tuned", "copy" and "distill".
For some of our tables, we had run either “copy” or “distill” due to compute limitations. During the rebuttal period so far, we have been running more experiments to ensure that Tables 2-6 are comprehensive – see [b] in the main rebuttal above.
Thanks for the authors' detailed response. However, I still feel concerned about the novelty and the conclusion of this work. The effectiveness of transferring Q implies that the attention may not be the key to the surprising results of performing attention transfer to an extent.
The results are interesting and the story is generally good. But it is still not convincing to me that attention is the underlying key to the performance though it is intuitive. The authors show transferring attention works but transferring Q also works. Is it possible attention is not the key factor? Further evidence should be provided beyond performance numbers, to show the relationship between attention transfer and the performance.
Based on the above reasons, I still think the current version is not acceptable.
The authors show transferring attention works but transferring Q also works. Is it possible attention is not the key factor?
- We emphasize that the only way that copying Q affects the student model is by making it easier to learn a useful attention map . Thus, the strong performance of copying Q should support our surprising findings on the importance of the pre-trained attention maps.
- During the rebuttal period, we experimented with distilling only the Q activations from the teacher to the student. We found that this achieves 85.0 when distilling from a pre-trained MAE ViT-L, which is worse than copying Q. We suspect that this is because the student must first learn to match Q and then learn the K that creates a good attention map. Only then does the Q-distillation attention map provide reasonable guidance to learn good features.
- Q distillation does not match the 85.7 that Attention Distillation achieves. This supports our hypothesis that copying Q does well because it allows the model to slightly modify the pre-trained attention maps to better match the downstream task. Indeed, Attention Copy does well when the teacher maps are well-suited for the downstream task: copying from a fine-tuned MAE ViT-L achieves the same 85.6 accuracy (L169).
- In our original response to Reviewer p7aT, we reported the result of distilling the features from a pre-trained MAE model. This achieved an accuracy of 81.3, significantly lower than what Attention Transfer achieves, which further corroborates our story that features are not as important as previously thought.
- We will add these new results and discussion on copying Q to the paper, around L217.
- Finally, Figure 7, 10, and 11 visualize the attention maps learned by attention distillation and qualitatively show that it matches the teacher’s attention maps well for the layers that are distilled. This links the pre-trained attention maps to the high downstream task performance.
Overall, we believe that our findings on the sufficiency of attention maps are already quite surprising and useful for the research community. We have run extensive experiments and ablations and hope that our paper provides valuable insights that potentially motivate the next generation of pre-training and fine-tuning algorithms.
The authors propose a novel perspective on the utility of pretraining vision transformers by demonstrating that the actual features and representations learned during pre-training are not crucial. Instead, they find that simply re-using the self attention from pre-training specifically, the way information flows between tokens—is sufficient for models to learn high-quality features from scratch and achieve performance comparable to pre-trained models.
To support their claim, the authors introduce a method called attention copy and attention distill. This method involves transferring the attention from a pre-trained teacher ViT to a student ViT, either by copying or distilling the attention maps. This approach allows the student model to learn its own features while still benefiting from the pre-trained attention patterns. The authors also highlight several drawbacks of previous works that heavily rely on pre-trained features and representations. They point out that the conventional fine-tuning approach may not be as effective under distribution shift settings, where the pre-trained features might not generalize well. In contrast, their attention transfer method provides a more robust alternative that maintains high performance even when the distribution shifts.
Through systematic experiments, the authors examine various aspects of their findings, particularly focusing on the sufficiency of attention maps. They provide evidence that attention patterns alone can guide the learning process effectively, thus questioning the necessity of the entire pretraining paradigm.
优点
The following are some strengths of this work
- The problem is well motivated and also also been discussed in previous works. Not specifically the distillation approach, but reusing attention maps has been interesting not only in the vision community, but widely studied in NLP.
- I found that the paper is beautifully written, it basically shows the efforts that the authors took in trying to explain each aspect of their work. They keep the language simple and easy to understand, with concise explanations together with proper visualizations and plots where necessary.
- The analysis of different components such as transferring attention from different layers, different heads, CKA analysis etc. is very well thought and it was a joy to read through the findings. So I thank the authors for this and encourage them to do these kind of analysis in all their future works as well.
- The experiments on different tasks such as image classification, model robustness etc. show the effectiveness of the approach.
缺点
The following are some queries:
- In Figure 5, it is surprising to see that attn-copy has least correlation with respect to the fine-tuned model as compared to attn-distill. In attn-distill from layer 20-24, the correlation increases much more than the pretrained model, which is not the case with attn-copy. Do the authors have any intuition on why this is the case?
- In general, with the CKA computation. I think it would be more interesting to understand the correlation across the features at different layers of the model. I would refer the authors to the work in [51], where the authors show correlation plot of features across every layer before and after their method is applied. This would help understand how attn-copy and attn-distill affect representations learned by the model. The authors can take a look at Figure 2 and Figure 4 in [2*] for reference.
- Continuing from above, it would also be interesting to see this correlation across different heads after the model is pretrained with attn-copy and attn-distill. In [1*], the author shows that there exists high correlation across attention heads in ViTs, so it would be interesting to see if attn-copy or attn-distill mitigates this.
[1*] https://github.com/sayakpaul/probing-vits?tab=readme-ov-file#visualizing-mean-attention-distances
[2*] Zhou et al., Refiner: Refining Self-attention for Vision Transformers, arxiv 2021
- Do the authors have the same observation for attn-copy and attn-distill when using ViT-L pretrained in a supervised setting. Also, im curious to know if the same observation holds for methods pretrained using self-distillation approaches such as DINO, BYOL, iBOT or distillation approaches like SimSiam, etc.
- I think I might be missing something, but the visualization of attention from the [CLS] token, in Figure 7 seems that attn-distill has worse attention than attn-copy at the deeper and intermediate layers. Also the localization of the object seems to be bad as well. Can the authors please comment on this.
- I would also like to see the comparison with different state-of-the-art methods that use ViT-L as their backbone is tasks like classification, object detection and robustness.
问题
There have been works such as [51], that illustrate the performance of copying attention across intermediate layers of the network. However, I would urge the authors to also include the following works for completeness, which have shown that copying attention works well in the domain of NLP
[3*] Xiao et al., Sharing attention weights for fast transformer, IJCAI 2019 [4*] Wang et al., Evolving attention with residual convolutions, ICML 2021 [5*] Ying et al., Lazyformer: Self attention with lazy update, arXiv 2021
局限性
Yes the authors discuss the limitations
We’re really glad that you liked the motivation, analysis, and writing of our paper! We respond to your comments below:
Fig. 5: In attn-distill from layer 20-24, the correlation increases much more than the pretrained model, which is not the case with attn-copy. Do the authors have any intuition on why this is the case?
Our main hypothesis is that this is because we transferred all 24 layers for Attention Copy but only distilled 18 layers for Attention Distillation. This means that Attention Distillation is more flexible in how it combines the features for the last 6 layers, so it can find a strategy similar to the fine-tuned MAE. In contrast, Attention Copy is constrained to be more similar to the pre-trained MAE.
Do the authors have the same observation for attn-copy and attn-distill when using ViT-L pretrained in a supervised setting. Also, im curious to know if the same observation holds for methods pretrained using self-distillation approaches such as DINO
We are running this right now! We hoped to have the results by now but had an issue with our cluster. We will post an update below with the results as soon as possible.
in Figure 7 seems that attn-distill has worse attention than attn-copy at the deeper and intermediate layers. Also the localization of the object seems to be bad as well.
Attention Distillation only really deviates from the pre-trained attention maps at the later layers of the network. This makes sense, since we don’t distill the last 6 layers’ attention maps. Furthermore, the “noisy” attention pattern can sometimes be quite useful in Vision Transformers, since they often tend to store information in low-entropy regions of the background (see “ViTs Need Registers” [6*]). The fine-tuned MAE shows similar “noisy” patterns in its later layers, which we can see after fixing a small issue with our attention map visualizations: since we follow the standard practice of using global average pooling, which averages the representations at all spatial locations, the CLS token representation is not used after the 24th layer. This means that its attention map has no signal to improve. We fix this by now showing the pattern after the 23rd layer instead. The examples in the 1 page PDF show that the fine-tuned MAE also exhibits the same patterns in the later layers as Attention Distillation.
I would also like to see the comparison with different state-of-the-art methods that use ViT-L as their backbone is tasks like classification, object detection and robustness.
To the best of our knowledge, fine-tuned MAE is the SOTA ViT-L model without using extra data on ImageNet classification and the OOD robustness benchmarks. For object detection, the SOTA ViT-B-based model using ImageNet-1k is ViTDet [32] with 56.0 and 48.0 . Note that the detection results cannot be directly compared with ours due to further architectural modifications within ViTDet on top of the ViT backbone.
I would urge the authors to also include the following works for completeness, which have shown that copying attention works well in the domain of NLP
Thank you for suggesting these NLP papers – we will definitely add them in our related works section!
[6*] Darcet, T., Oquab, M., Mairal, J., & Bojanowski, P. (2023). Vision transformers need registers. arXiv preprint arXiv:2309.16588.
I thank the authors for the responses. I think they have answered most of my queries and am satisfied with it. Im looking forward to the results of the DINO experiment. I keep my initial rating.
Thanks
Here are the results of fine-tuning, Attention Copy, and Attention Distillation from a DINO ViT-B/16 pre-trained model. Note that our DINO fine-tuning recipe achieves an accuracy of 83.2, higher than the 82.8 that has been achieved in previous papers. Overall, we find similar results as we originally reported in Table 5 with MoCo (another representation learning method based on self-similarity), as well as FLIP (a vision-language contrastive method).
| tune | copy | distill | |
|---|---|---|---|
| DINO | 83.2 | 82.2 | 82.8 |
We thank reviewers for their time, effort, and feedback for our paper. To recap, reviewers appreciated various strengths of our work:
- Significance: “well motivated” (7AbG), “very interesting implications for ViT pretraining and finetuning” (9FPV), “potential to boost the performance of self-supervised ViTs further” (9FPV), “helps to explain some of the properties of MAE” (9FPV), “findings is meaningful and interesting” (qaog)
- Experiments: “extensive experiments” (qaog), “joy to read through the findings” (7AbG), “their analysis is quite comprehensive” (9FPV), “extensive analyses” (p7aT)
- Clarity: “beautifully written” 97AbG), “well-written and easy to follow” (qaog), “ clearly presented” (9FPV)
Next, we provide general comments for some shared topics of discussion. We will address individual concerns and questions by responding to each review.
[a] High performance from copying Q (Table 1) [qaog, p7aT, 9FPV]
We do observe higher performance (85.6) from copying the self-attention queries than Attention Copy (85.1), but lower performance than Attention Distillation (85.7)
First, we note that copying does not transfer any features – the student network only “sees” the result of using to compute the attention map softmax(). Thus, copying solely provides structure to the student attention maps. This is consistent with our story that the attention patterns are sufficient to guide networks to high downstream performance.
Our hypothesis is that copying does well because it gives the model flexibility to change the attention patterns to be more suitable for the downstream task. Directly copying the entire attention map is too inflexible, which is why attention distillation works better, as it can deviate from the teacher’s attention maps.
Overall, we mainly focus on transferring the entire attention map since it’s a clean way to split the network’s computation (it contains all inter-token communication). To be thorough, we are now training models with -distillation, which encourages the student self-attention queries to be close to those of the teacher. We have had some cluster problems but should have the results within a few days.
[b] Updated Tables 2-6 [qaog, p7aT, 9FPV]
We update Tables 2-6 so that they all contain these models: trained from scratch, MAE fine-tuned, Attention Copy, Attention Distillation. We will update the main paper with these results as well.
Table 2:
| Pre-training data | tune | copy | distill | scratch |
|---|---|---|---|---|
| ImageNet-1k | 85.7 | 85.1 | 85.7 | 83.0 |
| COCO | 85.2 | 83.1 | 84.6 | 83.0 |
Table 3:
| evaluation data | tune | copy | distill | scratch |
|---|---|---|---|---|
| iNat 2018 | 79.9 | 71.8 | 74.1 | 64.3 |
| iNat 2019 | 83.8 | 77.9 | 80.0 | 66.2 |
Table 4:
| out-of-distribution evaluation | tune | copy | distill | scratch |
|---|---|---|---|---|
| ImageNet-A | 56.5 | 48.9 | 54.3 | 32.0 |
| ImageNet-R | 59.6 | 57.5 | 56.8 | 51.9 |
| ImageNet-S | 45.2 | 43.1 | 42.9 | 38.0 |
| ImageNet-V2 | 76.4 | 75.5 | 75.9 | 72.4 |
Table 5:
| pre-training method | tune | copy | distill |
|---|---|---|---|
| MAE | 85.7 | 85.1 | 85.7 |
| MoCo-v3 | 84.0 | 82.5 | 83.3 |
| FLIP | 87.4 | 86.6 | 86.1 |
| none | 83.0 | 72.7 | 76.3 |
Table 6:
| model | tune | scratch | copy | distill |
|---|---|---|---|---|
| ViT-B | 82.5 | 83.6 | 82.0 | 83.4 |
| ViT-L | 83.0 | 85.7 | 85.1 | 85.7 |
| ViT-H | 83.0 | 86.9 | 86.1 | 86.3 |
This paper proposes a transformer attention distillation scheme to transfer knowledge to a student model trained from scratch, as a alternative to fine-tuning the teacher model itself.
The paper is found well written and easy to follow, the idea interesting and well-motivated, the analysis well-thought and the experiments extensive, showing the effectiveness of the approach.
There are a number of weaknesses identified:
- it is not clear if it is the attention or the query features that is the primary factor
- the approach is overclaimed as an alternative of fine-tuning, especially given its increased cost
- attention distillation is similar to previous work
The authors addressed most concerns successfully in the rebuttal. Only one review remains negative on the basis of point 1 above, but the AC finds this point adequately addressed in "Author Rebuttal by Authors". The other three reviews are positive.
It is recommended to accept the paper as poster. However, some additional feedback from the AC follows on points 2 and 3 above, which is recommended to follow in revising the paper:
On point 2, section A.2 shows memory and time per iteration of the proposed approach, which is higher than fine-tuning. Total epochs are given for the proposed approach in Tables 12 and 13, but not for fine-tuning. It is necessary to know the total cost of proposed vs. fine-tuning for all experiments and tasks. Since fine-tuning is cheaper, the authors specify some alternative advantages of the proposed approach in their response to Reviewer 9FPV. These should be used to update the motivation of the paper. The third advantage ("Sharing weights can incur security risks") is not clear because to compute teacher attention maps, one still needs to share weights. Is it also suggested to remove "Who Needs Features?" from the title, since the authors admit that the proposed approach will not be a direct replacement of fine-tuning.
On point 3, the authors should consider the following existing work for example:
- Wang et al. MiniLM: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. NeurIPS 2020.
- Li et al. Neural attention distillation: Erasing backdoor triggers from deep neural networks. ICLR 2021.
- Wang et al. Attention distillation: self-supervised vision transformer students need more guidance. BMVC 2022.
- Dhar et al. Learning without memorizing. CVPR 2019.
It is clear that attention distillation is not new as a method and has been applied to both ConvNets and Transformers for different tasks and purposes. This leaves as main contribution the use of attention distillation as an alternative to fine-tuning. The paper should thus be repositioned, including a paragraph on attention distillation in the related work. In addition, these papers include more variants, e.g. to apply distillation per head or average attention maps first. It is recommended to add such variants in the ablations.
One final point on the quality of attention maps. It is discussed in “ViTs Need Registers” (which is mentioned by the authors in the discussion) that the quality of attention maps in transformers is low in general. Registers are a way to improve it. Other examples are [51], which argues that quality is better in the intermediate layers, and SimPool (ICCV 2023), which discards the CLS token. It is recommended to consider such improved attention maps in additional experiments.