Attention Temperature Matters in ViT-Based Cross-Domain Few-Shot Learning
We find a phenomenon for ViT-based CDFSL: multiplying a small temperature (even 0) to ViT's attention map can consistently improve performance. We delve into this phenomenon for an interpretation and propose an effective method for CDFSL.
摘要
评审与讨论
This paper deals with ViT-based cross-domain few-shot learning. It first proposes an observation regarding ViT-based model. In cross-domain learning, the attention module in ViT seems to be hurting the model performance in the target domain. Based on this observed phenomenon, this paper proposes some fixes by including an additional temperature to reduce the effect of the attention map (e.g. essentially making the attention map to be uniform)
优点
This paper proposes an interesting phenomenon about ViT-based model in the cross-domain application
缺点
- The whole paper is based on the phenomenon (that the attention module in ViT hurts cross-domain performance) analyzed in Sec 2. However, I do not think you can draw such conclusions convincingly from the analysis in Sec 2.
Sec 2 uses the simplest possible to train the model parameters, i.e. the model (including a backbone and a classification head) is learned on the source data, then the backbone is directly used in the target domain using a prototype-based classification. This is the simplest baseline method of model training. There are a lot more sophisticated methods of model training (e.g. MAML, ProtoNet, etc) that are designed for few-shot learning. It is not clear whether the phenomenon observed in this paper is due to this simplistic model learning method (i.e. maybe this phenomenon will not appear if you use a more advanced learning method)
-
This paper deals with few-shot cross-domain learning. But it is not clear whether the observed phenomenon is due to "few-shot" or due to "cross-domain". The analysis in Sec 2 only considers the "cross-domain" aspect, but did not provide anything about the "few-shot" aspect. It is not entirely clear whether this phenomenon is due to the domain shift between the source vs target domain, or due to the semantic difference between the classes observed on source and target domains. I would like to see a more rigorous analysis that seperate the two confounding factors
-
It is possible that the phenomenon is merely an overfitting issue, i.e. when you have the learnable attention module (query, key matrices) in ViT, the model just overfits to the source domain easily. If that ia the case, the observed phenomenon may not be that interesting (i.e. it is just overfitting). The proposed solution simply reduces the overfitting by regularization
-
This paper proposes an observation about some phenomenon, but it does not provide a convincing explanation on why this phenomenon happens or matters. This paper simply says that the attention module (learned on source data) does not generalize to target domain. But I think this is too generic -- it is well-known that if you train a model on the source domain, the model usually does not generalize to a target domain (this is the main motivation for research in domain adaptation).It seems the conclusion of this paper is just a specific manifestation of this well-known fact.
-
In the previous papers used for comparison (e.g. [10,49]), those papers reported results on more datasets. What is the reason for omitting those additional datasets in the experiment? This makes me wonder whether the results are cherry-picked.
问题
-
Please explain how can you be sure that the phenomenon in the paper is not just overfitting, instead it is something more interesting about ViT.
-
Please explain why not using all datasets used in [10,49] in the experiment.
-
What exactly is the role of "few-shot" in this paper? It seems most of the paper is discussing the domain shift issue.
局限性
None
We truly appreciate your valuable comments. In the following, we respond to the concerns.
W1. Implementing with other baseline methods.
We implemented our method with ProtoNet. Our method continues to be effective with this learning method and improves the performance on the target domain.
| Method | Crop. | Euro. | ISIC. | Ches. | Ave. |
|---|---|---|---|---|---|
| ProtoNet | 93.59 | 86.92 | 46.15 | 25.68 | 63.09 |
| ProtoNet + Ours | 95.07 | 89.46 | 48.64 | 27.14 | 65.08 |
W2. "cross-domain" vs. "few-shot", and "domain shift" vs. "semantic difference"
"domain shift" vs. "semantic difference":
We would like to point out that the miniImageNet performance is evaluated on the novel classes of this dataset, while the model is trained on the base classes of miniImageNet. Since there is no overlap between the base and novel classes, a semantic difference exists between these two sets of classes. As the model shows correct attention on these source-domain novel classes, such phenomenon is majorly due to the domain shift instead of semantic difference.
To verify this, we construct datasets with source-domain semantic and target-domain style, by swapping the amplitude of the source-domain image with the amplitude from target-domain images and maintaining the source-domain phase (following WaveSAN). Then, we measure the performance of the baseline method and ours.
| 5w5s | src semantic + src Style | src + Crop. style | src + Euro. style | src + ISIC style | src + Ches. style | Avg. target style |
|---|---|---|---|---|---|---|
| BL | 97.94 | 79.10 | 64.24 | 71.70 | 60.27 | 68.83 |
| Ours | 97.56 | 81.40 | 70.92 | 76.11 | 62.90 | 72.83 |
From this table, we can see
(1) By swapping the style from the source domain to the target domains, the performance consistently decreases, verifying the rationale of the constructed pseudo-target domains.
(2) With the source-domain (src) style, by applying our method, the performance slightly decreases. In the meanwhile, with the target-domain (Crop., Euro., ISIC, Ches.) styles, our method significantly improves the performance.
Since the semantics are preserved in the constructed pseudo-target domains, we can conclude that the domain shift is the cause of the phenomenon in our paper.
"Cross-domain" vs. "few-shot":
Since ineffective attention is observed by directly extracting features from target-domain images, no specific "few-shot" adaptation method is integrated. Therefore, this phenomenon is majorly due to the "cross-domain" aspect.
Moreover, by adapting our method, we can also contribute to the "few-shot" aspect. In the paper, we conclude that the query-key parameters tend to overfit the source domain, which makes the model discriminative but less transferable. For the "cross-domain" part, we resist the learning of these parameters during the source-domain stage to avoid overfitting. On the contrary, for the "few-shot" part, we encourage the learning of them during the target-domain finetuning to better fit the target datasets. To further verify the contribution to the "few-shot" part, we compare the finetuning of the query-key parameters and non-query-key ones.
| 5-way 5-shot | Crop. | Euro. | ISIC. | Ches. | Avg. |
|---|---|---|---|---|---|
| FT non-QK | 95.91 | 90.27 | 54.05 | 27.66 | 66.97 |
| FT QK | 96.01 | 90.36 | 54.30 | 28.31 | 67.25 |
We can see the advantage of finetuning (FT) query-key (QK) parameters, albeit query-key parameters are much fewer than the non-query-key ones. In all, our contribution includes both the "cross-domain" and "few-shot" parts.
W3. Overfitting
We would like to point out that although overfitting is a common phenomenon in machine learning, our contribution is in
- We are the first to unveil that the query-key parameters are more likely to be overfitting than other parameters, especially under large domain gaps.
- We are the first to find that temperature adjustment and abandonment can effectively handle the overfitting problem in ViT.
To the best of our knowledge, we do not find papers with similar contributions. Could you please provide any specific papers for better comparison?
W4. Why this phenomenon happens and matters
Please refer to the global rebuttal Q1; we have conducted a theoretical analysis.
W5. Comparison of more datasets
We would like to point out that many works (e.g., MEM-FS [40], TIP'23) only conduct experiments on the four datasets we experimented with in the paper. Here, we also report our performance on all 8 datasets as in [10, 49]. Please refer to the global rebuttal Q2. As can be seen, our model can still achieve a state-of-the-art performance average on all 8 datasets.
Other questions
Q1: Please refer to W3.
Q2: Please refer to W5.
Q3: Please refer to W2.
Thanks to authors for providing these additional results. They help addressed some of my concerns.
Regarding the "overfitting" issue, I do not see how the theoretical insight is relevant. It is well-known that adding a regularization will make the learning loss smooth, so the theoretical insight is just a reflection of this fact. So I think this paper basically just adds a regularization to address a standard overfitting issue.
Thanks for your response!
We would like to point out that the "overfitting" issue is reflected in the increased eigenvalue of the and , which is proved in (Chen, 2019). Our contribution to the theoretical analysis is in
(1) pointing out that query-key parameters show a higher tendency to overfit,
(2) linking the increased eigenvalue to domain robustness by the sharpness of loss landscapes, and
(3) verifying our method essentially handles the domain gap problem based on our theoretical and empirical analysis.
As for the mitigation of overfitting, our method is novel in its design and analysis, to the best of our knowledge, and achieves top performance. Indeed, our method can be understood as a kind of regularization, but "regularization" is a very general topic just like "deep learning" (could you please provide any specific papers for better comparison?). Therefore, our work is still insightful and novel to the current research.
Thanks again for your comments. If you still have further comments, please feel free to tell us!
Thank you very much for reading our response! May I know if our response has addressed your questions? Do you have any new questions after reading our response? Please feel free to let us know if you have any questions. We are very much looking forward to having the opportunity to discuss this with you.
Hi Reviewer XXJ8:
Sorry to bother you again. Since there are only few hours left in the discussion phase, we wish to know if your concerns have been addressed or if you have any additional concerns. Looking forward to your reply.
Thanks to the authors' efforts in answering my questions. I am raising my rating to 5.
This paper investigates the application of Vision Transformer (ViT) in Cross-Domain Few-Shot Learning (CDFSL). It fully analyzes the effectiveness of attention to CDFSL performance through experiments and identifies a method to enhance ViT's transferability across domains by adjusting attention mechanisms through temperature scaling. Despite reducing attention maps to uniform distributions, this adjustment effectively improves ViT's performance on target-domain datasets, mitigating issues with query-key mechanisms under large domain gaps. The proposed approach focuses on limiting query-key parameter learning while promoting non-query-key parameters, resulting in consistent outperformance across multiple CDFSL datasets compared to current state-of-the-art methods.
优点
- The analysis of how attention affects the CDFSL results is comprehensive.
- The proposed method obtains the SOTA performance.
缺点
- Can uniform attention be understood as an operation like randomly initializing attention?
- Is there any deep theoretical explanation for the conclusion that “Compared with the query-key attention, the non-query-key components in ViT tend to be more transferable but less discriminative than the query-key components”.
- “which inspires us to improve the generalization of ViT by encouraging the learning of non-query-key parts and resisting the learning of query-key ones.” in line 174-175. CDFSL requests both transferable and discriminative. Why resist the learning of query-key ones? If resist the learning of query-key ones, can authors still guarantee excellent performance in the source domain? Or does this paper not focus on the performance in the source domain even in the training phase, but only on transferability?
- The method is not novel enough.
- There is one way that the query-key attention be randomly downgraded to a uniform map after the source domain training and adjust it in the target domain inference. What is the difference between the above way and the proposed method? What are the advantages of the proposed method?
- There are too few comparison methods, and it is also necessary to compare with some methods that use CNN as the backbone to show the superiority of the proposed method in ViT.
问题
Please answer and address the above-mentioned problems.
局限性
In the paper authors mention that “We discuss the limitations of the work in the appendix”. However, I couldn’t find anywhere about limitation discussion in the appendix. This work does not contain any negative social impact.
We truly appreciate your valuable comments. In the following, we respond to the concerns.
W1. Random attention initialization
Since random attention initialization tends to produce a uniform attention map, we can view our attention abandonment method as producing a randomly initialized attention. However, our difference with the re-initialization is that we do not abandon the trained parameters and retrain them. Instead, we just "skip" the query-key attention to resist the learning of their parameters, but the trained parameters will not be abandoned. To the best of our knowledge, we are the first to propose this design.
W2. Theoretical explanation
Please refer to the global rebuttal Q1; we have conducted a theoretical analysis.
W3. Source-domain performance
In section2, we conclude that the query-key attention tends to be discriminative but less transferable, while the non-query-key parts tend to be transferable but less discriminative. Since the backbone network is pretrained on the ImageNet dataset, the query-key parts are already discriminative enough. Therefore, it is not necessary to further train them on the source dataset (miniImageNet, a subset of ImageNet) to make it more discriminative. Instead, it is more important to prevent query-key parts from being too discriminative (i.e., less transferable), therefore we resist the learning of them.
On the contrary, for the non-query-key parts, as they are transferable but less discriminative, we need to encourage learning of them, utilizing abandoning the query-key attention as the proposed method. Indeed, for the CDFSL task, the target-domain performance is more important, but we do not sacrifice the source-domain performance. To verify this, we measure the 5-way 5-shot accuracy on the source dataset of both the baseline method and ours. The accuracy is 97.94 for the baseline method and 97.56 for ours, which is only a marginal decrease.
W4. Novelty
We would like to point out that our novelties are in
(1) We are the first to unveil the importance of the attention temperature in ViT-based CDFSL methods and interpret it as a remedy for the poorly transferred query-key attention.
(2) We are the first to encourage the learning of non-query-key parameters by abandoning the query-key attention through a random binary temperature, as recognized by Reviewer G6Ww and D8qR.
Could you please provide papers with similar contributions, so that we can address your concerns more effectively?
W5. Difference of source and target domain operation, and the advantage of our method.
During the source-domain training, we randomly multiply a temperature of 0 or 1 to the attention (before softmax) in each forward pass, so that the query-key attention map randomly switches between a uniform map and the original attention map. If the attention map is downgraded to a uniform map, the query-key attention will be abandoned, so that the query-key parameters will not be trained, thereby resisting the learning of these parameters and encouraging the learning of others.
During the target-domain stage, we set the temperature for each attention map to a fixed value (0.3), because our model is not sensitive to the temperature choice, as validated in Appendix Fig.8.
As validated in experiments, our method effectively improves target-domain performance with comparable source-domain performance, which is simple but effective, as recognized by Reviewer G6Ww and D8qR.
W6. More comparisons with CNN-based methods.
We list more comparisons with state-of-the-art works as follows, where we can see our method achieves the best performance. Please refer to the global rebuttal PDF Tab.1 and Tab.2.
Thank you. For Question 5, I mean as authors mentioned “Compared with the query-key attention, the non-query-key components in ViT tend to be more transferable but less discriminative than the query-key components”, i.e., the query-key attention has less transferable and more discriminative. What happens if we cold-start the query-set attention in the target domain inference?
Thanks for your response! We report the results of the cold-start-finetuning that you suggested as follows.
| Method | Crop. | Euro. | ISIC. | Ches. | Ave. |
|---|---|---|---|---|---|
| Baseline | 94.24 | 88.62 | 45.72 | 25.66 | 63.53 |
| Baseline + cold-start finetuning | 84.45 | 80.12 | 42.49 | 23.66 | 57.68 |
| Baseline + finetuning | 94.93 | 90.41 | 48.94 | 25.96 | 65.06 |
| Ours | 95.53 | 90.13 | 53.09 | 27.72 | 66.62 |
| Ours + cold-start finetuning | 88.40 | 84.29 | 46.65 | 24.19 | 60.88 |
| Ours + finetuing | 96.66 | 90.82 | 54.91 | 28.03 | 67.61 |
As can be seen, the performance of the cold-start strategy is still lower than the regular finetuning method. This is because by re-initializing the query-key parameters randomly, the knowledge in these parameters transferred from the ImageNet pretraining is totally abandoned. In the paper (Tab.4, last row), we have verified that such transferred knowledge is still useful in the target domain, although it is not as good as that in the source domain. Therefore, totally abandoning such knowledge and learning it on the target domain cannot lead to a higher target-domain performance.
In contrast, our method can effectively resist the overfitting of the query-key parameters in the source-domain stage, and take advantage of the ImageNet pretraining of these parameters, thereby achieving higher performance.
If you have any other questions, please feel free to let us know! Thanks!
Thank you. From the table, the results of "+ cold-start finetuning" are much lower than the baseline and "+ finetuning." In the target domain, the query-set attention can fully utilize its discriminative ability, but the degree of performance degradation seems to suggest that the knowledge carried by the query-set attention is also highly transferable. Does this imply that the motivation behind the proposed method (i.e., Compared with the query-key attention, the non-query-key components in ViT tend to be more transferable but less discriminative than the query-key components) is not as meaningful?
Additionally, if that's the case, why would it be better to avoid learning the query-set attention during the miniImageNet source domain training phase? Overfitting doesn’t seem that convincing. Aside from the amount of data, are there significant differences in transferable knowledge between ImageNet and miniImageNet? Typically, after pretraining on ImageNet, training on miniImageNet and then fine-tuning on the target data should yield better results than without miniImageNet or ImageNet pretraining, especially on CDFSL problem.
Thanks! Remember that the target-domain finetuning only utilizes scarce data, therefore the lower performance of "cold start" is majorly due to the scratch training of query-key parameters, instead of a violation of our motivation. Instead, the result of re-initializing the query-key parameters (but no target-domain finetuning is conducted) is reported in Fig1.b that the performance is higher than the baseline.
For the regular finetuning, although the transferred baseline query-key parameters are not perfect, it at least provides a start point that is better than nothing for the target-domain few-shot finetuning. Therefore, the regular finetuning achieves better performance, and the low "cold start" performance does not violate our motivation.
For the source-domain training, our method aims to not only reduce further overfitting of the query-key parameters, but also improve the discriminability of those non-query-key ones, which helps the model generate attentions more based on the non-query-key parameters. Since the non-query-key parameters are generally more transferable than those query-key ones, our model achieves better transferability against domain gaps (verified in Fig.6a).
By the way, the ideal target-domain finetuning should begin with parameters that contain only the domain-irrelevant information and no domain-specific information. Although re-initializing query-key parameters can produce the uniform attention, this operation is not equivalent to generating better query-key parameters that have only the domain-irrelevant information and throwing away all domain-specific ones. Instead, it also throws away all domain-irrelevant information. Actually, it is difficult to directly generate better parameters (i.e., less domain-specific information but more domain-irrelevant information) to replace current query-key parameters. This is also why our method, which softly drives the model to be transferable, is needed.
Thanks again for your response! If you have any questions, please feel free to let us know!
Thank you very much for reading our response! May I know if our response has addressed your questions? Do you have any new questions after reading our response? Please feel free to let us know if you have any questions. We are very much looking forward to having the opportunity to discuss this with you.
Hi Reviewer ujLR:
Sorry to bother you again. Since there are only few hours left in the discussion phase, we wish to know if your concerns have been addressed or if you have any additional concerns. Looking forward to your reply.
This paper investigates the effectiveness of the attention mechanism in Vision Transformer (ViT) for solving cross-domain few-shot learning tasks. It finds that the traditional query-key attention operation is more on the side of discriminability than transferability in their trade-off balance, thus leading to downgraded performance for the target domain when there are large domain gaps. Based on a series of related analyses, this paper proposes an attention abandonment operation for source domain training and an attention adjustment operation for target domain finetuning.
优点
- This paper is well organized, easy to follow and free of typos.
- From an observed phenomenon between target domain classification accuracy and attention temperature, this paper conducts comprehensive quantitative analyses of the effectiveness of different attention strategies in terms of the trade off between discriminability and transferability. The results fully support the authors’ claim that the non-query-key components in ViT tend to be more transferable but less discriminative than query-key parts.
- All the operations (Source-Domain Attention Abandonment and Target-Domain Attention Adjustment) newly designed in this paper are supported by the above analyses and thus are technical sound.
- This paper may provide some insights for developing novel attention operations for ViT based cross-domain few-shot learning methods.
- Extensive experiments on four datasets demonstrate the effectiveness of the proposed method and also showcase its superiority over other state-of-the-art approaches.
缺点
I personally like this paper, there are just some small weaknesses:
-
In Figure 2(a), 3(a) and 5(a), the CLS token attention values located at the left-top corner are too small. Moreover, the authors have not provided the vector diagram, so I also cannot see these values clearly by zooming in. Could the authors enlarge the CLS boxes in these figures? In addition, it maybe better to provide a color bar for these heat maps to clearly show the range of their values.
-
Some references are not correctly formatted, the conference or journal names as well as page numbers are missing. For example, [4], [10], [13], [14], [15], [23], [29], [30], [47], [48], [49], etc.
问题
- In the section 4.2 implementation details, the authors propose to “set the attention of the CLS token to 0 for blocks whose ID is greater than 4” during the target-domain evaluation phase. Is this operation performed before or after the softmax function? Is the purpose of conducting this operation to make the image tokens unaffected by the CLS token? Could the authors explain the mechanism and impact of this operation?
- In the top plot of Figure 5(b), I have noticed that the CLS token values in the 2nd-10th blocks are near zero on all the datasets. This seems strange because the entire line does not fluctuate within this interval. Does this mean that the CLS token and image patches tokens are orthogonal to each other?
局限性
The authors have discussed the limitations.
Thank you for your appreciation of our work!
W1. Fig. 2a and Fig. 5a
We have added the color bar to the attention map and enlarged the class toke for a clear observation. Please refer to the global rebuttal PDF Fig.1 and Fig.2.
W2. Formatted references
We have checked the references and completed the conference or journal names as well as page numbers.
[4] Yinbo Chen, Zhuang Liu, Huijuan Xu, Trevor Darrell, and Xiaolong Wang. Meta-baseline: Exploring simple meta-learning for few-shot learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9062–9071, October 2021.
[10] Yuqian Fu, Yu Xie, Yanwei Fu, and Yu-Gang Jiang. Styleadv: Meta style adversarial training for cross domain few-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 24575–24584, June 2023.
[13] Patrick Helber, Benjamin Bischke, Andreas Dengel, and Damian Borth. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 12(7):2217–2226, 2019.
[14] Shell Xu Hu, Da Li, Jan Stühmer, Minyoung Kim, and Timothy M. Hospedales. Pushing the limits of simple pipelines for few-shot learning: External data and fine-tuning make a difference. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9068–9077, June 2022.
[15] Yanxu Hu and Andy J. Ma. Adversarial feature augmentation for cross-domain few-shot classification.
In Shai Avidan, Gabriel Brostow, Moustapha Cissé, Giovanni Maria Farinella, and Tal Hassner, editors, Computer Vision – ECCV 2022, pages 20–37, Cham, 2022. Springer Nature Switzerland.
[16] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Yoshua Bengio and Yann LeCun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.
[24] Hanwen Liang, Qiong Zhang, Peng Dai, and Juwei Lu. Boosting the generalization capability in cross domain few-shot learning via noise-enhanced supervised autoencoder. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9424–9434, October 2021.
[30] Cheng Perng Phoo and Bharath Hariharan. Self-training for few-shot transfer across extreme task differences. In International Conference on Learning Representations, 2021.
[47] Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang Xie, Alan Yuille, and Tao Kong. Image BERT pre-training with online tokenizer. In International Conference on Learning Representations, 2022.
[48] Xiang Zhou, Yuan Zeng, and Yi Gong. Learning to scale temperature in masked self-attention for image inpainting. arXiv preprint arXiv:2302.06130, 2023.
[49] Yixiong Zou, Yicong Liu, Yiman Hu, Yuhua Li, and Ruixuan Li. Flatten long-range loss landscapes for cross-domain few-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 23575–23584, June 2024.
Q1. Set CLS token to 0
This operation is conducted after the softmax operation. In Fig.5b, we validate that the attention value of the CLS token tends to be larger in target datasets than in the source dataset in the first few blocks, therefore we manually resist the influence of the CLS token in the target-domain attention map.
Q2. Near 0 attention on the CLS token
Intuitively, this is because the model focuses more on the image tokens, therefore the attention on the CLS token significantly decreases. Theoretically, we can observe that Fig.5b and Fig.3b are similar, with both the CLS token attention (or value) near 0. This is because according to our theoretical analysis (please refer to Q1 in the global response), our model decreases the eigenvalue of the query-key parameters (i.e., and ), therefore preventing the query-key attention from amplifying the perturbation on the representation [49]. As a result, the influence of the query-key attention brought to the attention's input feature decreases, therefore the output attention will be more similar to the activation map of as shown in Fig.3b. Since Fig.3b shows near 0 activation on the CLS token, the attention map of our model also shows the similar behavior.
Thank you for your reply!
For weakness 1, it would be better for authors to provide high-resolution vector diagrams, so that the readers can zoom in for details.
For question 1, the authors "manually resist the influence of the CLS token in the target-domain attention map." Does this operation significantly influence the final performance? If so, it seems to cause an unfair comparison, since such operation is not the main contribution of this paper and has not been applied to other competing ViT-based methods.
Thanks for your suggestions!
For the images, we promise to include high-resolution images in the final version.
For the manual rectification of the CLS token, the influence is only marginal, we report the result with and without this operation as follows.
| 5-shot | CropDiseases | EuroSAT | ISIC | ChesX | Avg. |
|---|---|---|---|---|---|
| w/ CLS operation | 95.53 | 90.13 | 53.09 | 27.72 | 66.62 |
| wo/ CLS operation | 95.42 | 90.01 | 52.81 | 27.70 | 66.49 |
The reason why we include this operation is to be consistent with our findings that the model tends to show high attention to the CLS token. However, since our model generates better attention maps that the manual rectification is not in great need (L230-231, and verified in Fig.8 of the appendix), this operation only marginally influences the performance.
For the fair comparison problem, since the observation of the excessive attention on the CLS token is also our contribution, the manual rectification of the CLS token's attention value can also be understood as a part of our method, although it only marginally influences the performance. Therefore, the comparison with current works is still fair.
Thanks again for your response! We promise to carefully polish our paper for the final version. If you have any questions, please feel free to ask us!
Thank you for your reply! My concerns have been addressed, so I maintain my rating score of "accept".
This paper studies the effectiveness of Vision Transformer (ViT) for Cross-domain few-shot learning (CDFSL). In particular, the authors found that by simply multiplying a temperature to the attention in ViT blocks, the target-domain performance consistently increases, even though the attention map is downgraded to a uniform map. The authors investigated phenomenon through several experiments and propose a simple and efficient solution for boosting ViT’s transferability by resisting the learning of query-key parameters and encouraging that of non-query-key ones. Extensive experiments demonstrate the effectiveness of the proposed method for CDFSL.
优点
-
The paper studies an important problem in few-shot learning called CDFSL and also investigates the effectiveness of ViT for this problem which is of high practical importance.
-
The paper conducted a detailed analysis of attention temperature on addressing target-domain shift by visualizing the attention maps and quantitative analysis.
-
The proposed solution is simple and effective for addressing the domain gap issue in CDFSL with ViT.
缺点
-
The paper mostly focuses on empirical analysis while lacking theoretical insights on why query-key features tend to be discriminative but less transferable.
-
The proposed method needs to retrain the ViT in Source-Domain Attention Abandonment which can be costly.
-
In Target-Domain Attention Adjustment, a pre-defined hyper-parameter is needed, which can be difficult to tune in CDFSL.
问题
-
What does it mean by "non-query-key structures"?
-
In Source-Domain Attention Abandonment, would the training strategy degrade the model performance on source data?
-
Further explanations are needed for Equation (6) to clarify how it can improve learning better attention for CDFSL.
-
The authors primarily consider a 5-way 5-shot scenario. How about increasing the number of samples in each class? For instance, in a 5-way 20-shot problem, would the proposed strategy still be effective?
局限性
N/A
We truly appreciate your valuable comments. In the following, we respond to the concerns.
W1. Theoretical insights.
Please refer to the global rebuttal Q1; we have conducted a theoretical analysis.
W2. Retrain ViT on the source domain
We would like to point out that following current works [10,49], the size of the source-domain dataset (miniImageNet) is not large since this dataset is only a subset of ImageNet and contains only 64 classes with 600 samples in each class. The training is conducted on a single RTX3090 GPU for around 5 hours, which is affordable.
Moreover, we also decrease the number of samples in each class to verify how the source dataset size influences the performance.
| Sample Number | - | 100 | 200 | 300 | 400 | 500 | 600 |
|---|---|---|---|---|---|---|---|
| 5-way 5-shot accuracy | 63.53 | 65.29 | 65.69 | 65.86 | 65.95 | 66.00 | 66.00 |
We can see the performance is not sensitive to the source dataset size, as we can achieve comparable performance even if the dataset size is halved to 300 samples in each class. In all, the retraining of the ViT on the source dataset is affordable.
W3. Pre-defined hyper-parameter of Attention Adjustment
The hyper-parameter in the target-domain attention adjustment is the temperature, which is simply set to a fixed value (0.3) for all datasets. As shown in Fig.1b, the average performance plateaus when the temperature gets smaller. We also report the performance of our model w.r.t. the temperature as follows.
| Temperature | 1.0 | 0.9 | 0.8 | 0.7 | 0.6 | 0.5 | 0.4 | 0.3 | 0.2 | 0.1 |
|---|---|---|---|---|---|---|---|---|---|---|
| 5-way 1-shot accuracy | 53.51 | 53.52 | 53.65 | 53.70 | 53.98 | 54.05 | 54.09 | 54.12 | 54.11 | 54.10 |
As can be seen, the performance plateaus when the temperature gets smaller than 0.5. We also evaluated the temperature sensitivity in the appendix Fig.8. Therefore our model is not sensitive to the specific choice of the target-domain temperature, i.e., this hyper-parameter is not difficult to tune.
Q1. Non-query-key structures
The non-query-key structure refers to the ViT backbone network without the query-key parameters, i.e., replace the query-key attention with the identity attention, the cosine attention, or the average attention, as shown in Tab.2. We promise to refine this demonstration.
Q2. Degrade the source-domain performance
The source-domain performance is decreased from 97.94 to 97.56 by applying our method, which is only a marginal decrease and is affordable.
Q3. Explanation of Eq. 6
Eq. 6 means to randomly multiply a temperature of 0 or 1 to the query-key attention in each block, and the probability of 0 is p. If the temperature is 0, then the attention is downgraded into average attention in Tab.2. If the temperature is 1, the original query-key attention is maintained.
In section 2 we conclude that the query-key attention mechanism makes the model discriminative but less transferable, while the non-query-key ViT structures (ViT network without the query-key parameters) tend to make the model transferable but less discriminative. Therefore, we randomly abandon the query-key attention to encourage the non-query-key attention (i.e., the average attention) in the source-domain training. As the source-domain training will encourage the trained parameters to be discriminative but less transferable, this operation helps the learning of the non-query-key ViT structures to be more discriminative, and resist the learning of the query-key parameters to avoid them from being less transferable.
In all, the query-key part is more discriminative but less transferable, therefore we resist its training to avoid it from being less transferable, while the non-query-key part is more transferable but less discriminative, therefore we encourage its training to improve its discriminability. As a result, the overall attention on the target domain is improved, due to the overcoming of shortcomings of each part.
Q4. Increase the number of shots
We report our performance with a larger number of shots below.
| 10-shot | CropDiseases | EuroSAT | ISIC | ChesX | Avg. |
|---|---|---|---|---|---|
| Baseline | 96.33 | 90.59 | 51.35 | 28.42 | 66.67 |
| Ours | 96.85 | 91.42 | 58.52 | 30.19 | 69.18 |
| 20-shot | CropDiseases | EuroSAT | ISIC | ChesX | Avg. |
|---|---|---|---|---|---|
| Baseline | 97.15 | 91.76 | 55.56 | 30.73 | 68.80 |
| Ours | 97.59 | 92.63 | 62.38 | 32.85 | 71.36 |
We can see our proposed strategy is still effective in these situations.
The rebuttal has addressed my concerns.
Thanks for your appreciation of our work! We will continue to polish our work in the final version!
We thank all the reviewers for their valuable input.
Q1. Theoretical insights
1. Our method reduces the sharpness of the model's loss landscape.
Theoretically, we analyze our findings from the sharpness of the loss landscapes (Foret 2021). That is, each value of model weights is viewed as a point in the weight space, and corresponds to a loss value. Each point and its loss value constructs a loss landscape, where the source-domain-trained model is viewed as a minimum in the landscape. The sharper the minimum is, the more vulnerable against domain gaps model will be [49]. Specifically, the sharpness is measured as
where refers to the model weights, refers to the perturbation with the radius . With this criterion, the generalization can be bounded as follows (Foret 2021).
Theorem 1. For and and any distribution , with probability over the choice of the training set ,
Following this criterion, we measure the sharpness given perturbations on different model weights.
| Perturbed Weights | All | Query Key |
|---|---|---|
| Sharpness of Baseline | 7.1483 | 7.0679 |
| Sharpness of Ours | 5.9915 | 6.4024 |
We can see our method indeed reduces the sharpness of the loss landscapes, indicating better robustness against domain gaps [49]. Notably, the sharpness of the query-key parameters is also decreased, indicating we can effectively resist their overfitting to the source domain by the proposed method.
2. Why the query-key mechanism increases the sharpness
The query-key attention is calculated as . Compared with other calculations in ViT, only the query-key attention involves the "square term" of . Therefore, any perturbation added to representations [49] is likely to be amplified by the multiplication of and . According to current studies (Chen, 2019), a well-trained model tends to increase the eigenvalue of its weights if overfitting happens. Since the eigenvalue here squarely influences due to both and in it, the query-key attention is more likely to amplify the perturbation added to representations and parameters. As the perturbation is finally forwarded to the classification loss, the query-key attention therefore increases the sharpness of the model, thereby making the model more vulnerable to domain shifts.
To solve this problem, our method resists the learning of the query-key attention, preventing and from learning large eigenvalues, therefore reducing the sharpness and benefiting the transferring. To verify the decreased eigenvalues, we measure the product of eigenvalues as follows.
| DINO Pretraining | Baseline Training | Ours | |
|---|---|---|---|
| Average eigen value product of and | 14.54 | 14.56 | 14.52 |
We can see the baseline training increases the eigenvalue, while our method decreases it, verifying our theoretical analysis.
References
Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur. Sharpness-aware minimization for efficiently improving generalization, 2021. 1, 2, 4, 5, 7, 8
Chen, Xinyang, et al. "Transferability vs. discriminability: Batch spectral penalization for adversarial domain adaptation." International conference on machine learning. PMLR, 2019.
Q2. Comparison of more datasets
We would like to point out that many works (e.g., MEM-FS [40], TIP'23) only conduct experiments on the four datasets we experimented with in the paper. Here, we also report our performance on all 8 datasets as in [10, 49].
| 5shot | CUB | Cars | Plac. | Plan. | Crop. | Euro. | ISIC | Ches. | Avg. |
|---|---|---|---|---|---|---|---|---|---|
| StyAdv [10] | 95.82 | 61.73 | 88.33 | 75.55 | 94.85 | 88.57 | 47.73 | 26.97 | 72.45 |
| FLoR [49] | 96.18 | 61.75 | 89.23 | 72.80 | 95.28 | 90.41 | 49.52 | 26.71 | 72.74 |
| Ours | 96.28 | 64.26 | 89.25 | 73.24 | 95.53 | 90.13 | 53.09 | 27.72 | 73.69 |
| 1shot | CUB | Cars | Plac. | Plan. | Crop. | Euro. | ISIC | Ches. | Avg. |
|---|---|---|---|---|---|---|---|---|---|
| StyAdv [10] | 84.01 | 40.48 | 72.64 | 55.52 | 81.22 | 72.15 | 33.05 | 22.92 | 57.75 |
| FLoR [49] | 84.60 | 40.71 | 73.85 | 51.93 | 81.81 | 72.39 | 34.20 | 22.78 | 57.79 |
| Ours | 85.48 | 43.45 | 74.49 | 52.58 | 84.02 | 74.35 | 34.92 | 23.19 | 59.06 |
As can be seen, our model can still achieve a state-of-the-art performance average on all 8 datasets.
Finally, we appreciate the inspiring comments again and will thoroughly revise the paper accordingly. We hope our explanations have answered your questions.
This paper explores the effectiveness of the attention mechanism in Vision Transformer (ViT) for cross-domain few-shot learning. Specifically, the authors discovered that by simply multiplying a small temperature to the attention in ViT blocks, the performance in the target domain consistently improves, even though the attention map is downgraded to a uniform map. Based on a series of related analyses, the paper proposes an attention abandonment operation for source domain training and an attention adjustment operation for target domain fine-tuning. Extensive experiments are conducted to demonstrate the effectiveness of the proposed method for cross-domain few-shot learning.
The paper received four reviews, two positive (D8qR and G6Ww) and two borderline (ujLR and XXJ8). The reviewers acknowledged that the paper finds an interesting phenomenon in ViT-based models, proposes a simple and effective solution for cross-domain few-shot learning, conducts comprehensive quantitative analyses, provides some insights for developing novel attention operations in ViT-based architectures, and achieves state-of-the-art performance. However, the paper is primarily empirical and lacks theoretical insights. After the rebuttal, one of the two borderline reviewers was convinced and upgraded their score from borderline reject to borderline accept.
Based on these considerations, the AC recommends acceptance and strongly suggests that the authors fully consider the reviewers' comments in the final version, including the additional analyses and experiments added during the rebuttal period, particularly the theoretical analysis and comparison of results on more datasets.