PaperHub
6.3
/10
Poster4 位审稿人
最低5最高7标准差0.8
7
5
6
7
4.5
置信度
正确性3.0
贡献度2.8
表达3.3
NeurIPS 2024

A Closer Look at the CLS Token for Cross-Domain Few-Shot Learning

OpenReviewPDF
提交: 2024-05-08更新: 2024-12-23
TL;DR

We find the CLS token naturally absorbs domain information, and propose to decouple domain information from the CLS token and adapt it for cross-domain few-shot learning.

摘要

关键词
Cross-Domain Few-Shot LearningCLS TokenVision Transformer

评审与讨论

审稿意见
7

In this paper, the authors found a new phenomenon that the CLS token used in Vision Transformer (ViT) absorbs the domain information in Cross-Domain Few-Shot Learning (CDFSL). On the basis of the findings, they proposed a novel CDFSL method that updates only the CLS token during the target training. A comprehensive analysis provided verification of the findings, and comparative experiments confirmed the validity of the proposed method.

优点

  • The paper is well organized and it is easy to follow the contents.
  • The findings that the CLS token absorbs the domain information in CDFSL sound new and worthwhile, and they will facilitate many future studies.
  • The findings on the CLS token are supported by a comprehensive analysis, so I feel this paper is highly reliable.
  • This paper has enough technical novelties that it proposed a novel CDFSL method based on the findings and confirmed the validity of the method.

缺点

  • There seemed to be a lack of validation using models other than DINO.
    • Although the results of the validation using iBOT are reported in the Appendix, the improvement of the proposed method is not significant. I think this result alone does not sufficiently support that the findings can be broadly applicable to the CLS token in Vision Transformer in general.
    • I consider that they should conduct an analysis on CLIP-pretrained Vision Transformer, if possible. Since CLIP is one of the most representative pre-trained computer vision models today, I believe it is crucial to verify whether the findings of this study are applicable.
  • The performance difference when compared to existing methods does not appear to be very large. In some benchmarks, the difference is less than 1%.
    • In particular, the ablation study results also appeared to show little improvement in ChestX. Is there any possible explanation for this result?

问题

  • The findings of this paper are very similar to the paper “Vision Transformer Needs Register” [6]. Is it possible to apply this method directly to CDFSL for comparison?

局限性

The limitation is well stated in the manuscript, and there does not seem to be anything additional mentioned.

作者回复

Thank you for your appreciation of our work!

W1. Verification of other pre-trained model

​ Due to time limitations, we did not fully tune the model in our appendix submission. Here we report the fully-tuned performance on the iBot and ViT-B model.

iBotCrop.Euro.ISICChes.Avg.
BL81.1772.7131.4422.5651.97
Ours82.4773.8332.8722.8853.01
ViT-BCrop.Euro.ISICChes.Avg.
BL82.9772.0634.1922.6052.95
Ours83.4374.4235.7822.9854.15

​ As can be seen, the improvements are clearer than in the appendix, which further verifies the generalizability of our method.

​ We also train our model with the CLIP pretraining.

CLIPCrop.Euro.ISICChes.Avg.
BL77.3363.3029.8421.1147.89
Ours78.5564.8830.9522.2549.16

W2. Why ChestX performance is lower

​ We would like to point out that all current works show a low performance on the ChestX dataset, where we have already achieved the top performance. This dataset is difficult in two aspects: (1) the domain gap between it and the source domain is large among all target datasets, as validated in Fig.2a that the domain similarity is low; (2) the semantic shift is much larger, as ChestX is a fine-grained classification task with expert knowledge [14]. That is, even an untrained human can hardly distinguish different chest diseases in the X-ray image, which means much less prior knowledge can be transferred from the source datasets [14]. As a result, in current works [10, 49], the performance improvement on ChestX is always less than 1%.

​ However, given the difficulty in ChestX, we have still achieved the best performance in our work, which still demonstrates the effectiveness of our analysis and method.

Q1. Registers

​ Notably, our paper differs from registers [6] in:

​ (1) [6] finds ViT needs registers because it helps to reduce the artifacts in the feature map, but we find the CLS token, although its location is similar to registers, naturally absorbs domain information, and further interprets the reason behind it, which is not discovered in [6];

​ (2) We further propose a method for the CDFSL task to take advantage of the CLS token's characteristics, which is not included in [6].

​ Indeed, the training and location registers are similar to the CLS token, therefore they could show similar behavior as the CLS token. To validate this hypothesis, we add registers[6] to the backbone network and report the performance as well as the domain similarity.

Method5-way 5-shot AccuracyDomain Similarity
Baseline63.890.076
Train w/ Register + Test w/ Register63.170.047
Train w/ Register + Test wo/ Register64.730.101
Ours66.100.655

​ As can be seen, such a phenomenon also exists in the registers, which verifies our analysis and interpretation. However, the improved performance is much lower than ours, which verifies our contribution. ​

评论

Thanks for the authors' response! All my concerns have been addressed by the rebuttal comments. This increased my confidence in my rating of this paper.

审稿意见
5

This paper explores an intriguing phenomenon in Cross-Domain Few-Shot Learning (CDFSL) using Vision Transformers (ViT). The authors observe that randomly initializing the CLS token, instead of using source-domain pre-trained parameters, consistently improves target-domain performance. They attribute this to the CLS token naturally absorbing domain-specific information due to ViT's inherent structure, which manifests as low-frequency components in the Fourier space of images. To address this, the authors propose a novel method that decouples domain information in the CLS token during source-domain training and adapts the CLS token for efficient few-shot learning in the target domain. This approach aims to enhance the transferability and generalization of ViT in CDFSL tasks. The effectiveness of the proposed method is validated through extensive experiments on four benchmarks, demonstrating state-of-the-art performance.

优点

  1. This paper describes and analyzes the impact of CLS on CDFSL performance in detail. This is an aspect that is easily overlooked. The authors analyze the impact of CLS and propose solutions.
  2. The method achieved SOTA performance.

缺点

  1. In line 39-40 “During the target-domain learning, this method finetunes the CLS token to efficiently absorb domain information, handling the few-shot learning problem.”, the few-shot learning (FSL) problem cannot be solved through “efficiently absorb domain information”. Normally we solve FSL by addressing the overfitting due to the limited labeled target data.
  2. How did the authors get the results of Tab. 1? Because the performance increase is not over 1%, it’s better to show how the results are obtained. It's better to follow what the existing methods do, take the average of 600 results.
  3. Providing theoretical support for this paper will make the paper more credible and sufficient. Such as in line 84-88 “we directly fix the CLS token as random initialization for both the source and target domains (Tab. 1 a.2). We can see that by abandoning the learning of the CLS token, the performance is also improved from the baseline method, but is slightly lower than training but not loading it (Tab. 1 a.3). This means such information in the CLS token could be beneficial for the source-domain training.”, why a.3 is better than a.2 in Tab.1? It will be better if authors provide the corresponding theoretical analysis.
  4. In line 91-92 “Intuitively, since not loading the CLS token improves performances only under cross-domain scenarios, it is natural to doubt the CLS token’s poisonous information as the domain information.”, how authors obtain this conclusion (only under cross-domain scenarios)? Authors should compare the FSL performance changes between cross-domain and in-domain scenarios.
  5. In line 98-99 “The larger the CKA similarity is, the smaller the domain distance will be, and it means the model contains less domain information.”, please explain why “it means the model contains less domain information”?
  6. In line 100-101 “Not loading the CLS token can significantly increase the CKA similarity, indicating the CLS token contains domain information while other structures tend to capture domain-irrelevant information.”, how authors obtain the conclusion that “other structures tend to capture domain-irrelevant information”?
  7. In Fig.3 (b), the similarity is between source and target domain data?
  8. The CLS token is fixed as the random initialization in the source domain stage and fine-tuned in the target domain stage. It means that this CLS token does not work in the source domain stage, then why do you introduce this token in this stage? Why not only introduce and train the domain tokens in the source domain stage, and introduce the CLS token in the target domain stage? Seems introducing CLS token in the source domain seems unnecessary.
  9. In line 191-193 “Interestingly, we find treating each class as a pseudo domain could achieve the best performance”, it means that every pseudo domain indicates a class. Therefore, how authors explain the domain tokens learned the domain-specific but not class-specific information?
  10. The comparison method is not sufficient. Please compare with more existing SOTAs.
  11. There is no explanation about Tab.4 (f). What it means?
  12. The research on related work is incomplete. In recent years, there are many CDFSL papers, such as "Deep Learning for Cross-Domain Few-Shot Visual Recognition: A Survey", "Free-lunch for cross-domain few-shot learning: Style-aware episodic training with robust contrastive learning" ", "Meta-fdmixup: Cross-domain few-shot learning guided by labeled target data", and "Enhancing Information Maximization with Distance-Aware Contrastive Learning for Source-Free Cross-Domain Few-Shot Learning", etc.

问题

  1. Poor paper writing. Some descriptions in the article are not analyzed or cited. Please see the weakness raised above for details.
  2. The comparison method is not sufficient.
  3. The research on related work is incomplete.
  4. Please answer the above questions.

局限性

In the paper authors mention that “We discuss the limitations of the work in the appendix”. However, there’s no limitation discussion in the appendix. This work does not contain any negative social impact.

作者回复

We truly appreciate your valuable comments. In the following, we respond to the concerns.

W1. Handling few-shot learning by absorbing domain information

​ We would like to point out that for the cross-domain few-shot learning (CDFSL) problem, one of the most important challenges is the domain gap between the source and the target domain. Therefore, an important task for the target-domain few-shot finetuning is to effectively adapt to the target domain. This is also a challenge due to the scarce training data, as domain information cannot be fully represented by training samples, i.e., the model could overfit to the target-domain training data instead of learning the target domain information, which is well handled by our analysis and method.

W2. The confidence interval of Tab.1

​ We also follow current works (e.g., [42]) to evaluate each model on thousands of episodes, as shown in Tab.4 and Tab.5 where the confidence intervals are included. Due to space limitation, the confidence intervals are abbreviated in Tab.1, and here we supplement them as follows.

Please see Tab.8 in the PDF for results.

W3. Why a.3 is better than a.2 in Tab.1

Please see Q2 in the global response.

W4. Comparison with in-domain performance to show domain information

​ We would like to point out that the in-domain performance is already included in Fig.1b. Specifically, loading the CLS token leads to the 5-way 5-shot accuracy of 92.8%, while not loading it gives a lower accuracy of 90.6%, which is different from that of target domains. That is the reason why we say "We think the CLS token contains the domain information".

W5. Why "the larger the domain similarity, the less the domain information"

​ Following current works (e.g., [26]), the domain similarity is measured by comparing the distance between two batches of images, where each batch is sampled from a single domain. We follow [7] to take CKA as the similarity function. The domain information will make the model overfit to the given domain. Suppose a model is completely overfitted to one domain, given other domains' images, the extracted features could be just random noises, therefore the domain similarity will be downgraded to 0. Therefore, following current works (e.g., [26]), we hold that the larger domain similarity indicates the less domain information.

W6. Why do other structures tend to capture domain-irrelevant information?

​ Based on our experiments in Fig.2a, we can see that by randomizing the CLS token, the domain similarity increases significantly, which means other structures cannot easily extract domain-specific features without the help of the CLS token. This result indicates other structures tend to be more domain-agnostic than the CLS token.

W7. Similarity in Fig.3b

​ The similarity is calculated as the cosine similarity between the CLS token and the input patch tokens of the first block, which is the same as Fig.2b (L108).

W8. Why introduce CLS token during the source-domain stage

​ The CLS token fixed as random initialization can be viewed as a placeholder for the downstream few-shot finetuning.

​ During the source-domain stage, the domain tokens are added to the fixed CLS token to be fed into ViT. Therefore, the domain token is encouraged to learn domain information, while other structures and parameters are encouraged to learn domain-irrelevant information. Since the CLS token is fixed as random initialization, it would also be domain-irrelevant, which is therefore suitable for the downstream target-domain finetuning.

​ During the target-domain stage, the domain token is then abandoned, and the remaining structures and parameters (i.e., the random CLS token and other parameters) are ideally domain-irrelevant, as validated in Fig.2a. Then, we un-fix the CLS token and finetune it to learn the target-domain information. Therefore, the domain-irrelevant CLS token will be specific to the target domain, and we also validated the effectiveness of this finetuning in Tab.5.

W9. "Domain-specific" vs. "class-specific"

Please see Q1 in the global response.

W10. Comparison with more SoTAs

​ We list more comparisons with state-of-the-art works as follows, where we can see our method achieves the best performance.

Please see Tab.1 and 2 in the PDF for results.

W11. Tab.4f

​ Tab. 4f means to add the domain token as appended tokens to the ViT's first block as prompts, instead of adding domain tokens to the CLS token. As can be seen, the performance is much lower, verifying our design of the model is better.

W12. Related work

Please see Q3 in the global response.

Other questions

Q1: Please refer to the above responses.

Q2: Please refer to W10.

Q3: Please refer to W12.

Q4. Please refer to the above responses.

评论

Thank you. The authors have addressed most of my concerns. However, the writing of the paper needs improvement. For example:

I understand the authors' explanation about Question 1. However, the expression "During the target-domain learning, this method finetunes the CLS token to efficiently absorb domain information, handling the few-shot learning problem" is not rigorous. It would be better to update this expression in the new version.

For Question 5, I understand what the authors want to express, however, it seems should be like "the larger the domain similarity, the less the domain-specific information" .

评论

Thanks again for your suggestion. We will continue to polish our work in the final version!

审稿意见
6

Based on the observation that pretrained ViT models perform better on cross-domain few-shot tasks when the cls-token is re-initialized, the authors hypothesize that this is due to the absorption of domain-specific information and consequently propose a modified training and inference scheme to combat this.

优点

Originality & Significance:

  • Motivation of the method based on a clear observation and backed up by intuition as well as corresponding analysis
  • Very interesting observation that is then leveraged to design a novel approach that improves results across various datasets

Quality:

  • Good range of experiments presented that mostly support the underlying intuition and argument of the paper
  • Experiments conducted with a good selection of alternate variants to gauge performance improvements of the main contribution across multiple datasets

Clarity:

  • Contributions and underlying motivation clearly stated in intro
  • The paper is mostly well written and easy to read and follow

缺点

TLDR; I do like the underlying idea, but have a range of questions and concerns that I’d like the authors to clarify & address.

  • Inconsistency in terms of stated pretraining on the source domain – raising some questions in terms of generalizability of results – please see question section below.
  • Unclear reasoning behind training setup – see below.
  • Unstated assumptions around prototypical method that make it hard to follow some of the technical setup and conclusions. See question section below.
  • Many of the presented results would benefit from background info/analysis, see below.
  • Some concerns around result interpretation and baseline comparison, see below.
  • Some minor inconsistencies in wording.

问题

Main concerns, questions & potential improvements:

[Q1] Statements in l27. as well as equation 1 define a fully supervised setup, which is arguably very common; However, the authors then use a ViT pretrained using DINO – which is a self-supervised method and does create quite a different representation space, which can have a significant impact on downstream performance (e.g. see recent in-domain FSL works like [A]); Even though is then then followed with further supervised training (which in itself could be seen as fine-tuning), this raises some questions!

  • Have the authors investigated their approach with a fully-supervised pretraining? If so, what is the result there? Do your observations still hold?
    \textrightarrow\textrightarrow (In addition to consistency with the motivation/pre-lim, this could add important additional insights!)
  • I’d also be curious how well the DINO pretrained method as well as a supervised ImageNet pretrained one would perform across Table 1, i.e. without the miniImageNet training.

([A] Hiller et al., Rethinking Generalization in Few-Shot Classification; NeurIPS 2022)

[Q2] Training a backbone that is already trained on ImageNet further on miniImageNet seems very odd, given that miniImageNet is essentially a smaller subset;

  • What is the reasoning behind this?
    \textrightarrow\textrightarrow If supervised information shall be included, why not simply use a fully-supervised backbone (see before) or fine-tune using entire ImageNet?
    \textrightarrow\textrightarrow I would understand this scheme if meta-finetuning or another paradigm was used, but it is not as far as I can see from the paper?

[Q3] The authors evaluate ‘prototype methods’ in Table 1 (& l.75), but do not state how their prototype is formed. Note that various method exist, including using the refined cls-token, averaging the patch tokens, etc.
\textrightarrow\textrightarrow Some specification/description here would help the reader.

[Q4] I’d appreciate some more insights on the experiments presented in Fig 2(a) and Tab 1 around fixing the cls token to random init.

  • How is the performance of this on the actual source training task, i.e. does the model still train well? (what’s the gap if there is any);
  • Why do the authors think that this setup does actually perform that well, would you imagine the domain information is simply not learnt at all or rather absorbed into the other tokens?
  • Does the model `fit’ to work with the specific random init? And what happens if you then change/re-initialize this cls-token at the downstream task – does it fall back to the ‘not load cls’ performance?
    \textrightarrow\textrightarrow All these insights would provide the reader with a much better basis for interpretation of your findings!

[Q5] What is the similarity in terms of retrieval (i.e. similarity map) like shown in Fig 2(b) when using a randomly initialized cls token on the downstream datasets? Better/worse/same than using the one(s) trained without your method?

[Q6] Fig 3 and the analyses are not necessarily convincing in my opinion. The blurry regions might cover a much larger distribution which could be significantly easier to match to then a very high-frequency one, couldn't it?

[Q7] The best choice turns out to be one cls token for each class, which is obviously valid – somewhat questions the `domain’ information though, as it seems to mainly be a class identifier then.
While the efficacy on the cross-domain tasks are still valid, the interpretation and therefore generalization to other source datasets could significantly differ. E.g. would I need 1000 tokens if I was to train on ImageNet? (implications would be important to state for potential follow-up work)


Additional comments:
Potential misunderstanding of the word `doubt’:

  • l92. seems confusing, as you effectively state that the reader should doubt (i.e. question/not believe) that the cls token contains domain information – but this is pretty much the main motivation of the paper?
  • Same in caption of Figure 2 (b), and some other places

局限性

Although the authors state in the checklist that limitations are addressed in the appendix, there is only one sentence that recognizes the limitation in terms of the used datasets; I’d like to see the authors properly discuss some potential limiting factors/considerations in terms of their actual algorithmic and architectural choice (e.g. pretraining influence; still a good choice even if domain gap is small(er)?; and others, if they are aware of any)



Post rebuttal update:

Most of my concerns have been addressed by the authors -- see rebuttal and comments;

I have increased my rating accordingly to weak accept to reflect the new insights & clarifications.

作者回复

We truly appreciate your valuable comments. In the following, we respond to the concerns.

Q1. DINO pretraining

​ The training paradigm of DINO pretraining follows current cross-domain few-shot learning (CDFSL) works [13,42].

​ 1. Why this setting?

​ Current works [A] have shown that unsupervised pretraining could show better generalization than supervised pretraining. Therefore, an unsupervised pretraining on ImageNet could help the model generalization, especially under large domain gaps. To verify it, we compare unsupervised pretraining and supervised pretraining below, and we can see the unsupervised ones show better target-domain performance.

​ 2. What if we tune on miniImageNet the model with full supervision on ImageNet?

​ For the ImageNet fully-supervised setting, both the training on ImageNet and miniImageNet use the supervised classification loss. However, since all images and labels in miniImageNet are covered by ImageNet, the tuning on miniImageNet will be difficult. Therefore, unsupervised pretraining will be more suitable.

​ To verify our method also fits other pretraining paradigms, we conduct experiments on other pretraining methods as follows.

Please see Tab.6 in the PDF for results.

​ Here, the CLIP pretraining can be viewed as a fully supervised pretraining. Since the pretraining of CLIP is not fully overlapped by miniImageNet, it is more suitable for validating the full supervision setting. As can be seen, our method and analysis still hold in this setting, including the ImageNet supervised pretraining and DINO without miniImageNet training.

Q2. First training on ImageNet then training on miniImageNet

​ Please refer to Q1 for why the model is first trained on ImageNet and then tuned on miniImageNet. We would like to point out that some works [14] also have shown that our baseline method shows advantages in cross-domain transferring, which is the reason why we chose this simple method as our baseline. To verify our model also suits the meta-learning-based baselines, we conduct experiments based on the ProtoNet [4].

Please see Tab.4 in the PDF for results.

​ We can see that our model also improves this kind of baseline method.

Q3. Prototypes

​ We would like to point out that the prototype is defined in L61-62, Eq.2. Briefly speaking, the ViT features extracted from samples in each class are averaged as the prototype for each class.

Q4. Fig.2a and Tab.1 around fixing the CLS token to random initialization

Please see Q2 in the global response for the explanation of a.1, a.2, and a.3.

​ If we re-initialize the CLS token during the source-domain training, the improvements also exist.

Please see Tab.3 in the PDF for results.

​ However, re-initializing the CLS token can be viewed as adding noise to the domain token, therefore harming the absorbed domain information, which then affects the domain-irrelevant information learned by other structures in ViT. As a result, the Re-Init performance is slightly lower.

​ Indeed, domain tokens are encouraged to be orthogonal to the fixed CLS token, which would drive the model to view the fixed token as a domain-agnostic token. But note that the random token is already agnostic enough to every domain even without training (as shown in Q5), therefore our training would not essentially drive the model to be more agnostic to that CLS token, i.e., our model is not bound to the specific value of the CLS token.

Q5. Similarity map in Fig.2b with randomly initialized CLS token

Please see Fig.1 in the PDF for results.

​ We use the same color bar as in Fig.2b. We can see the similarity is much lower, but a coarse contour of objects can still be observed in both the source and target domains, indicating a good transferability of detecting object contours, because no domain information is in the random token. However, the contour detected by the random token is much worse than that by the CLS tokens as shown in Fig.2a and Fig.5a, indicating although the random CLS token can initially detect the object contour, the learning of the CLS token strengthens this characteristic to detect low-frequency images.

Q6. Fig.3 Analysis of low-frequency images

​ To verify it is the CLS token that tends to capture low-frequency components in images, we use random tokens to calculate the similarity map of low-frequency images, and find random tokens do not show the same results as the CLS token.

​ Specifically, we measure the activation ratio of the CLS token and the random token.

Please see Tab.5 in the PDF for results.

​ We take the top 10% or 20% value as examples. We can see the random tokens show a tendency to decrease the activation ratio, while the CLS token shows a tendency to increase to ratio, indicating it is the CLS token that tends to be similar to the feature of low-frequency images.

Q7. One CLS token for each class?

​ Please see Q1 in the global response.

Additional comments

​ Here "doubt" is to express "think", which is a misuse of words. We promise to carefully revise the words in the paper.

Limitations

​ The limitation of the CLS token analysis is that it gets less effective when the domain gap is smaller. The datasets used in the paper show large domain gaps with the source domain, which makes it very challenging for knowledge transfer [14]. Therefore, the domain information absorbed in the source domain is harmful to the target domain. However, when the domain gap is smaller, the information captured by the CLS token could be partly transferred to the target domain, and our analysis and method may only get smaller improvements in this situation.

评论

I'd like to thank the authors for their responses and effort put into the rebuttal, I really appreciate it!

Most of my questions and concerns have been rectified, however having re-read some parts of the paper, I do have to agree with reviewer aYYr that parts of the paper's writing would benefit from being improved (wording and grammatically);

As stated in my initial review, I do like the underlying idea and think the authors provide valuable insights to the community. I hope the authors take the feedback into account and include clarifying material and insights (esp. global response Q2) into the revised manuscript.

I have increased my rating accordingly to weak accept to reflect the clarifications.


Re Q3: Prototype
Small clarification for completeness: You state that the prototypes have been defined in lines 61/62 -- My confusion stems from Fig. 1 (a): do you denote each patch embedding as a feature, or ONLY the refined cls token?
-> Hence the question what your "prototype" is -- is it the average across all refined cls-tokens of samples that belong to one class, or is the information of the patch embeddings also included? (both are common across different ViT-based works)
(-> Might be worth using a different colour to highlight the refined cls-token as well, as it's been pretty much invisible on my screen.)

评论

Thanks for your appreciation of our work! We promise to carefully polish our paper in the final version.

Re Re Q3: Prototype

Only the refined CLS token is denoted as the feature, following the current works that utilize ViT. Therefore, the prototype is calculated as the average across all refined CLS-tokens of samples that belong to one class. We promise to use another color to highlight the output CLS token.

Thanks again for your valuable suggestions! If you have further questions, please feel free to tell us.

审稿意见
7

This paper presents a novel approach to Cross-Domain Few-Shot Learning (CDFSL) by investigating the role of the CLS token in Vision Transformers (ViT) for knowledge transfer under great domain gaps. The authors identify an intriguing phenomenon where not loading the CLS token parameters improves target-domain performance. They delve into this by proposing a method to decouple domain information from the CLS token during source-domain training and adapt it for efficient few-shot learning in the target domain. The paper is well-structured, presenting a clear problem statement, methodology, and extensive experiments across four benchmarks.

优点

The paper addresses a significant problem in CDFSL, providing a new perspective on the role of the CLS token in ViTs. The methodology is innovative, with a clear rationale behind decoupling domain information from the CLS token. The experiments are comprehensive, covering multiple datasets and ablation studies that validate the effectiveness of the proposed approach. The paper is well-written, with a clear presentation of the problem, methodology, and results.

缺点

The generalizability of this approach to other tasks beyond the ones tested is not fully discussed. In my understanding, other tasks involving domain gaps and finetuning would benefit from this method. Could the authors elaborate on how this approach could benefit related tasks, such as domain adaptation or domain generalization? This paper would benefit from a discussion on the computational efficiency of the proposed method compared to existing approaches.

问题

The author explains the reason why the CLS token absorbs domain information is because it lies in the input layer and does not change according to input images. This insight is reasonable and interesting. As I know, the register token (Vision transformers need registers, ICLR 2024) also serves in the similar role as the CLS token. Therefore, in my understanding, if this explanation holds, the register token would show a similar phenomenon as the CLS token. Could the author conduct experiments to verify this?

局限性

Could this approach be applied to other tasks such as domain generalization or domain adaptation? How much computational cost is added if the regular CLS token is replaced by the proposed domain tokens?

作者回复

We truly appreciate your valuable comments. In the following, we respond to the concerns.

1. How could this method benefit other tasks

​ Our method could also benefit other cross-domain tasks. To verify this, we conduct experiments on the domain generalization task on 4 datasets (Sketch, Cartoon, Art Painting, Photo). Each dataset shares seven object categories (dog, elephant, giraffe, guitar, house, horse, and person) with 9,991 images.

​ Due to time limitations, we implement the code based on our original setting, i.e., viewing miniImageNet as the source domain, and taking the 5-way 5-shot classification. The classifier on target domains is obtained by the linear probing method.

MethodSketchCartoonArt PaintingPhotoAvg.
Baseline55.68 ±\pm 0.3170.09 ±\pm 0.2779.83 ±\pm 0.1997.29 ±\pm 0.0875.72
Ours60.88 ±\pm 0.3171.33 ±\pm 0.2783.23 ±\pm 0.1897.72 ±\pm 0.0778.29

​ As can be seen, our method also benefits the domain generalization task, which further verifies the effectiveness of our method.

Dataset information: Da Li, Yongxin Yang, Yi-Zhe Song, and Timothy M. Hospedales. Deeper, broader and artier domain generalization. 2017 IEEE International Conference on Computer Vision (ICCV), pages 5543–5551, 2017. 6

2. Computational cost

​ The only computational cost imported by us is the domain token. As there are 64 classes in miniImageNet, 64*384=24.6k parameters are added in our experiments, which is not a heavy burden since ViT-Small contains around 22M parameters (i.e., only 0.11% parameters are added).

3. Registers

​ Indeed, the training and location registers are similar to the CLS token, therefore they could show similar behavior as the CLS token. To validate this hypothesis, we add registers[6] to the backbone network and report the performance as well as the domain similarity.

Method5-way 5-shot AccuracyDomain Similarity
Baseline63.890.076
Train w/ Register + Test w/ Register63.170.047
Train w/ Register + Test wo/ Register64.730.101

​ As can be seen, such a phenomenon also exists in the registers, which verifies our analysis and interpretation.

​ Notably, our paper differs from registers[6] in:

​ (1) [6] finds ViT needs registers because it helps to reduce the artifacts in the feature map, but we find the CLS token, although its location is similar to registers, naturally absorbs domain information, and further interprets the reason behind it, which is not discovered in [6];

​ (2) We further propose a method for the CDFSL task to take advantage of the CLS token's characteristics, which is not included in [6].

评论

The rebuttal from the authors addressed most of my questions. The findings that the CLS token absorbs domain information in this paper are interesting to me. The interpretations are rationale, and the design of domain tokens is reasonable. The experimental results are convincing to me. I think this paper would inspire many other works in the related research, so I would like to increase the score to "Accept".

评论

Thanks again for your appreciation on our work! We will continue to polish our work in the final version!

作者回复

We thank all the reviewers for their valuable input.

Q1. One CLS token for each class?

​ Since the source dataset (miniImageNet) is a general classification dataset, the difference between each class is larger (e.g., than fine-grained datasets where domain information is clear). Therefore, for miniImageNet, it is reasonable to view each class as a domain.

​ To further ablate "domain-specific" and "class-specific", we then manually construct some new source domains based on miniImageNet. Specifically, we take the amplitude (by Fourier transformation) from target domains as the style information, and use the phase (by Fourier transformation) from the original source-domain images as the content information, thereby constructing 4 new domains with the original 64 source-domain classes. Then, we train our model on a new dataset containing the 4 constructed datasets and the original source dataset, and ablate different choice of domain tokens.

Please see Tab.7 in the PDF for results.

​ As can be seen, by introducing larger domain gaps, viewing each class as a domain is not the best choice. Instead, setting a domain token for each domain could achieve the best performance, which validates the rationale of the domain token in absorbing the domain information.

​ For the ImageNet training, since we can obtain the hierarchical structure of all classes (e.g., superclasses like animals, plants, ships, etc.) by class names, we do not need to assign 1k domain tokens for each class. Instead, we only need to assign each superclass with a domain token, which is affordable.

Q2. Tab.1 a.1, a.2, and a.3

For the 5-shot task, the source-domain accuracy of the baseline method (a.1) is 97.94, for the "fix as random initialization" (a.2) is 97.50, and for the "not loading CLS" (a.3) is 96.64.

​ Tab.1 a.2 means to fix the CLS token as random initialization during both the training and testing.

​ For the target-domain performance, since the fixed CLS token does not contain or learn any information, the capability of the CLS token is abandoned. Therefore, other structures in ViT need to take the place of the CLS token to learn the information that should originally be captured by the CLS token. However, since other structures in ViT are not as capable of learning domain information as the CLS token, the domain information is not effectively captured. Since the domain information is harmful to the target domain, a.2 is better than a.1. However, since the domain information is beneficial for the source domain, a.3 is better than a.2 in the source-domain training. Given that a.3 also randomizes the CLS token during the target-domain stage like a.2, a.3's final performance is therefore better than a.2.

​ For the source-domain performance, a.2 forces other structures in ViT to absorb source-domain information, which is not as capable of learning source-domain information as the CLS token, therefore its source-domain accuracy is lower than a.1. For a.3, as it also absorbs domain information by the CLS token, other structures in ViT tend to absorb the domain-irrelevant information. Therefore, by randomizing the CLS token in a.3, the remaining structures have less source-domain information, thereby showing the lowest source-domain performance.

Q3. Related work

We provide an extended related work of CDFSL as follows.

​ Cross-Domain Few-Shot Learning (CDFSL) [14, 20, 26, 30, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54] focuses on training a model on the source domain that can generalize well to target domain with limited examples. Current methods can be grouped into two types: meta-learning-based approaches [12, 14, 17, 33, 44, 45, 46] and transfer-learning-based ones [4, 14, 19, 40, 42, 47, 48, 49, 50, 51, 52, 53, 54]. Meta-learning-based approaches aim at learning task-agnostic knowledge to learn new tasks efficiently [14], differing in their way of learning the parameter of the initial model on the base class data. MAML [45] aims at learning an initial parameter that can quickly adapt to new tasks, while FWT uses a feature-wise transformation to learn representations with improved ability to generalization. An alternative way to tackle the problem is transfer-learning-based approaches, tackling the problem based on reusing the model trained on the base class data in a standard supervised learning way [46]. Among these approaches, LRP [47] aims to use the explanation results to guide the learning process. STARTUP [48], and Meta-FDMixup [49] mainly aim at defining relaxed settings for CD-FSL. Wave-SAN [51] tackles CD-FSL by spanning distributions of source styles. SET-RCL [51] simulates the style distributions of unknown target domains. IM-DCL [52] sets the entire feature as positive and negative sets to learn the query set without accessing the source domain. However, these works are mostly restricted to the CNN architecture. Recently some works have focused on the transformer structure to solve the CDFSL tasks but these efforts have not fully dug out the potential of the VIT structure and the importance of the CLS token on CDFSL.

Extended References:

[44] Cross-domain few-shot classification via learned feature-wise transformation

[45] Rapid learning or feature reuse towards understanding the effectiveness of maml.

[46] Deep learning for cross-domain few-shot visual recognition: A survey.

[47] Explanation-guided training for cross-domain few-shot classification.

[48] Self-training for few-shot transfer across extreme task differences.

[49] Meta-fdmixup: Cross-domain few-shot learning guided by labeled target data.

[50] Free-lunch for cross-domain few-shot learning: Style-aware episodic training with robust contrastive learning.

[51] Wave-san: Wavelet-based style augmentation network for cross-domain few-shot learning.

[52] Enhancing information maximization with distance-aware contrastive learning for source-free cross-domain few-shot learning.

最终决定

This paper examines the hypothesis that the CLS token of the Vision Transformer absorbs domain information and proposes a new cross-domain few-shot learning (CDFSL) method utilizing the CLS token based on this insight. No prior research has thoroughly investigated and discussed the role of CLS in domain adaptation, making these findings a significant contribution to the field. Most reviewers acknowledged the strong motivation and focus of the study, the detailed analysis, and the clarity of the writing. The effectiveness of the proposed CDFSL method across multiple datasets was also highly appreciated. The authors' rebuttal effectively addressed the reviewers' concerns, resulting in an improved rating from two of the reviewers. Two reviewers noted that the writing of the paper needs improvement, and their comments are expected to be reflected in the final manuscript.