PaperHub
5.3
/10
Poster4 位审稿人
最低3最高8标准差2.3
3
3
8
7
3.5
置信度
正确性3.0
贡献度2.5
表达3.0
NeurIPS 2024

Frustratingly Easy Test-Time Adaptation of Vision-Language Models

OpenReviewPDF
提交: 2024-05-15更新: 2024-11-06

摘要

关键词
Vision-Language ModelsTest-Time AdaptationRobustnessModel Calibration

评审与讨论

审稿意见
3

This paper presents a TTA strategy called ZERO, in which the author carefully and thoroughly explains the entire work, from motivation to method and experiments, including numerous appendices. However, the observations mentioned are not novel enough, and the proposed method is not sufficiently flexible.

优点

  1. This paper features a clear analysis, allowing for a quick understanding of the problem the authors aim to solve.

  2. The highlighted presentation of the experimental section greatly aids reviewers in rapidly analyzing the experimental results.

缺点

  1. The observations mentioned are not novel enough, as many studies have reported similar observations [1].

  2. The proposed method lacks flexibility and guarantees.

  3. The experimental results are insufficient, with not only a limited number of comparison methods but also incomplete results and a lack of comparison method results for different model architectures.

  4. Although the writing is clear, it uses many uncommon words in academic papers, causing difficulties in understanding.

问题

Questions:

  1. The authors' motivation is not novel, as the phenomenon of over-under-confidence has been revealed in many studies[1], making it not a surprising discovery, could you please clarify your novelty?

  2. The authors' method of setting the temperature to 0 seems to be a very straightforward and naive strategy. In fact, in some studies, the temperature has been made learnable instead of being forcibly set to 0[2].

  3. The authors' pseudocode is quite unusual, as it is the first time I have seen code used instead of pseudocode. Is there any insurmountable reason for this choice?

  4. The authors have too few comparison methods. In fact, there are already many TTA strategies [3]. Do these methods also have the same problem?

  5. Why are only a portion of the experimental results shown? I cannot find the missing results for different model architectures anywhere, including the appendices.

  6. Why is the temperature set to 0? This will cause the post-softmax distribution to be extremely sharp.

  7. Have the authors considered that label smoothing and sharpness-aware optimization strategies may solve this problem?

  8. Why do the authors use some unusual terms like "prevalent class"? it means? I cannot even find similar descriptions in papers on arXiv or Google Scholar.

[1] Singla, Sumedha, et al. "Augmentation by counterfactual explanation-fixing an overconfident classifier." Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2023.

[2] Wang, Xiao, et al. "Be confident! towards trustworthy graph neural networks via confidence calibration." Advances in Neural Information Processing Systems 34 (2021): 23768-23779.

[3] Liang, Jian, Ran He, and Tieniu Tan. "A comprehensive survey on test-time adaptation under distribution shifts." arXiv preprint arXiv:2303.15361 (2023).

局限性

N/A

作者回复

W1-Q1: Novelty and relationship to over-under-confidence. To clarify our novelties: (1) we provide theoretical tools to understand the pitfalls of Marginal Entropy Minimization; (2) the proposed baseline only requires manual tweaking of a single parameter, a single forward pass of the vision encoder, and no optimization.

We politely disagree with the statement “the motivation is not novel, as the phenomenon of over-under-confidence has been already revealed”. Here are the reasons:

  1. We know that over-under-confidence are widely known phenomena in the fields of uncertainty estimation and model calibration. Hence, we do not claim their discovery in any passage of the manuscript.
  2. Our core motivation is not over-under-confidence, but: (1) demonstrating that the argmax\arg\max of the marginal probability distribution is largely invariant to MEM (Sec 2.2), and (2) establishing that the marginal probability distribution can be regarded as a lower bound for the error of the standard inference protocol (Sec 2.3). These findings are not readily available in the existing literature; they are key contributions and novelties of our work. As also acknowledged by Wz3n, we believe these are “great motivations for bringing insights into Test-Time Augmentations”.

Q2: Setting the temperature to 0 is "straightforward and naive". We agree that Zero is indeed simple, but we see this as a strength rather than a weakness, as pointed out by Wz3n, 7eTE, and a5fT. We firmly believe that simplicity should be rewarded, since it allows more practitioners to use our work.

Q6: Why is the temperature set to 0? Fig. 1(b) suggests that incorrectly classified views tend to suffer from overconfidence. When marginalizing, a highly confident but wrong prediction may greatly influence the resulting marginal distribution, leading to an overall incorrect prediction. Adapting the temperature addresses this by removing the dependency on confidence. Please, see the response to a5fT (Q3) for further clarifications.

cont. - learnable temperature. We are aware of works treating the temperature as a learnable parameter. Other than the provided reference, this is also done in [7], a pioneer of modern research on calibration, as well as in CLIP. Our goal, however, largely differs from learning parameters; in contrast, we aim to show that a surprisingly strong TTA baseline emerges with no optimization at all.

W2: Method lacks flexibility and guarantees. On flexibility: our method is fairly simple and can be easily integrated into a variety of VLMs (as also Wz3n points out). Its only prerequisite is a softmax-based classification pipeline. On guarantees: as also acknowledged by 7eTE, our work features a theoretical and empirical analysis (Sec. 2.2 to 3.1) that tries to justify its design. This is missing even in some influential work in this field, e.g., [52, 37, 32]. We are happy to expand our answer in case the reviewer points to specific flexibility/guarantees issues we did not cover.

Q3: Unusual pseudocode. We provided the pseudocode in a code-like fashion following some influential examples from the literature, including CLIP (see [a-h]). While traditional pseudocode may hide how functions are implemented, code-like snippets provide an intuitive way to grasp how simple and easy to adopt the proposed methodology is.

Q4: Too few comparisons, many TTA methods exist [3]. Do they have the same problem? We strove to compare with the state-of-the-art among MEM-based methods (TPT, PromptAlign) as well as the latest state-of-the-art on TTA for VLMs “Reinforcement Learning from CLIP Feedback” [53], published at ICLR 2024. This year ICLR ended 4 days before the abstract submission deadline for NeurIPS. We believe this is a good effort to keep up with the pace of AI research. Our work also features an analysis of Marginal Entropy Minimization, hence, to answer the question, our theoretical insights extend to strategies relying on this objective, including those listed in the provided survey paper. Other than TPT and PromptAlign, some examples are [18,26,41,52].

We are happy to enrich our manuscript in case the reviewer points to specific comparisons that may be relevant to this work.

Q5: Missing results. Our focus is on VLMs, hence we followed the established experimental protocol of the field of TTA with VLMs by evaluating different CLIP variants (ViT-B-16 and MaPLe), following [37,32]. Additionally, we experimented with a newer combination of CLIP-ViT-B-16 and CLIP-ViT-L-14, for which [37,32] did not present results. We did not omit any results and believe that our evaluation is fair.

Q7: Label smoothing. In our context, applying label smoothing as in [i] would not bring benefits since it would not affect the argmax\arg \max of the marginal probability distribution. Thus, it would incur the same issues highlighted in the example of the response to a5fT (Q3).

pˉLS=1/NpiLS=1/N[(1α)pi+α/C]=(1α)1/Npi+α/C\bar{p}^{LS} = 1/N \sum p_i^{LS} = 1/N \sum \left[(1-\alpha) * p_i + \alpha/C\right] = (1-\alpha) * 1/N \sum p_i + \alpha/C
pˉLS=(1α)pˉ+α/C    argmaxpˉLS=argmaxpˉ\bar{p}^{LS} = (1-\alpha) * \bar{p} + \alpha/C \implies \arg \max \bar{p}^{LS} = \arg \max \bar{p}

cont. SAM. A core contribution of our work is to show that Marginal Entropy, a very popular objective for TTA, has some pitfalls that can be circumvented with a simple and optimization-free approach. While we agree that sharpness-aware strategies can synergize with TTA, proposing an optimization-based alternative is out of the scope of our work.

Q8: Unusual terms like "prevalent class". We thank the reviewer for pinpointing this potential misunderstanding! To clarify, we will replace “prevalent class” with “most probable class”. This refers to the class with the highest probability within a distribution. We notice the reviewer mentioned "many uncommon words." Could the reviewer provide examples? We are willing to clarify and update the manuscript accordingly.

评论

Dear reviewer, we are happy to see that our rebuttal addressed most of the concerns raised in the initial review. We answer to the remaining points below.

On Voting (shared with a5fT). Yes, Zero bridges the gap between “Test-Time Adaptation” and “Test-Time Augmentations”, by showing that a single parameter adaptation is an almost exact approximation of the discrete action of voting, and this is a positive fact because it ensures that the theoretical insights of Sec. 2.3 (another key contribution and novelty of the work) are met without relying on possibly missing model calibration (lines 224-233 of the manuscript). We do not hide the simplicity of the approach in any part of the manuscript, title included. This is why we always refer to Zero as a baseline, since we deem it a simple and effective TTA approach that can be taken as a reference for future works in this field.

There are plenty of examples in which simple baselines are used to drive an entire field, see, e.g., [a-b-c-d]. Their presence is immensely valuable for the respective communities, and we believe that Zero, supported by its theoretical motivations, falls within this category.

Experimental results. We want to clarify this crucial point: in all experiments, all methods are tested on the same datasets and with the same backbones. As correctly pointed out by reviewer 7eTE, we have grouped experiments because it is fairer to do so, since “different approaches consider different backbones in the original papers” (line 245).

We strongly disagree with the statement “unfortunately, the other two reviewers did not notice such an obvious issue. Different experiments use different datasets and architectures, which clearly violates the most basic requirements of academic papers”. The understanding of reviewer 7eTE about our experimental design is correct. Complementing 7eTE's comment:

  1. PromptAlign cannot exist without a MaPLe initialization. Their method starts from MaPLe, which prepends learned visual tokens also on the image branch. They compute layer-wise statistics of such visual tokens offline and use them as an alignment target during TTA. These visual tokens would not be available in any way with a “standard” CLIP model, so without MaPLe there would be no PromptAlign. For this reason, PromptAlign cannot be reported when adapting other baselines, and a MaPLe initialization is inevitably needed to fairly compare TTA methods. Moreover, we also reported MaPLe + TPT in this group.

  2. RLCF always needs a student model and a reward model. In Table 1 of their paper, they compare with TPT writing “CLIP-ViT-B-16”, even though RLCF consists of CLIP-ViT-B-16 rewarded by CLIP-ViT-L-14. This is definitely a non-negligible advantage, and our experimental section makes the comparison fairer. When comparing to RLCF, we make the comparison fairer by using their exact same pair of models in all tables. To clarify all experimental results, we have divided TPT and RLCF writing “CLIP-ViT-B-16” and ‘CLIP-ViT-B-16 + CLIP-ViT-L-14”, respectively.

We hope that this clarifies the remaining concerns. We are keen to provide further clarifications otherwise.

References:
[a] Romera-Paredes, Bernardino, and Philip Torr. "An embarrassingly simple approach to zero-shot learning." ICML 2015.
[b] Sun, Baochen, Jiashi Feng, and Kate Saenko. "Return of frustratingly easy domain adaptation." AAAI 2016.
[c] Sun, Mingjie, et al. "A simple and effective pruning approach for large language models." ICLR 2024.
[d] Gulrajani, Ishaan, and David Lopez-Paz. "In search of lost domain generalization." ICLR 2021.

评论

I still cant understand Zero bridges the gap between “Test-Time Adaptation” and “Test-Time Augmentations”. It seems like just filtering views hign uncertainty during voting of Test-Time Augmentations, which is a tiny and straightforward trick. Did I miss anything?

评论

Q: The proposed strategy is a tiny and straightforward trick. Did I miss anything?

We strongly agree that our strategy is a simple trick, and this is something we are proud of, as it entails easy reproducibility and adoption. Papers where simple tricks are presented that outperform the established state-of-the-art methods often lead to rethinking the field, which is what we aim to do with our work. Thus, we believe simplicity should be rewarded, not penalized.

However, we believe that there is indeed a crucial part missing from this discussion: all the detailed theoretical analysis of the manuscript. Two of our core contributions are theoretical (as discussed in lines 53-57):

  1. To theoretically show when MEM does not change the prediction of the marginal probability distribution, and to empirically verify that MEM has largely no effect on it;
  2. To theoretically and empirically demonstrate when the error rate of the latter is a lower bound to the base error of a VLM in the setup of TTA, and how this is linked to overconfidence.

Our approach, Zero, originates from these observations. Note that ours is a simple method for TTA that does not require any training, being 10x faster and 13x more memory efficient than TPT.

We also emphasize that the theory behind our strategy, however, is far from trivial. Does the reviewer consider the theoretical insights of the paper as a valuable contribution?

To conclude, it seems that many issues of this discussion, especially the one related to how the experimental section is organized, are solved. Could the reviewer please confirm this?

评论

Thanks to the authors for their response, which confirms that I did not misunderstand this paper.

Regarding the theory:

  1. "We theoretically demonstrate..." This is not a novel proposal, as there have been papers on calibration [1].

  2. The theoretical assumptions are difficult to establish. "CLIP is well-calibrated (ECE < 0.1 for all datasets), strongly supporting the theory." It seems that the authors have a misunderstanding about calibration, as ECE < 0.1 does not prove calibration. The most famous paper on ECE [2] highlights the poor calibration problem by showing results with ECE = 0.0412 as uncalibrated (CIFAR-10 ResNet 110).

  3. Even with a low ECE, it still does not meet the authors' assumptions. ECE is a necessary requirement for calibration, but it does not fulfill the authors' assumptions. For example, with two views, one-to-one wrong, with 50% confidence, ECE = 0. In the authors' experiments, it is considered fully calibrated, but the authors' theory will fail.

Regarding the method: "We strongly agree that our strategy is a simple trick, and this is something we are proud of." This premise assumes that no one has done similar work before. For example [3], weighting the results of different views' inferences based on uncertainty is more reasonable compared to the voting approach in this work (since the authors believe the model is calibrated, confidence should contribute to continuous weights rather than voting).


My attitude is already clear, and I firmly believe that there is room for improvement in this paper. I have no other misunderstandings with the authors. I respect the opinions of reviewer 7eTE, but I must maintain an objective evaluation.


Thank you for the authors' rebuttal and the response from reviewer 7eTE.


[1] Qingyang Zhang, Provable dynamic fusion for low-quality multimodal data. ICML 2023.

[2] Guo, Chuan, et al. "On calibration of modern neural networks." International conference on machine learning. PMLR, 2017.

[3] Simple and scalable predictive uncertainty estimation using deep ensembles

评论

Dear reviewer, thank you for your response.

Unfortunately, we feel like there are still many misunderstandings. Before proceeding, please let us know if the previous point about comparison groups in our experimental suite is solved. It appears to us that it has been, but we would really appreciate it if you could please confirm this since we have not received a reply yet. As per the remaining points, we provide answers in the following.


Answer to point 1:

The existence of papers on model calibration does not undermine our theoretical insights for Test-Time Adaptation of Vision-Language Models. Our work uses the notion of model calibration to show that, if the VLM is calibrated, the marginal probability distribution computed over a set of independent examples with the same underlying label can be regarded as a lower bound for the base error of the model. We are not claiming that we introduce the concept of model calibration, nor anything similar.

To make this discussion grounded, let us consider the provided reference [1]. The reference [1] has different findings, tasks and framework. The focus, therein, is model ensembles. Specifically, their theory describes ensembling different unimodal classifiers into a multimodal fusion output. This is exactly the opposite of what we do in Section 2.3. Our work reverts the classical paradigm of model ensembling and shows that the marginal probability distribution of only one model is sufficient to establish theoretical guarantees if such a model is calibrated. Then, we proceeded to show that this has a non-negligible impact on Test-Time Adaptation.

Additionally, the invariance to MEM has no link to calibration. Section 2.2 is dedicated to something entirely different: showing that the argmax of the marginal probability distribution is largely invariant to the most popular TTA framework, i.e., Marginal Entropy Minimization. There is no link between this contribution and calibration, and we do not see any similarities with the provided references.


Answer to point 2:

No, we do not have a misunderstanding about calibration, and we never claim that ECE < 0.1 proves it. In the sentence quoted in your response, we say “well calibrated”, we do not say “perfectly calibrated” or “provably calibrated”. We acknowledge that some misunderstanding may arise from this, and we will remove the text within the parentheses in line 190.

The ECE (Expected Calibration Error) on CIFAR-10 for a ResNet-110 cannot be directly compared to the ECE of zero-shot CLIP-ViT-B-16 on ImageNet variants. What constitutes "good" or "bad" calibration varies with the dataset, the complexity of the classification task, the model architecture, and the user's requirements [a]. Indeed, we also point out that the word “uncalibrated” in [2] does not refer to bad calibration, but to Neural Networks “before calibration” (see Table captions in [2]).

There are many differences between the suggested comparison and our setup. (e.g., 10 vs 10x or 100x more classes, ResNet 110 fine-tuned on CIFAR vs CLIP-ViT-B-16 transferred zero-shot). Once again, let us make this discussion grounded in the provided references. We only share with [2] results on ImageNet-1k, where Table 1 shows the ECE of fine-tuned DenseNet-151 and ResNet-152. When these are not subject to calibration techniques, their ECEs computed with 15-bins are 6.28% and 5.48%. For CLIP-ViT-B-16, this is as low as 1.88% (please note that the 1.92% ECE in Fig.1(b) uses 20 bins, as highlighted in Appendix 3). After the calibration technique introduced by [2], the ECE goes to 1.99% and 1.86% for DenseNet-151 and ResNet-152, respectively. These values after calibration are almost identical to the ECE of zero-shot CLIP-ViT-B-16 on the same dataset.

[a] Nixon, Jeremy, et al. "Measuring Calibration in Deep Learning." CVPR-W. 2019.

评论

Answer to point 3:

Quoting from the comment: ECE is a necessary requirement for calibration, but it does not fulfill the authors' assumptions. For example, with two views, one-to-one wrong, with 50% confidence, ECE = 0. In the authors' experiments, it is considered fully calibrated, but the authors' theory will fail. This statement contains some misunderstandings.

First, we do not consider the model “fully calibrated” in any of our experiments. Please refer to the Answers to points 2 and 4 for more details.

Second, our theory would not fail in the provided case. Section 2.3 provides theoretical justification for the simplified scenario of two-class classification. This is an established way of dealing with error bounds (see, e.g., the provided reference), to avoid the intractable combinatorial explosion that would arise otherwise. In the context of our theory, the provided example would entail an error rate of 50%, i.e., random guessing. In line 152 of the manuscript, we clarify that our theory does not apply to random guessing:

"From the Condorcet Jury Theorem [35], we know that Eq. (7) is a monotonically decreasing function if the error ϵ is better than random guessing, which is likely to be the case for VLMs pretrained on a massive amount of web data such as CLIP. Hence, we conclude that the error of pˉ\bar{p} is a realistic lower bound for the base model error ϵ over a set of independent data points sharing the same label".

Sure, theoretical frameworks often rely on simplifying assumptions. Hence, it is always important to present both theoretical insights and experimental verification in more realistic settings. In our work, we have done so for every theoretical insight, including the one in this discussion. The paragraph “Does this lower bound empirically realize?” of Section 2.3 aims exactly at pairing the theoretical discussion with evidence. Note that here “we do not reorganize predictions in a one-versus-all scheme” (lines 161-162), thus showing that, on average, our theoretical insight is also confirmed in multiclass settings.


Answer to point 4 (on the method):

To the best of our knowledge, the paradigm we propose has been ignored in the context of TTA with Vision-Language Models. If this is not the case, we would appreciate if the reviewer could provide a reference.

Additionally, the statement "the authors believe the model is calibrated" is wrong. We do not believe that the model is calibrated on augmented views, but exactly the opposite.

The entire Section 3.1 “Augmentations undermine the reliability of pˉ\bar{p} is about the fact that CLIP calibration gets really worse on augmented views. For example, on ImageNet-1k, the ECE (20 bins) goes from 1.93% to 13.24% (Figure 1(b) of the manuscript). We also largely show that identical trends emerge for all Natural Distribution Shifts datasets (Figure 3, Appendix), and with different CLIP models (PDF attachment of the general response).

What this means is that hand-crafted augmented views largely interfere, and make confidence information unreliable. Consequently, “weighting the results of different views’ inferences based on uncertainty” is not “more reasonable”, but would rather emphasize the problem. Supported by the consistent observation that the error of the model does not largely decrease on augmented inputs, this problem can be circumvented by discarding confidence information. This is what Zero does.

评论

Why does this response contradict the descriptions in the paper and previous responses? In the paper, it highlights the assumption that if f is calibrated (L141), and zero-shot CLIP is well-calibrated (ECE < 0.1 for all datasets), strongly supporting the theory of Section 2.3 (L190). However, now the response refers to it as uncalibrated.

Additionally, my example is not about randomly guessing, but rather emphasizing that accuracy equals confidence. We can consider an example with four samples: right/right/right/wrong, with a confidence of 75%.

评论

Thank you for your rebuttal, but I still cannot grasp the novelty of this work. I spent another 4 hours re-reading the paper, and I still hold same understanding:

For a sample x, it is first augmented into multiple views, X; the inference is performed, and the ones with low entropy are selected; the logits are divided by 0 and then summed, producing p; the prediction is obtained by argmax(p).

If this is the case, how is it different from voting on the augmented views?

Regarding the experimental results: Reviewer a5fT also pointed this out, but unfortunately, the other two reviewers did not notice such an obvious issue. Different experiments use different datasets and architectures, which clearly violates the most basic requirements of academic papers. The authors' reasons do not explain why there are various experimental settings in a combinatorial manner. (The main reason for raising concerns is that many improvements reported in the paper can be considered tiny.)

As for the experimental setup, I merely have concerns. Please provide a brief response to clarify the difference between the paper's approach and my understanding.

评论

Reviewer Wx8e, can you please clarify for me what you mean by "different experiments use different datasets and architectures"? You claim that as a reviewer I "did not notice such an obvious issue", so I have gone over the author's experiments again.

Are you referring to how in Table 1, different approaches use a different backbone architecture? If so, you'll see that the authors grouped methods by backbone, and have a version of their approach that uses each backbone. Thus it is valid to compare their method to each other method within each backbone grouping. This does not violate any principles of academic papers.

Or are you referring to how Table 2 uses different datasets than in Table 1? It is incredibly common to use different datasets when evaluating performance on different tasks. Table 1 corresponds to natural distribution shifts, while table 2 deals with fine-grained classification. This likewise does not violate any principles of academic papers.

Please let me know if I have misunderstood your concern here. Otherwise, my understanding of the paper remains unchanged and I will still argue for this paper 's acceptance.

评论

What I am referring to is the irrationality of group by architecture, with no clues indicating that different methods are only suitable for different architectures. So when the difference in experimental results is small, it is necessary to report all the results. This artificially divided group is not elegant, as TPT has also reported results under other architectures.

My main concern is the highlighted italicized part about novalty. Did I misunderstand this part?

评论

I don't aim to derail the author's conversation with reviewer Wx8e, but I am going to respond to Wx8e's response to me.

The choice of grouping the models is not irrational. The authors already explain why this is done in the paper:

"As different approaches consider different backbones in the original papers, we construct different comparison groups to ensure fair comparisons with all TTA baselines" [lines 245-246]

If TPT can be included in the other groups, sure - it wouldn't hurt to include it. But it's absence is not a major flaw. It is not always reasonable to expect authors to reengineer existing approaches in order to pair them with new architectures. But yes, if the results are updated to include TPT in all groups, then that's a positive. But I really fail to see its absence in some group/dataset combinations as anything worse than neutral.

My main concern is the highlighted italicized part about novalty. Did I misunderstand this part?

Yes, I feel like you have. The method itself is not positioned as a complicated or novel approach. The primary takeaway of the paper, as I see it, is that existing MEM approaches have a major deficiency (shown analytically in this paper), and that a very simple baseline method performs as well or better than them. This is important for the community to know, as we should not be using an overcomplicated and inefficient approach when a simple method beats it. It is also important for researchers to reevaluate MEM --- or pivot to a new avenue for TTA --- instead of continuing research on MEM as-is without knowing that it's no better than this baseline. So in short, I think you are right that this method is simple, but misunderstand that complexity is not the goal or value proposition here.

审稿意见
3

This work studies the test-time adaption (TTA) of Vision-Language Models (VLMs), where the goal is to adapt the trained VLMs to unseen datasets/distribution. To this end, this work first revisits the commonly used Marginal Entropy Minimization (MEM) by showing its effect on marginal probability distribution pˉ\bar{p}. Then, the relationship between pˉ\bar{p} and p{p} is shown. Based on this, a simple TTA method is proposed, which uses "zero" temperature when calculating the probability with softmax. The experimental results on several benchmarks demonstrate that the proposed method is effective, and can bring improvements over several baselines.

优点

  • The proposed method (ZERO) is simple and straightforward. The experimental results are good and several baselines are included and discussed, which helps understand the effect of ZERO.

  • Section 2.2 (how does MEM affect the marginal probability distribution) is interesting. It gives some insights that the prediction is invariant to entropy minimization

  • Section 2.3 gives the reliability perspective into the marginal probability distribution (pˉ\bar{p}). Showing the error of pˉ\bar{p} is a lower bound to the base error of the model is interesting. Giving empirical results is also helpful.

缺点

  • The major concern is the invariance to entropy minimization (Section 2.2). I noticed that this work provides a discussion on this, but I would appreciate the authors give some cases/ examples where this assumption does not hold.

  • While I think the analysis of reliability (Section 2.3) is interesting, the claim may be too strong. Specifically, in Line 192 "Poor calibration is always caused by overconfidence". I do not think "always" is appropriate here. I understand the empirical results give such observation. However, this claim needs to consider various datasets/ models. For example, [a] shows the CLIP models trained on different datasets (LIAON and WiT) exhibit distinct behaviors in terms of calibration/reliability. That said, I would suggest the authors lower the claim and make it moderate. Also, I noticed the authors discuss this potential limitation, which I appreciate. Moreover, it would be better if the authors could show the results of Figure 1 (b) using CLIP models trained on LAION and WIT (OpenAI-provided CLIPs). This will make the empirical observation more convincing.

    [a] A Closer Look at the Robustness of Contrastive Language-Image Pre-Training (CLIP)

  • I noticed TPT [37] and PromptAlign [32] evaluate models on the setting of Cross-Datasets Generalization. Why not report results on such a setting?

---- Post Rebuttal ---

I am not fully convinced by the response regarding the benchmark comparison, particularly since ZERO exhibits a significant performance drop on EuroSAT, which has not been clearly reported following the standard protocol. Other methods do not exhibit such a drop. An apples-to-apples comparison is fundamental to understanding the effectiveness of the proposed approach. While I acknowledge the merits of simplicity and analysis, I remain skeptical about the experimental evaluations.

Considering the current two evaluations and the lack of guarantee that “the predictions of the model, on average, are accurate,” I am not convinced of ZERO’s generalization to other tasks, backbones, or datasets.

问题

  • Please discuss the cases where invariance to entropy minimization (Section 2.2) may not hold
  • Please claim in Line 192 "Poor calibration is always caused by overconfidence" moderate and provide more experiments to support such observation
  • Please correct my understanding of Zero temperature: the temperature (not negative) will not change the Softmax predicted class (the predicted class corresponds to the maximum predictions). I am not sure how Zero temperature impacts the predicted class. Please clarify how it improves the performance.
  • [Small suggestion] TPT [37] and PromptAlign [32] use the protocol that only a single test sample is given. This work follows such protocol. I would suggest this work highlight this. The current presentation is not very clear to me

局限性

This work provides a discussion of potential limitations, including the augmentation-induced overconfidence might not hold on future VLMs, the invariance to entropy minimization may not hold on all models and datasets, and the independence among augmented views, the computational cost. I appreciate this work mentions these limitations.

作者回复

W1 - Q1: Invariance to Entropy Minimization. Invariance to MEM is strongly related to the uncertainty of the marginal probability distribution pre-TTA. The lower the initial entropy, the lower the impact of MEM on the argmax\arg \max.

Theory: this is related to the proof of Proposition 2.1. To guarantee invariance, we express the post-MEM embeddings as a function of the pre-MEM embeddings through a Taylor expansion, which implies that the variation is small. If the initial entropy is high, the gradients from MEM (and, thus, the variation between pre- and post-MEM embeddings) can be larger than what a Taylor expansion can accurately approximate, so Prop. 2.1 cannot be guaranteed.

Empirical evidence: To visualize the aforementioned relationship, we compute pre- and post-MEM marginal probability distributions. We sort the pre-MEM distributions in order of descending entropy and quantize them into 10 bins. Bins shall be interpreted as follows:

  1. the leftmost bin contains the top 10% of samples with the highest entropy;
  2. the second bin contains samples outside the top-10% percentile but within the top-20%, and so on;
  3. the rightmost contains the bottom 10% of samples with the lowest entropy.

For each bin we compute the invariance ratio, measuring how often the argmax\arg \max of the pre-MEM p\overline{p} does not change after MEM (the higher the better). Finally, we display a histogram with this data in Figure 2 of the PDF attachment.

A trend appears: as the entropy decreases (left to right), invariance holds more often. Hence, to answer: the most likely cases where invariance to MEM does not hold are those of high uncertainty in the marginal probability distribution. However, this may still be rare: even within the top 10% of most uncertain samples, invariance holds more than 82% of the time (leftmost bin).

The reported experiment was conducted on the validation set of ImageNet-1k with OpenAI’s CLIP-ViT-B-16, but identical trends were observed for all Natural Distribution Shifts datasets. We plan to report these results in the revised appendix, together with this discussion and a few examples. We are thankful to the reviewer for this comment.

W2 - Q2: Overconfidence and "always". We agree. Following the suggestion, we report the same experiment with LAION-pretrained CLIP models, using the code of [j]. In the PDF attachment, the reviewer can find results for LAION-2B and LAION-400M.

Notably, these CLIP variants comply with the already observed patterns, i.e.: the ECE increases with augmented views, with overconfidence being the leading cause. We propose to include these results in a dedicated Appendix.

We further propose to tone down ll. 192-197 (shared with 7eTE):

  1. line 192, paragraph header: “Poor calibration is frequently linked to overconfidence.”
  2. lines 195-196: “Notably, in the scope of our experiments, overconfidence is the primal factor leading to an increase of the ECE.”
  3. line 197: “In Appendix B, we also experiment across all datasets for Natural Distribution Shifts and different CLIP models pretrained on LAION. Importantly, this phenomenon further persists within this extended experimental suite.”

We thank the reviewer for this comment.

W3: Cross-datasets generalization. These results are already in the manuscript, in the 2nd comparison group of Tab. 2 (MaPLe). We pinpoint that there are slightly different meanings to the term “cross-datasets generalization”, according to the recent literature on TTA, and that confusion may arise from these. We recap them here:

  1. This experiment was first designed in the TPT paper to compare supervised prompt learning (e.g. CoOp [55]) vs instance-specific prompt learning. In our case there is no learning, so this would not apply.
  2. The PromptAlign paper extends this setting to see how prompt learning methods, trained on a source dataset like ImageNet, can benefit from instance-specific TTA when evaluating “cross-datasets”.

This second experiment is already in Table 2: MaPLe prompts are learned on ImageNet and the model is adapted by Zero “cross-datasets”. We referred to these as the "Fine-grained classification" experiments.

However, we notice that we did not explicitly mention that our MaPLe initialization comes from ImageNet. We sincerely apologize. As a remedy, we propose the following:

  1. Before l. 252: “MaPLe prompts are learned on ImageNet, following [32]”;
  2. In l. 267, we will insert: “When adapting MaPLe, we stick to the ImageNet-learned prompts and evaluate it cross-datasets as in [32].”

Q3: zero temperature. Consider this example in a 3-way problem. Let x1x_1, x2x_2, and x3x_3 be three views and p1p_1, p2p_2, and p3p_3 be their probabilities with the default temperature. Let y=2y=2 be the correct label.
p1=[0.9,0.1,0.0]p_1 = [0.9, 0.1, 0.0] - incorrect
p2=[0.3,0.6,0.1]p_2 = [0.3, 0.6, 0.1] - correct
p3=[0.25,0.6,0.15]p_3 = [0.25, 0.6, 0.15] - correct

This simple example is designed to meet the observations of the manuscript:

  1. the error rate of the model is low, but
  2. the model may suffer from overconfidence (p1p_1).

The resulting marginal probability distribution pˉ=[0.483,0.433,0.083]\bar{p} = [0.483, 0.433, 0.083] would be wrong because of the influence of a wrong overconfident example.

Setting the temperature to zero before marginalizing, it would be (approximating):
p1=[1,0,0]p_1 = [1, 0, 0]
p2=[0,1,0]p_2 = [0, 1, 0]
p3=[0,1,0]p_3 = [0, 1, 0]

with a resulting pˉ=[0.33,0.67,0]\bar{p} = [0.33, 0.67, 0], which would be correct and avoid the influence of overconfidence.

The takeaway is that due to the effect of data augmentations, confidence information can be misleading, but we can still rely on the fact that the predictions of the model, on average, are accurate. Setting the temperature to zero discards confidence information, but retains the argmax\arg\max (i.e., the prediction).

Q4: single-test point. To clarify, we will include the following before l. 252: “Similarly to [37, 32, 53], we always work with a single test point at a time.”

评论

Dear Authors,

Thank you for your response, which has addressed the initial concerns. I tend to maintain my score of “accept.” Please ensure the revision incorporates all the responses.

Best,

Reviewer a5fT

评论

Dear Authors,

Please clarify two major points to help me understand the effectiveness of ZERO.

  • Cross-dataset setting. Thanks for the response. Why not include Caltech in Tab 2 as TPT [37] and PromptAlign [32] do? Also, Ref Table 1 of RLCF [53] reports the comparison with TPT on natural distribution shifts, using CLIP-ViT-B-16. Why not include it in Table 1? MaPLe (ref Table 5 in [14]) reports the results on ImageNet, why not report in Table 1 ?

  • Zero temperature. Follow the suggested case, let p1=[0.1,0.9,0.0]p_1 = [0.1, 0.9, 0.0] - correct, p2=[0.6,0.3,0.1]p_2 = [0.6, 0.3, 0.1] - incorrect, and p3=[0.6,0.25,0.15]p_3 = [0.6, 0.25, 0.15] - incorrect, then pˉ=[0.43,0.48,0.08]\bar{p}=[0.43, 0.48, 0.08]. After using zero temperature, pˉzero=[0.67,0.33,0.00]\bar{p}_{zero}=[0.67, 0.33, 0.00]. This gives the wrong class prediction.

I understand there are two points here: 1) 'confidence information can be misleading', and 2) 'the fact that the predictions of the model, on average, are accurate'. How to make sure the second point persists? Moreover, using zero temp, makes the prediction scores aggregation (a kind of) the majority vote of the predicted classes.

Kind regards,

Reviewer a5fT

评论

Dear reviewer,

We thank you for the feedback. We are happy to see that our responses clarified most of your concerns and that you would tend to accept our work. We provide answers to the remaining points below.

Cross-datasets setting

  1. Results on Caltech-101 are already reported in Table 2. The column acronym is “CAL” (lines 264-267 describe the acronyms).
  2. RLCF [53] always uses two networks: a student model “CLIP-ViT-B-16” and a teacher model “CLIP-ViT-L-14”, even if they only write “CLIP-ViT-B-16” in their tables. Our experiment in Table 1 is analogous to theirs. We deemed it fairer to clearly report this fact in the tables, as using a CLIP-ViT-L-14 is a non-negligible advantage. We wrote “CLIP-ViT-B-16 + CLIP-ViT-L-14” to provide further clarifications w.r.t. their table.
  3. We did not consider MaPLe on ImageNet-1k following PromptAlign [32] (note that PromptAlign and MaPLe share the same authors). Please recall that we aim to compare TTA methods and MaPLe is not one of them, but a supervised prompt-learning approach. However, it can be used as a baseline to adapt, which is what PromptAlign [32] does. PromptAlign always uses a MaPLe initialization and computes offline statistics on ImageNet-1k, which are used as an alignment target during TTA. For this reason, TTA methods cannot be fairly compared on this dataset, since PromptAlign has an unfair advantage. The authors of PromptAlign [32] did not report results on ImageNet for this reason, see Table 1 therein. Nevertheless, for reference: Zero-Shot MaPLe reaches 70.72%, adapting with Zero boosts to 72.51% (+1.79% improvement).

Zero temperature. We agree that, in the provided example, using p\overline{p} would be correct. However, the provided example does not comply with the motivating observations of the manuscript. Figure 1(b), as well as Figure 3 (appendix), show that the error of the model on augmented views is low. This does not apply to the provided example.

Q: “How to make sure that the predictions of the model on augmented views, on average, are accurate?” “Making sure” is quite challenging, but filtering views aims exactly at this. High entropy is a common trait of OOD data, which highly correlates with inaccurate predictions. Please notice that our manuscript and our response already provide a comprehensive empirical verification that this happens very often: i.e., the error (1accuracy1-\text{accuracy}) remains comparable on augmented views if filtering is applied (see the white text boxes in Figures 1(b) and Figure 3 of the manuscript, as well as Figures 1(a) and 1(b) of the general response).

On voting. Yes, Zero bridges the gap between “Test-Time Adaptation” and “Test-Time Augmentations”, by showing that a single parameter adaptation is an almost exact approximation of the discrete action of voting, and this is a positive fact because it ensures that the theoretical insights of Sec. 2.3 are met without relying on possibly missing model calibration (lines 224-233 of the manuscript). This is also why we always refer to Zero as a baseline throughout all the manuscript: we deem it a simple and effective TTA approach that can be taken as a reference for future works in this field.

We hope that these responses are comprehensive and answer the remaining concerns. For further clarifications, please do not hesitate to proceed with the discussion. Thank you.

评论

Dear Authors,

  • In the cross-dataset setting, I noticed that PromptAlign reports results on 10 datasets, while this submission reports on 9 datasets. Please clarify the reason.

  • It would be helpful if the results in Table 1 could be reorganized. Please consider clearly illustrating the backbone and the compared methods to reduce potential confusion.

  • Based on the response, it seems the primary contribution lies in the theoretical analysis presented in Section 2.3 rather than the method itself.

Best,

Reviewer a5fT

评论

10 datasets. We also used 10 datasets, one is reported in the Appendix. As we describe in footnote 2, we find that the extremely OOD domain of satellite imagery (i.e., EuroSAT [10]) leads to consistent failures of all TTA methods (i.e., all methods consistently perform worse than a simple zero-shot baseline). Hence, we believe it is important to understand why this is the case, other than solely reporting numbers. To this aim, we have dedicated an entire section to the study of this dataset in Appendix E. We also reported qualitative examples on this topic. The key takeaway from our analysis is the following: the extreme OOD domain of satellite imagery is the only dataset, out of 15, where the error of the model on augmented views is in no way comparable to the error of the model on source images. This analysis suggests that this domain is a controversial benchmark for TTA, since crafting augmentations for satellite imagery requires an ad-hoc treatment, which, in turn, implies some form of prior knowledge about the test benchmark is available. In contrast, it is of paramount importance to avoid any test data information in TTA, to ensure the core principles of the field are not violated. We are open to also reporting results in the main body of the manuscript, but, if the reviewer agrees, we would keep our in-depth analysis in the Appendix.

Re-organization of Table 1. In Table 1, we already indicate all architectures at the top of each group, but we see how this can cause confusion for the comparison with RLCF. We can revise the last part accordingly, indicating the teacher-student setup of the latter, while also digging deeper into the design of RLCF already summarized from lines 242-244. We may also indicate the backbones in a column, rather than rows. We are open to including any suggestion in this regard.

Theoretical analysis is the main contribution. Yes, especially in light that the method itself originates from the theoretical insights. We highlight that such theoretical findings are rarely included in TTA methods for VLMs (e.g., TPT, PromptAlign, and RLCF do not present any). This is an added value for the community, as it provides tools to address a problem beyond mere quantitative results. To recap, our technical contributions are threefold:

  1. To theoretically show when MEM does not change the prediction of the marginal probability distribution, and to empirically verify that MEM has largely no effect on it;
  2. To theoretically and empirically demonstrate when the error rate of the latter is a lower bound to the base error of a VLM in the setup of TTA, and how this is linked to overconfidence.
  3. Motivated by these insights, we propose Zero, a simple method for TTA that does not require any training (being 10x faster and 13x more memory efficient than TPT).
评论

I am not fully convinced by the response regarding the benchmark comparison, particularly since ZERO exhibits a significant performance drop on EuroSAT, which has not been clearly reported following the standard protocol. Other methods do not exhibit such a drop. An apples-to-apples comparison is fundamental to understanding the effectiveness of the proposed approach. While I acknowledge the merits of simplicity and analysis, I remain skeptical about the experimental evaluations.

Considering the current two evaluations and the lack of guarantee that “the predictions of the model, on average, are accurate,” I am not convinced of ZERO’s generalization to other tasks, backbones, or datasets.

评论

We thank the reviewer for engaging in the discussion. We are sorry to hear there are remaining points on the experimental evaluation. We hope that we have answered the concerns on the datasets (i.e., we perform all experiments on the standard report datasets) and the fairness of the comparison (i.e., how we separate each group and how we performed experiments under the same architecture and protocols for each of them). Below we answer the remaining points.

On the effectiveness of Zero. It is true that Zero exhibits drops on EuroSAT and we discuss this also in the Limitations (lines 1009-1011). The purpose of Appendix E was to provide a thorough and comprehensive discussion about this, rather than hiding it (given that all models do not outperform zero-shot CLIP or CLIP-Ensemble). We apologize if this caused confusion and we will include these results in the main paper, together with a summary of the discussion.

However, we would also like to highlight that:

  1. Other methods set specific hyperparameters for each setting. For example, TPT uses different data augmentations between the Natural Distribution Shifts benchmark and the Fine-grained suite, while PromptAlign also sets different learning rates for different datasets in Fine-grained classification.
  2. Zero largely surpasses other methods on ImageNet-A (Tab. 1). For example, in the first group, the best variant of Zero+Ensemble surpasses TPT by +9.29%, while in the second group, Zero surpasses PromptAlign by +5.28%. This gap is very significant and shows that, while there any indeed failure cases, there are also important success cases.

On the generalization of the results. We focused our comparison on the datasets commonly used in the literature, i.e., natural distribution shifts and fine-grained ones used in TPT and PromptAlign, with three different backbones. This makes for a total of 44 different comparisons, out of which 25 times Zero outperforms the respective competitors (56.8% of the time). With Zero+Ensemble, these become 28 (63.6% of the time). Overall we believe these results confirm that Zero is a simple baseline that should be considered when testing TTA: if other methods train on augmented views, they should at least outperform Zero most of the time.

However, in an effort to further enrich our experimental evaluation, we report here results for both benchmarks of the manuscript with a CLIP model suggested in the initial review: CLIP-ViT-B-16 pretrained on LAION2B (see next comment). We report the first comparison group with TPT, as: (1) for MaPLe, we should train new prompts for this model, which are not officially available, and (2) RLCF requires tons of computational resources (16h 40mins for one run on ImageNet-1k).

评论

Table: Natural Distribution Shifts with LAION2B CLIP.

MethodIN-1kIN-AIN-v2IN-RIN-SketchAverage
Zero-Shot69.2737.0861.2778.8354.8560.26
Ensemble70.4338.3262.2880.4155.5461.40
TPT70.6141.9462.9680.4055.4862.28
Zero70.8750.2163.4180.4955.0964.01
Zero + Ensemble71.5150.4363.8282.2855.7364.75

Table: Fine-grained Classification with LAION2B CLIP.

MethodFLWRDTDPETSCARSUCFCALFOODSUNAIRESATAvg
Zero-Shot69.7154.4389.3789.9464.0295.8281.3870.6026.0447.0568.84
Ensemble68.7054.5587.7689.9867.6496.5181.6470.6225.6849.6469.27
TPT69.4754.5389.0090.7266.6896.1681.7671.3426.7348.8169.52
Zero70.1755.4289.1391.7667.0196.0881.6770.0228.3244.9669.45
Zero + Ensemble67.1655.8587.1091.6868.4496.5581.7869.8928.0947.4769.40

These results are consistent with the manuscript and show that, in both benchmarks, Zero outperforms TPT for most datasets: 4 out of 5 for Natural Distribution Shifts, 6 out of 10 for Fine-grained Classification. We also observe consistent patterns in the best and worst cases highlighted above (ImageNet-A and EuroSAT). We will report these results in the updated manuscript.

We believe that these results make our submission stronger, and we are open to include other suggestions that may provide further evidence.

On the score. We are sad to see a decrease in the score to our manuscript, as we believed we provided thorough answers to address the raised concerns. If there are points that we failed to address, please let us know and we will do our best to address them. Thank you.

评论

Dear Reviewer a5fT,

As the end of the authors-reviewers discussion period approaches, we would be grateful if you could acknowledge our last comments and the newly supplied experiments therein. Any last-minute feedback would be profoundly appreciated.

Thank you and best regards,
Authors

审稿意见
8

This work shows that Marginal Entropy Maximization (MEM), a leading class of methods for Test Time Adaptation (TTA) which involves maximizing the entropy of the predictive distribution marginalized over different views of the input, regularly results in the same argmax (and thus same final class prediction) as the argmax of this marginal distribution. This means that the entropy maximization step, which requires not insignificant computational overhead, often does not provide any benefit.

They then show that a very simple approach involving just setting the temperature of each predictive distribution to 0 prior to marginalization is a very strong baseline that outperforms MEM while being much more computationally efficient.

优点

  • The authors discover a critical flaw in MEM approaches
  • They propose a simple but very effective baseline that outperforms state of the art methods
  • They provide theoretical analysis of their approach

I like this paper quite a bit. It exposes a critical flaw in a popular class of TTA methods, and provides a strong, theoretically-backed alternative method that is very simple to implement.

缺点

  • In section 2.3 in the "Revisiting model assembling for TTA." paragraph, do different views of the input really count as independent samples? it seems like they would be highly dependent on each other.
  • I would tone down some of the stronger claims just a bit. For instance, the paragraph header "Poor calibration is always caused by overconfidence." might be overclaiming, since from my understanding this is an empirical observation on a sample of datasets, not an analytical fact.

问题

  • In section 2.3 in the "Revisiting model assembling for TTA." paragraph, do different views of the input really count as independent samples? it seems like they would be highly dependent on each other.

局限性

The limitations section is sufficient. Though I think ideally it would be part of the main paper, rather than in the Appendix.

作者回复

W1 - Q1: Independence among views. We agree and we thank the reviewer for this remark, which allows us to enrich our manuscript further.

The theoretical framework of Section 2.3 models an ideal scenario, where independence holds among different inputs. To clarify, this means that the model's error on view xi\mathbf{x}_i should not be correlated with the error on another view xj\mathbf{x}_j, which allows writing the compound error with a binomial distribution.

In practice, achieving perfect independence is challenging, if not impossible. Hence, a suitable approximation strategy to mitigate this issue is to promote diversity. In classical ensembling theory, a well-established approach is to train different models on different subsets of the available data. Similarly, our augmentation scheme of random cropping aligns with this approach by presenting the model with different portions of the image each time.

Moreover, ideally, the augmentation pipeline should not change the underlying label of the original input and guarantee that the model’s error rate on augmented views remains comparable to the error rate on the original inputs belonging to the same category (ϵ(y)\epsilon(y) in the caption of Figure 1(a) of the manuscript). In practice, this entails that augmentations should not disrupt the visual appearance of the image, and, consequently, some views may result in a slight or moderate correlation, because some “parts” of the source image will overlap among them. An analogy with classical literature can be drawn also in this case. Specifically, when not enough data are available, overlaps among the training sets of different models are required to ensure convergence. As a consequence, models producing slightly or moderately correlated predictions are more likely to emerge.

While we have tried to highlight this potential limitation in the dedicated section, we believe a tailored discussion may be helpful for the readers. Hence, we propose to replace lines 139-140 with a pointer to an in-depth appendix on this subject, where we will include this discussion.

W2: Overconfidence and "always". We agree with this remark. As we acknowledge in the Limitations section, this is an empirical observation stemming from the combinations of models and datasets that we tested, and could not extend to the space of all existing VLMs and datasets (or those that will arise in the future).

As suggested by Reviewer a5fT, Figures 1(a) and 1(b) of the PDF attachment show additional experiments with LAION-pretrained CLIP models, which confirm our initial observations. We hope that these will strengthen the analytical section, and plan to include these results in the Appendix.

Finally, we propose detailed changes to the manuscript to tone down some passages from lines 192 to 197 (shared with reviewer a5fT):

  1. line 192, paragraph header: “Poor calibration is frequently linked to overconfidence.”
  2. lines 195-196: “Notably, in the scope of our experiments, overconfidence is the primal factor leading to an increase of the ECE.”
  3. line 197: “In Appendix B, we also experiment across all datasets for Natural Distribution Shifts and different CLIP models pretrained on LAION. Importantly, this phenomenon further persists within this extended experimental suite.”

Thank you for this suggestion.

About Limitations. We are glad that our efforts in highlighting the limitations of our work were appreciated! We agree that, ideally, a Limitations section shall be part of the main body of the paper. This year NeurIPS is granting an extra page for the final revision of the manuscript upon acceptance. We will use it to follow this suggestion if this is our case. Thank you.

评论

Thank you for your response. These proposed changes seem great to me, and address the few concerns I had.

I've read over the other reviews and the author responses, and I strongly believe that this work should be accepted to NeurIPS.

评论

Thank you for your positive feedback on our work and for the suggestions!

We are happy to hear that the responses and proposed changes addressed the remaining concerns. We will implement the latter as we are sure they will further improve the quality of the work.

We would also like to profoundly thank you for your efforts in engaging with reviewer Wx8e.

审稿意见
7

This work carefully reviews the popular test-time adaptation (TTA) method, MEM, and finds that MEM has largely no effect on arg max(𝑝). Based on this understanding, this work further introduces a clean method called Zero, which shows decent performance for the TTA task.

优点

  1. Great motivation for bringing insights into Test-Time Augmentations.
  2. This is a simple method that can be easily integrated into the current model and algorithms. The performance on natural distribution shifts is strong, and the memory cost is impressively small.

缺点

Although the performance of the ZERO is good on natural distribution shifts, it is not as effective in fine-grained classification. It would be helpful to have more insights into why the performance in fine-grained classification is lacking.

问题

Please refer to the weakness.

局限性

Please refer to the weakness.

作者回复

W1 - Performance on Fine-grained datasets vs Natural Distribution Shifts.
We thank the reviewer for raising this interesting point, allowing us to further investigate our method.

A possible explanation may be linked to Sections 2.3 and 3.1 of the manuscript. Specifically, Zero improves over the zero-shot baseline if the error rate does not largely increase with augmented views. As Fig.1(b) of the manuscript displays, this is the case for all Natural Distribution Shifts datasets. For Fine-grained classification, we discuss here three distinct datasets for which Zero exhibits different behaviors:

  1. Flowers102\text{Flowers102}. Zero does not improve over the baseline here.
  2. Caltech101\text{Caltech101}. Zero marginally improves here.
  3. SUN397\text{SUN397}. Zero largely improves here.

To understand these different behaviors, we repeat the same experiment of Section 3.1 for the entire Fine-grained suite and report here the results for the aforementioned datasets, formatted as follows: [zero-shot accuracy,augmented version accuracy,error gap][ \text{zero-shot accuracy}, \text{augmented version accuracy}, \text{error gap}].

By error gap\text{error gap}, we refer to: error gap=zero-shot accuracyaugmented version accuracy\text{error gap} = \text{zero-shot accuracy} - \text{augmented version accuracy}.

We additionally include the accuracy of Zero.

Results [%]

  1. Flowers102=[67.44,66.19,1.25]\text{Flowers102} = [67.44, 66.19, 1.25] | Zero=67.07\text{Zero} = 67.07 (0.37-0.37 w.r.t. zero-shot baseline)
  2. Caltech101=[93.35,92.62,0.73]\text{Caltech101} = [93.35, 92.62, 0.73] | Zero=93.51\text{Zero} = 93.51 (+0.16+0.16 w.r.t. zero-shot baseline)
  3. SUN397=[62.59,62.97,0.38]\text{SUN397} = [62.59, 62.97, -0.38] | Zero=64.49\text{Zero} = 64.49 (+1.90+1.90 w.r.t. zero-shot baseline)

For completeness, all results are reported in Table 1 of the PDF attachment of the general response.

Overall, there is a strong correlation between the error gap and the improvement provided by Zero, with Spearman’s coefficient being 0.95-0.95 across all datasets. This result shows that the correlation is negative, i.e., the lower the error gap, the larger the improvement. This pattern is also consistent with the experiments on EuroSAT reported in the Appendix of the manuscript (App.E, Tab.5).

Hence, the reviewer’s question is directly linked to this follow-up question: Why do augmentations increase or decrease the error gap with different datasets? While this may be a case-by-case matter, we pinpoint two possible reasons:

  1. The semantic space of the ImageNet variants of the Natural Distribution Shifts benchmark comprises many common categories, which may have appeared frequently during CLIP’s pretraining. Hence, it seems reasonable that CLIP is robust w.r.t. augmented views of images belonging to these categories. In the Fine-grained classification suite, datasets such as SUN397 and Caltech101 also contain common object categories, which is consistent with the results shown above. Other datasets, such as Flowers102 and Oxford-Pets, span much less frequent concepts, and CLIP is less robust w.r.t. their augmented views.

  2. Other than the semantic classification space, also the visual appearance of images plays an important role. For example, datasets such as FGVC-Aircraft and Stanford Cars still contain rare concepts, but Zero largely improves over the baseline nonetheless (main paper, Tab.2, first comparison group). Our augmentation setup is simple, and only contains random resized crops and random horizontal flips, which can constitute a “zoom-in” to a random portion of the image. For some benchmarks, this is useful as it may trigger CLIP’s capabilities to recognize small details, such as logos, or even reading text, such as the car brand or the airline name. In contrast, for Flower102, these may lead to some loss of precious visual features, such as the stem.

To conclude: In our work we did not search for the best data augmentations but rather stuck to an established setting, using the same augmentations setup for all datasets. Nevertheless, the performance of Zero is linked to the impact that data augmentations have on how the model perceives images, and we believe this is an interesting research direction to pursue.

We also think that this may be a useful discussion, and plan to include it in the Appendix of the revised manuscript.

作者回复

General Comment

We sincerely thank all reviewers for the time and effort devoted to reviewing our manuscript.

Above all, we profoundly appreciate that the simplicity of the proposed baseline has been praised almost unanimously among reviewers (Wz3n, a5fT, 7eTE).

On theoretical results, we are glad that our “clear” theoretical analysis (Wx8e) has been pointed out as a strength that “discovers a critical flaw in a popular class of TTA methods” (7eTE) while providing two “interesting insights”, supported by “helpful empirical results” (a5fT), which constitute “great motivations for bringing insights into Test-Time Augmentations” (Wz3n).

On experimental results, we are happy that our strategy was described as a “simple but very effective baseline that outperforms state-of-the-art methods” (7eTE), all with a “memory cost impressively small” (Wz3n). We are also glad to read that the presentation was appreciated: “the highlighted presentation of the experimental section greatly aids reviewers in rapidly analyzing the experimental results” (Wx8e), “several baselines are included and discussed, which helps understand the effect of ZERO” (a5fT).

Finally, we appreciate that our efforts in highlighting the limitations of our work were acknowledged (a5fT, 7eTE).

Rebuttal Content

Concerning doubts and weaknesses, we provide detailed responses to each reviewer. Summarizing:

  1. We follow the suggestion of reviewer a5fT and conduct additional experiments with LAION-pretrained CLIP models about overconfidence, complementing Section 3.1 of our submission. These results align with the initial observations of the manuscript;

  2. Although not requested, we support our answers with additional experimental verification if applicable. This applies to (1) the analysis of invariance to entropy minimization (a5fT) and (2) motivating why Fine-grained datasets incur less improvement than Natural Distribution Shifts datasets (Wz3n);

  3. We propose explicit ad-hoc re-phrasings to the manuscript whenever possible. We hope this is helpful to clarify any misunderstandings;

When applicable, figures portraying the outcome of additional experiments are provided in the PDF attachment. We hope that our responses are comprehensive, clear, and satisfactory to the reviewers. We look forward to engaging in a fruitful discussion otherwise.

Response Formatting

All responses are organized into questions and weaknesses. For example, Q1 refers to the first question, while W1 to the first weakness.

Responses may contain references. When these are numbered (e.g., [32]) they refer to the references of the manuscript. When these are lettered (e.g., [a]) they refer to the list below.

References for the rebuttal
[a] Radford, Alec, et al. "Learning transferable visual models from natural language supervision." ICML 2021.
[b] Caron, Mathilde, et al. "Unsupervised learning of visual features by contrasting cluster assignments." NeurIPS 2020.
[c] Caron, Mathilde, et al. "Emerging properties in self-supervised vision transformers." ICCV 2021.
[d] Zhang, Hongyi, et al. "mixup: Beyond Empirical Risk Minimization." ICLR 2018.
[e] Cubuk, Ekin D., et al. "Randaugment: Practical automated data augmentation with a reduced search space." CVPR-W 2020.
[f] Yu, Jiahui, et al. "CoCa: Contrastive Captioners are Image-Text Foundation Models." TMLR.
[g] Sun, Mingjie, et al. "A Simple and Effective Pruning Approach for Large Language Models." ICLR 2024.
[h] Tolstikhin, Ilya O., et al. "Mlp-mixer: An all-mlp architecture for vision." NeurIPS 2021.
[i] Szegedy, Christian, et al. "Rethinking the inception architecture for computer vision." CVPR 2016.
[j] Cherti, Mehdi, et al. "Reproducible scaling laws for contrastive language-image learning." CVPR 2023.

最终决定

I would like to thank the reviewers for spending a lot of time reading the paper carefully and providing valuable suggestions. I also thank the authors for providing the detailed responses to the reviewers’ questions and addressed many concerns from the reviewers. After the rebuttal and discussion, there are still disagreement among the reviewers. The AC read through the manuscript, all reviews, the discussion, and the rebuttal. One reviewer is concerned about the performance report on EuroSAT and another reviewer is concerned about the novelty of the paper. Thanks so much again for the time that you spent on reading and discussion of the paper! The AC agrees that the method is simple yet provide outstanding performance with careful analysis. The AC believe we should promote simple-and-effective methods while avoid making methods unnecessarily complex. Also, the AC agrees that the authors should report the results and analysis of performance on EuroSAT in the main paper. The authors are highly encouraged to improve the paper quality according to the reviewers' feedback in the camera-ready version. The AC decided to accept this submission.