PaperHub
5.3
/10
Poster3 位审稿人
最低5最高6标准差0.5
5
6
5
3.3
置信度
正确性2.7
贡献度3.0
表达2.7
NeurIPS 2024

Protected Test-Time Adaptation via Online Entropy Matching: A Betting Approach

OpenReviewPDF
提交: 2024-05-12更新: 2025-01-10
TL;DR

A novel self-training approach for adapting ML models to test-time distribution shifts by monitoring the model's output and aligning it with the source domain's statistics.

摘要

关键词
Test Time Domain AdaptationOnline LearningTesting by BettingMartingaleDistribution Shift Detection

评审与讨论

审稿意见
5

This paper proposes a novel test-time adaptation method based on martingales and online learning. It detects whether testing samples need to be adapted based on the sequential entropy values. Then, if a sample needs to be adapted, a pseudo-entropy value is computed for the adaptation. Overall, the idea of this paper is reasonable and interesting.

优点

  1. The authors replace entropy minimization with entropy matching, which is interesting. Under this main idea, online drift detection and online model adaptation are naturally proposed and make sense.
  2. This paper is well written and easy-to-follow.

缺点

  1. The experiments are relatively weak as the authors only conduct experiments on the ImageNet-C dataset, ignoring the CIFAR10-C and CIFAR100-C datasets. It would be better to present the results on datasets with a small number of classes.
  2. The proposed method estimated the pseudo-entropy at testing time. However, I wonder whether this can be done when the label distribution [1, 2] also shifts because the shifted label distribution also affects the sequential entropy values. This should be discussed in detail, as well as the related papers.
  3. The “Protected” in the title of this paper and name of this method should be carefully considered as the overall method seems not to explicitly ensure the safety of performance.

[1] NOTE: Robust Continual Test-time Adaptation Against Temporal Correlation. NeurIPS 2022

[2] ODS: Test-Time Adaptation in the Presence of Open-World Data Shift. ICML 2023

问题

Please refer to the Weaknesses section.

  1. The experiments show that the proposed method can achieve a lower ECE value. A further discussion about how these results can be done and their benefit for practice.

局限性

The authors have discussed the limitations at the end of this paper.

作者回复

We thank the reviewer for your positive feedback and valuable suggestions. We are pleased that the reviewer found that “online drift detection and online model adaptation are naturally proposed and make sense.” We appreciate the positive feedback regarding the clarity of our writing. Thank you!

The experiments are relatively weak as the authors only conduct experiments on the ImageNet-C dataset, ignoring the CIFAR10-C and CIFAR100-C datasets.

Kindly refer to the global response to all reviewers.

The proposed method estimated the pseudo-entropy at testing time. However, I wonder whether this can be done when the label distribution [1, 2] also shifts because the shifted label distribution also affects the sequential entropy values.

Thank you for raising this important issue. Extending our proposal to the label shift setting remains an open question for us. The idea in [1] may serve as a promising starting point. Specifically, we found the idea of prediction-balanced reservoir sampling appealing, as it can be used to approximately simulate an i.i.d. data stream from non-i.i.d. stream in a class-balanced manner. This can potentially reduce the sensitivity of the martingale process to label shifts. In turn, we anticipate that under label shift and in the absence of covariate shift, our adaptation would be minimal; and in the presence of the latter, the adaptation would be more substantial, as desired.

Another possible approach would be to work with a weighted source CDF rather than the vanilla source CDF, where the weights should correspond to the likelihood ratio Pt(Y)/Ps(Y)P_t(Y)/P_s(Y). The use of such a weighted CDF was suggested in the conformal prediction literature to adjust for label shift between the source holdout data and test points [3], making the test loss to “look exchangeable” with the source losses. Certainly, the challenge in this context is to approximate the likelihood ratio Pt(Y)/Ps(Y)P_t(Y)/P_s(Y) in a reasonable manner. This challenge becomes more pronounced when faced with both a covariate and label shift. It may be the case that reference [2] pointed out by the reviewer could be a valuable starting point for us to explore this path.

We will include a discussion on this matter in the revised version of the paper.

[1] T. Gong, et al. "NOTE: Robust Continual Test-time Adaptation Against Temporal Correlation," NeurIPS, 2022.

[2] Z. Zhou, et al. "ODS: Test-Time Adaptation in the Presence of Open-World Data Shift," ICML, 2023

[3] A. Podkopaev and A. Ramdas, “Distribution-free uncertainty quantification for classification under label shift,” Uncertainty in artificial intelligence, 2021.

The “Protected” in the title of this paper and name of this method should be carefully considered.

We are now seriously considering removing the word "protected" from the title. We initially used this word for three main reasons. First, our monitoring tool rigorously alerts for distribution shifts, and the ability to raise such a warning is crucial for communicating with users that the model is encountering new environments. Second, our approach has been shown to have no harmful effect when the test data follows the same distribution as the source domain, which is a significant concern in test-time adaptation. Third, we wanted to acknowledge the protected regression method [1] that inspired our proposal.

[1] Vladimir Vovk, “Protected probabilistic regression,” technical report, 2021.

The experiments show that the proposed method can achieve a lower ECE value.

Recall that the ECE is presented when applying self-training on in-distribution test data (Figure 3, left panel). This experiment demonstrates that in this in-distribution scenario, our method maintains the ECE of the source model—we are not improving the ECE of the source model. Importantly, this stands in contrast to entropy minimization methods that, by design, drive the model to make over-confident predictions, as reflected by the increased ECE value in Figure 3. This emphasizes that when the test samples do not shift, we preserve the calibration property of the model and avoid making over-confident predictions, which is desired in practice. We will clarify the exposition surrounding Figure 3 in the text accordingly.

评论

Thank you for your response. I would like to keep my positive score for this paper.

评论

Thank you for your engagement and for your positive feedback! We sincerely appreciate your thoughtful comments, which have helped improve our work.

审稿意见
6

This paper introduces a novel method for test-time domain adaptation using online self-training. It combines a statistical framework for detecting distribution shifts with an online adaptation mechanism to dynamically update the classifier's parameters. The approach, grounded in concepts from betting martingales and optimal transport, aligns test entropy values with those of the source domain, outperforming traditional entropy minimization methods. Experimental results demonstrate improved accuracy and calibration under distribution shifts.

优点

  1. It is interesting to see betting and martingale appear in test-time adaptation, especially for modeling CDF for better shifted test sample prediction.
  2. The paper is overall easy to follow, with rich experiments and visualizations. The algorithms clearly explain how the framework works.
  3. The experiments on two TTA settings (single domain and continual TTA) confirm its effectiveness.

缺点

  1. It is not very intuitive to use betting in TTA. Although it might work for modeling CDF, martingale itself is not naturally suitable for TTA entropy CDF.
  2. It is not very suitable to use the term "domain adaptation" in this context. Domain adaptation typically allows multiple epochs for adaptation, even in source-free settings, and uses target training samples while evaluating on target testing samples. In TTA, the same set of test data is used for adaptation and testing.
  3. In Figure 1, the comparison with entropy minimization shows peaks, while entropy matching shows valleys when facing data in the tail for both classes, indicating they are both good indicators for class boundaries regardless of the changed data distribution. However, as described in lines 176-181, it seems the optimization will follow either the black line (entropy minimization) or the red line (entropy matching), making the meaning of this figure a bit vague.
  4. The model requires calculating the source CDF, which requires extra time. Additionally, if there are source privacy concerns, it may not be possible to perform such calculations if the source model is only made available.
  5. The experiments do not include an efficiency study, which is one of the motivations for doing TTA.
  6. The paper does not include commonly used TTA datasets such as CIFAR10-C, CIFAR100-C, or any domain adaptation datasets such as OfficeHome, DomainNet, etc.
  7. There is no sensitivity study.
  8. This paper does not compare with state-of-the-art TTA baselines such as ROID [1].

[1] Marsden, R. A., Döbler, M., & Yang, B. (2024). Universal test-time adaptation through weight ensembling, diversity weighting, and prior correction. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (pp. 2555-2565).

问题

  1. Can you explain how betting and martingales specifically contribute to the modeling of entropy CDF and why you believe this is a suitable approach for TTA?

  2. In Figure 1, the comparison between entropy minimization and entropy matching is somewhat unclear. Can you elaborate on the intended interpretation of this figure and how it supports your claims about optimization following either the black or red lines?

  3. The requirement to calculate the source CDF adds extra computational overhead and potential privacy concerns. How do you propose mitigating these issues, especially in scenarios where source model access is restricted or where computational resources are limited?


局限性

  1. The paper uses the term "domain adaptation," which traditionally implies multiple epochs of adaptation and separate target training and testing samples. In the context of test-time adaptation (TTA), this terminology may cause confusion. A clearer distinction between these methodologies is suggested to avoid misinterpretation.

  2. The requirement to calculate the source CDF introduces additional computational overhead, which has not been thoroughly discussed. In practical scenarios, especially where computational resources are limited or where source model access is restricted due to privacy concerns, this could pose significant challenges. An analysis of the method's computational efficiency and potential solutions to mitigate these issues would be beneficial.

  3. The experiments conducted do not include widely recognized TTA datasets such as CIFAR10-C, CIFAR100-C, or domain adaptation datasets like OfficeHome and DomainNet. Including these datasets in the evaluation would provide a more comprehensive understanding of the method's generalizability and robustness.

  4. The paper does not compare its results with state-of-the-art TTA baselines.

作者回复

We thank the reviewer for the positive feedback and the constructive review. We are pleased that the reviewer appreciates the novelty and clarity of our writing. We very much value the reviewer’s comment that “the experiments on two TTA settings (single domain and continual TTA) confirm its effectiveness.” Thank you!

Why is betting martingale a suitable approach for TTA?

A betting martingale is a powerful tool to monitor distribution shifts in an online manner. We designed a monitoring tool that tests whether the distribution of the classifier’s entropy values drifts at test time. This naturally leads us to ask: if a martingale can detect that the model “behaves” differently at test time, why not use this knowledge to correct the model’s behavior?

Our work shows how the evidence for distribution drift encapsulated in the betting martingale can be used to adapt the model at test time. The idea is to use the martingale to transform the test entropies to “look like” the source entropies—essentially matching the distribution of the source and the self-trained model entropy distributions. This, in turn, builds invariance to distribution drifts, which is a key principle to improve model robustness.

In the interest of space, for a more technical reply, we kindly refer the reviewer to the response we provided to Reviewer Utu5 (“Clarification on the entropy matching procedure”).

The use of the term "domain adaptation".

Your point is well taken! If given the opportunity, we will fix this issue in the revised paper and use “test-time adaptation” instead. We will also remove the word “Adaptation” from the title of the paper. Thank you for your constructive comment.

Can you elaborate on the intended interpretation of Figure 1?

Each curve in Figure 1 presents a different risk (black for entropy minimization, red for entropy matching) as a function of the weight ww of the classifier fwf_w. Since the optimization procedure aims to minimize a given risk function by changing the value of the weight ww, the curves help to understand what is the optimal value that should be obtained.

By minimizing the entropy risk, the optimization ends with a self-trained classifier fwf_w that achieves the smallest value of the black curve. This results in a trivial classifier that always predicts +1+1 (or always 1-1), regardless of the value of XX.

By minimizing the entropy matching risk, the optimization ends with fwf_w that achieves the smallest value of the red curve. This results in a classifier that separates the two classes as much as possible. Indeed, our online method obtained a self-trained classifier whose accuracy (nearly) matches the accuracy of the Bayes optimal classifier both under an in-distribution setting (top panel) and under an out-of-distribution setting (bottom panel).

The model requires calculating the source CDF, which requires extra time.

Recall that this is a CDF of 1-dimensional variables (the source entropies), which is computed only once and offline. The complexity of computing this CDF is dominated by the evaluation of the pre-trained model on relatively small holdout unlabeled samples from the source domain. At test time, we only need to compute the value of the pre-computed CDF at a single point (the test point’s entropy value), which amounts to accessing a small, pre-computed 1-dimensional array.

What if computational resources are limited?

Following the global response to all reviewers, the new experiments show that the runtime of our method is comparable to TENT and EATA and even lower than that of SAR. Moreover, our monitoring tool can be used to decide whether the model should be updated or not at test time (that can further reduce runtime), as the martingale process detects distribution shifts. Notably, our monitoring tool can be applied in a “black-box” manner as it only requires access to the output of the softmax layer.

Additionally, it may not be possible to perform such calculations if the source model is only made available.

Indeed, we require access to a pre-trained source model and a pre-computed source CDF. Importantly, given the two, we do not require any additional access to samples from the source data at test time. In that respect, our work does not differ significantly from EATA, which also assumes access to holdout source examples.

Limited evaluation; there is no sensitivity study.

Kindly refer to the global response to all reviewers.

This paper does not compare with state-of-the-art TTA baselines such as ROID

Our goal is to highlight why we believe it is important to transition from entropy minimization to online entropy matching. Therefore, we deliberately compared our approach to strong baseline methods that are based on entropy minimization.

The ROID method builds on a weighted version of the soft likelihood ratio loss as a self-supervised loss. This loss departs from the line of the baseline, entropy-minimization methods we focus on, but it opens an interesting future direction. Broadly speaking, our paper offers an online mechanism for matching source and target distributions of any given self-supervised loss. In the context of ROID, it will be illuminating to explore how our matching paradigm would perform in combination with the weighted soft likelihood ratio loss instead of the entropy loss. We will include this idea for future work in the text!

Moreover, ROID highlights that self-training can fail to improve or even deteriorate performance. This motivated the authors of ROID to include several, complex components within the test-time training scheme. Naturally, a SOTA method would include various components to enhance performance, and this set of ideas can be also valuable for our method. However, given that we introduce a new concept to the ever-growing test-time adaptation literature, we believe such explorations go beyond the scope of the current paper. These will divert attention away from our central message.

评论

Thank you for your detailed reply and additional experiments. I've also read the reviews and comments from other reviewers. I will increase the score.

评论

Thank you for your engagement and for raising your score! We sincerely appreciate your thoughtful comments, which have helped improve our work.

审稿意见
5

The paper addresses the problem of adapting a classifier to a new domain at test-time. It proposes a framework that first detects a distribution shift and, based on the detection results, adapts the classifier. The distribution shift detector employs a sequential test evaluating if the distribution of the test entropy deviates from the distribution of the source entropy. In the adaptation step, the normalization parameters of the model are updated via an entropy matching objective. Experiments are conducted on ImageNet-C.

优点

  • The idea of using test martingales for test-time adaptation is novel to my knowledge and seems like a very natural application and interesting research direction.
  • The paper claims it only adapts when the distribution actually shifts, which is a desirable property of a TTA method, leveraging the trade-off between agile adaptation and keeping the source knowledge.
  • The paper is very well written overall. The reader is provided with intuitive explanations of the testing-by-betting framework - a very technical and not easy-to-explain theory - which makes the paper pleasant and insightful to read.
  • The experiments on ImageNet-C show encouraging results.

缺点

In short: The paper proposes a very interesting approach, but it seems to me more work is required to round out the paper. In particular, the paper requires more empirical validation and clarifications.

Some of the main weaknesses I see are:

  • The experimental results are quite limited. By showing results on only one dataset against three baselines, it is a bit unclear how the method performs across different datasets compared to existing methods. Comparing on other standard TTA benchmarks (such as CIFAR-10-C, CIFAR-100-C for corruptions, or Office-Home for domain adaptation) could help determine in which settings the method provides most gains and also its limitations.
  • The paper would benefit from an illustration visualizing the entropy matching procedure. In particular, it would be helpful to illustrate how uj,uj~,Zjtu_j, \tilde{u_j}, Z^t_j, and Z~jt\tilde{Z}_j^t connect via the functions FsF_s and QQ.
  • More space and explanation could be dedicated to the actual entropy matching procedure. How can we match the two entropy distributions given uju_j? This seems currently concentrated in lines 265-271 (see questions below).

问题

  • I’d appreciate some more clarification regarding section 3.4 (adaptation mechanism). Could you please elaborate on the role and interpretation of QQ. I understood from lines 260-264 that QQ can be thought of as the distribution that approximates the unknown target’s entropy CDF, which makes sense to me given equations (3) and (6). However, in line 266 QQ seems to be used as a function to transform uju_j to u~j\tilde{u}_j by uj~=Q(uj)\tilde{u_j} = Q(u_j). Could you explain the link between QQ being the target’s entropy CDF as well as a transformation?

  • I think the entropy matching could potentially be a good alternative to the dominant paradigm of entropy minimization, particularly since the latter has been shown to collapse eventually [1]. How does entropy matching perform on long-range adaptation? My understanding from the experiments is that only 1000 samples per corruption are used and not even the entire ImageNet-C dataset.

[1] Press, Ori, et al. "The entropy enigma: Success and failure of entropy minimization." arXiv preprint arXiv:2405.05012 (2024).

局限性

The limitations are rather vague, and it seems currently unclear which limitations the method encounters.

作者回复

We thank the reviewer for your valuable feedback and suggestions. We are glad that the reviewer found our approach to be novel and an interesting research direction. It is gratifying to see that the reviewer thinks that “the entropy matching could potentially be a good alternative to the dominant paradigm of entropy minimization.” Additionally, we appreciate the positive feedback regarding the clarity of our writing and that the reviewer found the experiments on ImageNet-C show encouraging results. Thank you!

The experimental results are quite limited

Kindly refer to the global response to all reviewers.

An illustration visualizing the entropy-matching procedure

Thank you for this great suggestion, we will include such an illustration in the revised paper. We also discuss the role of each component below.

Clarification on the entropy matching procedure. What is the role and interpretation of QQ? How can we match the entropy distributions?

Thank you for raising this question, as it touches on one of the more nuanced aspects of our work. Indeed, as the reviewer mentioned, one can interpret QQ as the distribution approximating the unknown CDF of the target entropy ZtZ_t. We understand that this might be confusing as QQ is a function of uu. However, recall that uu is a function of ZtZ_t, as u:=Fs(Zt)u := F_s(Z_t). Observe also that Zt=Fs1(u)Z_t = F_s^{-1}(u).

To better clarify the role of QQ, consider a case where the betting is optimal in the sense of Proposition 2. The right-hand side of Eq. (8) gives the explicit form of the ideal QQ being u~=Q(u)=Ft(Fs1(u))\tilde{u} = Q(u) = F_t(F_s^{-1}(u)). In turn, Q(u)=Ft(Zt)Q(u) = F_t(Z_t). In practice, the test entropy CDF FtF_t is unknown, yet the above relation highlights why the QQ we formulate via the betting function can be intuitively viewed as the distribution approximating the unknown target entropies CDF. This is due to the fact that any valid betting martingale is a likelihood ratio process, aligning with Eq. (8).

As for the matching property, observe that in the ideal case, the pseudo-entropy value Zt~=Fs1(u~)=Fs1(Q(u))=Fs1(Ft(Zt))\tilde{Z_t} = F_s^{-1}( \tilde{u}) = F_s^{-1}(Q(u)) = F_s^{-1}(F_t(Z_t)). This argument reveals the tight relation between our adaptation scheme and optimal transport: in the ideal case, the pseudo-entropy Zt~\tilde{Z_t} is obtained by applying the optimal transport map from the target entropy distribution to the source entropy distribution. Our experiment in Figure 6 in the appendix demonstrates that we indeed achieve distribution matching via the online-estimated Q(u)Q(u) in practice, although we do not have access to FtF_t that varies over time.

We will include this discussion and clarification around Proposition 2 in the text.

How does entropy matching perform on long-range adaptation? My understanding from the experiments is that only 1000 samples per corruption are used and not even the entire ImageNet-C dataset.

The experiments we conducted also involved a long-range adaptation on a test set of size 30,000\approx 30,000 for ImageNet; 15,000\approx 15,000 for CIFAR; and for OfficeHome we used the entire test data. Focusing on ImageNet-C, we highlight that in the continual setup, we also used 20002000 samples per corruption (Figure 2 bottom right panel), resulting in a test set of size 200015=30,0002000 \cdot 15=30,000 samples in total. In the single corruption experiments, we use 37,50037,500 test samples for each corruption. Notably, we had to reserve a subset of the test set to implement both EATA and our method as the two need access to unlabeled holdout data from the source domain.

The limitations are rather vague, and it seems currently unclear which limitations the method encounters

The key limitations are the following:

  1. To implement our method, we assume access to the source CDF, evaluated on holdout unlabeled samples from the source domain.
  2. The choice of hyper-parameters, in particular the learning rate, can be challenging as it depends on the data and model used. This is akin to related TTA methods.
  3. The lack of theory that reveals when entropy matching is guaranteed to improve performance. This issue is one of our future research directions.

We will update accordingly the discussion on the limitations in the text. We thank the reviewer for this point.

评论

Thanks for the rebuttal. I read all the responses and appreciate the additional clarifications.

CIFAR-10C and CIFAR-100C experiments / Office Home

I appreciate the additional results, particularly including a domain adaptation dataset that provides more diversity in the distribution shifts tested.

Clarification on the entropy matching procedure. What is the role and interpretation of Q? How can we match the entropy distributions?

Thanks for the clarifications.

How does entropy matching perform on long-range adaptation? My understanding from the experiments is that only 1000 samples per corruption are used and not even the entire ImageNet-C dataset.

Thank you for detailing the lengths of the adaptation streams. I'm still wondering why the entire test set is not included in the experiments. For example, on CIFAR-10/100-C, the test set contains 10,000 samples per corruption type, and constructing a stream with 15 corruptions results in 150,000 test samples (instead of ~15,000 as used in Figure 1, rebuttal). To my knowledge, using all test samples is the standard evaluation setting (see e.g. [1]). Could you explain why you are diverging from the standard evaluation setting and subsampling? I understand you need a subset from the source domain for EATA and POEM, but this can be small and taken from the source data, right?

Related to the above question, I am not sure if I am entirely convinced by the experimental evaluation.

  1. If the focus of the paper is to propose a SOTA method, the evaluation against three baselines is too limited in my opinion. This concern of mine has not been addressed.
  2. If instead the focus of the paper is to propose an alternative to entropy minimisation, I think three standard entropy minimisation methods as baselines seem sufficient. However, in this case, I am not entirely convinced of entropy matching being a robust alternative to entropy minimisation. In particular, I am not sure which of the cons of entropy minimisation does entropy matching address. The proposed approach seems to address the overconfidence issue on the source data, which the paper nicely shows does not occur with entropy matching. I think this is a promising result. However, I see one of the major limitation of entropy minimisation as that of model collapse after a long range of adaptation. However, it is unclear if this alternative addresses the said important limitation of entropy minimisation since the evaluated test streams seem to be even shorter then those in standard settings.
评论

We thank the reviewer for their comments and for acknowledging our previous response.

I'm still wondering why the entire test set is not included in the experiments

The test sets of ImageNet-C, CIFAR10-C, and CIFAR100-C consist of 15 different corrupted versions of the original test set of each dataset—these 15 corrupted versions represent different variations of the same “clean” test images. Therefore, we found it more natural to form an out-of-distribution test set that contains a single instance of a specific image, rather than using all 15 versions of the same original image. Additionally, we used shorter adaptation streams to demonstrate our approach’s ability to achieve faster adaptation compared to baseline methods, as shown in Figure 2 (bottom right). We apologize for any confusion and hope this explanation clarifies our initial choice.

To address the reviewer's concern, we have now conducted experiments using the entire test set of CIFAR10-C and CIFAR100-C, which includes all 15 corrupted versions of each test image. The results are described hereafter.

I understand you need a subset from the source domain for EATA and POEM, but this can be small and taken from the source data, right?

Yes, the holdout set can be small and should include unlabeled source samples. Since we use an off-the-shelf pre-trained model, we selected these holdout samples from the test set of the original dataset, representing the source domain. To ensure a fair out-of-distribution test set, we made sure our method (and EATA) does not have knowledge of the clean versions of the corrupted images in the test set. This is why we removed the holdout images from the corrupted test data, as the corrupted images are merely variations of the clean ones.

Long-range adaptation experiments on CIFAR-10C and CIFAR100-C:

In the following experiment, we reserved 2,500 images to form a holdout set for EATA and our method. We followed the same experimental protocol described in the global response to all reviewers and ran each adaptation method on a test set containing 112,500 samples (15 versions of 7,500 images). The results are summarized in two tables, presented in a separate comment below. These tables show that our proposed method is competitive with the baseline methods in terms of adaptation accuracy. Notably, the runtime of our method is twice as fast as SAR and comparable to EATA and TENT. Importantly, we do not use a model-reset mechanism (as done by SAR) or anti-forgetting loss (as done by EATA), highlighting the stability of our approach. By contrast, we found that TENT is highly sensitive to the choice of learning rate. We sincerely thank the reviewer for raising this point.

I am not entirely convinced of entropy matching being a robust alternative to entropy minimisation

Beyond the theoretical aspects of our work and the monitoring capabilities we introduce, our experiments demonstrate several practical advantages over entropy minimization:

  1. The proposed method maintains the performance of the source model while avoiding overconfident predictions under in-distribution settings, a crucial advantage over entropy minimization methods.
  2. Short-term adaptation: Our approach achieves faster adaptation than entropy minimization methods, e.g., as shown in Figure 2 (bottom right). This is attributed to our betting scheme that quickly reacts to distribution shifts. While the reviewer emphasizes the problem of long-range adaptation, it is important to recognize the critical role of adaptation speed as well. Rapid adaptation is especially crucial, amid various strategies proposed for stabilizing long-range adaptation, such as resetting the self-trained model to its original state when specific heuristic conditions are met (as employed in SAR) or incorporating an anti-forgetting component into the entropy loss (as used in EATA).
  3. Long-term adaptation: In extended test periods, our new experiments with 112,500 test examples show comparable adaptation performance to strong baseline methods, demonstrating the robustness of our method. Moreover, if stable long-range adaptation is the main concern, we could integrate model-resetting or anti-forgetting mechanisms. Notably, our monitoring tool can detect when unfamiliar corrupted data arrives, allowing for rigorous decisions on model resetting, for example, to prevent aggressive adaptation from a diverged state. This capability highlights another unique and practical advantage of our method.

Once again, we apologize for any confusion and hope this discussion resolves the reviewer’s concerns. Please let us know if there are any questions, comments, or concerns left.

评论

Long-range adaptation: detailed results

CIFAR-10C Accuracy Table

MethodShot NoiseMotion BlurSnowPixelateGaussian NoiseDefocus BlurBrightnessFogZoom BlurFrostGlass BlurImpulse NoiseContrastJpeg CompressionElastic TransformOverall
No adapt17.81 ± 0.0511.71 ± 0.0729.11 ± 0.0913.67 ± 0.0415.93 ± 0.0712.54 ± 0.0742.06 ± 0.1211.14 ± 0.0611.81 ± 0.0817.33 ± 0.0818.94 ± 0.1016.24 ± 0.1013.80 ± 0.0919.96 ± 0.0616.11 ± 0.0717.87 ± 0.02
TENT50.09 ± 0.2567.02 ± 0.1562.30 ± 0.3161.34 ± 0.1453.96 ± 0.1869.00 ± 0.1672.44 ± 0.2167.72 ± 0.1668.55 ± 0.2263.76 ± 0.3051.18 ± 0.2051.36 ± 0.2566.67 ± 0.2658.88 ± 0.1860.67 ± 0.2261.91 ± 0.06
EATA48.95 ± 0.1566.65 ± 0.2060.75 ± 0.2858.59 ± 0.3047.62 ± 0.1367.95 ± 0.1970.94 ± 0.2465.91 ± 0.2065.99 ± 0.2158.99 ± 0.1645.69 ± 0.2042.78 ± 0.2267.25 ± 0.1852.48 ± 0.2155.97 ± 0.1058.29 ± 0.06
SAR50.02 ± 0.1567.10 ± 0.1761.86 ± 0.2060.84 ± 0.1652.28 ± 0.1168.73 ± 0.1872.48 ± 0.2167.52 ± 0.1868.23 ± 0.1963.21 ± 0.1550.21 ± 0.1449.85 ± 0.2867.81 ± 0.2657.75 ± 0.1960.15 ± 0.1861.22 ± 0.06
POEM (ours)51.80 ± 0.1067.69 ± 0.1663.68 ± 0.2063.33 ± 0.1756.60 ± 0.2369.06 ± 0.1972.69 ± 0.1767.82 ± 0.2269.25 ± 0.2164.72 ± 0.1652.01 ± 0.2152.29 ± 0.2964.07 ± 0.3258.09 ± 0.2759.07 ± 0.2362.12 ± 0.06

CIFAR-100C Accuracy Table

MethodShot NoiseMotion BlurSnowPixelateGaussian NoiseDefocus BlurBrightnessFogZoom BlurFrostGlass BlurImpulse NoiseContrastJpeg CompressionElastic TransformOverall
No adapt4.81 ± 0.035.44 ± 0.0513.14 ± 0.076.97 ± 0.054.18 ± 0.044.36 ± 0.0425.99 ± 0.085.91 ± 0.045.01 ± 0.047.50 ± 0.044.01 ± 0.023.52 ± 0.041.42 ± 0.0210.64 ± 0.067.22 ± 0.057.33 ± 0.01
TENT14.34 ± 0.1329.09 ± 0.1524.18 ± 0.1924.69 ± 0.1616.29 ± 0.1930.63 ± 0.2334.24 ± 0.2027.05 ± 0.1830.63 ± 0.2425.72 ± 0.2518.99 ± 0.2715.13 ± 0.1826.17 ± 0.3319.51 ± 0.2422.22 ± 0.1923.97 ± 0.05
EATA14.05 ± 0.1128.82 ± 0.1823.50 ± 0.1222.89 ± 0.1213.98 ± 0.1429.08 ± 0.1433.06 ± 0.2125.85 ± 0.1629.14 ± 0.1423.38 ± 0.2316.34 ± 0.1511.47 ± 0.0926.55 ± 0.1816.97 ± 0.1321.76 ± 0.1422.42 ± 0.05
SAR14.29 ± 0.1129.08 ± 0.1924.09 ± 0.1624.47 ± 0.1716.28 ± 0.1330.19 ± 0.1833.74 ± 0.1526.79 ± 0.1530.19 ± 0.2425.17 ± 0.1718.41 ± 0.2414.94 ± 0.1325.25 ± 0.3719.31 ± 0.1721.89 ± 0.1523.61 ± 0.05
POEM (ours)13.98 ± 0.1128.79 ± 0.2023.62 ± 0.1623.35 ± 0.1914.55 ± 0.1529.91 ± 0.1933.35 ± 0.1426.48 ± 0.1329.65 ± 0.1624.40 ± 0.1617.35 ± 0.2312.77 ± 0.1428.03 ± 0.2218.35 ± 0.1823.18 ± 0.1423.19 ± 0.05
评论

Thanks for the additional clarifications and experiments. I have raised my score. I'd encourage the authors to add the above clarifications on the experiments and the long-range adaptation results to the revised version.

评论

Thank you for your engagement and for raising your score! We will certainly include additional clarifications on the experiments, the new experiments from the global response, and the long-range adaptation results in the revised version of the paper. We sincerely appreciate your thoughtful comments, which have helped improve our work.

作者回复

We appreciate the reviewers' engagement with our paper and their valuable comments and suggestions. We will integrate their feedback into the revised paper and have conducted a new set of experiments, detailed below.

The reviewers acknowledged that the paper is well-written and introduces a novel approach for test-time adaptation.

The main criticism raised was that the experiments, though conducted on ImageNet-C, are quite limited. To address this, we have conducted additional experiments on CIFAR10-C, CIFAR100-C, and Office-Home datasets. In short, the new experiments show that our approach is competitive with strong baseline test-time adaptation methods that are based on entropy minimization (TENT, SAR, and EATA). This conclusion aligns with the ImageNet-C experiments presented in the paper.

CIFAR-10C and CIFAR-100C experiments

We focus on the continual setting where the corruption type is changing over time. Using a pre-trained ResNet32 model, we applied online self-training with a batch size of 4 (due to batch-normalization layers). Each corruption type had 10241024 samples, resulting in a test set of approximately 151024=15,00015 \cdot 1024 = 15,000 samples. To ensure a fair comparison, we tuned the learning rate for each method using a pre-specified grid; see the sensitivity study below. Notably, we did not change the hyper-parameters of the monitoring tool and used the same values as those employed in our ImageNet-C experiments.

Results are summarized in Figure 1 (attached PDF):

  • Our method is competitive and often outperforms baseline methods in terms of accuracy.
  • Runtime comparisons (relative to the no-adapt model) are also presented, demonstrating that our method's complexity is similar to TENT and EATA, and lower than SAR.
  • A sensitivity study for the learning rate parameter reveals our method’s robustness to this choice, particularly when compared to SAR and TENT.

Office-Home experiments

We focused on adaptation from the “Real World” domain to the “Art”, “Clipart”, and “Product” domains. A continual setting was deemed less natural here. We fine-tuned the last layer of a ResNet50 model with group-norm layers (pre-trained on ImageNet1K) on Office Home's real-world images, reserving 20% as a holdout set for our method and EATA. Test-time adaptation was applied to the entire test data of each target domain. Learning rates were tuned for fair comparison, similar to the CIFAR experiments. For consistency, we kept the same hyper-parameters for the monitoring tool as those used in our ImageNet-C experiments. As such, the same hyper-parameters for the monitoring tool are used across all experiments, regardless of the model or dataset.

Results are summarized in Figure 2 (attached PDF):

  • Overall, all the methods demonstrated modest accuracy gains compared to the ‘no-adapt’ case. Our proposed method slightly outperformed TENT and EATA in terms of accuracy, while achieving results comparable to SAR.
  • In terms of computational efficiency, our method's runtime was on par with TENT and EATA, and notably faster than SAR.
  • Regarding sensitivity to the choice of the learning rate, our approach displayed superior robustness compared to TENT and SAR, and a similar robustness to that of EATA.

We provide individual replies to specific comments from each reviewer.

最终决定

This paper presents a test-time adaptation (TTA) method which detects distribution shifts at test time as well as proposed a scheme to adapt the model to such shifts.

The paper received generally positive reviews with scores of 6,5,5.

In the original reviews, the reviewers had expressed some concerns and had requested for some additional experiments. The authors have also reported additional experiments in the rebuttal (e.g., long range adaptation experiments) and said that they will include it in the revised version.

The paper does have both a theoretically sound TTA method as well as pretty reasonable experimental results. Given the generally positive reviews and my own assessment from reading the paper, I think the paper makes good contributions to the area of TTA. Therefore, I recommend the paper for acceptance.,