PaperHub
7.2
/10
Poster4 位审稿人
最低3最高4标准差0.4
4
4
4
3
ICML 2025

Beyond Entropy: Region Confidence Proxy for Wild Test-Time Adaptation

OpenReviewPDF
提交: 2025-01-22更新: 2025-07-24

摘要

关键词
Test-Time AdaptationTransfer Learning

评审与讨论

审稿意见
4

This paper introduces ReCAP, a novel TTA method based on local inconsistency of predictions. Based on the finding that the local inconsistency increases and adaptation becomes difficult under wild distribution shifts, the region confidence is proposed as an alternative to entropy, a common objective in TTA. Its finite-sample approximation is also derived to overcome the computational intractability of the original region confidence. Experimental results show that ReCAP had higher accuracy on corrupted test data under wild settings (online, mixed shifts, and imbalanced labels).

update after rebuttal

I appreciate the author's rebuttal and additional experiments. My concerns have been addressed. I have updated my score to 4.

给作者的问题

N/A

论据与证据

Yes.

方法与评估标准

The motivation for using the proposed approximation of the region confidence in Eqs. (6) and (7) is unclear. One can use a simple Monte Carlo approximation. Providing some evidence that the proposed approximation is more efficient than Monte Carlo would be convincing.

理论论述

I have checked the derivation of the region confidence.

实验设计与分析

  • Experimenting on continual TTA settings performed in recent TTA studies (e.g., EATA) would strengthen the efficacy of ReCAP in wild TTA settings.
  • How was the sampling number from the region NN set? Examining the sensitivity of NN would be helpful.
  • Comparing ReCAP with a simple Monte Carlo approximation of the original region confidence in Eq. (2) would make the proposed method more convincing.
  • Ablation on the sample weighting and selection in Eq. (9) would be helpful.

补充材料

I have checked the proofs and additional results.

与现有文献的关系

The region confidence expands the commonly used sample-wise entropy. It can improve existing entropy-based TTA methods.

遗漏的重要参考文献

No.

其他优缺点

N/A

其他意见或建议

N/A

作者回复

Thank you for taking the time to review our paper and providing valuable feedback. We would like to answer your questions below.

Q1: Experimenting on continual TTA settings performed in recent TTA studies (e.g., EATA) would strengthen the efficacy of ReCAP in wild TTA settings.

A1: Thank you for your constructive suggestion to evaluate CTTA settings. We agree that such experiments would further strengthen the efficacy of our ReCAP. We conduct extensive experiments on CTTA for both classification and segmentation tasks.

For classification, CTTA setup (Tab. 3 in response to Reviewer 6Q1i) and additional PTTA (CTTA + label shift) setup (Tab. 3 in response to Reviewer eZDc) demonstrate that ReCAP consistently outperforms prior methods. For segmentation, results in Tab. 1 (this response) further confirm that ReCAP maintains robust adaptation performance in continual scenarios. These results underscore the broad applicability of ReCAP across continual and wild TTA settings.

Table 1: Semantic segmentation results (mIoU) on the Cityscapes-to-ACDC CTTA setup based on the Segformer-B5 architecture.

ConditionFogNightRainSnowFogNightRainSnowFogNightRainSnowFogNightRainSnowAvg
Source69.140.359.757.869.140.359.757.869.140.359.757.869.140.359.757.856.7
TENT6940.260.157.366.536.358.75464.232.855.350.961.829.851.947.852.3
EATA69.140.559.858.169.341.860.158.668.842.559.457.967.942.857.756.357.0
CoTTA70.941.162.459.770.840.662.759.770.840.562.659.770.840.562.759.758.4
SAR62.237.755.553.064.639.356.853.965.739.058.155.066.138.059.155.353.7
Ours72.743.863.961.171.942.264.160.171.040.563.558.870.339.362.857.259.0

Q2: Sampling Number NN Sensitivity.

A2: We appreciate your question and apologize for any confusion. Our method does not need any sampling due to the finite-to-infinite approximation in Propositions 4.3 and 4.4. Therefore, our method is entirely unaffected by the value of NN. We will provide additional clarifications in the revised version to enhance clarity.

Q3: Comparing ReCAP with a simple Monte Carlo approximation of the original region confidence in Eq. (2) would make the proposed method more convincing.

A3: We sincerely appreciate your valuable suggestion. We conduct a comprehensive comparison with the Monte Carlo (MC) approximation using different sampling numbers. As shown in Tab. 2, while MC provides a direct estimate of region confidence, its accuracy is highly sensitive to the number of samples, leading to increased variance and a computational cost that scales linearly with the sample size.

In contrast, ReCAP achieves significantly higher accuracy with lower variance during adaptation, demonstrating its superior stability and efficiency. These results further reinforce the motivation behind our finite-to-infinite approximation. We will incorporate this comparison into the revised version.

Table 2: Comparison between the MC approximation and our finite-to-infinite approximation under 3 independent runs.

ReCAPMC (4)MC (16)MC (64)MC (128)
Sampling numberNA41664128
Average Accuracy42.231.734.838.740.7
Standard deviation0.19.83.71.40.3
Running Time(s)116125163278454

Q4: Ablation on the sample weighting and selection in Eq. (9) would be helpful.

A4: We appreciate the reviewer's suggestion and have conducted an additional ablation study to analyze the impact of different sample selection and weighting strategies. As shown in Tab. 3, the results lead to the following key observations:

  1. Our region-based confidence optimization consistently enhances performance, surpassing the previous SOTA even when combined with the simplest entropy-based selection and weighting strategies.
  2. The combination of our proposed selection and weighting achieves the best overall accuracy, further validating the effectiveness of our design.

Table 3: Ablation study on selection and weighting strategies.

w/o selectionEntropy selectionOur selectionw/o weightingEntropy weightingOur weightingReCAP Accuracy
36.0
38.9
43.2
44.9
45.7
47.2
审稿人评论

I appreciate the author's rebuttal and additional experiments. My concerns have been addressed. I will update my score to 4.

作者评论

We are glad to know that our response has addressed your questions.

We sincerely thank you for your thoughtful and constructive feedback. Through further discussions and experiments, we were able to more clearly communicate the contributions of our work.

Again, we would like to thank you for appreciating our work and recognizing our contributions!

Best,

The Authors

审稿意见
4

This paper introduces a new Test-Time Adaptation Method (TTA) to combat domain shifts appearing at test time in extreme scenarios. In particular, it proposes ReCAP, a method that optimizes two terms: a bias term resembling a regional entropy around a given test data, and a variance term to enhance the consistency of the prediction of the model under neighboring features. Experiments are carried out on standard TTA benchmarks yielding consistent performance gain.

给作者的问题

Please refer to the previous sections

论据与证据

Yes

方法与评估标准

Yes

理论论述

I skimmed through the proofs in the Appendices and they seem correct

实验设计与分析

I checked the experimental sections in the paper and they all seem relevant, consistent with earlier work, and providing supportive results.

补充材料

I checked Appendix A, B, C, and F.

与现有文献的关系

This paper does a good job in linking their main contributions to earlier works. They further show experimentally how they can combine their proposed method with previous state-of-art showing further performance gain.

遗漏的重要参考文献

I think the paper did a good job relating itself to other related works.

其他优缺点

While I am generally very positive about this paper, the following experiments I think are missing and would strengthen the paper.

  1. Ablating L0\mathcal L_0: I checked the ablation experiments and did not find the one ablating the impact of L0\mathcal L_0. Further, when the proposed method is combined with SAR, is the data point selection mechanism of SAR employed or the proposed one?

  2. In the efficiency comparison in Table 4: The proposed ReCAP computes more backward passes than EATA, however it is still more efficient in runtime. This seems a bit contradictory and deserves more discussion along with comparison against the more efficient variant of EATA (i.e. ETA). It is also important, given the efficiency of ReCAP, to show the performance gain under computational budgeted evaluation [A].

  3. One extra [Optional] experiment is to extend the evaluation to the Practical TTA setting [B] which is closely related to the wild TTA setting.

[A] Evaluation of test-time adaptation under computational time constraints, ICML 2024

[B] Robust test-time adaptation in dynamic scenarios, CVPR 2023

其他意见或建议

Please refer to the "Other Strengths and Weaknesses" Section.

作者回复

We deeply appreciate your positive comments and constructive suggestions on improving our paper. We will address your questions below.

Q1: I checked the ablation experiments and did not find the one ablating the impact of L0\mathcal{L}_0.

A1: Due to space constraints, we provide the ablation study on L0\mathcal{L}_0 in Appendix C.1. As shown in Appendix Fig. 7, ReCAP consistently enhances performance across a broad range of L0\mathcal{L}_0 values, demonstrating its robustness to different selection boundaries. This result confirms that ReCAP does not rely on precise tuning of L0\mathcal{L}_0 and remains effective across varying settings. To improve accessibility, we will incorporate this ablation study into the main paper.

Q2: When ReCAP is combined with SAR, is the data selection mechanism of SAR employed or the proposed one?

A2: When integrating ReCAP with SAR, we replace the original entropy selection with our proposed strategy, allowing for a direct evaluation of ReCAP's effectiveness. Likewise, when combining ReCAP with DeYO, we follow the same replacement strategy. We will clarify this in the revised version to eliminate any ambiguity.

Q3: Efficiency comparison with EATA and ETA.

A3: Thank you for raising this point. While EATA performs fewer backward passes on test samples, it requires additional computation for Fisher regularization on extra source samples, resulting in a higher runtime compared to ReCAP.

For comparison with ETA, we provide additional evaluations in Tab. 1. Although ETA offers a slight runtime improvement, it struggles to adapt to dynamic shifts in wild TTA scenarios. In contrast, ReCAP effectively balances efficiency and performance, achieving superior accuracy with marginal additional computation cost.

Table 1: Running time for 50,000 images and accuracy on ImageNet-C under label shifts using ResNet.

MethodTime (s)Accuracy (%)
Tent11022.8
ETA11226.2
EATA11831.7
ReCAP11647.2

Q4: It is important, given the efficiency of ReCAP, to show the performance gain under computational budgeted evaluation.

A4: Thank you for your valuable suggestion. We agree that this evaluation is essential and realistic for assessing TTA methods. As shown in Tab. 2, ReCAP benefits from the computational efficiency of the upper-bound proxy, resulting in minimal performance degradation while achieving more significant gains under strict time constraints. This demonstrates ReCAP's ability to provide efficient adaptation under time limitations, making it well-suited for real-world deployments with computational budgets.

Table 2: Error rate on ImageNet-C under computational time constraints.

MethodRealisticGaus.ShotImpu.Defo.Glas.Moti.ZoomSnowFros.FogBrig.Cont.Elas.Pixe.JpegAvg.
EATA65.562.463.566.667.252.047.348.254.139.932.155.042.339.244.852.0
EATA69.367.169.271.171.757.549.951.957.442.432.660.745.141.447.455.6(+3.6)
SAR69.569.769.071.271.758.150.552.957.942.732.762.945.541.647.856.2
SAR79.478.578.179.979.367.556.160.563.147.434.075.351.746.653.863.4(+7.2)
DeYO64.161.462.166.066.251.747.447.554.039.831.954.041.938.744.351.4
DeYO69.767.668.273.272.259.050.852.858.142.732.562.945.541.548.156.3(+4.9)
Ours64.160.462.167.067.250.647.245.851.738.232.253.541.838.443.950.9
Ours68.265.267.170.771.055.749.850.053.840.632.752.944.940.646.754.0(+3.1)

Q5: One extra [Optional] experiment is to extend the evaluation to the Practical TTA setting which is closely related to the wild TTA setting.

A5: Thank you for this insightful suggestion. We agree that the PTTA setting (Continual + Label Shifts) is closely related to the wild TTA setting. As shown in Tab. 3, despite ReCAP not incorporating any additional design specifically for continual adaptation, it still outperforms entropy-based methods and specific-design RoTTA in PTTA setup. This result further validates the effectiveness and robustness of ReCAP across diverse test-time conditions.

Table 3: Accuracy on ImageNet-C under the PTTA setup, evaluated on ResNet50.

MethodGaus.ShotImpu.Defo.Glas.Moti.ZoomSnowFros.FogBrig.Cont.Elas.Pixe.JpegAvg.
Source17.919.917.919.711.321.324.940.447.433.669.236.318.728.452.230.6
Tent13.70.90.23.00.40.30.40.60.20.21.70.40.10.21.21.6
SAR32.014.017.716.712.61.116.544.542.411.17.746.68.60.638.620.7
DeYO40.744.141.317.722.141.316.541.250.530.973.251.442.456.558.241.9
RoTTA40.241.240.820.720.340.233.245.251.152.170.250.140.152.157.143.6
ReCAP42.244.342.118.923.942.128.744.751.652.571.252.241.557.958.344.8
审稿人评论

I would like to thank the authors for their efforts in replying to my comments. My questions were adequately answered. Thus, I am raising my score from weak accept to Accept.

作者评论

We are glad to know that our response has addressed your questions.

We sincerely appreciate your insightful and constructive feedback. Your comments have guided us to refine our work and better articulate the significance of our contributions.

Once again, thank you for your thoughtful evaluation and recognition of our work!

Best regards,

The Authors

审稿意见
4

This paper proposes a new method, ReCAP, a novel approach to addressing the main limitation of TTA in entropy minimization. The key idea of this work is that EM heavily relies on local consistency, and when this consistency is disrupted, model performance degrades. To resolve this issue, instead of optimizing the confidence of individual samples, ReCAP optimizes region-based confidence using bias and variance terms through Region Confidence Optimization. Furthermore, to enable low-cost computation and accuracy, the method employs approximation theories (e.g., Finite-to-Infinite Approximation). When applied to low-data settings (batch size = 1), the proposed method demonstrated a +3.5% improvement in performance.

给作者的问题

  • Does the variance + bias function serve the same role as the traditionally used mutual information?
  • How does ReCAP handle rapid domain shifts where local consistency may be entirely lost?
  • Could ReCAP be extended to structured prediction tasks like segmentation?
  • How sensitive is ReCAP to the choice of when adapting to new domains?

论据与证据

The paper experimentally demonstrates that entropy minimization leads to performance degradation when local consistency is disrupted. The results show that even when entropy values are similar, prediction differences can be significant in domain-shift environments. Furthermore, the proposed RCO method improves the stability of TTA, and ReCAP outperforms traditional entropy-based methods such as Tent and MEMO, proving to be particularly effective in domain shift scenarios. The study also validates that the Bias Term and Variance Term play a crucial role in maintaining prediction consistency through mathematical formulations and empirical analysis. Additionally, the paper demonstrates that ReCAP achieves higher performance than Tent with only a 5% increase in computational cost.

方法与评估标准

The study employs widely used datasets in TTA research, including ImageNet-C, ImageNet-R, and VisDA-2021, to evaluate performance. Comparisons are made with state-of-the-art TTA techniques such as Tent, MEMO, DDA, EATA, SAR, and DeYO. The evaluation focuses on improvements in accuracy under domain shifts, robustness in low-data scenarios, and performance across mixed-domain tests. The inclusion of an ablation study analyzing key hyperparameters such as region size and bias-variance tradeoff strengthens the validity of the evaluation framework, aligning well with the study’s research objectives.

理论论述

The paper theoretically supports its approach by introducing an optimization framework leveraging Bias and Variance Terms to balance confidence estimation and prediction consistency. Additionally, it proposes a Finite-to-Infinite Approximation method to reduce computational cost while effectively approximating regional confidence. The mathematical derivations appear valid, and the experimental results substantiate the proposed theoretical foundation.

实验设计与分析

The experimental design appears relatively reliable, demonstrating that ReCAP maintains high performance even in low-data settings and remains stable across various corruption types and domain shift scenarios. The study also presents t-SNE visualizations, confirming that ReCAP enhances class separability. Overall, the experiments are appropriately designed to support the paper’s claims.

补充材料

The supplementary material includes the theoretical proof used in SOC and an advanced study on limited batch sizes and imbalanced label shifts. These materials further emphasize the validity of their proposed ReCAP method and serve as valuable supporting evidence.

与现有文献的关系

This work builds upon prior entropy minimization-based TTA research, such as Tent, MEMO, and EATA, extending the optimization approach from sample-level to region-level confidence estimation. Additionally, it is relevant to domain adaptation research, distinguishing itself by focusing on maintaining local consistency as a key factor in adaptation performance.

遗漏的重要参考文献

This paper leverages the appendix to cite all relevant studies comprehensively.

其他优缺点

Strength

  • While entropy minimization has been used in TTA, its accuracy gains have been limited. This paper provides a meaningful finding by identifying its limitations and proposing an effective solution.
  • The paper effectively explains why local consistency is critical in TTA and thoroughly discusses the limitations of existing methods, making a strong case for the necessity of ReCAP.
  • Instead of optimizing confidence at the sample level, the paper introduces region-based confidence optimization, which is a more robust and reliable strategy for TTA.
  • The paper rigorously evaluates ReCAP across various datasets and settings, including different domain shifts, data scarcity scenarios, and mixed-domain testing. This strengthens the credibility of the proposed method and demonstrates its robustness in real-world applications.
  • Unlike computationally intensive methods like DDA, ReCAP maintains a lightweight adaptation process while still improving accuracy.
  • ReCAP not only outperforms baseline methods but also enhances other approaches such as SAR and DeYO, demonstrating its adaptability and versatility

Weakness

  • While the method is effective for classification, it is unclear how well it would generalize to more complex tasks like object detection, segmentation, or NLP.
  • The paper assumes that the finite-to-infinite approximation holds consistently, but in scenarios where domain shifts occur rapidly, there is a possibility that this assumption might not always hold. Investigating its robustness in highly dynamic environments could provide further insights.
  • While the paper discusses applying features to reduce computational cost, providing a quantitative comparison of the actual reduction in computation would strengthen the analysis.

其他意见或建议

  • It would be helpful if the paper clarified what value of 𝜏 was fixed when conducting experiments on the effect of 𝜆 in Section 6.1.
  • The t-SNE plots effectively illustrate the improvements in feature space adaptation, making it easier to understand the impact of ReCAP on prediction consistency and clustering quality.
作者回复

We appreciate your detailed review and positive feedback on our contributions, including meaningful findings, novel region-based confidence optimization, and comprehensive evaluation. Building on your comments, we provide additional explanations and experiments to further demonstrate ReCAP's effectiveness and efficiency.

Q1: Generalization ability in more complex tasks.

A1: Thank you for raising this important point. While our current experiments focus on classification, the core idea of region-based confidence optimization is inherently versatile. Additional evaluations on segmentation (Tab. 1 in response to Reviewer ynrW) and object detection (Tab. 1 in this response) show consistent improvements of ReCAP over entropy-based methods, indicating that ReCAP can be integrated into diverse model architectures and effectively extended to various complex tasks.

Table 1: Comparisons of detection performance on KITTI-C benchmark in [1] with MonoFlex, regarding AP.

MethodGauss.ShotImpul.Defoc.GlassMotionSnowFrostFogBrit.Contr.PixelSat.Avg
Source4.27.55.62.63.810.915.610.57.524.87.129.131.912.4
TENT16.025.123.821.711.627.126.926.930.535.833.741.135.227.3
EATA16.825.924.722.113.627.527.727.430.735.633.941.035.627.9
DeYO19.226.124.723.215.628.528.529.330.835.134.240.836.228.6
MonoTTA (latest SOTA)21.328.226.225.819.431.829.330.232.136.136.541.237.430.4
ReCAP21.329.326.326.720.131.132.232.631.736.736.141.337.531.0

Q2: How does ReCAP handle highly dynamic shifts where local consistency may be entirely lost?

A2: Thank you for your insightful question. Our finite-to-infinite approximation is derived without assuming any consistency condition, ensuring its applicability even when consistency is entirely lost. Based on this foundation, ReCAP employs region-confidence optimization to enhance local consistency, which is crucial for robust adaptation.

Moreover, we evaluate ReCAP in a highly dynamic setting where the data stream undergoes rapid transitions across different domains, including style, corruption, and label shifts (see Appendix B.1). The results demonstrate that ReCAP exhibits strong robustness and achieves SOTA performance, validating its capability to address highly dynamic scenes.

Q3: Quantitative comparison of computational cost reduction.

A: ReCAP reduces the computational cost via feature-level region modeling, eliminating the overhead of image-level region modeling and augmentation. Furthermore, its finite-to-infinite approximation serves as an efficient proxy, removing the need for costly sampling. As shown in Tab. 2, these designs achieve significant runtime reduction. We will incorporate this quantitative comparison into the revised version to enhance clarity on the efficiency of ReCAP.

Table 2: Running time on 50,000 images.

Region TypeTime (s)
Image-level region (16 augmentation)1798
Feature-level region (w/o proxy, 16 sampling)163
Feature-level region (w/ proxy)116

Q4: Clarification on ablation study settings.

A4: In our analysis of the effect of λ in Section 6.1, we fixed τ at 1.2, which aligns with the default value used across all experiments. We will explicitly state it in the revised version.

Q5: Variance + Bias vs. Mutual Information.

A5: Our variance + bias function serves a fundamentally different role from mutual information (MI) in several key aspects:

  1. Different Objects: MI is defined between two random variables, measuring their shared information, whereas our variance + bias function is computed over a local region surrounding a single sample x, capturing localized prediction stability.
  2. Different Purposes: MI primarily quantifies mutual dependence, while our function serves as prediction confidence and consistency measure within a local feature region, making it more aligned with adaptation objectives.
  3. Different Optimization Effects: MI encourages statistical association but does not address prediction probability discrepancies. In contrast, our function directly optimizes both prediction uncertainty (bias) and local discrepancy (variance), enhancing robustness under domain shifts.

Q6: Sensitivity of RaCAP.

A6: We have extensively evaluated ReCAP across diverse datasets (ImageNet-C, VisDA, ImageNet-R), tasks (classification, segmentation, detection), TTA settings (wild, mild, and continual), and hyperparameter configurations. Our results consistently show its robustness and reliability across these various conditions.

Additionally, we notice that the question might miss a word (e.g., choice of L_0 when adapting). If we have misunderstood your concern, please clarify, and we would be happy to provide further insights.

References
[1] Lin, Hongbin, et al. "Monotta: Fully test-time adaptation for monocular object detection." European Conference on Computer Vision, 2024.

审稿人评论

Thank you for the response—especially for including additional experimental results and for the detailed explanation on how your method differs from mutual information. The clarification regarding the finite-to-infinite approximation also helped me better understand your formulation.

Also, to follow up on my final question (Q6), I realized that I had originally meant to refer to the τ(tau), which I mistakenly left out—apologies for the confusion. The ablation study (Fig. 4b) shows stable performance within a reasonable τ range, supporting the method’s robustness. However, the performance drop beyond τ = 2.5 raises questions, how sensitive the method is in real-world scenarios where the optimal τ may not be known in advance. It would be helpful to better understand how much τ influences performance in practice.


Update following "Reply Rebuttal Comment by Authors":

Thank you for your thoughtful responses to my final questions. The additional experiments across diverse domains consistently show strong performance around similar τ values were convincing. I recognize the strength of your work and have decided to raise my score to a 4.

作者评论

Thank you very much for your constructive and positive feedback on our response. Following your suggestions, our additional experiments and explanations have further strengthened this work, particularly in terms of its broader applicability and high efficiency. We are also grateful for your clarification on Q6 and hope to address your question below:

Q7: The ablation study (Fig. 4b) shows stable performance within a reasonable τ range, supporting the method's robustness. However, the performance drop beyond τ = 2.5 raises questions, how sensitive the method is in real-world scenarios where the optimal τ may not be known in advance.

A7: Thank you for recognizing the robustness of our method. To further clarify the practical stability of τ selection, we provide additional discussion and a practical example below:

  1. Default Value as a Reliable Choice: Across all experiments in our manuscript, we consistently use a fixed τ=1.2, which delivers SOTA results across various datasets and TTA scenarios. This value serves as a reliable choice, and we recommend its use in cases where a validation set is unavailable.

  2. Stable Optimal Range: Hyperparameter tuning on a small validation set (10% Gaussian-type data from ImageNet-C) across 3 settings and 2 model architectures consistently selects 1.2 (±0.2\pm 0.2). Further validation reveals that values within 0.6, 1.6 maintain strong performance (Tab. 3), confirming a stable optimal region for reliable adaptation.

  3. Real-World Practicality: To assess τ sensitivity in real-world scenarios, we examine its impact on a detection task. The default τ=1.2 achieves 31.0 AP on KITTI-C, outperforming prior SOTA (Tab. 1). Additionally, grid search over 0.6, 1.6 on the validation set selects τ=1.3, improving AP to 31.2. This supports our default setting as a strong baseline and our tuning range as a practical search space.

Thank you again for raising this critical point. Given the empirical evidence and actionable guidance provided, we believe that τ selection is both stable and practical for researchers and practitioners, ensuring reliable performance without excessive sensitivity concerns.

Table 3: Additional ablation study on τ under 3 settings and 2 models. Results that surpass the prior SOTA are in bold.

Settingτ=0.6τ=0.8τ=1.0τ=1.2τ=1.4τ=1.6
Batch Size=1 (ResNet)46.647.347.547.647.246.3
Batch Size=1 (ViT-Base)64.264.965.465.665.565.1
Mixed Domain=1 (ResNet)40.041.242.042.142.142.0
Mixed Domain=1 (ViT-Base)58.659.559.559.659.559.5
Label Shift=1 (ResNet)45.546.647.147.246.445.3
Label Shift=1 (ViT-Base)61.562.162.663.062.662.2

Again, we would like to thank you for appreciating our work and recognizing our contributions!

Best regards, Authors

审稿意见
3

This paper proposes a region modification based mechanism, called “Region Confidence Adaptive Proxy (ReCAP), to address the problem of will test-time adaptation (WTTA). Further, it develops a finite-to-infinite asymptotic approximation, which is a tractable upper bound to the intractable region confidence. Experimental results show improved performance of ReCAP compared to other approaches.

给作者的问题

  1. How is the hyperparameter L_0 in equation 9 tuned for the experiments?
  2. Is the accuracy in Figure 4 measured on the test set itself, on which the final performance is reported? Whether there is a validation split for tuning?
  3. In real-world applications, isn't the setting, such as batch size = 1, too contrived, though it is challenging? Can we not accumulate more examples before updating, effectively increasing the batch size?

论据与证据

Yes.

方法与评估标准

Yes.

理论论述

The theoretical claims in the main paper have been checked. The details of the proofs in Appendix have not been thoroughly verified.

实验设计与分析

The experiments follow the existing WTTA line of work.

补充材料

No code is provided in the supplementary. The appendix has some additional results and proofs that have been reviewed to some extent.

与现有文献的关系

The problem of WTTA is challenging. However, with the advent of more realistic continual test-time adaptation [1] approaches, the real world applicability of WTTA seems limited compared to recent progress.

References:

  1. Wang, Qin, et al. "Continual test-time domain adaptation." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022.

遗漏的重要参考文献

NA

其他优缺点

Strengths

  • Proposed a tractable upper bound to the intractable region confidence.
  • The theoretical results are interesting.
  • Most of the experimental results show improvements.

Weaknesses

  • Some of the empirical gains are marginal; for example, in Table 2, VitBase, DeYO -> ReCAP gain is =< 0.5.
  • Limited real-world applicability of WTTA, with batch size = 1 setting.

其他意见或建议

  1. Line 235-236: It should be Eq. 5 in place of Eq. 10.
  2. The recent focus in the area of test-time adaptation has shifted towards continual test-time adaptation (CTTA), so experiments in the CTTA setting will enhance the contribution of this paper.
作者回复

Thank you for carefully reviewing our paper and offering a positive assessment. We appreciate your recognition of the contributions made by our work, particularly the idea of the tractable bound on the intractable region confidence and the theoretical results.

Q1: Some of the Empirical Gains are marginal. For example, VitBase in mixed testing domain.

A1: Thank you for your feedback. In highly competitive TTA scenarios, performance gains tend to approach saturation in some cases. However, larger domain gaps, such as more severe corruption or mixed style shifts, still present significant challenges in terms of adaptation efficiency and robustness for TTA methods.

To access the empirical gains under more severe shifts, we increase the severity level from 4 & 5 to levels 6 & 7 (see Tab. 1), and our method achieves significant improvements of +5.7 and +5.2 over DeYO. Furthermore, under complex style shifts, ReCAP achieves average gains of +2.6 on ImageNet-R and +1.7 on VisDA (Appendix B.1). Overall, our method consistently outperforms prior methods across 3 datasets, 3 wild settings, and 2 base models, achieving gains of >+1.5 in the majority of scenarios.

Table 1: Comparisons on ImageNet-C (severity level 6, 7) using VitBase under Mixed Testing Domain.

MethodLevel 6Level 7
Source18.8712.80
TENT2.450.99
EATA30.5816.32
SAR32.1117.74
DeYO29.5416.08
ReCAP35.2721.32

Q2: Limited applicability of bs=1 setting. Can we not accumulate more examples to increase the batch size?

A2: We appreciate your comment. While bs=1 may seem contrived, some real-world applications (e.g., edge computing) face hardware constraints that necessitate the use of small mini-batches. Following your comment, we evaluate the effect of accumulating examples with varying sizes (See Tab. 2). However, small batch sizes still present a crucial bottleneck, hindering adaptation performance. This underscores the importance of developing robust TTA solutions tailored to such restrictive conditions.

Table 2: Accuracy of Tent on ImageNet-C across different accumulated batch sizes, evaluated on ResNet50.

no-adaptbs=1bs=4bs=16bs=32bs=64
Tent (%)30.621.523.525.928.633.9

Q3: Line 235-236: It should be Eq. 5 in place of Eq. 10.

A3: Thank you for pointing out this typo, and we will correct it.

Q4: Additional experiments in the CTTA setting will enhance the contribution of this paper.

A4: Thank you for your constructive suggestion. Following your advice, we evaluate our method in CTTA scenarios for classification (Tab. 3 in this response) and semantic segmentation (Tab. 1 in response to Reviewer ynrW). While ReCAP is not designed for CTTA setup, it shows competitive performance and outperforms several strong baselines. These additional evaluations further highlight the broad applicability of our method across mild, continual, and wild settings.

Table 3: Error rate (%) in CTTA scenario (CIFAR100C) [1], evaluated on ResNeXt-29.

MethodGaus.ShotImpu.Defo.Glas.Moti.ZoomSnowFros.FogBrig.Cont.Elas.Pixe.JpegAvg
Source73.068.039.429.354.130.828.839.545.850.329.555.137.274.741.246.4
TENT37.235.841.737.951.248.348.558.463.771.170.482.388.088.590.460.9
CoTTA40.137.739.726.938.027.926.432.831.840.324.726.932.528.333.532.5
SAR39.734.336.526.437.428.626.132.731.436.626.129.633.029.838.132.4
EcoTTA39.135.737.526.237.728.326.332.231.036.925.927.432.728.434.732.0
DeYO39.034.136.326.737.228.426.232.431.636.225.526.832.230.138.332.1
ReCAP38.833.536.526.537.928.226.431.129.634.025.827.732.028.238.131.4

Q5: How is the hyperparameter L_0 in equation 9 tuned for the experiments?

A5: For hyperparameter L_0, we perform a grid search over a range of values on a small validation set, which comprises 10% of the Gaussian-type data from ImageNet-C. Additionally, we conduct a sensitivity analysis (see Appendix C.1) to confirm the robustness of our method to variations in L_0. Further details on this tuning process will be included in the revised version.

Q6: Is the accuracy in Figure 4 measured on the test set itself, on which the final performance is reported? Whether there is a validation split for tuning?

A6: For the sensitivity analysis in Figure 4, we measure accuracy on the entire test set to validate the robustness of our method. For hyperparameter tuning, we use a small validation split (the same set used for L_0). The hyperparameters selected through this process are then validated and shown to be robust in Figure 4. We will provide a clearer explanation of this procedure in the revised version to avoid any confusion.

References
[1] Wang, Qin, et al. "Continual test-time domain adaptation." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022.

审稿人评论

Thanks to the authors for their response.

I do not have any further queries or comments.

作者评论

We sincerely express our gratitude for your valuable feedback. Thanks to additional discussions and experiments, we were able to effectively convey the contributions of our work.

Again, we would like to thank you for appreciating our work and recognizing our contributions!

Best,

The Authors

最终决定

Test-time adaptation aims to reduce generalization error by updating on shifted data. This work analyzes the "mild" (= single shift) and wild (= multiple shift, and small batches) settings for test-time adaptation and finds that local inconsistency is a challenge. The proposed method, ReCap (Region Confidence with Adaptive Proxy), identifies more local regions in which to optimize instead of simply minimizing entropy or some thresholding of entropy as in prior work. Experiments show improvement on the common benchmark of ImageNet-C, in comparable settings to prior work including with batch size one and with imbalanced classes, and with comparison to recent and strong methods. Furthmore ReCAP can be combined with base methods, like SAR and DeYO, and shows further improvement. This is a clear accept given the contributions and the reviewer consensus for acceptance (4/4 agree with scores of 4, 4, 4, 3). The authors provided a rebuttal, all reviewers acknowledged it, and eZDc + 7x3y + ynrW raise 3 to 4 and 6Q1i maintains 3, so the consensus is clear, and the area chair agrees with acceptance.