PaperHub
6.1
/10
Poster4 位审稿人
最低3最高4标准差0.4
3
4
3
3
ICML 2025

Test-Time Adaptation with Binary Feedback

OpenReviewPDF
提交: 2025-01-23更新: 2025-07-24
TL;DR

We propose a novel test-time adaptation setting that uses few binary feedback (correct/incorrect). Our dual-path optimization algorithm, BiTTA, combines guided adaptation on uncertain samples with self-adaptation on confident predictions.

摘要

关键词
test-time adaptationdomain adaptationdeep learningmachine learning

评审与讨论

审稿意见
3

The paper introduces binary feedback-guided test-time adaptation (BiTTA), a novel TTA framewrok designed to adapt deep learning models to domain shifts at test time using binary feedback from human annotators. The authors address limitations in prior active TTA methods, which suffer from high annotation costs, especially in complex multi-class settings. BiTTA mitigates this challenge by incorporating recent reinforcement learning concepts and agreement-based self-adaptation. It introduces Binary Feedback-guided Adaptation (BFA) for selecting uncertain samples and Agreement-Based Self-Adaptation (ABA) for refining confident ones, effectively enhancing model performance on challenging test distributions. Experiments across multiple datasets demonstrate that BiTTA outperforms existing TTA methods.

给作者的问题

N/A

论据与证据

The authors claims the limitations in prior active TTA methods, which suffer from high annotation costs, especially in complex multi-class settings. BiTTA mitigates this challenge by incorporating recent reinforcement learning concepts and agreement-based self-adaptation. Experiment results show that BiTTA successfully utilizes binary annotations to enhance model performance during test-time, achieving results competitive with methods employing full annotations, validating the central claim.

方法与评估标准

BiTTA mitigates this challenge by incorporating recent reinforcement learning concepts and agreement-based self-adaptation. It introduces Binary Feedback-guided Adaptation (BFA) for selecting uncertain samples and Agreement-Based Self-Adaptation (ABA) for refining confident ones, which is reasonable. The authors assessed BiTTA across a range of benchmarks and settings: CIFAR10-C, CIFAR100-C, Tiny ImageNet-C, and continual and imbalanced test-time adaptation scenarios.

理论论述

This paper doesn’t include a lot of proofs or theoretical claims. This work prioritizes experimental validation.

实验设计与分析

I have checked the experimental designs, and think it is reasonable with a range of benchmarks and settings: CIFAR10-C, CIFAR100-C, Tiny ImageNet-C, and continual and imbalanced test-time adaptation scenarios.

补充材料

After examining the supplementary materials, I believe open-sourcing the code would be a valuable contribution and greatly benefit the community.

与现有文献的关系

The authors claims the limitations in prior active TTA methods (e.g. SimATTA), which suffer from high annotation costs, especially in complex multi-class settings. BiTTA mitigates this challenge by incorporating recent reinforcement learning concepts and agreement-based self-adaptation.

遗漏的重要参考文献

N/A

其他优缺点

Strengths:

  1. Reduced Labeling Costs: BiTTA minimizes labeling demands by using binary feedback from human annotators instead of requiring full-class labels. This significantly reduces the annotation burden, making it more feasible for real-world scenarios than current ATTA methods.

  2. Dual-Path Optimization with Reinforcement Learning: Binary Feedback-guided Adaptation (BFA) for uncertain samples and Agreement-Based Self-Adaptation (ABA) for confident samples. Introducing reinforcement learning by binary feedback optimization is interesting and novel in TTA.

  3. Strong Experimental Results: BiTTA consistently outperforms competing TTA methods and even surpasses the ATTA method with full-class labels in certain settings.

Weaknesses:

  1. Theoretical Analysis: The paper's empirical validation is strong, but a theoretical analysis, particularly comparing BiTTA to full-label methods, would significantly enhance its contribution and provide deeper insights.

  2. Scalability and Robustness: While ImageNet-C experiments demonstrate scalability, detailed per-corruption results in Table 1(d) would be beneficial. Furthermore, the paper should have more discussions on how BiTTA addresses robustness in high model error scenarios.

  3. Frequency of Human Intervention: The requirement of annotating 3 samples per 64-sample batch implies a high annotation burden and continuous human availability. It means every batch of samples needs interventions. Exploring strategies to reduce the frequency of human intervention would improve the system's practicality.

其他意见或建议

N/A

作者回复

We sincerely thank reviewer R67W for the comprehensive review and for highlighting both the strengths and potential areas of improvement in our paper. We appreciate your recognition of our reduced labeling costs and dual-path optimization framework.

Open-sourcing. We truly agree with your suggestion. We have already included source code zip files in the supplementary materials in the original submission, and we will open-source the code and include the corresponding repository link in the final version.

Theoretical Analysis. We thank the reviewer for highlighting the value of a theoretical comparison to full-label adaptation methods. Our theoretical analysis between binary feedback and full-label feedback from an information-theoretic perspective is in Appendix D.1\text{\color{blue}Appendix D.1}, which shows that binary feedback provides log(num_classes) times less information than full-class labeling per sample, offering lightweight labeling costs. Our empirical results also suggest that binary feedback provides enough information to drive effective adaptation under domain shift. This is further supported by a comparison with full-class labeling under equal labeling cost, where BiTTA shows superior performance over the full-label active TTA baseline (Figure 5\text{\color{blue}Figure 5}). These insights imply that BiTTA retains strong adaptation capabilities despite more limited supervision, supporting its practicality in real-world scenarios where annotation time is constrained.

Scalability and Robustness. Please note that we did include detailed per-corruption results for ImageNet-C in Table 6\text{\color{blue}Table 6} in Appendix B\text{\color{blue}Appendix B}. We will move them to the main paper in our final draft. In summary, BiTTA achieves 36.59% on ImageNet-C, outperforming all baselines by 12.0%p on average, demonstrating its scalability to larger datasets.

Regarding high model error scenarios, our experiments on challenging datasets (Tiny-ImageNet-C with source accuracy 21.48%) show that BiTTA still provides substantial improvements (+19.37%p), highlighting its robustness to high-error settings. This is because our agreement-based adaptation (ABA) can leverage samples with reliable prediction, complementing many "incorrect" labeled samples from binary feedback samples, leading to stable adaptation.

Frequency of Human Intervention. We did include additional experiments where we reduced the frequency of human intervention (e.g., labeling only 1 out of 4 consecutive batches) in Figure 11\text{\color{blue}Figure 11} in Appendix B\text{\color{blue}Appendix B}. Results show that BiTTA maintains strong performance even with significantly reduced intervention frequency (outperforms the baseline by 9%p on average), suggesting that practical implementations could use less frequent feedback while maintaining performance benefits. During the rebuttal, we further examined the impact of delayed feedback where binary feedback arrives after a few batches (Rebuttal for QyDK: Delays or inability to obtain timely feedback\text{\color{red}Rebuttal for QyDK: Delays or inability to obtain timely feedback}). The results showed that delayed feedback can derive the same level of adaptation performance. This showcases the practicability of our system.

We appreciate your constructive feedback and will carefully revise our manuscript accordingly. Given the strengths highlighted in our submission, we hope our rebuttal has addressed your concerns. Please let us know if you have any further questions.

审稿意见
4

The common test-time adaptation methods focus on sample selection through softmax probabilities and further minimize the uncertainty-based loss on the target data. Different from this, the paper proposes to use binary feedback for test-time adaptation to determine adaptation. In contrast to the existing overall process of test-time adaptation, they also use binary feedback instead of full-label access, thereby preventing model adaptation on uncertain samples with agreement-based self-adaptation on confident predictions. Experiments with baselines compared to their setting reveal the effectiveness of their method. The ablation studies indeed provide additional insights for their method.

update after rebuttal

The authors have addressed my concerns about the MC-Dropout, softmax probabilities, and its rationale. Furthermore, the inclusion of comparisons to ensemble techniques significantly enhances the value of the paper, as it was previously incomplete without these comparisons.

I recommend the authors include the text from the rebuttal to the paper, such as (i) MC-dropout vs. softmax probabilities and sample selection, (ii) Detailed details of MC-Dropout, including the text from the follow-up query, (iii) Table R4, Table R5 (both from rebuttal) (iv) justifying the number of MC-dropout iterations. Therefore, in response, I have increased my score.

给作者的问题

– Dropout [1], especially with multiple forward passes, acts as an ensemble consisting of multiple networks within the encoder. Therefore, I would like to see comparisons to works that also utilize ensembles for adaptation. For instance, TAST [2] uses ensemble networks for test-time adaptation.

– Using confidence with argsort to obtain samples (Equation 5) might as well face the same issue as common test-time adaptation methods. Since the setting is unsupervised, it is not certain to what extent the predictions can be useful.

References: [1] Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning (Chapter 7.2). Cambridge, MA: MIT Press. [2] Jang, Minguk, Sae-Young Chung, and Hye Won Chung. "Test-time adaptation via self-training with nearest neighbor information." arXiv preprint, ICLR 2022

论据与证据

– The Fig 1 caption says that Test-time adaptation algorithms fail under severe distribution shifts due to the fundamental risk of adapting to unlabelled samples. While the first half of the statement is reflected in several works, the reasoning for this is not written in the second half. In fact, it is not at all relevant to state it here – The claim that MC-dropout offers a robust uncertainty estimate, while softmax probabilities exhibit overconfidence, needs further evidence. It can be agreed that sometimes softmax probabilities are too confident about the wrong predictions and vice versa. However, I am curious to know how MC dropout would mitigate this. Does over-sampling help?

方法与评估标准

The usage of foundation models by assuming all the predictions for the binary feedback is erroneous in nature. Perhaps, it's hard to find the oracle solution for all kinds of distribution shifts.

理论论述

Yes, the paper doesn't include alot of theory sections.

实验设计与分析

– With the increase in classes (Cifar-10 to Cifar-100) from Table 1, why does the performance of the baselines such as EATA decrease drastically

– What was the experiment setting for the Impact of labelling errors? How was the noise in labeling induced? I would urge the authors to include more baselines in this

– How many forward passes were used for the MCMC dropout? I would like to see the detailed implementation details for the usage of the dropout.

补充材料

All, Yes. I find it interesting to use predictions over augmented samples by using methods such as Memo! I am curious why the performance with Memo decreases drastically? I find the accuracy over batch size 1 to be impressive! Also, the inclusion of Figure 9, which compares the sample selection strategy with random and confidence, is essential and well presented!

与现有文献的关系

The overall contribution of the problem setup is novel. The proposed methodology contains the usage of techniques that are not novel themselves but novel to the problem setting.

遗漏的重要参考文献

No. The authors have discussed the related work appropriately. Moreover, the paper's method is novel for the setting of test-time adaptation.

其他优缺点

Strengths: – The overall idea of using Reinforcement learning for test-time adaptation sounds interesting.

– I find the idea of using a few samples with ground truth binary feedback innovative!

– I don't find the computational overload-induced due to Monte Carlo to be a significant issue. This can be fixed with efficient sampling optimization techniques.

Weaknesses: – I urge to include more recent works in the tables, including DeYo [1] and OWTTT [2].

References: [1] Lee, Jonghyun, et al. "Entropy is not enough for test-time adaptation: From the perspective of disentangled factors." arXiv preprint arXiv:2403.07366 (2024). [2] ​​Li, Yushu, et al. "On the robustness of open-world test-time training: Self-training with dynamic prototype expansion." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023.

其他意见或建议

Dynamically adopting → Dynamically adapting

PACS dataset is used commonly for domain generalization not domain adaptation.

The authors cite too many results during the introduction and methodology. While it may provide evidence, it's not reader-friendly. Instead, the key message be stated with hyperlinks

作者回复

We thank reviewer SHhn for the detailed review and thoughtful questions that helped us improve our work. We appreciate your recognition of our novel problem setup and methodology.

Clarification regarding Fig. 1 caption. Traditional TTA methods indeed struggle with severe distribution shifts, which have been reported in the literature. In terms of the risk of adapting to incorrect predictions from unlabeled samples, we would tone down and further support with theoretical insights from ATTA (Gui et al., 2024\text{\color{blue}Gui et al., 2024}), where labeled test instances can enhance the overall performance compared with unlabeled adaptation. We will update the caption to clarify this.

MC-dropout vs. softmax probabilities and sample selection (Q2). MC-dropout offers more robust uncertainty estimates than standard softmax probabilities because (1) it captures epistemic uncertainty (model uncertainty) through multiple stochastic forward passes (Gal and Ghahramani, 2016\text{\color{blue}Gal and Ghahramani, 2016}) and (2) Softmax probabilities are known to be overconfident on out-of-distribution samples (Lee et al., 2024b\text{\color{blue}Lee et al., 2024b}).

During the rebuttal, we compared the expected calibration error (ECE) between MC-dropout and original softmax probabilities. We found that MC-dropout confidence (avg. ECE 0.062) shows 33% lower ECE than the original softmax confidence (avg. ECE 0.100) during all corruptions, showcasing MC-dropout provides a more robust uncertainty estimate.

Also, in Figure 9\text{\color{blue}Figure 9} in Appendix B\text{\color{blue}Appendix B}, we reported that our MC-dropout-based uncertain sample selection outperforms the original confidence-based sample selection, demonstrating a robust uncertainty estimation of MC-dropout.

Details of MC-dropout. We used N=4 iterations of dropout inferences to implement MC-dropout. We will open-source the code and integrate details of MC-dropout in the final manuscript.

Usage of foundational models for feedback. We agree that current state-of-the-art foundational models (e.g., GPT-4o) are erroneous for active labeling (please note that we used foundational models for generating full-class active labels in Figure 5\text{\color{blue}Figure 5}). This necessitates lightweight labeling from an oracle, such as our proposed binary feedback. Please let us know if you need further clarification.

Questions about EATA performance and noise induction. EATA's performance decreases on CIFAR-100-C compared to CIFAR-10-C. This phenomenon has also been observed in previous literature (Lee et al., 2024b\text{\color{blue}Lee et al., 2024b}). This explains the vulnerability of unlabeled adaptation; if an early adaptation stage leads to an erroneous direction of adaptation, simple entropy adaptation methods such as EATA and TENT can lead to model failures where most of the predictions become incorrect. Since model failures lead to over-confidence in incorrect samples, simple confidence/entropy thresholding in EATA cannot filter out wrong predictions.

Experiment setting for the impact of labeling errors. For evaluating the impact of labeling errors (Figure 6\text{\color{blue}Figure 6}), we randomly flipped binary feedback labels (correct↔incorrect) at specified rates (10%, 20%, 30%) to simulate annotation errors.

Additional results on DeYO, OWTTT, and TAST (Q1). Thank you for suggesting these recent works. During the rebuttal, we experimented with DeYO and TAST in Table R4\text{\color{red}Table R4} and OWTTT in Table R5\text{\color{red}Table R5}. We observed that our BiTTA still outperforms the latest baselines.

Table R4. Average accuracy (%) comparisons. Notation * indicates the modified algorithm to utilize binary-feedback samples.

CIFAR10CCIFAR100CTinyINC
DeYO*84.4161.3040.67
TAST-BN*75.5329.1317.05
BiTTA87.2062.4940.85

Table R5. Average accuracy (%) comparisons under the OWTTT pre-trained model. Notation * indicates the modified algorithm to utilize binary-feedback samples.

CIFAR10CCIFAR100C
OWTTT54.6329.10
OWTTT*31.243.39
BiTTA89.8964.06

In addition, during the rebuttal, we experimented on replacing MC-dropout with an ensemble structure from TAST. The ensemble method showed 1.7%p lower accuracy than the original MC-dropout-based algorithm. This emphasizes that while our dual-path RL optimization framework is flexible to incorporate ensemble methods (e.g., TAST), our proposed MC-dropout policy estimation shows the best performance.

Rebuttal on Other Comments or Suggestions. Thank you for pointing out typos and suggesting improvements. We will update the final manuscript.

We will revise our final manuscript to reflect the rebuttal. Given the strengths highlighted in our submission, we hope our rebuttal has addressed your concerns. Please let us know if you have any further questions.

审稿人评论

Thank you for your efforts.

I have a follow-up question regarding the dropout and iterations (n=4). From Table R2, It is evident that the accuracy doesn't vary significantly (~1-2%) with iterations. Consequently, (i) how do the authors justify the usage of MC-dropout for their use case compared to N=1? Why is it even needed?

(ii) It's still not clear about the implementation details; could the authors provide detailed implementational details regarding MC-dropout usage? Do the authors switch to model.train() during evaluation at test-time? What specific parameters (alpha, beta, running mean, vars) are updated at test time?

作者评论

Justifying the number of MC-dropout iterations. We thank the reviewer for the insightful question.

First, we clarify that N=1 still uses MC-dropout, which is essential for both policy and uncertainty estimation in BiTTA. Even a single stochastic forward pass introduces the necessary randomness to support BiTTA’s two core components: (1) Binary feedback-guided adaptation (BFA) enables the identification of uncertain samples for feedback. (2) Agreement-based self-adaptation (ABA) allows for measuring prediction agreement between deterministic and stochastic outputs to identify confident samples. Also, applying dropout during adaptation provided parameter-wise robustness compared to removing MC-dropout.

If MC-dropout is entirely removed—i.e., replacing MC-dropout with deterministic softmax outputs—the adaptation performance degrades significantly, with a 2.56%p accuracy drop, showing the necessity of MC-dropout (see also Rebuttal for wnVe: Q7: Dropout Pre-training.\text{\color{red}Rebuttal for wnVe: Q7: Dropout Pre-training.}).

Although the accuracy improvement might seem marginal, we found that using N > 1 consistently resulted in higher accuracy than N=1. This improvement is primarily attributed to better uncertainty calibration: specifically, we observed that N=1 leads to approximately 2×\times higher expected calibration error (ECE, 0.064 vs. 0.142). Therefore, choosing N>1 provides a reliable and robust estimation in uncertainty, leading to a better performance.

We found that small values such as N=4 offer a reasonable trade-off between accuracy and latency. That said, we acknowledge that the optimal choice of N may depend on the target scenario—for instance, applications that prioritize latency may prefer a smaller N, while those requiring higher reliability might benefit from a slightly larger N.

Implementation details regarding MC-dropout. Thank you for asking for clarification. We set model.train() during test-time adaptation to enable gradient calculation. However, to avoid multiple BN statistics updates from MC-dropout inference, we only update BN statistics once in the adaptation with momentum 0.3 (without dropouts). Then, we fix the BN statistics during MC-dropout inference by (1) forcing BN momentum to zero, (2) inferring MC-dropout in train() mode, and (3) restoring the original BN momentum. Then, we backpropagate with policy gradients on all model weights via REINFORCE. We will include these implementation details in our final manuscript.

We hope this response clarifies your concerns. Please let us know if you have any further questions or concerns. If our response sufficiently addresses your comment, we would greatly appreciate your consideration in increasing the score.

审稿意见
3

The paper introduces BiTTA, a novel test-time adaptation (TTA) framework that leverages binary feedback (correct/incorrect) from annotators to address domain shifts. The key contribution is a dual-path optimization strategy combining reinforcement learning (RL)-guided adaptation on uncertain samples (BFA) and agreement-based self-adaptation on confident predictions (ABA).

给作者的问题

None.

论据与证据

The central claims are well-supported:

BiTTA outperforms SOTA methods: Table 1 shows BiTTA’s accuracy improvements over TTA baselines.

Dynamic sample selection improves adaptation: Figure 4 demonstrates ABA’s effectiveness in selecting confident samples via prediction agreement.

方法与评估标准

The dual-path RL framework is well-motivated, combining BFA (uncertain samples) and ABA (confident samples). MC dropout for uncertainty estimation aligns with recent TTA literature.

理论论述

There is no theoretical claim in the paper.

实验设计与分析

The experimental design is sound.

补充材料

Yes, the supplementary material provides additional experiments that reinforce the main findings.

与现有文献的关系

The work connects to: Active TTA (SimATTA), RL with human feedback (RLHF) and Uncertainty estimation (MC dropout).

遗漏的重要参考文献

No critical omissions detected in cited literature.

其他优缺点

None.

其他意见或建议

My main concern is the practicality of the proposed RL-guided framework in real-world scenarios. The proposed RL-guided framework requires real-time binary feedback from annotators. However, in practical deployment scenarios, there might be delays or inability to obtain timely feedback (e.g., annotator unavailability). How would this affect BiTTA’s performance? Could the authors design experiments to simulate delayed/noisy feedback and evaluate robustness?

作者回复

We sincerely thank reviewer QyDK for the positive feedback on our work and for recognizing our paper's dual-path optimization strategy and its contributions.

Delays or inability to obtain timely feedback. During the rebuttal, we conducted an additional experiment where active adaptation algorithms (BiTTA and SimATTA) receive the labeled samples in delayed batches (e.g., labeled instances arrive 1/2/3 batches later). The results in Table R3\text{\color{red}Table R3} suggest that the delayed feedback shows a negligible impact on the performance of TTA with binary feedback, further enhancing the practicality of the feedback framework. We will include the result in the final manuscript.

Table R3. Accuracy (%) comparisons under delayed feedback in CIFAR10-C. Zero delay is equivalent to the original experiment setting. Notation * indicates the modified algorithm to utilize binary-feedback samples.

Delay0123
SimATTA*81.0981.2081.0781.11
BiTTA87.2087.1487.2287.28

Also, we did include experiments where annotators skipped labeling for several batches (e.g., labeling only 1 out of 4 consecutive batches, Figure 11\text{\color{blue}Figure 11} in Appendix B\text{\color{blue}Appendix B}). The results show that compared with the baseline, BiTTA maintains stable performance even with such intermittent feedback, outperforming SimATTA (active TTA baseline) by 9.22%p.

Noisy/erroneous feedback scenarios. We did include BiTTA under noisy/erroneous feedback scenarios in Figure 6\text{\color{blue}Figure 6} in Section 4\text{\color{blue}Section 4}, where BiTTA maintains robust performance even with up to 30% incorrect binary feedback, constantly outperforming SimATTA significantly by 7.81%p.

We will include these additional results in the final paper to address practical deployment concerns. We believe our TTA with binary feedback framework and proposed BiTTA algorithm is practical in real-world scenarios with delayed or noisy feedback. Given the strengths highlighted in our submission, we hope our rebuttal has addressed your concerns. Please let us know if you have any further questions.

审稿意见
3

This paper explores a new setting of test-time adaptation, in which the authors introduce binary human feedback for test-time learning. The authors introduce MC-dropout for samples’ confidence estimation and then devise a unified test-time RL learning framework to exploiting both Human Feedback Rewards (for unconfident samples) and Self-Agreement Rewards (for confident samples). Experiments demonstrate the effectiveness and superiority of the proposed method. My detailed comments are as follows.

给作者的问题

Q1 Relation to Prior Work: The agreement-based self-adaptation strategy, which maximizes the consistency between MC-dropout predictions and the original predictions, has been explored in prior work [Uncertainty-Calibrated Test-Time Model Adaptation without Forgetting]. A more detailed discussion of the connections and differences with this related work would be beneficial.

Additionally, what are the performance implications or advantages of using RL-based training instead of directly minimizing prediction inconsistency via backpropagation?

Q2 Ablation Study: Is there a clear ablation study evaluating the proposed method without Binary Feedback?

Q3 Combination with Existing TTA Losses: How does the method perform when Binary Feedback is applied to unconfident samples, while conventional TTA losses (e.g., TENT, EATA, DEYO) are used for confident samples?

Q4 Evaluation on Larger Datasets: Could the authors provide direct experiments on ImageNet-C in the main paper? Results on CIFAR-10, CIFAR-100, and Tiny-ImageNet are not sufficiently convincing for me. Moreover, in Tables 1-3, reporting results for the original baselines (including DEYO) without Binary Feedback would help to better understand the effectiveness of introducing Binary Feedback.

Q5 MC-Dropout Configuration: For MC-Dropout, do the authors apply only a single dropout iteration to calculate confidence and agreement loss? Could the number of dropout samples impact the performance of the method?

Q6 How are the baseline algorithms modified to incorporate Binary Feedback?

Q7 Are all the adopted models pre-trained with Dropout? If not, I am concerned that the confidence score C(x) from MC-Dropout predictions may be very low. Would this impact the performance of the proposed method?

Q8 Could the authors provide some runtime Memory Comparison?

Q9 Why does the proposed method outperform fully labeled TTA? A more in-depth explanation of this phenomenon would be valuable.

论据与证据

Yes. The claims are well-supported.

方法与评估标准

Yes. The overall designs are reasonable.

理论论述

No theoretical claims were provided.

实验设计与分析

Please refer to my Questions section.

补充材料

Yes I have checked the supp.

与现有文献的关系

The problem studied is a fundamental challenge in machine learning with good potential for practical applications.

遗漏的重要参考文献

Please refer to my Questions section.

其他优缺点

++Pros:

The setting of binary feedback-guided test-time adaptation is novel and interesting. The devised overall learning framework is technically sound, simple, and effective.

--Cons:

Many implementation details of the proposed method and modifications to baseline algorithms are difficult to find in the manuscript. Could the authors clarify these further?

The computational complexity is higher than that of conventional TTA methods like TENT.

其他意见或建议

No minor suggestions.

作者回复

We sincerely thank reviewer wnVe for the thoughtful evaluation of our work and recognition of our novel problem setting and technically sound framework.

Clarification on Implementation Details. While we provided the source code and additional implementation details in Appendix D\text{\color{blue}Appendix D}, we will review them again to include all details.

Computational Complexity, Q8: Runtime Memory Comparison. Thank you for the suggestions. During the rebuttal, we analyzed the memory consumption in Table R1\text{\color{red}Table R1}. While MC-dropout increases computational load, the overhead remains manageable (also acknowledged by Reviewer SHhn). Efficient techniques like MECTA (Hong et al., 2023\text{\color{blue}Hong et al., 2023}) or gradient accumulation reduce runtime memory up to 60% with maintained accuracy.

Table R1. Average GPU memory consumption (MB) in CIFAR10-C. MECTA and gradient accumulation (GA) are applied to BiTTA.

MethodSrcValidBN-StatsTENT*EATA*SAR*CoTTA*RoTTA*SoTTA*SimATTA*BiTTABiTTA+MECTABiTTA+GA
Avg.208126963246323932442966303832292824830467242841

Q1(1): Relation to Prior Work. Prior work penalizes disagreement samples to lower their predictive confidence; we discard these instead, given their inherently unstable yet moderate accuracies (Figure 4(b)\text{\color{blue}Figure 4(b)}).

Q1(2): Advantages of using RL-based training. In this rebuttal, we experimented with directly minimizing prediction disagreement via backpropagation, which dropped accuracy by 1.08%p. This highlights our RL-based formulation, which naturally leverages prediction agreement as a reward. By doing so, the model adaptively reinforces learning from confident samples while effectively disregarding unstable samples with disagreement, achieving stable and effective adaptation.

Q2: Ablation Study without Binary Feedback. We did investigate this in “Synergistic effect of adaptation strategies” (Section 4, line 391\text{\color{blue}Section 4, line 391}). Adaptation without binary feedback achieved lower accuracy (82.64%) compared to BiTTA (87.20%), highlighting the importance of binary feedback.

Q3: Combination with Existing TTA Losses. Thanks for the suggestion. During the rebuttal, we experimented by replacing the ABA loss with entropy minimization. Our RL framework (87.20%) outperformed the modified one (85.98%) by balancing the gradients in a unified framework.

Q4: Evaluation on Larger Datasets. We did conduct experiments on ImageNet-C in Appendix B\text{\color{blue}Appendix B} (Table 6\text{\color{blue}Table 6}). BiTTA (36.59%) outperforms all baselines, including TENT (0.93%) and SimATTA (17.5%). Due to page limits, we initially placed them in Appendix C\text{\color{blue}Appendix C} but will move them to the main manuscript. Also, in this rebuttal, we ran additional baselines (e.g., DeYO), finding BiTTA consistently superior when introducing binary feedback (Rebuttal for SHhn: Additional results on DeYO, OWTTT, and TAST\text{\color{red}Rebuttal for SHhn: Additional results on DeYO, OWTTT, and TAST}).

Q5: MC-Dropout Configuration. We used 4 dropout iterations (N=4) for policy estimation. During the rebuttal, we conducted an abalative study, which showed the robustness of BiTTA to the dropout iterations when N>1 (Table R2\text{\color{red}Table R2}).

Table R2. Average accuracy (%) comparisons with varying MC-dropout inferences (N) in CIFAR10-C.

N123456
Avg.86.9387.1987.1887.2087.2987.25

Q6: Modification to Baseline Algorithms. Baseline algorithms are modified to incorporate binary feedback by adding a cross-entropy loss on correct samples and complementary cross-entropy loss on incorrect samples. Details are in Appendix D\text{\color{blue}Appendix D}.

Q7: Dropout Pre-training. Models were not dropout pre-trained; we injected dropout at test time. During the rebuttal, we conducted a calibration analysis that MC-dropout has a lower calibration error than the original softmax (Rebuttal for SHhn: MC-dropout vs. softmax probabilities and uncertain sample selection\text{\color{red}Rebuttal for SHhn: MC-dropout vs. softmax probabilities and uncertain sample selection}). Also, during the rebuttal, we experimented on replacing the MC-dropout with the softmax, which resulted in 2.56%p drop, further demonstrating the importance of MC-dropout.

Q9: Comparison with Fully Labeled TTA. BiTTA outperforms fully labeled TTA (SimATTA) primarily because our approach effectively combines both binary feedback and unlabeled data, whereas SimATTA is highly dependent on source-like samples and prone to overfitting without unlabeled data.

We will incorporate this rebuttal in the final manuscript. We hope our rebuttal has addressed your concerns. Please let us know if you have any further questions.

审稿人评论

Thanks for the authors' response. My main concerns have been addressed, and I would like to keep my original score.

最终决定

This work incorporates simple human feedback into test-time adaptation in the form of binary correct/incorrect supervision. This feedback addresses failures of purely unsupervised TTA when shifts are too extreme. While the proposed method requires more information, in the form of this correct/incorrect binary feedback, it does improve adaptation results so there is sufficient return from this requirement. Experiments evaluate the proposed BiTTA on common benchmarks like CIFAR-10-C, CIFAR-100-C, and PACS, but ImageNet-C is notably missing from the main paper (with Tiny-ImageNet-C taking its place). Accuracy improves by 10+ points on CIFAR-100-C even in a comparison that augments baseline TTA methods to also incorporate feedback. Reviewers raised concerns about (1) the requirement of immediate feedback (which is indeed a practical issue for a test-time adaptation method) [QyDK], (2) insufficient implementation detail for the method and augmented baselines [wnVe], (3) missing baselines for recent methods [SHhn], and (4) concerns about lack of theory, scalability, and practicality given the rate of feedback [R67W]. However the author rebuttal and additional results satisfied these concerns: SHhn comments and raises to 4, R67W made an official comment (not rebuttal comment) to confirm maintaining 3, and all reviewers acknowledged the rebuttal. There is consensus for acceptance with scores of 4, 3, 3, 3 and the area chair agrees with acceptance.

The area chair acknowledges the confidential comment by the authors and confirms the decision incorporates all of the submission content, details, and threads.