Persistent Test-time Adaptation in Recurring Testing Scenarios
We conduct a simple-model theoretical analysis, introduce a benchmark and a baseline approach to address the gradual performance degradation of continual test-time adaptation methods.
摘要
评审与讨论
The paper proposes a novel method called Persistent Test-time Adaptation (PeTTA), aimed at addressing the gradual performance degradation of models when used for long-term test-time adaptation (TTA). Traditional TTA methods adapt to continuously changing environments but fail to account for the cumulative errors and performance drops that can occur when these environments reoccur over time. The study demonstrates, through the simulation of a simple Gaussian Mixture Model classifier, how TTA methods might gradually fail in such recurring environments. PeTTA achieves a balance between adaptation and preventing model collapse by monitoring the model's tendency to crash and adjusting the adaptation strategy accordingly, significantly improving the model's stability and performance in long-term testing.
优点
(1) The study introduces a new test scenario—recurring TTA—which realistically simulates conditions that might be encountered in real-world applications. This setup is more practical than traditional static test scenarios and helps reveal potential long-term issues with TTA methods. (2) By using simulations with Gaussian Mixture Model classifiers, the study thoroughly analyzes the reasons for performance degradation in TTA methods and proposes a theoretically supported solution. PeTTA's stable performance across multiple benchmarks validates its effectiveness.
缺点
(1) Although the recurring test scenario has theoretical significance, in practical applications, environmental changes involve more than just lighting conditions; factors such as shooting angles and weather conditions also impact the captured data. Therefore, a simple Gaussian Mixture Model classifier may not fully simulate the variations in real complex scenarios. (2) While virtual data experiments provide theoretical support, whether their results fully apply to real scenarios requires further validation. Using real images for experiments in theoretical analysis would better validate the practical applicability of the PeTTA method.
问题
Please see the weaknesses.
局限性
Yes.
-
We agree with the reviewer that the real-world scenarios are significantly more complicated. Despite its simplicity, our -GMMM model can empirically demonstrate the behavior of a collapsing TTA model on a real-world CIFAR-10-C dataset (see the similarity between Fig. 3(a) and Fig. 4(a)). To the best of our knowledge, this is the first attempt to establish a theoretical foundation for studying the collapse of TTA in the simplest case, promoting future research on the theoretical aspects of the collapsing TTA model.
-
Yes, a theoretical study on real images would certainly validate their practical applicability. However, the challenge of modeling real images lies in the difficulty of theoretically analyzing the effect of added noise as it propagates through a highly complex machine-learning model. The idea behind -GMMM is to simplify this complex process where the update rule at each step is rigorously defined. Nevertheless, the inspiration gained from this simple study facilitates the development of PeTTA, which demonstrated its ability to meet several real-world continual TTA benchmarks.
Thank you for your response to the weaknesses I pointed out. However, I believe the rebuttal doesn't fully address the concerns regarding the practical applicability of the proposed method in real-world scenarios.
While the theoretical foundation provided by the model is appreciated, the core of my concern lies in the practical validation and applicability of your method under more complex and varied real-world conditions. Simply demonstrating empirical behavior on a dataset like CIFAR-10-C, while useful, doesn't fully capture the diversity and complexity of real-world environmental changes, such as varying shooting angles, weather conditions, and other factors beyond lighting.
- We thank the reviewer for the comment on our rebuttal. In this study, and in the area of continual TTA research, the evaluation benchmark using CIFAR-10/100-C or ImageNet-C is the standard benchmarking approach for all the works in this line of research.
- Those datasets are designed to simulate up to 15 conditions that typically appear in real-world conditions (such as snow, fog, and brightness reflecting weather conditions, or motion blur, zoom blur due to camera/hand motions and jpeg, image noises due to capturing condition or image sensor quality), not just lighting factors you mentioned. Please visit [1] for more information. We evaluated our method in the most severe condition.
- Besides image corruptions, we also evaluated on DomainNet126 dataset with 4 domains: clipart, painting, real, and sketch.
- Again, the key focus of this paper is NOT on evaluating the reality of the evaluation protocol, or closing the gap between laboratory experiments and real-world deployment. Here, we point out the risk of the model collapsing, even in the simplest setting that all previous works adopted, with a small extension (longer time horizon) on the current continual TTA evaluation protocol.
- We are thrilled to extend our experiments with your suggestion, but currently, we are not aware of any publicly available dataset, especially with the time constraint of the rebuttal period that matches the criteria you mentioned. We believe that the current evaluations are sufficient to convey our key message. We agree that a more realistic evaluation is necessary and we will keep exploring in the future.
In summary, we acknowledged the gap between real-world deployment and the evaluation of synthetic data, which is commonly used in the test-time adaptation community does exist. However, we provided here some follow-up comments to justify that this setup is sufficient for us to convey the core message of this study. We hope this perspective will be considered in the ongoing discussions and evaluation of our work.
[1] Hendricks et al., Benchmarking Neural Network Robustness to Common Corruptions and Perturbations. ICLR’19.
This paper investigates the risk behind long-term test time adaptation. To achieve this, the authors simulate a long-term test data stream called Recurring Test-Time Adaptation by repeating a single period continual TTA setting for 20 times, and propose a Persistent TTA (PeTTA). Within the proposed algorithm, the authors calculate the Mahalanobis distance between the feature distributions at the current time slot and before test-time adaptation to weight the regularization term and model updated momentum. Moreover, the authors also leverage an Anchor Loss in the probability space to further avoid over-adapting process. To validate the proposed method, the authors conduct the experiments on several TTA benchmark datasets, e.g. CIFAR10-C and ImageNet-C. The results demonstrate the effectiveness of the proposed method which provides stable adapting process in long-term test time adaptation scenarios.
优点
- This paper investigates a realistic and valuable TTA setting, whereas the test data stream is enough long and most of the existing TTA methods based on self-training self-supervised objectives produce poor results.
- This paper is well-written and sounds interesting.
- The authors attempted to theorize the causes of long-term TTA failures.
缺点
- The novelty is incremental. Methodologically, the Anchor Loss proposed in this paper is similar to that used in [A] (compare eq.8 in the manuscript with eq. 5 in [A]), only replacing L2 distance between two probabilities to Cross Entropy. In the ablation study, I observe most of the promotion is provided by this Anchor Loss module on CIFAR100, DN and IN-C datasets. Based on these, I suggest the authors to provide some experimental comparisons on these two TTA method and more discussions on the differences between them.
- The danger brought by self-training based TTA is well-known [B], while there is no denying that the author may be correct in making the theoretical analysis in the manuscript. There are many TTA methods proposed to alleviate this issue, for example, Balanced BatchNorm proposed by [A] alleviates the impact from the biased class to batchnorm, class diversity weighting utilized in [C] avoid the accumulation of prediction bias during test time. Unfortunately, they are not discussed and compared in the experiments of the manuscript.
[A] Towards Real-World Test-Time Adaptation: Tri-Net Self-Training with Balanced Normalization, AAAI 2024.
[B] On Pitfalls of Test-time Adaptation, ICLR 2023.
[C] Universal Test-time Adaptation through Weight Ensembling, Diversity Weighting, and Prior Correction, WACV 2024.
问题
See the weakness section above. By the way, how about the performance of feature alignment based TTA method e.g. [D] in the recurring TTA scenario.
[D] Revisiting Realistic Test-Time Training: Sequential Inference and Adaptation by Anchored Clustering Regularized Self-Training. TPAMI 2024.
局限性
There is not nearly enough technical novelty, and there is a lack of discussion and comparison of some of the competing methods that should be compared.
Comments on the weaknesses part:
-
We respectfully disagree with the comment about the novelty since this study is not all about the anchor loss. Indeed, we acknowledged the anchor loss is not a new idea or our novel point on line 751 (Appendix E5) and will include a proper citation to the anchor network in [A] for completeness. The anchor loss alone is insufficient to avoid the model collapsing as we responded to Reviewer rAs4. Nevertheless, the main contributions of this paper are the theoretical analysis and the sensing of divergence for an adaptive model update in PeTTA. We also acknowledged that the model collapsing is observed in previous studies, as an inspiration for improving the performance but to the best of our knowledge, there is no theoretical investigation on this phenomenon. The suggestion of comparing PeTTA and [A] is interesting. We conducted additional experiments and provided the discussion in the following comment.
-
We sincerely appreciate the reviewer for suggesting recent methods such as ROID (WACV'24) [C] and TRIBE (AAAI'24) [A]. We have benchmarked their performance in Table III-V. Even though TRIBE - a SOTA model can provide stronger adaptability, outweighing the PeTTA model, and baseline RoTTA in the first several recurrences, the risk of the model collapsing still presents in TRIBE when increasing the observation period as demonstrated in the case of CIFAR-10-C (Fig. I(b)). Nevertheless, this result underscores the importance of our proposed recurring TTA setting for an extended testing stream evaluation. We also experimented with the class diversity weighting utilized in ROID [C], unfortunately, it cannot handle the temporally correlated testing stream in the recurring/practical TTA as PeTTA, and tend to be collapsed at the beginning. This is consistent with the finding by ROID's authors in Tab. 4 - ablation study of [C]. ROID with class diversity weighting cannot handle the practical TTA scenario and falls behind PeTTA on all benchmarks.
Since PeTTA utilizes RoTTA as a baseline approach, TRIBE - a much better baseline would be interesting for us to combine the amazing adaptability of TRIBE and the collapsing prevention of PeTTA to create a stronger baseline in future work. Overall, further evaluations of PeTTA against the most recent approaches in this rebuttal highlight (1) the value of recurring TTA in spotting the lifelong performance degradation of continual TTA and (2) the novel design of the sensing model divergence and adaptive update in PeTTA, inspired by a theoretical analysis to address the collapsing phenomenon of the TTA model. The revised paper will include a discussion of these methods and an experimental comparison with them as outlined in the rebuttal PDF.
Comments on the questions part:
We appreciate the reviewer's suggestion regarding the feature-based alignment TTA mentioned in [D]. While sharing several similarities in the design, this work does not directly study the performance degradation of continual TTA methods as PeTTA. Nevertheless, evaluating the performance of this approach under our recurring TTA setting is interesting and straightforward due to the simplicity of our setting. We will mention [D] in the revised paper and leave the evaluation for future work due to the length of the rebuttal period and since [D] has just officially appeared on the 8/2024 issue of TPAMI 2024.
Thanks for the additions to the TRIBE and ROID experiments, especially the TRIBE experiments, which allayed my concerns about the anchor loss module. The discussion of similar existing work here is necessary, and I suggest that the authors include these discussions and experiments in their camera-ready revision. Up to this point, all my concerns have been addressed, and overall this is a good paper for exploring the robustness of long time TTA methods. I would like to increase my score to Weak Accept to suggest acceptance.
We thank reviewer MNYK for the positive feedback on our rebuttal. We are excited that your concerns regarding the anchor loss, novelty of PeTTA, and additional comparison with the TRIBE, ROID have been successfully addressed. Noteworthily, the discussion here emphasizes the risk of model collapse, even with the latest state-of-the-art continual TTA methods, and highlights the significance of PeTTA. We will incorporate these points into our revised paper.
The paper provides theoretical and empirical analyses on the error accumulation and model collapse in continuous TTA scenarios. From the analyses, the authors discover the risk of using constant key hyperparameters ( and in RoTTA) and periodic reset of model parameter (in RDumb). They propose Persistent Test-time Adaptation (PeTTA) with adaptive and depending on distribution shifts and anchor loss . The proposed method shows lower and stabler error over the continuous distribution shifts.
优点
-
The paper considers a practical scenario of TTA, continuous TTA.
-
The theoretical analysis explains an interesting outcome of error accumulation in TTA, called model collapse: a collapsed model is prone to misclassifying several classes as a few classes. This is also verified in an experiment.
-
Based on the findings of model collapse, the authors propose a mechanism of detecting divergence of and adaptively selecting and , which is well justified and verified.
-
The authors provide an extensive set of experiments demonstrating their findings and the superiority of the proposed method, compared to the state of the arts: RoTTA and RDumb.
缺点
-
The paper focuses solely on recurring TTA scenarios, potentially overlooking other types of domain shifts such as non-cyclic domain shifts and label distribution shifts. The proposed method appears too specific to recurring TTA scenarios and may be prone to failures, particularly under label distribution shifts.
-
The sensitivity of hyperparameter choices, particularly and , is not studied. Additionally, there is insufficient justification for the hyperparameter choices of other algorithms. This raises concerns about the empirical study's claim of the proposed method's superiority.
-
The proposed method requires access to the source dataset, as indicated in Equation 6.
问题
The following questions include the concerns in the weakness.
-
Can PeTTA work under scenarios of (i) non-cyclic domain shift; and (ii) label distribution shift?
-
Can you provide further results on longer time horizon than 20? I want check if the error accumulation is fully resolved by PeTTA.
-
Is the performance of PeTTA sensitive to the choice of and ? If so, how can we select them in practice?
-
How did you choose the hyperparameters of PeTTA for the experiment result? Were the hyperparameters of other algorithms tuned in the same way of PeTTA?
-
Why there is no balancing hyperparameter in front of ?
-
Is PeTTA runnable without the access of source dataset?
局限性
- In the current manuscript, the limitation of this work is not described in detail. It would be great if you can provide more specific description on the direction of further improvement.
Comments on the Weakness Part:
- We would like to emphasize that recurring TTA serves as a diagnostic tool for catching the lifelong performance degradation of continual TTA, and even in this simplest case, several SOTA continual TTA algorithms fail to preserve their performance. This raises awareness in the community when evaluating their methods (visit Appendix D.2. for further discussion). Extending our recurring TTA to include more challenging scenarios and complex types of shifts is necessary, but should be addressed in the following work.
- We have provided additional justifications and discussions on the choices of hyper-parameters for consideraiton. See the comments below.
- See the comments below.
Comments on the Questions Part:
-
Yes, (i) our method can handle non-cyclic domain shift. PeTTA does not make any assumptions or utilize this property of the testing stream. As an example, it achieves good performance on CCC [1] where the corruptions are algorithmically generated, non-cyclic with two or more corruption types can happen simultaneously - see the Appdx. F.4. Regrading (ii), for all experiments, the label distribution is temporally correlated (non-iid) following [2, 3]. The class distribution within each data batch can be highly imbalanced, with some classes dominating the others. The robustness of our PeTTA with label distribution shifts is demonstrated up to this extent.
-
Absolutely, PeTTA is experimented with 40 recurrences in Tab. II and Fig.I (a). The experimental results confirm the persistence of PeTTA. Additionally, the performance of PeTTA over an extended time horizon is presented in Table 13 (Appendix F4). In this case, the model is adapted to over 5.1 million images, which is significantly more than the default 20 recurrences.
-
In PeTTA, is the initial learning rate for adaptation. We do not tune this hyper-parameter, and the choice of is universal across all datasets, following the previous works/compared methods (e.g., RoTTA, CoTTA).
Since is more specific to PeTTA, we included a sensitive analysis with different choices on CIFAR-10/100-C and ImageNet-C in Table VI for the sake of time. Overall, the choice of is not extremely sensitive, and while the best value is on most datasets, other choices such as or also produce roughly similar performance. Selecting is intuitive, the larger value of stronger prevents the model from collapsing but also limits its adaptability as a trade-off.
In action, is just an initial value and will be adaptively scaled with the sensing model divergence mechanism in PeTTA, meaning it does not require careful tuning. More generally, the choice of this hyper-parameter can be tuned similarly to the hyper-parameters of other TTA approaches, via the use of an additional validation set, or some accuracy prediction algorithm [4] when labeled data is not available.
-
Except for the recurring testing condition, each recurrence follows the standard continual TTA established in previous studies. Hence, for all compared methods, we use the best parameters provided by the authors. The performance after the first visit is manually verified to ensure the reproducibility of the original work. Noteworthy, the primary observation of this work is to determine how long these approaches can sustain their initial performance.
-
Since the purpose of the anchor loss is to guide the adaptation under the drastic domain shift (Lines 684-691, Appendix E.2), we empirically found out that it is unnecessary to introduce an additional hyper-parameter and let the adaptive regularization term take the leading role of collapse prevention.
-
No, PeTTA requires sampling from the source dataset to perform. Nevertheless, Appdx. E4 demonstrates that only a small number of samples is required for reliably estimating the empirical mean and covariance matrix. PeTTA relies solely on the typical assumptions used in the other methods it is compared against (e.g., EATA, RMT). Please visit Appendix E.4 for our discussions on the feasibility of accessing the source dataset.
[1] Robust Test-Time Adaptation in Dynamic Scenarios, CVPR'23.
[2] RDumb: A simple approach that questions our progress in continual test-time adaptation, NeurIPS'22.
[3] Gong et al., NOTE: Robust Continual Test-time Adaptation Against Temporal Correlation, NeurIPS'22.
[4] Lee et al., AETTA: Label-Free Accuracy Estimation for Test-Time Adaptation, CVPR’24.
Comments on the Limitations Part: We thank the reviewer for the suggestion. Appendix E elaborates some limitations mentioned in Section 6 in detail. The direction of further improvement will be included in the revised paper.
Thanks for the detailed responses and additional experiments, which address most of my major concerns. I would raise my score (5->6).
We appreciate Reviewer oiWU for the feedback on our rebuttal. It's great to know that most of your major concerns about PeTTA's evaluation on extended testing scenarios beyond our proposed recurring TTA, as well as hyper-parameter selection, have been addressed.
Dear reviewer oiWU,
the authors have posted their rebuttal. Could you please aim to reply to the authors latest till August 12th end of day, in case there are any further comments the authors would like to add/clarify?
Thanks again for your efforts in reviewing!
-AC
The authors proposed a practical TTA scenario called recurring TTA and, within this scenario, suggested the best-performing TTA methodology, which they named persistent TTA (PeTTA), measured by various benchmark performances.
优点
- The proposed recurring TTA scenario reflects the challenging and practical situation well.
- PeTTA is a simple yet effective solution. Especially, PeTTA excels in collapse prevention.
缺点
- I think Corollary 1 is not a rigorous condition for model collapse. Of course it guarantees the decrease of distance (), the distance could converge to another value because is not always . Just saying 'The model collapse happens when this condition holds for a sufficiently long period' is not proper.
- I think the role of is similar to , but is not controlled by . However, in Table 3, adding leads to significant performance improvement in several benchmarks. The authors should provide the difference between and , and explain why they chose the current design.
问题
- The authors said EATA is the baseline. However, I couldn't find any results of EATA.
- It seems that the images on the left and right in Figure 4(c) are switched.
- How about applying only without in Table 3?
局限性
Yes
Comments on the Weaknesses Part:
-
In Lemma 1, we mathematically showed that under Assumption 1, we have . Furthermore, the convergence of to , and the collapsing behavior when selected following Collary 1 are both empirically validated through a numerical simulation in Sec. 5.1. Since the rate of collapsing depends on various factors, including both data-dependent and algorithm-dependent, the Corollary 1 here holds as a generic condition for model collapsing. Nevertheless, we will further explore specific conditions/settings that can make this statement more rigorous in future work.
-
The motivation, and reasoning for our design choice behind the anchor loss and the regularization term are detailed in lines 684-691 of Appendix E.2. While using anchor loss is beneficial on many benchmarks, it is not sufficient on its own to achieve PeTTA’s performance, as shown in Table I of the rebuttal PDF. In response to reviewer MNYK, we expanded our evaluation to include TRIBE (AAAI'24), a more recent robust TTA algorithm that also utilizes a concept similar to anchor loss. Despite demonstrating better adaptability, this method is still prone to model collapse over a longer time horizon, necessitating the exploration of additional strategies beyond a simple anchor loss.
Comments on the Questions Part:
-
All tables show the performance of MECTA with EATA backbone (line 242). MECTA is an advanced version of EATA, with all the components preserved, and the batch normalization blocks are replaced with MECTA blocks, showing higher efficacy. For completeness, the experiments of a standalone EATA adapter are also included in Table III-V of this rebuttal PDF. In short, EATA still suffers from performance degradation in the recurring TTA setting just like MECTA and other methods.
-
Yes, thank you for catching that. This figure will be updated in the revised paper.
-
In Table I, we provided an additional ablation study where a baseline model is trained with and without the anchor loss (no regularization). The results show that while having initial benefits on some benchmarks, trivially applying the anchor loss alone is incapable of eliminating the lifelong performance degradation in continual TTA.
Thanks for the response. My concerns are addressed and I would raise my score.
We appreciate reviewer rAs4 for the feedback. We're pleased that your concerns about our theoretical analysis (Corollary 1) and the role of the anchor loss have been resolved after the rebuttal.
We thank the reviewers for their insightful comments and valuable feedback on our work. During the rebuttal period, we have extensively conducted additional experiments: benchmarking the performance of EATA and the most recent continual TTA methods (ROID (WACV’24), and TRIBE (AAAI’24)). The persistence of PeTTA is further justified over a recurring TTA with 40 visits (twice longer than previous experiments). Furthermore, we provided a discussion on the role of anchor loss and sensitivity analysis on the choice of - a PeTTA’s hyper-parameter. Lastly, we offered a point-by-point response to each reviewer's comment below.
The authors present a method for avoiding model collapse in continuous test-time adaptation. The approach is validated on a variety of small and larger-scale datasets from CIFAR to ImageNet scale, and backed up by theoretical investigation. The reviewers found the approach "simple yet effective", and positively note the clarity and relevance of the paper. Following the author rebuttal, all reviewers suggest at least borderline acceptance.
I agree with the reviewers and recommend acceptance, on the condition of including the additional results provided during the rebuttal into the manuscript. I would also recommend the authors to revisit whether some of the appendix tables containing results on larger scale datasets (like CCC), which stress the practical applicability of the method on real-world datasets, could be moved into the main paper. Likewise, the authors should consider how to integrate the most important results from their rebuttal document into the paper, vs. defaulting to moving all new results to the supplement. Again, I would aim to stress in particular larger scale datasets in the main paper.