CONTRAST: Continual Multi-source Adaptation to Dynamic Distributions
We propose the first work to consider dynamically evolving multi-source adaptation at test time.
摘要
评审与讨论
This paper introduces a novel continual multi-source adaptation method to tackle a new Test-Time Adaptation (TTA) task involving dynamic distributions. The method integrates multiple source models to adapt continuously to the evolving test data distribution. It efficiently computes the optimal combination weights for merging the source models and identifies which source model parameters require updating. Additionally, the authors present a thorough theoretical analysis of the optimization convergence and test risk bound, supported by extensive experiments demonstrating the effectiveness of the proposed method.
优点
- The authors propose a novel continual multi-source adaptation method to address the TTA task with dynamic distributions, which is a challenging and practical problem.
- The theoretical analysis of the optimization convergence and test risk bound is thorough and well-supported.
缺点
- The proposed method is not adequately supported by the experimental results. The authors claim that the method can adapt to dynamic test data distributions, i.e. sunshine interspersed with rain, modeled as a linear combination of source distributions. However, the authors did not construct a corresponding test dataset reflecting these conditions, but instead used a continual manner for the experiments.
- As a plug-in method, the proposed approach lacks experiments based on recent TTA methods. The experiments only included three methods, Tent, CoTTA and EaTA. The authors should incorporate more recent TTA methods in realistic scenarios, such as RoTTA, TRIBE, and ROID.
- [1] Longhui Yuan, Binhui Xie, and Shuang Li. Robust test-time adaptation in dynamic scenarios. In CVPR, 2023.
- [2] Yongyi Su, Xun Xu, and Kui Jia. Towards real-world test-time adaptation: Tri-net self-training with balanced normalization. In AAAI, 2024.
- [3] Robert A Marsden, Mario Döbler, and Bin Yang. Universal test-time adaptation through weight ensembling, diversity weighting, and prior correction. In WACV, 2024.
问题
I have some questions and suggestions for the authors:
- I suggest that the authors provide a toy experiment by constructing a dynamic test data distribution and conducting experiments using ground truth weights to validate the effectiveness of the proposed method.
- According to the current test settings, the authors train four source models (vanilla, fog, snow, and frost) and test them in a continual manner. For the results in the fog, snow, and frost domains, it is expected that directly using the corresponding source model would yield the best results, i.e., X-best would be the best. However, this is not the case in the experimental results. Could the authors provide an explanation?
局限性
Yes
W1: To demonstrate the linear combination of source distributions, we devise the following experiment. We linearly combine or blend same images from the test set of the Snow and Fog domains of CIFAR100-C using two different sets of weights. We then use CONTRAST to predict on the test set and average the combination weights obtained over all test batches. The results are presented below:
Table: Images from the Snow and Fog domains are blended using the ground truth (GT) weights. The CONTRAST row displays the combination weights predicted by our method, which closely align with the GT weights. We also highlight the best single-source model along with its accuracy. Additionally, we report the multi-source accuracy obtained using our method, which significantly outperforms the best single-source results.
| Snow | Fog | Best Model | Single-source Model Acc. | Multi-source Acc. | |
|---|---|---|---|---|---|
| GT | 0.9 | 0.1 | Snow | 68.5 | - |
| CONTRAST | 0.88 | 0.12 | - | - | 70.9 |
| GT | 0.1 | 0.9 | Fog | 70.3 | - |
| CONTRAST | 0.13 | 0.87 | - | - | 72 |
It can be observed that the predicted combination weights across the test set are very close to the ground truth values. Additionally, the multi-source accuracy achieved is superior to the best single-source adaptation performance.
W2: Our main goal in this paper is to demonstrate how to combine multiple source models during test time using any general single source method as our update strategy. Since our method is general enough to integrate any of the single source methods, we can also incorporate the recent methods mentioned by the reviewer into our framework. Upon using RoTTA as the update method, in the CIFAR100 CIFAR100C experiment, with the same experimental setup given in the paper, we get around 2% reduction in error rate over the best-single source baseline. We will include results with the other methods in the camera-ready version.
Q1: Please refer to W1.
Q2: This is a very reasonable question and we provide an explanation for the results. The goal of any multi-source method is to ensure that the performance is always equal to or better than X-Best [18]. For example, fog domain data might be most highly correlated with the fog model, but it still has some correlation, however small, with other models as well. In such a scenario, the other models also contribute positively to the performance on the target data.
Thank you for your detailed response. I appreciate the effort you’ve made to address my concerns. I will increase my score accordingly.
Thank you for your support of our work.
This paper introduces a new task to consider continual mutli-source test-time adaptation to dynamamically evolving distributions. A framework is proposed consisting of two key step: (1) learning the combination weights and (2) identify the most correlated source model to update. In order to speed up the optimization, this paper further designs a good initialization strategy and selects the optimal step size. Extensive experiments are conducted to validate the effectiveness of the proposed model.
优点
- Writing quality sounds good. The paper is well-structured. Theoretical insights are also provided.
- SOTA performance. The proposed framework can be integrated with various single source TTA models. It can improve the performance of single source TTA models by combining with the proposed method.
- Ablations. Ablation experiments are provided to evaluate the impact of the proposed modules.
缺点
-
Some parts of the writing are unclear. a) In Line 182, clarify what the t-th test batch refers to and how it is calculated. Why not compute the distance for each source model independently? b) In Line 270, unclear descriptions cause confusion for the reader. c) Algorithm 1 in the Appendix contains some typos.
-
The proposed framework largely follows strategies from existing multi-source domain adaptation methods. For example, the approach to learning combination weights and the use of weighted pseudo-labeling strategies are similar to those outlined in [18].
-
The differences between challenges in the new task and multi-source DA and CTTA are not clearly described. The proposed task appears similar to methods addressing multiple source model adaptation without accessing the source data. The CTTA setting typically deals with catastrophic forgetting in single source adaptation. However, multi-source adaptation inherently mitigates forgetting, as noted in Line 110. It remains unclear how the challenges of the multi-source CTTA task differ from those in multi-source DA and CTTA.
问题
Please refer to the weaknesses above.
局限性
The limitation is included in the paper.
W1: (a) There are two variables, and , where is the index of the source model and is the index of the test batch. The -th test batch refers to the batch of data streamed at time step . Thus, represents the distance of the -th test batch from the -th source model, and these distances are calculated independently. We will clarify this further in the camera-ready version. (b) Here, we wanted to explain that we update each model with single source TTA methods ('X') for a test batch. The model that achieves the best performance on this batch is referred to as 'X-Best,' and the model that performs the worst is referred to as 'X-Worst.' We will improve this explanation to make it more reader-friendly. (c) Thanks for pointing these out. We will fix the typos.
W2: There is a significant difference between the method in [18] and CONTRAST. The method in [18] operates in a setting where all of the target data is available during the adaptation phase and does not depend on the combination weight initialization. In contrast, our method operates under a test-time scenario where data streams in a batch by batch manner, requiring us to learn combination weights using a few samples from the current batch. Our contribution lies in properly determining these combination weights using an optimization framework, with guarantees on its convergence.
W3: Multi-source domain adaptation (DA) methods operate with multiple sources and require all target data beforehand during the adaptation phase. In contrast, CTTA (Continual Test-Time Adaptation) methods work with a single source and are designed for online streaming data. These methods fall into two categories: (i) those that mitigate forgetting and (ii) those that do not. Our setup combines aspects of both approaches, offering advantages from both worlds: (a) it performs better than the best source model, similar to multi-source DA methods, and (b) it operates effectively in streaming data scenarios. Additionally, our method can be forget-free if we use the class (i) CTTA methods for model updates. Even if we use the class (ii) methods, which are computationally lightweight, our approach significantly slows down the forgetting process over the long run (Fig. 3}). We will clarify these points further in the camera-ready version.
Thanks for your response. I maintain the positive score.
Thank you for your support of our work. If there are any further questions, please let us know.
The work introduces a new framework called CONTRAST, designed for dynamically combining multiple pre-trained source models during testing to adapt to changing target data distributions. For each test batch, CONTRAST learns the optimal combination weights of the source models, ensuring that the test error of the combined model does not exceed that of the best single source model. CONTRAST updates the parameters of the source models most relevant to the current test data, avoiding the forgetting of irrelevant model parameters, thereby maintaining good performance in long-term dynamic distribution adaptation. Theoretical analysis shows that CONTRAST optimizes the combination weights by balancing the distribution shift of the source models and the quality of pseudo-labels, minimizing the risk on the target distribution. Experimental results demonstrate that CONTRAST outperforms single source model adaptation methods on dynamic distribution setup.
优点
-
The method of the work is clear, intuitive, and easy to implement.
-
The writing is clear, making it easy to understand and read.
缺点
-
Using multiple source models and performing ensemble is a feasible and reasonable approach to improve test-time adaptation. However, since previous techniques only utilized a single source model, this ensemble approach introduces an unfair comparison in the experiments.
-
In the current task setting, dynamic test-time adaptation requires sequential model updates. However, this may not be a reasonable design or baseline. After the first update, we will have two models: the original model and the model updated on Task 1. Therefore, in practical scenarios, when a new task arrives, we wouldn't simply update the model that was updated on Task 1. Instead, we would at least try to update both the Task 1 updated model and the original model then ensemble them, or even discard the Task 1 updated model and start updating from the original model again.
问题
-
It is recommended that the authors add two baselines: one using the original TTA method with multiple source models, which can be achieved by ensembling multiple TTA-updated models. Another baseline is using the TTA method, independently updating from the original model for each task.
-
It is suggested that the authors discuss the computational complexity of the learning rate selection method. The experiments in the appendix indicate that the effective learning rate range is quite large. The difference between the proposed method and the optimal fixed learning rate is only 0.4. Therefore, to evaluate the necessity of using the proposed method, it is important to consider the additional computational overhead it introduces.
局限性
The limitations are discussed in the corresponding section.
W1: Since there are no prior works on dynamic multi-source adaptation in test time, we do not have a direct baseline for comparison. Therefore, we followed the protocol of the first multi-source Unsupervised Domain Adaptation (UDA) method ([18]), where we compared our approach with the best source model and also with any additional baseline we could create using single-source TTA methods. Additionally, we compared our method with existing multi-source UDA methods. Given that this is the first work on multi-source adaptation in test time, these comparisons represent the most reasonable baselines we could establish.
W2: This approach could work well when the task boundaries are known. However, in our setup, we have unlabeled streaming data with no information about the task boundaries. Therefore, this method cannot be directly used in our setup (see below for more details).
Q1: For the first baseline, we already have a comparison in Table 8 and 9 ("All Model Update") in the Supplementary, where we update all the models using single source TTA methods and then ensemble them with proper weights learned by CONTRAST (since the result would be worse if we naively ensembled these updated models; this also ensures fairness in the comparison). In this table, we can clearly see that updating only the most correlated model outperforms updating all models.
For the second baseline, we need information about the task boundaries in order to reset the models at the proper time, which we do not assume to have in our setup. This is a big assumption to make and is impractical almost always. If we make this assumption, the error rate using resetting the model is about ~3% lower than CONTRAST. This is not surprising because of the underlying strong assumption about knowing ground truth task boundaries. If we estimate the task boundaries, performance will fall and will be similar to or worse than CONTRAST, depending upon the estimation error. We will elaborate on these scenarios more thoroughly in the camera-ready version.
Q2: We calculate the Hessian for only scalar parameters, with representing the number of source models. Typically, in common application domains, addressing distribution shifts requires only a small number of source models, making the computational overhead negligible.
The manuscript address continual learning in the context of adaptation to multiple data distributions. The method employs a model ensemble for unsupervised domain adaptation to the dynamically evolving distributions.
The weights denoting the contribution of each model are calculated through optimization. The bounds for the optimal solution are provided.
The combination of different models weights for the model ensemble model are calculated and updated during the continual learning. Only the model that correlates the most to the new data is adapted.
优点
- The manuscript address continual learning in the context of adaptation to multiple data distributions.
- During the test time, multiple source models are adapted such that it optimally blends the sources
- The theoretical foundations for the proposed methodology are provided
- Extensive experimental results are provided in the manuscript and in the supplemental materials
缺点
- Many references have missing bibliographic information. For example reference [36] should be marked as being presented at MIDL 2023 and published in PMLR. Reference [3] was presented at the International Conference on Learning Representations (ICLR), 2021. Also, pages are missing in many references, including [1,2,4,5,6,7] among many others.
问题
What happens if many models are correlated with the new data to be learnt?
局限性
Broader Impact and Limitations are discussed in Section 6 of the manuscript.
W1: Thank you for pointing these out. We will fix these references in the camera-ready version.
Q1: When multiple source models are highly correlated and have nearly equal weights, updating all the models is an option. However, while updating all models might be effective in the short term, it can lead to an increased rate of catastrophic forgetting over time during continual adaptation. This presents a tradeoff between optimizing performance in the current batch and maintaining overall performance in the long run. Please refer to Section F.2 in the Supplementary for details.
In this paper, we propose CONTRAST, a novel method for continual adaptation to dynamic streaming data using multiple source models, without requiring access to source data. CONTRAST combines these models to adapt to test data that arrive in small batches without access to the original source data. It features two main innovations: calculating optimal combination weights for continuous adaptation and updating only the source model most correlated with the test data to prevent forgetting. Theoretical insights about the performance are provided. Experiments show that CONTRAST performs as well as the best source model with hindsight and maintains robust performance as the test data distribution changes over time. As stated by the reviewers, this paper addresses a novel, challenging, and practical problem (C1aF). The writing is clear, easy to understand, and well-structured (f99A, z72Z). The method is clear, intuitive, and easy to implement (f99A). Additionally, the paper provides thorough and well-supported theoretical foundations and bounds (YLKD, z72Z, C1aF). It also presents extensive results with numerous ablations and SOTA performance (YLKD, z72Z).
Four expert reviewers with backgrounds in domain adaptation, test-time adaptation, and model editing side largely with acceptance (YLKD: 7, f99A: 6, z72Z: 5) and one vote for borderline rejection (C1aF: 5). The authors provide a rebuttal, globally and individually, but unfortunately only one reviewer responds. The AC has accordingly paid close attention to the points in each review and rebuttal to determine if potential reasons for rejection have been satisfactorily resolved.
- f99A was principally concerned with the fairness of comparing multi-source adaptation against single-source adaptation and the construction of proper baselines from existing methods. The authors have provided sufficient explanation and results in the supplement to justify their choice of ensembled single-source TTA methods, but they are encouraged to prioritize how this is explained in the main text to achieve better clarity, because the doubt by this reviewer is indicative of insufficient explanation.
- z72Z was concerned with clarity, the contribution relative to multi-source DA [18], and the need to articulate the specifics of multi-source CTTA vs. multi-source DA and single-source CTTA. The rebuttal clarifies these points, and the AC encourages the authors to incorporate this into the work. Given the success of offline DA does not imply the success of online TTA, this submission makes a contribution.
- C1aF was concerned about (1) inadequate experimental design to test dynamic distributions of mixed shifts and (2) lack of recent and state-of-the-art TTA methods for more realistic settings like RoTTA, TRIBE, and ROID. The authors provide explanations and a rebuttal experiment on RoTTA that resulted in a positive re-evaluation and raised score.
Given this accounting of the rebuttal, and the raised score by C1aF, the AC sides with acceptance.
Note: The AC suggests the authors consider the following concurrent and prior work for discussion, but makes clear that these have no bearing on the decision, and are provided purely to better connect findings across the community.
- BeCoTTA https://arxiv.org/abs/2402.08712 is concurrent work on merging or blending models for adaptation published at ICML 2024.
- Seasoning Soups https://arxiv.org/abs/2302.10164 merges models in a few-shot supervised setting without gradient optimization by selecting weighted averages of multiple models.
- Both methods prevent forgetting by routing and merging while the latter never updates the original models themselves.