3.7

/10

Rejected6 位审稿人

最低3最高5标准差0.9

4.5

置信度

ICLR 2024

CoSDA: Continual Source-Free Domain Adaptation

Haozhe Feng,Zhaorui Yang,Hesun Chen,Tianyu Pang,Chao Du,Minfeng Zhu,Wei Chen,Shuicheng YAN

OpenReview PDF

提交: 2023-09-23更新: 2024-02-11

TL;DR

The study introduces CoSDA, a continual source-free domain adaptation approach using a dual-speed teacher-student model with consistency learning, which outperforms existing methods and mitigates catastrophic forgetting.

摘要

关键词

Source-Free Domain AdaptationContinual LearningCatastrophic Forgetting

评审与讨论

审稿意见

评分: 3置信度: 52023-10-18

This study introduces a novel context in the realm of continual SFDA, a specific case within unsupervised domain adaptation. It focuses on the sequential adaptation of a robustly trained source model to multiple unlabeled target domains. The authors pinpoint the issue of catastrophic forgetting prevalent in current domain adaptation techniques and skillfully reconfigure existing baselines to function in this innovative setting. Subsequently, they present a teacher-student consistency learning approach designed to attenuate the effects of forgetting, thereby facilitating efficient adaptation across multiple targets sequentially. The empirical evidence provided substantially corroborates the assertion that this methodology not only enhances performance but also significantly curtails issues related to catastrophic forgetting.

优点

It is happy to see that this manuscript reimplements previous SFDA approaches within a unified framework and conducts a realistic evaluation of these methods under the continual SFDA settings.
For the most part, the writing is clear and easy to understand.

缺点

I'm primarily concerned about the relevance of the continual SFDA setting. The manuscript restricts its experiments to synthetic tests on established UDA benchmarks, raising questions about the practical value of continual SFDA in real-world applications. In actual practice, it seems feasible to merge all target domains into a single large one and adapt the source model accordingly. Additionally, identifying or defining the source and target domains in real-world applications is already a challenging task, complicating the applicability of this approach.
Another concern pertains to the originality of CoSDA, as it appears to amalgamate various existing strategies, including the teacher-student model, mixup, and information maximization. While the use of the exponential moving average method is a common and sensible strategy to prevent overfitting, the rationale behind employing mixup and information maximization to tackle this issue isn't clear or intuitive.
Lastly, the presentation of results in the tables is somewhat overwhelming and perplexing. The abundance of statistics, compounded by the use of multiple colors, makes it difficult to interpret the data and grasp the essential outcomes. Simplifying these tables for clarity and ease of understanding would be highly beneficial.

Some typos in this manuscript:

"by by consolidating data"
"In this section, We ..."

问题

What are the practical applications for continual SFDA in the real world?
In real-world scenarios, how can we gather multiple domains that exhibit distribution shifts?
Considering the era of large-scale models, what is the actual importance of continual SFDA? Specifically, if the source model is an extensive visual foundation model, is there a real need to sequentially adapt this model across various target domains?

评论- Response To Reviewer CBo9

2023-11-20

Thanks for your time and valuable reviews, here are our responses:

[W1] There setting of continual SFDA is meaningful for the following reasons:

Ease of massive data annotation. Note that in our setting, the label of data is only accessible in the first domain(i.e. source domain). The need for annotation on target domains is eliminated, which saves great effort.
Prevention of potential performance degradation. Due to the domain shift, haphazardly combining different datasets may result in subpar performance.
Spatio-temporal separation. During each adaptation process, we have only unlabeled data from current domain. Neither data nor weights from previous domains are stored. Therefore, each adaptation process can be performed at different time and location, which we refer to as spatio-temporal separation. With spatio-temporal separation, we can first deploy our model, and adapts it to unseen domains during pretraining upon encountering new data.

[W2] The novelty of our method lies in dual-speed optimization strategy. Specifically, the student model is updated every batch, while the teacher model is updated every epoch. To the best of our knowledge, we are the first to propose such strategy. The rationale for choosing Mixup as augmentation and integrating information maximization is discussed in section 3.1 , appendix A.1 and A.2.

[W3] Thanks for your advice. We will try to improve the presentation of results. Also thanks for pointing out typos.

[Q1] One scenario is autonomous driving, where the system may continually encounter different conditions and situations, which can be viewed as different domains. In such scenario, performing well in all encountered domains is of great significance.

[Q2] Multiple domains can come from diverse natural contiditions, such as different weather.

[Q3] To the best of our knowledge, foundation vision models are not as prevalent as LLM in NLP. There is not a clear way of describing all CV tasks in a natural way as natural language does in NLP. Furthermore, as visual data is inherently far smaller than language data in amount, foundation models probably not perform well in all downsteam, highly specific tasks. The need of adaptation may long stand even if with the existence of foundation models, as the environment is constantly evolving and new domains are ceaselessly appearing.

2023-11-21

Thanks for your timely reply. As for the reply of W1, traditional unsupervised domain adaptation also eliminates the need for the labels of the target domain, and no evidence proves that haphazardly combining different datasets may result in subpar performance.

In my opinion, it is not a technical contribution that the student model is updated every batch while the teacher model is updated every epoch.

Based on the response and the original comments, I hold my primary score.

审稿意见

评分: 5置信度: 42023-10-20

The submission investigate continual source-free domain adaptation task, where a source pretrained model is continually adapted to a sequence of unlabeled target domains, without the access to all old domains (source and old target domains). The submission proposes several general methods to address this challenging task, which could be easily combined with the existing source-free domain adaptation methods.

优点

The investigated continual source-free domain adaptation task has more practical value, as the adapted model is expected to keep good performance on all old domains after adapting to a new target domain. Also in the proposed method, domain ID is not needed which makes the method more readily deployed in the real world application.
The proposed method is relatively simple, which only contains a mixup consistency loss with teacher-student architecture, a mutual information maximization loss, along with a BN statistics updating trick. Thus, the method is quite general, and could be easily combined with the existing source-free domain adaptation method, which is proved in the experimental section.
The experimental sections are detailed, which cover several benchmarks, and reproduce lots of existing method for the continual source-free domain adaptation setting.

缺点

Although the studied new setting is of high practical value, and the experiments are abundant, the major concern is that the proposed method(s) is not new/novel, the proposed modules are quite popular in the related areas.

Teacher-student architecture where teacher model is the EMA between old teacher model and the current student model, this technique is popular in almost every transfer learning topic.
Usage and discussion of mutual information maximization. As the paper mentioned, MI is proved to be very efficient in unsupervised clustering task [1], which is a similar topic to source-free domain adaptation. Also, AaD also discuss MI and relate it with several other different methods.
BN updating trick. Actually one paper in continual learning gives a thorough investigation about how BN influence the continual learning performance, with a short conclusion that the running statistics heavily biased towards the current task, which may influence the performance on the old task. In turn, BN statistics are also important for the current task, as mentioned in GSFDA that simply forwarding with the test data once before adaptation could improve the performance. The author could add some discussions with the above mentioned methods, as in the proposed way the teacher model has more information about the old domain while the student model focus on the statistics on the current domains.

reference

[1] Learning Discrete Representations via Information Maximizing Self-Augmented Training. ICML 2017

[2] Continual Normalization: Rethinking Batch Normalization for Online Continual Learning. ICLR 2022

问题

Overall, the paper is sound and address a new but practical new task, as well as providing detailed experimental analysis. However, I think the proposed method is somehow incremental, as mentioned in the weakness part, and I do not really get some new and interesting insights from this submission.

评论- Response To Reviewer TdNX

2023-11-20

Thanks for your time and valuable reviews, here are our responses:

[W1] The novelty of our method lies in dual-speed optimization strategy. Specifically, the student model is updated every batch, while the teacher model is updated every epoch. To the best of our knowledge, we are the first to propose such strategy.

[W2] MI is surely effective. We also provide analysis both empirically(in ablation study) and theoretically(in appendix A.2)

[W3] CoSDA aims at achieving continual SFDA. With multiple target domains, simply using BN statistics from the target domain may results in severe catastrophic forgetting. Furthermore, the moving average mean and variance are not the mean and variance of the whole dataset. They can be viewed as weighted mean, instead. As the adaptation process goes on, the mean and variance should be more and more precise, so later batches should have higher weight.

[Q] Please refer to W1.

2023-11-23

Thanks for the reply. My point is that all those three modules I listed in the weakness are already well investigated in the area or the related areas, the current submission seems more like a combination of them applying to a new but similar task. Thus I keep my score.

审稿意见

评分: 3置信度: 52023-10-29

The paper studies the well-established problem of source-free unsupervised domain adaptation (SFDA) in a continual learning setting. While existing approaches on SFDA focus exclusively on target domain performance, the authors propose a new method, CoSDA, that not only enhances target domain adaptation but also preserves the source domain performance. The core idea revolves around using a student-teacher framework, where the teacher model is updated via EMA (exponential moving average) of the student model weights to prevent forgetting. The mixup augmentation is used to drive the learning process via consistency regularization between the pair of networks, and a mutual information maximization loss is also added to obtain better pseudo-labels. Experiments are shown on four classification benchmarks: DomainNet, OfficeHome, Office31, and VisDA. Beyond single target adaptation, results are also shown on multi-target adaptation (where the domains appear sequentially) to highlight the continual learning capabilities of the framework. When compared to existing works on SFDA and Test-Time Adaptation (TTA), CoSDA achieves both higher accuracy and good robustness against forgetting.

优点

The problem statement is quite relevant for practical applications of domain adaptation methods. Most of the literature on domain adaptation focuses on target domain performance without any concern for source-domain performance. On the contrary, this paper tries to remedy this overlooked aspect in domain adaptation. This is quite an important problem since data can come from any domain during inference.

缺点

The proposed CoSDA approach lacks novelty and bears a significant similarity to CoTTA [1]. CoTTA tackles the closely related problem of test-time adaptation (TTA) and utilizes the exact same concept of a student-teacher framework trained using consistency regularization. CoSDA simply replaces the general augmentation set in CoTTA with mixup. The authors do compare with CoTTA and mention that "... CoTTA (Wang et al., 2022) ensures knowledge preservation by stochastically preserving a subset of the source model’s parameters during each update" but fail to mention this other crucial aspect of this paper which is directly related to their method. The inclusion of the mutual information regularization loss for better pseudo-labels is also borrowed directly from a previous SFDA approach SHOT [2]. Finally, the claimed student-teacher EMA framework is very common is the SFDA literature, [3, 4], none of which are mentioned in the paper.
A primary goal of this work is to prevent catastrophic forgetting of previously seen domains, however, there is no discussion of existing continual learning approaches and what distinguishes this work from these papers.
The experiments on sequential target domains (Sec 4.3) is limited, despite being a prominent claim of this paper: (a) the max number of domains is only 4, while TTA methods experiment with up to 15, and (b) no experiments on effects of a domain being repeated.

[1] Wang, Qin, et al. "Continual test-time domain adaptation." CVPR 2022.
[2] Liang, Jian, et al. "Do we really need to access the source data? source hypothesis transfer for unsupervised domain adaptation." ICML 2020.
[3] Kim, Donghyun, et al. "A unified framework for domain adaptive pose estimation." ECCV 2022.
[4] Ohkawa, Takehiko, et al. "Domain adaptive hand keypoint and pixel localization in the wild." ECCV 2022.

问题

Why was mixup chosen as the preferred augmentation? Are there experiments with other augmentations? The authors mention that mixup can be applied to other domains (NLP, Audio), but no evidence is provided to prove the efficacy of CoSDA on other domains, let alone more challenging tasks on images, such as semantic segmentation.
How does the performance vary with the number of unlabeled samples per domain?
CoTTA utilizes a stochastic restore of the source weights since it tackles the more challenging TTA setup (performing adaptation with very few images) - was this removed in the experiments? Furthermore, how well does CoSDA perform on the TTA benchmarks?

评论- Response To Reviewer GtVN

2023-11-20

Thanks for your time and valuable reviews, here are our responses:

[W1] CoSDA is clearly different from CoTTA in three aspects:

The setting of continual test-time adaptation is not about catastrophic forgetting. It computes the average of results over multiple itrations. Whereas in our continual SFDA setting data from each domain is seen only once.
The dual-speed optimization strategy is the module designed for continual SFDA. While in CoTTA, both teacher model and student model are updated after each epoch.
CoTTA needs to inference $k$ times(typically $k=32$ ), while CoSDA only needs $k=1$ . We discuss knowledge distillation-based methods in related works section, which is more relative than EMA.

[W2] We do discuss Continual DA methods in the last part of related works section.

[W3] The max number of domain is not 4, but 6. DomainNet has 6 domains, and the results are shown in Figure 2. The domains in the studied datasets have distinct covariate shift, while in continual test-time adaptation setting, all domains are augmented with 15 types of corruptions from the same images, therefore sharing great similarity. Furthermore, we implement CoSDA in continual test-time adaptation setting, and find CoSDA is superior even in this setting. The table below shows the classification error on CIFAR-10-to-CIFAR-10C:

Method	Mean
Source	43.5
BN Adapt	20.4
Pseudo-label	19.8
TENT	18.6
CoTTA	16.2
CoSDA	15.1

[Q1] The rationale for choosing Mixup as augmentation is discussed in Section 3.1 and Appendix A.1. We have tried CutMix and the result is not as well.

[Q2] By default, all current SFDA works discussed in the paper train on the whole unlabeled dataset, so we follow the setting.

[Q3] The stochastic restore part is not removed, which can varified from our source code(in supplementary materials). We implement CoSDA in continual test-time adaptation setting, and find CoSDA is superior even in this setting. The table below shows the classification error on CIFAR-10-to-CIFAR-10C:

Method	Mean
Source	43.5
BN Adapt	20.4
Pseudo-label	19.8
TENT	18.6
CoTTA	16.2
CoSDA	15.1

2023-11-23

Thank you for your response.

Relationship to CoTTA: You mention that your setting is not about forgetting, however, the writing in the paper hints at the complete opposite. Here's a line from the caption in Figure 1: "The pipeline of the proposed CoSDA method, utilizing a dual-speed optimized teacher-student model pair to adapt to new domains while avoiding forgetting." Furthermore, the reason CoTTA is updated in each epoch is because a single batch constitutes an epoch in TTA. I fail to find any distinct differences from the EMA approach and the so called "dual-speed optimization".
Related works: I clearly mention continual learning in general, not continual DA.
Mixup: I understand Mixup gives good results. As I mentioned, the "why" is not clearly defined. Section 3.1 lists "why we chose Mixup", not "why Mixup works". A specific data augmentation might not work for other datasets.

Based on the response and the original comments, I will stick to my original score.

审稿意见

评分: 5置信度: 52023-11-01

This paper proposes a new method for continual source-free domain adaptation (CoSDA), which transfers knowledge from a source-domain trained model to multiple target domains without accessing the source data. CoSDA uses a dual-speed optimized teacher-student model pair and consistency learning to mitigate forgetting and improve adaptation. CoSDA also incorporates mutual information loss to enhance robustness to hard domains. The paper evaluates CoSDA on four benchmarks and shows that it outperforms state-of-the-art methods in both single-target and multi-target sequential adaptation scenarios.

优点

The proposed method is simple yet efficient.
The experimental results are extensive, including a comparison of various baselines on different datasets. The results prove the effectiveness of the proposed method.
Theoretical analysis and proofs are provided as necessary.

缺点

The technique contribution is somewhat limited. The proposed method can be easily derived from existing methods. For example, the idea of the teacher-student framework[1] and the mix-up strategy[2] for continual source-free domain adaptation is nothing new.
There is a need for deeper experiments. Could the authors test their proposed method on a variety of adaptation sequences, especially those that are more complex, instead of relying solely on one fixed combination sequence? In real-world scenarios, sequential target data might span a longer duration and encompass a wider range of distributions. For instance, the authors could establish longer sequences to evaluate the method for mitigating forgetting.
In continual source-free domain adaptation. A primary challenge lies in managing the tradeoff between adapting to target domain and preventing the forgetting of previous domains. However, the proposed methods give limited attention to this challenge. The tradeoff is achieved through the EMA momentum m within a teacher-student learning framework in this paper. Thus, a hyper-parameter experiment of m should at least be conducted.

[1] Wang, Qin, et al. "Continual test-time domain adaptation." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022. [2] Taufique, Abu Md Niamul, Chowdhury Sadman Jahan, and Andreas Savakis. "ConDA: Continual unsupervised domain adaptation." arXiv preprint arXiv:2103.11056 (2021).

问题

see above

评论- Response To Reviewer qtge

2023-11-20

Thanks for your time and valuable reviews, here are our responses:

[W1] The technique contribution is somewhat limited. The proposed method can be easily derived from existing methods. For example, the idea of the teacher-student framework[1] and the mix-up strategy[2] for continual source-free domain adaptation is nothing new.

[W2] We implement CoSDA in continual test-time adaptation setting, and find CoSDA is superior even in this setting. The table below shows the classification error on CIFAR-10-to-CIFAR-10C:

Method	Mean
Source	43.5
BN Adapt	20.4
Pseudo-label	19.8
TENT	18.6
CoTTA	16.2
CoSDA	15.1

[W3] The dual-speed optimization strategy is the module designed for continual SFDA. The student model captures short-term features and update to adapt to target domains, while the teacher model filters out long-term domain-invariant features. The EMA momentum does not have much impact, so we fix the parameter over all experiments. Specifically, we follow the settings in MoCo and BYOL and increase the momemtum from 0.9 to 0.99 using a cosine schedule, as stated at the end of Section 3. We will carry out experiments to demonstrate that afterwards. Thanks for your advice!

审稿意见

评分: 3置信度: 42023-11-05

The paper focuses on mitigating catastrophic forgetting in the context of source-free domain adaptation. The authors take several steps to address this challenge, including re-implementing existing methods and introducing a novel approach called CoSDA, which leverages a teacher-student model to achieve continuous adaptation. Specifically, the authors introduce a consistency loss based on KL-divergence to transfer knowledge from the teacher network to the student network. They also employ a KL-divergence-based regularization loss to stabilize training with Mixup augmentations. Additionally, the authors present two distinct optimization strategies for updating the teacher and student networks.

优点

The motivation for using Mixup for the data augmentation is clear and well-described.
Extensive experiments and ablation studies are performed.

缺点

Majors

I have concerns regarding the formulation of consistency loss which is the main contribution of the work. Equation 1 is confusing. From my understanding, h_{\psi}(\tilde{x}) should refer to the output logits from the student network. If so, minimizing the divergence between two distributions from different mathematical spaces does not make any sense. To be specific, \tilde{p} is a softmax probability vector with each element in the range [0,1], while h_{\psi}(\tilde{x}) has elements ranging from negative infinity to positive infinity.
Please add an ablation study to demonstrate the potential issue of the proposed consistency loss collapsing. My suspicion is that this collapse occurs due to the divergence between two distinct mathematical spaces.
In batch normalization, the mean and variance are computed based on the activations within each mini-batch, which is a subset of the entire dataset. The statistics are calculated separately for each mini-batch as the model processes the data during training. This is what allows batch normalization to adapt to the statistics of the current batch and helps in stabilizing and accelerating training. However, the authors mentioned that their mean and variance are calculated based on the whole dataset for batch normalization, which does not make sense. As per my understanding, the moving average mean and variance are computed by accumulating the mean and variance from each batch using an exponential moving average formula (\alpha*\mu_{moving} + (1-\alpha)*\mu_{i}). The moving average mean and variance are not the mean and variance of the whole dataset. The lack of clarity on batch normalization in the paper raises concerns about its overall quality.
I may have a misunderstanding regarding the evaluation settings; however, based on my interpretation of the paper, it appears that the proposed method does not demonstrate significant improvements over the baseline approaches in terms of target domain classification accuracies. I kindly request the authors to provide a more detailed explanation or consider revising the experiment section for improved clarity on this matter.

Minors

The proposed work appears to be closely aligned with federated learning. I am curious about the authors' motivation for framing this work as a sequence of source-free domain adaptation. To me, the workflow seems to have stronger connections to federated learning rather than source-free domain adaptation. Thus, the literature on federated learning should be given.
typos: 3.1 ”to consist with” -> “to be consistent with”

问题

In the second paragraph of the introduction, the authors mentioned that “SFDA also allows for spatio-temporal separation of the adaptation process since the model training on source domain is independent of the knowledge transfer on target domain”. What does the spatio-temporal separation refer to?
In Equation 1, is h_{\psi}(\tilde{x}) referring to the logits or the softmax probabilities? If it represents the logits, it raises a question regarding the use of KL-divergence between two distributions: \tilde{p}, which is a softmax probability vector with each element in the range [0,1], and h_{\psi}(\tilde{x}), which has elements ranging from negative infinity to positive infinity.
I do not quite understand the in-sequence evaluation settings. From the result tables, the authors listed, most baselines perform better without the proposed methods. Then, how could the readers evaluate the effectiveness of the proposed methods?

伦理问题详情

No concerns were raised.

评论- Response To Review XnHS

2023-11-20

Thanks for your time and valuable reviews, here are our responses:

[W1] Both $\tilde{p}$ and $h_{\psi}(\tilde{x})$ are in the range [0, 1]. The hard-label from teacher is computed as $p := softmax(h_{\theta}(X)/\tau)$ , while $h_{\psi}(\tilde{x})$ denotes logits normalized by softmax function.

[W2] As stated above, there is no distinct divergence in mathematical spaces. Also, ablation study is already provided as results (-) MI. You can find collapse in Table1 between columns "CoSDA" and "CoSDA (-) MI".

[W3] Yes, the moving average mean and variance are not the mean and variance of the whole dataset. They can be viewed as weighted mean, instead. As the adaptation process goes on, the mean and variance should be more and more precise, so later batches should have higher weight.

[W4] CoSDA outperforms all outher method either on adaptation accuracy(shown in green) or accuracy drop(shown in red), achieving best trade-off. For example, as reviewer nYkW metioned, comparing with EdgeMix in table1, in terms of the accuracy drop, our method is clearly at least 3x smaller than EdgeMix. Combined with other modules, our method lowers about 2x accuracy drop on source domain with neglectable decrease in adaptation accuracy.

[Minors] The core idea of our paper is that CoSDA achieves "continual SFDA", while in federated setting, the predictions from multiple clients are gathered and averaged without the concept of "continual". Thanks for pointing out our typos.

[Q1] During each adaptation process, we have only unlabeled data from current domain. Neither data nor weights from previous domains are stored. Therefore, each adaptation process can be performed at different time and location, which we refer to as spatio-temporal separation.

[Q2] $h_{\psi}(\tilde{x})$ refers to softmax probabilitie.

[Q3] CoSDA outperforms all outher method either on adaptation accuracy(shown in green) or accuracy drop(shown in red), achieving best trade-off. Ideally, adaptation accuracy should be as high as possible, while accuracy drop as small as possible.

评论- Reply to Author's Response

2023-11-22

Thank you for the response.

However, most of my concerns are not addressed properly. I will keep my score.

[W1 and. W2] then why you used $h$ for different mathmatical spaces? Clearly $h_{\theta}$ and $h_{phi}$ point to different mathematical spaces based on the author's response, so why no clarification is made in the manuscript? It is a standard for academic writing to use consistent notations for the variables from the same mathematical space.

[W3] The authors acknowledged my concern but did not make any revisions to their manuscript or give any plan to address my concern.

[W4] My concern is the clarity of the table presentation. Readers find it hard to gather the key information (mentioned by the authors) from the current manuscript. Again, the authors did not propose any revision to address this.

[Minor] Then, the literature review should be given to federated learning.

[Q1] The authors addressed this question very well.

[Q2] Again, the same notation $h$ should not be used to mention variables from different mathematical spaces.

[Q3] Again, my concern is the clarity of the tables, not the performance lift. The lack of clarity in the experiment section makes readers feel hard to find the technical contributions of the proposed work.

审稿意见

评分: 3置信度: 42023-11-06

The paper proposes a simple continual SFDA framework, where a teacher-student framework is applied. The teacher model provides the hard labeling for the student model, and the student model will be penalized with the divergence from the teacher hard label, as well as a mutual information maximization regularization. There are extensive experiments compared to other SFDA methods and the baselines.

优点

The paper is well organized with sufficient background introduction.
The paper provides illustrative figures and diagrams for readers to better understand their proposal.
The paper conducts extensive experiments on DomainNet, OfficeHome and VisDA, which are the main DA datasets, and show fair performance on those SFDA tasks.

缺点

The proposed method lacks novelty in terms of the framework design. For example, the teacher-student architecture has been proposed from the self-supervised learning framework, e.g., MoCo. The mix-up is another already invented technique to augment both the input space and the label space. The consistent loss has been utilized in those above mentioned methods already.

Meanwhile, the mutual information maximization is a widely applied regularization technique during representation learning. Combining all, I think the novelty of the design is hard to justify.

The paper claims continual SFDA, where from the method design, there is no specific module is designed to deal with the model catastrophic forgetting issue, except accepting the teacher model’s hard label to measure the KL divergence from the student prediction on the mixed up sample.

The teacher model is leveraging exponential averaging, but the student model is not in any manner distilled from the teacher model. The model weights forgetting is not addressed in any technical design.

Across the compared methods, the proposed CoSDA does not show advantageous results over the other methods. For example, in Table 1, CosDA is not compellingly better than EdgeMix, where actually many of the DA protocols EdgeMix shows better results.

In Table 2, CosDA even combined with some other modules, such as NRC or AaD, is not better than AaD. This suggests that the proposed way does not achieve as advantageous performance as the state-of-the-art methods.

问题

Please refer to the weakness session for more detail.

评论- Response To Reviewer nYkW

2023-11-20

Thanks for your time and valuable reviews, here are our responses:

[W2] The dual-speed optimization strategy is the module designed for continual SFDA. The student model captures short-term features and update to adapt to target domains, while the teacher model filters out long-term domain-invariant features.

[W3] CoSDA outperforms all outher method either on adaptation accuracy(shown in green) or accuracy drop(shown in red), achieving best trade-off. In your example, in terms of the accuracy drop, our method is clearly at least 3x smaller than EdgeMix. Combined with other modules, our method lowers about 2x accuracy drop on source domain with neglectable decrease in adaptation accuracy.

AC 元评审

2023-12-08

(a) Summary of Scientific Claims and Findings:

The paper introduces CoSDA, a method aimed at continual source-free domain adaptation (SFDA) to mitigate catastrophic forgetting in domain adaptation scenarios. CoSDA employs a teacher-student framework to transfer knowledge from a source-trained model to multiple target domains without direct access to source data. This is achieved through mixup augmentation, mutual information maximization, and consistency learning via a teacher-student model.

Reviewers acknowledge the paper's attempt to address the critical issue of catastrophic forgetting and its implications in domain adaptation. However, there are concerns raised about the novelty of CoSDA's approach, with comparisons drawn to existing techniques such as teacher-student architectures and mixup strategies. Additionally, questions arise about the method's effectiveness in managing the adaptation to new domains while retaining knowledge from previous ones.

(b) Strengths of the Paper:

Clear Motivation: The paper effectively explains the rationale behind employing mixup for data augmentation in the context of domain adaptation.
Detailed Experiments: Extensive experiments across various benchmarks, along with ablation studies, demonstrate the method's performance and efficacy in the SFDA setting.
Theoretical Foundation: The paper provides a solid theoretical foundation, outlining the principles and algorithms used in the proposed CoSDA approach.

Novelty Concerns: Several reviewers express doubts about the novelty of CoSDA, noting similarities to existing techniques such as teacher-student architectures and mixup strategies.
Tradeoff Management: There's a lack of emphasis on managing the tradeoff between adapting to new target domains and preserving knowledge from previous ones, a crucial aspect in continual SFDA.
Methodological Clarity: Certain methodological aspects, like the formulation of the consistency loss function, lack clarity and raise concerns among reviewers.
Real-World Applicability: Reviewers question the practical relevance of continual SFDA in real-world scenarios, especially concerning the identification and merging of multiple target domains.
Comparative Analysis: The paper's comparison against existing methods might need more depth and clarity, especially in delineating the advantages of CoSDA over these established techniques.

Overall, the paper contributes to the discussion on continual SFDA and addresses the issue of catastrophic forgetting in domain adaptation. However, addressing concerns about method novelty, practical relevance, and the tradeoff between adaptation and retention of knowledge from previous domains could enhance the paper's impact and clarity.

为何不给更高分

There are different viewpoints on the novelty and effectiveness of the proposed method, CoSDA. Some reviewers highlight concerns about the novelty of the approach, the lack of significant improvements over existing methods, and issues related to the formulation of certain components within the proposed framework.

The reviewers' feedback suggests a need for clearer explanations regarding the novelty of the method, especially in comparison to existing techniques, and perhaps a more detailed exploration of the experimental results to better highlight the advantages of CoSDA over other methods in various settings. There are also suggestions for improvements in the paper's clarity, such as simplifying tables for better comprehension and addressing typos or inconsistencies in presentation.

为何不给更低分

N/A

最终决定Reject

2024-01-16

Reject