DA-Bench: Benchmarking Unsupervised Domain Adaptation Methods with Realistic Validation On Diverse Modalities
摘要
评审与讨论
This paper introduces DA-Bench, a framework for evaluating unsupervised domain adaptation (DA) methods across diverse modalities, such as images, text, biomedical, and tabular data. It emphasizes fair and realistic hyperparameter selection using nested cross-validation and unsupervised model selection metrics.
优点
DA-Bench provides a relatively comprehensive evaluation of existing shallow algorithms and offers practical insights for real-life applications. It is open-source, reproducible, and easily extendable with new DA methods, datasets, and model selection criteria.
缺点
The paper has the following shortcomings. First, the methods tested are outdated, lacking work from the past three years. Second, the correlation between different modalities is weak, casting doubt on whether transferability can be achieved. Lastly, the paper does not propose any new theories.
问题
None
We thank the reviewer rwX7 for their thoughtful feedback. We are happy to read that the reviewer found our benchmark comprehensive with practical insights for real-life applications and for acknowledging that our work is open-source, reproducible and easily extendable.
We address the reviewer's comments point by point below.
-
Outdated methods: We thank the reviewer for their feedback and refer them the detailed general response about deep DA methods that are completed in the reponse by 4 new methods with one from NeurIPS 2023.
-
Weak correlation: We believe that the weak correlation between modalities is a major result of this paper, highlighting the need for the ML community to go beyond Computer Vision applications and investigate other, more challenging modalities. DA has clearly not been solved outside of CV, and this benchmark aims to demonstrate that. There is for the moment no silver bullet in DA, and we believe it is an important contribution to clearly show this through an open, diverse and reproducible benchmark.
-
No new theory: By design, our work provides a benchmark which is an empirical assessment of the field of DA. This is in line with the
Benchmark and datasetstopic from the ICLR call for paper. We do not see which new theories could be provided in this scope. Note that we added more theoretical insights to analyse the observed performance and to better link them to the literature in Section 4.1 of the revised paper (see first answer to Reviewer PVS6).
We sincerely thank the reviewer for their time and effort in evaluating our work. As the discussion deadline approaches, we would greatly appreciate any feedback on our rebuttal, especially if there are remaining concerns that have not yet been addressed.
This paper introduces DA-Bench, a framework for evaluating Unsupervised Domain Adaptation methods across diverse modalities beyond computer vision. Through realistic assessments and nested cross-validation, it comprehensively evaluates shallow algorithms like reweighting and mapping. The study emphasizes the importance of realistic validation, offering insights into model selection approaches.
优点
- The paper is well-written and easily comprehensible.
- The proposed benchmark is comprehensive and straightforward.
- The motivation for validating the parameters of the methods, such as unsupervised scorers that do not rely on target labels, is commendable.
- Extensive experiments across multiple datasets are adequate.
缺点
- Figure 1 is far from self-contained. Could you explain the detailed relationships and distinctions among these four shifts according to the Figure? Also, in this figure, it appears to represent a binary classification task. Do these four shifts generally exist in other tasks as well?
- This paper evaluated three outdated deep DA methods. Could you provide the latest deep DA methods for evaluation, such as [1] and [2]?
- In Table 3, accuracy scores for deep methods are only presented for selected real-life datasets. Could you provide the scores for simulated datasets and conduct the relevant comparisons?
- In line 217,this paper claims that the supervised scorer cannot be used in practice. Does this imply that the scorer is meaningless in practice? In other words, how can we evaluate DA methods in practice when target labels are completely absent?
[1] Quantitatively measuring and contrastively exploring heterogeneity for domain generalization." In SIGKDD, pp. 2189-2200. 2023. [2] Domain adaptation for time series under feature and label shifts. In ICML, pp. 12746-12774. 2023.
问题
Please address the weaknesses above.
We thank the reviewer qzy2 for their positive feedback and for acknowledging our work's clarity, our comprehensive benchmark and our extensive experiments. We address the reviewer's concerns point by point below.
1. Figure 1 Clarity:
We thank the reviewer for their valuable feedback. Figure 1 illustrates the different data shifts and assumptions typically studied in the DA literature. The focus is on a binary classification task for visualization purposes but the shifts and assumptions are not limited to it (see [3] for a formalization of the shifts). We have updated the caption of Figure 1 in the revised paper to make it clearer. If the figure is still unclear, we encourage the reviewer to let us know.
[3] Quinonero-Candela, Dataset Shift in Machine Learning. MIT Press 2009
2. Lack of recent deep methods:
We refer the reviewer to the general comment on this point. Regarding the work mentioned :
- [1] is a paper for multi-source DA which is not applicable to our pairwise DA benchmark approach. Note that in [1], the best and relatively close competitor is DeepCoral, which is present in our current benchmark.
- [2] is a task specific DA approaches for time-series based on frequency approach. Our benchmark proposes only one time-serie dataset (BCI) for which frequency representation have limited performance. This is why we used spatial covariance representation and Riemanian geometry for representation, which actually leads to best performances with Linear OT over deep DA (maybe also due to relativeley small training datasets).
[1] Quantitatively measuring and contrastively exploring heterogeneity for domain generalization." In SIGKDD, pp. 2189-2200. 2023.
[2] Domain adaptation for time series under feature and label shifts. In ICML, pp. 12746-12774. 2023.
3. Score on simulated datasets:
We thank the reviewer for this interesting question. We believe that deep learning methods are unlikely to perform well on the simulated datasets (except for the subspace shift), since they are unstructured and low dimensional (only 2 features) but providing such results would be interesting. Given the time constraints, we focused on adding new Deep DA methods for the rebutals but we will run the deep DA methods with nonlinear MLP on the simulated datasets for the final version.
4. Using supervised scorers:
In realistic application of unsupervised DA, we do not have access to target labels. This is why unsupervised scorers are necessary. While our benchmark includes target labels to measure generalization performance, the unsupervised scorers, which are the ones that can be used in practice, do not rely on these labels. One of the main contributions of this benchmark is its study of unsupervised scorers. This provides practitioners with a sense of how much they can trust these scorers and their methods.
We sincerely thank the reviewer for their time and effort in evaluating our work. As the discussion deadline approaches, we would greatly appreciate any feedback on our rebuttal, especially if there are remaining concerns that have not yet been addressed.
This paper introduces DA-Bench, a benchmark for evaluating unsupervised domain adaptation (DA) methods across diverse data modalities with realistic validation techniques. Unlike existing benchmarks, DA-Bench uses unsupervised model selection scores and nested cross-validation to better reflect real-world settings where labeled target data is unavailable. Key contributions include:
- A comprehensive benchmark framework spanning multiple data types (e.g., computer vision, NLP, tabular, biosignals) with 20 shallow and 3 deep DA methods.
- Realistic evaluation protocols that avoid overestimating performance by selecting hyperparameters without target labels, using methods like Importance Weighted (IW) and Circular Validation (CircV) scores.
- Practical guidance and open-source code for DA model selection, offering insights on method and scorer effectiveness under different shift types.
DA-Bench represents a valuable resource for researchers and practitioners by setting a realistic standard for evaluating DA methods and by supporting the field’s advancement through rigorous, reproducible benchmarking.
优点
- The paper is well-organized, with clear explanations of DA methods, evaluation metrics, and data shifts. Effective use of tables and visuals enhances clarity and enables easy interpretation of results.
- DA-Bench is a benchmark that addresses a gap in UDA across multiple data modalities, such as NLP, tabular, and biosignals; multiple DA methods such as 20 shallow and 3 deep DA methods; and multiple model selection strategies.
- The paper is open-sourced and provides a useful resource for researchers and practitioners to conduct rigorous, reproducible benchmarking.
缺点
- Lack of Clarity in Figure 1
- Interpretability Issues: Figure 1 is difficult to interpret, particularly in terms of what each shift type represents. For instance, it’s unclear what each color indicates; if colors represent different classes, why are there multiple clusters for each class?
- Subspace Shift Classification: Subspace Shift is more accurately described as a methodology for handling other types of shifts—primarily Covariate or Conditional Shifts—by finding a common subspace where distributions align. It may be misleading to categorize Subspace Shift as a standalone shift type.
- Suggestion: A clearer, more detailed illustration is needed to make Figure 1 more intuitive and to accurately represent the nature of each shift type.
- Limited Analysis of Dataset and Shift Impact
- Shift Types in Existing Benchmarks: While the benchmark covers method performance across various datasets, it lacks an analysis of the types of shifts present in commonly used benchmarks. Understanding the specific types of shifts (e.g., Covariate, Target, Conditional) in datasets like Office-31, Office-Home, and MNIST/USPS would provide valuable insights.
- Impact on Method Performance: It would also be helpful to examine how specific shift types influence the performance of different DA methods. This analysis would allow researchers and practitioners to make more informed choices when selecting methods for datasets with specific types of shifts.
- Limited Evaluation of Deep DA Methods
- Missing Methods: The benchmark currently omits certain popular deep domain adaptation techniques, such as pseudo-labeling methods, which are widely used and impactful. For example, pseudo-labeling approaches like MIC (Masked Image Consistency).
- Lack of Clarity on Scorer Variability within the Same Method
- Table 2 Confusion: In Table 2, different DA methods with varying model selection techniques are compared, which makes interpretation challenging. A clearer comparison would involve analyzing the same DA method across different model selection approaches to understand the variability introduced by model selection. This would make it easier to interpret the effectiveness of each scorer within a consistent method framework.
Overall, the paper ambitiously aims to create a comprehensive benchmark that covers various domain adaptation (DA) methods and model selection strategies across multiple data modalities—a valuable and useful contribution. However, since the majority of recent unsupervised domain adaptation (UDA) research and datasets are predominantly in the computer vision (CV) domain and deep learning-based (to the best of my knowledge), it’s challenging to derive insights from the current conclusions regarding the performance of recent advanced deep methods and classic DA techniques specifically on popular CV benchmarks (e.g., Office-31, Office-Home, DomainNet, VisDA). While the paper successfully establishes a versatile and valuable toolbox, it lacks constructive insights for practitioners due to its focus on cross-modality comparisons using only the best-performing model selection strategy for each method.
问题
Please refer to the weakness.
We thank the reviewer for their detailed and encouraging feedback. We appreciate that the reviewer found our contribution to UDA excellent with our open-source, reproducible and useful benchmark for researchers and practicioners. Below, we address the reviewer's concerns point by point.
1. Lack of Clarity in Figure 1:
We thank the reviewer for their comment. Figure 1 is provided to give an illustration of the theoretical shifts tested with the simulated datasets. The color of the samples indeed represent the class and we selected an example where one of the class has multiple modes (what you refer to as clusters) because that is something that occurs often in ML (for instance a class dog contains very different groups of dogs). We have modified the caption in the revised paper to describe more precisely Figure 1. The subspace shift example cannot be cast as a conditional shift because each classes are moved diferently but the discriminant subspace assumption is exact. We are open to suggestions from the reviewer to make it better since illustrating this assumption in 2D turned out to be challenging (maybe in 3D with an additional noise feature?).
2. Subspace Shift Classification:
We would like to clarify that the term "subspace-shift" refers to a specific type of domain shift, characterized by the assumption that an alignment of source and target conditional and marginal distributions exists within a specific subspace. Based on the helpful feedback from the reviewer, we realized that the term "subspace-shift" could be misunderstood as referring to the methodology used to address this type of shift. To prevent such confusion, we have explicitly clarified that it denotes a shift occurring in the orthogonal complement of the invariant subspace and illustrate the subspace in the Figure.
3. Limited Analysis of Dataset and Shift Impact:
In real world applications, the traditional theoretical shifts are often not representative of the data complexity. The shifts are usually a combination of several of them, and characterizing which shift is present in a real dataset is not easy. This is a wide open fundamental question in unsupervised DA and out of scope for our benchmark. Yet, we agree that this is an interesting question. To adress the reviewer's concern, we have added a paragraph discussing the shift and how well they can be compensated in Sections 4.1 and 4.2.
4. Impact on Method Performance:
We added in Section 4.1 of the revised paper a discussion on the matter as well as important take-aways for DA researchers and practicioners. We note that, as explained above, for real data such analysis is more complex because there is no unique shift type clearly present in the data but rather a mix of several shifts.
5. Missing Methods:
We agree with the reviewer that pseudo-labeling methods are important. We included DeepJDOT, which can be seen as a pseudo labeling approach since it uses the prediction from a classifier inside the loss. The suggested MIC approach is restricted to CV applications, and we believe it would not fit the scope of our benchmark, as described in the general answer. Note that we included more recent ones like SPA [1].
6. Lack of Clarity on Scorer Variability within the Same Method:
We thank the reviewer for the suggestion. We added in the revised paper a visualization of the average rank of the scorers in Figure 4 in Appendix D.4 and a spider plot reporting the average rank for each methods (scorer performance depends on the methods) when comparing different scorers in Figure 5.
We sincerely thank the reviewer for their time and effort in evaluating our work. As the discussion deadline approaches, we would greatly appreciate any feedback on our rebuttal, especially if there are remaining concerns that have not yet been addressed.
This paper proposes DA-Bench, a comprehensive benchmark for evaluating unsupervised domain adaptation (DA) methods across diverse modalities such as computer vision, natural language processing, tabular data, and biosignals. The benchmark includes a set of simulated and real-world datasets with different types of distribution shifts, a wide range of shallow DA methods designed to handle these shifts, and an evaluation of deep DA methods. A nested cross-validation procedure with multiple unsupervised scorers is used for realistic hyperparameter selection. The results show that few shallow DA methods consistently perform well, and model selection scorers significantly influence their effectiveness. Deep DA methods, while promising, require extensive hyperparameter tuning and often fail to outperform shallow methods in low-data regim
优点
- Comprehensive benchmark: The benchmark covers a wide range of DA methods and datasets with diverse modalities and distribution shifts, providing a more realistic evaluation of DA techniques.
- Realistic model selection: The use of nested cross-validation with unsupervised scorers for hyperparameter selection addresses the challenge of lacking target labels in practical DA scenarios.
缺点
-
The paper mainly focuses on empirical results and benchmarking, without providing deep theoretical insights into why certain methods perform better or worse under different domain adaptation scenarios. Incorporating more theoretical analysis could help in understanding the underlying principles of domain adaptation and guide the development of new methods.
-
While the paper includes a variety of domain adaptation methods in its benchmarks, some of the baseline methods are relatively outdated or not state-of-the-art. Including more recent and advanced baseline methods would provide a more accurate assessment of the performance of new domain adaptation techniques.
-
The paper uses accuracy as the primary evaluation metric for the benchmarks. While accuracy is a common metric for classification tasks, it may not be sufficient to fully capture the performance of domain adaptation methods in all scenarios. Considering additional evaluation metrics such as ECE, Brier Score, F1-score, or even domain-specific metrics could provide a more comprehensive evaluation.
问题
See weaknesses.
We thank the reviewer PVS6 for their valuable comments. We are happy to hear that the reviewer found our benchmark comprehensive and that our realistic model selection adressed challenges in the DA literature. We address the reviewer's concerns point by point below.
1. Lack of theoretical insights:
We thank the reviewer for pointing out that the theoretical insights were not clear enough. We have added a more detailed discussion on the benchmark's results in Section 4.1. The main points are as follows: - On simulated data with known shifts, DA methods tend to show a significant gain on the shift they were designed for. For instance, mapping methods perform well under conditional shift but struggle under target shift, which is a known issue for this kind of methods (Redko et al. 2019). - The performance of reweighting methods is often close to the performance of the Train Src baseline on real datasets. This can be explained by the fact that supports of the source and target distributions significantly differ. In this case, reweighting methods are known to underperform (Segovia-Martín et al., 2023). - The performance of mapping methods is dataset-dependent, potentially due to the number of classes and presence of target shift. - Very regularized transformations are the best in average accross all modalities. The four methods, LinOT, CORAL, JPCA, and SA are within the top 5 methods in terms of average ranking. This highlights that regularized adaptation methods are robust across datasets and modalities, offering effective alignment with minimal risk of negative DA, making them reliable and safe default options.
2. Missing recent deep DA methods:
We invite the reviewer to refer to the general comment regarding the missing deep DA methods.
3. Diverse metrics:
We thank the reviewer for their suggestion. We fully agree with this statement and this is why in the current code design, we already log various metrics. This was not visible in the submitted manuscript, therefore, we have added a table similar to Table 2 but constructed with the F1-score in the appendix in Table 22. We find that the conclusions remain the same as the ones with accuracy score showing the robustness of our benchmark.
We sincerely thank the reviewer for their time and effort in evaluating our work. As the discussion deadline approaches, we would greatly appreciate any feedback on our rebuttal, especially if there are remaining concerns that have not yet been addressed.
We thank the reviewer for providing these extra references and we will consider incorporating them in our benchmark in the future.
During the rebuttal, we added SPA [1], a more recent method (2023), to our benchmark to show that our methodology allows evaluating advances in UDA. Note that adding the proposed methods highly tailored for CV tasks to our multimodal benchmark is not straightforward. It requires significant efforts and methodological adaptation to make sure they are fairly evaluated.
[1] Xiao et al. SPA: A Graph Spectral Alignment Perspective for Domain Adaptation. NeurIPS 2023
General comment
We thank all the reviewers for thoroughly and carefully reading our paper. Their remarks helped improve its quality. We are deeply grateful to them for acknowledging the clarity and quality of our comprehensive study that adresses a gap in the UDA literature (Reviewers fk3w, PVS6, qzy2, rwX7), noting that we provide an open-source, reproducible, and easily extendable benchmark with extensive experiments (Reviewers fk3w, qzy2, rwX7).
We first provide some clarifications on the scope of our paper and the recent deep methods bellow, then list our revisions in the paper and provide detailed answers to each reviewer.
Scope of the paper and recent deep methods:
Recent deep DA papers mostly focus on multi-source DA or test-time DA, which do not fit into our scope and evaluation process designed for pairwise DA. Moreover, the DA community is centered mostly around multi-domain Computer Vision problems, which made recent classical DA methods, as studied in our benchmark, less common. With our benchmark, we offer a larger view of DA, its potential use cases and its current limitations, by broadening the application spectrum. In our opinion, our benchmark is timely as it gives an overview of the field as it stablizes and highlight open research directions to ofer UDA methods that work outside CV.
With this goal in mind, we propose a reproducible benchmark over different modalities and datasets, and included pariwise DA methods that can work on all modalities. The methods suggested by the reviewers are often task-specific and cannot be extended easily to divers modalities. However, we agree with the reviewers that our deep benchmark should be extended with more recent methods. This is the reason why we added 4 recent deep methods to the benchmark [1, 2, 3, 4]. Besides, it should be noted that our benchmark is designed to easily add new methods and new datasets, as described in appendix. Note that this benchmark will also be maintained and updated in the future and will evolve with contributions from the research community, ensuring its continued relevance and utility. We believe this flexibility and long-term vision enhance the value of our work for the broader community.
[1] Xiao et. al., SPA: A Graph Spectral Alignment Perspective for Domain Adaptation. In Neurips, 2023.
[2] Ying Jin et al., Minimum Class Confusion for Versatile Domain Adaptation. In ECCV, 2020.
[3] Zhang et al., Bridging Theory and Algorithm for Domain Adaptation. In ICML 2019
[4] Mingsheng Long et al., Learning Transferable Features with Deep Adaptation Networks. In ICML, 2015.
Updates on the PDF in blue:
We list below the updates we made to the revised version of the paper (in blue).
- Add 4 new Deep DA methods. (fk3w, PVS6, qzy2, rwX7)
- Add discussion about the shift impact on real dataset in Section 4.1 and 4.2 (fk3w, PVS6)
- Clarify caption of Figure 1 and show the invariant subspace. (fk3w, qzy2)
- Clarified the "subspace shift". (fk3w)
- Add F1-score table in appendix D.4. (PVS6)
- Add new visualization for scorer comparison in appendix D5. (fk3w)
We hope to have adequately answered the reviewers' concerns and we remain open to continuing this constructive discussion.
Dear Reviewers,
The authors have responded - please see if they have addressed your concerns and engage in further discussion to clarify any remaining issues.
Thanks! AC
Dear Reviewers,
If you have not already done so, please engage in discussion with the authors to clarify any remaining issues as the discussion period is coming to a close in less than a day (2nd Dec AoE for reviewer responses).
Thanks for your service to ICLR 2025.
Best, AC
This paper introduces a benchmarking study for unsupervised domain adaptation (UDA) algorithms that evaluates methods on a range of datasets spanning multiple modalities. Specific care is taken to ensure that model selection/hyperparameter tuning is done in a realistic and unsupervised fashion in accordance to the UDA setting. An extensive evaluation is done across a range of 20 shallow and 7 deep learning based UDA methods, and the source code will be released.
Key strengths of the study include a standardized benchmarking pipeline that will be open-sourced, especially the focus on ensuring unsupervised model selection, as well as the evaluation on datasets from multiple modalities. Key weaknesses include a lack of recent UDA methods and insight resulting from the benchmarking.
Overall, the paper is borderline and the AC tends to agree with the sentiment expressed by fk3w in the concluding paragraph - while this is an ambitious attempt at a benchmark which is a useful contribution, it is unclear what insight it delivers for practitioners or in driving the field forward, given the lack of comparison to state-of-the-art methods for the respective domains or substantive theoretical insight. The AC does not agree with the characterization in the authors' rebuttal "General Comment" that "the DA community is centered mostly around multi-domain Computer Vision problems" - there are many domain-specific DA methods being developed in various communities (e.g. NLP and time-series). In some sense it is because of the difficulty of developing a universal DA method that many domain-specific approaches have been introduced; from this perspective, without including such methods in the comparison, the practical value of the benchmark is diminished. On the other hand, from the perspective of advancing DA methodology there is a lack of theoretical insight (e.g. analysis of shift types and impact on methods) as identified by multiple reviewers (fk3w, PVS6, rxw7). Thus, while the AC acknowledges the value of the benchmarking framework, the paper lacks compelling insights and falls short of the bar for ICLR.
审稿人讨论附加意见
To summarize the main points of discussion, all reviewers commented on the lack of recent methods in the comparison, and many commented on the lack of insight given the results as well as the lack of clarity in Figure 1. In response, the authors included 4 additional deep DA methods in their evaluation, added a discussion about the impact of shifts on methods, and revised the caption of Figure 1. To some extent, these revisions address the concerns of the reviewers, but there is nonetheless a lack of a comprehensive evaluation of state-of-the-art deep DA methods or a detailed analysis of the shifts in real-world datasets and how they correlate to method performance. These remaining concerns push the AC towards rejection of the paper as discussed above.
Reject