4.4

/10

withdrawn5 位审稿人

最低3最高6标准差1.2

4.0

置信度

ICLR 2024

Source-Free Unsupervised Domain Adaptation with Hypothesis Consolidation of Prediction Rationale

Yangyang Shu,Lingqiao Liu,Xiaofeng Cao,Qi Chen,Bowen Zhang,Ziqin Zhou,Anton van den Hengel

OpenReview PDF

提交: 2023-09-22更新: 2024-03-26

摘要

关键词

Source-Free Unsupervised Domain AdaptationHypothesis ConsolidationPrediction Rationale

评审与讨论

审稿意见

评分: 3置信度: 42023-10-17

This paper presents a framework designed to address source-free domain adaptation (SFDA) problems. To introduce supervision in the absence of source data during target adaptation, the authors introduce an additional stage called 'hypothesis consolidation’. This stage aims to provide more informative pseudo-labels for the subsequent semi-supervised learning objective.

The key idea behind this work is to first use nearest-neighbor geometry to perform a pre-adaptation and use the model trained from the pre-adaptation to generate top-k hypotheses and analyze the rationale among these hypotheses to identify the most likely 'correct' class using the technique from Weakly Supervised Object Learning (WSOL). Following the classifier consolidation stage, where the most likely class is determined for each instance, the authors utilize it to generate pseudo labels in the subsequent semi-supervised learning.

Filtering out reliable hypotheses using GradCAM scores, comparing individual class logits of an instance with the dataset-level average class score (common rationale), and then treating instances with reliable hypotheses as the labeled set and those with unreliable hypotheses as the unlabeled set for semi-supervised learning is very interesting. However, my primary concern lies in the computational intensity of these two steps, particularly when the optimization is split for each iteration, and the performance lift is limited compared to the additional computation required.

In summary, the application of techniques like GradCAM from WSOL to address domain adaptation problems, particularly in the context of source-free domain adaptation, is interesting. However, the current manuscript's writing style poses challenges for readers to fully comprehend the proposed method. As highlighted in the Weaknesses section of my review, more details need to be included to facilitate the implementation of the proposed method for practitioners and enable reviewers to thoroughly understand and evaluate it. I recommend that the authors rewrite the paper and submit it to upcoming venues. Personally, I find the proposed method promising and believe it has the potential to make a significant impact on the Source-Free Domain Adaptation (SFDA) community. However, the current version of the paper limits its potential impact.

优点

The paper is well-written, presenting a clear flow for the proposed work. The utilization of GradCAM from weakly supervised object learning to tackle source-free domain adaptation is intriguing. The novel approach of considering instances with reliable (based on the L2 score calculated from GradCAM) hypotheses as the labeled set and those with unreliable hypotheses as the unlabeled set for semi-supervised learning is novel.

缺点

Majors:

The paper's motivation regarding the localization of classification predictions, particularly in the context of weakly supervised learning and its application in source-free domain adaptation, lacks clarity. Furthermore, the use of CAM activation scores is problematic, as these scores can be significantly influenced by object scale and location. Specifically, when employing L2 (Euclidean) distance to measure the proximity of a sample's class hypothesis to a common class rationale, the potential impact of domain divergence on object scale and location is not effectively addressed. Consequently, the L2 distance might not accurately assess the correctness of the hypothesis, leading to a significantly large L2 distance even for samples corresponding to the same class due to variations in object scale and location (or a small L2 distance for the misaligned classes). To enhance the paper's motivation, it is crucial to provide a detailed explanation of why object detection scores are beneficial for improving classification performance, especially in the context of cross-domain classification. Clarifying this connection will greatly enhance the overall motivation of the paper.
The authors have not provided any information regarding the reproducibility of their work. No code is available in either the Supplementary Materials or the anonymous GitHub Repository. Given that their work involves intricate details that need to be comprehended alongside the code implementation (such as (a) a comprehensive explanation of the weak and strong augmentations used in this work for semi-supervised learning during adaptation is essential, especially considering the authors' claim of utilizing FixMatch for semi-supervised learning; (b) a mere citation to another existing work may not offer a clear understanding of the exact strategy employed in the proposed work), the absence of code prevents me from assessing the validity of the contributions claimed by the authors. Additionally, this year ICLR recommends that authors include a statement regarding reproducibility (https://iclr.cc/Conferences/2024/AuthorGuide). However, I could not find such a statement in the submitted work.
My main concern is about the computational complexity of the proposed method. The authors mention that the objectives are not jointly optimized but are divided into three stages for each iteration. More significantly, for each stage, the bank needs to be updated for smooth prediction and semi-supervised learning, which further increases the computational complexity. This implies that when comparing this work with other Source-Free Domain Adaptation (SFDA) methods, such as [1], the computational complexity is at least two times higher. I would appreciate an ablation study that delves into the computational complexity analysis and compares it with other SFDA methods. If adopting the proposed model requires two times more computational time for only a limited performance lift (Office-Home 0.1%, VisDA 0.8%), it raises concerns about the practical value of the proposed work for practitioners. More importantly, the authors highlighted in their ablation study 4.4 that by replacing the pre-adaptation step with the existing method (to be honest, I believe the pre-adaptation strategy the authors used for their method is the same as the work of [1]), the performance gains were minimal, specifically, less than 0.4% in VisDA-2017. This is noteworthy considering the substantial increase in computational requirements. To illustrate, computing the rationale for each instance using GradCAM for each hypothesis entails HW operations per instance. Assuming k=5 for the top-k hypotheses, the total operations for identifying the rationale alone become N5HW. Moreover, computing the common rationale for each class is computation-intensive as all samples in the target domain are involved in this calculation. The additional computational load for semi-supervised learning is also considerable. In my assessment, a 0.4% performance increase does not justify the roughly fivefold increase in computations.
In Equation (1), the use of two different metrics to measure sample-to-sample difference raises questions. Specifically, the authors employ KL-divergence to quantify the difference between queries and their neighbors, while using cosine similarity to measure the difference between queries and their "z-samples." It would be helpful for readers if the authors provided a detailed explanation or rationale behind this choice of using different metrics for these two scenarios. Elaborating on the reasoning behind this decision would enhance clarity and understanding.
The related work section requires major revision. The current content on UDA is not pertinent to the proposed work. The revised related work should consist of both weakly supervised object learning and semi-supervised learning, as these are the two most closely related topics to the proposed work.

Minors:

The abstract and introduction written by the authors bear a striking resemblance to [1] (especially the two paragraphs of the introduction). I recommend a comprehensive revision of both sections to ensure that they distinctly emphasize the contributions of the proposed work. Additionally, it would be beneficial to explicitly highlight the advantages and distinctions of the proposed approach in comparison to existing methods.
Regarding Figure 1, I have concerns about the interpretation. On the left side, where correctly predicted samples are depicted, the heat map appears to focus on the keyboard even when the label is laptop. It is unclear why the authors consider this GradCAM model suitable for serving the rationale purpose of correctly predicted samples. Furthermore, the figure lacks details to facilitate understanding. It is unclear whether the map presented is an attention map, heat map, or another map proposed in GradCAM. It's not advisable for the authors to assume that readers understand the intricacies of GradCAM without providing sufficient details.
In Appendix B, the authors conducted t-SNE visualization using a very small dataset, DN-126. I'm skeptical that performing t-SNE visualization on such a small dataset will effectively illustrate or validate any meaningful insights. I recommend considering providing t-SNE visualization on a larger-scale Domain Adaptation dataset, such as VisDA-2017, for more robust validation and meaningful interpretation.
Certain variables are ambiguously defined. For instance, it is unclear on which layer output the bank is built upon. The meaning of "d" in the bank (memory) formulation is also unclear to me. Does it refer to the ResNet output (2048), the output of the first classifier layer (256), or the prediction output of the classifier (number of classes)? This lack of clarity is particularly concerning given that the authors have not provided their code with the submission, leaving reviewers with no means to gain a detailed understanding of their proposed method.
After Equation (1), the authors mention that "p" denotes the posterior of the model. It is crucial to use statistical terms accurately, and in this context, understanding the prior distribution associated with the posterior is essential. Therefore, the authors should elaborate on the prior distribution linked to the posterior "p." Providing clarity on the prior distribution would enhance the understanding of the statistical context and the underlying assumptions of the model.
The terms "z-nearest neighbor" and “z-samples” are not commonly used in the machine learning community, and their meaning is unclear to me. I recommend providing an explanation or definition before using this term to ensure clarity for readers.
The authors should provide a detailed explanation of the source pretraining methodology in their method section. Notably, there are SFDA works, such as [2], that emphasize source pretraining for improved domain alignment. Moreover, considering that not all readers may be acquainted with Source-Free Domain Adaptation (SFDA), it's imperative to elucidate how the authors pre-trained the model on the source domain. During my initial review of the paper, I mistakenly believed the authors performed one stage of target adaptation before source pretraining, raising concerns about potential information leakage. However, upon closer inspection, I realized that the authors omitted the source pretraining stage in the paper, commencing directly with the target adaptation stage.

[1] Shiqi Yang, Yaxing Wang, Joost Van De Weijer, Luis Herranz, and Shangling Jui. Generalized source-free domain adaptation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8978–8987, 2021b.

[2] Dong, J., Fang, Z., Liu, A., Sun, G. and Liu, T., 2021. Confident anchor-induced multi-source free domain adaptation. Advances in Neural Information Processing Systems (NeurIPS), 34, pp.2848-2860.

问题

It appears that the common rationale is computed for each class across all samples in the target domain. Given that the common rationale is determined by the current model parameters (this means the common rationale will be re-calculated for each iteration), I am curious about the computational requirements for this step. From my understanding, computing the common rationale for each class can be resource-intensive, particularly when dealing with a large target domain and a considerable number of classes for prediction.
The decision not to use a consistent distance metric for neighbors and z-samples raises questions. It would be beneficial for readers if the authors explain the reasoning behind this choice. Clarifying why different distance metrics are employed for these two cases would enhance understanding and transparency in the methodology.

伦理问题详情

None

审稿意见

评分: 5置信度: 42023-10-30

In this paper, the authors investigate the source-free UDA with multiple hypotheses. They exploit multiple hypotheses to obtain the optimal label set and transform the SFUDA into semi-supervised learning.

优点

To the best of my own knowledge, this paper is the first attempt to use multiple source hypotheses in SFUDA and to transform SFUDA into a semi-supervised learning task. They provide a novel view and framework for SFUDA.

缺点

The motivation of the multiple hypotheses setting is not clear. These hypotheses are pre-trained on the same domain instead of multiple domains like previous works. Is this setting common in practice? I think that multiple high-quality hypotheses on one specific domain are not easy to access. It seems that repeatedly training these hypotheses is too wasteful. Overall, the motivation should be further clarified and providing several practical cases will be better.
Assume the above issue can be explained, then how to choose the best $\tilde{k}$ for another task? $\tilde{k}=4$ seems to be acquired from DomainNet-126 dataset, but this hyper-parameter may be not suitable for others. This issue also exists for the choice of $\tau_1$ and $\tau_2$ .
The Eq. (4) seems to be wrong. For the latter item, $\hat{y}_u$ is an integer but $p(\mathcal{A}_s(x_u))$ is a vector.
The performance of the proposed method is not SOTA on some sub-tasks.

问题

According to Table 5, more hypotheses will degrade the method's performance. Intuitively, more hypotheses should lead to better performance, because more candidates have a greater chance of getting the true pseudo-labels. Can you explain this strange phenomenon?

审稿意见

评分: 5置信度: 52023-11-01

This paper focuses on source-free domain adaptation and proposes to consider multiple prediction hypotheses for each sample and investigate the rationale behind each hypothesis. By consolidating these hypothesis rationales, it identifies the most likely correct hypotheses, which are then used as a pseudo-labeled set to support a semi-supervised learning procedure for model adaptation. To achieve the optimal performance, it proposes a three-step adaptation process: model pre-adaptation, hypothesis consolidation, and semi-supervised learning. Extensive experiments demonstrate that the proposed method achieves good source-free domain adaptation performance.

优点

-a. It proposes to consider multiple prediction hypotheses for each sample and investigate the rationale behind each hypothesis.

b. It generates pseudo labels by consolidating hypothesis rationales and proposes a three-step adaptation process.
c. Experimental results show that the proposed method achieves promising performance in domain adaptation.

缺点

The proposed pseudo label consolidation method seems similar to HCL Huang et al. (2021) which also aggregates predictions from multiple models/hypotheses to regularize/generate the final pseudo labels. Please discuss the differences, advantages and disadvantages of HCL and the proposed method.
The paper Temporal ensembling for semi-supervised learning also aggregates predictions from multiple models/hypothesis for better self-training. Please discuss the differences, advantages and disadvantages of this method and the proposed method.
In section 4.3, it discusses the impact of k, the number of prediction hypotheses per instance. It would be better to provide some insights and analysis to illustrate why large k leads to degraded performance.
Does the computation overhead increase with k, the number of prediction hypotheses per instance? If so, please provide analysis and discussion.

问题

see Weaknesses

审稿意见

评分: 3置信度: 42023-11-08

This paper proposes an approach that considers multiple prediction hypotheses for each sample to select reliable pseudo-labels, and convert this problem as a semi-supervised problem.

优点

This method is intuitive and easy to follow.

缺点

The overall idea, which selects reliable pseudo labels and converts this problem into a semi-supervised learning paradigm, is not novel.

Source Data-absent Unsupervised Domain Adaptation through Hypothesis Transfer and Labeling Transfer.
...

To validate the effectiveness of the proposed reliable sets selection strategy, more selection strategies in other literature should be compared with.
How do you propose the PRE-ADAPTATION loss $L_{PA}$ , and why does it work in this propoblem? Have you tried any other loss functions?

问题

see weakness

审稿意见

评分: 6置信度: 32023-11-11

The authors propose to address the source-free UDN by mainly combining two techniques: i) hypothesis consolidation, which is the major technical contribution, and ii) semi-supervised training. The goal of hypothesis consolidation is to obtain reliable pseudo-labels for semi-supervised training. And the rationale is that a top category prediction should be considered unreliable if the rationale or the attetioned region that drives the prediction is not typically or commonly observed for this category. The experimental results are extremely convincing.

优点

The paper is easier to follow.
The idea of selecting reliable pseudo-labels for semi-supervised training through hypothesis consolidation is novel and quite reasonable.
The performance improvements are decent. The ablation studies are convincing to justify the technical contributions.

缺点

Some parts are unclear.
a. The Eq. 2 is a little unclear to me. You may want to describe this in more detail such that readers without the knowledge of GradCAM would be easier to follow.
b. “We aim to select the most reliable hypothesis rather than correcting hypotheses.”: needing more clarification.
Some additional results are expected. Especially, it’s highly expected to show the accuracy of pseudo-labels before/after the hypothesis consolidation.
It’s highly expected to explain why multiple HCPR would lead to decreased performance. It would be confusing if recursive HCPR and increased K would lead to decreased performance if HCPR is helpful.
The rationales of hyper-parameter settings are mostly unclear.
a. The K for the top hypotheses is important. More discussion of its influences is expected. This is because the success of hypothesis consolidation assumes that top-K would have a much better performance than the top-1. It’s also expected to report the top-k accuracy to directly justify that consolidation of the top-k is reasonable.
b. More importantly, it’s unreasonable why the performance decreases as the k increases.
c. Also, why z=3 is selected

问题

see above

伦理问题详情

None