4.5

/10

Rejected4 位审稿人

最低3最高5标准差0.9

3.5

置信度

正确性2.3

贡献度2.0

表达2.3

ICLR 2025

Source-Free Target Domain Confidence Calibration

Rotem Nizhar,Idit Diamant,Coby Penso,Jacob Goldberger

OpenReview PDF

提交: 2024-09-26更新: 2025-02-05

TL;DR

We present a confidence calibration method for a source-free domain adaptation setup

摘要

关键词

confidence calibrationdomain adaptationsource-free

评审与讨论

审稿意见

评分: 5置信度: 32024-10-26

The author considers the source-free calibration problem. Since the absence of labeled data, we cannot use the traditional calibration methods. In this way, the author addresses it by leveraging pseudo labels generated from the source model’s predictions to estimate the true, unobserved accuracy. Finally, the author verify the effectiveness on various datasets.

优点

The calibration problem is important.
The proposed method is effective.

缺点

The paper is hard to read. The author should split the chapters to make them easier to read, for example, the third section should be split appropriately.
I am a little confused about this setting, i.e., calibration via the unlabelled data from the target domain. In what scenarios would this setting be used? TransCal [a] realizes calibration via the labeled source domain data. The author may discuss the differences and applications between the two settings.
The pseudo label has been explored thoroughly in semi-supervised learning. Simply transferring it to the calibration lacks novelty.
The author optimizes the temperature $T$ via the adaECE loss. For a fair comparison, the author should report other evaluation metrics such as SCE.
line 309. "we followed the evaluation protocol described in the TransCal Paper, which involves splitting each target domain into 80% for training and 20% for validation". It seems that TransCal split the training set and validation set on the source domain.

minor: I think the author should draw a main figure to describe the basic setting of the calibration.

[a] Transferable Calibration with Lower Bias and Variance in Domain Adaptation

问题

see weakness

评论- Reply to Reviewer 5d3k

2024-11-20

Thank you for your comments and feedback.

Q1. The paper is hard to read. The author should split the chapters to make them easier to read, for example, the third section should be split appropriately.

A. In the revised version we'll rearrange the paper structure to improve readability.

Q2. I am a little confused about this setting, i.e., calibration via the unlabelled data from the target domain. In what scenarios would this setting be used? TransCal realizes calibration via the labeled source domain data. The author may discuss the differences and applications between the two settings.

A. The distinction between SFCC and TransCal lies in the fact that TransCal utilizes labeled source data for model calibration. In contrast, our approach tackles a more challenging calibration scenario where source domain data is entirely inaccessible, often due to privacy concerns. As a result, TransCal cannot be applied in this context.

Q3. The pseudo label has been explored thoroughly in semi-supervised learning. Simply transferring it to the calibration lacks novelty.

A. As Fig 4a shows, PL are usually very noisy (20%) and semi-supervised methods cannot simply consider them as true labels and they need to filter out samples usually based on the prediction confidence. The main observation of our study is that, in spite of the high noise level of EPLs, we can directly use them for calibration because estimating the network accuracy based on the pseudo label is very similar to the true accuracy (as shown in Fig 4b). See also our joint answer to all the reviewers.

Q4. The author optimizes the temperature via the adaECE loss. For a fair comparison, the author should report other evaluation metrics such as SCE.

A. Thank you for the feedback. In the updated version, we include additional evaluation metrics such as NLL, Brier score and SCE (see A.2). The results show that SFCC outperforms other calibration methods on the new evaluation metrics.

Q5. line 309. "we followed the evaluation protocol described in the TransCal Paper, which involves splitting each target domain into 80% for training and 20% for validation". It seems that TransCal split the training set and validation set on the source domain.

A. Thank you for pointing this out; it is indeed a misleading explanation, and we will revise it in the updated version. What we intended to convey is that we applied the same data split on the target domain as was done on the source data in the TransCal paper.

As suggested, in A.1 we added a scheme that described the components and the flow of our method.

2024-11-21

Thanks for your feedback For weakness 2. My concern is that we do not need much more data to achieve the calibration (only to fit a single parameter $T$ ). In real practice, we can try to label some data manually and achieve calibration. For example, for each class, we can simply label one or three samples, which I think will not bring much more cost. In comparison, semi-supervised learning assumes that we have many unlabelled data, thus researchers explore the pseudo label methods to avoid the cost of labeling. Therefore, is it necessary to explore source-free calibration?

For weakness 3. The observation is interesting. However, the solution is still a pseudo label based method (do not filter out samples via the confidence). I do not think this degenerated pseudo label based method (directly use them for calibration and do not consider whether the pseudo is right or not ) is novelty enough or gives some contributions to the community.

评论- Reply to Reviewer 5d3k

2024-11-23

Thank you for your feedback. We believe that exploring unsupervised source-free calibration is crucial, as labeling even a small number of examples can be prohibitively expensive in some cases (e.g., X-ray images). In our paper, we provided examples of articles on UDA/SFDA calibration to highlight the growing interest of the research community in these areas.

It’s worth noting that if some labeled examples are available, the adaptation process becomes much simpler. In addition, While our method optimizes only one parameter, the limited information from each image requires a substantial number of examples to identify the appropriate calibration temperature. In Section A.7, we demonstrated that more than three labeled examples per class are needed to outperform SFCC. For instance, in the DomainNet40 dataset with the SFDA method DCPL applied to the source domain C and target domain R, approximately 120 labeled examples per class are required!

2024-11-23

Thanks for your feedback. Part of my concerns have been solved, so I raised my score from 3 to 5. The reason for not improving the score further is the concern about the technique's novelty.

审稿意见

评分: 3置信度: 42024-11-03

The author proposed a Temperature scaling-based calibration method for the SFDA by leveraging the Pesudo labels.

优点

The idea of calibrating the confidence of a model in interesting.

缺点

Why you think the noisy label would have the same effect of the correct label in calibration? Is there any empirical observation or theoretical guarantee? I doubt this motivation.
In Eq.4, what is the difference between the Pseudo label and the predicted label?
Using $\hat{A}$ to $A$ is not suitable as they would exist a big gap (you do not even know how reliable the pseudo labels are). $\hat{A}_{i}$ would not be equal to $A_{i}$ in this case. Also, the approach from the left to right in Eq.6 should not be held.
The reason in Fig.4b may be the fact that you are using a strong pre-trained network, which should not be concluded as a common empirical observation.
Are the bins are divided manually with a fixed number?

问题

See weakness

评论- Reply to Reviewer E9nN

2024-11-20

Thank you for your comments and feedback.

Q1. Why do you think the noisy label would have the same effect as the correct label in calibration? Is there any empirical observation or theoretical guarantee? I doubt this motivation.

A. Fig 4b exactly shows that for all the source-target pairs we checked and for all confidence bins we get that the estimated accuracy based on the pseudo-labels is very close to the true accuracy (see also joint answer to all reviewers).

Q2. In Eq. 4, what is the difference between the Pseudo label and the predicted label?

A. A pseudo label is obtained by applying the source model to a sample from the target domain. The Enhanced Pseudo Label is generated by combining a robust pre-trained feature extractor with the pseudo label. The predicted label represents the model's prediction after adapting it to the target domain.

Q3. Using $\hat{A}$ to $A$ is not suitable as they would exist a big gap (you do not even know how reliable the pseudo labels are). $\hat{A}{i}$ would not be equal to $A{i}$ in this case. Also, the approach from the left to right in Eq.6 should not be held.

A. Indeed even enhanced pseudo-labels can be very noisy (see Fig 4a). Our main novel observation is that the EPL label noise follows a specific pattern such that, despite the high noise level, estimating the bin-wise accuracy of the adapted model on the target domain using either the true labels or the pseudo-labels yield very similar results. Figure 4b provides empirical evidence for this observation.

Q4. The reason in Fig. 4b may be the fact that you are using a strong pre-trained network, which should not be concluded as a common empirical observation.

A. We do use a strong pre-trained network to extract more accurate pseudo labels. However, as Fig 4a shows, these pseudo labels are still very noisy. The main observation of this study, shown in Fig 4b is that in spite of the high noise level of EPLs, we can directly use them in the calibration process.

Q5. Are the bins divided manually with a fixed number?

A. Following previous works we used a fixed number of bins (15). The AdaECE bins' boundaries are set to have the same number of points in each bin.

审稿意见

评分: 5置信度: 42024-11-04

This paper studies the model calibration problem in the context of source-free domain adaptation (SFDA). The authors claim to propose the first source-free model calibration approach by using only pseudo labels of target-domain data. Specifically, a method called Source Free Confidence Calibration (SFCC) is designed as the solution. SFCC consists of two steps: first using the clustering-based strategy to refine the pseudo labels and then applying temperature scaling with the refined pseudo label-ed target data. Experiments on three SFDA datasets demonstrate that SFCC outperforms existing calibration approaches.

优点

(+) Both the source-free domain adaptation and the model calibration are significant for improving the robustness and generalization of models in real-world scenarios with complex data distributions. Therefore, it is meaningful to study the problem of source-free model calibration.

(+) The background presentation is comprehensive enough to introduce the investigated problem.

(+) It is good to see that code implementation is provided, which is helpful.

缺点

(-) The technical novelty is not clear. The GEPL method in Algorithm 1 has been widely used in source-free domain adaptation (SFDA) since the pioneering SFDA work SHOT [1] used the iterative version of GEPL to improve the pseudo-label quality. In addition, Algorithm 2 only applies the widely used model calibration method Temperature Scaling [2] to pseudo-labeled target-domain data. Therefore, it seems the technical novelty of this paper is very limited due to the use of many existing techniques without proposing a new one.

(-) The presentation is hard to understand, especially the methodology introduced in Lines 191-208. It is confusing and weak to claim that two versions of A_{i, 1} are equal by definition. By which definition? Presentation from Line198 to Line215 is only based on assumptions without any theoretical or generalized empirical guarantee as the support. This is also for the proposed SFCC method, although experimental results are impressive, it is unknown why SFCC can do well and how it can generalize.

(-) More experiments and ablations are required. First, only three SFDA methods are not enough. Since the claim of this submission is a calibration method for SFDA, only if the authors believe that the selected three SFDA methods (SHOT, AaD, and DCPL) can fully represent existing SFDA approaches, however, is it a fact? Second, it is required to do ablations on GEPL with other techniques of improving pseudo labels such as the common thresholding-based method. Third, other calibration error metrics mentioned in TransCal [3] excluding ECE should be reported because ECE could be misleading sometimes. In addition, what is the accuracy of all SFDA models?

References

[1] Do we really need to access the source data? source hypothesis transfer for unsupervised domain adaptation ICML 2020

[2] On calibration of modern neural networks ICML 2017

[3] Transferable calibration with lower bias and variance in domain adaptation NeurIPS 2020

问题

Please refer to Weaknesses.

评论- Reply to reviewer 7yUD

2024-11-20

Thank you for your comments and feedback. Q1. The technical novelty is not clear. A. See joint answer to all reviewers.

Q2. The presentation is hard to understand, especially the methodology introduced in Lines 191-208. It is confusing and weak to claim that two versions of $A_{i,1}$ are equal by definition. By which definition? Presentation from Line198 to Line215 is only based on assumptions without any theoretical or generalized empirical guarantee as the support.

A. The definitions of the two versions of $A_{i, 1}$ are identical, since they both count the amount of points for which $y_t=\hat{y}_t=\tilde{y}_t$ . The claims in line 198-215 are empirically justified across hundreds of source-target pairs in Fig 4b.

Q3.1: only three SFDA methods are not enough. Since the claim of this submission is a calibration method for SFDA, only if the authors believe that the selected three SFDA methods (SHOT, AaD, and DCPL) can fully represent existing SFDA approaches, however, is it a fact?

A: We selected three SFDA methods: SHOT, AaD and DCPL which represent three different approaches in regard to our generated pseudo-labels that are relevant to be tested with our calibration method. One method, DCPL, uses the same pseudo-labels (EPL) without changing them during training. A second SFDA method, SHOT, which uses pseudo-labels that rely only on the source model and not on a strong pre-trained network (and also adapted during training). A third method, AaD, which does not use pseudo labels at all in the target adaptation training.

Q3.2: It is required to do ablations on GEPL with other techniques of improving pseudo labels such as the common thresholding-based method. Third, other calibration error metrics mentioned in TransCal [3] excluding ECE should be reported because ECE could be misleading sometimes.

A: Thank you for your feedback. In the revised version of the article appendix , we've included additional evaluation metrics (In A.2), such as NLL and the Brier score, which were used in the TransCal paper. Regarding your question about improving pseudo labels using other techniques, we experimented with different feature extractors and present the results from the best-performing one. If we understand correctly, common thresholding-based methods remove examples we're uncertain about. We chose not to pursue this approach because we believe that excluding uncertain examples would be detrimental to the calibration process, as they are essential for it.

Q3.3: In addition, what is the accuracy of all SFDA models?

A. In section A.3 we have added tables that present the accuracy of all SFDA models.

2024-11-25

Although I appreciate the efforts in covering new results of other metrics and reporting accuracy results of all SFDA models, I still have concerns about the technical novelty and presentation. Notably, other reviewers also show such concerns. Therefore, I would like to keep my initial rating of this submission. Thank you.

审稿意见

评分: 5置信度: 32024-11-04

This paper addresses the calibration of model confidence in source-free domain adaptation (SFDA) scenarios, where only unlabeled target data is accessible. The authors introduce Source-Free Confidence Calibration (SFCC), a method that combines a pre-trained feature extractor with a deep clustering approach to improve pseudo-label accuracy and uses temperature scaling to achieve calibration in SFDA. Experiments on benchmarks like VisDA, DomainNet, and Office-Home suggest that SFCC performs comparably to, or better than, some existing methods that rely on source data.

优点

The paper is structured clearly, with step-by-step explanations that make it easy to follow. For example, the authors use experimental observations to explain why temperature scaling is suitable for SFDA problems even without clean target labels.
The study focuses on a real-world challenge in SFDA where source data may be inaccessible due to privacy or storage issues, and the calibration is particularly relevant.
Extensive testing across different datasets and adaptation methods validate the effectiveness of the proposed method.

缺点

Limited Novelty. The novelty of this work is marginal. Calibration approaches for SFDA, particularly temperature scaling, are already covered in previous research [1]. A more extensive literature review and in-depth comparisons with related methods are needed to clarify the contribution and differentiate this work.
External Feature Extractor. The use of an external feature extractor for pseudo-labeling in SFDA is not particularly innovative, as similar approaches are seen in prior work [2,3]. Additional discussion on the practicality and computational efficiency of this technique would strengthen the paper.
Experimental Results. The experimental analysis could be more comprehensive and further discussed. Please refer to the Questions.
(Minor) Theoretical Insight. The use of noisy pseudo-labels to estimate bin-wise accuracies is based on empirical results without theoretical insight, which limits the rigor of the approach.

[1] Dapeng Hu, Jian Liang, Xinchao Wang, Chuan-Sheng Foo: PseudoCal: A Source-Free Approach to Unsupervised Uncertainty Calibration in Domain Adaptation.

[2] Idit Diamant, Idan Achituve, Arnon Netzer: De-Confusing Pseudo-Labels in Source-Free Domain Adaptation. 2024 ICCV.

[3] Wenyu Zhang, Li Shen, Chuan-Sheng Foo: Rethinking the Role of Pre-Trained Networks in Source-Free Domain Adaptation. 2023 ICCV.

问题

Most of my concerns are centered around the experiments and methodology:

Will the proposed enhanced pseudo-labelling (EPL) method also benefit the SFDA performance, compared to the baseline methods, SHOT and DCPL?
For calibration, how was the bin number decided? Would variations in bin number affect the calibration outcomes?
In Figures 1 and 2, are the shown results recorded at the end of the adaptation process? How do these indicators evolve during adaptation, and what accuracy levels are associated with the displayed calibration levels?
For temperature scaling, is the optimal temperature calculated on the entire dataset or per mini-batch?
Clarifications Needed:
- In lines 504-510, the argument about classifying difficulty lacks clarity—how does this paragraph validate that claim?
- Does Figure 4b account for outliers, and could they impact the interpretability of the results?
- Figure 3b appears to support Equation (6), but the connection of Eq. (6) to Figure 4b is unclear for me.Please correct me if I've misunderstood anything.

评论- Reply to Reviewer Ye8f

2024-11-20

Thank you for your comments and feedback.

Regarding the novelty, please refer to our response to all reviewers.

We have included in A.5 a comparison with PseudoCal on the VisDA dataand showed that our method works better. In the final version we will include PseudoCal results for all the tasks,

Q1. Will the proposed EPL method also benefit the SFDA performance, compared to the baseline methods, SHOT and DCPL?

A. The EPL method can indeed contributes to SFDA performance. EPL was introduced in co-learning paper (Wenyu et al, 2023) and was also applied in DCPL, extracting better pseudo-labels. Both co-learning and DCPL showed the contribution of EPL to performance compared to SFDA baselines such as SHOT and AaD.

Q2. For calibration, how was the bin number decided? Would variations in bin number affect the calibration outcomes?

A. We selected the number of bins based on those introduced in the compared methods: CPCS, TransCal, and UTDC. Moreover, we experimented with different bin counts and included an example of the results A.4.

Q3. Are Figs 1 and 2, show results recorded at the end of the adaptation process?

A. Yes, in all analysis or results which are related to the adapted model, we used the model obtained at the end of the adaptation process.

Q4. For temperature scaling, is the optimal temperature calculated on the entire dataset or per mini-batch?

A. The optimal temperature was calculated on the entire validation set of the target domain by minimizing the adaECE or ECE.

Q5. Clarifications Needed:

Q6.1: In lines 504-510, the argument about classifying difficulty lacks clarity—how does this paragraph validate that claim?

A. We aim to demonstrate that examples with incorrect pseudo-labels are inherently difficult to classify. To quantify the difficulty of an example, we propose measuring the difference between the distance to the nearest center and the distance to the second-nearest center. Intuitively, easily classifiable examples tend to have one dominant prediction, meaning the nearest center is significantly closer than all others. We confirmed that, in cases where pseudo-labels were incorrect, there were typically at least two comparably plausible predictions, unlike examples with correct pseudo-labels.

Q6.2: Does Figure 4b account for outliers, and could they impact the interpretability of the results?

A: We are not sure we fully understand the question. In Fig 4b. there were several outlier source-target pairs in which the accuracy estimation was relatively low, but in the vast majority of cases, the accuracy estimation using EPL was very high.

Q6.3: Figure 3b appears to support Equation (6), but the connection of Eq. (6) to Figure 4b is unclear for me. Please correct me if I've misunderstood anything.

A. Figure 4b shows that although EPL is very noise (Fig 4a), the difference between computing adapted network accuracy using the correct labels and estimating the accuracy using EPL is very small. This is the main empirical justification of our method presented in Eq. 6.

2024-11-25

Thanks for the authors’ responses to my questions.

Regarding my question Q5.2, my concern is that, while the average bin-wise accuracy estimation error across all S-T pairs is relatively low—approximately 4% for the EPL method—there are still several outliers with larger errors ranging from 15% to 25%. As for Q5.3, based on my understanding, Eq. (6) describes that, for the data points with incorrect pseudo-label within the same bin (with the same predictive confidence level), the number of the data whose self-prediction is correct should be roughly equal to the number of the data whose self-prediction is the same as its incorrect pseudo-label. However, I find that Fig. 4(b) does not convincingly support this claim (which is indicated in lines 204-208 in the latest updated manuscript).

Additionally, my primary concerns about the technical novelty and the use of external pre-trained models and additional knowledge have not been fully addressed (or mentioned).

Lastly, a minor suggestion for the authors is to mark the changes made to the manuscript in a different color. This would be helpful for reviewers to easily track the revisions made during the rebuttal process.

评论- The novelty of our approach

2024-11-20

The novelty of our approach: In the case of semi-supervised and SFDA network training, we cannot treat PL as true labels since they are very noisy and we need to address the label noise problem explicitly. Our main novel observation is that the EPL label noise follows a specific pattern such that, despite the high noise level, estimating the bin-wise accuracy of the adapted model on the target domain using either the true labels or the pseudo-labels yields very similar results. Figure 4b provides empirical evidence for this observation and Fig. 3 explains the reason for this behavior. In our experiments, the average noise level for EPL was 20% (Fig 4a), while the average absolute difference between the true accuracy and EPL based estimated accuracy is only 4% (Fig 4b). We added another analysis (Figure 7) which shows that the average (non-absolute) difference between the true accuracy and EPL based estimated accuracy is zero. This makes EPL extremely useful for source-free calibration.

AC 元评审

2024-12-24

Summary: This paper investigates confidence calibration in source-free domain adaptation (SFDA). The authors propose Source-Free Confidence Calibration (SFCC), which leverages Enhanced Pseudo Labels (EPL) generated using a clustering approach and applies temperature scaling for calibration. Experiments conducted on several benchmark datasets demonstrate that the SFCC achieves comparable or better performance than existing calibration methods.

Decision: The paper addresses a meaningful problem in SFDA but does not meet the standards for acceptance due to the following reasons. 1. The main concern is that the method lacks novelty, as it primarily integrates pseudo-labels and temperature scaling, without substantial innovation (Ye8f, 7yUd, 5d3k). 2. Its reliance on pre-trained feature extractors limits generalizability and introduces biases (Ye8f, E9nN). 3. The experimental validation is insufficient, missing ablations, and limited metrics (7yUd, 5d3k). 4. Lack of theoretical insights (Ye8f, 7yUd, E9nN), though I believe this is a minor issue, as it is not necessary for all the SFDA papers.

These limitations outweigh the contributions, leading to the decision to reject. During the reviewer-AC discussion period, the reviewers unanimously agreed with this decision.

审稿人讨论附加意见

During the discussion phase, reviewers raised concerns about the technical novelty, reliance on pre-trained models, and insufficient experimental validation. The authors provided clarifications and additional results, including new evaluation metrics like SCE and NLL, and detailed explanations of the methodology. However, these efforts were insufficient to address the fundamental weaknesses. The reviewers maintained their concerns about limited innovation, presentations, and the use of external pre-trained models. Despite the rebuttal, the reviewers unanimously recommended rejection, citing unresolved issues that undermine the paper’s overall impact.

最终决定Reject

2025-01-22

Reject