PaperHub
6.3
/10
Poster4 位审稿人
最低6最高7标准差0.4
7
6
6
6
3.8
置信度
正确性3.0
贡献度3.0
表达3.5
NeurIPS 2024

MaNo: Exploiting Matrix Norm for Unsupervised Accuracy Estimation Under Distribution Shifts

OpenReviewPDF
提交: 2024-05-11更新: 2024-11-06
TL;DR

Exploiting Matrix Norm for Unsupervised Accuracy Estimation Under Distribution Shifts

摘要

Leveraging the model’s outputs, specifically the logits, is a common approach to estimating the test accuracy of a pre-trained neural network on out-of-distribution (OOD) samples without requiring access to the corresponding ground-truth labels. Despite their ease of implementation and computational efficiency, current logit-based methods are vulnerable to overconfidence issues, leading to prediction bias, especially under the natural shift. In this work, we first study the relationship between logits and generalization performance from the view of low-density separation assumption. Our findings motivate our proposed method \method{} that (1)~applies a data-dependent normalization on the logits to reduce prediction bias, and (2) takes the $L_p$ norm of the matrix of normalized logits as the estimation score. Our theoretical analysis highlights the connection between the provided score and the model's uncertainty. We conduct an extensive empirical study on common unsupervised accuracy estimation benchmarks and demonstrate that \method{} achieves state-of-the-art performance across various architectures in the presence of synthetic, natural, or subpopulation shifts. The code is available at https://github.com/Renchunzi-Xie/MaNo.
关键词
Unsupervised LearningDistribution ShiftsUnsupervised Accuracy EstimationGeneralizationDeep Learning

评审与讨论

审稿意见
7

The paper addresses the challenge of estimating the test accuracy of pre-trained neural networks on out-of-distribution (OOD) samples without access to ground-truth labels. Current logit-based methods often suffer from overconfidence, leading to prediction bias. The authors propose a new method called MANO, which applies data-dependent normalization on the logits and uses the Lp norm of the normalized logits matrix as the estimation score. This approach aims to reduce prediction bias and improve accuracy estimation. Theoretical and empirical analyses demonstrate that MANO outperforms existing methods across various types of distribution shifts and architectures.

优点

  1. The paper provides a solid theoretical analysis connecting logits to model uncertainty and generalization performance, supported by the low-density separation assumption.
  2. The paper is well-written and logically structured, making it easy to follow the authors' arguments and methodology.
  3. Extensive empirical studies on multiple benchmarks show that MANO outperforms existing state-of-the-art methods in different distribution shift scenarios.

缺点

  1. How can you ensure the universality of the Low-Density Separation (LDS) assumption in practical applications? Are there specific scenarios or datasets where this assumption might not hold?
  2. What are the specific mathematical advantages of the SoftTrun normalization strategy compared to traditional softmax? To what extent do these advantages rely on particular data distributions?
  3. In Table 1 and Table 2, the performance improvements over existing methods appear to be relatively minor in many instances. Is it possible to calculate other uncertainty measures to assess the significance of these results more accurately?
  4. In most of the experimental results, the outcomes are close to the maximum value of the evaluation metrics. Does this imply that the research problem has already been solved?

问题

See weakness.

局限性

N/A.

作者回复

We thank the reviewer for the positive comments as well as the insightful suggestions. We will address the reviewer's concerns below. Please let us know if any issues remain, we would be happy to continue this discussion to address them.

1. The universality of the Low-Density Separation (LDS) assumption in practical applications.

This is a very interesting and important question, so our answer will be a bit detailed. We apologize for the length. In case if the reviewer wants to know more details on the LDS assumption, please refer to Chapter 1 of Olivier Chapelle et al.’s book [e].

When dealing with classification problems, we always implicitly rely on the smoothness assumption that two data points close to each other are from the same class with high probability. Originally, the LDS assumption was proposed as an attempt to apply this reasoning to unlabeled data in order to train semi-supervised models, and it could be considered as a natural hypothesis we put forward on the problem at hand. In our case, we pose this assumption with respect to the pre-trained model in order to be able to analyze its behavior on unlabeled data. Hence, the LDS assumption does not hold when the pre-trained model makes lots of mistakes on examples that are far away from the decision boundary, which, for example, can happen in the presence of unreasonable distribution shifts that exclude deploying the model on the target domain. In the paragraph Assumptions on the prediction bias of Section 3.1, we elaborated further on the assumptions that we should make with respect to the pre-trained model and amount of distribution shift for the safe application of accuracy estimation methods (because this reasoning applies, to some extent, to all our competitors).

2. The specific mathematical advantages of the SoftTrun`SoftTrun` normalization strategy compared to traditional softmax. The softmax can be seen as a composition between the exponential and the scaling ϕ ⁣:uu/u1\phi \colon u \mapsto u/\lVert u \rVert_1. We study the nice mathematical properties of ϕ\phi in Appendix D.6 and design SoftTrunc`SoftTrunc` such that it preserves those properties. As shown in Equation (6), SoftTrunc`SoftTrunc` recovers the softmax when the model is well-calibrated, and when it is not, it uses a Taylor approximation of the exponential (which is a 2nd2^{nd} order polynomial) that reduces the influence of the prediction bias ϵ\mathbf{\epsilon} as shown in Equation (4) and Figure 2(b).

3. To what extent do advantages of SoftTrun`SoftTrun` rely on particular data distributions? The advantages of SoftTrun`SoftTrun` do not rely on the data distributions, since we do not make any assumptions on the prediction bias ϵ\mathbf{\epsilon} and the criterion Φ(Dtest)\Phi(D_{test}). It autonomously distinguishes calibration scenarios on data from different distribution shifts.

4. Enhancing the performance in Table 1 and Table 2 by selecting other uncertainty measures.

We thank the reviewer for raising this question. We would distinguish the two following reasons why the performance enhancement in Table 1 and Table 2 seems minor.

  • [The estimation performance under synthetic and subpopulation shifts are close to the optimal.] In our experimental protocol, we followed other papers in this domain, but it is true that the performance for synthetic and subpopulation shift datasets is almost perfect, so the improvement with respect to the state-of-the-art is not very large. Thus, the community may need more challenging datasets (e.g., natural shifts in Table 3) for these types of domain shifts.
  • [The selection of metrics.] We utilize commonly-used metrics including R2R^2 and ρ\rho to measure the linear relationship between designed scores and true test accuracy, and, from our experience, these metrics may not provide a clear insight into the degree of the improvement. Nevertheless, thanks to the suggestion of Reviewer A5Zk, we will add to the revised manuscript the results evaluated by the mean absolute error. In Table G below, we can see that with this metric the superiority of MaNo on CIFAR-10 is more evident now.

Table G: Mean absolute error on CIFAR-10 with ResNet-18.

DatasetConfScoreEntropyATCCOTNuclearMaNo
CIFAR-103.3713.1882.6771.3811.3580.427

5. Does this imply that the research problem has already been solved? We believe that there exists still many open questions in this field. These are some of them:

  • [Room for improvement under the natural shift] The numerical results in Table 3 show that current test accuracy estimation under the natural shift needs to be further improved, which is a more complex and practically meaningful question.
  • [Generalization of LLM / other data modalities.] Unsupervised accuracy estimation papers have a focus today mostly on computer vision applications, but it would be important to explore also other data modalities and network architectures. In particular, a promising research direction would be to analyze the possibility of unsupervised performance estimation for large language models.
  • [The underlying working mechanisim of generalization is still unclear.] One of the essential goals of this field is to understand when and why the model is able to generalize to unseen domains. However, the generalization capability of models is still a mysterious desideratum.

[e] Chapelle, O., Scholkopf, B., & Zien, A. (2006). Semi-supervised learning. 2006.

评论

Thanks to all the authors for your rebuttal!

I think the authors responded well to all my comments; I would like to raise my rating.

评论

Dear Reviewer Uw9o

Thank you for your valuable suggestions and constructive comments! We are happy to hear that your concerns have been resolved.

Kind Regards,

Authors

审稿意见
6

​​This paper presents MANO, a straightforward and efficient training-free approach for estimating test accuracy in an unsupervised manner, leveraging the Matrix Norm of neural network predictions on test data. The method is inspired by the low-density separation assumption, which posits that optimal decision boundaries should reside in low-density regions. An extensive empirical study on standard unsupervised accuracy estimation benchmarks reveals that MANO consistently achieves state-of-the-art performance across diverse architectures, even in the presence of synthetic, natural, or subpopulation shifts.

优点

  1. It demonstrates that logits can effectively indicate generalization performance by reflecting distances to decision boundaries, in alignment with the low-density separation assumption.

  2. It introduces MANO, a training-free method for estimating test accuracy by computing the LpL_p norm of the logits matrix, which quantifies global distances to decision boundaries. MANO employs a novel normalization technique that balances information completeness and error accumulation, and is resilient to various calibration scenarios. Additionally, it reveals a connection to the model’s uncertainty.

  3. It conducts a comprehensive empirical evaluation, encompassing 12 benchmarks across diverse distribution shifts, to showcase MANO’s superiority over 11 baseline methods. The results consistently show that MANO outperforms state-of-the-art baselines, even under challenging natural shifts.

缺点

  1. It does not address how to translate the proposed MANO into a practical estimated accuracy, which is crucial for real-world applications. Additionally, it should report the performance of the methods using the absolute estimation error metric, defined as the absolute difference between the estimated accuracy and the actual accuracy, for unsupervised accuracy estimation.

  2. The justification for selecting η=5\eta=5 is not robust. An ablation study on the impact of varying η\eta should be conducted to strengthen the argument.

  3. It lacks a discussion of some key related works on unsupervised accuracy estimation, such as [1][2].

[1] Chen, Jiefeng, et al. "Detecting errors and estimating accuracy on unlabeled data with self-training ensembles." Advances in Neural Information Processing Systems 34 (2021): 14980-14992.

[2] Chuang, Ching-Yao, Antonio Torralba, and Stefanie Jegelka. "Estimating Generalization under Distribution Shifts via Domain-Invariant Representations." International Conference on Machine Learning. PMLR, 2020.

问题

  1. Could the authors also report the performance of the approaches under the absolute estimation error metric?

  2. Could the authors perform an ablation study on the effect of the hyper-parameter η\eta?

  3. Could the authors add the missing related works on unsupervised accuracy estimation?

局限性

The authors have adequately addressed the limitations and potential negative societal impact of their work.

作者回复

We thank the reviewer for their positive comments and very valuable suggestions to improve the quality of our paper. We address the reviewer's concerns below. Please let us know if any issues remain.

1. How to use MaNo`MaNo` in practice? This work demonstrates the strong correlation between ground-truth OOD accuracy and the designed score, which can be particularly useful for model deployment applications. In the following, we provide two examples.

  • [Find difficult (under-performed) test set] In cases such as retraining on under-performed datasets or annotating hard datasets, we only need to know the rank of datasets by accuracy. Therefore, we can calculate the proposed score for each dataset directly and fulfill the task based on this score's ranking.
  • [Deployement risk estimation.] When deploying the model into production, it is important to estimate its safety. If the cost of getting test labels is prohibitive, our method can help estimate the model's accuracy on the product's test data. A practitioner can additionally look at the variability of the score on multiple test sets. When multiple datasets are not available, we can alternatively construct adequate synthetic datasets via various visual transformations.

We thank the reviewer for raising this question, we will update our paper accordingly.

2. Numerical results by using the absolute estimation error metric.

Thank you very much for introducing this metric to us, which will enhance the experiment quality significantly. In below Table E, we provide partial numerical results of our experiments by using the absolute estimation error. We will include all the results using this metric in our final version.

Table E: Mean absolute error on CIFAR-10 and Office-Home with ResNet-18.

DatasetConfScoreEntropyATCCOTNuclearMaNo
CIFAR-103.3713.1882.6771.3811.3580.427
Office-Home4.2124.5686.5233.3303.8862.230

3. Ablation study on η\eta. We thank the reviewer for this suggestion. We provide the ablation study in Table F below, where we observed that η=5\eta=5 can effectively distinguish the calibration scenarios.

Table F: Performance on CIFAR-10, Office-31, and PACS with ResNet18 for varying value of η\eta . The metric used in this table is the coefficient of determination R2R^2.

Datasetη\eta=0η\eta=1η\eta=3η\eta=5η\eta=7η\eta=9
CIFAR-100.9950.9950.9950.9950.9950.995
Office-Home0.9260.9260.9260.9260.7770.777
PACS0.5410.5410.5410.8270.8270.827
Average0.8200.8200.8200.9160.8660.866

It is worth noting that we provide the theoretical analysis to demonstrate the general reason why we choose η\eta=5 in Section D.3. It indicates that η\eta=5 ensures that Φ(Dtest)\Phi(D_{test}) deviates from its mean by the variance with probability smaller than 0.05.

4. Discussion about the two related works. We thank the reviewer for suggesting these two significant works. Both approaches utilize the disagreement between a pre-trained model and a target-adapted model to predict accuracy on each test set. In contrast, our MANO leverages the model’s logits for accuracy estimation without any training on target datasets. We will include these works and clarify the differences in our revision.

We thank the reviewer for their questions which helped us improve our work. We remain open to further discussion in case some issues remain unaddressed.

评论

Thank you for the detailed rebuttal! You've addressed most of my concerns, and as a result, I will be increasing my scores.

I do have one remaining question: Could you clarify how the proposed MANO is translated into the estimated accuracy?

评论

Dear Reviewer A5Zk,

Thank you for your positive feedback. We are glad our rebuttal has addressed your concerns.

To clarify how the MANO score is transformed into estimated accuracy, our MANO score exhibits a strong linear correlation with classification accuracy. Consistent with existing methods (Deng et al., 2021; Peng et al., 2024), we fit a linear regressor on held-out (or synthetic) datasets, using the MANO score as input and the predicted accuracy as output. Once trained, this regressor can be applied to unseen datasets to convert MANO scores into estimated accuracy. We will include this clarification in our revision.

Thank you again for your constructive suggestions.

Sincerely,

The Authors

评论

Thank you for the clarification! I will raise my scores!

评论

Dear reviewer A5ZK,

Thank you for your updated score and your helpful suggestions! We are glad that the clarifications were helpful. Thank you again!

Sincerely,

The Authors

审稿意见
6

This paper proposes the OOD accuracy estimation method, named MANO, by leveraging the positive correlations between features to decision boundary distance and generalization performance. Along with Softtrunc for preventing error accumulation in overconfidence scenarios, the proposed method outperforms existing estimation baselines over various benchmark datasets.

优点

  • This paper provides the theoretical insights that motivate the simple but effective OOD accuracy estimation method without supervision.
  • Analysis on extensive set of distribution shift benchmarks as well as ablation studies demonstrate that the proposed method shows promising results in various distribution shifts.

缺点

  1. The method requires hyperparameters, such as pp in aggregation step and determining model’s calibration for softrunc, which requires the OOD labels for optimal hyperparameter search.
  2. As mentioned in Section D.2, the selection of criterion (Φ(Dtest)\Phi (D_{test})) higher than η\eta could still include the high errors and high confidence (overconfindence), i.e., first scenario has high lower bound of Φ(Dtest)\Phi (D_{test}) as in third scenario. Therefore, more thorough analysis or ablation studies on how sensitive MANO is under such potential failures in correctly determining calibration is required. In addition, miscalibration includes the underconfidence, where models have underconfident predictions compared to their actual accuracy.
  3. Important estimation baseline is omitted, Agreement-on-the-Line, Baek et al. (2022).

Baek et al., Agreement-on-the-Line: Predicting the Performance of Neural Networks under Distribution Shift, NeurIPS 2022

问题

  • Averaged confidence can be considered as particular case of MANO with softmax, i.e., using p as infinity. Still, in Table 4, even using Softtruc, AC is not competitive against MANO. So, what makes MANO, considering featuer-to-boundary for all classes, superior against AC, which corresponds to only considering the most confident (probable) class?

局限性

See weaknesses and questions.

作者回复

We thank the reviewer for their precious comments which help us further improve the paper. We hope our answers below could precisely address the reviewer's concerns. Please let us know if any issues remain.

1. Hyperparameter tuning and OOD labels.

We thank the reviewer for this comment. In order to avoid any confusion, we would like to clarify several important aspects of problem setup:

  • [No access to OOD labels.] We consider the unsupervised setting, so we do not have any validation set. This is eventually one of the main challenges of the framework as we are not able to tune hyperparameters in contrast to the supervised approaches.
  • [Source-free.] We have constrained ourselves to a setup where only the pre-trained model is available without direct access to source data.

Given these constraints, we have come up with the proposed method, where we have made the following model choices.

  • [Fixed hyperparameters across all distribution shifts.] We have set all the hyperparameters to fixed values (p=4p=4 and η=5\eta=5) across all the datasets. The empirical results show the superiority of MaNo without searching the optimal hyperparameter values. We have also performed a sensitivity study (please see Figure 7(a) for pp and response to Reviewer A5Zk for η\eta).
  • [Necessity of introducing η\eta] The proposed normalization SoftTrun eventually reveals an additional level of complexity of the problem overlooked by previous methods. Table 4 shows that SoftTrun improves ConfScore and Nuclear, which justifies the importance of normalization, which, however, comes with the cost of introducing the hyperparameter η\eta.

When there is access to the training data, a possible way to choose hyperparameters could be to generate multiple synthetic datasets from the training set via various visual transformations to determine hyperparameters' values.

We will revise the manuscript to better clarify our problem setup and the motivation behind the model choices we have made.

2. MaNo and SoftTrun in the case of the erroneous overconfidence and underconfidence.

Thanks for this insightful comment and we reply to it in two parts.

  • [high errors and high confidence] This great remark concerns actual assumptions we must make in our setup (Section 3.1). One may notice that overconfidence and high errors imply that predicted probabilities are misaligned with the true ones in terms of class ranking, which is a big issue that may go beyond calibration. Considering unsupervised cases and no access to the training phase, logits are the only source of information we have, so we find it reasonable to assume that the model mistakes are mostly in low-confidence regions.

  • [underconfidence] We totally agree. That is why we do not impose any assumption on the prediction bias ϵ\mathbf{\epsilon}, so it can have positive/negative entries covering overconfidence and underconfidence. In practice, it has been shown that deep neural networks with softmax are overconfident [a], so the situation of underconfidence is unlikely to happen. As we did not observe this phenomenon in our experiments, it would be a good direction for future work to find benchmarks where underconfidence issues are present. We will include the above discussion in the revision.

3. Relation to Agreement-on-the-Line (ALine-D). We thank the reviewer for suggesting this interesting work. We would like to clarify the differences between ALine-D and MaNo.

  • [Different settings.] ALine-D operates under a model-centric setting, aiming to accurately estimate OOD accuracy across many different models. In contrast, MaNo is designed for a data-centric setting, focusing on estimating a single model’s OOD accuracy on various datasets.
  • [Different assumptions.] ALine-D assumes that (1) a set of diverse pre-trained models is available during evaluation; (2) agreement-on-the-line phenomenon holds consistently. In practice, accessing a diverse set of models may be infeasible, and assumption (2) does not always hold [b, c]. On the other hand, MaNo and other data-centric methods do not face these limitations.

Based on the above, ALine-D cannot be directly compared with MaNo, but we provide two indirect ways to do that.

  • In our experimental results, MaNo outperforms AgreeScore, which is based on a similar idea to ALine-D.
  • In Table D, we compare the results provided in [d], Section 5.2 with ours on CIFAR-10C to further illustrate the efficiency of MaNo even without the two assumptions.

Table D: MaNo v.s. Agreement-on-the-Line (ALine-D) on CIFAR-10C with ResNet-18.

MetricALine-DMaNo
R2R^20.9950.995
ρ\rho0.9740.997

4. What makes MaNo superior to AC? Thanks for raising this question. First, we would like to clarify a possible misunderstanding between LL_\infty norm and AC.

  • [ AC is not a specific case of p=p=\infty] When we consider LL_\infty-norm, it means that we compute the maximum value over the whole prediction matrix, while AC extracts the maximum value of each row in that matrix.

Then, to answer the question:

  • [AC ignores the distances to subconfident decision boundaries.] Our analysis in Section 3.1 shows that distance to each class boundary is important, so by considering the confidence of the most probable class only, AC may bring a loss of information.
  • [AC considers each sample separatively.] MaNo puts greater emphasis on high-margin terms globally when pp increases, while AC considers the high-margin values of each sample with the same weight.

[a] Leveraging ensemble diversity for robust self-training in the presence of sample selection bias.

[b] ID and OOD performance are sometimes inversely correlated on real-world datasets.

[c] Accuracy on the Line: On the Strong Correlation Between Out-of-Distribution and In-Distribution Generalization.

[d] Agreement-on-the-Line: Predicting the Performance of Neural Networks under Distribution Shift.

评论

I greatly appreciate authors' detailed rebuttal on my concerns, and most of them are fully addressed by the authors.

However, I still believe the over/under-confidence assumption this paper relies on could be easily violated in circumstances where models to OOD show overconfident predictions, e.g., after adaptation using confidence maximization (i.e., entropy minimization [1-3]). Therefore, this discussion on potential limitations should be included in the paper.

Under this condition, I increase my rating to 6.


[1] Wang et al., Tent: Fully test-time adaptation by entropy minimization, ICLR 2021
[2] Chen et al., Contrastive test-time adaptation, CVPR 2022
[3] Rusak et al., If your data distribution shifts, use self-learning, TMLR 2022

评论

Dear Reviewer 5Gdr,

Thank you for the updated score and the valuable review. We are glad that our clarifications were helpful. As suggested, we will include the discussion about potential limitation in the paper with the corresponding references (entropy minimization [1-3]). We believe the review helped us improve the paper and thank you again for your constructive suggestions.

Sincerely,

The Authors

审稿意见
6

The paper presents MANO, a method for unsupervised accuracy estimation under distribution shifts. The method addresses the challenge of estimating model performance on out-of-distribution (OOD) samples without access to ground-truth labels. Firstly, the authors investigate the correlation between logits and test accuracy, and propose a novel approach that involves normalizing logits and using the Lp norm of the normalized logits matrix as an estimation score, motivated by the low-density separation assumption. Then, the authors find that the commonly used softmax normalization method tends to harm the performance estimation when facing overconfidence issue on some particular dataset. The authors demonstrate that MANO achieves state-of-the-art performance across various benchmarks and architectures.

优点

  1. This paper introduces a novel method of using the norm of normalized logits for accuracy estimation, providing a fresh perspective on handling accuracy estimation problem under distribution shifts in unsupervised settings. This innovative approach builds on the low-density separation assumption, which is theoretically sound and practically relevant.

  2. The authors consider the common overconfidence issue faced by deep models and propose a new normalization method based softmax operator. The illustrated experiments show that the new SoftTrun method shows improvement in some cases.

缺点

  1. Despite the authors designing a method based on the low-density assumption from theoretical perspective, this approach appears rather trivial and is quite similar to entropy-based methods like the variant of ATC. Moreover, it measures the overall output smoothness of the test set, making the connection with uncertainty somewhat trivial.

  2. The proposed SoftTrun method, which is an improvement over softmax, shows performance enhancements in some cases (as shown in Figure 2(b)), but can also degrade performance in others. For new tasks, SoftTrun may not always outperform the traditional softmax method. Additionally, the experimental validation for this aspect is insufficient, as Tables 1 and 2 lack comparative results with the softmax-based MANO baseline method.

问题

Please refer to the weaknesses.

局限性

N/A.

作者回复

We thank the reviewer for the positive support and constructive comments on this work! Please find our responses below.

1.1. Relation to entropy-based methods such as ATC.

We thank the reviewer for this comment. The proposed MaNo and the ATC indeed share similarities as they both belong to the class of logit-based accuracy estimation methods. However, there are several important aspects that distinguish the two methods, not only in terms of methodology but also effectiveness.

  • [ATC calculates entropy that MaNo never considers.] ATC utilizes the logits to calculate the entropy and measures how many test samples have a confidence larger than a threshold, which is a hyperparameter needed to be chosen. They proposed to search it from the labeled source validation set, which might be not a good idea in cases when softmax tends to be overconfident and the prediction bias is noticeable as we discussed in Section 4.1. In contrast, MaNo directly calculates the LpL_p norm of the normalized logit matrix, thereby avoiding the introduction of a confidence threshold and leading to a simpler and more efficient approach.
  • [Stability.] Just being based purely on the experimental results, we can see that the entropy-based score function does not seem to be stable in performance and is worse compared to the proposed MaNo (regardless if it is softmax or SoftTrun normalization scheme).
  • [New normalization scheme] To the best of our knowledge, the question of calibration and logit normalization has never been touched before in the unsupervised accuracy estimation domain, so the proposed SoftTrun normalization scheme makes our contribution conceptually different from other approaches including ATC.

1.2. Simplicity of the approach.
We would like to open a debate with respect to this point as we believe that the simplicity of our approach is one of its main strengths given its superiority compared to 11 competitors over 3 different shifts and 12 datasets with a very noticeable difference in natural shift setting. In addition, we provide the following arguments.

  • Our approach is source-free and training-free (in contrast to ATC and ProjNorm, resp.), which makes it relevant for model deployment applications.
  • As the reviewer noticed, our method has emerged from the theoretical perspective of the low-density separation assumption, which makes it intuitive and easy to analyze.
  • The versatility of the approach allows us to painlessly test our method for different neural network architectures. Therefore, in order to strengthen this point, we have conducted additional experiments using two other commonly used vision models: ViT and Swin. In addition, we added a new evaluation metric (mean absolute error) based on the suggestion of Reviewer A5Zk. We will include all the experimental results in the revision, meanwhile displaying some of the results in the two tables below. Our approach has superior performance on these two architectures as well, which validates our versatility claim. Our experimental results also include an ablation baseline, which is Softmax MaNoMaNo that uses softmax normalization function instead of the proposed SoftTrun.

Table A: R2R^2 on CIFAR-10C and Office-Home with ViT.

ViTDispersionNuclearCOTSoftmax MaNoMaNo (Ours)
Cifar100.9450.9630.9500.9840.984
Office-Home0.2160.5310.7320.8050.805

Table B: Mean absolute error on CIFAR-10.1, ImageNet-S and ImageNet-A with Swin

SwinDispersionNuclearCOTSoftmax MaNoMaNo (Ours)
Cifar10.110.7585.0744.6691.0421.042
ImageNet-S7.3524.0545.3793.3362.091
ImageNet-A24.5337.5214.5971.2731.001

2.1. Clarifications regarding SoftTrun. We apologize for the possible misunderstanding regarding Figure2(b) and the design of SoftTrun. Please find our comments below.

  • [Typo in Figure 2(b)] We have made a typo, and the bar named "SoftTrun (first 2-order)" should be called as "Taylor approximation 2nd order" instead. This bar refers to the case when we apply only Taylor normalization function (as if we would always choose the 1st scenario in Eq.(6)), while "MaNo w/ Softmax" refers to applying softmax only (as if we would always choose the 2nd scenario in Eq.(6)).
  • [Purpose of Figure 2] The goal of this figure is to motivate the proposed SoftTrun defined by Eq.(6), where we automatically choose one of the normalization functions depending on a calibration scenario. For well-calibrated scenarios (Office-Home), we should use the softmax normalization for information completeness, while for poorly-calibrated ones (PACS), Taylor normalization is preferred in order to be more robust to prediction errors.
  • [SoftTrun] Thus, the proposed SoftTrun selects one of these two normalization schemes depending on the uncertainty value, and the final experimental results are displayed in Table 1, 2, 3.

We will update the labeling of Figure 2(b) and the writing of Section 4.2 to clear out any potential misunderstanding.

2.2. Results of softmax-based MaNo in Tables 1 and 2. Thanks for suggesting this idea. We did not display this baseline for the synthetic and subpopulation shifts as we found that the prediction bias is quite low on these problems which leads to Φ(Dtest)\Phi(D_{test}) much greater than the proposed fixed η=5\eta=5, so SoftTrun always chooses softmax normalization for these problems. The key difference between SoftTrun and softmax-always normalization schemes comes when natural shift is considered. In the table below, please find the experimental results of comparing the two schemes on 4 natural shift benchmarks, confirming SoftTrun superiority.

Table C: Softmax MaNo v.s. the proposed MaNo measuring by R2R^2 under natural shifts with ResNet-18.

MethodPACSOffice-HomeDomainNetRR1
Softmax MaNo0.5410.9290.8940.971
SoftTrun MaNo0.8270.9260.9020.983
作者回复

We thank the reviewers for their time and valuable suggestions. We are deeply grateful to them for acknowledging the novelty and quality of our study (Reviewer 6N1W, A5Zk, Uw9o) while noting its effectiveness and superiority on large-scale experiments (Reviewers 6N1W, 5Gdr, A5Zk, Uw9o). We are also encouraged to know that all reviewers found that our proposed method is built on solid theoretical support (Reviewer 6N1W, 5Gdr, A5Zk, Uw9o).

We provided all the additional experiments requested by the reviewers and remain open to continuing this constructive discussion for the length of the rebuttal period.

We believe that the paper strongly benefited from the reviews and hope that this, together with the multiple additional experiments provided in the individual answers, will allow the reviewers to reconsider their evaluation of our work if they think we addressed their concerns.

评论

As the discussion period is ending, we thank the reviewers for taking the time to review our paper and discuss it with us. We believe their valuable suggestions strongly benefitted our paper.

We are glad that our clarifications and the additional experiments requested addressed the reviewers' concerns and are grateful to them for recognizing the core strengths of our work:

  • Novel and innovative
  • Theoretically well-motivated
  • SOTA efficient approach
  • Extensive comparisons

We thank them again for their insightful feedback.

Sincerely,

The Authors

评论

Dear Reviewers,

The deadline for the reviewer-author discussion period is approaching. If you haven't done so already, please review the rebuttal and provide your response at your earliest convenience.

Best wishes, AC

最终决定

All the reviewers recommended acceptance. The paper proposes a novel method for accuracy estimation based on the matrix norm of logits, based on a solid theoretical foundation. The paper is clearly written and presents an extensive set of experiments with convincing results.