Long-Tailed Out-of-Distribution Detection via Normalized Outlier Distribution Adaptation
摘要
评审与讨论
The paper proposed a novel approach, namely normalized outlier distribution adaptation (AdaptOD), to tackle this distribution shift problem, where ID classes are heavily imbalanced, i.e., the true OOD samples exhibit very different probability distribution to the head and tailed ID classes from the outliers. One of its key components is dynamic outlier distribution adaptation that effectively adapts a vanilla outlier distribution based on the outlier samples to the true OOD distribution by utilizing the OOD knowledge in the predicted OOD samples during inference. Further, to obtain a more reliable set of predicted OOD samples on long-tailed ID data, a novel dual-normalized energy loss is introduced in AdaptOD, which leverages class- and sample-wise normalized energy to enforce a more balanced prediction energy on imbalanced ID samples.
优点
-
The paper is written well and is easy to understand.
-
The studied problem is very important.
-
The results seem to outperform state-of-the-art.
缺点
-
It would comprehensive if the accuracy is reported on ID data
-
I am curious why there is not a weight coefficient before DNE loss
-
It is better to ablate on the number of outlier data used during adaptation
问题
see above
局限性
n/a
Thank you for your appreciation of the novelty of our approach, empirical justification, problem setting, paper clarity, as your invaluable feedback on further enhancing the work. Please find our one-by-one responses to your concerns below:
W1: It would be comprehensive if the accuracy is reported on ID data
We have reported the ID accuracy of all methods on CIFAR10-LT and CIFAR100-LT in Tab. 2 and on ImageNet-LT in Tab. 4. It is impressive that our AdaptOD achieves not only consistently substantial improvement over current SOTA methods on OOD detection metrics but also consistently better ID classification accuracy on all three widely-used ID classification benchmarks.
W2: I am curious why there is not a weight coefficient before DNE loss
This is mainly because the DNE loss works stably with the cross-entropy loss in Eq. 11. We provide some ablation study results of having a coefficient hyperparameter for regularizing the DNE loss in Tab. G below, from which it is clear that the DNE loss can work well within its importance hyperparameter set in [0.5, 2.0].
Table G: Averaged AUC results of OOD detection on CIFAR10/100-LT for AdaptOD with different importance weights of the DNE loss
| Weight | 0.5 | 1.0 | 1.5 | 2.0 |
|---|---|---|---|---|
| CIFAR10-LT | 94.51 | 94.69 | 94.65 | 94.56 |
| CIFAR100-LT | 81.85 | 81.93 | 81.90 | 81.86 |
W3: It is better to ablate on the number of outlier data used during adaptation
If we understand your question correctly, the results in Fig. 3 should be the results you are looking for, where we report the performance of our proposed DODA and two existing TTA methods with an increasing number of OOD samples. It shows that our AdaptOD can utilize the test-time OOD data to adapt the true OOD distribution much faster and substantially more effective.
This paper addresses the issue of true OOD samples having distribution shifts in scenarios of Long-tailed Recognition (LTR) with heavy imbalance. To cope with this, the paper proposes normalized outlier distribution adaptation (AdaptOD) with two key components: Dynamic Outlier Distribution Adaptation (DODA) and dual-normalized energy (DNE) loss. DODA dynamically adapts the outlier distribution during inference to align better with the true OOD distribution, while DNE helps achieve balanced prediction energy across imbalanced ID classes. This method was tested on various benchmark datasets, including both low- and high-resolution datasets.
优点
- The proposed method, reducing distribution shift of OOD on training and inference stage, is novel.
- The paper includes a detailed ablation study that demonstrates the generality of the proposed methodology.
- The paper's class-wise and sample-wise normalization techniques improve the balance in prediction energy for imbalanced ID samples, enhancing overall performance compared to existing methods.
- This framework shows good experimental results compare to several baselines.
缺点
- The overview of Figure 2 is difficult to understand. Simplifying and highlighting only the necessary information would be beneficial. There are too many redundant details.
- While recent methodologies improve long-tailed OOD detection performance using synthetic outliers without real outliers, this approach still relies on auxiliary outliers. This dependency can limit the practical applicability of the method.
- The method's effectiveness is questionable when handling distribution shifts between training auxiliary outliers and actual real-world outliers, potentially compromising its performance if the shift is large.
问题
N/A
局限性
see weakness
Thank you for your constructive feedback and appreciation of the novelty of our work and its empirical justification. We provide our response to your concerns one by one in the following.
W1: The overview of Figure 2 is difficult to understand. Simplifying and highlighting only the necessary information would be beneficial.
Thanks for your advice, we have provided a revised Figure 2 in the attached PDF and simplified the overview of Figure 2 as follows.
-
we will rewrite the DODA part as: Each test sample is assigned a global energy-based OOD score to adapt the outlier distribution . DODA then uses the adapted outlier distribution to calibrate the global energy score , obtaining the calibrated global energy score as the OOD score.
-
we will also rewrite the DNE part as: For each iteration, DNE first applies Batch Energy Normalization on logit output to obtain the normalized energy, and then utilizes this energy to optimize a dual energy loss function at both the class and sample levels.
W2: While recent methodologies improve long-tailed OOD detection performance using synthetic outliers without real outliers, this approach still relies on auxiliary outliers. This dependency can limit the practical applicability of the method.
As far as we know, existing SOTA long-tailed OOD detection methods all take the outlier exposure (OE) approach, relying on the availability outlier data; non-OE-based methods still cannot work well on long-tailed ID data due to the heavy class imbalance. However, all these existing OE-based methods are challenged by the distribution gap between the distributions of outlier data and true OOD data. The proposed AdaptOD effectively tackles this common, crucial challenge by the novel DODA and DNE modules. We agree that there might be application scenarios where no auxiliary outlier data may be available. We will explore to extend our method to this setting by having the initial outlier distribution trained on synthetic outlier data, or even randomly initializing the outlier distribution. Specifically, we attempt to randomly Gaussian initializing which has each dimension sampled from an isotropic Gaussian distribution as the auxiliary outliers to train the OOD models as shown in Tab. E and Tab. F below.
Table E: Results on CIFAR10-LT using randomly Gaussian initializing as auxiliary outliers
| AUC | AP-in | AP-out | FPR | |
|---|---|---|---|---|
| EnergyOE | 84.65 | 85.27 | 82.66 | 52.70 |
| COCL | 90.53 | 89.44 | 88.92 | 37.89 |
| AdaptOD | 93.06 | 92.75 | 92.87 | 29.31 |
Table F: Results on CIFAR100-LT using randomly Gaussian initializing as auxiliary outliers
| AUC | AP-in | AP-out | FPR | |
|---|---|---|---|---|
| EnergyOE | 71.85 | 72.37 | 68.61 | 80.27 |
| COCL | 76.07 | 76.74 | 69.58 | 78.62 |
| AdaptOD | 79.21 | 81.25 | 74.89 | 70.48 |
W3: The method's effectiveness is questionable when handling distribution shifts between training auxiliary outliers and actual real-world outliers, potentially compromising its performance if the shift is large.
To mitigate the influence of a large distribution shift between training auxiliary outliers and actual real-world outliers during DODA, we utilize the statistic of training ID data to guarantee the effective optimization of outlier distribution in Eq. 3 during the testing phase, which effectively alleviates the reliance on the prior from the training auxiliary outlier data during the adaptation of DODA, thereby reducing the large adverse effects brought by the auxiliary outlier data that is highly different from the true OOD data. As a result, our AdaptOD can achieve SOTA performance over six large diverse OOD test datasets (including both the near-OOD dataset and the far-OOD dataset) and three synthetic OOD datasets as well on the CIFAR10/100-LT-based ID dataset benchmarks.
This paper addresses the challenge of OOD detection in LTR scenarios, where the distribution of classes is heavily imbalanced. The key issue highlighted is the absence of true OOD samples during training, which hampers the effectiveness of OOD detectors, especially the significant distribution shift between outlier samples and true OOD samples in LTR scenarios. The authors propose a novel approach called AdaptOD to tackle this problem by reducing the distribution gap between the outlier samples used for training and the true OOD samples encountered at inference time.
AdaptOD introduces two main components: DODA performs test-time adaptation to dynamically adjust the outlier distribution to better align with the true OOD distribution using the OOD knowledge from predicted OOD samples during inference. And DNE is designed to balance the prediction energy for imbalanced ID samples during training, which helps in learning a more effective vanilla outlier distribution that is crucial for DODA's adaptation process.
The authors demonstrate the effectiveness of AdaptOD through extensive experiments on CIFAR10/100-LT and ImageNet-LT. The results indicate its superiority in handling the long-tailed OOD detection problem.
优点
- Exploring OOD detection in the context of long-tailed recognition has practical value, and the discovery of strong distribution shift between outlier samples and true OOD samples in LTR scenarios, as shown in the Fig. 1a, is meaningful.
- The motivation is clear. AdaptOD utilizes the DNE to learning a better vanilla outlier distribution first, and then performs DODA to dynamically adapt the outlier distribution to the true OOD distribution.
- The method is elegant. DODA only calibrates the outlier distribution, effectively eliminating the retraining or memory overheads, and well approximating the upper-bound performance. DNE effectively adapt the standard energy loss in LRT scenarios with batch energy normalization, and further eliminated the dependence of energy bar on imbalance factors and training datasets.
- This paper is well-written, and the empirical evaluation is comprehensive.
缺点
- The goal of the CET component is to "enforce more balanced prediction energy for imbalanced ID samples" as line 88, which is similar to BERL[1]. The author should further clarify their differences.
- AdaptOD performs better in near OOD dataset CIFAR, which cannot be achieved by previous methods. It would be better to further discuss the reason.
[1] Choi, H., Jeong, H., Choi, J.Y.: Balanced energy regularization loss for out-of-distribution detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 15691–15700 (2023)
问题
- DNE is designed for more effective DODA, what's the most important factor in applying TTA methods on OOD detection in LTR during training stage?
- The updated form of the outlier distribution in Eq.4 is sample-wise, how does it compare to batch-wise updates?
局限性
The limitations have been discussed in Appendix E.
Thank you for appreciating our contribution and for your thoughtful and detailed feedback. Please find our responses to your concerns below:
W1: The goal of the DNE component is to ``enforce more balanced prediction energy for imbalanced ID samples" as line 88, which is similar to BERL
BERL only balances the prediction energy at the sample level, while our DNE balances not only the sample level but also the class level, with the support of the novel batch energy normalization. This difference allows AdaptOD to achieve a much better energy balance and thus a better OOD detection and ID classification accuracy, as shown in Tabs. 2 and 3. It also provides stable energy margins across different datasets, eliminating the need of manual tuning of these margins as in BERL, helping a better adaptation during DODA.
W2: AdaptOD performs better in the near-OOD dataset CIFAR, which cannot be achieved by previous methods. It would be better to discuss the reason in further detail
Near OOD dataset CIFAR is similar to the training ID data, so previous methods are always confused between them. However, DODA can effectively utilize the test-time OOD data during inference to effectively adapt the outlier distribution and subsequently calibrate the global energy score. As a result, it helps AdaptOD discriminate near OOD samples from the ID samples better than previous methods. This is supported by the results in Fig. 1 in the attached PDF in the general response, where the performance of three TTA methods, including ours, increases significantly with an increasing number of OOD samples from CIFAR involved in the adaptation.
Q1: DNE is designed for more effective DODA, what's the most important factor in applying TTA methods on OOD detection in LTR during the training stage?
Energy regularization is a mainstream method to improve OOD performance, and in long-tailed OOD detection, it is used to alleviate the bias of the global energy toward head samples in existing studies such as BERL. But we observe that there is a class-level bias toward head classes, in addition to the sample-level bias. Thus, one major contribution from DNE is to enable more balanced energy predictions overhead and tail samples due to its sample- and class-level energy debiasing. Another important contribution lies at its Batch Energy Normalization, which addresses a largely ignored issue in long-tailed OOD detection that requires careful manual tuning of energy margins to work well on different ID/OOD datasets. DNE provides stable energy margins for different long-tailed ID/OOD datasets, alleviating the aforementioned biases toward head classes without relying on the data-specific manual margin tuning.
Q2: The updated form of the outlier distribution in Eq.4 is sample-wise, how does it compare to batch-wise updates?
In DODA we utilize the statistic of training ID data to guarantee the optimization of outlier distribution in Eq. 3 during the test phase, and the assigned pseudo label for each unlabeled test sample only depends on this statistic. Therefore, our predicted OOD samples used to adapt the outlier distribution once is sufficient for our method to work well. If batch-wise updates are used, it can help the optimization of outlier distribution become more stable at the beginning of DODA, with only minor difference in detection effectiveness, as shown in Tab. C and Tab. D below. Sample-wise dynamic updates as in Eq. 4 are more desired than batch-wise approaches in the pursuit of instant OOD detection, so we focused on the former one in this work.
Table C: Result of sample-wise and batch-wise update on CIFAR10-LT with AdaptOD.
| AUC | AP-in | AP-out | FPR | |
|---|---|---|---|---|
| Sample-wise | 94.69 | 93.89 | 94.12 | 27.26 |
| Batch-wise | 94.74 | 93.95 | 94.18 | 27.22 |
Table D: Result of sample-wise and batch-wise update on CIFAR100-LT with AdaptOD.
| AUC | AP-in | AP-out | FPR | |
|---|---|---|---|---|
| Sample-wise | 81.93 | 83.09 | 77.83 | 67.37 |
| Batch-wise | 82.06 | 83.14 | 77.88 | 67.32 |
Thanks for your thorough responses, which have addressed my concerns. After reading your rebuttal and the other reviews, I determine to maintain my rating.
Dear Reviewer t7zA,
We're very please to know that our response has addressed your concerns. Thank you very much for the affirmative and positive comments on our work.
This paper introduced the normalized outlier distribution adaptation (AdaptOD) which adapts the outlier distribution to the true OOD distribution from both the training and inference stages. Such AdaptOD includes two components: one is dynamic outlier distribution adaption (DODA), which adapts the outlier distribution to the true OOD distribution. The other is dual-normalized energy loss (DNE), which includes class-level and sample-level normalized energy to enforce more balanced prediction energy for imbalanced ID samples. The experiments are conducted on three OOD LTR benchmarks.
优点
-
The introduction of DODA and DNE is novel and addresses the key issue of distribution shift in OOD detection, especially in LTR scenarios.
-
The combination of DODA and DNE demonstrates significant improvements over existing methods, both individually and collectively, as evidenced by the ablation studies.
-
The paper is exemplarily structured and articulated. It offers a lucid and comprehensive elucidation of the AdaptOD methodology, its constituent elements, and the experimental framework, rendering it accessible to readers.
缺点
The motivation and mechanism underlying dynamic outlier distribution adaption (DODA) are unclear and lack a theoretical foundation. (1) My primary concern is on the setting. The proposed AdaptOD with the introduction of test time adaptation (TTA) seems unfair. In previous work on OOD, the model's update process could not access the true OOD data and adopted the outlier as an alternative, while in this paper, the true OOD data is used to update the outlier distribution in the module DODA. As described in Tab. 5, without the DODA modules, the AUROC is 72.04 on ImageNet-LT, and such an increase is relatively small compared to SOTA. That indicates the main improvement depends on the knowledge leakage of true OOD samples during TTA.
(2) I am confused about the motivation of DODA under the particular OOD setting, which adapts the outlier distribution to true OOD distribution. Suppose we could access the true OOD distribution in the test phase and update the outlier distribution with the predicted OOD sample. Why don't we use the true OOD distribution to calibrate the scores? The DODA provides access to the true OOD samples and provides the perfect solution for auxiliary OOD data that directly uses the predicted OOD sample as auxiliary OOD data. Moreover, as described in Fig. 1, the curves of adapted outliers almost coincided with the true OOD. The authors would better give more interpretation.
(3) I am confused about the mechanism of DODA. Whether the DODA would adapt the outlier distribution to the true OOD distribution, as the initialized model detects the OOD sample at the beginning, the ID samples would be wrongly predicted as OOD, and that would lead to the optimization of the outlier distribution to the ID data distribution rather than the OOD distribution. The direction of the update of outlier distribution is uncertain how the DODA guarantees the direction to True OOD distribution. The authors would better give more theoretical analysis and support.
(4) Is adapting the outlier distribution to true OOD distribution with specified OOD samples meaningful for OOD detection? As the OOD samples can be drawn from any unknown distribution, the benchmarks are just built to evaluate the models’ ability of unknown detection. The specified adaption of outlier distribution would improve the performance on corresponding benchmarks. However, the specified adaption can not generalized to other benchmarks and would be useless for the OOD sample from an unknown distribution.
问题
-
The experimental comparisons for Tab. 1 seem unfair. The COCL and EnergyOE can not access the true OOD data, while the proposed method AdaptOD updates the global energy score with true OOD data in the test phrase.
-
For Tab. 2 and Tab. 3, the previous methods with DODA modules obeying the OOD setting should be considered.
局限性
The authors have discussed the limitations of this paper, and there is no negative societal impact.
Thank you for recognizing the novelty of our method and its improvement over existing methods. On the other hand, there might be some misunderstanding regarding the setting, which we will clarify in the following responses and further improve our writing.
W1: concern for the setting
- Test-time adaptation (TTA) for OOD detection
TTA is a widely used technique that utilizes test data (without knowing their class labels) to help models improve performance and adapt to real-world scenarios. Its application to OOD detection is relatively less explored, but it can help quickly adapt to new OOD scenarios and improve OOD performance due to the possibility of utilizing real OOD samples. This has been demonstrated in recent studies, AUTO and AdaOOD, and more recently published RTL at CVPR2024 [Ref1]. These justify the importance of designing TTA methods to enhance OOD methods. However, please note that the TTA module is assumed NOT to be able to get access to the ground truth of the test data in all these methods, including ours, meaning that it handles unlabeled test data. Ground truth of test data is used in detection performance evaluation only. Thus, there is no data (label) leakage issue.
- Unfair comparison to methods that could not access the true OOD data
It is true that our method and other TTA-driven methods have the advantages of accessing the samples drawn from true OOD distribution. To ensure a fair, comprehensive comparison, in Tab. 2, we compare our full method AdaptOD with DODA-enabled four SOTA, non-TTA methods. Furthermore, in Tab. 3, we compare our DODA with the two existing TTA methods for OOD detection, where each of them is respectively combined with different current OOD detection methods. Additionally, we further compare our AdaptOD with RTL [Ref1] in the general response. All of these experiments demonstrate the superiority of AdaptOD in long-tailed OOD detection.
- Small AUC improvement on ImageNet-LT compared to SOTA without the DODA module
Both our modules, DODA and DNE, make major contribution to the overall detection performance of our method AdaptOD. We agree that the DNE module leads to relatively small AUC improvement on ImageNet-LT, but it can consistently improve not only all OOD detection metrics but also ID classification accuracy across all three benchmarks. Additionally, DNE also provides stable energy margins on different datasets, eliminating the need of manual tuning of these margins.
[Ref1] Fan, Ke, et al. "Test-Time Linear Out-of-Distribution Detection." Proceedings of CVPR, 2024.
W2: the motivation of DODA
- Why don't we use the true OOD distribution to calibrate the scores?
There may be a setting misunderstanding. Although we can access the test data that contains the true OOD data, we do not know the true OOD distribution since we cannot access the test labels. Therefore, DODA assigns the pseudo labels to test data at first, and then uses the predicted OOD samples (maybe involving false predictions) to gradually estimate the true OOD distribution.
- The curves of the adapted outliers almost coincided with the true OOD
This occurs because of highly effective outlier distribution adaptation in our method. This is also justified by the results in Tab. 5, where AdaptOD has only a small performance gap to the Oracle model that utilizes the ground truth labels of the OOD test data to update the outlier distribution. This indicates that AdaptOD can well approximate the true OOD distribution by the predicted labels of the OOD samples.
W3: the mechanism of DODA. Wrongly predictions may lead to bad optimization of outlier distribution
Yes, it is true that the wrongly predicted OOD samples would lead to bad outlier distribution during TTA, as shown in Fig. 3. Therefore, our DODA designs a Z-score-based method in Eq. 3 to implement an OOD filter based on training ID data, where (corresponding to 99% confidence interval) is set so as to utilize a predicted OOD sample as true OODs with very high confidence. There may be a few wrongly predicted OOD samples, but their influence is limited in DODA as they are often similar to true OOD data, and moreover, the optimization of the outlier distribution would be dominated by the most correctly predicted OOD samples. As a result, our AdaptOD can quickly adapt to the outlier distribution and perform very stably thereafter (see Fig. 3).
W4: the meaning of adapting the outlier distribution to true OOD distribution with specified OOD samples
Yes, OOD samples can be drawn from any unknown distribution in different deployment stages and/or applications. This is exactly the main motivation of our work that we aim to continuously adapt the outlier distribution by utilizing dynamic, unknown OOD samples in the target application scenarios to tackle this challenge. Our comprehensive results with different OOD data across three ID data benchmarks show that AdaptOD can generalize well regardless of the difference in ID or OOD datasets.
Q1: The experimental comparisons for Tab. 1 seem unfair
COCL and EnergyOE are previous SOTA methods, so we directly compare them in Tab. 1 to show the significant improvement of AdaptOD as a whole. For a fairer comparison, in Tab. 2 and Tab. 3, we report the average performance across six OOD datasets of Tab. 1 which combine COCL and EnergyOE with DODA and other TTA methods, so COCL and EnergyOE also can access the test-time true OOD data.
Q2: for Tab. 2 and Tab. 3, the previous methods with DODA modules obeying the OOD setting should be considered
DODA is a TTA method that can be combined with previous OOD methods to utilize the unlabeled test data for tackling the distribution shift problem. For a fair comparison, we have compared AdaptOD with the previous OOD detection methods enabled by DODA in Tab. 2, and we have also compared DODA with other TTA methods in Tab. 3 and the table above with the recent RTL.
Dear Reviewer KD4x,
As far as we understand, you might have some major misunderstanding of our setting and method. In our rebuttal, we have provided detailed clarifications to address these issues. Since the author-reviewer discussion is coming to an end soon, please kindly advise whether our response has addressed your concerns. We're more than happy to address any further concerns you might have. Thank you very much for helping enhance our paper!
We sincerely appreciate the reviewers' time and invaluable feedback. We are encouraged that there is a unanimous consensus among all reviewers highlighting our method's effectiveness through extensive experiments. The reviewers also appreciate the importance of the addressed problem (KD4x, t7zA, udMg) and the novelty of our method (KD4x, t7zA, eeVQ). Additionally, we are pleased that the reviewers found the paper clear and easy-to-understand (KD4x, t7zA, udMg)
We respond to each reviewer's comments in detail below. We will incorporate the reviewers' suggestions into the manuscript revisions, which we believe will significantly enhance the paper's quality. We also include an additional PDF containing a revised Figure 2, and visualization results on the near OOD dataset CIFAR .
To further demonstrate the improvement of our AdaptOD, we conduct extra experiments compared with RTL [Ref1]. We present the new results in Tab. A and Tab. B.
Table A: Result of AdaptOD and RTL on CIFAR10-LT.
| AUC | AP-in | AP-out | FPR | |
|---|---|---|---|---|
| AdaptOD | 94.69 | 93.89 | 94.12 | 27.26 |
| RTL | 92.69 | 92.73 | 92.34 | 30.68 |
Table B: Result of AdaptOD and RTL on CIFAR100-LT.
| AUC | AP-in | AP-out | FPR | |
|---|---|---|---|---|
| AdaptOD | 81.93 | 83.09 | 77.83 | 67.37 |
| RTL | 79.58 | 81.06 | 75.04 | 72.34 |
References
- [Ref1] Fan, Ke, et al. "Test-Time Linear Out-of-Distribution Detection." Proceedings of CVPR, 2024.
This paper offers a novel approach to handling the significant challenge of distribution shifts in out-of-distribution (OOD) detection, especially under long-tailed recognition scenarios. The introduction of Dynamic Outlier Distribution Adaptation (DODA) and Dual-Normalized Energy Loss (DNE) is a commendable attempt to refine OOD detection mechanisms, supported by substantial experimental validation showing improved performance over existing methods.
However, the reviews and discussions reveal a mix of appreciation for the novelty and effectiveness of the proposed methods and concerns regarding certain aspects like the theoretical foundation of DODA and the practical applicability of the approach given its dependency on auxiliary outliers. The authors have responded thoroughly, attempting to clarify misunderstandings and expand on the methodology. Given the overall positive impact of the method demonstrated through rigorous experimentation and its potential to advance OOD detection practices, I recommend accepting this paper with necessary revisions.