PaperHub
6.5
/10
Poster4 位审稿人
最低6最高8标准差0.9
6
6
8
6
3.3
置信度
ICLR 2024

How Does Unlabeled Data Provably Help Out-of-Distribution Detection?

OpenReviewPDF
提交: 2023-09-19更新: 2024-04-21

摘要

关键词
out-of-distribution detectionlearnability

评审与讨论

审稿意见
6

This study addresses wild Out-Of Distribution (OOD) detection, benefiting from the availability of more unlabeled data to enhance OOD data identification. The paper proposes the utilization of the top singular value as a criterion to differentiate between In-Distribution (ID) and OOD samples. The authors, grounded in novel theoretical insights, posit that ID samples should exhibit larger top singular values compared to OOD samples. Both experimental results and theoretical analyses corroborate the effectiveness of the proposed method, Separate And Learn (SAL). Overall, this work stands out for its solid foundation.

优点

  • Using gradients to make distinctions is a common and intuitive approach that makes sense to me. How ever, the method of distinguishing based on the top singular value is both intriguing and non-obvious. This innovative discovery holds significant importance for OOD detection.

  • The availability of extra unlabeled data undoubtedly enables wild OOD detection to outperform traditional OOD detection, which does not utilize unlabeled data. However, what I find astonishing is that the performance of wild OOD detection surpasses even that of outlier exposure, showcasing its remarkable effectiveness.

  • The method is supported by a theoretical examination of its generalization bounds, ensuring a solid foundation for reliable ML.

缺点

  • One of my major concerns with the paper lies in the discrepancy assumption stated in Theorem 1. To validate this, additional experiments with diverse datasets are imperative to ascertain whether the assumption is consistently met in practical scenarios. While I acknowledge that assumptions are essential for theoretical foundations, and the experiments in Appendix F provide a good start, they are insufficient for a comprehensive evaluation. I recommend extending the empirical analysis to additional datasets (e.g., CIFAR-10 as ID and another data as OOD), with the goal of thoroughly investigating the generality of the assumption and its ability to underpin SAL robustly.

  • In the majority of the experiments conducted, the value of pi is set to 0.1. To ensure a comprehensive evaluation, it is crucial to conduct additional experiments with varied values of pi.

问题

Please address the issues outlined in the weaknesses. This work is solid and good. So addressing all of my concerns comprehensively will lead me to reconsider and potentially raise my score.

评论

We are deeply encouraged that you recognize our method to be novel, significant, and solid in both the algorithm and theory and with remarkable empirical results.

Your summary and comments are insightful and spot-on :)

A1. Verification of the assumptions on additional datasets

Thank you for the suggestion! As suggested, we verified the assumption of distribution discrepancy using CIFAR-10 as ID and five other OOD datasets, i.e., SVHN, PLACES365, LSUN-C, TEXTURES, and LSUN-R. The result is shown as follows and in Appendix Section W, and we can conclude that ζ\zeta can indeed satisfy the regulatory condition in Theorem 2, i.e., ζ>1.011π\zeta > 1.011\sqrt{\pi}.

SVHN

π\pi0.050.10.20.50.70.91.0
ζ\zeta0.260.370.490.710.971.241.36
1.011 π\sqrt{\pi}0.230.320.450.710.840.961.0

PLACES365

π\pi0.050.10.20.50.70.91.0
ζ\zeta0.280.330.530.770.850.981.04
1.011 π\sqrt{\pi}0.230.320.450.710.840.961.0

LSUN-C

π\pi0.050.10.20.50.70.91.0
ζ\zeta0.290.340.470.720.871.091.20
1.011 π\sqrt{\pi}0.230.320.450.710.840.961.0

TEXTURES

π\pi0.050.10.20.50.70.91.0
ζ\zeta0.280.330.460.740.850.961.05
1.011 π\sqrt{\pi}0.230.320.450.710.840.961.0

LSUN-R

π\pi0.050.10.20.50.70.91.0
ζ\zeta0.280.350.470.730.871.101.22
1.011 π\sqrt{\pi}0.230.320.450.710.840.961.0
评论

A2. Additional experiment results with varying π\pi

We absolutely agree with this concern. We would like to point the reviewer to the Appendix Table 6 of the original submission, where we report the OOD detection result and the filtering error on SVHN with different mixing ratios π\pi. The result aligns well with our observation of the bounds presented in Section 4.1 of the main paper.

During rebuttal, we provide additional results on more OOD datasets with varying π\pi, i.e., 0.05, 0.2, 0.5, 0.9, and contrast with the baselines, which are added to Appendix Section T, and also shown below (CIFAR-100 as the in-distribution dataset). We found that the advantage of SAL still holds.

SVHNPlaces365LSUN-CLSUN-RTexturesID ACC
π=0.05\pi=0.05
MethodFPR95AUROCFPR95AUROCFPR95AUROCFPR95AUROCFPR95AUROC
OE2.7898.8463.6380.226.7398.372.0699.1932.8690.8871.98
Energy w/ OE2.0299.1756.1883.334.3298.423.9699.2940.4189.8073.45
WOODS0.2699.8932.7190.010.6499.770.7999.1012.2694.4874.15
SAL (Ours)0.1799.906.2196.870.9499.790.8499.375.7797.1273.99
π=0.2\pi=0.2
MethodFPR95AUROCFPR95AUROCFPR95AUROCFPR95AUROCFPR95AUROC
OE2.5998.9055.6884.364.9199.021.9799.3725.6293.6573.72
Energy w/ OE1.7999.2547.2886.784.1899.003.1599.3536.8091.4873.91
WOODS0.2299.8229.7891.280.5299.790.8999.5610.0695.2373.49
SAL (Ours)0.0899.922.8099.310.0599.940.0299.975.7198.7173.86
π=0.5\pi=0.5
MethodFPR95AUROCFPR95AUROCFPR95AUROCFPR95AUROCFPR95AUROC
OE2.8699.0540.2188.754.1399.051.2599.3822.8694.6373.38
Energy w/ OE2.7199.3434.8290.053.2799.182.5499.2330.1694.7672.76
WOODS0.1799.8021.8793.730.4899.611.2499.549.9595.9773.91
SAL (Ours)0.0299.981.2799.620.0499.960.0199.995.6499.1673.77
π=0.9\pi=0.9
MethodFPR95AUROCFPR95AUROCFPR95AUROCFPR95AUROCFPR95AUROC
OE0.8499.3619.7896.291.6499.570.5199.7512.7494.9572.02
Energy w/ OE0.9799.6417.5296.531.3699.730.9499.5914.0195.7373.62
WOODS0.0599.9811.3495.830.0799.990.0399.996.7298.7373.86
SAL (Ours)0.0399.992.7999.890.0599.990.0199.995.8899.5374.01
审稿意见
6

Leveraging unlabeled data has shown potential in enhancing the safety and reliability of machine learning models for out-of-distribution (OOD) detection, despite the challenges posed by the heterogeneity of both in-distribution (ID) and OOD data. This paper introduces SAL (Separate And Learn), a novel learning framework that addresses the existing gap in understanding how unlabeled data aids OOD detection by providing strong theoretical guarantees and empirical effectiveness. SAL operates by isolating potential outliers from the unlabeled data, training an OOD classifier with these outliers and labeled ID data, and achieving state-of-the-art performance on standard benchmarks, thereby validating the theoretical framework and its components.

优点

  1. The manuscript presents some theoretical analyses as well as a number of intriguing illustrations.
  2. The experimental results look promising, with comparisons made against numerous baseline methods.

缺点

  1. The manuscript lacks crucial baselines, such as [1] and [2], which are essential for a comprehensive evaluation, and fails to provide an analysis or comparison with them.
  2. The essential topic of the article is weakly supervised out-of-distribution detection, although it is described from different perspectives.

[1] Zhou, Zhi, et al. "Step: Out-of-distribution detection in the presence of limited in-distribution labeled data." Advances in Neural Information Processing Systems 34 (2021): 29168-29180.

[2] He, Rundong, et al. "Topological structure learning for weakly-supervised out-of-distribution detection." arXiv preprint arXiv:2209.07837 (2022).

问题

  1. Near OOD Scenario: It is unclear how effective the method presented in this article would be in a near OOD scenario, such as treating the first 50 classes of CIFAR-100 as ID and the last 50 classes as OOD. This specific situation could pose challenges since near OOD samples may share similarities with the ID data, potentially affecting the method's performance.
  2. Generalization Performance Across Different Backbones: The article does not provide information on how well the method generalizes when different backbone architectures are used. The performance stability of the method when transitioning between various model architectures is a critical aspect to consider for its widespread applicability.
  3. Generalization Performance on unseen OOD data: There is no discussion on the method's effectiveness when the test OOD data and the OOD data in the unlabeled set are not identically distributed. Understanding how the method handles such disparities is crucial for evaluating its robustness in real-world scenarios.
评论

Dear Reviewer dSEG,

Thank you for taking the time to read our response and increasing your score! We are glad to hear that the response solved your concern. We will make sure the discussion and results will be added in the final version!

Thanks,

Authors

评论

We thank you for recognizing our method to be novel with promising results. We thank the reviewer for the thorough comments and suggestions, which we address below:

A1. Baselines

As suggested, we have additionally compared with the two related works (TSL [1] and STEP [2]). To ensure a fair comparison, we strictly follow the experimental setting in TSL [1], and rerun SAL under the identical setup. The comparison on CIFAR-100 is shown as follows. Accordingly, we have also added discussion and proper citations of the mentioned papers in the revised paper (See related work section and Appendix Section U). We thank you for pointing them out!

LSUN-CLSUN-R
MethodFPR95AUROCFPR95AUROC
STEP0.0099.999.8197.87
TSL0.00100.001.7699.57
SAL (Ours)0.0099.990.5899.95

A2. Discussion on weakly supervised OOD detection

We agree that weakly supervised OOD detection is indeed similar to the problem setting of SAL. We have already updated the related work section of our paper and included more discussions/citations on weakly supervised OOD detection. Thank you for pointing this out!

A3. Near-OOD Scenario

We are glad you bring that up. We have already evaluated the near-OOD detection in Appendix J. Specifically, we use the CIFAR-10 as the in-distribution data and the CIFAR-100 as the OOD data in the wild. During test time, we use the test set of CIFAR-100 as the OOD for evaluation. With a mixing ratio π\pi of 0.1, SAL achieves an FPR95 of 24.51% and AUROC of 95.55% compared to 38.92% (FPR95) and 93.27% (AUROC) of WOODS.

In addition, we follow the suggested data setting by the reviewer, i.e., the first 50 classes of CIFAR-100 as ID and the last 50 classes as OOD. The comparison with the most competitive baseline is reported as follows. We have also added the new results to Appendix Section J in the revised manuscript.

CIFAR-50
MethodFPR95AUROCID ACC
WOODS41.2889.7474.17
SAL (Ours)29.7193.1373.86
评论

A4. Generalization Performance Across Different Backbones

As suggested, we have additionally tried ResNet-18 and ResNet-34 as the network architectures---which are among the most used in OOD detection literature. The comparison with the baselines on CIFAR-100 is shown in the following tables and Appendix Section V, where SAL outperforms all the baselines across different architectures. These additional results support the effectiveness of our approach.

ResNet-18

SVHNPlaces365LSUN-CLSUN-RTexturesAverageID ACC
MethodFPR95AUROCFPR95AUROCFPR95AUROCFPR95AUROCFPR95AUROCFPR95AUROC
MSP81.3277.7483.0674.4770.1183.5182.4675.7385.1173.3680.4176.9678.67
ODIN40.9493.2987.7171.4628.7294.5179.6182.1383.6372.3764.1282.7578.67
Mahalanobis22.4495.6792.6661.3968.9086.3023.0794.2062.3979.3953.8983.3978.67
Energy81.7484.5682.2376.6834.7893.9373.5782.9985.8774.9471.6482.6278.67
KNN83.6272.7682.0980.0365.9684.8271.0581.2476.8877.9075.9279.3578.67
ReAct70.8188.2481.3376.4939.9992.5154.4789.5659.1587.9661.1586.9578.67
DICE54.6588.8479.5877.260.9399.7449.4091.0465.0476.4249.9286.6678.67
CSI49.9889.5782.8775.6476.3980.3874.2183.3458.2381.0468.3381.9974.23
KNN+43.2190.2184.6274.2150.1282.4876.9280.8163.2184.9163.6182.5277.03
OE3.2997.9362.9080.237.0795.934.0697.9833.2790.0322.1292.4274.89
Energy (w/ OE)3.1294.2759.3882.199.1291.237.2895.3943.9290.1124.5690.6477.92
WOODS3.9296.9233.9286.295.1994.232.9596.2311.9594.6511.5993.6677.54
SAL (Ours)2.2997.966.2996.663.9297.814.8797.108.2895.955.1397.1077.71

ResNet-34

SVHNPlaces365LSUN-CLSUN-RTexturesAverageID ACC
MethodFPR95AUROCFPR95AUROCFPR95AUROCFPR95AUROCFPR95AUROCFPR95AUROC
MSP78.8979.8084.3874.2183.4775.2884.6174.5186.5172.5383.1275.2779.04
ODIN70.1684.8882.1675.1976.3680.1079.5479.1685.2875.2378.7079.1179.04
Mahalanobis87.0980.6284.6373.8984.1579.4383.1878.8361.7284.8780.1579.5379.04
Energy66.9185.2581.4176.3759.7786.6966.5284.4979.0179.9670.7282.5579.04
KNN81.1273.6579.6278.2163.2985.5673.9279.7773.2980.3574.2579.5179.04
ReAct82.8570.1281.7576.2580.7083.0367.4083.2874.6081.6177.4678.8679.04
DICE83.5572.4985.0575.9294.0573.5975.2080.9079.8077.8383.5376.1579.04
CSI44.5392.6579.0876.2775.5883.7876.6284.9861.6186.4767.4884.8377.89
KNN+39.2392.7880.7477.5848.9989.3074.9982.6957.1588.3560.2286.1478.32
OE2.1198.2360.1283.226.0896.343.9498.1330.0092.2720.4593.6475.72
Energy (w/ OE)1.9495.0368.8485.947.6692.046.8697.6340.8293.0725.2292.7478.75
WOODS2.0897.3325.3788.934.2697.741.0597.308.8596.868.3295.6378.97
SAL (Ours)0.9899.942.9899.080.0799.940.0399.964.0198.831.6199.5578.01
评论

A5. Generalization Performance on unseen OOD data

Another great point! In the original submission, we have included the result where the OOD data in the wild is different from the test OOD data (please see Appendix Section I and Table 8). Specifically, we use 300K RANDOM IMAGES as the wild OOD dataset and SVHN, PLACES365, LSUN-C, LSUN-RESIZE, and TEXTURES as the test OOD data. We observe that SAL can perform competitively on unseen OOD datasets as well, compared to the most relevant baseline WOODS.

In addition, following [1], we use the CIFAR-100 as ID, TINc/TINr dataset as the OOD in the wild dataset and TINr/TINc as the test OOD. The comparison with baselines is shown below and in Appendix Section I, where the strong performance of SAL still holds.

TINrTINc
MethodFPR95AUROCFPR95AUROC
STEP72.3174.5948.6891.14
TSL57.5282.2929.4894.62
SAL (Ours)43.1189.1719.3096.29

[1] He, Rundong, et al. "Topological structure learning for weakly-supervised out-of-distribution detection." arXiv preprint arXiv:2209.07837 (2022).

[2] Zhou, Zhi, et al. "Step: Out-of-distribution detection in the presence of limited in-distribution labeled data." Advances in Neural Information Processing Systems 34 (2021): 29168-29180.

评论

Thank you for the author's response, which solves most of my questions. I will improve my score. Moreover, please add your response to the final version.

审稿意见
8

This paper introduces a novel setting for OOD detection, termed as ”wild OOD detection,” building upon the foundation established by the preceding work ”Training OOD Detectors in their Natural Habitats.” A novel methodology, denoted as SAL, is presented, encapsulating at wo-stage process comprising filtering and classification components. Empirical evaluations have demonstrated that SAL achieves SOTA performance, boasting substantial improvements over existing methods. Additionally, theoretical underpinnings are provided to bolster the credibility and effectiveness of SAL.

优点

  1. A theory has been established to investigate aspects of separability and learnability. This contribution is both novel and significant.

  2. Experimental evaluations conducted on standard benchmarks demonstrate that SAL achieves SOTA performance.

  3. A novel method grounded in theory has been developed to advance safe machine learning practices. Theory serves as a crucial driver in this endeavour. I am very happy to see the novel work on provable OOD detection.

缺点

  1. Could you provide explanations or conduct experiments to elucidate the factors contributing to the decreased ID accuracy depicted in Table 1?

  2. Why is pi set to 0.1 in most experiments? Could you conduct additional experiments to investigate whether pi remains robust across a range of values? Furthermore, does the performance of pi align with theoretical predictions?

  3. It appears that the top singular vector is crucial for SAL. Have you conducted any experiments to demonstrate the performance when considering the top 2, top 3, ..., top k singular vectors?

  4. What would occur if you were to use the gradient norm in place of the top singular vector? Could you elucidate the rationale behind opting for the top singular vector instead of the norm?

问题

Please refer to the Weaknesses.

评论

We are glad to see that the reviewer finds our work significant and novel from various perspectives. We thank the reviewer for the thorough comments and suggestions. We are happy to clarify as follows:

A1. ID accuracy

Great observation! As explained in paragraph 2 of Section 5.2 in the original submission, the slight discrepancy is due to that our method only observes 25,000 labeled ID samples, whereas baseline methods (without using wild data) utilize the entire CIFAR training data with 50,000 samples. We have used bold fonts to highlight it in the revision.

A2. Additional experiment results with varying π\pi

Thank you for your suggestion! In our main experiment, we default π\pi to be 0.1, which strictly follows the original setting in WOODS [1]. This reflects the practical scenario that the majority of test data may remain ID. Compared to larger π\pi, our setting with π=0.1\pi=0.1 is also more challenging due to limited information of OOD data.

We would like to point the reviewer to the Appendix Table 6 of the original submission, where we report the OOD detection result and the filtering error on SVHN with different mixing ratios π\pi. The result aligns well with our observation of the bounds presented in Section 4.1 of the main paper.

During rebuttal, we provide additional results on more OOD datasets with varying π\pi, i.e., 0.05, 0.2, 0.5, 0.9, and contrast with the baselines, which are added to Appendix Section T, and also shown below (CIFAR-100 as the in-distribution dataset). We found that the advantage of SAL still holds.

SVHNPlaces365LSUN-CLSUN-RTexturesID ACC
π=0.05\pi=0.05
MethodFPR95AUROCFPR95AUROCFPR95AUROCFPR95AUROCFPR95AUROC
OE2.7898.8463.6380.226.7398.372.0699.1932.8690.8871.98
Energy w/ OE2.0299.1756.1883.334.3298.423.9699.2940.4189.8073.45
WOODS0.2699.8932.7190.010.6499.770.7999.1012.2694.4874.15
SAL (Ours)0.1799.906.2196.870.9499.790.8499.375.7797.1273.99
π=0.2\pi=0.2
MethodFPR95AUROCFPR95AUROCFPR95AUROCFPR95AUROCFPR95AUROC
OE2.5998.9055.6884.364.9199.021.9799.3725.6293.6573.72
Energy w/ OE1.7999.2547.2886.784.1899.003.1599.3536.8091.4873.91
WOODS0.2299.8229.7891.280.5299.790.8999.5610.0695.2373.49
SAL (Ours)0.0899.922.8099.310.0599.940.0299.975.7198.7173.86
π=0.5\pi=0.5
MethodFPR95AUROCFPR95AUROCFPR95AUROCFPR95AUROCFPR95AUROC
OE2.8699.0540.2188.754.1399.051.2599.3822.8694.6373.38
Energy w/ OE2.7199.3434.8290.053.2799.182.5499.2330.1694.7672.76
WOODS0.1799.8021.8793.730.4899.611.2499.549.9595.9773.91
SAL (Ours)0.0299.981.2799.620.0499.960.0199.995.6499.1673.77
π=0.9\pi=0.9
MethodFPR95AUROCFPR95AUROCFPR95AUROCFPR95AUROCFPR95AUROC
OE0.8499.3619.7896.291.6499.570.5199.7512.7494.9572.02
Energy w/ OE0.9799.6417.5296.531.3699.730.9499.5914.0195.7373.62
WOODS0.0599.9811.3495.830.0799.990.0399.996.7298.7373.86
SAL (Ours)0.0399.992.7999.890.0599.990.0199.995.8899.5374.01

[1] Julian Katz-Samuels et al. Training ood detectors in their natural habitats. In International Conference on Machine Learning, 2022.

A3. Additional experiments on multiple principal components

Another great point! In our original submission, we reported results using multiple principal components in Appendix Section K. We observed that using the top 1 singular vector for projection achieves the best performance.

A4. Discussion on using gradient norm

We have already evaluated the GradNorm score as suggested in Table 2, where we replace the filtering score in SAL with the GradNorm score and then train the OOD classifier. The result underperforms SAL, showcasing the effectiveness of our filtering score.

We have also extensively discussed the design rationale of the filtering scores of SAL in Section 3.1, saying that the scores in SAL for ID and OOD data are shown to be provably well-separated (Remark 1) and thus ensure a low filtering error, while the norm of the gradient is not. Both the theoretical result and empirical verification can demonstrate the advantage of SAL compared with GradNorm.

评论

Thank you for your reply. The clarifications and additional experimental results solved all of my concerns. Therefore, I have raised my rating of this work.

评论

Dear Reviewer XJRS,

Thank you for taking the time to read our response and increasing your score! We are glad to hear that the response solved your concern. We will make sure the discussion and results are added in the final version!

Thanks,

Authors

审稿意见
6

This paper presents a novel framework for Out-Of-Distribution (OOD) detection, named SAL, which aims to improve machine learning models through regularization using unlabeled data. SAL comprises two main components: (1) Filtering–distinguishing potential outliers from the general dataset, and (2) Classification utilizing the identified candidate outliers to train an OOD classifier. The paper includes pertinent theoretical proofs to substantiate the proposed method, and experimental results are provided to demonstrate its effectiveness.

优点

1 SAL’s methodology is structured around two distinct phases—screening and classification—which can be independently optimized, offering enhanced flexibility. 2 Utilizing a Large Volume of Unlabeled Data: SAL effectively leverages substantial amounts of unlabeled data to extract valuable information, thereby bolstering its detection capabilities. 3 Theoretical Support: Beyond its impressive empirical performance, SAL is underpinned by robust theoretical foundations.

缺点

1 In scenarios where the actual OOD data markedly diverges from the outliers present in the unlabeled dataset, there arises a question regarding the preservation of SAL’s performance. 2 The efficacy of SAL is significantly influenced by the quality of the unlabeled data employed, indicating a substantial dependence on data integrity.

问题

1 Could you design experiments to further confirm and answer issues in weaknesses? 2 In table 1, how can this discrepancy be accounted for in datasets or scenarios that perform well in other methods but have degraded performance (ID ACC) in SAL?

评论

We thank the reviewer for the thorough comments and suggestions. We are encouraged that you recognize our method to be novel and effective, and with robust theoretical analysis. We address your questions below:

A1. Different outlier dataset from the actual test OOD

We are glad you bring that up! In the original submission, we have included the result where the OOD data in the wild is different from the test OOD data (please see Appendix Section I and Table 8). Specifically, we use 300K RANDOM IMAGES from outlier exposure [1] to create the wild OOD training dataset. We evaluate on SVHN, PLACES365, LSUN-C, LSUN-RESIZE, and TEXTURES as the test unseen OOD data. We observe that SAL can perform competitively on unseen OOD datasets as well, compared to the most relevant baseline WOODS.

[1] Hendrycks et al. Deep anomaly detection with outlier exposure. In Proceedings of the International Conference on Learning Representations, 2019.

A2. Quality of the unlabeled data

To address your concern, we have designed the following experiment where the quality of the unlabeled data deteriorates. The results of SAL and competitive baselines are shown in the table below and have also been added to the Appendix Section S.

Specifically, we corrupt the outlier data in the wild with additive Gaussian noise [2]. As such, the filtered candidate outliers will have a much lower quality compared to the outliers in SAL. We use the CIFAR-10 as the in-distribution dataset and keep other configurations the same.

SVHNPlaces365LSUN-CLSUN-RTexturesAverageID ACC
MethodFPR95AUROCFPR95AUROCFPR95AUROCFPR95AUROCFPR95AUROCFPR95AUROC
OE23.1186.6132.0186.2722.9882.7519.5387.4325.6884.4624.6685.5093.81
Energy w/ OE26.7685.9126.0987.4822.3282.2622.6985.7727.4982.1825.0784.7292.38
WOODS18.3389.8323.4590.0419.7084.2717.7990.8222.3784.8320.3387.9594.00
SAL (Ours)15.2391.2218.2393.5114.6289.0413.9391.8218.5892.4216.1291.6093.91

[2] Hendrycks et.al., Benchmarking neural network robustness to common corruptions and surface variations. In Proceedings of the International Conference on Learning Representations, 2019.

A3. ID accuracy

Great observation! As explained in paragraph 2 of Section 5.2 in the original submission, the slight discrepancy is due to that our method only observes 25,000 labeled ID samples, whereas baseline methods (without using wild data) utilize the entire CIFAR training data with 50,000 samples. We have used bold fonts to highlight it in the revision.

评论

We thank all the reviewers for their time and valuable comments. We are encouraged to see that ALL reviewers find our paper novel and significant (BtG8, XJRS, dSEG, ZN1X). Reviewers also recognize our work providing a solid and crucial foundation for reliable ML (XJRS, ZN1X), and the results are promising, impressive, achieving state-of-the-art (BtG8, XJRS, dSEG, ZN1X).

As recognized by multiple reviewers, the significance of our work can be summarized as follows:

  • Our work offers a new algorithmic framework to effectively exploit the unlabeled wild data for OOD detection. This algorithm has broad utility since unlabeled data is ubiquitous in many ML applications, but the principled way of utilizing them for OOD detection is currently lacking in the field.
  • Moreover, we provide new theories from the lens of separability and learnability, to formally justify the two components in our algorithm.
  • Empirically, we show that SAL can be broadly applicable to modern neural networks, and establish state-of-the-art performance on common OOD detection tasks, reinforcing our theoretical insights.

We respond to each reviewer's comments in detail below. We also revised the manuscript according to the reviewers' suggestions (blue text), and we believe this makes our paper stronger.

AC 元评审

The paper introduces a novel framework for Out-Of-Distribution (OOD) detection, named SAL, with the goal of enhancing machine learning models through regularization using unlabeled data. SAL consists of two main components: Filtering, which distinguishes potential outliers from the general dataset, and Classification, utilizing identified outliers to train an OOD classifier. This novel approach is positioned in the context of "wild OOD detection," extending the prior work on "Training OOD Detectors in their Natural Habitats." Empirical evaluations show that SAL achieves state-of-the-art performance, demonstrating substantial improvements over existing methods. The paper supports the proposed methodology with theoretical proofs, reinforcing the credibility and effectiveness of SAL. The reviewers are unanimously positive about the paper's quality and contribution.

为何不给更高分

The paper's impact is slightly limited by the required assumptions for the established theorems and an incomplete positioning and comparisons with some existing and relevant literature (but formulated in different ways).

为何不给更低分

With the above side, the presented results are solid and justify the proposed method's advantage.

最终决定

Accept (poster)