PaperHub
7.8
/10
Poster4 位审稿人
最低4最高5标准差0.4
5
5
5
4
4.3
置信度
创新性2.5
质量2.8
清晰度3.0
重要性2.8
NeurIPS 2025

Spurious-Aware Prototype Refinement for Reliable Out-of-Distribution Detection

OpenReviewPDF
提交: 2025-05-12更新: 2025-10-29
TL;DR

We introduce a novel, post-hoc prototypical OOD detection method that counteracts unknown spurious correlations by refining class prototypes, achieving state-of-the-art performance for reliable, trustworthy AI.

摘要

关键词
Out-of-Distribution DetectionSpurious CorrelationsPrototype RefinementRobustnessPost-hoc MethodsTrustworthy AI

评审与讨论

审稿意见
5

This work tackles the problem of OOD detection considering the OOD samples that share spurious features with the ID samples but differ in semantic labels. Specifically, the samples belong to the same classes are grouped into 1) a group of correctly classified samples 2) multiple groups of misclassified samples depending on which classes they are misclassified. Following that, prototypes are calculated for all groups. Subsequently, all samples are reassigned to each group depending on the proximity to the corresponding prototypes. Finally, the refined prototypes are computed as the mean of updated group members. During inference, OOD score is calculated based on the distance to the nearest group prototypes.

优缺点分析

Strengths:

  • The paper is well-written, and the proposed Spurious-Aware Prototype Refinement for Reliable Out-of-Distribution 159 Detection (SPROD) is well-motivated.
  • The experiments on spurious OOD (SP-OOD) benchmark are comprehensive, and the performance is significant.

Weaknesses:

  • The results on non-spurious OOD (NSP-OOD) is missing. I understand that this work mainly focuses on the SP-OOD, however, i am wondering that whether the proposed SPROD is also effective on NSP-OOD, or at least maintain the performance on NSP-OOD. This would significantly improve the contribution of this work. Therefore, the results on NSP-OOD are expected to present. I will raise the score once this concern is addressed.
  • To demonstrate the generalization of proposed methods, the evaluate across different models such as ViT and BiT are expected to present.
  • Some baseline methods are missing such as GEN [1], NN-Guide [2], NECO[3] and FDBD[4].
  • What is the (synthetic) datasets and model utilized to produce Figure 2?

[1] GEN: Pushing the Limits of Softmax-Based Out-of-Distribution Detection, CVPR, 2023.

[2] Nearest Neighbor Guidance for Out-of-Distribution Detection, ICCV, 2023.

[3] NECO: NEural Collapse Based Out-of-distribution detection, ICLR, 2024.

[4] Fast Decision Boundary based Out-of-Distribution Detector, ICML, 2024.

问题

See weaknesses.

局限性

yes

最终评判理由

Thanks for conducting the experiments. All my concerns are addressed.

格式问题

N/A

作者回复

Reviewer: The results on non-spurious OOD (NSP-OOD) is missing. I understand that this work mainly focuses on the SP-OOD, however, i am wondering that whether the proposed SPROD is also effective on NSP-OOD, or at least maintain the performance on NSP-OOD. This would significantly improve the contribution of this work. Therefore, the results on NSP-OOD are expected to present. I will raise the score once this concern is addressed.

Response:

We sincerely thank the reviewer for pointing this out. We agree that performance on standard non-spurious OOD (NSP-OOD) datasets is an important aspect, as it helps demonstrate the broader applicability and robustness of our method.

We would like to clarify that we have already included a dedicated section titled “Performance on Non-Spurious OOD Datasets” (Line 200 in the supplementary material). As shown in Table 6, SPROD consistently achieves near-perfect performance on all NSP datasets. However, the ID datasets in these experiments are the five datasets we used in the main paper (which have some degrees of spurious correlations). We focused on these datasets since they align with our core problem setup and we already had extensive experiments based on them.

In light of the reviewer’s helpful suggestion, we conducted additional experiments on three widely-used Standard OOD benchmarks: CIFAR-10, CIFAR-100, and ImageNet-1k, following the OpenOOD benchmark setup. Based on Table 1 of the OpenOOD paper, the in-distribution and corresponding OOD datasets are:

  • CIFAR-10:
  • Near-OOD: CIFAR-100, TinyImageNet
  • Far-OOD: MNIST, SVHN, Textures, Places365
  • CIFAR-100:
  • Near-OOD: CIFAR-10
  • Far-OOD: MNIST, SVHN, Textures, Places365
  • ImageNet-1k:
  • Near-OOD: SSB-Hard, NINCO
  • Far-OOD: iNaturalist, Textures, OpenImage-O

To provide a meaningful comparison, we include results from the top-performing methods per dataset as reported in Table 2 of the OpenOOD paper:

  • CIFAR-10: KNN, VIM
  • CIFAR-100: MLS, RMDS
  • ImageNet-1k: ASH

Below we present the OOD detection performance of SPROD alongside these methods:

MethodCIFAR-10 NearCIFAR-10 FarCIFAR-100 NearCIFAR-100 FarImageNet NearImageNet FarAvg (%)
RMDS89.8092.2080.1582.9276.9986.3884.74
MLS87.5291.1081.0579.6776.4689.5784.23
VIM88.6893.4874.9881.7072.0892.6883.27
KNN90.6492.9680.1882.4071.1090.1884.58
ASH75.1278.4978.2080.5878.1795.7481.05
SPROD89.0491.7881.8079.9375.0695.2985.15

These results indicate that SPROD performs on par or better than existing top methods in both near and far-OOD settings for standard benchmarks. Specifically, SPROD shows consistent performance across all datasets without any unusual degradation. We hope this addresses the reviewer’s concern and helps clarify the general applicability of our method beyond spurious OOD scenarios.

 


Reviewer: To demonstrate the generalization of proposed methods, the evaluate across different models such as ViT and BiT are expected to present.

Response:

Thank you for the suggestion. We have evaluated SPROD across a wide range of backbone architectures, including both convolutional (e.g., ResNet-18/34/50/101, BiT-R50x1) and transformer-based models (e.g., ViT-S, DeiT-B, DINOv2, Swin-B, ConvNeXt-B), as detailed in the Backbone Experiments section (Supplementary, line 237). Tables 8–31 report extensive results across four OOD metrics and five datasets. Specifically, Tables 8–11 present aggregated results, showing that SPROD consistently ranks among the top performers across all backbones, further demonstrating the method’s reliability.

 


Reviewer: Some baseline methods are missing such as GEN [1], NN-Guide [2], NECO[3] and FDBD[4].

Response:

We thank the reviewer for pointing out the absence of several recent post-hoc OOD detection methods, such as GEN, NNGuide, NECO, and fDBD. We agree that incorporating these methods strengthens the comprehensiveness and fairness of our evaluation.

To address this concern, we have conducted additional experiments on several state-of-the-art post-hoc techniques published between 2023 and 2025. The evaluations were performed using the same pretrained ResNet-50 backbone as in our main experiments (Table 2), ensuring a consistent comparison setting. The updated results are shown below:

MethodWBCAUCAMCSpIAvg
NNGuide70.6 ±2.949.8 ±4.243.6 ±2.179.4 ±0.085.1 ±0.865.7
Relation80.7 ±0.260.4 ±2.596.0 ±0.574.5 ±0.381.8 ±0.778.7
ASH78.5 ±3.247.3 ±2.839.6 ±1.778.0 ±0.286.6 ±0.766.0
GEN62.3 ±0.646.0 ±1.438.5 ±0.380.2 ±0.080.8 ±0.461.6
NECO53.5 ±1.639.5 ±3.235.1 ±1.580.2 ±0.167.2 ±0.355.1
SCALE89.0 ±2.944.9 ±3.254.4 ±2.178.4 ±0.486.2 ±0.570.6
fDBD71.1 ±0.551.3 ±1.347.4 ±0.279.9 ±0.084.2 ±0.366.8
NCI84.0 ±0.146.4 ±2.454.8 ±0.878.5 ±0.184.9 ±0.269.7
SPROD98.8 ±0.061.6 ±0.997.4 ±0.082.1 ±0.585.3 ±0.085.0

As evident from the results, SPROD is consistently among the best-performing methods across all datasets. We will include extended results in the final version and revise the manuscript to compare against these additional baselines.

  • [1] NNGuide: Nearest Neighbor Guidance for Out-of-Distribution Detection, ICCV 2023
  • [2] RELATION: Neural Relation Graph, NeurIPS 2023
  • [3] ASH: Activation Shaping for OOD Detection, ICLR 2023
  • [4] GEN: Pushing the Limits of Softmax-Based OOD Detection, CVPR 2023
  • [5] NECO: Neural Collapse Based OOD Detection, ICLR 2024
  • [6] SCALE: Scaling for Training Time and Post-hoc OOD Detection, ICLR 2024
  • [7] fDBD: Fast Decision Boundary-Based OOD Detector, ICML 2024
  • [8] NCI: Detecting OOD through Neural Collapse, CVPR 2025

 


Reviewer: What is the (synthetic) datasets and model utilized to produce Figure 2?

Response:

Thank you for the question. For Figure 2, we use a synthetic dataset where class-conditional distributions are symmetric Gaussians in 2D. In the SP-OOD setting, each class contains two subgroups (majority and minority), each modeled as a Gaussian with different means and proportions to simulate spurious correlations. We will clarify these details in the final version.

评论

Thanks for conducting the experiments. All my concerns are addressed.

评论

Thank you very much for your engagement and positive feedback!

审稿意见
5

This paper addresses spurious OOD and non-spurious OOD where the former contains spurious contexts combined with lack of features, and latter lacks both in an OOD setup. The authors suggest SPROD, a method which seeks a robust prototypes of ID classes in a 3 stage manner: seeking initial prototype, discovering minor prototype, and refining prototypes.

优缺点分析

Strengthness

  1. The paper brings up a less discussed topic, spurious OOD, which tackles to decrease the misleading correlation between spurious features and the object itself.
  2. Proper experiments were conducted to support the argument in the paper.
  3. Authors have conducted extensive amount of experiments.

Weakness

  1. Authors do not compare with recent baseline methods.
  2. This work should conduct experiments on general OOD datasets where the reasons are stated below.

问题

  1. Compared baselines in Table 2 are outdated. The latest work is done on 2022 while the submission date was mid 2025. Comparing with recent methods which are published since 2023 must be done. Especially, there is a github repository with numerous recent methods on https://github.com/Jingkang50/OpenOOD such as CombOOD, SCALE, AdaSCALE, NNGuide, ASH, GEN, RELATION, and more must be conducted and added to all Table 2. These should be done on general OOD test datasets (CIFAR10, CIFAR100, ImageNet) as well.
  2. OOD fundamentally deals with “unknownness” without any given prior knowledge related with OOD samples which show up on testing. However, this work only conducts experiments on a strong prior OOD test dataset, where the datasets are designed explicitly related with “spurious” features. Only testing on datasets with such strong assumption hinders the practicality and fundamental reason of researching OOD detection. If SPROD works only well on spurious OOD setup, I doubt the practicality of the method as OOD samples wouldn’t be labeled as “spurious” when they are tested. Therefore, experiments on CIFAR10, CIFAR100, and ImageNet must be conducted in order to prove the robustness and show that SPROD is not only limited to spurious OOD dataset, but also robust on general OOD datasets.
  3. In table 2, results of FPR95 of SPROD is only the 4th comparing with other methods for the most complex dataset which is hard to insist it is the SOTA method.
  4. Figure 4 doesn’t represent well how much the correlation between spurious correlation rate and performance drop as the results aren’t presented numerically. It doesn’t look like the best way to present the correlation as a reader.
  5. The authors argue that generative scoring is better than discriminative scoring and present proper experimental configurations. However, results show no difference when datasets get more complex. Showing better performance only on simple dataset doesn’t indicate it is a better scoring method. Therefore, I doubt the fact of this statement.
  6. Table 1 in Supplementary should be in the main manuscript as it is comparing with related methods.

Overall, my biggest concerns are 1. and 2. whose experiments should be done. Change of review results would highly depend on the results of these two questions.

局限性

yes

最终评判理由

I increase the ratings for the following reasons

  1. I acknowledge the novelty of the work in terms of suggesting a new dataset and benchmarking regarding spurious correlation and also the effectiveness of the method on SP-OOD benchmarking.

  2. Also, the primary concerns regarding the performance of general OOD is important as it implies the general robustness on general OOD samples. However, SP-OOD has different goals, which is detecting worst cases which might cause critical issues or consequences. Despite that, the general performance is fair enough.

格式问题

no issues

作者回复

Reviewer: Authors do not compare with recent baseline methods…

We appreciate the reviewer for pointing this out and fully agree that benchmarking against more recent methods is important for a comprehensive evaluation.

In our main experiments, we included 11 prominent post-hoc OOD detection methods, selected to represent a broad spectrum of prior approaches and to highlight their limitations under the spurious-OOD setting. We acknowledge, however, that several recent state-of-the-art methods from top-tier conferences have been proposed since 2023, many of which are available in the OpenOOD repository, as the reviewer correctly points out.

Using the same protocol and pretrained ResNet-50 backbone as in Table 2 of our main paper, the results are summarized below:

MethodWBCAUCAMCSpIAvg
NNGuide70.649.843.679.485.165.7
Relation80.760.496.074.581.878.7
ASH78.547.339.678.086.666.0
GEN62.346.038.580.280.861.6
NECO53.539.535.180.267.255.1
SCALE89.044.954.478.486.270.6
fDBD71.151.347.479.984.266.8
NCI84.046.454.878.584.969.7
SPROD98.861.697.482.185.385.0

As seen, SPROD is consistently among the top-performing methods across datasets.

  • [1] NNGuide, ICCV 2023
  • [2] Relation, NeurIPS 2023
  • [3] ASH, ICLR 2023
  • [4] GEN, CVPR 2023
  • [5] NECO, ICLR 2024
  • [6] SCALE, ICLR 2024
  • [7] fDBD, ICML 2024
  • [8] NCI, CVPR 2025

 


Reviewer: This work should conduct experiments on general OOD datasets where the reasons are stated below…

We thank the reviewer for raising this important point. We agree that evaluating performance on widely used benchmarks provides a good indication of the general applicability of OOD detection methods.

While our primary focus is on the SP-OOD problem setup, we would like to clarify that SPROD is not limited to spurious OOD settings. In fact, our method does not make any assumption about the presence of spurious correlations. If unknown spurious correlations exist in the training data, SPROD aims to discover them for more robust OOD detection. This is further supported by our ablation study using datasets with only 50% spurious correlation, where SPROD still demonstrates strong performance, indicating its robustness across varying levels of spuriousness.

Following the reviewer’s concern, we have conducted additional experiments using CIFAR-10, CIFAR-100, and ImageNet-1k as in-distribution datasets. These experiments follow the standardized OpenOOD benchmark setup, where OOD datasets are selected according to Table 2 of the OpenOOD paper.

We report SPROD's performance alongside top-performing baseline methods, using values directly taken from the OpenOOD paper for fair comparison. The results are summarized in the table below:

MethodCIFAR-10 NearCIFAR-10 FarCIFAR-100 NearCIFAR-100 FarImageNet NearImageNet FarAvg (%)
RMDS89.8092.2080.1582.9276.9986.3884.74
MLS87.5291.1081.0579.6776.4689.5784.23
VIM88.6893.4874.9881.7072.0892.6883.27
KNN90.6492.9680.1882.4071.1090.1884.58
ASH75.1278.4978.2080.5878.1795.7481.05
SPROD89.0491.7881.8079.9375.0695.2985.15

These results show that SPROD remains competitive with the best-performing methods and achieves the highest average performance across all settings.

 


Reviewer: In table 2, results of FPR95 of SPROD is only the 4th comparing with other methods for the most complex dataset which is hard to insist it is the SOTA method.

We fully acknowledge that our method does not outperform all others across every scenario. However, a core strength of SPROD lies in its consistency across a wide range of datasets and model backbones. While some methods may excel on specific benchmarks, they often suffer from performance drops elsewhere. In contrast, SPROD consistently ranks among the top-performing methods across various settings. This is more comprehensively illustrated in Tables 8–11 of the supplementary material, which highlights SPROD’s stability and reliability.

One central goal of our paper is to broaden the experimental scope of Spurious-OOD detection. Prior work often focused on one or two datasets, limiting generalizability. We introduced the Animals MetaCoCo dataset to encourage evaluation under more diverse and realistic conditions. While this dataset enriches the benchmark space, it is not fully aligned with our method's assumptions: SPROD relies on misclassification signals to detect unknown spurious features. As shown in Table 5 (supplementary), the attribute imbalance in Animals MetaCoCo is highly entangled and lacks clear spurious correlations, unlike datasets such as Waterbirds or UrbanCars.

The SpuriousImageNet dataset poses additional challenges. As noted in line 192 of the supplementary material, many spurious correlations here fall under the "Class Extension" type (shortcut features that appear in one class but are not shared across others), thus not increasing misclassification risk. Since SPROD detects unknown spurious features via their impact on misclassification, it is inherently limited in such settings. We believe future methods with partial group supervision may be better suited for this dataset and further advance trustworthy AI. Nonetheless, we included SpuriousImageNet to promote more diverse evaluations, even though it is less compatible with our group-supervision-free framework.

 


Reviewer: Figure 4 doesn’t represent well how much the correlation between spurious correlation rate and performance drop as the results aren’t presented numerically. It doesn’t look like the best way to present the correlation as a reader.

The figure combines multiple dimensions of comparison into a single visualization (different baselines, two model backbones, two types of correlation metrics, and two training settings) and may have become too complex to clearly convey the correlation between spurious correlation rate and performance drop. To address this, we will revise the presentation by including a dedicated figure or table that focuses specifically on the correlation, with numerical values to improve clarity and interpretability for the reader. While our original intent was to offer a compact representation for breadth, we agree that a focused presentation would better serve the reader.

 


Reviewer: The authors argue that generative scoring is better than discriminative scoring and present proper experimental configurations. However, results show no difference when datasets get more complex. Showing better performance only on simple dataset doesn’t indicate it is a better scoring method. Therefore, I doubt the fact of this statement.

We agree that in some datasets, even for a fixed method such as SPROD, the difference between discriminative and generative scoring is negligible. Indeed, we do not claim that the generative approach will always outperform the discriminative one. As we indicated in line 132, we indicate that softmax-based (discriminative) scoring fails in certain cases.

A trivial but illustrative case is presented in Figure 2 (a), which shows a scenario where OOD samples lie far from all class distributions. In such cases, discriminative scoring, which only relies on distance to the decision boundary, may incorrectly identify far-away OOD samples as in-distribution, simply because they are marginally far from the decision boundary than some ID samples.

In contrast, as shown in Figure 2 (b), generative scoring (based on distributional distance) can easily detect these far-OOD samples. Further in the SP-OOD setting, we highlight an additional risk introduced by biased decision boundaries. A recent study also emphasizes that discriminative classifiers are particularly sensitive to spurious shortcuts [1].

Finally, we would like to point out that this discussion primarily serves as a motivational context for our proposed method and is not a core axis of the paper. We believe this direction deserves a dedicated study to fully explore its theoretical and experimental implications.

[1] Generative Classifiers Avoid Shortcut Solutions, ICLR 2025

 


Reviewer: Table 1 in Supplementary should be in the main manuscript as it is comparing with related methods.

We agree that Table 1 in the Supplementary provides a valuable comparison in the multi-modal domain. The main reason we did not focus on CLIP-based experiments is explained in both Section A.1 of the Supplementary and in the main paper (line 251). For example, the zero-shot setup of CLIP often bypasses spurious correlations in the training data, which limits its usefulness for studying how methods respond to spurious attributes during learning, which is a primary concern of our study.

That said, given the increasing attention on CLIP-based settings in recent research, we will include a small comparison table in the main paper for the final version.

评论

Thank you very much for your effort for preparing the responses.

However, the authors lack results comparing with recent baseline methods in large-scale ImageNet 1K experiments where the most recent one is 2023 ICLR. In the OpenOOD leaderboard, NCI, ASH, SCALE, AdaSCALE, and CombOOD outperforms the author's reported results in ImageNet-1K dataset with ResNet50.

Additionally, I suggest the authors to compare results using RegNet_Y_16_GF backbone in order to compare with methods such as AdaSCALE, RMDS++, and NNGuide, which are in the OpenOOD leaderboard which would also show the impact of changing to a more robust backbone.

评论

Thank you very much for your follow-up

For the conventional OOD setting, we used the latest version of the OpenOOD paper (arXiv, 16 Dec 2024) to evaluate SPROD under a controlled and standardized protocol to ensure fair comparison with published methods. We reported our first effort in this setup because the results were promising enough to demonstrate consistency and broader applicability.

We appreciate you pointing us to the live OpenOOD leaderboard, which we had not seen before. It appears promising, but we also noticed several issues:

  • According to its GitHub instruction (“add your results and open a PR”), it seems the results are self-reported by contributors, without unified control or verification.
  • We observe inconsistencies in several entries. For instance, CombOOD reports AUROC = 95.22 / 90.24 (near/far) for ImageNet, whereas its paper reports 81.38 / 98.44, with far-OOD performing better as expected; raising concerns about reproducibility.
  • The leaderboard results are not from a unified experimental setup. Backbones and training strategies vary, making direct comparison difficult.

To address your suggestion, we ran SPROD on RegNet_Y_16GF and obtained:

  • AUROC=81.07 (near)
  • AUROC=97.53 (far)

These are notably higher (~+6% gain in near-OOD) than our previously reported numbers using the standard protocol in the OpenOOD paper setup. Interestingly, this new performance of SPROD (while not designed for the conventional setting) exceeds the leaderboard-reported results of NCI (78.69 / 95.53), a method specifically developed for conventional OOD and published at CVPR 2025.

 


Finally, while SPROD achieves state-of-the-art performance in various experiments, we want to clarify that outperforming all prior methods, even across all SP-OOD benchmarks, is not the primary goal of this paper. As stated in the summary of contributions (end of the Introduction), our work offers multiple contributions (including the introduction of a new dataset, broader SP-OOD benchmarking, and a method that makes no assumption about the availability of validation data or group annotations), which we believe are valuable to the community.

Our new extended experiments on standard OOD detection benchmarks also suggest the reliability and generality of SPROD, even though this setting is beyond the main scope of the paper.

We hope this clarification further addresses your concerns and provides a clearer picture of our contributions.

评论

Thank you very much for the effort preparing for the questions.

The authors claim the novelty of the work lies in the novel dataset, benchmarking, and method that tackles SP-OOD. However, I doubt how this can be used in OOD in terms that OOD deals with “unknownness” without any given prior knowledge regarding the test samples while the dataset and benchmarking gives a strong prior which is spurious correlations. The main reason I asked for general OOD benchmarking was due to this reason, to verify whether solving spurious correlation improves general OOD performance or not. However, referring to the benchmarks in the first response, NNGuide shows poor performance in SP-OOD benchmark while outperforming in general OOD benchmark with RegNet_Y_16GF backbone. Therefore, it is hard to tell it is directly correlated.

In short, my questions are

  1. In what case can the SP-OOD dataset and benchmark can be used? Is there any practical case where spurious correlations are given in OOD detection?
  2. How can these resource be used to solve general OOD where the concept of OOD assumes nothing is known?
  3. What is the correlation between general OOD and SP-OOD? The results of RegNet_Y_16GF backbone on ImageNet1k benchmark falls behind several methods such as AdaSCALE, RMDS++, and NNGuide while SPROD surpasses NNGuide on SP-OOD.
评论

Thank you for your continued engagement and helpful feedback. We now cover your concerns from several aspects:

  1. OpenOOD Leaderboard Reliability

As noted earlier, the OpenOOD leaderboard contains several credibility issues and cannot serve as a reliable basis for academic comparison. While we initially highlighted CombOOD for its major discrepancy (over 10% AUROC drop from leaderboard to paper), we also found inconsistencies in other methods. For example, the leaderboard reports inflated values for AdaSCALE; it shows 86.26 AUROC (near-OOD, ResNet-50), which is significantly higher than the value reported in its original paper.

  1. Inconsistent Settings across Experiments

The OpenOOD leaderboard lists multiple entries per method, varying across backbones and training pipelines. In our experience, different methods benefit differently from architectural changes. To illustrate this, we ran SPROD on ImageNet-1k using ConvNeXt-B, Swin-B, and ViT-B-16 backbones. SPROD achieves near-OOD AUROC of 85.07, 84.13, and 80.28, respectively. For comparison, AdaSCALE reports 74.58 and 73.23 for Swin-B and ViT-B/16, respectively (no reported value for ConvNeXt).

  1. General OOD Performance

Our ImageNet-1k results confirm that SPROD performs on par with the top-performing general OOD methods. Even considering the OpenOOD leaderboard (despite its reproducibility issues), only a few methods reach 85% near-OOD AUROC, while most recent and well-known approaches fall below this threshold. This performance is especially notable, as SPROD makes no claim of outperforming general OOD detection directly.

  1. On the Notion of “Unknownness” and Prior Assumptions

As noted in our first response, SPROD tries to identify and mitigate inherent biases in datasets when such biases exist. When spurious correlations are absent, SPROD still performs robustly. Therefore, it does not rely on prior knowledge of the OOD setting and, moreover, does not even assume access to OOD validation data (a common but often overlooked limitation in the field).

We specifically chose to study SP-OOD in our experiments to evaluate method robustness in worst-case scenarios, often missed by conventional benchmarks. Most existing methods are over-tuned to these general benchmarks and may exhibit vulnerability in the presence of inherent biases and distribution shifts. The SP-OOD datasets better reveal previously overlooked failure modes where OOD detectors are vulnerable due to reliance on shortcut features. Importantly, this experimental design does not provide access to extra information, as all tested methods (including SPROD) operate under identical conditions without additional inputs.

  1. Spurious Correlation

Hundreds of recent studies have focused on identifying such correlations in common datasets and developing methods that ensure reliable performance on under-represented or subgroups. These methods often prioritize worst-case scenario rather than average performance and evaluate on specific datasets like Waterbirds and CelebA.

This concern is also highly relevant in OOD detection and is motivated by previous works. Recent studies show that large-scale datasets like ImageNet contain extensive hidden spurious correlations [1]. For example, features like "bird feeder", "eucalyptus", or "label" can serve as shortcuts for predicting classes like "Hummingbird", "Koala", or "hard disc" respectively [1]. In such cases, an OOD detector might consistently misclassify all OOD samples containing "bird feeder" as ID class "Hummingbird", despite achieving high overall performance across 1000 classes.

Focusing only on averaged metrics can hide these severe failure cases. SP-OOD benchmarks are designed to make such vulnerabilities visible and measurable.

[1] "Spurious Features Everywhere - Large-Scale Detection of Harmful Spurious Features in ImageNet", ICCV 2023

 
We hope these clarifications help address your concerns.

公开评论

More clarifications:

86.26 AUROC is for ResNet-50 model regularized with ISH training. (See SCALE [1] for more details). Please see the "Training" column which explicitly mentions "ISH".

Please see the 10th row (at the time of writing) which reports results on original ResNet-50 model in the leaderboard.

Note the inclusion of ImageNet-O dataset in near-OOD in the paper which is not present in the OpenOOD v1.5 leaderboard.

[1] Scaling for Training Time and Post-hoc Out-of-distribution Detection Enhancement

评论

Thank you very much for addressing the concern related to the importance of spurious correlation in common datasets and I acknowledge the importance of the spurious correlation in datasets as well as SPROD's robustness on spurious correlations. However, there are a few concerns left.

Most existing methods are over-tuned to these general benchmarks and may exhibit vulnerability in the presence of inherent biases and distribution shifts.

In order to verify this statement, I recommend conducting OpenOOD full-spectrum, where distribution of ID samples are shifted, with RegNet-Y-16GF backbone for comparison.

Baseline comparison

  • Despite that general OOD methods overlook the spurious correlation in datasets they still perform better than SPROD as shown in the comparison below. In this case, the practicality of SPROD is limited to datasets with high spurious correlation and won't be able to be used in general usage which is my concern related to the suggested method. For example, assuming a situation where users need an OOD model, which model would the users adopt? Even though SPROD solves the spurious correlation problem in ImageNet1k and classify classes with spurious correlations well, if it falls behind in other classes without spurious correlation in OOD detection, the practicality is doubted, which is why the general OOD performance is critical.

  • Authors insist that SPROD performs 85% for OpenOOD near-OOD benchmarking with ConvNeXt-B backbone and insist that most of the methods fall behind, however, the comparison with those baselines aren't done with the same backbone, therefore, it would be fair to compare using the same ConvNeXt-B backbone with recent methods to verify this statement.

Below is the organized results of SPROD and SOTA baseline methods on OpenOOD benchmarking with near-OOD AUROC results.

MethodResNet50RegNet-Y-16GFConvNeXt-B
SPROD75.0681.0785.07
AdaScale(paper reported)78.9889.18x
NNGuidex84.87x
RMDS++78.5485.16x
MDS++x86.29x
  • AdaScale: Authors insist that the reporting of OpenOOD is not credible as the reported value in the leaderboard and manuscripts are different. However, even with the results in the AdaScale paper Table 1, results of ResNet50 and RegNet-Y-16GF show higher performance than SPROD. Also, it is shown that their transformer baseline methods specifically show low performance, however, the CNN based methods perform well. SPROD shows advantage in transformer based methods, however, fall behind the performance for ResNet50 and RegNet-Y-16GF, and the reported results of ConvNeXt-B, which shows higher ID accuracy in ImageNet1k than RegNet-Y-16GF, falls behind AdaScale-RegNet-Y-16GF.
  • Comparing with NNGuide, which is an ICCV2023 paper, the performance with RegNet-Y-16GF underperforms.
  • Comparing with MDS++, the performance with RegNet-Y-16GF underperforms.
  • Comparing with RMDS++, SPROD underperforms for ResNet50 and RegNet-Y-16GF.

Furthermore, authors insist the reporting of OpenOOD is not credible, then the only suggested way for the authors is to implement the methods shown above on the OpenOOD benchmark and verify the results.

评论

We thank the reviewer for their continued engagement.

We would like to restate that our paper primarily focuses on the SP-OOD setting. Regarding the reviewer’s valid concern “prove the robustness and show that SPROD is not only limited to spurious OOD dataset, but also robust on general OOD datasets”, we conducted standard experiments on general OOD datasets. Our initial results show that SPROD performs comparably to the well-known methods reported in the OpenOOD paper, with an average performance of 85.15, which is higher than top-performing baselines in that benchmark. We believe this is sufficient to demonstrate that SPROD is also robust enough in the general OOD setting. Nevertheless, we never claimed that SPROD is the best-performing method across all settings. Our principal claim is robustness, meaning that SPROD avoids performance drops across diverse experiments, rather than achieving state-of-the-art in every benchmark.

In the context of robustness research, particularly for spurious correlation, it is common to accept a slight reduction in general benchmark performance in favor of higher worst-group performance. So it is less prevalent in the field to anticipate a work showing robustness on a specific dataset like Waterbirds, while also outperforming the ImageNet benchmark. However, achieving robustness to spurious correlations while preserving high overall performance would be highly desirable.

We acknowledge the reviewer’s concern regarding recent methods and evaluated most of the mentioned approaches (reimplementing some, such as NECO, in the OpenOOD format). However, testing all of them within the limited time was infeasible. Furthermore, some works, such as AdaScale, were not peer-reviewed, and Mahalanobis++ (presenting MDS++ and RMDS++) was a concurrent ICML 2025 work released after the NeurIPS deadline.

Finally, note that each experiment on the ImageNet benchmark required several hours, and performing quick experiments in this setup is prone to under-reporting due to limited trials. Therefore, running experiments on multiple methods and backbones is not feasible. Moreover, our reported values for different backbones were only to show the effect of backbone choice. We are aware that comparing methods with different backbones and training strategies is not objective, whereas the mentioned live leaderboard does so.

评论

Thank you very much for explaining and emphasizing the tradeoff between SP-OOD and general OOD performance. The sentence explaining the tradeoff helped me understand the nature of spurious correlation detection, which I am not familiar with, and the difference between general OOD, which I am more familiar with, in terms of the objectiveness. My concerns regarding the general OOD performance at this point is too deep and out of the scope of this paper, therefore, I have no remaining concerns and will increase my ratings.

评论

We are glad our clarifications helped address the questions and clear up any misconceptions.

公开评论

Clarifications regarding AdaSCALE's reporting

AdaSCALE for OpenOOD Leaderboard Setup

Near-OODFar-OOD
SSB-Hard, NINCOOpenImage-O, Textures, iNaturalist

Additional Datasets for Evaluation in the Paper

Near-OODFar-OOD
SSB-Hard, NINCO, ImageNet-OOpenImage-O, Textures, iNaturalist, Places

Note the inclusion of ImageNet-O and Places OOD datasets.

The setup for the above two are different and hence the difference in the results. The results are not inflated, please check the OOD datasets that are evaluated in the paper.

Note: I just noticed this.

审稿意见
5

The paper presents a prototype-based OOD detection approach explicitly designed to address the challenge posed by unknown spurious correlations. The proposed method, SPROD, is a post-hoc technique that refines class prototypes to mitigate potential bias. For each test sample, the distance to the nearest prototype is used as the score for OOD detection. Comprehensive experiments on spurious correlation OOD detection benchmarks demonstrate superior performance compared to several standard OOD detection methods.

优缺点分析

Strengths:

  • The paper is well-written and easy to follow. It clearly demonstrates the problem and will serve as a valuable resource for the OOD detection community.

  • The proposed method is straightforward and, in that sense, simple: it involves refining prototypes (similar to K-means clustering) and using the distance to the nearest prototype as the score function.

  • The presented experiments highlight the strength of the proposed approach, showing consistent improvements over 11 post-hoc baselines. The paper also includes other useful analyses, such as investigations into the scoring mechanism.

Weaknesses:

  • My main concern is the lack of discussion or comparison with prior prototype-based OOD detection work. In recent years, several studies have explored prototype-based frameworks, and in this context, the proposed method does not appear significantly novel. The main contribution seems to lie in its focus on spurious OOD detection, rather than in introducing a fundamentally new technique. Some relevant examples include:
  1. Finding Dino: A Plug-and-Play Framework for Zero-Shot Detection of Out-of-Distribution Objects Using Prototypes
  2. Prototype-Based Optimal Transport for Out-of-Distribution Detection
  3. Learning with Mixture of Prototypes for Out-of-Distribution Detection
  • While the empirical results demonstrate the effectiveness of the method, it is not entirely clear how or why the group prototype refinement specifically helps in addressing spurious OOD detection. A deeper theoretical or conceptual justification would strengthen the contribution.

问题

Suggestions: It is recommended that the authors explicitly define the score function used in SPROD, rather than merely stating it.

局限性

Yes.

最终评判理由

The authors have addressed the concerns regarding the discussion with prior works and clear justification on the group prototype refinement.

格式问题

No

作者回复

Discussion of prior work:

Reviewer: My main concern is the lack of discussion or comparison with prior prototype-based OOD detection work. In recent years, several studies have explored prototype-based frameworks, and in this context, the proposed method does not appear significantly novel. The main contribution seems to lie in its focus on spurious OOD detection, rather than in introducing a fundamentally new technique.

We appreciate the reviewer’s comment and agree that prototype-based approaches form a relevant and valuable category in OOD detection research.

We would like to clarify several points regarding both novelty and comparison:

  • In our main paper (line 171), we explicitly highlight the strength of the pure prototypical method (SPROD-1), which we believe has been overlooked in prior post-hoc OOD literature. To the best of our knowledge, no previous work presents this vanilla prototypical approach as a standalone post-hoc baseline for OOD detection.

  • We build upon this strong baseline by introducing two further stages that aim to reduce the effect of spurious biases through prototype splitting and refinement, resulting in a novel three-stage post-hoc framework.

Regarding the third cited work, Learning with Mixture of Prototypes (PALM), we were indeed aware of it. However, PALM involves a heavy training procedure with a specifically designed objective and therefore is not compatible with our post-hoc setting. That said, we found its ideas inspiring and included a dedicated section in the supplementary material titled "Analysis of Mixture of Prototypes" (line 265). In this section, we introduce a KMeans-based variant of SPROD that mirrors PALM’s core idea of intra-class prototype diversity, but implemented in a fully post-hoc manner without any retraining. Results in Tables 32–33 show that this KMeans variant achieves competitive performance, especially on datasets where spurious correlations are less prominent (e.g., Spurious ImageNet, where the main bias is class-extension and does not induce inter-class confusion).

Finding DINO (PROWL) also leverages prototype ideas, but it is designed for a different domain (pixel-level OOD detection in segmentation), making it less comparable to our classification-based setting. While POT is another recent work using prototypes, it is not peer-reviewed, and we could not find a working public implementation.

 


Theoretical or Conceptual justification:

Reviewer: While the empirical results demonstrate the effectiveness of the method, it is not entirely clear how or why the group prototype refinement specifically helps in addressing spurious OOD detection. A deeper theoretical or conceptual justification would strengthen the contribution.

SPROD employs a two-step prototype refinement strategy to approximate group-specific prototypes within each class. First, it defines prototypes for correct and misclassified training samples, motivated by the observation that minority group instances are more prone to misclassification [JTT paper]. However, some correctly classified from minority samples may still be incorrectly assigned to biased prototypes. To address this, SPROD reassigns all training samples to their nearest prototype and recalculates the prototypes, reducing the majority prototype influence (see Figure 3).

By applying these two steps, SPROD aims to obtain nearly pure group-specific prototypes. We now provide a theoretical overview, to illustrate the systematic bias introduced by standard prototypical methods under strong spurious correlations, motivating the refinement of group-specific prototypes to eliminate this bias.

  • [1] Just Train Twice, ICML 2021

Feature Decomposition

Without loss of generality, we assume a binary classification problem with classes c{0,1}c \in \{0,1\}. We consider a general decomposition of embeddings ziz_i into three semantically distinct components: spurious, core, and irrelevant features. We model ziz_i as a linear combination of three functionally distinct components that span the embedding space:

zi=j=1nucspαi,jspuc,jsp+j=1nuccoreβi,jcoreuc,jcore+j=1nuirrγi,jirrujirr,z_i = \sum_{j=1}^{n_{u_c^{sp}}} \alpha_{i,j}^{sp} u_{c,j}^{sp} + \sum_{j=1}^{n_{u_c^{core}}} \beta_{i,j}^{core} u_{c,j}^{core} + \sum_{j=1}^{n_{u^{irr}}} \gamma_{i,j}^{irr} u_{j}^{irr},

where the vector sets {uc,jsp}\{u_{c,j}^{sp}\}, {uc,jcore}\{u_{c,j}^{core}\}, and {ujirr}\{u_{j}^{irr}\} form orthonormal bases for the spurious, core, and irrelevant subspaces. Each coefficient set {αi,jsp}\{\alpha_{i,j}^{sp}\}, {βi,jcore}\{\beta_{i,j}^{core}\}, and {γi,jirr}\{\gamma_{i,j}^{irr}\} is specific to instance ii. This decomposition can be expressed compactly in matrix form:

zi=Ucspαisp+Uccoreβicore+Uirrγiirr,z_i = U_c^{sp} \alpha_i^{sp} + U_c^{core} \beta_i^{core} + U^{irr} \gamma_i^{irr},

where each U()Rd×nu()U^{(\cdot)} \in \mathbb{R}^{d \times n_u^{(\cdot)}} is a matrix of orthonormal basis vectors. We assume the coefficient vectors for each instance ziz_i are drawn from class-conditional distributions: αispPcsp\alpha_i^{sp} \sim P_c^{sp} and βicorePccore\beta_i^{core} \sim P_c^{core}, while the irrelevant component is shared across classes, i.e., γiirrPirr\gamma_i^{irr} \sim P^{irr}.

Group Definitions

Within each class cc, Majority group samples ziScmajz_i \in S_c^{maj} satisfy αispPcsp\alpha_i^{sp} \sim P_c^{sp}, βicorePccore\beta_i^{core} \sim P_c^{core}. Minority group samples ziScminz_i \in S_c^{min} satisfy αispP1csp\alpha_i^{sp} \sim P_{1-c}^{sp}, βicorePccore\beta_i^{core} \sim P_c^{core}.

The ratio rc=ScmajScmaj+Scminr_c = \frac{|S_c^{maj}|}{|S_c^{maj}| + |S_c^{min}|} quantifies the class-conditional spurious correlation strength.

OOD Sample Definition

We define an OOD sample as:

zOOD=UcspαOODsp+UcirrγOODirr+UextδOODext,z_{OOD} = U_c^{sp} \alpha_{OOD}^{sp} + U_c^{irr} \gamma_{OOD}^{irr} + U^{ext} \delta_{OOD}^{ext},

where αOODspPcsp\alpha_{OOD}^{sp} \sim P_c^{sp}, γOODirrPirr\gamma_{OOD}^{irr} \sim P^{irr}, and δOODext\delta_{OOD}^{ext} and UextU^{ext} capturs external factors not observed during training.

Hard OOD samples may also include core components drawn from a shifted distribution

βOODcoreQcorePccore.\beta_{OOD}^{core} \sim Q^{core} \neq P^{core}_c.

For simplicity, we focus on near-OOD samples with no core or external components with βOODcore=0\beta_{OOD}^{core} = 0 and δOODext=0\delta_{OOD}^{ext} = 0.

For spurious-OOD groups, αOODsp\alpha_{\text{OOD}}^{\text{sp}} is non-zero; in challenging cases, it follows the same distribution as the class spurious component αOODspPcsp\alpha_{\text{OOD}}^{\text{sp}} \sim P_c^{\text{sp}}.

Prototype Calculation

We define the expected coefficient vectors for each subspace:

μccore=EβPccore[β],\mu_c^{core} = \mathbb{E}_{\beta \sim P_c^{core}}[\beta], μcsp=EαPcsp[α]\mu_c^{sp} = \mathbb{E}_{\alpha \sim P_c^{sp}}[\alpha] μirr=EγPirr[γ].\mu^{irr} = \mathbb{E}_{\gamma \sim P^{irr}}[\gamma].

Using these, we define the majority and minority group prototypes for class c{0,1}c \in \{0,1\} as:

pcmaj=Ucspμcsp+Uccoreμccore+Uirrμirr,p_c^{maj} = U_c^{sp} \mu_c^{sp} + U_c^{core} \mu_c^{core} + U^{irr} \mu^{irr}, pcmin=Ucspμ1csp+Uccoreμccore+Uirrμirr.p_c^{min} = U_c^{sp} \mu_{1-c}^{sp} + U_c^{core} \mu_c^{core} + U^{irr} \mu^{irr}.

The overall class prototype is a convex combination of subgroup prototypes:

pc=rcpcmaj+(1rc)pcmin.p_c = r_c p_c^{maj} + (1 - r_c) p_c^{min}.

Bias in Prototype Distances Under Strong Spurious Correlation

Consider spurious-dominated regimes where rc1r_c \approx 1, then we have

pcpcmaj.p_c \approx p_c^{maj}.

Let zOODScOODz_{\text{OOD}} \in S_c^{\text{OOD}} be an OOD sample with spurious alignment to class cc. The score is defined as the negative Euclidean distance to the nearest prototype. The expected squared distance to the biased class prototype is:

E[zOODpc2]=μccore2+Tr(Σcsp)+Tr(Σirr).\mathbb{E}[\|z_{OOD} - p_c\|^2] = \|\mu_c^{core}\|^2 + \mathrm{Tr}(\Sigma_c^{sp}) + \mathrm{Tr}(\Sigma^{irr}).

Here, the first term corresponds to core bias, the second to spurious variance, and the third to irrelevant variance. Now consider an in-distribution sample from the minority group. Its expected squared distance to the majority prototype (biased class prototype) is:

E[zminpcmaj2]=μ1cspμcsp2+Tr(Σccore)+Tr(Σirr).\mathbb{E}[\|z_{min} - p_c^{maj}\|^2] = \|\mu_{1-c}^{sp} - \mu_c^{sp}\|^2 + \mathrm{Tr}(\Sigma_c^{core}) + \mathrm{Tr}(\Sigma^{irr}).

In this case, the first term represents spurious bias, followed by core variance and irrelevant variance, respectively. Although zminz_{min} is an in-distribution sample, its distance to the biased prototype includes a spurious bias term, while the OOD sample differs only in the core direction. Depending on the relative magnitudes of the core and spurious bias terms, this highlights the potential for erroneous OOD detection.

For example, in the Waterbird dataset, the background (water or land) represents the spurious means μcsp\mu_c^{sp} and μ1csp\mu_{1-c}^{sp}. The difference μcspμ1csp\|\mu_c^{sp} - \mu_{1-c}^{sp}\| may exceed the core bias in OOD distances, leading to incorrect OOD detection.

Why SPROD Mitigates Distance Bias

If prototypes can be accurately estimated to match each group's distribution, then each in-distribution sample ziScgz_i \in S_c^g can be compared to its corresponding prototype p^(zi)=pcg\hat{p}(z_i) = p_c^g, where g{maj,min}g \in \{maj, min\}. Because the prototype shares the same spurious basis alignment, the spurious bias is eliminated. The expected squared distance becomes:

E[zip^(zi)2]=Tr(Σccore)+Tr(Σirr),\mathbb{E}[\|z_i - \hat{p}(z_i)\|^2] = \mathrm{Tr}(\Sigma_c^{core}) + \mathrm{Tr}(\Sigma^{irr}),

reflecting variance only, with no bias terms. In contrast, the standard prototype-based approach contains a spurious bias for minority samples:

E[zminpcmaj2]=μ1cspμcsp2+Tr(Σccore)+Tr(Σirr).\mathbb{E}[\|z_{min} - p_c^{maj}\|^2] = \|\mu_{1-c}^{sp} - \mu_c^{sp}\|^2 + \mathrm{Tr}(\Sigma_c^{core}) + \mathrm{Tr}(\Sigma^{irr}).

For a spurious-OOD sample, we also observe a core bias component:

E[zOODp^2]=μccore2+Tr(Σccore)+Tr(Σirr).\mathbb{E}[\|z_{OOD} - \hat{p}\|^2] = \|\mu_c^{core}\|^2 + \mathrm{Tr}(\Sigma_c^{core}) + \mathrm{Tr}(\Sigma^{irr}).

Comparing this to the expected squared distance of in-distribution samples using group-specific prototypes, we also conclude:

E[zOODp^2]>E[zip^(zi)2].\mathbb{E}[\|z_{OOD} - \hat{p}\|^2] > \mathbb{E}[\|z_i - \hat{p}(z_i)\|^2].
评论

Dear Reviewer, we would appreciate an opportunity to further discuss/address any concerns that might remain after our rebuttal response.We hope to hear back from you. Thanks!

评论

Hi Reviewer jXq2,

Could you check whether the authors' rebuttal address your concerns? If further clarification is needed, please engage in the discussion during the discussion phase.

AC

评论

Thank you for providing the details. My concerns have been addressed.

评论

Thank you! Your insightful comment on providing theoretical justification will certainly enrich the paper.

评论

In view of the extended discussion period, we hope to still hear back from you regarding the unresolved concerns!

审稿意见
4

This work defines two ood categories, namely SP-OOD and NSP-OOD. In response to SP-OOD, the paper proposes that SPROD is a use of three stages to solve the SP-OOD problem which has spurious correlations. It strives to conduct reinforcement learning on the samples misclassified by the classifier and construct more detailed minority group prototypes to enhance the learning ability of the model for SP-OOD. In the experiment, five OOD datasets were defined for NSP-OOD as the ood evaluation datasets for SP-OOD, and it was verified that the conventional softmax score was not suitable for them, while the likehood score was more suitable for them.

优缺点分析

Strengths:

  1. The paper proposes a set of post-hoc methods for SP-OOD. This method uses the class labels of each image to enhance the classification error of the training samples through sample reinforcement learning to construct more detailed minority group prototypes.

  2. The article presents the suitable OOD score forms for their post-hoc method and experimentally presents the results of the generative score and discriminative score, explaining the reasons why the softmax OOD score is not suitable.

Weaknesses:

  1. Although the article pays more attention to the experimental performance of SP-OOD, we still hope that the author can provide some experimental results on the NSP-OOD and conventional OOD datasets. The real-world OOD dataset should be a combination of multiple OOD data. We can see that this method in the paper can improve the detection performance of SP-OOD without sacrificing the detection performance of other OOD types.

  2. The article lacks some ablation experiments. The author can supplement some relevant ablation experiments, such as proving the effectiveness of stage 2 and stage 3.

  3. In Figure 3 of the article, the square ICONS and other items are not specified.

  4. The most recent competing methods are missing in this work, such as CLIP-based methods.

问题

  1. Why is the performance of ImageNet100 lacking on the more common OOD datasets? Is it related to the number of categories or for other reasons? It is suggested that the authors conduct an analysis.

  2. Currently, the application of CLIP in OOD detection is a research hotspot. I wonder how the proposed method can be used in OOD detection based on CLIP. Because the comparison method in the article lacks the comparison of 2023-2025 methods in recent years.

局限性

Yes

最终评判理由

My concerns have been addressed.

格式问题

No

作者回复

Reviewer Comment: Although the article pays more attention to the experimental performance of SP-OOD, we still hope that the author can provide some experimental results on the NSP-OOD and conventional OOD datasets. The real-world OOD dataset should be a combination of multiple OOD data. We can see that this method in the paper can improve the detection performance of SP-OOD without sacrificing the detection performance of other OOD types.

Response:

We thank the reviewer for highlighting this important point. We agree that demonstrating robustness beyond SP-OOD serves as a valuable complementary analysis to further validate the general applicability of our method.

Regarding NSP-OOD, we refer the reviewer to the dedicated section titled "Performance on Non-Spurious OOD Datasets" (line 200) in the supplementary materials. As shown in Table 6, SPROD consistently achieves near-perfect detection performance across all NSP datasets, further supporting its broad robustness.

Regarding conventional OOD datasets, we initially focused on SP-OOD-related benchmarks due to their alignment with our problem setting and the extensive number of experiments required. However, we have now extended our evaluation to conventional datasets including CIFAR-10, CIFAR-100, and ImageNet-1k to demonstrate the reliability and generality of SPROD. We follow the OpenOOD benchmark protocol for post-hoc methods. For fair comparison, we selected the best-performing methods for each dataset as reported in Table 2 of the OpenOOD paper:

  • CIFAR-10: KNN and VIM for near-OOD and far-OOD, respectively
  • CIFAR-100: MLS and RMDS for near-OOD and far-OOD, respectively
  • ImageNet-1k: ASH as the top-performing method

The following table reports SPROD’s performance alongside these methods, with values for baselines taken directly from the OpenOOD paper:

MethodCIFAR-10 NearCIFAR-10 FarCIFAR-100 NearCIFAR-100 FarImageNet NearImageNet FarAvg (%)
RMDS89.8092.2080.1582.9276.9986.3884.74
MLS87.5291.1081.0579.6776.4689.5784.23
VIM88.6893.4874.9881.7072.0892.6883.27
KNN90.6492.9680.1882.4071.1090.1884.58
ASH75.1278.4978.2080.5878.1795.7481.05
SPROD89.0491.7881.8079.9375.0695.2985.15

These results show that SPROD remains competitive with the best-performing methods on general OOD benchmarks and achieves the highest average performance across all settings.

 


Reviewer Comment: The article lacks some ablation experiments. The author can supplement some relevant ablation experiments, such as proving the effectiveness of stage 2 and stage 3.

Response: We agree that ablation studies are important for understanding the effectiveness of individual components. In Section Ablation Study on SPROD Stages (Supplementary, line 216), we include a dedicated analysis isolating each of the three stages of SPROD on the Waterbirds dataset under varying spurious correlation strengths (50% and 90%). Results in Figures 6 and 7 show that while Stage 1 provides a strong baseline, Stages 2 and 3 further enhance robustness under stronger spurious correlations.

 


Reviewer Comment: Currently, the application of CLIP in OOD detection is a research hotspot. I wonder how the proposed method can be used in OOD detection based on CLIP. Because the comparison method in the article lacks the comparison of 2023-2025 methods in recent years.

Response:

Regarding CLIP-based methods

We appreciate the reviewer’s interest in recent CLIP-based approaches, which indeed represent an important direction in OOD detection research. We would like to clarify that our submission includes a dedicated comparison with CLIP-based methods in the supplementary material, specifically in Subsection A.1 (line 75). In this subsection, we present both quantitative results (Table 1) and detailed performance plots (Figures 1 and 2). Our results show that SPROD achieves state-of-the-art performance in this CLIP-based setup, further validating its generality and robustness.

Additionally, in both this subsection and the main paper (line 251), we briefly discussed our reasons for not focusing extensively on CLIP-based experiments. In particular, the zero-shot setup of CLIP tends to bypass the effects of spurious correlations in the training data (which are central to our study) making it less suitable for analyzing how methods handle spurious attributes during learning.

Regarding 2023-2025 methods

In our main experiments, we selected 11 prominent post-hoc OOD detection methods, aiming to cover a broad range of representative techniques and to highlight their limitations in the spurious-OOD setting.

To further address the concern regarding more recent methods (2023–2025), we additionally evaluated SPROD against 8 state-of-the-art post-hoc methods on the pretrained ResNet-50 backbone (mirroring our second table in the main paper). The results are summarized below:

MethodWBCAUCAMCSpIAvg
NNGuide70.6 ±2.949.8 ±4.243.6 ±2.179.4 ±0.085.1 ±0.865.7
Relation80.7 ±0.260.4 ±2.596.0 ±0.574.5 ±0.381.8 ±0.778.7
ASH78.5 ±3.247.3 ±2.839.6 ±1.778.0 ±0.286.6 ±0.766.0
GEN62.3 ±0.646.0 ±1.438.5 ±0.380.2 ±0.080.8 ±0.461.6
NECO53.5 ±1.639.5 ±3.235.1 ±1.580.2 ±0.167.2 ±0.355.1
SCALE89.0 ±2.944.9 ±3.254.4 ±2.178.4 ±0.486.2 ±0.570.6
fDBD71.1 ±0.551.3 ±1.347.4 ±0.279.9 ±0.084.2 ±0.366.8
NCI84.0 ±0.146.4 ±2.454.8 ±0.878.5 ±0.184.9 ±0.269.7
SPROD98.8 ±0.061.6 ±0.997.4 ±0.082.1 ±0.585.3 ±0.085.0

SPROD maintains consistently strong performance across all tasks and achieves the highest average AUROC among all methods.

  1. NNGuide: Nearest Neighbor Guidance for Out-of-Distribution Detection, ICCV, 2023
  2. RELATION: Neural Relation Graph: A Unified Framework for Identifying Label Noise and Outlier Data, NeurIPS, 2023
  3. ASH: Extremely Simple Activation Shaping for Out-of-Distribution Detection, ICLR, 2023
  4. GEN: Pushing the Limits of Softmax-Based Out-of-Distribution Detection, CVPR, 2023
  5. NECO: NEural Collapse Based Out-of-distribution Detection, ICLR, 2024
  6. SCALE: Scaling for Training Time and Post-hoc Out-of-distribution Detection Enhancement, ICLR, 2024
  7. fDBD: Fast Decision Boundary based Out-of-Distribution Detector, ICML, 2024
  8. NCI: Detecting Out-of-distribution through the Lens of Neural Collapse, CVPR, 2025

 


Reviewer Comment: In Figure 3 of the article, the square ICONS and other items are not specified.

Response:

Thank you for pointing this out. The purple squares in Figure 3 represent spurious OOD samples. We labeled them in Figure 2 but inadvertently omitted the legend in Figure 3. We will correct this oversight in the camera-ready version.

评论

Thank you for your thoughtful engagement and positive feedback. We are happy to address your concerns.

评论

Dear Reviewer, we would appreciate an opportunity to further discuss/address any concerns that might remain after our rebuttal response.We hope to hear back from you. Thanks!

评论

Thanks for the response.

The rebuttal largely addressed the concerns I initially raised. However, regarding the empirical results for conventional OOD datasets, ​​I still wonder if the authors could provide a deeper analysis of the specific OOD domains where the proposed method underperforms.​​ Additionally, ​​why does it show lower effectiveness than KNN on CIFAR-10 Near, CIFAR-10 Far, and CIFAR-100 Far?​​

评论

Thank you for your thoughtful feedback. We appreciate your observation. As shown in our experiments, different methods behave differently across datasets and backbones. Specifically, as you noted, SPROD is behind KNN by 1.6, 1.18, and 2.47 percent on CIFAR-10 Near, CIFAR-10 Far, and CIFAR-100 Far, respectively. However, it outperforms KNN on CIFAR-100 Near, ImageNet Near, and ImageNet Far by 1.62, 3.96, and 5.11 percent.

We believe these variations relate to the nature of the datasets. Metric-based methods like SPROD, KNN, and MDS are closely related but can behave differently in specific settings. For instance, while SPROD performs best on Waterbirds and CelebA, KNN shows a large advantage over MDS on Waterbirds, whereas MDS outperforms KNN on CelebA. This could be due to KNN’s sensitivity to outliers and annotation noise, which is known to exist in CelebA [1]. KNN is also sensitive to its hyperparameter K. On the other hand, MDS suffers in some settings (e.g., Spurious ImageNet), possibly due to overfitting from covariance estimation.

Overall, our results across diverse datasets and backbones highlight the consistency of SPROD. Its prototype-based formulation is simple by design and may not capture complex patterns like MDS, so it may not always be the top-performing method. But more importantly, SPROD does not show major failures across various general OOD and SP-OOD experiments. This is more aligned with our primary goal of proposing a robust OOD detection method.

Following your suggestion, we are happy to include a more detailed analysis of these variations in the final version.

[1] “A Quantitative Analysis of Labeling Issues in the CelebA Dataset”, ISVC 2022.

评论

Thanks for the response, and my concerns are addressed.

最终决定

This paper introduces a post hoc prototype refinement framework for OOD detection, handling spurious OOD scenarios. Reviewers appreciated the clarity and motivation of the work, and the thorough experimental evaluation. Additional experiments were provided during rebuttal and addressed the main concerns with the new results. Another part of major concerns were raised regarding incremental novelty and overlap with existing prototype-based methods, especially the prototype-based and mixture of prototype-base methods, as mentioned by jXq2 and some relevant baseline models mentioned by rDjS. The AC also recognized this issue and thinks this is crucial. The rebuttal gave clarification and satisfied reviewers. The AC strongly encourages the authors to incorporate the additional results into the paper. In particular, the work should include clearer discussion of the most relevant prototype-based methods (e.g., PALM, CIDER [ICLR’23], etc) in the suitable position of the main paper, considering the obvious connections, even if these require training or target different applications, to place the contribution in proper context.

Overall, despite some limitations, the paper is well executed and presents a practical and effective solution. The AC recommends acceptance and suggests the authors to integrate these revisions into the final version, including discussions with related works, added comparisons, and others addressed during rebuttal.

公开评论

Great work! This paper might be relevant :

ASCOOD

https://arxiv.org/abs/2411.10794

Going Beyond Conventional OOD Detection / Image-based Outlier Synthesis With Training Data