PaperHub
7.3
/10
Spotlight4 位审稿人
最低4最高5标准差0.5
5
4
4
5
3.5
置信度
创新性3.0
质量3.0
清晰度2.5
重要性3.3
NeurIPS 2025

Aggregation Hides Out-of-Distribution Generalization Failures from Spurious Correlations

OpenReviewPDF
提交: 2025-05-03更新: 2025-10-29

摘要

关键词
OOD Generalization Benchmarks

评审与讨论

审稿意见
5

This paper uses a gradient-based method, OODSelect, to partition benchmarks’ OOD split into semantically coherent subsets where accuracy on the line does not hold. The authors tested the method in six datasets and demonstrated that the method uncovers subsets where higher ID accuracy predicts lower OOD accuracy, with sometimes up to 77% of the usual OOD split. They also provide the identified subsets for five datasets for future research.

优缺点分析

Strengths

  • The paper is well written, and the OODSelect method is well supported by the theoretical analysis as well as extensive experiments.
  • The paper explores selection consistency and coherence within the subset and demonstrates the semantic soundness of each partition. It also investigates the use of LLMs and VLMs to generate semantic concepts in the dataset.
  • Model pool size selection, subset size, and other design details are clearly illustrated in the main text or appendix, with rationale and supporting experiments.

Weaknesses

  • The experiments are only conducted in the image domain, and the semantic explanations can be challenging. But I understand that running additional experiments in other domains (like text or tabular data) during the rebuttal period may be difficult.

问题

  • Can the method be applied to non-image domains (e.g., text or tabular)? What modifications would be necessary? In these domains, could the semantic explanations of the OODSelect set be easier?
  • Can the authors provide concrete computational costs with dataset size and model pool size for a better understanding of the extensive computational cost in the main paper?

局限性

yes

最终评判理由

The author addressed my concerns during the discussion period. I will keep my positive score.

格式问题

None

作者回复

Thank you for your feedback on our work. We are glad you found our submission well-written and supported with extensive experiments. We are also glad you found our theoretical analysis to be a strength. We adress your questions point-by-point below.

Can the method be applied to non-image domains (e.g., text or tabular)? What modifications would be necessary? In these domains, could the semantic explanations of the OODSelect set be easier?

Indeed, the method can be applied to non-image domains. The only modification is the types of models one applies. From the method’s perspective, there is no difference. Thank you for the suggestions; we have added WILDSCivilComments, a text dataset, to our analysis. We find similar trends for WILDSCivilComments. The WILDS CivilComments dataset is a text classification benchmark derived from the Jigsaw Civil Comments platform, where the goal is to predict the toxicity of online comments. It focuses on evaluating model robustness to demographic distribution shifts, particularly across identity subgroups such as gender, race, and religion. We have 6 domains with spurious correlations related to six identities: ['male', 'female', 'LGBTQ', 'christian', 'muslim', 'other_religions', 'black', 'white']. Our OOD identities are Christian and Black.

For instance, at 15000 / 52823 (~28%) of the dataset, we have an ID and OOD correlation of -0.6. Additionally, the random and hard selections are positive or near zero.

We will include the full analysis in our final version.

One limitation with many state-of-the-art tabular and text benchmarks is that we know exactly what the spurious correlations are. We did not find any unexpected insight from automating semantic explanations. While images also have some known spurious correlations, the field of adversarial machine learning has demonstrated that there are correlations that humans do not perceive that models can use to make predictions [Szegedy et al., 2013]. However, such features were also not provided by our experiments with semantic explanations.

Can the authors provide concrete computational costs with dataset size and model pool size for a better understanding of the extensive computational cost in the main paper?

Thank you for the suggestions. Per your suggestions, we will break this down into GPU hours per model and data in Figure 5 and Table 2. Generally speaking, these computations heavily depend on the specific datasets and data modality, as well as the classes of models considered for the task. Our work serves as a baseline that can be inter- or extrapolated to new datasets and model types. We include the per-model-per-dataset GPU hours along with some summary statistics.

GPU Resources per Model per Dataset

namemeanmedianstd_devminmax
PACS1.2241100.8147221.2731110.0383336.694167
VLCS2.1289831.6129171.3632860.0411116.658611
TerraIncognita2.1963491.9213891.6027380.0369449.859167
CXR3.6242902.7302783.3714090.00194418.010833
WILDScamelyon4.0739411.8304174.3481430.77500018.054444
WILDScivilcomments9.6058849.4580563.7030643.43027817.541944

We hope these additional results clarify your questions and strengthen the paper.

评论

Thanks for your response, it answers my question. I will keep my positive score.

审稿意见
4

This paper revisit a well-acknowledged concept of ‘Accuracy on the line’, that is, the in-distribution performance and the out of distribution performance of a model will highly correlate. In this paper, author design a test-time adversarial gradient based optimization, to reveal that that exists a subset of OOD data that is inverse correlated to the ID accuracy. Even the overall correlation is positive, a subset up to 77% of the OOD split can lead to negative correlation.

优缺点分析

Strengths:

  1. Author design a neat method to reveal that Accuracy on the line could just be an artifact of aggregating heterogeneous OOD examples, where a subset of OOD might break this hypothesis. This is a interesting idea and novel to the community, it also indicate that average metric can be misleading especially in safety-critical application like medical.
  2. The paper conduct experiment on diverse dataset, including medical dataset and conventional classification dataset, shows that this behavior is universial.
  3. This paper also dive deeper into their experiment and provide good discussion. I find many insights interesting, for instance, the OOD select examples are not corresponding to the most misclassified example (which often lead to 0 correlation).

Weaknesses:

  1. This work ask an interesting question of revisiting benchmarking, but not provide sufficient actionable item. For instance, there’s attempt in finding semantic pattern among OODselect data using VLM, which could be a actionable item for future benchmarking design (like debias, etc), but the result and insight is limited. There’s more discussion on Broader Impact section but mostly appeal for future direction. A complete discussion on how can this idea could inspire future algorithm design(especially training time), benchmark design will be great.
  2. The algorithm is computational expensive. Author should include table showing how will the computation scale with number of model & data.

问题

  1. There're previous paper also criticize the universality of accuracy on the line, like [1] and [2].
  2. This work mostly adopt non-VLM model for experimenting the correlation between ID and OOD dataset. However, as pointed out in [2], VLM model behavior differently to non-VLM in ID-OOD correlation. How would the method/conclusion of this work apply to VLM? ( I assume it's applicable on zero shot inference, as if I understand correctly only model prediction is required for OOD subset selection).
  3. Why not use ImageNet and ImageNet-OOD dataset in the experiment (which is more widely adopted by the community), is it due to size of the data?
  4. In table 1, I am not sure I follow some of the data. for Chest Xrays, full OOD size is 129838, OODselect subset is 50000 (70%), but 50000/129838=38.5%?

[1] Effective robustness against natural distribution shifts for models with different training data, NeurIPS 2023. [2] LCA-on-the-Line: Benchmarking Out-of-Distribution Generalization with Class Taxonomies, ICML 2024.

局限性

Yes

最终评判理由

I thanks author for their detailed reply. My concern regarding compute, related work discussion are largely resolved. Regarding actionability, while training-time implication remain open, I acknowledge that screening protocol for future benchmarks can be a sufficient actionable contribution. Given those, I have raised my score to the paper.

格式问题

In table 1, PACS, 2.6 missing a %. Many hyper-link in this paper doesn't work, especially 'Appendix A".

作者回复

Thank you for your thoughtful feedback on our work. We are glad you found our work and insights to be interesting and novel to the community, and you found our experiments to be comprehensive. We address your questions and concerns point-by-point below.

“This work ask an interesting question of revisiting benchmarking, but not provide sufficient actionable item. For instance, there’s attempt in finding semantic pattern among OODselect data using VLM, which could be a actionable item for future benchmarking design (like debias, etc), but the result and insight is limited. There’s more discussion on Broader Impact section but mostly appeal for future direction. A complete discussion on how can this idea could inspire future algorithm design(especially training time), benchmark design will be great.”

We appreciate the call for clearer, immediately useful takeaways. Our updated draft will provide more detailed discussion on this topic. As part of this, we will release the OODSelect subsets for every dataset so that researchers can evaluate new models on splits that break “accuracy‑on‑the‑line.” Second, previous work recommends that future benchmarks begin by screening their OOD splits without accuracy on the line [Bell et al. 2024; Salaudeen et al. 2025]; our method can be applied out‑of‑the‑box to any dataset that today exhibits a positive ID-OOD correlation. These immediate two steps—publicly available stress-test subsets and a simple screening protocol—constitute actionable guidance, and we now state this plainly rather than implying it.

“The algorithm is computational expensive. Author should include table showing how will the computation scale with number of model & data.”

Thank you for the suggestions. Per your suggestions, we will break this down into GPU hours per model and data in Figure 5 and Table 2. Generally speaking, these computations heavily depend on the specific datasets and data modality, as well as the classes of models considered for the task. Our work serves as a baseline that can be inter- or extrapolated to new datasets and model types. We include the per-model-per-dataset GPU hours along with some summary statistics.

namemeanmedianstd_devminmax
PACS1.2241100.8147221.2731110.0383336.694167
VLCS2.1289831.6129171.3632860.0411116.658611
TerraIncognita2.1963491.9213891.6027380.0369449.859167
WILDScamelyon4.0739411.8304174.3481430.77500018.054444
CXR3.6242902.7302783.3714090.00194418.010833

“There're previous paper also criticize the universality of accuracy on the line, like [1] and [2].”

Thank you for these suggestions; we will incorporate them into our related works and discuss how our work is complementary but different.

Importantly, our work is complementary but different; both papers focus on demonstrating different metrics’ correlation to the aggregated OOD set (Shi et al., 2023, multiple distributions and Shi et al., 2024, LCA), while our work focuses on identifying OOD subsets that have different behavior than the global set. Our work is interested in accuracy; however, developing methods like ours to identify this behavior with other metrics merits future study.

“This work mostly adopt non-VLM model for experimenting the correlation between ID and OOD dataset. However, as pointed out in [2], VLM model behavior differently to non-VLM in ID-OOD correlation. How would the method/conclusion of this work apply to VLM? ( I assume it's applicable on zero shot inference, as if I understand correctly only model prediction is required for OOD subset selection).”

Importantly, [2] shows that VLMs also obey AoTL (accuracy, not LCA), suggesting our conclusions should transfer if we restrict attention to that model class. One practical hurdle is that open-source VLMs seldom report accuracy on a held-out test set corresponding to their pre-training distribution, meaning that we cannot directly apply our analysis.

When we apply VLMs (the set of models evaluated in your reference [1]) to the ID and OOD subsets we identify for our trained models on the in-distribution data, we still find positive and strong correlations between the ID and selected OOD sets for >1% of the subsets; note these same correlations are strongly negative in our analysis.

This is likely because (i) both the ID and OOD datasets are ID for the VLMs (pretraining includes these public benchmarks), or (ii) both the ID and OOD sets are OOD for the VLMs. Our analysis requires some notion of models trained on some ID set and evaluated on some other OOD set. To run this analysis, we would need to train VLMs from scratch on our ID and OOD datasets. This is a much larger-scale experiment, considering the model and data sizes required to train VLMs, and it presents an interesting direction for future work.

We will include this analysis in our final version for completeness; thanks for the suggestion!

​​DatasetR
PACS0.78
VLCS0.62
Terra Incognita0.84
WILDSCamelyon(approximately) 1.00
CXR0.94

“Why not use ImageNet and ImageNet-OOD dataset in the experiment (which is more widely adopted by the community), is it due to size of the data?”

We focused on multi‑source domain generalization benchmarks that are standard in the literature [Gulrajani et al., 2020; Koh et al., 2020]. ImageNet -> ImageNet‑OOD would be single‑source. Also, as you mentioned, it requires downloading and processing TBs of ImageNet‑21K; incorporating it would significantly increase the already extensive experimental computational costs. We will add a discussion on ImageNet-OOD in our related works.

“In table 1, I am not sure I follow some of the data. for Chest Xrays, full OOD size is 129838, OODselect subset is 50000 (70%), but 50000/129838=38.5%?”

You are correct: the full CXR OOD set is 71,433 images, not 129,838 (size of WILDSCamelyon), so the 50,000-image subset indeed represents ~70 %. We will also address other typos you pointed out; thanks!

Hyperlinks.

Apologies for the inconvenience. Please find our appendix in the supplemental material. Because they are detached in submission, the links to appendix items do not point to anything. In the final version, where they are together, this issue will naturally be resolved.

We hope these revisions resolve the outstanding concerns and strengthen the paper, and we would be grateful if you would consider raising the score accordingly.

评论

We thank reviewer CTtt for their valuable feedback and suggestions to improve our work. We have addressed your questions and implemented your suggestions in our rebuttal. We would be grateful for your follow-up on our rebuttal. Thanks.

Kind Regards,

Authors

评论

We thank reviewer CTtt for their time and efforts in reviewing our paper. As we are approaching the extended author-reviewer discussion period, we wanted to follow up on getting your response to our rebuttal to your review. We look forward to hearing back from you. Many thanks!

Regards,

Authors

评论

I thanks author for their detailed reply. My concern regarding compute, related work discussion are largely resolved. Regarding actionability, while training-time implication remain open, I acknowledge that screening protocol for future benchmarks can be a sufficient actionable contribution. Given those, I have raised my score to the paper.

审稿意见
4

The paper proposes that out-of-distribution (OOD) performance may be on average positively correlated with in-distribution (ID) performance for models of increasing capacity, but there may be a subset of OOD data that is actually negatively correlated with ID performance. To determine if this is the case, a method for splitting an OOD test set to minimize correlation with the ID training set is presented (OODSelect). The method is applied to a number of application-focused datasets with OOD subsets, and a large number of models of varying capacities are trained and evaluated to find that there is indeed a subset for which ID performance does not predict OOD performance. In at least one dataset, the subset that performs worse is associated with certain identifiable subpopulations that are unrelated to the data labels.

优缺点分析

Strengths

The paper is very relevant to practical applications, and also addresses an important issue with evaluating models: namely their generalization properties in real-world applications. The results are also very interesting and relevant - especially to the fairness and bias of models. The experiments cover a wide range of datasets and models.

Weaknesses

My main issue with the paper is it is not as well presented as it could be, and there are some omissions and odd details in the methods/results. The method also seems a bit overcomplicated and I would like to see the paper focus more on exploring its findings in greater detail (which are quite interesting). The actual methods, although original in execution, are not entirely original in purpose (see the additional citations in Questions). Having said that, I still think this paper addresses an important problem with insightful results, and will be happy to raise my score if the authors address the points below.

There is something strange going on in fig 3a and 4: the intercepts for OODSelect are sometimes negative, which shouldn't be possible since the slope is always negative and the accuracies are always positive. Are the regression lines applied to the probit transformed accuracies?

The proof sketch for Proposition 1 is not entirely convincing, and a counterexample to submodularity would help greatly.

I don't understand how the quantity of models was chosen, because it is the diversity of the models that really matters - e.g. if all of the models have the same performance up to noise, then there would probably be little or no correlation between ID and OOD test accuracies. As an example of why this is an issue: in figure 4d there is a clear line of correlated All/OODSelect accuracies on the right side. One possible explanation is that there is an architecture for which All/OODSelect accuracy are positively correlated, and if one were to only sample models from that architecture, then the OODSelect slope would become positive (just like All), invalidating the findings for fig 4d.

Minor issues:

  • Definition 1 should explain accPIDacc_{P_{ID}}, etc. Actually, maybe definition 1 should follow the problem setup
  • Problem setup should make clear whether OOD examples are from train or test set
  • notations are somewhat unwieldy: long tilde, long text variables like accPIDacc_{P_{ID}}, abbreviations like AoTIL (why not just say positive/negative correlation?)
  • Typo in table 1 (missing % sign)
  • Proposition 1 \subset isn't well defined since si\bf{s}_i is a vector of 0s and 1s
  • Paragraph at line 178 needs a reference to figure 2
  • Appendix is missing from the submission

问题

What is the reason for applying the probit transform, as opposed to e.g. Spearman (rank) correlation?

Could the authors clarify the meaning of the second sentence in the quote below (line 67)?

Influence functions [33] may appear suitable at first glance, but they rank training points by leave-one-out influence, rather than partitioning the test/OOD set. Thus, applying influence functions in this context would still require an additional heuristic to define coherent subsets, while also inheriting known fragilities in modern deep networks.

In CXR 70% of OOD examples are negatively correlated. Where is the strong positive correlation between ID and all OOD examples coming from then? Surely the model must be doing significantly better on the remaining 30% of examples to counteract the OODSelect subset.

Paragraph at line 186 (selection consistency and coherence) - this seems intuitively sensible, and if anything, the opposite is surprising. Could the authors comment on how inconsistency can occur when increasing subset size (maybe in relation with proposition 1)?

The discussion of influence functions as well as the subset discovery method presented seem similar to influence scores computed by Feldman & Zhang (https://proceedings.neurips.cc/paper/2020/hash/1e14bfe2714193e7af5abc64ecbd6b46-Abstract.html). There is also a lot of related work in this area (sometimes called example difficulty scoring, e.g. https://arxiv.org/abs/2401.01867), some of which is relevant to the bias/fairness angle (e.g. https://arxiv.org/abs/2010.03058). Is there a way to leverage the scores of individual examples, such as influence scores from Feldman & Zhang, to do the same OOD selection?

局限性

I think the main potential limitation is the range of models evaluated, but I cannot say either way whether the authors have adequately addressed this point from the main text alone (without Appendix A). As the authors discuss, the datasets chosen are also limited in the metadata available, and perhaps here the authors may want to examine datasets similar to the chest X-rays datasets, which have labelled subpopulations, in order to determine if the negatively correlated OOD subsets have bias/fairness implications.

最终评判理由

Resolved: all issues except the one discussed in the comment below. The other reviews have also been addressed.

Unresolved: the evidence for negative correlation between ID and OOD accuracy is not entirely convincing, as seen in a subset of plots. This is likely due to the handling of different architectures. I argue in the comment below that the architecture is a confounding variable that should be accounted for to avoid Simpson's paradox. This issue does not invalidate the paper's results (many of the plots do show AotIL) but does weaken one of the paper's central claims.

格式问题

N/A

作者回复

We thank the reviewer for the thoughtful comments and are pleased you found the problem, results, and significance compelling. We address each concern in point-by-point detail below.

the main potential limitation is the range of models evaluated, but I cannot say either way whether the authors have adequately addressed this point from the main text alone (without Appendix A). / “Model diversity and stability of the correlation.”

Please note that our appendix was included in the supplementary material.

Section 4 clarifies that we train models spanning over 35 architectures (AlexNet through ViT‑L), with different pre‑training, augmentations, and hyperparameters (L143). We also include transfer learning, finetuning, and training from scratch. We also provide a detailed list of the architectures in Appendix A. The diversity of the models keeps your example from occurring; we will clarify this in the final version of the paper. Once we have diversity, we want to select enough models for a stable global correlation, i.e., sampling new models does not change the global correlation by more <1%. Importantly, the sampling is fully random and not based on this stopping criterion; stability is primarily a statistical criterion. Figure 5 in our submission summarizes these results.

datasets similar to the chest X-rays datasets, which have labelled subpopulations

Thank you for this suggestion. We include state-of-the-art domain generalization datasets to illustrate that this observation is not just a subpopulation issue. We have added the WILDSCivilComments (Koh et al., 2021) dataset to our analysis, which has demographic information (e.g., gender, religion) used to define domains. We find the same behavior. For instance, at 15000 / 52823 (~28%) of the dataset, we have an ID and OOD correlation of -0.6 where this OOD subset are comments from black individuals. This adds to the fairness aspect of the implications of our work.

“The method also seems a bit overcomplicated and I would like to see the paper focus more on exploring its findings in greater detail (which are quite interesting).”

Thanks for this suggestion to improve our paper. We will move more details on the method to the appendix and expand on our findings, including the clarifications and additional results provided below, as you suggested.

“why can the intercept be negative?”

Indeed, the regression is applied to the probit transform, as in the definition of accuracy on the line in seminal previous work (more in the response to the comment below). After this transform, the linear fit can indeed have a negative intercept even though raw accuracies are strictly positive. We will state this explicitly in Section 3.2 and in the figure captions.

“Why Pearson in probit space rather than Spearman?”

We follow the extensive “accuracy‑on‑the‑line” literature [canonical references: Recht et al. 2019; Miller et al. 2021; Taori et al]. Our work addresses this specific empirical observation of correlation in probit space, which captures that structure directly. One limitation is that the Spearman rank correlation will hide sufficiently different examples where model behavior differs substantially from the general trend, an occurrence we want our metric to reflect; we see this below in an example of WILDCamelyon. We aim to identify sets where models can perform well in-distribution but poorly out-of-distribution. If such sets have a small mass, the Spearman correlation may not accurately reflect them.

It is true that the Spearman correlation could provide additional insights; in our experiments, applying the Spearman correlation to our selected subsets instead of the Pearson correlation yields the same conclusions.

We primarily aim to identify positive vs. negatively correlated OOD subset accuracy. We apply both Pearson and Spearman and normalize with a threshold of 0, for positive vs. negatively correlated. We find concordance in whether the ID/OOD performances are positively or negatively correlated. We similarly find concordance for strong vs. weak correlations (|R/Spearman| > 0.3). For PACS, TerraIncognita, we identified near-perfect concordance. Predictions are perfect once we account for error bars (we provide cases below); many of these have correlations near 0. For VLCS and CXR, we find perfect concordance.

The WILDSCamelyon example is one where there is a strong negative Pearson R while the Spearman rank is near-zero. Importantly, both of these values suggest that a better in-distribution model does not imply a better out-of-distribution model, and are therefore settings we want to capture. The difference in values here is a function of a group of models with negative correlations, while many other models have positive correlations. So, Pearson R is sensitive to the impact of these models (a property we desire), while Spearman is not. However, we want our metric to be able to show that it is possible to have a negative correlation between ID and OOD accuracies for some set of models if they exist.

DatasetN Selected OODPearson RSpearman
PACS2500.36-0.021
VLCS (perfect concordance)N/AN/AN/A
TerraIncognita2 5000.02-0.03
WILDSCamelyon50-0.920.00
CXR (perfect concordance)N/AN/AN/A

“...a counterexample to submodularity would help greatly.”

Please find such a counterexample in the proof of our submission in Appendix B4 (line 977).

“In CXR 70% of OOD examples are negatively correlated. Where is the strong positive correlation between ID and all OOD examples coming from then? Surely the model must be doing significantly better on the remaining 30% of examples to counteract the OODSelect subset.”

This is a good point that we will elaborate on in our updated draft. Your observation is close to the case. Yes, the OOD accuracy, the remaining 30% necessarily has higher OOD than accuracy on the selected subset, so the remaining 30% contributes significantly to the general trend. This is also an effect of the absolute value of the slope of CXR being very small when it is negative in the 70% case, though the correlation is strongly positive. Then the 30% need only do much better relative to the range of OOD accuracy for the remaining 70%, which in some cases is less than 10% accuracy between the worst and best OOD accuracy. Please see Figure 7 in the Appendix, which illustrates this. This is also an interesting observation, as we often assume that only a majority group can have such an impact on the general trend, but a sufficiently out-of-distribution (not too small) set can also have this effect. Importantly, the slope is also sensitive to non spurious correlation factors like uniform label noise, making it an unreliable metric in place of the Pearson correlation.

“how inconsistency can occur when increasing subset size (maybe in relation with proposition 1)?”

Each subset size is produced independently. With a submodular objective, greedily growing a set would guarantee monotonic gains and nested subsets. By contrast, our correlation objective is non‑submodular (Prop. 1): the marginal benefit of adding an example can flip sign depending on what is already selected. Empirically, we still observe high overlap across sizes (Jaccard ≥ 0.72) and semantic consistency (for example where we have metadata like CXR), indicating the optimizer tends to find coherent clusters, but the check is essential because the theory offers no guarantee.

“Influence functions disucssion (line 67)? Is there a way to leverage the scores of individual examples, such as influence scores from Feldman & Zhang, to do the same OOD selection?”

We expand on our discussion in L67, speaking to this question. Influence functions [Feldman & Zhang 2020] score training points by how much they change a single test prediction under leave‑one‑out retraining. OODSelect, in contrast, must partition the test/OOD pool itself. Bridging that gap would require an extra, heuristic clustering step and would inherit the known numerical fragility of influence estimates in deep networks [Hu et al., 2025 Most Influential Subset Selection: Challenges, Promises, and Beyond]. Moreover, high‑influence test examples are typically near‑duplicates that the model already gets right, not the spurious‑feature samples that can identify subsets with negative ID <-> OOD correlation. Our “Hard‑example” baselines illustrate this: selecting the most‑misclassified points leaves the correlation near  0, whereas OODSelect selection identifies negatively correlated groups.

Minor issues:

Thanks for pointing these out; we will address them in our final version:

  • OOD examples…train or test set. Since we do not train on any OOD examples, there are no train and test set. There is only a full dataset that we select subsets from. We will clarify this in our final version.
  • Typos and references. We will address the typos and missing references; thank you.
  • Proposition 1 subset. Thank you for pointing this out. As you inferred correctly, we mean the samples selected by s_i’s are subsets of each other rather than the selection vectors themselves. We will fix this notation in our final version.
  • Def 1...We appreciate the comment on notation and acronyms. We want to reiterate that there is a literature on this observation of accuracy on the line (under probit transform) that our work directly addresses, though our findings have general implications. We chose to remain consistent with the norms of seminal previous work rather than pollute future discussion with conflicting notations and acronyms..

We hope that our detailed clarifications, additional experiments, and revised presentation comprehensively address your concerns, and you will consider raising your score.

评论

We thank the reviewer for their detailed response and engagement; we truly appreciate it. As most of the concern is about architectural confounds, we will perform a systematic study of the architectures to augment our analysis. However, for the sake of completeness, we want to address your response, in case it merits another recalibration, as we believe there were a few misunderstandings in your response.

First, we want to emphasize that we are addressing an empirical phenomenon widely observed in the literature. The setting we study is exactly that under which the phenomenon has been observed. In order to show the limitations of this observation, we must match their experimental setup. Please see the following widely cited papers (just to cite a few; the literature on the phenomenon is quite extensive):

OOD subset's performance is less than ID performance, but the correlation between ID and OOD looks visually flat or even positive on the higher end of ID performance (see the "lines" of points that occur on the right sides of figures 4c, 4d, 11, 15, and 17). and 'which runs counter to the claim in line 172'

First, please note that these figures are on a probit scale (as is the norm in prior work), so visual flatness is misleading. For instance, close ticks can be on the order of tens of accuracy (and look flat). We will be explicit about the scale in the figure captions in the final version. Hence, we provided the Pearson R, which captures the negative trend. Negative Pearson R, by definition, provides evidence for a negative correlation (while our explicit evaluation of missclassified examples is near 0). Hence, this does not run counter to our claim.

This circles back to my prior concerns about model diversity and the choice of correlation method.

We are happy to provide per-architecture correlations. We believe we address your question about model diversity, with over 35 different architectures in our rebuttal, and would be grateful for any explicit remaining concerns regarding diversity.

However, we want to emphasize that this would be counter to the empirical phenomenon we aim to address, as is the choice of correlation, which explicitly combines multiple architectures to obtain a diversity of models. In fact, our work supports that there are examples where models can improve in-distribution and globally out-of-distribution yet perform worse on a specific out-of-distribution subset, independent of architecture. Our analysis answers this question. Subsetting necessarily gives the same answer, given the strong overall trends.

Isolating our analysis to a single architecture makes the claim about the architecture rather than the dataset, which changes the discussion. This area of work is focused on the datasets themselves and their shifts (please see cited works). Moreover, for very strong correlations, e.g., those whose figures you cited, the points lie srongly on the line, suggesting that, indeed, the same is likely to hold for subsets, including isolating to architectures.

evaluating OODSelect on a held-out set of architectures (i.e., find the subset of AotIL examples using one set of architectures, then measure ID/OOD correlation on a disjoint set of architectures)

This is exactly what our experiments do (L146).

We thank the reviewer for their comments to improve our work. As most of the concern is about architectural confounds, we will perform a systematic study of the architectures to augment our analysis.

We hope our response further validates the strength of our analysis, particularly in light of the confusion on the flatness of the lines due to being on the probit scale, and you will consider calibrating your score to reflect this. Many thanks!

评论

Thank you for the clarifications. I agree that the choice of architectures is in line with prior work. However, there is a key difference: the OOD data in prior work is drawn independently of the models involved, whereas the subsets found by OODSelect are conditional on the model architectures involved, hence my concern about whether the choice of architectures could affect the results. Thank you also for pointing out the disjoint train test splits - it certainly addresses most of my concerns. I have a question however: are the splits disjoint in architecture (e.g. AlexNet vs ResNet) or just variants (e.g. ResNet 34 vs 100)?

I am still unsure about the strength of the results in figures 4d, 11, 15, 17. I meant by "lines of points" that these plots have points on the right side (higher ID accuracy) which are on a horizontal line (meaning OOD accuracy is constant for increasing ID accuracy) or with an upward trend. Surely these points in isolation cannot be negatively correlated with ID accuracy, even after the probit transform? Furthermore, if the Pearson R and slopes reported with each plot are computed post-probit transform, then shouldn't the blue lines fall visually along where the points are? This discrepancy is most noticeable in figure 11, where the orange lines clearly follow the orange points, but the blue lines completely miss the majority of the points. I think there may be a few outliers hidden by the legend that are heavily influencing the lines of best fit?

In any case, it appears that the relationship between ID and OOD accuracy is not consistent over different ranges of ID accuracies. This is why I argued that: a) the choice of architectures could change the reported correlations (consider that omitting outliers or choosing only the stronger architectures with points on the right side of the plots could change the results from AoTIL to AoTL), and b) the Pearson R alone does not imply AoTIL generally for all models, since it is sensitive to a few outlier models. In contrast, the plots in prior literature generally show a consistent linear trend over the entire range of ID performances, which is why I don't think these issues are relevant to their works.

评论

Thank you for the detailed reply and for pointing me to the appendix. Most of my comments have been addressed and I have raised my score accordingly. There is one conceptual issue which the rebuttal has made clear to me, which unfortunately I don't think there is room to address in this review cycle, but I have described in detail below for the authors' consideration.

I think the proposed methods are able to find subsets of examples with poor OOD generalization. However, I think the evidence for negative correlation between ID and OOD accuracy is not entirely convincing. Specifically, some of the evidence points to OODSelect simply picking up the most misclassified examples, which runs counter to the claim in line 172:

Focusing specifically on the most misclassified examples, we find that the ID-OOD correlation is near 0 and rarely invert it as the OODSelect examples do.

A visual inspection of the ID/OOD accuracy plots in appendix A shows that generally, the OOD subset's performance is less than ID performance, but the correlation between ID and OOD looks visually flat or even positive on the higher end of ID performance (see the "lines" of points that occur on the right sides of figures 4c, 4d, 11, 15, and 17).

This circles back to my prior concerns about model diversity and the choice of correlation method. Different model families or architectures (e.g. Regnet vs Inception) may have drastically different correlations between ID and OOD performance, so one should consider architecture as a confounding variable instead of treating all runs as equivalent data points (e.g. currently it is not possible to tell if "lines" of points have the same architecture). Otherwise, the negative correlations could be due to Simpson's paradox.

In particular, it is possible that the OOD subsets are the examples which are especially sensitive to inductive biases of different architectures (see https://arxiv.org/abs/2401.01867 for what I mean by inductive bias), which also happen to be negatively correlated with ID performance. To be clear, I think it is useful to be able to surface such examples, but this is not the same as ID and OOD performance being negatively correlated for models with the same architecture, such as by training wider or deeper networks.

This also means that the observed correlations are sensitive to the architectures which models were sampled from - e.g. if there are more architectures with negative ID/OOD correlation than those with positive correlation, then the overall correlation will be negative. I noticed that in Appendix A, some architectures have more variants than others (e.g. 5 ResNets vs 2 SqueezeNets), but uniformly sampling across all variants would then result in more ResNets than SqueezeNets.

Regarding the use of Pearson correlation after applying the probit transform to accuracies, I agree with your justifications for the method, but also want to point out that since different measures of correlation are sensitive to different data points, the choice of correlation method could also affect which architectures are emphasized in the results. Also, since OODSelect directly optimizes for negative correlation, one needs to check if the OOD subset has overfit to the particular set of architectures used in OODSelect.

In summary, I think the negative correlation claim (AotIL), while not invalid, could be made more robust such as via:

  1. showing AotIL in a single architecture, e.g. by varying width and depth
  2. indicating the architecture of individual points in the plots (e.g. via hue)
  3. choosing and sampling from different architectures in a way that accounts for differences in ID accuracy or inductive bias - for example, not choosing multiple architectures with similar ID accuracy, or choosing 1 width/depth per architecture that has comparable ID and OOD performance to the other architectures.
  4. doing more systematic study of individual architectures, e.g. comparing transformer vs conv nets.
  5. revising the AotIL claims to be more specific, discussing its limitations and assumptions.
  6. evaluating OODSelect on a held-out set of architectures (i.e. find the subset of AotIL examples using one set of architectures, then measure ID/OOD correlation on a disjoint set of architectures)
评论

We thank the reviewer again for their engagement. In the true spirit of author-review discussion, your comments have been extremely valuable in improving our work. We truly appreciate your feedback! Thank you for this comment; we will implement your suggestions to augment our results!

independently of the models involved, whereas the subsets found by OODSelect are conditional on the model architectures involved, hence my concern about whether the choice of architectures could affect the results

You raise an important point that we will expand upon in our revision. Indeed, for all analyses of this type, including previous work, the results are conditional on the architecture and training choices (Teney et al, 2023: ID and OOD Performance Are Sometimes Inversely Correlated on Real-world Datasets). For instance, if we have a set of perfect models, we would have accuracy on the line in aggregate and on subsets. Accuracy on the line, across models/architectures rather than a single model/architecture, has been used as a predictor for OOD performance; our results show that for a set of models with this global property (e.g., models on a leaderboard), one can find subsets where the correlation is inverted. This aligns with the practical implications of our observation.

Our point is that, for choices that give AoTL, there can be subsets with weak correlation or AoTIL.

Surely these points in isolation cannot be negatively correlated with ID accuracy, even after the probit transform? Furthermore, if the Pearson R and slopes reported with each plot are computed post-probit transform, then shouldn't the blue lines fall visually along where the points are?

In our work, we primarily discuss the Person R, not the slope, primarily because the slope is sensitive to many things irrelevant to our aim, e.g., uniform noise. Though, outliers also affect Pearson R, which we address below.

Most of our plots also show consistent linear behavior for all models; the Strongest are Figure 4a-b, 7, 9d-e, 13, 15a.

Generally, we are able to find many examples where the departure between ID/OOD correlations inverts from aggregation to subselection --- whether to weak correlation or strongly negative. Both are relevant. As we note in the submission, this will not be the case for every dataset. For instance, for the less diverse datasets, we are not able to find large subsets. Our purpose in this work is to demonstrate that this phenomenon can occur.

Importantly, for the figures other than Figure 11, when your observation occurs, the Pearson Rs are indeed relatively weak. For instance, in Figure 15d/e, there is a sort of phase shift from strongly negative to weakly positive, which cancel out. However, the direction of the trend is not determined only by outliers. Thanks to your suggestion on Spearman Rank, which is robust to outliers, we are able to verify that there is strong concordance between the direction of the trend as determined by Spearman Rank and Pearson R. You can also visually inspect to verify. We will add a careful discussion of these in the final version for each dataset.

We will also be sure to call out the observation in Figure 11 and perform additional analysis on this ID/OOD split. Specifically, we will perform an outlier detection test on the metric and then also report the correlation without outliers as well. In fact, it would be interesting to identify what these models that are near chance in-distribution but nearly perfect OOD are.

We will also add an analysis of different phases of model performance. Importantly, we cannot have 'accuracy on the (inverse) line' for a single model or a small set of models, since then there is effectively no 'line.' What we can say is that some models have a bigger delta than others between OOD and ID accuracy. But, like the slope, this delta is sensitive to things not important for our aim, e.g., uniform noise.

Additionally, there are settings where previous work also observes similar phenomena of non-uniform linearity; and these previous works often have fewer multisource datasets evaluated. For instance, Figure 21 in Miller et al., 2021, Figure 5 in Taori et al., 2020, Teney et al. 2023, Salaudeen et al. 2025.

Please also note that we provide theoretical analysis on the bounded effect of new models on the results (Lemma 1). We will expand discussion of this result in the main text and connect with the discussion above.

Thanks for pointing out your observation, which improves our work.

are the splits disjoint in architecture (e.g. AlexNet vs ResNet) or just variants (e.g. ResNet 34 vs 100)?

The splits are across models, not architectures. Though no single architecture dominates the distribution, and even within architectures, there are different pretrainings, finetuning, and transfer learning strategies. We will include analysis splitting across architectures to address this concern.

Thanks for the suggestion!

审稿意见
5

The paper investigates the OOD split in the current benchmarks and propose an approach to select subsets of these OOD samples before they are used for evaluating methods. This is useful to address the concerns of spurious correlations (higher ID accuracy reduced OOD accuracy) in machine learning models. In some of the datasets such as Chest Xrays, they extracted around 70% of OOD splits.

优缺点分析

Strengths:

  • Originality: The paper presents an interesting approach to select OOD split from datasets.

  • Quality: The study is well-executed, with a thorough theoretical and empirical analysis to show the validity of the proposed approach. The authors provide clean analysis for each dataset graphically. The authors also provided detailed appendix with proofs, dataset analysis and anonymized codes.

  • Clarity: Definitions of key concepts, such as the objective, proof of non-submodularity, are well-explained. The practical recommendations on each dataset and benchmarks provide clear and actionable insights for both researchers and practitioners on their selection.

  • Significance: the paper is useful for the study of hidden OOD data in the current datasets and this will improve the current distribution shoft benchmark to have better ID-OOD split which aids in the evaluation of generalizability of models.

Weaknesses

  • Can the selection of OOD samples (OODSelect) as per objective in Equation 2 bring out ID samples (potentially adversarial samples) where the model incorrectly classifies, apart from original OOD samples. How can we differentiate between where the model is going wrong in ID and incorrectly predicted OOD samples? Is the above problem addressed as we use N number of models and OODSelect samples perform poor in large subset of these models?

  • Missing baselines: The paper proposes a subset selection of OOD samples from a dataset considering an ID distribution. How about we see some baselines for the same problem. For instance, we plot latent feature distribution of data samples using a pretrained model (e.g. CLIP) and we pick OOD samples based on some distance metric thresholds. There are some discussions on these lines using foundational models in Appendix C. However, it will be nice to ground the numbers as in Table 1.

  • Alternate to correlation metric: These simple distance metric (e.g. Euclidean dist) can make the objective submodular and the subset selection problem can be solved using greedy algorithm. This can be a good comparison?

Some related works that could strengthen related works:

  • Nagarajan et al. Understanding the failure modes of out-of-distribution generalization. ICLR 2021
  • Lin et al. Spurious Feature Diversification Improves Out-of-distribution Generalization. ICLR 2024
  • Deng et al. Robust Learning with Progressive Data Expansion Against Spurious Correlation. NeurIPS 2023.
  • Shi et al. LCA-on-the-Line: Benchmarking Out-of-Distribution Generalization with Class Taxonomies. ICML 2024

问题

  • Are there any analysis done on the OOD splits that obtained from each dataset? What are the semantic coherence with ID data?

局限性

Yes

最终评判理由

Thanks for addressing the concerns. The further discussion also seems to bring in interesting insights; especially regarding CLIP-feature based OOD selection is quite poor compared to proposed OODselect. Thanks for the hard baseline to show OODSelect isolates spurious ones rather than generally difficult examples. Please include the overall rebuttal discussion in the final version of the paper. I am inclined to raise my ratings.

格式问题

No

作者回复

We thank the reviewer for the constructive feedback and the positive assessment of our paper’s originality, quality, clarity and significance. Below we respond to your concerns point-by-point and describe the changes made in the revised manuscript.

How can we differentiate between where the model is going wrong in ID and incorrectly predicted OOD samples? Is the above problem addressed as we use N number of models and OODSelect samples perform poor in large subset of these models?

Equation 2 only moves the OOD accuracy vector; the ID vector is untouched, so any ID‑sample misclassification pattern is not a part of the optimization target (in our submission: L113 “we are not selecting models or altering the ID accuracies; we always correlate the same length‑N vectors, only the OOD accuracy values change”). The OOD sample is also distinct from the ID samples in our datasets. For example, for TerraIncognita, the images are from distinct geographical locations; this disjointness is true for every dataset.

Furthermore, we included in our submission the “Hard” baseline requested by the reviewer: for every dataset, we select the N most incorrectly predicted OOD images (i.e., the ones that most models get wrong). Across all settings, this baseline leaves the ID‑vs‑OOD correlation near  0, never resulting in a strong negative correlation; e.g., CXR (N selected  = 10) gives R = −0.00 ± 0.00, TerraIncognita (N selected = 50) R = 0.18 ± 0.01, etc. (Table 3). Figure 2 visualises the same trend—misclassified sets hover around the x‑axis while OODSelect can find examples with strongly negative accuracy correlations.

Importantly, it is not the case that OODSelect performs “poorly” on large subsets, but that there is a limit on how large a subset can be for the correlation. As N approaches the full OOD set, it necessarily approaches the ID and full OOD accuracy correlation (though in a non-submodular way, see Proposition 1).

Together, these analyses demonstrate that OODSelect isolates spurious rather than generally difficult examples, directly addressing the reviewer’s concern. We will clarify this point in the final version of the paper.

“...latent feature distribution of data samples using a pretrained model (e.g. CLIP) and we pick OOD samples based on some distance metric thresholds. ... Alternate to correlation metric: These simple distance metric (e.g. Euclidean dist)”

Thank you for suggesting this comparison! We have added a CLIP‑Euclidean baseline that, for each dataset, greedily retains the farthest OOD examples from the ID examples in the CLIP space (Radford et al., 2021). Results are summarised in the table below.

We find that feature‑only selection—especially with general pretrained models—ignores the feature-label correlations that drive OOD failures. Our findings echo the recent conclusion that latent‑distance detectors often “answer the wrong question” for OOD detection [Li et al., 2025, Out‑of‑Distribution Detection Methods Answer the Wrong Questions]. We find that feature-wise selection sometimes yields an even stronger correlation between ID and OOD accuracy than random selection.

Comparison with CLIP Distance Selection

DatasetN for subsetR from CLIP distance selectionR from OODSelectR Random SelectionR Full OOD
PACS100.52-0.340.600.84
1000.67-0.300.660.84
VLCS100.82-0.920.840.96
1000.91-0.890.860.96
TerraIncognita100.24-0.900.470.85
5000.72-0.930.670.85
WILDSCamelyon100.97-0.890.780.99
10000.99-0.900.500.99
CXR100.20-0.800.200.84
10000.83-0.970.130.84

Are there any analysis done on the OOD splits that obtained from each dataset? What are the semantic coherence with ID data?

Please see our paragraph on consistency and coherence (L 186–L 226), which we summarize here: Each subset size is selected via an independent optimization, so we quantify overlap (monotonicity) across sizes with the normalised Jaccard index (0 = no overlap, 1 = full overlap).  Across all datasets the index lies between 0.72 and 0.99, confirming that the subsets OODSelect identifies are both consistent and coherent. When metadata are available, we also find clear semantic structure. For the Chest‑X‑ray dataset, for example, disease categories such as Pleural Effusion appear markedly more often in the OODSelect split than in the remainder of the OOD pool, indicating the method surfaces clinically meaningful sub‑populations. Finally, we explored automated semantic explanations of the selected subsets using Dunlap et al.’s (2024) approach for “Describing differences in image sets with natural language.” The utility of this explainer proved highly variable, effective on natural‑image datasets but far less reliable for non‑natural images like chest X‑rays or microscopy slides. Investigating more robust cross‑modal explanation techniques is, therefore, an interesting avenue for future work.

Suggested related works.

We will also add a discussion of the related works you suggested. Nagarajan et al. (ICLR 2021) provide a theoretical analysis of failure modes in OOD generalization, showing how even linear models trained with gradient descent can rely on spurious features due to geometric and statistical factors. While their work explains why models fail, it does not offer tools to expose such failures at test time, as ours does. Lin et al. (ICLR 2024) demonstrate that promoting diversity in spurious feature reliance during training can improve OOD robustness. Their method is a training-time intervention, whereas OODSelect is a post-hoc evaluation tool. Deng et al. (NeurIPS 2023) introduce a progressive data expansion (PDE) curriculum that incrementally introduces spurious examples during training, improving worst-group accuracy. Our approach is complementary, enabling stress-testing of models regardless of their training strategy. Lastly, Shi et al. (ICML 2024) show that using class taxonomies and lowest common ancestor (LCA) distances restores a strong ID–OOD correlation, particularly for vision-language models. Unlike their taxonomy-based reformulation, we retain standard accuracy metrics and reveal that even within fixed OOD splits, large subsets can invert the ID–OOD trend—underscoring the limits of average-case evaluation.

We hope these clarifications and the new quantitative baseline demonstrate that OODSelect isolates genuinely spurious-correlation stress tests that elude distance heuristics, further strengthening the paper’s contributions. If we have addressed your concerns, we hope you will raise your score.

评论

We thank reviewer aNDU for their valuable feedback and suggestions to improve our work. We have addressed your questions and implemented your suggestions in our rebuttal. We would be grateful for your follow-up on our rebuttal. Thanks.

Kind Regards,

Authors

评论

We thank reviewer aNDU for their time and efforts in reviewing our paper. As we are approaching the extended author-reviewer discussion period, we wanted to follow up on getting your response to our rebuttal to your review. We look forward to hearing back from you. Many thanks!

Regards,

Authors

最终决定

This paper looks at the accuracy-on-the-line phenomenon where out-of-distribution generalization shows a strong correlation between in-distribution and out-of-distribution accuracy. The paper specifically establishes that there are semantically-meaningful subsets of examples that have inverse correlation, indicating that aggregation of these subsets cause a misleading correlation. The paper further develops a method to find such subsets, demonstrating this across a range of datasets. This is potentially an important finding that is novel and highly relevant to the important problem of understanding generalization, which the reviewers appreciated. Further, the analysis, theoretical motivation, and benchmarks were done in a rigorous manner. Reviewers had a number of questions related to simple baselines for this problem, the correlation metric used, lack of details throughout, relevance of findings on other modalities, computational complexity, and the confounding factor of architectures/models and their inductive biases. The authors provide a strong rebuttal with many results addressing the above concerns, and reviewers recommend acceptance. I agree, and believe this paper provides an interesting avenue that could open up a better understanding of OOD generalization. I recommend that the authors include many of the rebuttal elements in the final paper, and also note that the issue of confounding factors (especially architecture but also training etc.) is important to at least discuss in the paper.