Expecting The Unexpected: Towards Broad Out-Of-Distribution Detection
We show that recent method for OOD detection over-specialize on novel classes and perform unreliably on other types of shifts. We release a benchmark to study broad ood detection and propose an ensemble method to alleviate current issues
摘要
评审与讨论
The paper addresses the problem of OOD detection in deployed machine learning systems. The authors analyze existing OOD detectors, identifying a common limitation in their adaptability to diverse distribution shifts. To address this, they propose a new benchmark named BROAD, designed to evaluate OOD detectors across a wide spectrum of distribution shifts. This benchmark examines various scenarios, including novel classes, adversarial perturbations, synthetic images, corruptions, and multi-class inputs. Additionally, the authors introduce an approach that leverages an ensemble of reliable OOD detectors combined with a GMM.
优点
The paper explains the existing limitations of current OOD detectors, providing a clear and compelling rationale for advancement in this domain. The paper's main strength lies in the introduction of the BROAD benchmark, which establishes a robust evaluation framework. This benchmark not only rigorously assesses OOD detector performance but also offers a critical insight into their ability to extend beyond the scope of novel classes. Through meticulous experimentation and analysis, the authors provide compelling evidence of OOD detector capabilities and potential areas of improvement. This well-constructed evaluation framework significantly contributes to the depth and reliability of the research findings.
缺点
The paper introduces an ensemble method for broad OOD detection; however, there are notable weaknesses. Its efficiency and inability to be scaled up raise questions about its practicality in real-world applications. Moreover, the paper lacks a clear roadmap or forward-looking guidance for the broader OOD detection community on how to effectively approach the challenges posed by the BROAD benchmark. While the intent to introduce a method is appreciated, the proposed method is clearly not the ideal solution to this intricate problem, as it does not offer new ideas or new directions to the OOD community. A more comprehensive analysis and discussion on potential future directions and steps for advancing OOD detection would greatly enhance the paper's overall impact and utility to the research community. This would provide valuable insights for researchers looking to build upon this work and make meaningful strides in the field of OOD detection.
问题
See weaknesses.
We thank the referee for their careful consideration and feedback.
Regarding efficiency and the scalability, is the reviewer referring to the computational overhead of the ensemble method ? When inference cost is an issue, the fast version, ENS-F, only incurs a 25% overhead compared to inference without detection, while still performing competitively. Such an overhead is rather low in comparison to recent work, including the acclaimed ViM, ODIN etc, that lead to overheads larger than 100% (see Table 6). While simple baselines such as MSP, max logits and logits norm incur a negligible overhead, most recently published work have comparable or higher overheads than ENS-F. ENS-R and ENS-V are indeed inefficient, but may still be adequate in scenarios where inference cost is not a significant issue.
We hope our work will encourage the community to stop over-focusing on novel class detection and instead consider the broader needs of real world applications. Ideally, novel scores will emerge which will be better suited to broad OOD detection. Such scores may not be SotA on novel class detection, but achieve significant improvement in broad OOD detection, and would not have attracted attention without the correct evaluations. However, finding a single method that can efficiently discriminate a large variety of distribution shifts, each with their own unique properties, may prove difficult. In that case, we believe ensemble methods may remain an efficient way to address this complex and multi-faceted problem. We admit this is not the most satisfying solution, but it has significant merits, and might currently be the most adequate method to real world usage in many settings.
Thanks for you response. I have read the author's responses and all other reviews, but I am keeping my rating the same.
This paper propose a OOD benchmark comprising five different types of distribution shift. The result shows that the performance of OOD detection methods are not consistent over different types of distribution shift. The paper propose a method to ensemble different OOD detection methods to achieve consistent performence over different types of distribution shift.
优点
1). The assessment of a broder OOD detection capabilities is interesting and probablity important for future OOD detection method development.
2). Extensive experiments have been done to benchmark the recent OOD detection methods.
3). Overall, the paper is clear and well-written.
缺点
My concerns are mainly about the proposed method.
1). The ensemble of OOD detection methods seems ad-hoc for this benchmark by evaluating and picking some of the methods that perform relatively well on the benchmark.
2). The proposed method of fitting GMM over scores from different OOD detection methods does not make sense to me. For example, in Sec.3, it says "this approach is adept at identifying atypical realizations of the underlying scores, even in situations where the marginal likelihood of each score is high, but their joint likelihood is low." It would be weird if most in-distribution samples can not achieve high likelihood at each score while out-of-distribution samples can, since all these methods aim at measuring if the sample is in-distribution. In other words, what is the advantage of fitting GMM over taking average of different scores?
3). The time complexity or the scalability of the proposed method is still of concern. Though the results shows that the time complexity of the proposed ENS-F is acceptable at 25% of the time for a normal inference. However, when comparing to the methods it ensembles (e.g. MSP takes 1% additional time), the time complexity is extremely high.
问题
1). The proposed method uses a validation set to fit GMM, does it affect the performance? With the use of additional data, is it fair to compare the proposed method with other baseline methods?
2). Does the proposed ensemble method outperform simple ensemble methods such as taking the average of the scores or the largest score? If so, why would it be?
We thank the referee for their careful consideration and feedback.
Weakness 1 As stated page 7, at the end of Section 3, we do not assume access to OOD samples to pick which scores to use for ensembling, precisely to avoid the issue mentioned by the referee. Instead, we describe the selection process for the different ensemble versions, which are only based on in-distribution ImageNet validation inputs, and are thus independent from performances on BROAD. We could likely achieve better performance by picking scores based on BROAD, but refrained from it in agreement with the referee’s remark.
Weakness 2 Overall, generative modeling with GMM is better suited because it models exactly the quantity of interest: how likely is it to observe the given set of scores, assuming we are in-distribution ? While the referee’s suggested heuristic discards valuable information.
While the referee says “these methods aim at measuring if the sample is in-distribution”, and it is indeed the claimed goal, our work precisely establishes that due to standardization of evaluation benchmarks in the literature, these methods instead only discriminate between in-distribution and novel classes. Their behavior on other types of distribution shifts is a priori uncertain. Some methods have below random AUC performances for certain distribution shifts (see table 2), which demonstrates it is not always the case that a lower score means a higher chance of being in-distribution (instead, it means a higher chance of not being a novel class).
Moreover, the sentence cited by the referee simply points out that generative modeling is able to model information in the joint distribution that is not present in marginal distributions. For instance, EBO and MaxLogits are heavily correlated in-distribution (see Figure 3). Thus, even when the scores of EBO and MaxLogits are both within reasonable range, a lower-than-usual score for EBO and higher-than-usual score for MaxLogits have low joint density, which would not be captured by marginal probabilities.
Additionally, nothing indicates that the distribution of scores must be single mode. As clearly shown in Table 7 (appendix), using a mixture of several Gaussians performs better than a single Gaussian. An average of scores might be unadapted if the distribution of each score has different modes.
Finally, even if we rescale scores to balance their variance, depending on the functional form of their distribution they may not trigger false positives at the same frequency, and redundant scores would have exaggerated weights.
Weakness 3 As the referee acknowledged, a 25% overhead is acceptable. While some methods indeed have faster inference, half of the methods have comparable – or even much higher, overhead. Many acclaimed recent works such as ViM and ODIN have overheads larger than 100% and were still found to be of high interest to the community. Besides, several of the fastest methods (MSP, max logits and logits norm), are simple baselines commonly used in the literature and unrepresentative of typical overhead in published methods. We believe Ens-F has, in fact, a smaller overhead than most methods introduced in the past 3 years, and it should thus not be considered a limitation.
Q1 It is important to note that the training of the GMM uses 45,000 in-distribution validation samples, from the validation set of ImageNet (the comparison would indeed be unfair if our ensemble had access to some OOD samples). First, it is unlikely that the use of 45,000 additional images compared to the 1.2 millions samples in ImageNet 2012’s training set would make a significant difference. But most importantly, it is common for the methods in the literature to use validation data for various purposes, such as tuning (CADet), learning parameters (MDS), or calibrating (ViM). Since we use the same validation data as most methods we compare to, we believe it is fair. Instead, it can introduce a slight overfitting of the underlying scores to the validation data on which the GMM is trained, hindering performances, but we find the GMM still largely outperforms underlying methods.
Q2 We did not evaluate the heuristics mentioned by the referee. Firstly, because a generative approach seemed better suited due to the elements mentioned in the response to weakness 2. Secondly, because we wanted to refrain from picking the ensemble method based on BROAD performances, which would incur risks of overfitting our specific benchmark, and may lead to bad generalization on unseen OOD datasets.
I thank the authors for addressing my questions. The proposed broader OOD detection benchmark is interesting and could be important. However, I have concerns over the proposed ensemble method and I think this paper lacks more insights. By mentioning other simple heuristic method such as taking the average, my key point is that using GMM is also a heuristic choice. Why does the distribution of each score have different mode? How does the difference between detection methods affect the performance on the broader OOD detection problem? Without answers to these question, I think that simply reporting the ensemble of different methods achieves better performance is not enough and could not guarantee a good performance on the broader OOD detection problem. Without further insights, in my opinion, the proposed method still seems to be ad-hoc for this specific benchmark. I knew the choice of detection methods is based on the results on an in-distribution validation set. Yet, it is a choice made in a small range of detection methods.
Overall, I think the problem of broader OOD detection is interesting, but more insights are required. I will keep my original rating unchanged.
The paper propose a new visual OOD detection benchmark consisting of 5 types of distributional shifts (1) novel classes (2) adversarial perturbations (3) synthetic images (4) corruptions (5) images with multiple objects. The paper further evaluates the performance of various OOD detection methods (that do not require training/fine-tuning) and observe that the performance is inconsistent across different types of distribution shifts. Lastly, the authors propose to ensemble various scores with Gaussian mixture models, which demonstrates better performance.
优点
- The overall organization of the paper is clear and easy to follow.
- The proposed ensembling method is straightforward and demonstrate good performance.
缺点
- The major weakness of the paper is that most OOD detection scores considered in the paper are proposed to only handle novel classes. Expecting such OOD scores to detect adversarial perturbations and corruptions may be out-of-scope and unrealistic.
In particular, recent work [1] has demonstrated that when OOD samples are not involved during training (the setting considered in this work), it can be theoretically impossible to expect common OOD detection methods to work. Despite that detecting adversarial perturbations and corruptions are interesting tasks, directly utilizing post-hoc OOD detection scores is ill-justified and the failure is expected.
- In the multi-label scenario, it seems more reasonable to use object detection models instead of classification models. The failure of OOD detection based on classification models is expected. It would be more interesting to see the performance of OOD detection given bounding boxes.
问题
Method:
- Can authors justify in theory or principle why OOD detection methods are suitable for detecting adversarial perturbations and corruptions?
Experiments:
- Can authors provide further OOD detection results with object detection models for the multi-label case?
[1] Fang et al., Is Out-of-Distribution Detection Learnable?, NeurIPS 2022
We thank the referee for their careful consideration and feedback.
Weaknesses
We agree OOD detection scores considered in the paper are only suited for novel class detection (though some of them generalize well on certain other types of shifts). In fact, it is exactly our point: while described as “OOD detection” methods, these are, in fact, “novel class detection” methods. While they are virtually always justified in the literature by the need to address unexpected test-time inputs, they are in fact only evaluated on novel class benchmarks. There is thus a discrepancy between the role that these methods are supposed to play, and what they are actually evaluated on and suitable for. Our work proposes to fix this issue by introducing a more comprehensive evaluation benchmark. The fact that these methods generalize unreliably to other types of shifts is indeed not surprising, since they were not evaluated on them in their original papers. However, it justifies that, in fact, there is a misalignment between the literature and the needs of real-world systems, where a large variety of unexpected inputs can occur. Existing “OOD detection” methods fail to detect certain types of OOD inputs encountered in real-world settings, as demonstrated by our work. In order to move toward useful systems, we need to move toward a more realistic definition of OOD detection, and understand how existing methods perform in such a setting, which is exactly what our submission does. Thus we believe it is not out-of-scope to evaluate OOD detection methods on distribution shifts they were not designed for – but that occur in the type of problem they aim to address.
[1] Is a very interesting work that establishes that it is impossible to discriminate the training distribution from all possible distribution shifts. However, it does not mean that it is impossible to discriminate from the specific distribution shifts encountered in practice, which are only a small subset of all possible shifts. In fact, the performance of our ensemble method formally demonstrates that, at least to a reasonable extent, it is possible to simultaneously detect all the distribution shifts considered in BROAD. We argue there is a reasonable middle point to find between only considering detection of one specific and limited type of shift, and trying to discriminate against all possible (in the mathematical sense) shifts. That middle point should be to attempt detection of the shifts reasonably encountered in practice, and our work shows that, though difficult, it is not infeasible.
Indeed, using an object detection method would be much better suited for cocomagenet samples, which would in that case not even be considered OOD. But the principle of OOD detection is to protect against input types that are not expected, thus we can not assume a priori knowledge of these input types to select a better suited model or paradigm. There are many classification models deployed in the real world, which are designed to handle single-class inputs. Due to unexpected use, they may occasionally be presented with inputs that, in fact, have several training classes present. Since such cases are uncommon and out of normal usage, it is often not reasonable to consider an object detection framework just to palliate this eventuality. This is connected to the general discrepancy noted by our work between real-world needs and the literature: the standardization of evaluation benchmarks has led recent OOD detection methods to only consider expected OOD inputs, although their point is precisely to detect inputs that are not expected.
Q1: Certain of these methods have been designed for simultaneous detection of adversarial samples (e.g., CADet) while others have been found in prior work to be powerful baselines in detecting adversarial inputs (e.g., ODIN and MDS). However, as argued above, our point is precisely to demonstrate that existing OOD detection methods are in general not suitable to detect other shifts encountered in practice.
Q2: We are not sure what would be the grounds for detecting multi-label inputs if the deployed system is tailored to and trained on multi-label inputs. In that case being multi-label would not correspond to a different distribution.
I am grateful to the authors for their additional explanations regarding the motivation behind this work. The empirical evaluations presented do shed some light, but the significance of the majority of the experiments is still not clear to me. This is because many of the OOD methods evaluated in this study are not designed to handle certain types of OOD inputs, such as adversarial inputs. The observed failures and inconsistencies are well expected without empirical verification. In light of these considerations, I find it challenging to endorse the acceptance of this work in its current state.
This paper reviews Out of Distribution (OOD) Detection, in the sense of samples seen in the real world that were not covered in the training data. Two approaches are
- to create robust systems, designed to not degrade on OOD data, and
- to flag samples uncharacteristic of the training data. Distribution shift detection appears more practical, but can be fooled, by the multiple ways OOD can occur for such diverse reasons (in images): as novel classes, adversarial attacks, synthetics, corruptions multiple labels.
The paper introduces
- a new OOD benchmark with 12 datasets representing the various OOD reasons
- benchmarking of a variety of existing methods published in the last decade,
- And demonstration of a Gaussian Mixture Model of an ensemble of existing methods with significant gains over existing methods.
优点
The paper offers a comprehensive view of why image recognition models may fail on images that have not been seen during training, and by a comprehensive set of tests demonstrates the relative value of existing detection methods when applied to tasks they were intended for, and for other OOD tasks that are related, but not explicitly targeted by the existing method. Of note is the interesting comment on how to build OOD detectors with generative models, by use of a function they designate as "h(x)" that so to speak "sees inside" the generative network, offering an extended feature set for detection.
The paper introduces a OOD classifier that is a combination of existing methods by use of a Gaussian mixture model that has better coverage, and achieves an AUC score on average superior to existing methods.
缺点
There isn't sufficient detail in the paper to re-construct the Gaussian mixture model (GMMs) proposed by the authors. GMMs are conventionally used to estimate density functions for oddly-shaped distributions, e.g. with multiple modes. It is intuitive, in fact not unexpected, that creating an ensemble of detectors has better performance on average than any individual detector, so the novelty of this finding is limited. however the results from the paper are not reproducible from the paper's contents. Given the scores how is the GMM density learnt? In what sense is this an ensemble? How does this generate a classification and thus an AUC?
One gets the sense that this work would find a better audience in a more engineering-oriented conference where testing comparisons of performance were the primary interest, and algorithmic aspects were not.
问题
If the construction and output of the GMM classifier is actually revealed in the paper and this review overlooked it, please explain.
We thank the referee for their careful consideration and feedback.
Reproducibility: As indicated in the abstract, we will release the full code to reproduce our experiments for the camera-ready. In terms of describing how the GMM is trained, as stated in page 7, “we propose the learning of a Gaussian mixture of the scores computed by existing OOD detection methods”. Thus given a set of scores (the selection of which is discussed in detail), we simply train a GMM to learn the density of the score tuples . Given an image training set , it means the GMM is trained on \{(S_1(X_1), S_2(X_1), ..., S_m(X_1)), …, (S_1(X_n),S_2(X_n), …, S_m(X_n))}. If the reviewer feels that such a process is not naturally implied by the above sentence, we can add a paragraph laying these specifications. There are two additional details we indeed forgot to specify, that should be fairly straightforward: we train the GMM using the EM algorithm as is typically done, and given a test sample , we use the negative likelihood of from the GMM as detection score. That is, we detect as OOD the samples for which the learned density of the tuple of score is below the threshold. For completeness, we will explicitly mention these two parts of the detection pipeline.
Novelty and significance: We agree with the referee that an ensemble method outperforming the underlying learners is nothing surprising, and does not warrant publication. The main point here is not just that the GMM significantly increases average performances, but that it manages to achieve a level of consistency across distribution shift types that individual methods do not have. The main contributions of our work are to establish that so-called “OOD detection” methods in the literature are in fact “novel classes detection” methods, and over-specialize to that specific distribution shift types, while behaving unreliably on general OOD detection. We propose a novel benchmark to evaluate broad OOD detection, hoping it will encourage the design of more general methods that will perform reliably in the real world. In the meantime, the performance of our GMM approach indicates that by combining the respective strengths of existing scores, it is able to attain a level of reliability across distribution shifts unmatched by baselines.
Dear Reviewers,
The authors have responded to the reviews. Please read over the responses and take this opportunity to clarify any further issues before the end of the discussion period today (Nov 22 AOE).
Thanks, AC
This paper discusses the limitations of out-of-distribution detection algorithms, arguing that they should be able to handle a range of shifts. These limitations are demonstrated through an extensive benchmark on five different types of distribution shifts and with recent OOD methods; the paper finds that OOD methods do not perform consistently well across the different types of shift. The paper then proposes an ensemble approach using GMMs as a method that can handle a range of shifts, outperforming current OOD methods.
Reviewers appreciated the extensive benchmarking and generally well-written paper, though they had concerns on the proposed method for broad OOD detection (e.g. efficiency and motivation). The authors provided responses which were acknowledged by reviewers, but they did not raise their scores. During the discussion reviewers agreed that the paper would be better not including the proposed ensemble method, and rather focusing exclusively on the proposed benchmark, which they thought was valuable. One reviewer pointed out recent work considering "broader" OOD detection [1] and highlighted the more nuanced discussion required surrounding the definition of OOD in terms of the intended use-case. Most reviewers leaned towards rejection at the end of the discussion.
This is a somewhat borderline paper; the AC agrees with the reviewers that the benchmarking study is valuable, and the paper would be of much greater impact if it focused on benchmarking, rather than trying to propose a new method. The space saved from excluding the method can be used to provide further analysis and insight into behavior of the various algorithms in different shifts and to chart potential research directions going forward. In addition, including a more detailed discussion regarding why the definition of OOD should be broadened and the feasibility of doing so (e.g. in reference to [2]) would be helpful (as provided in the authors' responses). As such, the AC recommends revising the paper accordingly and resubmitting to a future venue.
[1] Bai et al., Feed Two Birds with One Sone: Exploiting Wild Data for Both Out-of-Distribution Generalization and Detection, ICML 2023 [2] Fang et al., Is Out-of-Distribution Detection Learnable?, NeurIPS 2022
为何不给更高分
- Proposed GMM method is not well-motivated and seems ad-hoc
为何不给更低分
N/A
Reject