Out-Of-Distribution Detection with Diversification (Provably)
Our theory and experiments demonstrate that training with diverse auxiliary outliers enhances OOD detection performance.
摘要
评审与讨论
The authors propose DiverseMixup, a Mixup data augmentation technique applied to auxiliary OOD data to improve the OOD detection capacities of classifiers trained with the Outlier Exposure technique. They provide a theoretical analysis justifying their approach and demonstrate the superior empirical performances of their technique.
优点
- The method appears to be novel and competitive compared to other OOD data augmentation methods while remaining quite simple and easy to implement.
- The ablation and complementary experiments are satisfying.
- The experiments answer many practical questions.
缺点
One of the main contributions of the paper is the theoretical analysis. However, there appear to be critical flaws in both the demonstrations and the hypotheses.
Major
- The quantities and are never defined. We can guess out of commonly used notation in optimization that (same for ) but this is the definition of , which is defined as a set (which is not straightforward - why would an argmin be a set in that case ?). It adds a lot of confusion, and we never know exactly what we are talking about, which is critical for a demonstration.
- In the demonstration of theorem 1, the authors define and as constants, but 1) they depend on which is supposed to be affected by the later-introduced and 2) the parts of l.530 that are replaced do not seem to match the definition of 's.
- Demonstrations of Theorems 2 and 3 seem to rely on one argument, which is: "Since , then ". I am concerned with the validity of this assumption (assuming that the definition of as sets make sense, which is not clear). As a counterexample, let's consider and (which implies that and ). Now, let's consider and such that
and
\\begin{dcases} \\epsilon_{P_\\tilde{X}}(h^*_{aux},f) = \\int_{\\mathcal{X_aux}} |h^*_{aux}(x) -f(x)|dx = \\epsilon_2 < \\epsilon_1,\\\\ \\int_{\\mathcal{X_aux} \\setminus \\mathcal{X_aux}} |h^*_{aux}(x) -f(x)|dx > \\epsilon_1 - \\epsilon_2 \\end{dcases}
where for simplicity, we omit , assuming that the behavior is similar on this input space region for and . In that case, clearly, minimizes (thanks to the inequality above that keeps sub-optimal for ) but does not minimize \\epsilon_{P_\\tilde{X}}(h,f).
Minor
- l. 111 perhaps you meant instead of ?
- l.122 is called a probability, whereas it is an expectation
- Some typos.
问题
I am puzzled because, on the one hand, the paper demonstrates strong empirical results, and the evaluation methodology is extensive and thorough, but on the other hand, I suspect that the authors' theoretical work is flawed - which does not affect the strength of the presented method but the validity of the paper. I am ready to improve my rating to acceptance if the authors prove my suspicions wrong during the rebuttal.
局限性
The authors have adequately addressed the limitations.
The detailed proof of the validity of the assumption "Since , then ".
For simplicity, we omit , assuming that the behavior is similar in this input space region for and .
We first express the expected error of hypotheses on the training data distribution and the unknown test-time data distribution as follows:
From the above expressions, we obtain:
The model is over-parameterized (line 126), which implies that our hypothesis space is large enough. Consequently, there exists the ideal hypothesis such that both and are minimized, i.e., . In this scenario, . We denote , thus .
I would like to thank the authors for their detailed and structured responses! I am satisfied with responses to W1, W2-2 (please include the details in the final manuscript following these first two remarks), 5 and 6.
I still have concerns about the rest. To start with the simpler:
4: My concern is about the fact that the sentence suggests that the support of lives in the same space as and , which cannot be true. In addition, a support can be defined for a measure, a distribution (which I assumed was implicitly defined in but after a double check, is not), or a function but not for a set. Instead, I would advise simply stating that
W2-1:
(i) How can you define as a constant if and are not and ?
(ii -iii) I am afraid that the regime where and therefore that you describe implies a "near-ideal hypothesis", where actually pretty much everything tends towards in the generalization bound 4. In other words, the problem I point out (that 's are not constant) is only alleviated in a regime where Eq. 4 makes no longer sense in practice. What do you think about that?
W3
The overparametrized case happens when the number of parameters is larger than the number of training points. In that case, the model might perfectly fit the training points (interpolate), but nothing is guaranteed about the generalization error. In addition, you define your errors with continuous integrals, for which the model is never overparametrized because integrals can be defined as the limit of a sum of terms (1 for each data point) where (Riemann definition).To obtain guarantees such as arbitrary minimization of the error, you should rely on universal approximation theorems, but they imply constraining assumptions for the underlying neural network (as a pointer see https://en.wikipedia.org/wiki/Universal_approximation_theorem and the references therein). These assumptions should be stated depending on the theorem you choose to use in your demonstration.
Thank you sincerely for your detailed response and constructive feedback. Your insights have greatly contributed to our work, and we truly appreciate your support.
We would like to further address your concerns as follows:
W4. The support of lives in the same space as and , which cannot be true. Instead, I would advise simply stating that .
We sincerely appreciate your feedback, which has drawn our attention to the lack of preciseness in that definition. Your suggestion is indeed helpful. we have decided to follow your advice and revise the definition of as follows: represent the input space of OOD data, where represents the entire input space in the open-world setting.
W2-1. (i) How can you define as a constant if and are not and ?
Thank you for your comment, we appreciate the opportunity to address your concerns as follows:
(i) is a constant, given that , we have , as a result, is a constant. Specifically, , which depends on and , where represents the unknown test-time distribution in the open-world and does not change throughout our analysis. Similarly, is a predefined hypothesis space that is fixed. Consequently, is a constant. As we derived in public comment (1), . Considering , we can conclude that is a constant.
(ii) We recognize that our current presentation of may have led to some misunderstanding. Our intention in introducing in Theorem 1 was to unify the small values and . For coherence in our derivation, we use the definition of directly. We appreciate your feedback and acknowledge that our current presentation could be clearer.
(iii) We have modified this part of the derivation in the revised version to enhance clarity. Specifically, after proving and clearly stating that is a constant, we have directly defined . This modification should make our reasoning more transparent and easier to follow.
W2-1(ii-iii) I am afraid that the regime where and therefore that you describe implies a "near-ideal hypothesis", where actually pretty much everything tends towards 0 in the generalization bound 4. In other words, the problem I point out (that 's are not constant) is only alleviated in a regime where Eq. 4 makes no longer sense in practice.
We sincerely appreciate your feedback. We would like to address your concerns by first explaining the rationale behind our assumptions and then discussing the impact on Eq.4 in practice.
(i) We assume the existence of an ideal hypothesis within the hypothesis space such that . According to universal approximation theorems, this condition can be met when the depth or width of deep neural networks satisfies certain conditions. Specifically, under these conditions, the model becomes a universal approximator, implying the existence of such that , leading to .
(ii) In practical scenarios, Eq. 4 represents an upper bound on the generalization error of the learned hypothesis . Moreover, when , each term in Eq. 4 retains its practical significance. To illustrate this, let us revisit Eq. 4:
where the empirical error term is minimized through optimization, the reducible error quantifies how closely approximates and the distribution shift error captures the discrepancy between training and test data distributions. These components contribute significantly to the error upper bound. As , only the term (related to the ideal error) approaches zero, while the other terms remain relevant and unaffected.
W3. The overparametrized case happens when the number of parameters is larger than the number of training points. In that case, the model might perfectly fit the training points, but nothing is guaranteed about the generalization error. In addition, you define your errors with continuous integrals, for which the model is never overparametrized because integrals can be defined as the limit of a sum of terms (1 for each data point) where (Riemann definition). To obtain guarantees such as arbitrary minimization of the error, you should rely on universal approximation theorems, but they imply constraining assumptions for the underlying neural network. These assumptions should be stated depending on the theorem you choose to use in your demonstration.
We sincerely appreciate your detailed response, thorough explanations, and constructive suggestions. Your input has significantly enhanced the theoretical rigor of our paper.
(i) We have acknowledged our misunderstanding regarding the overparameterized case. To obtain guarantees such as the arbitrary minimization of error, we should indeed rely on universal approximation theorems.
(ii) We are now making specific assumptions about the underlying neural network to strengthen our proof. Specifically, based on the paper [1], we assume our model is a fully-connected ReLU network with a width of (m+4), which can approximate any Lebesgue-integrable function from to with arbitrary accuracy with respect to distance. In this case, there exists an ideal hypothesis that minimizes both and simultaneously, i.e., , thus .
[1] Universal Approximation Theorem for Width-Bounded ReLU Networks.
Finally, we would like to thank you again for your thorough review of our paper, your detailed feedback, and your constructive suggestions. They have significantly improved the quality of our work.
Thank you for your perseverance and your honesty. I have final minor remarks that I do not expect the authors to respond to, but simply to take into account.
- Suggestion about the definition of : if you no longer use and directly bound it with (which is a cleaver way of alleviating concerns about not being a constant) I would recommend directly defining rather than for better readability.
- (Important) Assumptions for Universal Approximation Theorem: Make sure to clearly state the assumptions in the main paper in the Theorem, not only in the proof.
The authors have constructively engaged in a fruitful scientific discussion during this rebuttal, which has improved the quality of their manuscript and allowed to fix its initial theoretical imprecisions. As a result, I am happy to raise my score to acceptance.
Thank you again for your valuable input and for increasing your rating. We will incorporate your valuable suggestions in the revised version to further improve the quality of our paper.
The detailed proof from line 530 to the definition of .
Let's first review line 530 on the demonstration of theorem 1:
We denote , as the error of and on and , respectively.
In other word, for any , . For any , .
As a result, , .
Considering that (mentioned in line 126), for any , is holds. As a result, . We have:
We denote , so:
Proof of the negligible effect of on .
Specifically, and represent the error of the ideal hypothesis on the unknown test-time data distribution and the training data distribution , respectively. We can derive that . Moreover, given that our model is over-parameterized, we have a sufficiently large hypothesis space to include a near-ideal hypothesis such that is sufficiently small, i.e., . As a result, we can conclude that . We have put the detailed proof in the official comment.
The detailed proof as follows:
Considering that and represent the error of the ideal hypothesis on the unknown test-time data distribution and the training data distribution , respectively, we have:
In our setting, the model is over-parameterized, meaning we have a sufficiently large hypothesis space to include a near-ideal hypothesis such that is sufficiently small. Therefore, we can denote . Given that , . Thus, and are both negligible, and we use a small value to unify them in Theorem 1.
We thank the reviewer for recognizing our novel method and satisfying experiments. We appreciate your support and constructive suggestions and address your concerns as follows.
W1. The quantities and are never defined. Why would an argmin be a set in that case?
Thank you for your valuable comment. We agree that clearer definitions of and would improve our manuscript. Our choice to define the argmin as a set was purposeful and well-founded and we appreciate to clarify this as follow:
1) The definition of quantities and .
and is defined as the element in and , respectively, which can be denoted as: , where and refers to the sets of ideal hypotheses on the training data distribution and test-time data distribution . We will add the definition in manuscript.
2) Why would an argmin be a set?
(i) Our setting requires us to represent the optimal hypothesis as a set. In our setting, contains all hypotheses optimal for the training data distribution. However, these hypotheses may perform inconsistently on the test-time data distribution, necessitating the use of a set rather than a single .
(ii) The optimization problem behind deep neural networks is highly non-convex and the optimal solution is not unique [1] [2]. Therefore, defining argmin as a set offers generality.
(iii) Representing argmin as a set is not unprecedented in the field. For example, the theoretical analysis in [3] also employs this set-based representation.
W2. The authors define and as constants, but 1) they depend on which is supposed to be affected by the later-introduced and 2) the parts of line 530 that are replaced do not seem to match the definition of 's.
Sincerely thank you for your thorough review. We realize our derivation omitted some explanations, leading to misunderstandings. We're happy to clarify in detail as follow:
1) Clarification of and .
(i) We do not define and as constants. This misunderstanding may have arisen from our definition of as a constant in Theorem 1. We will clarify this in the revised paper.
(ii) is not affected by . depends on the unknown test data distribution . The later-introduced only affects the training data distribution, thus is unaffected by .
(iii) While is indeed influenced by , we can prove that is bounded by a very small constant. Therefore, in subsequent analyses, the effect of on is negligible. We have put the detailed discussion and proof in the official comment (1).
2) The parts of l.530 that are replaced do not seem to match the definition of 's.
Upon careful examination, we believe the derivation is accurate. However, we recognize that the presentation could be enhanced for clarity. We have provided a detailed derivation in the official comment (2).
W3. Demonstrations of Theorems 2 and 3 seem to rely on one argument, which is: "Since , then " I am concerned with the validity of this assumption.
Thank you for your valuable feedback and we greatly appreciate your attention to detail. We illustrate the validity of this assumption as follow:
(i) The assumption is reasonable in the over-parameterized setting. Considering that we can decomposing the ideal error on into two components: one for and another for the remaining OOD data . Considering that the model is over-parameterized (line 126), there exists the ideal hypothesis minimizing both error simultaneously. In other word, is the intersection of the sets of optimal hypothesis for auxiliary and remaining data, thereby . We have provided a detailed derivation in the official comment (3).
(ii) Discussion of the Counterexample. The counterexample suggests that the model's optimal solution in may not be optimal in because it balances performance in . This assumes there is no hypothesis that achieves optimal performance in both and . However, in over-parameterized setting, there exists the ideal hypothesis that is optimal in both and . Therefore, the counterexample does not hold in our setting.
4. line 111 perhaps you meant instead of ?
5. line 122 is called a probability, whereas it is an expectation.
6. Some typos
Thank you for your valuable suggestions, we have addressed your concerns as follows:
(i) We would like to clarify that is outside the support of . In other words, is within the support of .
(ii) We acknowledge the error on line 122 and have corrected it.
(iii) We have thoroughly reviewed and revised all representation issues throughout the paper.
References:
[1] The Loss Surface of Deep and Wide Neural Networks.
[2] On the Quality of the Initial Basin in Overspecified Neural Networks.
[3] Agree To Disagree: Diversity Through Dis-Agreement For Better Transferability.
The paper proposed Diversity-induced Mixup for OOD detection (diverseMix), which enhances the diversity of auxiliary outlier set for training in an efficient way.
优点
-
The paper is written well and is easy to understand.
-
The studied problem is very important.
-
The results seem to outperform state-of-the-art.
缺点
- My biggest concern is that there are already some papers that theoretically analyze the effect of auxiliary outliers and proposed some complimentary algorithms based on mixup, such as [1], [2] and [3]. It might be useful to clarify the differences.
[1] Out-of-distribution Detection with Implicit Outlier Transformation [2] Learning to augment distributions for out-of-distribution detection [3] Diversified outlier exposure for out-of-distribution detection via informative extrapolation
问题
see above
局限性
n/a
Thank you for your reviews. We are encouraged that you appreciate our studied problem and state-of-the-art result. We address your concerns as follows.
W1. My biggest concern is that there are already some papers that theoretically analyze the effect of auxiliary outliers and proposed some complimentary algorithms based on mixup, such as [1], [2] and [3]. It might be useful to clarify the differences.
Thanks for mentioning [1] [2] [3]. We first review these related works and then clarify the differences from the perspectives of motivations, techniques, and theory. Additionally, we have incorporated these related works into the manuscript.
(i) Reviews of related works.
DOE [1] proposes a novel and effective approach for improving OOD detection performance by implicitly synthesizing virtual outliers via model perturbation.
DAL [2] introduces a novel and effective framework for learning from the worst cases in the Wasserstein ball to enhance OOD detection performance.
DivOE [3] is an innovative and effective method for enhancing OOD detection performance by using informative extrapolation to generate new and informative outliers.
(ii) Differences in motivations. DiverseMix has a different motivation from DOE, DAL and DivOE. Our DiverseMix is proposed for enhancing the diversity of auxiliary outliers set through semantic-level interpolation to enhance OOD detection performance. In comparison, DOE focuses on improving the generalization of the original outlier exposure by exploring model-level perturbation. DAL focus on crafting an OOD distribution set in a Wasserstein ball centered on the auxiliary OOD distribution to alleviating the distribution discrepancy between auxiliary outliers and unseen OOD data. DivOE focuses on extrapolating auxiliary outliers to generate new informative outliers for enhance OOD detection performance.
(iii) Differences in technique. Our method, DiverseMix, has a different algorithmic design compared to others. Specifically, DiverseMix adaptively generates interpolation strategies based on outliers to create new outliers. In contrast, DOE, DAL, and DivOE primarily rely on adding perturbations. Specifically, DOE, DAL, and DivOE apply perturbations at the model level, feature level, and sample level, respectively, to mitigate the OOD distribution discrepancy issue.
(iv) Differences in theory. We prove that a more diverse set of auxiliary outliers could improve the detection capacity from the generalization perspective, and this theoretical insight inspired our method DiverseMix. We also provide insightful theoretical analysis verifying the superiority of DiverseMix. In comparison, DOE revealing that model perturbation leads to data transformation to enhance the generalization of OOD detector. DAL finds that the distribution discrepancy between the auxiliary and the real OOD data affecting the OOD detection performance. DivOE is based on the perspective of sample complexity to demonstrate its effectiveness.
References:
[1] Out-of-distribution Detection with Implicit Outlier Transformation.
[2] Learning to augment distributions for out-of-distribution detection.
[3] Diversified outlier exposure for out-of-distribution detection via informative extrapolation.
Thanks for the clear response on my concerns and questions. All the things has been resolved, so I increase my score to 6. Thanks!
Thank you so much for the valuable comments and increasing your rating. Thanks!
This study aims to explore the reasons behind the effectiveness of out-of-distribution (OOD) regularization methods by linking the auxiliary OOD dataset to generalizability. The researchers show that the variety within the auxiliary OOD datasets significantly influences the performance of OOD detectors. Moreover, they introduce a straightforward approach named diverseMix which is designed to enhance the diversity of the auxiliary dataset used for OOD regularization.
优点
Strengths:
- The paper is well-composed and presents an extensive array of experiments across multiple OOD detection benchmarks.
- This study offers important insights into the significance of auxiliary datasets in OOD regularization, addressing a frequently neglected aspect of OOD regularization techniques.
- The authors provide a robust theoretical foundation for diverseMix. Additionally, they show strong empirical evaluations which further highlight the effectiveness of diverseMix across a range of OOD experiments.
缺点
Weakness:
- The reviewer has some concerns regarding the empirical evaluations of diverseMix. In particular, the choice of how the ImageNet-1k is split into ImageNet-200 as ID while the remaining data is leveraged as OOD seems arbitrary.
问题
The primary question of the reviewer is why not leverage the entire ImageNet-1k dataset for ID whilst leveraging other unlabeled datasets for the auxiliary data.
局限性
The authors have adequately addressed any potential negative societal impacts of this work, and the reviewer does not anticipate any such negative impacts.
We sincerely thank the reviewer for your valuable comments and appreciate your recognition of the effective method as well as sufficient theoretical guarantees. We provide detailed responses to the constructive comments.
W1. The reviewer has some concerns regarding the empirical evaluations of diverseMix. In particular, the choice of how the ImageNet-1k is split into ImageNet-200 as ID while the remaining data is leveraged as OOD seems arbitrary.
Thank you for raising this important question. We sincerely appreciate your attention to the details of our experiments and pleasure to provide further clarification.
(i) Our splitting strategy was not arbitrary but carefully considered. We randomly selected 200 classes from ImageNet-1K as the ID categories, while the remaining 800 classes were used as OOD categories. This setting ensures the validity of our experiments.
(ii) This experiment setting is consistency with prior work. We followed the benchmark [1] to set ImageNet-200 as the ID dataset for our experiments.
Q1. The primary question of the reviewer is why not leverage the entire ImageNet-1k dataset for ID whilst leveraging other unlabeled datasets for the auxiliary data?
Thanks for the constructive suggestions. We are grateful for this insight and have conducted the additional experiments as suggested. Specifically, we used the entire ImageNet-1K dataset as the ID dataset and employed the SSB-hard dataset as auxiliary outliers. The experimental results are shown below, with columns representing different OOD datasets. The values in the table are presented in the format (FPR / AUROC ). From the experimental results, it is evident that our method remains effective even when ImageNet-1K is used as the ID dataset.
| Method | dtd | iNaturalist | ninco | average |
|---|---|---|---|---|
| OE | 73.90/76.82 | 49.33/89.53 | 76.03/80.52 | 66.42/82.29 |
| Energy | 69.80/82.56 | 74.40/85.58 | 81.66/77.32 | 75.29/81.82 |
| Mixoe | 69.48/78.07 | 46.61/89.72 | 74.17/80.79 | 63.42/82.86 |
| Ours | 68.17/78.69 | 42.71/90.98 | 73.29/81.27 | 61.39/83.65 |
References:
[1] OpenOOD v1.5: Enhanced Benchmark for Out-of-Distribution Detection.
Dear reviewers, please read and respond to authors' rebuttal (if you haven't done so). Thanks.
Your AC
This submission received three ratings (6, 6 and 7), averaging 6.33, which is above the acceptance. After rebuttal, all reviewers' concerns have been well addressed, especially about the experiments, difference from previous works and the clarity of the theoretical notations and proof. After carefully checking the response of all reviewers, I suggest the acceptance. Hope the authors carefully follow the reviewers' advice to improve the submission finally.