PaperHub
6.3
/10
Poster3 位审稿人
最低6最高7标准差0.5
7
6
6
4.0
置信度
正确性3.3
贡献度2.7
表达3.0
NeurIPS 2024

Out-Of-Distribution Detection with Diversification (Provably)

OpenReviewPDF
提交: 2024-05-14更新: 2025-01-15
TL;DR

Our theory and experiments demonstrate that training with diverse auxiliary outliers enhances OOD detection performance.

摘要

关键词
OOD detection

评审与讨论

审稿意见
7

The authors propose DiverseMixup, a Mixup data augmentation technique applied to auxiliary OOD data to improve the OOD detection capacities of classifiers trained with the Outlier Exposure technique. They provide a theoretical analysis justifying their approach and demonstrate the superior empirical performances of their technique.

优点

  • The method appears to be novel and competitive compared to other OOD data augmentation methods while remaining quite simple and easy to implement.
  • The ablation and complementary experiments are satisfying.
  • The experiments answer many practical questions.

缺点

One of the main contributions of the paper is the theoretical analysis. However, there appear to be critical flaws in both the demonstrations and the hypotheses.

Major

  1. The quantities hoodh^*_{ood} and hauxh^*_{aux} are never defined. We can guess out of commonly used notation in optimization that hood=undersethinmathcalHoperatornameargminepsilonPmathcalX(h,f)h^*_{ood} = \\underset{h\\in \\mathcal{H}}{\\operatorname{argmin}} \\epsilon_{P_{\\mathcal{X}}}(h,f) (same for hauxh^*_{aux} ) but this is the definition of mathcalHood\\mathcal{H}^*_{ood}, which is defined as a set (which is not straightforward - why would an argmin be a set in that case ?). It adds a lot of confusion, and we never know exactly what we are talking about, which is critical for a demonstration.
  2. In the demonstration of theorem 1, the authors define beta1\\beta_1 and beta2\\beta_2 as constants, but 1) they depend on PmathcalXP_{\\mathcal{X}} which is supposed to be affected by the later-introduced mathcalXdiv\\mathcal{X_div} and 2) the parts of l.530 that are replaced do not seem to match the definition of beta\\beta's.
  3. Demonstrations of Theorems 2 and 3 seem to rely on one argument, which is: "Since mathcalXauxsubsetmathcalXood\\mathcal{X_aux} \\subset \\mathcal{X_ood}, then mathcalHoodsubsetmathcalHaux\\mathcal{H}^*_{ood} \\subset \\mathcal{H}^*_{aux}". I am concerned with the validity of this assumption (assuming that the definition of mathcalH\\mathcal{H}^* as sets make sense, which is not clear). As a counterexample, let's consider phimathcalX=Unif(0,3)\\phi_{\\mathcal{X}} = Unif(0,3) and phitildemathcalX=Unif(0,1)\\phi_{\\tilde{\\mathcal{X}}} = Unif(0,1) (which implies that mathcalXaux=(0,1)\\mathcal{X_aux} = (0,1) and mathcalXood=(0,3)\\mathcal{X_ood} = (0,3)). Now, let's consider hoodh^*_{ood} and hauxh^*_{aux} such that

begindcasesintmathcalXoodsetminusmathcalXauxhood(x)f(x)dx=0epsilonPX(hood,f)=intmathcalXauxhood(x)f(x)dx+intmathcalXoodsetminusmathcalXauxhood(x)f(x)dx=intmathcalXauxhood(x)f(x)dx=epsilon1,enddcases \\begin{dcases} \\int_{\\mathcal{X_ood} \\setminus \\mathcal{X_aux}} |h^*_{ood}(x) -f(x)|dx = 0\\\\ \\epsilon_{P_X}(h^*_{ood},f) = \\int_{\\mathcal{X_aux}} |h^*_{ood}(x) -f(x)|dx + \\int_{\\mathcal{X_ood} \\setminus \\mathcal{X_aux}} |h^*_{ood}(x) -f(x)|dx= \\int_{\\mathcal{X_aux}} |h^*_{ood}(x) -f(x)|dx = \\epsilon_1,\\\\ \\end{dcases}

and

\\begin{dcases} \\epsilon_{P_\\tilde{X}}(h^*_{aux},f) = \\int_{\\mathcal{X_aux}} |h^*_{aux}(x) -f(x)|dx = \\epsilon_2 < \\epsilon_1,\\\\ \\int_{\\mathcal{X_aux} \\setminus \\mathcal{X_aux}} |h^*_{aux}(x) -f(x)|dx > \\epsilon_1 - \\epsilon_2 \\end{dcases}

where for simplicity, we omit mathcalXid\\mathcal{X_id}, assuming that the behavior is similar on this input space region for hoodh^*_{ood} and hauxh^*_{aux} . In that case, clearly, hoodh^*_{ood} minimizes epsilonPX(h,f)\\epsilon_{P_X}(h,f) (thanks to the inequality above that keeps hauxh^*_{aux} sub-optimal for PXP_X) but does not minimize \\epsilon_{P_\\tilde{X}}(h,f).

Minor

  1. l. 111 perhaps you meant mathcalXid\\mathcal{X_id} instead of mathcalYid\\mathcal{Y_id}?
  2. l.122 epsilon\\epsilon is called a probability, whereas it is an expectation
  3. Some typos.

问题

I am puzzled because, on the one hand, the paper demonstrates strong empirical results, and the evaluation methodology is extensive and thorough, but on the other hand, I suspect that the authors' theoretical work is flawed - which does not affect the strength of the presented method but the validity of the paper. I am ready to improve my rating to acceptance if the authors prove my suspicions wrong during the rebuttal.

局限性

The authors have adequately addressed the limitations.

评论

The detailed proof of the validity of the assumption "Since XauxXood\mathcal X_{aux}\subset\mathcal X_{ood} , then HoodHaux\mathcal H^*_{ood}\subset \mathcal H^*_{aux}".

For simplicity, we omit Xid\mathcal X_{id}, assuming that the behavior is similar in this input space region for hoodh^*_{ood} and hauxh^*_{aux}.

We first express the expected error of hypotheses hh on the training data distribution PX~\mathcal P_{\widetilde{\mathcal X}} and the unknown test-time data distribution PX\mathcal P_{\mathcal X} as follows:

{ϵPX~(h,f)=Xauxh(x)f(x)dx=ϵ1,XoodXauxh(x)f(x)dx=ϵ2,ϵPX(h,f)=Xoodh(x)f(x)dx=Xauxh(x)f(x)dx+XoodXauxh(x)f(x)dx=ϵ1+ϵ2.\begin{cases} \epsilon_{\mathcal P_{\widetilde{\mathcal X}}}(h, f) = \int_{\mathcal X_{aux}} |h(x) - f(x)|dx = \epsilon_1, \\\\ \int_{\mathcal X_{ood} \setminus \mathcal X_{aux}} |h(x) - f(x)|dx = \epsilon_2, \\\\ \epsilon_{\mathcal P_{\mathcal X}}(h, f) = \int_{\mathcal X_{ood}} |h(x) - f(x)|dx = \int_{\mathcal X_{aux}} |h(x) - f(x)|dx + \int_{\mathcal X_{ood} \setminus \mathcal X_{aux}} |h(x) - f(x)|dx = \epsilon_1 + \epsilon_2. \end{cases}

From the above expressions, we obtain:

{Haux=h:argminhϵ1,Hother=h:argminhϵ2,Hood=h:argminh(ϵ1+ϵ2).\begin{cases} \mathcal{H}^*_{aux} = \\{ h : \arg \min\limits_h \epsilon_1 \\}, \\\\ \mathcal{H}^*_{other} = \\{ h : \arg \min\limits_h \epsilon_2 \\}, \\\\ \mathcal{H}^*_{ood} =\\{ h : \arg \min\limits_h (\epsilon_1 + \epsilon_2) \\}. \end{cases}

The model is over-parameterized (line 126), which implies that our hypothesis space is large enough. Consequently, there exists the ideal hypothesis hh such that both ϵ1\epsilon_1 and ϵ2\epsilon_2 are minimized, i.e., HauxHother\mathcal{H}^*_{aux} \cap \mathcal{H}^*_{other} \neq \emptyset. In this scenario, minh(ϵ1+ϵ2)=minhϵ1+minhϵ2\min_h (\epsilon_1 + \epsilon_2) = \min\limits_h \epsilon_1 + \min\limits_h \epsilon_2. We denote Hood=HauxHother\mathcal{H}^*_{ood} = \mathcal{H}^*_{aux} \cap \mathcal{H}^*_{other}, thus HoodHaux\mathcal{H}^*_{ood} \subset \mathcal{H}^*_{aux}.

评论

I would like to thank the authors for their detailed and structured responses! I am satisfied with responses to W1, W2-2 (please include the details in the final manuscript following these first two remarks), 5 and 6.


I still have concerns about the rest. To start with the simpler:

4: My concern is about the fact that the sentence suggests that the support of Yid\mathcal{Y_id} lives in the same space as Xood\mathcal{X_ood} and Xid\mathcal{X_id}, which cannot be true. In addition, a support can be defined for a measure, a distribution (which I assumed was implicitly defined in Xid\mathcal{X_id} but after a double check, is not), or a function but not for a set. Instead, I would advise simply stating that Xood=X\Xid\mathcal{X_ood} = \mathcal{X} \backslash \mathcal{X_id}

W2-1:

(i) How can you define β\beta as a constant if β1\beta_1 and β2\beta_2 are not and β=4max(β1,β2)\beta = 4 max(\beta_1, \beta_2)?

(ii -iii) I am afraid that the regime where β10\beta_1 \rightarrow 0 and therefore β20\beta_2 \rightarrow 0 that you describe implies a "near-ideal hypothesis", where actually pretty much everything tends towards 00 in the generalization bound 4. In other words, the problem I point out (that β\beta's are not constant) is only alleviated in a regime where Eq. 4 makes no longer sense in practice. What do you think about that?

W3

The overparametrized case happens when the number of parameters is larger than the number of training points. In that case, the model might perfectly fit the training points (interpolate), but nothing is guaranteed about the generalization error. In addition, you define your errors with continuous integrals, for which the model is never overparametrized because integrals can be defined as the limit of a sum of nn terms (1 for each data point) where nn\rightarrow \infty (Riemann definition).To obtain guarantees such as arbitrary minimization of the error, you should rely on universal approximation theorems, but they imply constraining assumptions for the underlying neural network (as a pointer see https://en.wikipedia.org/wiki/Universal_approximation_theorem and the references therein). These assumptions should be stated depending on the theorem you choose to use in your demonstration.

评论

Thank you sincerely for your detailed response and constructive feedback. Your insights have greatly contributed to our work, and we truly appreciate your support.


We would like to further address your concerns as follows:

W4. The support of Yid\mathcal Y_{id} lives in the same space as Xood\mathcal X_{ood} and Xid\mathcal X_{id}, which cannot be true. Instead, I would advise simply stating that Xood=XXid\mathcal X_{ood}=\mathcal X \setminus \mathcal X_{id}.

We sincerely appreciate your feedback, which has drawn our attention to the lack of preciseness in that definition. Your suggestion is indeed helpful. we have decided to follow your advice and revise the definition of Xood\mathcal X_{ood} as follows: Xood=XXid\mathcal X_{ood}=\mathcal X\setminus\mathcal X_{id} represent the input space of OOD data, where X\mathcal X represents the entire input space in the open-world setting.

W2-1. (i) How can you define β\beta as a constant if β1\beta_1 and β2\beta_2 are not and β=4max(β1,β2)\beta=4\max(\beta_1,\beta_2)?

Thank you for your comment, we appreciate the opportunity to address your concerns as follows:

(i) β1\beta_1 is a constant, given that β2β1\beta_2 \leq \beta_1, we have β=4max(β1,β2)=4β1\beta = 4\max(\beta_1, \beta_2) = 4\beta_1, as a result, β\beta is a constant. Specifically, β1=minhHϵxPX(h,f)\beta_1 = \underset{h \in \mathcal{H}}{\min} \epsilon_{x \sim \mathcal P_{\mathcal X}}(h, f), which depends on PX\mathcal P_{\mathcal{X}} and H\mathcal{H}, where PX\mathcal{P}_{\mathcal{X}} represents the unknown test-time distribution in the open-world and does not change throughout our analysis. Similarly, H\mathcal{H} is a predefined hypothesis space that is fixed. Consequently, β1\beta_1 is a constant. As we derived in public comment (1), β2β1\beta_2 \leq \beta_1. Considering β=4max(β1,β2)=4β1\beta = 4\max(\beta_1, \beta_2) = 4\beta_1, we can conclude that β\beta is a constant.

(ii) We recognize that our current presentation of β=4max(β1,β2)\beta = 4\max(\beta_1, \beta_2) may have led to some misunderstanding. Our intention in introducing β\beta in Theorem 1 was to unify the small values β1\beta_1 and β2\beta_2. For coherence in our derivation, we use the definition of β=4max(β1,β2)\beta=4\max(\beta_1,\beta_2) directly. We appreciate your feedback and acknowledge that our current presentation could be clearer.

(iii) We have modified this part of the derivation in the revised version to enhance clarity. Specifically, after proving β2β1\beta_2 \leq \beta_1 and clearly stating that β1\beta_1 is a constant, we have directly defined β=4β1\beta = 4\beta_1. This modification should make our reasoning more transparent and easier to follow.

W2-1(ii-iii) I am afraid that the regime where β10\beta_1 \rightarrow 0 and therefore that you describe implies a "near-ideal hypothesis", where actually pretty much everything tends towards 0 in the generalization bound 4. In other words, the problem I point out (that β\beta's are not constant) is only alleviated in a regime where Eq. 4 makes no longer sense in practice.

We sincerely appreciate your feedback. We would like to address your concerns by first explaining the rationale behind our assumptions and then discussing the impact on Eq.4 in practice.

(i) We assume the existence of an ideal hypothesis hh within the hypothesis space H\mathcal{H} such that β10\beta_1 \rightarrow 0. According to universal approximation theorems, this condition can be met when the depth or width of deep neural networks satisfies certain conditions. Specifically, under these conditions, the model becomes a universal approximator, implying the existence of hoodHh^*_{ood} \in \mathcal{H} such that hoodfh^*_{ood} \rightarrow f, leading to β10\beta_1 \rightarrow 0.

(ii) In practical scenarios, Eq. 4 represents an upper bound on the generalization error of the learned hypothesis hh. Moreover, when β10\beta_1 \rightarrow 0, each term in Eq. 4 retains its practical significance. To illustrate this, let us revisit Eq. 4:

ϵxPX(h,f)ϵ^xPX~(h,f)empiricalerror+ϵ(h,haux)reducibleerror+suphHauxϵxPX(h,hood)distributionshifterror+Rm(H)complexity+ln(1δ)2M+β\epsilon_{x\sim\mathcal P_{\mathcal X}}(h,f) \leq \underbrace{\hat \epsilon_{x \sim \mathcal P_{\widetilde {\mathcal X}}}(h,f)} _{**empirical error**}+\underbrace{\epsilon(h,h ^* _{aux})} _{**reducible error**}+\underbrace {\underset { h \in \mathcal H ^* _{aux}}{\sup }\epsilon _{x \sim \mathcal P _{\mathcal X}}(h,h ^* _{ood})} _{**distribution shift error**}+\underbrace {\mathcal R _m(\mathcal H)} _{**complexity**}+\sqrt{\frac{\ln(\frac{1}{\delta})}{2M}}+\beta

where the empirical error term is minimized through optimization, the reducible error quantifies how closely hh approximates hauxh^*_{aux} and the distribution shift error captures the discrepancy between training and test data distributions. These components contribute significantly to the error upper bound. As β10\beta_1 \rightarrow 0, only the β\beta term (related to the ideal error) approaches zero, while the other terms remain relevant and unaffected.

评论

W3. The overparametrized case happens when the number of parameters is larger than the number of training points. In that case, the model might perfectly fit the training points, but nothing is guaranteed about the generalization error. In addition, you define your errors with continuous integrals, for which the model is never overparametrized because integrals can be defined as the limit of a sum of terms (1 for each data point) where (Riemann definition). To obtain guarantees such as arbitrary minimization of the error, you should rely on universal approximation theorems, but they imply constraining assumptions for the underlying neural network. These assumptions should be stated depending on the theorem you choose to use in your demonstration.

We sincerely appreciate your detailed response, thorough explanations, and constructive suggestions. Your input has significantly enhanced the theoretical rigor of our paper.

(i) We have acknowledged our misunderstanding regarding the overparameterized case. To obtain guarantees such as the arbitrary minimization of error, we should indeed rely on universal approximation theorems.

(ii) We are now making specific assumptions about the underlying neural network to strengthen our proof. Specifically, based on the paper [1], we assume our model is a fully-connected ReLU network with a width of (m+4), which can approximate any Lebesgue-integrable function ff from Rm\mathbb{R}^m to R\mathbb{R} with arbitrary accuracy with respect to L1L^1 distance. In this case, there exists an ideal hypothesis hh that minimizes both ϵ1\epsilon_1 and ϵ2\epsilon_2 simultaneously, i.e., HauxHother\mathcal{H}^*_{aux} \cap \mathcal{H}^*_{other} \neq \emptyset , thus HoodHaux\mathcal{H}^*_{ood} \subset \mathcal{H}^*_{aux}.

[1] Universal Approximation Theorem for Width-Bounded ReLU Networks.


Finally, we would like to thank you again for your thorough review of our paper, your detailed feedback, and your constructive suggestions. They have significantly improved the quality of our work.

评论

Thank you for your perseverance and your honesty. I have final minor remarks that I do not expect the authors to respond to, but simply to take into account.

  • Suggestion about the definition of β\beta: if you no longer use β2\beta_2 and directly bound it with β1\beta_1 (which is a cleaver way of alleviating concerns about β2\beta_2 not being a constant) I would recommend directly defining β=4...\beta = 4 \int... rather than β=4β1\beta=4\beta_1 for better readability.
  • (Important) Assumptions for Universal Approximation Theorem: Make sure to clearly state the assumptions in the main paper in the Theorem, not only in the proof.

The authors have constructively engaged in a fruitful scientific discussion during this rebuttal, which has improved the quality of their manuscript and allowed to fix its initial theoretical imprecisions. As a result, I am happy to raise my score to acceptance.

评论

Thank you again for your valuable input and for increasing your rating. We will incorporate your valuable suggestions in the revised version to further improve the quality of our paper.

评论

The detailed proof from line 530 to the definition of β\beta.

Let's first review line 530 on the demonstration of theorem 1:

GError(h)ϵxPX~(h,f)+ϵxPX~(hood,f)+ϵxPX(hood,f)+ϕX(x)ϕX~(x)h(x)haux(x)dx+ϵxPX(haux,hood)+ϵxPX~(haux,f)+ϵxPX~(hood,f)GError(h)\leq\epsilon_{x\sim\mathcal P_{\widetilde{\mathcal X}}}(h,f)+\textcolor{blue}{\epsilon_{x\sim\mathcal P_{\widetilde{\mathcal X}}}(h^*_{ood},f)}+\textcolor{blue}{\epsilon_{x\sim\mathcal P_{\mathcal X}}(h^*_{ood},f)} \\ +\int |\phi_{\mathcal X}(x) -\phi_{\widetilde{\mathcal X}}(x) |\left| h(x)-h^*_{aux}(x) \right| \,dx +\epsilon_{x\sim\mathcal P_{{\mathcal X}}}(h^*_{aux},h^*_{ood}) \\ +\textcolor{blue}{\epsilon_{x\sim\mathcal P_{\widetilde{\mathcal X}}}(h^*_{aux},f)}+\textcolor{blue}{\epsilon_{x\sim\mathcal P_{\widetilde{\mathcal X}}}(h^*_{ood},f)}

We denote β1=minhHϵxPX(h,f)\beta_1=\underset{h\in\mathcal H}{\min}\epsilon_{x\sim\mathcal P_{\mathcal X}}(h,f), β2=minhHϵxPX~(h,f)\beta_2=\underset{h\in\mathcal H}{\min}\epsilon_{x\sim\mathcal P_{\widetilde{\mathcal X}}}(h,f) as the error of hoodh^*_{ood} and hauxh^*_{aux} on PX\mathcal P_{\mathcal X} and PX~{\mathcal P}_{\widetilde{\mathcal X}}, respectively.

In other word, for any hHoodh\in \mathcal H^*_{ood}, ϵxPX(h,f)=β1\epsilon_{x\sim\mathcal P_{\mathcal X}}(h,f)=\beta_1. For any hHauxh\in \mathcal H^*_{aux}, ϵxPX~(h,f)=β2\epsilon_{x\sim\mathcal P_{\widetilde{\mathcal X} }}(h,f)=\beta_2.

As a result, ϵxPX(hood,f)=β1\textcolor{blue}{\epsilon_{x\sim\mathcal P_{\mathcal X}}(h^*_{ood},f)=\beta_1}, ϵxPX~(haux,f)=β2\textcolor{blue}{\epsilon_{x\sim\mathcal P_{\widetilde{\mathcal X}}}(h^*_{aux},f)=\beta_2}.

Considering that HoodHaux\mathcal H^*_{ood}\subset \mathcal H^*_{aux} (mentioned in line 126), for any hHoodh\in \mathcal H^*_{ood}, hHauxh\in \mathcal H^*_{aux} is holds. As a result, ϵxPX~(hood,f)=β2\textcolor{blue}{\epsilon_{x\sim\mathcal P_{\widetilde{\mathcal X}}}(h^*_{ood},f)=\beta_2}. We have:

GError(h)ϵxPX~(h,f)+β2+β1+ϕX(x)ϕX~(x)h(x)haux(x)dx+ϵxPX(haux,hood)+β2+β2GError(h)\leq \epsilon_{x\sim\mathcal P_{\widetilde{\mathcal X}}}(h,f)+\textcolor{blue}{\beta_2}+\textcolor{blue}{\beta_1}+\int |\phi_{\mathcal X}(x) -\phi_{\widetilde{\mathcal X}}(x) |\left| h(x)-h^*_{aux}(x)\right|dx +\epsilon_{x\sim\mathcal P_{\mathcal X}}(h^*_{aux},h^*_{ood})+\textcolor{blue}{\beta_2}+\textcolor{blue}{\beta_2}

We denote 1/4β=maxβ1,β21/4*\beta=\max\\{ \beta_1,\beta_2\\}, so:

GError(h)ϵxPX~(h,f)+ϕX(x)ϕX~(x)h(x)haux(x)dx+ϵxPX(haux,hood)+βGError(h)\leq\epsilon_{x\sim\mathcal P_{\widetilde{\mathcal X}}}(h,f)+\int|\phi_{\mathcal X}(x)-\phi_{\widetilde{\mathcal X}}(x)||h(x)-h^*_{aux}(x)|dx+\epsilon_{x\sim\mathcal P_{\mathcal X}}(h^*_{aux},h^*_{ood})+\textcolor{blue}{\beta}
评论

Proof of the negligible effect of Xdiv\mathcal X_{div} on β2\beta_2.

Specifically, β1\beta_1 and β2\beta_2 represent the error of the ideal hypothesis on the unknown test-time data distribution PXP_{\mathcal{X}} and the training data distribution PX~P_{\widetilde{\mathcal{X}}}, respectively. We can derive that β2β1\beta_2 \leq \beta_1. Moreover, given that our model is over-parameterized, we have a sufficiently large hypothesis space to include a near-ideal hypothesis such that β1\beta_1 is sufficiently small, i.e., β10\beta_1 \rightarrow 0. As a result, we can conclude that β20\beta_2 \rightarrow 0. We have put the detailed proof in the official comment.

The detailed proof as follows:

Considering that β1\beta_1 and β2\beta_2 represent the error of the ideal hypothesis on the unknown test-time data distribution PXP_{\mathcal{X}} and the training data distribution PX~P_{\widetilde{\mathcal{X}}}, respectively, we have:

β2=minhHϵxPX~(h,f)=minhHX~h(x)f(x)dx\beta_2 = \underset{h \in \mathcal H}{\min} \epsilon_{x \sim \mathcal P_{\widetilde{\mathcal X}}}(h, f) = \min\limits_{h \in \mathcal H} \int_{\widetilde{\mathcal X}} |h(x) - f(x)|dx \\ β1=minhHϵxPX(h,f)=minhHXh(x)f(x)dx=minhH(X~h(x)f(x)dx+XX~h(x)f(x)dx)minhHX~h(x)f(x)dx+minhHXX~h(x)f(x)dxβ2\beta_1 = \underset{h \in \mathcal H}{\min} \epsilon_{x \sim \mathcal P_{\mathcal X}}(h, f)= \min\limits_{h \in \mathcal H} \int_{\mathcal X} |h(x) - f(x)|dx = \min\limits_{h \in \mathcal H} \left( \int_{\widetilde{\mathcal X}} |h(x) - f(x)|dx + \int_{\mathcal X \setminus \widetilde{\mathcal X}} |h(x) - f(x)|dx \right) \ge \min\limits_{h \in \mathcal H} \int_{\widetilde{\mathcal X}} |h(x) - f(x)|dx + \min\limits_{h \in \mathcal H} \int_{\mathcal X \setminus \widetilde{\mathcal X}} |h(x) - f(x)|dx \ge \beta_2

In our setting, the model is over-parameterized, meaning we have a sufficiently large hypothesis space to include a near-ideal hypothesis such that β1\beta_1 is sufficiently small. Therefore, we can denote β10\beta_1 \rightarrow 0. Given that β2β1\beta_2 \leq \beta_1, β20\beta_2 \rightarrow 0. Thus, β1\beta_1 and β2\beta_2 are both negligible, and we use a small value β\beta to unify them in Theorem 1.

作者回复

We thank the reviewer for recognizing our novel method and satisfying experiments. We appreciate your support and constructive suggestions and address your concerns as follows.


W1. The quantities hoodh^*_{ood} and hauxh^*_{aux} are never defined. Why would an argmin be a set in that case?

Thank you for your valuable comment. We agree that clearer definitions of hoodh^*_{ood} and hauxh^*_{aux} would improve our manuscript. Our choice to define the argmin as a set was purposeful and well-founded and we appreciate to clarify this as follow:

1) The definition of quantities hauxh^{*}_{aux} and hauxh^{*}_{aux}.

hoodh^*_{ood} and hauxh^*_{aux} is defined as the element in Hood\mathcal H^*_{ood} and Haux\mathcal H^*_{aux}, respectively, which can be denoted as: hoodHood,hauxHauxh^*_{ood}\in \mathcal H^*_{ood}, h^*_{aux} \in \mathcal H^*_{aux}, where Hood\mathcal H^*_{ood} and Haux\mathcal H^*_{aux} refers to the sets of ideal hypotheses on the training data distribution PX~P_{\widetilde{\mathcal{X}}} and test-time data distribution PXP_{\mathcal{X}}. We will add the definition in manuscript.

2) Why would an argmin be a set?

(i) Our setting requires us to represent the optimal hypothesis as a set. In our setting, Haux\mathcal H_{aux} contains all hypotheses optimal for the training data distribution. However, these hypotheses may perform inconsistently on the test-time data distribution, necessitating the use of a set rather than a single hauxh^*_{aux}.

(ii) The optimization problem behind deep neural networks is highly non-convex and the optimal solution is not unique [1] [2]. Therefore, defining argmin as a set offers generality.

(iii) Representing argmin as a set is not unprecedented in the field. For example, the theoretical analysis in [3] also employs this set-based representation.


W2. The authors define β1\beta_1 and β2\beta_2 as constants, but 1) they depend on PXP_\mathcal X which is supposed to be affected by the later-introduced Xdiv\mathcal X_{div} and 2) the parts of line 530 that are replaced do not seem to match the definition of β\beta's.

Sincerely thank you for your thorough review. We realize our derivation omitted some explanations, leading to misunderstandings. We're happy to clarify in detail as follow:

1) Clarification of β1\beta_1and β2\beta_2.

(i) We do not define β1\beta_1and β2\beta_2 as constants. This misunderstanding may have arisen from our definition of β\beta as a constant in Theorem 1. We will clarify this in the revised paper.

(ii) β1\beta_1 is not affected by Xdiv\mathcal X_{div}. β1\beta_1 depends on the unknown test data distribution PXP_{\mathcal X}. The later-introduced Xdiv\mathcal X_{div} only affects the training data distribution, thus β1\beta_1 is unaffected by Xdiv\mathcal X_{div}.

(iii) While β2\beta_2 is indeed influenced by Xdiv\mathcal X_{div}, we can prove that β2\beta_2 is bounded by a very small constant. Therefore, in subsequent analyses, the effect of Xdiv\mathcal X_{div} on β2\beta_2 is negligible. We have put the detailed discussion and proof in the official comment (1).

2) The parts of l.530 that are replaced do not seem to match the definition of β\beta's.

Upon careful examination, we believe the derivation is accurate. However, we recognize that the presentation could be enhanced for clarity. We have provided a detailed derivation in the official comment (2).


W3. Demonstrations of Theorems 2 and 3 seem to rely on one argument, which is: "Since XauxXood\mathcal X_{aux}\subset\mathcal X_{ood} , then HoodHaux\mathcal H^*_{ood}\subset \mathcal H^*_{aux}" I am concerned with the validity of this assumption.

Thank you for your valuable feedback and we greatly appreciate your attention to detail. We illustrate the validity of this assumption as follow:

(i) The assumption is reasonable in the over-parameterized setting. Considering that we can decomposing the ideal error on Xood\mathcal X_{ood} into two components: one for Xaux\mathcal X_{aux} and another for the remaining OOD data XoodXaux\mathcal X_{ood} \setminus \mathcal X_{aux}. Considering that the model is over-parameterized (line 126), there exists the ideal hypothesis minimizing both error simultaneously. In other word, Xood\mathcal X_{ood} is the intersection of the sets of optimal hypothesis for auxiliary and remaining data, thereby HoodHaux\mathcal H^*_{ood}\subset\mathcal H^*_{aux}. We have provided a detailed derivation in the official comment (3).

(ii) Discussion of the Counterexample. The counterexample suggests that the model's optimal solution in Xood\mathcal X_{ood} may not be optimal in Xaux\mathcal X_{aux} because it balances performance in XoodXaux\mathcal X_{ood} \setminus \mathcal X_{aux}. This assumes there is no hypothesis hHh \in \mathcal H that achieves optimal performance in both Xaux\mathcal X_{aux} and XoodXaux\mathcal X_{ood} \setminus \mathcal X_{aux}. However, in over-parameterized setting, there exists the ideal hypothesis that is optimal in both Xaux\mathcal X_{aux} and XoodXaux\mathcal X_{ood} \setminus \mathcal X_{aux}. Therefore, the counterexample does not hold in our setting.


4. line 111 perhaps you meant Xid\mathcal X_{id} instead of Yid\mathcal Y_{id}?

5. line 122 ϵ\epsilon is called a probability, whereas it is an expectation.

6. Some typos

Thank you for your valuable suggestions, we have addressed your concerns as follows:

(i) We would like to clarify that Xood\mathcal X_{ood} is outside the support of Yid\mathcal Y_{id}. In other words, Xid\mathcal X_{id} is within the support of Yid\mathcal Y_{id}.

(ii) We acknowledge the error on line 122 and have corrected it.

(iii) We have thoroughly reviewed and revised all representation issues throughout the paper.

References:

[1] The Loss Surface of Deep and Wide Neural Networks.

[2] On the Quality of the Initial Basin in Overspecified Neural Networks.

[3] Agree To Disagree: Diversity Through Dis-Agreement For Better Transferability.

审稿意见
6

The paper proposed Diversity-induced Mixup for OOD detection (diverseMix), which enhances the diversity of auxiliary outlier set for training in an efficient way.

优点

  1. The paper is written well and is easy to understand.

  2. The studied problem is very important.

  3. The results seem to outperform state-of-the-art.

缺点

  1. My biggest concern is that there are already some papers that theoretically analyze the effect of auxiliary outliers and proposed some complimentary algorithms based on mixup, such as [1], [2] and [3]. It might be useful to clarify the differences.

[1] Out-of-distribution Detection with Implicit Outlier Transformation [2] Learning to augment distributions for out-of-distribution detection [3] Diversified outlier exposure for out-of-distribution detection via informative extrapolation

问题

see above

局限性

n/a

作者回复

Thank you for your reviews. We are encouraged that you appreciate our studied problem and state-of-the-art result. We address your concerns as follows.

W1. My biggest concern is that there are already some papers that theoretically analyze the effect of auxiliary outliers and proposed some complimentary algorithms based on mixup, such as [1], [2] and [3]. It might be useful to clarify the differences.

Thanks for mentioning [1] [2] [3]. We first review these related works and then clarify the differences from the perspectives of motivations, techniques, and theory. Additionally, we have incorporated these related works into the manuscript.

(i) Reviews of related works.

DOE [1] proposes a novel and effective approach for improving OOD detection performance by implicitly synthesizing virtual outliers via model perturbation.

DAL [2] introduces a novel and effective framework for learning from the worst cases in the Wasserstein ball to enhance OOD detection performance.

DivOE [3] is an innovative and effective method for enhancing OOD detection performance by using informative extrapolation to generate new and informative outliers.

(ii) Differences in motivations. DiverseMix has a different motivation from DOE, DAL and DivOE. Our DiverseMix is proposed for enhancing the diversity of auxiliary outliers set through semantic-level interpolation to enhance OOD detection performance. In comparison, DOE focuses on improving the generalization of the original outlier exposure by exploring model-level perturbation. DAL focus on crafting an OOD distribution set in a Wasserstein ball centered on the auxiliary OOD distribution to alleviating the distribution discrepancy between auxiliary outliers and unseen OOD data. DivOE focuses on extrapolating auxiliary outliers to generate new informative outliers for enhance OOD detection performance.

(iii) Differences in technique. Our method, DiverseMix, has a different algorithmic design compared to others. Specifically, DiverseMix adaptively generates interpolation strategies based on outliers to create new outliers. In contrast, DOE, DAL, and DivOE primarily rely on adding perturbations. Specifically, DOE, DAL, and DivOE apply perturbations at the model level, feature level, and sample level, respectively, to mitigate the OOD distribution discrepancy issue.

(iv) Differences in theory. We prove that a more diverse set of auxiliary outliers could improve the detection capacity from the generalization perspective, and this theoretical insight inspired our method DiverseMix. We also provide insightful theoretical analysis verifying the superiority of DiverseMix. In comparison, DOE revealing that model perturbation leads to data transformation to enhance the generalization of OOD detector. DAL finds that the distribution discrepancy between the auxiliary and the real OOD data affecting the OOD detection performance. DivOE is based on the perspective of sample complexity to demonstrate its effectiveness.

References:

[1] Out-of-distribution Detection with Implicit Outlier Transformation.

[2] Learning to augment distributions for out-of-distribution detection.

[3] Diversified outlier exposure for out-of-distribution detection via informative extrapolation.

评论

Thanks for the clear response on my concerns and questions. All the things has been resolved, so I increase my score to 6. Thanks!

评论

Thank you so much for the valuable comments and increasing your rating. Thanks!

审稿意见
6

This study aims to explore the reasons behind the effectiveness of out-of-distribution (OOD) regularization methods by linking the auxiliary OOD dataset to generalizability. The researchers show that the variety within the auxiliary OOD datasets significantly influences the performance of OOD detectors. Moreover, they introduce a straightforward approach named diverseMix which is designed to enhance the diversity of the auxiliary dataset used for OOD regularization.

优点

Strengths:

  • The paper is well-composed and presents an extensive array of experiments across multiple OOD detection benchmarks.
  • This study offers important insights into the significance of auxiliary datasets in OOD regularization, addressing a frequently neglected aspect of OOD regularization techniques.
  • The authors provide a robust theoretical foundation for diverseMix. Additionally, they show strong empirical evaluations which further highlight the effectiveness of diverseMix across a range of OOD experiments.

缺点

Weakness:

  • The reviewer has some concerns regarding the empirical evaluations of diverseMix. In particular, the choice of how the ImageNet-1k is split into ImageNet-200 as ID while the remaining data is leveraged as OOD seems arbitrary.

问题

The primary question of the reviewer is why not leverage the entire ImageNet-1k dataset for ID whilst leveraging other unlabeled datasets for the auxiliary data.

局限性

The authors have adequately addressed any potential negative societal impacts of this work, and the reviewer does not anticipate any such negative impacts.

作者回复

We sincerely thank the reviewer for your valuable comments and appreciate your recognition of the effective method as well as sufficient theoretical guarantees. We provide detailed responses to the constructive comments.

W1. The reviewer has some concerns regarding the empirical evaluations of diverseMix. In particular, the choice of how the ImageNet-1k is split into ImageNet-200 as ID while the remaining data is leveraged as OOD seems arbitrary.

Thank you for raising this important question. We sincerely appreciate your attention to the details of our experiments and pleasure to provide further clarification.

(i) Our splitting strategy was not arbitrary but carefully considered. We randomly selected 200 classes from ImageNet-1K as the ID categories, while the remaining 800 classes were used as OOD categories. This setting ensures the validity of our experiments.

(ii) This experiment setting is consistency with prior work. We followed the benchmark [1] to set ImageNet-200 as the ID dataset for our experiments.

Q1. The primary question of the reviewer is why not leverage the entire ImageNet-1k dataset for ID whilst leveraging other unlabeled datasets for the auxiliary data?

Thanks for the constructive suggestions. We are grateful for this insight and have conducted the additional experiments as suggested. Specifically, we used the entire ImageNet-1K dataset as the ID dataset and employed the SSB-hard dataset as auxiliary outliers. The experimental results are shown below, with columns representing different OOD datasets. The values in the table are presented in the format (FPR \downarrow / AUROC \uparrow). From the experimental results, it is evident that our method remains effective even when ImageNet-1K is used as the ID dataset.

MethoddtdiNaturalistnincoaverage
OE73.90/76.8249.33/89.5376.03/80.5266.42/82.29
Energy69.80/82.5674.40/85.5881.66/77.3275.29/81.82
Mixoe69.48/78.0746.61/89.7274.17/80.7963.42/82.86
Ours68.17/78.6942.71/90.9873.29/81.2761.39/83.65

References:

[1] OpenOOD v1.5: Enhanced Benchmark for Out-of-Distribution Detection.

评论

Dear reviewers, please read and respond to authors' rebuttal (if you haven't done so). Thanks.

Your AC

最终决定

This submission received three ratings (6, 6 and 7), averaging 6.33, which is above the acceptance. After rebuttal, all reviewers' concerns have been well addressed, especially about the experiments, difference from previous works and the clarity of the theoretical notations and proof. After carefully checking the response of all reviewers, I suggest the acceptance. Hope the authors carefully follow the reviewers' advice to improve the submission finally.