4.8

/10

Rejected4 位审稿人

最低3最高6标准差1.1

3.3

置信度

ICLR 2024

Conservative Prediction via Data-Driven Confidence Minimization

Caroline Choi,Fahim Tajwar,Yoonho Lee,Huaxiu Yao,Ananya Kumar,Chelsea Finn

OpenReview PDF

提交: 2023-09-23更新: 2024-02-11

TL;DR

How does dataset choice for confidence minimization affect conservative prediction?

摘要

关键词

conservative predictionconfidenceuncertaintyrobustnessselective classificationOOD detection

评审与讨论

审稿意见

评分: 3置信度: 32023-10-27

Pretrained models can perform well on known observations but might be overconfident on unknown points, which hence incur high risks in safety-critical domains. Motivated by this, the authors proposed a conservative approach based on the Data-Driven Confidence Minimization for both selective classification and out-of-distribution detection tasks. Particularly, they introduced a regularizer in the objective function to penalize the over-confidence in those unknown observations.

优点

The authors provided insight into the choice of the auxiliary dataset severed as unknown observations, which can be used in the regularization part.
Empirically, the authors conducted extensive experiments to show that the proposed method is promising.

缺点

To be honest, I got overwhelmed for a while due to some confusing notations and definitions.
- I am confused about the definition of "unknown" in selective classification. For examples,
  - The authors first referred to the unknown examples as those not well-represented in training (paragraph 1, section 2), later they said that unknown examples are those misclassified points (paragraph 2, section 2). In selective classification, one may use "hard" observations (easily misclassified) instead of "unknown" to avoid the confusion of the "unknown" in OOD.
  - When it comes to the notation for unlabeled data $D_u$ exclusively used in the OOD detection, Proposition 4.1 and 4.2 then sounds targeting OOD detection with the conclusion "DCM provably detects unknown examples". Here I presume that "detects unknown examples" solely means "detects OOD examples", excluding the "unknown" examples in the selective classification task.
- The authors need to well-articulate the notations before these are used. For examples,
  - What is $\mathcal{P}_{ID}$ ? It is not friendly for readers not familiar with this topic.
  - Please add the appendix reference for the definition of $\delta$ -neighborhoods in the comment after Proposition 4.1.
  - What does $i$ in (4) stand for?
The uniform label distribution used for "unknown" observations is ad-hoc. It sounds like you are assuming all unknown examples overlap together and hence you cannot clearly distinguish them. However, what if there are only several overlapped classes? For example, some unknown observations from class 1 overlap with class 2, but they are disjoint with other classes. In this case, it is not appropriate to give non-zero probability to generate other “pseudo” labels for these unknown examples from class 1. Moreover, if this unknown example is an OOD point that is unlike any of the "known" classes, does it still make sense to give a non-zero probability to be labeled as "known" classes?
The authors need to discuss the practicality of the assumption " $D^\delta_{k}$ and $D^\delta_{unk}$ are disjoint" in Proposition 4.2 since the proposed method follows the corresponding theoretical guidance. In other words, is this assumption strong?
- The unknown/misclassified points (I follow your definition of "unknown") in selective classification could be those hard points from some classes that intrinsically overlap with each other, then "known" and "unknown" are not disjoint.
- In OOD detection, now that we have the assumption of "disjoint", why do we still bother with the unlabeled data? Why do not generate auxiliary data around but separable from ID?
The theoretical guidance is not that clear: Are the two theorems practically useful and how do the authors capitalize on these theorems in the experiments? In particular, as per Line 2, Page 5, what kind of "appropriate threshold" is used? Did the authors use the value of the left-hand side of the inequality in (4)?

问题

In Algorithm 2, since there are no "easily misclassified" examples with known labels like the validation data in Algorithm 1, why do not just train the model based on $\mathcal{L}\_{xent}+\lambda\mathcal{L}\_{conf}$ ? In other words, is there any necessity for the prior step to optimize $\mathcal{L}\_{xent}$ ?
Equation (9) and the comment "resulting in a mixture between the true label distribution $p$ and the uniform distribution $\mathcal{U}$ , with mixture weight $\lambda$ " after Proposition 4.2: Is this rearrangement (9) correct? Since $\mathcal{P}\_{u}=\alpha_{test}\cdot\mathcal{P}\_{ID} + (1-\alpha_{test})\cdot\mathcal{P}\_{OOD}$ , why is the mixture weight just $\lambda$ , wouldn't there be an extra factor $\alpha_{test}$ ?
$\epsilon$ in Proposition 4.2 and the involved proof:
- The exact value of $\epsilon$ depends on the model performance or $\mathcal{L}(\theta)$ , how can we allow $\epsilon\leq\frac{1}{2N}(\frac{M-1}{(1+\lambda)M})^2$ to conclude the inequality (15)?
- What is $M$ in (15)?
Proposition A.1: I think the dimension of $p$ is $C+1$ when it comes to OOD detection (as the authors mentioned $p$ is the true label distribution). However, the dimension of $s$ is $C$ as the authors explicitly showed. Then is that legitimate for the expression $s-p$ and $s-\frac{\mathbf{1}}{C}$ in (6)?
Other minor issues:
- Please consistently add a comma after "i.e."
- What is (5) used for?

伦理问题详情

评论- Initial Response to Reviewer QVJM [1/4]

2023-11-20

Thank you for your feedback. We address your comments below. Please let us know if our response addresses all of your concerns.

The theoretical guidance is not that clear: Are the two theorems practically useful and how do the authors capitalize on these theorems in the experiments?

Yes, the two theorems are practically useful. We explain their practical implications in the last part of Section 4. They show that it is important to use the right uncertainty dataset, one that includes examples close to the unknown examples that the model may encounter at test time.

In our experiments (Section 6), we capitalize on these findings by using the following uncertainty datasets: (1) unlabeled ID and OOD examples for OOD detection, and (2) misclassified validation examples for selective classification. These choices are validated by our results. We hope this clarification helps.

In particular, as per Line 2, Page 5, what kind of "appropriate threshold" is used? Did the authors use the value of the left-hand side of the inequality in (4)?

Our empirical evaluation considers different thresholds of confidence, as is standard in prior selective classification works [7,8,9,10,11]: Acc@90 measures with a threshold such that the model predicts on 90% of the data, AUC is an aggregate metric that considers all thresholds, etc.

Yes, we use a version of the LHS of (4), since all of our evaluation is based on the interplay between confidence (=MSP) and accuracy.

Practicality of the assumption that known and unknown datasets are disjoint. The unknown/misclassified points (I follow your definition of "unknown") in selective classification could be those hard points from some classes that intrinsically overlap with each other, then "known" and "unknown" are not disjoint. In OOD detection, now that we have the assumption of "disjoint", why do we still bother with the unlabeled data? Why do not generate auxiliary data around but separable from ID?

For selective classification: even though the known and unknown examples come from the same classes, they can be disjoint partitions of the overall distribution. For OOD detection: it’s not that any disjoint distribution works equally well; confidence minimization performs best when the uncertainty set is, in addition to being disjoint from the known data, close to the hard examples that the model may see at test time. Being separate from ID by itself does not make a dataset a useful uncertainty dataset.

Moreover, there are prior work that use synthetic outlier data: VOS [4], NPOS [5] and Dream-OOD [6]. We have provided comparisons to these methods on CIFAR-10 and CIFAR-100 as OOD datasets. Our method outperforms them in most cases, showing the efficacy of using the unlabeled auxiliary set.

ID Dataset / Network	Method	SVHN		LSUN (Crop)		iSUN		Texture		Places365
		AUROC (↑)	FPR@95 (↓)	AUROC (↑)	FPR@95 (↓)	AUROC (↑)	FPR@95 (↓)	AUROC (↑)	FPR@95 (↓)	AUROC (↑)	FPR@95 (↓)
CIFAR-10 /ResNet-18
	VOS	96.37	15.69	93.82	27.64	94.87	30.42	93.68	32.68	91.78	37.95
	NPOS	97.64	5.61	97.52	4.08	94.92	14.13	94.67	8.39	91.35	18.57
	DCM-Softmax	99.7 (0.1)	0.4 (0.3)	98.6 (0.8)	6.6 (3.0)	99.7 (0.1)	0.6 (0.2)	97.1 (0.2)	14.8 (0.3)	92.4 (0.3)	32.6 (2.1)
	DCM-MaxLogit	99.8 (0.1)	0.3 (0.1)	98.7 (0.7)	6.0 (3.0)	99.8 (0.1)	0.5 (0.2)	97.1 (0.2)	14.9 (0.4)	92.5 (0.3)	34.4 (2.1)
	DCM-Energy	99.8 (0.1)	0.1 (0.1)	98.8 (0.7)	5.3 (3.6)	99.8 (0.1)	0.1 (0.1)	97.1 (0.2)	16.1 (1.1)	92.5 (0.3)	35.6 (2.2)
CIFAR-100 /ResNet-34
	VOS	73.11	78.50	85.72	59.05	82.66	72.45	80.08	75.35	75.85	84.55
	NPOS	97.84	11.14	82.43	56.27	85.48	51.72	92.44	35.20	71.30	79.08
	Dream-OOD	87.01	58.75	95.23	24.25	99.73	1.10	88.82	46.60	79.94	70.85
	DCM-Softmax	99.3 (0.2)	1.8 (0.9)	98.6 (0.3)	8.7 (1.9)	99.3 (0.2)	2.3 (1.4)	88.5 (0.5)	46.6 (2.7)	78.6 (0.4)	67.7 (2.3)
	DCM-MaxLogit	99.3 (0.2)	2.0 (1.1)	98.7 (0.2)	7.3 (1.9)	99.3 (0.2)	2.3 (1.4)	88.5 (0.5)	46.9 (3.0)	78.6 (0.4)	68.8 (1.8)
	DCM-Energy	99.2 (0.2)	2.2 (1.2)	98.9 (0.2)	5.2 (2.0)	99.4 (0.2)	1.9 (0.8)	89.3 (0.6)	48.6 (3.3)	78.3 (0.6)	68.9 (1.9)

评论- Initial Response to Reviewer QVJM [2/4]

2023-11-20

Additionally, DCM significantly outperforms VOS [4] and NPOS [5] with larger datasets and model architectures. We ran additional OOD detection experiments using a pretrained CLIP ViT-B/16 with ImageNet-1K as ID.

Methods	iNaturalist		SUN		Places		Textures
	FPR@95 (↓)	AUROC (↑)	FPR@95 (↓)	AUROC (↑)	FPR@95 (↓)	AUROC (↑)	FPR@95 (↓)	AUROC (↑)
Fort et al.,MSP	54.05	87.43	73.37	78.03	72.98	78.03	68.85	79.06
VOS	31.65	94.53	43.03	91.92	41.62	90.23	56.67	86.74
VOS+	28.99	94.62	36.88	92.57	38.39	91.23	61.02	86.33
NPOS	16.58	96.19	43.77	90.44	45.27	89.44	46.12	88.80
DCM-Softmax	2.6 (0.5)	99.2 (0.1)	32.9 (1.5)	94.2 (0.2)	35.9 (1.8)	93.8 (0.3)	11.2 (1.0)	97.9 (0.1)
DCM-MaxLogit	1.8 (0.4)	99.4 (0.1)	27.5 (1.4)	94.9 (0.2)	32.5 (2.8)	94.5 (0.3)	8.2 (0.8)	98.3 (0.1)
DCM-Energy	0.5 (0.2)	99.6 (0.1)	24.5 (1.7)	95.8 (0.2)	30.8 (3.0)	95.4 (0.3)	4.3 (0.6)	98.8 (0.1)

I am confused about the definition of "unknown" in selective classification.

“Unknown” is a term we use to talk about the two problem settings in a unified way: (1) for OOD detection, it is the OOD data, and (2) for selective classification, it is misclassified validation set examples. For selective classification, misclassified examples are "unknown" to the model because they indicate a gap in the model's knowledge due to insufficient coverage of the training distribution.

Proposition 4.1 and 4.2 then sounds targeting OOD detection with the conclusion "DCM provably detects unknown examples". Here I presume that "detects unknown examples" solely means "detects OOD examples", excluding the "unknown" examples in the selective classification task.

By “detection”, we mean that the confidence values of a model trained with the DCM loss is high for known examples and low for unknown examples. This property is relevant to both the OOD detection and selective classification problem settings.

The authors need to well-articulate the notations before these are used. For examples, What is $P_{ID}$ ? It is not friendly for readers not familiar with this topic.

$P_{ID}$ represents the distribution of in-distribution, or “known”, examples. It is the distribution of the training data. We’ve revised Section 2.2 to introduce this notation.

Please add the appendix reference for the definition of $\delta$ -neighborhoods in the comment after Proposition 4.1.

Thank you for the suggestion; we have added a pointer to the appendix when $\delta$ -neighborhoods are first introduced in the main text.

What does $i$ in (4) stand for?

$i$ was not needed in (4) and the proof; this was something we wrote in an earlier version of the proposition but forgot to delete. Thank you for catching this!

The uniform label distribution used for "unknown" observations is ad-hoc. It sounds like you are assuming all unknown examples overlap together and hence you cannot clearly distinguish them. However, what if there are only several overlapped classes? For example, some unknown observations from class 1 overlap with class 2, but they are disjoint with other classes. In this case, it is not appropriate to give non-zero probability to generate other “pseudo” labels for these unknown examples from class 1. Moreover, if this unknown example is an OOD point that is unlike any of the "known" classes, does it still make sense to give a non-zero probability to be labeled as "known" classes?

We respectfully disagree with the claim that the uniform label distribution is ad-hoc. While we agree that it is possible for classes to have fine-grained relationships as the reviewer mentions, we are not assuming such additional knowledge. The uniform label distribution is the maximum entropy distribution among categorical distributions, and thus, regularizing towards it is a natural choice for increasing uncertainty. We note that regularizing towards the uniform distribution is a standard choice that has been successful in several prior works [1,2,3].

What is (5) used for?

(5) in the appendix states the DCM objective before we move on to the propositions and proofs. It is implicitly used in all later equations that involve the loss (6, 9, 10…)

评论- Initial Response to Reviewer QVJM [4/4]

2023-11-20

What is $M$ in (15)?

$M$ refers to the number of classes. We realize that this was a duplicate notation; $M$ was used in an earlier version, and we did not revise it. We have replaced $M$ with $C$ in the text.

Proposition A.1: I think the dimension of $p$ is $C + 1$ when it comes to OOD detection (as the authors mentioned $p$ is the true label distribution). However, the dimension of $s$ is $C$ as the authors explicitly showed. Then is that legitimate for the expression $s - p$ and $s - \frac{1}{C}$ in (6)

The dimensionality of $p$ is always $C$ , as Appendix A states throughout. Our analysis operates in a setting where $p$ assigns a $C$ -way categorical distribution to all possible inputs, including OOD ones. Note that this is consistent with the evaluation of ours and many prior OOD detection papers, which uses the confidence of a $C$ -way prediction to separate ID inputs from OOD ones.

Please consistently add a comma after "i.e."

Thanks for the suggestion: we’ve added commas after all instances of “i.e.” in the revised version (in red text).

[1] "Deep anomaly detection with outlier exposure." ICLR 2019

[2] "When does label smoothing help?." NeurIPS 2019

[3] "Conservative q-learning for offline reinforcement learning." NeurIPS 2020

[4] VOS: Learning What You Don't Know by Virtual Outlier Synthesis, ICLR 2022

[5] Non-parametric Outlier Synthesis, ICLR 2023

[6] Dream the Impossible: Outlier Imagination with Diffusion Models, NeurIPS 2023

[7] "Selective Classification for Deep Neural Networks." NeurIPS 2017

[8] "A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks." ICLR 2018

[9] "Enhancing The Reliability of Out-of-distribution Image Detection in Neural Networks." ICLR 2019

[10] "Deep Gamblers: Learning to Abstain with Portfolio Theory." NeurIPS 2019

[11] "Self-Adaptive Training: beyond Empirical Risk Minimization." NeurIPS 2020

[12] On the Opportunities and Risks of Foundation Models,

评论- Initial Response to Reviewer QVJM [3/4]

2023-11-20

In Algorithm 2, since there are no "easily misclassified" examples with known labels like the validation data in Algorithm 1, why do not just train the model based on $L_{xent} + \lambda L_{conf}$ ? In other words, is there any necessity for the prior step to optimize $L_{xent}$ ?

The primary reason we use only $L_{xent}$ is computational efficiency: the fine-tuning step is much shorter than the pre-training steps, and the ratio of compute required becomes smaller when we use larger and larger datasets. When we train one epoch using $L_{xent} + \lambda L_{conf}$ , the model essentially sees 2x the number of images compared to training using $L_{xent}$ only and hence takes 2x compute, since it sees equal number of images from the training and auxiliary datasets. Also this lets us take one ID pre-trained model and adapt it relatively quickly to different test distributions, avoiding the costly pre-training process for each different test distribution. We note that our method is resonant with the philosophy of using large pre-trained foundation models and fine-tuning them for individual downstream tasks [12]. Indeed, as the reviewer notes, it is possible to train the model based on $L_{xent} + \lambda L_{conf}$ . We have added experiment results in Appendix K for this: in general, we see that training directly from scratch using $L_{xent} + \lambda L_{conf}$ leads to slightly lower ID accuracy but comparable performance on OOD detection tasks. This is possibly because the cross-entropy loss on ID training set and confidence minimization loss on the ID examples within the unlabeled uncertainty set work in opposite directions, making learning the ID classification task harder. Whereas if we fine-tune an ID pre-trained model, the model already has learned the ID task, and the fine-tuning step only further modifies the decision boundary, maintaining higher ID performance.

ID Dataset	OOD Dataset	Classification Accuracy		OOD Detection AUROC		OOD Detection FPR@95
		Fine-tune	Pre-train	Fine-tune	Pre-train	Fine-tune	Pre-train
CIFAR-10	SVHN	90.6	93.7	99.7	99.7	0.8	0.4
	TinyImageNet	90.4	93.5	99.3	99.3	2.2	2.6
	LSUN	90.6	93.7	99.2	99.8	2.5	0.5
	iSUN	90.2	93.5	99.5	99.7	1.4	0.6
CIFAR-100	SVHN	69.3	71.4	99.6	99.6	0.4	0.6
	TinyImageNet	68.1	71.1	99.0	98.7	3.2	5.9
	LSUN	69.0	71.0	99.7	99.5	0.7	1.1
	iSUN	68.1	71.2	99.4	99.1	2.0	2.7

Equation (9) and the comment "resulting in a mixture between the true label distribution $p$ and the uniform distribution $U$ , with mixture weight $\lambda$ " after Proposition 4.2: Is this rearrangement (9) correct? Since $P_u = \alpha_{test} P_{ID} + (1 - \alpha_{test})P_{OOD}$ , why is the mixture weight just $\lambda$ , wouldn't there be an extra factor $\alpha_{test}$ ?

In our theoretical setup, we have assumed $\alpha_{test} = \frac{1}{2}$ for simplicity, which is mentioned at the start of Appendix A.1:

Let $D_u$ be an unlabeled test set where half the examples are sampled from $P_{ID}$ , the other half are sampled from $P_{OOD}$ .

Lemma A.3 also deals with the simplified transductive setup (which is generalized in proposition A.4), where the ID examples in $D_u$ also appear in $D_{train}$ . However, the more general version can be proved similarly, assuming $\alpha_{test}$ is absorbed into the constant $\lambda$ .

$\epsilon$ in Proposition 4.2 and the involved proof: The exact value of $\epsilon$ depends on the model performance or $L(\theta)$ , how can we allow $\epsilon \leq \frac{1}{2N} \left(\frac{M - 1}{(1 + \lambda)M}\right)^2$ to conclude the inequality (15)?

The reviewer’s statement, “the exact value of $\epsilon$ depends on the model performance…”, is incorrect. $\epsilon$ is a free variable, and the lemma claims that there exists some $\epsilon$ such that (8) is true.

This is a standard trick in proving inequalities. We know that (14) holds for all $\epsilon > 0$ . We can set $\epsilon$ to any positive value; by setting it to a value lower than $\frac{1}{2N} \left(\frac{M - 1}{(1 + \lambda)M}\right)^2$ , we get equation 15.

评论- Checking in

2023-11-22

We wanted to follow up to see if the response and revisions address your concerns. We are open to discussion if you have any additional questions or concerns, and if not, we kindly ask you to reevaluate your score. Thank you again for your reviews which helped to improve our paper!

评论- Checking in to see if we have addressed reviewer concerns

2023-11-23

Since the discussion period is ending soon, we wanted to check if you had any feedback based on our response. Specifically, let us know if we have addressed your concerns sufficiently. If you have any questions/suggestions based on our response, we are happy to discuss them.

Thanks for putting thoughts into our paper and your valuable insights.

2023-11-23

Sorry for the delayed response. I really appreciate the authors' effort and provide more empirical results. I still have the below concerns.

The assumption of “disjoint sets” does not convince me. In selective classification (some works of literature call it rejection) [1], those hard observations could intrinsically overlap within many classes. If they are disjoint, I do not think there is a necessity to bother selective classification since you have a clear separation to distinguish those classes. Even in OOD detection, the OOD class could be near-OOD [2] which is hard to detect.
“This property is relevant to both the OOD detection and selective classification problem settings.” By following your notation of $D_u$ , which is exclusively introduced and used in sec. 3.2 (OOD detection), I get confused as to why this property holds for both tasks.
“uniform label distribution”: uniform label distribution assumption theoretically helps to yield the conclusion like proposition 4.1. But it may sabotage other nice properties since you equally treat every class without explicitly using potential information conveyed, like by x.
$\alpha_{test}=1/2$ : This is another biggest concern in my mind. The authors didn’t clearly disclose it in the main article, which may mislead readers to ignore this key parameter. Moreover, your loss function and the follow-up work based on this big assumption make the work convenient. Without this assumption, does the theorem still hold? If not, what is the alternative or general theorem? Moreover, does this assumption work well when the ground truth of $\alpha_{test}\neq \frac{1}{2}$ ? How about the sensitivity analysis of this parameter?
I do not agree with your statement on $\varepsilon$ . On one hand, you say there “exists” $\epsilon>0$ . on the other hand, you say $\epsilon$ is a free variable and “ setting it to a value lower than ….”. First of all, we do not know if this kind of $\epsilon$ exists or not (you didn’t show it). Second, suppose there exists this kind of $\epsilon$ , how could we let the existing value be lower than the value we want?
“The dimension of $p$ ”: Your presentation “Let $\mathcal{P}_{ID}$ be a distribution over $\mathcal{X}\times\\{1,\cdots, C\\}\subset\mathcal{X}\times\mathcal{Y}$ i.e., there are $C$ classes” indicates $C$ is the number of known/inlier classes. Now when it comes to OOD detection, there is an extra label for the OOD class. Then how could the label distribution $p$ have the same dimension as the number of inlier classes?

[1] Gangrade, Aditya, Anil Kag, and Venkatesh Saligrama. "Selective classification via one-sided prediction." International Conference on Artificial Intelligence and Statistics. PMLR, 2021.

[2] Fort, Stanislav, Jie Ren, and Balaji Lakshminarayanan. "Exploring the limits of out-of-distribution detection." Advances in Neural Information Processing Systems 34 (2021): 7068-7081.

评论- Reply to Reviewer QVJM (1/2)

2023-11-23

The assumption of “disjoint sets” does not convince me. In selective classification (some works of literature call it rejection) [1], those hard observations could intrinsically overlap within many classes. If they are disjoint, I do not think there is a necessity to bother selective classification since you have a clear separation to distinguish those classes. Even in OOD detection, the OOD class could be near-OOD [2] which is hard to detect.

Our analysis focuses on disjoint train and test distributions as a simplifying assumption. Note that even in the analysis, the unlabeled set overlaps with the training set; the fact that it’s still okay to minimize confidence on some training distribution inputs is the useful information here. Empirically, we thoroughly investigate scenarios that are much harder than disjoint settings, including near-OOD detection, and show that DCM consistently improves performance.

“This property is relevant to both the OOD detection and selective classification problem settings.” By following your notation of $D_u$ , which is exclusively introduced and used in sec. 3.2 (OOD detection), I get confused as to why this property holds for both tasks.

The property holds for unknown examples, which includes both OOD detection and selective classification. To avoid confusion, we will edit our analysis section and proofs to say D_unc instead of D_u.

“uniform label distribution”: uniform label distribution assumption theoretically helps to yield the conclusion like proposition 4.1. But it may sabotage other nice properties since you equally treat every class without explicitly using potential information conveyed, like by x.

The uniform distribution is not simply for theoretical convenience. It is a practical assumption that reflects high uncertainty when given no further information about the label. We could consider relationships between classes if we had such information, but we (like most other works in this area) do not make such assumptions. While any regularization could “sabotage other nice properties,” our experiments on many datasets demonstrate that this is not an issue in practice.

$\alpha_{test} = \frac{1}{2}$ . This is another biggest concern in my mind. The authors didn’t clearly disclose it in the main article, which may mislead readers to ignore this key parameter. Moreover, your loss function and the follow-up work based on this big assumption make the work convenient. Without this assumption, does the theorem still hold? If not, what is the alternative or general theorem?

The theorem still holds. The proof does not require any assumption on the value of $\alpha_{test}$ . Changing $\alpha_{test}$ would only affect the rightmost term in Equation 9 by a multiple, but the final lemma statement stays the same: a low DCM loss implies separation.

Moreover, does this assumption work well when the ground truth of $\alpha_{test} \neq \frac{1}{2}$ ? How about the sensitivity analysis of this parameter?

Our appendix includes ablation studies that change the ground-truth $\alpha_{test}$ (Fig 4 left) and use a lambda that is misaligned with the ground-truth $\alpha_{test}$ (Fig 2 right, Fig 5 right). In all of these experiments, DCM was robust to this ratio.

I do not agree with your statement on $\epsilon$ . On one hand, you say there “exists” $\epsilon > 0$ . on the other hand, you say $\epsilon$ is a free variable and “ setting it to a value lower than ….”. First of all, we do not know if this kind of $\epsilon$ exists or not (you didn’t show it). Second, suppose there exists this kind of $\epsilon$ exists, how could we let the existing value be lower than the value we want?

This is a standard constructive proof of existence. We are proving that there exists some $\epsilon > 0$ such that the statement of the lemma is true. Everything up until equation 14 holds for all $\epsilon > 0$ . Since we know that all $\epsilon \leq \frac{1}{2N} \left(\frac{M - 1}{(1 + \lambda)M}\right)^2$ would satisfy the statement, we have proven that there exists some $\epsilon > 0$ that satisfies the statement. For more on “let” in the context of constructive existence proofs, see the first example here, where “let n=6” proves that n exists: https://users.math.msu.edu/users/duncan42/Recitation7.pdf

评论- Reply to Reviewer QVJM (2/2)

2023-11-23

“The dimension of $p$ ”: Your presentation “Let $P_{ID}$ be a distribution over $\mathcal{X} \times \{1, \ldots, C\} \subset \mathcal{X} \times \mathcal{Y}$ i.e., there are $C$ classes” indicates $C$ is the number of known/inlier classes. Now when it comes to OOD detection, there is an extra label for the OOD class. Then how could the label distribution $p$ have the same dimension as the number of inlier classes?

Most of the OOD detection literature does not consider a separate C+1-th class. Instead, the goal is to have the predictions be close to uniform on OOD inputs. This is a standard choice in the OOD detection literature [1,2,3] and has the advantage of not needing architectural modifications. Our work follows this convention.

[1] "Enhancing the reliability of out-of-distribution image detection in neural networks." arXiv preprint arXiv:1706.02690 (2017).

[2] "Deep anomaly detection with outlier exposure." ICLR 2019

[3] "A simple unified framework for detecting out-of-distribution samples and adversarial attacks." Advances in neural information processing systems 31 (2018).

审稿意见

评分: 5置信度: 42023-10-31

This paper proposes a data-driven method to penalize the over-confident prediction on unknown samples. Specifically, the authors suggest that auxiliary datasets that contain unknown samples should be mixed with the original training dataset to obtain a conservative prediction. In addition, the authors propose a two-stage training scheme. For the first stage, the model is trained with training data. Then auxiliary dataset is combined with the training data to train the model with a loss composed of cross-entropy loss and regularization. To further understand the training scheme, theoretical analysis is provided by the authors which suggests that the proposed method can get a prediction confidence always lower than the true confidence. Additionally, according to the analysis, a known sample tends to be given larger confidence. To verify the proposed method, extensive experiments are conducted. In detail, the method is validated with selective classification and OOD detection across several image classification datasets.

优点

This paper proposes to use an auxiliary dataset combined with a penalized loss function to reduce the confidence in unseen samples. To further understand the proposed method, the authors analyze the proposed method theoretically and get two reasonable interpretations of the proposed method. To validate the efficacy of the proposed method, several datasets are selected to conduct experiments with different counterpart methods. The proposed method shows good performance on selective classification as well as ood detection. An ablation study of the component of the method is also given to further analyze the method. The authors also show us the prediction histogram to validate the proposition.

缺点

It seems that the Figure 2 is not in correct order. By observing Table1, we can find a performance drop on iid setting for relative simple datasets that can achieve classification accuracy more than 99%. And we can also find an enhancement in the ood setting. However on relatively hard setting like FMoW, the iid performance is enhanced by the performance, but the ood and iid+ood performance does not show a significant gap compared with other methods. Take the loss into consideration, I am wondering whether the key is to use a strong regularization on training set and the enhancement for ood is in sacrifice of performance drop of the iid setting. Could the authors show the result of adding a strong label smoothing or similar regularization during pretraining for more complete comparison? In addition, could the author show the results of larger datasets? I am wondering whether the regularization still works as the classification task becomes harder.

问题

Please refer to weakness.

评论- Initial Response to Reviewer UGo2

2023-11-20

Thank you for your thoughtful feedback. We address your comments below. Please let us know if you have any remaining questions or concerns.

It seems that the Figure 2 is not in correct order.

Thanks for catching this; we have fixed the caption and associated text in the paper related to this.

By observing Table1, we can find a performance drop on iid setting for relative simple datasets that can achieve classification accuracy more than 99%. And we can also find an enhancement in the ood setting. However on relatively hard setting like FMoW, the iid performance is enhanced by the performance, but the ood and iid+ood performance does not show a significant gap compared with other methods.

First, there is an inherent tension between ID and OOD performance. Many interventions for making models more robust (in different senses of the word) improve OOD performance at the cost of a slight drop in ID performance. We note that Camelyon17 is a bigger dataset than FMoW (455k vs 141k images) and involves a more severe drop in ID->OOD performance: the Acc@90 for MSP drops by 21.6% in Camelyon vs 7.4% in FMoW. In this challenging setting, we see substantial benefits over existing works. While other methods are also competitive on FMoW, DCM is consistently among the highest performing in all settings involving OOD data.

‘Could the authors show the result of adding a strong label smoothing or similar regularization during pretraining for more complete comparison?’

Thank you for the suggestion. We ran additional experiments comparing pre-training with label smoothing for both selective classification and OOD detection. Adding label smoothing to pre-training does not enhance performance.

Below, we compare selective classification performance of DCM with an MSP classifier on FMoW pre-trained with varying degrees of label smoothing.

(Selective classification AUC on FMoW)

	0	0.25	0.5	0.75	1	DCM
ID	81.3	81.0	80.7	71.0	59.6	82.9
ID+OOD	77.1	76.9	76.4	67.8	52.5	78.9
OOD	74.5	74.3	74.2	64.0	55.1	76.4

We also conducted experiments comparing pre-training with label smoothing and weight decay as forms of regularization for OOD detection, detailed in Appendix K of the revised paper. We present results for label smoothing with CIFAR-100 as the ID dataset below; pre-training with label smoothing does bridge the performance gap to DCM.

(OOD detection AUROC on CIFAR-100 as ID)

	0	0.25	0.5	0.75	1	DCM
SVHN	77.7	74.9	67.2	76.4	50.0	99.7
LSUN	68.5	56.8	61.7	63.3	50.0	99.5
TinyImageNet	68.0	59.4	62.4	58.5	50.0	98.7
iSUN	67.1	55.1	61	59.2	50.0	99.1

We suspect that label smoothing during pre-training reduces the model’s ability to differentiate between classes, degrading selective classification and OOD detection performance.

In addition, could the author show the results of larger datasets? I am wondering whether the regularization still works as the classification task becomes harder.

Thank you for the suggestion. We ran our method with a ViT-B/16 model pre-trained on ImageNet-1K, and tested on 4 large OOD datasets, including iNaturalist and Places365. Our method outperforms all baselines on all 4 OOD datasets by a large margin, including several strong recent approaches such as VOS [4] and NPOS [5].

Methods	iNaturalist		SUN		Places		Textures
	FPR@95 (↓)	AUROC (↑)	FPR@95 (↓)	AUROC (↑)	FPR@95 (↓)	AUROC (↑)	FPR@95 (↓)	AUROC (↑)
Fort et al.,MSP	54.05	87.43	73.37	78.03	72.98	78.03	68.85	79.06
VOS	31.65	94.53	43.03	91.92	41.62	90.23	56.67	86.74
VOS+	28.99	94.62	36.88	92.57	38.39	91.23	61.02	86.33
NPOS	16.58	96.19	43.77	90.44	45.27	89.44	46.12	88.80
DCM-Softmax	2.6 (0.5)	99.2 (0.1)	32.9 (1.5)	94.2 (0.2)	35.9 (1.8)	93.8 (0.3)	11.2 (1.0)	97.9 (0.1)
DCM-MaxLogit	1.8 (0.4)	99.4 (0.1)	27.5 (1.4)	94.9 (0.2)	32.5 (2.8)	94.5 (0.3)	8.2 (0.8)	98.3 (0.1)
DCM-Energy	0.5 (0.2)	99.6 (0.1)	24.5 (1.7)	95.8 (0.2)	30.8 (3.0)	95.4 (0.3)	4.3 (0.6)	98.8 (0.1)

We also note that our evaluations for selective classification included two sizable datasets: Camelyon-17 (around 450,000 examples), and FMoW (around 150,000 examples, 63 classes).

评论- Checking in

2023-11-22

评论- Checking in before the end of the discussion period

2023-11-23

Thanks for putting thoughts into our paper and your valuable insights.

2023-11-23

I appreciate the response, but I will keep my score.

审稿意见

评分: 6置信度: 22023-11-01

The paper proposes the Data-Driven Confidence Minimization (DCM) framework for detecting unknown inputs in safety-critical machine learning applications. By minimizing model confidence on an uncertainty dataset, DCM achieves provable detection of unknown test examples. Experimental results demonstrate that DCM outperforms existing approaches in selective classification and out-of-distribution detection tasks.

优点

Overall I think the paper is well motivated and well written. The proposed method is well motivated by the theretical analysis and the empirical performance is convincing.

缺点

For the theretical part, what can we say when the following does not hold "(1) if the auxiliary set contains unknown examples similar to those seen at test time, confidence minimization leads to provable detection of unknown test examples". Specifically, what if the auxiliary set DOES NOT contain unknown examples similar to those seen at test time. It is important to know the theretical property in this case.
The experiment results are shown in CIFAR. I would be more interested in experiments on larger scale datasets with foundation models. For example, CLIP on ImageNet. Nowadays, the interest of the community has shifted to foundation models. I believe the paper can benefit from this aspect.

问题

See weakness

评论- Initial Response to Reviewer bNMd

2023-11-20

Thank you for your thoughtful feedback. We address your comments below. Please let us know if you have any remaining questions or concerns.

For the thoretical part, what can we say when the following does not hold "(1) if the auxiliary set contains unknown examples similar to those seen at test time, confidence minimization leads to provable detection of unknown test examples". Specifically, what if the auxiliary set DOES NOT contain unknown examples similar to those seen at test time. It is important to know the thoretical property in this case.

In the absence of any assumptions about how the auxiliary dataset relates to the test distribution, we cannot make any theoretical claims about the effect of confidence minimization on test inputs. In this scenario, Proposition A.1 still informs us about the resulting model’s behavior on the auxiliary data distribution: DCM will make the model more conservative in the sense that the predictive probability will be upper-bounded by the true probability distribution. How this affects the test distribution is a matter of generalization from auxiliary->test, which we would need further assumptions to study. We note that empirically, unrelated auxiliary distributions still help in making the model more overall conservative [1,2], though not as much as more relevant auxiliary data as in ours.

‘The experiment results are shown in CIFAR…’

We thank the reviewer for their suggestions. We have run DCM on a ViT-B/16 model pre-trained on ImageNet-1K, and tested on 4 large OOD datasets, including iNaturalist and Places365. Our method outperforms all baselines on all 4 OOD datasets, including recent baselines such as VOS [3] and NPOS [4], by a large margin.

Methods	iNaturalist		SUN		Places		Textures
	FPR@95 (↓)	AUROC (↑)	FPR@95 (↓)	AUROC (↑)	FPR@95 (↓)	AUROC (↑)	FPR@95 (↓)	AUROC (↑)
Fort et al.,MSP	54.05	87.43	73.37	78.03	72.98	78.03	68.85	79.06
VOS	31.65	94.53	43.03	91.92	41.62	90.23	56.67	86.74
VOS+	28.99	94.62	36.88	92.57	38.39	91.23	61.02	86.33
NPOS	16.58	96.19	43.77	90.44	45.27	89.44	46.12	88.80
DCM-Softmax	2.6 (0.5)	99.2 (0.1)	32.9 (1.5)	94.2 (0.2)	35.9 (1.8)	93.8 (0.3)	11.2 (1.0)	97.9 (0.1)
DCM-MaxLogit	1.8 (0.4)	99.4 (0.1)	27.5 (1.4)	94.9 (0.2)	32.5 (2.8)	94.5 (0.3)	8.2 (0.8)	98.3 (0.1)
DCM-Energy	0.5 (0.2)	99.6 (0.1)	24.5 (1.7)	95.8 (0.2)	30.8 (3.0)	95.4 (0.3)	4.3 (0.6)	98.8 (0.1)

[1] Deep Anomaly Detection with Outlier Exposure, ICLR 2019

[2] Energy-based Out-of-distribution Detection, NeurIPS 2020

[3] VOS: Learning What You Don't Know by Virtual Outlier Synthesis, ICLR 2022

[4] Non-parametric Outlier Synthesis, ICLR 2023

评论- Checking in

2023-11-22

评论- Checking in before the end of the discussion period

2023-11-23

Thanks for putting thoughts into our paper and your valuable insights.

2023-11-23

Thank you for the detailed response.

审稿意见

评分: 5置信度: 42023-11-07

This paper proposes a method called Data-Driven Confidence Minimization (DCM) for OOD detection and selective classification (i.e., a reject option). The method builds on Outlier Exposure, using different uncertainty datasets. For selective classification, the uncertainty dataset is misclassified examples in a val set. For OOD detection, the uncertainty dataset is a potential mixture of in-distribution and OOD data. The paper includes a proof that having a noisy uncertainty dataset in this manner still allows for separating ID and OOD examples. In experiments, DCM performs well compared to several OOD detection baselines including OE.

优点

Good empirical results
The paper is well-written and easy to follow

缺点

There isn't much technical novelty on top of OE. The method is mainly about selecting a new uncertainty dataset, which is a fine direction to explore, but the approach is technically simple and possible not substantial enough for ICLR.
The proof seems fairly obvious; it seems to be saying that datasets are separable even when the training data are noisy. I may have missed some details, but surely this is already well-known and a standard result in learning theory. I'm worried that this proof doesn't contribute new knowledge to the field and may give rise to a false impression.
There are numerous more recent baselines, e.g., Virtual Outlier Synthesis. It would be good to include some of these.

问题

N/A

评论- Initial Response to Reviewer sVZ1 [1/2]

2023-11-20

Thank you for your thoughtful feedback. We address your comments below. Please let us know if you have any remaining questions or concerns.

‘There isn't much technical novelty on top of OE…’

While our method is similar to OE in that both methods minimize confidence, we note the following key differences between our work and OE: We analyze the role of the uncertainty set both theoretically and empirically. We extend this confidence minimization framework to selective classification, which prior works such as OE do not explore. Our method outperforms prior works that share our data assumptions (WOODS, ICML 2022 [1]; ERD, UAI 2022 [2]) while being simpler and computationally cheaper to run. Our method requires fewer algorithm-specific hyperparameters than WOODS [1] (1 for DCM, 8 for WOODS) and 5x reduction in compute compared to the ensemble-based ERD [2].

The proof seems fairly obvious; it seems to be saying that datasets are separable even when the training data are noisy. I may have missed some details, but surely this is already well-known and a standard result in learning theory. I'm worried that this proof doesn't contribute new knowledge to the field and may give rise to a false impression.

We believe that our analysis establishes a novel and useful result in the context of using auxiliary datasets to minimize confidence. We agree that the proof technique itself is straightforward; we are not claiming that the proof itself constitutes a theoretical advance. However, its application in the realm of using auxiliary datasets for confidence minimization is a novel contribution. We welcome any suggestions for relevant prior work to compare and cite. To our knowledge, our analysis in the context of conservative prediction hasn't been explored in existing literature.

‘There are numerous more recent baselines…’

Thank you for the suggestions.

We added comparisons to VOS [3], NPOS [4] and Dream-OOD [5] on CIFAR-10 and CIFAR-100 to Appendix H of our revised paper. DCM outperforms these methods by 2.5% on CIFAR-10 and 2.9% on CIFAR-100, in terms of OOD detection AUROC, averaged over 5 OOD datasets each.

ID Dataset / Network	Method	SVHN		LSUN (Crop)		iSUN		Texture		Places365
		AUROC (↑)	FPR@95 (↓)	AUROC (↑)	FPR@95 (↓)	AUROC (↑)	FPR@95 (↓)	AUROC (↑)	FPR@95 (↓)	AUROC (↑)	FPR@95 (↓)
CIFAR-10 /ResNet-18
	VOS	96.37	15.69	93.82	27.64	94.87	30.42	93.68	32.68	91.78	37.95
	NPOS	97.64	5.61	97.52	4.08	94.92	14.13	94.67	8.39	91.35	18.57
	DCM-Softmax	99.7 (0.1)	0.4 (0.3)	98.6 (0.8)	6.6 (3.0)	99.7 (0.1)	0.6 (0.2)	97.1 (0.2)	14.8 (0.3)	92.4 (0.3)	32.6 (2.1)
	DCM-MaxLogit	99.8 (0.1)	0.3 (0.1)	98.7 (0.7)	6.0 (3.0)	99.8 (0.1)	0.5 (0.2)	97.1 (0.2)	14.9 (0.4)	92.5 (0.3)	34.4 (2.1)
	DCM-Energy	99.8 (0.1)	0.1 (0.1)	98.8 (0.7)	5.3 (3.6)	99.8 (0.1)	0.1 (0.1)	97.1 (0.2)	16.1 (1.1)	92.5 (0.3)	35.6 (2.2)
CIFAR-100 /ResNet-34
	VOS	73.11	78.50	85.72	59.05	82.66	72.45	80.08	75.35	75.85	84.55
	NPOS	97.84	11.14	82.43	56.27	85.48	51.72	92.44	35.20	71.30	79.08
	Dream-OOD	87.01	58.75	95.23	24.25	99.73	1.10	88.82	46.60	79.94	70.85
	DCM-Softmax	99.3 (0.2)	1.8 (0.9)	98.6 (0.3)	8.7 (1.9)	99.3 (0.2)	2.3 (1.4)	88.5 (0.5)	46.6 (2.7)	78.6 (0.4)	67.7 (2.3)
	DCM-MaxLogit	99.3 (0.2)	2.0 (1.1)	98.7 (0.2)	7.3 (1.9)	99.3 (0.2)	2.3 (1.4)	88.5 (0.5)	46.9 (3.0)	78.6 (0.4)	68.8 (1.8)
	DCM-Energy	99.2 (0.2)	2.2 (1.2)	98.9 (0.2)	5.2 (2.0)	99.4 (0.2)	1.9 (0.8)	89.3 (0.6)	48.6 (3.3)	78.3 (0.6)	68.9 (1.9)

评论- Initial Response to Reviewer sVZ1 [2/2]

2023-11-20

Additionally, DCM significantly outperforms VOS [3] and NPOS [4] with larger datasets and model architectures. We ran additional OOD detection experiments using a pretrained CLIP ViT-B/16 with ImageNet-1K as ID and iNaturalist, Places365, SUN and Textures as OOD datasets. The complete table is in Appendix I.

Methods	iNaturalist		SUN		Places		Textures
	FPR@95 (↓)	AUROC (↑)	FPR@95 (↓)	AUROC (↑)	FPR@95 (↓)	AUROC (↑)	FPR@95 (↓)	AUROC (↑)
Fort et al.,MSP	54.05	87.43	73.37	78.03	72.98	78.03	68.85	79.06
VOS	31.65	94.53	43.03	91.92	41.62	90.23	56.67	86.74
VOS+	28.99	94.62	36.88	92.57	38.39	91.23	61.02	86.33
NPOS	16.58	96.19	43.77	90.44	45.27	89.44	46.12	88.80
DCM-Softmax	2.6 (0.5)	99.2 (0.1)	32.9 (1.5)	94.2 (0.2)	35.9 (1.8)	93.8 (0.3)	11.2 (1.0)	97.9 (0.1)
DCM-MaxLogit	1.8 (0.4)	99.4 (0.1)	27.5 (1.4)	94.9 (0.2)	32.5 (2.8)	94.5 (0.3)	8.2 (0.8)	98.3 (0.1)
DCM-Energy	0.5 (0.2)	99.6 (0.1)	24.5 (1.7)	95.8 (0.2)	30.8 (3.0)	95.4 (0.3)	4.3 (0.6)	98.8 (0.1)

[1] Training OOD Detectors in their Natural Habitats, ICML 2022

[2] Semi-supervised novelty detection using ensembles with regularized disagreement, UAI 2022

[3] VOS: Learning What You Don't Know by Virtual Outlier Synthesis, ICLR 2022

[4] Non-parametric Outlier Synthesis, ICLR 2023

[5] Dream the Impossible: Outlier Imagination with Diffusion Models, NeurIPS 2023

评论- Checking in

2023-11-22

评论- Checking in to see if we have addressed concerns

2023-11-23

Thanks for putting thoughts into our paper and your valuable insights.

评论- Response

2023-11-23

Hello,

My apologies for the late reply. The new results are helpful, but I'm still a bit hung up on the simplicity of the method and the practical value of the proof. From what I can see, the main takeaway of the proof is "we then demonstrate that DCM can provably detect unknown examples similar to those in the uncertainty set with an appropriate threshold on predicted confidence", but this just seems obvious to me. If one is training against a specific dataset (uncertainty set), then surely examples similar to that dataset can be separated. I'm just not sure what this is telling us that most readers wouldn't already know.

With respect to other sources of novelty highlighted in the responses, the selective classification setting is interesting, but I'm not sure if applying something very similar to OE with a specific uncertainty set of misclassified examples is enough technical novelty. The results are good, but that always has to be weighed against other aspects of a paper, and I think overall I'm still not convinced enough by the response to increase my score. If the authors don't have time to reply, I'll wait a few days and reread the rebuttal and paper before finalizing my score.

评论- Reply to Reviewer sVZ1

2023-11-23

Dear Reviewer sVZ1,

Thank you for your response.

We agree our method is simple, but we view simplicity as a strength, especially in light of the strong empirical performance. Many impactful machine learning methods are valued for being straightforward, as they're easier to use and build upon.

The proof's practical value is to show that if some OOD examples are included in the uncertainty set, then filtering out ID examples from the uncertainty dataset (as done by OE) is unnecessary.

Finally, we also note that the ICLR 2024 call for papers lists societal considerations including "fairness, safety, privacy" as specifically relevant topics. Considering the critical reliability and safety risks of machine learning model errors, we feel that our work showing substantial mitigation of this problem is very relevant and will be of great interest to the ICLR community.

We appreciate your feedback and hope this clarifies our perspective.

评论- (Optional) Follow Up to Reviewer sVZ1

2023-11-23

Strong Performance Gains w/o Auxiliary Dataset: Another practical benefit of our method is that it shows strong performance even without a separate unlabeled dataset and only the test dataset (e.g., transductive setting). As detailed in Appendix G of our initial submission, minimizing confidence directly on the test set outperforms all other OOD detection baselines. This means that if practitioners have an unlabeled test set containing both ID (“known”) and OOD (“unknown”) data, they can use our approach directly on this test data. They can then identify OOD inputs within the same test set using the DCM-fine-tuned model.

Theoretical Findings on Importance of Unlabeled Test Examples for Consistent OOD Detection Performance: Prior work [1] has shown that OOD detection methods vary in effectiveness across different (ID, OOD) datasets. As the support set of OOD inputs can be very large, achieving consistent OOD detection is challenging. We show that if one has access to unlabeled data similar to that at test time, it is possible to detect all OOD inputs. We believe that this theoretical guarantee on OOD detection performance is useful to the research community.

We hope this clarifies our contribution and are happy to discuss further.

[1] "No True State-of-the-Art? OOD Detection Methods are Inconsistent across Datasets," https://arxiv.org/abs/2109.05554

评论- Overall Response to All Reviewers

2023-11-20

We thank all reviewers for their comments and constructive feedback. According to your comments, we have revised our paper and uploaded the new version with changes in red. Below, we summarize the changes we made:

Novelty and Comparison with Outlier Exposure (OE):

We note the following novel contributions of our work:

We theoretically and empirically analyze the role of the uncertainty set.
We extend this confidence minimization framework to selective classification, which prior works such as OE do not explore.
Our method outperforms prior works which share our data assumptions (WOODS, ICML 2022 [1]; ERD, UAI 2022 [2]), while being simpler and computationally cheaper to run. Our method requires fewer algorithm-specific hyperparameters than WOODS [1] (1 for DCM, 8 for WOODS), and 5x reduction in compute compared to the ensemble-based ERD [2].

Finally, we note that the ICLR 2024 call for papers lists societal considerations including "fairness, safety, privacy." Considering the critical reliability and safety risks of machine learning model errors, we feel that our work, which shows substantial mitigation of this problem, is very relevant and will be of strong interest to the ICLR community.

Additional Baselines:

We added comparisons to recent OOD detection methods like Virtual Outlier Synthesis (VOS), Non-parametric Outlier Synthesis (NPOS), and Dream-OOD to Tables 9, 10 of our revised paper.
DCM outperforms all of these baselines by 2.5% on CIFAR-10 and 2.9% on CIFAR-100, in terms of OOD detection AUROC, averaged over 5 OOD datasets each.

Large-Scale Experiments on ImageNet-1K with Vision Transformers:

DCM shows significant performance gains over all baselines on large-scale datasets and model architectures.
On an ImageNet-1K OOD detection task with a CLIP ViT-B/16 backbone, DCM outperforms the leading baseline by 22.9% in FPR@95.

Comparison of DCM with Pre-training with Label Smoothing:

We added comparisons to pre-training with label smoothing for both selective classification and OOD detection on challenging classification tasks.
DCM outperforms pre-training with label smoothing in both settings. Label smoothing does not improve performance in either setting compared to standard pre-training.

Please see below for detailed responses to each reviewer.

[1] Training OOD Detectors in their Natural Habitats, ICML 2022

[2] Semi-supervised novelty detection using ensembles with regularized disagreement, UAI 2022

AC 元评审

2023-12-08

This paper addresses the challenge of making conservative predictions with machine learning models in safety-critical applications, where it's important for models to abstain from making predictions on "unknown" inputs not well-represented in training data. The authors build upon prior work that minimizes model confidence on an auxiliary outlier dataset, and provide a theoretical analysis of the choice of auxiliary dataset for confidence minimization.

Two key insights emerge from this analysis: (1) confidence minimization can lead to provable detection of unknown test examples if the auxiliary set contains unknown examples similar to those seen at test time, and (2) it is unnecessary to filter out known examples for out-of-distribution (OOD) detection if the first condition is satisfied.

Based on these insights, the authors propose the Data-Driven Confidence Minimization (DCM) framework, which minimizes confidence on an uncertainty dataset. The DCM framework is applied to selective classification and OOD detection, both scenarios where conservative prediction is crucial. The authors also provide practical methods for collecting uncertainty data for these settings.

The experiments demonstrate that DCM consistently outperforms current selective classification approaches on four datasets when tested on unseen distributions. It also outperforms some state-of-the-art OOD detection method on eight ID-OOD dataset pairs, on CIFAR-10 and CIFAR-100 compared to Outlier Exposure. This research thus provides significant improvements in handling "unknown" inputs in machine learning, particularly in safety-critical applications.

为何不给更高分

There are various concerns from the reviewers about the technical contribution and novelty, experimental insufficiency, for the current draft. Majority of reviewers suggest reject the paper with only one slightly above the borderline with low confidence. So the rejection is suggested.

为何不给更低分

N.A.

最终决定Reject

2024-01-16

Reject