Nearest neighbor-based out-of-distribution detection via label smoothing
摘要
评审与讨论
This paper proposes a statistic based on the KNN density for OOD; when a sample is generated out of the distribution, the statistic tends to be high, and vice versa.
优点
I can't fully assess the strengths of this paper, as the presentation is poor; there is no even a formal formulation of the considered OOD problem, given OOD could be framed differently depending on the applications.
缺点
This paper is not well-written, hence mainly impeding me to fully understand the proposed method. For example, the drawbacks of using the neural network’s softmax predictions for OOD is mentioned in the third paragraph in the introduction, but the description of that is hard to follow; in addition, it is unclear how the proposed method based on the KNN density can overcome the drawback. The mathematical writing is not rigorous too; for instance, there is no explanation of in Def 1. The message that Theorem 1 tries to convey is also hard to follow.
问题
The paper claims the proposed paper is unsupervised. Does that mean the users do not have access to the information whether examples in a training set are from in- or out-of-distribution? If that is the case, then how do you compute that needs to access ? On the other hand, if the unsupervised OOD implies that labels are not provided, then how do you use label smoothing to train the network?
Thank you for the review.
RE: "the drawbacks of using softmax predictions". We clearly provide one reason in the third paragraph -- "This is largely because the softmax probabilities sum to 1 and thus must assign the probability weights accordingly." This does not apply to our method since there is no explicit constraint on the model's learned embeddings.
RE: notation. While we feel the notation is easy to follow, so that there is no confusion, we will clarify all notation used for the camera ready.
RE: "unsupervised". In our setup, the user trains a machine learning model on labeled samples () and deploys it in the real world. Now, the model receives a query sample and the user would like to know whether this sample is from or not. Our method uses the user's trained model and . Many other OOD methods, such as POEM, require samples from an auxiliary distribution in order to discriminate between and the test distribution . A limitation of these methods is that for them to work well, often needs to be similar or representative of , but the latter is not known beforehand. Our method sidesteps this as it does not require any auxiliary data.
Thank you for your reply. I re-evaluate the work and have a better understanding now. The core of the proposed method is similar to the likelihood ratio test, where the proposed "radius ratio" acts as a proxy of the likelihood ratio. Does the corollary 1 state the precision for such a "radius ratio"? The provided proof serves only Theorem 1, and Corollary 1 from Theorem 1 does not look like an immediate derivation to me. Can you provide a complete proof from Theorem 1 to Corollary 1?
We thank the reviewer for taking a closer look at the paper. Here is the proof:
Case 1: is OOD. We classify example x as OOD if . We define examples with as OOD, therefore in Theorem 1, we established that implies , so all OOD examples are identified.
Case 2: is ID. Theorem 1 says that if , then . This means that we can only misclassify ID examples where . We have based on Assumption 2.
Therefore, probability of misclassifying ID examples is at most
Note that in the paper, there is a mistake in the statement, as it's missing the factor and in the exponent coming from Assumption 2. As written in the paper, Assumption 2 is used only in Corollary 1 (and not Theorem 1), but forgot to apply the parameters of this assumption to the bound in Corollary 1 when going from Theorem 1 to Corollary 1.
We thank the reviewer for bringing this up and for the careful read of our results and apologize for any confusion it may have caused.
The paper proposes a nearest neighbor-based method for out-of-distribution detection with label smoothing. The paper utilizes the k-NN based density estimation at intermediate layers to identify OOD samples. The proposed method is backed up by theoretical analysis that provides high-probability statistical results. The experiment results show that the proposed method and the proposed method without label smoothing are usually among the best performing methods.
优点
The paper is well organized and clearly written. The paper finds a good angle that combines k-NN density estimation with label smoothing and provides a theoretical analysis to support it. The paper clearly states the assumptions for theoretical results. The experiment results show decent performance compared to several baselines.
缺点
The comparison is not very convincing, with DeConf performing poorly, SVM and isolation forest being methods that directly build classifiers on the embedding layers, only POEM is a state-of-the-art method that adopts a different approach. It is also not clear why Control has better performance than other baselines since the paper criticizes the softmax so much.
The example in Figure 1 does not make a convincing case. The overlap does not seem too different between the no LS and 0.1 LS case. It would be better if the shrink could be quantified for better understanding.
The paper touches on a "conclusion" that distance is better than label distribution for OOD detection, but there is no further analysis to support this or about whether these two contradict each other.
问题
-
How does the method compare with another unsupervised method ([1] Yu, Qing, and Kiyoharu Aizawa. "Unsupervised out-of-distribution detection by maximum classifier discrepancy." Proceedings of the IEEE/CVF international conference on computer vision. 2019.) and a specific layer based method ([2] Sun, Yiyou, Chuan Guo, and Yixuan Li. "React: Out-of-distribution detection with rectified activations." Advances in Neural Information Processing Systems 34 (2021): 144-157.)?
-
Can the difference in Figure 1 be quantified?
-
Can the theoretical results be quantified in experiments, or can some kind of upper bound be calculated that can be compared with the result? I'm asking because Corollary 1 provides an interesting quantity that is the probability of falsely identifying in-distribution examples as out-of-distribution while all OOD are identified.
-
Is there any more insight on the comparison between distance-based and label-distribution-based approaches?
We appreciate the detailed review.
-
Yu. et al trains a model with two classification heads with two losses -- the usual supervised cross-entropy loss on the labeled in-distribution samples, and another loss that encourages maximum discrepancy between the two heads on unlabeled points. Points that the heads disagree on are deemed OOD. This method requires auxiliary (unlabeled) OOD points and the performance degrades when the auxiliary samples are not from the test distribution. In contrast, our method does not require auxiliary samples. Meanwhile, react of Sun et. al truncates the activations of the model's penultimate layer, where the threshold c is chosen via a percentile on ID data. Any existing OOD detection scheme can then be used with the modified model (they use energy scoring). React is an interesting scheme; we will try to benchmark against it before the rebuttal deadline but if not, then we can do it upon acceptance for the camera ready.
-
We understand how the separation may not be readily apparent. We could report true negative and true positive rates at a range of thresholds but this might be too much. Would reporting the ROC-AUC (which is the same performance metric we use in experiments) for each chart help?
-
It may be difficult to quantify the theoretical results, because many constants are unknown, such as the extent to which the label smoothed hypothesis holds, as well as the density and the intrinsic dimension of the data. Rather, our theoretical result was motivated by the results we saw empirically.
-
The datasets we use have 10, 100, and 1000 classes. We find that our distance-based k-NN does well across the spectrum, with very large improvements over label-distribution-based control when the in-distribution dataset is Fashion MNIST.
Thank you for the response. However, I'm still curious about the last point in weaknesses: do the distance-based approach and label-distribution-based approach contradict each other, and is the conclusion that the distance-based approach is better purely empirical?
Thanks for the reply. We can't say formally argue that distance-based and label-distribution-based approaches contradict each other and our conclusion that the k-NN distance-based approach is better is in fact empirical. In the introduction, we state one pitfall of softmax confidence scoring (label-distribution-based): "However, the majority of the works still ultimately use the neural network’s softmax predictions which suffers from the following weakness. The uncertainty in the softmax function cannot distinguish between the following situations: (1) the example is actually in-distribution but there is high uncertainty in its predictions and (2) the example is actually out of distribution. This is largely because the softmax probabilities sum to 1 and thus must assign the probability weights accordingly."
One ablation we could report is to run our k-NN method, but rather than using euclidean distance on intermediate embeddings, we use the softmax class probabilities as the embedding and we choose a distance function for probability distributions, like Wasserstein or Total Variation. This ablation will highlight the significance of our aforementioned point. We ran some experiments like this, and with the caveat that the experiments were quick and dirty, we found that indeed using label-distribution embeddings perform worse than (unnormalized) intermediate embeddings. If this ablation sounds interesting, we can add them in the final revision of the paper.
The paper's main claim is that combining kNN with label smoothing training can improve OOD detection performance. The paper also proposes the usage of multiple intermediate representations. The paper is mainly theoretical, and experiments on small-scale OOD detection benchamrk datasets.
优点
- The proposed theory is well presented in rigor with well-defined notations.
- The main point of the paper is well described.
缺点
- The focus of the paper is not clear. The theory is focused on showing the merits of label smoothing training but the method also utilizes multiple latent representations. Is there any theory on the benefit of using multiple latent representations for kNN?
- The experiments have been conducted on only small-scale datasets. Can this claim hold on the ImageNet-1k scale dataset as well?
- The sufficient condition of Proposition 1 is not intuitively explained but only technical. Hence, I cannot really determine if this main theoretical result depends on a very strong assumption, in which case the theory can be quite trivial.
问题
- Could the authors give a very descriptive and intuitive summary of the main theory? For me, the main theoretical point is too obvious since label smoothing makes the network learn 'better' representation [1,2,3], particularly enhancing the intra/inter-class discriminant ratio, thereby possibly improving the separation between ID and OOD in the latent space. Why is the theory not trivial?
- Please address the above weaknesses.
[1] Xu, Yi, et al. "Towards understanding label smoothing." arXiv preprint arXiv:2006.11653 (2020). [2] Yuan, Li, et al. "Revisiting knowledge distillation via label smoothing regularization." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020. [3] Jung, Yoon Gyo, et al. "Periocular recognition in the wild with generalized label smoothing regularization." IEEE Signal Processing Letters 27 (2020): 1455-1459.
Thank you for the review.
While there is no theory on using multiple latent representations, our combination is fairly simple -- for each layer, we normalize the query's k-NN score (where score is the k-NN density estimate) by the average in-distribution k-NN score, and then we average this quantity across layers. The normalization helps make the per-layer scores more comparable, since distances between embeddings can be larger in one layer than another. While the purpose of pooling across layers is to make the statistic more reliable (as is done in the Robust k-NN baseline), in our ablations, we show that using just a single layer (the penultimate one) does pretty well. In the single layer case, our statistic is equivalent to the k-NN score, since the normalizing constant does not matter.
RE: "ImageNet-1k scale". Thank you for the suggestion. Here are the results when when the training / in-distribution (ID) is CelebA and SVHN and the OOD dataset is the validation set of resized Imagenet-1k (https://www.tensorflow.org/datasets/catalog/imagenet_resized ; 32x32 size). In both cases, our proposed method does well, and it is a leading method among the baselines when the ID is CelebA.
| Control (mean) | (stderr) | kNN k=1 (0.1) | kNN k=1 (no LS) | Deconf | Robust kNN | SVM | Isolation Forest | POEM | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| in: CelebA / out: Imagenet | 0.6469 | 0.0067 | 0.6841 | 0.0225 | 0.5976 | 0.013 | 0.5347 | 0.0239 | 0.6477 | 0.0092 | 0.4123 | 0.0309 | 0.436 | 0.0175 | 0.6884 | 0.0352 |
| in: SVHN / out: Imagenet | 0.8249 | 0.0155 | 0.8516 | 0.004 | 0.7232 | 0.0087 | 0.6928 | 0.0176 | 0.8765 | 0.0023 | 0.5599 | 0.0171 | 0.416 | 0.0071 | 0.8075 | 0.0219 |
RE: "Proposition 1". We provided an intuitive explanation of Proposition 1 before the statement, which says “that under certain regularity conditions on the density and X, we have that the ratio of the k-NN distance between an out-of-distribution point and an in-distribution point increases after this mapping,” where mapping refers to the one implied by the Label Smoothed Embedding Hypothesis when mapping from embeddings obtained w/o label smoothing to embeddings w/ label smoothing. We will further clarify this in the paper.
RE: "summary of the theory". The main non-trivial assumption we make is the Label Smoothed Embedding Hypothesis, which we parameterize by and , and our results rigorously formalize the implications of these assumptions on the behavior of k-NN distances, which in turn implies that our k-NN based procedure is more effective under label smoothing. While the reviewer is correct that this isn’t a standard assumption, we stress that such an assumption is not unreasonable given that this phenomenon has been established empirically several times before. Our results require the use of theoretical techniques that were established less than a decade ago, and don’t immediately follow from known results.
This paper presents a new method for detecting out-of-distribution data. The method primarily relies on the k-NN radius and label smoothing to differentiate between in-distribution and out-of-distribution data. The authors provide both theoretical and experimental evidence to demonstrate the superiority of their method over many other baselines.
优点
- The application of the k-NN radius to differentiate OOD (Out-Of-Distribution) data seems like a novel idea to me.
- In the theoretical part, I read the statements and found the results they proved to be reasonable.
- The authors compare their method with many baselines, and improvements can be observed in most test cases.
- The paper is clearly written and easy to understand
缺点
- It appears that the authors' method underperforms compared to some baselines in certain test cases. Could you provide any explanation as to why this occurs with some datasets?
问题
- In Proposition 1 on page 5, should there be a plus sign before x0 and proj(x) instead of a minus sign? To me, a plus sign seems more indicative of a contraction. Could you please clarify if I am misunderstanding anything?
Thank you for the thoughtful review.
The noteworthy cases where our method underperforms is when the (in, out) pair is (CIFAR10, SVHN) and (CelebA, SVHN). In these cases, control and POEM do better than all other baselines. While it can be challenging to pinpoint a cause for these particular losses, we suspect that the SVHN examples produce intermediate activations similar to those of CIFAR10 and CelebA (what the backbone network is trained on).
RE: "plus sign before x0 and proj(x)". The reviewer is correct that it should be a plus sign before each of these. We thank the reviewer for such a detailed reading of the paper and we apologize for any confusion it may have caused.
We thank all reviewers for their feedback. While we have replied to each review individually, we'd like to note that following one reviewer's recommendation, we've added results for Imagenet-1k (1000 classes) to show the performance of our method when the OOD data comes from a large number of classes.
This paper uses nearest neighbor ideas to detect OOD examples. The paper is not very well written and it is hard to identify the clear contribution over existing work. The reviews are borderline and mixed, with the positive score being a very brief and frankly low quality review. I encourage the authors to incorporate the reviewer comments in a future revision
为何不给更高分
Poorly written paper
为何不给更低分
NA
Reject