Combining Statistical Depth and Fermat Distance for Uncertainty Quantification
摘要
评审与讨论
This paper introduces a new method for Out-of-Distribution detection based on the concepts of Lens Depth and Fermat distance. This method is used to see whether a sample has a similar representation in the penultimate layer of a Neural Network as the samples in the training data. The method is subjected to various tests of Out-of-Distribution detection and is shown to be on-par or exceeding alternative methods. However, the proposed method does not intrude on the training process of the model, and therefore cannot have a negative impact on the classification performance. Alternative methods assume a Gaussian Distribution in the hidden representation, but the use of (a modification of) Lens Depth allows estimating the “similarity” of the sample without assuming a certain distribution.
优点
- The application of Femat Distance and Lens Depth introduces mathematical concepts that are not common knowledge and not obvious to a Machine Learning audience. The application of these methods in OoD detection is new (originality)
- Previous literature is well cited, and the mathematical concepts are clearly and intuitively introduced, with clearly stated relevance (clarity) The claims made follow naturally from the evidence and are not overstated. The evaluation is in line with common practice in the field of OoD detection (quality)
- The paper is well written and consistently builds a clear argumentation (clarity)
- Mathematical concepts are introduced with both formalism, and an intuitive explanation (clarity).
- The proposed method is competitive with other methods, and is minimally invasive to the training process. This could be helpful when then training process is outside of the control, for example for large pre-trained models (significance)
缺点
- Small claims are not entirely accurate. Line 4 says there are “no assumptions” about the form of the distribution, but there are only minimal assumptions (see question 3). Line 262 claims that the proposed measure is a good measure of “uncertainty estimation”, but it’s only evaluated for OoD detection, so it may be wildly over/underconfident and behave poorly on aleatoric uncertainty. Line 323 conjects that OoD detection may ensure fairness, but I see no reason why. Line 5 claims that the proposed method is applicable to any classification model, but the performance is only tested for Neural Networks (quality/clarity)
- The explanation of Lens Depth may be made more intuitive with a visualisation to support Lines 94-99 (clarity)
- Presented results are not substantially better than previous methods. Authors argue that the main benefit is that the proposed method is minimally invasive to the training process, but the authors do not make a strong case on why this is necessary (significance)
问题
- How computationally expensive is LD after the improvements discussed in Section 4.5? Is it substantially faster/slower to do inference than e.g. DDU?
- In Figure 4.2 you show that the LD still works with 200 samples to claim that the method also works for small datasets. At what dataset size does the method start to fail, and how catastrophic is that? A plot like Figure 4.2.B with decreasing sizes of the dataset may give this insight.
- Consider Figure D.1. What if two of the “blobs” belong to cluster A and the last to cluster B, so that there are two classes (C=2) but in three clusters. Would LD then still behave as desired? If LD then gives undesirable results, wouldn’t you say that there is at least some assumption about the shape of the distributions?
- How would the model perform if the two moons have more spread, to the point that the two classes might touch/overlap? Is there “uncertainty” between the two classes? I understand this is not the point of OOD-detection, but it can be a point of UQ. This might be a ‘limitation’ worth mentioning. LD is good at OOD-detection, but not for the general task of uncertainty estimation. Specifically Line 262 says that LD is a good measure for uncertainty estimation, but only OOD-detection and being monotically decreasing with accuracy are demonstrated. Estimating heteroscedastic aleatoric uncertainty and uncertainty calibration are not tested, but are properties of good uncertainty estimation. On Line 264 “uncertainty quantification” and is said, while OOD-detection is investigated, though I think they are not exactly the same.
- In Figures 5.2b-5.2d the accuracy seems to plateau. Do the authors have any suggestions on what might be causing this, and how this might impact applications using LD?
- One important use case I’d consider for minimally invading the training process is OoD detection with pre-trained models. Can you elaborate on whether this would be a good use case for your method? If it is, consider stating this in the paper as well, to argue clearly for why minimally invasive OoD detection is desirable.
局限性
The authors claim that their method works on all classification models, and without any assumptions on the distribution of the data. However, this is missing evidence. Authors only demonstrate effectiveness in Neural Networks on Computer Vision data. While it is true that the method may be applied to other models and other data, more research is needed to establish its effectiveness there. Other limitations are demonstrated and addressed. The positive conclusions are appropriately based on the findings and are not over-optimistic.
The authors discuss the high computational cost and demonstrate methods to make it more efficient, but it’s not clear what the remaining computational cost is.
First of all, thank you for your time and for this review. Here are our answers.
Q1. How computationally expensive is LD after the improvements? faster/slower to do inference than e.g. DDU?
The computational expense depends on the number n of points in the proposed reduced LD, and the number of classes C. The complexity is of order O(Cn^2). At inference, it is slower than DDU, as DDU uses Gaussian assumption, and an empirical covariance estimation. Hence it uses simple and standard matrix linear algebra, which is friendly for standard Deep Learning packages. With our method, we use non-standard operations, making it slower. We believe that there is much room for improvement using more native implementations, more parallelization etc.. especially when computing shortest paths for LD. But we considered this as not a priority. The objective of our paper is mainly to introduce an approach that is fairly new to Deep Learning community.
Q2. What dataset size does the method start to fail?
To respond to your question, we have added an experiment on the spiral dataset where we use only a certain percentage of the original data until LD fails to capture the original distribution. Please look at the figure in the pdf file attached in the global rebuttal to observe. Please note that we randomly sample a small portion of the original points. Hence, the sampled points can be concentrated in a small region instead of being distributed along the spiral. So, the sampled points can represent not very correctly the original distribution. With that being said, it is not surprising that fails to capture the original support at around of the original size. We hope that the figure will give you more insight about our method at small data regime. Finally, thank you for your recommendation and we will add that in the appendix.
Q3. Consider Figure D.1. What if two of the “blobs” belong to cluster A and the last to cluster B, so that there are two classes (C=2) but in three clusters. Would LD then still behave as desired? If LD then gives undesirable results, wouldn’t you say that there is at least some assumption about the shape of the distributions?
This is a very interesting remark. In this example, we intentionally make 3 clusters very far away from each other to see the effect of our method. In the extreme scenario that you propose, there is a class with 2 clusters. One could argue that in such case, these 2 clusters should not be too distinctive. This is because the main model in trained to well classify, so semantically similar inputs should be close to each other, leading to quite continuous cluster for each class. Hence, the "bad" effect could exist but should be very minimal.But in general, we agree that the cluster of each class should be sufficiently connected to have an ideal result. We will add that in our discussion.
Q4. How would the model perform if the two moons have more spread.
In the case you are mentioning, we argue this is not a case of OoD uncertainty but a case of decision uncertainty. For this, other metrics such as predictive entropy should be a good candidate as it's related to uncertainty in the decision. OoD on the other hand deals well with data scarcity. The aim of LD is to measure the Out-of-domain uncertainty, which is due to the zones where we do not have (or very little) data. As model is not trained in these zones, we would like that the model does not predict on these zones as it can behave very randomly due to scarcity of training data in these zones. This is unlike the case where the two moons have more spread (and even overlap), so we have enough data in the zone between the 2 classes. Finally, we agree with your remark and it's worth mentioning in the discussion.
Q5. In Figures 5.2b-5.2d the accuracy seems to plateau
We think that the plateau part corresponds to "difficult" regions where InD and OOD data are less distinctive, so the rejected data could contain both InD and OOD data. Such a phenomenon is not specific to LD and appears for other methods as well. (see DUQ for example.)
Q6. One important use case I’d consider for minimally invading the training process is OoD detection with pre-trained models:
Yes! This is indeed a point we have in mind, as SOTA models become often too large to retrain ourselves. But perhaps we did not insist enough on that. It will be added in the introduction. Thank you for your recommendation. Ideally this should be a good use case where we have no idea about the distribution of data and we want to keep the original model intact to make sure that its performance on the main task is not impacted.
Kind regards, The authors
Thank you for the additional insights and additions. The method is indeed promising, even if computationally expensive.
The proposed work is interesting and promising.
I thank the authors for their submission and the following discussion.
The paper presents a non-parametric approach to out-of-distribution (OOD) detection. Given a trained neural network classifier, it is proposed to combine the Lens Depth (LD) with the Fermat distance (in an improved form) to capture the geometry and density of the data in feature space. Without assuming any prior distribution, the paper classifies OOD samples for toys and small scale benchmarks.
优点
- The combination of the Lens Depth with the sample Fermat distance for the out-of-distribution problem is a solid and interesting contribution.
- The paper is well written and easy to follow. In general, the approach is clearly described.
- The results on small scale experiments are convincing.
- The approach presented does not include the training process of the model.
缺点
- An extension of the related work to include papers on OOD would be necessary for the content of the paper.
- An additional evaluation metric would be helpful, e.g. FPR-95, ECE. This point should be addressed.
- A large-scale evaluation, e.g. ImageNet, is also missing. This is the main limitation of the paper.
问题
- What is the reason for not performing the ImageNet evaluation, given that it is quite common in the topic?
局限性
The paper has a broader impact statement to discuss the idea of robust decision making.
First of all, thank you for your kind review. Here are our answers.
Weaknesses
- Related work to include papers on OOD:
Thank you for your recommendation. We will add more references on OOD in related work.
- An additional evaluation metric:
We do appreciate your kind recommendation, which is legitimate. However, our choice of sticking to AUROC is almost forced upon us because single-forward UQ methods like DDU and DUQ, unfortunately, do not report metrics such as FPR-95. Furthermore, the references we found such as [1] (and papers it cites) deal with ECE as a metric for OOD generalization but not OOD UQ (Uncertainty Quantification) Thus, it is difficult for us to include a fair comparison with these metrics.
[1] Wald, Yoav, et al. "On calibration and out-of-domain generalization." Advances in neural information processing systems 34 (2021): 2215-2227.
- Large scale evaluation.
See next question.
Question:
This very legitimate question has been raised by another reviewer, and we reproduce the answer for your convenience.
Regarding the complexity of evaluation, we need a trained model on InD of size N and for every OoD example, the complexity is O(C N^2). Even if this is reasonable for a single example, which is relevant for applications, this can indeed become quite large for a full scale evaluation like ImageNet as InD. We estimate that would require approximately 60 hours on our hardware for a single run, which was not possible for this rebuttal, especially multiple runs are needed. We did CIFAR100 / Tiny-ImageNet over 5 independent runs as it required approx 10 hours of experiments. And the presence of C=100 hopefully shows that the method scales well with larger dataset. To have a fair comparision, we use model Wide-ResNet-28-10 and the same training scheme as in DDU paper to train models on CIFAR100. We see that performance of our method is better or on par with strong baseline methods.
Table. AUROC score with CIFAR100 as InD data and Tiny-Imaget as OOD data. Results of other methods are extracted from paper of DDU that were experimented on the same Wide-ResNet-28-10 model.
| Method | AUROC |
|---|---|
| LD (ours) | |
| Softmax Entropy | |
| Energy-based | |
| SNGP | |
| DDU | |
| 5-Ensemble |
Conclusion
Given our answers, we sincerely hope you will consider raising your score.
Kind regards, The authors
The revision has addressed most of my and other reviewers' points. I would still like to see more metrics in the evaluation protocol. For example, FPR-95 is quite useful for understanding how the proposed approach works. However, I am also in favour of the paper given the positive scores.
Dear TY6a. The end of the discussion period is close. I would be grateful if you provide a feedback regarding authors’ answers to your review.
This paper proposes a new method for OOD detection/scoring based on the lens depth and Fermat distance, arguing that it has advantages over prior methods by being non-parametric, non-invasive, (almost) tuning-parameter-free, and quite effective in adapting to the unknown structure of the data to identify OOD points.
优点
- Subject matter is important
- I found the paper really easy and fun to read.
- 4.2 is a nice, simple, and practical modification—very natural and clearly successful!
- Both the Lens Depth and Fermat Distance are nice, intuitive notions, and it is natural and fun to think about their combination!
- I raise a number of conceptual issues below, but at the end of the day the demonstration of the method on standard data sets, comparing it to state-of-the-art methods, is fairly compelling, hence my high score.
缺点
- LD is interesting and intuitive but what happens when the data falls into two disjoint clusters? Then won’t LD (with basically any distance I can think of, including Fermat distance) consider points in between those two clusters to be extremely central, despite the fact that, since they lie in neither distribution, they could reasonably be considered very OOD? Related: it seems the FD is infinite (whenever \beta>0) between two points separated by a region of zero density, suggesting that the sample version will be highly unstable in this setting, as it is should not converge at all but instead diverge to infinity. I see this is addressed in 4.4 by computing sample FD separately per cluster, but how were the clusters computed? Clustering is no trivial task, and given that things go wrong without clustering, I imagine S(x) in eq (4.2) depends rather heavily on the clustering. This (seems to me important) aspect of the proposed method seems underexplored/underexplained in the paper.
- How does the convergence of the sample FD to the population FD depend on dimension? It’s a bit hard to believe it doesn’t suffer from some sort of curse of dimensionality, since it depends on a density and density estimation very much suffers from the curse of dimensionality. It seems many of the nice demonstrations of it in this paper occur in 2 dimensions (with the data lying nearly on a set of dimension 1), which doesn’t seem very representative of NN feature spaces.
- Claim of “no trainable parameter” in the abstract is rather misleading, given the need for choosing both \alpha (ok there is a case made that maybe this isn’t too important) and the clustering.
- Lit review is well-organized, but very focused on methods for NN OOD detection. The paper makes a big deal out of the method being non-intrusive, but another way of saying this is just that the proposed method is a way of scoring a point being OOD with respect to a distribution, which is a problem that, in general, has nothing to do with NNs or their feature representations. Surely there is a large body of work on outlier detection in statistics that could be considered in a similar light to this method, where one takes an off-the-shelf outlier detection method’s score and just applies it to the data transformed to be in the feature space of the NN? That is essentially what this paper is doing (though for a novel method, and I am not questioning its novelty). I just wonder what other existing methods are out there that could be doing something similar, even if they haven’t been explicitly applied to NNs.
- Section 4.5 and Appendix E: choices II and III seem like they would rather seriously break the connection between the estimated LD and the true LD, since the k-means clustering will in general (and in typical circumstances) have clusters with very different numbers of points in them, so by reducing to the cluster centers (or center+’s), you are representing very different numbers of points with different centers. Another way to say it is that the density of the n points via methods II and III is quite different from that of the original N points (or via method I), and hence using them to compute the LD will be quite different in nature from using method I or the original N points. I would expect these methods (II and III) to not even have any kind of consistency property to the true LD of the original points, given their change in the density.
- I appreciated the authors’ honesty in reporting LL ratio results as being better than their method (of course, it comes with a more complex process), but it seems worth noting that it is substantially better. Since all the AUROC scores are close to 1, it is natural to look at. 1-AUROC (so smaller is better), in which case the LL ratio gets 0.006 and LD gets 0.029, almost 5x higher. I don’t think the authors were misleading in presenting these results, but I found the two sentences (lines 252-254) highlighting the challenges associated with the LL ratio to be a bit vague, and the results might be more convincing if those challenges were made more explicit (possibly in an appendix if there isn’t room in the main paper).
- I don’t find Fig 5.2 very convincing, since the monotonicity here is a pretty weak property and no comparison is made with other methods—my guess would be that many methods satisfy monotonicity. Is that not the case?
问题
- What is \alpha in Fig 4.1? Is it the same for all panels?
- Nothing about the proposed method seems to have anything to do with NNs or their feature space, and in particular, it is never mentioned why the method is applied to data points in the feature space, as opposed to the raw data points. I can imagine the reason is that the method works better with relatively “nice” densities, with fewer clusters and continuous densities supported on smooth manifolds, but there is no mention of this in the paper, and it seems like it merits discussion. I did see the last sentence mentions the method can be applied to any model with a feature space, but again, why is a feature space (or a classification model) even needed?
局限性
I guess some of my points listed under “weaknesses” could be interpreted as limitations, and I would like to see them better addressed/discussed. If they are (even if the authors don’t change their method at all), that would raise my score.
First of all, thank you for your kindly insightful reviews. Here are our answer. Many aspects of the discussion will be added to the paper.
Weaknesses
W1: Two disjoints clusters for data?
Very natural question. You are perfectly right that the population (ideal) Fermat distance in Eq. (3.4) would be infinite due to vanishing density . However, the sample FD would remain finite -- of the order of the distance to the power . The finite size effect thus stabilizes the value. In fact, Theorem 2.3 in the reference [6] proves convergence of rescaled sample FD to population FD, with rescaling with . More precisely when the samples are either iid with density or Poisson with intensity . It is crucial for to remain bounded from above and away from zero. As such, one could say that in between clusters, we need “very small density” but not “zero density”. Hence connectedness.
This purely mathematical answer is rather unsatisfying. A more practical take would be to argue that if a class is disconnected, there is a problem with the feature space. We agree that a better clustering would solve the issue. But, here the clustering is given by the labeling, as we are in a supervised setting for classification. For a focus on the clustering (unsupervised), we found the reference [1] on the use of FD for clustering.
[1] Sapienza et al. "Weighted geodesic distance following Fermat's principle."
[6] Groisman et al. Nonhomogeneous euclidean first-passage percolation and distance learning.
W2: Curse of dimensionality in the convergence sample FD to population FD.
In the previous convergence extracted from [6], the scaling does indeed depend on the dimension, thus reflecting a curse of dimensionality. However, one could argue that this dependence is not too bad. Moreover [Theorem 2.7, 6] tackles the case of data living in an embedded manifold of dimension , much smaller than the embedding space . In this case, only matters and not . This is a nice property for ML, where it is a common belief that data lives in much smaller dimensions that raw data points.
In the end, these are sensible arguments for the lack of a curse of dimensionality in using FD.
W3. “No trainable parameter”
By "trainable param", one refers to params that are optimized (based on gradient descend for example) in the process. Here, is chosen a priori and fixed during the process. And as you point out, it is not very important. Also, perhaps there is a misunderstanding, but we point out (as in W1) that the clustering is part of the data. It is given by the classes in our supervised classification problem, which we want to make more robust thanks to OOD.
W4. Focus on "non-intrusive" and "NNs", ignoring that the method applies beyond NNs.
Yes we agree with your remark. A comprehensive answer is given with the related Question 2.
W5: Section 4.5 seem like they would rather seriously break the connection between the estimated LD and the true LD
Yes your remark is correct. This could lead to some change in the original density. However, at the end, our objective is to measure how "central" a point is w.r.t. our data and only LD matters. So, our motivation for using reduced methods is to find a config of points that cover well the support of the original data. If this is the case, even if there is change in density, the change of LD is minimal and the ordering of points by LD is not really impacted. That is, points that are "central" will remain with large LD and points near the the frontier of the original support should still have small value of LD.
W6: LL ratio
In this method, instead of using directly the main model, one needs to train two supplementary generative models to estimate distributions. A first model is trained on ID data and a second one is trained on perturbed inputs. So that, the second model captures only the background statistics. Under suitable assumptions, authors show that the ratio between these two likelihoods cancels out the background information. Consequently, the LL ratio focuses on the semantic part of the input and so can be used as a score to distinguish ID from OOD data. This method needs an adequate noise in the way that the perturbed inputs contain only background information. This process itself is complicated as we need some supplementary dataset to choose the noise. Moreover, one needs to really carefully train these 2 generative models so that they can reflect the true underlying input density. This is quite complex. We will add this in the appendix.
W7: Fig 5.2 on the monotonicity.
We made the figure because similar ones appear in the literature. We agree it is not fundamental. Nevertheless, we believe that this monotonicity is a reasonable sanity/reality check, which is nice to observe.
Questions
Q1. in Fig 4.1?
We used alpha=3 in this experiment and it is the same for all panels.
Q2 (and W4). Beyond NNs, and feature space? Why not raw data points?
You are perfectly right that the combination LD+FD can be used for outlier detection in a more general context than NNs. This is what we had in the mind in the very last sentence in the conclusion. We were perhaps not insisting enough in the paper.
Regarding other literature, we cited the works of Cholaquidis et al. from statistics on LD and the works of Groisman et al. from probability of FD. We are unaware of any works combining the two, aside from ours, for any outlier detection.
Regarding the use of raw data points directly, this indeed possible as in [1] for clustering. However, in the context of NNs and vision, working on pixels is not a good idea. Feature spaces are more efficient in this context as they extract the low dimensionality of the data more efficiently at a semantic level.
Kind regards
The authors
I thank the authors for their thoughtful and thorough rebuttal. I found it generally quite convincing, and was already quite positive on this paper. Clarifying that the clusters are just defined by the labels is important, sorry for my confusion on that! I do find the word "cluster" a bit unusual to refer to the points corresponding to an observed label, so the authors might consider rewording that. Anyway, I am raising my score to an 8.
The authors address the problem of out of distribution detection in supervised learning with particular focus on neural networks models. The developed method worj in some feature (embedding) space by measuring the statistical depth of the query point with respect to some reference set of points. The particular implementation combines lens depth function with Fermat distance. The authors validate the proposed approach in a series of experiments on simulated and real-world data.
优点
- The paper is very well-written and easy to follow.
- The considered problem is relevant for practice as there is a significant demand in efficient and non-intrusive methods for uncertainty quantification.
- The proposed approach is solid with all the steps being properly motivated.
- The authors did a significant effort to do a comprehensive literature review, experimental evaluation and analysis, though all the steps were not fully successful (see Weaknesses and Questions below).
[After rebuttal comment] I appreciate the answer by the authors and increase my score to 6. My main concerns were addressed.
缺点
-
While usage of statistical depth functions and distribution/manifold related distances looks logical, it is not clear why the particular choices of Lens Depth and Fermat distance were made.
-
The baselines considered are not comprehensive enough and some of the baselines are not interpreted correctly by the authors of the present paper. In particular: a. Non-Gaussianity of embedding distribution was directly considered in [1] aiming to improve over GDA. I think that is worth comparing with this method as the present paper target the same issue though with the completely different approach. b. I believe that the authors incorrectly say that the difference between papers [2] and [3] is only in usage of spectral normalization. In my opinion, even more important is that [2] uses Mahalanobis distance as uncertainty measure while [3] considers the density of Gaussian mixture instead.
-
The experiments are done with relatively simple datasets like CIFAR-10 for in-distribution data and SVHN/CIFAR-100/TinyImageNet as OOD. With the proposed approach being relatively lightweight, it is not clear why not to consider CIFAR-100/ImageNet as in-distribution with corresponding OOD choices (like ImageNet-R or ImageNet-O as OOD for ImageNet).
References [1] Kotelevskii, Nikita, et al. Nonparametric uncertainty quantification for single deterministic neural network. Advances in Neural Information Processing Systems 35 (2022): 36308-36323. [2] K. Lee, K. Lee, H. Lee, and J. Shin. A simple unified framework for detecting out-of-distribution samples and adversarial attacks. Advances in neural information processing systems, 31, 2018. [3] J. Mukhoti, 362 A. Kirsch, J. van Amersfoort, P. H. Torr, and Y. Gal. Deep deterministic uncertainty: A new simple baseline. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24384–24394, 2023.
问题
- Why Lens Depth was chosen and not other statistical depth functions like Half-space depth, Simplicial depth, ...
- Why Fermat distance was chosen? One can consider many alternatives. For example, following manifold learning literature one can consider kNN graph constructed with Euclidean distance over embeddings and then computing shortest paths over the resulting graph.
- Can you clarify how you implemented "GDA"-based methods? Did you use Mahalanobis distance or GMM-density?
- Why didn't you do the experiments with more complex datasets? Is it due to high computational of LD + Fermat distance approach?
- Have you tested effectiveness of reduced LD on more complex dataset than MNIST? Apparently, more complex models may lead to more complex embedding structure and require more points for approximation.
局限性
Limitations are adequately addressed
Comment on weakness
We believe that in [3], one uses GDA and not Gaussian Mixture Model (GMM). More precisely, GMM consists in calculating density for a point x as , and so one needs to fit both (weight) and params of each . (Here is the mean and covariance matrix). While, GDA consists in only fitting and then is considered as . As both approaches lead to a mixture of gaussian, in some ML litterature, one usually confuse GDA with GMM. The most notable difference between these 2 approaches is that GMM has a smoothing effect of density value between the clusters , so larger value in these zones.
We did mention non-Gaussianity as a desirable feature of our method, as the method [1]. Yet it is not the main selling point: See Q1-Q2 below for synergies of LD+FD. We shall cite [1], but we believe that density estimation by kernels
- is not a far departure from GDA
- it does not provide a natural measure of centrality like Lens Depth
- needs bandwidth tuning
Thus instead of adding that benchmark, we chose to focus on Q4.
Answers to questions
Questions Q1-Q2. Why Lens depth vs Tuckey depth aka Half-plane depth or Simplicial depth? Fermat distance vs Typical manifold learning eg Shortest paths after local kNNs?
Thank you for these questions, which we answer jointly, in order to stress that while the choice in (1) is for convenience, the choice for (2) is more fundamental. Moreover the choice (1)+(2) is synergistic and not independent. The main idea of our method is the combination of a notion of depth or centrality w.r.t a distribution, and a measure of length that gives shorter distances for high density areas. Furthermore, it is highly desirable that the notion of depth adapts to the chosen distance.
Q1. As you relevantly point out, there are many notions of depth. And we chose not to delve on them. But indeed, it is easy to give a panorama of pros and cons:
- Tukey depth aka half-plane depth. Pros: Computationally simple, Naturally normalized. Cons: Euclidean, hence less synergy with Fermat distance. Half-place separation is very Euclidean.
- Simplicial depth. It counts the number of simplices of sample points that contain a given point. Cons: Not normalized into a probability, potentially exponential growth of number of simplices. Computationally complex / expensive as while there exists algorithms in 2D plane, it is not trivial in higher dimension.
- LD. Pros: Takes into account any distance. Normalized. Computationally middle complexity. In summary, we can structure this discussion into the table.
| Depth notion | Adapts to any distance | Computational cost | Normalized into probability |
|---|---|---|---|
| Lens depth (LD) | Yes | Average | Yes |
| Half-space depth | No | Low | Yes |
| Simplicial depth | No as simplices are Euclidean | High | No |
LD is a well-rounded choice. And among these, it is the only one which can leverage the Fermat distance. If you think this is important, we could turn this explanation into an additional paragraph in the paper.
Q2. Indeed manifold learning is the logical tool for evaluating distances in a latent space. However typical methods such as those suggested do not have this extra feature of “shorter distances in high density areas”. Indeed, in the Fermat distance, the sum is small if the q_i’s are close and alpha is high. And the idea of Fermat distance, inspired from percolation theory in statistical physics, is to use high parameter . This point is subtle yet important for us. Perhaps we should insist more on that.
All in all, we believe that we have made a sensible choice in combining LD for notion of depth, and Fermat for notion of distance.
Q3.
GDA method: please see our comment on weakness above. The result is a mixture of Gaussians, as illustrated in Fig. 1.1 in our main paper.
Q4. “Experiments with more complex datasets? High computational of LD + FD approach?” With being relatively lightweight, it is not clear why not to consider CIFAR-100/ImageNet as in-distribution with corresponding OOD choices.
As per the weakness you raised, we added CIFAR100 (InD) vs. TIny-ImageNet (OoD). Regarding the complexity of evaluation, we need a trained model on InD of size N and for every OoD example, the complexity is O(C N^2). Even if this is reasonable for a single example, which is relevant for applications, this can indeed become quite large for a full scale evaluation like ImageNet as InD. We estimate that would require approximately 60 hours on our hardware for a single run, which was not possible for this rebuttal, especially multiple runs are needed. We did CIFAR100 / Tiny-ImageNet over 5 independent runs as it required approx 10 hours of experiments. And the presence of C=100 hopefully shows that the method scales well with larger dataset. To have a fair comparision, we use model Wide-ResNet-28-10 and the same training scheme as in DDU paper.
Table. AUROC score with CIFAR100 as InD data and Tiny-Imaget as OOD data. Results of other methods are extracted from paper of DDU that were experimented on the same Wide-ResNet-28-10 model.
| Method | AUROC |
|---|---|
| LD (ours) | |
| Softmax Entropy | |
| Energy-based | |
| SNGP | |
| DDU | |
| 5-Ensemble |
Q5. “Reduced LD on more complex dataset than MNIST?”
Yes, this was done. Indeed the results summarized in tables 5.1 and 5.2 in the paper do use reduced LD. This is mentioned in the details provided in the Appendix A “Experiment details”. And while the reduction is sizable, the results’ AUROC do not degrade. Finally, let us mention that the details of experiment CIFAR100/Tiny-Imagenet will be added to appendix.
Conclusion
Given our answers, we sincerely hope you will consider raising your score.
Kind regards, The authors
Dear authors,
your rebuttal was well received and you partially addressed my concerns. I will decide on the score changes after the discussion with other reviewers.
Dear exfg. The end of the discussion period is close. I would be grateful if you provide a feedback regarding authors’ answers to your review.
First of all, thank you all for your time and insightful reviews. Here are the main points in the rebuttal.
-
We answered all the questions to best of our capability. In particular, the questions of Reviewer NNtx brought extensive mathematical discussions.
-
We shall add / nuance multiple points about limitations.
-
Reviewer kN5D remarked that Figure 4.2 shows that the LD still works with 200 samples. At what dataset size does the method start to fail, and how catastrophic is that?
In the attached pdf, you can find the figure showing experiment proposed by Reviewer kN5D. We hope that the figure will give you more insight about our method at small data regime.
- Reviewers exfg and TY6a kindly proposed to test our method on larger scale datasets to see if our method can scale well with more complex data. In the allotted time of the rebuttal, we added experiment with CIFAR100 as InD data and Tiny-ImageNet as OOD. And the presence of C=100 classes hopefully shows that the method scales well with larger dataset. We performed 5 independent runs. To have a fair comparison, we use model Wide-ResNet-28-10 and the same training scheme as in DDU paper.
Table. AUROC score with CIFAR100 as InD data and Tiny-Imaget as OOD data. Results of other methods are extracted from paper of DDU that were experimented on the same Wide-ResNet-28-10 model.
| Method | AUROC |
|---|---|
| LD (ours) | |
| Softmax Entropy | |
| Energy-based | |
| SNGP | |
| DDU | |
| 5-Ensemble |
Kind regards,
The authors
The paper proposed a non-parametric approach to estimate whether an input point is OOD. For this the authors calculate so-called lens depth based on the Fermat distance.
The paper is well written. The experiments verify the method under different conditions and datasets. The method is based on theoretically well-defined concepts. Overall, the paper and the results looks fine to be accepted after incorporating the reviewers' feedback.