Mahalanobis++: Improving OOD Detection via Feature Normalization
We show that the Mahalanobis distance estimation is degraded by strong variations in the feature norm and provide a simple fix (projection to the unit sphere) that consistently improves the method and leads to new SOTA results.
摘要
评审与讨论
The paper proposes a simple fix to the Post-Hoc OOD detection technique based on the Mahalanobis distance computed on the feature space of the neural network of interest. This simple fix consists of normalizing the features by their norm before computing the distance. The authors emphasize how the samples violate the assumptions underlying Mahalanobis distance in the feature space:
- Assumption 1: the class wise features follow a multivariate normal distribution
- Assumption 2: class conditional covariance matrices are the same They do so by analyzing the magnitude of the feature norm, emphasizing how the fix can alleviate this problem. Experiments on a comprehensive benchmark of various models empirically demonstrate the effectiveness of this method.
给作者的问题
- Could you plot the distribution of normalized features in a plot similar to Figure 3?
论据与证据
The problem with the feature norm is clearly illustrated with experiments based on Lemma 3.1, expected squared variance deviation, and QQ plots.
However, I have some concerns with the link between the fix and the assumptions. The fix intends to alleviate the difference between the feature norms of samples from different classes, but I do not see how it makes the features satisfy assumptions 1 and 2. Specifically, nothing ensures that after the fix, which is just a normalization, the features follow a normal gaussian, and that their covariance matrices are equal.
- QQ plots are here to show that the obtained features are closer to normal gaussian, but they might still be non gaussian.
- I do not see the relation between assumption 2 and the expected squared variance deviation, which I think is not a standard metric. Two very different covariance matrices could have a deviation of zero with this metric. Why not conducting statistical tests, or using standard probability distribution divergence measures that can be estimated?
方法与评估标准
- The experiments to emphasize the problem with feature norm are thorough and theoretically grounded
- The evaluation benchmark is extensive.
理论论述
I checked the proof of Lemma 3.1 which is Ok. I skimmed through Appendix C (proof of expected squared variance deviation) but did not check thoroughly.
实验设计与分析
yes
补充材料
yes
与现有文献的关系
The contributions are closely related to (Lee et al. 2018b) and (Ren et al, 2021) which are appropriately discussed.
遗漏的重要参考文献
All essential references are discussed to the best of my knowledge
其他优缺点
Strenght
- The problem with Assumptions 1 and 2 is clearly emphasized
- The method is simple
- It consistently improves the performance of Mahalanobis method
Weaknesses
- What is called a "fix" might not be an actual "fix" but just a tool to make Mahalanobis method better
- Concerns with expected squared variance deviation
其他意见或建议
- In the proof of Lemma 3.1, is not introduced (the lemma is about )
- Eq. 5 "trace" please use the same notation as lemma 3.1 ("tr")
We thank the reviewer for the thoughtful comments and appreciate the positive feedback. Below we address the reviewers remarks:
-
“The fix intends to alleviate the difference between the feature norms of samples from different classes”
We would like to clarify that different feature norms for samples from different classes are not a problem per se. For instance, if the mean vectors of different classes were of different magnitude (which is typically not the case), the Gaussian assumption could still be satisfied. However, our analysis shows that the observed feature norm distribution and the one that we would expect under the Gaussian model are significantly different, for instance showing very heavy tails. We take this as indication that the Gaussian assumption is violated, and substantiate this with further analysis (QQ plots, etc).
-
“nothing ensures that after the fix, which is just a normalization, the features follow a normal gaussian, and that their covariance matrices are equal.” and “QQ plots are here to show that the obtained features are closer to normal gaussian, but they might still be non gaussian.”
We agree that we cannot guarantee that the features follow a normal distribution. In fact, we believe that there is no reason to believe that the features follow any particular distribution. However, we provide strong empirical evidence that modelling the feature distribution with a normal distribution with shared covariance is more appropriate after normalization. In particular, 1) the QQ plots are less skewed, 2) the shared covariance assumption is better satisfied, and 3) the feature norm does not act as a confounder for OOD detection anymore.
-
"What is called a 'fix' might not be an actual 'fix'"
In addition to the above, we are happy to rephrase from "fix" to e.g. "remedy"
-
“Could you plot the distribution of normalized features in a plot similar to Figure 3?”
The feature norms of the normalized features would show as a straight line at 1 with no deviation. Please let us know in case this does not clarify the question.
-
We thank the reviewer for the remarks about the trace notation and for noting that has not been properly introduced. We will adjust the notation and clarify that is a random variable representing the feature distribution for input .
-
”two very different covariance matrices could have a deviation of zero with this metric. “ (Eq. 5)
We respectfully disagree with the reviewer. In particular, we can write Since is pd, , and the only way for the expectation to be zero is that for all , wich is only the case when .
-
"expected squared variance deviation ... is not a standard metric"
We agree that this metric is not commonly evaluated, but we argue that it is the right one to look at. In particular, the Mahalanobis distance performs a whitening by the variances: Deviations in a certain direction are measured relative to the sample variance in this direction. Small absolute deviations can thus result in large distances when they are along a direction of small variance. We therefore need a measure that can capture relative deviations instead of absolute deviations, since absolute deviations would be dominated by directions of large variance. Our proposed measure computes the relative deviation of the variance of from in every direction u and averages this deviation over all directions. This is a natural way to assess whether and are similar in all possible directions in the feature space. A similar measure that is commonly used to compare covariance matrices is the Riemannian metric (see e.g. [1,2]) . It is also possible to compute an appropriate measure with divergences like the KL divergence: . We evaluate both, confirming that normalization aligns the covariance structure in a meaningful way (lower is better):
| Riemann | Riemann | KL | KL | |
|---|---|---|---|---|
| unnormalized | normalized | unnormalized | normalized | |
| mean | 98.2 | 88.2 | 1090.6 | 982.0 |
| median | 93.6 | 84.7 | 1011.0 | 908.3 |
We are happy to discuss any of the points further!
[1] Förstner & Moonen. (2000). A Metric for Covariance Matrices. 10.1007/978-3-662-05296-9_31.
[2] Pennec, Fillard, & Ayache. A Riemannian Framework for Tensor Computing. Int J Comput Vision 66, 41–66 (2006).
I appreciate the author's response and would like to increase my rating.
We are glad that the reviewer appreciates our rebuttal response and would like to thank them for raising the score!
This submission focuses on the OOD detection task and it proposes a simple yet effective method to improving the Mahalanobis distance approach.
update after rebuttal
The authors rebuttal has largely addressed my concerns and I thus maintain my positive rate.
给作者的问题
(see above in other weaknesses.)
论据与证据
While in a mixture of theoretical and emphrical way, the reviewer believees that the submission is supported by clear evidences.
方法与评估标准
Yes, the reviewer believes that the proposed method makes sense.
理论论述
The reviewer hasn't carefully checked the proof of Lemma 3.1. Yet, Lemma 3.1 is at least not contradicting with the reviewer's intuition.
实验设计与分析
Yes, the reviewer has gone through the experimental settings in the main paper and finds that they largely make sense.
补充材料
The reviewer has quickly gone through the experimental parts in the supp, but not line by line for the theoretical part.
与现有文献的关系
The reviewer believes that Gaussian distribution assumption can be very common in different scientific literatures. From this perspective, it is very interesting for a method to appear if it does can help mitigating the violation of this assumption in a relatively simple way.
遗漏的重要参考文献
N.A.
其他优缺点
I really appreciate the authors effort in perfoming extensive experimental analysis and I believe that they can to a large scale strength this submission. Below, I still have several queries over this submission that I hope can be addressed to further improve the quality.
-
I hope the related works section can be better organlized and the differences between the proposed method and existing methods to be better elaborated. For example, when the method names a subsection called Mahalanobis distance, it actually wants to review Mahalanobis-distance-based existing methods from my understanding. Thus, it is important for this to be clarified. Meanwhile, it is appreciated if the difference between the proposed method and existing similar methods to be better elaborated.
-
When the authors present Lemma 3.1, if I am not wrong, it is only the property of Gaussian distributed features but not sufficient condition. If this is the case, I appreciate the authors to make this more clear there to avoid reader's misunderstanding.
-
The authors claim that "we expect this to be negligible due to the large dataset size". I first appreciate more explanation or elaborate on this negligibility. Meanwhile, the authors seem to require the size to be very large (>10^6). What if in some cases that this is not the case? If the negligibility still holds?
-
Finally, if I am not wrong, the key motivation seems to be concentrating the feature norm. I am thus a bit curious here, what if we not only normalize like Eq. 6 but concentrate the feature even further? What will happen? Meanwhile, while I admit its naturaness, is there any specific reason for the authors to choose to perform concentration via normalization?
I still have these queries yet I remain positive on this submission. I thus vote for weak accept now.
其他意见或建议
N.A.
We thank the reviewer for their positive feedback, and for appreciating our work. We address the remarks below:
-
"organize related work section" and "elaborate difference to existing similar methods"
We will extend the discussion about related work, and emphasize the differences to previous work that used feature normalization [3,4] or the Mahalanobis distance, or both [1,2]. Most importantly, other works have investigated train-time methods that involve normalization. Either implicitly through contrastive losses (CIDER [1], SSD[2]), or explicitly to improve OOD detection [3,4] It is then natural to also apply normalization at inference time. For instance, CIDER applies KNN, and SSD performs k-means and then Mahalanobis. Those methods thus normalize their features for OOD detection because they also normalize during training. This is orthogonal to our work: The standard Mahalanobis method for OOD detection is a post-hoc method, where adjusting the pretraining scheme is not feasible. We show that in this setting, the Gaussian assumption underlying this method is often severely violated, and that normalizing the features better aligns with this assumption, consistently improving OOD detection across architectures and pretraining techniques. We will clarify this distinction and expand the discussion of other approaches in the paper (see the answer to reviwer jfEM for a more thorough discussion and quant. comparisons to SSD). If there is a specific reference the reviewer would like us to discuss, please let us know.
-
Lemma 3.1, not sufficient condition
We will clarify that a concentrated feature norm is not a sufficient, but a necessary condition for a Gaussian distribution. Lemma 3.1 only shows that - under the assumption of a Gaussian distribution in feature space - we expect some concentration of the feature norm. To illustrate this, we sample from class-specific Gaussian distributions with the estimated means and shared covariance matrix (Figure 3-left), noting that in practice (Figure 3-right) the feature norms deviate strongly from the Gaussian model (e.g. via heavy tails). This suggests severe violations of the Gaussian assumption, which we substantiate by QQ plots and the variance alignment analysis. Our remedy - normalization - aligns the features better with the premise of normally distributed data with shared covariance matrix.
-
"elaborate on neligibility" (in QQ plot analysis)
In QQ-plots, we compare empirical quantiles against a theoretical standard normal distribution. Since normalized and unnormalized features have different variances, their QQ-plots would have different slopes, making direct comparison difficult. To align comparisons, we divide both samples by their empirical standard deviation—this ensures both are evaluated against the same reference slope (black line in Figure 4). Dividing by the empirical variance technically transforms a normal distribution to a Student's t-distribution with n-1 degrees of freedom. As the reviewer pointed out correctly, this matters for small . However, the t-distribution converges to a Gaussian as , and for , the difference is typically negligible [5]. We use all ImageNet train features () in our QQ plots, making the t-distribution practically Gaussian, allowing for the analysis we performed in the paper. We would like to stress that all of this is only a technicality in the analysis of the features via QQ plots, and irrelevant for Maha++ as an OOD detection method.
-
concentration of feature norm is "key motivation"
Our key motivation is not to concentrate the feature norm. Instead, feature norm concentration is a necessary condition IF the features were indeed normally distributed. As we find, the feature norms are, however, not concentrated, but for instance show extremely heavy tails. We take this as an indication that the Gaussian assumption is violated, and further validate it via QQ plots and our variance analysis. Regarding the reviewers question about concentrating even further: We are not sure we understand what the reviewer means by this. One could, in principle, normalize by a different norm (e.g. or ), but this would change the direction of the features. We therefore opted for normalization. Does this answer the question?
We are happy to clarify any of the points further!
[1] Ming et al. How to exploit hyperspherical embeddings for out-of-distribution detection? ICLR2023
[2] Sehwag et al. Ssd: A unified framework for self-supervised outlier detection, ICLR 2021
[3] Regmi et al. T2fnorm: Train-time feature normalization for ood detection in image classificatio, CVPR 2024 workshop
[4] Haas et al. Linking neural collapse and l2 normalization with improved out-of-distribution detection in deep neural network, TMLR 2023
[5] https://www.jmp.com/en/statistics-knowledge-portal/t-test/t-distribution
The paper revisits the Mahalanobis distance for out-of-distribution detection. It first examines how the assumptions underlying the Mahalanobis distance for OOD detection are violated by a variety of models. It then proposes a maximally simple but effective remedy by applying l2-normalization to the pre-logit features. The evaluation shows that this outperforms previous works by a significant margin.
给作者的问题
None.
论据与证据
The claims made by the paper are supported by clear and convincing evidence. The paper demonstrates that, empirically, feature distributions of some models do not fit the assumptions made by prior Mahalanobis distance-based OOD detection. Figure 5 further shows that, for SwinV2-B models, the feature norm is strongly correlated with the Mahalanobis distance, while beeing a bad OOD predictor, which in turn leads to suboptimal OOD detection performance. In contrast, applying l2-normalization as proposed reduces correlation between feature norms and Mahalanobis distance, which allows drawing a better decision boundary. The findings are furthermore validated by the quantitative evaluation of the proposed method on a wide variety of pre-trained models.
方法与评估标准
The method is well motivated by pointing out how the assumptions in prior Mahalanobis based OOD detection methods can be violated by some models. The evaluation metrics (false-positive rate at true positive rate of 95% in particular) and benchmark datasets make sense and are in line with prior work on OOD detection. I appreciate that the evaluation is performed on a wide variety of model types, architectures and sizes.
理论论述
The main theoretical claim can be found in equation 5 and is elaborated upon in the appendix, which I did only check superficially.
实验设计与分析
The main experimental design is focused on evaluating the OOD false positive rate at a fixed true positive rate of 95% across different models and datasets, which is in line with prior work. The experimental analysis demonstrates that models suffer from violations of Mahalanobis based OOD detection with varying degree, but that most models benefit somewhat from l2-normalization as proposed.
补充材料
The supplementary contains a lot of additional experimental results, proofs, and discussion. I did not check the entirety of the supplemental but found the discussion of Augreg ViTs particularly interesting.
与现有文献的关系
While the purely methodological innovation of this paper is minimal, its value lies in identifying and empirically demonstrating violations of key assumptions of Mahalanobis based OOD detection in practice, proposing a maximally simple remedy, and providing thorough evaluation of this remedy on a wide variety of models.
遗漏的重要参考文献
None.
其他优缺点
None.
其他意见或建议
- ImageNet reference renders as "(University, 2015)"
We thank the reviewer for carefully reading and evaluating our paper, and we are glad that the reviewer finds that our claims are “supported by clear and convincing evidence”, that our method is “well motivated”, that that they appreciate the “wide variety of model types, architectures and sizes” in our “thorough evaluation". We agree with the reviewer that the results about augreg ViTs stand out, and think that investigating the underlying reasons for the behaviour of those models (i.e. the why the augreg training scheme results in the favourable structure of the feature space) is an interesting direction for future research. We thank the reviewer for pointing out the incorrect ImageNet reference, which we will fix. For the rebuttal, we have included a more thorough discussion and comparison to SSD (see response to reviewer jfEM), an evaluation of a DinoV2 model (also in response to reviewer jfEM) and more variance deviation measures (see response to reviewer qhSY). If there is anything else the reviewer would like to see addressed, we would be happy to discuss this.
This paper presents an holistic empirical analysis illustrating the current violation of the gaussian distribution of the representations of most vision backbones. From this constatation, the paper introduces a variation of the Mahalanobis distance for OOD detection called Mahalanobis++. Extensive experiments on multiple recent OOD benchmarks and various bacbones are proposed to assess the good behavior of the proposed approach.
Update after rebuttal
I am satisfied with the rebuttal and will thus keep my positive rating
给作者的问题
Despite bringing appreciated insights into the distance-based OOD literature, the related work section misses clear positioning e.g. which challenges are unaddressed by normalized approaches such as [2,3] or CIDER [4]? What makes the proposed method better suited for OOD detection?
[2] Regmi, S., Panthi, B., Dotel, S., Gyawali, P. K., Stoyanov, D., and Bhattarai, B. T2fnorm: Train-time feature normalization for ood detection in image classification, CVPR Workshop 2024 [3] Haas, J., Yolland, W., and Rabus, B. T. Exploring simple, high quality out-of-distribution detection with l2 normalization, TMLR, 2024. [4] Ming, Y., Sun, Y., Dia, O., and Li, Y. How to exploit hyperspherical embeddings for out-of-distribution detection? Neurips 2023
论据与证据
The principal claim concerns the violation of the class-wise unimodal Gaussian hypothesis of the representations. This is a reasonable claim as the Mahalanobis method does rely on strong relaxations for computational reasons. Moreover, this claim is supported by strong empirical evidence in this paper, see Fig. 3, 4, 5, and Table 1.
方法与评估标准
Evaluation criteria and benchmarks are standard for OOD detections.
Reporting only FPR95 is not a standard practice as this metric is not robust to small changes in the decision function and is particularly sensitive to class imbalance. FPR@95 highlights performance at a specific critical threshold but is typically complemented by AUC, ensuring a more holistic evaluation. I see that AUC scores in the supplementary are still in favor of the proposed approach.
理论论述
Lemma 3.1. does not support the indicated conclusion. First, features should be concentrated around . Moreover, the higher the dimension, the looser is the upper-bound.
实验设计与分析
The evaluation protocol is well designed. However, as many backbones pretrained with a contrastive loss also normalize the representations, comparison of Mahalanobis with SSD [1] or on other DINO-like backbones would give important insight on the method and the importance of normalization for OOD detection.
[1] Sehwag, Vikash, Mung Chiang, and Prateek Mittal. “SSD: A Unified Framework for Self-Supervised Outlier Detection,” ICLR 2021
补充材料
I checked the proof of Lemma 3.1 and the AUC results section E.
与现有文献的关系
This paper shares the violation of the Gaussian distribution with multiple other distance-based papers. Other approaches proposed a pre-training strategy to mitigate this limitation.
The proposed extension to Mahalanobis is particularly incremental. However, it is well illustrated both by empirical statistical evaluation of the feature dispersion and extensive evaluations.
遗漏的重要参考文献
- Good performances of Mahalanobis distance for OOD detection on normalized features have already been explored in SSD [1]. In the related work section, authors state that "Adapting them to ImageNet-scale setups as post-hoc OOD detectors has so far not been successful". Same in the end of the method section: "While -normalization has been used with non-parametric methods like KNN (Sun et al., 2022; Park et al., 2023a) or cosine similarity (Techapanurak et al., 2020), it is - to the best of our knowledge - not used with the Mahalanobis score". This is a bit of an overstatement as Mahalnobis is a strong and cheap baseline even on large scale datasets and SSD has been successfully experimented on ImageNet-1k.Thus, a broader discussion and comparison with SSD is missing in the current paper.
[1] Sehwag, Vikash, Mung Chiang, and Prateek Mittal. “SSD: A Unified Framework for Self-Supervised Outlier Detection,” ICLR 2021
其他优缺点
The paper is very well written and supported with extensive evaluations.
其他意见或建议
NA
We thank the reviewer for their valuable comments and address the remarks below:
-
"the features should be concentrated around " (in Lemma 3.1)
We thank the reviewer for checking our proof, but we strongly believe that the term is correct. The term stated by the reviewer can even become complex if the norm of the mean is large enough. As the variance goes to zero, the norm of the random variable should be concentrated around , which is exactly what our term states. We are happy to answer any questions on a particular step of the proof to resolve any potential confusion.
-
"the higher the dimension, the looser is the upper-bound" (in Lemma 3.1)
This is expected, as the squared -norm grows with dimension . However, the deviation per dimension decreases: The right side decreases with as the variance grows linearly in . Lemma 3.1 shows that under the Gaussian assumption, we should see some concentration of the feature norms. To illustrate this, we simulate it in Figure 3 (left) by sampling from class-specific Gaussian distributions with the estimated means and shared covariance matrix, noting that the actual feature norms (Fig. 3 right) deviate strongly from the Gaussian model (e.g. via heavy tails). This suggests severe violations of the Gaussian assumption, which we substantiate by QQ plots and the variance alignment analysis.
-
"the sentence 'Adapting them to IN-scale setups ... has so far not been successful' ... is a bit of an overstatement ... as Mahalanobis ... has been successful on IN-1k"
We agree, this statement only refers to Gaussian mixture models (GMMs), and not to the Mahalanobis distance. We will clarify this in the paper and explain the difference between GMMs and the Mahalanobis distance.
-
"the sentence 'l2-normalization...is ... not used with the Mahalanobis score' ... is a bit of an overstatement ... as SSD ... has been successful on IN-1k"
We agree that Mahalanobis has been applied to normalized features in other works like SSD[1] and CIDER[2], and we should have chosen our statement more carefully. However, these are train-time methods where normalization is implicitly part of their contrastive loss. Those methods thus normalize their features for OOD detection because they also normalize during training. This is orthogonal to our work: The standard Mahalanobis method for OOD detection is a post-hoc method, where adjusting the pretraining scheme is not feasible. We show that in this setting, the Gaussian assumption underlying this method is often severely violated, and that normalizing the features better aligns with this assumption, consistently improving OOD detection across architectures and pretraining techniques. We will clarify this distinction and expand the discussion of [1-4] in the paper (see below for SSD).
-
"a broader discussion and comparison with SSD" and "which challenges are unaddressed by normalized approaches"
SSD involves three steps:
- Training with a supervised (SSD+) or unsupervised (SSD) contrastive loss (implicitly normalizing features),
- Cluster estimation via k-means in the normalized feature space,
- Mahalanobis-based OOD detection using cluster labels instead of class labels.
This setting differs fundamentally from ours, as SSD, like [1,3,4], is a train-time method. Methods like [1-4] cannot be directly applied to the pretrained checkpoints we evaluate. To demonstrate the advantages of post-hoc approaches, we evaluate SSD+ on NINCO using the ResNet50 from [5] (trained for 700 epochs). SSD+ is clearly outperformed by our top models, with FPR >3× higher. Those are obtained from various pretraining schemes, and retraining models with SSD or [1,3,4] on this scale is typically not feasible. | model | FPR| |-------------|-------| |SSD+ w. 100 clusters|66.0% | |SSD+ w. 500 clusters|65.7% | |SSD+ w. 1000 clusters|67.8% | |CnvNxtV2-L + Maha++ | 18.4%| |EVA02-L14 + Maha++| 18.6%|
-
comparison with DINO
We report the FPR of a DinoV2-S model (ft on IN1k) on NINCO. Maha++ outperforms Maha clearly.
| Maha | Maha++ |
|---|---|
| 77.3% | 53.4% |
[1]Ming et al. How to exploit hyperspherical embeddings for out-of-distribution detection? ICLR2023
[2]Sehwag et al. Ssd: A and unified framework for self-supervised outlier detection, ICLR 2021
[3]Regmi et al. T2fnorm: Train-time feature normalization for ood detection in image classificatio, CVPR 2024 workshop
[4]Haas et al T. Linking neural collapse and l2 normalization with improved out-of-distribution detection in deep neural network, TMLR 2023
[5]Sun et al. Out-of-distribution detection with deep nearest neighbors. ICML 2022
After review, the paper received four positive evaluations. Following the authors' rebuttal, one reviewer increased the rating, while the other three maintained their original scores. All reviewers acknowledged the paper's contributions and expressed satisfaction with its technical merits.
The AC concurs with the reviewers' assessments and recommends acceptance.