Inference, Fast and Slow: Reinterpreting VAEs for OOD Detection
We show that by leveraging sufficient statistics derived from the likelihood path (LPath) of VAEs, our the LPath method can achieve SOTA performance for unsupervised, one sample OOD detection.
摘要
评审与讨论
This paper introduces a novel VAE-based approach for unsupervised OOD detection. The core idea is to utilize the sufficient statistics generated during the encoding and decoding processes of a VAE as features for classical OOD detection algorithms, such as COPOD (Li et al., 2020) or MD (Lee et al., 2018; Maciejewski et al., 2022). The authors also propose a method to address the trade-off in selecting the latent dimension. This involves using two VAEs, one with a high latent dimension and the other with a lower one. The performance of the proposed method is empirically evaluated on benchmark datasets. The results demonstrate that the proposed method can achieve performance comparable to existing state-of-the-art approaches while maintaining a smaller model size.
优点
-
This paper is well written in presenting the proposed idea and algorithm, and is very easy to follow.
-
The performance of the proposed algorithm is empirically demonstrated by comparing to existing SOTA approaches. It's shown that this method performs on par with others using smaller models.
缺点
As illustrated in Table 1, the performance improvement of the proposed algorithm is not substantial when compared to other competitors. In some instances, its performance is even inferior to other state-of-the-art (SOTA) methods. Furthermore, the datasets used in the experiments are all small in size. Consequently, the benefits of the proposed likelihood path principle and the associated algorithms have not been fully demonstrated. It remains uncertain whether the proposed concept and algorithm would be effective in more complex real-life scenarios.
问题
-
The proposed LPath principle and algorithm are presented in a context of Gaussian VAE. Is it possible to extend them to Gaussian mixture VAE or hierachical VAE? By doing so, likely it can improve the OOD detection performance, while at a cost of increasing the model size.
-
As shown in Table 1, for some cases, DDPM performs much better than the proposed methods. Why? Is it because the model capacity of proposed algorithm is too small compared to DDPM? How about enlarging its model size, then compare it again with DDPM?
Some minor issues:
pp.122-123, the equation needs to be revised
pp.226, "While VAE optimization should already be driving Eqs. 17–19 to a small value", why?
pp.893, Theorem ??
As illustrated in Table 1, the performance improvement of the proposed algorithm is not substantial when compared to other competitors. In some instances, its performance is even inferior to other state-of-the-art (SOTA) methods.
We acknowledge that our method does not outperform all SOTA methods in every case. However, LPath-1M-COPOD or LPath-2M-COPOD is competitive itself against most methods in most settings, and the union of all three of the last rows, being the new method, outperforms other methods.
Our goal is to introduce a novel method that balances performance with efficiency in model size and computational resources. We believe the LPath approach offers a valuable trade-off and can inspire further exploration in OOD detection methods.
Furthermore, the datasets used in the experiments are all small in size. Consequently, the benefits of the proposed likelihood path principle and the associated algorithms have not been fully demonstrated. It remains uncertain whether the proposed concept and algorithm would be effective in more complex real-life scenarios.
We appreciate your concern about the scale of our datasets. e acknowledge that the datasets and models in this paper are relatively small by today’s standards. However, our main goal for this work is not to show a significant performance gain, but rather to show that it is possible to use the LPath principle to achieve competitive unsupervised OOD detection performance even with much smaller models.
The datasets we selected are widely used in the literature and include challenging cases like HFlip and VFlip. These are difficult scenarios where the ID and OOD data differ minimally. Demonstrating effectiveness in such cases highlights the potential of our approach.
We agree that testing the LPath method on larger, real-world datasets is important. Given the current challenges in unsupervised OOD detection, where achieving an AUC significantly above 0.5 is non-trivial, we consider our contributions a meaningful step forward. Future work will explore the applicability of our method in more complex scenarios.
Further, we are not trying to write a purely outcome oriented paper that pushes up the benchmark by a few more points through more tweaking, we are trying to write a concept and method-oriented paper that aims to present a novel concept that aims to inspire more novel and interesting ideas.
Questions:
The proposed LPath principle and algorithm are presented in a context of Gaussian VAE. Is it possible to extend them to Gaussian mixture VAE or hierarchical VAE? By doing so, likely it can improve the OOD detection performance, while at a cost of increasing the model size.
Yes, extending the LPath principle to Gaussian mixture VAEs or hierarchical VAEs is feasible and could potentially enhance OOD detection performance. However, this would involve increased model complexity and goes beyond our computational budget. We believe this is a promising direction for future research and could further validate our approach.
As shown in Table 1, for some cases, DDPM performs much better than the proposed methods. Why? Is it because the model capacity of proposed algorithm is too small compared to DDPM? How about enlarging its model size, then compare it again with DDPM?
DDPM indeed performed better than our method in 3 out of the 9 cases, but only marginally. And we agree it could be due to the model size (Ours at 3M vs. DDPM at 46M). We note that despite this disparity in parameter counts, we managed to outperform in the Cifar-10 (ID) case and comparable in the FMNIST (ID) case. No other VAE based method comes close.
While we are seeing many versions of the scaling laws in various domains, the main point of this paper is not to scale OOD detection or to achieve the best SOTA OOD performance in every case, our paper is not outcome oriented, but rather method-oriented. Though both types of papers have their utility, we are trying to write a concept and method-oriented paper. We aim to present a novel concept that may inspire more novel and interesting ideas, instead of an outcome-oriented paper that pushes up the benchmark by a few more points through more tweaking or using larger models.
Some minor issues:
pp.122-123, the equation needs to be revised
Sorry, we are not sure what you mean, could you be more specific on how you think it should be revised?
pp.226, "While VAE optimization should already be driving Eqs. 17–19 to a small value", why?
Thank you for this question. We clarified the reasoning in the revised manuscript:
Eqn 17 is small, because the KL objective encourages the encoder to be close to prior N(0, 1) Eqn 18 should be close to 1, also due to the KL regularization. Eqn 19 should be small, as model distribution converges weakly to data distribution (Theorem 4 and 5 of [1]).
[1] Dai, Bin, and David Wipf. "Diagnosing and Enhancing VAE Models." International Conference on Learning Representations.
pp.893, Theorem ??
Thank you for catching this typo on page 893. We have corrected it in the revised manuscript.
The paper considered out-of-distribution (OOD) detection using variational autoencoders (VAEs). The idea is essentially to represent each data point with , i.e. with the mean-variance pairs of both encoder and decoder. A density estimate is then formed on the training data , and OOD detection is done with respect to this density estimate. Empirical results appear promising.
The approach is justified through a newly phrased "Likelihood Path" principle, which is linked to both the likelihood principle and information theory.
优点
- I think the key idea of extracting sufficient statistics per data point and doing OOD detection in a density of these is neat.
- Empirical performance suggests that the approach has some merit.
缺点
Communication
- The paper is not an easy read as it quickly gets rather convoluted. For example, some efforts are spent on talking about "slow and fast weights", but it is unclear how this concept is related to the actual work. The actual work seems to be about sufficient statistics, so it seems like a detour to talk about "fast and slow weights". I may have missed an important insight here, but I do not see what the fast-and-slow-analogy brings to the table (beyond confusion for this reader).
- Another example of convoluted writing is that the method is motivated by the proposed "likelihood path" principle in Sec 3, but this principle is only discussed in Sec 4. It feels to me like the order of presentation is unconstructive. (I understand that such comments are fundamentally subjective, but I, at least, found the presentation confusing).
- The phrased likelihood path principle is imprecise: the first sentence say that there is "more information", but it does not say relative to what. It does not help that the principle is repeated three times throughout the paper.
Conceptual
- The paper presents a fairly broad principle and argues based on the idea that is a sufficient statistic for the marginal likelihood. This, however, appears incorrect. is, as far as I can tell, a sufficient statistics for the one-sample-estimate of the marginal likelihood, but this is a notoriously different quantity than the marginal likelihood. Pragmatically, I understand why the one-sample-estimate is considered, but I find it troublesome that this is not more elaborately discussed, and instead the paper largely talks about the one-sample-estimate as being the marginal likelihood.
- Following the point above, then equations 23 and 24 appear incorrect. They would be correct for the ELBO or the one-sample-estimate of the marginal likelihood, but they are incorrect for the marginal likelihood (which is what the equations state).
问题
- I like that the approach works well with small models. Have you tried using your approach on larger models?
- Is the approach applicable to hierarchical VAEs? Since these have many latent variables, I assume becomes rather large, which may render the approach impractical. Do you have any experiences with this?
The paper is not an easy read as it quickly gets rather convoluted. For example, some efforts are spent on talking about "slow and fast weights", but it is unclear how this concept is related to the actual work. The actual work seems to be about sufficient statistics, so it seems like a detour to talk about "fast and slow weights". I may have missed an important insight here, but I do not see what the fast-and-slow-analogy brings to the table (beyond confusion for this reader).
Thank you for your feedback on communication. Our intention was to establish a foundation for applying the sufficiency principle within deep generative models (DGMs). Specifically, we observe that the likelihood function with respect to the slow weights (the neural network parameters) is intractable (as noted on lines 304–307), making it challenging to extract sufficient statistics directly from them.
Our key insight is that we can apply the sufficiency principle to the fast weights—the parameters and in Gaussian VAEs—which are tractable and directly influenced by individual data points (as illustrated in Figure 1). This distinction is crucial because it enables us to extract sufficient statistics from the fast weights, facilitating effective OOD detection through their density estimation.
Another example of convoluted writing is that the method is motivated by the proposed "likelihood path" principle in Sec 3, but this principle is only discussed in Sec 4. It feels to me like the order of presentation is unconstructive. (I understand that such comments are fundamentally subjective, but I, at least, found the presentation confusing).
The likelihood path principle relies on likelihood and sufficiency principles, which in turn rely on slow and fast weights for DGMs. We would like to seek constructive feedback on the order of presentation. We were also worried about this issue, so we tried to include a very brief summary in the introduction, hopefully to give readers a rough idea of what it is. We’d like to hear your feedback on whether the introduction section is clear to you, or any suggestions on how a better order might look like given the dependencies we described above.
The phrased likelihood path principle is imprecise: the first sentence say that there is "more information", but it does not say relative to what. It does not help that the principle is repeated three times throughout the paper.
We appreciate this feedback and agree that the phrasing could be more precise. We also highly value being precise in the statements and claims we make. In Section 4.3, we formally define the LPath principle in Equations (23) and (24). Specifically, Equation (24) shows that when density estimation is imperfect, the mutual information between the sufficient statistics and the sample is greater than that of the marginal likelihood . In other words, contains more information about relative to .
We will revise the manuscript to explicitly state that the "more information" is relative to the marginal likelihood , addressing the ambiguity you pointed out. Our experimental results in Table 1 support the LPath principle by demonstrating that leveraging the sufficient statistics leads to better OOD detection performance than using the marginal likelihood alone. This is significant because practical estimations of are inherently imperfect, making the LPath principle broadly applicable.
Following the point above, then equations 23 and 24 appear incorrect. They would be correct for the ELBO or the one-sample-estimate of the marginal likelihood, but they are incorrect for the marginal likelihood (which is what the equations state).
We appreciate your observation and apologize for any confusion regarding Equations (23) and (24). On lines 364-365, we clearly stated lines 366-374 are about the marginal likelihood estimate, not but the theoretical marginal likelihood computed from integration. As you noted, the argument is correct when it is ELBO or one-sample-estimate. Our main thesis is that the Likelihood Path (LPath) contains more information and can outperform practical marginal likelihood estimates used in OOD detection.
While analyzing the theoretical marginal likelihood is interesting, it is not the primary focus of our work. Nevertheless, we can extend our argument on lines 366–374 to cases where we marginalize over through sample summation or integration. Since Equations (23) and (24) hold uniformly for every , we can obtain an averaged version of these equations. We will include this extended discussion in the Appendix to provide additional clarity.
Questions:
I like that the approach works well with small models. Have you tried using your approach on larger models?
Thank you for your interest in scaling up our approach. While we are keen to explore the performance of our method on larger models, we have not yet conducted such experiments due to computational resource constraints. We consider this an important direction for future work and hope someone in the community with the compute budget will investigate it further.
Is the approach applicable to hierarchical VAEs? Since these have many latent variables, I assume becomes rather large, which may render the approach impractical. Do you have any experiences with this?
You raise an excellent point. The key challenge we are solving is exactly what you said, when the number of latent variables becomes large, classical OOD detection methods would suffer from the curse of dimensionality. Our method combines the best of both worlds by extracting sufficient statistics on the likelihood path, reducing the dimensionality of the problem while keeping as much information as possible for effective OOD detection.
In principle, the LPath approach can be applied to hierarchical VAEs. Extending our approach to hierarchical VAEs is an exciting avenue for future research. Although we have not yet experimented with hierarchical models, we believe that our method could mitigate the complexity by focusing on the sufficient statistics. However, we leave applications on more models for future work, as the main contribution of our work is to provide a novel method and a conceptual framework to reason about similar problems.
Thanks for the detailed replies.
I understand that a significant part of my feedback was subjective (e.g., I found the paper difficult to read). As an author, such feedback can be difficult to incorporate as it is inherently imprecise.
Regarding the requested feedback on the introduction, then this is one place where I lacked clarity. For example, the introduction introduces two research questions. As far as I can tell, RQ1 is really the key goal of the paper, while RQ2 is more like an implementation detail. Currently, the introduction presents these questions as equally important, but after having read the paper I cannot help but consider these two questions as not being of equal importance.
So, while I appreciate the idea of explicating the driving research questions, I found them more confusing than helpful, because I felt I had to second-guess their relative importance. As a reader, having to guess is the last thing I want.
Dear Reviewer RmB7,
Thank you for your detailed response and for engaging with our rebuttal. While we appreciate your feedback, we would like to address your specific concerns about the introduction and the presentation of our research questions (RQs).
I understand that a significant part of my feedback was subjective (e.g., I found the paper difficult to read). As an author, such feedback can be difficult to incorporate as it is inherently imprecise.
We appreciate you acknowledging that a significant part of your feedback was subjective and imprecise. Thank you for being transparent.
“For example, the introduction introduces two research questions. As far as I can tell, RQ1 is really the key goal of the paper, while RQ2 is more like an implementation detail. Currently, the introduction presents these questions as equally important, but after having read the paper I cannot help but consider these two questions as not being of equal importance.”
We respectfully disagree with your interpretation. The purpose of RQs is to frame the key challenges our paper addresses, not to create a hierarchy of importance. It is common for papers to make multiple contributions, and we believe the notion of “relative importance” between RQs is neither necessary nor relevant. Could you clarify why you think presenting two RQs is problematic?
Moreover, the two RQs are interrelated: RQ1 establishes the overarching challenge, while RQ2 addresses a critical component of the solution that can potentially inspire future readers with similar challenges when applying the LPath principle. Dismissing RQ2 as an “implementation detail” misrepresents its role in addressing the scientific questions posed by the paper.
“So, while I appreciate the idea of explicating the driving research questions, I found them more confusing than helpful, because I felt I had to second-guess their relative importance. As a reader, having to guess is the last thing I want.”
We believe this critique stems from a misunderstanding. The RQs were included precisely to enhance readability, based on extensive rounds of feedback from colleagues unfamiliar with our work. We observed significant improvements in how they understood and engaged with the paper after organizing it around these questions. Could you elaborate on why you found them confusing? If there are specific parts of the introduction or RQs that could be clarified, we would be happy to address those concerns.
Thank you again for your time and for sharing your perspective. We hope this clarifies our approach and look forward to hearing your additional thoughts.
I understand and sympathize with your replies.
I want to emphasize that, due to the subjective nature of this debate, I would usually not have engaged. I only gave the feedback as you explicitly requested. I hope I have been clear that this part of my feedback was subjective and I, therefore, won't let it factor into my recommendation.
The paper introduces the LPath algorithm for OOD detection. LPath operates in two stages: first, it trains a VAE on in-distribution data and extracts key statistics from the fast weights related to the reconstruction error and latent variables; second, it applies classical density estimation methods to these statistics for OOD detection. The key idea is that the computational path to the likelihood function contains more informative details than the likelihood alone. The method is demonstrated on some simple datasets.
优点
-
The paper addresses a crucial problem in machine learning: the need for reliable OOD detection, particularly in safety-critical applications. The proposed “LPath Principle” and its application to VAEs offer a promising direction for improving OOD detection performance.
-
The paper presents a new perspective on OOD detection by reinterpreting VAEs through the lens of fast and slow weights in an interesting manner. The authors also demonstrate originality by pairing two VAEs with different latent dimensions to address the trade-off between encoder and decoder performance for OOD detection.
-
The paper provides a thorough analysis of the LPath Principle, examining it from different perspectives. The authors ground their approach in well-established statistical principles. The combinatorial analysis, while the explanation is lacking a bit of clarity, seems to be well motivated.
缺点
-
While the paper generally explains the concepts well, most sections, particularly those discussing the statistical foundations and combinatorial analysis, are not very well organised and in most cases it’s difficult to follow the main argument.
-
While the results are competitive, the improvement over existing methods is marginal on some benchmarks, particularly those that seem saturated. This raises the question of whether the LPath approach offers significant practical advantages in those cases. To address this, the paper could benefit from an evaluation on more difficult datasets.
-
The paper focuses primarily on Gaussian VAEs. While the method is presented as a general principle applicable to other deep generative models, the paper does not provide concrete examples or experimental results for models other than VAEs.
问题
N/A
The paper presents a method for unsupervised out of distribution (OOD) detection that uses the VAE. It uses statistics generated by the VAE as a feature extraction for existing OOD algorithms. Two reviewers made a case for rejection, while one reviewer thought paper could be accepted. Overall, the case for rejection was more compelling, especially regarding the technical and conceptual presentation of the idea and its validation on overly simple data sets. The paper defines new foundational concepts, such as the "Likelihood Path" and "fast and slow inference" that could be more clearly communicated to the reader to verify that the method proposed is indeed more fundamental than a modest outlier detection method.
审稿人讨论附加意见
The negative reviewers engaged with the authors during the discussion and did not wish to change their assessment. The authors questioned whether those reviewers really read the paper carefully. While the paper may be able to be modified and accepted to a top-tier conference, I am confident that the reviewers made informed assessments on the submission.
Reject