DOG: Diffusion-based Outlier Generation for Out-of-Distribution Detection
摘要
评审与讨论
This paper discusses an effective outlier synthesis based on stable diffusion when label information is given. The methodology consists of three parts; textual inversion to infer the embedding, synthesizing outliers with similar concepts, and fine-tuning via outlier exposure. The method is tested on three benchmarks and generally shows improvement over compared baselines.
优点
(1) Outlier exposure techniques that do not require explicit outlier data reduce memory constraints. I believe this is a good plus for the practical use.
(2) The method improves on previous works on outlier exposure.
缺点
(0) Fundamentally, I am not sure why such synthesized data can be located on the boundary of the decision region. The synthesized images via stable diffusion deviate from the original data distribution a lot since they process different backgrounds. Furthermore, this methodology is limited when only word embeddings are simple enough that stable diffusion produces coherent images of the word.
(1) I also have issues with the experimental setup. While the authors propose this as an effective outlier synthesis strategy, the experiment is done in relatively easy datasets. I think the paper can be improved by comparing it against near-OOD, or hard OOD datasets in the OpenOOD benchmark [1]. Before that, I hesitate to agree that this is an effective synthesis strategy.
(2) I also think the performance of the proposed outlier synthesis strategy is not that significant. It scores 91.51 on average AUROC. I can find other simple post-hoc post-processing strategies that score 94~95 on such benchmarks [2,3], which questions the efficacy of the method. Furthermore, I am not sure why [3] scores 95% in the original paper but 86% in this manuscript.
(3) I am further concerned about the empirical significance of this paper since this paper uses [3] as a testing score. This makes it hard to determine how the effect of outlier synthesis relates to that. I believe POEM does not use such ASH scoring.
(4) I also feel the paper is hard to understand despite being a combination of existing works. Please explicitly note the OE loss function in mathematical form. Furthermore, what is Line 11 in the Algorithm 1? How is it done? Please give a link to the related section for improved clarity.
(5) This paper generates multiple images through the diffusion model so its extra computational cost for training will be heavy. I do not see any discussions on that.
(6) The paper does not discuss challenging real-world datasets (e.g. MVTec) that a stable diffusion model cannot cover. The synthesis strategy will be limited in such datasets.
(7) The paper only considers relatively easy supervised setup and cannot be extended to unsupervised setup, especially when datasets with multiple semantics are given as in-distribution data. This limits the applicability of the proposed outlier synthesis strategy.
(8) Furthermore, I am curious how this method performs in the one-class classification (e.g. one-class CIFAR-10/100 experiment). There exists a work that synthesizes the OOD via diffusion model on such datasets [4]. I would like to see how this method performs against [4] and other baselines, especially when is trained from scratch.
Overall, I have various issues with the significance and experiment of this paper. Hence, I lean to rejecting this paper.
问题
See weakness
[1] OpenOOD v1.5: Enhanced Benchmark for Out-of-Distribution Detection, arXiv 2023.
[2] Boosting Out-of-Distribution Detection with Typical Features, Neural Information Processing Systems 2022.
[3] Extremely Simple Activation Shaping for Out-of-Distribution Detection, International Conference on Learning Representations 2023.
[4] Fake It Till You Make It: Towards Accurate Near-Distribution Novelty Detection, International Conference on Learning Representations 2023.
This paper introduces an outlier generation methodology, termed DOG, which harnesses text-to-image generative models. DOG's primary objective is to create near-OOD data points that closely approach the low-density border of in-distribution (ID) data. The authors propose identifying text-based boundary anchors within the ID data, which serve as reference points for generating outlier data. Subsequently, this generated outlier data is employed to train an OOD detection model, following prior research in the field. The authors proceeded to conduct OOD detection experiments on CIFAR and ImageNet benchmark datasets, demonstrating superior detection performance in comparison to existing post-hoc and fine-tuning approaches.
优点
- The proposed approach exhibits strong OOD detection performance on image benchmarks.
- The framework for generating outliers, leveraging the text representation space, is novel.
- The paper is easy to comprehend.
缺点
- The key technical novelty of this paper lies in the utilization of text space for outlier generation. However, the motivation behind utilizing the text space for outlier generation is not convincing and not well-supported.
- Figure 1 in the introduction provides a general overview of the outlier generation but doesn't sufficiently support the primary motivation of employing the text space for the outlier generation.
- In section 3.1, the explanations for motivation, such as "finding neighboring..." and "language is more readily...", are ambiguous and would benefit from additional clarity. Providing examples or statistical analyses to illustrate these points would make the motivation more informative and compelling.
- The paper also lacks definitions for certain key terms, making it challenging for readers to comprehend the content (see questions below).
- Defining the candidate WordNet set appears to be specific to CIFAR and ImageNet, and it's unclear how this approach can be generalized to other domains (such as speech or language) or tasks (like captioning).
- Figure 3 lacks axis labels and is challenging to interpret. Figure 3-c seems uninformative.
问题
- What makes text manipulation preferable over manipulating the image representation space for generating outlier images? Can you provide any observations or statistical analyses?
- What is the definition of “prompt template” and “pseudo-word”? Could you provide examples of a prompt, pseudo-word , and candidate set and ?
- Figure 3-c, what can we learn from the T-SNE plot?
The task is OOD detection. To calibrate the OOD detector, the authors propose generating OOD data using a pre-trained diffusion model. Specifically, the authors suggest finding similar descriptions to the ID labels from the concept word set WordNet and then generating OOD images based on these descriptions.
优点
- Generating near OOD samples from the text level is intuitive and interesting.
- Good experimental results.
缺点
- The motivation for generating pseudo-word s_y is not clearly explained. Can we use YOLO to detect objects in images as s_y?
- No results about the advantages of dynamic adjustment.
- No results about the method's sensitivity to the hyper-parameters η and λ.
- The writing exhibits redundancy. For example, in section 3.1.1, both "The training objective Eq. 3 can also be seen as a form of reconstruction." and "the optimization objective Equation 3 can be interpreted as a reconstruction loss" convey similar ideas.
- Typos. ATOM FPR95 2.52 should be bolded instead of DOG FPR95 2.65 in Table 1.
问题
How can we distinguish the generated samples belong to anomalies of known classes or new class samples?
To solve the out-of-distribution (OOD) detection task, this paper aims to generate surrogate outliers and use them to train an OOD detector effectively. To be specific, the proposed method, Diffusion-based Outlier Generation (DOG), first optimizes a pseudo-word representation for each label via Textual Inversion, and then finds near-OOD words for each pseudo-word. Using the near-OOD words, DOG generates effective surrogate outliers using a pre-trained text-conditional diffusion model. This paper demonstrates that the generated outliers can be effectively utilized for training the OOD detector. The experimental results show the superior performance of the detector over various baselines.
优点
- The generation process can be applied to any in-distribution dataset if a pre-trained text-conditional diffusion model (e.g., StableDiffusion) and a trained vision-language-aligned embedding space (e.g., CLIP) are given.
- The proposed method, DOG, outperforms various OOD detection methods.
缺点
- This paper conducts only small-scale low-resolution experiments based on CIFAR-10/100. Although the proposed method (DOG) outperforms baselines on the datasets, I cannot be convinced that DOG is scalable and effective on large-scale high-resolution datasets.
- The proposed method seems hyperparameter-sensitive (, , , ...) because the quality of selected words is very important. For example, if the words are too far from in-distribution or too similar to in-distribution, the generated samples could be ineffective for training an OOD detector.
- Lack of comparison with a recent method [1]. This method also generates outliers using a diffusion model, but this does not require a vision-language model like CLIP.
[1] Mirzaei et al., Fake It Until You Make It : Towards Accurate Near-Distribution Novelty Detection, ICLR 2023
问题
How to choose the hyperparameters for effective outlier generation?
To solve the out-of-distribution (OOD) detection task, this paper aims to generate surrogate outliers via textual inversion and use them to train an OOD detector effectively.
This paper demonstrates that the generated outliers can be effectively utilized for training the OOD detector. The experimental results show the superior performance of the detector over various baselines.
Most of the reviewers think the paper lacks a clear motivation and is limited to specific scenarios like CIFAR10. Also, there are extensive missing references or baselines.
There is no rebuttal.
为何不给更高分
Most of the reviewers think the paper lacks a clear motivation and is limited to specific scenarios like CIFAR10. Also, there are extensive missing references or baselines.
There is no rebuttal.
为何不给更低分
N/A
Reject