PaperHub
5.5
/10
Rejected4 位审稿人
最低5最高6标准差0.5
5
5
6
6
3.8
置信度
正确性2.8
贡献度2.3
表达2.8
ICLR 2025

BOOD: Boundary-based Out-Of-Distribution Data Generation

OpenReviewPDF
提交: 2024-09-27更新: 2025-02-05
TL;DR

This paper proposes a novel framework called Boundary-based Out-Of-Distribution data generation (BOOD), which synthesizes high-quality OOD features and generates human-compatible outlier images using diffusion models.

摘要

关键词
OOD detectionDiffusion modelsTraining data generation

评审与讨论

审稿意见
5

This paper focuses on addressing the OOD detection task by synthesizing outlier samples. To achieve the synthesis of reasonable outlier samples, it first selects out the samples which reside near to boundaries, and then apply the adversarial attack to perturb features of these samples until their classes are changed. Finally, it applies the diffusion model to generate outlier samples from those perturbed features, which are used for training the OOD classifier. Experiments on various datasets demonstrate that the proposed method achieve better performance than existing methods.

优点

  • It introduces a new outlier synthesis through selecting out samples close to decision boundaries and distorting them. Outlier samples can be more easily synthesized from these samples compared to other samples.
  • Extensive experiments are conducted to validate the effectiveness of the proposed method and core technical components such as the sample selection strategy.
  • The paper is well written with clear structure and smooth logic, making it easy for readers to understand its ideas and algorithms.

缺点

  1. The rationale for using adversarial attacks to perturb sample features remains insufficiently justified. Perturbing features to alter their class identities might unintentionally transform them into samples of other in-distribution classes. To address this concern, the authors should provide theoretical or empirical evidence demonstrating that their perturbation method reliably generates features distinct from existing classes. Additionally, a comparison with alternative perturbation strategies would help clarify the unique benefits of the proposed approach.
  2. The inquiry into the performance of random feature perturbations, such as adding Gaussian noise or displacing features away from class centroids, is highly relevant. To make this critique more actionable, I recommend requesting an ablation study comparing the proposed perturbation method against these simpler alternatives. Such an analysis would provide concrete evidence of the theoretical and empirical advantages of the method.
  3. The paper lacks sufficient detail on the architectures of the image encoder and the OOD classification model. For replication purposes, it is essential to include specifics such as the number and type of layers, activation functions, and other relevant parameters. A detailed description of these aspects would significantly enhance the reproducibility of the proposed algorithm.
  4. There is an error in Equation (2), where the denominator should correctly be '\Gamma(y_j)^Tz'. While this observation is helpful, I suggest the authors conduct a thorough review of all equations and mathematical notations throughout the manuscript to ensure accuracy and consistency.

问题

  1. Clarification is needed regarding the sensitivity of the method to the hyperparameter r. An exploration of this sensitivity, perhaps through a sensitivity analysis, would provide valuable insights into the robustness and reliability of the proposed approach under varying conditions.
  2. The method performs significantly worse than NPOS on the OOD dataset Textures, as indicated in Table 2. An explanation for this performance discrepancy would be beneficial. The authors could analyze specific characteristics of the Textures dataset or aspects of their method that may contribute to this outcome.
评论

Question 2: performance gap between NPOS and BOOD on OOD dataset Textures using ImageNet-100 as ID dataset.

Textures is a dataset containing textural images in the wild, which has large discrepancy from the training distribution of Stable Diffusion. NPOS [2] is an OOD detection framework that leverages outlier features synthesized from low-likelihood area in the latent feature space. We do found the performance gap between NPOS [2] and BOOD on Textures while choosing ImageNet-100 as ID dataset. Compared to NPOS, we argue that the performance gap between BOOD and NPOS [2] is because the Stable Diffusion lacks the ability to generate images that located near Textures' distritbution while using ImageNet-100 as ID dataset. But we want to emphasize that: for the area of generative data augmentation, the performance of the frameworks are limited by the capability of diffusion models. In the following table, we select NPOS [2], BOOD and another generative OOD data augmentation framework DreamOOD [1] for comparison. The three methods are all tested on Textures, using ImageNet-100 as ID dataset.

TexturesAverage
MethodFPR95 ↓AUROC ↑FPR95 ↓AUROC ↑
NPOS [2]8.9898.1344.0089.04
DreamOOD [1]53.9985.5638.7692.02
BOOD51.8885.4135.3792.44

From the table above, we can discovery that both DreamOOD [1] and BOOD have performance gaps between NPOS [2] on Textures, indicating that the performance of generative data augmentation is bounded by the capability of diffusion model. However, BOOD shows superior performance in the average OOD dection results, indicating our framework is promising. This performance gap might be narrowed with the stronger diffusion models in the future study.

Weakness 3: architectures of the image encoder and the OOD classification model.

This is a good point regarding reproducibility. To guarantee the fairness between frameworks for comparison, we choose the same architectures of image encoder and OOD classification model as DreamOOD [1]. We summerize the architectures of the models as below:

  • Image encoder. We employed a ResNet-34 architecture for image encoder for both CIFAR-100 and ImageNet-100. Here's the breakdown:

    Input Layer:

    • Initial Conv2d: 3→64 channels, 3×3 kernel, stride=1, padding=1
    • BatchNorm2d
    • ReLU

    Main Blocks (using BasicBlock structure):

    • Layer1: 64→64 channels, 3 blocks
    • Layer2: 64→128 channels, 4 blocks, stride=2 at first block
    • Layer3: 128→256 channels, 6 blocks, stride=2 at first block
    • Layer4: 256→512 channels, 3 blocks, stride=2 at first block

    Final Layers:

    • Adaptive Average Pooling to (1,1)
    • Flatten
    • Linear transformation: 512→768 dimensions (768-dimensions features are aligned with the class token embeddings Γ(y)\Gamma(y))

    Each BasicBlock contains:

    • Conv2d (3×3) → BatchNorm2d → ReLU
    • Conv2d (3×3) → BatchNorm2d
    • Skip connection (with optional 1×1 conv if dimensions change)
    • Final ReLU
  • OOD classification model. We also employed a ResNet-34 architecture for OOD classification model. Most part of the architecture are same as the image encoder, except the final layer: the Linear transformation changed from 512→768 to 512→100 (the number of classes equals 100).

We have included the above architecture explanations in our updated version of paper, please check it, thanks!

Weakness 4: error in equation

We apologize for the typo. We have corrected this in our updated version of paper, please check it.

We trust our responses have adequately resolved your concerns. We sincerely thank you for reviewing our rebuttal and your constructive feedback. If any aspects still need clarification, we are happy to discuss them further.

[1] Xuefeng Du, Yiyou Sun, Jerry Zhu and Yixuan Li. "Dream the Impossible: Outlier Imagination with Diffusion Models." NIPS 2024.

[2] Leitian Tao, Xuefeng Du, Jerry Zhu, and Yixuan Li. "Non-parametric outlier synthesis." ICLR 2023.

评论

Thank the authors very much for their efforts in preparing the response. Sensitivity analysis about r and network details are provided, and the performance on Textures dataset is explained. However, it is still unclear that why using small step size for minor perturbations and additional perturbation steps can prevent transforming features into other in-distribution classes. It is a little strange that displacing features away from class centroids produces rather poor results. I tend to keep my original score.

评论

We thank reviewer v1ZR for the valuable reply. Below are our response:

It is still unclear that why using small step size for minor perturbations and additional perturbation steps can prevent transforming features into other in-distribution classes.

Great question. Firstly we want to emphasize that, large step size α\alpha and additional perturbing steps cc do make difference in the generated images. By setting α\alpha and cc to a relatively large number, we will perturb the generated features further from the decision boundaries. From Figure 4 in the paper, we can observe that large α\alpha and cc might lead the synthesized OOD images transform to another classes' distributions.

Moreover, since the prediction confidence produced by classifier does not fully correlated to the distribution area for a feature, we cannot directly judge whether a feature has transformed into other ID classes by the prediction confidence produced by classifier on feature level. Our perturbation strategies can alleviate this problem at a high extent. From the experiment results, we can discover that the control of α\alpha and cc is effective.

We also offer a possible strategy to better solve this problem. After obtaining the identified ID boundary features, we perturb the features for cc steps and calculate the entropy of the feature in each step using the following formula:

H(zk)=inpiloge(pi)H(z_{k}) = -\sum_{i}^{n} p_{i}log_{e}(p_{i})

where zz denotes the synthesized feature in step kk (kck \le c), nn denotes the number of classes, and pip_{i} denotes the prediction probability for class ii. The more uniform the distribution of probability is, the higher the entropy will be. When a feature is transformed towards other classes' distributions or distributed in ID areas, the prediction probability for a specific class will become abnormally high, resulting in lower entropy.

For each original feature, we rank the entropies of its cc perturbed versions and select the features whose perturbed samples show the highest entropies. These selected features are more likely to distribute outside the ID area.

We are optimizing this method and we will conduct more experiments and present it in camera-ready version. Thank you again for your constructive question.

评论

We appreciate the review for providing valuable advice. Below are our responses:

Weakness 1: the perturbation strategy may unintenrionally transform the features into in-distribution classes.

Thank you for providing a reasonable concern. To prevent the generated OOD features from transforming into the distribution of other in-distribution classes, we set a relatively small step size α\alpha for minor perturbations in each iteration when synthesizing OOD features, which guarantees small deviations and prevents the synthesized features from entering other distributions. We also employ small additional perturbation steps cc to guarantee the synthesized OOD features will not step into other distributions after crossing the decision boundaries.

To further provide the empirical evidence to support our theory, we attach the performance of BOOD on CIFAR-100 with a larger range of α\alpha and cc:

cc valueFPR95 ↓AUROC ↑
012.1997.20
111.9897.32
210.6797.42
311.2397.24
412.6597.02
513.9196.84
α\alpha valueFPR95 ↓AUROC ↑
0.00112.0997.21
0.00512.1297.21
0.01510.6797.42
0.02518.7395.38
0.0519.3395.02
0.122.5794.79

From the results above and Figure 4 in the paper, we can conclude that our perturbing strategies are promising. We also include the tables above in our updated version of paper.

Weakness 2: ablation studies on alternative perturbation strategies.

Our proposed strategy perturbs the identified ID boundary features through the direction of gradient ascent, aiming to synthesize OOD features that distributed around the decision boundaries. To gain a deeper insight of the effectiveness of our strategy, we provide additional ablation studies on the different perturbation strategies, including (1) adding Gaussian noises to the latent features, (2) displacing features away from class centriods and (3) BOOD's perturbation strategy. Below are the results:

MethodFPR95 ↓AUROC ↑
(1)18.9995.04
(2)40.5191.63
BOOD10.6797.42

From the statistics above, we can conclude that our perturbation strategies are solid. We also include this analysis in the updated version of summision. We will discover more perturbation strategies in our camera-ready version of submission.

Question 1: the sensitivity of the method to the hyperparameter r.

This is a great point. For pruning rate rr, we suggest a relatively mild rate since too small rr will not have enough features to generate diverse OOD images, and a large rr may lead to selection of ID features that distribute far from the boundaries. Below are BOOD's performance with different rr on CIFAR-100:

r valueFPR95 ↓AUROC ↑
2.5%13.4596.84
5%12.4797.34
10%13.3197.02
20%15.8895.68

We also attached the table above in the updated paper, please check it!

评论

Dear Reviewer v1ZR:

The six aspects covered in our previous response directly address the concerns you raised: (1) we provide explanations regarding why our perturbation strategy will not transform the features into ID classes, (2) we provide ablation studies on the different perturbation trategies, (3) we provide sensitivity analysis of hyperparameter rr, (4) we explain the performance gap between NPOS and BOOD on Textures using ImageNet-100 as ID dataset, (5) we introduce the architectures of the image encoder and OOD classification model, and (6) we fix the error in equation 2.

With the discussion session closing soon, we want to ensure we've adequately addressed all your questions. We truly appreciate your constructive comments throughout this process.

Best,

Authors

审稿意见
5

This paper proposes a novel framework, named Boundary-based Out-Of-Distribution data generation (BOOD). It first identifies the features closest to the decision boundary by calculating the minimal perturbation steps imposed on the feature to change the model's prediction. Then, it generates the outlier features by perturbing the identified boundary ID features along with the gradient ascent direction. These synthetic features are then fed into a diffusion model to generate the OOD images, enhancing the model’s ability to distinguish ID and OOD data. Extensive experiments show the effectiveness of their method.

优点

1.This paper proposes a novel boundary-based method for generating OOD data, leveraging diffusion models to identify ID data closest to decision boundaries and applying an outlier feature synthesis strategy to generate images located around decision boundaries. This approach provides high-quality and informative features for OOD detection. 2.This paper is technically sound. The ablation experiments, hyperparameter analysis experiments, and visualization experiments are all comprehensive. 3.This paper provides a clear and thorough introduction to the proposed methods and algorithmic procedures. The formulas and notations are well-explained, with detailed definitions for all symbols and terms used.

缺点

1.One potential drawback is a notation conflict between the additional perturbation steps c (line 287-288) and the earlier use of C for the number of classes. This overlap in symbols could cause confusion, so it might be beneficial to change the symbol for one of these terms to improve clarity. 2.In Table 2, the comparison with state-of-the-art (SOTA) methods could be enhanced by including more recent methods from 2024. This would better highlight the advantages and relevance of the proposed approach in the context of the latest advancements. 3.A limitation of the hyperparameter sensitivity analysis is that it could benefit from experimenting with a wider range of values to better demonstrate the rationale behind the chosen settings. Additionally, more intuitive visualizations could be provided to clearly illustrate the improvements of the proposed method over previous approaches.

问题

See Weakness above.

评论

We thank the reviewer for the feedback and constructive suggestions. Our response to the reviewer’s concerns is below:

Weakness 1: notation conflict between the additional perturbation steps c and the earlier use of C for the number of classes.

Thank you for pointing out the ambiguous notation. We have changed the notation for the number of classes to VV. Please check L104 and L107 our updated version of paper.

Weakness 2: including more comparisons between BOOD and SOTA methods from 2024.

We provide comparison between BOOD and a SOTA methods FodFom [1] from ACMMM 2024. FodFom is a framework that utilizing Stable Diffusion to generate outlier images for OOD detection. The results are summarized in the table below:

SVHNLSUN-RLSUN-CiSUNTexturesPlaces365Average
MethodFPR95 ↓AUROC ↑FPR95 ↓AUROC ↑FPR95 ↓AUROC ↑FPR95 ↓AUROC ↑FPR95 ↓AUROC ↑FPR95 ↓AUROC ↑FPR95 ↓AUROC ↑
FodFoM33.1994.0228.2495.0926.7995.0433.0694.4535.4493.3842.3090.6833.1793.78
BOOD5.7098.310.1099.941.7099.320.1099.944.3598.9541.2090.308.8697.79

Compare to FodFom, BOOD demonstrates superior performance, illustrating its competitiveness.

Weakness 3: experimenting with a wider range of hyperparameter values.

We provide analysis on BOOD's performance with a wider range of hyperparameters as below.

cc valueFPR95 ↓AUROC ↑
012.1997.20
111.9897.32
210.6797.42
311.2397.24
412.6597.02
513.9196.84
α\alpha valueFPR95 ↓AUROC ↑
0.00112.0997.21
0.00512.1297.21
0.01510.6797.42
0.02518.7395.38
0.0519.3395.02
0.122.5794.79

From the results, we conclude that our previous choices are optimal. Employing a smaller step size α\alpha facilitates nuanced differentiation between samples across different distances, thus resulting in precise boundary identification. Choosing a moderate cc guarantees that the synthesized OOD features are adequately distant from the ID boundaries, and prevents the OOD features from distributed in the in-distribution area. We have uploaded the analyses in our new submission of paper.

Weakness 4: additional visualizations to illustrates the improvements.

Good point! We provide more visualizations in the Appendix B Figure 8 in our updated version of paper. Please check them, thanks!

We're confident that our explanations have fully addressed your concerns. We greatly appreciate your time in reviewing our rebuttal and your feedback. Please don't hesitate to raise any lingering concerns for discussion.

[1] Jiankang Chen, Ling Deng, Zhiyong Gan, Wei-Shi Zheng and Ruixuan Wang. "FodFoM: Fake Outlier Data by Foundation Models Creates Stronger Visual Out-of-Distribution Detector." ACMMM 2024.

评论

Dear Reviewer dKDk:

We have thoroughly addressed your feedback across four key dimensions in our previous response: (1) we fix the notation conflict for the number of classes, (2) we provide comparison between BOOD and a SOTA method, (3) we provide analysis on BOOD's performance with a wider range of hyperparameters, and (4) we provide additional visualizations in Appendix B in our updated PDF.

As we approach the end of the discussion window, please let us know if our responses have fully addressed your questions or if additional clarification would be helpful. Thank you for your thoughtful feedback.

Best,

Authors

评论

Dear Reviewer dKDk,

As the discussion period draws to a close, we wanted to check if our responses have addressed your concerns adequately. Please let us know if you need any further clarification.

Best regards,

Authors

审稿意见
6

This paper proposes BOOD, a method for synthesizing out-of-distribution images that are closer to the boundary, for enhancing OOD detection performance. It first learns a image encoder whose feature space aligns with class token embeddings, and leverage it as a cosine classifier. Then it picks the images whose features need the fewest number of perturbation steps in the gradient ascent direction to change the cosine classifier’s prediction, and generates OOD images from their perturbed features. It then uses the generated OOD images to regularize the training of an OOD classification model. Experimental results show that BOOD outperforms a variety of existing OOD detection approaches on CIFAR-100 and ImageNet-100 as ID data.

优点

  • This paper proposes a new approach for synthesizing out-of-distribution data by performing adversarial perturbation and generating images along the ID boundary. The method is intuitively and technically sound.

  • Performance-wise, the gain over existing methods is significant on CIFAR-100 as ID. The synthesized images look reasonable visually as boundary cases.

  • The writing and presentation of the paper are clear.

缺点

  • The method seems to be bounded by the capability of the stable diffusion model. In cases where ID data are very distinct from stable diffusion's training distribution, e.g. if the ID data is SVHN or texture, or some other domains like medical imaging, etc., or where the classification is very fine-grained, it is uncertain how effective the method would be.

  • The performance improvement on CIFAR-100 as ID data is significant but the improvement on ImageNet-100 is only marginal, although both datasets are natural images with 100 classes. This also somewhat raises some uncertainty about how much improvement BOOD can bring over the existing methods in general. It may be helpful to include more in-depth discussion or analysis on in which cases BOOD provides significant gains and in which cases its advantage over prior approaches is less obvious.

  • Minor point - there are several typos in the use of parenthetical vs. textual citations: e.g. L047, L179, L232

问题

  • How necessary is it to synthesize OOD data, as opposed to finding publicly available OOD data and seeing if training with them can generalize to unseen OOD data? How does BOOD compare with methods that use real OOD data for augmentation, such as [1]?

  • The method seems to involve various different hyperparameters, including pruning rate r, max perturbation iteration K, and regularization weight beta. How are they selected? If one applies BOOD to a new ID dataset, are there guidelines or general rules of how to select them?

  • Given that generation with diffusion models can be computationally expensive, it would be helpful to see more in-depth analysis on computation-performance tradeoffs (e.g. performance vs. the number of images generated per class).

[1] Hendrycks, Dan, Mantas Mazeika, and Thomas Dietterich. "Deep anomaly detection with outlier exposure." ICLR 2019.

评论

Question 2: general guidelines for hyperparameter selection.

Our suggested guidelines for hyperparameter selection are summarized as follows:

For pruning rate rr, we recommend implementing a moderate pruning rate, as insufficient pruning (small rr) may limit the diversity of generated OOD images (not enough features), while excessive pruning (large rr) risks selecting ID features proximally distributed to the anchor.

rr valueFPR95 ↓AUROC ↑
2.5%13.4596.84
5%12.4797.34
10%13.3197.02
20%15.8895.68

Our analyses suggest selecting a relatively large max iteration number KK to ensure comprehensive boundary crossing for most features. While increased iterations do affect computational overhead in boundary identification, the impact remains manageable. We present a detailed computational cost and performance analysis across varying KK values:

KK valueBoundary identification timeFPR95 ↓AUROC ↑
5~9sec17.6994.33
50~1.5min12.4797.34
100~2.5min12.4797.34
200~5min12.4797.34
400~10min12.4797.34

For regularization weight β\beta: Empirical evidence suggests optimal performance is achieved with moderate regularization weighting, as excessive OOD regularization can compromise OOD detection efficiency.

β\beta valueFPR95 ↓AUROC ↑
1.512.7196.95
212.7897.15
2.512.4797.34
313.197.02

For step size α\alpha: A moderate α\alpha value is recommended for boundary features identification and OOD features synthesizing. A large α\alpha may lead to large discrepancy between each iteration in adversarial attack, thus making the counting of distance to the decision boundaries not accurate.

α\alpha valueFPR95 ↓AUROC ↑
0.00112.0997.21
0.00512.1297.21
0.01510.6797.42
0.02518.7395.38
0.0519.3395.02
0.122.5794.79

For step size cc: We suggest a moderate cc value, since a large cc may force the generated OOD features to step into other classes' distributions, and a small cc may not guarantee the OOD features that are adequately distant from the ID boundary.

cc valueFPR95 ↓AUROC ↑
012.1997.20
111.9897.32
210.6797.42
311.2397.24
412.6597.02
513.9196.84

The above statistics illustrates that our suggested guidelines for hyperparameters selection are effective. We have included the new analysis in the Appendix C. Please check our updated summision of paper, thanks!

Question 3: analysis on computation-performance tradeoffs.

This is a reasonable question concerning computational cost. With the increase of number of OOD images per class, the performance of OOD detection becomes better. Below are the performance vs . different number of generated images per class on CIFAR-100:

number of images per classFPR95AUROCID ACC
10025.2193.6365.14
50015.8396.173.18
100012.4797.3478.17

We also provide the computational cost comparison between BOOD and DreamOOD on CIFAR-100 below. The computational cost are almost no difference.

Computational CostBuilding latent spaceOOD features synthesizingOOD image generationOOD detection model regularizationTotal
BOOD~0.62h~0.1h~7.5h~8.5h~16.72h
DreamOOD~0.61h~0.05h~7.5h~8.5h~16.66h

We have included the comparisons regarding computational cost in Appendix E in our updated version of paper, please check them, thanks!

Weakness 3: several typos in the use of parenthetical vs. textual citations.

Thank you for pointing out the typos! We have already fixed them, please check our updated version of paper.

We feel confident that we have thoroughly addressed the points you raised, and we thank the reviewer for taking the time to read our rebuttal and for your positive feedback again. If you still have concerns, we are willing to have discussion with you.

[1] Hendrycks, Dan, Mantas Mazeika, and Thomas Dietterich. "Deep anomaly detection with outlier exposure." ICLR 2019. [2] Xuefeng Du, Yiyou Sun, Jerry Zhu and Yixuan Li. "Dream the Impossible: Outlier Imagination with Diffusion Models." NIPS 2024.

评论

We thank the reviewer for the thorough review! Our response to the reviewer’s concerns is below:

Weakness 1: The method seems to be bounded by the capability of the stable diffusion model.

We acknowledge this important limitation regarding domains significantly divergent from the diffusion model's training distribution (e.g., SVHN, Textures, and medical imaging). But this constraint is inherent to all methodologies utilizing diffusion-based generative data augmentation. While future developments in generative modeling may address these limitations, we emphasize that our primary goal is to leverage diffusion models to generate informative OOD images, thus increasing the OOD detection model's performance. With two novel strategies: (a) an adversarial perturbation strategy to identify the ID features closest to the decision boundaries precisely, and (b) an OOD feature synthesis strategy to generate outlier features which distributed around the decision bondaries, we successfully achieve the state-of-the-art results on CIFAR-100 and ImageNet-100.

Question 1: Necessity of synthesize OOD data compared to finding publicly available OOD data, and comparison with methods that use real OOD data for augmentation.

Great point! Finding publicly available OOD data as auxiliary OOD detection training data is feasible, but it has two significant drawbacks: (1) It requires significant labor and time cost to labeling and filtering an OOD dataset that is completely not overlapped with the ID data. (2) it's impossible to collect images distributed outside the data distribution boundary, which can not be captured in the real world. BOOD addresses these problems in two aspects: (1) BOOD is an automatically OOD images generation framework, which significantly decrease the human labor involved in creating a new OOD dataset traditionally. (2) BOOD leverages efficient feature perturbation strategies and diffusion model, to generate images that distributed around the decision boundaries, eliminating the issue of not able to collect OOD images that are unreal. To conclude, it's necessary to synthesize OOD data for OOD detection training.

To understand the effectiveness of BOOD, We provide comparison of OOD detection results between BOOD and MSP+OE [1], using CIFAR-100 as ID dataset . Below are the results:

SVHNPlaces365LSUNTexturesAverage
MethodFPR95 ↓AUROC ↑FPR95 ↓AUROC ↑FPR95 ↓AUROC ↑FPR95 ↓AUROC ↑FPR95 ↓AUROC ↑
MSP + OE [1]42.986.949.886.557.583.454.484.851.1585.4
BOOD3.8599.0747.490.263.798.967.2598.4515.5595.94

From the table above, we can conclude that our method is more superior than the method using real OOD data for augmentation.

Weakness 2: The performance improvement on ImageNet-100 is only marginal.

We thank the review for providing valuable suggestions. In our research, we found that the performance improvement on ImageNet-100 is not significant as the improvement on CIFAR-100. But our proposed method BOOD still surpasses the state-of-the-art framework DreamOOD [2] by 3.39% (35.37% vs. 38.76%) in FPR95 and 0.42% in AUROC (92.44% vs. 92.02%). We suppose that the performance improvement gap is due to two reasons: (1) the distribution of ImageNet-100 is not optimal as an ID dataset to build a latent feature space, and (2) the number of OOD images we generate for ImageNet-100 is not enough to fit the size of the dataset. We are still discovering the reasons behind this, and we want to express our appreciation to reviewer for giving the useful insight.

评论

Dear Reviewer Fhg5:

Our earlier response systematically addressed your concerns from five different perspectives: (1) we explain the reason of the marginal improvement on ImageNet-100, (2) we explain the reason why the capability bounded by stable diffusion should not be our focus, (3) we explain the necessity of synthesizing OOD data and provide comparison analysis between BOOD and a method using real OOD data for augmentation, (4) we provide general guidelines for hyperparameter selection, and (5) we fix the typo of the use of parenthetical citations.

With the discussion period drawing to a close, we want to ensure our responses have addressed your questions adequately. Thank you again for your valuable feedback.

Best,

Authors

评论

Thank the authors for the rebuttal. I am content with the response and remain positive about this work.

评论

We are pleased to know that our reply successfully addressed your concerns. Thank you again for your valuable time and constructive feedback!

审稿意见
6

The paper proposes a new OOD data generation framework that helps the model to more clearly distinguish ID and OOD data by generating OOD samples near the decision boundary. Specifically, this method identifies ID boundary features by minimizing perturbation steps and generates OOD features near the boundary through gradient ascent. Experiments on CIFAR-100 and IMAGENET-100 demonstrate the effectiveness of the proposed algorithm.

优点

1.BOOD is the first framework capable of explicitly generating OOD data around the decision boundary, thereby providing informative functionality for shaping the decision boundary between ID and OOD data.

2.The paper is easy to follow.

3.Experimental results on the CIFAR-100 and IMAGENET-100 datasets show that the BOOD method significantly outperforms existing SOTA methods, achieving substantial improvements.

缺点

1.BOOD requires calculating the boundary positions of numerous features and generating images through a diffusion model, which may be computationally time-consuming.

2.The hyperparameter in the paper is crucial for synthesizing high-quality OOD features, it is recommended to provide the basis for its selection.

3.The adversarial perturbation strategy is an important component, it is recommended to provide a comparative analysis with other perturbation strategies to help readers gain a more comprehensive understanding of the experimental setup.

4.Descriptions of the images presented are lacking in the main text.

问题

1.List and compare the actual memory requirements of the proposed model.

2.Further comparative studies on different perturbation strategies could be added to help understand the impact of each strategy on the quality of generated data, and to validate the performance variations of the BOOD method under different hyperparameters.

3.Provide additional descriptions of Figures 2, 3, and 4 in the main text for a more comprehensive evaluation.

评论

We thank the reviewer for the feedback and constructive suggestions. Our response to the reviewer’s concerns is below:

Weakness 1 and Question 1: BOOD may be time-consuming

We appreciate this important inquiry regarding computational efficiency and resource utilization. It is worth noting that all methodologies in the domain of generative data augmentation utilizing diffusion models necessarily involve image generation processes. Our main research focus is developing automatically utilizing diffusion models to generate OOD dataset and improving the OOD detection performance, where we proposed two key procedures: (a) an adversarial perturbation strategy to identify the ID features closest to the decision boundaries precisely, and (b) an OOD feature synthesis strategy to generate outlier features which distributed around the decision bondaries. Optimizing the generation cost of diffusion models falls out of our research focus.

In our analysis, we conducted a comparative study of computational efficiency between BOOD and DreamOOD [1], which also is a framework regarding generating OOD images through diffusion model. We specifically focus on four key processes: (1) the building of latent space, (2) OOD features synthesizing, (3) the OOD image generation and (4) regularization of OOD detection model. To provide quantitative evidence, we present below a detailed comparison of computational requirements between BOOD and DreamOOD [1] on CIFAR-100:

Computational CostBuilding latent spaceOOD features synthesizingOOD image generationOOD detection model regularizationTotal
BOOD~0.62h~0.1h~7.5h~8.5h~16.72h
DreamOOD~0.61h~0.05h~7.5h~8.5h~16.66h

We also summarize the memory requirements of BOOD and DreamOOD on CIFAR-100 as below:

Memory requirementsOOD featuresOOD imagesTotal
BOOD~7.32MB~11.7G~11.7G
DreamOOD~2.9G~11.67G~14.57G

Our empirical evaluation reveals that the differences between these approaches are not statistically significant. Thus, our proposed framework is not time consuming or has harsh memory requirements.

Weakness 2: need to provide the basis for hyperparameter selection.

We present a detailed analysis of hyperparameter sensitivity, with experiments conducted on the CIFAR-100 dataset. Based on our systematic investigation, we propose the following basis for hyperparameter selection:

For pruning rate rr, we recommend implementing a moderate pruning rate, as insufficient pruning (small rr) may limit the diversity of generated OOD images (not enough features), while excessive pruning (large rr) risks selecting ID features proximally distributed to the anchor.

rr valueFPR95 ↓AUROC ↑
2.5%13.4596.84
5%12.4797.34
10%13.3197.02
20%15.8895.68
评论

Our analyses suggest selecting a relatively large max iteration number KK to ensure comprehensive boundary crossing for most features. While increased iterations do affect computational overhead in boundary identification, the impact remains manageable. We present a detailed computational cost and performance analysis across varying KK values:

KK valueBoundary identification timeFPR95 ↓AUROC ↑
5~9sec17.6994.33
50~1.5min12.4797.34
100~2.5min12.4797.34
200~5min12.4797.34
400~10min12.4797.34

For regularization weight β\beta: Empirical evidence suggests optimal performance is achieved with moderate regularization weighting, as excessive OOD regularization can compromise OOD detection efficiency.

β\beta valueFPR95 ↓AUROC ↑
1.512.7196.95
212.7897.15
2.512.4797.34
313.1097.02

For step size α\alpha: A moderate α\alpha value is recommended for boundary features identification and OOD features synthesizing. A large α\alpha may lead to large discrepancy between each iteration in adversarial attack, thus making the counting of distance to the decision boundaries not accurate.

α\alpha valueFPR95 ↓AUROC ↑
0.00112.0997.21
0.00512.1297.21
0.01510.6797.42
0.02518.7395.38
0.0519.3395.02
0.122.5794.79

For step size cc: We suggest a moderate cc value, since a large cc may force the generated OOD features to step into other classes' distributions, and a small cc may not guarantee the OOD features that are adequately distant from the ID boundary.

cc valueFPR95 ↓AUROC ↑
012.1997.20
111.9897.32
210.6797.42
311.2397.24
412.6597.02
513.9196.84

The above statistics illustrates that our suggested basis for hyperparameters selection are effective. We have included the new analysis in the Appendix C. Please check our updated summision of paper, thanks!

Weakness 3 and Question 2: Further comparative studies on different perturbation strategies.

Our proposed strategy perturbs the identified ID boundary features through the direction of gradient ascent, aiming to synthesize OOD features that distributed around the decision boundaries. To gain a deeper insight of the effectiveness of our strategy, we provide additional ablation studies on the different perturbation strategies, including (1) adding Gaussian noises to the identified latent features, (2) displacing features away from class centroids and (3) BOOD's perturbation strategy. Below are the results:

MethodFPR95 ↓AUROC ↑
(1)18.9995.04
(2)40.5191.63
BOOD10.6797.42

From the statistics above, we can conclude that our perturbation strategies are solid. We also include this analysis in the updated version of submission. We will discover more perturbation strategies in our camera-ready version of submission.

Weakness 4 and Question 3: Descriptions of the images presented are lacking in the main text.

Thanks, this is a great point. Please check our updated version paper, where we include the descriptions of Figures 2, 3, 4 in the main text in L215, L252, L454 and L461.

We believe that our responses have sufficiently addressed your concerns. If you have further questions, we are pleased to discuss with you. Thank you again for taking the time to read our rebuttal and your constructive feedback!

[1] Xuefeng Du, Yiyou Sun, Jerry Zhu and Yixuan Li. "Dream the Impossible: Outlier Imagination with Diffusion Models." NIPS 2024.

评论

Dear Reviewer YNDo:

In our previous response, we have addressed your concerns in four aspects: (1) we provide computational cost of BOOD to prove that BOOD is not time consuming or requires large memory, (2) we provide bases for hyperparameter selection including α,β,c,K\alpha, \beta, c, K and rr, (3) we provide comparative studies on different perturbation strategies, and (4) we include the descriptions of Figures 2, 3, 4 in the main text of updated PDF.

As the discussion period ends soon, we just wanted to check if the response clarified your questions or needs further discussions. Thanks again for your constructive feedback.

Best,

Authors

评论

Thanks for the very detailed response ! I have carefully checked the response, and most of my concerns have been clarified. Hoping additional experiments be included in the final draft. Considering the overall quality, I will change the rating to borderline accept.

评论

We are glad to hear that our response helped resolved your questions. Thank you again for your time and constructive feedback!

AC 元评审

This paper proposes generating anomaly samples near the decision boundary using normal samples. The reviewers’ assessments were mixed, but upon further consideration, I identified a substantial existing body of literature on anomaly detection—particularly methods targeting samples near the out-of-distribution boundary—that the authors did not adequately acknowledge. Moreover, the idea of adversarially creating such samples is not new. Given these factors and the reviewers’ concerns, I am inclined to recommend rejecting this submission.

审稿人讨论附加意见

The reviewers raised concerns about the insufficient justification for using adversarial attacks to perturb sample features, the clarity of experimental details, and some minor weaknesses. Although the authors attempted to address these issues, not all reviewers engaged with their responses. From my perspective, while the authors’ answers seem convincing in some respects, I remain troubled by the novelty of their contribution. They claim to be the first to pursue this line of work, despite the existence of extensive related literature. As a result, I still have significant reservations about the paper’s originality.

最终决定

Reject