PaperHub
5.5
/10
Rejected4 位审稿人
最低5最高6标准差0.5
6
5
6
5
4.5
置信度
正确性2.8
贡献度2.0
表达2.8
ICLR 2025

Test Time Augmentations are Worth One Million Images for Out-of-Distribution Detection

OpenReviewPDF
提交: 2024-09-26更新: 2025-02-05

摘要

关键词
Out-of-distributionTest time augmentationOOD Detection

评审与讨论

审稿意见
6
  • The author finds that some TTAs, which are the so-called IDA, don't change the feature representation of ID data while they have greater influence on OOD data. Therefore, using these TTAs, each input image is assigned an ID score, i.e., the k-th similarity between the feature of input image and its TTAs counterparts.

优点

  • The main idea is easy to follow.
  • The results shown in Table 4 are better than previous methods.
  • I would like to raise my rating if the weakness is well tackled.

缺点

  • The layout needs further improvement. For example, line40, line48, line65-68, and line93-107 are either too crowded or leaving many blank spaces.
  • Reference error in Figure 1. The reference style should be corrected as KNN (Sun et.al 2020) and VIM (Wang et.al 2022). Check the remaining parts of your manuscript carefully.
  • What's the score represent in Table 1? It looks like the F1-Score.
  • Are IDAs stable across different datasets? I noticed the results shown in Table 2 are dataset-dependent. On CIFAR-10, the LPIPS in ascending order is Mask (0.0052)-->Crop (0.0171)-->Hflip (0.048)-->Rotate90 (0.082)-->Gray (0.1184)-->ColorJitter (0.1618)-->Invert (0.2368). On ImageNet, the LPIPS in ascending order is Mask (0.01)-->Crop (0.1425)-->Gray (0.2466)-->Hflip (0.2961)-->ColorJitter (0.4484)-->Invert (0.5656)-->Rotate90 (0.6312).
  • The definition of IDA and OODA from line 183 to line 142 is not rigorous. IDA and OODA are identified by the degree of feature changes on a pretrained model. It is important to state that the feature change is measured by a pretrained model instead of our eyes. Otherwise, all TTAs in Table 1 are IDA if the feature change is measured by eyes.

问题

  • see weakness
评论

Thank you for your attention to detail. We address your comments below:

  • Layout Improvements: We apologize for the layout issues. We have carefully revised the manuscript to address the spacing of the above lines (lines 40, 48, 65-68, and 93-107) as well as throughout the article.
  • Reference Error in Figure 1: Thank you for your careful review of the manuscript. We have corrected the style of the references and used the correct formatting throughout the manuscript.
  • Score in Table 1: The score reported in Table 1 is the AUROC (Area Under the Receiver Operating Characteristic curve). We will clarify this in the table caption and the relevant section of the text to avoid confusion.
  • Stability of IDAs Across Datasets: You make an excellent observation about the dataset dependency of LPIPS ordering. We think this variance occurs because of the feature complexity difference between Cifar10 and ImageNet. Despite this difference, it can be observed from Table 2 that the augmentation method with the lowest LPIPS on different datasets is always Mask, and its LPIPS is much smaller than the other augmentations, indicating that Mask is not dataset-dependent. This is why we chose mask as our method in our method design in section 3.
  • Rigorous Definition of IDA and OODA: Our definitions of IDA and OODA are derived from [RR1], which distinguish between them by whether they affect the expression of image features. However, it is difficult to formulate a quantitative metric to define IDA and OODA because different datasets have different sensitivities to different data augmentation methods, e.g., MNIST for grey scale changes and Texture for vertical flipping. To provide clearer differentiation, we introduce two methods in the paper to assist in distinguishing between IDA and OODA: one by visualizing the results (Fig. 3 and Fig. 10), and the other by quantifying the effect of enhancement on image features with the help of LPIPS (Table 2). Although LPIPS seems to be a good quantitative metric, Table 2 shows the average LPIPS of 1000 images, which makes it difficult to strictly distinguish between IDA and OODA. We believe that defining a rigorous metric to differentiate between IDA and OODA is a worthwhile future direction to explore.

We appreciate your careful review and believe that addressing these points will improve the presentation and clarity of our manuscript.

[RR1] How Much Data Are Augmentations Worth? An Investigation into Scaling Laws, Invariance, and Implicit Regularization. ICLR. 2023

评论

Dear Reviewer 2Ca4,

We hope this message finds you well. We want to follow up on our previous response addressing your comments regarding layout improvements, reference formatting, OOD score clarification, and other technical aspects of our paper. We have made significant revisions based on your valuable feedback, including:

  • Adding clear table captions specifying the metrics used
  • Fixing all reference formatting issues
  • Improving layout consistency throughout the manuscript
  • Adding comprehensive baseline comparisons across architectures
  • Explaining the definitions of IDA and OODA

We would greatly appreciate your feedback on these revisions to ensure they adequately address your concerns. Your insights have been valuable in improving our paper's quality and clarity.

Thank you for your time and dedication.

评论

Thanks for your response. I have no further questions.

评论

Thank you for your valuable review and feedback on my submission again. Your thoughtful evaluation and score are greatly appreciated, as they will help improve the quality of my work. I value the time and effort you dedicated to reviewing my paper.

评论

Thank you very much for reviewing our paper! We really appreciate your time and effort in providing these helpful comments. We have done our best to address each point you raised and made improvements based on your suggestions.

Below are our responses to your comments. We hope you'll have a chance to take a look!

审稿意见
5

Out-of-distribution (OOD) detection poses significant challenges for deploying machine learning models in safety-critical applications, and while data augmentations have been shown to enhance OOD detection by providing diverse features, previous research has primarily focused on their role during training, neglecting their effects during testing. This paper presents the first comprehensive study on test-time augmentation (TTA) and its impact on OOD detection, revealing that aggressive TTAs can lead to distribution shifts in OOD scores for in-distribution (InD) data, while mild TTAs do not, making them more effective for OOD detection. The authors propose a detection method that utilizes K-nearest-neighbor (KNN) searches on mild TTAs instead of InD data, achieving superior performance with just 25 TTAs compared to existing methods that use the entire training set of 1.2 million images on IMAGENET. Additionally, their approach is compatible with various model architectures and demonstrates robustness against adversarial examples.

优点

  1. Improved Efficiency: Achieves superior OOD detection with only 25 mild TTAs, reducing computational costs compared to methods using large datasets.

  2. Versatile and Robust: Compatible with various model architectures and resilient to adversarial attacks, enhancing reliability.

  3. Focus on Testing: Addresses the critical impact of test-time augmentation, leading to more effective OOD detection strategies.

缺点

  1. The article lacks preliminary section.

  2. The method designed in the article lacks an overall algorithm description; the current narrative is not clear enough.

  3. What OOD score is used in Tables 1 and 2 is not clarified.

  4. Can the proposed TTA method adapt to more OOD scores, such as MSP and ASH?

  5. Adaptation to other architectures needs corresponding baselines, rather than just reporting the performance of the proposed method.

  6. How will it perform on more challenging benchmarks, such as hard OOD detection and pseudo-correlation benchmarks?

问题

see Cons

评论

Thank you for your valuable feedback. We address your concerns below:

Lack of Preliminary Section: We have added a preliminary subsection in the method section to the revised manuscript, which provides the necessary background information on OOD detection, including common definitions and problem settings. We believe this addition will improve the paper's accessibility and context for readers less familiar with the field.

Lack of Algorithm Description: We have added a dedicated algorithm box clearly outlining the steps involved in our TTA-enhanced KNN approach in the Appendix. This will include details on the sequential masking strategy, embedding extraction, KNN search within the augmented set, and the final OOD score calculation.

Unclear OOD Score in Tables 1 and 2: We apologize for the lack of clarity. The OOD score used in Table 1 is AUC and the metric in Table 2 is LPIPS. Table 1 shows the IDA's effectiveness for OOD detection compared to OODA, and Table 2 shows that masking is the mildest augmentation, resulting in the lowest LPIPS score, which inspire the mask-based method design described in Section 3. We have made this clear in the headings of the tables in the latest version.

Adaptability to Other OOD Scores: Our TTA-based OOD detection method relies on the similarity between original and augmented samples within a specific representation space. This makes a straightforward combination with post-hoc scoring methods like MSP challenging. However, as demonstrated in Figure 6, our similarity-based approach remains effective when applied to logits and softmax output. Furthermore, it's compatible and synergistic with post-hoc methods that modify the representation space, such as ReAct, leading to improved performance as shown in Table 5.

Adaptation to Other Architectures and Baselines: We initially compared our method's performance against baselines using the Swin Transformer (Appendix F, Table 13). To provide a more comprehensive evaluation, we expanded these experiments to encompass additional baselines and OOD datasets as follows:

AUC (%)NINCOSSB-hardiNaturalistPlaces365SUNTextureAvg
MSP80.2271.1489.9477.9379.6580.5779.91
ML81.1568.2089.0773.0675.5879.0877.69
ODIN62.6563.1470.5746.3055.1365.4760.54
Energy77.1468.4784.9967.4770.8876.4474.23
VIM81.0369.0891.3476.4477.5287.5480.49
KNN79.4464.1787.5977.1876.4988.2878.86
GradNorm45.5249.9838.7026.4132.7835.4638.14
DICE41.2057.2032.6032.5335.5570.8044.98
GEN80.6668.0490.6880.5081.6482.3280.64
NAC76.5867.2991.4875.5380.8783.1479.15
ASH-B82.2670.1394.3285.1488.1089.7584.95
ASH-S80.2468.2492.6181.6485.5687.6582.66
ASH-P82.3567.7393.1983.4287.4889.0583.87
Ours81.3767.7190.7978.5881.8984.0480.73

The results demonstrate that our method maintains strong performance with transformer-based architectures like Swin Transformer, generally outperforming most baselines and achieving a performance second only to ASH (consistent with the findings in Table 4). This suggests our method's model-agnostic nature, unlike some baselines (e.g., DICE), as its performance remains robust across different model architectures. The updated results are included in the latest manuscript.

Performance on More Challenging Benchmarks: According to the suggestion in OpenOOD [RR1], we evaluated our approach on the Near-OOD dataset in Table 3 and Table 4, which are Cifar10 for Cifar100 and NINCO and SSB-hard for ImageNet. Our method consistently outperforms baseline approaches on hard-OOD detection tasks, as evidenced by the superior performance on near-OOD datasets like CIFAR-100 (90.79% AUC), NINCO (82.19% AUC) and SSB-hard (71.43% AUC). This advantage can be attributed to the ability of sequential masking to capture subtle feature differences between InD and near-OOD samples. These results suggest that our TTA-based approach is particularly effective for challenging scenarios where traditional methods often struggle to maintain reliable detection performance. For pseudo-relevant benchmarks, we have not found relevant OOD test benchmarks. However, we will discuss this as a valuable direction for future work.

We believe that addressing these points will significantly improve the clarity and completeness of our paper and better demonstrate the effectiveness and generalizability of our proposed method. Thank you again for your insightful comments.

[RR1] Yang J, Wang P, Zou D, et al. Openood: Benchmarking generalized out-of-distribution detection[J]. Advances in Neural Information Processing Systems, 2022, 35: 32598-32611.

评论

Thank you very much for reviewing our paper! We really appreciate your time and effort in providing these helpful comments. We have done our best to address each point you raised and made improvements based on your suggestions.

Below are our responses to your comments. We hope you'll have a chance to take a look!

评论

Thanks for your response. Some of my concerns have indeed been addressed, but I believe that, in this light, the TTA you proposed essentially functions as a combination of TTA + KNN for OOD scores. TTA has not been effectively adapted to various post hoc OOD scores, and many of the claims made in your paper seem to be somewhat overstated in my view. In fact, TTA is not a universal method for improving OOD detection performance. Therefore, I have decided to keep my score.

评论

Thank you for your recognition of our response and thoughtful feedback.

Regarding the combination of our method with post-hoc OOD detection approaches, I believe there may be a misunderstanding. While our previous response indicated that direct combination with certain OOD scoring methods (like MSP, ML) is challenging, this doesn't mean integration is impossible. As demonstrated in Figure 6 of our paper, our method performs effectively in the representation spaces used by these post-hoc detection methods, suggesting indirect integration is feasible. Furthermore, for methods that modify the representation space (like ReAct), Table 5 shows that our approach can be directly combined to achieve enhanced performance.

Regarding your comment that "TTA is not a universal method for improving OOD detection performance," we agree that no single method, including ours, can claim universal superiority across all datasets and scenarios. Each approach has its strengths and limitations, and the choice of method often depends on the specific application context.

As for your observation about potential overstatements in our paper, we would greatly appreciate it if you could point out specific instances where you feel our claims may be overstated. This would help us improve the precision and accuracy of our presentation.

Please don't hesitate to let us know if you have further questions, we appreciate your valuable suggestions.

审稿意见
6

This paper focus on performing test time augmentation to improve semantic-shift OOD detection. It first categorizes the OOD detection in to In-Distribution Augmentation (IDA) and Out-of-Distribution Augmentation (OODA) and proposes to utilize the learned Perceptual Image Patch Similarity (LPIPS) distances to differentiate between IDA and OODA. Further, a TTA method based on the sequential mask is proposed to boost the OOD performance when employing the K-th nearest neighbor (KNN) as the OOD score. Moreover, a thorough ablation studies are presented.

优点

  • The paper is well-written and clearly presented.
  • The findings related to IDA and OODA are novel, and the proposed OOD score, which does not rely on in-distribution data, is a logical and reasonable approach.
  • Section 2 is well-organized and clearly explained.

缺点

  • Novelty and Baselines: While the proposed method shows novelty, a few relevant baseline methods are missing. It would be beneficial to categorize the proposed approach as a post-hoc OOD scoring method and compare it to existing techniques, such as Maximum Softmax Probability (MSP), Energy score, GradNorm [1], and GEN [2]. Including a comparison with GEN [2], a recent post-hoc OOD scoring method that only requires access to probability information, would strengthen the evaluation.

  • Additional Architectures: It is commendable that Table 8 includes TTA performance across different architectures. However, for a more comprehensive evaluation, the performance of other baseline methods (e.g., Energy score) should also be reported. Without these comparisons, it is difficult to discern whether the observed performance changes are due to architectural differences rather than a specific advantage of TTA. A table similar to Table 13 in Appendix F, but for other architectures (ViT-b-16 and ResNet-50), would address this concern.

  • Optional Suggestion for Table 1: For readability, it may be helpful to split Table 1 into two sections: one for methods that do not require access to training data and another for methods that do.

  • (Optional suggestion) Generalizability of TTA: It would be interesting to explore whether TTA could be combined with other post-hoc scoring methods, such as Energy and GEN.

References

[1] On the Importance of Gradients for Detecting Distributional Shifts in the Wild. NeurIPS, 2021.

[2] GEN: Pushing the Limits of Softmax-Based Out-of-Distribution Detection. In CVPR, 2023.

问题

See weakness

评论

Thank you for the constructive feedback and valuable suggestions. We address each point below:

Novelty and Baselines: We appreciate the acknowledgment of our method's novelty. According to reviewers’ comments, we add comparisons with more baselines such as GradNorm, GEN, DICE and NAC on both Cifar10 and ImageNet as follows:

AUC (%)Cifar100SVHNTexturePlaces365iSUNLSUNAvg
GradNorm60.1368.6257.3166.9070.0082.5767.59
DICE83.7993.8989.4385.0493.1498.1790.58
GEN88.6095.2291.7991.0095.5297.3693.25
NAC88.3794.3290.3788.8395.5597.2792.45
Ours90.7995.1793.5091.2996.9597.4594.19
AUC (%)NINCOSSB-hardiNaturalistPlaces365SUNTextureAvg
GradNorm72.5567.7194.1375.7488.1688.5481.14
DICE76.4666.5793.0679.5486.2488.0181.65
GEN81.2269.0792.1980.6082.6483.2581.49
NAC78.4768.2693.5278.5388.8188.1482.62
Ours82.1970.4392.5575.8191.8291.5484.06

As shown in the tables, our method surpassed all baseline approaches on both Cifar10 and ImageNet datasets, with particularly strong performance on Near-OOD data, validating both its effectiveness and reliability. We have updated the relevant data to the latest manuscript.

Additional Architectures: Thank you for this important suggestion. We have briefly compared the performance of our method with the baseline method on the Swin Transformer in Table 13 (Appendix F). To further show the result, we extend the experiments to include more baseline methods and OOD datasets as follows:

AUC (%)NINCOSSB-hardiNaturalistPlaces365SUNTextureAvg
MSP80.2271.1489.9477.9379.6580.5779.91
ML81.1568.2089.0773.0675.5879.0877.69
ODIN62.6563.1470.5746.3055.1365.4760.54
Energy77.1468.4784.9967.4770.8876.4474.23
VIM81.0369.0891.3476.4477.5287.5480.49
KNN79.4464.1787.5977.1876.4988.2878.86
GradNorm45.5249.9838.7026.4132.7835.4638.14
DICE41.2057.2032.6032.5335.5570.8044.98
GEN80.6668.0490.6880.5081.6482.3280.64
NAC76.5867.2991.4875.5380.8783.1479.15
ASH-B82.2670.1394.3285.1488.1089.7584.95
ASH-S80.2468.2492.6181.6485.5687.6582.66
ASH-P82.3567.7393.1983.4287.4889.0583.87
Ours81.3767.7190.7978.5881.8984.0480.73

It can be seen that our method maintained strong performance on transformer-based architectures, Swin Transformer, where it exceeded most baselines (ranking second only to ASH, aligning with Table 4's findings). This shows that our method is model-agnostic and its performance is not affected by the model architecture as some baseline methods (e.g., DICE) are. We have updated the table in the latest manuscript.

Splitting Table 1: Table 1 shows the performance of IDA and OODA on Cifar10, all tests were performed in the inference phase, independent of training data. The table shows that IDAs can be used for OOD detection and that using multiple IDAs is superior to a single IDA. We think the reviewer's comment about readability is directed to Tables 3 and 4, where we have made relevant notations in the latest manuscript.

Generalizability of TTA: Since our TTA-based method detects OOD by the similarity between the original sample and augmented sample on certain representation space, it cannot be simply combined with other post-hoc scoring methods, such as Energy and GEN (since they are scalar not a vector). However, Figure 6 in the main paper shows that our method is still valid for calculating similarity on logits and softmax. In addition, for some post-hoc methods that rectify the representation space such as ReAct, our approach works well with them and achieves better performance (as shown in Table 5 in the main paper).

We thank you again for your recognition, and valuable comments, and we believe that addressing these issues will improve the presentation and clarity of our manuscripts.

评论

Thank you for the response. I have no further questions. However, I recommend that the authors include results for additional baselines in the final version to further demonstrate the effectiveness of TTA.

评论

Thank you for your positive feedback. We appreciate your thorough review and constructive suggestions which have helped improve our paper significantly. As you recommended, we have already incorporated comprehensive comparisons with additional baselines (GradNorm, GEN, DICE, and NAC) in the latest version of our manuscript, as shown in the Table 3 and Tabel 4. These new results further validate the effectiveness of our TTA-based approach, particularly in challenging scenarios like Near-OOD detection.

Given that our revisions have thoroughly addressed your concerns and demonstrated strong performance against an expanded set of baselines, we would be grateful if you would consider improving your score.

评论

Dear Reviewer,

Thank you again for your time and insightful comments on our manuscript. As the discussion window is nearing its end, we kindly ask if you could take some time to review our responses.

Your feedback is greatly valued, and we are eager to address any additional suggestions or concerns you may have to further improve our work.

Best regards, Authors

评论

Thank you very much for reviewing our paper! We really appreciate your time and effort in providing these helpful comments. We have done our best to address each point you raised and made improvements based on your suggestions.

Below are our responses to your comments. We hope you'll have a chance to take a look!

审稿意见
5

In this study, the authors present a study of the impact of test-time augmentation on out-of-distribution (OOD) detection. The authors first analyze that strong test time augmentations can cause distribution shifts on both OOD and ID data but mild augmentation could only affect OOD samples. Based on this analysis, the authors propose a new OOD detection method with a KNN search on mild augmentation. With 25 test time augmentation samples, the authors show that the proposed method could detect OOD samples with a performance which is comparable to other methods.

优点

  • The authors analyze data augmentations and categorize strong(aggressive) data augmentation and mild data augmentation in the view of OOD detection (whether these augmentations shift in distribution samples or not). It is very interesting to present the effect of different augmentation schemes on OOD detection.
  • The method can be used without retraining or accessing additional OOD samples.
  • The proposed method is simple and shows comparable performance on CIFAR10 and ImageNet.

缺点

  • The novelty is very limited. The data augmentation has recently also been investigated in OOD detection or anomaly detection [R1-R2]. [R1] uses data augmentation for anomaly detection. [R2] uses simple test-time blurring (a kind of data augmentation) for OOD detection. It would be better to discuss related studies and clarify the novel contribution of this paper. It is important but is currently missing from the current version of the paper.
  • The paper might lead to a misunderstanding of the related studies. The authors claim that InD-Independent approaches (MSP, DICE, GradNorm, and ReAct) have a limitation, which requires accessing the in-distribution samples. Some of them also can be just used at the test time like the proposed method. Even using some portion of training data is not a critical limitation of these studies in many cases. The authors also did not compare the proposed method with DICE and GradNorm. ASH which does not require additional test data also outperforms the proposed method. [R3] is also important study but it is missing.
  • It would be better to analyze the distribution shift of ID and OOD samples under data augmentation in a more theoretical way. Why some data augmentations are good at OOD detection and others are not.
  • The proposed method requires 25 inferences which limits the real-time use of the method
  • Comparison with other methods is performed on only CNN-based architectures. ViT result is only reported on the proposed method. The OOD detection score is different with respect to the model and it would be great to compare with other methods on ViT architecture.

[R1] Cohen, Seffi, Niv Goldshlager, Lior Rokach, and Bracha Shapira. "Boosting anomaly detection using unsupervised diverse test-time augmentation." Information Sciences 626 (2023): 821-836. [R2] Choi, Sungik, and Sae-Young Chung. "Novelty detection via blurring." ICLR 2019. [R3] Liu, Yibing, Chris Xing Tian, Haoliang Li, Lei Ma, and Shiqi Wang. "Neuron activation coverage: Rethinking out-of-distribution detection and generalization." ICLR 2024

问题

  • It would be great to clarify the novel contribution of the method compared with other augmentation-based approaches.
  • It would be great to further discuss the limitations of InD-independent approaches because it is unclear to me. It would be better to correctly clarify the method which does not require to access InD samples.
  • Discussion with DICE and ASH, [R3] will strength the paper.
  • How the method can be generalizable to different architectures? It would be great to compare with other methods on the ViT backbone.
评论

Lack of Theoretical Analysis: Our approach builds on findings from [RR4], which demonstrated mild augmentation's superiority over aggressive augmentation for OOD detection during training. Motivated by this finding, We hypothesize that mild augmentation (IDA) is more suitable for OOD detection than aggressive augmentation (OODA) during test time. Our empirical results in Table 1 confirm this hypothesis. In addition, we show in Table 2 that Mask is the mildest data augmentation method through LPIPS, and design a sequential masking TTA method for OOD detection. While lacking formal theoretical proof, our work presents a coherent progression from assumption to empirical validation.

Computational Cost: While our method requires 25 inferences, parallel processing minimizes the additional runtime. Our computational overhead is actually lower than several baselines, such as gradient-based methods (DINO, GradNorm) that require backpropagation, or KNN which must compute distances to the whole reference set. We have evaluated the per-sample runtime of common OOD detection methods on an Nvidia 3090, averaging over 1000 samples for each. Our method demonstrates a favorable balance between performance and runtime.

DatasetMethodTimePerformance
Cifar10MSP0.228192.13
ODIN0.273290.14
KNN0.283693.72
VIM0.241692.66
Ours0.234794.19
ImageNetMSP0.311979.05
ODIN0.399381.67
KNN0.366977.87
VIM0.333880.58
Ours0.342684.22

Generalizability to Different Architectures: Building on our initial Swin Transformer comparison in Table 13 (Appendix F), we've expanded our evaluation to include additional baseline methods and OOD datasets:

AUC (%)NINCOSSB-hardiNaturalistPlaces365SUNTextureAvg
MSP80.2271.1489.9477.9379.6580.5779.91
ML81.1568.2089.0773.0675.5879.0877.69
ODIN62.6563.1470.5746.3055.1365.4760.54
Energy77.1468.4784.9967.4770.8876.4474.23
VIM81.0369.0891.3476.4477.5287.5480.49
KNN79.4464.1787.5977.1876.4988.2878.86
GradNorm45.5249.9838.7026.4132.7835.4638.14
DICE41.2057.2032.6032.5335.5570.8044.98
GEN80.6668.0490.6880.5081.6482.3280.64
NAC76.5867.2991.4875.5380.8783.1479.15
ASH-B82.2670.1394.3285.1488.1089.7584.95
ASH-S80.2468.2492.6181.6485.5687.6582.66
ASH-P82.3567.7393.1983.4287.4889.0583.87
Ours81.3767.7190.7978.5881.8984.0480.73

The results demonstrate our method's architecture-agnostic nature, performing strongly on Swin Transformer and exceeding most baselines (second only to ASH, consistent with Table 4). Unlike architecture-dependent approaches such as DICE, our method maintains its effectiveness across different model architectures. These results are updated in the latest manuscript.

We believe that addressing these points will significantly strengthen the paper and better highlight its contributions. We appreciate the reviewer's feedback.

[RR4] How Much Data Are Augmentations Worth? An Investigation into Scaling Laws, Invariance, and Implicit Regularization. ICLR. 2023

评论

We thank the reviewer for their insightful comments and suggestions. We address each point below:

Limited Novelty: We appreciate the reviewer for providing relevant literature. Our method introduces several key innovations:

  • Study the Effect of Augmentations on OOD Detection: Our comprehensive investigation into the impact of test-time data augmentation on OOD detection in Section 2 reveals that employing mild augmentation strategies leads to more favorable performance. This focus on mild TTAs and their unique benefits is a novel aspect of our work.
  • New Method for OOD Detection: Our key contribution lies in the novel use of test-time augmentations (Sequential Mask) with a KNN-based approach for OOD detection. To the best of our knowledge, this specific combination has not been explored before.
  • Method Agnostic Approach: Unlike methods based on training or model output, our method is model-agnostic and can be applied to any pre-trained classifier without requiring additional training or manipulation.

For the references [RR1] and [RR2] given by the reviewer, although they both consider augmentation methods, they require training phases: [RR1] utilizes blurred images during training, while [RR2] trains a Siamese network for tabular data anomaly detection in an unsupervised manner. In contrast, our method innovates by detecting OOD data through a direct comparison of original and augmented samples, eliminating the need for additional training or model modifications—a key distinction from existing augmentation-based OOD detection approaches.

Misunderstanding of InD-Independent Approaches: We apologize for the potential misunderstanding. Possibly due to a clerical error by the reviewer, what we meant in the paper (from line 45-69) is that InD-dependent methods (KNN, ReAct, ViM) require the collection of a large amount of InD data to be used as a reference set, which makes the performance of the method potentially affected by the quantity and quality of the InD data (as shown in Fig. 1). Therefore, we aim to develop an InD-independent method.

Discussion with DICE, ASH, and [RR3]: This is a great suggestion. In response to reviewer feedback, we expanded our evaluation to include additional baseline methods (GradNorm, GEN, DICE, and NAC) on both Cifar10 and ImageNet datasets:

AUC (%)Cifar100SVHNTexturePlaces365iSUNLSUNAvg
GradNorm60.1368.6257.3166.9070.0082.5767.59
DICE83.7993.8989.4385.0493.1498.1790.58
GEN88.6095.2291.7991.0095.5297.3693.25
NAC88.3794.3290.3788.8395.5597.2792.45
Ours90.7995.1793.5091.2996.9597.4594.19
AUC (%)NINCOSSB-hardiNaturalistPlaces365SUNTextureAvg
GradNorm72.5567.7194.1375.7488.1688.5481.14
DICE76.4666.5793.0679.5486.2488.0181.65
GEN81.2269.0792.1980.6082.6483.2581.49
NAC78.4768.2693.5278.5388.8188.1482.62
Ours82.1970.4392.5575.8191.8291.5484.06

The results show that our approach outperforms all baseline methods across both Cifar10 and ImageNet datasets, notably excelling on Near-OOD data.

For ASH, while it achieves better results on ImageNet, our method surpasses it on Cifar10. In addition, ASH has different optimal strategies on Cifar10 and ImageNet. More importantly, we believe that not only the best-performing method is valuable. Our method provides a new direction for OOD detection and has impressive performance (only weaker than ASH on ImageNet).

These updated results have been incorporated into the latest manuscript.

[RR1] Choi, Sungik, and Sae-Young Chung. "Novelty detection via blurring." ICLR 2019.

[RR2] Cohen, Seffi, Niv Goldshlager, Lior Rokach, and Bracha Shapira. "Boosting anomaly detection using unsupervised diverse test-time augmentation." Information Sciences 626 (2023): 821-836.

[RR3] Liu, Yibing, Chris Xing Tian, Haoliang Li, Lei Ma, and Shiqi Wang. "Neuron activation coverage: Rethinking out-of-distribution detection and generalization." ICLR 2024

评论

Thank you very much for reviewing our paper! We really appreciate your time and effort in providing these helpful comments. We have done our best to address each point you raised and made improvements based on your suggestions.

Below are our responses to your comments. We hope you'll have a chance to take a look!

评论

Thank you for the responses! It would be great to clarify the position of this paper and potential disadvantage of 'InD-dependent methods'. Because the models which are evaluated in this study are trained from training data via supervised learning, using the same or partial training set is not an additional disadvantages. It would be great to provide some example where 'InD-independent method' is valuable. Also, I think MSP, Gradnorm, and ReAct can be also working as 'InD-independent method'.

评论

Thank you for your recognition of our response.

Regarding InD-independent and InD-dependent methods, we fully agree that it's impossible to definitively state which approach is superior, as their effectiveness depends heavily on the specific application scenario.

When abundant InD data is available and easily accessible, InD-dependent methods offer a straightforward and effective solution. However, InD-independent methods like ours serve a valuable role in scenarios where:

  • Limited InD data availability: In some real-world applications, collecting or storing large amounts of InD data may be impractical due to privacy concerns, storage limitations, or data acquisition costs.
  • Model flexibility requirements: When the underlying model needs to be frequently updated or switched (e.g., in deployment scenarios with multiple model variants), InD-independent methods offer plug-and-play convenience without requiring new reference data collection or recalibration.

You make an excellent point about supervised training models in InD-independent methods. While it's true that InD-independent methods like MSP, GradNorm, DICE (as well as ours) calculate OOD score utilizing models trained on InD data, a key distinction is that they remain independent of specific InD examples during inference. This means the model serves as an interchangeable component - we can switch between different pre-trained models without needing to collect new reference data or modify the OOD detection approach (The effectiveness of our approach on different datasets and different model architectures can be used as an example.).

This flexibility and independence from explicit InD data during inference make InD-independent methods particularly valuable in certain practical applications, though we acknowledge both approaches have their merits depending on the specific use case.

If you have further questions, please don't hesitate to let us know.

AC 元评审

This paper suggests that test data augmentation can be beneficial for OOD detection. However, it is not well integrated into the existing literature, and its novelty remains questionable. Although the authors have presented some additional results and clarifications regarding their contributions, these efforts are not sufficiently convincing. As such, the current version of the paper is not suitable for publication.

审稿人讨论附加意见

The reviewers expressed concerns regarding the work’s novelty, its high inference time, and the generalizability of the proposed method. Although the authors attempted to address these issues, their responses were not entirely convincing. The paper requires substantial revisions. It is also worth noting that two of the reviewers did not participate in the rebuttal phase.

最终决定

Reject