6.0

/10

Poster4 位审稿人

最低5最高8标准差1.2

4.5

置信度

正确性2.8

贡献度3.3

表达3.3

NeurIPS 2024

AdaNeg: Adaptive Negative Proxy Guided OOD Detection with Vision-Language Models

Yabin Zhang,Lei Zhang

OpenReview PDF

提交: 2024-05-14更新: 2024-11-06

TL;DR

We propose adaptive negative proxies by exploring potential OOD images during testing for OOD detection

摘要

关键词

Adaptive negative proxyOOD detectionvision-language models

评审与讨论

审稿意见

评分: 6置信度: 52024-06-29

The paper "AdaNeg: Adaptive Negative Proxy Guided OOD Detection with Vision-Language Models" presents a novel approach to out-of-distribution (OOD) detection using pre-trained vision-language models (VLMs). The primary innovation is the introduction of adaptive negative proxies, which are dynamically generated during testing by exploring actual OOD images. This method addresses the semantic misalignment issues of previous approaches that use static negative labels. AdaNeg utilizes a feature memory bank to cache discriminative features from test images, creating task-adaptive and sample-adaptive proxies that better align with the specific OOD datasets. The approach combines static negative labels with adaptive proxies to enhance the performance of OOD detection, achieving significant improvements in benchmarks like ImageNet. The method is training-free, annotation-free, and maintains fast testing speeds.

优点

Innovative Approach: The introduction of adaptive negative proxies to address semantic misalignment is a significant advancement. This dynamic generation of proxies during testing offers a novel solution to improve OOD detection.
Effective Use of Vision-Language Models: Leveraging VLMs to integrate textual and visual knowledge enhances the robustness and accuracy of OOD detection.
Performance Improvement: The method shows substantial improvements in standard benchmarks, particularly a 2.45% increase in AUROC and a 6.48% reduction in FPR95 on the ImageNet dataset.
Training-Free and Annotation-Free: AdaNeg does not require additional training or manual annotations, making it highly efficient and practical for real-world applications.
Scalability and Efficiency: The method maintains fast testing speeds and can dynamically adapt to new OOD datasets without significant computational overhead.
Comprehensive Evaluation: Extensive experiments and analyses demonstrate the effectiveness and robustness of the proposed approach across various benchmarks.

缺点

Potential Overhead in Memory Management: The implementation of a memory bank for caching features may introduce significant overhead in memory management, especially when dealing with large-scale datasets or high-dimensional feature spaces.
Generalization to Other Domains: Although the approach demonstrates promising results on existing public datasets, its effectiveness in other domains or with different types of data remains uncertain and requires further investigation.
Testing Phase Dependency: It is unclear whether the approach can maintain the same level of reliable performance when only a small number of images are tested in practical applications. This dependency on the number of test images warrants additional examination.

问题

See weakness.

局限性

Generalization to Other Domains
Dependency on test data.

作者回复

2024-08-05

Dear Reviewer eV4A，

We sincerely thank you for the constructive comments and recognition on our work! Please find our responses below.

Q1: Potential Overhead in Memory Management: The implementation of a memory bank for caching features may introduce significant overhead in memory management, especially when dealing with large-scale datasets or high-dimensional feature spaces.

A1: Thanks for the question. Indeed, the introduction of a memory bank introduces additional memory overhead, which we have briefly discussed in Section 5 of the main paper. However, we clarify that our memory overhead does not continuously increase with the scale of the dataset. This is because we drop image features with high prediction entropy when the memory bank is full, as detailed in Line 184 and Appendix A.2 of our submission

For a typical high-dimensional feature of 512 dimensions and a maximum memory length $L$ of 10 for each class, our memory bank occupies a storage space of 214.75MB when using the ImageNet dataset as ID. This storage requirement is negligible compared to the memory consumption of the CLIP model during forward passes.

Q2: Generalization to Other Domains: Although the approach demonstrates promising results on existing public datasets, its effectiveness in other domains or with different types of data remains uncertain and requires further investigation

A2: Thanks for the nice suggestion. We further validate our method on the BIMCV-COVID19+ dataset [a], which includes medical images, following the OpenOOD setup [68, 71]. Specifically, we selected BIMCV as the ID dataset, which includes chest X-ray images CXR (CR, DX) of COVID-19 patients and healthy individuals. For the OOD datasets, we follow the OpenOOD setup and use CT-SCAN and X-Ray-Bone datasets. The CT-SCAN dataset includes computed tomography (CT) images of COVID-19 patients and healthy individuals, while the X-Ray-Bone dataset contains X-ray images of hands. As illustrated in the table below, our AdaNeg method consistently outperforms NegLabel on this medical image dataset. We will add these analyses in the revision.

Methods	CT-SCAN AUROC $\uparrow$	CT-SCAN FPR95 $\downarrow$	X-Ray-Bone AUROC $\uparrow$	X-Ray-Bone FPR95 $\downarrow$	Average AUROC $\uparrow$	Average FPR95 $\downarrow$
NegLabel	63.53	100	99.68	0.56	81.61	50.28
AdaNeg (Ours)	93.48	100	99.99	0.11	96.74	50.06

[a] BIMCV COVID-19+: a large annotated dataset of RX and CT images from COVID-19 patients

Q3: Testing Phase Dependency: It is unclear whether the approach can maintain the same level of reliable performance when only a small number of images are tested in practical applications. This dependency on the number of test images warrants additional examination.

A3: Thank you for your suggestion. We examined the dependency of our approach on the number of test images by evaluating its performance across different scales of test samples. As the number of test samples increases (from 900 to 90K), the cached feature data also increases, leading to an improvement in our method's results, as shown in the table below. Even with a small number of test samples (e.g., 90 and 900), our method significantly reduces FPR95 compared to NegLabel, demonstrating its robustness across different numbers of test images.

Note that with only 90 test images, the task of distinguishing between ID and OOD samples degenerates into a simpler task since the number of test images is even smaller than the number of classes (e.g., 1000 for ImageNet). Consequently, both NegLabel and our method achieve lower FPR95 in such an easier scenario.

We will add these findings to the revised paper.

Num. of Test Images	90	900	9K	45K	90K
NegLabel	14.00	20.44	20.71	20.51	20.53
AdaNeg	6.00	10.12	9.78	9.66	9.50

Table Caption: FPR95 ( $\downarrow$ ) with different numbers of test images, where test samples are randomly sampled from ImageNet (ID) and SUN (OOD) datasets while maintaining their relative proportions.

评论- Comment

2024-08-08

Thanks for your rebuttal, which has solved my concerns.

2024-08-09

We sincerely thank this reviewer for the positive feedback!

Authors of paper 9248

审稿意见

评分: 5置信度: 52024-07-10

In this paper, the authors propose AdaNeg, a test-time adaption method for CLIP-based post-hoc OOD detection. AdaNeg is an extension of NegLabel and introduces a class-wise memory bank for each ID and negative labels. The memory bank is gradually filled with ID and OOD features during the model deployment. The author design a margin-based approach to select positive and negative samples with high confidence. And they propose a cache elimination mechanism to update the memory bank. Besides, AdaNeg uses cross attention between the input sample and the memory bank to reweight the cached features. The experimental results show the proposed method outperforms the baseline methods under various benchmarks.

优点

<1> AdaNeg uses dynamic OOD proxies instead of the static design of NegLabel, achieving SOTA performance in CLIP-based zero-shot OOD detection.

<2> The multi-modal score is an interesting design and explanation that demonstrates the improvement brought by using both text and image encoding capabilities in a multi-modal model.

<3> The paper is well organized and easy to follow.

缺点

Major concerns

<1> AdaNeg is a test-time adaption approach that caches features w.r.t. ID labels and negative labels by maintaining a class-wise memory bank. For OOD detection, the biggest problem of the test-time adaptation method is that the arrival time of the OOD sample is uncertain. Compared with non-TTA methods, AdaNeg has greater uncertainty in its performance during the deployment phase and may even risk causing model collapse.

For example, when the model is deployed to a close-world environment, almost all input samples are ID samples (I believe this is a very common scenario). In this case, the memory banks of negative labels will gradually be filled with ID samples (in long-term deployment, there will always be misclassified ID samples that enter the negative memory banks). Since the number of negative labels is much greater than that of ID labels, more and more ID samples will be misclassified as OOD over time. I suggest the author conduct an experiment using the 1.28M training set images of ImageNet-1k as input (this still meets the zero-shot setting of CLIP) and observe how the proportion of samples misclassified as OOD changes with the number of input samples. If 1.28M images are repeatedly input into multiple rounds, will the misclassification rate increase further? In contrast, the other case is that the OOD samples are far more than the ID samples. Will this cause a greater false positive risk? I hope the authors can test their method with different ID and OOD sample mixture ratios, such as 1:100, 1:10, 1:1, 10:1, 100:1.

In summary, I suggest the authors to further study the setting of TTA in OOD detection to improve the motivation of the work, since the input samples may come from two different distributions, ID and OOD. How to ensure the stability of TTA OOD detection algorithm when the input stream is a non-stationary process is a problem worth studying.

<2> The negative labels provide the initial memory bank slots for AdaNeg, but it seems to me that the negative labels are not necessary. This suggests that we need to rethink AdaNeg's motivation for negative labels. Why do samples that are judged as negative need to be placed in the memory bank w.r.t. the negative label? What if the authors directly use the MCM score to judge negative samples and then let them organize themselves into OOD proxies? The authors need to provide a more detailed analysis (preferably theoretical analysis) to prove that the negative label-based memory bank design is necessary.

Further, negative labels simply select words that are semantically far from ID labels. For some OOD samples, they may be far away from both ID labels and negative labels. According to the mechanism of AdaNeg, they cannot enter the memory bank of negative labels. Is this a negative impact of designing a memory bank based on negative labels?

Minor concerns

<1> The authors need to provide more detailed experimental settings. The paper mentions that memory banks are task specific. When evaluating the model, taking the ImageNet-1k benchmark as an example, do the authors maintain an independent memory bank for each OOD dataset (precisely, each ID-OOD pair), or did the four OOD datasets share one memory bank?

<2> There seem to be some typos and symbol issues in the paper.

a) L247: temperature $\tau = 100$ seems to be $\tau = 0.01$ because $\tau$ is in the denominator.

b) The subscript NL is not case-inconsistent, e.g., Eq. (4) and Eq. (8).

问题

see Cons

局限性

Yes

作者回复

2024-08-05

Dear Reviewer Pmx9，

We sincerely thank you for the constructive comments and recognition on our work! Please find our responses below.

Q1: Analyses on the stability of our method with different ID and OOD sample mixture ratios

A1: Many thanks for the detailed and valuable comments. Following this reviewer's suggestion, we investigated the stability of our method by constructing test sets with various mixture ratios of ID and OOD samples. Specifically, we adopted the 1.28M ImageNet training data as ID and randomly sampled 12.8K and 1.28K instances from the SUN OOD dataset to construct the ID:OOD ratios of 100:1 and 1000:1 settings, respectively. To construct the ID:OOD ratios of 1:100, 1:10, 1:1, and 10:1 settings, we used the full 40K SUN dataset as OOD and randomly sampled 400, 4K, 40K, and 400K instances from the ImageNet training data. We did not validate the setting where the test set contains only ID data, as the absence of OOD data makes it impossible to calculate evaluation metrics (e.g., FPR95 and AUROC).

As shown in the table below, our method outperforms NegLabel across a wide range of mixture ratios (from 1:100 to 100:1), validating the robustness and reliability of our approach. As pointed out by this reviewer, unbalanced mixture ratios do pose a challenge to our method. Our approach performs the best in scenarios with a balanced mixture of ID and OOD samples, reducing the FPR95 by 11.18%. As the mixture ratio becomes increasingly unbalanced, the improvement brought by our method gradually decreases. When the unbalanced ratio reaches 1000:1, our method shows some negative impact. We will include these analyses in the limitation part of our revised manuscript, and attempt to address this challenging setting in future work.

ID:OOD Ratio	1:100	1:10	1:1	10:1	100:1	1000:1
NegLabel	22.42	21.11	20.99	20.92	21.48	23.69
AdaNeg	21.00	12.49	9.81	15.61	20.71	26.28

Table Caption: FPR95 ( $\downarrow$ ) with different mixture ratios of ID and OOD samples.

Q2: The necessity of negative labels in the initialization of the memory bank.

A2: Thanks for the insightful questions. Our use of negative labels to extend the memory bank is carefully designed for effective implementation. Initially, our memory bank is empty, containing only zero values. Consequently, the derived negative proxies are also zero vectors. In other words, during the early stages of testing, it is impossible to generate effective negative proxies solely based on the memory bank. To enable our method to be operational from the beginning of the testing phase, we extend the memory bank with negative labels. It is important to note that while negative labels play a crucial role during the early stages, their influence diminishes as the memory bank gradually accumulates data, and the negative proxies will be progressively dominated by cached negative images.

Regarding the reviewer's suggestion to use the MCM score to judge negative samples and organize them into OOD proxies, this is indeed feasible. One could cluster the negative samples selected by the MCM score and use the cluster centers as OOD proxies. However, this method has a significant drawback: it requires detecting a portion of negative samples before applying our method. In other words, this approach cannot be applied online to the initial test data. From this perspective, we utilize the negative labels as predefined initial cluster centers in our AdaNeg. During testing, these cluster centers are gradually refined as more image features are cached into the memory.

Considering that negative labels are words semantically distant from ID labels, if OOD samples are similarly distant from both ID and negative labels, they typically fall into the intermediate hard examples between ID and negative labels. Handling such intermediate hard examples is a long-standing challenge for all methods, including NegLabel and our AdaNeg. Our AdaNeg partially addresses these hard examples via an easy-to-hard strategy. Specifically, our method first caches easy OOD samples, which are closer to negative labels and farther from ID labels, into the memory. These easy OOD samples, being closer to the intermediate hard OOD examples, can serve as bridges to facilitate the detection of hard OOD samples. The effectiveness of our method is validated by the improved OOD detection performance shown in Table 2 of the main paper.

Q3: More detailed experimental settings.

A3: Thanks for the kind suggestion. In our experiments, we maintain an independent memory bank for each OOD dataset (i.e., each ID-OOD pair). In implementation, we clear the memory bank when switching to a new OOD dataset.

We validated performance using a shared memory bank across four OOD datasets: iNaturalist, SUN, Places, and Textures. In this setup, the memory bank retains features from previous OOD datasets when testing a new one. As shown in the table, shared memory banks outperform independent ones, suggesting that cached features aid in recognizing new OOD datasets. However, shared memory banks may leak information between datasets, which can be problematic in practice. Therefore, we use independent memory banks by default.

Types of Memory Bank	INaturalist	SUN	Places	Textures	Average
Independent	0.59	9.50	34.34	31.27	18.92
Shared	0.59	9.13	32.68	29.08	17.87

Table Caption: FPR95 ( $\downarrow$ ) with different types of memory banks.

Q4: Typos and symbol issues

A4: Thank you for the kind correction. We will correct $\tau$ = 0.01 in L247 and unify the subscript NL as $S_{nl}$ in the revision. We will further proofread the manuscript carefully.

2024-08-09

Thank you for the detailed responses. The authors have addressed most of my issues. I think sample imbalance is an important challenge for TTA. Therefore, I suggest that the authors may add an adaptive module to AdaNeg to alleviate the ID-OOD imbalance. I will keep my score.

2024-08-11

Dear Reviewer Pmx9，

Thank you for highlighting the issue of model instability due to ID-OOD imbalance.

Following your suggestion, we have implemented an adaptive gap (AdaGap) strategy to adjust the memorization selection criteria dynamically. This approach builds on the observation that as the score $S_{nl}$ increases/decreases, the probability that a sample is ID/OOD also increases accordingly. By enforcing a stringent selection criterion, we can effectively minimize the inclusion of misclassified samples in our memory. Specifically, we first online estimate the ratio of ID to OOD samples in the test data using a First-In-First-Out queue, which caches the ID/OOD estimation (cf. Eq. 8) of the most recent N samples:

MR = (Estimated ID Number) / (Estimated ID Number + Estimated OOD Number)，

where the ID and OOD numbers are calculated within the queue.

Leveraging the estimated mix ratio (MR), we can dynamically adjust the gap g in memory caching to minimize the presence of misclassified samples within the memory. For instance, if ID samples predominate in the test samples (i.e., MR > 0.5), this could lead to an increased proportion of ID samples in the OOD memory. To counteract this, we refine the selection criterion for OOD memorization to cache only those OOD samples with higher confidence. This refinement involves modifying the selection criterion for memorization in Equation (8) as follows:

Negative: From $S_{nl}$ (v) < $\gamma$ - g $\gamma$ to $S_{nl}$ (v) < $\gamma$ - max(g, MR) $\gamma$

Positive: From $S_{nl}$ (v) $\geq$ $\gamma$ + g(1- $\gamma$ ) to $S_{nl}$ (v) $\geq$ $\gamma$ + max(g, 1-MR)(1- $\gamma$ )

where g =0.5 is the default gap analyzed in Figure 3(b).

In this way, our method remains consistent with our original version under balanced ID/OOD conditions (e.g., MR = 0.5). However, if the proportion of ID samples is higher in the test sample estimation (e.g., MR > 0.5), we increase the threshold for storing negative samples in the memory. In the extreme case where MR = 1, we estimate that there are no OOD samples among the test samples; thus, we stop storing test samples in the negative memory and only selectively cache test samples into the positive memory. We adjust our approach conversely when the MR value is lower than 0.5. This strategy enhances the robustness of our method against ID-OOD imbalance, as demonstrated in the table below:

ID:OOD Ratio	1:100	1:10	1:1	10:1	100:1	1000:1
NegLabel	22.42	21.11	20.99	20.92	21.48	23.69
AdaNeg	21.00	12.49	9.81	15.61	20.71	26.28
AdaNeg (With AdaGap)	20.50	12.22	9.73	12.98	15.61	18.43

Table Caption: FPR95 (↓) with different mixture ratios of ID and OOD samples.

Please kindly note that the MR is estimated with the most recent N test samples, allowing for dynamic online adjustment of the selection criterion. We set N=10,000 by default. This dynamic adjustment ensures that our memory caching strategy remains responsive to the evolving nature of the test sample distribution, thereby optimizing memory utilization and enhancing the accuracy of our domain distinction process. We will include these analyses and the AdaGap module in the revision.

审稿意见

评分: 8置信度: 52024-07-12

The authors introduce a new approach to leverage the pre-trained vision-language model for identifying out-of-distribution (OOD) samples. Compared to prior works that employ consistent negative labels across different OOD datasets, they introduce adaptive negative proxies to dynamically generate text labels during testing by exploring actual OOD images, thereby aligning more closely with the underlying OOD label space. Empirically, the proposed method demonstrates state-of-the-art performance across various OOD detection benchmarks especially on the large-scale ImageNet benchmark.

优点

Dynamically generating negative proxies is a simple and effective strategy.
The setting studied is very natural and this paper can easily stimulate further research in the area.
The proposed approach performs well, particularly on large-scale datasets such as ImageNet, effectively demonstrating its scalability.
The paper is nicely written.

缺点

While the proposed AdaNeg shows clear improvements over training-free baselines, its overall performance on ImageNet still lags behind training-based methods. This raises the question of whether there are opportunities for complementarity between the two approaches.
Can the dynamic update of the memory bank and refinement of OOD proxies during the testing stage be considered a form of test-time training? The authors are requested to clarify the inherent connections and distinctions, especially from the perspectives of training versus training-free approaches.
If negative proxies can directly identify true out-of-distribution (OOD) test images during the testing phase, is it possible to use the identified OOD samples to update the model parameters online?

问题

Please refer to the weakness.

局限性

yes

作者回复

2024-08-05

Dear Reviewer ACdF，

We sincerely thank you for the constructive comments and recognition on our work! Please find our responses below.

Q1: While the proposed AdaNeg shows clear improvements over training-free baselines, its overall performance on ImageNet still lags behind training-based methods. This raises the question of whether there are opportunities for complementarity between the two approaches.

A1: Thanks for the nice suggestion! We validated the complementarity between our AdaNeg and the existing state-of-the-art method NegPrompt [31], which is characterized by learnable negative prompts. We reproduced the results of NegPrompt with the author released codes. As shown in the table below, our method brings significant performance improvements over NegPrompt, validating the complementarity between our approach and training-based methods. We will include these analyses in the revision.

Methods	INaturalist	SUN	Places	Textures	Average
NegPrompt	6.76	23.41	28.32	34.57	23.27
+ AdaNeg	3.87	11.35	25.45	29.79	17.62

Table: FPR95 ( $\downarrow$ ) of AdaNeg based on the training-based NegPrompt method.

Q2: Can the dynamic update of the memory bank and refinement of OOD proxies during the testing stage be considered a form of test-time training? The authors are requested to clarify the inherent connections and distinctions, especially from the perspectives of training versus training-free approaches.

A2: Thanks for the suggestions. Our online update of the memory bank and refinement of OOD proxies is a kind of training-free test-time adaptation. Unlike existing training-required test-time adaptation methods [13, 15, 69], which typically require test-time optimization and subsequently slow down the testing process, our approach is optimization-free. It introduces only a lightweight memory interaction operation, enabling rapid and accurate testing, as analyzed in Table 4. We will further clarify this point in the revision.

Q3: If negative proxies can directly identify true out-of-distribution (OOD) test images during the testing phase, is it possible to use the identified OOD samples to update the model parameters online?

A3: Thank you for the comments. We performed experiments to use the identified OOD samples to update the text prompt with cross-entropy loss, similar to the training objective in [1]. However, it is unstable and often leads to model collapse, especially in challenging settings. For example, in the ImageNet (ID) and SSB-hard (OOD) setup, around 44% of OOD samples were misclassified as ID, disrupting model learning.

Our training-free method is more robust to such misclassifications due to the weighted combination of cached features. Additionally, updating model parameters online would significantly slow down the testing process. In contrast, our method ensures rapid and accurate testing with only lightweight memory interaction.

2024-08-12

Thank you for the responses. My previous concerns have been well addressed. After carefully reviewing the other reviewers' comments and the authors' replies, I believe the paper has no significant flaws, and therefore, I choose to maintain my score.

2024-08-13

We sincerely appreciate your positive feedback and your dedicated time to review our paper.

Authors of paper 9248

审稿意见

评分: 5置信度: 32024-07-12

This paper introduces a new algorithm for Out-Of-Distribution (OOD) sample detection. First, it analyzes the shortcomings of previous Vision-Language OOD detection methods and proposes improvements based on these findings. Specifically, the paper presents a scheme for online updating of the memory bank during testing to design better negative proxies. The authors conducted experiments on datasets such as ImageNet and CIFAR. According to the experimental results, the newly proposed method can enhance OOD detection performance.

优点

Currently, vision-language models are developing rapidly, and using them for OOD sample detection is a promising direction. Approaching from this perspective may yield better results.
The experiments in this paper are relatively thorough, encompassing both large datasets based on ImageNet and smaller datasets based on CIFAR. According to the authors' experimental results, the newly proposed method can improve the accuracy of OOD detection.

缺点

The motivation in this paper is not very clear. Specifically, in Figure 1(a), it is not evident why the newly proposed AdaNeg is better than NegLabel. On the contrary, the distribution of OOD samples seems to be closer to NegLabel.
The method proposed in this paper is based on the features and results of test samples during testing, which limits the upper bound of the method. In my opinion, the effectiveness of the proposed method relies on the vision-language model's strong inherent OOD detection capability, meaning that most test samples can be correctly processed. Based on these correctly processed samples, the method can further improve the detection accuracy of other samples. However, if in a certain scenario, the model itself cannot correctly estimate most of the samples, this method might actually make the results worse.
This paper merely performs optimizations based on the NegLabel framework, without many innovative points. The novelty of this improvement is insufficient to support a NeurIPS paper.

问题

As shown in weakness. What is the meaning of Figure 1, and is the proposed method effective when the base model predicts most samples incorrectly?

局限性

Potential negative societal impact is not applicable.

作者回复

2024-08-05

Dear Reviewer yxd9，

We sincerely thank you for the constructive comments! We hope our following responses can address this reviewer's concerns.

Q1: The motivation in this paper is not very clear. Specifically, in Figure 1(a), it is not evident why the newly proposed AdaNeg is better than NegLabel. On the contrary, the distribution of OOD samples seems to be closer to NegLabel.

A1: Thank you for pointing out this issue. In the visualization of our submission, the adaptive negative proxies are not $L_2$ normalized, whereas the features of ID, OOD, and NegLabel are normalized vectors. This makes the visualization not very clear. We have corrected this issue by revising the visualization with normalized adaptive negative proxies. As expected, the distribution of adaptive negative proxies is much closer to the ground truth OOD labels than to the negative labels in NegLabel. We have attached an updated Figure in the PDF of the Global Response and will revise Figure 1(a) in the manuscript.

Q2: The method proposed in this paper is based on the features and results of test samples during testing, which limits the upper bound of the method. In my opinion, the effectiveness of the proposed method relies on the vision-language model's strong inherent OOD detection capability, meaning that most test samples can be correctly processed. Based on these correctly processed samples, the method can further improve the detection accuracy of other samples. However, if in a certain scenario, the model itself cannot correctly estimate most of the samples, this method might actually make the results worse.

A2: Thanks for the comments. We agree that our method leverages the inherent strong capability of vision-language models (VLMs). However, we would like to emphasize that our method is highly robust to misclassified samples.

For instance, we investigate a challenging task setting with ImageNet as the ID dataset and SSB-hard [58] as the OOD dataset, where NegLabel achieves an FPR95 of 77.26%. With a threshold $\gamma = 0.5$ and a gap $g = 0$ , about 44% of OOD samples were incorrectly stored in memory slots for ID classes. In this scenario, our method achieves an FPR95 of 72.90%, showing a 4.36% improvement over NegLabel. This robustness comes from the weighted combination of cached features used to form the proxies, which resists interference from a few misclassified features.

In a worst-case scenario, where test samples are randomly cached into ID or OOD memories (50% OOD misclassified), our method achieves an FPR95 of 77.69%, comparable to NegLabel's 77.26%. Such extreme confusion is unlikely in real-world scenarios, where there are usually some clues to distinguish between ID and OOD samples.

Q3: This paper merely performs optimizations based on the NegLabel framework, without many innovative points. The novelty of this improvement is insufficient to support a NeurIPS paper.

A3: We appreciate this reviewer's comments; however, we'd like to argue that our method introduces significant improvements to the NegLabel framework and makes notable contributions to negative proxies guided OOD detection. Our contributions and distinguished features are as follows:

We identify the label space misalignment between existing negative-label-based proxies and the target OOD distributions. To address this issue, we dynamically generate adaptive negative proxies to align with the OOD label space more effectively. This is a "novel approach" and a "significant advancement," as recognized by Reviewers ACdF and eV4A.
We construct adaptive negative proxies using a feature memory bank that incorporates carefully designed write and read strategies. Additionally, we propose a novel multi-modal score that combines complementary textual and visual knowledge, which is an "interesting design" as highlighted by Reviewer Pmx9.
Our method is simple yet effective (cf. Reviewer ACdF), training-free, and annotation-free (cf. Reviewer eV4A). It exhibits good scalability (cf. Reviewers ACdF and eV4A), has been comprehensively evaluated (cf. Reviewers yxd9 and eV4A), and achieves substantial performance improvements (cf. Reviewers yxd9 and eV4A).

2024-08-13

Thanks for your reply. The authors have addressed most of my concerns. I will raise my score.

2024-08-13

We sincerely appreciate your positive feedback and your dedicated time to review our paper.

Authors of paper 9248

作者回复

2024-08-05

Common Responses to All Reviewers

Dear Reviewers, Area Chairs, and Program Chairs:

We are grateful for the reconstructive comments and valuable feedback from the reviewers. We are glad that the reviewers found our idea novel (Reviewers ACdF and eV4A) and design interesting (Reviewer Pmx9), and we appreciate their recognition on the wide scalability (Reviewers ACdF and eV4A), comprehensive evaluation (Reviewers yxd9 and eV4A), and substantially improved performance (Reviewers yxd9 and eV4A) of our method.

To address the reviewers' concerns, we have attached an updated Figure 1(a) in the enclosed PDF file and provided additional experiments and analyses on training-based methods, misclassified samples, different setups of test data, and medical image datasets. Please find our itemized responses to all reviewer’s comments below, and we sincerely hope our responses can well address the reviewers' concerns.

Best regards,

Authors of Paper9248

最终决定Accept (poster)

2024-09-25

The paper was reviewed by four experts in the field. The paper mostly reviewed positive comments from reviewers on innovation, scalability, interesting design, and easy-to-follow paper. Among the concerns described by reviewers were analyses on stability, necessity of negative labels, unclear motivation, and innovativeness. The rebuttal attempted to address all the concerns and reviewers acknowledged that most of their concerns (e.g., not very clear motivation, possibly limited innovation) were adequately addressed. Therefore, AC recommends the paper for acceptance. Authors are encouraged to include important points from rebuttal and post-rebuttal discussion into the final version of the paper. We congratulate authors on the acceptance of their paper at Neurips 2024.