Self-Calibrated Tuning of Vision-Language Models for Out-of-Distribution Detection
We propose a framework, named SCT, to mitigate the problem of spurious OOD features mined from ID data in prompt-tuning based OOD detection methods.
摘要
评审与讨论
This paper first reveals the relationship between the quality of out-of-distribution (OOD) features and the prediction uncertainty of in-distribution (ID) data. Then, the paper introduces modulating factors to weight the ID loss and OOD loss, with the weights being related to the ID data prediction confidence. The experiments are carried out on standard datasets.
优点
- The analysis of the relationship between OOD feature quality and ID prediction confidence is well-reasoned; lower ID confidence indeed affects the accuracy of foreground-background separation.
- Weighting the loss components is a straightforward approach, making it easier to understand.
- The writing is clear and easy to comprehend.
缺点
- Overall, the technical contribution of this paper is relatively incremental, primarily focusing on how to weight the two loss components.
- The effectiveness of the proposed method is quite limited. For example, as shown in Table 2, the improvement in averaged AUROC under the 16-shot scenario is minimal, only around 0.3%, and there is even a slight decrease in results on ID-like data.
- There is a lack of comparison with existing state-of-the-art (SOTA) methods. For instance, the results reported in this paper are not as good as those of NegLabel[1] (AUROC 94.21 > 93.37), which is a zero-shot method that does not require training and training samples. The results for ID-like method reported in this paper are also lower; the official paper reports 94.36 AUROC under 4-shot, while this paper reports 92.14 AUROC under 16-shot.
- More exploration is needed regarding the settings of function and in Equation 4.
- The statement in lines 158-159 is somewhat unclear. Should it be that inaccurate OOD features hinder the effective learning of better OOD detection?
[1] Jiang, Xue, et al. "Negative label guided ood detection with pretrained vision-language models." ICLR (2024).
问题
See weaknesses
局限性
NA
Thank you for your time devoted to reviewing this paper and your constructive suggestions. Here are our detailed replies to your questions.
Q1: Overall, the technical contribution of this paper is relatively incremental, primarily focusing on how to weight the two loss components.
Thanks for the valuable comments! We would like to re-clarify the novelty and insights of our SCT as follows.
Conceptually, the motivation of SCT is to mitigate the problem of spurious OOD features in prompt-tuning based OOD detection methods. Generally, these methods rely on the ID-irrelevant local context extracted by VLMs as the surrogate OOD features to perform regularization, the quality of which is greatly affected by the foreground-background decomposition of VLMs. As shown in Figure 1/4 in the submission, although VLMs can mask out some ID-related regions, large portions of the extracted OOD features(shown as the colored patches of images) obviously belong to ID features.
Empirically, we find that the quality of extracted OOD features is significantly correlated with the uncertainty level of ID data. As illustrated in the left panel of Figure 2 in the submission, the extracted OOD features become more inaccurate as the uncertainty increases. In the right panel of Figure2, we train LoCoOp on multiple data groups with different uncertainty levels and the results demonstrate that the OOD detection performance of LoCoOp can be significantly impacted by the uncertainty level of ID data. Therefore, to mitigate the issue of unreliable OOD features, we propose SCT to calibrate the influence of OOD regularization from different ID samples based on their uncertainty level.
Q2: The effectiveness of the proposed method is quite limited. For example, as shown in Table 2, the improvement in averaged AUROC under the 16-shot scenario is minimal, only around 0.3%, and there is even a slight decrease in results on ID-like data.
Thanks for the comments! As shown in Table 1 in the submission, the space for improvement of AUROC of the prompt-tuning based method is close to saturation, both exceeding 90%. However, the space for improvement of FPR95 is still very large, so these two metrics should not be treated equally. The improvement of SCT on FPR95 (e.g. +5.95% for IDLike and +2.73% for LSN under 16-shot setting) is significant. Nevertheless, we will continue to promote the improvement of AUROC in future work!
Q3: There is a lack of comparison with existing state-of-the-art (SOTA) methods. For instance, the results reported in this paper are not as good as those of NegLabel[1] (AUROC 94.21 > 93.37), which is a zero-shot method that does not require training and training samples. The results for ID-like method reported in this paper are also lower; the official paper reports 94.36 AUROC under 4-shot, while this paper reports 92.14 AUROC under 16-shot.
Thanks for the comment! Zero-shot methods and prompt-tuning based methods are compatible with each other, further boosting the OOD detection performance. We conduct experiments on the compatibility of NegLabel with SCT in Table 5 in the attached PDF and the results show that SCT can be combined with advanced Neg-Label for better OOD detection.
Regarding the results for ID-like method, we strictly follow the official source code [1] and the hyperparameter settings of the official paper on a single A100 GPU. We will add this to our revised version.
Q4: More exploration is needed regarding the settings of function and in Equation 4.
Thanks for the suggestions! We conduct experiments on more instantiations of modulation functions in the following table. The results demonstrate that All the instantiations show significant improvement over LoCoOp, which verifies the effectiveness of the learning framework of SCT.
| method | FPR95 | AUROC | ID-ACC | ||
|---|---|---|---|---|---|
| LoCoOp | 1 | 1 | 29.47 | 93.10 | 71.43 |
| power-2 | 27.41 | 93.14 | 71.42 | ||
| power-4 | 27.10 | 93.21 | 71.49 | ||
| log | 27.06 | 93.20 | 71.39 | ||
| triangle | 27.34 | 93.16 | 71.53 |
Q5: The statement in lines 158-159 is somewhat unclear. Should it be that inaccurate OOD features hinder the effective learning of better OOD detection?
Thanks for the comments! What we mean by the original statement is that we aim to design a mechanism to mitigate the issue of unreliable extracted OOD features. We will make this statement more clear in our revision.
Reference: [1] Bai, Han, et al. "ID-like Prompt Learning for Few-Shot Out-of-Distribution Detection." CVPR, 2024
Thanks for your time and comments on our work. We have tried our best to address the concerns and provided detailed responses to all your comments and questions. Are there any unclear points that we should/could further clarify?
Thanks for your responses. Some of my concerns have been addressed; however, I still have concerns about the incremental technical contribution and the limited improvement on AUROC. In my experiments, the FPR95 fluctuates greatly, while the AUROC results are relatively stable, so I consider the AUROC results more reliable. I will maintain my current score.
Many thanks for your response and we will consider your suggestions in the revision.
Based on the observation that CLIP undercalibration will affect the existing prompt-tuning-based method's OOD regularization, i.e. samples with uncertain True-Class Probability (referred to as ID uncertainty in this paper) may provide false OOD features and harm to negative training used in the existing methods; therefore, the author proposes a simple training strategy Self-Calibrated Tuning (SCT) weighted with the ID uncertainty, which can help improve FPR95 through experimental verification.
优点
-
The author also observed and attempted to study the important CLIP calibration problem.
-
The paper is relatively easy to understand overall.
缺点
My main concerns about this work are that the work is relatively incremental and empirical; there is insufficient discussion on the pros and cons of field-related methods (including other paradigms); the experiments are not sufficient, rigorous, and analyzed; and the method's effect on improving common benchmarks is rather one-sided, etc. The details are as follows:
[Method]
-
(Inaccurate motivation verifications) If I understand correctly, the misclassified ratio on the horizontal axis in the right panel of Fig. 2 refers to while your ID uncertainty refers to sth like True-Class Probability, marginal softmax . These are not two identical things, though there is a certain correlation: only within some ranges, TCP could indicate accuracy [1]. Therefore, I think the author may have a biased understanding here, and the experimental results cannot fully reflect the motivation of the work: when "uncertain" ID samples are used as OOD regularization, some ID data are misdetected as OOD (FPR). If I have misunderstood, could the author clarify? Or maybe additional correct experiments are needed?
-
(No calibration verifications) The claim and results in the work show that SCT helps with CLIP calibration, but there is no visualization of the calibration after training to illustrate the point (e.g. could add the before and after calibration comparison like in Fig. 2).
-
(Lack of discussion of weighted training; modulating function rationale) The idea of weighted loss is very direct and easy to think of. Previous work should also be mentioned and discussed. For example, [2] is based on the feature representation paradigm and uses activation scale as the ID-ness (similar to ID uncertainty in the context) indicator for weighted training to improve OOD detection. In comparison, I do not quite understand why must be monotonically decreasing w.r.t. (e.g. ) and not monotonically increasing (e.g. ), because the weighting method in [2] is monotonically increasing, and the result is also improved. Could the authors elaborate on this?
-
(How about post-hoc CLIP calibration?) Usually, calibration is divided into two types: training and post-hoc [3] (calibration related works are lacked in the paper). The former is used in this paper. The latter may be explored in OOD feature extraction methods, e.g. changing the rank operation (Eq. (6) & Fig. 3(d)). The author may lack discussion in this aspect.
[Experiments]
-
(No much AUROC improvement) I understand that the method in this work is mainly to improve FPR95, but unilaterally only improving FPR95 does not seem to be comprehensive enough, because AUROC is also an equally important indicator and methods need to be proposed to improve it.
-
(Lack of CIFAR results) Although the comparison method LoCoOp has not been experimented on CIFAR, CIFAR is indeed another important benchmarks in the field of OOD detection, and I think it is necessary to supplement it.
-
(Discussions with simpler yet more effective pre-trained features + post-hoc?) I just would like to know what the author thinks about the (potential) advantages of the prompt-tuning-based method studied in the paper compared to the post-hoc method. After all, post-hoc does not require additional training and uses the basic ResNet backbone; the FPR95 and AUROC on the main task of Tab. 1 on ImageNet-1k have reached 20.05 and 95.71 respectively, which are much better than the results reported in the paper (26.47, 93.37).
-
Could the authors clarify on what validation set are the hyperparameters tuned?
-
(Interpretations of the ablations.) Figure 3(b) shows that the results of selecting other regularization functions are very different, and the paper (L294-298) does not provide any analysis. I am curious about how the author could try to interprete these ablation study experiment results. Similarly, the quality of OOD features extracted by different extraction methods also varies greatly, which seems very empirical (Fig. 3(d)).
-
Table 1 is suggested to include results of combining more newer post-hoc methods (e.g. ASH (Djurisic et al., 2022), Scale [2]) and fine-tuned methods, which will give readers a more comprehensive sense.
[Presentation]
-
The paragraph introducing the OOD features (L189) should be moved forward, or at least before refer to Fig. 1, which will give readers a clearer background.
-
Why is the left panel in Fig. 2 not arranged in ascending order of softmax output? The arrangement of 0.02, 0.89, 0.04, and 0.67 affects reading. What is it trying to say? It would be better to display the classes and images together for clarity.
References:
[1] Corbière, Charles, et al. "Addressing failure prediction by learning model confidence." NeurIPS, 2019.
[2] Xu, Kai, et al. "Scaling for Training-Time and Post-hoc Out-of-distribution Detection Enhancement." ICLR, 2024.
[3] Guo, Chuan, et al. "On calibration of modern neural networks." ICML, 2017.
问题
Please see the weaknesses.
局限性
Limitations of lack of theoretical analysis (L571-574)'d better be put in the manuscript.
Thank you for your time devoted to reviewing this paper and your constructive suggestions. Here are our detailed replies to your questions.
W1: Inaccurate motivation verifications
Thanks for the constructive comments! Although true-class probability (TCP) and accuracy are not two identical things, classification accuracy could indicate TCP to a certain degree under the large-scale ImageNet-1k dataset. Since ImageNet-1k has a large number of classes and CLIP models are prompt-tuned with one-hot labels, the prediction probability of CLIP can be overconfident, which means that CLIP models normally output high maximum softmax probability . Therefore, when models correctly classify samples, TCP are likely to be very high. Nevertheless, we would acknowledge it has some non-accurate parts.
To make our motivation verification clearer and more accurate, we conduct a new experiment on the correlation between data uncertainty and OOD detection performance. Specifically, we calculate the TCP of all the training samples in a 64-shot set using a CLIP model prompt-tuned with a 4-shot training set which contains no overlapping samples with the 64-shot set. We choose the data with the lowest and highest TCP for every ID class to generate two data groups with different uncertainty levels respectively, and train LoCoOp on these two data groups. As shown in Table 1 in the attached PDF, the OOD detection performance of LoCoOp is significantly impacted by the data uncertainty level, which is consistent with Figure 2 in the submission.
W2: No calibration verifications
Thanks for the valuable suggestions! We add the figures of calibration comparison as Figure 1 in the attached PDF. The illustrations show that our SCT can significantly help with CLIP calibration for more accurate OOD feature extraction. The extracted OOD features (shown as grey patches of images) of SCT models are significantly more accurate and contain less ID-relevant regions than LoCoOp models.
W3,W4
We leave the answers in General Response.
W5: No much AUROC improvement
Thanks for the comments! As shown in Table 1 in the submission, the space for improvement of AUROC of the prompt-tuning based method is close to saturation, both exceeding 90. However, the space for improvement of FPR95 is still very large, so these two metrics should not be treated equally. We will continue to promote the improvement of AUROC in future work!
W6: (Lack of CIFAR results)
Thanks for the valuable suggestions! We conduct experiments on the CIFAR benchmark in Table 4 of the attached PDF and the results show that our SCT can still outperform the baselines under CIFAR benchmark.
W7: Discussions with pre-trained features + post-hoc
Thanks for the comment! First, prompt-tuning based method can leverage the generalization ability of VLMs to better fit the domains of the downstream tasks with relatively low computational cost. Secondly, post-hoc methods need to be built on a well-trained model, the capacity of which greatly affects the OOD detection performance. Thirdly, post-hoc methods and prompt-tuning based methods are compatible with each other, further boosting the OOD detection performance. We conduct experiments on the compatibility of An advanced post-doc method, NegLabel [1], with SCT in Table 5 in the attached PDF. The results show that SCT can be combined with post-doc methods for better OOD detection.
W8: Validation set
Thanks for the question! We tune the hyperparameter on the dedicated OOD validation set in the OpenOOD v1.5 benchmark[2]. The other hyperparameters are chosen following previous works.
W9: About ablations.
Thanks for the constructive question! In the ablation study of Figure 3(b), we follow OE [3] and Energy-OE [4] to implement prompt-tuning based OOD detection with MSP and Energy regularization functions, respectively. The motivation of conducting this experiment is to verify that SCT can outperform LoCoOp in different regularization functions. Specifically, the energy regularization needs tuning the two energy threshold hyperparameters and , limiting its advantages over other regularization. As shown in Fig. 3(b), we conjecture that directly forcing the probability distribution of OOD features to the uniform distribution(MSP) performs worse than entropy maximization under the setting of prompt tuning.
Regarding OOD feature extraction methods, the probability-based and entropy-based methods both have a threshold hyperparameter to discriminate between ID and OOD features. The sensitivity to hyperparameters of different methods may be the reason behind the different OOD detection performance. For the entropy-based method, the performance is poor since it is challenging to determine the appropriate threshold. [5]
W10: About Table 1
Thanks for the suggestion! We will conduct new experiments of newer post-hoc and fine-tuned methods on CLIP models in our revision. Since ASH and Scale can't apply to CLIP as analyzed in W3, we will include the results of them on conventional CNN models in the appendix for fair comparison.
W11, W12, W13: About presentation and limitations.
Thanks for the suggestion! The original images arrangement is designed to compare the neighboring images with different uncertainty levels. we will make the modifications in our manuscript.
Reference: [1] Jiang, Liu, et al. "Negative Label Guided OOD Detection with Pretrained Vision-Language Models." ICLR, 2024. [2] Jingyang Zhang,Jingkang Yang, et al "Openood v1.5: Enhanced benchmark for out-of-distribution detection." [3] Dan Hendrycks, et al, "Deep Anomaly Detection with Outlier Exposure", ICLR 2019 [4] Weitang Liu, et al, "Energy-based Out-of-distribution Detection", NeurIPS 2020 [5] Miyai1, Yu, et al. "LoCoOp: Few-Shot Out-of-Distribution Detection via Prompt Learning." NeurIPS, 2023
Thanks for your time and comments on our work. We have tried our best to address the concerns and provided detailed responses to all your comments and questions. Are there any unclear points that we should/could further clarify?
I have carefully read the reviews of the reviewers and the author's detailed rebuttal, and I am sincerely grateful. In particular, the author corrected the experimental verification of TCP motivation and provided the benchmark result of CIFAR. However, in general, I am still worried about:
-
If we refer to related methods such as feature representation-based ones, 90 AUROC is not saturated, and their FPR is also relatively low ~20, so the AUROC performance should be improved further.
-
The method is a bit incremental and empirical, lacking theory as the author also acknowledges. If I understand correctly, the author wants to achieve the balance and relative weighting of OOD and ID items by modulating the function. Even so, the ablation OOD/ID ratios in the last three rows of Tab. 4 all increase monotonically with p, so either the hyperparameters may not be tuned well or there would be a deeper understanding of the exact reasons that matter.
-
For calibration verification, in addition to the recommended qualitative OOD region visualization comparison, you can also consider adding quantitative metrics such as ECE [4].
(Minor)
- Post-hoc studies can be more comprehensive in research, and choosing different K is heuristic and not a very good calibration strategy. The authors could consider temporal scaling [4], etc.
Based on the above reasons and careful consideration, I tend to keep the original score.
[4] A Closer Look at the Robustness of Contrastive Language-Image Pre-Training (CLIP)
Thank you very much for all the constructive feedback after reading our response! We will make sure to incorporate all suggestions in the revision.
This paper presents a novel few-shot approach to regularizing prompt tuning-based OOD detection methods called Self-Calibrated Tuning (SCT). SCT is specifically built to address the problems of incorrect OOD features being used in prompt tuning-based OOD detection methods. More specifically, by weighting regions of the image based on model confidence, SCT can better alleviate these issues in prompt tuning-based OOD detection methods. The resulting SCT method shows strong empirical improvements across a wide range of OOD detection methods.
优点
Strengths:
- The paper is well written and the authors provide a clear and concise motivation justifying the use of SCT.
- The author provides a timely analysis of the problem of incorrect OOD features extracted from ID data.
- SCT shows strong empirical performance across a wide range of traditional OOD detection methods and prompt tuning-based OOD detection methods.
- Additionally, given the nature of prompt tuning-based OOD detection methods, SCT can act in the more relevant few-shot setting.
缺点
Weakness:
- A primary concern of the reviewer is the lack of evaluations against the more traditional CIFAR set of benchmarks for OOD detection.
- Additionally, the empirical performance gain of SCT (table 2) in combination with other prompt-tuning-based methods, seems minimal.
问题
The reviewer would like to see some additional evaluations of SCT in the traditional CIFAR setting of OOD detection. The reviewer would also like to point out some small inconsistencies in the bolding for Table 2 (IDLike+SCT).
局限性
The author provides adequate discussions on any limitations and broader impacts. Additionally, the reviewer does not forsee any potential negative social impacts from the work.
Thank you for your time devoted to reviewing this paper and your constructive suggestions. Here are our detailed replies to your questions.
W1: A primary concern of the reviewer is the lack of evaluations against the more traditional CIFAR set of benchmarks for OOD detection.
Thanks for the valuable suggestions! We conduct experiments on the CIFAR benchmark in Table 4 in the attached PDF and the results show that our SCT can still outperform the baselines under the CIFAR benchmark.
W2: Additionally, the empirical performance gain of SCT (table 2) in combination with other prompt-tuning-based methods, seems minimal.
Thanks for the comments! As shown in Table 1 in the submission, the space for improvement of AUROC of the prompt-tuning based method is close to saturation, both exceeding 90%. However, the space for improvement of FPR95 is still very large, so these two metrics should not be treated equally. The improvement of SCT on FPR95 (e.g. +5.95% for IDLike and +2.73% for LSN under 16-shot setting) is significant. Nevertheless, we will continue to promote the improvement of AUROC in future work!
Q1: The reviewer would also like to point out some small inconsistencies in the bolding for Table 2 (IDLike+SCT).
Thank you for pointing out this issue! We are sorry for the carelessness and the original bolded data should be 92.44 instead of 91.44. We will make the modification in the revision.
In response to challenges in OOD detection using CLIP-based methods, this paper introduces Self-Calibrated Tuning (SCT), a novel framework that addresses issues with unreliable OOD features extracted from ID data. SCT dynamically adjusts the influence of OOD regularization during model training based on the prediction uncertainty of ID samples. By introducing modulating factors into the learning objective, SCT directs the model's attention more effectively towards classification tasks, especially when training with low-confidence data. This adaptive approach improves the calibration of OOD features extracted from high-confidence ID data, enhancing the overall OOD detection performance of prompt tuning methods. Empirical evaluations on ImageNet-1k demonstrate SCT's effectiveness.
优点
- This paper is well-motivated and well-written. In particular, authors propose to adaptively adjust the importance of OOD features and introduce SCT, which are motivated by the following finding: performance of prompt tuning based methods is significantly affected by the uncertainty of the given ID data.
- Authors have a comprehensive review of the whole research literature.
- Authors conduct a large amount of experiments and the experimental results demonstrate the effectiveness of SCT on both official benchmarks and hard OOD detection tasks.
- In summary, I think SCT could become a great contribution towards OOD detection community.
缺点
None in particular
问题
- My concern is mainly about the computational cost and training cost of SCT, since it involves operations on dense/local features.
- My second concern is about the rationality of using pre-trained models (CLIPs .etc) to perform OOD detection tasks, because the concepts in both ID and OOD datasets are probably seen during pre-training stage. I want to know the authors' opinions towards the benchmarking and research paradigm.
局限性
Please refer to weaknesses.
Thank you for your time devoted to reviewing this paper and your constructive suggestions. Here are our detailed replies to your questions.
Q1: My concern is mainly about the computational cost and training cost of SCT, since it involves operations on dense/local features.
Thanks for your valuable question!
First, SCT doesn't incur any extra computational cost to the LoCoOp due to its simple design. Technically, SCT introduces modulating factors respectively on the two components of the original learning objective. The modulating factors (instantiated as for and for in Equation (4) in the submission) only involves the computation of the prediction probability of ground truth classes , which can be repeatedly used after the original forward pass of CLIP models.
Secondly, the operation of local features involved in LoCoOp and SCT are also relatively low-cost in terms of computation. The local are generated from the forward pass of the vision encoders of CLIP, which doesn't bring additional computational cost compared to regular training. Regarding the extraction of OOD features, we compute the similarity between local features and text features of all the ID classes and we identify regions that do not include their ground truth class in the top-K predicted classes as ID-irrelevant regions. Empirically, we evaluate the time and memory consumption of SCT compared with other baselines in Table 1 and the results show that SCT is relatively compute-efficient. The evaluation is conducted on a single A100 GPU with the batch size of 32.
Table 1. Evaluation of computational cost of SCT and baselines.
| Method | Time per iteration (s) | GPU Memory (MiB) | FPR95 | AUROC | ID-ACC |
|---|---|---|---|---|---|
| CoOp | 0.70 | 21140 | 35.09 | 91.99 | 71.93 |
| LoCoOp | 0.96 | 23036 | 29.47 | 93.10 | 71.43 |
| SCT | 0.96 | 23036 | 26.47 | 93.37 | 71.77 |
Q2: My second concern is about the rationality of using pre-trained models (CLIPs .etc) to perform OOD detection tasks, because the concepts in both ID and OOD datasets are probably seen during pre-training stage. I want to know the authors' opinions towards the benchmarking and research paradigm.
A2:
Thanks for your valuable comments! The definition of VLM-based OOD detection differs significantly from that of conventional OOD detection. VLM-based OOD detection aims to detect samples that do not belong to any ID class text designated by the downstream task [1]. Therefore, the current benchmarks, such as the large-scale ImageNet-1k benchmark, still apply to VLM-based OOD detection as long as they satisfy the rule that ID and OOD concepts don't overlap. For the better development of this field, future works should be focused on building benchmarks based on realistic datasets and scenarios.
References: [1] Miyai, Yang, et al. "Generalized Out-of-Distribution Detection and Beyond in Vision Language Model Era: A Survey." 2024.
This paper focuses open-set detection method based on CLIP. The authors propose an additional weighting mechanism based on the LoCoOp method to alleviate the problem that the outlier related regions extracted by the LoCoOp method are not trustworthy in some cases.
优点
Outlier detection with VLM is an interesting research direction.
缺点
The contribution over LoCoOp is incremental. The only difference is an extra reweighting term based on the current prediction score. And the reweighting mechanism is purely based on heuristics - for example, for implicitly enforce hard sample mining.
Minor: The intuition in Figure 1/4 is not clear to me. The shown examples validate that LoCoOp can detect and mask-out the inlier-related regions well. Also, the GT label should be annotated.
问题
Please clarify the novelty and new insights.
局限性
N/A
Thank you for your time devoted to reviewing this paper and your constructive suggestions. Here are our detailed replies to your questions.
W1,Q1: The contribution over LoCoOp is incremental. The only difference is an extra reweighting term based on the current prediction score. And the reweighting mechanism is purely based on heuristics - for example, for implicitly enforce hard sample mining. Please clarify the novelty and new insights.
Thanks for the valuable comments! We would like to re-clarify the novelty and insights of our SCT as follows.
Conceptually, the motivation of SCT is to mitigate the problem of unreliable OOD features in prompt-tuning based OOD detection methods. Generally, these methods rely on the ID-irrelevant local context extracted by VLMs as the surrogate OOD features to perform regularization, the quality of which is greatly affected by the inaccurate foreground-background decomposition of VLMs. As shown in Figure 1/4 in the submission, although VLMs can mask out some ID-related regions (shown as the grey patches of images), large portions of the extracted OOD features (shown as the colored patches of images) obviously belong to ID features.
Empirically, we find that the quality of extracted OOD features is significantly correlated with the uncertainty level of ID data. As illustrated in the left panel of Figure 2 in the submission, the extracted OOD features become more inaccurate as the uncertainty increases. In the right panel of Figure 2, we train LoCoOp on multiple data groups with different uncertainty levels, and the results demonstrate that the OOD detection performance of LoCoOp can be significantly impacted by the uncertainty level of ID data. Therefore, to mitigate the issue of unreliable OOD features, we propose SCT to calibrate the influence of OOD regularization from different ID samples based on their uncertainty level.
Technically, despite the simple design, SCT is significantly different from hard sample mining. The latter conducts reweighting directly on the samples based on the classification difficulty during training. The former adaptively adjusts the importance between the two components of the original learning objectives for every single sample. Data with high uncertainty are directly down-weighted in hard sample mining while they are utilized more for OOD regularization in SCT. As shown in Table 4 in the submission, under 16-shot ID data, the OOD detection performance of simply assigning to (denoted as and ) are significantly inferior to SCT (denoted as and ), demonstrating the difference of SCT and hard sample mining.
| FPR95 | AUROC | ID-ACC | ||
|---|---|---|---|---|
| 29.47 | 93.10 | 71.43 | ||
| 29.30 | 92.66 | 71.50 | ||
| 28.94 | 92.62 | 71.90 | ||
| 26.47 | 93.37 | 71.77 |
W2: The intuition in Figure 1/4 is not clear to me. The shown examples validate that LoCoOp can detect and mask-out the inlier-related regions well. Also, the GT label should be annotated.
Thanks for the suggestions! As shown in Figure 1/4, although VLMs can mask out some ID-related regions (shown as the gray patches of images), large portions of the extracted OOD features (shown as the colored patches of images) obviously belong to ID features. We will make the captions and figures clearer as suggested in our revised version.
Thanks for your time and comments on our work. We have tried our best to address the concerns and provided detailed responses to all your comments and questions. Are there any unclear points that we should/could further clarify?
Many thanks for the response. I carefully checked the author's response and revisited the relevant parts of the paper. I would like to firstly note I'm not very familiar with the relevant area and its evaluation standard of relevant works; in this background, the proposed method is slightly insufficient for me in terms of technical novelty and empirical contribution. The author may either:
- empirically conduct more experiments to show solid improvement. As I noticed in the paper and the updated table here, the difference is not significant. If current benchamrks tend to saturate, the authors may convert to other more challenging datasets.
- carry out some theoretical analysis. Current method is mainly built on heuristics, for example, the general uncertainty-based idea can also result in other variants like the one proposed, why and how the proposed idea be the optimal?
I sincerely hope this can help make the submission better.
Many thanks for your feedback and we will consider your suggestions in the revision.
General Response
We appreciate all the reviewers for their thoughtful comments and suggestions on our paper.
We are very glad to see that the reviewers find our focused problem is important (R1,R2,R3,R4) within the OOD detection research, and simple but adaptable (R1,R2,R4) to various other techniques, and the experiments are good, comprehensive and demonstrate the general effectiveness of our SCT (R2,R4). We are also pleased that the reviewers find our writing is very clear and easy to understand (R2,R3,R4,R5).
We have tried our best to address the reviewers' comments and concerns in individual responses to each reviewer with comprehensive experimental justification. The reviews allowed us to improve our draft and the contents added in the revised version and the attached PDF are summarized below:
From Reviewer wD2g
- Clarify the novelty and insights of SCT(see Figure 1 and 2 in original draft)
- Explain and compare the difference between SCT and hard sample mining (see Table 4 in the original draft)
From Reviewer 34fm
- Conduct evaluation on the computational cost of SCT
- Discuss the rationality of utilization of pre-trained models for OOD detection.
From Reviewer fpY7
- Conduct experiments for more accurate motivation verifications. (see Table 1 in PDF)
- Supplement illustrations of calibration gains of SCT. (see Figure 1 in PDF)
- Discuss and compare the difference between SCT and other weighted training methods. (see Table 2 in PDF)
- Conduct experiments on post-doc CLIP calibration (see Table 3 in PDF)
- Explain the performance gain of SCT on AUROC.
- Conduct experiments on the CIFAR benchmarks. (see Table 4 in PDF)
- Conduct experiments on the compatibility of SCT and advanced zero-shot method. (see Table 5 in PDF)
- Provide more analysis on the ablation study results. (see Figure 4 in the original submission)
From Reviewer pUkR
- Conduct experiments on the CIFAR benchmarks. (see Table 4 in PDF)
- Explain the performance gain of SCT in combination with other baselines.
- Correct the inconsistencies of experiment data in the original submission.
From Reviewer pBeU
- Clarify the novelty and insights of SCT (see Figure 1 and 2 in original draft)
- Explain the performance gain of SCT in combination with other baselines.
- Conduct experiments on the compatibility of SCT and advanced zero-shot method. (see Table 5 in PDF)
- Conduct more explorations on the function and .
- Clarify the unclarity of some statements in the original submission.
We appreciate your comments and time! We have tried our best to address your concerns and revised the paper following the suggestions. Would you mind checking it and confirming if you have further questions?
Some rest answers:
For reviewer fpY7:
W3: Lack of discussion of weighted training; modulating function rationale
Thank you for recommending this work! We will clarify the difference between SCT and ISH as follows.
Conceptually, the activation scale factor, denoted as in the official paper, is derived as the quotient of the sum of all activations and the sum of un-pruned activations, which has no direct mathematical correlation with true-class probability of VLMs. Specifically, VLMs like CLIP compute prediction probability based on the cosine similarity between image and text features. The computation of cosine similarity involves the normalization of features, which naturally eliminates the effect of activation scale. Furthermore, we compute the Pearson correlation coefficient of the activation scale factors and of ImageNet-1k validation set utilizing CLIP, and the result is -0.05 with the p-value equal to 0.0002. showing that these two variables have no significant linear correlation.
Technically, ISH performs reweighting directly on the samples based on their activation scale factor. Whereas, SCT adaptively adjusts the importance between the two components of the original learning objectives for every single sample. Data with high uncertainty are not directly down-weighted but are utilized more for OOD regularization in SCT.
Empirically, we conduct an experiment to make monotonically increasing with respect to in Table 2 in the attached PDF and results show that the performance of monotonically increasing is much worse than monotonically decreasing .
W4: Post-hoc CLIP calibration
Thanks for the valuable suggestions! We conduct experiments on the effect of different in the rank operation in Table 3 in the attached PDF. The results demonstrate that post-hoc CLIP calibration shows less effective performance than training-based methods. We will include more calibration related works in our revision.
Summary: This paper focused on OOD detection using CLIP. Specifically, it aims to improve the prompting-based method with OOD regularization. It found that the potentially poor quality of surrogate OOD features may hinder the performance, and reveals the relationship between the quality of OOD features and the prediction uncertainty of ID data. Then, the paper introduces Self-Calibrated Tuning (SCT), using modulating factors to weigh the ID loss and OOD loss, with the weights being related to the ID data prediction confidence. The resulting SCT method shows strong empirical improvements.
Strength: Interesting and important research direction; well-motivated and well-written; the finding/analysis was interesting and well-reasoned; comprehensive review; extensive experiments; effective and easy-to-understand algorithm. One reviewer thinks SCT could become a great contribution to the OOD detection community.
Weakness: Relatively incremental and empirical; inaccurate motivation verifications; lack of details/explanations on algorithm design; lack discussion/comparison to post-hoc methods; the experiments are not sufficient (lack comparisons to SOTA); missing results on CIFAR; the improvement is not clear on AUROC.
After rebuttal: The authors provided the rebuttal, including further explanations and experiments attempting to address all the reviewers’ concerns. Three reviewers (with lower ratings) provided the feedback. They acknowledged that a portion of their concerns are addressed. However, they still have concerns about the performance gain on AUROC and the incremental technical contributions. One reviewer is still concerned about the weight design. After rebuttal, the paper received 3-4-4-5-7 (average 4.6), with one reviewer (rating 4) acknowledging the nonfamiliarity with the topic.
Recommendation: Given the mixed rating, the AC read the paper, the review, the rebuttal, and the further discussion. The AC agrees with the multiple strengths mentioned by the reviewers. Regarding the major two unresolved concerns 1) performance gain on AUROC, and 2) incremental technical contributions, the AC has the following opinions. First, the AC agrees with the authors’ arguments that FPR95 is the more challenging metric. Although the overall gains across metrics may not be outstanding, it is nice to see that such a simple STC method leads to consistent and *compatible gains. While the final STC method is simple, seemingly a simple add-on to LoCoOp, its generality and compatibility could be impactful to the community. Such a simple change to the objective function made the AC think of the focal loss, which, while simple, has become a go-to approach to resolve the imbalance. The AC also thinks the design of the weight is reasonable. The AC also appreciates the reviewers’ feedback, especially regarding other branches of OOD detections (e.g., feature-based, post-hoc). In the AC’s humble opinion, a paper does not need to beat all the existing methods and claim the state of the art if it offers sufficient novel insights. When an area is under development, the AC thinks allowing different branches to thrive should be encouraged. The AC sees that the paper makes some valuable fundamental observations/insights and proposes an easily extensible solution, which the following-up work can build upon. Regarding all these aspects, the AC recommends acceptance.
That said, the AC finds that the reviewers' comments and discussions are highly valuable and asks the authors to carefully incorporate them into the final version.