Understanding and Mitigating Miscalibration in Prompt Tuning for Vision-Language Models
We propose Dynamic Outlier Regularization to improve CLIP’s calibration under various evaluations without compromising vanilla fine-tuning.
摘要
评审与讨论
This paper investigates the calibration performance of CLIP following fine-tuning. The author observes a trade-off in calibration between base and new classes and proposes a method called Dynamic Outlier Regularization (DOR). DOR samples categories unrelated to the base class from a large vocabulary to minimize the feature deviation of novel textual labels. Empirical results demonstrate that this approach outperforms the standard fine-tuning method across various settings.
优点
-
The author conducted robust experiments that demonstrate how prompt fine-tuning prior to happiness can lead to a decline in the model's calibration performance.
-
The paper introduces an efficient normalization method designed to enhance the calibration performance of both base and novel classes.
-
The paper provides performance results across multiple calibration evaluation metrics and a range of experimental settings.
-
The writing is clear and well-structured, making it easy to read.
缺点
- Previous works [1,2] have examined the calibration performance of pre-trained CLIP after fine-tuning. However, your paper lacks experimental results comparing your method with these studies. We recommend that you include such comparisons in your work.
- Given that your method is based on experimental observations from CoOp and KgCoOp, we have concerns about its generalizability. For example, in Table 1, your method underperforms compared to Vanilla TCP in half of the settings.
- This article primarily selects outliers from WordNet. We are curious whether using different lexical databases significantly affects the results.
If you can include relevant experiments and address my questions, I will consider increasing the score.
[1] Enabling Calibration In The Zero-Shot Inference of Large Vision-Language Models.
[2] Towards Calibrated Robust Fine-Tuning of Vision-Language Models.
问题
See above.
Thank you for the constructive and elaborate feedback. Please find our response below:
1. Comparison with calibration methods for CLIP [W1]
Following the reviewers' suggestion, we compare it with two recent calibration approaches for CLIP including post-hoc scaling (ZS-TS [1]) and regularization-based training (CaRot [2]). For ZS-TS, we perform the post-hoc calibration on the fine-tuned model using ImageNet-1k and evaluate the learnable temperature on both the base and new classes. For CaRot, we easily incorporate DOR with it. We report the average ECE on the base-to-new datasets.
The results in the table show that the optimized temperature significantly improves the calibration of the fine-tuned CLIP on new classes, but such calibration in ZS-TS can not be transferred to the base classes. In contrast, our approach effectively mitigates the miscalibration issue on new classes while maintaining the calibration performance on the base classes. We add the detailed results in Appendix I and Table 15.
| Method | Metric | Vanilla | +ZS-TS | +DOR |
|---|---|---|---|---|
| CoOp | Base | 3.07 | 8.25 | 2.67 |
| New | 14.58 | 7.96 | 6.49 | |
| HM | 8.83 | 8.11 | 4.58 | |
| MaPLe | Base | 2.75 | 6.68 | 2.83 |
| New | 5.46 | 4.05 | 4.44 | |
| HM | 4.11 | 5.37 | 3.64 | |
| PromptSRC | Base | 3.75 | 6.74 | 3.88 |
| New | 4.15 | 3.48 | 3.80 | |
| HM | 3.95 | 5.11 | 3.84 |
For CaRot, we present the results in the table below. We find that our DOR method can improve the calibration of CaRot on both base and new classes. This demonstrates that DOR is a flexible regularization strategy compatible with various fine-tuning methods. We add the detailed results in Appendix I and Table 16.
| CaRot | + DOR | |
|---|---|---|
| Base | 7.28 | 6.73 |
| New | 4.92 | 4.77 |
| HM | 5.89 | 5.59 |
2. The generalizability of DOR [W2]
We apologize for any potential misunderstanding. We clarify that the columns in Table 1 denote different measures of calibration error, instead of different cases. Notably, our method still achieves comparable performance in those non-bold results, instead of underperforming the baselines. In fact, we only bold the results that are significantly improved. In addition, we focus on the overall average performance ("HM") in the base-to-new evaluation, so counting the number of the bold results in all columns is meaningless. We add the description in Line 313.
Additionally, we add more results from two other prompt tuning methods including PromptSRC and Coprompt in Table 1 of the revised manuscript. The results show that our approach significantly improves calibration on new classes while preserving base class calibration, which leads to better average performance. In summary, the results demonstrate our method's flexibility as a regularization strategy compatible with various fine-tuning techniques.
3.Ablation on lexical databases [W3]
Thanks for your great suggestion. To evaluate whether different lexical databases significantly affect the results, we performed an ablation study on the choice of lexical databases. Specifically, we consider two additional textual databases: CLIP’s vocabulary and ConceptNet 5.7. For CLIP’s vocabulary, it includes 49,407 characters and words. For ConceptNet, we use raw sentences and filter out those exceeding CLIP's input limit (77 tokens). Finally. it consists of 705,662 short sentences.
As shown below, we find that all databases can achieve better results than the baseline. Among the three databases, using WordNet achieves the best calibration performance. Additionally, we observed that short sentences may not perform as well as prompts like “a photo of [class].” We add the detailed results in Appendix G.4 and Table 12.
| Class | w/o DOR | WordNet | CLIP | ConceptNet5 |
|---|---|---|---|---|
| Base | 3.07 | 2.67 | 2.91 | 3.02 |
| New | 14.58 | 6.49 | 8.43 | 8.37 |
| HM | 8.83 | 4.58 | 5.67 | 5.70 |
[1] LeVine W, et al. Enabling Calibration In The Zero-Shot Inference of Large Vision-Language Models. ICLR workshop, 2023
[2] Changdae O, et al. Towards Calibrated Robust Fine-Tuning of Vision-Language Models. NeurIPS, 2024
Thank you for your response. You have addressed my questions and resolved my doubts.
Thank you for taking the time to review the response and raising your score. We are glad that our response addressed your concerns, which also improves the quality of this work.
This paper proposes Dynamic Outlier Regularization (DOR) to improve confidence calibration in prompt-tuned vision-language models (VLMs), particularly in CLIP. The authors argue that current prompt-tuning methods lead to miscalibration by creating a trade-off between base and new classes, with CoOp overconfident on new classes and KgCoOp underconfident on base classes. DOR aims to resolve this trade-off by introducing textual outliers to regularize model predictions, promoting consistent calibration across both base and new classes.
优点
- The paper identifies a relevant issue in the domain of prompt tuning and confidence calibration in VLMs, an area of growing importance.
- The authors introduce a novel idea of using dynamically sampled textual outliers to address calibration inconsistencies, and the approach shows effectiveness across various datasets.
- DOR’s flexibility in working with multiple prompt-tuning methods is a potential advantage.
缺点
-
The paper does not include comparisons with several recent and closely related methods, such as CoPrompt and PromptSRC, which also address calibration issues and trade-offs between base and new classes. Without these comparisons, it is unclear whether DOR provides any substantial advantage over the state of the art, especially since these methods were specifically designed to tackle the same calibration challenges.
-
The primary claim—that prompt-tuning methods like CoOp and KgCoOp introduce calibration trade-offs between base and new classes—has already been extensively studied in prior works. For instance, CoPrompt effectively handles these issues and includes mechanisms specifically designed to manage calibration across both class types. As such, the problem statement lacks novelty, and the paper provides insufficient rationale for why DOR would be preferable to these existing methods.
-
While the paper offers some empirical evidence for DOR’s effectiveness, it lacks analysis that explains why the use of textual outliers should systematically address calibration trade-offs.
-
The proposed solution, while conceptually interesting, lacks practical guidelines on how to effectively select and implement outliers in a real-world setting. Given that the efficacy of DOR relies on appropriate textual outlier selection, more detailed criteria or algorithms for selecting these outliers would be necessary for practitioners to adopt this method.
-
Interestingly, some of the latest methods show less improvement with the proposed solution compared to some of the earlier methods like CoOp. This indicates that the latest methods are already capable of handling the problem and don't require such a solution proposed in this paper. Again, the paper lacks a comparison to the latest method, making it difficult to understand if it has any usage.
问题
See the weakness section.
6. Detailed clarification about the contribution of DOR [W2, W3]
We clarify the contribution of DOR in detail here since the reviewer is curious about why outliers would be preferable to these existing regularizations and how DOR breaks the calibration trade-off.
To illustrate why outliers can address the calibration trade-off, we can understand it from the perspective of gradient conflicts [3]. Specifically, we calculate the cosine similarity of the gradients of the prompt between and to reflect the degree of gradient conflicts. we conduct an ablation of recent regularization item in Appendix B.2. We can decouple CLIP's fine-tuning objective as , where is the cross-entropy loss for classification, and denotes the regularization term. Previously methods introduce regularization terms include (KgCoOp), (CoPrompt), and (PromptSRC).
As shown in Figure 8 of the revised manuscript, the gradient conflict distributions for KgCoOp, CoPrompt, and PromptSRC are predominantly within the range of , which indicates a conflict with the fine-tuning objective . Considering that CoOp with vanilla is an efficient calibrator for base classes (See Figure 1 (a)), we can infer that these regularization terms may hinder the calibration performance for base classes. As an alternative, we observe that the gradient conflicts for DOR are primarily concentrated within the range . Compared to previous regularization terms, it leverages outliers to construct the regularization term and shows significantly fewer conflicts in the range. The phenomenon supports our claim that outliers can be used in regularization without interfering with the original fine-tuning objective for classification.
To further illustrate the actual impact of outliers, we conducted an ablation for the proportion of outliers in the regularization item. Considering KgCoOp uses a fixed number of base classes as the regularization term, we progressively replaced these texts with textual outliers at different proportions from 0.1 to 1.0. We report the average performance on base-to-new datasets.
As shown in the table below, as the proportion of outliers increases, the miscalibration of base classes begins to decrease while the performance of new classes remains largely unaffected. The results strongly support our claim that the outlier can effectively mitigate the miscalibration issue on new classes while maintaining the calibration performance on the base classes. We add the detailed discussion in Appendix B.2 and Figure 8-9.
| 0.0 | 0.1 | 0.2 | 0.3 | 0.4 | 0.5 | 0.6 | 0.7 | 0.8 | 0.9 | 1.0 | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Base | 5.32 | 5.31 | 5.07 | 5.02 | 5.02 | 4.68 | 4.48 | 4.10 | 3.41 | 2.88 | 2.84 |
| New | 4.59 | 4.61 | 4.74 | 4.71 | 4.75 | 4.97 | 4.55 | 4.54 | 4.82 | 4.76 | 4.65 |
| HM | 4.96 | 4.96 | 4.91 | 4.87 | 4.89 | 4.83 | 4.52 | 4.32 | 4.12 | 3.82 | 3.75 |
[1] Khattak M U, et al. Self-regulating prompts: Foundational model adaptation without forgetting. ICCV, 2023
[2] Roy S, et al. Consistency-guided prompt learning for vision-language models. ICLR, 2024.
[3] Shi G, et al. Recon: Reducing conflicting gradients from the root for multi-task learning. ICLR, 2023.
I would like to thank the authors for their efforts in the rebuttal. The rebuttal addresses most of my concerns, and I have increased the score. However, one weakness remains, and it is more clear that the proposed solution only slightly improves the performance of the latest SOTA solutions. Two of the latest methods, added during the rebuttal, show only marginal improvements. This suggests that the calibration problem is not such a prominent issue for these methods.
Thank you for the reply and the improved score. For the remaining concern, we clarify that the most important contribution of this work is on the new insight of how prompt tuning methods lead to miscalibration. Then, we propose the simple and effective method to validate the generalizability of this insight. The empirical results show that our method can universally improve existing prompt-tuning methods with a simple regularization, which justifies the new insight.
In addition, we find that the calibration performances of those SOTA methods are suboptimal for several domain-specific tasks (e.g., DTD, EuroSAT, and PathMNIST). Here, our method consistently reduces the miscalibration, especially on new classes (See Table 8 of Appendix F). For example, our method significantly reduces the overall ECE of CoPrompt from 10.33% to 6.92% on the medical dataset -- PathMNIST (See Tables 13 and 14 of Appendix H). The empirical results show that the calibration problem is still a prominent issue even for those SOTA methods, which highlights the importance of our work (especially the insight).
Dear Reviewer pueg,
We truly appreciate your dedication to reviewing our work. As the rebuttal period is ending soon, we eagerly await your feedback and would like to know if our response has adequately addressed your concerns. Are there any points that remain unclear or require further explanation?
In this work, we provide a new insight to understand the confidence calibration issue of prompt tuning, and propose a novel method to universally improve the calibration performance of current methods. While the latest SOTA methods of prompt tuning achieve excellent performance of ECE on some standard datasets, we show that they are suboptimal yet for many domain-specific scenarios, which reveals their potential weakness in real applications. Note that our method can consistently improve the calibration performance of SOTA methods across various datasets, which not only validate the new insight but also provides a robust solution in different scenarios.
Best regards,
Authors.
As I mentioned earlier, I have increased my score as most of my original concerns were addressed by the reviewers. However, the reason for which I won't increase my score further is the fact that the proposed solution does not show a similar effectiveness across methods. In particular, some of the latest methods don't show considerable improvements, which suggests such methods don't need the proposed solution to work well.
We appreciate the reviewer for the insightful and detailed comments. Please find our response below:
1. Comparison with related methods [W1]
Thank you for the suggestion. Before providing the new results, we clarify that PromptSRC and CoPrompt do not study the calibration issue and do not evaluate the performance of confidence calibration throughout their paper.
As suggested by reviewers, we present (in the table below) new results with recent prompt-tuning methods: PromptSRC and CoPrompt. The results in the table show that our DOR can consistently improve calibration on new classes. Compared with the vanilla baselines, DOR can result in better overall calibration performance, while maintaining the calibration performance on the base classes. Furthermore, as shown in Table 2 of the revised manuscript, our method can also boost their generalization on new classes. We present the detailed calibration results in Table 1 and Appendix F of the revised manuscript.
| Method | ECE | Acc | ||||
|---|---|---|---|---|---|---|
| Base | New | HM | Base | New | HM | |
| PromptSRC | 3.75 | 4.15 | 3.95 | 84.77 | 75.79 | 80.28 |
| +DOR (Ours) | 3.88 | 3.80 | 3.84 | 84.79 | 76.02 | 80.41 |
| CoPrompt | 2.56 | 5.96 | 4.26 | 82.32 | 73.29 | 77.80 |
| +DOR (Ours) | 2.96 | 4.69 | 3.83 | 82.39 | 74.50 | 78.45 |
2. The contribution of DOR [W2]
Again, we clarify that PromptSRC and CoPrompt do not study the calibration issue and do not evaluate the performance of confidence calibration throughout their paper. These two papers do not affect the novelty of this work. Indeed, the trade-off of accuracy between base and new classes is widely discussed in the literature of prompt tuning, but their discussions do not include confidence calibration. In addition, We clarify that we do not claim the problem statement as our primary contribution. In this work, we provide an insightful analysis of how current prompt-tuning methods affect confidence calibration. Then, we propose DOR regularization to effectively mitigate the miscalibration of fine-tuned CLIPs.
3. The analysis for using outliers [W3]
Thanks for your insightful question. In the analysis of Section 3, we show that the textual divergence is the key to the calibration performance of fine-tuned CLIP. Therefore, we regularize the textual divergence of new classes without restricting those of base classes, to preserve the calibration of zero-shot CLIP. This is why we utilize outliers in the regularization. We leave a detailed analysis in answer 6 of the response below.
4.The selection strategy of the outlier set [W4]
Thanks for your careful review. We clarify that DOR is not very sensitive to the strategy of outlier selection. In the original paper, we provide a discussion about the effect of the selection strategy in Table 4 and Section 6 of our original manuscript. Specifically, we evaluated DOR with three selection strategies: near-OOD (ours), far-OOD, and random-OOD. We report the average performance on base-to-new datasets using CoOp.
As shown below, the results indicate that the near-OOD strategy achieves the best performance in the base-to-new evaluation. Additionally, compared to the baseline (w/o regularization), models with regularization consistently reduce the ECE. Notably, the random selection serves as a strong baseline, highlighting the robustness of our proposed regularization term. To clarify this further, we add a detailed discussion in the first paragraph of Section 6 alongside Table 4.
| Policy | Base | New | HM |
|---|---|---|---|
| w/o DOR | 3.07 | 14.58 | 8.83 |
| Near | 2.68 | 7.09 | 4.89 |
| Far | 2.95 | 7.72 | 5.34 |
| Random | 2.80 | 7.33 | 5.07 |
5. The broader influence of DOR [W5]
Yes, the latest prompt-tuning methods perform better than those earlier methods. Note that our method still improves their calibration performance, we highlight the flexibility of DOR that integrates seamlessly with existing prompt-tuning methods without requiring significant modifications, making it easy to adopt within existing pipelines.
Importantly, we show that the latest methods still achieve suboptimal performance of confidence calibration in some domain-specific tasks. As shown below, our experiments on the medical dataset PathMNIST demonstrate that DOR can significantly improve the calibration of the latest method - CoPrompt. For example, DOR reduces the miscalibration of CoPrompt by 5.99% on base classes and 0.82% on new classes respectively, which highlights its great potential for real-world applications. We illustrate the detailed experiment in Appendix H and Table 13-14.
| Base | New | HM | |
|---|---|---|---|
| ZSCLIP | 29.80 | 15.27 | 22.54 |
| CoPrompt | 12.26 | 8.39 | 10.33 |
| +DOR | 6.27 | 7.57 | 6.92 |
Thank you for raising the concern of statistical significance. Due to the space limit, we use bold number to indicate the statistical significance of superior results. In particular, we adopt -test among 3 runs to determine the statistical significance (We bold those results with a p-value lower than 0.05). As shown in tables, our method does not make significant improvements for some cases indeed. We justify that the effectiveness of our method is validated by universally improving the performance of current prompt-tuning methods on most cases. As you suggested, we will add the number of std in the appendix of the final version.
Additionally, we'd like to clarify that improving the accuracy is not the goal of this work. In Tables 2 and 3, we present the classification accuracy to show that our method will not decrease the generalization performance with comparable (even improved) results. This is the point that the reviewer may have some misunderstandings.
Thank you for the prompt reply. We respectly disagree that SOTA methods don't need the proposed solution to work well. In the discussion, we add empirical results on several domain-specific datasets like pathMNIST, as reviewers suggested. For your convenient, we attach the results in the table below. The results show that our method can significantly improve current SOTA methods to achieve better performance in confidence calibration (e.g., from 10.33 to 6.92). This indicates the potential benefits of DOR for SOTA promp-tuning methods.
Besides, our work provides a deep understanding of the calibration issue of prompt tuning methods. While some SOTA methods perform well on improving the performance, the community was unclear about why they can benefit the calibration and whether they can perform consistently in some unseen scenarios. This highlights the significance of this work for the community.
| Class | ZSCLIP | CoOp | +DOR | KgCoOp | +DOR | MaPLe | +DOR | DEPT | +DOR | PromptSRC | +DOR | CoPrompt | +DOR |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Base | 29.80 | 1.56 | 1.25 | 13.45 | 12.15 | 14.12 | 15.47 | 6.57 | 7.63 | 12.43 | 11.52 | 12.26 | 6.27 |
| New | 15.27 | 61.28 | 14.99 | 12.45 | 7.48 | 13.47 | 6.54 | 62.18 | 13.91 | 11.08 | 10.34 | 8.39 | 7.57 |
| HM | 22.54 | 31.42 | 8.12 | 12.95 | 9.82 | 13.80 | 11.01 | 34.38 | 10.77 | 11.76 | 10.93 | 10.33 | 6.92 |
The results on one benchmark of the medical domain do not justify such a big claim. Let's point out to Table 2. Maple, KgCoOp, and CoPrompt are the previous best methods. The improvement in all datasets is less than 1%.
I just realized there is no indication of a standard deviation. What is the significance of this 1% improvement? This can easily be the variance due to the initialization.
Similarly, Table 3 shows just a 0.14% improvement over MaPLe. Again, there is no indication of deviation. Unfortunately, I am less convinced with the results now.
This paper introduces Dynamic Outlier Regularization (DOR), a method to improve confidence calibration in fine-tuned VLMs by controlling textual feature divergence (also can be extended to visual tuning) through the use of selected outliers, thereby enhancing model reliability on both base and novel classes.
优点
- DOR integrates smoothly with popular prompt-tuning methods (e.g., CoOp, KgCoOp) without requiring major architectural changes, making it easy to adopt in existing pipelines.
- The authors provided insightful analysis of how current prompt-tuning methods impact confidence calibration, with clear explanations for why overconfidence or underconfidence arises in certain settings. The motivation is clear.
- The manuscript is well-written and in a good logic.
缺点
- The ablation study is insufficient, e.g. how sensitive DOR is to the choice of outliers and whether different selection strategies could yield better results?
- The experiments are largely limited to standard benchmarks without applying the method to domain-specific tasks (e.g., medical imaging or autonomous systems as the authors mentioned in the Introduction), where calibration is especially critical.
问题
- When you calculate the semantic similarity between textual labels in WordNet and the base classes, do you use cosine similarity? Please Clarify. Would using a different metric (e.g. Euclidean distance) impact results?
- The outliers are selected based on the top-K, but there’s no mention of a specific similarity threshold. Would setting a threshold affect performance?
- How frequently are the outliers updated during training? Does the frequency affect DOR’s calibration performance?
- In page 8, line 407-408, is it a typo (should be "CoCoOp" ranther than "CoOp") or a wrong statement (the number is wrong if you campred to the zero-shot CLIP)? Please correct it.
- What criteria were used to select visual outliers from ImageNet-1K? How to ensure these outliers are sufficiently distinct from base classes without introducing irrelevant noise?
- How does DOR influence the feature space of base classes when incorporating visual outliers?
We appreciate the reviewer for the insightful and detailed comments. Please find our response below:
1. The strategy of outlier selection [W1]
Thank you for your question. We have discussed this point in Section 6 (Table 4) of our original manuscript. Given a fixed lexical database (e.g., WordNet), we mainly considered three selection strategies: near-OOD (ours), far-OOD, and random-OOD.
The results are presented in the table below. We find that the near-OOD strategy can produce the best performance under the base-to-new evaluation. Moreover, compared to the baseline (w/o regularization), fine-tuned model with DOR consistently reduces the ECE. Interestingly, the random selection can serve as a strong baseline, demonstrating the robustness of our proposed regularization item. For clear illustration, we add a detailed discussion in the first paragraph of Section 6 and Table 4.
| Policy | Base | New | HM |
|---|---|---|---|
| w/o DOR | 3.07 | 14.58 | 8.83 |
| Near | 2.68 | 7.09 | 4.89 |
| Far | 2.95 | 7.72 | 5.34 |
| Random | 2.80 | 7.33 | 5.07 |
2. Applying DOR to domain-specific tasks [W2]
Thank you for the great suggestion. Here, we conduct our experiments on PathMNIST [1] as the medical benchmark. PathMNIST is a multi-class classification task and consists of 9 types of tissues. We fine-tune the CLIP with 16 shots from the first 5 classes and evaluate the model on all 9 classes under the base-to-new evaluation protocol.
As shown in the table below, we find that DOR can effectively help with the calibration of fine-tuned CLIP on medical datasets. For example, DOR significantly reduces the ECE on new classes from 61.28% to 14.99% in CoOp. Moreover, DOR can incorporate with latest methods like CoPrompt and significantly reduces the overall ECE from 10.33% to 6.92%. In summary, DOR can notably improve the calibration performance of prompt-tuning methods and is capable of real-world domain-specific tasks. We add the detailed results in Appendix H and Table 13-14.
| Class | ZSCLIP | CoOp | +DOR | KgCoOp | +DOR | MaPLe | +DOR | DEPT | +DOR | PromptSRC | +DOR | CoPrompt | +DOR |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Base | 29.80 | 1.56 | 1.25 | 13.45 | 12.15 | 14.12 | 15.47 | 6.57 | 7.63 | 12.43 | 11.52 | 12.26 | 6.27 |
| New | 15.27 | 61.28 | 14.99 | 12.45 | 7.48 | 13.47 | 6.54 | 62.18 | 13.91 | 11.08 | 10.34 | 8.39 | 7.57 |
| HM | 22.54 | 31.42 | 8.12 | 12.95 | 9.82 | 13.80 | 11.01 | 34.38 | 10.77 | 11.76 | 10.93 | 10.33 | 6.92 |
3. Ablation on similarity metric [Q1]
Thank you for the suggestion. As you suggested, we provide this ablation in Appendix G.1 and Figure 10. We conduct the experiments with three metrics: Cosine similarity, Euclidean distance (L2), and Mahalanobis distance. We report the average performance on base-to-new datasets using CoOp.
As shown in the table below, we observed that DOR consistently reduces the calibration error with all three metrics. Surprisingly, L2 may outperform Cosine similarity on new classes. Following the common practice in vision-language models [2, 3], we defaultly adopt the cosine similarity to measure the semantic similarity between textual features throughout this work. We add the detailed results in Appendix G.1 and Table 10.
| w/o DOR | Cosine | Euclidean | Mahalanobis | |
|---|---|---|---|---|
| Base | 3.07 | 2.67 | 2.92 | 2.83 |
| New | 14.58 | 6.49 | 4.74 | 6.73 |
| HM | 8.83 | 4.58 | 3.83 | 4.78 |
4. Ablation on the number of outliers [Q2]
In this work, we sample the outliers with the K highest scores, without defining a specific threshold. In particular, we set K=5000 throughout the paper. We guess your concern might be related to the number of outliers. As setting a score threshold is the same as changing the value of K, we add a sensitivity analysis of outlier numbers in the revised version.
As shown in Figure 10, we find that DOR is insensitive to the number of selected outliers when the number is relatively large (K>500). Therefore, we suggest a moderate number (e.g., 5000) to yield better calibration performance. We add the detailed results in Appendix G.2 and Figure 10.
| Method | Class | 0 | 10 | 50 | 100 | 500 | 1000 | 5000 | 10000 | 20000 |
|---|---|---|---|---|---|---|---|---|---|---|
| CoOp | Base | 2.82 | 2.97 | 2.83 | 3.01 | 2.80 | 2.69 | 2.67 | 2.73 | 2.72 |
| New | 13.88 | 10.45 | 9.49 | 8.85 | 7.07 | 7.02 | 6.49 | 6.53 | 6.72 | |
| MaPLe | Base | 2.77 | 2.90 | 3.06 | 3.24 | 3.01 | 3.03 | 2.93 | 2.93 | 2.88 |
| New | 5.37 | 4.89 | 4.93 | 4.67 | 4.52 | 4.57 | 4.59 | 5.21 | 5.15 | |
| CoPrompt | Base | 2.56 | 2.80 | 2.77 | 2.98 | 2.87 | 2.86 | 2.96 | 2.88 | 2.85 |
| New | 5.96 | 4.76 | 4.71 | 4.51 | 4.62 | 5.00 | 4.69 | 4.57 | 4.77 |
5.Frequency of outlier update [Q3]
We demonstrate that the dynamic update strategy in DOR is essential for calibration in fine-tuned CLIP. In our method, we randomly sample a batch of textual outliers from the selected textual outlier set, in each iteration. Therefore, the textual outliers used in each iteration can be different, which establishes a dynamic regularization.
As for the effect of update frequency, we conduct an ablation on the outlier frequency. Here we use large update intervals to represent update frequencies. We report the average ECE on the base-to-new datasets with CoOp. As shown in the table below, low update frequencies (or large update intervals) are observed to negatively impact the calibration performance of DOR. The experimental results highlight the benefits of employing a dynamic update strategy in DOR, since it helps mitigate overfitting to noise [4] and achieves superior calibration performance.
| Interval | Base | New | HM |
|---|---|---|---|
| 1 | 2.67 | 6.49 | 4.58 |
| 10 | 2.76 | 6.72 | 4.74 |
| 100 | 2.71 | 7.00 | 4.86 |
| 1000 | 2.65 | 7.43 | 5.04 |
6. Some typos [Q4]
Thank you for pointing it out. We have fixed it in the revised version (Line 406).
7. Clarification on visual outlier selection [Q5]
Thank you for pointing out the ambiguous description. OOD (Out-of-Distribution) data primarily refer to Semantic Shifts, which involve changes in both the label space and the input space [5]. Hence, we suggest using images that do not share similar semantic concept relations with the images seen in the fine-tuning. Specifically, we use class names to filter out images from semantically similar classes. We construct the outlier set by retaining images from 50% of the classes of ImageNet-1k, which ensures these classes differ as much as possible from those used during fine-tuning. To make it clearer, we add a detailed description in Section 6.
8.Feature space when incorporating visual outliers [Q6]
Thanks for your great suggestion. We clarify that DOR-V maintains the performance on base classes while does not affect the visual feature significantly. To quantify the difference between visual distributions, we measure the distance between distributions via Maximum Mean Discrepancy (MMD) and Wasserstein distance. We compare the visual distribution between Zero-Shot CLIP (Z) and CLIP-Adapter (C). DOR-V (D) denotes that our method applied on CLIP-Adapter. We report the results from the base classes of DTD dataset.
As shown in table below, compared with ZS-CLIP, the gap between CLIP-Adapter and DOR-V is relatively smaller. The results show DOR does not significantly affect the feature space of the original fine-tuning method. To have a better view, we also conduct a visualization via t-SNE in Figure 11 of the revised manuscript. In summary, DOR-V does not DOR does not significantly affect the feature space and achieves better calibration as evidenced in Table 5 of the manuscript. We add the detailed results and visualization in Appendix J (Figure 11 & Table 17).
| Metric | Z ⟷ C | Z ⟷ D | C ⟷ D |
|---|---|---|---|
| MMD | 0.39 | 0.84 | 0.22 |
| Wasserstein | 1.48 | 1.01 | 0.47 |
[1] Yang J, et al. Medmnist v2-a large-scale lightweight benchmark for 2d and 3d biomedical image classification. Scientific Data, 2023.
[2] Prasanna, et al. Does CLIP’s generalization performance mainly stem from high train-test similarity?. ICLR, 2024
[3] Yi C, et al. Leveraging Cross-Modal Neighbor Representation for Improved CLIP Classification. CVPR, 2024
[4] Wei H, et al. Open-set label noise can improve robustness against inherent label noise. NeurIPS, 2021
[5] Yang J, et al. Generalized out-of-distribution detection: A survey. IJCV, 2024
Dear Reviewer 5mjQ,
We truly appreciate your dedication to reviewing our work. As the rebuttal period is ending soon, we eagerly await your feedback and would like to know if our response has adequately addressed your concerns. Are there any points that remain unclear or require further explanation?
Best regards, Authors.
I would like to thank the authors for their detailed responses. Most of my concerns have been addressed and I have raised the score.
Great thanks for your recognition and for raising the score. We are glad that our discussion finally addressed your concerns, which also improved the quality of this work.
This paper, through CoOp and KgCoOp, observe that when a model undergoes adaptation via prompt tuning, CoOp tends to be overconfident in novel classes, while KgCoOp, on the contrary, becomes underconfident. To address this tendency towards overconfidence/underconfidence, this paper propose a Dynamic Outlier Regularizer (DOR) term, which demonstrates performance improvements when the proposed regularizer is added to the algrorithms for prompt learning of VLMs.
优点
- A new blind point was identified (CoOp and KgCoOp becomes over/under confident)
- Experimental results show performance improvement when the proposed regularizer is added.
缺点
-
The motivation was inferred from CoOp and KgCoOp, but in fact, these two algorithms lack adequate consideration for novel classes. For an effective analysis of this phenomenon, algorithms that directly account for novel classes should be utilized, such as CoCoOp, MaPLe, PromptSRC, DEPT, and TCP. If similar tendencies are observed in these algorithms, it would strongly support the authors' claim.
-
It appears that further explanation is needed regarding the concept of "Texture divergence." From my understanding, this divergence is due to the diversity of textual representation arising from CoOp's prompt learning method. However, finding concrete evidence to confirm this explanation is challenging. Additionally, a more detailed explanation of the notation, particularly the keywords emphasized in the paper, would enhance reader's understanding if the study.
-
More comparisons with other algorithms are necessary. Given the large number of prompt learning algorithms, further experimental comparisons are needed to confirm whether the effect of this regularizer is generalizable. (Similar with the first weakness statement)
问题
Please refer to the weakness part.
Thanks for your valuable suggestions. Please find our response below:
1. Results with more methods for motivation [W1]
Thank you for the suggestion. As suggested by reviewers, we extend the analysis of motivation to more prompt tuning methods, including CoCoOp, MaPLe, DEPT, PromptSRC and CoPrompt. The results in the table below can make consistent conclusion as provided in Section 3.2: existing prompt tuning methods can significantly increase the FD score of textual features on both base and new classes, as well as the confidence. This leads to the overconfidence issue on new classes, as the accuracy is almost unchanged for those classes. We add a detailed visualization in Appendix B.1.
In addition, we clarify that KgCoOp leads to underconfidence on base classes, instead of novel classes (Line 227). Notably, we politely disagree with the reviewer's view that KgCoOp lacks adequate consideration for novel classes. As claimed in the Abstract of KgCoOp's paper, KgCoOp is designed to mitigate the generalization issue on unseen classes, which is also emphasized in Section 3. They also demonstrate the effectiveness of KgCoOp on unseen classes with empirical results in Table 2. This is why we use KgCoOp as the representative method to show the calibration issue of recent prompt learning methods.
| FD score | ZSCLIP | CoOp | CoCoOp | MaPLe | DEPT | KgCoOp | PromptSRC | CoPrompt |
|---|---|---|---|---|---|---|---|---|
| Base | 0.172 | 0.435 | 0.237 | 0.236 | 0.669 | 0.186 | 0.205 | 0.241 |
| New | 0.183 | 0.371 | 0.223 | 0.228 | 0.514 | 0.192 | 0.208 | 0.251 |
| Confidence | ZSCLIP | CoOp | CoCoOp | MaPLe | DEPT | KgCoOp | PromptSRC | CoPrompt |
|---|---|---|---|---|---|---|---|---|
| Base | 0.721 | 0.882 | 0.815 | 0.836 | 0.804 | 0.801 | 0.857 | 0.843 |
| New | 0.733 | 0.754 | 0.778 | 0.802 | 0.804 | 0.762 | 0.744 | 0.807 |
2. Explanation regarding "Texture divergence". [W2]
Yes, this proposed feature divergence is closely related to the diversity of textual representation. As discussed in Section 3.2 (Line 206), we clarify that textual diversity is introduced by the cross-entropy loss, which is widely employed in not only CoOp but also many existing prompt tuning methods.
To further verify that larger textual divergence is correlated with higher output confidence, we employed Spearman correlation coefficients to compute the relationships between class-wise confidence and FD score on the DTD dataset. The average confidence of each class is derived from samples predicted to the corresponding label. The results of Spearman correlation reveal that textual divergence is positively correlated to output confidence, which supports our observation in Figure 6 of the manuscript.
| ZSCLIP | CoOp | CoCoOp | MaPLe | DEPT | KgCoOp | PromptSRC | CoPrompt | |
|---|---|---|---|---|---|---|---|---|
| base | 0.70 | 0.31 | 0.55 | 0.52 | 0.28 | 0.64 | 0.58 | 0.51 |
| new | 0.82 | 0.59 | 0.65 | 0.53 | 0.52 | 0.77 | 0.67 | 0.80 |
3. Comparisons with latest prompt tuning algorithms [W3]
Thanks for the great suggestion. As suggested by reviewers, we provide new results with recent methods including PromptSRC [1] and CoPrompt [2]. The average performance on base-to-new datasets is presented in the table below. DOR improves the overall calibration and accuracy of models fine-tuned by PromptSRC or CoPrompt, which is consistent with our former results. Notably, DOR consistently reduces miscalibration on new classes and maintains a comparable performance on base classes. In summary, the results support our claim that our proposed DOR can consistently boost performance on new classes upon existing prompt tuning methods, without compromising the vanilla fine-tuning objective. We add the empirical results into Tables 1 and 2. The detailed results for each dataset can be found in Appendix F of the manuscript.
| Method | ECE | Acc | ||||
|---|---|---|---|---|---|---|
| Base | New | HM | Base | New | HM | |
| PromptSRC | 3.75 | 4.15 | 3.95 | 84.77 | 75.79 | 80.28 |
| +DOR (Ours) | 3.88 | 3.80 | 3.84 | 84.79 | 76.02 | 80.41 |
| CoPrompt | 2.56 | 5.96 | 4.26 | 82.32 | 73.29 | 77.80 |
| +DOR (Ours) | 2.96 | 4.69 | 3.83 | 82.39 | 74.50 | 78.45 |
[1] Khattak M U, et al. Self-regulating prompts: Foundational model adaptation without forgetting. ICCV, 2023
[2] Roy S, et al. Consistency-guided prompt learning for vision-language models. ICLR, 2024.
Dear Reviewer 2Y4k,
We truly appreciate your dedication to reviewing our work. As the rebuttal period is ending soon, we eagerly await your feedback and would like to know if our response has adequately addressed your concerns. Are there any points that remain unclear or require further explanation?
Best regards, Authors.
Thank you so much for diligently responding to my opinion. I have read your review, and I will increase my rating to 6. Thank you.
Thank you for reviewing our response and increasing the score. We are delighted that our response addressed your concerns. Your feedback is highly valuable in improving the quality of this work.
We thank all the reviewers for their valuable comments and insightful suggestions. We are glad that reviewers find this work focuses on an relevant and important issue (pueg), which is a new blind point (2Y4k) in vision-language models. We are also encouraged that reviewers appreciate that our motivation is clear (wLz3) with insightful analysis (5mjQ), and the proposed DOR is flexible (pueg) on existing prompt tuning methods. Besides, reviewers recognize that the writing is well-structured (5mjQ, wLz3) and in good logic (5mjQ).
In the following responses, we provide point-by-point responses to all reviewers’ comments and concerns. We list the main concerns of the reviewers and the changes below:
- Added experiments for DOR on more latest prompt tuning methods in Table 1-2 and Appendix F. [2Y4k, pueg]
- Added experiments for textual divergence on more methods in Appendix B.1. [2Y4k]
- Added explanation for DOR's contribution and motivation in Appendix B.2. [pueg]
- Added ablations on DOR including selection strategy [pueg, 5mjQ], similarity metric [5mjQ], outlier numbers [5mjQ], update frequency [5mjQ], and outliers databases [wLz3] in Table 4 and Appendix G.
- Added experiments on domain-specific tasks in Appendix H. [5mjQ]
- Added experiments on existing calibration for CLIP in Appendix I. [wLz3]
- Added explanation on DOR-V in Line 510-513 and Appendix J. [5mjQ]
- Clarified the calibration evaluation in experiments in Line 313. [wLz3]
- Fixed typo in Line 406. [5mjQ]
We sincerely hope the following responses can address the concerns. We look forward to your response and are willing to provide clarifications to any concerns you may have.
For clarity, we highlight the revised part of the manuscript in blue color.
Dear Reviewers,
Thank you for your efforts in reviewing this paper. We highly encourage you to participate in interactive discussions with the authors before November 26, fostering a more dynamic exchange of ideas rather than a one-sided rebuttal.
Please feel free to share your thoughts and engage with the authors at your earliest convenience.
Thank you for your collaboration.
Best regards, ICLR 2025 Area Chair
This submission introduces Dynamic Outlier Regularization (DOR) to enhance confidence calibration in prompt-tuned vision-language models (VLMs), with a specific focus on CLIP. DOR primarily leverages textual outliers from WordNet to regularize model predictions, aiming to ensure consistent calibration across both base and novel classes. However, after reviewing the feedback and author rebuttals, the area chair is inclined to reject the paper, as the experimental improvements are marginal and the integration of WordNet-derived outliers may not be practical for real-world applications.
审稿人讨论附加意见
Overall, the authors address most concerns during the rebuttal phase. However, as noted earlier, the improvements over baseline fine-tuning methods are marginal. The area chair also concurs with #pueg regarding the challenges of effectively selecting and implementing outliers in real-world applications.
Reject