PaperHub
6.8
/10
Spotlight4 位审稿人
最低6最高7标准差0.4
7
7
7
6
4.0
置信度
正确性3.0
贡献度2.8
表达3.3
NeurIPS 2024

Enhancing Zero-Shot Vision Models by Label-Free Prompt Distribution Learning and Bias Correcting

OpenReviewPDF
提交: 2024-05-07更新: 2024-11-06

摘要

Vision-language models, such as CLIP, have shown impressive generalization capacities when using appropriate text descriptions. While optimizing prompts on downstream labeled data has proven effective in improving performance, these methods entail labor costs for annotations and are limited by their quality. Additionally, since CLIP is pre-trained on highly imbalanced Web-scale data, it suffers from inherent label bias that leads to suboptimal performance. To tackle the above challenges, we propose a label-**F**ree p**ro**mpt distribution **l**earning and b**i**as **c**orrection framework, dubbed as **Frolic**, which boosts zero-shot performance without the need for labeled data. Specifically, our Frolic learns distributions over prompt prototypes to capture diverse visual representations and adaptively fuses these with the original CLIP through confidence matching. This fused model is further enhanced by correcting label bias via a label-free logit adjustment. Notably, our method is not only training-free but also circumvents the necessity for hyper-parameter tuning. Extensive experimental results across 16 datasets demonstrate the efficacy of our approach, particularly outperforming the state-of-the-art by an average of $2.6%$ on 10 datasets with CLIP ViT-B/16 and achieving an average margin of $1.5%$ on ImageNet and its five distribution shifts with CLIP ViT-B/16. Codes are available in [https://github.com/zhuhsingyuu/Frolic](https://github.com/zhuhsingyuu/Frolic).
关键词
vision-language modelzero-shot classificationlogit adjustment

评审与讨论

审稿意见
7

This paper proposes a method named Frolic to improve the zero-shot performance of vision-language models like CLIP. The method focuses on two key challenges: enhancing prompt distribution learning and correcting inherent label bias in pre-trained models without relying on labeled data. Experimental results across 16 datasets demonstrate performance improvements of the method.

优点

  • The paper is well-written and the method is easy to follow.
  • The method effectively addresses label bias inherent in pre-trained models, improving the robustness and accuracy of zero-shot predictions.
  • Ablation experiments are comprehensive.

缺点

  • Does the paper utilize the validation dataset in experiments, if not, this should be clarified.
  • How does the proposed method compare with the methods[1][2] in the few-shot settings?
  • Although Frolic is described as training-free, there may still be hyperparameters involved in the method, and how to choose these parameters?
  • How does the prompt distribution learning contribute to the performance? Does it outperform the original CLIP?
  • Can the method be performed with a few labeled samples?
    [a] Black box few-shot adaptation for vision-language models. ICCV 2023 [b] Tip-Adapter: Training-free Adaption of CLIP for Few-shot Classification. ECCV, 2022

问题

The discussion of the limitations of the methods is unclear. The authors should discuss it in depth.

局限性

Please refer to the weakness and Questions parts.

作者回复

[Q1] Does the paper utilize the validation dataset in experiments, if not, this should be clarified.

[A1] We do not use a validation set because our method involves no hypermeter searching.


[Q2] How does the proposed method compare with the methods[1][2] in the few-shot settings?

[A2] The methods[1][2] refer to LFA and Tip-Adapter. These methods boost the CLIP's generalization using labeled training samples. In contrast, our Frolic doesn't require any labeled samples. We evaluate our method with LFA and Tip-Adapert on the ImageNet and its variants dataset, where the LFA and Tip-Adapter only utilize the labeled samples from the ImageNet dataset. The results below show that our method achieves the best performance, except compared to LFA on the ImageNet. 


ImageNetImageNet-AImageNet-V2ImageNet-RImageNet-Sketch
LFA72.651.564.776.148.0
Tip-Adapter70.549.863.176.948.1
Frolic70.960.464.780.753.3

[Q3]   Although Frolic is described as training-free, there may still be hyperparameters involved in the method, and how to choose these parameters?

[A3] The only hyperparameter in our method is tolerance ϵ=0.01\epsilon = 0.01, which is critical for the precision of numerical calculations rather than being a traditional model hyperparameter.  As illustrated in Figure 4, when the iteration extends over 20 steps, ϵ\epsilon approaches zero. While theoretically, a lower ϵ\epsilon is preferable for finer precision, we have selected ϵ=0.01\epsilon = 0.01 to ensure a practical balance between computational efficiency and convergence reliability.


[Q4] How does the prompt distribution learning contribute to the performance? Does it outperform the original CLIP?

[A4] As compared in Table 3 (fcf_{\rm c} and fgf_{\rm g}), the prompt distribution learning (fgf_{\rm g}, shown in the third row) significantly outperforms the original CLIP (fcf_{\rm c}​, shown in the first row). For example, fcf_{\rm c} increases the performance from 65.1% to 68.8% on the 10-datasets and 68.7% to 69.8% on the ImageNet.


[Q5]   Can the method be performed with a few labeled samples?

[A5] Our method can effectively incorporate labeled samples by replacing the class description to estimate z1{\bf z}_1 to zK{\bf z}_K in Equation (3). To demonstrate this capability, we have conducted additional experiments in a 16-shot setting using the ViT-B/16 backbone, the results below show that the 16-shot setting achieves better performance than the class descriptions.

PetsFlowersAircraftDTDEuroSATCarsFoodSUNCaltechUCF
Frolic92.974.831.556.158.569.187.270.895.275.2
Frolic with labeld sampels94.598.351.471.789.683.587.976.696.385.2

[Q6] The discussion of the limitations of the methods is unclear. The authors should discuss it in depth.

[A6] The quality and distribution of data used in pre-training can significantly impact the performance of pre-trained models. Our method relies on the capabilities of pre-trained models for downstream tasks, if the pre-trained knowledge differs from the downstream tasks, the efficacy of our method may be limited. We have included a detailed discussion of limitations in the revised manuscript. Additionally, as you suggested in [Q5], few-shot learning presents a viable solution to these challenges. We appreciate your insights, which help us to further exploit the potential of our method.

评论

I'm grateful for your response. All of my concerns have been resolved, and I now have a deeper understanding of your method. Therefore, I have raised my rating.

评论

We do appreciate the reviewer's positive support, and we are pleased to take the reviewer's advice to improve our work.

审稿意见
7

In this paper, the authors introduce a promising method for enhancing zero-shot vision models through the utilization of prompt distribution learning and bias correction. The method is particularly notable for being training-free and label-free, which greatly simplifies the implementation process. The authors provide comprehensive experimental results, which effectively demonstrate the efficacy of the proposed method.

优点

  1. This paper is easy to understand.
  2. The paper provides an in-depth theoretical discussion and a thorough experimental evaluation demonstrating the effectiveness of the proposed method.

缺点

  1. The computation of second-order moments and the covariance matrix from the marginal distribution as discussed in Eq.(3) and (5) might be computationally intensive.
  2. The method relies on the pseudo-labels estimated by CLIP. How does this process influence the final results?
  3. The setting of results on the ImageNet variants dataset is unclear.
  4. The results lack comparison with other prompt distribution methods, such as CoOp and CoCoOp.

问题

The author stated that their method does not require hyperparameter tuning. How does the proposed method compare to those requiring hyperparameter tuning?

局限性

Refer to the weakness and Questions

作者回复

[Q1] The computation of second-order moments and the covariance matrix from the marginal distribution as discussed in Eq.(3) and (5) might be computationally intensive.

[A1] We have evaluated the running time as presented in Table 5. The results show that while Frolic requires slightly more computation time (0.0078 seconds/per sample) compared to the original CLIP (0.0072 seconds/per sample), it yields improved performance, increasing from 68.7% to 71.1%. 


[Q2] The method relies on the pseudo-labels estimated by CLIP. How does this process influence the final results?

[A2] Our method can improve performance regardless of the quality of the pseudo-labels. As evidenced in Table 4, the original CLIP achieves 92.95% accuracy on the Caltech dataset, indicative of high-quality pseudo-labels, while 24.8% accuracy on the Aircraft dataset, which represents lower-quality pseudo-labels. Our method effectively improves results across these varied quality levels, boosting accuracy from 92.95% to 95.1% on the Caltech and from 24.8% to 31.4% on the Aircraft.


[Q3] The setting of results on the ImageNet variants dataset is unclear.

[A3] We utilize the model learned on ImageNet to evaluate its performance across the ImageNet variants. 

We have provided details about each variant to ensure clarity in the revision:

ImageNet-V2: sampling from the original ImageNet and including 10,000 images of 1,000 ImageNet categories.

ImageNet Sketch: including 138 50,000 images and covering 1,000 ImageNet categories.

ImageNet-R: containing renditions (e.g., art, cartoons, graffiti) for ImageNet classes, comprising 30,000 images from 200 ImageNet categories.

ImageNet-A: collecting real-world images that are misclassified by ResNet-50, totaling 7,500 images from 200 of ImageNet categories.

ObjectNet: including 50,000 test images with rotation, background, and viewpoint, and overlapping 113 classes with ImageNet

These details have been included in the revised manuscript.


[Q4] The results lack comparison with other prompt distribution methods, such as CoOp and CoCoOp.

[A4] The CoOp and CoCoOp require a training procedure with labeled samples while our method does not involve any training.  To ensure a fair comparison, we compare our Frolic with CoOp and CoCoOp on across-dataset results, where the CoOp and CoCop are trained only with the labeled samples from the ImageNet dataset and then directly tested on the remaining datasets. The results shown below demonstrate that our Frolic not only avoids the complexities of training but also exhibits superior generalization performance compared to these methods.

ImageNetCaltechPetsCarsFlowersFood101AircraftSUN397DTDEuroSATUCF101
CoOp71.593.789.164.568.785.318.464.141.946.366.5
CoCoOp71.094.490.165.371.886.022.967.345.745.368.2
Frolic73.395.493.671.774.388.231.872.858.065.375.9
评论

Thank you for the authors' rebuttal, which has addressed most of my concerns. I have raised my scores.

评论

We're grateful for your appreciation and endorsement. Your review holds significant value for us, providing insightful feedback that helps enhance our work.

审稿意见
7

This work aims to enhance the zero-shot performance of pre-trained vision-language models. Specifically, three strategies, including label-free prompt distribution learning, adaptive calibration and correction pre-training label bias, are proposed and work together to improve the performance on different downstream tasks. Experiments on diverse tasks confirm the effectiveness of the proposed method.

优点

  1. Enhancing the zero-shot performance has attracted much attention recently. This work discusses multiple challenges in this process and proposes corresponding algorithms with theoretical analysis.
  2. The proposed method is training-free without hyper-parameters, which is applicable to real applications.
  3. Diverse downstream tasks are included for evaluating the performance of the proposed method.
  4. The manuscript is well-written and easy to follow.

缺点

  1. The work works with a uniform prior that classes are balanced distributed. It is better to discuss the scenario with imbalanced prior.
  2. Besides CLIP, there are many other vision-language pre-trained models, e.g., BLIP, etc. It is better to include other pre-trained models to evaluate the performance of the proposed method.
  3. Why does eie_i follow a Gaussian distribution in L446?

问题

I do not have critical questions about this work.

局限性

Yes

作者回复

[Q1] The work works with a uniform prior that classes are balanced distributed. It is better to discuss the scenario with imbalanced prior.

[A1] We consider that most of the datasets such as ImageNet and its variants are uniformly distributed,  leading us to assume πj=1K\pi_j=\frac{1}{K} in line 134 as a special case for simplicity. However, if the downstream datasets are imbalanced, we can derive the distribution vector π=Z1μ\bf \pi = Z^{-1}\bf \mu as outlined in Equation (5). This strategy allows our method to adapt flexibly to both balanced and imbalanced dataset distributions.


[Q2]  Besides CLIP, there are many other vision-language pre-trained models, e.g., BLIP, etc. It is better to include other pre-trained models to evaluate the performance of the proposed method.

[A2] Thank you for your valuable suggestion. We have conducted additional experiments using BLIP with the  ViT/B-16 backbone.  The results presented below indicate that our method consistently outperforms BLIP across various datasets.

PetsFlowersDTDEuroSATCarsFoodSUNCaltechUCF
BLIP65.150.444.137.062.669.048.486.151.5
+Frolic79.356.258.350.170.274.267.192.364.3

[Q3] Why does eie_i follow a Gaussian distribution in L446?

[A3] The original x\bf x follows a Gaussian distribution N(zj,Σ){\cal N}{({\bf z}_j, \Sigma)}. We express the covariance matrix Σ\Sigma of the original x\bf x as an expansion in terms of its eigenvectors as Equation (24), then we can interpret the variable ei=uiT(xzj)e_i = {\bf u}_i^T(\bf{x}-\bf{z}_j) to a new coordinate system defined by the orthogonal vectors ui{\bf u}_i. Since these transformations are linear and x\bf x is Gaussian, the transformed variables eie_i ​ also adhere to a Gaussian distribution. Specifically, as the variance of the ii -th coordinate is λi\lambda_ieie_i follows the Gaussian distribution N(0,λi){\cal N}(0, \lambda_i)

评论

After reading the rebuttal and comments of other reviewers, I would like to raise my rating.

评论

Thank you for considering our rebuttal and comments from your fellow reviewers. We appreciate your suggestions, which are crucial to the improvement of our work.

审稿意见
6

This paper presents Frolic, a label-free prompt learning methods aiming to improve zero-shot visual recognition of vision-language models like CLIP. The method is built upon estimating distributions over prompt prototypes to capture diverse visual representations and further bias correction. Experiment results demonstrate consistent improvement on various benchmarks over prior-art methods.

优点

  1. The approach does not require access to large-scale datasets for estimation, which is required by many prior work.
  2. The approach advances previous distribution estimation methods by removing the need of label information, which made this line of work possible for zero-shot recognition tasks.

缺点

While this approach does not need access to large-scale pretraining data, it seems to make some weak assumptions on downstream tasks. If my understanding is correct, it assumes:

  1. the downstream task is balanced (line 134), which is not always hold true in reality. The long-tail nature of real world does not guarantee that the testing class distribution will be balanced even though the benchmarks do.
  2. A decent scale of the testing data on downstream tasks is available (e.g, testing data does not come in online fashion like one sample at a time) for estimating the beta term in bias correction. Would the approach for bias estimation still holds true in online testing scenario? I would assume this is more aligned with reality too.

问题

  1. The methods did not seem to evaluate on all benchmark datasets like previous works do. For example, CIFAR10/100 and RESISC. While some of these datasets have saturated performance, it might be still helpful to include results there.
  2. It's not clear to me why [24] is not listed in tables for comparison as I would assume 24 to be the most direct baseline. It shares the same motivation of this paper with the requirement of accessing pretraining data. Arguably, while LAION has been taken back, [24] already does the job for us. I would not think an argument like "[24] requires large-scale pretraining data for estimating bias thus we do not compare with it" as a valid argument because this zero-shot visual recognition task itself is solely working with CLIP-ish models and estimating bias from the datasets like LAION sounds like a fair practice to me.

局限性

Authors have included such discussion

作者回复

[Q1] the downstream task is balanced (line 134), which is not always hold true in reality. The long-tail nature of real world does not guarantee that the testing class distribution will be balanced even though the benchmarks do.

[A1] We acknowledge that real-world data often exhibits an imbalanced distribution. The balanced assumption in line 134 is a special case for simplicity. In our method, if the downstream datasets are imbalanced, we can derive the distribution vector π=Z1μ\bf \pi = Z^{-1}\bf \mu as outlined in Equation (5). This strategy allows our method to adapt to balanced and imbalanced dataset distributions flexibly.


[Q2]  A decent scale of the testing data on downstream tasks is available (e.g, testing data does not come in online fashion like one sample at a time) for estimating the beta term in bias correction. Would the approach for bias estimation still holds true in online testing scenario? I would assume this is more aligned with reality too.

[A2] Thank you for your insightful question. Our method can be easily extended to an online scenario. This involves updating the estimation SS incrementally as each new test example is processed. Specifically, we first initialize the matrix SS with the identity matrix. Suppose we receive the nn -th test sample with the predicted label as jj and the predicted probability as p\bf p, we first update the sj=n1nsj+1np{\mathbf{s}}_j = \frac{n-1}{n}{\mathbf{s}}_j + \frac{1}{n}\mathbf{p}. Then we compute the estimated β\beta and update the f_df\_{\text{d}}. We have conducted additional experiments in the online setting to validate this extension, and the results below show that the online Frolic achieves comparable performance to the original Frolic.

CaltechPetsCarsFlowersFood101AircraftSUN397DTDEuroSATUCF101Average
Frolic95.493.671.774.388.231.872.858.065.375.972.7
Frolic(Online)94.393.171.474.588.131.573.057.464.575.172.3

[Q3] The methods did not seem to evaluate on all benchmark datasets like previous works do. For example, CIFAR10/100 and RESISC. While some of these datasets have saturated performance, it might be still helpful to include results there.

[A3] We have conducted experiments with ViT-B/16 on CIFAR-10, CIFAR-100, and RESISC datasets, and the results are presented below:

CIFAR-10CIFAR-100RESISC
CLIP91.368.658.9
Frolic92.670.064.4

We observe that on these three datasets, our method improves the performance over the original CLIP.


[Q4] It's not clear to me why [24] is not listed in tables for comparison as I would assume 24 to be the most direct baseline. It shares the same motivation of this paper with the requirement of accessing pretraining data. Arguably, while LAION has been taken back, [24] already does the job for us. I would not think an argument like "[24] requires large-scale pretraining data for estimating bias thus we do not compare with it" as a valid argument because this zero-shot visual recognition task itself is solely working with CLIP-ish models and estimating bias from the datasets like LAION sounds like a fair practice to me.

[A4] Thank you for your suggestions. We acknowledge that [24], referred to as REAL, represents a crucial baseline and shares a similar motivation with our work. We conducted a comparison between REAL, which utilizes the LAION 400M dataset, and our Frolic, using the OpenCLIP (ViT-B/16 model) across several datasets. The summarized results below demonstrate that our Frolic outperforms REAL obviously and achieves an average improvement of 1.1% in accuracy. We have included these results in the revised manuscript.

ImageNetFlowersCarsAircraftPetsFoodDTDAverage
REAL[24]68.173.184.018.890.585.259.868.5
Frolic70.373.984.619.991.686.960.169.6
评论

We are grateful for your feedback and we agree with your suggestions to enhance the quality of our work.

A1: We make an assumption that the mean of the class distribution can be represented by the text features zj\mathbf{z}_j via prompting. Given unlabeled samples, we can compute their sample mean m\mathbf{m} and denote the unknown label priors as πj\pi_j. The sample mean can be represented as a linear combination of zj\mathbf{z}_j:

m=xj=1KπjN(x;z_j,Σ)xdx=j=1Kπjzj\mathbf{m} = \int_{\bf x}\sum_{j=1}^K\pi_j\mathcal{N}({\bf x};{\bf z}\_j,\Sigma){\bf x}{\text{d}}{\bf x}=\sum_{j=1}^K \pi_j \mathbf{z}_j. Let Z=[z1,,zK]Z = [\mathbf{z}_1, \ldots, \mathbf{z}_K], then the unknown label priors can be solved by π=Z1m\mathbf{\pi} = Z^{-1}\mathbf{m}.


A2: We have conducted experiments using the stronger models mentioned, and the results are presented below. We find that:

  • Our Frolic outperforms the original CLIP across various datasets with both ViT-L/14 and ViT-L/14@336,
  • Our online version of Frolic maintains comparable performance to the original Frolic.
  • Our Frolic can outperform REAL with ViT-L/14 backbone.
ViT-L/14CIFAR10CIFAR100RESISC
CLIP95.878.665.7
Frolic96.579.866.8
ViT-L/14@336CIFAR10CIFAR100RESISC
CLIP91.379.266.7
Frolic92.681.168.1
ViT-L/14CaltechPetsCarsFlowersFood101AircraftSUN397DTDEuroSATUCF101Average
Frolic97.395.483.581.892.442.177.366.971.082.279.0
Frolic(Online)96.795.683.382.191.741.977.766.770.481.578.8
ViT-L/14@336CaltechPetsCarsFlowersFood101AircraftSUN397DTDEuroSATUCF101Average
Frolic97.796.285.183.292.844.378.567.672.984.380.3
Frolic(Online)96.996.185.382.992.344.578.167.872.383.180.0
ViT-L/14ImageNetFlowersCarsAircraftPetsFoodDTDAverage
REAL[24]73.782.489.628.292.889.465.774.5
Frolic74.282.990.529.193.390.166.675.2
评论

Thanks for the response. All my concerns are resolved and I have raised my rating

评论

We appreciate your acknowledgment and your feedback is valuable in helping us enhance our work further.

评论

Thanks for the authors for providing the rebuttal and additional results.

Most of my concerns are resolved while a few remains:

  1. In [A1], you mentioned "if the downstream datasets are imbalanced, we can derive the distribution vector ". It's not clear to me how can this be achieved when we are working with zero-shot recognition. Under zero-shot scenario, you wouldn't have class distribution information of downstream tasks beforehand and would have to either make further assumptions or use some sort of estimation. Can you elaborate more on this?

  2. I see most results are provided for ViT-B models, what about stronger ones like ViT-L and ViT-L@336?

作者回复

Dear Program Chair, Senior Area Chair, Area Chair, and Reviewers,

First of all, we gratefully thank all the reviewers for their thoughtful comments and feedback.

In this paper, we propose label-Free prompt distribution learning and bias correction, dubbed as Frolic, framework to boost the performance of zero-shot models. The contribution of this paper is listed as follows:

  1. Simple and Practical Solution: Our method is not only training-free but also circumvents the necessity for hyper-parameter tuning.

  2. Comprehensive Evaluation: We demonstrate the effectiveness of our proposed method Frolic by conducting experiments across 16 datasets

  3. Significant Performance Improvement: Our Frolic has a consistent and significant improvement over existing baselines. For example, it surpasses the state-of-the-art zero-shot models by a margin of 2.6% on average with CLIP ViT-B/16.

As our paper received mixed ratings, i.e., three positive (666) and one negative (4), it would be appreciated if the reviewers could have a look at our responses and revision. We have tried our best to address your concerns in our responses in detail. Hope that our responses answered the questions. Please let us know at your early convenience if you have further questions or concerns.

Best regards,

Authors of Paper #2527

最终决定

Reviewers agree on acceptance based on the practical significance, efficient methodology, and convincing results. The authors are encouraged to include discussions and results in the rebuttal to the final version.