PaperHub
4.8
/10
Rejected4 位审稿人
最低3最高6标准差1.1
5
3
5
6
3.8
置信度
ICLR 2024

Vision-Language Subspace Prompting

OpenReviewPDF
提交: 2023-09-17更新: 2024-02-11

摘要

关键词
Prompt Learning; Vision Language Models

评审与讨论

审稿意见
5

This work focused on how to conduct prompt tuning on vision-language models (i.e., CLIP), and proposed a subspace-based prompt learning method that divided soft prompts with orthonormal subgroups, regularized by hard prompts. Experiments on base-to-new classes, domain generalization, and cross-dataset transfer settings show the effectiveness of the method.

优点

  • The proposed method achieved competitive performance on base-to-new classes, domain generalization, and cross-dataset transfer settings.

  • The method is simple but effective, although some insights behind the method are not clear now.

缺点

  • Analysis about "Is subspace modeling useful" in Section 4.4. The conclusion is obtained based on the comparisons between SuPr w/o reg, CoOp, and CoOp-Ensemble. It is not clear what are the detailed differences among the three methods, which is essential to understand whether the comparisons can lead to the conclusions, as the performance gain may come from other components.

  • SVD for subspace modeling. It is a bit hard for me to understand the role of SVD in subspace modeling. According to Sec. 3.2, it seems that SVD is to guarantee that the matrix UcU_c is an orthonormal matrix. If so, is it possible to only restrict UcU_c to be orthonormal without the SVD operation? Also, it is interesting to know the ablation where UcU_c is no longer an orthonormal matrix. In this potential ablation study, can we say the subspace are no longer disentangled/independent?

  • Main technical contribution. It seems that the main messages of this work are (1) dividing soft prompts into subgroups, (2) regularizing soft prompts with hard prompts. There lack insights why the subgroup manner works beyond the technical tricks.

  • Analysis on subspace. Does the subspace have any semantic information, or what does each subspace represent? That would contribute to explainability.

问题

Please see weaknesses for detailed comments.

评论

W4: Analysis on subspace. Thanks for the suggestion! We have now included the qualitative visualization in the revision using Paella [4] to synthesize images based on different (hard & learned) prompts. Please refer to Figure 5-11 in the appendix of the revision or the links below. The visualizations in Figure 5-8 show that soft prompts learned by subspace modelling capture different intra-class variations, such as fine-grained attributes in terms of colour, texture and depiction styles. This explains why our SuPr improves over CoOp, which is stuck with learning only dominating concepts. Also, walking in the subspace across different subspace bases shows interesting transitions along different attributes, showing the wealth of semantic information learned in each subspace, as shown in Figure 9-11.

[4] Dominic Rampas, Pablo Pernias, Marc Aubreville. A novel sampling scheme for textand image-conditional image synthesis in quantized latent spaces. arXiv preprint arXiv:2211.07292, 2023.

评论

W3: Main technical contribution. Please first refer to the response to Reviewer kdehkdeh regarding 'novelty in subspace modelling'. May the take-home messages be refined from our clarification.

Clarification for Message (1): 'dividing soft prompts into subgroups' should not be treated as a main contribution, as this is also just what CoOp-Ensemble requires for modelling. Our first main contribution is subspace modelling with multiple groups of soft prompts. No prior prompt-learning approaches considered subspace modeling to represent categories. This is a great contribution, as learning a subspace classifier leads to better extrapolation/generalization than a prior vector/prototype based class representations such as CoOp. We will show in our revision that a subspace classifier captures the intra-class variability for a class rather than a single dominating point, such that it improves generalization.

Clarification for Message (2): 'regularizing soft prompts with hard prompts' can be enabled in various ways, such as strong alignment between soft/hard prompts. Excessive alignment can harm an adapted VLM's performance on the base classes, as observed in [2,3]. However, our contribution is the specific approach to regularize the modelled soft-prompts-based linear subspaces by forcing them to span hard-prompt embeddings. Our improved VLMs can be tailored well for base classes while maintaining generalization on unseen classes.

Explanation for insights into why the proposed method works: Please refer to the response to the following concern. Also, we compute the prediction scores for all the test samples for some datasets and visualize the top 1010%% prediction-confident samples in Figure12-14. To better understand the selected samples, we cluster them into three clusters using K-means. From the results, we can see that the test samples that simulate the vector classifiers are less diverse than the subspace ones, indicating the issue of learning only the dominating concepts of vector classifiers. Among them, we can also see the samples predicted right using subspace classifiers but not by vector classifiers.

  • Prediction-confident test samples of Freckled/Petunia/Ostrich for vector v.s. subspace classifiers. Samples with red boxing are wrong predictions, and samples with blue boxing are predicted right with subspace classifiers but not with vector classifiers.

[2] Hantao Yao, Rui Zhang, and Changsheng Xu. Visual-language prompt tuning with knowledgeguided context optimization. In CVPR, 2023.

[3] Beier Zhu, Yulei Niu, Yucheng Han, Yue Wu, and Hanwang Zhang. Prompt-aligned gradient for prompt tuning. In ICCV, 2023.

评论

Response to nXYV

W1: Analysis about "Is subspace modeling useful" in Section 4.4. We understand this is a confusion about our ablation study. Please refer to the response to all for this concern.

W2: SVD for subspace modelling. 1) Restrict UcU_c orthonormal without SVD: Good point. SVD was used to ensure the orthonormality of UcU_c during optimization. Placing this restriction on UcU_c directly was also considered in our preliminary implementation. However, UcU_c was formed by the soft-prompt embeddings, i.e., it is generated dynamically (as soft prompts update iteratively), making this constraint hard to be in place for UcU_c. We will keep investigating this in our future research. 2) What happens if UcU_c is not an orthonormal matrix? Indeed, having an orthonormal matrix UcU_c for linear subspace modelling is unnecessary. We can use the vanilla support points from the soft-prompt embeddings and employ least square linear regression to model a linear subspace as per [1]. We tried this during development, and it performed similarly to SVD - just 0.23\% weaker. Thus we stayed with SVD for simplicity and slightly better empirical performance. 3) The subspace is no longer disentangled/independent when UcU_c is unconstrained? Yes. However, we would like to clarify that the orthogonality constraint imposed by SVD affects the bases that define each linear subspace, but it does not affect whether the subspaces are orthogonal to each other. IE: There is currently no inter-subspace independence/orthogonality constraint. We also tried regularising the subspaces to be orthogonal to each other during development, but this negatively affected performance.

[1] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The elements of statistical learning, 2001.

Hos(%)ImageNetCaltech101OxfordPetsStanfordCarsFlowers102Food101FGVCAircraftSUN397DTDEuroSATUCF101Average
SuPr-Ens w/o SVD73.3396.4196.2875.1085.4791.5137.2880.3870.5481.5980.4879.11
SuPr-Ens73.7496.4096.6475.0885.5491.5537.0580.5170.7981.5981.8579.33
审稿意见
3

The paper proposes SuPr, a novel sub-space prompt learning method to improve the generalization ability of large pre-trained vision language models, especially CLIP. Specially, authors learned several partitions of soft prompts and project them into subspaces while using hard prompts to regularize them. The experiment results show the effectiveness of their method.

优点

  1. Improving the generalization ability of pre-trained models is a interesting topic.
  2. Using subspace to enrich the semantic meaning of soft prompts is a interesting direction.

缺点

  1. Results are not consistent. For some dataset, it can achieve slightly better results than SOTA methods, but the results are not good in EuroSAT dataset. The author should explain reasons or assumptions at least.
  2. The experiments are not enough. For example, there is no numerical ablation study for each component.
  3. Overall, the paper is written in a rush way which results in many confusing explanations.

问题

See the weakness part.

评论

Response to SWVW

W1: Inconsistent Result from EuroSAT dataset. Please refer to the response to all for this concern.

W2: Ablation study. Please refer to the response to all.

W3: Confusing explanations. Sorry for providing any confusing explanations. Please kindly let us know which parts confuse you. We will clarify them in a revision.

审稿意见
5

this paper addresses the prompt learning of vision-language models to achieve better base- and novel-cllass performance with subspace modelling. The papers proposes the subspace modelling of soft prompts, as well as its regualization with hard prompts and ensembling methods. Experiments verified the effectiveness of the proposed method.

优点

  1. the overall method and experiments are reasonable and convincing. This is a good practice for VLMs soft prompting.
  2. the paper is well written and easy to follow.
  3. the paper marks the first integration of subspace modelling with VLMs.

缺点

the improvement of this paper is not significant according to the Tables (<1% in Table 1, 2,3).

问题

  1. this is a good practice of integration of subspace modelling with VLMs. How about the novelty of the method in the subspace modellling domain?
  2. Why LASP is not compared in Table 3 and Table 4?
评论

Response to kdeh

W1: Performance. Please refer to the response to all for this concern.

W2: Novelty in subspace modelling. Our method SuPr differs from typical subspace modelling methods in the literature by the following aspects: 1) We introduced a novel hard-prompt-based regularization to guide the modelled subspace to span hard-prompt embeddings. This differs from the typical orthogonality regularization commonly used [1-3] to decouple the modelled subspaces of different classes. We also experimented using orthogonality regularization, which induced bad performance. This indicates prompt-based subspace modelling is different from conventional subspace methods. Orthogonalizing them is not beneficial. 2) We proposed an ensembling method to improve our linear subspace modelling by learning multiple linear subspaces with different hard prompts.

[1] You Chong, Daniel Robinson, and René Vidal. Scalable sparse subspace clustering by orthogonal matching pursuit. In CVPR 2016.

[2] Christian Simon, Piotr Koniusz, Richard Nock, and Mehrtash Harandi. Adaptive subspaces for few-shot learning. In CVPR, 2020.

[3] Devos Arnout, and Matthias Grossglauser. Regression networks for meta-Learning few-Shot classification. In AutoML 2020.

W3: Missing LASP in Table 3&4. Initially, we noticed a flaw in the LASP paper, as the mean of their reported results did not match their reported mean. Thus, we abandoned their results in Table 3. And, LASP did not provide results for the evaluation for Table 4. Nevertheless, we have now re-implemented LASP for both settings in Table 3/4 and included the results in the revision. The results show that our method consistently holds its superiority over LASP.

Cross dataset transfer (HOS):

ImageNet(source)Caltech101OxfordPetsStanfordCarsFlowers102Food101FGVCAircraftSUN397DTDEuroSATUCF101Average
SuPr71.7094.2089.8064.9070.7086.3023.0066.5045.5050.2067.7065.88
LASP71.4093.3089.8865.0170.2085.3920.8866.7443.6745.3269.0764.95

Few shot learning (HOS):

ImageNetCaltech101OxfordPetsStanfordCarsFlowers102Food101FGVCAircraftSUN397DTDEuroSATUCF101Average
SuPr69.7795.1793.1376.8094.2386.0035.5373.6064.9773.2379.9776.58
LASP70.4794.7092.5871.9789.4885.8530.6072.3258.3968.8078.2473.95
审稿意见
6

This paper proposes a new subspace-based prompt learning method to search a balance between hand-crafted and learnable prompt. The learn model can achieve high performance on the base classes and it can also generalize to new classes.

优点

-The paper is well-written and easy to follow.

-It is interesting to see that the proposed method work well on many datasets.

缺点

-The proposed method fix the parameters of text encoder and image encoder. Will it achieve better performance when making all these parameters learnable.

-Will the proposed training strategy introduce extra training cost?

问题

See the weakness.

评论

Response to Keaq

W1: Comparison with making all parameters learnable. We have now included the results of making all parameters learnable during fine-tuning CoOp and SuPr. The results below show that both methods have gained degraded performance when freeing all parameters for training. We attribute this observation to using limited training samples, making over-parameterized models easy to overfit. However, our SuPr still improves over CoOp by about 3.0\% accuracy in this situation.

Hos(%)ImageNetCaltech101OxfordPetsStanfordCarsFlowers102Food101FGVCAircraftSUN397DTDEuroSATUCF101Average
CoOp71.9293.7394.4768.1374.0685.1928.7572.5154.2468.9067.4671.66
CoOp w/ all learnable59.8992.3591.0565.9062.1780.0023.6871.3255.7168.8476.5969.26
SuPr73.7496.4096.6475.0885.5491.5537.0580.5170.7981.5981.8579.33
SuPr w/ all learnable62.9093.1391.7267.7768.3582.5228.8673.7759.8976.6978.5272.23

W2: Extra training cost? a) Number of trainable parameters: Our method builds on top of training multiple sets of soft prompts divided from the single set of soft prompts from CoOp, i.e., SuPr has the same size of trainable parameters as CoOp. SuPr-Ens has total parameters = the number of ensembles ×\times number of parameters of soft prompts, which slightly introduces extra parameters. However, the parameter size of soft prompts is tiny; thus, the extra parameter amount is still small enough. b) Computational cost: The following table shows the training time for adapting different models to ImageNet. The results show that our SuPr and SuPr-Ens do not introduce substantially more computational cost from the baseline (3232mins of CoOp). CoCoOp requires substantially more cost, 25.525.5 times longer than the base unit. The next is ProGrad, which has 2.382.38 times the cost of CoOp. Our SuPr and SuPr-Ens scale the training time up to 1.51.5 and 1.781.78 times of CoOp, which is comparable with the recent SoTA methods, KgCoOp (1.061.06) and LASP (1.311.31).

GPU: NVIDIA 3090TiCoOpCoCoOpProGradKgCoOpLASPSuPrSuPr-Ens
10-epoch Training-Time on ImageNet1.0 (32mins)25.502.381.061.311.501.78
评论

Response to all

We appreciate the valuable feedback from all the reviewers. We have now addressed all concerns and have uploaded a revised version. In this revision, several new analyses have been included, especially to shed light on the insights of subspace modelling. (Please refer to the 'Qualitative Visualization' in the appendix for more details.)

First, we want to address several important concerns among the reviewers.

Experimental performance: Some reviewers raised concerns about the performance of our method, including marginal improvements and underperforming results on EuroSAT. a) Marginal improvements: In the revision, we have added the delta performance between each baseline and our SuPr-Ens in brackets in Table 1. The results show that our method demonstrates superiority over the competitors in the vast majority of cases. Among them, many improvement margins are >1>1% even compared with the latest SoTA methods, KgCoOp and LASP. b) EuroSAT results: We believe the non-ideal EuroSAT results shall be treated as an exception rather than the weakness of our method. LASP shows an extraordinary performance on EuroSAT, outperforming the other SoTA methods, including ProDA, KgCoOp, etc., by over 15\% accuracy, which should be considered an outlier considering the improvement margins in most prior works. Nevertheless, our SuPr-Ens outperforms all other competitors by more than 7.5\% accuracy, clearly indicating its efficacy.

Ablation study: Apologies for the confusion caused by our ablation experiments shown in Figure 3(b) (and in Table 4 in the appendix) in the submission. We now reiterate and clarify it as follows,

  • CoOp: Vanilla CoOp, assuming the parameter size of learnable soft prompts is MM.

  • CoOp-Ensemble: Multiple CoOp models are introduced, i.e., multiple sets of soft prompts are learnable, whose total parameter size is also MM. During inference, prediction is based on the ensembling of multiple CoOp models.

  • SuPr w/o reg: Adding subspace modeling on top of CoOp-Ensemble.

  • SuPr: Adding hard-prompt based regularization on top of SuPr w/o reg.

  • SuPr-Ens: Ensembling separate linear subspaces, which are regularized by different subsets of hard prompts for a class.

SetCoOpCoOp EnsembleSuPr w/o regSuPrSuPr Ens
Multiple Prompts
Subspace Modeling
Regularization
Subspace Ensemble
Base82.6980.0281.1881.4782.54
New63.2268.5173.3075.2176.36
H71.6673.8277.0478.2179.33

It is seen that CoOp-Ensemble improves CoOp with about 2.16\% accuracy by learning separately multiple learnable sets of soft prompts and ensembling the learned CoOp models. Incorporating the subspace modelling on top of CoOp-Ensemble to replace the simple ensembling improves its performance further by 3.22\%, evidencing the effectiveness of SuPr w/o reg. Furthermore, adding the hard-prompt-based regularization pushes the model performance up by 1.17\%, as the achievement of SuPr, which is further improved with 1.12\% accuracy by subspace ensembling -- SuPr-Ens.

评论

Dear reviewers,

Please have a look at our rebuttal and let us know if there are any further revisions or adjustments you would like us to make. Alternatively, please kindly consider raising your scores if there are no additional concerns.

Best regards,

The Authors

AC 元评审

The authors propose a novel subspace-based prompt learning method, named SuPr, which can effectively model subspaces spanning the embeddings of both the learnable soft and the textual/hard prompts. Hand-crafted prompts are further used to regularize our subspace-based alignment between hand-crafted prompts and learnable prompts to achieve excellent fitting of base classes and generalization to novel classes. The proposed method is evaluated on 11 diverse image classification datasets. Pros:

  • The method is simple and efficient.
  • Many experimental results. Cons:
  • Only marginal improvement.
  • Improvement is not not consistent (e.g., EuroSAT).

The authors tried to address reviewers' concerns very proactively. Unfortunately, only SWVW responded to authors rebuttal. SWVW disagrees that EuroSAT is only an outlier. The updated Table 1 also clearly shows that the improvement is not consistent. The AC agrees with SWVW's judgment and recommends rejection.

为何不给更高分

The rebuttal did not address the concerns of two weakness below.

  • only marginal improvement.
  • improvement is not not consistent (e.g., EuroSAT).

为何不给更低分

N/A

最终决定

Reject