Cross-Model Semi-Supervised Prompt Learning for Vision-Language Models
摘要
评审与讨论
This paper focuses on learning continuous soft prompts for adapting pre-trained vision-language models in a semi-supervised setting. To make the learned prompts invariant to the different views of a given unlabeled sample, the authors propose a new scheme, i.e., varying the lengths of visual and text prompts attached to these samples. The results show that the proposed method makes an immense gain in image classification performance.
优点
-
This paper is well-written and easy to follow.
-
To the best of my knowledge, the paper studied an under-explored problem, semi-supervised prompt learning for VLMs.
-
Considering the two branches of VLM, the authors propose to learn multi-modal prompts and vary the lengths of visual and text prompts attached to obtain samples with different views.
-
The experiments show the superior effectiveness of XPL compared to the designed baselines.
缺点
My concerns are summarized below:
-
The authors illustrate the motivation of XPL with Figure 1(a,b), i.e., a large category-wise performance gap between two models leveraging different numbers of learnable prompts. I am curious about whether the results in Figure 1(a,b) are averaged over several random seeds. In the prompt learning, the random seed has a non-negligible effect on the performance.
-
For a more comprehensive comparison, I suggest that the authors add the results of zero-shot CLIP. Besides, these prompt engineering methods (e.g., prompt ensemble) and more advanced prompt learning methods on labeled data should be compared.
问题
- Whether the results in Figure 1(a,b) are averaged over several random seeds?
- The results of zero-shot CLIP, prompt engineering methods, and more advanced prompt learning methods should be included.
We thank Reviewer xki8 for finding the paper well-written and easy to follow. Below are our responses to the specific concerns.
If Figure 1(a,b) are averaged over several random seeds: Thanks for the query. All the results given in the paper including the ones shown in Figure 1(a,b) are taken as average over different random seeds.
Comparison with zero-shot CLIP and other prompt engineering methods: . Being one of the first works in multi-modal semi-supervised prompt learning, we carefully designed the baselines for a comprehensive assessment. The Text Prompt Learning (TPL) and Visual Prompt Learning (VPL) baselines exactly follow the CoOp [1] approach, one with only text prompts as in the CoOp paper and the other with only visual prompts. As we found that XPL surpasses these baselines (refer in Figures 3 and 4 of the main paper) by a huge margin, we withheld from adding the zero-shot CLIP values which has been shown to be greatly inferior to CoOp method itself [1]. However, in view of the reviewers comment, we compared the lowest value for XPL obtained in 1-shot (refer Figure 4b in main paper) with the 0-shot CLIP results. As observed in the table below, XPL even for 1-shot surpasses the 0-shot CLIP values by a huge margin.
| ImageNet | Caltech101 | OxfordPets | Flowers102 | UCF-101 | StandfordCars | EuroSAT | DTD | |
|---|---|---|---|---|---|---|---|---|
| XPL (1-shot) | 66.1 | 92.9 | 87.4 | 87.7 | 69.4 | 65.1 | 80.1 | 53.8 |
| CoOp (1-shot) | 60.4 | 90.8 | 85.5 | 79.9 | 67.0 | 60.6 | 57.3 | 48.5 |
| Clip (0-shot) | 59.1 | 85.0 | 84.8 | 66.2 | 60.0 | 56.4 | 39.1 | 42.2 |
[1] Conditional prompt learning for vision-language models, CVPR 2022
Dear Reviewer xki8,
We sincerely thank all your efforts and apologize for sending out this note regarding our paper. We have addressed all the thoughtful questions and suggestions raised by you and have provided them in the rebuttal. However, as the discussion phase is still going on, we wondered if you might still have any concerns that we could address. Thank you for your service!
Best wishes,
Authors
This work proposes a novel framework for semi-supervised prompt learning tailored for VLMs. Moreover, it showcases the remarkable efficacy of the proposed method on the commonly used 15 datasets. To achieve this, this work is designed with two innovations: i) the mutual knowledge distillation from the VLMs with different prompt lengths, and ii) the combination and consistency of weak and strong augmentation strategies. In my understanding, this work is actually based on existing basic methods, such as dual student architecture that existed in semi-supervised learning, and the combination of weak and hard augmentation in contrastive learning methods. The advantage of this work is it is the first work to investigate the semi-supervised learning in the VLMs. Moreover, it reveals the importance of cross-model distillation with different prompt lengths, which is interesting. It is also simple and effective on most datasets.
优点
The strengths of this work are as:
i) This work is the first to investigate the semi-supervised learning for the efficient transfer learning (ETL) of VLMs.
ii) The framework of XPL is simple but effective, which showcases great performances on 15 typical datasets.
iii) The mutual learning of the models with different prompt lengths looks interesting.
缺点
There are a series of issuses that is required to be solved.
i) I noticed that this work only validated their methods on some simple benchmarks, such as CoOp, and VPT. Is it possible to give more experiments to validate the applicability of your methods on recent works on prompt learning-based few-shot learning?
ii) This work lacks the theoretical analysis for why the different prompt lengths is better.
iii) The clarification for the utilization of unlabeled data in "TPL^u", "MPL^u" is not clear.
iv) In Figure 18 of the supplementary, why the MPL is lower than MPL^u in most datasets?
v) Is it possible to provide the comparison for the 10% - 50% labeled data?
vi) The ablation studies on hyper-parameters are only conducted with EuroSAT is not reasonable, since the randomness in few-shot learning.
问题
The author is expected to increase more through explanations for their methods, and the experiments listed in the weakness. Moreover, the contribution of this work should be further clarified, especially the difference or significance of their methods with existing basic methods, such as weaker and hard augmentation in contrastive learning and the model distillation with dual students architecture.
We thank Reviewer 18u3 for finding our framework simple but effective and the training policy interesting. Below are our responses to the specific concerns.
Applicability of our methods on recent works on prompt learning: As ours is the first work in semisupervised prompt learning in VLMs, we designed many possible SSL baselines for learning prompts by leveraging on unlabeled data. We put forth TPL, VPL and MPL baselines by carefully culling SSL literature from related fields and showcase the efficacy of learning rich prompts using our framework by comparing with these over 15 datasets [refer Figures and of main paper]. For XPL, the main strength lies on efficient transfer learning in VLMs by exploiting the hugely available free knowledge (unlabeled data) available in the wild. However, for the sake of completeness and to address the reviewers concerns regarding the applicability of XPL, we evaluate the generalizability of our approach from base to new classes (Table 2 of main paper and Table 4 in supplementary section) and compare the results with more recent approach CoCoOP [1]. The following tables showcase the supremacy of XPL on the individual datasets.
| EuroSAT | S | U | H |
|---|---|---|---|
| Co-CoOP | 87.49 | 60.04 | 71.21 |
| XPL | 97.80 | 58.90 | 73.52 |
| UCF-101 | S | U | H |
|---|---|---|---|
| Co-CoOP | 82.33 | 73.45 | 77.64 |
| XPL | 88.50 | 74.70 | 81.02 |
| Caltech101 | S | U | H |
|---|---|---|---|
| Co-CoOP | 97.96 | 93.81 | 95.84 |
| XPL | 98.95 | 92.49 | 95.61 |
| Oxford Pets | S | U | H |
|---|---|---|---|
| Co-CoOP | 87.49 | 60.04 | 71.21 |
| XPL | 97.80 | 58.80 | 73.44 |
| StandfordCars | S | U | H |
|---|---|---|---|
| Co-CoOP | 70.49 | 73.59 | 72.01 |
| XPL | 74.59 | 71.82 | 73.18 |
| DTD | S | U | H |
|---|---|---|---|
| Co-CoOP | 77.01 | 56 | 64.85 |
| XPL | 80.18 | 54.60 | 64.96 |
| Flowers102 | S | U | H |
|---|---|---|---|
| Co-CoOP | 94.87 | 71.75 | 81.71 |
| XPL | 98.24 | 69.87 | 81.66 |
Analysis for why the different prompt lengths is better. While prompt learning enables efficient and faster adaptation paradigm, their low capacity may not allow a single prompt learning model to achieve best performances in all cases. As shown in the Figures 1(a) and (b) of the main paper, two models varying in the number of learnable prompts (8 and 16 prompts) exhibit diverse category-wise performance. Some classes in these datasets are better suited to 16 prompts while some show improvement with 8 prompts. Our co-teaching approach exploits these multiple prompt learners to learn complimentary knowledge and thus can complement in providing better semi-supervision to each other. Further, we identify that directly using the same adaptation model to produce confident pseudo-labels for the unlabeled data may miss crucial information for certain categories. As shown in notable works of [2, 3], the discriminative power of a single model is weaker to assign large number of high-quality pseudo labels to the unlabeled data. Similar conjuncture can be observed by using a fix length of prompts as shown in Figure 7 of the paper. Instead, the idea is to construct multiple pseudo-labels of different versions of the same unlabeled data and allow them to complement each other. In our work, we extend this idea further in a multiple prompt learning paradigm. In addition to using different views of the unlabeled samples, the two pathways have different lengths of learnable prompts. Such a co-teaching framework enables better representation learning by not only forcing invariance to different views of the unlabeled data but also enforcing invariance towards prompts of different lengths. To the best of our knowledge, such dual invariance applied to semi-supervised VLMs has not been explored earlier. We have also showcased the effectiveness of our cross-model co-teaching design over several semi-supervised baselines in the main paper and performed another additional baseline in accordance with the different queries of the reviewers.
Clarification for the utilization of unlabeled data in TPL, MPL: All the baselines of TPL, VPL and MPL are based on naive Fixmatch [4] method as the underlying semi-supervised approach. The training on the unlabeled samples is carried out by generating pseudo-labels from the weakly augmented version and using it to reduce the loss across its strongly-augmented counterpart following the Fixmatch philosophy.
In Figure 18 of the supplementary, why the MPL is lower than MPL in most datasets: As illustrated in Section 4.2 of the main paper, the main performance gain in MPL is achieved by leveraging on the unlabeled data. The MPL baselines utilizes only the low labeled data, disregarding the pool of unlabeled data which can lead to a loss of valuable knowledge. With unlabeled data, MPL achieves a significant gain over MPL specifically in the low-labeled data regime, thereby performing much better than the MPL baseline across the diverse datasets.
Comparison for higher proportion labeled data: Although our work is more focused to improve the performance on downstream tasks in extremely low labeled data regime, to address the query of the reviewer, we ran additional experiments on the EuroSAT dataset to evaluate XPL on higher proportions of labeled data -- 20% and 30%. As observed in the table below (Table 7 of the revised manuscript), the performance of XPL surpasses that of the next-best baseline MPL even in higher regime to labeled data.
| EuroSAT | 20% | 30% |
|---|---|---|
| XPL | 98.1 | 99.2 |
| MPL | 97.4 | 98.7 |
Ablation studies on Hyper-parameters for other datasets: In Section 4.3 of the main paper, we perform sensitivity analysis on , and hyperparameters on the EuroSAT dataset. Here denotes the pseudo-label threshold, indicated the ratio of unlabeled to labeled data and is the weightage of the difference losses. As suggested by the reviewer, we further extend these hyperparameter sensitivity analysis on other diverse datasets of ISIC, Chest-Xray and Cropdiseases. Below are the results for the different hyperparameter settings for the respective datasets. We also modify the Effect of Hyperparameters in Section 4.3 of the paper to plot the average performance over these datasets using the different hyperparameters. Following the trend of EuroSAT, , and continues to result in the apt design choice for best performance.
| XPL | EuroSAT (1%) | EuroSAT (5%) | EuroSAT (10%) | ISIC (1%) | ISIC (5%) | ISIC (10%) | Chest-Xray (1%) | Chest-Xray (5%) | Chest-Xray (10%) | Cropdiseases (1%) | Cropdiseases (5%) | Cropdiseases (10%) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 93.8 | 95.8 | 97.6 | 71.5 | 78.8 | 85.7 | 34.2 | 36.2 | 36.9 | 96.5 | 98.4 | 98.7 | |
| 88.3 | 90.4 | 92.8 | 63.6 | 77.2 | 84.4 | 30.3 | 35.7 | 36.0 | 93.2 | 95.8 | 97.0 | |
| 89.6 | 91.5 | 93.3 | 64.7 | 77.9 | 85.0 | 31.6 | 35.9 | 36.4 | 94.7 | 96.8 | 98.1 | |
| 93.8 | 95.8 | 97.6 | 71.5 | 78.8 | 85.7 | 34.2 | 36.2 | 36.9 | 96.5 | 98.4 | 98.7 | |
| 91.2 | 95.0 | 96.3 | 64.8 | 77.7 | 83.9 | 33.6 | 35.7 | 36.0 | 92.6 | 95.2 | 96.1 | |
| 94.1 | 96.2 | 97.7 | 71.6 | 79.0 | 86.2 | 35.0 | 36.4 | 37.1 | 96.9 | 98.8 | 99.6 | |
| 93.8 | 95.8 | 97.6 | 71.5 | 78.8 | 85.7 | 34.2 | 36.2 | 36.9 | 96.5 | 98.4 | 98.7 | |
| 92.3 | 94.9 | 96.7 | 66.3 | 75.8 | 83.4 | 30.3 | 34.7 | 34.4 | 94.7 | 97.1 | 97.6 | |
| 91.6 | 93.4 | 96.4 | 62.2 | 71.6 | 81.1 | 31.8 | 35.8 | 36.1 | 94.1 | 96.6 | 97.1 |
[1] Conditional prompt learning for vision-language models, CVPR 2022
[2] Cross-model pseudo-labeling for semi-supervised action recognition, CVPR 2022.
[3] Semi-supervised action recognition with temporal contrastive learning, CVPR 2021.
[4] Fixmatch: Simplifying semi-supervised learning with consistency and confidence. Advances in neural information processing systems 2020
Dear Reviewer 18u3,
We sincerely thank all your efforts and apologize for sending out this note regarding our paper. We have addressed all the thoughtful questions and suggestions raised by you and have provided them in the rebuttal. However, as the discussion phase is still going on, we wondered if you might still have any concerns that we could address. Thank you for your service!
Best wishes,
Authors
This paper introduces a semi-supervised, cross-model prompt learning for vision-language models (VLMs). The key idea of the paper relies on feeding different lengths of soft prompts to two pre-trained VLMs (referred to as primary and auxiliary networks). Given an unlabeled image, the authors create a pair of weakly and strongly augmented versions, pass them to the two networks, and use the confident prediction from one network as a pseudo-label for the other and vice versa. Experimental comparisons with different baselines are presented on several benchmark datasets. The supplementary material contains additional experimental analyses, visualizations and source code.
优点
- The paper tackles an interesting and timely topic. With the recent progress made in training powerful VLM models, it is worth studying how to prompt these pre-trained models for different downstream tasks
- The paper reads fairly well
- Source code is shared in the supplementary for reproducibility
缺点
Major issues
-
The work is very incremental
The technical novelty of this work is quite limited. The idea of passing a set of augmented versions of unlabeled image data in two different networks and constraining the networks to supervise each other has been explored in several existing works (ex. [1,2]). Furthermore, the idea of deriving visual prompts directly from text prompts using a linear projection (coupling function) has been introduced in previous works (ex. [4]) which the authors fail to cite and discuss.
-
Several claims/motivations in the paper are not convincingly justified
The main claim of this paper is to use prompts of different lengths for the primary and auxiliary networks. However, the authors fail to give a convincing argument as to why that should lead to a better performance. For instance, how did the authors arrive at using N and N/2 long prompts for the primary and auxiliary networks, respectively? why not N and N/3 or N and N/4?
The ablation experiment presented in Fig 7 of the main paper is counterintuitive to the key message of the paper. On page 5, the authors claim, "As the two models with different lengths differ in what they learn, they can complement in generating the supervision of each other". However, the results in Fig 7 show that this is not true. For instance, using a model with the same prompt length for the two networks (N=8 or N=16) outperforms a model with different prompt lengths (N=8 for primary and N=4 for auxiliary). If different prompt length is indeed as important as the authors argue, how do they explain these results?
It can also be noticed from Fig 7 that a model with N=32 for primary and N=8 for auxiliary performs inferior to the baseline model (N=16 for primary and N=8 for auxiliary). Does this mean that if the length difference between the two networks increases, performance gets worse? Where is the threshold for this performance trade-off? This needs a rigorous justification as the merit of the paper heavily relies on this argument.
Moreover, the performances of some of the baselines in Fig 7 are worse than the multimodal prompt learning (MPL) baseline. This raises the question of whether the proposed cross-modal approach is indeed better than MPL given its sensitivity to the prompt length.
-
The experimental settings and comparisons are not clearly presented
There have been several works related to prompt learning in text, vision, or multimodal domains. However, the author's comparison fails to cite most of these works except for CoOP. Which VPL or MPL baselines are used in the paper? Did the authors design their own baselines or use previous works as a baseline (but forgot to cite properly)? What are the experimental settings for these baselines? Why didn't the authors compare with recent works such as Co-CoOP [3] or MaPLe [4]?
While I appreciate the extensive comparisons on several datasets, the presented results (quantitative figures) are almost unreadable due to the very small size of the figures. It would be better to either draw bigger figures or use tables instead of figures for a clearer presentation of the experimental results.
-
More ablation experiments are needed to justify the merit of the work
A simple experiment to show the benefit of the cross-model approach would be to use a single model and use the prediction of the weakly augmented input as a pseudo-label for the prediction of the strongly augmented input. However, such a baseline is missing in the paper.
It is also important to further explore what makes the cross-model approach based on one primary and one auxiliary network work. What if we use one primary and two auxiliary networks with a triplet of augmented inputs (1 weak and 2 strong - one for each auxiliary network) and each auxiliary network supervises the primary network and vice versa? Does this lead to better supervision of the primary network? These explorations would be important to strengthen this work.
Minor issues
-
In the supervised training, why did the authors choose to use only the weakly augmented image?
-
The paper needs some re-organizations. Some of the results presented in the supplementary (ex. Appendix B, D, E) should be in the main paper.
References
[1] Cross-Model Pseudo-Labeling for Semi-Supervised Action Recognition, CVPR 2022
[2] Semi-Supervised Semantic Segmentation with Cross-Consistency Training, CVPR 2021
[3] Conditional prompt learning for vision-language models, CVPR 2022
[4] MaPLe: Multi-modal Prompt Learning, CVPR 2023
问题
Please refer to the questions raised in the "Weaknesses" section and try to address them carefully.
We thank Reviewer MsJj for finding our work an interesting and timely topic. Below are our responses to the specific concerns.
Novelty and Advantage of Co-teaching using prompts of different lengths: Thanks for the query. To the best of our knowledge, there is no existing Semi-Supervised Learning (SSL) works either in prompt learning or its applications in the multi-modal setting of VLMs. The primary challenges faced are of two folds. 1) The low quality of prompts learned in presence of only a few labeled data with a vast set of unlabeled samples and 2) Exploiting both the text and visual modalities using prompts only, to extract rich representations from unlabeled samples. As mentioned in Section 1 of the main paper, although approaches like [1,2] employ a cross-model representation learning, such paradigms have neither been exploring to tuning prompts nor in the multi-modal context of VLMs. These incur different challenges for learning efficient prompts in a multi-modal setting using frozen VLMs. The main motivation of using a co-training policy is to harness the complementary knowledge across two models having prompt inputs with different lengths. While prompt learning enables efficient and faster adaptation paradigm, their low capacity may not allow a single prompt learning model to achieve best performances in all cases. As shown in the Figures 1(a) and (b) of the main paper, two models varying in the number of learnable prompts (8 and 16 prompts) exhibit diverse category-wise performance. Some classes in these datasets are better suited to 16 prompts while some show improvement with 8 prompts. Our co-teaching approach exploits these multiple prompt learners to learn complimentary knowledge and thus can complement in providing better semi-supervision to each other. Further, we identify that directly using the same adaptation model to produce confident pseudo-labels for the unlabeled data may miss crucial information for certain categories. To the best of our knowledge, such dual invariance applied to semi-supervised VLMs has not been explored earlier. We have also showcased the effectiveness of our cross-model co-teaching design over several semi-supervised baselines in the main paper and performed another additional baseline in accordance with the reviewer's subsequent question.
Exploring the relation between the number of prompts in primary and auxiliary: Thanks for this interesting query. The different prompt lengths were chosen to minimize the number of learnable parameters while gaining significant boost in performance. In Section 4.3 (first of the ablation studies) and in Figure 7 of the main paper, we showed exactly this by keeping prompt length differences at 4x (XPL(32,8)) and 1x (XPL(16,16)). We found that the combination 16 and 8 length prompts for the primary and the auxiliary network respectively gave the best performances across the datasets. For XPL(32,8), we chose a 32 length prompt for the primary pathway and 8 length prompt for the auxiliary. Similarly, for XPL(16,16), prompt length of 16 was fixed for both the pathways. We also detail the results in the following table. Inspite of increase in the number of learnable parameters for both the 4x and 1x scenarios, they fail to enhance the complimentary training of our original XPL(16,8).
| EuroSAT (1%) | EuroSAT (5%) | Chest-Xray (1%) | Chest-Xray (5%) | |
|---|---|---|---|---|
| XPL(32,8) | 93.2 | 95.5 | 34.3 | 35.8 |
| XPL(16,16) | 92.6 | 94.2 | 31.2 | 33.6 |
| XPL(16,8) | 93.8 | 95.8 | 34.2 | 36.2 |
Clarifications regarding the results presented in Fig 7 of the main paper: Thanks for this interesting question! While prompt learning is efficient and addresses overfitting problems of low labeled data nicely, their low capacity may not allow a single prompt to achieve the best performances. This is illustrated in Fig. 1(a), (b) of main paper and serves as one of the major motivations of our approach. Here, we show that two models with different numbers of learnable prompts (8 and 16) exhibit markedly different category-wise performance. Some classes are better suited to 16 prompts while some show improvement with 8 prompts. To leverage the complementary information by the two prompt sets into a single model during inference we introduced XPL. As shown in Figure 7 of main paper, compared to XPL having different prompt sets (row XPL(16,8)) the performance diminishes in case of same number of prompts of length 16 (row XPL(16,16)) and length 8 (row XPL(8,8)) showing the utility of using different prompt lengths in primary and auxiliary models.
| EuroSAT (1%) | EuroSAT (5%) | Chest-Xray (1%) | Chest-Xray (5%) | |
|---|---|---|---|---|
| XPL(8,8) | 91.7 | 93.2 | 30.6 | 32.9 |
| XPL(16,16) | 92.6 | 94.2 | 31.2 | 33.6 |
| XPL(16,8) | 93.8 | 95.8 | 34.2 | 36.2 |
| XPL(32,8) | 93.2 | 95.5 | 34.3 | 35.8 |
However to address the query of the reviewer, as illustrated in Section 4.3 of the main paper, models using same prompt lengths XPL(16,16) or XPL(8,8) outperform a model with different prompt lengths XPL(8,4) because using a much shorter prompt length of just inhibits the capacity of the prompts and as a result, it is not able to make any meaningful contribution. The very low capacity of the smaller prompt model brings down the performance overall.
The performance of XPL(32,8) is very much comparable to that of XPL(16,8). As stated in the first ablation study of the main paper (Section 4.3, first ablation last line of the paragraph), the minor drop in the accuracy for XPL(32,8) is possibly due to a large mismatch in the capacities of the two paths. In order to gauge this threshold of the difference in prompt length between the base and auxiliary pathways for performance trade-off, we started the analysis using same prompt length of for both pathways and gradually increased the difference of prompt lengths to reach a satisfactory boost in performance.
Comaparing the individual dataset plots between MPL (Figure 3 of main paper) and our XPL along with its variations (Figure 7 of main paper), it is observed that XPL(16,8), XPL(16,16), XPL(8,8) and XPL(32,8) consistently perform better than MPL across almost all datasets. It is to be noted that the MPL baseline uses a prompt length of 16. XPL(8,4) falls behind because using a prompt length of just fails to capture the class-wise discrimination leading to poor quality representations across both the modalities of the VLM.
Comparison with other works like Co-CoOP: Works such as CoCoOP [3] focus only on the generalizability of the approach from base to new classes without caring for parameter efficiency in transfer learning. For XPL, the main strength lies on efficient transfer learning in VLMs by exploiting the hugely available free knowledge (unlabeled data) available in the wild. However, for the sake of completeness, we also provide the generalizability performance of XPL for base to new classes in the ablation (Table 2 of main paper and Table 4 in supplementary section). As suggested by the reviewer we compare the results of XPL with that of CoCoOP in the rebuttal for the datasets that are common between us and CoCoOP. The following tables showcase the supremacy of XPL on the individual datasets.
| EuroSAT | S | U | H |
|---|---|---|---|
| Co-CoOP | 87.49 | 60.04 | 71.21 |
| XPL | 97.80 | 58.90 | 73.52 |
| UCF-101 | S | U | H |
|---|---|---|---|
| Co-CoOP | 82.33 | 73.45 | 77.64 |
| XPL | 88.50 | 74.70 | 81.02 |
| Caltech101 | S | U | H |
|---|---|---|---|
| Co-CoOP | 97.96 | 93.81 | 95.84 |
| XPL | 98.95 | 92.49 | 95.61 |
| Oxford Pets | S | U | H |
|---|---|---|---|
| Co-CoOP | 87.49 | 60.04 | 71.21 |
| XPL | 97.80 | 58.80 | 73.44 |
| StandfordCars | S | U | H |
|---|---|---|---|
| Co-CoOP | 70.49 | 73.59 | 72.01 |
| XPL | 74.59 | 71.82 | 73.18 |
| DTD | S | U | H |
|---|---|---|---|
| Co-CoOP | 77.01 | 56 | 64.85 |
| XPL | 80.18 | 54.60 | 64.96 |
| Flowers102 | S | U | H |
|---|---|---|---|
| Co-CoOP | 94.87 | 71.75 | 81.71 |
| XPL | 98.24 | 69.87 | 81.66 |
Baseline with a single model and using the prediction of the weakly augmented input as a pseudo-label for the prediction of the strongly augmented input: Thanks for the suggestion. This falls exactly in line with the MPL baseline of the main paper. In MPL, we use a single network and the training on the unlabeled samples is carried out by generating pseudo-labels from the weakly augmented version and using it to reduce the loss across its strongly-augmented counterpart following the Fixmatch philosophy. As portrayed in Figure 3 of the main paper, the performance of MPL is consistently worse than that of XPL across all datasets. This is in accordance with the common finding in semi-supervised learning literature [2, 3] and can be attributed to noisy and incorrect pseudo-labels.
Experiment with one primary and two auxiliary networks with a triplet of augmented inputs (1 weak and 2 strong - one for each auxiliary network): We appreciate this suggestion of the reviewer. We ran an additional experiment using one primary and two auxiliary networks with a triplet of augmented inputs where each auxiliary network supervises the primary network and vice versa. We keep the prompt length of and for the primary and one auxiliary network respectively, as used in XPL. For the additional auxiliary branch, we use as prompt length . We evaluate this approach over two diverse datasets EuroSAT and ISIC as presented in the table below (also added in Table 6 of supplementary in the revised manuscript). We can see that adding one more auxiliary pathway does help to boost the performance cementing our proposition of leveraging cross-model training for complementary knowledge. The performance gain is around for EuroSAT and around for Chest-Xray across 1%, 5% and 10% proportions of labeled data. However, it should be noted that using an additional auxiliary pathway increases the learnable parameters and computation directing us to the points of diminishing return soon.
| EuroSAT (1%) | EuroSAT (5%) | EuroSAT (10%) | ISIC (1%) | Chest-Xray (5%) | Chest-Xray (10%) | |
|---|---|---|---|---|---|---|
| XPL(16,8,4) | 94.2 | 96.6 | 98.2 | 73.2 | 80.1 | 87.8 |
| XPL(16,8) | 93.8 | 95.8 | 97.6 | 71.5 | 78.8 | 85.7 |
[1] Cross-Model Pseudo-Labeling for Semi-Supervised Action Recognition, CVPR 2022
[2] Semi-Supervised Semantic Segmentation with Cross-Consistency Training, CVPR 2021
[3] Conditional prompt learning for vision-language models, CVPR 2022
Dear Reviewer MsJj,
We sincerely thank all your efforts and apologize for sending out this note regarding our paper. We have addressed all the thoughtful questions and suggestions raised by you and have provided them in the rebuttal. However, as the discussion phase is still going on, we wondered if you might still have any concerns that we could address. Thank you for your service!
Best wishes,
Authors
The paper introduces Cross-model Prompt Learning (XPL), a semi-supervised approach for prompt learning in Vision-Language Models (VLMs), aiming to reduce the dependency on large labeled datasets. XPL employs dual pathways with variable soft prompt lengths to utilize unlabeled data for enhancing model performance in low-labeled-data regimes. The approach is validated on 15 datasets, showing that XPL outperforms the supervised baseline, particularly in few-shot classification tasks.
优点
- The method's ability to leverage unlabeled data effectively could substantially reduce the need for large labeled datasets.
- The approach is empirically validated across 15 diverse datasets, demonstrating its effectiveness and robustness in various contexts.
缺点
- The approach primarily extends semi-supervised learning (SSL) principles to prompt learning with minimal adaptation, which may suggest that the level of novelty is somewhat constrained.
- There is a concern regarding the robustness of the learned prompts, as they appear to be highly sensitive to the distribution of the dataset, potentially limiting their applicability in diverse real-world scenarios.
问题
- How does the proposed method differ fundamentally from existing SSL applications in prompt learning, and what specific adaptations have been made to tailor this approach to VLMs?
- Can the authors provide more insight into how the method would perform on out-of-distribution data or datasets with different characteristics than those tested?
- What measures have been taken to ensure that the learned prompts are not overly fitted to the specific datasets used in the experiments?
- Conduct additional experiments on out-of-distribution datasets or through domain adaptation challenges to evaluate the robustness and generalizability of the learned prompts.
We thank Reviewer f2tm for acknowledging that our approach could substantially reduce the need for large annotated datasets. Below are our responses to the specific concerns.
Novelty and difference from existing SSL applications in prompt learning: Thanks for the query. To the best of our knowledge there is no existing Semi-Supervised Learning (SSL) work either in prompt learning or its applications in the multi-modal setting of VLMs. The primary challenges faced are of two folds. 1) The low quality of prompts learned in presence of only a few labeled data with a vast set of unlabeled samples and 2) Exploiting both the text and visual modalities using prompts only, to extract rich representations from unlabeled samples. As shown in Figures 1a and b of the main paper, two models leveraging unlabeled data but with different number of learnable prompts exhibit markedly different category-wise performance. Typically (as shown by [,]) in low-labeled data regime, the learned representations tend to lack enough discriminative power for downstream tasks thereby also failing to generate high-quality pseudo-labels. In such a constrained scenario, a combination of multiple models can be of rescue. However, such multiple model approach has not been applied for learning prompts in a semi-supervised setting for large VLMs. Our XPL uniquely combines a cross-model approach to employ a co-training policy for learning quality prompts. This novel prompt learning setup for a large frozen VLM harnesses the complementary knowledge across two models having prompt inputs with different lengths. In addition to using different views of the unlabeled images (vision), the two pathways have different lengths of learnable prompts (language). Such a co-teaching framework enables better representation learning by not only forcing invariance to different views of the unlabeled data but also enforcing invariance towards prompts of different lengths. In view of the different existing conventional SSL approaches, our carefully designed baselines do employ the various standard SSL techniques for prompt learning in VLM. These are illustrated in the following paragraph.
As ours is the first work in semisupervised prompt learning in VLMs, we designed many possible SSL baselines (namely TPL, VPL and MPL) by carefully culling SSL literature from related fields and showcase the efficacy of our framework by comparing with these over 15 datasets [refer Figures and of main paper]. These baselines are based on Fixmatch [3] as the underlying semi-supervised approach that was used for semisupervised classification of images. As mentioned in Section 4.1, TPL, VPL and MPL make use of text prompts, visual prompts and both text and visual prompts respectively. The training on the unlabeled samples is carried out by generating pseudo-labels from the weakly augmented version and using it to reduce the loss across its strongly-augmented counterpart following the Fixmatch philosophy and not using multiple models. As can be observed from all the plots in Figures and of the main paper, XPL (using multiple models in addition) outperforms all these baselines showing the effectiveness of the cross-model design. In Figure , XPL provides % improvement on average even in 1-shot scenario over MPL(the strongest baseline). Moreover, XPL offers a significant jump of % for the fine-grained DeepWeeds dataset in 1-shot setup. Further, we have also demonstrated the effectiveness of our cross-model XPL over other traditional semi and self supervised approaches - Pseudo-labeling [4], which does not employ co-teaching and also over MoCo [5] which employs momentum encoder(refer Figure of the main paper). As observed, our XPL outperforms both PL and the self-supervised MoCo baselines for all considered datasets across al the scenarios.
Performance on other out-of-distribution datasets: As shown in Section 4.3 of the main paper, we showcase the robustness and generalizability of the learned prompts using XPL by performing domain-shift experiments where the labeled and unlabeled data come from two out-of-domain distributions. From the results of Table 1 of main paper, we observe how XPL corroborates robustness over the next best baseline MPL for the complex DomainNet dataset. To further evaluate the robustness and generalizability of the learned prompts in our proposed XPL, we run additional experiments on the another benchmark dataset, Office-31 [6]. We follow the similar domain-shift scenarios as for the experiments on DomainNet (refer Table of the paper), considering , when all unlabeled data belong to source and , when all unlabeled data belong to target . As observed from the table below (Table 5 of the revised manuscript), XPL holds its supremacy over the next best baseline MPL across all the domain-shift scenarios for the Office-31 dataset as well. XPL not only gives a accuracy boost over MPL when all the unlabeled data are from itself () for almost all scenarios, but even with , the performance of XPL is better than that of MPL with . This greatly signifies the ability of our cross-model XPL approach to learn prompts that are robust as well as generalizable to harness richer representations from even out-of-distribution data.
| Office-31 | = A = W () | = A = W () | = W = D () | = W = D () | = D = A () | = D = A () |
|---|---|---|---|---|---|---|
| MPL | 82.8 | 81.7 | 86.4 | 85.2 | 84.2 | 81.9 |
| XPL | 84.7 | 84.0 | 88.2 | 87.1 | 85.5 | 84.6 |
Measures to handle overfitting: Thanks for the interesting question. One of the key motivations for employing a co-teaching paradigm in XPL is to make the learned prompts robust and generalizable for tackling the issue of overfitting. These qualities have already been portrayed in the domain-shift experiments. We employ different regularization techniques to handle the issue. We implement a novel consistency regularization policy across the primary and auxiliary pathways. While the conventional consistency regularization is used only among the augmented versions of the input samples, we extend the regularization to account for the different lengths of prompts which are associated to the inputs of the two pathways. So the training generalize better across both the input augmentations and also the prompt lengths. As difference in prompt lengths cater to difference class performances, this ensures a more generalized representations of the prompts. In addition, we also employ other regularization techniques such as dropouts along with layer and batch normalizations.
References:
[1] Cross-model pseudo-labeling for semi-supervised action recognition, CVPR 2022.
[2] Semi-supervised action recognition with temporal contrastive learning, CVPR 2021.
[3] Fixmatch: Simplifying semi-supervised learning with consistency and confidence, Advances in neural information processing systems 2020
[4] Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks, ICML 2013.
[5] Momentum contrast for unsupervised visual representation learning, CVPR 2020.
[6] Adapting visual category models to new domains, ECCV 2010
Dear Reviewer f2tm,
We sincerely thank all your efforts and apologize for sending out this note regarding our paper. We have addressed all the thoughtful questions and suggestions raised by you and have provided them in the rebuttal. However, as the discussion phase is still going on, we wondered if you might still have any concerns that we could address. Thank you for your service!
Best wishes,
Authors
We would like to thank all the reviewers for their constructive comments! We are encouraged that the reviewers appreciate our work for being the first to investigate semi-supervised learning for efficient transfer learning of VLMs to different downstream tasks (Reviewer 18u3), an interesting and timely topic for the community (Reviewer MsJj). We are glad that they find our work (a) focussing an under-explored problem, semi-supervised prompt learning for VLMs (Reviewer xki8); (b) easy to understand (Reviewer xki8), reads well (Reviewer MsJj) with extensive experiments verifying the effectiveness of the proposed approach (Reviewer f2tm, 18u3); (c) an interesting and efficient cross-model approach for finetuning vision-language models (Reviewer 18u3). We thank Reviewer MsJj especially for acknowledging our effort towards reproducibility.
We have addressed all the questions that the reviewers posed with additional experimental comparisons and clarifications. All of these additional experiments and suggestions have been added into the updated manuscript (changes are highlighted in blue). We kindly request the reviewers to have a look.
This paper introduces Cross-model Prompt Learning (XPL), a semi-supervised approach for prompt learning in Vision-Language Models (VLMs). It leverages dual pathways with variable soft prompt lengths to utilize unlabeled data, outperforming supervised baselines, particularly in few-shot classification tasks. The approach is innovative, but concerns about the novelty, robustness of learned prompts, and clarity in comparisons impact the overall contribution. Additional experiments on out-of-distribution datasets and ablation studies, along with improved theoretical justifications, are needed to strengthen the paper. Addressing these issues would enhance the significance of XPL in the semi-supervised VLM domain.
After rebuttal, the above weakness was not well addressed and no reviewers were willing to increase the score. AC recommends reject.
为何不给更高分
Besides the weakness above, due to the development of Multimodal LLM, prompt tuning for classification is not a bit out-dated.
为何不给更低分
N/A
Reject