Patch-Prompt Aligned Bayesian Prompt Tuning for Vision-Language Models
We introduce a Bayesian prompt learning that learns class-specific stochastic prompts for CLIP.
摘要
评审与讨论
he paper presents a novel approach to prompt tuning in Vision and Language Models (VLM), where a distribution over prompts is learned in a per-class fashion. For a target class, the prompt embeddings are drawn from a latent distribution parameterized by a small learnable MLP, similar to (Derakhshani et al., 2022). However, contrary to (Derakhshani et al., 2022), the prompts are learned for each image class, and they are computed from patch embeddings rather than from the holistic image. In addition to the extended distribution over the prompts, the paper proposes an optimal-transport loss to align the image and text features. A broad set of experiments are conducted to showcase the benefits of the proposed approach w.r.t. CoOp and CoCoOp.
优点
The paper is technically sound and motivated. On the one side, extending the pool of prompts is appealing to improve generalization to new classes. On the other hand, the idea of using optimal transport to align the image and text probabilities sounds novel to me, which brings a new optimization technique to the domain of vision and language pre-training with good results.
The experiments follow the standard protocols providing superior performance to CoOp and CoCoOp, showcasing the importance of having a proper set of prompts from which one can sample in task-specific manner.
The paper is well documented, and while the writing can be improved (see below), the narrative is easy to follow and understand. The authors provide code with their submission that hopefully will be made publicly available for reproducibility.
缺点
While acknowledging the novelty of extending the sampling pool of prompts to be patch-specific and class-specific, I wonder to which extend such novelty is just marginal w.r.t. the framework proposed by Derakshani et al. In my opinion, this extension is rather marginal and while the authors have the merit to be the first to apply such extension, the technical contribution in this sense is small to me.
The use of the optimal transport is in general well motivated, but I am not sure if the results shown in Table 1 and Figure 7 (a) are a bit worrying in the sense that adding such optimization loss results in many cases detrimental. While novel, it is worth questioning whether its contribution is significant.
I wonder why the comparisons against state of the art works dismiss ProDA (Lu et al 2022) and Derakshhani et al. 2022).
The writing needs to be improved, while the narrative is well threaded, I believe the paper can benefit from proof-reading.
I might have missed this point but a question I have is to which extend having per-class prompts is beneficial and how this is applied to the new classes. A proper description of such scenario for inference would be desirable.
问题
All my concerns are addressed above.
We thank reviewer zJm8 for the positive comments. Below, we address the concerns raised in your review.
Q1: ... but I am not sure if the results shown in Table 1 and Figure 7 (a) ...
Specifically, from Table 1 and Figure 7(a), we find that applying CT in isolation (P-Prompt) achieves a better score than the baseline CoCoOp in most cases, which shows the efficiency of the CT module. Besides, our PBPrompt outperforms other variants when combined with CT and Bayesian frameworks.
Q2: comaprison with ProDA and VPT (Derakshhani et al. 2022)
H score of CoCoOp, ProDA, PBPrompt, and CoOp+VPT on Base-to-New task
| Method | ImageNet | Caltech | Pets | Cars | Flowers | Food | Aircraft | SUN | DTD | EuroSAT | UCF | Average |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| CoCoOp | 73.10 | 95.84 | 96.43 | 72.01 | 81.71 | 90.99 | 27.74 | 78.27 | 64.85 | 71.21 | 77.64 | 75.83 |
| ProDA | 72.72 | 95.68 | 96.62 | 72.91 | 80.66 | 89.43 | 35.46 | 77.79 | 66.44 | 73.88 | 78.04 | 76.65 |
| VPT | 73.34 | 94.62 | 96.61 | 70.21 | 74.40 | 91.01 | 31.54 | 75.77 | 58.18 | 69.75 | 73.92 | 73.34 |
| PBPrompt | 73.76 | 96.66 | 96.92 | 73.02 | 83.12 | 91.22 | 34.64 | 78.35 | 66.41 | 80.34 | 79.51 | 77.86 |
Thank you for your valuable suggestion. We do not report the results of ProDA and VPT because of their unreleased codes. Fortunately, we find some results from previous works [1,2], and we report the comparison above (detailed results can be found in Table C.7 and Table C.9 in the Appendix). From the results, we find that our proposed PBPrompt outperforms ProDA and VPT by achieving significant improvements in most cases, which demonstrates the superiority of our approach.
Q3: For the new class
We model the posterior as , where denotes the class embedding of c-th class (the bert embedding is used in our experiment). Thus, we can infer the distribution based on the embedding of the new class, which is easy to obtain in practice. We will add this in our revision.
The paper proposes a method to improve the prompt engineering problem. They generate stochastic prompts via a random sampled inputs and a learnable sef-attention generator. Then aligns the text embeddings with the image embeddings using bidirectional distance. The model in being trained by optimizing the ELBO. The experimental results and ablation studies showing the effectiveness of the proposed method.
优点
- The paper is well-written and easy to understand
- The idea of generating label-specific stochastic prompts is novel
- Results on multiple tasks validates the effectiveness of the proposed method
缺点
- I'm thinking about the motivations of generating stochastic prompts. On one hand, there are different ways to describe a given class (e.g. "a dog that chews bones", "puppies are good friends of people"). On the other hand, we can always combine multiple prompts into one single, long prompt. Is it possible that "stochastic" method works just because it provides more input variance in the training, therefore reduces the overfitting? (Especially finetuning data is limited)
问题
- In figure 7 (b), I see not all randomly generated prompt correlates to the class label well (e.g. visualizations of the dog image). Are they any randomness in inference? How about the variances?
伦理问题详情
N/A
We thank reviewer EsJ6 for the positive comments. Below, we address the concerns raised in your review.
Q1: Is it possible that "stochastic" method works just because it provides more input variance in the training, therefore reduces the overfitting
For a given class, previous deterministic methods represent the class as a point in the latent embedding space (the output of a powerful single prompt remains a point). This may fail to cover diverse visual attributes of the given class. We in this paper propose stochastic prompt generation under the Bayesian framework, where a class is modeled as a distribution in the embedding space. The stochastic methods can be viewed as a Bayesian extension of the deterministic methods by introduce the variance (uncertainty), which showing greater potential in modeling complex and structural data. Due to the introduced distributed representation, the stochastic methods capture diverse vision concepts of the given class, resulting in comprehensive label representation.
Q2: ...Are they any randomness in inference? How about the variances?
Thank you for your careful reading! One of the main ideas of Fig.7(b) is to show the diverse prompts, and it is reasonable that not all the randomly generated prompt correlates to the class label well. After trained on a training dataset, the learned posterior of the given class captures diverse visual semantics. Some sampled prompts may focus on the surrounding environment features and thus have probability to attend to the background.
Like previous Bayesian models, in the test stage, we use the mean of the posterior to generate the sampled prompt, which reduces the variance and results in robust performance.
This paper presents a Bayesian probabilistic approach to prompt tuning. In this method, label-specific stochastic prompts are generated hierarchically. This involves sampling a latent vector from an underlying distribution and utilizing a lightweight generative model. Additionally, a regularization technique is introduced to minimize the statistical distance between visual patches and linguistic prompts.
优点
[] The paper is well-written and easy to follow.
[] Experiments have shown that the proposed method outperforms baseline methods.
缺点
[] The concept of incorporating Bayesian neural networks for prompt learning was previously presented in "Improving Zero-Shot Generalization for CLIP with Synthesized Prompts, ICCV23." This diminishes the novelty of Bayesian prompt tuning. It would be nice to differ these two works. Also, the regularization seems to be a simple utilization of conditional transport.
[] Some SOTA methods are missed in experiments, e.g., “Improving Zero-Shot Generalization for CLIP with Synthesized Prompts, ICCV23.”, “Self-regulating Prompts: Foundational Model Adaptation without Forgetting, ICCV23” and “MaPLe: Multi-modal Prompt Learning, CVPR23”. It is not adequate to simply compare an old baseline CoCoOp only. More importantly, comparing with these SOTA methods, the performance of the proposed methods is much worse.
[] The ablation study should be conducted on ImageNet as it would nice to see the effectiveness of the proposed on the most challenging dataset.
问题
Please refer to the weakness part.
Q3: Ablations on ImageNet
Table 1 The results of various variants at the few-shot task on ImageNet
| Dataset | Methods | 1 shot | 2 shots | 4 shots |
|---|---|---|---|---|
| ImageNet | 68.27 | 69.30 | 69.92 | |
| 69.03 | 69.79 | 70.23 | ||
| PBPrompt | 69.55 | 69.90 | 70.50 |
Table 2 The ablation results of PBPrompt on ImageNet with more 50 training epochs.
Note that the values in brackets denote the difference from the original results in Table C.7 in the manuscript.
| ImageNet | |
|---|---|
| Base | 76.97 (+0.07) |
| New | 70.12 (-0.75) |
| H | 73.36 (-0.40) |
Following your advice, we have reported the ablation results on ImageNet above (Due to the limited time, we only report the ablation results that are newly conducted in the rebuttal period and more details about these ablation study can be found in Appendix C.10 . We will add the results of Monte Carlo sampling numbers and coefficient if necessary). We find that our approach shows robustness and effectiveness on the ImageNet dataset.
We thank the reviewer again for valuable suggestions, which helped us improve the quality of the submission.
We thank reviewer H1KY for the comments and suggestions. Below, we address the concerns raised in your review. We hope our efforts can help you improve your rating of the paper.
Q1: Novelty of this paper.
We first thank the reviewer for the suggested paper. It is nice to see that several studies are proposed to introduce uncertainty into prompt tuning, which is a critical challenge for robust prompt searching. We summarize the main difference between SHIP and ours below:
(1) modeling of . Both SHIP and PBPrompt introduce uncertainty into the prompt generation process. However, the latent variable ( in PBPrompt) models different levels of uncertainty and comes from different assumptions. SHIP introduces the stochastic prompts into each image, and infers a sample-dependent posterior:
$
q(\mathbf{z_i}) = \mathcal{N}(u(\mathbf{x_i}), \Sigma(\mathbf{x_i})),
$
where denotes the feature of -th image. While PBPrompt views each category as an underlying distribution and infers a label-specific posterior:
$
q(\mathbf{z_c}) = \mathcal{N}(u(\mathbf{e_c}), \Sigma(\mathbf{e_c})),
$
where denote the embedding of -th category.
(2) Prior on . SHIP simply adopts the standard Gaussian as the prior of , e.g., , while PBPrompt utilizes the contextual prior to capture label-specific features: . This difference enables PBPrompt to access additional label semantics, achieving better prior guidance.
(3) Training pipelines. SHIP introduces an additional feature reconstruction loss to pre-train the VAE, and then finetunes the prompt via the task-specific loss. Our PBPrompt naturally integrates the stochastic prompts into the CLIP framework and directly optimizes the prompt via the combined ELBO.
As discussed above, we want to note that SHIP and our PBPrompt are quite different in terms of generation of , priors on , and the training pipelines. We will add these discussions in our revision.
Please note that it is not trivial to apply the CT directly to the prompt tuning. Unlike previous prompt tuning methods that often represent the image and label with the global features (e.g., the <CLS> and <EOS> embeddings), we here view the image patches and prompt embeddings as two discrete distributions and . This makes it natural to apply the CT to regularize the alignments between vision-language domains.
Q2: Missed baselines and weak performance
Table.1 Additional New-to-New results. The results of CoOP +VPT and CoOp + SHIP are copied from SHIP paper
| Average | ImageNet | Caltech 101 | Oxford Pets | Stanford Cars | Flowers 102 | Food 101 | FGVC Aircraft | SUN 397 | DTD | EuroSAT | UCF 101 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| CoOp + VPT | 73.34 | 72.60 | 94.62 | 96.61 | 70.21 | 74.40 | 91.01 | 31.54 | 75.77 | 58.18 | 69.75 | 73.92 |
| CoOp + SHIP | 76.73 | 72.79 | 96.36 | 93.01 | 71.14 | 83.06 | 90.87 | 33.28 | 77.35 | 64.65 | 76.22 | 78.91 |
| PBPrompt | 77.86 | 73.76 | 96.66 | 96.92 | 73.02 | 83.12 | 91.22 | 34.64 | 78.35 | 66.41 | 80.34 | 79.51 |
Table 2 Additional Cross-dataset results. The results of CoOP +VPT and CoOp + SHIP are copied from SHIP paper
| Method | Caltech | Pets | Cars | Flowers | Food | Aircraft | SUN | DTD | EuroSAT | UCF | Average |
|---|---|---|---|---|---|---|---|---|---|---|---|
| CoOp + VPT | 93.67 | 89.27 | 65.50 | 70.20 | 86.27 | 22.13 | 66.57 | 46.93 | 47.43 | 67.21 | 65.51 |
| CoOp + SHIP | 94.04 | 90.38 | 65.55 | 69.67 | 86.40 | 21.90 | 66.26 | 45.69 | 48.17 | 68.52 | 65.69 |
| PBPrompt | 94.87 | 90.62 | 66.00 | 72.44 | 86.34 | 24.82 | 67.69 | 45.62 | 47.13 | 68.83 | 66.40 |
Following your advice, we have reported the additional results of SHIP above. To have a fair comparison, we report the results of CoOp + SHIP on base-to-new task and cross-dataset transfer learning task in Table 1 and Table 2 respectively. We find that our approach outperforms SHIP in most cases, which indicates the effectiveness of our label-specific Bayesian generation.
As for the suggested PromptSGC and MaPLe, we find that they belong to multimodal prompt tuning and aim to optimize both the vision and text prompts, which is beyond the scope of this paper. It is unfair to compare the proposed textual prompt tuning PBPrompt with such multimodal methods.
This paper proposes to hierarchically generate the label-specific stochastic prompts using generative modules from sampled noisy latent vector. Then, a conditional transport framework is employed to establish a relationship between visual patches and textual prompts. Several experimens are performed in few-shot, transfer learning, domain generalization, and base-2-new manners using ViT-B/16 and RN50 as the backbones.
优点
- The idea of using the noisy latent vector combined with deterministic mapping to generate diverse prompts for alleviating the overfitting issue in vision language prompt learning is meaningful.
- The paper is well-organized and easy to follow.
缺点
- In Fig.3, the reported PLOT using ViT-B/16 is run by this submission. However, this result is much different from the one reported by PLOT on github (https://github.com/CHENGY12/PLOT/tree/main/plot-pp). According to these results, PLOT achieves better few-shot performance when using the ViT-B/16 as the visual backbone.
- The PLOT base to new experiment using ViT-B/16 reproduced in this paper also lacks credibility, considering that the performance of the proposed method and PLOT are comparable when similar experiments using RN50 as the visual backbone are performed.
- In my view, the primary reference for this paper is PLOT, and therefore it needs to be compared to PLOT as exhaustively as possible. However, this paper lacks some important comparisons. For example, PLOT mainly employs RN50 as the visual backbone, although this paper has added experiments in few-shot and base-2-new manners using RN50 as the backbones, the domain generalization experiments using RN50 are missing.
- I appreciate the proposed Stochastic Prompts Generation, however, the conditional transport seems not meaningful. In the main text, the author only claims that OT needs two stages. However, the first stage of OT is not time-costly in the CoOp-related experiments. So, what is the main contribution of using CT instead of OT?
- The paper lacks an ablation study on the proposed conditional transport and optimal transport (OT). We need to compare experiments using SPG and CT with experiments using SPG and OT to determine whether the proposed CT is meaningful.
- In PLOT, the number of prompts is set to 4. However, this paper only uses C for the number in Eq. (4) without stating the exact value in the experimental details, which may result in unfair comparisons. I also find that the Moter Carlo sampling number is set to 20 as the default setting. Does this Moter Carlo sampling number correspond to the number of PLOT prompts? If yes, this is unfair, please conduct fair experiments and explain the reason.
- The learnable parameters shown in Table C.9 indicate that the proposed approach uses much more parmas compared with CoCoOp and PLOT. I would like to know the composition of these parameters and whether the additional parameters rather than the suggested method are the reason for the performance improvement.
- Multi-modal approaches such as CoPrompt[1], MaPLE[2], and VioLET[3] can achieve much better base-to-new performacne using ViT-B/16 as the visual backbone. I understand that the proposed approach only tunes the language branch, however, I wonder whether the proposed approach can further improve the multi-modal approaches.
- From the ablation studies in Table.1, P-Prompt shows better performance compared to B-Prompt, while in Figure.7(a), B-Prompt exhibits better few-shot performance. These results indicate that the proposed CT is useful for generalization while SPG accounts for better supervised performance. This may not be intuitive, as the proposed Bayesian approach is more capable of introducing uncertainty thereby enhancing generalization performance and reducing overfitting (also mentioned in Sec. 2.2).
[1] Roy, Shuvendu, and Ali Etemad. "Consistency-guided Prompt Learning for Vision-Language Models." arXiv preprint arXiv:2306.01195 (2023).
[2] Khattak, Muhammad Uzair, et al. "Maple: Multi-modal prompt learning." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.
[3] Wang Y, Liu Y, Zhang X, et al. VioLET: Vision-Language Efficient Tuning with Collaborative Multi-modal Gradients[C]//Proceedings of the 31st ACM International Conference on Multimedia. 2023: 4595-4605.
问题
The proposed method claims to be generalizable and can solve the overfitting problem well. Then I am curious, when we increase the number of training epochs to 50 or even 200 in base-to-new experiments, does it affect the generalization performance?
Q8: Comparison with Multimodal prompt tuning ...
We thank the reviewer for the valuable suggestion. theoretically, our proposed model can be modified into the multimodal prompt tuning task, and we will leave it as a future work.
Q9: Results of Table.1 and Figure.7(a)
Thank you for your careful reading again.
We apologize for the confusion caused by my mistakenly switching these two lines of results (B-Prompt and P-Prompt) in Tabel 1. Now this mistake has been corrected and you can check it in reversion. The results do indicate that the proposed Bayesian approach is more capable of introducing uncertainty thereby enhancing generalization performance and reducing overfitting which is consistent with the results reported in the Table. 1 and the Figure 7(a).
Q10: when we increase the number of training epochs to 50 or even 200 in base-to-new experiments, does it affect the generalization performance
Effect of PBPrompt with more 50 training epochs on base-to-new generalization
denoted the difference from the original results in Table C. 7. : The improvements of harmonic mean compared to CoCoOp (without additional training epochs).
| ImageNet | Caltech101 | Flowers102 | DTD | EuroSAT | |
|---|---|---|---|---|---|
| Base | 76.97 (+0.07) | 98.01 (+0.03) | 96.68 (+1.21) | 80.44 (+2.41) | 91.86 (+2,32) |
| New | 70.12 (-0.75) | 94.43 (-0.94) | 71.16 (-2.44) | 52.15 (-5.66) | 68.08 (-4.79) |
| H | 73.36 (-0.40) | 96.19 (-0.47) | 81.98 (-1.14) | 63.28 (-1.57) | 78.20 (-2.14) |
| Δ | +0.26 | +0.35 | +0.27 | -1.57 | +6.99 |
I appreciate your constructive suggestions. Indeed, there is a trade-off between performance on base and new classes according to the number of training epochs. Specifically, more training epochs lead to better accuracy on base classes and lower it on new classes. Our proposed model shows better generalizability by achieving higher performance (The H score) on the base-to-new task and more details can be found in Sec C.9. Like previous works, it will affect the results when the training epochs increase.
We thank reviewer 19zzfor the comments and suggestions. Below, we address the concerns raised in your review. Please let us know if you have any further concerns or whether this adequately addresses all the issues that you raised with the paper.
Q1 & Q2: Comparison with PLOT
First, we want to note that it is unfair to compare our model with PLOT-pp (https://github.com/CHENGY12/PLOT/tree/main/plot-pp). PLOT-pp is a multimodal prompt tuning algorithm, where both the vision and text prompts are tuned to improve the performance. However, this paper aims to address the uncertainty issue in text prompt tuning, where the vision embedding is fixed according to previous works.
Based on the above facts, and to be consistent and fair, we modify the official PLOT by only replacing the RN50 with ViT-B/16 and optimizing the textual prompts under the same training pipeline (this ensures that the PLOT results reproduced in this paper are credible).
From all results compared with PLOT using ViT-B/16, we find that PBPrompt has a significant improvement over PLOT on almost all datasets.
Moreover, we find that PLOT is sensitive to the backbones, and it shows inconsistent performance on RN50 and ViT-B/16 (The authors of PLOT explain that the training of OT distance needs discriminative local visual features, which may not be the case in ViT-based encoders. Please refer to https://github.com/CHENGY12/PLOT/issues/1 for more details). In contrast, our model introduces the uncertainty under the Bayesian framework, which shows strong robustness with different backbones (as discussed in Sec4.3 in the manuscript).
Q3: ..., the domain generalization experiments using RN50 are missing
| Method | ImageNet | ImageNetV2 | ImageNet-Sketch | ImageNet-A | ImageNet-R |
|---|---|---|---|---|---|
| CoOp | 61.91 | 54.26 | 32.47 | 21.78 | 54.21 |
| PLOT | 63.01 | 55.11 | 33.00 | 21.86 | 55.61 |
| PBPrompt | 62.95 | 54.77 | 34.10 | 24.85 | 59.89 |
Thank you for your constructive suggestions. We have added the domain generalization experiments using RN50 above. We find that PBPrompt using RN50 outperforms PLOT by achieving the 3/4 best results. This demonstrates the superiority of the proposed method.
We have added the comparison and detailed discussion in Sec. C.7 in the new revision.
Abalation studies on OT ( ) and SPG ()
| Dataset | Methods | 1 shot | 2 shots | 4 shots |
|---|---|---|---|---|
| ImageNet | 68.27 | 69.30 | 69.92 | |
| 69.03 | 69.79 | 70.23 | ||
| PBPrompt | 69.55 | 69.90 | 70.50 | |
| Caltech101 | 92.86 | 93.91 | 94.51 | |
| 93.39 | 93.76 | 94.62 | ||
| PBPrompt | 93.92 | 94.40 | 94.83 | |
| Flowers102 | 73.56 | 82.04 | 87.00 | |
| 74.16 | 82.66 | 87.92 | ||
| PBPrompt | 75.43 | 83.37 | 88.90 | |
| DTD | 50.65 | 54.55 | 59.40 | |
| 51.95 | 55.66 | 59.50 | ||
| PBPrompt | 52.03 | 56.20 | 59.63 | |
| EuroSAT | 52.15 | 66.97 | 68.19 | |
| 61.10 | 67.21 | 71.77 | ||
| PBPrompt | 60.92 | 68.77 | 72.84 |
Q4 & Q5: ..., what is the main contribution of using CT instead of OT
First, we thank the reviewer for the appreciation of our proposed stochastic prompts generation (SPG), which is one of the main contributions of this paper. Both OT and CT have the ability to measure the semantic distance between two sets, and we here choose the CT as the regularization mainly because:
- Mathematically, CT measures the distance bidirectionally, which shows a more holistic alignment:
$
\mathcal{L} = \frac{1}{M}\sum_{m=1}^M \sum_{c=1}^C C(\mathbf{u_m}, \mathbf{g_c}) \frac{p_c exp(\mathbf{u_m}^T \mathbf{g_c})}{\sum_{c'=1}^C p_{c'}exp(\mathbf{u_m}^T \mathbf{g_{c'}})} +
\sum_{c=1}^C p_c \sum_{m=1}^M C(\mathbf{g_c}, \mathbf{u_m})\frac{exp(\mathbf{g_c}^T\mathbf{u_m})}{\sum_{m'=1}^M exp(\mathbf{g_c}^T \mathbf{u_{m'}})},
$
where the first patch-to-prompt term calculates the transport cost from the image patches to prompt embeddings. This forces the expected prompt to have the closest semantics with patch embeddings so that it can receive a higher transport probability than other labels. The second prompt-to-patch term calculates the cost from an opposite view, which makes the prompt act as a weighted cluster that shares similar semantics with visual patches. These two terms guide the learning of informative and distinguishing prompts, resulting in better representation.
- Unlike OT which typically requires an inner iteration to search the estimated transport plan, CT can be directly calculated according to the cost matrix and transport probabilities. This makes CT more flexible to jointly optimize with the task-specific loss in an end-to-end manner, often showing higher performance and stronger generalizability.
Following your suggestion. We have added the ablation study on the proposed conditional transport (CT) and optimal transport (OT) with five datasets above. We find that CT outperforms OT in most cases. These results demonstrate the efficiency of the introduced CT module.
Q6: The number of Moter Carlo sampling
It is worth noting that the Moter Carlo sampling number is different from the number of PLOT prompts. More specifically, C, the number of PLOT prompts, denotes C sets of learnable prompt vectors. And N, the Moter Carlo sampling number in our method, denotes that we sample N prompts from one set of learnable prompts. Thanks to the SPG module, our model infers the posterior of the prompt based on one set of learnable prompts. This makes it possible to sample 20 prompts without introducing 20 learnable prompt vectors. Therefore, it is not reasonable to set the two parameters to be the same.
Q7: ... the proposed approach uses much more parmas ...
The additional parameters mainly come from the SPG, which is the main component of the proposed model and aims to generate stochastic prompts given the class name. To evaluate the efficiency of the SPG, we here build a variant (named as ) which uses the inferred mean of the Gaussian as the sampled in Eq.(2) in the manuscript, and report the results above. We find that PBPrompt outperforms its with a significant gap at all datasets, which shows that the improvement comes from the introduced model rather than the additional parameters.
Dear reviewers:
Thank you again for your time and valuable comments and suggestions, which help us greatly in improving our submissions. Following your advice, we have added more discussion about the difference between SHIP and PBPrompt (Sec. A in the appendix), the missing comparisons, including results of PLOT and PBPrompt with RN50 (Table C.9), comparison with ProDA and SHIP (Table 1, Table C.7 and Table C.10), ablation studies on training epochs (Table.C11), and on SPG and OT (Table. C12).
All modifications are marked in blue for ease of reading. We hope the updated revision and our responses can address your concerns and help you improve your assessment of the paper.
Best regards,
Authors
The paper presents a method of prompt tuning called PBPrompt for vision language pre-trained models (VLPs). It learns label-specific prompt distribution and aligns the image patches and textual prompts by minimizing the CT distance.
The paper received mixed scores (6, 6, 5, 3). Some reviewers appreciated the interesting idea of generating label-specific stochastic prompts and the use of optimal transport to align image and text features. They also acknowledged the paper's superior performance compared to CoOp and CoCoOp; While others suggest rejection or a borderline rating, citing concerns about the novelty, effectiveness of the proposed approach, missing comparisons with state-of-the-art methods, and generalizability of the proposed approach.
While the authors responded to these concerns by clarifying by providing additional experimental results, and AC also thinks it's a novel approach little explored, the reviewers and AC believe the paper is not ready given its present form, majorly due to the concerns raised about its practicality, generalizability, and novelty.
为何不给更高分
After carefully considering all the reviews and author's response, AC feels this paper is really a borderline paper. The pros and cons are as described above, based on the issues list above.
为何不给更低分
N/A
Reject