5.2

/10

Poster5 位审稿人

最低5最高6标准差0.4

3.8

置信度

正确性2.8

贡献度2.8

表达3.0

NeurIPS 2024

Aggregate-and-Adapt Natural Language Prompts for Downstream Generalization of CLIP

Chen Huang,Skyler Seto,Samira Abnar,David Grangier,Navdeep Jaitly,Joshua M. Susskind

OpenReview PDF

提交: 2024-05-14更新: 2024-11-06

TL;DR

Prompt learning that distills the textual knowledge from natural language prompts to improve the downstream generalization of CLIP.

摘要

关键词

CLIPDownstream GeneralizationPrompt LearningNatural Language Prompts

评审与讨论

审稿意见

评分: 5置信度: 42024-07-06

This paper proposes a new method to enhance the downstream generalization of CLIP by distilling knowledge from LLM- or human-generated text prompts. The proposed method involves training a prompt generator to predicts prompt embeddings (AAPE) based on images.

优点

The method aggregates and distills task-related knowledge from LLM- or human-generated text prompts, leading to better downstream generalization of CLIP.
The approach demonstrates good performance on various downstream vision-language tasks.

缺点

The analysis on the aggregated text prompt is missing. There is no evidence to prove that Input-Adapted Prompt Aggregator can alleviate the influence from noisy text.
The evaluation efficiency may be significantly impacted by the need to connect predicted prompts base text features.
A minor error is found in line 181, where “a photo of a {class} + category type” should be “a photo of a {class}”.

问题

Can Input-Adapted Prompt Aggregator correctly distinguish between noisy and accurate text prompts? For example, in fig1, the last text prompt in the natural language prompts (“This image is of a red Jeep Compass SUV from 2012.”) inaccurately describes the training image. What’s the attention score of this prompt during aggregation?

FLOPs of the proposed method.

How long does it take to evaluate the proposed method and CLIP on ImageNet1k dataset?

局限性

The analysis on the aggregated text prompt is missing. The proposed method should be applied on more vision-language models, such as SigLIP[1]. This limitation has been mentioned in the paper.

[1] Sigmoid Loss for Language Image Pre-Training.

作者回复

2024-08-07

Thanks for your helpful suggestions to improve our work. Below is our point-by-point response.

Whether the prompt aggregator can suppress noisy text prompts.

The attached Rebuttal.pdf (Fig. 1) visualizes the attention score for some prompt samples. We do observe low scores for those noisy or redundant prompts (see the example of car image), which will be suppressed during aggregation. We plan to add a larger-scale visualization and a correlation analysis between the attention score and image-prompt similarity.

Inference efficiency, FLOPs, evaluation time on ImageNet1k.

Please refer to our Response to Common Concern for comparisons and discussions of inference cost in terms of # params, GFLOP and FPS. AAPE evaluation on ImageNet1k takes a reasonable amount of time (12.6 min).

Application to more vision-language models, such as SigLIP.

We have shown the benefits of the language priors learned in our AAPE for two categories of vision-language models: 1) contrastive CLIP and 2) generative LiMBeR that connects the vision encoder with an LLM. For future work, we do plan to study more multimodal models to test the generality of AAPE, as mentioned in L318-320. We can go the contrastive route and apply AAPE to e.g. SigLIP as suggested. What's more interesting is the application to larger-scale generative models like BLIP-2 and LLaVA (work in progress) for general purpose visual and language understanding.

2024-08-13

I have read the author's response and maintain my rating.

评论- Re: Official Comment by Reviewer ZqyB

2024-08-14

Thanks for the feedback. We will continue to improve the paper and integrate all the insights from discussions with different reviewers.

审稿意见

评分: 5置信度: 42024-07-08

The proposed method leverages language priors for better downstream adaptation and generalization of CLIP [37], which is similar to CuPL [36] that utilizes prompts generated by a LLM (e.g., GPT-3) for zero-shot image classification using CLIP. Unlike CuPL, the proposed method aggregates multiple LLM-generated or human-generated prompts to construct an aggregated text feature. Using the aggregated text feature as a guidance, the authors adaptively transform CLIP image features to tackle few-shot image classification, image-to-text retrieval, image captioning, or VQA. A notable difference from CuPL is that the proposed method does not run an inference of LLM at test time.

优点

This paper has the following strengths:

(i) The proposed method is timely, and seems to be effective and efficient. By aggregating text features produced from LLM-generated or human-generated prompts and then using the aggregated feature as a guidance, the proposed method could effectively leverage language priors for better downstream adaptation and generalization of CLIP. After the distillation of rich language knowledge during the training time, the proposed method does not need to run an inference of LLM at test time.

(ii) The authors validate the effectiveness of the proposed distillation method for downstream tasks (few-shot image classification, image-to-text retrieval, image captioning, VQA). It demonstrates that the learned adaptive transformation guided by language priors actually improves the adaptation and generalization of CLIP for downstream tasks.

缺点

This paper has the following weaknesses:

(i) This paper lacks the justification of how the adaptive transformation of CLIP image features, guided by an aggregated text feature, overcomes the image-text modality gap which exists in CLIP vision-language space [A]. Despite the modality gap, the proposed method simply encourages transformed CLIP image features to be located nearby an aggregated text feature (using L2 distillation loss).

[A] Liang et al., “Mind the Gap: Understanding the Modality Gap in Multi-modal Contrastive Representation Learning”, NeurIPS, 2022.

(ii) The proposed method might not be able to improve the model robustness to distribution shifts which are difficult to be explained by natural language. For example, the Terra Incognita dataset [B] contains distribution shifts caused by different camera locations in wild environments. The widely used benchmark of in-the-wild distribution shifts, WILDS [C], also contains such distribution shifts. Since the proposed method leverages only language priors, it might not be good at handling those distribution shifts.

[B] Beery et al., “Recognition in Terra Incognita”, ECCV, 2018.

[C] Koh et al., “WILDS: A Benchmark of in-the-Wild Distribution Shifts”, ICML, 2021.

(iii) This paper lacks experimental results obtained with different distillation loss coefficient values, although the distillation loss is the main contribution of this paper. According to L191-196, it simply states that the proposed model is not sensitive to different coefficient values. Without relevant experimental results, it is difficult to evaluate this statement.

问题

(i) Despite the image-text modality gap in CLIP vision-language space [A], the proposed method simply transforms CLIP image features using an aggregated text feature. Does “h(x)” overcome the image-text modality gap? How do the authors deal with the modality gap? If the proposed method fails to overcome the modality gap, the authors should justify why the proposed method leads to better downstream adaptation and generalization of CLIP.

[A] Liang et al., “Mind the Gap: Understanding the Modality Gap in Multi-modal Contrastive Representation Learning”, NeurIPS, 2022.

(ii) According to L176-180, the authors claim that learning h(x) is much more parameter-efficient than learning a sequence of token embeddings. Is it true? For example, suppose that we have 16 learnable word tokens, where the dimension size of each word embedding is 512. In this case, there are 16 * 512 learnable parameters. In contrast, as described in L169-170, “h( $\cdot$ )” is a network which consists of two fully connected layers. Given that the dimension size of CLIP features is 768 or 1024, the network seems to have 2 * 768 * 768 or 2 * 1024 * 1024 learnable parameters. What does it mean that learning h(x) is much more parameter efficient than learning a sequence of token embeddings?

(iii) To validate the statement in L191-196 (“Performance is found to be not very sensitive to the $\lambda$ value in a wide range.”), the authors need to provide experimental results obtained with different $\lambda$ values.

局限性

Yes, in page 9 and 16.

作者回复

2024-08-07

Thanks for your recognition of our work and the detailed feedback! We respond to specific comments below.

Does AAPE $h(x)$ mitigate the image-text modality gap?

Great question! Short answer is yes. Essentially, AAPE belongs to those text prompt tuning methods that only operate at the text branch, and one tuning signal is from some aggregation of external text knowledge. This sounds like we may risk breaking the image-text feature alignment after finetuning. However, AAPE is directly predicted from the image feature $x$ via our prompt generator $h(\cdot)$ (or an image-to-text mapping function) — AAPE does not involve separate token vectors as learned in CoOp and CoCoOp. Then $h(\cdot)$ can be viewed as a bridge between the image and text modalities, encouraging their alignment via an explicit feature mapping function. This way, our finetuning process in Fig. 2(b) can be re-interpreted from the perspective of modality gap mitigation: first we compute two text embeddings that are already image-aligned — $w_{i}$ (from frozen CLIP text encoder) and $h(x)$ ; they are then combined using a projection $g$ to compute an image-text alignment loss to further reduce modality gap.

Note our image-text mapping function $h(\cdot)$ is similar to that in MaPLe, only that MaPLe learns both image and text prompts (vs. our unimodal prompts). The key difference between the two methods is that our $h(\cdot)$ maps to some natural language equivalent while MaPLe does not introduce any language priors. This makes the text distribution for our finetuning not deviate too much from the free-form texts used for CLIP pre-training. As a result, our learned image-to-text mapping is expected to preserve CLIP's feature alignment level after finetuing. For empirical evidence, we compare the average cosine similarity score for the ground-truth image-prompt pairs on ImageNet. The higher the similarity score, the better the image-text alignment. We show AAPE scores 0.91, slightly higher than CLIP (0.89) which confirms our hypothesis. AAPE is also comparable to multimodal prompting method MaPLe (score 0.92), yet with language priors, AAPE achieves better classification accuracy and generalizes to more vision-language tasks.

Natural language may not be able to handle particular distribution shifts e.g. on Terra Incognita and WILDS datasets

Thanks for this insightful comment. We do rely on the language priors learned in AAPE to specify descriptive class details that are often invariant to image-based distribution shifts. Table 4 confirms our robustness across ImageNet variants. Intuitively, the description of "boxy shape and sloping roofline" for a "Jeep car" is useful to distinguish it from other similar classes, regardless of the style of input image, e.g., sketch images in ImageNet-Sketch or paintings in ImageNet-R. That said, we acknowledge that our model robustness (to domain shifts) depends on the descriptiveness of training image prompts and their coverage of discriminating image information in the considered domain. The suggested Terra Incognita and WILDS datasets are both good examples where one might need customized image prompts (preferably generated by LLM at scale) with domain knowledge, in order to generalize across e.g. different structural scaffolds of molecules in WILDS dataset. We leave such investigations of LLM prompting to future work, and will add the above discussions in main paper.

Ablation on distillation loss coefficient $\lambda$

The attached Rebuttal.pdf includes the ablation results in Table 2. It is shown that our AAPE performs robustly (with overlapping confidence intervals) when $\lambda \in [3,9]$ , and still outperforms the strong baseline OGEN (H: 80.34) in this wide range of $\lambda$ . By default, $\lambda=5$ is used for classification task.

L176-180: is it true that learning h(x) is more parameter-efficient than learning a sequence of token embeddings?

Sorry about the confusion. We meant for the purpose of augmenting $w_i$ , how to efficiently learn an image-conditional embedding that is equivalent to a full prompt sentence. So the comparison is only under the context of learning sentence embedding conditioned on image $x$ . We could simply generate a single embedding for the sentence of length $L$ based on $x$ using $h(x)$ . Or we could generate $L$ embeddings for individual word tokens based on $x$ . Obviously the token-wise prediction is much less efficient, since we need to train either $L$ conditional generator networks { $h_l$ } $_{l=1}^L$ or a large-scale autoregressive network $h$ . Will clarify in the main paper.

2024-08-10

Dear Authors,

Thanks for the response! Most concerns were successfully resolved, except whether AAPE $h(x)$ overcomes the image-text modality gap. Do the authors think AAPE $h(x)$ would be located nearby text features rather than image features in Figure 1 of the paper [A]? (not between text and image features)

[A] Liang et al., “Mind the Gap: Understanding the Modality Gap in Multi-modal Contrastive Representation Learning”, NeurIPS, 2022.

I'm still not convinced whether $h(\cdot)$ , which consists of two fully connected layers, could fully overcome the image-text modality gap. If the authors claim that the lightweight module (i.e., two fully connected layers) is sufficient for overcoming the image-text gap by simply using L2 distillation loss, then they need to provide validation results such as Figure 1, 4, or 8 in the paper [A]. If it is true, then the discovery would be helpful for understanding why the proposed method works well in general.

Best regards,

Reviewer yCM3

评论- Re: Official Comment by Reviewer yCM3

2024-08-13

Thanks for sharing this paper [A]. After a careful read of [A], we realize that we mistook the concept of "modality gap" for "image-text alignment" in terms of the (cosine) feature similarity. Based on the cosine similarity, according to our previous response, AAPE does improve this metric (e.g., 0.91 vs. 0.89 of CLIP on ImageNet) mainly because $h(x)$ is an explicit image-to-text mapping from image feature $x$ . On the other hand, [A] defines the modality gap as the Euclidean distance between the centers of image and text features, and shows that contrastive learning preserves modality gap. Here we provide some new analysis and empirical results for AAPE based on such defined modality gap.

AAPE does not overcome the modality gap. More accurately speaking, AAPE may sometimes reduce the gap distance after prompt tuning, but only by a moderate amount (e.g. from 0.82 to 0.78 on ImageNet). This is because we do not train to reduce the gap. Instead, AAPE is optimized by two loss terms: 1) L2 distillation loss that makes $h(x)$ located near the aggregated text embedding $p^a$ , 2) contrastive task loss for classification which maximizes the image-text feature similarity again. This is definitely different from those experiments in [A] that either manually shift the image and text features to explicitly close the gap (Section 4.2), or directly transform text features to be close to image feature (Section 4.4).

Then why does AAPE generalize without closing the modality gap? One key argument in [A] is that it’s not clear that it’s desirable to reduce modality gap for improving downstream performance. Interestingly, the authors found a larger gap can help some zero-shot learning and fairness tasks, while other tasks benefit from a smaller gap. In our case, we similarly found good generalization does not necessarily need a reduced modality gap by e.g., pulling AAPE back to the neighborhood of image features.

To explain AAPE's good generalization, we previously rely on intuitive reasoning that AAPE introduces rich language priors, and that AAPE promotes image-text alignment. Inspired by [A], here we attempt to go from the optimization perspective. Our hypothesis is that we have a reshaped loss & accuracy landscape when optimizing AAPE with a multi-task loss (L2 distillation + contrastive) rather than contrastive loss alone. We will add visualizations of our reshaped loss/accuracy landscape w.r.t modality gap (similar to Fig.3 and 10 in [A]), where the optimal gap is different from the gap of pretrained CLIP. Generally speaking, the optimal modality gap minimizes 2 different objectives in our multi-task loss which could move in opposite directions: the contrastive loss learns AAPE to (over)fit the seen classes (a common belief in the literature), while the L2 distillation loss moves AAPE closer to $p^a$ with external text knowledge (thus modifying the image-text modality gap) which often benefits generalization to unseen classes as evidenced by our results.

It would be an interesting direction for future research to study both 1) how our multi-task optimization dynamics could affect the modality gap and 2) the relation between the gap distance and downstream generalization.

We plan to include the above analysis in Appendix. We also want to add that AAPE is designed to be a general image captioning latent. This makes AAPE easily go beyond few-shot classification and generalize to more vision-language tasks. For better generalization in large-scale tasks, on-going investigations include scaling up the training data and model size of $h(\cdot)$ (beyond two fully connected layers).

2024-08-13

Thanks for the detailed response!

As acknowledged by the authors, it seems that $h(\cdot)$ does not overcome the image-text modality gap. From this observation, what I wonder is why AAPE $h(x)$ works well in general. If my understanding of the proposed method is clear, AAPE $h(x)$ should be treated as a kind of text features (rather than image features, although they are obtained via a transformation from image features). If not, L2 distillation loss does not make sense due to the modality gap.

Could the authors elaborate them in this perspective?

评论- Re: Official Comment by Reviewer yCM3

2024-08-13

Thanks for your fast feedback, and yes, AAPE is a text feature vector. Since AAPE is generated from input image features and it's regularized by LLM-generated text prompts via a distillation loss, we actually view AAPE as a latent vector for image captioning.

From our previous response, we have detailed the reason why AAPE works. As a quick recap, AAPE learning is governed by two signals: distillation loss that equips AAPE with the captioning capability which could generalize to describe both seen and unseen image classes; and contrastive loss that allows AAPE to (over)fit the seen classes. Optimizing such multi-task loss may lead to an increased or decreased image-text modality gap (depending on the image & prompt distributions for tuning), whereas AAPE is shown to consistently achieve strong generalization performance. This is yet another proof of the main argument in [A]: good generalization does not necessarily need a reduced modality gap.

2024-08-13

Thanks for the response!

Although I acknowledge the claim in the paper [A] (good generalization does not necessarily need a reduced modality gap), the claim is still limited to classification. The modality gap is essentially caused by the temperature scaling used in the contrastive loss for classification, and I agree that such modality gap sometimes could be useful and even ignored in classification as demonstrated in the paper [B].

[B] Zhang et al., "Diagnosing and Rectifying Vision Models using Language", ICLR, 2023.

However, I'm not convinced that such modality gap is useful in general (not only for classification). Unlike classification where the modality gap still preserves cosine similarity ranks, AAPE $h(x)$ should be treated as a kind of text features in this paper. It seems that this paper lacks justifications and the proposed method just surprisingly works well in general.

Could the author provide more justifications from this perspective?

评论- Re: Official Comment by Reviewer yCM3

2024-08-14

Some clarifications:

From our new empirical studies, we do not claim that Modality gap is useful. Instead, our main observation is that the gap is not highly correlated with generalization performance. We observed that after prompt tuning, the modality gap is reduced on 7 out of 11 datasets and moderately increased on the remaining 4, while the performance is consistently improved. This gives rise to our argument that "good generalization does not necessarily need a reduced modality gap" which is in line with [A]. To be fair, we also don't claim we fully overcome the modality gap with AAPE learning.
One key reason for AAPE's good generalization is the strong image-text alignment as measured by cosine feature similarity. This is achieved by our image-to-text mapping function $h(\cdot)$ , which encourages feature alignment with increased similarity score as given before. Hence, AAPE $h(x)$ is actually a text embedding that preserves the cosine similarity notion, which could explain the reduced modality gap on 7 classification datasets.
From the optimization perspective, our good generalization is also attributed to the multi-task learning objective: the distillation loss introduces language priors into AAPE, which promotes generalization and avoids overfitting from contrastive loss. Note such multi-task learning shifts the modality gap differently for each dataset, but all shifting in the directions (on loss landscape) that favor loss decrease and thereby performance gains. This observation under the multi-task learning framework can generalize beyond the classification task, since it does not require varying the contrastive loss's temperature which is limited to classification.
For tasks of image captioning and VQA, we similarly optimize a multi-task objective on COCO dataset to learn AAPE: distillation loss + contrastive loss for image-text retrieval. Our latest evaluation on the image-text modality gap shows the gap is actually reduced from 0.79 to 0.72 after AAPE learning. This may be a signal that our multi-task loss can both mitigate modality gap to some extent and achieve competitive performance. We just need to verify this hypothesis on more tasks/data distributions, which may provide hints on the relation between modality gap and downstream generalization in complex vision-language tasks.

We will provide justifications in Appendix about why AAPE generalizes, as briefly summarized below:

Generalization is not strongly correlated to the modality gap reduction for classification, with empirical results.
Reasoning of image-text alignment and multi-task learning that promotes generalization. To offer empirical support by showing the feature similarity score and multi-task loss landscape vs. modality gap, respectively.
Evaluating how the hypothesis of "multi-task loss may mitigate modality gap" extends to other vision-language tasks, with preliminary results.

审稿意见

评分: 5置信度: 42024-07-08

The paper proposes a new prompt embedding named Aggregate-and-Adapted Prompt Embedding (AAPE), which improves prompt learning by distilling knowledge about more detailed descriptions of classes into prompt embeddings. Concretely, the “summary” prompt is obtained by aggregating diverse reference prompts. Then, the prompt generator is trained to produce a prompt embedding that stays close to the aggregated summary while minimizing task loss at the same time. From the experiments, the proposed AAPE shows good performance on diverse tasks such as few-shot classification, VQA, and image-to-text retrieval.

优点

How to well train prompt embeddings is one of the important topics in fine-tuning Vision-Language Models.
The paper is well written.
The proposed approach shows good performance in multiple tasks.

缺点

More details on using the CLIP reward are required. From my understanding, there is another loss to enhance the CLIP reward $`CLIP-s`\left(\boldsymbol{x}, \boldsymbol{p}^\alpha \right)$ .
It would be better to include the performance of the model with only Aggregate-and-Adapted Prompt Embedding (AAPE) $h\left(\mathbf{x}\right)$ in Table 1 without using text embedding $\mathbf{w}_i$ and projection $g$ . This result clearly shows the performance gain induced by the Aggregate-and-Adapted Prompt Embedding.
It would be better if the efficiency comparison was included in the paper compared to other prompt learning methods such as CoOp, MaPLE, and PromptSRC, etc.

问题

Please refer to the weaknesses.

局限性

The authors have adequately addressed the limitations and potential negative societal impact.

作者回复

2024-08-07

Thanks for the positive feedback and constructive suggestions. For the efficiency concern, please refer to our Response to Common Concern. Answers to your other questions below.

More details on the CLIP reward

As mentioned in L160, we use the same CLIP reward as in [17], formulated as $`CLIP-S`(x,p^a) = w \cdot \max(\cos(x,p^a),0)$ - will add in paper. This reward works well in our experiments. Studies of more advanced loss/reward functions are left for future work.

To show the benefits AAPE, use $h(x)$ only but no $w_i$ and $g$ for classification

In response, we have successfully applied AAPE $h(x)$ alone to the tasks of image-to-text retrieval, image captioning and VQA, which proves the efficacy of AAPE thanks to the distilled language priors in it. But for classification, it is not possible to only use $h(x)$ to fulfill the task. We argue that it's necessary to include $w_i$ and $g$ to build a well-functioning classifier.

Specifically, the image-conditional $h(x)$ can be viewed as an ``image captioning latent vector''. It does not necessarily encode the explicit class name, which prevents it from acting as a good text classifier for each class. Also, it's easy to imagine that directly matching the image features $x$ to $x$ -conditional $h(x)$ would always lead to a high similarity score, which is not discriminative. Therefore, we start with a basic template ( $w_i$ ) that allows us to manually encode the $i$ -th class name. Then we combine $w_i$ with $h(x)$ so that $h(x)$ can provide extra class descriptions that are adapted to input image. For combination strategy, we found it not working when linearly combining $w_i$ and $h(x)$ based on element-wise addition or using a linear projection $g$ , since that tends to ignore $h(x)$ if we match the linear combination to $x$ for classification (see L187-188). Hence we choose to use a nonlinear $g$ on top of the concatenation of $w_i$ and $h(x)$ for their non-trivial fusion. Because of such design choices, to isolate the impact of natural language priors under the classification setting, we always use $w_i$ and $g$ and compare $h(x)$ learned with and without the language prior distillation loss. Figs. 3-5 all confirm the positive contributions of language priors. Will add the above clarifications in paper.

2024-08-12

Thank you for your response. Most of my concerns have been addressed. I have reviewed all the reviews and rebuttals.

However, I still have a few questions:

What is the computational cost during the training phase?
I'm still curious about the performance of the proposed method with no $w_i$ and $g$ on the classification task. I think that the proposed method can perform well on at least base (seen) classes without using $w_i$ and $g$ .

Thanks.

评论- Re: Official Comment by Reviewer khEq

2024-08-14

Computational cost for training

Thanks for the reminder of adding such info. Here we show the training cost in terms of GFLOP/time (min). We mainly compare AAPE (162.6/41.92) with CoCoOp (162.5/39.53) since both methods learn input-adaptive text prompts, only that AAPE incurs a small overhead for learning an additional prompt aggregator. When compared to CoOp (162.5/10.08) and PromptSRC (179.6/13.13), AAPE is less time-efficient but has comparable or better GFLOP.

Classification performance of AAPE $h(x)$ without $w_i$ and $g$

Thanks for providing the insights. We have finished testing an $h(x)$ -only baseline for classification, i.e., using $h(x)$ in place of $x$ to act as a proxy image query, while the text classifier is the basic template $w_i$ that allows us to encode different class names to perform classification. This simple setting is similar to that of the image-to-text retrieval task, both of which can be viewed as the testbeds for the text knowledge captured in $h(x)$ .

Method	Base	New	H
AAPE (default)	84.72	77.54	80.97
$h(x)$ only	84.01	75.93	79.77
CuPL	74.31	75.25	74.78
AAPE w/o $L_{distill}$	79.47	73.25	76.23
CoCoOp	80.47	71.69	75.83

We see from the table that our $h(x)$ -only baseline actually performs well, on both base and new classes. This indicates the good generalization using an input-adapted "captioning latent" $h(x)$ to distinguish different classes. Our default approach combines $h(x)$ and the template $w_i$ that contains explicit class name information (using projection $g$ ). By doing this, we achieve better performance than $h(x)$ -only, and in the meantime, enable easy interpretation of the roles of $h(x)$ and $w_i$ during classification. When compared to CuPL that simply ensembles LLM-generated text prompts for classification, our $h(x)$ -only baseline outperforms significantly by learning input-adapted prompts. As a reference, we also compare with the methods of AAPE w/o $L_{distill}$ and CoCoOp, which both predict input-adapted text prompts but without language supervision. The benefits of our $h(x)$ -only baseline are evident, thanks to the help of the learned language priors. Will add the results in final paper.

审稿意见

评分: 5置信度: 32024-07-11

This framework first aggregates textual knowledge from human or large language model (LLM) generated prompts into a summary aligned with each input image. This is achieved using a prompt aggregator. A prompt generator is then jointly trained to create prompt embeddings that are close to this aggregated summary while also minimizing task-specific loss. The method demonstrates improvements in performance on various downstream vision-language tasks, including few-shot classification, visual question answering (VQA), and image captioning without incurring LLM inference costs during testing.

优点

The motivation is clear and the method is reasonable to me. The proposed method extends the application of downstream tasks from the common image classification to image-text retrieval, visual question answering (VQA), and image captioning. The performance on Flickr30k dataset is remarkable.

缺点

This paper misses one relevant research in prompt learning “ArGue: Attribute-Guided Prompt Tuning for Vision-Language Models.” ArGue introduces an attribute-guided prompt tuning approach that outperforms traditional prompt learning methods in specific tasks and datasets. The failure to reference and compare these latest methods may limit the comprehensiveness and advancement of the proposed method.

The comparison methods used for the image-to-text retrieval task in this paper are somewhat outdated and do not incorporate the latest research advancements. For example, BLIP-2, as a new image-to-text retrieval method, has demonstrated superior performance across various tasks and datasets. To more accurately assess the effectiveness of the proposed method, the latest comparison methods such as BLIP-2 should be included in the experiments.

The method in this paper shows overfitting on the base classes, with significantly lower performance on the novel classes compared to the base classes. This indicates a lack of generalization ability when dealing with new categories, potentially leading to poor performance on unseen data in real-world applications.

问题

In Figure 1, what input is provided to GPT-3? Is it only text? Why does the last sentence of the description contain "the image is..."?

Where are human-generated image captions obtained from?

局限性

yes

作者回复

2024-08-07

Compare with ArGue (CVPR 2024)

ArGue and our AAPE both learn text prompts that distill language priors from LLM. However, the prompt learning mechanisms are different: ArGue learns individual prompt token vectors. They are combined with the embeddings of the class name and class-wise visual attributes generated by LLM. On the other hand, we learn to directly generate the embedding of a full prompt sentence which we call AAPE, supervised by both LLM-generated natural language prompts and task loss. Note our AAPE prediction is conditioned on the image feature $x$ via a prompt generator $h(x)$ .

We share the following benefits:

1. Our input-conditioning mechanism focuses on extracting prompt features from the input image, while ArGue learns unconditional text prompts for each class. Without considering input information, ArGue has a higher risk of overfitting to training class distributions. 2. Our image-conditional prompt generator $h(x)$ can be viewed as an image-to-text mapping function, which helps to bridge the modality gap between image and text feature spaces. This further gives rise to improved generalization during prompt tuning.
Here we provide empirical support for AAPE's good generalization on the few-shot classification task. In the base-to-new class generalization setting, we have comparable Base/New/H class accuracy averaged on 11 datasets: ArGue-N (83.77/78.74/81.18) vs. AAPE (84.72/77.54/80.97). While in the domain generalization setting, we have better accuracy for ImageNet/-V2/-Sketch/-A/-R: ArGue-N (71.84/65.02/49.25/51.47/76.96) vs. AAPE (73.56/65.97/50.12/51.62/77.52).
1. More importantly, our AAPE is designed to be a universal embedding directly applicable to various vision-language tasks. For example, in our text retrieval, image captioning and VQA experiments, AAPE $h(x)$ is used as a standalone ``image captioning latent'' that achieves SOTA performance (Table 2 in main paper). This is not possible with ArGue.

Compare with latest methods like BLIP-2 for the image-to-text retrieval task

As mentioned in Section 5.1 and Appendix B, we treat the image-to-text retrieval task as a small-scale proof of concept of our prompt learning method. Specifically, we learn AAPE on COCO dataset only, and test its zero-shot/finetuning performance on Flickr30k. The goal is not to push for SOTA performance, but to verify if the learned AAPE can successfully distill text knowledge from COCO captions which can be useful for downstream text retrieval.

Table 3 provides supportive results as AAPE obtains strong performance with both zero-shot and finetuned models on Flickr30k. What's most interesting is that our zero-shot Flickr30k performance is nearly on par with that of SOTA zero-shot models, including CLIP/SigLIP and the latest work Llip. This proves good generalization of AAPE. We will add results of more recent/competitive methods like BLIP-2 as suggested. But note this won't be a fair comparison since we only train on COCO while other methods like BLIP-2 are trained on billions of image-text pairs.

In the future, we plan to scale up training data to learn universal AAPE. This not only enables fair comparisons with SOTA multimodal models for image-to-text retrieval, but also on more complex vision-language tasks e.g. using AAPE+LiMBeR which proves promising already.

AAPE overfits base classes, with lower performance on novel classes; poor generalization on unseen data

We humbly argue that overall, AAPE improves generalization over unseen classes and data distributions. We acknowledge there exists a gap between the base and new class accuracies (7.18 points averaged across 11 datasets) for AAPE's few-shot classification experiments. One reason is that our prompt generator $h(\cdot)$ won't be trained sufficiently well on some datasets with limited class data, while we rely on a good $h(\cdot)$ to provide useful language priors for new class generalization. For example, DTD and EuroSAT datasets have only 47 and 10 training classes respectively (16 shots per class), and their base-new accuracy difference is high (20.33 and 19.10 points). While the SUN397 dataset has 397 classes (hence better learned $h(\cdot)$ ), and the base-new accuracy gap is reduced to 3.06 points. This motivates us to scale up training data in real-world tasks to learn universal prompt generator that generalizes over unseen categories.

Nevertheless, under the few-shot classification setting, our AAPE learning does not significantly widen the average base-new accuracy gap (7.18) in comparison to other methods, e.g. the competitive OGEN (gap 7.31) that doesn't leverage language priors. With the language priors, AAPE can actually improve both the base and new class accuracies for each of the 11 datasets (see Fig. 3), sometimes by a large margin. When well trained on ImageNet, AAPE also achieves SOTA results for domain generalization (Table 4). Furthermore, we have successfully trained a competent $h(\cdot)$ on COCO to describe complex scenes. It shows impressive zero-shot generalization to multiple tasks with varying data distributions, including image-to-text retrieval (on Flickr30k), image captioning (on NoCaps) and VQA (on VQA2).

Fig 1: what is the input to GPT-3? Why does the last sentence of the description contain "the image is..."?

As mentioned in L120-127, we follow the CuPL method to query GPT-3 with more than one LLM-prompt templates (i.e., GPT-3 inputs). Examples are "How can you identify a(n) {}?" and "Describe an image from the internet of a(n) {}". The latter example leads to the GPT-generated image prompts that start with "this image is...". Will clarify in the paper.

Where are human-generated image captions obtained from?

As mentioned in L128-136, we use human-annotated captions (5 per image) from COCO dataset to represent complex scenes. We train on COCO captions for 3 vision-language tasks, i.e., image-to-text retrieval, image captioning and VQA.

评论- Reply to rebuttal

2024-08-13

I read all the reviews and rebuttals, and decide to maintain the original rating.

2024-08-14

Thanks for the feedback. We will continue to improve the paper and integrate all the insights from discussions with different reviewers.

审稿意见

评分: 6置信度: 42024-07-13

The paper proposes a prompt learning method that distills the knowledge from pre-trained LLM while conditioning them on the image embedding. A Prompt Aggregator module combines the LLM generated prompts per image and the Prompt Generator module, generates a prompt from the image embedding. The modules are trained using a regularization loss between the aggregated and generated prompts, and the downstream task loss. Experimental results on 11 image classification datasets, four imagenet domain variants, two image captioning datasets and VQA dataset show the effectiveness of the proposed prompt learning method.

优点

The experimental results cover a broader spectrum that includes image classification, vision-language understanding, such as captioning and VQA. The results show the proposed method is effective across the range of tasks.

缺点

The method is shares a lot if similarity to [1] which distills the LLM knowledge through prompts to the CLIP text encoder. It is important to compare and distinguish how this work differs from [1].
See questions

[1] Khattak, Muhammad Uzair, et al. "Learning to Prompt with Text Only Supervision for Vision-Language Models." arXiv preprint arXiv:2401.02418 (2024).

问题

What are the number of parameters learnt in comparison to other prompt learning methods? Is the final projection layer necessary?
The zero shot performance on the captioning task seems promising. Is there a projection layer $g$ for the VQA and captioning tasks as well? Perhaps the projection layer being trained can be giving a significant boost in the classification task. What are the authors thoughts on this?

局限性

none

作者回复

2024-08-07

Thank you for the constructive feedback on our work. Regarding the efficiency concern, please refer to our Response to Common Concern. For other comments, our point-by-point response is as follows.

Compare and discuss how this work differs from ProText (arXiv:2401.02418)

Thanks for bringing this paper to our attention! ProText and our AAPE both learn text prompts that distill natural language priors from LLM, with the same goal of improving generalization to novel classes or data distributions. However, the prompt learning mechanisms are different: ProText learns individual prompt token vectors. They are then combined with a text template to generate the embedding of a full prompt sentence, which stays close to those LLM-generated sentences. On the other hand, we learn to directly generate the prompt sentence embedding which we call AAPE, supervised by both LLM-generated prompts and task loss. Note our AAPE prediction is conditioned on the image feature $x$ via a prompt generator $h(x)$ .

We share the following benefits:

1. Our input-conditioning mechanism focuses on extracting prompt features from the input image, while ProText learns unconditional text prompts for each class. Without considering input information, ProText has a higher risk of overfitting to training classes (thus worse generalization to new classes). 2. Our image-conditional prompt generator $h(x)$ can be viewed as an image-to-text mapping function, which helps to bridge the modality gap between image and text feature spaces. This further gives rise to improved generalization during prompt tuning.
Here we provide empirical support for AAPE's better generalization on the few-shot classification task. In the base-to-new class generalization setting, we have the average Base/New/H class accuracy on 11 datasets: ProText (72.95/76.98/74.91) vs. AAPE (84.72/77.54/80.97). While in the domain generalization setting, we have the accuracy for ImageNet/-V2/-Sketch/-A/-R: ProText (70.22/63.54/49.45/51.47/77.35) vs. AAPE (73.56/65.97/50.12/51.62/77.52).
1. More importantly, our AAPE is designed to be a universal embedding directly applicable to various vision-language tasks. For example, in our text retrieval, image captioning and VQA experiments, AAPE $h(x)$ is used as a standalone ``image captioning latent'' that achieves SOTA performance (Table 2 in main paper). This is not possible with ProText.

Will add the above discussions and comparisons in revised paper.

Is the projection $g$ necessary for classification? Is $g$ used for VQA/captioning tasks?

In response, we apply AAPE $h(x)$ alone to the tasks of image-to-text retrieval, image captioning and VQA, which proves the efficacy of AAPE thanks to the distilled language priors in it. But for classification, it is not possible to only use $h(x)$ to fulfill the task. We argue that it's necessary to include both $w_i$ and $g$ to build a well-functioning classifier.

Specifically, the image-conditional $h(x)$ can be viewed as an ``image captioning latent vector''. It does not necessarily encode the explicit class name, which prevents it from acting as a good text classifier for each class. Therefore, we start with a basic template ( $w_i$ ) that allows us to manually encode the $i$ -th class name. Then we combine $w_i$ with $h(x)$ so that $h(x)$ can provide extra class descriptions that are adapted to input image. For combination strategy, we found it not working when linearly combining $w_i$ and $h(x)$ based on element-wise addition or using a linear projection $g$ , since that tends to ignore $h(x)$ if we match the linear combination to $x$ for classification (see L187-188). Hence we choose to use a nonlinear $g$ on top of the concatenation of $w_i$ and $h(x)$ for their non-trivial fusion.

We acknowledge that $g$ introduces more parameters for prompt learning. One might wonder whether the increased model size contributes significantly to our classification accuracy gains. To this end, we compare with a variant of the CoCoOp method that similarly conditions prompt generation on input image. Our implemented CoCoOp variant has a $g$ attached after the text encoder, such that its total parameter count is similar to ours. We show such CoCoOp variant achieves an average H of 75.32 on 11 datasets, even worse than that of projection-free CoCoOp (75.83) mainly because of the worse generalization on new classes. This indicates that adding projection $g$ with increased capacity does not necessarily mean higher performance for few-shot classification. On the other hand, our similar-sized AAPE with $g$ achieves H 80.97, which benefits a lot from the learned language priors. Figs. 3-5 all confirm the positive contributions of language priors in AAPE. Will add the above clarifications in paper.

2024-08-12

Thank you for the response to the comments. I have read all the reviews and the rebuttals.

My concerns have been addressed sufficiently and I will raise my rating.

评论- Re: Official Comment by Reviewer UfdC

2024-08-13

Thanks for raising the score! We promise we will integrate all the new discussions and results in the final paper.

作者回复

2024-08-07

Response to Common Concern

Thanks to all reviewers for the thoughtful comments. Before responding to the questions raised by each reviewer, we first address the common concerns around efficiency.

The attached Rebuttal.pdf (Table 1) compares the inference compute cost between our AAPE and 3 types of prompt learning methods. CoOp and OGEN are the first type of methods that learn fixed prompts with no adaptation to input. The efficiency benefits in these methods are evident: the number of learned parameters is small (2k), and the fixed prompts only need a single forward pass for all batch data, leading to high inference speed (FPS). MaPLe and PromptSRC belong to another type of methods that learn prompts for both text and image modalities. These methods have comparable GFLOP and FPS with CoOp but much more parameters to learn, thus risks generalization with often sub-optimal new class accuracy.

CoCoOp and our AAPE both learn input-adaptive text prompts, with reasonable parameter count and comparable GFLOP. However, they suffer from low speed (FPS) since they require forwarding input-conditional prompts to the text encoder each time. Despite the low FPS, we argue that AAPE still shines:

Not only because of its highest classification accuracy, but also because AAPE scales much better with model size than CoCoOp-style methods that similarly condition prompt generation on input. Table 1 in Rebuttal.pdf provides empirical evidence by comparing AAPE with a CoCoOp $\dagger$ baseline that has similar parameter count (with bigger prompt generation network). Note the degraded FPS and new class generalization with CoCoOp $\dagger$ , which makes it underperform AAPE in both efficiency and accuracy.
More importantly, AAPE is designed to be a universal embedding directly applicable to various vision-language tasks. We have successfully applied AAPE to few-shot classification (with the help of a projection $g$ ), as well as to text retrieval, image captioning and VQA where AAPE is used as a standalone ``image captioning latent'' (without $g$ ) to provide rich language priors. This is not possible with CoCoOp, OGEN or many other prompting methods for few-shot classification, only that AAPE's generality sacrifices the efficiency in the classification task. While in complex vision-language tasks, AAPE can achieve SOTA performance (Table 2 in main paper) with higher efficiency than prior works, e.g., about 2.8/1.2 times faster than MAGMA for training/inference.
We leave as future work to further speed up AAPE inference particularly for classification, via pruning or distillation techniques to simplify the forward pass of prompts.

最终决定Accept (poster)

2024-09-25

The paper introduces Aggregate-and-Adapt Natural Language Prompts for improving the downstream generalization of CLIP. Reviewers acknowledged the novelty of distilling textual knowledge from natural language prompts, either human-generated or from LLMs, to enhance performance across a range of vision-language tasks. The experimental results demonstrate the effectiveness of the method on tasks like few-shot classification, image captioning, and VQA.

However, concerns were raised about missing comparisons with relevant baseline methods, such as ProText, and some reviewers questioned the efficiency of the proposed approach compared to existing prompt-learning methods. Additionally, there was a call for more detailed analysis of the aggregated text prompt’s ability to handle noisy data. The authors provided a detailed rebuttal, addressing these concerns, as well as other, with additional comparisons and clarifications, which were generally well-received by the reviewers. Consequently, all reviewers vote for accepting the paper post-rebuttal.

Considering the novelty, the strong experimental results, and the authors’ comprehensive response which resolve majority of the concerns, the AC recommends accepting the paper. However, the authors are encouraged to further strengthen the final version by including the suggested comparisons and additional analysis as promised in the rebuttal.