Finetuning CLIP to Reason about Pairwise Differences
We propose a method to fine-tune CLIP so that differences in image embeddings correspond to text descriptions of those differences, which can be synthetically generated by LLMs
摘要
评审与讨论
This paper proposes PC-CLIP, a method to finetune vision-language models to better reason about pairwise differences between images by aligning embedding differences with LLM-generated descriptions of visual differences. The approach improves CLIP's performance on difference-based classification from almost random while maintaining zero-shot classification performance. The quality of the embeddings is also evaluated through text-to-image generation.
优点
- The paper is well-structured and clearly written, explaining the methodology and results effectively. The experiments are comprehensive, and I also find the methodology for evaluating the quality of the embeddings through text-to-image generation very interesting and insightful.
- The method is novel, and the results of the difference-based classification look promising. The method gains significant performance from the almost random performance of vanilla CLIP. This capability can potentially benefit downstream models.
缺点
See Questions.
问题
- The method requires the LLM to generate the differences between two captions for both training and performing difference-based classification. However, the experiment only covers LLaMA2-13B-chat-hf as the LLM. I wonder if the authors have any insight into whether the training is based on LLaMA2 but the inference is using any smaller/larger models. Would the change in the distribution of the output impact the performance of the model?
- I'm curious if this approach could be combined with other CLIP improvement techniques. For example, NegCLIP [1] improves compositional reasoning by focusing more on the text encoder. This method feels like a nice complement to it.
[1] When and Why Vision-Language Models Behave like Bags-Of-Words, and What to Do About It?, Yuksekgonul et al., 2023
We thank the reviewer for their time in providing a thoughtful review. We appreciate that the reviewer found that our experiments are comprehensive and that our evaluations are interesting and insightful.
[T]he experiment only covers LLaMA2-13B-chat-hf as the LLM. I wonder if the authors have any insight into whether the training is based on LLaMA2 but the inference is using any smaller/larger models. Would the change in the distribution of the output impact the performance of the model?
We believe that using different LLMs during training (e.g., for generating synthetic data) and during inference (e.g., for creating comparative prompts or extended class prompts) will not largely impact the performance of the model. In fact, for our comparative prompts and extended class prompts, we use GPT4, and we still see our improved performance with this change in distribution. Our rationale is that there are many different ways to represent class names or text differences in natural language, and with the sufficient diversity in our synthetic training data for PC-CLIP as well as the original pretraining data, we believe that our PC-CLIP model can account for these distributional differences.
I'm curious if this approach could be combined with other CLIP improvement techniques. For example, NegCLIP [1] improves compositional reasoning by focusing more on the text encoder. This method feels like a nice complement to it.
Yes, we believe that other CLIP improvement techniques can be complimentary to PC-CLIP. For the case of NegCLIP, this seems to further improve compositional reasoning, which would pair nicely with incorporating notions of pairwise differences. In our new experiments, we also show that using PC-CLIP with image-captioning techniques is also complimentary, achieving strong performance coupled with existing pipelines for image-difference captioning. Thus, it seems that PC-CLIP can be complimentary to many other methods, to achieve even stronger performance.
Thank you the authors for a detailed response. The experiment with the image-captioning technique looks good.
We thank the reviewer for their continued engagement and appreciation of our new results. If you are satisfied with our response, we hope that you may kindly consider increasing your support for our paper.
This paper presents PC-CLIP, an approach to improve CLIP models' ability to reason about differences between images through fine-tuning CLIP with LLM-generated comparisons. The key idea is to align differences in CLIP's image embedding space with natural language descriptions of visual differences between image pairs generated by LLM. The authors demonstrate this enables new capabilities like difference-based classification while maintaining or improving standard zero-shot performance. They also introduce comparative prompting to leverage class-level differences during inference.
优点
-
New capabilities: The method enables valuable new capabilities like difference-based classification and comparative prompting while maintaining or improving CLIP's core zero-shot performance.
-
Comprehensive empirical validation: The work provides extensive experimental validation across multiple tasks, datasets, and evaluation metrics, including classification, embedding analysis, and generation. The baselines are also very reasonable (fine-tune CLIP on the COCO original and rewrite captions).
缺点
-
Scaling up: The experiment focuses only on a single CLIP model size (ViT-L/14), it is unclear whether the approach still remains useful when the model is larger, and whether it is more or less useful.
-
Limited improvements: While the method shows consistent improvements on standard classification tasks, many of the gains are relatively small (1-2% absolute improvement). It is unclear whether the cost is worth such small improvements.
-
LLM dependency: The method's core dependency on LLM-generated comparisons introduces risks from hallucination and noise in the training signal, with limited analysis of how comparison quality impacts downstream performance or how to ensure robust comparison generation.
问题
- What is the difference between Table 2 and Table 3? The numbers in Table 3 are much lower than in Table 2. Also, the PC-CLIP baseline numbers are much lower than CLIP (e.g., CUB 59.43 vs 52.57). Also, Table 3 is never referenced in the main text.
Also see weaknesses.
Thank you for your time in providing a thoughtful review. We appreciate that the reviewer found that our work provides extensive experimental validation with reasonable baselines.
Scaling up: The experiment focuses only on a single CLIP model size (ViT-L/14)
In response to your request, we provide an experiment where we train a ViT-H/14 model and show the results, compared to standard CLIP features with this model. We observe the following results:
| CIFAR100 | CUB | EuroSAT | Flowers | SUN | |
|---|---|---|---|---|---|
| CLIP (ViT-H) | 87.60 | 86.42 | 53.56 | 89.06 | 75.64 |
| PC-CLIP (ViT-H) | 87.85 | 86.85 | 53.93 | 88.91 | 75.68 |
This demonstrates that PC-CLIP is indeed effective at larger CLIP model scales and can scale up.
While the method shows consistent improvements on standard classification tasks, many of the gains are relatively small (1-2% absolute improvement). It is unclear whether the cost is worth such small improvements.
We argue that performing a single round of finetuning is worth the cost given that it would generally improve performance on a wide variety of tasks (including classification, difference-based classification, image generation). We believe that this is a simple single-time cost for benefits that we later observe during every inference pass. Also, we believe that the consistent improvements over standard classification tasks, as well as other applications, such as image-difference classification and text-to-image generation make our overall finetuning (< 20 hours on a single A6000 GPU) certainly worth the cost.
The method's core dependency on LLM-generated comparisons introduces risks from hallucination and noise in the training signal, with limited analysis of how comparison quality impacts downstream performance or how to ensure robust comparison generation.
We agree that depending on LLMs for generating synthetic data introduces complications; however, this is a standard practice done in the field. Furthermore, our work benefits from the improvements of better LLMs as well as better sampling techniques from these LLMs to detect and reduce hallucinations [1, 2]. Thus, we don’t see this as a limitation of our work, and it is certainly a common practice in the recent literature. Please also see our experiment regarding adding the more noisy examples that were previously filtered — we see only a slight drop in performance, but still stronger performance than the baselines. Thus, our method is robust to noise in the LLM generations.
What is the difference between Table 2 and Table 3? The numbers in Table 3 are much lower than in Table 2. Also, the PC-CLIP baseline numbers are much lower than CLIP (e.g., CUB 59.43 vs 52.57). Also, Table 3 is never referenced in the main text.
The numbers in Table 3 reflect the performance on the most highly confused 3 pairs of classes in standard zeroshot classification. The numbers for both PC-CLIP and CLIP on these tasks being lower than CLIP in Table 2 make sense than as they are the hardest classes for these models (Table 3), as opposed to over all classes (Table 2). Table 3 not being referenced is a typo on our part; thank you for pointing this out! We have updated this in our revision, we had meant to say "highly confused in Table 3" instead of "highly confused in Section 4.3".
What is the difference between Table 2 and Table 3? The numbers in Table 3 are much lower than in Table 2. Also, the PC-CLIP baseline numbers are much lower than CLIP (e.g., CUB 59.43 vs 52.57). Also, Table 3 is never referenced in the main text.
The numbers in Table 3 reflect the performance on the most highly confused 3 pairs of classes in standard zeroshot classification. The numbers for both PC-CLIP being lower than CLIP in Table 2 make sense than as they are the hardest classes for these models (Table 3), as opposed to over all classes (Table 2). The focus of Table 3 is to look at the relative improvement of comparative prompting with PC-CLIP or CLIP, in the (+ comp) column. The main takeaway is that comparative prompting seems to always benefit PC-CLIP, while this is not the case for CLIP.
Table 3 not being referenced is a typo on our part; thank you for pointing this out! We have updated this in our revision, we had meant to say "highly confused in Table 3" instead of "highly confused in Section 4.3".
[1] Farquhar, et. al. Detecting hallucinations in large language models using semantic entropy.
[2] Ji, et. al. Towards mitigating LLM hallucination via self reflection.
I thank the authors for providing a detailed response. I still have concerns about their improvement -- the overall improvement on ViT-H seems even minor. Since you have trained the ViT-H model, could you perform similar experiments to demonstrate some main results conclusions still hold for this model, like a comparison prompt?
Could you perform similar experiments to demonstrate some main results conclusions still hold for this model, like a comparison prompt?
In response to your request, we have provided results with the larger CLIP model built off of the ViT-H when using comparative prompting and with extended class prompts. We still observe better performance at this larger model scale with both of these approaches on a majority of the considered tasks.
Comparative Prompting
| CIFAR100 | CUB | EuroSAT | Flowers | SUN | |
|---|---|---|---|---|---|
| CLIP (ViT-H) | 87.6 | 86.31 | 55.81 | 89.87 | 75.57 |
| PC-CLIP (ViT-H) | 87.85 | 86.73 | 56.04 | 89.62 | 76.08 |
Extended Prompts
| CIFAR100 | CUB | EuroSAT | Flowers | SUN | |
|---|---|---|---|---|---|
| CLIP (ViT-H) | 88.77 | 84.52 | 58.26 | 89.02 | 71.98 |
| PC-CLIP (ViT-H) | 90.16 | 86.45 | 57.41 | 88.44 | 75.40 |
We thank the reviewer again for their time and consideration! We are more than happy to address any other concerns they might have.
Dear Reviewer Juc5,
In our response above, we have tried to address all your comments and concerns. To summarize, we have done:
- Added new experiments with finetuning larger CLIP (ViT-H) models, demonstrating better performance with ZS classification at scale
- Added results in terms of comparative prompting and using extended class descriptions, which show PC-CLIP's benefits over the CLIP baseline.
- Addressed a question on robustness to hallucinations in the LLM generations -- with new experiments that demonstrate leaving in the poor quality generations still leads to improved performance over CLIP *Addressed a question on the differences between Table 2 and 3
Thank you again for taking the time to review our work and we hope to hear back from you soon. Please let us know if you have any additional questions!
The paper argues that the CLIP text embeddings do not exhibit the structure of purely text-based embeddings. Specifically, the difference between the CLIP text embeddings in the representation space does not encode the semantic difference between two captions in the raw space. To solve this problem, the authors propose to finetune the CLIP model on LLM-generated difference descriptions for the pair of image-text data sampled from a dataset. Post-training, the model can perform difference-based classification and zero-shot classification very well. In addition, the paper’s solution gives rise to the possibility of comparative prompting which can potentially improve the model classification on confused classes (e.g., crab and lobster), and embedding arithmetic with text-to-image generative models.
优点
- The paper proposes an effective method to solve the highlighted problem. Specifically, the synthetic data generation using LLMs for any pair of images is reasonable.
- The paper showcases the utility of their method using difference-based classification, zero-shot evaluation, and evaluating the quality of the learned method.
缺点
- The paper lacks a good motivation on why CLIP models should exhibit the structure of purely-language based text embeddings. The CLIP pretraining never encourages such structure to emerge so it is not unusual to observe such behavior. In particular, the motivation never justifies why difference understanding is a desirable property.
- The paper’s problem does not seem relevant in the context of large multimodal models. Specifically, we have several models such as LLaVA, Qwen-VL that should be able to perform difference-based classification by design. In addition, we would not need to create difference descriptions during inference with these models.
- The paper is quite detached from the prior work on difference based image captioning [1,2]. It is a well-studied field where the focus is on understanding the fine-grained differences between the two images. In reality, we would need PC-CLIP for such tasks more than difference-based classification. There is no evaluation on such tasks.
- There is prior work that aims to address a similar problem [3,4]. The paper neither discusses nor compares with any of these methods. One of them is quite old, and warrants some level of comparison.
- Comparative prompting is not well-grounded. Specifically, how will we know if the f_A is misaligned? Figure 3 expects that f_B is well-aligned but f_A is not, what happens if both f_A and f_B are misaligned?
- In Table 2, we observe that comparative prompting leads to worse results than non-comparative prompting for CIFAR100, CUB, and almost similar results for SUN397. This suggests that this method may not be effective many times.
- Additionally, comparative prompting requires a user to create difference captions and tune the alpha hyperparameter. It is not clear whether such efforts are worth the gains on completely unseen tasks and scenarios in comparison to a CLIP model which does not require any of these.
- Zero-shot CLIP results are not provided for a more standard dataset such as ImageNet. A part of this is probably due to the cost involved in finding the confusion classes and creating difference prompts for them.
- Apart from zero-shot classification, CLIP model achieved high-level robustness to natural distribution shifts (ImageNetv2/R/A/S). It is unclear how PC-CLIP would perform in such cases. Additionally, there are no linear probing results to understand the quality of the learned vision embeddings.
[1] Robust Change Captioning: https://arxiv.org/abs/1901.02527
[2] Spot the diff: https://arxiv.org/pdf/1808.10584
[3] CLIP4IDC: CLIP for Image Difference Captioning (https://arxiv.org/pdf/2206.00629)
[4] OneDiff: A Generalist Model for Image Difference Captioning (https://arxiv.org/pdf/2407.05645)
问题
Mentioned in the weakness
We thank the reviewer for their detailed review and for their constructive feedback. We hope to answer your questions and address your concerns below with some of our responses and new experiments.
lacks a good motivation on why CLIP models should exhibit the structure of purely-language based text embeddings... CLIP pretraining never encourages such structure to emerge so it is not unusual to observe such behavior.
While it isn’t unusual to observe such behavior in CLIP models as it has not been encouraged, we believe that it is a desirable property. For instance, the embedding space is much more interpretable and reflective of how we understand concepts with arithmetic and analogies. Furthermore, we argue that this is not directly encouraged via text pretraining objectives, although this is still a result of text-based training. Finally, we argue that the benefits across our various different experimental settings also support that CLIP benefits of reflecting this improved embedding space that better captures notions of differences.
does not seem relevant in the context of large multimodal models
While this is true that other large multimodal models would have stronger reasoning capabilities due to their LLM components, the main focus of this paper is to teach CLIP this ability. While CLIP is indeed an older model, it still gets plenty of use in things such as text-to-image generation via Stable Diffusion. We would also like to highlight that it is a component in many of these large multimodal models (including LLaVA). Thus, improving this foundation model can have large benefits in many pipelines that build off of it (though we feel this is better left for future work), making it an important model to study and improve, with many potential downstream impacts.
quite detached from the prior work on difference based image captioning [1,2]
Indeed, thank you for pointing out this related work. We highlight that the main focus of our work is different — improving CLIP’s embedding space to reflect these differences, which is the reason why our paper may seem detached from prior work in difference-based image captioning. In these papers, which we already discuss in our related work section, the goal is to generate text descriptions of the difference between two images, often using the CLIP model. We focus on the reverse — with a large synthetically generated set of these text descriptions of differences, can we improve CLIP embeddings (in terms of zeroshot and difference-based classification) by finetuning on these differences. This is a key distinction from these prior works. We have made this clearer in our revision, and we believe that our inclusion of new experiments in image-difference captioning better connect our work with this subfield of research.
In reality, we would need PC-CLIP for such tasks more than difference-based classification. There is no evaluation on such tasks… There is prior work that aims to address a similar problem [3,4]. The paper neither discusses nor compares with any of these methods.
In response to your request, we provide additional experiments on Image-Difference captioning and retrieval. To perform evaluations on such tasks, we use PC-CLIP paired with [3] for image-difference captioning (as they use the original vanilla CLIP weights). This allows for a fair comparison of the usability of PC-CLIP versus the standard approach in [3] that uses CLIP. We restate the experimental results from the overall response here for clarity:
Spot-the-diff retrieval
| Retrieval Models | Recall @ 5 | Recall @ 10 |
|---|---|---|
| CLIP + IDC | 3.0 | from paper - 3.7 (rep 4.8) |
| PC-CLIP + IDC | 3.6 | 5.2 |
Spot-the-diff caption
| Caption Models | BLEU-4 | METEOR | CIDEr-D | ROUGE-L |
|---|---|---|---|---|
| IFDC | 8.7 | 11.7 | 37.00 | 29.90 |
| VACC | 8.1 | 12.5 | 34.5 | 32.10 |
| CLIP + IDC | 10.61 | 12.82 | 41.17 | 32.96 |
| PC-CLIP + IDC | 10.96 | 12.82 | 43.09 | 33.24 |
We observe that using PC-CLIP for such a task is indeed helpful, as we achieve stronger performance than simply using the CLIP weights in the pipeline of [3] on text to image-pair retrieval and on 3 of the 4 metrics for image-pair captioning. PC-CLIP matches performance on the last metric. Thus, this demonstrates the utility of PC-CLIP for such image-difference captioning and retrieval tasks.
We remark that the dataset Clevr-Change is no longer downloadable via the links in [3], so we cannot provide results on that dataset. Furthermore, [4] does not provide any code, so we cannot replicate their experiments. We have added these new experimental results and references into our revision.
Comparative prompting is not well-grounded... In Table 2, we observe that comparative prompting leads to worse results than non-comparative prompting for CIFAR100, CUB, and almost similar results for SUN397. Additionally, comparative prompting requires a user to create difference captions and tune the alpha hyperparameter. It is not clear whether such efforts are worth the gains on completely unseen tasks and scenarios in comparison to a CLIP model which does not require any of these.
Yes, we agree that comparative prompting indeed does not always help performance. However, we argue that boosting performance on 2/5 tasks, while maintaing performance on 1 task shows that it is a viable alternative — showing large benefits in some cases (e.g., EuroSAT).
We believe that providing a natural way to incorporate prior domain knowledge into the zeroshot classifier in the form of difference captions is important and beneficial to performing classification with CLIP-style VLMs. This makes image classifiers via CLIP more controllable and usable by specific domain experts via these comparative prompts.
Apart from zero-shot classification, CLIP model achieved high-level robustness to natural distribution shifts (ImageNetv2/R/A/S). It is unclear how PC-CLIP would perform in such cases.
As per your request, we provide new results on a natural distribution shift benchmark of ImageNet-R and ImageNet-A.
| Imagenet-A | Imagenet R | |
|---|---|---|
| CLIP | 69.07 | 90.33 |
| PC-CLIP | 69.2 | 90.47 |
We observe that PC-CLIP also achieves slightly stronger performance on these natural distribution shifts as well as the other downstream tasks with likely to be stronger distribution shifts from the CLIP pretraining data. That being said, we also believe that our datasets, such as EuroSAT, include data with very significant distribution shifts (e.g., satellite images) that are even more difficult than the natural distribution shifts mentioned by the reviewer. Thus, with our original datasets and these new natural distribution shift datasets, our experiments show the benefits of PC-CLIP over a variety of distribution shifts in practice.
Additionally, there are no linear probing results to understand the quality of the learned vision embeddings.
In response to your request, we provide additional results in terms of linear probing PC-CLIP embeddings. We train linear probes on top of the vision embeddings produced by CLIP and PC-CLIP (when doing full finetuning). We observe the following results:
| CIFAR100 | CUB | EuroSAT | Flowers | SUN | Imagenet-A | Imagenet-R | |
|---|---|---|---|---|---|---|---|
| CLIP | 90.78 | 88.51 | 89.00 | 98.52 | 84.70 | 69.47 | 91.93 |
| PC-CLIP | 90.67 | 88.42 | 89.00 | 98.55 | 84.73 | 70.4 | 92.0 |
PC-CLIP achieves better linear probing performance on a majority of tasks. We remark that while there are slight improvements on the vision encoder, there are much larger benefits for the text encoder. As mentioned in our paper in Table 6, we have results showing that the text embeddings are significantly improved in terms of localizing the differences between class names on various downstream tasks. This aligns with prior work that primarily highlights issues with CLIP’s text embedding space [1].
[1] Yuksekgonul, et. al. When and Why Vision-Language Models Behave like Bags-Of-Words, and What to Do About It?
Hi,
I thank the authors for their diligent rebuttal. The numbers on image-difference captioning, retrieval and natural distribution shift look good. I am increasing my rating to 5.
As agreed upon by the authors, I think that comparative prompting may or may not work since it succeeds in 50% of the tasks. In addition, it assumption that one of the text representations is well-aligned does not make sense.
The paper proposes a method to train CLIP image/text encoders such that the difference between CLIP's image embeddings are aligned with the text descriptions of the image differences. The authors further show through empirical results that this results in better performance on difference-based classification while maintaining neutral performance on zero-shot prompting performance.
优点
The strengths of the paper are as follows:
-
The proposed method is simple and can be easily extended as a main objective during CLIP pretraining (not just as a finetuning objective).
-
The proposed method shows significant improvement on tasks involving difference based classification.
缺点
The weaknesses of the paper are:
-
The main contribution of the paper is limited. Finetuning CLIP on the proposed objective (aligning g(I_1 ) - g(I_2) with f(T_{1,2})) is a novel contribution, however in itself is not sufficient. Similarly, the empirical studies in this paper are not very impressive. While Table 1 results on difference-based classification are impressive, the rest of the results show very marginal improvements over the baselines.
-
There seems to be lot of repetition in the form of summarizing the same set of results/observations over and again. The authors must reduce such repetition and include more details from the appendix, such as the training details.
问题
- In Sec 4.1, it is mentioned that the 1000 randomly sampled images were paired and a filtering was applied to remove poor quality generations -- are there any ablations without such a filtering? This is important in the context of scaling up this approach to millions of images where we might end up with lower quality generations, so learning how much lower quality generations impact the final performance could be a good indicator of the scalability of this approach.
- In L419, I think the authors meant to say "highly confused in Table 3" instead of "highly confused in Section 4.3".
We thank the reviewer for their constructive feedback. We hope to answer your questions with our new experiment and address your concerns below.
The main contribution of the paper is limited…. While Table 1 results on difference-based classification are impressive, the rest of the results show very marginal improvements over the baselines.
We respectfully disagree with the reviewer that our contribution is limited. As you point out, our finetuning on pairwise differences is indeed novel, and the process to generate this data at scale with LLMs is also a novel contribution. While in some experiments the improvements are slight, we argue that performing this finetuning and then observing gains on the majority of downstream tasks is significant. In fact, the generality of our experimental results is one of the main appealing factors of PC-CLIP.
We have shown gains in terms of zeroshot classification (with various different techniques, including different LLM-generated class prompts), improved text embeddings (Table 6), improved arithmetic in embedding space for text-to-image generation (Table 5 and Figure 4), and new experiments that show improvements in image-difference retrieval and captioning (Tables 7 and 8). The fact that we achieve widespread benefits in the variety of considered experimental settings shows the significance of our approach.
lot of repetition in the form of summarizing the same set of results/observations over and again
This is a good observation. The repetitive nature of the dataset often comes from the fact that we have generated pairs from 1000 examples, and individual images will show up in a large number of pairs. This can naturally be reduced by considering a larger dataset over more than 1000 examples, and only allowing for examples to show up in a small number of pairs. This is a useful suggestion, and we hope to include this in future work. We believe that our results are strong even with these images appearing in multiple repetitive pairs.
In Sec 4.1, it is mentioned that the 1000 randomly sampled images were paired and a filtering was applied to remove poor quality generations -- are there any ablations without such a filtering?
In response to your comment, we performed an additional ablation where we finetune PC-CLIP on the full unfiltered set of data (where 200k examples have been filtered from the total 990k). We observe the following results (bolding the best-performing method and italicizing the second-best-performing method).
Zeroshot classification
| CIFAR100 | CUB | EuroSAT | Flowers | SUN | |
|---|---|---|---|---|---|
| CLIP | 85.59 | 81.72 | 54.96 | 81.51 | 72.46 |
| PC-CLIP Unfiltered | 85.81 | 80.46 | 58.81 | 81.57 | 73.11 |
| PC-CLIP | 86.12 | 80.08 | 57.15 | 81.95 | 73.58 |
Difference-based classification
| AwA2 | CIFAR100 Size | CUB | Flowers Color | |
|---|---|---|---|---|
| CLIP | 51.74 ± 1.34 | 54.92 ± 1.11 | 53.32 ± 0.22 | 52.97 ± 2.12 |
| PC-CLIP Unfiltered | 57.70 0.37 | 65.40 1.24 | 64.92 0.16 | 67.08 2.03 |
| PC-CLIP | 58.52 ± 0.46 | 67.44 ± 1.29 | 64.91 ± 0.21 | 67.55 ± 2.11 |
We observe that while PC-CLIP with filtering achieves the strongest performance on a majority of tasks, PC-CLIP Unfiltered outperforms vanilla CLIP weights on 4 of the 5 tasks, and even outperforms PC-CLIP with filtering on one task. This supports that our finetuning method is robust to noise in the LLM generations.
"highly confused in Table 3" instead of "highly confused in Section 4.3"
Thanks for pointing that out! We have fixed this in our revision.
Dear Reviewer i1CP,
In our response above, we have tried to address all your comments and concerns. To summarize, we have done:
- Addressed a new experiment that demonstrate the robustness of PC-CLIP to the unfiltered LLM generations
- This experiment still shows improved performance in terms of ZS classification and difference-based classification *Addressed a question on repetition in the generations, due to images being reused in multiple pairs
Thank you again for taking the time to review our work and we hope to hear back from you soon. Please let us know if you have any additional questions!
Thanks to the authors for detailed rebuttal. I have gone through my own review, other reviewers' reviews and authors' rebuttals on all reviews and I maintain my rating of 5 due to the following outstanding concerns:
(1) Scaling up: the results on scaling up to larger CLIP models indeed look very marginal (2) Most of the results (in the original manuscript and the tables in the rebuttal) show very marginal improvements due to the proposed method. (3) Several clarifications / other edits needed in the manuscript.
We thank the reviewers for their thoughtful and descriptive feedback. We are glad that reviewers found that our finetuning objective is “novel” [i1CP, i5go], allowing for “valuable new capabilities like difference-based classification and comparative prompting” [Juc5], with a “comprehensive empirical evaluation” [Juc5, i5go], which have “interesting and insightful” evaluations of the embedding space.
In response to reviewer comments, we have provided many new experiments in this rebuttal that we believe strengthen the contributions of the paper.
Image-difference Captioning and Retrieval
As suggested by reviewer SGyP, we have added some new experiments in terms of difference and image-pair retrieval and captioning, following the experimental guidelines in [1]. We observe that when using PC-CLIP instead of CLIP in the pipeline of [1], we observe better performance in terms of both retrieval (i.e., picking the pair of images described by a text difference) and captioning (i.e., describing the difference between images), as well as compared to the other captioning baselines from their paper.
| Retrieval Model | Recall @ 5 | Retrieval Recall @ 10 |
|---|---|---|
| CLIP + IDC | 3.0 | 3.7 |
| PC-CLIP + IDC | 3.6 | 5.2 |
| Caption Models | BLEU-4 | METEOR | CIDEr-D | ROUGE-L |
|---|---|---|---|---|
| IFDC | 8.7 | 11.7 | 37.00 | 29.90 |
| VACC | 8.1 | 12.5 | 34.5 | 32.10 |
| CLIP + IDC | 10.61 | 12.82 | 41.17 | 32.96 |
| PC-CLIP + IDC | 10.96 | 12.82 | 43.09 | 33.24 |
Natural Distribution Shifts
Furthermore, we have added more results on natural distribution shift benchmarks of ImageNet-a and ImageNet-R, where we again achieve slight performance improvements over the CLIP baseline.
| Imagenet-A | Imagenet R | |
|---|---|---|
| CLIP | 69.07 | 90.33 |
| PC-CLIP | 69.2 | 90.47 |
Ablations to Noise in Generations
In response to the request of Reviewer we performed an additional ablation where we finetune PC-CLIP on the full unfiltered set of data. We observe the following results (where the best-performing method is bolded and the second-best-performing method is italicized):
| Zeroshot Dataset | CIFAR100 | CUB | EuroSAT | Flowers | SUN |
|---|---|---|---|---|---|
| CLIP | 85.59 | 81.72 | 54.96 | 81.51 | 72.46 |
| PC-CLIP Unfiltered | 85.81 | 80.46 | 58.81 | 81.57 | 73.11 |
| PC-CLIP | 86.12 | 80.08 | 57.15 | 81.95 | 73.58 |
| Difference-based Dataset | AwA2 | CIFAR100 Size | CUB | Flowers Color | |
|---|---|---|---|---|---|
| CLIP | 51.74 ± 1.34 | 54.92 ± 1.11 | 53.32 ± 0.22 | 52.97 ± 2.12 | |
| PC-CLIP Unfiltered | 57.70 0.37 | 65.40 1.24 | 64.92 0.16 | 67.08 2.03 | |
| PC-CLIP | 58.52 ± 0.46 | 67.44 ± 1.29 | 64.91 ± 0.21 | 67.55 ± 2.11 |
We observe that while PC-CLIP with filtering achieves the strongest performance on a majority of tasks, PC-CLIP Unfiltered outperforms vanilla CLIP weights on 4 of the 5 tasks, and even outperforms PC-CLIP with filtering on one task. This supports that our finetuning method is robust to noise in the LLM generations.
PC-CLIP at Larger CLIP Model Scale
As requested by Reviewer Juc5, we provide an experiment where we train a ViT-H/14 model and show the results, compared to standard CLIP features with this model. We observe the following results:
| CIFAR100 | CUB | EuroSAT | Flowers | SUN | |
|---|---|---|---|---|---|
| CLIP (ViT-H) | 87.60 | 86.42 | 53.56 | 89.06 | 75.64 |
| PC-CLIP (ViT-H) | 87.85 | 86.85 | 53.93 | 88.91 | 75.68 |
This demonstrates that PC-CLIP is indeed effective at larger CLIP model scales and can scale up.
We thank the reviewers again for their time and effort in providing detailed reviews. We have uploaded our revision, with changes highlighted in red. We now address other reviewer comments in their individual threads below.
[1] CLIP4IDC: CLIP for Image Difference Captioning
This paper studies how can improve the CLIP image and text embeddings on the pairwise difference reasoning task. Concretely, the author observed that there is King - man + woman = Queen capability in the text embedding model, while such capability is missing in the CLIP embedding. To encourage such capability, the author proposed to use LLM to explicitly output the difference b/w two captions in text. Then CLIP is used to encode the LLM output and try to match it with the difference of image embeddings of the corresponding two images.
Strength:
- The paper is easy to follow
- The idea of pairwise difference encoding is novel and interesting.
Weakness:
- The author didn't clearly justify the reason of why such capability is needed for CLIP embedding. Although the author argues the potential benefits for image generation and saw improvement on the CLIPScore, the CLIPScore is not very reliable in capturing detail description difference. The difference in Figure 4 is quite limited. There also seems no side by side human preference comparison to justify the effectiveness of PC-CLIP text embedding model in the image generation task.
- Scaling up the size of CLIP achieves limited improvement.
- Comparative prompting seems not be well grounded and not always improving performance. The significance of comparative prompting is limited.
Given all the weakness and the reviewers rating, I would recommend reject.
审稿人讨论附加意见
The reviewers argue
- The motivation of having such capability on the CLIP model
- The performance of scaling up the CLIP model
- The comparative prompting significance.
- The hallucination introduced via LLM.
Although the author provides detailed responses to all of them, the reviewers are not very satisfied. Although they either maintain or improve the scores, the ratings are 5 (marginally below the acceptance)
Reject