PaperHub
4.0
/10
withdrawn3 位审稿人
最低3最高6标准差1.4
3
6
3
4.3
置信度
ICLR 2024

Fooling Contrastive Language-Image Pre-Training with CLIPMasterPrints

OpenReviewPDF
提交: 2023-09-21更新: 2024-03-26
TL;DR

We mine and analyze fooling master images for Contrastive Language Image Pre-training, which attack many prompts at the same time and generalize to semantically related prompts.

摘要

关键词
contrastive language image pre-training; adverserial attacks; fooling master images; modality gap; projected gradient descent; latent variable evolution; black-box attack; ai robustness

评审与讨论

审稿意见
3

This paper examines the possibility of attacking pre-trained CLIP models, i.e. generating a fooling image to maximize its embedding cosine similarity with some given prompts. Three possible methods of attacking including stochastic gradient descent, latent variable evolution, and projected gradient descent are considered. Experiments are done on imagenet showing the effectiveness of proposed methods. It is also found that fooling images trained on a small number of image captions potentially generalize to a much larger number of semantically related captions.

优点

Measuring the similarities between images and text is an important topic and it is surprising to see that with simple adversarial attacks, we are able to generate a fooling image that can simultaneously maximize its cosine similarity across multiple different prompts.

The use of latent variable evolution in the context of adversarial attack is interesting. With LVE, it does not require access to the model weights and thus overcoming the limitations of common gradient-based methods such as SGD and PGD.

The paper is well-written and easy to follow.

缺点

I am not very convinced by the setting of the problem. In particular, I am not convinced why we need a fooling image that maximizes its possibility with many other prompts. What are the potential concerns of this vulnerability?

It seems that the methodology is not different enough with the existing literature. In particular, as admitted in the paper, none of SGD, LVE, or PGD is a technical contribution of this paper, and it seems that this paper is just evaluating those methods in the context of fooling CLIP.

Only original CLIP models are evaluated in the paper but there are many later improvement such as TCL (https://arxiv.org/pdf/2202.10401.pdf), ALBEF (https://arxiv.org/abs/2107.07651), BLIP (https://arxiv.org/abs/2201.12086), and so on. Do they have the same vulnerability? How does this vulnerability scale with the size of pre-training data (https://arxiv.org/abs/2210.08402)?

问题

The title is misleading. If I understand correctly, the paper does not have pre-training experiments but only carry out evaluations of pre-trained CLIP.

Please see weakness.

评论

We thank the reviewer for their feedback and questions. We address them in a point-by-point manner:

I am not very convinced by the setting of the problem. In particular, I am not convinced why we need a fooling image that maximizes its possibility with many other prompts. What are the potential concerns of this vulnerability?

The reviewer makes an excellent point for classification settings: Indeed, we find that in a classification setting, where the similarity scores are passed through a softmax, the attack has the same effect as a regular adversarial attack (i.e. the adverserial class with the incidentally highest similarity score becomes the predicted output class).

Nevertheless, as emphasised by the authors of CLIP (Radford et al., 2021), a further highly relevant application is zero-shot image retrieval, which offers plenty of attack surface by means of CLIPMasterPrints. In more detail, inserting a single CLIPMasterprint into an existing database of images could potentially disrupt the system’s functionality for a wide range of search terms, as for each targeted search term the inserted fooling master image is likely to be the top result.

When inserting several CLIPMasterPrints into the database, even the top n results could consist entirely of these adversarial images rather than the true results. While this is also possible when inserting “regular” adversarial examples, the amount of examples needed for an attack using fooling master images is orders of magnitude lower than for regular adversarial examples. Practical malicious applications of this vulnerability could be

  • censorship of images related to a list of censored topics
  • adversarial product placement: targeting a variety of searched brands to advertise a different product as the top result
  • disruption of service: introducing a larger number of unrecognizable CLIPMasterPrints for a wide range of topics, resulting in unintelligible results for many queries, reducing the quality of service of an image retrieval system.

Mechanisms of introducing CLIPMasterprints into a database depend on the application but could be as simple as putting images online to be crawled by search engines or uploading them through webforms.
To make this point more clear, we have added a Section 5 on “Potential Attack Scenarios” to revised manuscript.

It seems that the methodology is not different enough with the existing literature. In particular, as admitted in the paper, none of SGD, LVE, or PGD is a technical contribution of this paper, and it seems that this paper is just evaluating those methods in the context of fooling CLIP.

As we also point out in the response to Reviewer zgK9, It is important to note that LVE is a family of approaches and that our approach is not very similar to previous work, which we have also clarified again in the revised manuscript. For example, it differs in the evolved solutions, generative model, optimized loss function, and application domain.

Only original CLIP models are evaluated in the paper but there are many later improvement such as TCL (https://arxiv.org/pdf/2202.10401.pdf), ALBEF (https://arxiv.org/abs/2107.07651), BLIP (https://arxiv.org/abs/2201.12086), and so on. Do they have the same vulnerability? How does this vulnerability scale with the size of pre-training data (https://arxiv.org/abs/2210.08402)?

Given the much more common use of CLIP compared to its reiterations we find that the vulnerability of the original model to be of high significance even as newer approaches exist. Nevertheless, we agree that an evaluation on these improved models is desirable, and have therefore mined additional CLIPMasterPrints for BLIP (Li et al.,2022) and SigLIP (Zhai et al., 2023) models and find that both approaches are vulnerable to the attack as well. The corresponding results can be found in Figure 5 of the revised manuscript. Investigating how the vulnerability scales w.r.t. dataset size would be a valuable addition to the manuscript which we plan to add in the future by comparing e.g. the performance of mined CLIPMasterPrints on OpenCLIP models trained on LAION datasets of different sizes.

The title is misleading. If I understand correctly, the paper does not have pre-training experiments but only carry out evaluations of pre-trained CLIP.

We decided to spell out CLIP in the title rather as we found “Fooling CLIP with CLIPMasterPrints”, to be a poor choice in terms of style. But we agree with the reviewer that it could be misleading and therefore renamed the paper to "Fooling Contrastive Language-Image Pre-trained Models with CLIPMasterPrints"

Concerning the reviewer’s remark not conducting any pre-training: We propose a mitigation approach where we refine the original ViT weights in order to push the manifold of text embeddings away from mined CLIPMasterPrints. However this approach has been moved to the Appendix due to limited space.

评论

Thank the authors for their rebuttal. Thanks for explaining the potential concerns of the vulnerability of CLIP and I also appreciate the commitment of adding more results. However, I am still not entirely convinced by the use case and novelty of CLIPMasterPrint. It seems that the novelty of the paper is a specific variant of LVE, then it seems the comparison with vanilla LVE is missing. As regards to the use case, on one hand white box attacks seem impractical as it assumes access to the weights of the retrieval model while black box attacks like LVE seem very easy to detect due to its difference from natural language. I am not convinced that CLIPMasterPrint can be a real threat.

评论

Thank you for taking the time to read our rebuttal and respond. Regarding use cases: We believe that whitebox attacks are going to become quite common in the future as many startups and companies will rely on pre-trained foundation models, where open models will likely be preferred to avoid licensing fees and over-dependence on the issuing companies. Blackbox attacks on the other hand, despite being unreconizeable images, still remain useful in a disruption-of-service setting as detailed in a paper: An image retrieval system is much more easily flooded with unrecognizable images, if a single image responds with high cosine similarties to a large number of search terms (and their semantic equivalents).

审稿意见
6

This paper studies the adversarial attacks of CLIP models and proposes CLIPMasterPrints, a type of images that can maximize the CLIP scores with a wide range of prompts. This paper proposes three ways to mine these CLIPMasterPrints, SGD, PGD and gradient-free optimization when the model weights are inaccessible. Details experiments on image recognition tasks to show that extracted CLIPMasterPrints can fool the pretrained CLIP model on wide categories. The authors also study how to mitigate the attack risks from CLIPMasterPrints by mitigating the modality gap between text and image encoder in CLIP

优点

  1. This paper proposes a new type of adversarial attacks for CLIP models CLIPMasterPrints. Given the fact that CLIP is the foundational vision-language models that have wide application, this topic plays an important role in mitigating the risks of misusing CLIP.

  2. This paper proposes several technical ways to mine the CLIPMasterPrints, from gradient based methods to non-gradient methods, which could cover diverse scenarios based on if the CLIP weights are accessible or not.

  3. This paper also studies the way to reduce the risks of CLIPMasterPrints. Although the solution points to the existing finding (multimodality gap), it is still good to see the solution.

缺点

  1. Limited experiments: this paper also conducts experiments on ImageNet. It is interesting to see if the conclusion still holds for other recognition dataset ( note CLIP is evaluation on dozens of datasets). Moreover, this paper also uses CLIP ViT-L models. It also interesting to see the performance on CLIP ResNet models.

问题

See Weakness.

伦理问题详情

Not applicable.

评论

We thank the reviewer for their insights and kind words regarding our work. Figure 4 in the revised manuscript shows the performance of PGD-mined CLIPMasterPrints for an ensemble of 64 ResNet50 models as an image encoder on Imagenet. Additionally, we ran more experiments to compare to more recent improvements over CLIP, such as BLIP (Li et al.,2022) and SigLIP (Zhai et al., 2023). Our results show that the model using a ResNet ensemble is somewhat less vulnerable to CLIPMasterPrints, yet the obtained scores are still on par with the baseline. Furthermore, BLIP and SigLIP seem to be vulnerable to CLIPMasterPrints as well. We agree that results on different datasets would add to the overall picture and plan to add these (e.g. Flickr30k) in the future.

评论

Hi, thanks for your rebuttal. You've addressed all my questions, I'll keep the score as 6 but raise my confidence to 5.

评论

Thank you taking the time to respond to our rebuttal.

审稿意见
3

The paper focus on mining images referred as “fooling master images” or “CLIPMasterPrints” that fool the CLIP model by obtaining higher image-text similarity scores compared to clean image-text scores. These CLIPMasterPrints are optimized to obtain hight similarity scores across different text embeddings. Authors show that these CLIPMasterPrints also generalize to semantically related text prompts that are not directly considered in the optimization process. They attribute the rationale for existence of such CLIPMasterPrints to the not well aligned CLIP image-text embeddings. They countermeasure CLIPMasterPrints by adjusting the CLIP alignment i.e. by shifting the centroids of image and text embeddings.

优点

  • Simple approach to successfully fool against different text embeddings.
  • Paper is easy to read and understand.
  • Illustrations are clear.

缺点

Prior works have already shown that CLIP is vulnerable to adversarial attacks (Noever & Miller Noever, 2021; Daras &Dimakis, 2022; Goh et al., 2021). This work on fooling master images is a variant of such adversarial attacks utilizing already existing optimization algorithms like SGD, PGD and, LVE (Latent Variable Evolution) to craft an image with perturbations. The difference here is that adversarial objective aims to fool different text embeddings. Furthermore, countermeasuring CLIPMasterPrints is performed by shifting the centroids of image and text embeddings that is proposed in prior work Liang et al. (2022) (cited in the paper). Therefore, this limits the originality, novelty and technical contributions of this work.

问题

Despite the concept of CLIPMasterPrints for CLIP is interesting, the paper does not meet the criteria for conference acceptance. I suggest this paper to be a fit for workshop.

评论

We thank the reviewer for the feedback. However, we respectfully disagree concerning the paper’s contribution. We argue the acceptance of a manuscript at major venues should not only depend on how simple or complex its proposed approaches are, but rather on the insight and results provided by the conducted work. We consider work that contributes in terms of analysis of existing widely-used approaches such as CLIP equally important to work that proposes novel algorithms. In this work, we have focused on analysis and insight, demonstrating a novel, and in our opinion, somewhat unexpected and counterintuitive result. We show that the result can be obtained using different approaches, after which we connect it to a phenomenon previously observed in the literature (the modality gap) by showing that countermeasures against the observed phenomenon also mitigates the model's vulnerability towards the attack. It is also important to note that LVE is a family of approaches and that our approach clearly differs from previous work, which we have clarified in the updated manuscript. For example, it is completely different from existing approaches in terms of evolved solutions, used generative model, optimized loss function, and application domain.

In a nutshell: the paper’s novelty is not about the applied approaches (although we are not aware of any previous work using our LVE approach, also with respect to generating off-manifold images), but about the results and insights obtained, as well as directing attention to the vulnerability investigated. We have clarified these points in the revised manuscript.

评论

We would like to thank all the reviewers for their insightful comments, which we believe have helped us to significantly improve our paper. We address their comments in a point-by-point fashion below. In summary, we (1) have elaborated on the contributions of the paper and will emphasise them in the manuscript, (2) find that our approach holds when using different visual encoders and more recent improvements over CLIP (Li et al., 2022, Zhai et al., 2023), and (3) have elaborated on potential attack scenarios in image retrieval, to which the introduced CLIPMasterPrints approach could be applied to. The corresponding changes have been added to the revised paper.

评论

We hope we were able to address the main concerns of the reviewers? We understand this is quite short-term, yet we would be grateful to receive brief feedback by the reviewers on our recent revision before the discussion period closes.

If there are any further questions or issues, please let us know.