PaperHub
5.0
/10
Rejected4 位审稿人
最低3最高6标准差1.2
6
5
6
3
4.8
置信度
正确性2.8
贡献度2.3
表达2.8
ICLR 2025

One Perturbation is Enough: On Generating Universal Adversarial Perturbations against Vision-Language Pre-training Models

OpenReviewPDF
提交: 2024-09-21更新: 2025-02-05

摘要

关键词
Universal Adversarial AttacksVision-Language Pretraining ModelsGenerative Attacks

评审与讨论

审稿意见
6

This paper introduces a universal adversarial attack method called C-PGC for VLP models. By leveraging contrastive learning mechanisms within VLP models, C-PGC generates UAP that can effectively attack across different tasks and models without requiring individual perturbations for each input. Experiments demonstrate that C-PGC disrupts image-text feature alignment efficiently in both white-box and black-box settings, outperforming existing attack methods.

优点

  1. The writing is clear. The formulas are correct.
  2. The experiment is abundant and multi-dimensional.
  3. The research topic is important for VLM.

缺点

  1. The generator-based UAP method is time-consuming due to its indirect optimization approach, as it does not directly update the UAP.
  2. In multimodal contrastive loss, randomly selecting texts or images may not be ideal. Instead, selecting items related to the current image-text pair in the batch could improve the performance.
  3. The method uses the default settings of SGA in the experiment (i.e., resizing the original images into five scales {0.5, 0.75, 1, 1.25, 1.5} and applying Gaussian noise with a mean of 0 and a standard deviation of 0.5). It would be beneficial to give an explanation of the effect of such augmentation, with the experiment result being better.

问题

Please see weakness

伦理问题详情

The paper is related to adversarial attack, which is an important subarea in AI security

评论

Thank you for dedicating your time and effort to reviewing our paper. We are deeply encouraged by your positive comments and recognition of our work. Below, we provide point-by-point responses to address your concerns.

[Q1] The time budget of generator training.

Yes! Training a generator can be time-consuming and is a common issue in the whole research field of generative attacks [1,2,3]. However, the generative paradigm can bring significant performance improvement and once the generator finishes training, the universal perturbation can be directly applied to any image-text pair without requiring any sample-specific optimization.

[Q2] Positive sample selection.

We are deeply sorry for the misunderstandings that our paper may cause you. Please kindly allow us to classify the positive sample selection more clearly. Taking image perturbation as an example, we first randomly sample a batch of texts as candidates, where we then choose texts with the farthest distance specific to the current input adversarial image as positive samples. i.e., our framework has taken the current input pair into consideration when selecting items for contrastive loss. The ablation study in Section 4.4 of the main body has verified the effectiveness of this farthest selection strategy.

We will highlight this design to make it more clear. Also, we will include your constructive suggestion as a potential direction for a better selection strategy in the Discussion section of the Appendix.

[Q3] The effect of set-level augmentation.

Yes. As described in the method part, we are motivated by the significant gains introduced by SGA's augmentation technique [4] and hence integrate it into the proposed framework to enhance the universal perturbation. The underlying mechanism is to leverage the many-to-many relationships between images and texts by introducing multiple augmented images to provide set-level diverse guidance, further improving the optimization direction for effective guidance. Following your suggestion, we conduct experiments where this augmentation strategy was removed from our method.

Table A. Comparison of C-PGC and its variant that cancels the data augmentation on Flickr30.

SourceTarget\RightarrowALBEFALBEFTCLTCLX-VLMX-VLMCLIP_VITCLIP_VITCLIP_CNNCLIP_CNNBLIPBLIP
\DownarrowMethod \DownarrowTRIRTRIRTRIRTRIRTRIRTRIR
ALBEFw/o Aug69.7874.794757.2620.4337.5542.3665.1753.6371.641.2255.34
ALBEFOurs90.1388.8262.1164.4820.5339.3843.165.9354.472.5144.7956.36
CLIP_VITw/o Aug18.537.822.1939.8613.4734.1786.4687.1161.5371.3625.0344.73
CLIP_VITOurs23.2338.6725.0541.7915.8535.5988.9293.0566.0675.4226.7145.7

Table A demonstrates that the augmentation mechanism indeed provides an improvement in ASR, proving the rationality of set-level augmentation. We will include these results in Appendix E to complement our ablation studies.

Thank you again for your thoughtful suggestion! If you have any further concerns, feel free to reach out to us. :)

[1] Feng W, Xu N, Zhang T, et al. Dynamic generative targeted attacks with pattern injection. In CVPR, 2023.

[2] Poursaeed O, Katsman I, Gao B, et al. Generative adversarial perturbations. In CVPR, 2018.

[3] Gao H, Zhang H, Wang J, et al. NUAT-GAN: Generating Black-box Natural Universal Adversarial Triggers for Text Classifiers Using Generative Adversarial Networks. In IEEE TIFS, 2024.

[4] Lu D, Wang Z, Wang T, et al. Set-level guidance attack: Boosting adversarial transferability of vision-language pre-training models. In ICCV, 2023.

审稿意见
5

The paper presents a novel framework, C-PGC, designed to generate Universal Adversarial Perturbations (UAPs) targeting Vision-Language Pretraining (VLP) models. The authors introduce a cross-modal conditional perturbation generator, which leverages both single-modal and cross-modal features to disrupt the learned alignment between visual and textual representations in VLP models.

优点

The writing style of the paper is commendably clear and concise, making it accessible to a broad audience within the machine learning and computer vision communities. The authors have taken care to present the technical details in a manner that is straightforward and easy to follow, even for readers who may not be deeply familiar with adversarial attacks or VLMs. The method’s components are explained in a way that balances technical rigor with simplicity. This makes the paper highly readable and ensures that a wide range of researchers and practitioners can engage with its contributions.

The experimental results demonstrate that the proposed C-PGC framework performs well across several benchmarks. The authors have conducted comprehensive experiments on multiple well-established datasets and across various VLP models. The results show consistent improvements in attack success rates (ASR) across both white-box and black-box settings.

缺点

The paper, while strong overall, has several areas for improvement:

  1. Use of Contrastive Loss
    The inclusion of contrastive loss (LCL\mathcal{L}_{CL}) feels somewhat forced. Since the goal is to perform untargeted attacks, it seems unnecessary to rely on contrastive loss, which is typically used to enforce alignment between representations. While the authors have shown its utility through ablation studies, the logical foundation of using contrastive loss in an untargeted setting remains unclear. The paper could be improved by either rethinking the rationale behind using LCL\mathcal{L}_{CL} or exploring alternative loss functions better suited for untargeted attacks.

  2. Choice of Positive and Negative Samples for Contrastive Loss
    The current method of manually selecting the farthest sample as the positive or negative example feels arbitrary and unnecessarily complex. In the context of untargeted attacks, where the objective is not to make the adversarial sample resemble a specific class, it would make more sense to introduce a synthetic, "fictitious" sample that maximally deviates from the original, rather than relying on the farthest feature-distance sample. This approach could simplify the process and make the use of contrastive loss more coherent in the untargeted setting.

  3. Limited Comparison with State-of-the-Art
    The comparison of the proposed method only with GAP, a 2018 work, limits the scope of the evaluation. Given the rapid advancements in adversarial attack methods, comparing against more recent techniques would provide a clearer picture of the method’s competitiveness. For example, comparing with more contemporary adversarial generation methods would strengthen the experimental section and make the results more relevant to current research.

  4. Visual Design of the Framework Diagram
    The framework diagram could benefit from improved design and color harmony. While this is a minor issue, it affects the overall presentation quality. The authors could refer to well-designed diagrams from recent top-tier papers to make the visualizations clearer and more appealing.

These adjustments would enhance the logical foundation of the method and improve both the clarity and relevance of the experimental comparisons.

Post-Rebuttal:

The main issue lies in the fact that the authors' response fails to address my concerns regarding the use of Contrastive Loss in the paper. The goal of the paper is to construct untargeted adversarial attacks, yet the authors manually select negative samples to construct the Contrastive Loss, which is problematic. If we follow the authors' stated motivation that "destruction is easier than construction," the approach should involve maximizing the Contrastive Loss during normal training, rather than manually selecting negative samples to minimize it. For these reasons, I am inclined to maintain my current rating.

问题

Please refer to the Weaknesses section. If the authors can address these issues, I would be willing to raise my score.

  1. Use of Contrastive Loss Perhaps the authors could offer a more detailed explanation of their rationale for using contrastive loss in this context and discuss potential alternative loss functions they considered, as well as why contrastive loss was ultimately chosen despite its typical use in alignment tasks.

  2. Choice of Positive and Negative Samples for Contrastive Loss
    Introducing a synthetic, "fictitious" sample that maximally deviates from the original, is a more direct way. Perhaps the authors could discuss the trade-offs between their current approach and your suggested method.

  3. Limited Comparison with State-of-the-Art
    The authors could select 2-3 recent and relevant adversarial attack methods from related work for comparison.

评论

We sincerely express our gratitude for dedicating your valuable time to providing insightful suggestions that can enhance our paper. Your praise regarding our writing, methodology, experiments, and contributions has greatly encouraged and motivated us. Our detailed responses to all of your concerns are presented below.

[Q1] Use of Contrastive Loss.

We sincerely apologize for not adequately explaining the rationale behind employing the contrastive learning mechanism in our paper. We then provide a detailed explanation of the underlying principles of using contrastive loss, supported by more experimental results.

  1. Motivation. It is widely acknowledged that contrastive learning serves as a powerful and foundational tool for modality alignment in VLP models, establishing a nearly point-to-point relationship between image and text features. Our core idea stems from the general principle: "It's easier to tear down than to build up." Since contrastive learning can effectively establish robust and precise alignment, leveraging the same technique to disrupt these established alignments is expected to yield effective attack performance.

  2. Rationale. Taking image attack as an example, the principle behind our contrastive learning-based attack can be understood from two perspectives. (1) Leverage the originally matched texts as negative samples to push the aligned image-text pair apart. This broadly corresponds to the common objective of untargeted attacks that you have kindly mentioned. (2) Additionally, our contrastive paradigm introduces additional benefits by using dissimilar texts as positive samples to pull the adversarial image vadvv^{adv} out of its original subspace and relocate it to an incorrect feature area. By simultaneously harnessing the collaborative effects of push (negative samples) and pull (positive samples), the proposed contrastive framework effectively destroys the modal alignment and achieves exceptional attack performance, which has been validated by comprehensive experimental results.

Besides, we also explore several potential alternative loss functions that more directly align with the common untargeted attack, including maximizing the negative cosine similarity L_Cos\mathcal{L}\_{Cos} or MSE distance L_MSE\mathcal{L}\_{MSE} between the features of matched image-text pairs.

Table A. ASR results of different loss functions when the surrogate is ALBEF.

Target\RightarrowALBEFALBEFTCLTCLX-VLMX-VLMCLIP_VITCLIP_VITCLIP_CNNCLIP_CNNBLIPBLIP
Method\DownarrowTRIRTRIRTRIRTRIRTRIRTRIR
L_MSE\mathcal{L}\_{MSE}12.0230.7514.3935.0811.4130.7937.3256.0540.1756.3919.6637.33
L_Cos\mathcal{L}\_{Cos}57.5567.437.0649.4510.728.4837.4958.340.8758.3923.3339.44
L_CL\mathcal{L}\_{CL}76.4682.4656.5262.6114.3333.6142.9862.8146.1165.5827.1346.44
L_MSE+L_Dis\mathcal{L}\_{MSE}+\mathcal{L}\_{Dis}81.0983.7148.7656.5417.5835.7241.564.7247.4170.3435.9651.76
L_Cos+L_Dis\mathcal{L}\_{Cos}+\mathcal{L}\_{Dis}65.272.7136.1350.0618.6336.7442.2365.1750.9169.7836.9150.69
L_CL+L_Dis\mathcal{L}\_{CL}+ \mathcal{L}\_{Dis}90.1388.8262.1164.4820.5339.3843.165.9354.472.5144.7956.36

Recall that L_CL\mathcal{L}\_{CL} and L_Dis\mathcal{L}\_{Dis} denote the proposed contrastive loss and the unimodal loss term respectively. As observed, the use of L_CL\mathcal{L}\_{CL} consistently brings significant ASR improvements, verifying the rationality and superiority of contrastive loss.

We will include these analyses in a new section of the Appendix to supplement a more thorough understanding of the rationale behind the choice of contrastive loss.
Thank you once again for your inspiring suggestion!

评论

[Q2] Choice of Samples for Contrastive Loss.

Very interesting suggestion! Introducing synthetic samples that are maximally distant from the anchor as positive samples is a promising strategy to enhance attack performance. A reasonable implementation might involve adversarial learning to generate such maximally distant samples. However, this strategy necessitates synthesizing these maximally distant samples for each input pair during the generator training, which can lead to a significant increase in the computational overhead. In contrast, the current sampling and selection method achieves impressive attack performance without imposing significant additional burdens, striking a balance between efficiency and effectiveness.

Also, we will include this discussion in the Appendix to encourage future exploration of more efficient and effective positive sample selection strategies.

[Q3] Comparison with the State-of-the-Art.

Thanks for this thoughtful advice! We supplement a recent study of UAP on VLP models for comparison. Please refer to the [Q2] in the Common Concerns.

[Q4] Visual Design of the Framework Diagram.

Thank you for your valuable suggestion! Following your advice, we adjusted the color schemes and visual materials of the framework diagram in the revision of our paper, which will be uploaded soon. Furthermore, we will learn from more top-tier papers to further enhance the paper's expressiveness and appeal.

We hope these responses can address your concerns. Once again, we deeply appreciate your valuable suggestions for improving our work and would be delighted to further discuss with you.

评论

Thanks for your addressing. I have gotten your point. Have you tried using the negative direction of the contrastive loss, instead of manually selecting negative samples to construct the contrastive loss?

评论

Dear Reviewer U4zG,

Thank you once again for dedicating your valuable time to reviewing our paper and providing constructive comments!

As the end of the discussion period approaches, we kindly ask if our responses have satisfactorily addressed your concerns. Your feedback would be greatly appreciated, and we would be delighted to engage in further discussions with you.

Sincerely,

The Authors

评论

Thanks for your inspiring acknowledgment of our response!

We apologize for not fully understanding the "negative direction of the contrastive loss, instead of manually selecting negative samples to construct the contrastive loss?" meant, since it is generally necessary to obtain both positive and negative samples for contrastive loss formulation. Does that indicate our proposed strategy in the response in [Q2]? The experiments are close to end and we will upload the results very soon.

We would appreciate it if you could provide more details so that we can more accurately address this matter. :)

评论

maximum the contrastive loss

评论

We are very sorry that we have not made the attempt to maximize the contrastive loss, since maximizing the contrastive loss seems to be contrary to our attack goal.

Could you please tell us more details about the purpose of this experiment? Also, we wonder if the supplemented experiments to [Q2] in Author Response (Part IV) have adequately solved your issue.

Thanks for your response!

评论

Dear U4zG,

We have carefully supplemented responses to your further questions and provided experiments following your suggestion. We look forward to your reply and welcome discussion on any questions regarding our paper and response.

Best regards,

Authors

评论

Dear U4zG,

We apologize for still not fully understanding the aim of "maximum the contrastive loss". Therefore, we clarify the following facts as potential answers to your raised question. Please kindly review them to see if this response has solved your concern.

  1. We highlight that it requires both positive and negative samples to formulate the contrastive learning paradigm. Generally, positive samples are defined as the targets that the anchor aims to move closer to, while negative samples are those that it seeks to move away from [1,2,3].
  2. Maximizing the established contrastive loss violates our attack goal, since this operation actually pushes the adversarial image closer to the original matched texts (negative samples) and pulls the adversarial image away from the dissimilar texts (positive samples). This also fails to obey the definition of contrastive learning, since the anchor sample is supposed to get away from the negative samples and closer to the positive samples.
  3. We provide results where we directly maximize the cosine similarity and MSE between the adversarial sample and its paired data in Table A of Author Response (Part I). Also, we have provided experiments using synthetic samples based on adversarial learning in Table B of Author Response (Part IV).
  4. Do you actually indicate maximizing the distance between the anchor sample and the positive samples?

We sincerely hope that you can tell us whether the above responses have adequately solved your concerns or provide more details about your question. We're glad to have further discussion with you. If you are satisfied with our responses, we would greatly appreciate it if you could kindly consider raising your score accordingly. :)

We're looking forward to your reply.

Sincerely,

Authors

[1] Khosla P, Teterwak P, Wang C, et al. Supervised contrastive learning. In NIPS, 2020.

[2] Li J, Selvaraju R, Gotmare A, et al. Align before fuse: Vision and language representation learning with momentum distillation. In NIPS, 2021.

[3] Yang J, Duan J, Tran S, et al. Vision-language pre-training with triple contrastive learning. In CVPR, 2022.

评论

Dear Reviewer U4zG,

This is a gentle reminder that the rebuttal period is approaching its conclusion. Since we have addressed your major concerns (Q1, Q2, Q3, Q4) during the rebuttal, we would like to kindly ask if you could consider raising your score. :)

Best regards,

Authors

评论

Dear Reviewer U4zG,

Since the discussion phase will last nearly only one hour, we kindly send this message as the last gentle reminder of our responses.

As we have addressed your major concerns (Q1, Q2, Q3, Q4) during the rebuttal, we would like to kindly ask if you could consider raising your score. :)

Best regards,

Authors

审稿意见
6

This paper addresses the vulnerability of Vision-Language Pre-training (VLP) models to universal adversarial perturbations, which are instance-agnostic and do not require individual perturbations for each input. The authors introduce a novel attack method, the Contrastive-training Perturbation Generator with Cross-modal conditions (C-PGC), which leverages contrastive learning to disrupt the multimodal alignment in VLP models. Experiments are conducted across multiple VLP models and tasks.

优点

  1. The proposed UAP framework addresses the inefficiencies of instance-specific attacks by incorporating cross-modal and unimodal guidance within a contrastive training setup, representing an advancement in universal adversarial attack methods.

  2. The paper thoroughly evaluates C-PGC's effectiveness across multiple VLP models and downstream tasks, and additionally analyzes various defense strategies to mitigate the potential risks posed by C-PGC.

缺点

The proposed method leverages image and text attacks alongside cross-modal contrastive learning to generate universal adversarial perturbations. While this approach shows promise, the novelty may not be fully evident. I recommend that the authors consider further highlighting and reorganizing the unique contributions of the paper to enhance its clarity and impact.

问题

  1. In the text modality attack, how do the authors maintain semantic similarity between original and adversarial texts? In the experiments, the authors should provide the similarity scores (e.g. bert_score) between original and adversarial texts to demonstrate that the modifications do not significantly alter the text's semantics.

  2. The authors should explain how ASR is calculated in the main text.

  3. As shown in Figure 4, the adversarial texts exhibit a clear semantic gap from the original texts. Thus, would using special characters (e.g. ##*) for the universal adversarial word be more effective?

  4. The authors propose a contrastive training perturbation generator to produce universal adversarial perturbations for images and text. I am curious about how this generator differs from general UAP methods (e.g. Data-free Universal Adversarial Perturbation and Black-box Attack ), justifying its designation as a "generator."

I look forward to your detailed response.

评论

We would like to express our gratitude for your valuable suggestions and positive feedback on our paper. Our detailed responses are provided below.

[Q1] Semantic similarity between adversarial text and its original text.

Very inspiring question! We have carried deeper investigation into this issue and please refer to [Q1] of the Common Concerns for detailed answers.

[Q2] The way to calculate the ASR.

We totally align our evaluation protocol with prior adversarial attacks on VLP models [1,2,3], where the ASR is calculated as the proportion of successful adversarial samples within the originally correctly predicted pairs. We will supplement this introduction in the experimental part. Thanks for your kind reminder!

[Q3] Employment of special characters.

Interesting suggestion! We employ an optimization-based strategy to obtain the text perturbation, where we iteratively update the generator to output more effective text embeddings. Actually, the vocabulary used for mapping text embeddings back to discrete words has included these special tokens such as # and *. In other words, the final adversarial word is assigned based on the optimized text embeddings and has the possibility to be a special token.

To further investigate the impact of these special tokens, we conduct experiments where we directly adopt # and * as adversarial tokens to evaluate their attack results:

Table A. ASR of C-PGC and its variants using special characters as the adversarial word.

SourceTarget\RightarrowALBEFALBEFTCLTCLX-VLMX-VLMCLIP_VITCLIP_VITCLIP_CNNCLIP_CNNBLIPBLIP
\DownarrowAdv. word \DownarrowTRIRTRIRTRIRTRIRTRIRTRIR
ALBEF#87.8185.7460.8462.0518.2835.8738.6761.450.9168.2141.7154.11
ALBEF*87.2184.8760.2462.1918.0135.438.7961.9151.2768.141.9254.69
ALBEFOurs90.1388.8262.1164.4820.5339.3843.165.9354.472.5144.7956.36
CLIP_VIT#21.2537.4124.2741.0414.7134.3887.0792.3963.275.1425.4644.19
CLIP_VIT*22.0737.5424.5841.3214.8134.7787.5792.4163.7874.8625.7644.87
CLIP_VITOurs23.2338.6725.0541.7915.8535.5988.9293.0566.0675.4226.7145.7

Table B. Comparison of C-PGC and its variants using special characters as the adversarial word regarding the BERT score between clean and adversarial texts. We adopt 5,000 texts from Flickr30 to calculate these results.

Source model\RightarrowALBEFALBEFALBEFTCLTCLTCLCLIP_VITCLIP_VITCLIP_VITCLIP_CNNCLIP_CNNCLIP_CNN
Adv. word \DownarrowP\uparrowR\uparrowF1\uparrowP\uparrowR\uparrowF1\uparrowP\uparrowR\uparrowF1\uparrowP\uparrowR\uparrowF1\uparrow
#0.82130.84190.83130.81710.83890.82770.81370.83390.82350.81560.83640.8257
*0.81490.82510.81970.80980.82060.81490.80950.82060.81480.80970.82030.8147
Ours0.88910.86130.87480.89240.86870.88020.87460.86840.87130.89480.88420.8893

Tab. A and Tab. B display that our optimization-based strategy exhibits both superior attack performance and higher semantic similarity. Note that the use of special tokens as adversarial words can also be more conspicuous than the natural language, which might compromise the attack stealthiness and increase the likelihood of being detected by human observers or automated filtering systems.

We will supplement these discussions into the revision to promote future studies on the selection of the universal adversarial word. Thanks again for this interesting and inspiring question!

评论

[Q4] Details about the perturbation generator.

Our generator Gw()G_w(\cdot) adopts a decoder-based CNN architecture with cross-attention layers to integrate cross-modal knowledge. It upsamples a low-dimensional fixed noise vector zvz_v into high-dimensional features. For image perturbations, Gw()G_w(\cdot) directly outputs the image perturbations. For text attacks, it generates word embeddings that are subsequently flattened and mapped back to the discrete word domain. i.e., the fundamental design principle is analogous to that of generators employed in prior studies, such as GAP [4] and Nuat-GAN [5], with differences primarily in structural composition rather than conceptual innovations.

[Q5] Reorganization of contributions.

Thank you for your thoughtful suggestions. Our primary contributions are as follows:

  1. We design a cross-modal conditioned perturbation generator to produce effective UAPs for both image and text modalities.
  2. We propose the first malicious contrastive paradigm tailored for multimodal adversarial attacks. Firstly, we devise selection strategies (e.g., the farthest distance selection) to obtain positive and negative samples based on the attack objective. Then, we leverage these meticulously constructed samples to contrastively train the perturbation generator under both unimodal and multimodal guidance.

We will more clearly reorganize and highlight the innovations and contributions in the Introduction sections of the paper.

Thank you again for your valuable feedback! If you have any further questions or suggestions, please don’t hesitate to tell us. :)

[1] Zhang J, Yi Q, Sang J. Towards adversarial attack on vision-language pre-training models. In ACM MM, 2022.

[2] Lu D, Wang Z, Wang T, et al. Set-level guidance attack: Boosting adversarial transferability of vision-language pre-training models. In ICCV, 2023.

[3] Wang H, Dong K, Zhu Z, et al. Transferable multimodal attack on vision-language pre-training models. In S&P, 2024.

[4] Poursaeed O, Katsman I, Gao B, et al. Generative adversarial perturbations. In CVPR, 2018.

[5] Gao H, Zhang H, Wang J, et al. NUAT-GAN: Generating Black-box Natural Universal Adversarial Triggers for Text Classifiers Using Generative Adversarial Networks. In IEEE TIFS, 2024.

审稿意见
3

This paper proposes a method to learn universal perturbations that can transfer across different Vision-Language Pre-training (VLP) models and downstream tasks. The authors leverage contrastive loss to disrupt cross-model interactions and use a Euclidean distance-based loss to maximize the distance between adversarial data and the original data. Experimental results show that the proposed method achieves strong attack performance on various VLP models and downstream tasks.

优点

  1. The paper focuses on an important task of evaluating robustness of VLP models.
  2. Both adversarial images and texts are learned.

缺点

Several concerns remain:

  1. Motivation:
  • In the abstract, the authors claim to "fully utilize the characteristics of Vision-and-Language (V+L) scenarios by incorporating both unimodal and cross-modal information." However, the authors do not seem to fully exploit the characteristics of different V+L scenarios or tasks.
  • In the introduction, Figure 1 compares two methods and claims that "the generator-based approach GAP consistently achieves superior ASR compared to UAP." Since UAP uses the DeepFool method to learn perturbations, its inferior performance compared to a generator-based approach does not necessarily demonstrate the superiority of the generator-based method over other approaches, e.g., PGD. More experiments including comparisons with other strong baselines like PGD, or a more comprehensive analysis of why generator-based methods are required to support this claim.
  1. Algorithm: a: My biggest concerns are the definition of universal perturbation learning and adversarial text learning.
  • The authors use generators to produce adversarial data based on cross-modal conditions. The main advantage of universal adversarial attacks is their ability to produce perturbations that are generalizable across all data without needing to generate sample-specific perturbations, thereby improving efficiency. In other words, universal perturbations should be independent of the test data and applicable to unseen data. However, relying on cross-modal conditions appears to conflict with this objective. If cross-modal conditions are required, why not generate sample-specific perturbations instead? The authors need to clarify how to maintain the universality of the perturbations without using cross-modal conditions. Additionally, the authors should report results of universal attacks for more scenarios, e.g., using perturbations generated on the Flickr30k dataset to attack models on the MSCOCO dataset.
  • Learning adversarial text perturbations requires ensuring that they do not compromise the quality of original texts. However, the authors did not address this, rendering the algorithm impractical. Further verifications are required, such as utilizing proposing metrics to evaluate the semantic consistency of perturbed texts or discussing potential methods to constrain the text perturbations to maintain readability and coherence. b. In addition, the authors utilize contrastive learning to disrupt the cross-model relationships and use the Euclidean distance-based loss to enlarge the distance between adversarial data and their original counterpart. First, the authors utilize contrastive learning to enlarge the gap between multiple texts and minimize the distance with diverse target texts. However, I question whether setting different targets can truly maximize the distance between adversarial and original images. An ablation study is necessary to verify this, such as comparing the proposed approach with a baseline that doesn't use diverse target texts, or to analyze how different choices of target texts affect the effectiveness of the perturbations. Second, the authors use two distinct losses to maximize the distance between adversarial images and each modality in the original image-text pairs. It is unclear why two losses are needed, rather than using a unified loss for both modalities. Additional experiments comparing their two-loss approach with a unified loss approach, and analyze the impact on both cross-modal and intra-modal relationships should be conducted.

c. Perturbation learning methods, including set-level augmentation, maximizing both intra- and inter-model differences, and leveraging contrastive learning, have already been explored by current approaches [1,2]. The specific contribution of this method remains unclear, aside from generating universal perturbations instead of sample-specific ones. Furthermore, the distinction between the generation of universal and sample-specific perturbations remains unclear.

  1. Experiments:
  • Previous works on universal adversarial attacks for VLP models should be discussed and compared, such as [3]. Additional comparison with relevant methods should be provided.
  • The authors apply data augmentation to improve the method's effectiveness, but additional comparisons with other augmentation techniques should be conducted to better demonstrate the proposed method's superiority. Examples include ScMix [3] and Admix [4]. [1] Dong Lu, Zhiqiang Wang, Teng Wang, Weili Guan, Hongchang Gao, and Feng Zheng. 2023. Set-level guidance attack: boosting adversarial transferability of vision-language pre-training models. In Proceedings of the IEEE International Conference on Computer Vision. 102–111. [2] Kim M, Tack J, Hwang S J. Adversarial self-supervised contrastive learning[J]. Advances in neural information processing systems, 2020, 33: 2983-2994. [3] Xiaosen Wang, Xuanran He, Jingdong Wang, and Kun He. 2021. Admix: enhancing the transferability of adversarial attacks. In Proceedings of the IEEE International Conference on Computer Vision. 16158–16167. [4] Zhang, Peng-Fei, Zi Huang, and Guangdong Bai. "Universal Adversarial Perturbations for Vision-Language Pre-trained Models." Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2024.

问题

Please refer to Weaknesses.

评论

We sincerely thank you for your precious time and effort in providing a wealth of suggestions to enhance the quality of our paper. We have carefully read all the comments and provide detailed point-by-point responses as follows. Hopefully, we can adequately address your concerns.

[Q1.1] Fully utilize the characteristics of V+L scenario.

Thanks for your constructive advice. This claim aims to highlight that in contrast to the unimodal scenarios, we fully leverage the unique characteristics of multimodal scenarios to enhance the modeling of a universal perturbation that can generalize to various samples and V+L tasks. Specifically, we elaborate on this claim from the following main aspects.

  1. Cross-Modal Conditions in Perturbation Generator:
    We design a perturbation generator with cross-modal conditions to benefit from cross-modal knowledge. In contrast, generators in unimodal generative attacks rely solely on information from a single modality.
  2. Malicious Multimodal Contrastive Learning Paradigm:
    We formulate a multimodal contrastive learning paradigm for multimodal adversarial attacks. In comparison, unimodal attacks typically focus on interactions within a single modality to generate adversarial samples.
  3. Joint Perturbations Across Modalities:
    Similar to prior sample-specific adversarial attacks on VLP models [1, 2], our approach applies perturbations to both image and text modals. Conversely, unimodal attacks only allow the perturbation of a single modality.

Extensive experiments demonstrate the effectiveness of these techniques, revealing that our framework utilizes the characteristics of V+L scenarios to achieve superior performance.

Besides, while investigating task-level V+L characteristics to enhance attacks for specific tasks is a promising direction, we highlight that the primary focus of our paper lies in leveraging the shared and joint characteristics of Vision-Language scenarios to present a universal and versatile UAP that can effectively generalize to diverse downstream V+L tasks.
Thanks again for your thoughtful comment. We will supplement these analyses in Appendix I of our revision.

[Q1.2] Comparison with PGD-based UAP algorithm.

Thanks for this insightful suggestion! Following your advice, we supplement the comparison between PGD-based UAP and GAP as follows:

Table A. ASR results of GAP and UAP learned by DeepFool and PGD respectively.

Source\RightarrowALBEFALBEFTCLTCLX-VLMX-VLMCLIP_VITCLIP_VITCLIP_CNNCLIP_CNNBLIPBLIP
Method\DownarrowTRIRTRIRTRIRTRIRTRIRTRIR
UAP_DeepFool13.9814.9512.2816.593.6613.7820.9436.6826.4241.8811.8518.76
UAP_PGD19.8316.8815.818.544.2514.0820.4135.5827.5242.712.7318.98
GAP69.7881.5922.1529.976.6118.3723.437.5429.9244.2916.0928.12

The results indicate that the UAP learned through PGD indeed obtains better results compared to the DeepFool-based one. Nevertheless, GAP still achieves significantly higher fooling rates, validating the necessity of employing a generator. Furthermore, numerous studies on generative adversarial attacks [3,4,5] have also similarly highlighted the superior performance of generative methods over non-generative ones, thereby supplementing strong empirical support for our claim.

We will include these experimental results and cite the related literature in the Introduction section to more confidently affirm our conclusion: "The generator-based approach GAP consistently achieves superior ASR compared to UAP." Thanks again for your valuable suggestion that helps us improve our paper.

评论

Thanks for authors' efforts for address the reviewer's concerns. I have several questions need to be further addressed.

  1. In [Q1.1], leveraging cross-modal knowledge, such as the paired relationships between images and texts, is not a novel concept in cross-modal attacks. Are there any other new insights derived from this task? Furthermore, the authors claim to fully utilize the characteristics of vision-and-language (V+L) scenarios. However, different cross-modal tasks—such as cross-modal retrieval and image captioning—focus on distinct contents and relationships between images and texts, each with unique characteristics. It appears that the authors have not sufficiently explored these task-specific characteristics, raising concerns that the approach may not fully leverage the diverse aspects of V+L scenarios.

  2. In [Q1.2], why do generative methods achieve significantly better performance compared to PGD methods? Is there any underlying rationale behind this difference?

评论

[Q3.2] Comparison of different data augmentations.

Thank you for this valuable suggestion. Initially, we were motivated by the impressive performance of SGA [2] and hence integrated their image augmentation into our framework to enhance the attack. Following your advice, we reproduce your suggested augmentation techniques while keeping other settings unchanged,i.e., C-PGC_ScMix\_{ScMix} and C-PGC_Admix\_{Admix}. Experimental results are presented as follows:

Table D. Attack performance under different data augmentation strategies.

SourceTarget\RightarrowALBEFALBEFTCLTCLX-VLMX-VLMCLIP_VITCLIP_VITCLIP_CNNCLIP_CNNBLIPBLIP
\DownarrowStrategy\DownarrowTRIRTRIRTRIRTRIRTRIRTRIR
ALBEFC-PGC_ScMix\_{ScMix}66.0876.2639.0351.2420.7337.4740.0265.5850.1371.8534.651.9
ALBEFC-PGC_Admix\_{Admix}62.872.2334.4747.781936.674264.8848.1969.6832.2850.05
ALBEFOurs90.1388.8262.1164.4820.5339.3843.165.9354.472.5144.7956.36
CLIP_VITC-PGC_ScMix\_{ScMix}20.5537.4624.4341.3913.5232.679.9388.6455.4469.4324.443.06
CLIP_VITC-PGC_Admix\_{Admix}19.5337.0424.0241.514.7434.2685.3491.859.0771.7823.6643.22
CLIP_VITOurs23.2338.6725.0541.7915.8535.5988.9293.0566.0675.4226.7145.7

It can be observed that C-PGC with the current augmentation strategy outperforms the ScMix and Admix, hence validating the set-level guidance is more suitable for our contrastive training. This is achieved by SGA's alignment-preserving augmentation, which enriches image-text pairs while maintaining their inherent alignments intact [2], hence better maintaining the effectiveness of our malicious contrastive learning. We will include these results in Section E of the Appendix to provide insights for future research.

Finally, we would like to express our gratitude once again for your perceptive and valuable feedback. We hope that the comprehensive response above will effectively address your concerns. It would be our great pleasure to engage in further discussion with you.

[1] Zhang J, Yi Q, Sang J. Towards adversarial attack on vision-language pre-training models. In ACM MM, 2022.

[2] Lu D, Wang Z, Wang T, et al. Set-level guidance attack: Boosting adversarial transferability of vision-language pre-training models. In ICCV, 2023.

[3] Hayes J, Danezis G. Learning universal adversarial perturbations with generative models. In IEEE Security and Privacy Workshops (SPW), 2018.

[4] Feng W, Xu N, Zhang T, et al. Dynamic generative targeted attacks with pattern injection. In CVPR, 2023.

[5] Gao H, Zhang H, Wang J, et al. NUAT-GAN: Generating Black-box Natural Universal Adversarial Triggers for Text Classifiers Using Generative Adversarial Networks. In IEEE TIFS, 2024.

[6] Moosavi-Dezfooli S M, Fawzi A, Fawzi O, et al. Universal adversarial perturbations. In CVPR, 2017.

[7] Chaubey A, Agrawal N, Barnwal K, et al. Universal adversarial perturbations: A survey. In arXiv, 2020.

[8] Zhang, Peng-Fei, Zi Huang, and Guangdong Bai. Universal Adversarial Perturbations for Vision-Language Pre-trained Models. In ACM SIGIR, 2024.

[9] Poursaeed O, Katsman I, Gao B, et al. Generative adversarial perturbations. In CVPR, 2018.

[10] Kim M, Tack J, Hwang S J. Adversarial self-supervised contrastive learning. In NIPS, 2020.

评论

[Q2.2.3] Investigation of the loss function. (Updated)

Inspiring question! The two designed loss terms mathcalL_CL\\mathcal{L}\_{CL} and mathcalL_Dis \\mathcal{L}\_{Dis} serve distinct roles in reaching the attack goal from multimodal and unimodal perspectives respectively. We have conducted ablation studies in Section 4.4 of the main body and corroborated the effectiveness of both two terms, hence confirming the necessity of jointly employing both mathcalL_CL\\mathcal{L}\_{CL} and the mathcalL_Dis \\mathcal{L}\_{Dis}.

To further address your concerns, we conduct experiments with different unified loss alternatives C-PGC_Cos\_{Cos} and C-PGC_MSE\_{MSE}, which maximizes the negative cosine similarity and MSE between features of matched image-text pairs.
Table C. ASR of C-PGC and its three variants on Flickr30. The surrogate model is ALBEF.

Target\RightarrowALBEFALBEFTCLTCLX-VLMX-VLMCLIP_VITCLIP_VITCLIP_CNNCLIP_CNNBLIPBLIP
Method\DownarrowTRIRTRIRTRIRTRIRTRIRTRIR
C-PGC_MSE\_{MSE}12.0230.7514.3935.0811.4130.7937.3256.0540.1756.3919.6637.33
C-PGC_Cos\_{Cos}57.5567.437.0649.4510.728.4837.4958.340.8758.3923.3339.44
Ours90.1388.8262.1164.4820.5339.3843.165.9354.472.5144.7956.36

These results again reveal the superiority of the proposed loss function over possible unified alternatives, which will also be added in Section E of the Appendix.

[Q2.3] Contribution of our work.

(1) As stated in the main text, the set-level augmentation was proposed by SGA [2] and we incorporate it into our framework for further enhancement, without claiming it as the major contribution. (2) Meanwhile, it is reasonable and common practice to pursue a strategy that jointly maximizes both types of diversity. However, the pivotal distinction lies in how it is effectively implemented and the degree to which it enhances attack efficacy. (3) We also highlight that contrastive learning is a general concept that has indeed been utilized by previous approaches for general beneficial purposes, such as modality alignment and enhancing adversarial robustness [10].

However, our contribution lies in making the first attempt to propose a novel contrastive paradigm tailored to the malicious scenario of vision-language UAP generation, which effectively fools different VLP models across diverse V+L scenarios. Based on the attack objective, we devise tailored positive and negative sample selection strategies (e.g., the farthest distance selection) to facilitate the contrastive learning of UAP. Moreover, we propose a novel architecture of perturbation generator with cross-modal conditions as auxiliary information to promote the modeling of multimodal UAP. By training the generator with both the introduced set-level cross-modal (L_CL\mathcal{L}\_{CL}) and unimodal (L_Dis\mathcal{L}\_{Dis}) guidance, C-PGC generates powerful UAP against VLP models with excellent generalization and transferability.

As for the distinction between universal and sample-specific attacks, we hope that our response in [Q2.1] has adequately solved your concerns.

[Q3.1] Comparison with a recent baseline.

Thank you for your constructive suggestion! I guess you are referring to [4] in your cited references, is that correct? We have supplemented our comparison with this excellent concurrent work and please refer to [Q2] in the Common Concerns for details.

评论

[Q2.1] Definition of universal perturbation learning

We are deeply sorry for any misunderstanding our paper may cause you. Please kindly allow us to restate the training and inference paradigm of C-PGC to clarify the use of cross-modal conditions.

Taking image perturbation as an example, the attacker has access to a training dataset of image-text pairs to learn a perturbation generator that transforms a fixed noise zvz_v into the universal perturbation δv\delta_{v}. Specifically, for each training image-text pair (v,t)(v, \mathbf{t}), the texts mathbft\\mathbf{t} are fed into the generator through cross-attention layers as cross-modal conditions, to provide additional knowledge for learning an effective UAP.

However, once we finish the contrastive training of the generator, the final output δv\delta_{v} becomes the image UAP that can then be applied to any test data without further requiring any cross-modal conditions. This is consistent with established practices of universal perturbation [6,7,8], and ensures the universality and applicability of the universal perturbation, which is definitely independent of the test data and unseen data.

Also, our experimental setup fully aligns with previous UAP studies [6, 8, 9] and uses the training set of datasets (e.g., MSCOCO) to learn the UAP while evaluating the validation set. Moreover, experiments involving “using perturbations generated on the Flickr30k dataset to attack models on the MSCOCO dataset” actually have been presented in Table 7 of our Appendix (Table 12 in the revised version). We apologize for the lack of emphasis on this matter and will highlight this experiment in the main text.

[Q2.2.1] Semantic similarity between adversarial text and its original text

Thank you for this meaningful question. Please kindly refer to [Q1] in the Common Concerns for a detailed response.

[Q2.2.2] Ablation study of diverse target texts.

Very insightful suggestion! Following your advice, we implement a variant C-PGCSin{_{Sin}}, which uses only a single target text with the farthest distance as the positive sample:

Table B. Comparison of C-PGC and its variant with single text as the target.

SourceTarget\RightarrowALBEFALBEFTCLTCLX-VLMX-VLMCLIP_VITCLIP_VITCLIP_CNNCLIP_CNNBLIPBLIP
\DownarrowMethod\DownarrowTRIRTRIRTRIRTRIRTRIRTRIR
ALBEFC-PGCSin{_{Sin}}82.9986.144956.9818.1935.7940.5265.951.0969.6838.5452.86
ALBEFOurs90.1388.8262.1164.4820.5339.3843.165.9354.472.5144.7956.36
CLIP_VITC-PGCSin{_{Sin}}20.5537.4624.4341.3913.5232.679.9388.6455.4469.4324.443.06
CLIP_VITOurs23.2338.6725.0541.7915.8535.5988.9293.0566.0675.4226.7145.7

The results illustrate that the use of multiple target texts can enhance attack effectiveness, validating the efficacy of set-level diverse guidance. As for the choices of target texts, we have provided an alternative strategy C-PGCRand_{Rand} in Section 4.4 of the main text, where target texts are randomly selected instead of choosing the farthest ones. This finding verifies the effectiveness of our farthest selection strategy.

We will supplement the experiment of C-PGCSin_{Sin} into the Section E of the Appendix. We appreciate your valuable suggestion for improving the comprehensiveness of our analysis.

评论
  1. [Q2.2.1] In the second question from the original reviewer, my concern is how to make sure that adversarial perturbations on texts would not influence the quality of original texts. Calculating the semantic similarity between clean and adversarial texts cannot solve this problem. Instead, the higher similarity between them show that the proposed method does not significantly alter the original semantics in the feature spaces. Thus, it cannot achieve effective attacks. The authors should prove that adversarial texts are visually plausible.
评论

Dear Reviewer jJKC,

Thank you again for your precious efforts in reviewing our paper and the constructive comments!

As the end of the discussion period approaches, we would like to know whether our responses have properly addressed your concerns. Your feedback will be highly appreciated and we are glad to engage in further discussions with you.

Sincerely,

The Authors

评论

[Q3] Influence the quality of original texts.

We sincerely apologize for the misunderstanding our response may cause you. Please kindly allow us to provide a more detailed explanation of this issue as follows.

1. Clarifications of the attack goal.

(1) We clarify that ensuring the quality of the initial text is not compromised is indeed valuable, but the basic objective of a malicious untargeted adversarial attacker is to fool the victim model to output incorrect predictions while ensuring attack imperceptibility [1, 2, 13, 14], rather than not influence the quality of original data.

(2) Correspondingly, our attack performance is verified by the excellent ASR of extensive experiments in the main text and the comparison with your kindly suggested baseline in Table C of the Common Concerns, and the attack stealthiness is guaranteed by the results measured by sufficient distance metrics in Table A and B of the Common Concerns, presenting a qualified and successful untargeted adversarial attack method.

2. Evaluation of the influence on the quality of original texts.

(3) We emphasize that how to evaluate and reduce the extent of influencing the quality of original texts is still an open issue that has not been explored by existing cross-modal adversarial attacks [1,2,8,15].

Since the original text is the reference representing the original sentence semantics, the semantic similarity between the adversarial and original texts measured by sufficient text distance metrics can be regarded as a reasonable metric to evaluate the extent of influence on the text quality. Therefore, the results in Tables A and B of the Common Concerns verify C-PGC's better performance in maintaining text quality compared with existing well-acknowledged methods [1,2].

(4) To further address your concerns, we provide several visual demonstrations between our method and SGA [4]. We observe that, in most cases, our method indeed exerts less impact on the quality of original texts.

(Clean) Man taking a photograph of a well-dressed group of teens.
(SGA) Man [rights] a photograph of a well-dressed group of teens.
(Ours) Man [getting] a photograph of a well-dressed group of teens.

(Clean) A young girl wearing a bulky red life jacket floating in a lake.
(SGA) A young girl wearing a bulky red life [school] floating in a lake.
(Ours) A young girl [getting] a bulky red life jacket floating in a lake.

(Clean) A brown dog walks in the grass with its tongue hanging out.
(SGA) A brown dog walks in the [new] with its tongue hanging out.
(Ours) A brown dog [getting] in the grass with its tongue hanging out.

(Clean) Two young men are loading fruit onto a bicycle.
(SGA) [Teens] young men are loading fruit onto a bicycle.
(Ours) Two young men are [getting] fruit unto a bicycle.

(Clean) A dog is walking through some gravel beside a river.
(SGA) A dog is walking through some [like] beside a river.
(Ours) A dog is [getting] through some gravel beside a river.

These visualization results correspond with our numeric results in Table A and B of the Common Concerns. More visualization results will be added to the Appendix. Thanks again for your valuable question!

We're more than glad to have more discussions with you if you have any further questions. :)

[11] Naseer M M, Khan S H, Khan M H, et al. Cross-domain transferability of adversarial perturbations. In NIPS, 2019.

[12] Yang X, Dong Y, Pang T, et al. Boosting transferability of targeted adversarial examples via hierarchical generative networks. In ECCV 2022.

[13] Dong Y, Liao F, Pang T, et al. Boosting adversarial attacks with momentum. In CVPR, 2018.

[14] Chakraborty A., Alam M., Dey V., et al. A survey on adversarial attacks and defences. In CAAI Transactions on Intelligence Technology, 2021.

[15] Wang H, Dong K, Zhu Z, et al. Transferable multimodal attack on vision-language pre-training models. In S&P, 2024.

评论

Thanks for your detailed feedback! Your acknowledgment of our above responses encourages us a lot. Next, we provide point-to-point responses below to further resolve your concerns.

[Q1.1] Clarification about Leveraging Cross-modal Knowledge.

A very insightful question! We highlight that the use of cross-modal knowledge in previous cross-modal attacks is limited to simply maximizing the feature distance between samples of different modals for adversarial sample optimization, without any deeper utilization to further enhance attacks [1,2,8].

On the contrary, we are the first to incorporate cross-modal information into the perturbation generator as auxiliary knowledge to facilitate the modeling of multimodal adversarial perturbation. Moreover, we propose the first attack paradigm that utilizes cross-modal knowledge from a novel perspective of malicious contrastive learning. Extensive experimental results have validated the effectiveness of our proposed novel techniques to utilize the cross-modal knowledge, presenting a powerful universal attack framework.

[Q1.2] Task-specific characteristics

We are deeply sorry for any misunderstanding our response may cause you.

We clarify that the essential attack objective of this work aligns with existing well-acknowledged adversarial attacks on VLP models [1,2,8], which aim to fool VLP models themselves, rather than specific downstream tasks. i.e., the generated perturbation is supposed to yield excellent attack performance tailored to the target VLP model regardless of the downstream tasks. Hence, our primary focus is to leverage the shared and joint characteristics of different Vision-Language scenarios to present a universal and versatile UAP that can effectively generalize to diverse downstream V+L tasks.

Also, it is a promising direction for future studies to investigate task-level V+L characteristics to further enhance attacks for specific downstream tasks. We have added them to encourage future studies in Appendix I of the revision.

[Q2] The Superiority of Generative Attacks

We're sorry for lacking a sufficient explanation regarding the underlying mechanism behind the generative attack.

The superiority is achieved by the powerful distributional modeling capability of generative model. Since a universal perturbation is learned on the data distribution independent of specific instances, the generator facilitates the learning of universal perturbation by better perceiving and capturing distribution-level features of diverse image-text samples [4,9,11,12], hence significantly enhancing the attack compared with non-generative methods.

评论

Dear Reviewer jJKC,

We have carefully responded to each of your questions and revised our paper accordingly. The sufficient experiments and analyses have been presented above and the revision details are listed at the top of this page.

We look forward to your reply and welcome discussion on any issues regarding our paper and responses.

Best regards,

The Authors

评论

Dear Reviewer jJKC,

We really appreciate your precious time and efforts in providing insightful feedback on our responses. To further address your concerns, we have presented detailed clarifications and visual demonstrations in the latest author's reply. We kindly request that you review them to see whether your remaining queries have been sufficiently addressed. If you are satisfied with our responses, we would greatly appreciate it if you could kindly consider raising your score accordingly. :)

We're looking forward to your reply.

Best regards,

Authors

评论

Dear Reviewer jJKC,

This is a kind reminder that the end of the rebuttal is really close. We have addressed the majority of your concerns in our first-stage responses and supplemented further experiments with detailed analyses of your remaining issues. If you're satisfied with our rebuttal, we would appreciate it if you could consider raising your score accordingly. :)

Best regards,

Authors

评论

Thank you for your response. My feedback is as follows:

Re: [Q1.1] Clarification about Leveraging Cross-modal Knowledge

I suggest that the authors reflect this point in the revised version of the paper, as it is crucial to highlight the differences between attacks targeting Vision-Language Pre-training Models and those designed for other types of models.

Re: [Q3] Influence on the Quality of Original Texts

I still find the similarity measurement problematic. The goal of an attack is to increase the gap between adversarial data and the original data. If the semantic similarity between them is low, the attack becomes less effective. A few examples provided are insufficient to confirm that the adversarial texts are visually plausible.

Considering the concern regarding the adversarial text, I will maintain my score.

评论

Re to "Re: [Q1.1] Clarification about Leveraging Cross-modal Knowledge"

Sure, we will incorporate this into the introduction and method part of this paper.

Re to "Re: [Q3] Influence on the Quality of Original Texts"

We're quite confused about your subjective opinion regarding the similarity measurement after our detailed clarification and experimental results.

  1. The attack performance of our method has been directly measured by the ASR of extensive experimental results. We have highlighted them again and provided a comparison with your suggested SOTA baseline UAP method, which strongly confirms the superiority of our method. It's unclear why you insist on the similarity measurement to doubt the attack performance.
  2. We have emphasized that ensuring the quality of the initial text is not the basic objective of adversarial attacks, but attack stealthiness. You may mistakenly mix them up and request harsh constraints on the text perturbation that almost all the existing cross-modal attacks fail to satisfy.
  3. We have highlighted the evaluation and reduction of the extent of influencing the quality of original texts is still an open issue that has not been explored by existing cross-modal adversarial attacks [1,2,8,15].
  4. We also aim to convey that existing instance-specific attacks exhibit great attack performance due to their instance-wise adversarial perturbation, while they sacrifice the semantic similarity and lead to less stealthy attacks (also compromise the text quality that you're most concerned about). Since our method achieves both better semantic similarity and scores of influence on the text quality than existing instance-specific attacks, C-PGC is a highly qualified adversarial attack method and is definitely acceptable in terms of attack stealthiness and influence on the text quality.

Despite the above facts, we still design and provide sufficient new experiments to solve your concerns (including the direct analysis with the advanced GPT-4o), which repeatively validate the better quality-preserving of our method. However, you seem to neglect the supplemented experiments and keep holding subjective opinions about the measurement, without specifying any possible useful metrics or specific problems within the evaluation protocol.

We kindly request you review our provided experiments and re-assess the quality of this paper. This is the true meaning of the rebuttal phrase of ICLR. And your precious assessment also means a lot to us.

评论

Dear Reviewer jJKC,

Sorry for reminding you again since the end of the rebuttal is really close. You seem to neglect our latest quantitative experiments based on GPT-4o in Author Response (Part VII). We also present detailed clarification regarding your confusing reasons for not increasing scores in Author Response (Part VIII).

We have made great efforts to solve your issues and would appreciate a more fair and objective evaluation.

Best regards,

Authors

评论

Supplementary experiment to '[Q3] Influence the quality of original texts'

To better solve your question, we devise a novel experiment leveraging the advanced GPT-4o to quantitatively evaluate the influence of text perturbation on the quality of the original sentence. Specifically, we query GPT-4o with the following prompt inspired by [16], using 5000 pairs of original and perturbed texts.

As an experienced evaluator, your task is to evaluate the extent of the given word replacement operation to the original text quality. Give a specific integer score based on the following statements, ranging from 0 to 4: Very Poor (0): The word replacement greatly impacts the quality of the initial sentence, e.g., largely destroying the basic semantic or basic grammar.
Poor (1): The word replacement causes significant semantic deviation and ambiguity, e.g., the perturbed sentence looks strange and illogical.
Fair (2): The perturbed sentence is readable and logical, showing likeness to the initial sentence with notable semantic variances.
Good (3): The perturbed sentence is reasonable and semantic-preserving.
Excellent (4): The word replacement nearly brings no significant influence on the sentence quality.

Every time you receive two sentences, the first one is an original sentence, and the second is a sentence perturbed by the word replacement operation.

The obtained average scores are as follows:

Table E. GPT-4o score of different methods in terms of the influence on the text quality. The high values indicate better performance.

Eval ModelCo-AttackSGAC-PGC
GPT-4o2.68462.8022.9164

These results again prove that C-PGC achieves better performance in terms of the influence on the original text quality. Among existing cross-modal attacks [1,2,8,15], we make the first attempt to quantitatively evaluate the direct influence of text perturbation on the original sentence quality based on this experiment, which serves as an important addition to better solve your question.

[16] Peng Y, Cui Y, Tang H, et al. Dreambench++: A human-aligned benchmark for personalized image generation. In arXiv, 2024.

评论

We would like to express our sincere gratitude once more for your invaluable time and efforts in evaluating our work!

A revision of our paper that fully incorporates your precious suggestions along with the details concerning the revised content has been uploaded. We have highlighted all modified content in the paper with blue color. Specifically, we have added/updated the following contents in the paper。

Reviewer jJKC

  1. We clarify the claim of "Fully utilize the characteristics of V+L scenario" and encourage future studies to investigate task-level characteristics in Appendix I.
  2. We add the comparison with PGD-based UAP to Figure 1 and cite the support literature in the Introduction section.
  3. We highlight that our Appendix includes the cross-domain experiments in the introduction part of Section 4.
  4. We provide the analysis regarding semantic similarity between adversarial text and its original text in Appendix D.
  5. We conduct ablation studies of diverse target texts in Appendix E.
  6. We supplement further discussions of unified losses in Appendix B.
  7. We compare our C-PGC with the suggested baseline in Appendix C.
  8. We compare different data augmentations in Appendix E.

Reviewer AoYQ

  1. We provide the analysis regarding semantic similarity between adversarial text and its original text in Appendix D.
  2. We provide the way to calculate ASR in Section 4.2.
  3. We discuss the employment of special characters in Appendix J.
  4. We reorganize and highlight the innovations and contributions in the contributions of Introduction.

Reviewer U4zG

  1. We provide the detailed motivation and rationale behind the contrastive loss in Appendix B.
  2. We discuss the trade-offs between C-PGC and the suggested method in Appendix I.
  3. We compare our C-PGC with a recent SOTA baseline in Appendix C.
  4. We improve the color schemes and visual materials of the framework diagram in Figure 3 of the Method.

Reviewer 4NRc

  1. We highlight the positive sample selection strategy in the multimodal contrastive loss part of Method section.
  2. We discuss the effect of set-level augmentation in Table 11 of Appendix E.
评论

We sincerely appreciate all the reviewers for dedicating their valuable time and effort to review our paper and provide insightful comments and suggestions. Encouragingly, reviewers praise C-PGC’s design (R#U4zG, R#AoYQ), effectiveness (R#AoYQ, R#U4zG), scalability and versatility (R#AoYQ,R#U4zG) and appreciate the writing (R#U4zG, R#4NRc) and experiments (R#AoYQ, R#U4zG, R#4NRc) of our work. Below, we uniformly address several concerns shared by different reviewers.

[Q1] R#jJKC, R#AoYQ: Semantic Similarity between Clean and Adversarial Texts.

Thanks for the insightful question. Yes! The basic objective of untargeted adversarial attacks is to fool the victim model to output incorrect predictions [1, 2], while the attacker is supposed to preserve semantic similarity between the original and the adversarial sample to ensure attack imperceptibility. In our implementation, we follow the rigorous setup in prior works [3,4,5] that modify only one single word to preserve semantic similarity and attack stealthiness. To further quantitatively evaluate C-PGC, we follow the suggestion of R#AoYQ and calculate BERT scores [6], including P (precision), R (recall), and F1 (F1 score) as metrics for the semantic distance between 5,000 clean and adversarial sentences in Table A. To more comprehensively evaluate, we also compute BLEU metrics when surrogated on ALBEF in Table B.

Table A. Comparison of our C-PGC with the two widely acknowledged Co-Attack [3] and SGA [4] regarding the semantic similarity between clean and adversarial texts.

Source model\RightarrowALBEFALBEFALBEFTCLTCLTCLCLIP_VITCLIP_VITCLIP_VITCLIP_CNNCLIP_CNNCLIP_CNN
Method\DownarrowP\uparrowR\uparrowF1\uparrowP \uparrowR \uparrowF1\uparrowP\uparrowR\uparrowF1\uparrowP\uparrowR\uparrowF1\uparrow
Co-Attack [3]0.83280.85890.84550.83250.85880.84530.82690.85260.83940.82710.8530.8397
SGA [4]0.83890.86540.85180.83760.86460.85090.84160.86970.85530.83780.8650.8511
Ours0.88910.86130.87480.89240.86870.88020.87460.86840.87130.89480.88420.8893

Table B. Comparison of BLEU metrics regarding our C-PGC with Co-Attack and SGA.

Method \DownarrowB@4METEORROUBE_LCIDErSPICE
Co-Attack [1]0.790.520.8957.030.661
SGA [2]0.7980.5270.8987.1590.668
Ours0.8890.5520.9058.0360.671

Note that we provide previous sample-specific algorithms Co-Attack[3] and SGA[4] as references. Notably, our method achieves better similarity scores to these wide-acknowledged sample-specific methods across different surrogate VLP models, demonstrating the outstanding attack stealthiness of our C-PGC.

To better understand its influence on attack performance under more practical scenarios, we evaluate C-PGC against a widely used language correction tool, i.e., the LangugeTool (LT) which has been applied to adversarial text correction [5]. As shown in Table 2 of the main body, the minor ASR drop from NRP to NRP+LT indicates that C-PGC exhibits excellent attack imperceptibility and can effectively bypass the automated evaluation tool, which again underscores the generalizability of the generated UAP, presenting a practical and effective textual perturbation strategy.

We will supplement these analyses in the Appendix and encourage more related future works. Thanks again for your insightful advice!

评论

[Q2] R#jJKC, R#U4zG: Comparison with the recent SOTA.

As noted by R#jJKC, we notice a concurrent study on UAP attacks for VLP models [7], which shows promising attack performance. To make a fair comparison, we faithfully reproduce this method using their publicly released code under the same experimental settings as ours. Note that [7] implements several versions of their method and we report their best results as follows:

Table C. Comparison of our C-PGC with a recent SOTA attack [7] on Flicke30K.

SourceTarget\RightarrowALBEFALBEFTCLTCLX-VLMX-VLMCLIP_VITCLIP_VITCLIP_CNNCLIP_CNNBLIPBLIP
\DownarrowMethod\DownarrowTRIRTRIRTRIRTRIRTRIRTRIR
ALBEFETU [7]78.0184.5629.9235.9114.3322.0323.7739.233.5547.6922.6132.28
ALBEFOurs90.1388.8262.1164.4820.5339.3843.165.9354.472.5144.7956.36
CLIP_VITETU [7]14.825.2321.2230.8710.8724.9684.1490.4557.5165.5116.427.22
CLIP_VITOurs23.2338.6725.0541.7915.8535.5988.9293.0566.0675.4226.7145.7

By contrastively training our designed cross-modal conditional generator, the proposed C-PGC greatly enhances the attack and achieves significant improvements in ASR. Particularly in the more realistic and challenging transferable scenarios, the proposed method achieves considerably better performance, e.g., 32.19% and 28.57% increase in ASR of TR and IR tasks when transferring from ALBEF to TCL. These results confirm the superiority of our contrastive learning-based generative paradigm. We will cite this paper and include these results in a new section of the Appendix.

Thanks again for mentioning this great work! We would appreciate it if you could kindly inform us of any additional UAP methods targeting VLP models that we may have inadvertently omitted. We would be delighted to reproduce and compare these methods to ensure a comprehensive assessment.

[1] Dong Y, Liao F, Pang T, et al. Boosting adversarial attacks with momentum. In CVPR, 2018.

[2] Chakraborty A., Alam M., Dey V., et al. A survey on adversarial attacks and defences. In CAAI Transactions on Intelligence Technology, 2021.

[3] Zhang J, Yi Q, Sang J. Towards adversarial attack on vision-language pre-training models. In ACM MM, 2022.

[4] Lu D, Wang Z, Wang T, et al. Set-level guidance attack: Boosting adversarial transferability of vision-language pre-training models. In ICCV, 2023.

[5] Wang H, Dong K, Zhu Z, et al. Transferable multimodal attack on vision-language pre-training models. In S&P, 2024.

[6] Zhang T, Kishore V, Wu F, et al. Bertscore: Evaluating text generation with bert. In ICLR, 2020.

[7] Zhang, Peng-Fei, Zi Huang, and Guangdong Bai. Universal Adversarial Perturbations for Vision-Language Pre-trained Models. In ACM SIGIR, 2024.

AC 元评审

This work reveals that Vision-Language Pre-training (VLP) models are vulnerable to instance-agnostic Universal Adversarial Perturbations (UAPs). It introduces a Contrastive-training Perturbation Generator with Cross-modal conditions (C-PGC) that leverages malicious contrastive learning to disrupt the multimodal alignment in VLP models, achieving effective and transferable attacks across various models and vision-language tasks. The authors provided additional experiments and explanations during the rebuttal phase to address the reviewers' concerns; however, some issues remain unresolved. The primary weaknesses of this work lie in the inefficiency and ineffectiveness of the proposed universal adversarial text generation approach. The method's reliance on identifying critical words before applying universal perturbations contradicts the principle of universal attacks, which should function on all unseen data without additional computation. The quality of the generated perturbations is questionable, as merely replacing words does not ensure imperceptibility, making the modified texts easily identifiable. Additionally, the high semantic similarity between the original and adversarial texts suggests the attack has minimal impact, undermining its effectiveness. The use of large language models to validate imperceptibility is also unconvincing. Furthermore, the performance improvement primarily stems from the underlying perturbation technique (e.g., DeepFool) rather than the novel strategy proposed in this work, as highlighted in the rebuttal. Therefore, we decide not to accept this work based on its current state.

审稿人讨论附加意见

The authors provided additional experiments and explanations during the rebuttal phase to address the reviewers' concerns; however, some issues remain unresolved. The primary weaknesses of this work lie in the inefficiency and ineffectiveness of the proposed universal adversarial text generation approach. The quality of the generated perturbations is questionable, as merely replacing words does not ensure imperceptibility, making the modified texts easily identifiable. Additionally, the high semantic similarity between the original and adversarial texts suggests the attack has minimal impact, undermining its effectiveness. The use of large language models to validate imperceptibility is also unconvincing.

最终决定

Reject