PaperHub
5.7
/10
Poster3 位审稿人
最低5最高6标准差0.5
6
6
5
3.3
置信度
正确性3.0
贡献度2.7
表达2.3
NeurIPS 2024

Ask, Attend, Attack: An Effective Decision-Based Black-Box Targeted Attack for Image-to-Text Models

OpenReviewPDF
提交: 2024-05-15更新: 2024-11-06

摘要

关键词
Black-box adversarial attackImage-to-text modelAttentionDifferential evolution

评审与讨论

审稿意见
6

The paper focuses on the vulnerability of image-to-text models to adversarial attacks, particularly in a decision-based black-box targeted attack scenario. The authors design a three-stage attack process: (i) Ask: Guides attackers to create target texts that fulfill specific semantic requirements. (ii) Identifies crucial regions of the image for the attack, reducing the attack's search space. (iii) Utilizes an evolutionary algorithm to attack these critical regions, aligning with the semantics from the Ask stage to achieve targeted attacks without semantic loss. Tests on transformer-based and CNN+RNN-based models demonstrate the success of the proposed AAA method in performing effective targeted attacks.

优点

  1. A novel decision-based black-box targeted attack approach is proposed.

  2. The idea of reducing the search space is interesting.

缺点

  1. The technical contribution is limited. Many concepts and ideas are directly adopted from existing work. There are many steps that use existing models such as CLIP models, which only work on some benchmark datasets.

  2. The writing quality needs to be enhanced. For instance, the title should be "An Effective Decision-Based Black-Box Targeted Attack for Image-to-Text Models". Any typos in SsemS_{sem} and $S_{seg}? Some technical details are unclear. For example, how can authors guarantee target words that should be closer to the input space as many rounds of mutation proceed?

问题

  1. How can authors guarantee target words that should be closer to the input space as many rounds of mutation proceed?

  2. Can authors provide more explanation about why selecting a visual model is good as the surrogate model? Will the model architecture or parameters have impacts on the feature map? I guess for the common datasets such as the ImageNet and common architectures, these models could be similar. But what if different model architectures were trained on a more diverse dataset? As a black-box attack, it is hard to guarantee the consistency of feature map generation among different models.

  3. Are the mean and median of heatmap A is around [0.3, 0.4] for all types of data?

局限性

The authors mentioned the limitation of a high number of queries and low optimization efficiency.

作者回复

To Weakness 1:

Our technical contribution lies in carefully crafting specific designs to improve search efficiency and make our framework applicable in more difficult scenarios. Each design and its technical contribution is summarized as follows.

1) Improving search efficiency from different perspectives:

a) Reducing the search difficulty: Designing a target semantic dictionary as prior knowledge shortens the search path from image to target text as much as possible, improving the search efficiency of Attack stage (line 137), 

b) Reducing the computation time: The Attack stage requires calculating the cosine similarity matrix of the entire population’s text in each iteration, which is time-consuming. The CLIP model’s text encoder can process large amounts of text in parallel. Therefore, we disable the visual encoder of CLIP and use the text encoder to compute the cosine similarity matrix between the target text and the output text of all chromosomes (Equation 13), speeding up the search.

c) Reducing the search space: The large number of decision variables (pixels) leads to a vast search space and low search efficiency. We combine the Grad-CAM formula with decision variables to design Equation 10, reducing the range of decision variables in unimportant areas (e.g., background), significantly reducing the search space.

2) Dealing with different difficult scenarios: 

a) Black-box:Grad-CAM requires gradient information and is not suitable for black-box scenarios, nor can it handle the image-to-text task where the number of output text categories is infinite. Based on previous research (cited [28], CVPR 2021) and our experiments (Figure 4), we designed a surrogate model strategy to generate attention maps in black-box scenarios. And we designed a CLIP-based mapping formula (Equation 9) to map the output text to the category set of the surrogate model, enabling the generation of attention maps for image-to-text tasks with unlimited output categories.

b) Perturbation imperceptibility: The perturbations of adversarial examples need to be imperceptible to the human eye. The Grad-CAM-based Equation 10 we designed can also enhance the stealthiness of adversarial examples. This is because a smaller search range for decision variables means smaller perturbations at each pixel.

To Weakness 2:

Thank you for pointing out these typos. The term $S_{seg}$ in line 167 is a typo and should be corrected to $S_{sem}$ . We have checked the rest of the paper and found no other typos. We will correct this in the final version. Your question about target words is addressed in Question 1 below.

To Question 1:

The mutation process in the Ask stage does not change the position of any words in the input space. Instead, it filters words related to the attacker’s specified semantics within a small hypersphere centered on the image (Equation 2). All target words are within a distance of radius $\eta$ from the image, ensuring that the target text (created based on the target semantic dictionary) is also positioned close to the image. This effectively reduces the search difficulty and improves the search efficiency in the Attack stage (Table 1 and Appendix B.3).

To Question 2:

DNN with different architecture and parameters exhibit similar attention patterns for the same image, as concluded in previous adversarial attack work (cited [28], Figure 4, CVPR2021) and validated by our qualitative and quantitative experiments on diverse CNN and Transformer models (Figure 4). The role of attention maps in our framework is to reduce the search space of unimportant regions (e.g., background), enhancing the stealthiness of adversarial images and reducing the search difficulty in the Attack stage. Even if different surrogate models produce slightly different attention maps, the cost is limited to additional computation in unimportant regions. Since adversarial attacks are not time-sensitive tasks, this cost is acceptable.

To Question 3:

Yes, this is a pattern we observed through statistical analysis of the heatmap value distribution. We have added the numerical distribution of all heatmaps as well as the median and mean values to the supplementary file in the rebuttal system.
评论

Hello,

Is there any theoretical guarantee for question 1? What is the additional computation cost if dealing with different heatmaps? Does "DNN with different architecture and parameters exhibit similar attention patterns for the same image" work for other modalities?

评论

To Question 1:

Is there any theoretical guarantee for question 1?

In fact, the reason we can ensure that the chromosome (perturbed image) after multiple rounds of mutation in the Ask stage remains in close proximity to the clean image, thereby facilitating the search for the corresponding target word, is grounded in the following mathematical rationale:

Let xx denote the clean image, nn the number of pixels in the image, η\eta the maximum perturbation size for each pixel, and xix_i represent the ii-th pixel of the image. Define x\Vert x \Vert_{\infty} = max\max (x1|x_{1}|, \cdots, xi|x_i|, \cdots, xn|x_n|) to represent the infinity norm of xx. Let xr1x_{r1}, xr2x_{r2}, and xr3x_{r3} be the images obtained by adding different random perturbations not exceeding η\eta to each pixel of xx. Let FF be the mutation scaling factor, and vv be the chromosome (perturbed image) obtained after mutation. The hypersphere B(x,η)B(x, \eta) is centered at xx with a radius η\eta, and is mathematically characterized as: B(x,η)=zxzηB(x, \eta) = \\{z \mid \Vert x - z \Vert_{\infty} \leq \eta \\}

The distance between the mutated chromosome vv and the clean image xx is given by:
vx=xr1+F(xr2xr3)x=xr1x+F(xr2x)+F(xxr3)\Vert v - x \Vert_{\infty} = \Vert x_{r1} + F \cdot (x_{r2} - x_{r3}) - x \Vert_{\infty} = \Vert x_{r1} - x + F \cdot (x_{r2} - x) + F \cdot (x - x_{r3}) \Vert_{\infty}

Employing the triangle inequality, we deduce: vx=xr1x+F(xr2x)+F(xxr3)xr1x+Fxr2x+Fxxr3 \Vert v - x \Vert_{\infty} = \Vert x_{r1} - x + F \cdot (x_{r2} - x) + F \cdot (x - x_{r3}) \Vert_{\infty} \leq \Vert x_{r1} - x \Vert_{\infty} + F \cdot \Vert x_{r2} - x \Vert_{\infty} + F \cdot \Vert x - x_{r3} \Vert_{\infty}

Given the definition of the hypersphere B(x,η)B(x, \eta), for any point xx’ within B(x,η)B(x, \eta), it follows that: xx=maxixxη\Vert x’ - x \Vert_{\infty} = \max_i | x’ - x| \leq \eta

Consequently: vxη+Fη+Fη=(1+2F)η\Vert v - x \Vert_{\infty} \leq \eta + F \cdot \eta + F \cdot \eta = (1 + 2F) \cdot \eta

From this, the scaling factor 11+2F\frac{1}{1 + 2F} is derived. Let v=x+(vx)1+2Fv’ = x + \frac{(v - x)}{1 + 2F}, which yields: vxη|| v’ - x ||_{\infty} \leq \eta

By assigning vv’ to vv, we guarantee that the mutated chromosome vv will always be within the hypersphere B(x,η)B(x, \eta) regardless of the number of iterations.

Through these mathematical calculations, we can ensure that the new chromosome vv obtained after any number of mutation operations will be in close approximation to the input image (with a distance that does not exceed η\eta).

To Question 2:

What is the additional computation cost if dealing with different heat maps?

The additional computational cost refers to the extra iterations required by the evolutionary algorithm to find a solution with the same fitness value. As shown in Figure 3(b) of the main paper, the experimental setup of AAAAAA (w/o Attend) used extremely poor attention heatmaps (failing to identify any unimportant regions), resulting in an additional 200 iterations to find a solution of the same quality as AAAAAA. This indicates that the quality of the attention heatmap (its ability to identify unimportant regions) is related to the convergence speed of the evolutionary algorithm. The reason is that higher-quality attention heatmaps can effectively reduce the search space.

To Question 3:

Does "DNN with different architecture and parameters exhibit similar attention patterns for the same image" work for other modalities?

This conclusion is feasible under the condition of the same modality, but it is not feasible under the condition of cross-modality. Observations on datasets from different modalities such as RGB, thermal imaging, and infrared reveal that different DNNs trained in the same modality have similar attention maps. However, for a image in another modality (thermal image or infrared), these DNNs (trained on RGB) produce attention maps with differences. This may be because thermal and infrared images lack the color and texture features of RGB images, leading to varying degrees of inability to identify unimportant areas by DNNs. It is worth noting that our work mainly focuses on the RGB modality, addressing the challenges of black-box attacks and targeted attacks, without involving the cross-modality attack, as image-to-text models are mostly based on the RGB modality.

审稿意见
6

The authors tackle the challenging problem of adversarial attacks on Image-to-Text Models, focusing specifically on the black-box attack scenario where an attacker has no access to the internal workings of the model, only its output. To address this challenge, they propose a novel framework called Ask, Attend, Attack (AAA), which leverages evolutionary algorithms to mount targeted attacks. The AAA framework consists of three stages: Ask, where the attacker generates target texts that satisfy specific semantic constraints; Attend, where crucial regions of the image are identified without access to the image encoder; and Attack, where an evolutionary algorithm is used to attack these identified regions with semantically related attacks that match the target texts from the Ask stage. This approach cleverly avoids the issue of semantic mismatch encountered in previous gray-box attacks. The authors demonstrate the effectiveness of their proposed framework through experiments on two distinct models.

优点

  1. The authors aptly acknowledge the complexity of the domain they tackle in this paper, which underscores the significance and novelty of their contributions.
  2. The authors provide a clear and comprehensive description of the three stages comprising their proposed AAA framework. Notably, each stage is carefully designed to address the semantic mismatch issue that has been previously observed in related research, offering a thoughtful and intentional approach to tackling this challenge.
  3. The experimental outcomes presented in this manuscript are encouraging, as the proposed AAA framework demonstrates strong performance across the two models examined. Additionally, I commend the authors for including an ablation study, which provides valuable insights into the effectiveness of individual components and sheds light on their contributions to the overall framework's success.

缺点

  1. While I appreciate the effort invested in this paper, I do find some aspects lacking in terms of clear motivation. For instance, the example provided to justify the attack setting, where an attacker seeks to elicit political slogans and hate speech by introducing imperceptible perturbations to the image, seems somewhat weak compared to more established examples in the domain of image classification. Additionally, I would have liked a clearer explanation for the specific population definition chosen for the Ask stage; without further justification, it's difficult to fully understand the reasoning behind this design choice.
  2. The authors effectively emphasize the practicality of the attack setting, which is indeed a significant strength of this paper. However, I do have some reservations about the feasibility of the proposed solution. Specifically, I'm concerned that the attacker would need to interact with the victim model numerous times to generate the target image, which seems impractical and may not be a realistic scenario in many real-world applications.

问题

  1. I see that the authors chose the l_1 adversarial attack domain. Can AAA be generalized for the more common l_{inf} attack?
  2. How are the different parameter values chosen?
  3. In Table 1, the epsilon values chosen are 25 and 15. These seem rather big when compared to image classification domain. How do the results look when epsilon is made even smaller?

局限性

N/A

作者回复

To Weakness 1:

To underscore the significance of research into image-to-text black-box targeted attacks, we will add an example of societal harm and highlight a societal benefit: 1) Harm Example: Social media companies use image-to-text AI for content moderation of user-uploaded images. Black-box targeted attacks on image-to-text AI could enable attackers to alter the semantic content of images from illegal to legal, thus bypassing the moderation process and uploading the illegal images to social media. 2) Benefit: The introduction of image-to-text black-box targeted attacks can alert AI service providers to vulnerabilities in their AI products, providing a benchmark to improve the security of their AI products.

In Section 3.3, the Ask stage’s population consists of randomly perturbed images, where each perturbed image is a chromosome, and each pixel is a gene. The goal of the Ask stage is to obtain prior knowledge (target semantic dictionary) to reduce the search difficulty of evolutionary algorithm (Attack stage). Target semantic dictionary contains words that are 1) close to the image in the input space, and 2) related to the attacker’s specified semantics. Equations 2, 3, and 4 show a random search within a small hypersphere centered on the input image $x$ with a radius $\eta$, ensuring all potential words are close to the input image. Equations 5, 6, and 7 store words related to the attacker’s specified semantics from the output text of each chromosome in the target semantic dictionary, helping the attacker generate a target text that meets their requirements (target semantics) and can be efficiently searched.

To Weakness 2:

In Appendix C.1, we discuss the limitations of our framework in real-world applications, specifically the need for numerous interactions with the target model. This is due to our use of a basic differential evolution algorithm, which requires many iterations to converge to the optimal solution. A potential improvement (Appendix C.2) is to combine our framework with the current state-of-the-art evolutionary algorithms, significantly reducing the number of interactions needed.

To Question 1:

In fact, our framework uses the $l_{inf}$ attack domain, not the $l_1$ domain. As shown in Equation 10, the value range for the i-th variable (pixel) of the clean image $x$ is [$-A(i) \cdot \eta$, $A(i) \cdot \eta$], where $A(i)$ is the contribution weight (0 to 1) and $\eta$ is the perturbation size parameter. This means each pixel has its own perturbation limit ($l_{inf}$ attack domain), adjusted by its contribution weight. For example, less important areas (like the background) have reduced perturbation limits. Since each pixel has a different perturbation limit, we represent the constraint in Equation 1 using the $l_1$ attack domain.

To Question 2:

The parameter values are determined through ablation studies. For example, Appendix B.2 determines the target semantic strategy, Appendix B.3 determines the word selection strategy, Appendix B.4 determines the population size, Appendix B.6 determines the evolutionary algorithm, Figure 4 in the main text determines the surrogate model, and Figure 5 in the main text determines the perturbation size, etc.

To Question 3:

In Section 4.2 (Qualitative Experiment of Different Perturbation Sizes), we have compared attacks with epsilon values of 25, 15, 10, and 5 (Figure 5). Our conclusions are:

1) As perturbation size decreases, both our method and existing methods become more stealthy, but their attack performance decreases.

2) As perturbation size increases, our black-box method achieves nearly 100% target attack performance, while existing gray-box methods have lower performance due to semantic loss issues.
评论

I thank the authors for responding to my comments. However, my concern regarding Weakness 1 still remains. Due to this factor and after reading through the other reviews and responses I have decided to keep my scores the same.

评论

Dear Reviewer 8Kf2,

As the deadline for the discussion period is approaching, we would like to kindly request your feedback on our responses. We wish to express our deepest gratitude for the time and efforts you have dedicated to reviewing our work. We sincerely hope that our detailed responses have adequately addressed all the concerns and suggestions you raised.

We fully understand that you may be occupied with other commitments, but we would greatly value any comments you can provide on our responses before the deadline.

Thank you for your attention to this matter. We eagerly look forward to hearing from you soon.

Sincerely, 11274 Authors

审稿意见
5

This paper proposes a new adversarial attack, termed AAA (ask, attend, attack) attack towards image-to-text models. In the Ask stage, the attacker iteratively generates candidates to generate individuals closer to the target semantics in the feature space of the target model. During the Attend Stage, the attacker utilizes a surrogate model to generate a heatmap to guide the attack. Eventually, in the attack stage an evolutionary algorithm is adopted to search for individuals with closer feature distance between the output text and the target text.

优点

  • This paper presents the first black-box targeted adversarial attack towards image-to-text models.
  • The proposed attack is shown to be even more effective than existing gray-box attacks.
  • Comprehensive evaluations are conducted including detailed ablation studies on each component.

缺点

  • The presentation of Section 3.3 is confusing. Steps 1,2 and 3 are randomly shuffling the input images. It is not well-motivated or clearly explained why the attacker needs to generate the perturbed images following these steps.
  • The Ask stage seems to be "optimizing" the target text so that the target text would be naturally closer to the input image, which, from my perspective, makes the attack less "targeted" as it is actually easing the attack. As shown in Table 1, the Ask stage heavily contributes to the effectiveness of the AAA attack.

问题

  • In line 262, "AAA (w/o Ask) means the target text is not from the target semantic dictionary, but random words". Why the author didn't use the target semantics as the target text? It is confusing that AAA (w/o Ask) adopts random words as the target text while achieving comparable performance to gray-box attacks.

局限性

See weaknesses and questions.

作者回复

To Weakness 1:

Section 3.3 (Ask stage) aims to obtain prior knowledge for searching the target text. Prior knowledge shortens the search path from image to target text as much as possible, improving the search efficiency of Attack stage (line 137).

Specifically, we perform a random search within a small hypersphere centered on the image $x$ with a radius $\eta$, ensuring all potential words are close to the image (Equations 2, 3, and 4). Then, we compile a target semantic dictionary from the words related to the attacker’s desired semantics (Equations 5, 6, and 7). 

To Weakness 2:

I agree with your opinion that Ask stage makes the attack easier, which is exactly our goal and is reasonable:

1) Technically Reasonable: Prior knowledge can reduce the search difficulty of evolutionary algorithm. In addition, adversarial attacks often rely on prior information. For example, existing image-to-text gray-box attacks depend on the prior information of gradients, and black-box attacks for classification tasks rely on the prior information of similarity between surrogated and target models, while our prior information is the target semantic dictionary. It is worth noting that our prior information is more easily obtainable in reality, while their prior knowledge is difficult to acquire.

2) Fair Comparison: We applied the same prior knowledge (target semantic dictionary) to both our method and the existing methods, meaning that all approaches in the comparative experiments utilized the identical target text, thereby guaranteeing a fair comparison.

3) Reasonable use scenarios: Attackers typically need to meet their target semantic requirements without fixating on specific words. For example, if an attacker wants to map an illegal image to a legal text related to “dogs”, and the target semantic dictionary (prior knowledge) includes the words “puppy” and “jogging” but not “little dog” and “running”, the attacker can choose the sentence “The puppy is jogging” as the target text, avoiding “The little dog is running”. The reason is that although both sentences meet the attacker’s needs, the former is derived from the dictionary and is therefore easier to search for.

To Question 1:

Using target semantics as the target text for AAA (w/o Ask) is unfair. In our paper, AAA (w/o Ask) randomly creates a complete and coherent sentence from all English words as the target text, which is fair.

The reason why the former is unfair: The target text must be a sentence, not a word, because the output text of an image-to-text model is always a complete sentence. If the target semantics (a single word, such as "animal" or "photograph") is used as the target text to calculate the similarity with the output text (a complete sentence), it is difficult to achieve a high similarity score.

The reason why the latter is fair: The sole difference between AAA and AAA (w/o Ask) lies in the range of words for the target text. For AAA, the range is "words within the target semantics dictionary," whereas for AAA (w/o Ask), it is "all English words." Our experimental setup for AAA (w/o Ask) fairly demonstrates the impact of word selection range on the attack. Additional comparative experiments with various word selection ranges are conducted in Appendix B.3, which can provide a better understanding.

It is important to emphasize that when comparing our method AAA with existing gray-box methods, the target texts are identical sentences, ensuring the fairness of the experiment.
评论

Dear Reviewer xQiz,

As the deadline for the discussion period is approaching, we would like to kindly request your feedback on our responses. We wish to express our deepest gratitude for the time and efforts you have dedicated to reviewing our work. We sincerely hope that our detailed responses have adequately addressed all the concerns and suggestions you raised. We fully understand that you may be occupied with other commitments, but we would greatly value any comments you can provide on our responses before the deadline.

Thank you for your attention to this matter. We eagerly look forward to hearing from you soon.

Sincerely, 11274 Authors

作者回复

Dear Reviewers, thank you for recognizing our work as novel, important, and experimentally adequate. And your comments have greatly improved the quality and clarity of our manuscript. We will address your concerns one by one.

最终决定

All reviewers have provided positive scores for this submission, highlighting its strengths in novelty and extensive experiments. Given the unanimous positive feedback and the recognition of its contribution to the area, the AC carefully reviewed the paper and concurred with the reviewers' assessments, therefore supporting the decision to accept this submission.