5.6

/10

Rejected5 位审稿人

最低3最高10标准差2.3

4.4

置信度

ICLR 2024

Investigating Human-Identifiable Features Hidden in Adversarial Perturbations

Dennis Yuren Menn,Tzu-hsun Feng,Sriram Vishwanath,Hung-yi Lee

OpenReview PDF

提交: 2023-09-22更新: 2024-02-11

TL;DR

We demonstrate that adversarial perturbations contain human-identifiable features, which are responsible for a neural network’s misclassification.

摘要

关键词

Adversarial AttackComputer VisionNeural Network

评审与讨论

审稿意见

评分: 10置信度: 52023-10-31

This work identifies the presence and effect of human-identifiable features in adversarial perturbations. The authors recognize that individual perturbations on a single input, while successful at fooling a model, do not produce distinct features that can be readily interpreted by humans. They posit that this is due to the presence of noise in the perturbations, and introduce a methodology to help overcome this by averaging many perturbations on the same image. The result produces perturbations that are significantly more human understandable, as demonstrated through a human evaluator experiment. With these new perturbations, they identify two different effects that these perturbations have on their input: masking, which covers prominent features of the true class of the image, and generation, which creates prominent features of the target class. Overall, this work provides insights into features created in adversarial examples, introduces methodology that can increase explainability in the presence of adversarial examples, and provides explanations from their findings for well known phenomena in adversarial training, transfer ability attacks, and interpretability.

优点

Thank you for your submission! I thoroughly enjoyed reading this paper; the results were compelling, the methodology was sound, the contributions and findings are novel and useful, explanations were clear, and I was surprised at how recognizable the generated perturbations were.

Some specific highlighted results/conclusions/contribution:

As mentioned in the paper, there is a significant need for work that provides explanations for reasons as to why attacks are as successful as they are and why models are as vulnerable to adversarial examples as they are. This work bridges these two approaches by (a) evaluating a variety of attacks and (b) creatively extracting portions of perturbations that are well aligned across models and thus represent features that transfer across models
The perturbations generated with this method were significantly clearer/more recognizable to me as a reader. Additionally, I felt that the claim of generating human recognizable perturbations was well supported by also incorporating the results showing that (a) human evaluators were able to recognize perturbations without associated inputs from the MM+G method at a rate significantly higher than random guessing and (b) the perturbations generated in the MM+G setting yield far more successful adversarial examples than the standard SM case
The discussion section connected multiple trends in transferability, adversarial training, and clean/robust accuracy tradeoffs to reasonable explanations based on insights from this work.

缺点

The breadth of experiments done was extensive, but I felt that in certain places, the depth of individual experiments could have been improved. Specifically:

I would have preferred to see more samples per class evaluated (10 seems quite small to me)
In the human evaluator test, I understand the limitation of testing all the attacks/settings but at the very least both settings under one attack should have been evaluated. At present, it is hard to give meaning to the 80.7% human evaluator accuracy under the BIM MM+G setting since there is not a BIM SM setting to compare it to. It would also be helpful to provide some justification for why BIM (over the other attacks) was chosen for this experiment.
Similar to the previous point, including SM settings in the cosine similarity experiment would have been helpful to get a baseline sense of how similar perturbations usually are to each other and to see if the MM+G setting yields significantly different values.

Additionally, the paper is clear and concise as written, but there were some portions that could benefit from additional details, explanations, or citations, mainly in Section 4 (Experimental Method).

Specific (minor) suggestions for improvement:

The notion of "incomplete components of the associated features" was lacking definition/explanation, adding some details around what this is supposed to represent would be helpful.
The problem of "the number of available neural networks being limited" didn't feel clear/well motivated. There are many parameters that can be adjusted to produce different models (seeds, hyperparameters, optimizer, architecture, etc.). Further, it wasn't clear how the solution of applying noise to produce more inputs solved this problem.
Some more citations to help support the contour extraction experiment would be helpful, particularly for claims that make statements about portions of the image that humans use for classification.

问题

How were the subset of classes chosen?
How were the 200 inputs chosen? Were there any constraints or conditions for these inputs? Were all samples chosen correctly classified by all models?
While it does appear that adding noise to produce additional inputs works well, the inspiration/motivation for doing this wasn't exactly clear. Why add noise rather than performing some kind of data augmentation?
Why was the standard deviation of noise added to the inputs different for the different attack algorithms?
Why were 270 models chosen for generating perturbations? Were these experiments tried with fewer models (besides the single model case)?
It is mentioned in the human evaluator test that the lowest and highest accuracy in each subset was discarded before calculating the average. What was the purpose of this? And can you clarify exactly what was discarded (e.g., was data for a single sample removed from all participants or was data from a single participant removed from all samples?)

评论- Thank You for Your Comment

2023-11-23

We wholeheartedly appreciate your encouragement and suggestions. Thank you very much.

How were the subset of classes chosen and 200 inputs chosen?

We carefully selected 20 classes from the ImageNet dataset to ensure maximal diversity in our study, covering a wide array of subjects such as animals, plants, architecture, toys, transportation, and utilities. To avoid human bias, we adopted a systematic approach, selecting the first 10 images from each class rather than hand-picking them.

While it does appear that adding noise to produce additional inputs works well, the inspiration/motivation for doing this wasn't exactly clear. Why add noise rather than performing some kind of data augmentation?

We believe that perturbations contain noise that may result in meaningless local variations [1]. As a consequence, introducing different Gaussian noises to the same input image may lead to different local variation patterns. Therefore, averaging those perturbations can effectively reduce the noise residing within them.

We believe that the changes in local variations are independent of specific data augmentation techniques because their variation is meaningless. Consequently, we anticipate that different augmentation techniques will yield similar results. However, adding noise offers the advantage of easily controlling the strength of augmentation by adjusting its standard deviation. This is the reason why we choose to add noise instead of performing data augmentation.

Reference:

[1] Smoothgrad: Removing noise by adding noise. Daniel Smilkov, et al. Workshop on Visualization for Deep Learning. 2017.

Why was the standard deviation of noise added to the inputs different for the different attack algorithms? In our research, Gaussian noise serves as an additional tool to further reduce the noise in perturbations. We found that different attack algorithms may need different levels of Gaussian noise to result in optimal clarity for human-identifiable features.

For BIM attacks, Gaussian noise with a standard deviation of 0.02 produces clear and pronounced human-identifiable features. CW and DeepFool attacks with a standard deviation of 0.02 also exhibit clear human-identifiable features. However, with a standard deviation of 0.05, the clarity of these features will be further enhanced. Consequently, we set the standard deviation for BIM attacks to 0.02 and for CW and DeepFool attacks to 0.05.

We would like to emphasize that averaging perturbations from different models is sufficient for the emergence of human-identifiable features, as discussed in Appendix A. Adding Gaussian noise only serves as an additional tool to further reduce perturbation noise.

Why were 270 models chosen for generating perturbations? Were these experiments tried with fewer models (besides the single model case)?

We have conducted additional experiments to investigate the impact of the number of models used for averaging perturbations on the mean squared error (MSE) compared to perturbations from the MM setting. This analysis may give us a better insight into the number of models required to observe the emergence of human-identifiable features.

Prior to calculating the MSE score, we normalized the averaged perturbations using standard deviations and means derived from ImageNet datasets. This normalization ensures that the resulting MSE is on a comparable scale to the MSE scores calculated from images sampled from the ImageNet dataset, which have input values ranging between 0 and 1.

Our findings reveal that to achieve MSE convergence within 0.05, an average of 25 models is required for the three attack algorithms. For an MSE of 0.02, 90 models are necessary, while an MSE of 0.01 needs 157 models. For detailed information, please refer to Appendix I.

It is mentioned in the human evaluator test that the lowest and highest accuracy in each subset was discarded before calculating the average. What was the purpose of this? And can you clarify exactly what was discarded (e.g., was data for a single sample removed from all participants, or was data from a single participant removed from all samples?)

To eliminate outliers, we excluded data from two participants in each group: one with the highest accuracy and another with the lowest accuracy. This decision was made based on statistical considerations, as we observed that within each group, some individuals may perform significantly better or worse than the norm. By removing these outliers, we aim to provide an averaged human perspective on the evaluation of human assessment.

审稿意见

评分: 3置信度: 52023-10-31

This paper explores the human-identifiable features that are concealed within adversarial perturbations. To this end, this paper utilizes 270 models as surrogate models, introduces Gaussian noise to the input, and identifies the human-identifiable features. This paper shows that in targeted attacks, these features typically demonstrate a "generation effect" by producing features or objects of the target class. In contrast, in untargeted attacks, these features exhibit a "masking effect" by hiding the features or objects of the original class. This paper further claims the revealed phenomenon can interpret some properties of adversarial perturbations.

优点

This paper revisits a critical concept in the context of adversarial robustness: the underlying mechanism of adversarial perturbations.
This paper conducted human tests to verify that the emergence of semantic features is not coincidental, which is of importance.
This paper validates the hypothesis across targeted and untargeted attacks and includes search-based attacks.

缺点

This paper challenges a well-acknowledged phenomenon in the context of adversarial robustness: the perceptual aligned gradient (PAG), which refers to the human-identifiable features that align with human perception in adversarial perturbations, only exists in robust models [1-3]. However, this paper claims that such features are also hidden in the perturbations of standardly trained (non-robust) models, which contradicts the current understanding of PAG. This concept of PAG has been well supported by various empirical and theoretical analyses in the follow-up works, along with its various applications. Therefore, in my opinion, to challenge the existing theories that contradict the claim made, this paper should provide sufficient theoretical and empirical evidence to support the proposed claims. Unfortunately, not only has the evidence in this paper already been discovered or directly deduced by previous work, but they also cannot explain the contradicted theories, which I specify below.

The experiment uses Gaussian noise to average the perturbations to reveal the human-identifiable features. However, this phenomenon has already been revealed in [4], which shows that randomized smoothing (adding Gaussian noises to the input and calculating the averaged gradient) on a single standardly trained model can lead to PAG and generate these features. Therefore, it's not a newly discovered phenomenon claimed in this paper that averaging gradient among perturbations with different noises can lead to human-identifiable features.
The experiment also averages different models to reveal the human-identifiable features. However, this phenomenon is expected based on existing work [5, 6], which shows that a little adversarial robustness of the models can lead to PAG. Specifically, as ensembling more non-robust models can still enhance adversarial robustness to a certain extent, though not as robust as adversarially trained models, it can be inferred that the ensembled model can lead to such PAG and identifiable features. Even if this paper shows that the robust accuracy of the ensembled model against adversarial attacks is still low (in Figure 3), the enhanced robustness may still be sufficient to bring such PAG.
In addition, it has also been shown [7] that the distribution of non-robust features [17] varies across different model architectures. Therefore, intuitively, the gradient (perturbation) of a single model (or a single kind of model architecture) may be noisy, but by averaging the gradients from different models, it is possible to converge toward the robust features.

Based on these discussions, the discovery made in this paper is somewhat trivial, since the observed phenomenons have already been revealed in existing work or can be directly deducted from them. Furthermore, the evidence presented in this paper is insufficient to challenge the well-established theories of PAG, as this paper does not provide a clear explanation of the contradictions or confusions, which I specify below.

There exist several works [8-10] aim to explain the reason PAG only exists in robust models by characterizing the decision boundaries between different models, which is well supported by theoretical analysis. These works show the fundamental difference of decision boundaries between standard and adversarially trained models leads to the (non-)existence of PAG, which contradicts the claim made in this paper in Section 7(2) that human-identifiable features also exist in non-robust models. Unfortunately, this paper does not discuss this viewpoint and does not conduct a theoretical analysis to overturn these theories.
There also exist theories interpreting the existence of PAG in robust models by modeling adversarial training as energy-based models [11-12]. Additionally, the robust model also provides better guidance during the generation process of diffusion models [13-14], indicating the importance of robust models with PAG for better gradient and generation guidance. Since such a generation process requires multi-step sampling, which can be regarded as applying an average (ensemble) of gradients (perturbations) to the standardly trained model, this also contradicts the viewpoint in this paper and should be well-explained.
In Section 7(1), the explanation for the transferability of adversarial examples contradicts existing works. This paper attributes the transferability to the human-identifiable (robust) features, but existing works [15-16] show that robust features may not be always helpful for adversarial examples transferring between models and non-robust features still play a crucial role in transferring adversarial examples. Therefore, the claims made in this paper fail to explain the transferability of adversarial examples across models.
The explanation of non-trivial accuracy for classifiers trained on a manipulated dataset [17] made in Section 7(3) is flawed. It is clear that in the manipulated dataset, which includes perturbations claimed as human-identifiable features in this paper, the features from the original class are still dominant over the perturbations. According to the interpretation within this paper, the model should still learn the features from the original class and cannot achieve clean accuracy in this noisy training setting. This contradicts the explanation proposed in this paper.
In Appendix A, Figure 7, it appears that the masking effect of the perturbation without Gaussian noise significantly reduces the identifiability of human-identifiable features, compared to the results in the main paper (with Gaussian noise). Therefore, it can be inferred that ensembling Gaussian noise plays a more crucial role in generating the human-identifiable features than ensembling different models, which undermines the soundness of the claim that the presence of human-identifiable features is inherent in the perturbations themselves, rather than being a result of added Gaussian noise.
There is a lack of ablation studies on the number of models to further support their claims. It is suggested to add experiments to analyze how many models or noises are required to emerge such human-identifiable features, which can provide a more intuitive view of how noisy the gradients are in the adversarial perturbations.
For transfer attacks, this paper only compares BIM, CW, and DF, which are not specifically designed for transfer attacks. It is suggested to add a comparison with existing state-of-the-art transfer attackers, e.g., MI-FGSM [18], DI-FGSM [19], and ensemble attacker CWA [20], to substantial the claims regarding transfer attacks. Since this paper claims that the success of transfer attacks is based on hidden human-identifiable features, it can be inferred that transfer attacks can emerge with more human-identifiable features, which should be supported by experiments on evaluating these attacks designed for transferring.
There is no statement on open sourcing and reproducibility. Since finding such 270 surrogate models is challenging to reproduce, I strongly suggest releasing the code.

[1] Robustness May Be at Odds with Accuracy. ICLR 2019

[2] Image Synthesis with a Single (Robust) Classifier. NeurIPS 2019

[3] Adversarial Robustness as a Prior for Learned Representations. arxiv 1906.00945

[4] Are Perceptually-Aligned Gradients a General Property of Robust Classifiers?. NeurIPS 2019 Workshop

[5] On the Benefits of Models with Perceptually-Aligned Gradients. ICLR 2020 Workshop

[6] A Little Robustness Goes a Long Way: Leveraging Robust Features for Targeted Transfer Attacks. NeurIPS 2021

[7] Skip Connections Matter: On the Transferability of Adversarial Examples Generated with ResNets. ICLR 2021

[8] Bridging Adversarial Robustness and Gradient Interpretability. ICLR 2019 Workshop

[9] On the Connection Between Adversarial Robustness and Saliency Map Interpretability. ICML 2019

[10] Robust Models Are More Interpretable Because Attributions Look Normal. ICML 2022

[11] Towards Understanding the Generative Capability of Adversarially Robust Classifiers. ICCV 2021

[12] A Unified Contrastive Energy-based Model for Understanding the Generative Ability of Adversarial Training. ICLR 2022

[13] Enhancing Diffusion-Based Image Synthesis with Robust Classifier Guidance. TMLR

[14] BIGRoC: Boosting Image Generation via a Robust Classifier. TMLR

[15] Closer Look at the Transferability of Adversarial Examples: How They Fool Different Models Differently. WACV 2023

[16] Why Does Little Robustness Help? Understanding and Improving Adversarial Transferability from Surrogate Training. S&P 2024

[17] Adversarial Examples are not Bugs, they are Features. NeurIPS 2019

[18] Boosting adversarial attacks with momentum. CVPR 2018.

[19] Improving transferability of adversarial examples with input diversity. CVPR 2019.

[20] Rethinking Model Ensemble in Transfer-based Adversarial Attacks. arXiv:2303.09105

问题

Please see the weaknesses above.

评论- We Respectfully Disagree With the Reviewer's Comment (Part 1)

2023-11-22

We respectfully disagree with the reviewer's comment, yet we appreciate the reviewer's time. The reviewer clearly has an opposite view from ours on the origin of adversarial perturbations. This contrast in views, in our opinion, has led to a very critical tone in the reviewer’s comments. In the following, we respond to each of the reviewer's points, providing clarifications and counterarguments where necessary.

Please note that references starting with R are referred to as the reviewer's references.

The experiment uses Gaussian noise to average the perturbations to reveal the human-identifiable features. However, this phenomenon has already been revealed in [R4], which shows that randomized smoothing (adding Gaussian noises to the input and calculating the averaged gradient) on a single standardly trained model can lead to PAG and generate these features. Therefore, it's not a newly discovered phenomenon claimed in this paper that averaging gradient among perturbations with different noises can lead to human-identifiable features.

The reviewer confused our findings with those in [R4]. Our research reveals that human-identifiable (robust) features naturally occur in adversarial perturbations of 'standard-trained' neural networks, even without adding Gaussian noise. This stands in contrast to [R4], which associates perceptually aligned gradients with the robustness of a classifier, leading to a markedly different conclusion.

A method for creating a robust classifier involves randomized smoothing, which adds Gaussian noises with a significant standard deviation (0.5 as mentioned in [R4]) to the original image and implicitly averages the model's predictions. The effectiveness of this technique lies in its certified radius, which assures the classifier's robustness within a certain range of perturbations and is proportional to the added Gaussian noises' standard deviation, as confirmed by earlier research work[1].

In [R4], it is noted that as the standard deviation of incorporated Gaussian noise increases, human-perceptible features become observable and evident in perturbations. Hence, the author suggests that the increase of model's robustness may lead to the presence of perceptually aligned gradients.

Our study, however, presents a different scenario. We demonstrate that human-identifiable (robust) features in adversarial perturbations are discernible when averaging perturbations from various models, even without Gaussian noise. This suggests that robust features inherently exist in standardly trained models, as averaging perturbations do not add new features to the original perturbation. Therefore, the distinction between our work and the study in [R4] is evident.

评论- We Respectfully Disagree With the Reviewer's Comment (Part 2)

2023-11-22

The experiment also averages different models to reveal the human-identifiable features. However, this phenomenon is expected based on existing work [R5, R6], which shows that a little adversarial robustness of the models can lead to PAG. Specifically, as ensembling more non-robust models can still enhance adversarial robustness to a certain extent, though not as robust as adversarially trained models, it can be inferred that the ensembled model can lead to such PAG and identifiable features. Even if this paper shows that the robust accuracy of the ensembled model against adversarial attacks is still low (in Figure 3), the enhanced robustness may still be sufficient to bring such PAG.

The reviewer states that according to prior research [R5 and R6], our findings are expected. However, our results are fundamentally different from those studies. In [R5], the author shows that adversarial training with low magnitude enhances the model's gradient perceptibility without significantly boosting its robustness, while [R6] discusses enhanced transferability of adversarial perturbations through a similar training method and is unrelated to our work.

The reviewer, lacking concrete evidence, attributes the increased gradient perceptibility in [R5] to a marginal improvement in adversarial robustness, which is not stated in [R5]. The reviewer then states that our results are expected extension of [R5], based on the assumption that a slight increase in the model’s robustness will also increase perturbations’ perceptibility. There is no indication that such assumption can be realized. Verifying such assumption clearly does not fall within our work’s domain. Furthermore, it is noteworthy that several defensive algorithms, such as defensive distillation [2], marginally enhance the model's robustness. We would be interested to know if the reviewer could offer thought on whether perturbations in these defensively augmented models exhibit a perception-aligned gradient.

In our experiment, we average perturbations from different models and demonstrate that human-identifiable features naturally exist in perturbations from a standard-trained model. Averaging helps to reduce noise and put together incomplete information without incorporating new features into the perturbations. This should not be confused with an increase in models' robustness.

In addition, it has also been shown [R7] that the distribution of non-robust features [R17] varies across different model architectures. Therefore, intuitively, the gradient (perturbation) of a single model (or a single kind of model architecture) may be noisy, but by averaging the gradients from different models, it is possible to converge toward the robust features. Based on these discussions, the discovery made in this paper is somewhat trivial, since the observed phenomenon have already been revealed in existing work or can be directly deducted from them. Furthermore, the evidence presented in this paper is insufficient to challenge the well-established theories of PAG, as this paper does not provide a clear explanation of the contradictions or confusions, which I specify below.

The paper in [R7] proposes an algorithm - the Skip Gradient Method (SGM) - that can increase the transferability of adversarial perturbations. In the paper, nothing is discussed about “the distribution of non-robust features”. We would appreciate it if the reviewer would show where the paper the claim is written.

The reviewer initially mentioned that robust features exist exclusively in robust models, based on the concept of perceptual aligned gradients. Our research reveals, however, that robust features are also present in perturbations originating from standard-trained models, a finding established by averaging perturbations across various models. Our work is significant as it highlights the role of robust features for fooling models in perturbations from a standard-trained model. This insight has the potential to foster a deeper understanding of how perturbations can mislead models. Despite described clearly in the paper, the reviewer persistently extrapolates other studies and attributes our findings simply to increased model robustness, which deviates from the actual explanation and facts.

评论- References (to be updated)