4.8

/10

Rejected4 位审稿人

最低3最高8标准差2.0

4.3

置信度

正确性2.3

贡献度2.3

表达2.3

ICLR 2025

Evaluating Model Robustness Against Unforeseen Adversarial Attacks

Maximilian Kaufmann,Daniel Kang,Yi Sun,Xuwang Yin,Steven Basart,Mantas Mazeika,Adam Dziedzic,Franziska Boenisch,Tom B Brown,Abhinav Kommula,Oliver Zhang,Akul Arora,Jacob Steinhardt,Dan Hendrycks

OpenReview PDF

提交: 2024-09-28更新: 2025-02-05

TL;DR

A new benchmark for evaluating the robustness of neural networks to adversaries not seen during training.

摘要

When considering real-world adversarial settings, defenders are unlikely to have access to the full range of deployment-time adversaries, and adversaries are likely to use realistic adversarial distortions that will not be limited to small $L_p$-constrained perturbations. To narrow in on this discrepancy between research and reality we introduce ImageNet-UA, a new benchmark for evaluating model robustness against a wide range of unforeseen adversaries. We make use of our benchmark to identify holes in current popular adversarial defense techniques, highlighting a rich space of techniques which can improve unforeseen robustness. We hope the greater variety and realism of ImageNet-UA will make it a useful tool for those working on real-world worst-case robustness, enabling development of more robust defenses which can generalize beyond attacks seen during training.

关键词

AI SafetyML safetyadversarial robustnessdistribution shiftunforeseen adversaries

评审与讨论

审稿意见

评分: 3置信度: 52024-10-23

This paper propose a benchmark of adversarial attacks computed on different noise patterns. These are made differentiable, and applied to images with PGD. Models are adversarially trained under a specific perturbation, and evaluated with others to test whether they have generalized across unseen attacks.

优点

Models should be robust to whatever noise it appears. I agree with the authors that model must be resilient to any perturbation of images, and testing more attacks is always useful.

Robustness augments when diversifying the noise. It is interesting to notice that models behave better when their training is enriched with the various noise patterns. However, this is somehow trivial due to the formulation of adversarial training: the better the approximation of the inner maximization is, the more robust the model it gets. Still, the authors did the right experiments, showing that this is empirically proved also in this setting.

缺点

The presented strategies are L $_p$ -bounded attacks. I disagree with the authors stating that the proposed attacks are beyond the L $_p$ norm. In particular, it can be still computed the amount of perturbation and bounded to the maximum budget. The whole optimization problem can also be written as an L $_p$ problem, and constrained to be such a problem. Maybe the projection is a bit different, because the authors want to use specific patterns. Hence, all the discussion regarding going beyond that threat model does not apply on this paper.

No details on the attack. The authors spend a lot of space to present the manipulations, and they vaguely state they are differentiable. However, I could not understand how. This content must be present in the main paper, since it is a claimed contribution. Placing them in the appendix is not the right thing to do. Hence, I strongly suggest the authors to remove figures (like 2 and 4, they are not useful to understand the efficacy of the methods) and expand on the technical details of the attack.

Debatable contribution of manipulations. I debate that the nine attacks are novel, since they are all well-known noise patterns already present in the literature. Also, JPEG compression has been extensively used a lot (also on the defensive side). I would frame that the real contribution is how the authors attach those noise patterns in a differentiable way (see previous comment). Hence, 3.3 is not really novel per se to the extent of my knowledge.

Unfair analysis. The authors show that models are not robust to new manipulations. However, the authors should compare the efficacy of attacks at the same perturbation budget (same norm) and same number of queries to the models (i.e. forward and backward). Otherwise, in absence of those, attacks can be easily biased towards being effective or not. Also, why the authors did not use AutoAttack, similarly to RobustBench? Also, the authors do not provide their implementation of PGD, which might also contain bugs.

Confusing presentation. There is a digression on L $_2$ adversarial training which is not clear to me why is discussed. Also, L $_2$ is still an L $_p$ perturbation, and the authors might want to further investigate this by studying why they obtained their results. Still, I did not understand the connection in comparing different threat models in that portion of the paper.

问题

Can the authors specify the amount of perturbation injected, so that it is comparable with PGD?
Can the authors test AutoAttack, with the same level of noise used for their perturbations?
Can the authors better clarify the novelty of their approach, by revising the content and clearly specify how they produce differentiable attacks?

评论- Author Response (1/2)

2024-11-27

Thank you for your careful analysis of our work. We hope the following response addresses your concerns.

Distinction between our attacks and pixel-space $L_p$ attacks.

I disagree with the authors stating that the proposed attacks are beyond the $L_p$ norm. In particular, it can be still computed the amount of perturbation and bounded to the maximum budget. The whole optimization problem can also be written as an $L_p$ problem, and constrained to be such a problem. Maybe the projection is a bit different, because the authors want to use specific patterns. Hence, all the discussion regarding going beyond that threat model does not apply on this paper.

We certainly agree that our attacks are $L_p$ perturbation sets in a latent space! In fact, we point this out on line 204. However, in the context of adversarial robustness the phrase “ $L_p$ attack” almost always refers to “pixel-space $L_p$ norm bounded additive perturbations”. This is commonly understood to be the meaning of “ $L_p$ attack”. Even though attacks that are not pixel-space $L_p$ attacks may still be formulated as $L_p$ perturbations in some latent space, these are typically not referred to as $L_p$ attacks. For examples of well-known papers demonstrating this, see [1, 2, 3, 4]. Thus, our discussion of going beyond the space of “ $L_p$ attacks” remains valid and in keeping with standard terminology. We have clarified this in the updated paper.

[1]: “Towards Deep Learning Models Resistant to Adversarial Attacks”. Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, Adrian Vladu. ICLR 2018 [2]: “Certified Adversarial Robustness via Randomized Smoothing”. Jeremy M. Cohen, Elan Rosenfeld, J. Z. Kolter. ICML 2019 [3]: “Adversarial Training for Free!”. Ali Shafahi, Mahyar Najibi, Amin Ghiasi, Zheng Xu, John Dickerson, Christoph Studer, Larry S. Davis, Gavin Taylor, Tom Goldstein. NeurIPS 2019 [4]: “Spatially Transformed Adversarial Examples”. Chaowei Xiao, Jun-Yan Zhu, Bo Li, Warren He, Mingyan Liu, Dawn Song. ICLR 2018

Descriptions of differentiable attacks are already in the main paper.

No details on the attack. The authors spend a lot of space to present the manipulations, and they vaguely state they are differentiable. However, I could not understand how. This content must be present in the main paper, since it is a claimed contribution. Placing them in the appendix is not the right thing to do.

This information is already provided in the main paper, taking up most of page 5. Please see Section 3.3, in which we briefly describe how each of our 8 core attacks are designed in a differentiable manner. These descriptions include the most relevant details for each attack, while still being fairly brief to conserve space. For example, we explain that our Gabor attack is obtained by “optimizing the underlying sparse tensor which the Gabor kernels are applied to”. In Appendix B, we provide corresponding descriptions for the 11 additional attacks and a highly detailed description of the Wood attack (corresponding to Figure 3).

Clarifying technical novelty of attacks.

I debate that the nine attacks are novel, since they are all well-known noise patterns already present in the literature. … I would frame that the real contribution is how the authors attach those noise patterns in a differentiable way (see previous comment).

We fully agree that many of the noise patterns themselves are well-known (e.g., Gabor noise), and we are glad that you agree that making differentiable versions of these noise patterns / corruptions constitutes a real contribution. In many cases, making these noise patterns differentiable was nontrivial and required considerable manual effort. For example, our Fog attack uses a custom differentiable version of the diamond-square algorithm. We will clarify this in the updated paper. Thank you for your suggestion.

评论- Author Response (2/2)

2024-11-27

Clarifying selection of attack hyperparameters.

the authors should compare the efficacy of attacks at the same perturbation budget (same norm) and same number of queries to the models (i.e. forward and backward). Otherwise, in absence of those, attacks can be easily biased towards being effective or not. Also, why the authors did not use AutoAttack, similarly to RobustBench?

This is a great question that touches on the details of how to properly perform inter-attack comparisons. In particular, we respectfully believe this concern may be due to a misunderstanding. Please allow us to explain:

There are two notions of “attack” that are commonly used in the literature. The first notion is “attack as optimization method”. This includes optimization methods like PGD and AutoAttack, both of which can be used to optimize perturbations within the same perturbation set. The second notion is “attack as perturbation set”. This includes perturbation sets like the Elastic attack of Xiao et al. [4], adversarial patches [5], different $L_p$ perturbation sets, or indeed the 18 novel attacks that we propose.

This distinction is important, because the first notion of “attacks” are easy to compare to each other. For example, there is an unambiguous sense in which we can say AutoAttack is stronger than PGD. However, when comparing across multiple “attacks” in the second notion of the phrase (i.e., multiple perturbation sets), it becomes less meaningful to say that one is stronger than another. To see why, consider the difficulty of answering the question “Are patch attacks stronger than elastic warping attacks?” There is no clear answer to this question, because one can always make patches larger or smaller to make the attack stronger or weaker than a given warping attack. For similar reasons, it does not make sense to give our different attacks the same number of iterations or the same latent variable $L_p$ norm bounds. The latent spaces are different, so there is no sense in which we can have the “same” hyperparameters for each. It would be like trying to set the same hyperparameters for a patch attack and elastic attack.

You may well ask, “If comparisons between different perturbation sets are challenging, then what are we measuring in this paper?” Crucially, we are not measuring whether one attack is stronger than another, for the reasons stated above. Rather, we are demonstrating that these new attacks have different robustness profiles across models, indicating that new methods may be needed to address them.

To make comparisons more intuitive and efficient, we selected hyperparameters to maximize effectiveness of attacks within a reasonably small number of optimization steps and to yield roughly similar average performance within severity levels. This is why hyperparameters like optimization steps are different for our different attacks. But much like the comparison between elastic and patch attacks, it doesn’t make sense to pick the same hyperparameters for each attack, because they mean different things for each attack.

This is a fairly nuanced point, so we have added a discussion of it to the appendix of the updated paper. Thank you for bringing this to our attention.

[4]: “Spatially Transformed Adversarial Examples”. Chaowei Xiao, Jun-Yan Zhu, Bo Li, Warren He, Mingyan Liu, Dawn Song. ICLR 2018 [5]: “Adversarial Patch”. Tom B. Brown, Dandelion Mané, Aurko Roy, Martín Abadi, Justin Gilmer. NeurIPS 2017

Clarifying why we use PGD instead of AutoAttack.

Also, why the authors did not use AutoAttack, similarly to RobustBench?

As discussed above, we are primarily interested in developing new perturbation sets. One could apply multiple different optimization methods to these perturbation sets, including AutoAttack, random restarts, or any number of innovations from the adversarial examples literature. We use PGD for simplicity, although in Appendix M we do compare to a black-box grid search optimizer, demonstrating the value of our differentiable corruptions.

Reason for discussing $L_2$ adversarial training results.

There is a digression on $L_2$ adversarial training which is not clear to me why is discussed.

We think you may be referring to the paragraph “ $L_2$ -based adversarial training outperforms $L_\infty$ -based adversarial training” on line 395, but if not please correct us. This paragraph was added simply to discuss an interesting trend we observed where $L_2$ adversarially trained models obtained better performance on the UA2 metric than $L_\infty$ adversarially trained models, which we thought readers might be interested to see. This is a relatively short part of the results section and provides some potentially useful information, so we thought it was worth including.

If we have addressed the thrust of your concerns, we kindly ask that you consider raising your score.

评论- Author Response (additional requested information)

2024-11-28

Additional requested information on PGD implementation.

Also, the authors do not provide their implementation of PGD, which might also contain bugs.

Please see this code for our $L_\infty$ PGD implementation (anonymized): https://pastebin.com/b99EJSmi

We are quite confident the implementation is correct, as PGD is a simple attack to implement. In case it helps, we asked o1-preview whether our implementation is correct, and it said yes. Removing torch.sign on the gradient results in o1-preview changing its answer to "no", as expected. Please feel free to check the implementation yourself. If there are any other questions or concerns we can address, please let us know and we would be happy to help.

评论- Discussion period ending soon

2024-12-02

This is a gentle reminder that the discussion period is ending soon. We would be happy to answer any remaining questions or concerns, and we would be grateful if you could take our initial response into account in your final review.

审稿意见

评分: 8置信度: 42024-10-25

This paper introduces a set of 19 new non-Lp attacks for measuring unforeseen robustness. The authors introduce a score called UA2 which measures robustness across a set of 8 of these attacks for benchmarking existing models. Additionally, they provide experiments to demonstrate that robustness against these non-Lp attacks is directly correlated with Lp attack robustness or distribution shift robustness. The authors also evaluate several defenses including combining adversarial training with image augmentation (PixMix), multiattack training, and perceptual adversarial training.

优点

The authors propose many new attack types which can be used for measuring unforeseen robustness
Extensive experiments measuring the performance of the current state of research against the set of attacks proposed
Writing is clear

缺点

Overall, I thought this was a strong paper. One thing that could be incorporated for strengthening the paper could be adding some evaluations on the attacks left out of the 8 core attacks and demonstrating that the performance measured over the left out attack set is correlated with performance measured over the 8 core attack set. This would empirically demonstrate that the performance on the 8 core attacks are highly representative of the 19 attacks. Minor typos:

Table 1 caption: "disctinct"
line 318: $\varepsilon_a$ notation as opposed to $\epsilon_A$ used elsewhere

问题

In section 4.1 comparison to ImageNet-C what does "non-optimized" version of the attack mean?

评论- Author Response

2024-11-27

Thank you for your careful analysis of our work. We hope the following response addresses your concerns.

Evaluations on 11 additional attacks.

One thing that could be incorporated for strengthening the paper could be adding some evaluations on the attacks left out of the 8 core attacks and demonstrating that the performance measured over the left out attack set is correlated with performance measured over the 8 core attack set. This would empirically demonstrate that the performance on the 8 core attacks are highly representative of the 19 attacks.

We actually do include full evaluations on the 11 additional attacks (with a medium distortion level). These results are Figure 26, but are not properly referenced elsewhere in the paper. We have updated the paper to remedy this.

We have also added an evaluation of the correlation between the average accuracy on the 11 additional attacks and the average accuracy on the 8 core attacks. The Pearson correlation is 0.92, indicating that the 8 core attacks are a good representation of the full set of 19 attacks. Thank you for your suggestion.

审稿意见

评分: 3置信度: 42024-10-28

The authors propose ImageNet-UA, a benchmark for evaluating model robustness against unforeseen adversarial examples, i.e., adversarial perturbations not considered in the design phase of the model. Supposing the models' developers do not have access to worst-case adversarial examples during training, the benchmark proposes a set of non-Lp gradient-based attack methods as worst-case inputs to evaluate the model's robustness at test time. The authors provide an in-depth analysis of various adversarially trained models on different datasets from the benchmark.

优点

The authors designed and evaluated 19 non-Lp adversarial generation methods.
The authors proposed a wide analysis of the benchmark on several models and different datasets.

缺点

The authors did not compare the benchmark to other $L_p$ attack methods in the state of the art. Additionally, there is significant work that should be included in the comparison or at least mentioned in the related work
The author’s evaluation of their adversarial training method is incomplete.
The authors introduced a new metric UA2 (Unforeseen Adversarial Accuracy), but it is unclear why the new metric is necessary.

Comments

The authors did not compare the benchmark to other $L_p$ attack methods in the state of the art. Additionally, there is significant work that should be included in the comparison or at least mentioned in the related work

In section 4.1 the authors compare UA2 to other distribution-shift benchmarks, and they claim, based on the results (Table 5), that UA2 is a measure of worst-case robustness, like $L_p$ , but they did not assess the distributional shift of $L_p$ perturbed samples. Additionally, there have been recent workshops and efforts to address the real-world issues applied to ImageNet, and the authors should provide some references and possibly differentiate their work from these. One example is the ShiftHappens workshop at ICML 2022 https://shift-happens-benchmark.github.io

The author’s evaluation of their adversarial training method is incomplete.

In the Experiments section, there is a broad discussion on how UA2 robustness, based on the evaluation of unforeseen adversarial example, is distinct from $L_p$ robustness. However, the authors based this claim on the evaluation of just one $L_p$ attack method (PGD) and just on $L_\infty$ bound constraints, with a fixed number of optimization steps (only $50$ , very limited for ImageNet, and without providing evidence of convergence).

The authors introduced a new metric UA2 (Unforeseen Adversarial Accuracy), but it is unclear why the new metric is necessary.

The authors introduce the UA2 (Unforeseen Adversarial Accuracy) metric as the average Robust Accuracy achieved by the model over the non- $L_p$ methods in the benchmark. Although this metric provides an overall score relative to the benchmark, it doesn't allow for a thorough assessment of which attacks are truly effective (and thus worst case) and which are not. For example, the results with 'medium distortion,' shown in Figure 8, clearly indicate that three types of attacks (Elastic, Wood, Gabor) generally perform better than the others. Additionally, it's not clear why the authors use the average in this metric rather the worst case (so the perturbation that has the strong effect) Unforseen Adversarial, which would be an ensemble attack composed by all the attacks rather than single evaluations.

Additional comments:

In sect. 3.3, the authors select eight attacks out of the nineteen, but they don't give a justification on why they selected these. Specifically, is there a possibility that other, non-selected attacks, are super effective against specific models? A more substantial contribution would be to understand in which cases (e.g., for which preprocessing techniques or for which type of models) one attack works better than the other. As an example, a useful actionable insight would be "one can exclude attack A, B, C as the preprocessing technique would remove such perturbation". This insight gathering is lacking in the paper but it would be much more useful to enhance robustness evaluations to unforseen attacks.

In sect. 3.3, the authors suggest quantization only in the JPEG compression strategy, however it's still to be observed that, given that the benchmark is specific to the image domain, in theory all the manipulations should still produce images (thus quantized in integer values in [0, 255]). This aspect seems overlooked in the paper (at least in the text).

In section 4.2 the authors propose a combined adversarial training approach that consider $L_\infty$ adversarial training and "PixMix" augmentation. The authors' analysis of PixMix is incomplete in two ways. First, they don't explain what PixMix is or provide any reference. Second, in Table 6, they only show improvements in the UA2 metric, without evaluating how this adversarial training method affects the $L_p$ robustness.

Generally, the authors evaluate the robustness of models adversarially trained on a specific kind of $L_p$ perturbation, against attacks that optimize for a different type of $L_p$ perturbation. Indeed, the benchmark is composed of non- $L_p$ methods which optimize the adversarial sample considering both $L_\infty$ and $L_2$ constraints. When comparing with $L_p$ attack methods the authors considered only $L_\infty$ .

Minor issues:

there is a mixed use of american and british english (e.g., defenses and defences). Please, double check and remain consistent throughout the manuscript
Figure 3 could benefit from a more structured presentation. Specifically, naming the steps and providing a step-by-step overview would help understand the approach better

问题

Could you argue why certain non- $L_p$ attack methods are more effective than others?
Could you report the $L_p$ robustness of the models adversarially trained with $L_\infty$ and with $L_\infty$ + PixMix?
Could you report the results for an increased number of steps? Or please, provide evidence of convergence for the attacks.

评论- Author Response (1/3)

2024-11-27

Thank you for your careful analysis of our work. We hope the following response addresses your concerns.

Please see Table 2 for comparisons to $L_p$ attacks.

In section 4.1 the authors compare UA2 to other distribution-shift benchmarks, and they claim, based on the results (Table 5), that UA2 is a measure of worst-case robustness, like $L_p$ , but they did not assess the distributional shift of $L_p$ perturbed samples.

In Table 5 and Appendix I, we compare to prior distribution shift benchmarks, which naturally does not include $L_p$ attacks. Our comparison to $L_p$ attacks appears in Table 2. In particular, we compare to the popular PGD attack. We also include evaluations with the more recent AutoAttack method in Appendix F.4.

Comparisons to distribution shift literature.

there have been recent workshops and efforts to address the real-world issues applied to ImageNet, and the authors should provide some references and possibly differentiate their work from these. One example is the ShiftHappens workshop at ICML 2022

The distribution shift literature is indeed highly relevant to our work. In Table 5 and Appendix I, we already include comparisons to ImageNet-C and ImageNet-Sketch—two foundational datasets from this literature. The ICML 2022 ShiftHappens workshop included papers presenting other similar datasets, such as ImageNet-Cartoon and ImageNet-D. Thus, our existing comparisons already have fairly good coverage of these kinds of distribution shift datasets.

Our comparisons in Table 5 and Appendix I demonstrate that our differentiable worst-case corruptions have distinct properties from the average-case corruptions that are studied in the distribution shift literature. For example, Table 5 shows that the rankings between methods is similar for ImageNet-C and our unoptimized corruptions. But when we optimize our corruptions, accuracy drops substantially and the ranking between methods changes. This suggests that optimized, worst-case corruptions have different properties than average-case corruptions commonly studied in the distribution shift literature.

Importantly, we do not claim that our benchmark is “better” than ImageNet-C or other distribution shift benchmarks. Rather, our point is to demonstrate that they have very different properties and that worst-case corruptions may require novel methods to address. This is a desirable quality for any new benchmark. We will clarify these points in the updated paper. Thank you for your suggestion.

Clarifying $L_p$ attack selection and hyperparameters.

the authors based this claim on the evaluation of just one $L_p$ attack method (PGD) and just on $L_\infty$ bound constraints, with a fixed number of optimization steps (only $50$ , very limited for ImageNet, and without providing evidence of convergence).

Please note that we do evaluate AutoAttack in Appendix F.4, although for the specific reason of checking the unusually high robustness of DINOv2.

We agree that checking convergence of attacks is important. We did in fact check convergence when selecting hyperparameters for all our attacks. The PGD attacks in our main evaluations do converge for nearly all models. We will add loss plots to the updated paper to make this easy for readers to check for themselves. Thank you for your suggestion.

评论- Author Response (2/3)

2024-11-27

Clarifying reason for UA2 metric.

it is unclear why the new metric is necessary

Our UA2 metric is the analogue of the mean corruption error (mCE) metric in ImageNet-C. We don’t claim that the metric definition is a source of technical novelty, as it is simply the average robust accuracy across our attacks. Similarly to “mCE”, we choose the name “UA2” as a useful shorthand for the main metric that we care about. We have updated the paper to clarify that this is just the average robust accuracy. Thank you for your suggestion.

Although this metric provides an overall score relative to the benchmark, it doesn't allow for a thorough assessment of which attacks are truly effective (and thus worst case) and which are not. For example, the results with 'medium distortion,' shown in Figure 8, clearly indicate that three types of attacks (Elastic, Wood, Gabor) generally perform better than the others. Additionally, it's not clear why the authors use the average in this metric rather the worst case (so the perturbation that has the strong effect) Unforseen Adversarial, which would be an ensemble attack composed by all the attacks rather than single evaluations.

There are two notions of “attack” that are commonly used in the literature. The first notion is “attack as optimization method”. This includes optimization methods like PGD and AutoAttack, both of which can be used to optimize perturbations within the same perturbation set. The second notion is “attack as perturbation set”. This includes perturbation sets like the Elastic attack of Xiao et al. [1], adversarial patches [2], different $L_p$ perturbation sets, or indeed the 18 novel attacks that we propose.

This distinction is important, because ensemble attacks like AutoAttack only apply to the first notion of “attack” (multiple optimization methods within a single perturbation set). There is no corresponding way to do ensemble attacks for the second notion of “attack” (multiple perturbation sets). To see why, consider the difficulty of answering the question “Are patch attacks stronger than elastic warping attacks?” There is no clear answer to this question, because one can always make patches larger or smaller to make the attack stronger or weaker than a given warping attack. For similar reasons, we do not consider the ensemble attack of taking the minimum accuracy across our perturbation sets. There is simply no basis for comparison to say one attack is stronger than another. As a corollary, there is no meaningful way to do ensemble attacks.

This is a fairly nuanced point, so we have added a discussion of it to the appendix of the updated paper. Thank you for bringing this to our attention.

[1]: “Spatially Transformed Adversarial Examples”. Chaowei Xiao, Jun-Yan Zhu, Bo Li, Warren He, Mingyan Liu, Dawn Song. ICLR 2018 [2]: “Adversarial Patch”. Tom B. Brown, Dandelion Mané, Aurko Roy, Martín Abadi, Justin Gilmer. NeurIPS 2017

Clarifying reasons for selecting 8 core attacks.

In sect. 3.3, the authors select eight attacks out of the nineteen, but they don't give a justification on why they selected these.

Please see the start of Section 3.3 for a justification: “[we] select a core eight [attacks] for their computational efficiency, effectiveness, variety and preservation of semantics (see Appendix D for further discussion, and Appendix K for a human study on semantic preservation).” As noted, we provide further details of this selection process in Appendix D.

Specifically, is there a possibility that other, non-selected attacks, are super effective against specific models?

Yes, we do actually know that this is the case. As we mention in Appendix D, we left out some attacks, including Prison and Edge, due to their having a high correlation with PGD across models. This means that they performed well against the same models that PGD performs well against, which includes most non-robust models.

However, this does not mean those attacks are interesting to include. As we discuss in Appendix D, it is desirable to have diversity across the different attacks, such that we don’t have multiple attacks effectively measuring the same aspect of robustness. We operationalize this by designing attacks to be both visually distinct and distinct in their performance profile across a broad range of models.

评论- Author Response (3/3)

2024-11-27

Clarifying PixMix contribution.

First, they don't explain what PixMix is or provide any reference.

Thank you for pointing out the missing reference. This was a typo. We have added a reference to PixMix and a brief overview of the PixMix augmentation method introduced in prior work [6].

Second, in Table 6, they only show improvements in the UA2 metric, without evaluating how this adversarial training method affects the robustness.

Figure 10 shows full results for PGD adversarial training + PixMix. Accuracy on PGD decreases from 84.6% to 84.5%, showing that adding PixMix maintains similar PGD robustness while increasing robustness to our new attacks. We have added this comparison to Table 6 in the updated paper for ease of reference. Thank you for your suggestion.

[6]: “PixMix: Dreamlike Pictures Comprehensively Improve Safety Measures”. Dan Hendrycks, Andy Zou, Mantas Mazeika, Leonard Tang, Bo Li, Dawn Song, Jacob Steinhardt. CVPR 2022

Clarifying the distribution shift nature of our benchmark.

Generally, the authors evaluate the robustness of models adversarially trained on a specific kind of perturbation, against attacks that optimize for a different type of perturbation.

This is by design. As shown in Figure 1, our goal is to evaluate robustness against unseen test-time attacks.

Addressing minor issues.

there is a mixed use of american and british english (e.g., defenses and defences). Please, double check and remain consistent throughout the manuscript

Thank you for pointing this out. We will conduct a thorough pass to avoid this issue.

Figure 3 could benefit from a more structured presentation. Specifically, naming the steps and providing a step-by-step overview would help understand the approach better

We agree that Figure 3 is a bit confusing. We will add step numbers to the figure and update the caption to enable following along more easily. Thank you for your suggestion.

If we have addressed the thrust of your concerns, we kindly ask that you consider raising your score.

评论- Author Response (additional requested information)

2024-11-28

Additional requested information on quantization.

In sect. 3.3, the authors suggest quantization only in the JPEG compression strategy, however it's still to be observed that, given that the benchmark is specific to the image domain, in theory all the manipulations should still produce images (thus quantized in integer values in [0, 255]).

We use floating point images in all of our evaluations. This is standard practice in the adversarial robustness literature (e.g., this is standard in the RobustBench leaderboard). However, we agree that checking the performance of uint8 quantized images is important. We have run our core attacks on the standard ResNet-50 model 3 additional times, twice without quantization and once with quantization. The results are below.

	pgd(4/255)	JPEG	Elastic	Wood	Glitch	Kaleidoscope	Pixel	Snow	Gabor
No quantization run1	0.00%	0.00%	4.28%	6.25%	0.27%	0.00%	0.43%	0.03%	8.65%
No quantization run2	0.00%	0.00%	4.28%	6.25%	0.27%	0.00%	0.41%	0.03%	8.62%
With quantization	0.00%	0.00%	4.36%	6.56%	0.27%	0.00%	0.43%	0.03%	8.63%

As you can see in the above results, quantization to uint8 does not affect the final performance of the attacks. These results also highlight how our attacks give stable performance across multiple random seeds. We will add these results to the updated paper. Thank you for your suggestion.

评论- Response to authors

2024-11-28

I would like to thank the authors for their extensive rebuttal and clarifications.

I still have some remaining concerns about the overall "benchmark" evaluation. When proposing a benchmark, it's crucial to have a very clear evaluation protocol and standardization of the procedures. This is still not entirely convincing for me. Specifically:

The attacks used for testing have different "rules" each, it might be possible to "overfit" to the benchmark by submitting a model that is robust only to the specific perturbations (and perturbation sizes) used here;
The discussion on the UA2 metric still needs improvement, as it's still not clear what is the output of this evaluation. Worst-case evaluations are commonly used to evaluate the models, not the attacks. So from my perspective it's still relevant to take the worst rather than the average effect. It's not clear to me why the average would be helpful.
The selection of the 8 core attacks as well is not convincing, as different models might be affected more by one of the attacks that are not selected.

And finally, a minor concern:

the evaluation metric is the same as ImageNet-C, but why do we need a new name?

Additional concern, checking the other reviews and responses:

the PGD implementation presented as response to another review does not take the best results (i.e., the ones that maximize the loss), but just the last. This is not taking the worst case, as this assumes high linearity of the loss landscape).

I still maintain my score. I thank the authors for their additional experiments, that hopefully will improve this work furhter in the future and enrich it with new insights.

评论- Author Response (1/2)

2024-11-29

Thank you for your response. We believe we can address many of the concerns in your response, so we would be grateful if you could consider the following points.

Overfitting to the attacks is not allowed, as stated in our threat model.

The attacks used for testing have different "rules" each, it might be possible to "overfit" to the benchmark by submitting a model that is robust only to the specific perturbations (and perturbation sizes) used here;

In our threat model (Section 3.1), we describe how the defender does not have access to the attacks at training time. This is what we mean by the attacks being "unforeseen". All of our attacks are held-out for evaluation, which fully addresses the concerns of overfitting.

Also note that the description of attacks having rules also applies to all previously proposed attacks, including PGD. A key strength of our benchmark is that the attacks cannot be used at training time, allowing us to measure robustness to out-of-distribution adversaries.

Clarifying why averaging over perturbation sets is sensible.

The discussion on the UA2 metric still needs improvement, as it's still not clear what is the output of this evaluation. Worst-case evaluations are commonly used to evaluate the models, not the attacks. So from my perspective it's still relevant to take the worst rather than the average effect. It's not clear to me why the average would be helpful.

It may be easier to see why averaging over perturbation sets is sensible by looking at the related research problem of multi-attack robustness. In multi-attack robustness, the goal is to improve adversarial robustness against multiple different perturbation sets, oftentimes by training against multiple attacks at once. It is standard practice in this area to average over perturbation sets. For examples of papers that do this, see [1, 2]. The fundamental reason for taking the average is that different perturbation sets are not easily compared with each other. E.g., see our earlier response about how it is nontrivial to say that a patch attack is stronger than a warping attack; there is simply not a clear basis of comparison in most cases.

However, we do agree that taking the minimum accuracy across attacks (i.e., the union attack) can be informative if all the attacks are normalized in some manner. In our benchmark, we carefully tuned our attacks to ensure that all severity levels still preserve semantics of the image, which provides a basis of comparison so that different perturbation sets can still be compared. For this reason, we will add a minimum accuracy metric (union attack) to the updated paper. Thank you for your suggestion.

[1]: "Learning to Generate Noise for Multi-Attack Robustness". Divyam Madaan, Jinwoo Shin, Sung Ju Hwang. ICML 2021

[2]: "MultiRobustBench: Benchmarking Robustness Against Multiple Attacks". Sihui Dai, Saeed Mahloujifar, Chong Xiang, Vikash Sehwag, Pin-Yu Chen, Prateek Mittal. ICML 2023

The 8 core attacks are a good proxy for the full suite of 19 attacks.

The selection of the 8 core attacks as well is not convincing, as different models might be affected more by one of the attacks that are not selected.

Please note that we include extensive evaluations on the 11 additional attacks in Figure 26 of the submission. These results show that the extra attacks do not significantly change the ranking of models. The top 5 models shared between the Figure 26 and Figure 8 (the corresponding core attack evaluation) are exactly the same, with the top 2 models having the same ranking. This figure is not currently referenced in the text, so it was easy to miss. We have updated the paper to link to this figure in the main text.

Clarifying why we create a descriptive name for our metric.

the evaluation metric is the same as ImageNet-C, but why do we need a new name?

There may be a misunderstanding here. In our initial response, we mentioned that the mCE metric from ImageNet-C is an analogue to our UA2 metric. We did not mean that the metrics are the same. Rather, we meant that they are analogous in certain ways. Namely, our benchmark can be thought of as an adversarial version of ImageNet-C in which the corruptions are differentiable and can be optimized. Just as how ImageNet-C uses the name "mean corruption error" (mCE) as a descriptive name for their metric, we use the name "unforeseen adversarial accuracy" (UA2) for our metric. Coming up with descriptive names for metrics is common in benchmarking and can help readers follow along with the results.

评论- Author Response (2/2)

2024-11-29

We follow standard practice in PGD implementations.

the PGD implementation presented as response to another review does not take the best results (i.e., the ones that maximize the loss), but just the last. This is not taking the worst case, as this assumes high linearity of the loss landscape).

All major PGD implementations (foolbox, cleverhans, advertorch) use the final iterate, not the lowest-loss iterate along the optimization trajectory. These implementations are what RobustBench uses, so we follow this standard approach in our implementation. As we mentioned in our initial response, the PGD optimization trajectories do converge, so non-monotonicity isn't an issue.

foolbox implementation: https://github.com/bethgelab/foolbox/blob/2513a9a8675d7017e5266d3b0ed89124cb436ec5/foolbox/attacks/gradient_descent_base.py#L77

cleverhans implementation: https://github.com/cleverhans-lab/cleverhans/blob/574efc1d2f5c7e102c78cf0e937654e847267522/cleverhans/torch/attacks/projected_gradient_descent.py#L9

advertorch implementation: https://github.com/BorealisAI/advertorch/blob/e063e341c87d9a621ae1a8f72c3507d5ea5bd327/advertorch/attacks/iterative_projected_gradient.py#L32

If you have any remaining questions or concerns, we would be happy to answer them. If we have addressed the thrust of your concerns, we kindly ask that you consider raising your score.

评论- Discussion period ending soon

2024-12-02

This is a gentle reminder that the discussion period is ending soon. We would be happy to answer any remaining questions or concerns, and we would be grateful if you could take our most recent response into account in your final review.

审稿意见

评分: 5置信度: 42024-11-02

This paper introduces the ImageNet-UA benchmark and the UA2 metric to assess model robustness against a variety of unforeseen adversarial attacks. Experimental results indicate that traditional Lp robustness training may have limitations when addressing unknown attacks. The paper suggests employing L2 adversarial training combined with image augmentations to enhance model performance under these unexpected attack scenarios.

优点

The paper conducts comprehensive experiments using a set of 19 diverse non-Lp attacks, offering a thorough evaluation of model robustness in the face of various unforeseen adversarial scenarios.
By introducing the ImageNet-UA benchmark, the paper fills a gap in adversarial robustness research, enabling a more applicable assessment of model resilience to real-world adversarial attacks beyond traditional Lp constraints.

缺点

A considerable portion of the logical content is relegated to the appendix, which is suboptimal. Key elements of the logic should be presented within the main text. For instance, Figure 5 should either be compressed and included in the main paper or replaced with an equivalent narrative explanation.
In Figure 3, the term $m \times m \times 2$ lacks a corresponding definition, which should be clarified.
Equation 2 suggests that the adversarial attack should aim to maximize the loss function with respect to perturbations, rather than as currently presented.
The study mainly employs various image augmentation techniques to examine model performance without contributing significant advancements in model verification methods (i.e., assessing stability across augmented samples), which falls short of the standards for ICLR.

问题

Refer to Weaknesses

评论- Author Response

2024-11-27

Thank you for your careful analysis of our work. We hope the following response addresses your concerns.

Added reference to Figure 5.

Key elements of the logic should be presented within the main text. For instance, Figure 5 should either be compressed and included in the main paper or replaced with an equivalent narrative explanation.

We agree that Figure 5 would be good to reference early in the main text, since it shows our full suite of attacks. We have added a reference to Figure 5 in the introduction of the updated paper. Thank you for your suggestion.

Clarified Figure 3 caption.

In Figure 3, the term $m \times m \times 2$ lacks a corresponding definition, which should be clarified.

The $m$ value is the spatial dimension of the latent variable that we optimize in the Wood attack. This allows us to reduce the spatial frequency of the distortion by setting $m < n$ , where $n$ is the input image dimension. We have clarified this in the caption of Figure 3 in the updated paper. Thank you for your suggestion.

Fixed typos.

Equation 2 should indeed be the argmax, not argmin. We have fixed this typo in the updated paper. Thank you for noticing it.

Clarifying novelty of contributions.

The study mainly employs various image augmentation techniques to examine model performance without contributing significant advancements in model verification methods (i.e., assessing stability across augmented samples), which falls short of the standards for ICLR.

We agree that simply providing new image augmentations would not be as interesting, given that many works have already proposed image augmentations and distortions to evaluate robustness (e.g., ImageNet-C).

However, this would not be an accurate description of our work. Our work is very different from prior work on model verification, since we propose differentiable image corruptions that can be optimized to find worst-case corruptions. There have been multiple prior works accepted to ICLR proposing one new attack of this kind [1, 2], but we propose eighteen new attacks (nineteen in total, counting the elastic attack), all of which are manually curated to provide a high-quality evaluation.

If we have addressed the thrust of your concerns, we kindly ask that you consider raising your score.

[1]: “Spatially Transformed Adversarial Examples”. Chaowei Xiao, Jun-Yan Zhu, Bo Li, Warren He, Mingyan Liu, Dawn Song. ICLR 2018

[2]: “Unrestricted Adversarial Examples via Semantic Manipulation”. Anand Bhattad, Min Jin Chong, Kaizhao Liang, Bo Li, D. A. Forsyth. ICLR 2020

评论- Discussion period ending soon

2024-12-02

AC 元评审

2024-12-19

The paper introduces ImageNet-UA, a benchmark designed to evaluate model robustness against unforeseen adversarial attacks that extend beyond traditional Lp-bounded threat models. It introduces UA2 to measure a model robustness across these diverse attacks. The benchmark includes 19 non-Lp attacks, from which 8 are highlighted as core attacks. The study demonstrates that combining adversarial training (particularly L2 adversarial training) with image augmentation techniques can enhance model robustness under unforeseen attacks.

The paper makes a contribution in building a benchmark to evaluate robustness under unseen attacks, which is a longstanding issue in adversarial robustness research. The paper conducts extensive experiments with 19 attacks, providing insights into robustness under unforeseen scenarios. However, the paper has several weaknesses.

Novelty: The attacks used in the benchmark are not entirely novel, as they are based on well-known noise patterns from existing literature.
Metric: The necessity and effectiveness of the UA2 metric are unclear. Questions arise about why it averages performance rather than focusing on the worst-case scenarios.
The selection of 8 core attacks out of 19 is not well justified.
The evaluation is limited to adversarial training models. Could you consider other generalizable defenses, e.g., diffusion-based defenses.

The paper received mixed reviews with one accept recommendation, one borderline reject recommendation, and two reject recommendations. After author-reviewer discussion and AC-reviewer discussion, the paper still has some weaknesses as listed above. Therefore, the AC considers that the paper falls short of the ICLR acceptance threshold and recommends rejection.

审稿人讨论附加意见

The reviewers raised several points about the paper.

Novelty of Attacks and Benchmark (Reviewer aFdd, 8hc1): The proposed non-Lp attacks are based on well-known noise patterns.
Justification of UA2 Metric (Reviewer 1oNW): The necessity and design of the UA2 metric (averaging instead of focusing on the worst-case scenarios) were questioned.
Writing quality (Reviewer 8hc1,aFdd ): Crucial details on the differentiability and implementation of the attacks were placed in the appendix, reducing clarity. The lack of explanation regarding PixMix and adversarial training techniques was also highlighted.

After author rebuttal, some of the concerns have been addressed. But three of the reviewers are still lean towards rejection. The AC agrees that the novelty is limited and some technical details are unclear. The paper needs further improvement to meet the standard of a top-tier conference. Therefore, the AC recommend rejection.

最终决定Reject

2025-01-22

Reject