Carefully Blending Adversarial Training and Purification Improves Adversarial Robustness
We propose a novel adversarial defence for image classifiers, merging adversarial training and purification: the internal representation of an adversarially-trained classifier is mapped to a distribution of denoised reconstructions to be classified.
摘要
评审与讨论
To better defend against adversarial attacks, the paper proposes a novel adversarial defense mechanism for image classification – CARSO – blending the paradigms of adversarial training and adversarial purification in a synergistic robustness-enhancing way.
优点
The paper proposes a novel defense mechanism.
The proposed method is validated on multiple datasets.
缺点
-
The presentation of the paper is poor.
a) In the first half of the paper, the author merely describes some background. There is a lack of analysis of existing methods, such as the shortcomings of the current methods, what problems the proposed method can solve, and why it can solve these problems.
b) Some descriptions are unclear, such as 'Upon completion of the training process, the encoder network may be discarded as it will not be used for inference.' I think 'may' should be removed here.
-
The current experiments are insufficient to prove the effectiveness of the proposed method.
a) Table 2 simplifies a lot of information, which reduces clarity; for example, it only records the mean or best results of multiple methods and lacks the clean accuracy of the purification method. I suggest listing all methods according to both clean accuracy and adversarial accuracy. The existing content in Table 2 can be added as additional row information.
b) Since the paper does not give specific problems, only a general goal, which is to better defend against adversarial attacks, the experiments become relatively limited. I believe the author should re-summarize the shortcomings of existing methods and the advantages of the proposed method and conduct more experimental comparisons.
问题
See Weaknesses.
局限性
The authors have discussed limitations of the work.
We thank the Reviewer for his/her observations, noticing however that part of such review is based on what we believe to be a mischaracterisation of our paper’s goal and contents. We will address the Reviewer’s concerns in a similar list-based format.
- Poor presentation.
a) We will start by noticing that the portion of the paper in which no novel information is added with respect to existing published literature accounts for at most ~1/4th of the lines (14 to 43, 80 to 135, 285 to 288; i.e. 90 lines over 356 in total), and even less so in terms of page space. A figure far from the ‘first half’ mentioned by the Reviewer. Additionally, the Reviewer does not address whether, and for what specific reasons, the mentioned background plays an irrrelevant role in the overall structure of the paper. This prevents us from addressing the merit of the question.
An analysis of existing methods in the fields of AT- and purification-based adversarial defences is given in Sections 1 and 2, and in part of Subsection 5.2. Specifically, some shortcomings of existing methods are addressed at lines 38 to 43, 98 and 99, 106 to 108, 113 and 114, 310 to 313. Our analysis is consistent with the goals of our paper, as outlined in the Abstract and the final paragraphs of Section 1: i.e. to propose a novel approach that achieves adversarial robustness thanks to the blending of AT and purification, and to assess its empirical robustness according to a standardised benchmark for perturbations, across some datasets. We do not believe that such goal would require to explicitly identify pitfalls in existing approaches, which serve indeed as a basis for our method, and punctual solutions to them. Owing to its architectural novelty as the main aspect of interest, we provide some justification to its inner workings (the ‘problems it can solve’) in Subsections 4.2, 4.3, 4.4, 4.5 and Appendices C and D. We also believe that the recorded empirical robustness, improving upon the existing state of the art, constitutes a definitive justification in spite of the (well acknowledged) clean accuracy decrease.
b) Standing by the stance that the prescriptive meaning of modal verb ‘may’ is acceptable in such case, we can further clarify its meaning by operating the substitution:the encoder network may be discardedthe encoder network is discarded. Hardly believing that such single element is responsible for a large disruption of clarity across the paper, we are prevented from commenting upon other passages, as they have never been explicitly mentioned by the reviewer. - Insufficient experiments.
We are unable to understand whether the Reviewer implicitly refers to the presence of other aspects of concern – besides later-mentioned points (a) and (b) – as the reason to believe our experiments to be ‘insufficient to prove the effectiveness of the proposed method’. We will comment on the points explicitly mentioned.
a) We are sorry that the Reviewer finds Table 2 a source of reduction in clarity. We are aware of the presentation choice to compare our method to only the best-performing existing models in terms of empirical robust accuracy – which we find nonetheless adequate in the light of our declared goals (see e.g. the Abstract, Section 1, and the previous point (1) of the numbered list). Additionally, we would point out that the choice of assessing different models on the basis of the worst-case scenario against different adversarial attacks is a well-established custom of the field, and a main staple in the definition of the AutoAtack benchmark. Also, no averaged result is shown in Table 2 (as the Reviewer states instead) – apart from dataset-averaged accuracies. Finally, the clean accuracies of all methods used in the comparison are shown either in Table 2 or in Appendix F (Table 15). In an effort to improve clarity in the presentation of our results, the contents of Table 15 can be moved to Table 2.
b) We believe that saying that ‘the paper does not give specific problems, only a general goal’ represents a mischaracterisation of our work and its goals. Indeed, we recall once more the main aim of our endeavour: to devise a novel technique for adversarial defence based upon the blending of adversarial training and purification – and the assessment of such method on a standardised benchmark (for perturbations on image classification) across some datasets. We believe – as all other Reviewers agree – that the novelty is the main aspect of interest in our proposal, together with its assessment that proves its efficacy. As such, we do not find it a weakness the fact that our method is not built as the response to specific shortcomings of other existing methods – apart from the inferior empirical robust accuracy that they finally provide in the experimental scope addressed, and their reliance on model approximation or surrogation to assess the end-to-end white-box adaptive robustness. Aware that a further broadening of the experimental scope will increase its justificatory strength, we are prevented to comment on specific aspects of the suggestion due to its non-specific nature.
Thanks for the responses.
As you can see, nearly all of my comments relate to the readability of the paper. Indeed, there are many key explanations and descriptions missing, as I pointed out in my initial review. Although I only provided a few examples for each issue, these examples are sufficient to demonstrate that the overall readability of the paper is weak.
Of course, these are my personal thoughts. If other reviewers and the AC consider this an easy-to-read paper, I fully agree with accepting the paper. Currently, I will maintain my score, and I will discuss this issue during the subsequent Reviewer-AC Discussions.
We thank the Reviewer for the answer and additional clarifications about his/her position.
We also understand and respect that the Reviewer finds readability the main weakness of our work. Believing in the role of peer-review not just as a filter – but for the betterment of submitted works – we tried to address the specific issues raised in the original review in accordance to their nature:
- When referring to the conceptual structuring of the paper or the lack of descriptive contextualisation (i.e.: weaknesses 1.a, 2.b), we referenced the specific passages of the paper addressing those points. We also provided a more justificatory explanation of our choices, in the light of the overall goal of our paper and the need to balance those descriptions with the introduction of the (many) novel aspects of our method.
- When referring to phrasing or technical/typographic aspects of presentation (i.e.: weaknesses 1.b, 2.a), we tried to accommodate the Reviewer’s suggestions as much as possible – recognising the improvement in clarity the provide.
We refer to our already-submitted rebuttal for the actual discussion of the points just mentioned.
We also recognise that the Reviewer may have found more issues while reading the paper than those explicitly stated. While those latter may be ‘sufficient to demonstrate that the overall readability of the paper is weak’ (at least in its initial version) – such choice may inadvertently prevent us from better addressing those clarity concerns.
Finally, we thank the reviewer for his/her clear statement on paper acceptance, and for the willingness to actively engage in reviewer-reviewer and reviewer-AC discussion.
This study proposes a novel adversarial defense method called CARSO. CARSO consists of two models: a classifier and a purifier. The classifier is (pre)trained to correctly classify possibly perturbed data. The encoder of the purifier is trained to generate a latent space from the internal representation of the classifier and the original (possibly perturbed) input. The decoder of the purifier is trained to reconstruct a sample from the latent representation and the internal representation of the classifier. The final prediction is determined by aggregating the outputs of the classifier for reconstructed data.
Detailed procedures are summarized as follows:
- The classifier is always kept frozen. Other parts, including the VAE and small CNNs for compression, are trained on a VAE loss consisting of a reconstruction loss based on a pixel-wise channel-wise binary cross-entropy loss and KL-div.
- The internal representation and input are compressed by small CNNs before being inputted into the encoder of the purifier.
- The classifier is pretrained according to [18] or [62].
- When training the purifier, each batch contains both clean and adversarial samples.
- The aggregation is represented by a double exponential function.
- Evaluations are conducted under attacks.
优点
- The concept of blending adversarial training and purification is novel and interesting. The proposed method, CARSO, achieves robust accuracy that surpasses the SOTA adversarially trained models and purification methods, including diffusion-based models, despite its relatively simple mechanism.
- The evaluation was carefully conducted. The authors explicitly address common pitfalls in evaluating robustness. For example, they conducted end-to-end validation (full whitebox setting), addressed concerns about gradient obfuscation, and used PGD+EOT to address the stochasticity of CARSO.
- CARSO can utilize existing pretrained models, which have already achieved high robust accuracy.
- A wide variety of datasets (CIFAR-10, CIFAR-100, and TinyImageNet-200) were used for evaluation.
缺点
1. In my opinion, the claim that CARSO surpasses the used adversarially trained model seems questionable. If my understanding is correct, during inference, the decoder takes class information only from the internal representation of the classifier. Thus, I believe the decoder can correctly reconstruct the sample only if the classifier, outputting the internal representation, can correctly extract class information from the original perturbed sample. Could the authors clarify this?
Note: Initially, I doubted whether some experimental or evaluation settings were appropriate. However, as far as I can tell, there are no issues. Just in case, I recommend the authors review their source code again.
2. CARSO sacrifices clean accuracy more significantly than existing SOTA methods. Additionally, to compare CARSO and the best AT/purification models in terms of clean accuracy, Table 2 should include the clean accuracy of the best AT/purification models (i.e., the contents in Table 15). The scenario or dataset columns in Table 2 might not be necessary.
3. Few ablation studies. The authors should include the case of perturbations and use internal representations from different layers. Particularly, the relationship between the layers used for extracting representation and robust accuracy is of interest.
问题
Minor comments:
- The authors should standardize the meaning of each symbol. If I understand correctly, represents the sample index in Figure 1 but represents the class index in Section 4.5, leading to low readability.
- In Figure 1, is not defined. I believe it is first explained in Line 270.
局限性
The authors explicitly addressed the limitations in Section 5.3.
We thank the Reviewer for his/her constructive observations, and for the care put into writing the review. We will gladly comment upon all points raised, in a similar list-based format.
- On the superiority of CARSO w.r.t. AT classifier baselines. With respect to the claim that CARSO surpasses in empirical robust accuracy the individually-considered adversarially-trained classifier models it employs – we ultimately refer to experimental evaluation whose results are contained in Table 2 (specifically: the comparison of columns
AT/AAandC/rand-AA). Were the claim untrue – within the experimental scope considered – we would have observed a less or equalC/rand-AArobust accuracy in comparison to the one reported underAT/AA. Instead, the use of CARSO determines its marked increase, ranging from (CIFAR-10) to (TinyImageNet-200). Such observations alone would be sufficient to substantiate our claim.
A possible explanation of such results may be found in the justification to the method provided in Subsection 4.2, and in the use of a robust aggregation strategy (described in Subsection 4.5 and Appendix D), that plain AT classifiers both lack.
Specifically, adversarial attacks against any classifier target the distribution of logits contained in its last layer (whose usually constitutes the predicted class). Consequently, the concern of the reviewer about overall robust accuracy being limited by that of the classifier would have been justified in the case only such last-layer representation had been used as its whole internal representation of interest. Instead, the logits layer is not even used as part of the conditioning set of the purifier (see Table 5, Appendix E.2). The conditioning set chosen (i.e. the representations at intermediate layers of the classifier) does not even include proper class information, but just a collection of features that the decoder of the purifier learns to map to clean image reconstructions, under adversarial noise corruption against the classifier.
In such setting, an adversary against CARSO would need to target the last-layer logits of the classifier (used to finally performs class assignment via robust aggregation) only by attacking multiple intermediate layers of the very same classifier, and through the decoder that uses them as input, in addition to the classifier itself. This may ultimately make CARSO a harder target to fool, in comparison to the classifier alone. - On the significant clean accuracy toll. It is true, and we transparently recognise (see Subsection 5.3), that the specific version of CARSO evaluated in our work (i.e. using a VAE as generative purifier) imposes a heftier clean accuracy toll in comparison to existing methods (either AT-based or using diffusion/score-based models as generative purifiers). Such results ultimately depend upon the deliberate choice to assess the feasibility of the idea behind CARSO (i.e. blending AT and purification via representation-conditional purification and robust aggregation), and the robustness it produces, in the best-possible scenario for the attacker. In such light, the VAE-based purifier and the chosen robust aggregation strategy ensure exact end-to-end differentiability for the whole model. This guarantees that the evaluation is not dependent on approximated backward models, which can only provide a robustness upper bound and are more susceptible to gradient obfuscation.
As we mention in Subsection 5.3 and in the Conclusion, we are interested – and actively pursuing research - in different architectural choices for the purifier in CARSO and CARSO-like models, which may result in much more competitive clean accuracy, though making rigorous robustness evaluation more challenging with current tools.
We thank the Reviewer for suggestions related to Table 2, and will definitely include in it the contents of Table 15, as a way to enhance transparency and clarity in the presentation of results. - On ablation studies and further experiments. We are aware that the paper provides little space to ablation studies or experimental settings different from empirical adversarial robustness evaluation. The additional assessment of robustness has been excluded at this stage due to the generally more demanding challenges offered by ; we recognise however the significant added value it may contribute to our work. We are also particularly interested – and pursuing active research – in how the choice of layers to be used as conditioning set influences overall robustness. As noted in Subsection 5.3 we are planning further work into such realm.
Answers to minor comments
We thank the reviewer for the precise remarks, that allow us to improve the clarity and legibility of the paper. In response to such observations, we have operated the following edits to our manuscript.
- In Subsection 4.5 and Appendix D (where the robust aggregation strategy is described), the class index is now referred to as , leaving index to reconstructed sample multiplicity, as shown in Figure 1.
- The symbol is now referred to in the caption of Figure 1 as VAE Loss and the reader explicitly redirected to Appendix B for a formal definition of it.
I appreciate the authors' clarification. In conclusion, I will maintain my rating of Weak Accept. Additionally, since the clarification addressed several unclear points, I am increasing my confidence score from 4 to 5. My detailed thoughts are as follows.
I believe that the CARSO proposed in this research presents a novel adversarial defense approach. Utilizing the internal representation of an adversarially robust model for conditioning a purifier is, in my opinion, a novel contribution beyond the trivial combination of adversarial training and purification. The idea of combining the two mainstream adversarial defense strategies—adversarial training and purification—and the resulting outcomes are likely to be of great interest to the community. Although there is a trade-off in clean accuracy, the achieved robust accuracy significantly surpasses the state-of-the-art, making it worthy of evaluation. Moreover, the end-to-end evaluation, investigation into gradient obfuscation, and use of PGD-EOT clearly address naturally arising questions from this approach (particularly the use of purifiers), which I found to be a solid evaluation.
As far as I am aware, the primary weaknesses of this research, as acknowledged by the authors, include the lack of certain experiments and the decrease in clean accuracy. For more details, please refer to my Weaknesses 2 and 3. However, regarding the former, I recognize that critical results demonstrating the method's effectiveness were sufficiently provided. There may also be areas for improvement in the presentation. While I did not find it difficult to understand the goal and concept of the study, I encountered some challenges in fully grasping the flow of the methodology. Revisiting the structure of Section 4 could enhance the quality of the paper.
Considering all these strengths and weaknesses, I will keep my rating.
P.S. In addition, considering the authors' rebuttal for my Weakness 1, I increased soundness from 2 to 3.
This paper integrates adversarial training and adversarial purification to enhance robustness. It specifically maps the internal representation of potentially perturbed inputs onto a distribution of tentative reconstructions. These reconstructions are then aggregated by the adversarially-trained classifier to improve overall performance.
优点
The idea of combining adversarial training and adversarial purification is interesting.
缺点
1, The experiments are too weak. I hope the authors can refer to at least [1][2][3], which are relevant to adversarial purification, to conduct experiments from more dimensions and consider more baselines and fundamental experiments.
2, Could we just combine [1] with an adversarially-trained model to achieve similar performance?
3, Why should the classifier be adversarially trained for better accuracy?
4, Why can't we directly purify the image? Could we use an image-to-image method to purify the input image?
[1] DISCO: Adversarial Defense with Local Implicit Functions. [2] Diffusion Models for Adversarial Purification [3] IRAD: Implicit Representation-driven Image Resampling against Adversarial Attacks
问题
The first two points mentioned above are my key concerns.
局限性
The method heavily relies on training a VAE as the generative purification model.
We thank the Reviewer for his/her observations.
Firstly, we would start by pointing out an inaccuracy in the summary of our paper. The tentative reconstructions of input images generated by the purifier are not aggregated by the classifier. Indeed, the classifier processes them independently from one another, outputting a distribution of logits for each. Such distributions are then aggregated by the robust aggregation strategy described in Subsection 4.5 (justified in Appendix D), which constitutes an integral part of our method and a core contribution to its robustness.
With respect to the weaknesses identified:
-
Weak experiments; additional baselines. We believe our experimental choices to be consistent with the goal of our paper: i.e., to show that it is possible to synergistically blend adversarial training and purification in a novel and meaningful way. The resulting model attains the state of the art in a hard adaptive benchmark across datasets, against the best AT- and purification-based defences. Remarkably, our evaluation relies on exact end-to-end differentiability (and not on best-effort approximation, such as BPDA), and explicitly accounts for the stochasticity of the method (by using randomness-aware AutoAttack, which relies on EoT). Our evaluation also guards against common pitfalls of adversarial defences in general (e.g. gradient obfuscation, see Table 3) and diffusion-based purification methods (such as the mentioned DiffPure, vulnerable to pitfalls identified in [4] and [5]).
Referring to the specific models, [2] is directly surpassed by [5] (which reconsiders adversarial evaluation of diffusion-based models in the light of [4]), to which we directly compare. We thank the reviewer for the suggestion of [1] and [3], concerned with transformation-based defences and using BPDA to assess adaptive white-box accuracy (reasons for which they have been excluded from direct comparison in the first place). Indeed, this allows us to show once more the effectiveness of CARSO. Even with the advantage given by the use of attacks against approximated models, we surpass their reported best robust accuracy in commonly-employed datasets. More specifically:- Cascaded DISCO (k=5) [1] on CIFAR-10 attains a RA, compared to our ;
- IRAD [3] on CIFAR-10 attains a RA, compared to our ;
- IRAD [3] for CIFAR-100 attains a RA, compared to our .
If deemed appropriate, we will gladly add such comparisons to Tables 2 and 15, or in a dedicated appendix.
As far as the further remarks are concerned, the vague and non-specific wording prevents us from addressing specific issues related to our method.
-
Combination of DISCO and AT. We are unaware of any published paper or experimental evidence that neither confirms nor disproves whether a hypothetical combination of DISCO and adversarial training would be effective, especially at the levels of robust accuracy attained by state-of-the-art models (including ours). We thank the reviewer for the suggestion, definitely worth investigating in the future. We must admit, however, that we do not consider such suggestion a weakness of our method, whose framing, formulation, and assessment are independent of it – and different in scope.
-
On the adversarial training of the classifier. As with point (2), we hardly see the Reviewer’s question as a weakness of our method. The purifier used as part of CARSO maps internal features of the classifier to image reconstructions. As such, under adversarial noise, the use of a classifier (and, as a consequence, its internal features) that are the most invariant under such perturbations stabilises and enhances the robustness of the purifier. Reciprocally, the use of a (same, in our case) classifier trained under noisy corruptions to finally classify purified reconstructions makes the overall process more robust also to non-adversarial artifacts the purification process may introduce. This, of course, comes at the cost of a decreased clean image accuracy, as we notice.
-
On direct image purification. As noticed in the case of points (2) and (3), once more, we do not believe the Reviewer’s question constitutes a weakness to our method. Firstly, Subsection 4.2 (lines 197 to 202) already provides a preliminary answer to such point. In detail, direct image purification (i.e. the mapping of a perturbed to a tentatively clean image) is already the leading scheme used in legacy (e.g. [6]) as well as modern (e.g. the already mentioned [2] and [5]) adversarial purification methods. In a broader sense, [1] and [3] may also be included in such class of techniques. In the case of [6] (which uses VAEs as purification models), the approach even resulted in worse robustness compared to the classifier alone. In all remaining cases, the experimental evidence we provide shows that none of such methods is able to provide a robust accuracy better that ours, within the experimental scope considered.
Finally, we want to address the heavy reliance on VAE training of our method. As noted in Subsection 4.1 (lines 154 to 156), the purifier being a VAE is an inessential part of CARSO as a general method: “any model capable of stochastic conditional data generation at inference time” would suffice. In addition, the use of a VAE ensures that the attacker has access to exact end-to-end model gradients for evaluation, increasing the strength of our experimental setup.
References
[1], [2], [3]: as mentioned by the Reviewer.
[4] Lee, Kim: ‘Robust Evaluation of Diffusion-Based Adversarial Purification’, 2024.
[5] Lin et al.: ‘Robust Diffusion Models for Adversarial Purification’, 2024.
[6] Gu, Rigazio: ‘Towards Deep Neural Network Architectures Robust to Adversarial Examples’, 2015.
Inaccuracy in summary.
I didn't mean that these reconstructions are directly aggregated by the classifier at the image level, as it's clear that the classifier cannot achieve this. I also acknowledge that the robust aggregation strategy is intriguing.
Weak experiments; additional baselines
Apologies for the lack of clarity in my previous statement. By 'weak experiments,' I don't just mean that more baselines should be compared; I'm also suggesting that more models should be used in addition to WideResNet-28-10. Furthermore, the impact of 'adversarially-balanced batches' and other technical details hasn't been thoroughly explored. Thus, the performance comparison with the current best AT-trained model doesn't seem entirely fair.
Furthermore, I hope the experiments can provide more insight into the positioning of the proposed method. For instance, the advantage of adversarial purification is that it avoids changes to the original model and can be adaptively deployed across different models. On the other hand, adversarial training can reduce extra time consumption during inference. I would like to see the pros and cons of this method clearly outlined in your paper. Specifically, I expect to understand what benefits are gained from adding purification to AT and what trade-offs are made when integrating purification, rather than simply demonstrating the method's potential effectiveness. It's important to understand under which scenarios it is most effective.
Regarding the goal of your paper—to show that it is possible to synergistically blend adversarial training and purification in a novel and meaningful way—I believe that simply demonstrating this possibility is not sufficient for acceptance in this venue. Compared to previous work, you need to showcase advantages across a variety of scenarios.
Combination of DISCO and AT.
My intention in questioning this is to understand why a straightforward combination of existing methods like DISCO and AT wouldn't work just as effectively, given that it seems simpler. I'm curious about the motivation behind your specific blending approach.
- Specific blending approach. Relevant reasons against the simple juxtaposition of a VAE-based purifier and a classifier are contained in [2], where such arrangement results in decreased robustness. Mitigation against such pitfall is provided by the representation-conditional purification we devise, and the stochastic data generation offered by the VAE is turned into a robustness enhancement by the aggregation strategy we propose. Without performing an actual experiment – which is outside of the scope of our paper in its current form – it would be impossible to determine if DISCO+AT (or similar) methods are equally viable from an empirical viewpoint, nor whether they would be susceptible of known (or novel!) failure modes. For sure, the structure of DISCO prevents exact end-to-end algorithmic differentiability, and thus forces reliance on BPDA for proper attacks. As such, it offers a less demanding evaluation scenario, and an ineliminable robustness overestimation w.r.t. CARSO.
References
[1] Madry et al., Towards Deep Learning Models Resistant to Adversarial Attacks, 2018.
[2] Gu & Rigazio, Towards Deep Neural Network Architectures Robust to Adversarial Examples, 2015.
Thank you for your detailed and patient response; I appreciate your efforts. I acknowledge that this is an interesting paper, and I want to emphasize my opinions on a few points:
-
More experiments are needed across a broader range of models, including an ablation study, to demonstrate the effectiveness of your method in a well-established manner. While I haven't explicitly mentioned using higher-resolution datasets, I believe that using ImageNet might be too inefficient for your approach. As I have repeatedly emphasized, additional experiments are crucial. I agree that the experiments support the goal of showing "that it is possible to synergistically blend adversarial training and purification in a novel and meaningful way." However, merely demonstrating this possibility is not sufficient. Stating, "we show that not only our approach is viable, but we also do so in a deliberately hard scenario," is just the most basic and necessary experiment to support your goal.
-
The presentation of this paper needs significant improvement, particularly in terms of both the expression used and the quality of the figures.
We thank the Reviewer for the clarifications about his/her review. We will address the issues further specified in a list-based format.
- Experiments with classifiers other than WideResNet-28-10. With respect to the classifier model – to be used as part of the CARSO architecture – we tried to strike a balance between a reasonably-performant pretrained AT model (representative of modern AT-based approaches) while keeping model size under control. Indeed, as larger adversarially-trained models practically always perform better in terms of clean and adversarial accuracy (keeping the training protocol fixed), the use of smaller classifiers within CARSO would increase the strength of experimental results. This is indeed the case, as we achieve better-than-SotA robust accuracy across our whole experimental scope.
To support the significance of such choice in the context of AT, the following statistics can be gathered from the RobustBench entries for robustness (commit776bc95bb4167827fb102a32ac5aea62e46cfaab):- CIFAR-10: are WideResNet-28-10s; are deeper and wider (Wide)ResNets; are deeper (but narrower) ResNets, with an overall larger number of parameters; are transformer-based models with a larger number of parameters; are (PreAct)ResNet-18s.
- CIFAR-100: are WRN-28-10s; are deeper and wider (W)RNs; are transformer-based models with a larger number of parameters; are (PA)RN-18s.
- Impact of adversarially-balanced batches. Though the impact of adversarially-balanced batches (ABBs) has not been thoroughly explored, a heuristic justification for its use, within the training of CARSO, is provided in Appendix C. With respect to the use of ABBs in the adversarial training of classification models that use AT as the only technique for robustness enhancement, we refer to [1] whose crucial aspects w.r.t. AT are reported in Appendix A. The requirement of worst-case perturbations in the inner optimisation step theoretically discourages the use of non-worst-case examples (as would be the case of FGSM-generated examples) or non-entirely-perturbed batches. As such, usual PGD-based adversarial training – and derived techniques – remain the theoretically-recommended way to achieve robustness in the end-to-end training from scratch of classifier models. Since we train only the purifier part of CARSO with ABBs, such considerations do not apply in our case.
- Fairness of comparison with the best AT method. Given the analysis previously provided, and the inclusion of a direct comparison between CARSO-based models and their classifier models alone, we believe the additional comparison of CARSO-based models with the currently best-performing AT-based techniques to be fair within the experimental scope of interest. Especially so, given that the best-performing of such models have a larger number of trainable parameters – and an even larger one in the case of purification-based defences – w.r.t. CARSO.
- Positioning of CARSO and related experiments. In the design of experiments and their presentation within our work, we focused mainly on the specific measurement of empirical robust accuracy as the means of comparison with other existing approaches. While it is true that those particular experiments do not provide a clear-cut pros/cons analysis, the paper outlines some issues with existing AT- and purification-based approaches (Section 1, 2, and Subsection 5.2), and the differences between them and CARSO (Subsections 4.2, 4.3, 4.4, 4.5 and Appendices C and D).
- Goal of our work. We believe that the most precise description of the goals of our work is contained in the Abstract, Section 1 (the Introduction) and Section 6 (the Conclusion) when considered altogether. Specifically, the direct quote from our rebuttal, i.e. ‘to show that it is possible to synergistically blend adversarial training and purification in a novel and meaningful way’ is indeed one of our goals (as stated e.g. in the penultimate paragraph of the Introduction) – yet, it is hardly the only one our work achieves. Indeed, we show that not only our approach is viable, but we also do so in a deliberately hard scenario: in terms of norm-bound, attack choice, and requirements of end-to-end differentiability (which in turn allow for approximation-free assessment). Yet, we are able to attain the state-of-the-art in one of the most stringent robustness benchmarks available (randomness-aware AutoAttack) – against any kind of existing models, even developed well outside the compliance to our requirements.
We thank the Reviewer for his/her prompt response, the willingness to engage in further discussion, and the interest in our paper.
As for the specific contents of the Reviewer’s latest comment, we will address them in a similar list-based format, as usual.
- We will provide split answers in the sub-list that follows, according to the specific conceptual issues raised.
- Broadening of the experimental scope. We agree that any broadening of the experimental scope would constitute an improvement to our paper, as it would be for any work of science. As far as a broadening in model variety is concerned – as the reviewer mentions – we will refer to our previous comment (part 1). As we already shown – using the RobustBench accepted submissions as a representative leaderboard for adversarial training – our specific model choice for the classifier (i.e. a WideResNet-28-10), which is used internally to CARSO, is shared by of CIFAR-10 and CIFAR-100 overall entries in terms of architecture (ResNet or derived), and directly comparable with similar or larger models to of CIFAR-10 and CIFAR-100 entries in terms of both architecture and model size lower bound. Yet, against the whole set of models (including those not directly comparable to ours, due to architectural differences), we manage to obtain superior performance in terms of empirical robust accuracy.
- Ablation study. While we did not present it in the paper by the term ablation study, we conduct one crucial of such experiments as a way to assess – as the reviewer says – the effectiveness of our method, and, on a lesser measure, the trade-offs of our approach. Indeed, in Table 2, we directly compare the accuracy of existing AT-trained models developed for the goal of adversarial robustness, with CARSO models using the very same AT-trained classifiers (up to weight values). We do so in terms of both clean (columns
AT/ClvsC/Cl) and robust (columnsAT/AAvsC/rand-AA) accuracy, across three different datasets. As such, the comparison shows the direct effect of ablating away the entire additional structure we propose, whose results we lengthily commented upon: in brief, a marked increase in robust accuracy accompanied by a decrease in clean accuracy. - Higher resolution datasets. Acknowledging that this point is entirely novel w.r.t. to previous review and post-rebuttal comment of the Reviewer, we agree – as we already said – that any broadening of the experimental scope would be beneficial to our work. We also believe, however, that – since we introduce our method as an entirely original one – the existing amount of evidence we provide about its effectiveness cannot be simply dismissed on the basis of dataset resolution being pixels.
- On the sufficiency of the goal of our paper. As far as the later statements by the Reviewer are concerned – as we also already said – we agree that merely showing that our approach is viable – and even doing so in a deliberately hard scenario – are not entirely sufficient. However, for some reason, the Reviewer fails to acknowledge the next part of the quoted sentence: we are able, with our method, to significantly improve upon the empirical adversarial robustness of any existing model pursuing the same goal – which has been tested according to AutoAttack and whose results made public by its Authors. While we still believe that further goals are yet to be achieved by our paper and by our models, we honestly do not consider those results as just the ‘most basic and necessary to support [our] goal’.
- On the improvement of the ‘expressions used’ and the ‘quality of figures’. With all due respect, we are quite surprised by the Reviewer’s observations in this regard. Not because we do not believe our paper can be affected by those issues, but due to the fact that the Reviewer voices these entirely new concerns so later on in the review and discussion period. Given the very specific and technical nature of problems such as the choice of expressions, or the quality of pictographic content, knowing those specific terms and/or aspects of poor quality in figures (which is only one, i.e. Figure 1) in greater advance would have definitely allowed for pin-point interventions in the paper before the end of such discussion phase. Especially so in the case of pictures, which we could have submitted -- according to the rules of the Conference -- before the end of the rebuttal period.
We lack the specifics required to further comment upon the issues identified by the Reviewer.
We thank all reviewers for their time and useful remarks.
We would like to use this space to clarify once more, in an explicit fashion, the goals of our work. As stated in the Abstract, Section 1 (the Introduction) and Section 6 (the Conclusion), our first and foremost aim was that of introducing a novel approach to obtain adversarially-robust classification models in the context of deep learning. Such method (CARSO) is based upon the non-trivial architectural blending (i.e. different from simple model juxtaposition or chaining) of adversarial training and adversarial purification, together with a specific robust aggregation strategy of the multiple purified inputs whose classification finally constitutes the robust prediction of interest. We believe such aspects of novelty to be the most interesting element of our proposal.
Nonetheless, an empirical assessment of the method proposed – in a specific realisation (notably: using a VAE as the generative purifier) – is carried out in the norm-bound scenario, using images as input and the standardised AutoAttack routine as the adversary of choice. A comparison – in the very same setting – with the adversarially-trained classifiers used as part of CARSO, and with the overall best-performing AT-based and purification-based methods (in terms of adversarial robustness, and to the best of our knowledge) is also provided – and it shows shows the superior robust accuracy of CARSO in all scenarios considered.
This comes at the cost of decreased clean classification accuracy, as we transparently recognise. Such decrease, however, cannot be considered separately from the choice to use a VAE as the generative purifier of the specific CARSO model we employed. In turn, such choice was deliberate and determined by two self-imposed requirements: to assess our method in the worst-case scenario for the defender and to avoid gradient approximation in the process of attack. In such light, a VAE purifier offers exact end-to-end differentiability for the resulting model (all BPDA-based attacks do not) and a more than manageable computational load for the attacker (thus preventing the need of surrogate-based attacks, as it is the case for most diffusion-based purifiers). In a similar spirit, we explicitly performed gradient obfuscation diagnostics, to ensure the most rigorous robustness testing.
Finally, it is not our intention to portray CARSO as a final and definitive method. Instead, we are much interested and actively pursuing research in the development and testing of CARSO and CARSO-like models whose purifier is not a VAE, their trade-offs, and the challenges they pose for a tight-bound robustness evaluation. Additionally, we believe further insight into the role of specific classifier layers to be used as internal representation in the structure of CARSO to be worth pursuing.
Changelog of the Manuscript
In the subsection that follows, we summarise all minor changes to the manuscript prompted by Reviewer’s comments. We hope in such way to increase presentation clarity and reduce possible ambiguity.
- At line
181and line225, the following substitution is operated:may be discardedis discarded. - The contents of Table 15 (Appendix F) are moved to and merged with Table 2.
- In Subsection 4.5 and Appendix D, the class index (was: ) is now referred to as , leaving index to reconstructed sample multiplicity, as already shown in Figure 1.
- The symbol is now referred to in the caption of Figure 1 as ‘VAE Loss’, with explicit reference to Appendix B for a formal definition of it.
- Potentially, a comparison with best-performing transformation-based defences may be added to Table 2 or to a dedicated appendix. In particular, as prompted by Reviewer
WKot, we refer to the methods called DISCO [1] and IRAD [2].
References
[1] Ho & Vasconcelos, ‘DISCO: Adversarial Defence with Local Implicit Functions’, 2022
[2] Cao et al, ‘IRAD: Implicit Representation-driven Image Resampling against Adversarial Attacks’, 2023
Three reviewers diverge in their evaluations of this submission with scores from weak acceptance to rejection (ZCGH: 6, 7CCs: 4, WKot: 3). The authors provide a rebuttal, including a global comment and review-wise comment, and they participate actively in thie discussion. Each reviewer replies and confirms their score. During the AC-reviewer discussion phase two reviewers maintain their position on rejection, while one reviewer argues for weak acceptance but understands the work may need one more round of revision and review for acceptance. The AC acknowledges the confidential author comments made during this phase and has considered them accordingly.
Due to the divergent scores, the AC has taken particular care in examining the submission itself and balancing all of the points made by the reviewers and authors. The AC sides with rejection.
To summarize the entirety of the discusion: while the proposed defense is in part novel w.r.t. existing adversarial training or input purification methods, this work has not sufficiently examined its performance across a diverse and challenging enough set of experiments to prove itself against other combinations of adversarial training and input purification (such as AT + DISCO per WKot). Furthermore the clarity of the work may not be sufficient even following a revision for it to be broadly appreciated by the community (which was emphasized by 7CCs). As a potential counterpoint ZCGH replied to the authors that "I recognize that critical results demonstrating the method's effectiveness were sufficiently provided", and the AC seconds the case made that ImageNet results are not strictly required. Nevertheless when the arguments for and against are accounted for, the AC does not find grounds to overturn the two-out-of-three negative reviews. The AC also finds that the focus on a (conditional) VAE is not a weakness, as it is simply an instantiation of the proposed approach. However, the issues of clarity and comparison are sufficient to require another round of review, although the rebuttal experiments provided by the authors are appreciated. In total, the reservations from reviewers on clarity and the the scope of models covered coupled with the sacrifice of clean accuracy vs. the gain in robust accuracy indicate that more work is needed.
The AC encourages the authors to incorporate the feedback from reviews into a resubmission to ICLR or CVPR given the interest in better merging of train-time and test-time robustness interventions and the robust aggregation due to sampling in this work.