Boosting Adversarial Robustness with CLAT: Criticality Leveraged Adversarial Training
CLAT mitigates adversarial overfitting by selectively fine-tuning robustness-critical layers and achieves SOTA performance across all baselines, with over 2% improvements in both clean accuracy and adversarial robustness.
摘要
评审与讨论
The paper presents CLAT, a layer-aware adversarial training algorithm designed to mitigate adversarial overfitting by identifying and fine-tuning layers that learn non-robust features. The approach leverages layer criticality, a metric that quantifies a layer’s functional importance to subsequent features, to pinpoint critical layers for targeted fine-tuning. Experimental results demonstrate that CLAT enhances the adversarial robustness of existing methods, suggesting its effectiveness in improving model resilience against adversarial attacks.
给作者的问题
Can you finetune the model using your identified layers from scratch, using traditional robust training methods (trades, AT)? Will the results become better than fine-tuning on all layers?
论据与证据
The paper proposes a way to identity important layers for adversarial robustness. However, the actual method includes the first stage of "pretraining" using prior methods, presumably on all the layers. It is unclear to me
- If these layers are more critical, why not only finetune them from the beginning?
- Why using different objective in stage 2, where the criticality is minimized instead of the adversarial loss?
I think that the paper can benefit from more experiments to support the claim that the identified layers are indeed important.
方法与评估标准
- The definition of layer criticality is not clear to me. The paper can benefit more if the author can ground this concept into prior literature, or using some theoretical results to back it up.
理论论述
No theoretical contents
实验设计与分析
The evaluation focuses on existing adversarial training method + CLAT. However, since CLAT is mainly for identifying important layers for training, it should make more sense to use these layers for adversarial training too.
补充材料
None
与现有文献的关系
The paper takes an architecture-centric approach by identifying critical layers within the network, which is orthogonal to most adversarial training methods that primarily focus on designing new loss functions.
遗漏的重要参考文献
None
其他优缺点
Strengths
- The baselines are comprehensive, including AWP and SWA which also aim to mitigate robust overfitting.
- This method seems to be compatible with most of the robust training method.
其他意见或建议
None
Thank you for your thoughtful questions and suggestions. We address each point below.
Finetuning from beginning / CLAT for training: CLAT can indeed be applied from the beginning, without any prior training, as well as after standard adversarial training. This directly addresses your suggestion of using the identified layers during the main adversarial training process.
As shown in Figure 1 (see the “CLAT for 100 epochs” curve), training only the identified critical layers from scratch eventually surpasses baseline adversarial training methods—demonstrating that CLAT is not limited to post hoc fine-tuning. While convergence is slower, final performance is higher than PGD-AT, reinforcing the utility of these layers during training itself. Lines 315–325 in the main text and Figure 5 in the appendix further support this observation with full training trajectories.
Using TRADES instead of PGD-AT mirrors these trends, with CLAT consistently improving performance when applied from scratch or as a fine-tuning step (see finetuning results in Table 1 and Table 2). In summary, CLAT is not merely an add-on to existing methods, but a flexible and general mechanism for improving adversarial robustness, whether applied during or after training.
Different objective: Continuing to optimize the standard adversarial loss after the model has already been trained often leads to degradation in generalization on clean data—a well-documented issue in adversarial training. To address this, we introduce a joint objective in Stage 2 that combines cross-entropy loss with a criticality term (Equation 6). This encourages the model to reduce reliance on non-robust features within the most sensitive layers, while still preserving performance on clean inputs. This insight and formulation are discussed in lines 158–174 of the paper.
The balance between the two terms is controlled by the hyperparameter λ, which we find to be stable across datasets and architectures. In practice, λ enables a smooth trade-off between robustness-enhancing updates and accuracy-preserving behavior during fine-tuning.
More experiments to show identified layers are important: We believe our current experiments already provide strong support for the importance of the identified layers and reviewers eBwd and 9V5S both agree on the comprehensiveness of our evaluation. We conduct multiple ablations to highlight this, including: (1) comparing performance when fine-tuning critical versus random layers, (2) analyzing the effect of selecting layers with the lowest versus highest CIDX values, and (3) evaluating the consistency of selected critical layers across batch sizes and datasets for a fixed architecture—amongst several other results in the main paper and appendix.
Additionally, we provide evidence (see response on “consistency” to Reviewer 9V5S) that the identified critical layers remain consistent across different adversarial attacks used to compute them. We are happy to incorporate this into the appendix for completeness. Together, these experiments consistently demonstrate that the identified layers are not only meaningful, but central to CLAT’s improvements in both clean and adversarial accuracy.
Critical layer definition: Layer criticality refers to the extent to which individual layers contribute to adversarial vulnerability. We define this empirically by measuring how much each layer’s output changes when the input is perturbed adversarially—assigning higher scores to layers that exhibit larger activation shifts (Equation 5).
While the term “criticality” is new in this context, the idea relates to prior work on layerwise sensitivity, pruning, and fine-tuning for robustness [1]. However, such methods typically focus on redundancy reduction or apply global training updates. In contrast, CLAT introduces a novel, lightweight, and scalable mechanism to identify and leverage structurally important layers for improving adversarial robustness.
We present theoretical motivation and justification for this formulation in the Methods section and Appendix F, and Reviewer 8QH8 specifically noted the thoroughness of this explanation. We agree that a deeper theoretical understanding of certain phenomena—such as why specific layers consistently emerge as more critical—would be valuable, and we view this as an exciting direction for future work.
Nonetheless, our empirical results—including comparisons with random and non-critical layers, consistency across datasets, attacks, and batch sizes, and consistent performance gains—provide strong support for the validity and utility of our criticality measure.
The paper aims to make adversarial training more efficient by selectively training only the most critical layers based on a criticality factor. This factor is determined using the local Lipschitz constant, calculated as the average difference in a layer's features with and without adversarial perturbations added to the input. The identified "critical" layers are then trained by incorporating a regularization term into the objective function. Results on multiple datasets demonstrate performance improvements.
给作者的问题
No questions.
论据与证据
The paper makes two major claims: (1) layer-wise adversarial training can be more efficient, and (2) it can improve both robustness and clean accuracy. These claims are sufficiently supported by experimental evidence. However, they are not adequately compared with existing works that also utilize layer-wise training.
方法与评估标准
The benchmarks used for evaluation are reasonable for the method. However, the novelty and originality of the approach are limited given prior work.
理论论述
No theoretical claims are made.
实验设计与分析
I have examined the experimental design, including the robustness of CLAT against various adversarial attacks across different datasets and models, and found it to be reasonable. However, the paper lacks a strong analysis. For instance, each model has different critical layers, and it would be interesting to see a deeper analysis of this phenomenon.
补充材料
I have skimmed through the supplementary materials and examined Section D.4 more closely.
与现有文献的关系
This paper aims to reduce the cost of adversarial robustness by training selective layers. A significant body of work has explored this topic from different angles.
遗漏的重要参考文献
There is a significant body of work on adversarial attacks and robustness at the layer and architectural level, dating back to 2017. However, these papers are not properly discussed in this paper. A few example papers are given in the following.
Regularizing Deep Networks Using Efficient Layerwise Adversarial Training (https://arxiv.org/pdf/1705.07819)
Free Adversarial Training with Layerwise Heuristic Learning
Layer-wise Adversarial Defense: An ODE Perspective (https://openreview.net/forum?id=Ef1nNHQHZ20)
Intriguing properties of adversarial training at scale (https://arxiv.org/abs/1906.03787)
Adversarial Attacks and Batch Normalization: A Batch Statistics Perspective (https://ieeexplore.ieee.org/abstract/document/10056932)
Smooth Adversarial Training (https://arxiv.org/abs/2006.14536)
Reliably fast adversarial training via latent adversarial perturbation (https://openaccess.thecvf.com/content/ICCV2021/html/Park_Reliably_Fast_Adversarial_Training_via_Latent_Adversarial_Perturbation_ICCV_2021_paper.html)
其他优缺点
The writing needs improvement. Additionally, it would be useful to relate the work to previous papers that are closely aligned with the proposed approach.
其他意见或建议
I would advise the authors to avoid using overly complex phrases when describing their method, such as "a paradigm shift" (L45-46).
Thank you for your thoughtful feedback and for recognizing the experimental rigor of our work. Below, we respond to each of your concerns and questions.
Novelty: Please see the “Novelty” section in our response to Reviewer 8QH8. For a concrete comparison, consider RiFT [1], a fine-tuning method that also updates a subset of layers. While effective, RiFT identifies redundant layers using weight perturbations, whereas CLAT uses input perturbations to identify layers most responsible for adversarial vulnerability. CLAT consistently outperforms RiFT and other baselines across datasets and architectures (see Tables 1 and 2).
If there are other methods the reviewer believes are directly comparable, we would be happy to clarify distinctions.
Further analysis: Thank you for highlighting this point. We analyze layer selection across models, datasets, and training settings—examining the number of layers selected and when fine-tuning should begin. We also include multiple ablation studies verifying the impact of selected layers. Our response to Reviewer 9V5S under “Cidx consistency” shows that critical indices remain stable regardless of perturbation type.
We agree that deeper insight into why certain layers emerge as more important—perhaps due to architectural roles or training dynamics—would be valuable. As our focus is on practical improvements to robustness, we limit scope accordingly.
Additional baselines: Our submission prioritized SOTA methods, strong threat models (e.g., PGD, AutoAttack), and widely adopted training protocols. Below, we compare against omitted fine-tuning methods using their best-performing models and original settings. As shown, CLAT outperforms these methods—even under stronger attacks.
Note: FGSM is substantially weaker than PGD due to its single-step nature.
Comparison with Additional Baselines
| Method | Model | Attack | Adv. Accuracy |
|---|---|---|---|
| [3] | VGG-19 | FGSM, ε = 0.1 | 68.37 |
| [3] + CLAT | VGG-19 | FGSM, ε = 0.1 | 78.55 |
| [5] | ResNet-18 | iFGSM, ε = 8/255 | 46.29 |
| [5] + CLAT | ResNet-18 | iFGSM, ε = 8/255 | 59.95 |
| [6] | WRN28-10 | PGD-50-10, ε = 8/255 | 47.06 |
| [6] + CLAT | WRN28-10 | PGD-50-10, ε = 8/255 | 58.62 |
| [8] | ResNet-20 | PGD-20, ε = 8/255 | 51.07 |
| [8] + CLAT | ResNet-20 | PGD-20, ε = 8/255 | 54.20 |
| [9] | ResNet-50 | PGD-10, ε = 8/255 | 43.70 |
| [9] + CLAT | ResNet-50 | PGD-10, ε = 8/255 | 53.67 |
Clarifications:
- [3] Uses FGSM and perturbs all layers, with high overhead and limited robustness.
- [4] Superseded by stronger methods like Fast is Better than Free [10], which we include.
- [5] Underperforms under PGD.
- [6] Perturbs three fixed layers; does not adapt to model structure or input sensitivity.
- [7] Targets large-scale training; less relevant to models like ResNet-18, WRN-34-10.
- [8] Modifies BatchNorm; orthogonal to CLAT’s selective fine-tuning objective.
- [9] No longer SOTA, but CLAT improves its performance when combined.
Writing: Noted on writing style and complex phrasing. We will incorporate these refinements in the camera-ready version.
References
[3] Regularizing Deep Networks Using Efficient Layerwise Adversarial Training: Sankaranarayanan, S., Jain, A., Chellappa, R., & Lim, S. N. (2018). Proceedings of the AAAI Conference on Artificial Intelligence, 32(1).
[4] Free Adversarial Training with Heuristic Layerwise Perturbation: Zheng, H., Zhang, M., & Huang, H. (2020). arXiv preprint arXiv:2010.03131.
[5] Yang, Z., Liu, Y., Bao, C., & Shi, Z. (2020). Layer-wise Adversarial Defense: An ODE Perspective. International Conference on Learning Representations (ICLR).
[6] Reliably Fast Adversarial Training via Latent Adversarial Perturbation: Park, G. Y., & Lee, S. W. (2021). Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 7758-7767
[7] Intriguing Properties of Adversarial Training at Scale: Xie, C., & Yuille, A. (2019). arXiv preprint arXiv:1906.03787. https://arxiv.org/abs/1906.03787
[8] Adversarial Attacks and Batch Normalization: Muhammad, A., Shamshad, F., & Bae, S.-H. (2023). IEEE Access, 11, 96449-96459. https://doi.org/10.1109/ACCESS.2023.3250661
[9] Smooth Adversarial Training: Xie, C., Tan, M., Gong, B., Yuille, A., & Le, Q. V. (2021). arXiv preprint arXiv:2006.14536. https://arxiv.org/abs/2006.14536
[10] Wong, E., Rice, L., & Kolter, J. Z. (2020). Fast is better than free: Revisiting adversarial training. arXiv. https://arxiv.org/abs/2001.03994.
The paper introduces CLAT (Criticality-Leveraged Adversarial Training), which aims to enhance the adversarial robustness of neural networks by identifying and fine-tuning critical layers that are most vulnerable to adversarial attacks. The main contributions include the development of a criticality index for layer selection and a fine-tuning objective to reduce the non-robust features of these layers. The authors claim that CLAT improves both clean accuracy and adversarial robustness while mitigating overfitting.
给作者的问题
see my comments about strength and weakness.
论据与证据
The claims made in the submission are not sufficiently supported by novel or convincing evidence. The method proposed is more akin to a fine-tuning approach rather than a novel adversarial training method. The primary contribution lies in the selection of critical layers using the criticality index (Equation 3). However, the adversarial training loss (Equation 5) is not innovative and has been commonly used in other regularization techniques such as logits pairing [1] and some other follow-up papers. Despite the authors' claims that the feature weakness defined in Equation (2) approximates the local curvature value, the overall method lacks novelty and significance in the context of adversarial training, to the reviewer's knowledge. [1] Adversarial logit pairing, 2018
方法与评估标准
The proposed methods, while relevant to the problem of adversarial robustness, do not introduce substantial advancements over existing techniques. The criticality index and the fine-tuning objective are conceptually interesting but do not represent a significant departure from current practices.
理论论述
The theoretical claims regarding the criticality index and its relationship to feature weakness are adequately presented but do not provide a strong basis for novelty. The connection between the proposed feature weakness metric and local curvature is not sufficiently explored or validated to support the claims of innovation.
实验设计与分析
The experimental designs appear insufficient to validate the significance of the proposed method.
-
The authors tested CLAT on several datasets (CIFAR-10, CIFAR-100, Imagenette, and ImageNet) and network architectures, but the improvements over baseline adversarial training methods are marginal, as shown in Table 1,2,3,4.
-
The experiments lack breadth, as they do not include more recent backbone architectures (e.g., Vision Transformers).
-
Furthermore, the comparisons with other state-of-the-art adversarial training methods are inadequate, limiting the strength of the conclusions drawn.
补充材料
I reviewed the supplementary material, including additional experimental results and ablation studies. While these provide further context, they do not address the fundamental limitations of the proposed method.
与现有文献的关系
The key contributions of the paper are related to existing work in adversarial training and layer-specific fine-tuning. However, the proposed method does not significantly advance the field beyond current practices. The paper cites relevant literature but fails to distinguish itself from prior work in terms of innovation or impact.
遗漏的重要参考文献
The paper does not adequately discuss recent advancements in adversarial training that focus on layer-specific interventions or architectural modifications. The methods that explore the use of specialized layers or modules for enhancing robustness should be included to provide a more comprehensive context.
其他优缺点
- The method proposed by the author is more like a fine-tuning method than an adversarial training method.
(1)The main contribution of this paper is to propose a method for selecting critical layer in eq.3.
(2)Loss for adversarial training, in eq.5, is the commonly used training method, for example, with the interpretation and definition of design eq.5 in logits pairing [1] . Although the authors claim that feature weakness defined in Equation (2) is also an effective approximation to the local curvature value, the proposed method is still lacking in innovation and importance from the perspective of adversarial training.
[1] Adversarial logit pairing, 2018
- The experimental designs are insufficient to validate the significance of the proposed method.
(1)The authors tested CLAT on several datasets (CIFAR-10, CIFAR-100, Imagenette, and ImageNet) and network architectures, but the improvements over baseline adversarial training methods are marginal, as shown in Table 1,2,3,4.
(2)The experiments lack breadth, as they do not include more recent backbone architectures (e.g., Vision Transformers).
(3)Furthermore, the comparisons with other state-of-the-art adversarial training methods are inadequate, limiting the strength of the conclusions drawn.
其他意见或建议
No
Thank you so much for your time! Below, we’ve provided detailed responses to each of your concerns and criticisms. We hope this helps clarify everything.
Experimental support and breadth of evaluation:
Respectfully, we disagree with the concern regarding insufficient experimental support. As also noted by Reviewers 9V5S and eBwd, our submission presents a comprehensive and thoughtfully designed evaluation. We test CLAT across four diverse datasets (CIFAR-10, CIFAR-100, Imagenette, and ImageNet), multiple CNN architectures, and in conjunction with several adversarial training methods, including state-of-the-art approaches from RobustBench. These include both models trained from scratch and partially pretrained models.
To our knowledge, CLAT is the first fine-tuning method that can also be applied from scratch, demonstrating flexibility across training regimes. Despite this broad applicability and strong performance, CLAT introduces negligible overhead, requiring no backward passes or gradient computations for layer selection. We further support our claims with targeted ablation studies that isolate the contribution of each component.
While we cannot exhaustively include every prior method, we carefully prioritized baselines that reflect challenging, high-performing adversarial training settings. Several well-regarded SOTA methods accepted at top-tier venues conduct fewer evaluations in terms of dataset and model diversity. If there are specific baselines or comparisons the reviewer would like us to include, we are happy to provide additional results or clarify their exclusion.
Finetuning method:
Yes, CLAT is a fine-tuning method by design—and this is a core strength. It introduces a layer-selective strategy that mitigates overfitting and consistently improves both clean and adversarial performance. The proposed criticality index identifies layers still learning non-robust features that benefit from continued optimization.
As a fine-tuning method, CLAT is modular, integrates easily into existing adversarial training pipelines, and adds minimal overhead. Its ability to be applied from scratch further highlights the relevance of the selected layers and the generality of the approach.
More broadly, fine-tuning and adversarial training are not mutually exclusive. Fine-tuning has emerged as a valuable direction for improving robustness and has been recognized at top-tier venues [1].
Novelty and ALP:
CLAT is, to our knowledge, the only method that achieves state-of-the-art robustness across diverse training setups, by fine-tuning fewer than 5% of parameters with minimal overhead and no backward pass required for layer selection.
CLAT also introduces a distinct training objective compared to prior work such as Adversarial Logit Pairing (ALP) [2]. While ALP adds a global logit-level regularizer to align clean and adversarial outputs, CLAT uses a forward-pass-only feature sensitivity metric to select a small subset of layers and applies a layerwise regularizer to penalize their vulnerability. This objective targets internal robustness rather than output-level alignment and constrains optimization structurally and locally, in contrast to ALP’s full-model training.
This combination of non-gradient-based sensitivity analysis, feature-level regularization, and sparse fine-tuning defines a lightweight and general framework that is fundamentally distinct from prior approaches.
Marginal improvements:
Please see our response to Reviewer 9V5S regarding this concern.
Transformer-based models:
Vision Transformers represent a fundamentally different architectural paradigm with structural properties that diverge significantly from CNNs. While this is an important direction, it is outside the scope of this work, which focuses on the structural role of layer selection in CNN-based adversarial robustness. Preliminary results of CLAT on TinyVIT showcase an increase in both clean and adversarial accuracy by approximately 3%.
References
[1] Zhu, K., Wang, J., Hu, X., Xie, X., & Yang, G. (2023). Improving generalization of adversarial training via robust critical fine-tuning. arXiv preprint arXiv:2308.02533.
[2] Kannan, H., Kurakin, A., & Goodfellow, I. (2018). Adversarial logit pairing. arXiv preprint arXiv:1803.06373.
This paper proposes a criticality index to identify critical layers that are more prone to perturbation and then apply CLAT to fine-tune these layers for better clean and adversarial accuracy. Results are evaluated on various models, methods and datasets, proving the effectiveness of CLAT.
给作者的问题
- Apart from using randomly selected layers as a comparison group to show the effectiveness criticality, is there a more straightforward way to show that the chosen layers are the desired vulnerable layers? For example, visualizing the output variation of a number of layers including the critical and non-critical ones.
- Will the critical layers vary given different perturbations? For example, CLAT uses untargeted perturbation. Will the shift of perturbation results in a different selection of critical layers?
论据与证据
My major concern is that despite the soundness of finding critical layers for emphasized AT, the entire theory is based on existing findings without much advancement. The improvement, although consistent across methods, models and datasets, is relatively limited (~2%). This may also suggest that the identification of critical layers may not contribute as significantly as expected to adversarial robustness.
方法与评估标准
Methods are intuitive and easy to understand. Evaluations are primarily comprehensive.
理论论述
There are no outstanding theoretical claims or analysis in this paper.
实验设计与分析
Experiments have been persuasive as the paper includes multiple methods and models. It also includes results from ImageNet and randomly selected layers as a comparison with CLAT to demonstrate the soundness of the method, which is satisfactory. However, considering the significance of the former experiments, I would recommend including these results into the main paper.
补充材料
Part D.4 to E.2.
与现有文献的关系
The idea is consistent with existing research and aligns with scientific consensus within AT and robustness overfitting. Methods are consistent across models and datasets, despite the relatively limited improvement. Overall, the contribution would be moderate.
遗漏的重要参考文献
N/A.
其他优缺点
N/A.
其他意见或建议
N/A.
We’re grateful for your insightful comments and for appreciating the care we put into our experiments. Below, we offer detailed responses to each of your points.
Marginal improvement: We respectfully disagree. Gains of ~2% in adversarial robustness—particularly through fine-tuning—are considered meaningful in recent work (e.g., RiFT [1] reports average improvements of ~1.4%).
In addition, CLAT introduces a distinct and lightweight mechanism for identifying structurally important layers, based on their sensitivity to input perturbations. CLAT’s ability to improve robustness from scratch—across training settings and learning rates—while updating under 5% of parameters suggests these layers are inherently robust-relevant, not artifacts.
Critical Layers: We also validate our layer selection by comparing against low-criticality layers (Appendix E.3). Additionally, thanks for the suggestion—we now include a visualization of the criticality index for RN50 at the start of fine-tuning (post 70 epochs of AT), showing clear separation between high- and low-criticality layers (e.g., 34, 41, 48). Similar patterns hold across architectures and throughout fine-tuning. Link for image: https://imgur.com/a/iNKlttr
Cidx consistency: The identified critical layers remain stable across perturbation types. For instance, indices computed using AutoAttack closely match those from untargeted PGD for DN121, RN50, and RN18 on CIFAR-10.
| Network | PGD CIDX | AA CIDX |
|---|---|---|
| DN121 | [39, 14, 1, 3, 88] | [39, 14, 1, 3, 88] |
| RN50 | [34, 41, 48, 3, 36] | [34, 41, 48, 3, 36] |
| RN18 | [11, 10, 4, 2, 12] | [11, 10, 4, 2, 12] |
Additional Results: We agree and are happy to move the ImageNet and random ablations into the main paper, space permitting. They were placed in the appendix due to the broader use of CIFAR datasets for baseline comparison.
[1] Zhu, K., Wang, J., Hu, X., Xie, X., & Yang, G. (2023). Improving generalization of adversarial training via robust critical fine-tuning. arXiv preprint arXiv:2308.02533.
I am grateful for the authors' response. All my concerns are well-addressed.
This paper proposes to identify robustness-critical layers and use only parameters of these layers in adversarial fine-tuning. This initial reviews were quite diverging. Reviewers acknowledge the potential benefit of the proposed approach in making adversarial training more efficient, but are also critical about the novelty (combination of existing findings), actual contribution of the criticality, and the experimental evaluation. With the additional results provided in the rebuttal, the AC finds the paper can meet the bar for ICML, because the proposed approach provides a consistent improvement and is intuitive and well motivated. The authors should include the additional results from the rebuttal to the final paper.