/10

Poster4 位审稿人

最低3最高4标准差0.4

ICML 2025

Adapting to Evolving Adversaries with Regularized Continual Robust Training

Sihui Dai,Christian Cianfarani,Vikash Sehwag,Prateek Mittal,Arjun Bhagoji

OpenReview PDF

提交: 2025-01-24更新: 2025-07-24

TL;DR

Regularization helps when sequentially adapting to new test-time attacks

摘要

Robust training methods typically defend against specific attack types, such as $\ell_p$ attacks with fixed budgets, and rarely account for the fact that defenders may encounter new attacks over time. A natural solution is to adapt the defended model to new adversaries as they arise via fine-tuning, a method which we call continual robust training (CRT). However, when implemented naively, fine-tuning on new attacks degrades robustness on previous attacks. This raises the question: how can we improve the initial training and fine-tuning of the model to simultaneously achieve robustness against previous and new attacks? We present theoretical results which show that the gap in a model's robustness against different attacks is bounded by how far each attack perturbs a sample in the model's logit space, suggesting that regularizing with respect to this logit space distance can help maintain robustness against previous attacks. Extensive experiments on 3 datasets (CIFAR-10, CIFAR-100, and ImageNette) and over 100 attack combinations demonstrate that the proposed regularization improves robust accuracy with little overhead in training time. Our findings and open-source code lay the groundwork for the deployment of models robust to evolving attacks.

关键词

Adversarial trainingmulti-attackfine-tuningregularization

评审与讨论

审稿意见

评分: 32025-02-21

Most robust training methods focus on specific attack types and struggle to maintain robustness when new attacks arise, making continual robust training (CRT) necessary. This paper proposes a logit-space regularization approach to preserve robustness across both previous and new attacks efficiently, demonstrating its effectiveness through theoretical analysis and extensive experiments on multiple datasets.

给作者的问题

My primary concern is the significance of studying CRT, given that in practice, attacks may not arrive sequentially. Additionally, does the order in which different attacks occur impact the final results of the proposed methods?
Another issue is the limited novelty of the method within the CRT setting. What potential future directions could further enhance ALR?
Also, could authors discuss and compare this important related work [1] in their work? [1] RAMP: Boosting Adversarial Robustness Against Multiple lp Perturbations for Universal Robustness.

论据与证据

The claims for CRT are clearly defined and thoroughly discussed.

方法与评估标准

The evaluations include baselines from both multi-attack robustness and unforeseen attack robustness. The experiment is comprehensive with many settings/attack types, as well as different regularization methods.

理论论述

Yes. The proof looks reasonable to me.

实验设计与分析

The experimental designs are mostly reasonable to me: covering many attack scenarios (as long as 4 types of attacks continually), and comparing many baselines as well as regularization methods.

补充材料

Yes. The authors included the code as supplementary material and the experiments should be reproducible.

与现有文献的关系

In adversarial robustness for multi-norm robustness and unforeseen robustness field, this paper could be contributing to the intersection of the two by proposing a new CRT scenario, where the attacks are continually deployed.

遗漏的重要参考文献

The work of [1] which is the newest work on multi-norm robustness could be compared and discussed in the paper.

[1] RAMP: Boosting Adversarial Robustness Against Multiple lp Perturbations for Universal Robustness.

其他优缺点

The paper is well-written and the problem is presented clearly and motivated.

其他意见或建议

The ALR component seems to be sort of lack of novelty since it is very similar to TRADES. Also, I noticed there are two terms for regularization in Theorem 3.1, I wonder for the implementations, the authors regularize on which term?
In real-world applications like autonomous driving, attacks may occur in real-time, yet CRT requires multiple rounds of fine-tuning. How can these challenges be addressed to ensure the practical deployment of CRT in such dynamic environments?
noticed that the tables do not include an $l_1$ attack. Does this omission have any specific significance?

作者回复

2025-04-01

Thank you for your insightful review.

Discussion of RAMP

Thank you for pointing us to this interesting relevant work. This work looks at achieving robustness against multiple Lp norms and proposes a logit pairing loss which aims to minimize the KL divergence between the logits of predicting on 2 different Lp attacks. Additionally, they use gradient projection to integrate model updates between natural training and adversarial training for better clean accuracy-robustness tradeoff. In comparison, our work looks at robustness against sequences of attacks including non-Lp attacks. Our regularization term uses $\ell_2$ distance between clean and adversarial logits. We will add this discussion into Appendix A and add a citation to the RAMP paper into Section 5.

Significance

We believe that it is reasonable to model the defender’s knowledge of potential threats in a sequential manner. Firstly, it takes time for researchers and attackers alike to develop new attacks. For example, the UAR benchmark [1] was initially released in 2019 and was then expanded with new non-Lp attacks in 2023. When new attacks are discovered, the defender would want to quickly adapt their model for robustness. Additionally, in the case that multiple attacks are discovered simultaneously, our regularized CRT can be used with multiple attacks at a single timestep. For example, in initial training, we can use existing methods for multiattack training and add ALR with respect to all attacks, and in finetuning, we can use a finetuning strategy such as FT Croce + ALR which is able to take into account multiple attacks.

[1] Kaufmann et al. (2019). Testing robustness against unforeseen adversaries. arXiv preprint

Real world attack setting

We discuss challenges of extending to real-time attacks in the “Extension to scenarios where defender has limited knowledge about the attack type” section of our response to Reviewer 9HKR.

Does the order in which different attacks occur impact the final results of the proposed methods?

We provide results for another ordering of the 4 attacks (Linf->StAdv->Recolor->L2) in Appendix Table 5. Overall, we observe the same trends with ALR helping in reducing forgetting of robustness on previous attacks and improving robustness on held out attacks. The final model obtained after all rounds of finetuning for this sequence achieves higher union all accuracy (3.32% higher than the sequence in the main paper) so order may have some impact on the final model performance. This is also shown in the finetuning ablations in Fig 3 as the matrix is not symmetric.

Which term is regularized in Theorem 3.1

For initial training with ALR, only the adversarial loss term corresponding to the initial known attack is regularized. In finetuning, the term regularized depends on the finetuning strategy. We use regularization with the attack that is selected to be computed on the batch by the finetuning strategy. Specifically, if we have $\mathcal{L}_1$ be the loss on the previous attack and $\mathcal{L}_2$ be the loss on the new attack used in finetuning, then for FT Single, only the second term is regularized since we only use the new attack in finetuning. For FT Croce, both terms are regularized across training since both attacks have a chance of being chosen during training (although we are only computing regularization with respect to a single attack per batch).

Novelty and future directions

We acknowledge that there are similarities between our ALR regularizer and TRADES regularizer as they both maximize a distance in the logit space, but we highlight that ALR is theoretically motivated from the standpoint of improving generalization to new/unforeseen attacks and reducing forgetting of previous attacks while TRADES is motivated from the standpoint of balancing clean accuracy-robustness tradeoff. We also highlight that the study of obtaining robustness against sequences of different attack threat models is novel and we contribute extensive experiments across a variety of attack types and investigate the performance of random-noise based regularization as well.

For an experimental comparison to TRADES, please refer to the “comparisons with TRADES” portion of response to reviewer 1uU6. We observe that ALR works better for our task as it does not trade off robustness on the initial attack.

We discuss future directions for enhancing this line of work in Appendix B. These include ways of detecting attacks, further improving finetuning efficiency, studying the impact of model capacity, and theoretical analysis comparing loss under different attacks for the initial model after training and model obtained after finetuning.

L1 attack

Please see our response to Reviewer JuEE portion on “L0 Attack” for explanation as to why we used the set of evaluation attacks used in the paper. If the reviewer thinks it is necessary, we can add a few experiments with L1 attack for the final version of the paper.

审稿人评论

2025-04-03

Thanks for the response, which addresses my concerns. I will raise my score accordingly.

作者评论

2025-04-08

Thank you for your response and for raising your score. We will add the RAMP discussion and explicit comparison with TRADES into the updated version of the paper.

审稿意见

评分: 32025-03-05

The paper presents a regularization method for robust continual learning and evaluates it using extensive experiments.

给作者的问题

What do you mean by single step optimization for ALR? I'm not sure how accurate it is compared with multi-steps.
What would happen if we switch the order of the attacks? The paper seems to only consider two orders of attacks with both starting with Lp attack. What would happen if we start with a non Lp attack

论据与证据

The author claims ALR is essential for maintaining robust performance, while from the experiment section, it seems adding ALR sometimes may not even be the optimal, while fine-tuning with MAX performs better.

方法与评估标准

I appreciate the intense experimental results. However, I didn't find how many trials the author ran the experiment for each set of parameters and there's no standard deviation in the result table.

理论论述

Does Theorem 3.1 hold for any $h$ in the hypothesis set? If so, it can be very loose. Normally, in the non-robust setting, literature studied continual learning by controlling both the generalization bound (the generalization performance of the final model in terms of the average model error over all tasks) as well as the forgetting bound (the averaged loss difference between the final model with the model right after learning the tasks). In this paper the author seems to combine these two metrics together to provide the bound.
Is it possible to generalize to subsequently multiple attacks instead of just two?
While I appreciate the definition of 2.1, it seems not used in any of the theorems.

实验设计与分析

The attack method considers 10-PGD with 0.075 attack step size, which seems cannot search over the perturbation ball with radius 0.5 for CIFAR 10 under random initialization. The number of steps * step size should be at least 2 * perturbation radius. Otherwise, the paper does not consider random initialization for PGD attacks.
Any reason why select the specific perturbation budget for each attack or its purely random? Are you assuming the attacks are roughly the same attack strength? What would happen if randomly change the perturbation budget for each attack?
From Table 2, seems different algorithm has different regularization parameters. Then, how do we set the regularization parameter (optimal)? In Appendix H, the author presents so many tables (Table 10-Table 17) indicating that having regularization performs better, but still unclear how to choose the regularizer, and seems from these tables larger regularizer gives better results in terms of facing unseen attack, but obviously we cannot set the regularizer parameter to be infinity.
For model selection, why not perform a separate validation set for model selection? Don't you observe robust overfitting phenomena if using training data for model selection?
It seems having ALR as the regularize won't always give the optimal performance. Should we focus on average accuracy or union accuracy? Should we focus on (known) or (all). For example, Table 6,7,8,9 shows ALR is not the optimal, instead MAX or FT-MAX is.

补充材料

I have read the appendix of the paper.

与现有文献的关系

The paper addresses the problem of maintaining robustness while transferring to different attack models, which is a rather important topic that has been studied in the literature.

遗漏的重要参考文献

N/A

其他优缺点

Are there any comparisons with TRADES, as the regularizer is similar except one is L2 norm and the other is KL divergence?

其他意见或建议

What is RCRT at the end of the paragraph of the paper? Regularized continual robust learning? Seems the abbreviation is not coherent, and as there are so many abbreviations in the paper, I feel it’d be helpful to repetitively state the full name for the sake of readability.
In general I feel the explanation of the experimental result can be more detailed and more clear, instead of simply listing all the tables and figures in the appendix. For example, what kind of attack do you think ALR helps in terms of robust transfer? As the theorem only considers two attacks, it might be helpful to also consider such a setting in the experiment instead of 4 attacks.
It's unclear if ALR regularization works better compared with other regularization. In Appendix H.1, the author tries to provide multiple tables, each dealing with one regularization. One better option would be to fix the attack order and change different regularizations, each with the same or different regularization parameters, and see which works better. I really appreciate the intensity of the experiment, but there's no need to consider that many sequences of attack if we do not even understand and analyze the results.

typo: start of page 5 left column, repetitively saying for l2 attacks. I imagine the second one should be L-infty attack.

伦理审查问题

N/A

作者回复

2025-04-01

General clarifications

Goals in CAR

We optimize three objectives (Def 2.1): (1) robustness to known attacks, (2) robustness to unforeseen attacks, and (3) update efficiency. (Known) metrics correspond to (1), (all) metrics to (2), and training time to (3). Thus, (known), (all), and time metrics are all key for comparing techniques. MAX and FT-MAX may achieve better accuracy but are less efficient.

Finetuning Ablations

We direct the reviewer to Fig 3 and discussions in Sec 4.4 which addresses a few of the reviewer’s questions. There, we study sequences of 2 attacks including non-Lp starting attacks. These experiments make it clear that in finetuning ALR is much better compared to random (Uniform and Gaussian) regularization which hurts Union accuracy across the 2 attacks.

Theory

Thm 3.1, generalization and forgetting bounds

Thm 3.1 holds for any hypothesis $h$ , so any model trained with the regularizer in the RHS will have a reduced loss gap. We find that this bound correlates with robust loss gaps in practice (Appendix E). Refining the bound to account for both generalization and forgetting is an interesting future direction. Our analysis relies on distances between representations for inputs to a fixed model rather than changes in representations through training. The latter must take into account changes in the representation space induced by training on new attacks, which is not well-understood. With Thm 3.1 and Cor 3.2, we aim to strike a balance between robustness against individual attacks (both seen and unseen), union accuracy, and accuracy on clean samples.

Generalizing theory to more than 2 attacks

Thm 3.1 and Cor 3.2 hold for any two attacks, whether or not they were seen during the course of training. For a larger set of attacks, the maximum loss gap between any attack pair is subject to the bound in Thm 3.1. Attacks can be defined to optimize over the union of multiple adversarial constraints, allowing us to extend our theoretical results to the union of multiple attacks.

Experimental

PGD step size

PGD is used only in adversarial training. Evaluations use AutoAttack, which adapts step size for accurate robustness assessment. Prior work (Gowal et al. 2020, Rice et al. 2020) also omits random initialization in L2 adversarial training.

Selection of the perturbation budget

Comparing attack strengths, especially non-Lp, is challenging. We use default budgets from the original attack papers. Developing comparative metrics for attack strengths is an interesting direction.

How to set the regularization strength?

See “How should we set the regularization strength parameter?” in our response to Reviewer 9HKR.

Model selection

We select the epoch with the highest average validation accuracy on known attacks, effectively performing optimal early stopping to avoid robust overfitting.

Average vs. Union Accuracy

The choice between them depends on application. Safety-critical settings prioritize union accuracy since it captures the worst case.

Other

TRADES comparison

TRADES is designed for improving clean accuracy tradeoff while ALR is designed for improving generalization across (seen and unforeseen) attacks. Since TRADES regularizer also maximizes a distance (KL instead of L2) in the logit space, we expect it can also improve generalization across attacks as well and provide results below. Similar to experiments with ALR, we regularize on top of PGD L2 and Linf adversarial training. Regularization strength in parentheses.

Threat model	Reg.	Clean	L2	Linf	StAdv	ReColor	Union
L2	None	91.17	69.7	28.41	2.08	44.94	1.24
L2	Trades (1)	90.43	70.08	31.33	0.89	38.51	0.6
L2	Trades (3)	88.93	70.05	33.81	9.04	58.25	6.74
L2	Trades (6)	88.76	69.69	33.00	7.04	56.82	5.51
L2	ALR (1)	89.43	69.84	34.00	48.23	65.46	31.27
Linf	None	85.93	59.48	51.44	14.87	62.48	11.9
Linf	Trades (1)	85.39	59.33	49.23	14.11	64.45	11.45
Linf	Trades (3)	83.97	58.54	47.00	20.51	69.33	16.34
Linf	Trades (6)	85.72	56.44	41.70	23.17	70.23	17.83
Linf	ALR (0.5)	83.18	58.15	51.49	34.78	58.15	29.87

Notably, increasing TRADES strength in Linf training trades off Linf performance, whereas ALR does not.

Appendix results

Tables in App. H.1 mirror Table 3’s ablations on initial training, showing consistency across datasets and attacks. These do not involve finetuning; for finetuning results, see Fig 3. Discussion of Appendix figures is currently located early on before related figures. We will fix this in the camera-ready.

Single step optimization in ALR

ALR optimizes worst-case logit distance which we compute with a single PGD step. This is less precise than multiple steps but more efficient, aligning with our goal of improving efficiency.

审稿意见

评分: 32025-03-14

The paper proposes a algorithm to robustly finetune the model for newly proposed attacks. Specifically, the paper proposes a regularization term called ALR at the both pretraining and finetuning stage. The regularziation term bound the difference between clean logits and the adversarial logits. The experiment results show that ALR significantly improves the robustness of the newly attack.

给作者的问题

In Table 1, the robust accuracy of Union(All) drops after the fine-tuning stage, does this mean the initial model is actually the best model against unknown adversarial attack?

论据与证据

The paper claims are generally well supported by the experiment results. However, there are several points that I find lacks of support.

What is the function of the regularizaton term ALR? When using it at the pretraining stage, does it accelerate the funetuning stage or make the initial model more robust to unknown attack? To prove this, it is better to add ablation study using AT and AT + ALR as init model to conduct the finetuning process.
The attack family lacks of $\ell_0$ attack.

方法与评估标准

The paper performs extensive experiments on different datasets using different kinds of attacks. The experiments result is able to demonstrate the effectiveness of the regularization ALR as it largely increases the robust accuracy.

理论论述

The theoretical claims seem correct but I did not carefully check them.

实验设计与分析

Yes, I have check the setting of the experimental designs and they are correct.

补充材料

No.

与现有文献的关系

遗漏的重要参考文献

None

其他优缺点

The presentation of paper needs improvement. For example, in Definition 2.1, the paper introduces introduces several concepts such as $t$ and $\delta_{known}$ . They are actually unnecessary for the paper. The author could provide a more direct introduction to the method itself.

其他意见或建议

None.

作者回复

2025-04-01

Thank you for your positive appraisal of the paper and interesting questions.

What is the function of the regularization term ALR? When using it at the pretraining stage, does it accelerate the funetuning stage or make the initial model more robust to unknown attack? To prove this, it is better to add ablation study using AT and AT + ALR as init model to conduct the finetuning process.

ALR in the pretraining stage serves to improve the generalization to unforeseen attacks (discussed in the second paragraph of Section 4.2), which then provides a better starting point when finetuning the model to the new attack. In Figure 5 in the Appendix, we have provided a comparison between finetuning (without regularization) from an AT + ALR initial model and finetuning (without regularization) from an AT initial model.

When used in finetuning, ALR serves to reduce forgetting of robustness to previous attacks in the sequence, which we demonstrate in Table 1.

L0 Attack

Because the goal of CRT is to quickly adapt the model to unforeseen attacks when they become known to the defender, we chose to use the same set of attacks (Linf, L2, StAdv, ReColor) used for evaluation in works on unforeseen robustness [1,2] as well as incorporate attacks from a benchmark for unforeseen robustness (Gabor, Snow, Pixel, Kaleidoscope, Glitch, Elastic, JPEG, Wood) [3]. This is why we opted to use the attack set that we used in evaluation.

L0 attack involves combinatorial optimization and it is unclear whether it can be easily integrated into adversarial training which our framework is based off of. Additionally, [4] demonstrates that L0 attacks can be quite weak and needs to use large per-pixel perturbations.

[1] Laidlaw et al. (2021). Perceptual Adversarial Robustness: Defense Against Unseen Threat Models. International Conference on Learning Representations (ICLR).
[2] Dai, S., Mahloujifar, S., & Mittal, P. (2022). Formulating robustness against unforeseen attacks. Advances in Neural Information Processing Systems, 35, 8647-8661.
[3] Kaufmann, M., Kang, D., Sun, Y., Basart, S., Yin, X., Mazeika, M., ... & Hendrycks, D. (2019). Testing robustness against unforeseen adversaries. arXiv preprint arXiv:1908.08016.
[4] Zuo, F., Yang, B., Li, X., & Zeng, Q. (2019). Exploiting the inherent limitation of l0 adversarial examples. In 22nd International Symposium on Research in Attacks, Intrusions and Defenses (RAID 2019) (pp. 293-307).

Definition 2.1

The variables $t$ and $\delta_{\text{known}}$ are important for rigorously defining the problem that we are studying (continual adaptive robustness) and what goals we hope a good algorithm for this setting will achieve, which we connect to our proposed CRT in Section 3.1. Specifically, we become aware about the existence of attacks sequentially over time and at a specific point in time $t$ , we would want to be robust against attacks that we have known for a while, have some robustness to recently introduced attacks, and recover quickly from new attacks. To model this, we have 3 problem parameters, (1) $\delta_{\text{known}}$ which specifies the loss threshold that we can tolerate on attacks that we’ve known for a while, (2) $\delta_{\text{unknown}}$ which specifies the loss threshold that we can tolerate on new recently introduced attacks, and (3) $\Delta t$ which specifies how long the model has to recover from new attacks. “Recovering from new attacks” basically means that threshold for tolerated loss switches from $\delta_{\text{unknown}}$ to $\delta_{\text{known}}$ with $\delta_{\text{known}} < \delta_{\text{unknown}}$ .

These three quantities also serve to motivate the metrics we measure in the experimental section: Union (known) and Avg (known) correspond to $\delta_{\text{known}}$ , Union (all) and Avg (all) give a sense of $\delta_{\text{unknown}}$ and the training time corresponds to $\Delta t$ . These connections are discussed within the results discussion in Section 4.2.

In Table 1, the robust accuracy of Union(All) drops after the fine-tuning stage, does this mean the initial model is actually the best model against unknown adversarial attack?

This result suggests that the features used for classifying robustly under StAdv attack are different from those for Linf attack, so when we fine-tune the model on StAdv attack at time step 1, the generalization to Linf attack gets worse, which in turn impacts the union accuracy. The fine-tuned model's performance also depends on the attack sequence, so it is hard to conclude that the initial model will be better in terms of unforeseen robustness or not. For example, in Table 5 in the Appendix, union accuracy steadily increases after fine-tuning. In general, it is better to fine-tune on known attacks because these attacks are the ones that the defender is confident will affect the model’s performance.

审稿意见

评分: 42025-03-15

This paper introduces Regularized Continual Robust Training (RCRT), a framework for adapting deep learning models to evolving adversarial attacks while maintaining robustness to previously seen threats. The authors theoretically demonstrate that the gap in robustness between different attacks is bounded by logit-space distances, and propose adversarial L2 regularization (ALR) to minimize this bound during both initial training and fine-tuning.

给作者的问题

The paper focuses on regularization at the logit level. Have you explored regularizing intermediate representations in the network, and if so, how does this compare to logit-level regularization?
For practical deployment, how would you recommend balancing the tradeoff between robustness and clean accuracy that regularization introduces? Are there guidelines for selecting the regularization strength λ based on the specific application needs?
Your results in Table 1 show that fine-tuning with only the new attack (FT Single) leads to significant forgetting of previous attacks. Have you explored methods from the continual learning literature (like replay buffers or elastic weight consolidation) that specifically target catastrophic forgetting, and if so, how do they compare to your regularization approach?
How might your approach extend to scenarios where the defender has limited knowledge about the attack type but can only observe the adversarial examples? This would be closer to real-world security scenarios where attackers don't reveal their methods.

论据与证据

The claims are generally well-supported by both theoretical analysis and extensive empirical evidence. The theoretical bound connecting the robustness gap to logit distances (Theorem 3.1 and Corollary 3.2) is well-established and forms a sound basis for the proposed regularization technique. The empirical evaluation is comprehensive, covering multiple datasets, attack types, and regularization approaches, with clear performance metrics (Union accuracy, Average accuracy, and time overhead).

方法与评估标准

The methods and evaluation criteria are appropriate for the problem. The paper properly defines continual adaptive robustness and establishes clear metrics for success (known and unforeseen robustness thresholds with grace periods). The evaluation considers both effectiveness (robustness across attacks) and efficiency (training time), which are both critical for practical deployment.

理论论述

I checked the proofs in Theorems 3.1 and Corollary 3.2 and found them to be mathematically sound.

实验设计与分析

The experimental design is sound with clear selection of datasets and attack types and ppropriate baseline methods for comparison.

补充材料

I have reviewed the related work, experiment result and theoretical result of supplementary material

与现有文献的关系

The paper properly positions its contributions on the robustness for model under continual shift in relation to Dai et al., 2023; Kaufmann et al., 2019. The author should also discuss literature in gradual domain adaptation, and discuss paper like [1]. [1] Understanding Self-Training for Gradual Domain Adaptation. Ananya Kumar, Tengyu Ma, and Percy Liang. Understanding self-training for gradual domain adaptation. In International Conference on Machine Learning, pp. 5468–5479. PMLR, 2020.

遗漏的重要参考文献

Not aware.

其他优缺点

The regularization approaches considered (VR, ALR, UR, GR) all operate on model outputs (logits or later features) rather than exploring regularization of earlier representational layers, which might provide complementary benefits.
While the paper examines the effect of regularization on the tradeoff between robustness and clean accuracy, there's limited discussion of the optimal balance for different applications.
The evaluation focuses on image classification tasks; applicability to other domains remains unexplored.
The computational overhead of ALR, while modest, might still be a concern for very large models.

其他意见或建议

The paper would benefit from more visual examples of different attack types to help readers understand their qualitative differences.
Some discussion of potential applications and use cases where CAR would be particularly valuable would strengthen motivation.

作者回复

2025-04-01

Thank you for your insightful review and positive appraisal of our paper.

Discussion of gradual domain adaptation

Thank you for pointing us to this line of work. The work referenced studies shifts in data distribution over time and proposes gradual self-training to adapt the source model without access to labels. Meanwhile, we propose regularized CRT as a solution to the expanding space of attacks over time, with the data distribution itself remaining the same, with access to attacks and labels. We will add a discussion of this related direction into Appendix A.

Regularization on intermediate representations

We provide experiments with regularization on the features at the layer before the logits (Results in Appendix Table 4 rows labelled “+ALR feature”). Overall, we observe similar results compared to logit level regularization. We also provide theoretical results for regularization at the layer before the logits in Appendix C.3.

Comparing to methods from continual learning

Thank you for this suggestion. We experimented with using EWC for finetuning for StAdv robustness from an L2 robust model with the FT Single approach. We provide results for 3 different strengths (in parentheses) of EWC compared to unregularized FT Single and FT Single + ALR. Overall, we find that ALR’s improvement in robustness on known and unforeseen attacks is significant compared to EWC. EWC’s improvement over FT Single is similar to using FT Croce results in Table 1 (Time step 1) which uses replay of previous attacks in finetuning. We will add these comparisons into the Appendix.

Method	Clean	L2	StAdv	Linf	ReColor	Avg Known	Union Known	Avg all	Union all
FT Single	80.89	45.45	54.5	6.09	41.98	49.98	41.05	37.0	5.87
FT Single + EWC (0.5)	83.98	58.85	51.15	15.44	51.55	55.00	46.25	44.25	14.54
FT Single + EWC (1)	85.20	57.69	56.18	13.07	50.99	56.93	49.42	44.48	12.69
FT Single + EWC (2)	85.10	57.96	55.14	13.54	51.23	56.55	48.9	44.47	12.99
FT Single + ALR	87.24	62.22	61.5	21.4	70.87	61.86	55.04	54.0	21.14

Extension to scenarios where defender has limited knowledge about the attack type

This is an interesting direction. In our work, we focus on changes in the defender’s knowledge of attacks over time which is useful in cases such as a research or security team discovering a new attack type. A real-time attack setting poses new challenges:
No access to threat model- the defender does not know the threat model and cannot generate adversarial examples. They only have access to the perturbed data generated by the adversary.
Missing true labels and no access to the original unperturbed input - the defender also does not have the corresponding true labels or the original clean input for use in training.
Few shot updates - it becomes critical that the model can be made robust with only a few examples of successful attacks, otherwise it means that the adversary has been exploiting the vulnerabilities of the model for a long time Defending in this setting is outside of the scope of this paper, but potentially using generative models in order to model the perturbation [1] used by the adversary can help to bridge the gap from points (1) and (2) and allow for the defender to apply the attack on their own dataset and finetune with our proposed CRT + ALR. If the generative model is able to learn to model perturbations with only a few adversarial examples, then this can also address (3). We will add this discussion into Appendix B’s discussion of future directions.

[1] Wong et al. 2020 Learning perturbation sets for robust machine learning. ICLR

Visual examples of different attack types

Thank you for this suggestion, we will add visual examples of each attack into the Appendix.

Discussion of potential applications

Solving CAR is of interest in any safety-critical domain where an attacker is motivated to evade a ML model. A good example is automated content moderation, where malicious actors try to post content that violates policies by uploading obfuscated images [2]. Strategies naturally evolve over time for motivated attackers who can also use open-source methods proposed in the literature. As ML models will continue to be used in sensitive domains such as finance, cyber-physical systems and medicine, model deployers need methods to update their models to evolving threats. We will add this discussion to the updated paper.

[2] Stimberg et al. (2023) "Benchmarking robustness to adversarial image obfuscations." NeurIPS 2023

How should we set the regularization strength parameter?

We recommend selecting regularization strength based on how much tradeoff in clean accuracy (and starting attack accuracy in the case of uniform and gaussian regularization) that the model deployer can tolerate for the application.

最终决定Accept (poster)

2025-05-01

Continual robust training (CRT) contributes to the intersection of adversarial robustness and continual learning by updating to address new attacks following the initial training. This work addresses both the initial training and fine-tuning stages to achieve robustness to the new and old attacks. 4/4 reviewers agree on acceptance with final scores of 4, 3, 3, 3. The authors provide a rebuttal, 4/4 acknowledge it, and s9Ym raised their score from 2 to 3. Weaknesses were raised concerning similarity to TRADES, realism of the problem setting with sequential attacks, and the computational cost of fine-tuning for multiple rounds; however these are counterbalanced by the contributions and all reviewers are satisfied with the rebuttal. The area chair sides with acceptance in agreement with the consensus of the reviewers.