5.4

/10

Rejected5 位审稿人

最低5最高6标准差0.5

3.2

置信度

正确性2.6

贡献度2.4

表达2.6

ICLR 2025

Understand Clean Generalization and Robust Overfitting in Adversarial Training from Two Theoretical Views: Representation Complexity and Training Dynamics

Binghui Li,Yuanzhi Li

OpenReview PDF

提交: 2024-09-25更新: 2025-02-05

TL;DR

We provide a theoretical understanding why clean generalization and robust overfitting both happen in adversarial training.

摘要

Similar to surprising performance in the standard deep learning, deep nets trained by adversarial training also generalize well for unseen clean data (natural data). However, despite adversarial training can achieve low robust training error, there exists a significant robust generalization gap. We call this phenomenon the Clean Generalization and Robust Overfitting (CGRO). In this work, we study the CGRO phenomenon in adversarial training from two views: representation complexity and training dynamics. Specifically, we consider a binary classification setting with $N$ separated training data points. First, we prove that, based on the assumption that we assume there is $\operatorname{poly}(D)$-size clean classifier (where $D$ is the data dimension), ReLU net with only $O(N D)$ extra parameters is able to leverages robust memorization to achieve the CGRO, while robust classifier still requires exponential representation complexity in worst case. Next, we focus on a structured-data case to analyze training dynamics, where we train a two-layer convolutional network with $O(N D)$ width against adversarial perturbation. We then show that a three-stage phase transition occurs during learning process and the network provably converges to robust memorization regime, which thereby results in the CGRO. Besides, we also empirically verify our theoretical analysis by experiments in real-image recognition datasets.

关键词

deep learning theoryadversarial trainingclean generalization and robust overfittingrepresentation complexitytraining dynamicsfeature learning theory

评审与讨论

审稿意见

评分: 5置信度: 32024-11-02

The paper investigates the CGRO phenomenon observed in adversarial training (AT)—specifically, that a model trained by AT achieves good clean generalization (CG) while simultaneously encountering robust overfitting (RO). The paper offers a theoretical explanation for this phenomenon from two perspectives: the model’s representation complexity and the training dynamics of AT.

From the representation complexity perspective, the paper considers a binary classification problem where a classifier that achieves only CG exists. A CGRO classifier can then be constructed based on this "CG-only" classifier, requiring slightly more parameters, which implies that achieving CGRO requires a mildly higher representation complexity.

Regarding AT’s training dynamics, the paper considers a structured data setting where data consists of one patch with a true feature and other patches with spurious features. Under additional assumptions, the model trained by AT is shown to partially learn true features while memorizing spurious features, suggesting that such a model is a CGRO classifier.

优点

The paper explores an interesting phenomenon arising in AT and offers two theoretical explanations. The paper is well-written, with clear descriptions of the problem setup, definitions, assumptions, and proof sketches that effectively convey the intuition behind the theory.

缺点

There are several technical issues that I would like to discuss:

Lemma 4.5: the lemma constructs a classifer $f _ {\cal S}$ based on $f_{\rm clean}$ and claim that $f_{\cal S}$ is a CGRO classifier. However, although by construction $f_{\cal S}$ memorizes the training data ${\cal S}$ , this does not necessary imply that $f_{\cal S}$ will exhibit robust overfitting. The generalization performance of $f_{\cal S}$ depends on the choice of ${\cal S}$ . If ${\cal S}$ nicely covers the support of the data distribution ${\cal D}$ , that is given ${\cal S}$ we have $X\in \cup_ {i=1} ^{N}{\mathbb B} _ {p}(X_{i}, \delta)$ with high probability, then $f_{\cal S}$ , despite memorizing ${\cal S}$ , is still able to achieve a good robust testing accuracy. In this case, $f_{\cal S}$ is not CGRO. Could you specify conditions on ${\cal D}$ under which $f _ {\cal S}$ would exhibit robust overfitting with high probability over the randomness of drawing ${\cal S}$ from ${\cal D}$ ? Could you also provide theoretical analysis towards how the sample size $N$ as well as $\delta$ and $p$ affect ${\cal L} _ {\cal D}^{p,\delta}(f _ {\cal S})$ ?
Section 5.1, line 325: the definition of ${\mathbb A} _ {p}(X, \delta)$ is unclear. It seems that for any $X'\in{\mathbb A} _ {p}(X, \delta)$ other than $X$ , we have $X'[{\rm signal}(X)]=q w^{*}$ for some $q\in{\mathbb R}$ and $X'[j]=0$ for $j\ne {\rm signal}(X)$ . Is this correct? The construction of ${\mathbb A} _ {p}(X, \delta)$ is important for understanding the results in Theorem 5.9. I hope the authors could clarify the definition of ${\mathbb A} _ {p}(X, \delta)$ .
Theorem 5.9 : Theorem 5.9 attributes CGRO to the model trained by AT simultaneously learning the true feature and memorizing the spurious feature. Under the settings in Section 5, is it possible to show that a model trained via standard (non-adversarial) training would only learn the true feature and ignore the spurious feature? Establishing this distinction is important, as it would suggest that CGRO arises from the nature of AT itself rather than from the artificial construction of the data model.

I recommend the authors address these concerns. I'm willing to engage in a discussion with the authors and am flexible about adjusting my score.

问题

See weaknesses.

评论- Response to Review Be1H

2024-11-30

We sincerely thank the reviewer for the insightful feedback, and for highlighting the clarity of our writing. We are very glad to address the questions and suggestions raised by the reviewer, which we believe will help further refine our work. Below are our responses to the questions and suggestions raised by the reviewer.

[W1] Could you specify conditions on $\mathcal{D}$ under which $f _{\mathcal{S}}$ would exhibit robust overfitting with high probability over the randomness of drawing $\mathcal{S}$ from $\mathcal{D}$ ? Could you also provide theoretical analysis towards how the sample size $N$ as well as $\delta$ and $p$ affect $\mathcal{L} _{\mathcal{D}}^{p,\delta} (f _{\mathcal{S})$ ?

[E1] We sincerely thank the reviewer for the valuable and thoughtful suggestion. Indeed, if the data distribution $\mathcal{D}$ satisfies that the covering number of the supporting set is exponential in the data input dimension $D$ . And we also provide theoretical analysis about $\mathcal{L} _{\mathcal{D}}^{p,\delta} (f _{\mathcal{S})$ . See details in Appendix B. We propose a novel robust generalization bound that is mostly related to global flatness of loss landscape. And this Relation Is also observed in practice.

[W2] The construction of $\mathcal{A} _{p}(\boldsymbol{X},\delta)$ is important for understanding the results in Theorem 5.9. I hope the authors could clarify the definition of $\mathcal{A} _{p}(\boldsymbol{X},\delta)$ .

[E2] We would like to clarify that the perturbed range $\mathcal{A} _{p}(\boldsymbol{X},\delta)$ ensures that adversarial perturbations used to generate adversarial examples are always aligned with the meaningful signal vector $\boldsymbol{w}^{*}$ during adversarial training, which exactly simplifies the analysis. Specifically, we have

\boldsymbol{X}^{\text{adv}}[j] = \alpha(1-\gamma) y \boldsymbol{w}^{*}, j = \operatorname{signal}(\boldsymbol{X})

\boldsymbol{X}^{\text{adv}}[j] = \boldsymbol{X}[j], j \in [p] \setminus \operatorname{signal}(\boldsymbol{X})

We then derive the correctness as $y f _{\boldsymbol{W}}(\boldsymbol{X}^{\text{adv}}) = y \sum _{r=1}^{m} \langle\boldsymbol{w} _r, \boldsymbol{X}^{\text{adv}}[\operatorname{signal}(\boldsymbol{X})]\rangle^{q} + y \sum _{r=1}^{m}\sum _{j \in [P]\setminus \operatorname{signal}(\boldsymbol{X})} \langle \boldsymbol{w} _{r}, \boldsymbol{X}^{\text{adv}}[j] \rangle^{q} = \underbrace{y \sum _{r=1}^{m} \langle\boldsymbol{w} _r, \alpha(1-\gamma) y \boldsymbol{w}^{*} \rangle^{q}} _{{\alpha}^{q}(1-\gamma)^{q}\mathcal{U}} + \underbrace{y \sum _{r=1}^{m}\sum _{j \in [P]\setminus \operatorname{signal}(\boldsymbol{X})} \langle \boldsymbol{w} _{r}, \boldsymbol{X}[j] \rangle^{q}} _{\mathcal{V}}$

Thus, the model correctly classify the perturbed data if and only if $\alpha^{q}(1-\gamma)^{q}\mathcal{U} + \mathcal{V} \ge 0$ , which implies that we can analyze the perturbed data by tracking the dynamics of $\mathcal{U}$ and $\mathcal{V}$ .

[W3] Under the settings in Section 5, is it possible to show that a model trained via standard (non-adversarial) training would only learn the true feature and ignore the spurious feature? Establishing this distinction is important, as it would suggest that CGRO arises from the nature of AT itself rather than from the artificial construction of the data model.

[E3] We sincerely thank the reviewer for the insightful suggestion. As the reviewer pointed out, we can prove that a model trained using standard (non-adversarial) training will only learn the true feature and ignore the spurious feature. We will include this statement in the revision of our paper.

审稿意见

评分: 6置信度: 42024-11-04

This paper studies the Clean Generalization and Robust Overfitting (CGRO) phenomenon in adversarial training from two theoretical perspectives: representation complexity and training dynamics. Using ReLU networks, they show that CGRO classifiers require only poly(D)+ND parameters while robust classifiers need exponential complexity. They also analyze training dynamics using a two-layer CNN on structured data, showing how the network converges to a state of partial true feature learning and exact spurious feature memorization.

优点

The analysis is particularly noteworthy for its comprehensive approach - it not only establishes complexity bounds showing that CGRO classifiers require only poly(D)+ND parameters while robust classifiers need exponential complexity, but also provides insights into how this phenomenon emerges during training through a clear three-stage analysis of the training dynamics. The theoretical framework is well-supported by experimental validation on both real and synthetic datasets, demonstrating practical relevance. Most importantly, the paper's findings help explain a commonly observed phenomenon in adversarial training, offering valuable insights into why models often achieve CGRO in practice. The technical depth combined with practical significance makes this work a meaningful contribution to understanding adversarial robustness.

缺点

I have two major concern regarding the representation complexity and training dynamics, respectively.

Representation Complexity:

The central result at Line 255 states:

Clean Classifier (poly(D)) ≲ CGRO Classifier(poly(D)+ND) ≪ Robust Classifier (Ω(exp(D)))

However, the poly(D) complexity of clean classifiers is simply assumed in Assumption 4.3 rather than proven. This is problematic because if clean classification also requires exp(D) parameters, the entire complexity hierarchy becomes meaningless. The authors need to either:

Prove that clean classifiers can indeed be approximated by neural networks with poly(D) parameters, even though it is trivial.
Cite existing results that establish this fact, even though this is a well-known classical result.

I assume this is only a clarity issue, because the authors think this is too trivial or too classical and not worth discussing?

Training Dynamics:

While the analysis clearly shows that training leads to partial true feature learning and exact spurious feature memorization, there is a gap between this result and CGRO that needs to be addressed. Specifically, the paper should analyze:

For an unseen samples, what is the probability that the model will correctly classify it? Need analysis of the condition $α^q U + V ≥ 0$ for a new sample.
For adversarial examples generated from this unseen sample, what is the probability of misclassification? Need analysis of the condition $α^q U + V < 0$ for adversarial example.

These probability analyses would help complete the connection between the training dynamics results and the CGRO phenomenon.

问题

See weakness.

评论- Response to Review bQGX (2/2)

2024-11-30

[W2] Specifically, the paper should analyze:

For an unseen samples, what is the probability that the model will correctly classify it? Need analysis of the condition $\alpha^q U + V \geq 0$ for a new sample.
For adversarial examples generated from this unseen sample, what is the probability of misclassification? Need analysis of the condition $\alpha^q U + V < 0$ for adversarial example.

These probability analyses would help complete the connection between the training dynamics results and the CGRO phenomenon.

[E2] Theorem E.12 We sincerely thank the reviewer for the insightful suggestion. We would like to point out that our Theorem E.12 gives the probability analysis as the reviewer suggested. See details in the proof of Theorem E.12 (Appendix E). Here, we briefly provide high-level intuition behind the probability analysis as follows:

For an unseen clean data $(\boldsymbol{X}, y) \sim \mathcal{D}$ , we have $\mathbb{P} _{(\boldsymbol{X},y)\sim \mathcal{D}}[y f _{\boldsymbol{W}^{(T)}}(\boldsymbol{X}) < 0] = \mathbb{P} _{(\boldsymbol{X},y)\sim \mathcal{D}}[\alpha^{3} \sum _{r=1}^{m} (u _{r}^{(T)})^{3} + y \sum _{r=1}^{m}\sum _{j \in [P]\setminus \operatorname{signal}(\boldsymbol{X})} \langle \boldsymbol{w} _{r}^{(T)}, \boldsymbol{X}[j] \rangle ^{3} < 0]$ $\le \mathbb{P} _{(\boldsymbol{X},y)\sim \mathcal{D}}[ \sum _{r=1}^{m}\sum _{j \in [P]\setminus \operatorname{signal}(\boldsymbol{X})} \langle \boldsymbol{\beta} _{r}^{(T)}, \boldsymbol{X}[j] \rangle^{3} \ge \tilde{\Omega}(1) ] \le \operatorname{exp}(-\frac{\tilde{\Omega}(1)}{\sigma^{6}\sum _{r=1}^{m}\|\boldsymbol{\beta} _{r}^{(T)}\| _{2}^{6}}) \le O(\frac{1}{\operatorname{poly}(d)}) = o(1).$

It shows that, after adversarial training, the network learns the partial true feature, which ensures to correctly classify unseen clean data with high probability.

For the adversarial example $\boldsymbol{X} + \boldsymbol{\xi}$ generated from the unseen data $(\boldsymbol{X}, y) \sim \mathcal{D}$ and adversarial attack $\mathcal{A}:\mathbb{R}^{D}\rightarrow \mathbb{R}^{D}$ , we have $\mathbb{P} _{(\boldsymbol{X},y)\sim \mathcal{D}}[\operatorname{min} _{\|\boldsymbol{\xi}\| _{2}\le \delta}y f _{\boldsymbol{W}^{(T)}}(\boldsymbol{X}+\boldsymbol{\xi}) < 0] \ge \mathbb{P} _{(\boldsymbol{X},y)\sim \mathcal{D}}[y f _{\boldsymbol{W}^{(T)}}(\mathcal{A}(\boldsymbol{X})) < 0] = \mathbb{P} _{(\boldsymbol{X},y)\sim \mathcal{D}}[\alpha^{3}\sum _{r=1}^{m}(u _{r}^{(T)})^{3}(1-\gamma)^{3} + y \sum _{r=1}^{m}\sum _{j \in [P]\setminus \operatorname{signal}(\boldsymbol{X})} \langle \boldsymbol{w} _{r}^{(T)}, \boldsymbol{X}[j] \rangle^{3} < 0]$ $\ge \frac{1}{2} \mathbb{P} _{(\boldsymbol{X},y)\sim \mathcal{D}}[ \|\sum _{r=1}^{m}\sum _{j \in [P]\setminus \operatorname{signal}(\boldsymbol{X})} \langle \boldsymbol{\beta} _{r}^{(T)}, \boldsymbol{X}[j] \rangle^{3}\| \ge \tilde{\Omega}((1-\gamma)^{3})] \ge \frac{1}{2}(1 - \frac{\tilde{O}(d)}{2^{d}}) = \frac{1}{2} - o(1).$

It shows that, with probability at least $1/2$ , the network learner fails to classify the adversarial examples generated from unseen clean data.

Reference

[1] Allen-Zhu, Z., Li, Y., & Liang, Y. (2019). Learning and generalization in overparameterized neural networks, going beyond two layers. Advances in neural information processing systems, 32.

[2] Lv, B., & Zhu, Z. (2022). Implicit bias of adversarial training for deep neural networks. In International Conference on Learning Representations.

[3] Allen-Zhu, Z., & Li, Y. (2023, July). Backward feature correction: How deep learning performs deep (hierarchical) learning. In The Thirty Sixth Annual Conference on Learning Theory (pp. 4598-4598). PMLR.

[4] Biggio, B., Corona, I., Maiorca, D., Nelson, B., Šrndi´ c, N., Laskov, P., Giacinto, G. and Roli, F. (2013). Evasion attacks against machine learning at test time. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2013, Prague, Czech Republic, September 23-27, 2013, Proceedings, Part III 13. Springer.

[5] Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I. and Fergus, R. (2013). Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199.

[6] Goodfellow, I. J., Shlens, J. and Szegedy, C. (2014). Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572.

[7] Li, B., Jin, J., Zhong, H., Hopcroft, J., & Wang, L. (2022). Why robust generalization in deep learning is difficult: Perspective of expressive power. Advances in Neural Information Processing Systems, 35, 4370-4384.

评论- Response to Review bQGX (1/2)

2024-11-30

We sincerely thank the reviewer for the positive and insightful feedback, and for highlighting the strength of our theoretical contributions and the clarity of our writing. We are very glad to address the questions and suggestions raised by the reviewer, which we believe will help further refine our work. Below are our responses to the questions and suggestions raised by the reviewer.

[W1] However, the $\operatorname{poly}(D)$ complexity of clean classifiers is simply assumed in Assumption 4.3 rather than proven. This is problematic because if clean classification also requires $\operatorname{exp}(D)$ parameters, the entire complexity hierarchy becomes meaningless. The authors need to either: 1.Prove that clean classifiers can indeed be approximated by neural networks with $\operatorname{poly}(D)$ parameters, even though it is trivial.

Cite existing results that establish this fact, even though this is a well-known classical result.

[E1] We sincerely thank the reviewer for the valuable and thoughtful suggestion. As the reviewer mentioned, we would like to clarify that the main theorem relies on Assumption 4.3, where we apply a teacher-student learning framework. This framework assumes the existence of a ground-truth neural network of moderate size that can achieve perfect clean classification. This teacher-student setup is widely used in deep learning theory (e.g., [1][2][3]).

We also emphasize that the polynomial-size assumption arises from empirical observations, where mildly over-parameterized networks achieve good clean classification performance but poor robust classification performance (e.g., [4][5][6]), rather than from a rigorous mathematical proof. Furthermore, providing a lower bound for the representation complexity required to achieve robust generalization under Assumption 4.3 is highly non-trivial, due to the complex decision boundary induced by the ground-truth polynomial-size network. To address this, we build on the technique from [7] to establish an exponential lower bound in the worst case.

审稿意见

评分: 5置信度: 32024-11-04

The paper studies the robust overfitting phenomenon in adversarial training, where despite the model's generalization on clean data the robust accuracy differs significantly between training and test samples. The paper aims to show the existence of classifiers with clean generalization and robust overfitting (CGRO) and section 4 includes Theorem 4.4 with a polynomial upper-bound for CGRO classifiers and Theorem 4.7 on the exponentially growing required number of ReLU network parameters to achieve robust generalization. Next, the paper studies a structured-data setting and analyzes the training dynamics of a convolutional network, where the network is shown to converge to robust memorization. A few numerical results on MNIST and CIFAR10 are presented in Section 6.

优点

1- The paper studies a relevant research problem on robust overfitting in adversarial training. The paper presents several theoretical results on the phenomenon and the possibility of robust overfitting and clean generalization in adversarial learning.

缺点

1- The paper's presentation of the theoretical results can be significantly improved. One confusion I have about the theory discussion in the paper is the order-related notations and their meaning in the results. Beginning in Definition 3.4, the authors clarify in Remark 3.5 that the notations are with respect to data dimension $D$ . However, I could not understand how the authors treat the data dimension $D$ as a parameter that they can change and perform asymptotic analysis for. Isn't it the case that the data dimension $D$ is fixed and the asymptotic analysis should focus on the model complexity (number of parameters) or the sample size for asymptotic analysis?

Continuing about my confusion with the asymptotic analysis on the data dimension, let me refer to Theorem 4.4 where the authors make Assumptions 4.1 -4.3. Looking at these assumptions, they seem quite dependent on the dimension of $X$ . For example, if one changes the input dimension, should not the constants $R$ (Assumption 4.1) and $\delta$ (Assumption 4.2) be updated as the dimension $D$ changes? Also, in the statement of Theorem 4.4, how can the dimension $D$ change while the theorem sims to fix the input space $mathcal{X}$ ? If so, then the function $f_{CGRO}$ will change with dimension $D$ and it is not a fixed function to perform CGRP analysis for.

As I discussed above, the authors' asymptotic analysis in terms of the input dimension seems quite confusing, because it changes the input $X$ , the classifier architecture $f$ , and also affects the assumptions altogether. I may have missed some details about the authors' analysis and will look forward to discussing the analysis in the discussion period.

2- The numerical results seem rather insufficient. I understand that this is a theory work, but the results in Figure 2 and Table 1 does not support the authors' claim. First of all, the input dimension does not change when the authors vary the model size, therefore the experiments do not analyze the asymptotic scenario when $D$ grows. Also, the improvement in robust test accuracy seems to follow from the improvement in clean test accuracy as the model keeps growing in capacity and hence can achieve higher test accuracy on both clean and perturbed data.

3- Minor comment: In my opinion, copying a figure from another paper (Figure 1 in the introduction) is inappropriate in a top-level machine learning paper. It would have been better if the authors' replicated the results on a different dataset or architecture and while giving credit to the cited paper, they presented their own computed version of the figure in the submission.

问题

1- I explained my confusions with the authors' asymptotic analysis in data dimension $D$ . Can the authors clarify on the questions I raised on the analysis?

2- Can the authors provide more numerical results where the clean test accuracy remains stable while adversarial test accuracy improves where the model complexity grows?

评论- Response to Review iwSa

2024-11-30

We sincerely thank the reviewer for the thoughtful feedback. We are very glad to address the questions and suggestions raised by the reviewer, which we believe will help further refine our work. Below are our responses to the questions and suggestions raised by the reviewer.

[Q1] I explained my confusions with the authors' asymptotic analysis in data dimension $D$ . Can the authors clarify on the questions I raised on the analysis?

[A1] We would like to clarify that our asymptotic analysis in data dimension $D$ is widely used in the literature of deep learning theory, as demonstrated in papers such as [1][2][3] in our related work section. We would also like to point out that the constant property of $R$ is reasonable, since the range of image pixels is typically $0$ – $255$ , while the data dimension can be very large. Additionally, our assumption about the perturbation radius $\delta$ is reasonable, because, for real datasets, different classes tend to be well-separated, while the perturbation radius is often much smaller than the separation distance, as empirically observed in the work of [4].

[Q2] Can the authors provide more numerical results where the clean test accuracy remains stable while adversarial test accuracy improves where the model complexity grows?

[A2] We would like to point out that our experimental results show that clean test accuracy remains stable while adversarial test accuracy improves as model complexity grows. See Figure 1 (a) (b) and Table 1: for the MNIST dataset, when the model size increases from $12$ to $16$ , robust test accuracy significantly improves (from $77.96\%$ to $83.43\%$ ), while the change in clean test accuracy is minimal (from $95.06\%$ to $94.85\%$ ); for the CIFAR-10 dataset, when the model size increases from $5$ to $10$ , robust test accuracy significantly improves (from $46.25\%$ to $50.08\%$ ), while the change in clean test accuracy is minimal (from $85.83\%$ to $86.05\%$ ).

[Q3] Minor comment: In my opinion, copying a figure from another paper (Figure 1 in the introduction) is inappropriate in a top-level machine learning paper. It would have been better if the authors' replicated the results on a different dataset or architecture and while giving credit to the cited paper, they presented their own computed version of the figure in the submission.

[A3] We sincerely thank the reviewer for the suggestion. We have updated our experimental results in Figure 1 of the revised paper, where we apply the WideResNet architecture [5] to run the standard adversarial training algorithm on the CIFAR-10 dataset. The results show that CGRO occurs during adversarial training.

Reference

[1] Bubeck, S., & Sellke, M. (2021). A universal law of robustness via isoperimetry. Advances in Neural Information Processing Systems, 34, 28811-28822.

[2] Li, B., Jin, J., Zhong, H., Hopcroft, J., & Wang, L. (2022). Why robust generalization in deep learning is difficult: Perspective of expressive power. Advances in Neural Information Processing Systems, 35, 4370-4384.

[3] Allen-Zhu, Z. and Li, Y. (2023b). Towards understanding ensemble, knowledge distillation and self-distillation in deep learning. In The Eleventh International Conference on Learning Representations.

[4] Yang, Y.-Y., Rashtchian, C., Zhang, H., Salakhutdinov, R. R. and Chaudhuri, K. (2020). A closer look at accuracy vs. robustness. Advances in neural information processing systems, 33 8588–8601.

[5] Zagoruyko, S. (2016). Wide residual networks. arXiv preprint arXiv:1605.07146.

审稿意见

评分: 6置信度: 22024-11-04

This paper analyses the phenomenon of clean generalization and robust overfitting (CGRO) from two perspectives: representation complexity and training dynamics. They first show that achieving a robust classifier on test distribution requires many more parameters than achieving a classifier with CGRO. Then, they further investigate the optimization dynamics of adversarial training and show that the weights of the classifier will finally move to the CGRO regime. Their theory about training dynamics is based on a specific data structure: Patch Data Distribution. Some experimental results are provided to validate the CGRO phenomenon and the three-stage dynamics.

优点

This paper provides two theoretical perspectives to understand the CGRO phenomenon. The theory is solid in general although it needs some assumptions of the data structure. The writing is clear and they provide detailed descriptions of the notions, setup, and related background.

缺点

The model representation perspective shows that the models need more parameters to achieve robust generalization than CGRO. But how do we validate that the number of parameters of the deep networks such as ResNet is in this CGRO range?
The data structure is not very realistic. I think a more realistic assumption would be the data lies in a low-dimensional subspace or manifold. The assumption that only a pixel-space patch has useful signals would be too tough in my opinion.
Although the authors validate the three stages of training dynamics in the synthetic structure data, they don't validate it on real data. Could you provide some experimental results to show it?

问题

Please refer to the Weaknesses part.

评论- Response to Review hyTi

2024-11-30

We sincerely thank the reviewer for the support and insightful feedback, and for highlighting the strength of our theoretical contributions and the clarity of our writing. We are very glad to address the questions and suggestions raised by the reviewer, which we believe will help further refine our work. Below are our responses to the questions and suggestions raised by the reviewer.

[W1] The model representation perspective shows that the models need more parameters to achieve robust generalization than CGRO. But how do we validate that the number of parameters of the deep networks such as ResNet is in this CGRO range?

[E1] We would like to clarify that, from the perspective of representation complexity, models require more parameters to achieve robust generalization than CGRO. However, we do not provide an explicit method to validate that the number of parameters in practical deep networks falls within this CGRO range, as our lower bound applies in the worst-case scenario. We also believe that exploring the range of parameter counts to determine whether a network is in the CGRO regime is an important and interesting avenue for future research.

[W2] The data structure is not very realistic. I think a more realistic assumption would be the data lies in a low-dimensional subspace or manifold. The assumption that only a pixel-space patch has useful signals would be too tough in my opinion.

[E2] We would like to clarify that the patch structure we use can be seen as a simplification of real-world vision-recognition datasets. Specifically, images are divided into signal patches that are meaningful for classification, such as the whisker of a cat or the nose of a dog, and noisy patches, like the uninformative background of a photo. Our assumption about patch data can also be generalized to situations where there exists a set of meaningful patches. However, analyzing the learning process in such cases would complicate our explanation and obscure the main idea we wish to present. Therefore, we focus on the case of a single meaningful patch in our work.

[W3] Although the authors validate the three stages of training dynamics in the synthetic structure data, they don't validate it on real data. Could you provide some experimental results to show it?

[E3] We sincerely thank the reviewer for the valuable and thoughtful suggestion. We would like to point out that, for real data and real models, it is difficult to rigorously define true feature learning and spurious feature learning. As a result, verifying the phase transition in real-world experiments is challenging, and this issue is commonly encountered in feature learning theory papers, such as [1][2][3][4]. We also believe that validating this on real data is an important and promising direction for future research.

Reference

[1] Allen-Zhu, Z. and Li, Y. (2023b). Towards understanding ensemble, knowledge distillation and self-distillation in deep learning. In The Eleventh International Conference on Learning Representations.

[2] Chidambaram, M., Wang, X., Wu, C. and Ge, R. (2023). Provably learning diverse features in multi view data with midpoint mixup. In International Conference on Machine Learning. PMLR.

[3] Chen, Z., Deng, Y., Wu, Y., Gu, Q., & Li, Y. (2022). Towards understanding mixture of experts in deep learning. arXiv preprint arXiv:2208.02813.

[4] Zou, D., Cao, Y., Li, Y., & Gu, Q. (2023, July). The benefits of mixup for feature learning. In International Conference on Machine Learning (pp. 43423-43479). PMLR.

审稿意见

评分: 5置信度: 42024-11-06

The paper investigates the phenomenon of clean generalization and robust overfitting (CGRO) within the context of adversarial training. Initially, the authors demonstrate that, from the standpoint of model expressive power, achieving a CGRO classifier is significantly easier compared to robust generalization, which, in the worst-case scenario, necessitates exponentially large model capacity. Subsequently, the paper examines the training dynamics of adversarial training by considering a patch data distribution, employing a three-stage analysis technique to explore the feature learning process involved in adversarial training. Finally, numerical experiments are conducted to validate the theoretical insights presented in the study.

优点

The paper is well-structured and easy to follow.
The three-stage analysis of the training dynamics of adversarial training, based on the assumption of a patch data distribution, is both intriguing and innovative.
The theoretical findings are well-supported by the numerical experiments, demonstrating a strong correlation between the two.

缺点

The paper falls short in clarifying how its results contribute to understanding why CGRO occurs during adversarial training. The paper raises this broad question in the introduction but lacks a detailed explanation of how its findings can offer insights into addressing it.
From the perspective of representation complexity to illustrate the CGRO phenomenon, the main theorem relies on Assumption 4.3, which posits the existence of a clean ReLU network classifier of polynomial size. However, this assertion pertains to a factual claim and cannot be directly treated as an assumption. Instead, it would be more appropriate to formulate it as a theorem accompanied by a rigorous mathematical proof. This would strengthen the foundation of the theoretical results presented in the paper.
From the perspective of training dynamics to elucidate the CGRO phenomenon, the analysis depends on the assumption that the perturbation radius 𝛿 is only marginally smaller than the signal norm 𝛼. The theoretical results are derived specifically under this condition for the perturbation radius. However, in practical scenarios, the perturbation radius can be chosen arbitrarily. The paper does not adequately address whether the same theoretical results can be achieved in this more generalized context, which leaves an important gap in understanding the robustness of the findings. Further exploration of this aspect would enhance the comprehensiveness of the study.

问题

See the weakness.

评论- Response to Review kzDK (1/2)

2024-11-30

We sincerely thank the reviewer for the encouraging and insightful feedback, and for highlighting the strength of our theoretical contributions. We are very glad to address the questions and suggestions raised by the reviewer, which we believe will help further refine our work. Below are our responses to the questions and suggestions raised by the reviewer.

[W1] The paper falls short in clarifying how its results contribute to understanding why CGRO occurs during adversarial training. The paper raises this broad question in the introduction but lacks a detailed explanation of how its findings can offer insights into addressing it. [E1] We would like to clarify that we provide theoretical insights to understand why CGRO occurs from two perspectives: expressive power and training dynamics.

From the perspective of expressive power, we show that only polynomial representation complexity can achieve a CGRO classifier (Theorem 4.4), while an exactly robust classifier requires exponentially large representation complexity in the worst case (Theorem 4.7). This implies that the CGRO phenomenon may stem from the limits of expressive power in practice.
From the perspective of training dynamics, our theoretical analysis of the feature learning process suggests that, with moderate over-parameterization, the network model trained via adversarial training learns partial true features but exactly memorizes training-data-specific random spurious features (Theorem 5.9). This leads to perfect clean generalization and significant robust overfitting.

We believe that our theoretical insights will help in designing better adversarially robust algorithms to address robust learning problems.

[W2] From the perspective of representation complexity to illustrate the CGRO phenomenon, the main theorem relies on Assumption 4.3, which posits the existence of a clean ReLU network classifier of polynomial size. However, this assertion pertains to a factual claim and cannot be directly treated as an assumption. Instead, it would be more appropriate to formulate it as a theorem accompanied by a rigorous mathematical proof. This would strengthen the foundation of the theoretical results presented in the paper.

[E2] We sincerely thank the reviewer for the valuable and thoughtful suggestion. As the reviewer mentioned, we would like to clarify that the main theorem relies on Assumption 4.3, where we apply a teacher-student learning framework. This framework assumes the existence of a ground-truth neural network of moderate size that can achieve perfect clean classification. This teacher-student setup is widely used in deep learning theory (e.g., [1][2][3]).

We also emphasize that the polynomial-size assumption arises from empirical observations, where mildly over-parameterized networks achieve good clean classification performance but poor robust classification performance (e.g., [4][5][6]), rather than from a rigorous mathematical proof. Furthermore, providing a lower bound for the representation complexity required to achieve robust generalization under Assumption 4.3 is highly non-trivial due to the complex decision boundary induced by the ground-truth polynomial-size network. To address this, we build on the technique from [7] to establish an exponential lower bound in the worst case.

评论- Response to Review kzDK (2/2)

2024-11-30

[W3] From the perspective of training dynamics to elucidate the CGRO phenomenon, the analysis depends on the assumption that the perturbation radius $\delta$ is only marginally smaller than the signal norm $\alpha$ . The theoretical results are derived specifically under this condition for the perturbation radius. However, in practical scenarios, the perturbation radius can be chosen arbitrarily. The paper does not adequately address whether the same theoretical results can be achieved in this more generalized context, which leaves an important gap in understanding the robustness of the findings. Further exploration of this aspect would enhance the comprehensiveness of the study.

[E3] We sincerely thank the reviewer for the suggestion. We would like to point out that, for real datasets, different classes tend to be well-separated, while the perturbation radius is often much smaller than the separation distance, as empirically observed in the work of [8]. Under our theoretical setup for training dynamics analysis, the separation between the positive and negative classes is roughly of the order of the signal norm $\alpha$ . Thus, the assumption that the perturbation radius $\delta$ is only marginally smaller than the signal norm $\alpha$ is reasonable. We also acknowledge that exploring more generalized setups is an important and interesting future direction for our work.

Reference

[1] Allen-Zhu, Z., Li, Y., & Liang, Y. (2019). Learning and generalization in overparameterized neural networks, going beyond two layers. Advances in neural information processing systems, 32.

[2] Lv, B., & Zhu, Z. (2022). Implicit bias of adversarial training for deep neural networks. In International Conference on Learning Representations.

[5] Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I. and Fergus, R. (2013). Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199.

[6] Goodfellow, I. J., Shlens, J. and Szegedy, C. (2014). Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572.

[8] Yang, Y.-Y., Rashtchian, C., Zhang, H., Salakhutdinov, R. R. and Chaudhuri, K. (2020). A closer look at accuracy vs. robustness. Advances in neural information processing systems, 33 8588–8601.

2024-12-01

I appreciate the author's response; however, I remain unconvinced and will maintain my score. While I acknowledge the author's effort to explain the two perspectives in studying the CGRO phenomenon, I still find their contributions to understanding this phenomenon lacking and unclear from their theoretical findings.

From the perspective of expressive power, the authors state that Assumption 4.3 is based on empirical observations. However, this does not sufficiently demonstrate the truth of the assumption itself. Empirical observations alone cannot validate theoretical statements without further rigorous justification.

Regarding training dynamics, if the authors aim to align their theoretical results with the assertion that CGRO arises from limitations in expressive power, they must provide additional evidence demonstrating that highly over-parameterized models can indeed achieve robust generalization. If they fail to establish this link, it suggests that the issue may not stem from expressive power but rather from the challenges in the optimization process.

Overall, the author's arguments concerning the illustration of the CGRO phenomenon lack clarity, and their theoretical findings do not rigorously support their claims. Therefore, I feel justified in maintaining my score.

评论- Response to Review kzDK

2024-12-01

Dear Reviewer kzDK,

Thanks very much for the reply. We would like to address the reviewer’s concerns as follows.

[Q1] if the authors aim to align their theoretical results with the assertion that CGRO arises from limitations in expressive power, they must provide additional evidence demonstrating that highly over-parameterized models can indeed achieve robust generalization.

[A1] We would like to point out that our experimental results show that clean test accuracy remains stable, while adversarial test accuracy improves as model complexity grows. See Figure 1 (a) and (b) and Table 1: for the MNIST dataset, when the model size increases from $12$ to $16$ , robust test accuracy significantly improves (from $77.96\%$ to $83.43\%$ ), while the change in clean test accuracy is minimal (from $95.06\%$ to $94.85\%$ ); for the CIFAR-10 dataset, when the model size increases from $5$ to $10$ , robust test accuracy significantly improves (from $46.25\%$ to $50.08\%$ ), while the change in clean test accuracy is minimal (from $85.83\%$ to $86.05\%$ ). These real-world experimental results provide empirical evidence demonstrating that highly over-parameterized models can indeed achieve robust generalization.

[Q2] If they fail to establish this link, it suggests that the issue may not stem from expressive power but rather from the challenges in the optimization process.

[A2] We would like to clarify our theoretical findings regarding the CGRO phenomenon. First, from the perspective of expressive power, we show that achieving the CGRO classifier requires only a linear number of extra parameters based on the clean classifier, whereas robust generalization may require exponentially large models in the worst case. Although we observe that increased over-parameterization can improve robust performance, an exponentially large number of parameters is practically unattainable. Therefore, we focus on training dynamics under a moderately over-parameterized setup, aiming to explore what occurs during adversarial training that leads the network to the CGRO classifier. By rigorously analyzing the feature learning process in adversarial training, we further suggest that a network with a moderate size (smaller than an exponentially large model) will learn some true features but also memorize training-data-dependent random noise, which results in the CGRO phenomenon.

We sincerely thank the reviewer for the thoughtful and positive feedback and for the quick response.

AC 元评审

2024-12-19

This paper studies the Clean Generalization and Robust Overfitting (CGRO) of neural networks under adversarial training. It studies the CGRO from two views: representation complexity and training dynamics. A CGRO Classifier is defined as: clean test error is small, robust training error is small, but robust test error is large. The socre are quite mixed and the AC also checked this paper.

Representation complexity (Section 4): the main result shows that clean classifier requires $ploy(D)$ parameters, and CGRO requires $poly(D) + O(ND)$ parameters, while robust classifer requires $exp(D)$ parameters. It shows the separation between clean/robust classifiers from approximation theory, i.e.,

Clean Classifier (poly(D)) ≲ CGRO Classifier(poly(D)+ND) ≪ Robust Classifier (Ω(exp(D)))

However, the results heavily depend on Assumption 4.3 rather than proven, as mentioned by two reviewers as well. Though it may observe from empirical results, it requires a careful theoretical demonstration. The AC suggests the authors to check the following reference, which would be useful to remove Assumption 4.3.

[1] Shi, Z., Liu, F., Cao, Y., & Suykens, J. A. (2024). Can overfitted deep neural networks in adversarial training generalize?--An approximation viewpoint. arXiv preprint arXiv:2401.13624.

In training dynamics, the results show that adversarial training makes the neural network conducts partial true feature learning and exactly memorizes spurious features. That means, after adversarial training, the network correctly classifies unseen clean data with high probability in Theorem E.12; but fails to classify the adversarial examples generated from unseen clean data with probability at least 1/2.

I found that the statement is incomplete as the comparison to classifier under standard training is missing. I agree with the reviewer on this point: "a model trained via standard (non-adversarial) training would only learn the true feature and ignore the spurious feature? Establishing this distinction is important, as it would suggest that CGRO arises from the nature of AT itself rather than from the artificial construction of the data model."

After reading the paper and the reviewers' comments, I think this paper is good in general. If the above two points can be addressed, this paper will be very nice. Given the current version, it is difficult for the AC to make a clear decision personally for a conference submission. Considering the high-competitive venue of ICLR this year, I have to reject this submission finally. I hope the authors will be not dissapointed about this decision and will address these two issues to make the work perfect based on the above feedback as well as the reviewers' comments.

审稿人讨论附加意见

After discussion, some issues are unaddressed, e.g., Assumption 4.3 and the training dynamics when compared to classifier under standard training.

最终决定Reject

2025-01-22

Reject