Stability and Generalization of Adversarial Training for Shallow Neural Networks with Smooth Activation
摘要
评审与讨论
The paper studies generalization bounds for two-layer networks trained using several variants of adversarial training. The main results are a bound on the stability of the algorithm, which in turn provides a bound on the generalization and the robust accuracy.
优点
-
As far as I know, the results presented in this paper are novel. The approach of studying the generalization capabilities of adversarial training through the lens of stability seems like a promising direction with interesting results.
-
The paper studies 2-layer networks (where the weights of the second layer are fixed) which go beyond current research in this field, and potentially beyond the NTK/kernel regime.
-
The results of the Moreau Envelope are also novel in this context AFAIK, and provide tighter bounds than GD or SGD analysis (although they are computationally less efficient)
缺点
Although the results of this paper are interesting, there are some issues with the proofs and with some of the setting of the problem, specifically the amount of over-parameterization. In more detail:
-
The assumption on m I think is a bit misleading. If I understand correctly, the authors assume that which means that the change of the weights when doing GD is very small compared to the number of neurons. This in turn means that slightly changing the loss (by a single sample), and beginning from the same starting point will lead to almost the same weights, hence the stability result. However, I don’t think this assumption makes sense in more practical settings. It basically means that the number of iterations is so small (or the learning rate is so small) that most weights barely move from their initialization. This is also emphasized in Theorem B.1, where the change depends on . This means that even if m is slightly smaller than , (e.g. ) then the difference changes exponentially, and the algorithm is very non-stable. This issue is not addressed in the paper and is only briefly mentioned in the proof intuition.
-
The bounds in Theorem 3.2 don’t make sense. If then the r.h.s of the bounds is negative. Do the bounds only work when ? If so, this means that so the number of iterations might be very small compared to the learning rate. This also seems relevant to Theorem 3.6 with
-
There is an issue with the value of T in Corollary 3.3. Which value of T gives that ? Suppose is a constant, then we need to get a constant term rather than , no?
-
Line 509 - I don’t understand this, how is Eq (4) related to \epsilon_gen?
I will be happy to see the author’s response to these issues and will consider raising my score accordingly.
问题
-
Given , how should we set the parameters of the problem (i.e. etc.) to get generalization error and robust accuracy smaller than ?
-
What happens when m is chosen independently of T - although we allow it to depend on n, i.e. the overparameterized regime compared to the number of samples.
局限性
The authors discuss the limitations of their results.
W1: Regarding
Reviewer has a correct understanding of why stability holds, i.e., the change of the weights when doing GD is very small compared to the number of neurons. We utilize the weakly convex robust loss (see Lemma 4.1) as well as to establish stability.(see Theoorem 4.3)
Note that training neural networks (i.e., implementing the ERM rule) is known to be computationally hard, even for a two layer network with only three hidden nodes. This has been known since the early 90s. Recent results in deep learning theory focus on two-layer neural networks that are greatly overparameterized ( is large) and involve a certain scale of initialization. Under these settings, we can bound the training error of GD after T iterations by . This is the state-of-the-art results in computational theory of deep learning.
Now, the important point we want to make here is that the key to the analysis is ensuring that the weights of the trained network do not change by much from initialization. One can then show that training neural networks is similar to dynamics of kernel methods. This is the dominant framework for computational learning theoretical guarantees for training neural networks and is aptly called the “lazy regime” or the “neural tangent kernel” (NTK) setting.
The early works that laid the foundations for this theory are [r1, r2, r3].
W2: Regarding
Yes, for that result of ours to make sense we do need . However, we can also give a result that is of the same form as the expression in equation (4) of [r4] and get a factor of instead of . In that case we don’t need .
W3: Regarding
It is possible that the reviewer did not realize .
In order to get , we need to be a small positive real number, and equivalently, , and are small positive real numbers (through the definition of ). Note that we have selected in Corollary 3.3, plugging in , we equivalently require , and to be small positive real numbers. In Corollary 3.3, we assumed and , so we can choose to be a proper (small) positive constant so that the three inequalities hold.
W4: Regarding line 509
The definition of the generalization gap is the test loss minus the training loss \epsilon_{gen}=L_{rob}(W_t)-$$\hat{L}_{rob}(W_t;S)
so we can get line 509 by subtracting on both sides of Eq (4), so the left hand side will have the generalization gap.
Q1: Given , how should we set the parameters of the problem to get generalization error and robust accuracy smaller than ?
The relationship is a bit involved since our bounds are post-hoc and not a-priori. In other words, our bound is instance specific, and depends on the given training data and the output of the algorithm. This is in stark contrast to bounds that are based on uniform convergence and hold simultaneously for all hypotheses in the class – there you can ask how many samples or iterations you need to ensure suboptimality. However, note that uniform convergence bounds can be overly pessimistic/vacuous. Anyhow, even though there is not a simple answer, below we try to give an intuitive understanding.
Let’s say you want to bound the expected robust test loss by times the expected training loss (see Lemma 4.2 relating the two quantities). Then, To ensure , we need . We can give two possible ways of setting the different parameters to ensure that .
Recall that .
Set , , , ,
or set , , , .
Check that , and .
Q2: What happens when m is chosen independently of T
We can choose , and then set . Note though that it seems superfluous as there could be other regimes that could still give us the condition we need. This is why prior works (e.g., [r4, r5]) also state the assumption as we do.
This also makes sense as our generalization bounds are based on algorithmic stability which depends on parameters and . If we were using tools from uniform convergence, then m would naturally depend on the sample size and input dimension d.
[r1] Du, Simon S., et al. "Gradient descent provably optimizes over-parameterized neural networks." arXiv preprint arXiv:1810.02054 (2018).
[r2] Arora, Sanjeev, et al. "Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks." International Conference on Machine Learning. PMLR, 2019.
[r3] Ji, Ziwei, et al. "Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow relu networks." arXiv preprint arXiv:1909.12292 (2019).
[r4] Richards, Dominic, et al. "Stability & generalisation of gradient descent for shallow neural networks without the neural tangent kernel." Advances in neural information processing systems 34 (2021): 8609-8621.
[r5] Lei, Yunwen, et al. "Stability and generalization analysis of gradient methods for shallow neural networks." Advances in Neural Information Processing Systems 35 (2022): 38557-38570.
I thank the authors for their response.
Other reviewers (including myself) found it strange that the overparameterization depends on the learning rate and number of iterations. However, after reading the author's response I agree that this dependence is indeed the right one for stability analysis of generalization bounds, contrary to uniform convergence analysis.
To conclude, my questions are answered, and I will raise my score to 7. I think the authors should add in the final version several clarifications that were raised from the reviewer's question. I also think it would strengthen the paper to include the proof for \alpha_1 > 1.
This paper uses uniform stability to analyze adversarial training on wide shallow networks when the adversarial perturbations are -optimal. Assuming there exists a robust network near initialization, in expectation the best network iterate has test loss that scales with . The results for GD are extended to SGD and when using Moreau's envelope.
优点
- The bounds are tied to how strong the adversarial training attack is, providing an approximation to practical attacks while allowing for tractability in the theoretical analysis.
- No assumptions are made on the data distribution, and the adversarial attack threat model is very general.
缺点
- The bounds are in expectation and not high probability bounds.
- A large width is required for the bounds to hold.
- The parameter is set in the worst-case, and as a result for practical attack algorithms is likely to be high. In fact, if is constant, then it appears Corollary 3.3 is vacuous.
问题
- How realistic is the assumption that there exists a robust network in the vicinity of the initialization?
- Should I be thinking of as a constant? If so, how should Corollary 3.3 be interpreted?
局限性
Yes.
W1: The bounds are in expectation and not high probability bounds.
We can give high probability bounds based on [r1], which relates high probability generalization bounds with algorithmic stability, but they are looser than the bounds in expectation. We had those originally in the paper but chose to remove them as they are not very informative.
W2: A large width is required for the bounds to hold.
Training neural networks (i.e., implementing the ERM rule) is known to be computationally hard, even for a two layer network with only three hidden nodes. This has been known since the 90s. Recent results in deep learning theory focus on the setting where networks are greatly overparameterized. Indeed, all results giving computational guarantees for learning deep neural networks assume that the networks are overparameterized. So, it is not surprising that we need a similar condition in the adversarially robust setting. The assumption on width, , has also appeared in related prior works [r2] and [r3].
Also, note that the way we present the results, we can also rewrite the condition above as and interpret it as early stopping .
W3: The parameter is set in the worst-case, and as a result for practical attack algorithms is likely to be high. In fact, if is constant, then it appears Corollary 3.3 is vacuous.
[TLDR] is not a constant, it is a parameter you can choose. You can make it arbitrarily small by adding more computation. Large means that the quality of simulated adversarial attacks is poor, so indeed our bounds suggest that the robust generalization will not be good. This is what should be expected, it is not a weakness of our result.
Recall that adversarial training involves finding an adversarial perturbation of every training example in the training set. Can we solve this optimization problem exactly? If so, then is equal to zero. Most works in theory of robust learning assume that. We argue that in practice that is not true. So, we allow for approximate optimization for finding an adversarial example during training. We introduce a parameter that is bound on the suboptimality of an adversarial example. Note that this is a design parameter that a practitioner can choose. If you choose a large , the training is easier but the resulting model is not good. So, of course, one should choose a small . This involves a computational cost that most previous papers ignore. The only paper that carefully shows how many iterations are needed to find a suboptimal attack is that of Mianjy and Arora (2023) – they show that we need iterations of PGD attack to find a suboptimal attack.
[Mianjy and Arora, 2023] “Robustness Guarantees for Adversarially Trained Neural Networks,” NeurIPS 2023.
Q1: How realistic is the assumption that there exists a robust network in the vicinity of the initialization?
[TLDR] Existence of a network that generalizes well in the vicinity of a certain way of initialization is a high dimensional phenomenon that is central to the NTK setting, a dominant framework for computational learning theory of deep neural networks. Nonetheless, our results also yield guarantees that do not depend on the distance from initialization.
Most computational learning results for deep neural networks assume overparametrization and a certain initialization that ensures that the weights of the trained network do not change by much from initialization. Yet, these trained networks in the so-called lazy regime or the NTK setting, are guaranteed to generalize. So, assuming the existence of a network that generalizes well in the vicinity of initialization (chosen at the right scale) is a high dimensional phenomenon. We argue that if the conditions of our theorems are met (i.e., we consider over-parametrized settings), then such an assumption is not unrealistic. It is same as saying “how realistic is it that a unit ball has almost no volume and all of its mass is concentrated near the boundary.” Well it happens in high dimensions.
While we give a bound on robust generalization in terms of distance from initialization, it does not mean that it is a necessary and the only condition for robust generalization. In fact, we can also give a robust generalization bound that does not depend on the distance from initialization. This follows using Theorem 3.1 and Lemma 4.2.
Q2: Should I be thinking of as a constant? If so, how should Corollary 3.3 be interpreted?
No, should not be treated as constant, it is a design parameter you can choose. Please see a detailed discussion above.
[r1] Feldman, Vitaly, and Jan Vondrak. "High probability generalization bounds for uniformly stable algorithms with nearly optimal rate." Conference on Learning Theory. PMLR, 2019.
[r2] Richards, Dominic, and Ilja Kuzborskij. "Stability & generalisation of gradient descent for shallow neural networks without the neural tangent kernel." Advances in neural information processing systems 34 (2021): 8609-8621.
[r3] Lei, Yunwen, Rong Jin, and Yiming Ying. "Stability and generalization analysis of gradient methods for shallow neural networks." Advances in Neural Information Processing Systems 35 (2022): 38557-38570.
Thank you for addressing my concerns, especially regarding . I have raised my score to a 6.
I wonder if requiring the -optimal condition to hold everywhere can be relaxed (for example, to an averaged case), for networks where guaranteeing nearly optimal adversarial attacks is more difficult.
Adversarial Training is a popular method to train models that enhance robutness to adversarial examples. In recent years, a lot of papers study the generalization of adversarial training in various models. In this paper, the authors study the generalization of adversarial training for a special shallow neural networks with uniform stability, they show the theoretical reults for three different algorithms.
优点
- The authors show theoretical results for a special shallow neural networks with logistic loss, which is non-convex and non-smooth.
- The authors discuss stability and generalization guarantees of three variants of adversarial training.
- The proof of this work is clear to read.
缺点
- The model studied in this work is a special neural network. By fixing the weights of last layer, it has only one layer of trainable parameters, it is unclear if the results shown in this special neural network can be generalized to regular neural netowrk.
- The authors claim that they use a over-parameterized neural network, while they explian "over-parameterized" as . It is confused that depends on and but not depends on the number and dimention of inputs.
- In section 4, the bound shown in theoretical results depend on with a same order. I check the proof of this work, and find that these bounds depend on since the weights of last layer is initilized from , which means that the bounds depends on the initilization of weights of last layer. If the last layer is initilized from other parameters, especially, initilized from , does the reult still hold for over-parameterized neural network?
- The main techniques used in proofs depend on the weakly convex property of function, and uniform stability of algorithm. It is suggested to show the formal definition of weakly convex function.
问题
Is this possible to generalize the results from special neural network to regular neural network with mild assumptions like the Theorem 4.7 in [1]?
局限性
This work study a speical neural network, and it is unclear whether the results can be generalized to regular neural netowrks.
W1: The model studied in this work is a special neural network. By fixing the weights of last layer, it has only one layer of trainable parameters, it is unclear if the results shown in this special neural network can be generalized to regular neural netowrk.
It is known since the early 90s that training neural networks is computationally hard, even for two layer networks with three hidden neurons. Modern theory of deep learning avoids those hardness results by considering (a) overparameterized two layer-networks, (b) a certain small-scale randomized initialization, and (c) freezing the top layer and training only the bottom layer. Under this setting, we can bound the training error of GD after T iterations by . This is the state-of-the-art results in computational theory of deep learning. Yes, it may be too specialized and not close to practice, but this is all we have currently. So, even in the standard setting, it remains unclear if these results can be generalized to “regular” networks.
W2: The authors claim that they use a over-parameterized neural network, while they explian "over-parameterized" as . It is confused that depends on and but not depends on the number and dimention of inputs.
We state in terms of and the learning rate \eta as our generalization bounds are based on algorithmic stability which depends on those algorithmic parameters. Similar assumptions have also appeared in related prior works [r1] and [r2]. If we were using tools from uniform convergence, then m would naturally depend on the sample size and input dimension . Note that we could still write the condition as and , but that would seem superfluous as there could be other regimes that could still give us the condition we need.
[r1] Richards, Dominic, and Ilja Kuzborskij. "Stability & generalisation of gradient descent for shallow neural networks without the neural tangent kernel." Advances in neural information processing systems 34 (2021): 8609-8621.
[r2] Lei, Yunwen, Rong Jin, and Yiming Ying. "Stability and generalization analysis of gradient methods for shallow neural networks." Advances in Neural Information Processing Systems 35 (2022): 38557-38570.
W3: In section 4, the bound shown in theoretical results depend on with a same order. I check the proof of this work, and find that these bounds depend on since the weights of last layer is initilized from , which means that the bounds depends on the initilization of weights of last layer. If the last layer is initilized from other parameters, especially, initilized from , does the reult still hold for over-parameterized neural network?
No. The scale of the initialization is important. As we state above, this is the case with the dominant framework for a computational theory of deep learning even in the standard setting. Such initialization has also appeared in prior works, e.g., [r3, r4, r5]. In particular, we need the initialization to be at that scale for our Lemma 4.1 to hold.
[r3] Du, Simon S., et al. "Gradient descent provably optimizes over-parameterized neural networks." arXiv preprint arXiv:1810.02054 (2018).
[r4] Arora, Sanjeev, et al. "Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks." International Conference on Machine Learning. PMLR, 2019.
[r5] Ji, Ziwei, and Matus Telgarsky. "Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow relu networks." arXiv preprint arXiv:1909.12292 (2019).
W4: The main techniques used in proofs depend on the weakly convex property of function, and uniform stability of algorithm. It is suggested to show the formal definition of weakly convex function.
We will include a formal definition of weakly convex functions.
Q1: Is this possible to generalize the results from special neural network to regular neural network with mild assumptions like the Theorem 4.7 in [1]?
We are not sure what is reference [1], but see our response above regarding the state-of-affairs with computational learning theoretic results for training neural networks.
Thanks for your response. My main concern is regarding the setting of an over-parameterized neural network. After reading the authors' response and the comments from other reviewers, my concerns have been addressed. I will raise my rating to a "Weak Accept" accordingly.
This work studies the stability of adversarial training in two-layer neural networks in binary classification problems. The authors study gradient-descent-based adversarial training, with nearly-optimal adversarial perturbations, in two-layer neural networks with a frozen second layer, and they obtain guarantees on the stability of the algorithm and its robust generalization. Furthermore, the paper considers a smoothened empirical robust loss and obtains guarantees for this case too. The proofs of the main results are deferred to the Appendix, yet a proof sketch is presented in Section 4 for the case of (exactly) optimal adversarial perturbations during training.
优点
The paper is well written and makes it easy to follow its contributions and contextualize them with respect to prior work. Furthermore, the results are cleanly organized and presented. The main result is a stability bound on GD-adversarial training, which overcomes the non-convexity and non-smoothness of the objective. I particularly enjoyed reading the proof sketch section and, in particular, Lemma 4.1, which makes it clear how the analysis can sidestep the aforementioned challenges. I consider this work to be a valuable contribution to the field of theoretical robust (deep) learning.
缺点
The theoretical results hold for a special class of networks (two-layer with smooth activations and frozen 2nd layer weights) and do not appear to be particularly surprising or useful for inspiring practical ideas. This is the reason why I do not recommend a higher score. Perhaps the authors could discuss some of their assumptions in Section 5; for instance, they could comment on the assumption of eq. (1). Should we consider to be constant during training? Do we expect it to be smaller or larger as training progresses? Could it be empirically estimated? Similarly, the authors could comment on the feasibility of adversarial training with the smoothened loss.
问题
I have a few questions/suggestions:
- Lines 9 and 10 in the abstract: If I understand correctly, there should be something added along the lines of “provided there exists a robust network around the initialization.”
- Lines 193-194, “This view is also consistent with several empirical studies.”: can you please elaborate on this?
- Theorem 3.2, for the result on the generalization gap you should explicitly state the loss function, right? (since eq. (3) is defined with respect to a generic function ).
局限性
The limitations of the work are thoroughly discussed.
W1: Detached from practice?
Our is a theoretical result and theory always lags practice and rarely matches the practice perfectly. Having said that, we would like to remind the reviewer that it is known since the early 90s that training neural networks is computationally hard, even for two layer networks with three hidden neurons. Modern theory of deep learning avoids those hardness results by considering (a) overparameterized two layer-networks, (b) a certain small-scale randomized initialization, and (c) freezing the top layer and training only the bottom layer. Under this setting, we can bound the training error of GD after T iterations by . This is the state-of-the-art results in computational theory of deep learning. It is indeed too specialized and not close to practice, but this is all we have currently.
W2: Feasibility of adversarial training with the smoothed loss.
Yes, it is very feasible and practical and has shown to be successful in prior work. [Xiao 2024] discuss at length the feasibility of adversarial training with smoothed loss and present extensive empirical results.
[Xiao 2024] "Uniformly Stable Algorithms for Adversarial Training and Beyond." arXiv preprint arXiv:2405.01817 (2024).
W3: How to think about ?
We should not think of as a constant. It is a parameter that you as an algorithm designer can choose. You can make it arbitrarily small by adding more computation. Large means that the quality of simulated adversarial attacks is poor, so indeed our bounds suggest that the robust generalization will not be good. This is what should be expected, it is not a weakness of our result. We further expand on our comment above. Recall that adversarial training involves finding an adversarial perturbation of every training example in the training set. Can we solve this optimization problem exactly? If so, then is equal to zero. Most works in theory of robust learning assume that. We argue that in practice that is not true. So, we allow for approximate optimization for finding an adversarial example during training. We introduce a parameter that is bound on the suboptimality of an adversarial example. Note that this is a design parameter that a practitioner can choose. If you choose a large beta, the training is easier but the resulting model is not good. So, of course, one should choose a small beta. This involves a computational cost that most previous papers ignore. The only paper that carefully shows how many iterations are needed to find a suboptimal attack is that of Mianjy and Arora (2023) – they show that we need iterations of PGD attack to find a suboptimal attack.
[Mianjy and Arora 2023] “Robustness Guarantees for Adversarially Trained Neural Networks,” NeurIPS 2023.
Q1: Lines 9 and 10 in the abstract: If I understand correctly, there should be something added along the lines of “provided there exists a robust network around the initialization.”
While we give a bound on robust generalization in terms of distance from initialization, it does not mean that it is a necessary and the only condition for robust generalization. In fact, we can also give a robust generalization bound that does not depend on the distance from initialization. This follows using Theorem 3.1 and Lemma 4.2.
Q2: Lines 193-194, “This view is also consistent with several empirical studies.”: can you please elaborate on this?
Early stopping is a standard approach for algorithmic regularization. It has been shown to prevent overfitting in many settings, see for example [r1, r2, r3]. Here, we show that it is able to mitigate robust overfitting as well.
[r1] Caruana, Rich, Steve Lawrence, and C. Giles. "Overfitting in neural nets: Backpropagation, conjugate gradient, and early stopping." Advances in neural information processing systems 13 (2000).
[r2] Rice, Leslie, Eric Wong, and Zico Kolter. "Overfitting in adversarially robust deep learning." International conference on machine learning. PMLR, 2020.
[r3] Pang, Tianyu, et al. "Bag of tricks for adversarial training." arXiv preprint arXiv:2010.00467 (2020).
Q3: Theorem 3.2, for the result on the generalization gap you should explicitly state the loss function, right? (since eq. (3) is defined with respect to a generic function ).
Yes. We will state the loss function explicitly.
I apologise for my delayed response.
Thank you very much for your reply and the references provided (especially [Mianjy and Arora 2023]). I would advise you to reference this paper when introducing , and to discuss its computational considerations.
This paper studies the stability and generalization of adversarial training for two-layer neural networks with smooth activation functions in binary classification problems, focusing on settings beyond the NTK regime and without data distribution assumptions. The authors derive generalization bounds based on uniform stability, particularly for gradient descent (GD) and stochastic gradient descent (SGD). After the authors’ responses, the reviewers recognized the theoretical novelty and generally agreed that this work contributes to a valuable theoretical understanding of adversarial training. Some reviewers also raised concerns about the practical relevance of the results, though this concern is common in theoretical papers. Overall, all reviewers appreciate this theoretical paper, and I recommend acceptance.