Gradient Flow Provably Learns Robust Classifiers for Data from Orthonormal Clusters
摘要
评审与讨论
This paper analyses the robustness of networks in a very fine-grained way. Specifically, it assumes the data distribution obeys mixture K-Gaussian clusters, and the cluster centers is orthonormal. In this way, the paper proves that the optimal robust classifier under this distribution is approaching to if the dimension is sufficient large or the intra-class variance is small. Furthermore, beyond the existence, this paper uses gradient flows to prove that with some assumptions on initial points, pReLU networks can converge to a nearly optimal robust classifier if the intra-cluster variance is small.
优点
This paper is well-written and easy-to-follow. I appreciate author’s effort to make a strongly technical paper easy for folks to read, that will be beneficial for the community. For examples, authors introduce many intuitions for the assumptions or the results of the theorem, and bring detailed proof sketch for readers to grasp.
This paper develops a full convergence analysis for gradient flow, demonstrating the conjecture on (Min & Vidal 2024), bring the significant contribution.
The technical contribution is solid. Although I do not check each stuff of the whole proofs, I believe they are correct after reading the proof sketch and some important part of the proof in the appendix.
The authors have discussed their limitations concretely, that will help readers understand their work comprehensively.
缺点
The assumption of the data distribution seems too strong. The awesome results may be unable to instruct learning for real-world scenarios.
Typo: In Line 061 it should be “orthonormal”.
问题
Is your assumption reasonable for real-world datasets? For instance, the variance assumption?
(Min & Vidal 2024) has proved the (almost) robustness for classifier, and your Theorem 2 proves the similar result for Bayes optimal classifier. Is this a progress? I think the Bayes optimal classifier probably should be more robust than . Maybe “optimal” does not means “robust optimal” in some cases. Can you provide more discussions on that?
We thank the reviewer for the review and for acknowledging the strengths of our paper. We address your concerns below:
-
Is the data assumption reasonable: We have added a new experiment section addressing your concern about the practical relevance. We refer the reviewer to our revised Appendix A and our global response.
-
Bayes classifier v.s. classifier: We have shown Bayes classifier is nearly optimally robust and explained it by interpreting it as a nearest-cluster rule. When Min and Vidal, 2024 propose classifier, they did not state whether its robustness is optimal or not. The reviewer's question of why classifier is also optimally robust is valid, and we have some explanation: the classifier is also approximately a nearest-cluster rule when is large. To see this, notice that because the -norm is getting closer to -norm as increases.
We hope our response addresses your concerns. If you have additional questions/concerns, please feel free to post comments during the discussion phase.
References:
Min and Vidal, Can implicit bias imply adversarial robustness? ICML, 2024
This paper study the problem 'what is the maximum adversarial perturbation a neural network can tolerate?' within the theory. The article contains three main theorems. Theorem 1 shows that for the Orthonormal cluster data, any Lebesgue measurable function can not keep robustness under budget norm 1 on such distribution with a probability. Theorem 2 gives the analysis of the robustness for Bayes optimal classifier. Theorem 3 shows that with some assumptions, a pRelu two-layer network can converges to optimal robust classifier after training.
优点
1, In my opinion, the conclusion is new and reasonable. 2, Motivation is reasonable. 3, The article has a good writing.
缺点
1, The proof is a bit long, I didn't take a closer look. I hope the author can write the proof of the theorem more concisely.
2, Overall, I maintain a positive attitude towards this article, but I still have many questions that I hope the author can answer seriously.
问题
1, This paper mainly focus on the orthonormal clusters data, and to make sure that , there are at most D(dim of ) clusters in the data distribution, this type of data appears to be a combination of several normal distributions that are relatively far apart. So may I ask why considering this type of data? What are the practical applications of this kind of data in reality?
2, Theorem 1 said that: 'Given a sample , we have equation (3)', but I think the probability in (3) should be about ? So it should not be written 'Given a sample ' here. The same for Theorem 2.
3, For the network structure, author do not choose Relu network due to 'Relu is non-differentiable' as said in note 1. But in my opinion, this is not important. Relu is almost differentiable everywhere, which seems sufficient, and so many work have done on Relu network. Moreover, Relu is frequently used in real world. So, what would happen we take p=1? Is the author's main conclusion (converges to optimal robust classifier) still correct?
4, Why there is an upper bound of the amount of data in theorem 3? It should be more data lead to the better the training, why does the author need an upper bound for the data here?
5, In Theorem 3, I think represent the parameters obtained after t steps training, is it right? And what is the learning rate?
6, Accoding to the real world experience, normal training makes the network non-robust. In the paper, as said in equation (7) and definition (6) of dataset, author also does not consider the robustness training, but according to theorem 3, training lead to optimal robust classifier. Is there a contradiction in between?
We thank the reviewer for the review. We address your concerns below:
-
Orthonormal cluster data: We have added a new experiment section addressing your concern about practical relevance. We refer the reviewer to our revised Appendix A and our global response.
-
Notation in Theorem 1 and 2: Thank you. We have fixed the notion in Theorem 1 and 2.
-
What happens if one uses ReLU: First of all, we wrote footnote 1 only to justify the use of gradient flow: If we are to study ReLU networks, then we have to write equation (7) as the differential inclusion (and the reviewer is right that many analyses exist), which is unnecessary in our manuscript since our main results are about pReLU (p>2). Secondly, the results for ReLU (the case when p=1) have been shown in prior works, and we mentioned them in multiple sections (lines 97-107, and lines 384-395): Frei et al. 2023 showed that GF on two-layer ReLU networks provably converges to networks that are non-robust against adversarial attacks of radius and Min and Vidal, 2024 further explained this phenomenon.
-
upper bound on the amount of data: The upper bound on sample size is due to the current proof techniques used in this paper, and we have discussed this limitation in lines 517-527. We invite the reviewer to read lines 517-527; if they do not clarify, please feel free to ask further questions.
-
what represents: Since this paper considers the gradient flow dynamics, the continuous-time limit of gradient descent when the step size is infinitesimal, represents the solution to the ordinary differential equation in (7), and denote the time. Nonetheless, there is some connection between the gradient flow solution with the iterates from gradient descent with sufficiently small step size, for which we refer the reviewer to Elkabetz and Cohen, 2021.
-
robustness without adversarial training: This is the point we highlight in our manuscript: the correct choice of network architecture matters in determining the adversarial robustness of the trained network. For this mixture of Gaussian data, we show that, as long as the activation function is chosen carefully (pReLU, p>2), the gradient flow training without adversarial examples can find the optimal robust classifier; Notice that in our Figure 2., if the network is NOT chosen carefully, the trained network is not robust, reminiscent of what reviewer pointed out, "normal training makes the network non-robust".
We hope our response addresses your concerns. If you have additional questions/concerns, please feel free to post comments during the discussion phase.
References:
Min and Vidal, Can implicit bias imply adversarial robustness? ICML, 2024
Frei et al., The double-edged sword of implicit bias: Generalization vs. robustness in reLU networks. NeurIPS, 2023
Elkabetz and Cohen, Continuous vs. discrete optimization of deep neural networks. NeurIPS, 2021
According to the author's response of question 6, when training on the data defined in the paper, pRelu(p>2) seems to be a better network structure than Relu to get robustness. Can this conclusion be generalized to other distribution? Can the author's article guide the search for network structures that are conducive to robustness in more situation.
We thank the reviewer for the question. Our response to your question will be in two parts:
-
Can this conclusion be generalized to other distributions? Yes. Empirically, Min and Vidal, 2024 have shown that pReLU (p>2) is more robust than ReLU when trained on the MNIST dataset. In our newly added numerical section, we compared pReLU (p>2) with ReLU on the transfer learning tasks of classifying cats and dogs with pretrained feature extractor, and the pReLU is more robust. These experiments show empirical evidence that pReLU (p>2) can be more robust than ReLU for some real-world datasets. Theoretically, we conjecture that pReLU is suitable for learning data distributions concentrated within a finite number of sufficiently separated low-dimension regions, for example, the multi-cluster data model we assume in the paper or some union of low-dimensional subspaces. Extending our theorem to these data models is our ongoing work.
-
Can the author's article guide the search for network structures that are conducive to robustness in more situations? Not directly. We think searching for robust network structures is possible for complicated data models (other than what we mentioned in the previous point) only when we understand their geometric structure better, and we try to convey this message in the introduction. Since this requires a joint research effort to understand both data structure and the inductive bias of network architecture during training, there is a significant amount of work before we can do the search. Nonetheless, by showcasing a successful example with a mixture Gaussian model, we hope our work can motivate more works studying data structure and implicit bias of neural network training for adversarial robustness.
My question has been resolved. This article has made contributions on theoretical, but there are also some weakness such as a simple data structure(Although the author has provided some experimental explanations on the practicality of this kind of data, it is inevitable that this data structure is indeed too simple), so I believe 6 is a reasonable score.
This paper addresses the vulnerability of deep learning-based classifiers to adversarial attacks, which typically require defense mechanisms or modified learning procedures, such as adding adversarial examples. The authors demonstrate that for certain data distributions, it is possible to train a provably robust classifier using standard learning methods without any additional defenses. Focusing on a binary classification problem where the data is generated from a mixture of Gaussian clusters with orthonormal centers, the paper first characterizes the largest -attack that a classifier can defend against while maintaining high accuracy. It then proves the existence of optimal robust classifiers achieving this maximum -robustness. Furthermore, the authors show that for data sampled from the orthonormal cluster model, gradient flow on a two-layer network with a polynomial ReLU activation, even without adversarial examples, can provably yield an optimal robust classifier.
优点
-
This paper provides solid theoretical analysis, proving that under the multi-cluster data assumption, a two-layer pReLU neural network with certain initialization conditions can converge to a robust solution.
-
The paper mainly utilizes the techniques from [1] and offers a two-phase training dynamics analysis based on early stopping.
-
Overall, the paper is written quite smoothly.
Reference
[1] Boursier, E., Pillaud-Vivien, L., & Flammarion, N. (2022). Gradient flow dynamics of shallow relu networks for square loss and orthogonal inputs. Advances in Neural Information Processing Systems, 35, 20105-20118.
缺点
-
The Non-degenerate initialization shape assumption used in the paper seems overly strong, as it requires that each Voronoi region contains at least one initialized weight, which may not be natural. Specifically, when considering a random initialization setup, if the dimension is much larger than the number of clusters , the randomly initialized weights should be approximately orthogonal to the -dimensional subspace formed by the cluster features with high probability. This appears to suggest that the non-degeneracy gap might be very small.
-
The paper considers a setup with sufficiently small data variance . However, the empirical phenomena observed in [2] seem to be independent of the variance. Thus, the paper only partially addresses the conjecture in [2] by resolving a special case for finite orthogonal data.
-
The paper lacks effective experimental validation of its theoretical analysis and conclusions, such as numerical simulations on synthetic data and observations on real image classification datasets.
Reference
[2] Min, H., & Vidal, R. (2024). Can Implicit Bias Imply Adversarial Robustness?. arXiv preprint arXiv:2405.15942.
问题
-
Could the authors provide an analysis and verification of whether their proposed non-degeneracy gap assumption holds for general small random initialization?
-
The main text does not seem to clearly explain why the small quantities in the conclusions are related to the variance . Could the authors offer a more intuitive explanation?
We thank the reviewer for the review. We address your concerns below:
-
Non-degeneracy gap assumption: The reviewer is right that the non-degeneracy gap is very small under random initialization when dimension is large, which we also acknowledged in our remark Requirement on the initialization in lines 506-516 as one limitation of our current results. In that remark, we noted that the required non-degeneracy gap of generally cannot be achieved by random initialization. However, this does not imply that GF cannot converge to a robust classifier from random initialization. From a random initialization, the GF dynamics should be split into three phases ("burn-in" phase -> alignment phase -> convergence phase). The "burn-in" phase is the initial chaotic phase when each neuron moves towards the interior of one of the Voronoi regions (since each neuron is initialized close to the boundaries of these Voronoi regions, we have no control over which Voronoi region it "selects" as it highly depends on the sampled training data points). Once every neuron has reached the interior of one of the Voronoi regions with a non-degeneracy gap , our results apply afterward. We hope the reviewer can see the challenges in analyzing this "burn-in" phase, which we plan to tackle in future research. We also note that special but practically viable initializations, such as initializing each neuron as one of the training data points, satisfy our non-degeneracy gap assumption, which might be a good alternative to random initialization.
-
Dependence on variance : In our response, we will first point out that the dependence on is also empirically observed in Min and Vidal, 2024. Then, we will discuss our view on such dependence.
-
In Figure 2(a) of Min and Vidal, 2024, they plot the distance between the trained network and against different choices of , and the distance increases as increases. Therefore, empirical results have shown the dependence on , and our results agree with such observations.
-
Why does the distance have to depend on ? Our explanation is as follows: To make sure the trained network is close to , one requires each neuron to be aligned with one of the cluster centers , and any misalignment will be reflected in the distance between and . However, since the training data only contains noisy samples around cluster centers, the GF dynamics can only guide the neurons' directions within a neighborhood of the true cluster center, and the size of this neighborhood necessarily depends on the variance. Indeed, in our proof sketch for the alignment phase (line 445-476), we show that the misalignment is , which eventually gets into our upper bound on the distance between and .
- Empirical validation: In Min and Vidal, 2024, the authors had the conjecture that pReLU networks converge to the robust classifier and offered empirical validations on synthetic data (mixture of Gaussian data) and on real image dataset (MNIST). We feel their numerical validations are sufficient to show the effectiveness of pReLU in finding robust classifiers. Thus, we focus on formally proving their conjecture with a detailed proof sketch and technical discussion. Nonetheless, to address the practical relevance of our theorems, we added numerical experiments to our revised manuscript; we invite the reviewer to check the new Appendix A and our global response.
We hope our response addresses your concerns. If you have additional questions/concerns, please feel free to post comments during the discussion phase.
References:
Min and Vidal, Can implicit bias imply adversarial robustness? ICML, 2024
Thanks for your update! I appreciate the efforts of authors for updating the revision. Thus, I will keep my scores.
The paper studies the binary classification problem in which the data comes from a mixture of Gaussian clusters with orthonormal cluster centers. The paper first shows that an attack with unlimited budget is not defendable and establishes a maximum budget for a plausible attack. Then, it shows that some classifier can achieve this robustness while maintaining accuracy. Finaly, the paper shows that a neural network with a single hidden layer and polynomial ReLU activation can be trained using gradient descent to approximate this optimal classifier when the initialization of the network parameters is favorable.
优点
- Originality:
The paper is original in its focus on GMM distributed data.
- Quality:
The paper is rigorous in its presentation.
- Clarity:
The paper is clear for the most part.
- Significance:
The issue of finding plausible classification problems that are amenable to analysis is important considering the current stage in understanding adversarial examples phenomenon.
---------Edited-----------
I increase my original score of 3 to 5 since the overall quality of the paper is very good.
缺点
I believe that the paper is not ready for publication based on four key observations.
First, the assumptions of the analysis are very restrictive, the orthonormality condition on the cluster centers for example exclude even the simplest classification problems such as XOR.
Second, the paper makes no attempt to show that the analysis bears any relevance in real-world scenarios.
Third, the analysis does not connect or explain any issue that the analysis is revealing with regards to the current paradigm of training robust ANNs. While the main theorem asserts that the degree of polynomial ReLUs should be at least 3, it does not explain what makes the first degree ReLUs unsuitable. Furthermore, while the paper claims training robust classifiers are possible with simple gradient decent, it does not explain why is it that we observe a trade-off between accuracy and robustness in practice.
Last but not least, some of the claims of the paper are obvious and has no significance. For example, we don't need a mathematical analysis to figure out the reason behind the fact that attacks with unlimited budget are not defendable. Moreover, we don't need a reason to believe that robust classifier exists since every optimal Bayes classifier is accurate and robust by definition.
问题
See the Weaknesses section.
We respectfully disagree reviewer's assessment of our manuscript:
-
Data assumption: Our data assumption is not restrictive when compared to other theoretical works using similar analyses. We have compared our data assumption in lines 397-409 to those in prior work that also studied the convergence of two-layer networks; ours is the least restrictive. Moreover, our theoretical results can characterize exactly what classifier the two-layer network learns via gradient flow, which cannot be done without some assumption, either on data or the network width. Having said this, we are working on relaxing these assumptions.
-
Relevance in real-world scenarios: We have added a new experiment section addressing your concern about practical relevance. We refer the reviewer to our revised Appendix A and our global response.
-
Why ReLU fails to be robust: The results for ReLU (the case when p=1) have been shown in prior works, and we mentioned them in multiple sections (lines 97-107 and lines 384-395): Frei et al. 2023 showed that GF on two-layer ReLU networks provably converges to networks that are non-robust against adversarial attacks of radius and Min and Vidal, 2024 further explained this phenomenon.
-
Accuracy-robustness tradeoff: We have clearly stated what is considered to be robust classifiers in our paper in lines 142-149: those that can defend against attacks of radius while maintaining a robust accuracy almost as high as the clean accuracy, which has no accuracy-robustness trade-off against attacks of radius smaller than . Our contribution is to show that robust classifiers in such a strong sense exist for our orthonormal cluster model with and can be found by the gradient flow. Moreover, our results still agree with the empirical observation that the accuracy-robustness tradeoff exists: the robust classifier we find can achieve high accuracy under any attack of radius smaller than , but has near zero accuracy under attacks of radius larger than , as shown in Figure 2. In order to achieve more robustness against an attack of radius larger than , one must trade off the clean accuracy, for example, by using a constant classifier that always outputs the majority class. The same thing can be said for real datasets: robust classifiers in our strong sense exist for some unknown that might be small, and in practice, the algorithm might be looking for robust networks against attacks of radius much larger than , which necessarily has the accuracy-robustness tradeoff.
-
Obvious claims The reviewer is misinterpreting our theorems.
For example, we don't need a mathematical analysis to figure out the reason behind the fact that attacks with unlimited budget are not defendable
Our theorem 1 does not assume an unlimited budget. For our orthonormal clusters model, we are showing the critical attack budget : if the attack radius is smaller than this critical value, a robust classifier achieving high accuracy exists; and if the attack radius is larger, no classifier can achieve high accuracy.
Moreover, we don't need a reason to believe that robust classifier exists since every optimal Bayes classifier is accurate and robust by definition
We do not see why the Bayes classifier is robust by definition if its definition has nothing to do with adversarial attacks. Our theorem 2 shows that the Bayes classifier for our orthonormal clusters model is nearly optimally robust because it can maintain high robust accuracy under any attacks of radius smaller than the critical attack budget . We would like the reviewer to clarify why this is an obvious result that does not require proof.
We hope our response addresses your concerns. If you have additional questions/concerns, please feel free to post comments during the discussion phase.
References:
Min and Vidal, Can implicit bias imply adversarial robustness? ICML, 2024
Frei et al., The double-edged sword of implicit bias: Generalization vs. robustness in reLU networks. NeurIPS, 2023
Thank you for the response. I don't want to discourage the authors from pursuing this research, rather I think that the paper is asking the wrong questions. Even though the analysis is done rigorously, and the quality of the paper is very good, I'm afraid that answers to irrelevant questions are irrelevant.
- Data assumption:
My understanding of the analysis is that the paper is basically assuming that the training samples are sampled from a simplex, i.e. sampled from a Dirichlet distribution, and that the samples are separable, i.e. the parameters of the Dirichlet distribution are equal and less than one. The paper does not realize this simple characterization of data and spends a lot of energy in describing it through orthogonality, etc. Nevertheless, this does not change the fact that the assumption is not real or helpful. For example, it is completely useless for any domain that is a subset of because it is not even possible to define a classification problem that respects this assumption in 1D; the simplex in this case is a point. Considering this fact, I cannot see what good an analysis with this assumption can do. Previous papers might be onto something here, but I don't think that this paper is furthering the discussion in a significant direction. I can imagine that relaxing/changing this assumption would be a step in the right direction.
- Relevance in real-world scenarios:
I think the new experiments are fine, however, the experiments should also contain the end-to-end robust accuracy as well before we can infer them as evidence for real-world relevance.
- Why ReLU fails to be robust:
From what I understand, in which is the set of times differentiable functions. For example, is continuous but not differentiable, and is both continuous and differentiable but not twice differentiable. Consequently, a network that is activated with pReLU is also in . We know that for , i.e. networks that are activated with ReLU are more expressive that those that are activated with pReLU with . So, it makes a lot of sense, even a little obvious if I may, that GD performs better for pReLU when the target decision boundary can be represented with a function in which ; there are simply less hypotheses in the hypothesis space. This further shows why the assumptions on data distribution are very important on making the analysis relevant. My take is that the paper has set the learning problem up to succeed.
- Accuracy-robustness tradeoff:
I think you are assuming that a tradeoff between accuracy and robustness is only possible when training samples overlap. While there are such proposals in the literature[A], there are also counter arguments[B]. I think the safest bet for now is to say that the trade-off is due to nonexistence of a representation for the robust decision boundary in the hypothesis space. The analysis does not take into account this possibility which further exacerbates the issues with data assumption and the presented analysis that I have mentioned. The paper needs to make these issues explicit in the text.
- Obvious claims:
Our theorem 1 does not assume an unlimited budget.
The theorem gives an upper bound for the budget, so I am interpreting that it is rejecting the possibility of unlimited budget. The exact value of the bound is dependent on the data distribution and is irrelevant to the broader problem IMHO.
We do not see why the Bayes classifier is robust by definition if its definition has nothing to do with adversarial attacks.
AFAIK, the Bayes classifier is:
in which is the distribution of test samples. So, as long as we are choosing the right distribution, there is no need to mention any robustness condition. As an analogy, consider a human subject as the optimal Bayes classifier. If we find a perturbation for an image of a dog which changes the opinion of the human into believing that this is an image of a cat, then that image is indeed of a cat, i.e. the perturbation has moved the sample past the true decision boundary. What you are referring to I think is the application of attacks in adversarial training of the hypothesis which always limit the budget so that a human is never convinced that the label of the image has truly changed, e.g. the perturbation be imperceptible. As you can see, we are always training a classifier for which the corresponding optimal Bayes classifier exists in practice.
[A] D. Tsipras, S. Santurkar, L. Engstrom, A. Turner, and A. Madry, “Robustness may be at odds with accuracy,” in International Conference on Learning Representations, 2019. [B] Y.-Y. Yang, C. Rashtchian, H. Zhang, R. R. Salakhutdinov, and K. Chaudhuri, “A closer look at accuracy vs. robustness,” in Advances in Neural Information Processing Systems, 2020
We thank the reviewer for clarifying your comments. We disagree with the conclusion the reviewer arrived at:
I think that the paper is asking the wrong questions. Even though the analysis is done rigorously, and the quality of the paper is very good, I'm afraid that answers to irrelevant questions are irrelevant.
Although it is hard to know precisely what the reviewer thinks are the "wrong"/"irrelevant" questions, we are afraid that the reviewer is misunderstanding the scope and contribution of this paper. We will explain next.
- Contribution of this paper: As our paper title, "gradient flow provably learns robust classifiers for data from orthonormal clusters," clearly suggested, our main contribution is to show how gradient descent algorithm on some networks provably converges to a robust classifier for orthonormal data model; and is NOT what is a robust classifier for orthonormal data model. One can easily be convinced that a robust classifier exists for orthonormal data model, either through our Theorem 1 and 2 (we write them in theorems for mathematical rigor), or as the reviewer explained, but how can we train a neural network to find the robust classifier?, since the neural networks are overparametrized in its width, there are infinitely many networks that interpolate the training data, which classifier do gradient descent/flow learn and whether the trained network is robust become important and nontrivial research questions. The mathematical way of answering these questions is through the rigorous convergence analysis of training dynamics of gradient flow and its implicit bias (we have referenced many works in lines 118-120). Notably, the convergence analysis is nontrivial, even for simple data distribution (lines 397-409). In Theorem 3, we show that pReLU converges to the robust classifier via gradient flow (Reviewer HTHc and MqXy think they are solid theoretical contributions). We carefully compared our results with prior work, explained the proof sketch, and pointed out some limitations of our current analysis (Reviewer HTHc found them "will help readers understand their work comprehensively"). We find it unfair for the reviewer to criticize our work from a pure learning perspective for having a simple data model and disregard all our main contributions in optimization.
-
Data assumptions: The reviewer criticizes our data assumptions based on an incorrect understanding of our data assumptions. Specifically, the reviewer believes “training samples are sampled from a simplex, i.e. sampled from a Dirichlet distribution.” As we write in Line 56, “We consider a balanced mixture of K-Gaussians.” By and large, the entire theory of deep learning community acknowledges that there is a big gap between theory and practice. Existing theory is all under very restrictive settings in terms of architectures (e.g., two-layer ReLU networks or multi-layer networks with infinite width) or data assumptions (mixtures of Gaussians). Although we acknowledge that our approach involves restrictive data assumption, we simply want to point out that it is less restrictive than the state-of-art for the convergence analysis of learning dynamics (Lines 398-409), which is the topic of this paper. Simply put, this is where the state of the art is in terms of assumptions for which one can prove theorems about precise characterizations of trained neural networks (Theorem 3) in deep learning theory.
-
Obvious claims: we are very puzzled by the reviewer's comments, such as “unlimited budget are not defendable” or “the exact value of the bound is dependent on the data distribution and is irrelevant to the broader problem IMHO.” These comments do not make sense within the context of adversarial robustness. Specifically, all the literature on -bounded adversarial robustness is predicated on the idea of small, imperceptible adversarial perturbations. So, by definition of the problem, the budget is never unlimited: attacks are always bounded in the norm. Moreover, all the literature on adversarial robustness assesses performance as a function of the attack budget. In addition, all of the literature on certified robustness is predicated on computing the largest budget, for which one can guarantee that the classifier does not change its predictions for all inputs. The whole point of our results is that for a ReLU network, the maximum allowable attack budget is , while for a pReLU network, the maximum allowable budget is . Anyone working on adversarial learning would appreciate that this is a huge improvement, and the claim is far from obvious, given that proving it involves the convergence analysis on gradient flow on neural networks.
Nonetheless, we agree with multiple points the reviewer stated in the previous response, for example, the discussion about hypothesis classes being different when using pReLU and ReLU networks and the discussion about accuracy robustness tradeoff; they are interesting and insightful, and we will include them as additional remarks to the manuscript and we thank the reviewer for the references (Since they involve debatable issues like accuracy-robustness tradeoff, we will take caution in making these remarks, thus they, unfortunately, cannot be made happen before the discussion phase ends).
We find it unfair for the reviewer to criticize our work from a pure learning perspective for having a simple data model and disregard all our main contributions in optimization.
I understand your frustration, but my job is to be objective, not to be fair. If the main contribution of the paper is in optimization, then it should not have been filed under "learning theory" as the primary area. That is the main reason that the paper is being reviewed by someone like me. Having said that, since the paper's analysis is not taking into account the issues of hypothesis spaces and sample complexities of those spaces, from the perspective of "learning theory" it is not a strong analysis. In other words, as far as learning theory is concerned, I believe that the proposition of the paper regarding GD is true for every ERM learning rule out there and the specific analysis of gradient flow seems a little superfluous. Nevertheless, I might change my mind after discussing the contributions with other reviewers and AC in the discussion period. Hence, I will increase the score of the paper to 5 to show that I am willing to discuss.
The reviewer criticizes our data assumptions based on an incorrect understanding of our data assumptions. Specifically, the reviewer believes “training samples are sampled from a simplex, i.e. sampled from a Dirichlet distribution.” As we write in Line 56, “We consider a balanced mixture of K-Gaussians.”
Figure 1 of the paper explicitly shows the simplex, the dashed line connecting the centers of the Gaussians is exactly the simplex. More formally, the cluster centers in could be stacked in a matrix which has a QR decomposition in which is a rectangular diagonal matrix with its diagonal entries being 1. As you can see, when , the cluster centers fall on the corners of the -simplex and when it is not possible to satisfy the orthogonality condition. Furthermore, since only support vectors are necessary to define the decision boundary, the training samples that are sampled from the simplex are sufficient for classification since they are the samples that fall closest to any other cluster center.
Simply put, this is where the state of the art is.
The SOTA argument is not effective when it comes to theoretical papers. For an experimental paper, it is acceptable to be similar to SOTA, but for a theoretical paper we either need a new theory or an improvement over SOTA. That is what I meant when I said that the paper is asking the wrong questions, this particular configuration of data is already analyzed and now we need to move beyond it. The particularities of this data assumption are not significant enough to be discussed further IMHO. For example, I consider dropping the orthogonality condition or the Normality condition on the distribution of clusters as significant.
we are very puzzled by the reviewer's comments.
In my comment I am referring to Theorem 2 of the paper. The text following the theorem reads:
Therefore, is nearly optimal robust when , i.e. the ambient dimension is large or the intra-class variance is small.
I find the conclusion of this theorem obvious. This problem would be alleviated if the paper draws a stronger connection between the theorems and the main goal of the paper.
We thank the reviewer for the response and willingness to reassess our manuscript. Since it is close to the end of the discussion phase, we will keep our response short as we have discussed our viewpoints in previous responses.
In other words, as far as learning theory is concerned, I believe that the proposition of the paper regarding GD is true for every ERM learning rule out there and the specific analysis of gradient flow seems a little superfluous.
The reviewer's comment about hypothesis space is interesting, but we think our additional experiment in the newest revision is at the odds of this conjecture. In this new experiment (Please see Appendix A.1), we run SGD on a pReLU network and on a regular polynomial ReLU network (they induce the same function space for the same ), and the results are (We invite the reviewer to check the details in our new revision):
- SGD on a regular polynomial ReLU network cannot find a robust classifier. Therefore, selecting the function space is important, but the way the function space is parametrized is also important.
- SGD on a pReLU network with a large initialization scale cannot find a robust classifier. Therefore, selecting the correct hyperparameter of the training algorithm is also important.
In summary, many important aspects of GD training affect the robustness of the trained network; our theoretical analysis, alongside many previous works on the implicit bias of GD, precisely shows why certain choices make sense, which is a non-trivial problem, as we can see in the experiment. We hope our new experiment can aid the reviewer's assessment of our manuscript.
(Note: We slightly edited our response to improve its clarity)
Several reviewers (reviewer sKL6, b91S, and HTHc) have raised their concerns about the practical relevance of our data assumption. We thank their feedback and they have helped us to improve our manuscript. We have added one numerical experiment section to Appendix A of our revised manuscript (text highlighted in blue), and we explain our revision in this global response.
In the newly added section, we solve the tasks of classifying cats and dogs via transfer learning using extracted features from a ResNet152 trained on ImageNet. We conjecture that the extracted features of the dog (or cat) class may naturally have many clusters: when the feature extractor is trained on ImageNet, dogs are further labeled by their breeds. Thus, the extracted features of dogs of the same breed should be sufficiently close, and features of dogs of different breeds should be sufficiently far apart, based on the well-known neural collapse phenomenon (Papyan et al., 2020; Galanti et al., 2021). If such a multi-cluster structure exists in the extracted feature, then we expect training pReLU as a classification head can achieve better robust accuracy than training its ReLU counterpart.
In the experiments, we show that, indeed, in this transfer learning scenario, the multi-cluster structure arises due to the distinguishing power of the feature extractor trained on large datasets with finer labels, and we show that in this case, pReLU networks with larger achieve better robustness compared to its ReLU counterpart. Admittedly, our current Theorems cannot fully explain the observed experimental results since the extracted features form clusters with large variances, and there are some correlations among these clusters, which does not follow our data assumption. Relaxing our data assumption to large variance and allowing inter-cluster correlation is an important future research direction.
We hope that our newly added experiments address reviewers' concerns about the practical relevance of our theorems, and we are happy to further improve them based on reviewers' feedback.
References:
Papyan et al., Prevalence of neural collapse during the terminal phase of deep learning training. PNAS, 2020
Galanti et al., On the role of neural collapse in transfer learning. ICLR, 2021
This paper shows that gradient descent can provably learn a robust classifier for binary classification data drawn from clusters with orthonormal means, under a specific initialization condition. The reviewers highlight the theoretical rigor, novelty of the results, technical contributions, and high writing quality of the paper. They also raise concerns about the limited practical implications and the stringent assumptions on the data distribution and the initialization condition. The AC agrees that this paper makes solid technical contributions to a very challenging problem, but thinks that the paper could be strengthened by a better argument on its practical relevance (e.g., what is the consequence of this result given that in practice adversarial training is typically required to achieve robustness?) and weaker requirements on the initialization.
审稿人讨论附加意见
The authors added a numerical experiment during the discussion phase, aiming to address concerns about the practical relevance of their data assumption. This new experiment partially, but did not completely, address the reviewers' concerns.
Reject