5.0

/10

Poster3 位审稿人

最低5最高5标准差0.0

3.3

置信度

正确性3.0

贡献度2.7

表达2.3

NeurIPS 2024

Wide Two-Layer Networks can Learn from Adversarial Perturbations

Soichiro Kumano,Hiroshi Kera,Toshihiko Yamasaki

OpenReview PDF

提交: 2024-05-14更新: 2024-12-20

TL;DR

Justify the feature hypothesis and perturbation learning.

摘要

关键词

Adversarial PerturbationsAdversarial ExamplesAdversarial AttacksNon-Robust FeaturesPerturbation LearningLearning from Adversarial Perturbations

评审与讨论

审稿意见

评分: 5置信度: 32024-07-04

In this paper, the authors theoretically investigate the perturbation learning phenomenon from a future hypothesis perspective. Perturbation learning means that a classifier can be learned from adversarial examples with incorrect labels (the labels used to generate adversarial examples, but seems to be incorrect in human eyes). Their theory includes two parts. First, they show that the adversarial perturbations are the linear combination of training samples, thus containing the information of the clean training dataset. This theorem provides an intuition for the feature hypothesis. Then, they show that, under some conditions, the predictions of a classifier trained on adversarial perturbations (or adversarial examples) are consistent with those of a classifier trained on a clean dataset.

Some experiments on a synthetic Gaussian dataset are provided to show the effect of the input dimension and hidden dimension, validating their theorem.

优点

Understanding perturbation learning and the feature hypothesis is quite important in the domain of adversarial learning. This paper provides a deeper understanding of this problem.
In this paper, the authors provide a more general theory about perturbation learning compared to prior work. The theory is based on fewer constraints in training data distribution, training time, etc.

缺点

My main concern is whether the kernel regime is a suitable and extensible tool to study the feature hypothesis in adversarial training and explain other interesting phenomena such as the transferability of adversarial examples and the trade-off between robustness and accuracy.

According to the feature hypothesis of adversarial training, the data contains a set of features that could used for prediction. Humans and all kinds of neural networks use different feature subsets to classify the image (the subsets have overlaps). The feature hypothesis may provide a unified explanation for several open problems in adversarial training. However, it seems that the theory in this paper only focuses on perturbation learning and is difficult to extend to explain other phenomena. In my opinion, the framework should have a deeper exploration of the feature subsets that are used by the trained models to make predictions.

问题

Experimental Questions:

What is the effect of the sampling of adversarial labels on the accuracy of the model $g$ ? If we set the adversarial labels of all negative data to be positive and vice versa, will the accuracy of $g$ be very low?
What is the effect of the structures of $f$ and $g$ ? When these two networks have more divergent structures, I think the accuracy of $g$ must be lower. Is it correct in experiments? If it is, this is similar to the transferability of adversarial examples, i.e., the adversarial examples are easier to transfer to similar networks.

Theoretical Questions:

What do the functions $\hat{f}$ and $\hat{g}$ mean? What are the relationships between $f$ and $\hat{f}$ , $g$ and $\hat{g}$ ?

局限性

The authors state the limitation on the assumption of the width of the network in the paper.

作者回复

2024-08-06

We appreciate the reviewer's constructive comments.

My main concern is whether the kernel regime is a suitable and extensible tool to study the feature hypothesis in adversarial training and explain other interesting phenomena such as the transferability of adversarial examples and the trade-off between robustness and accuracy. ... The feature hypothesis may provide a unified explanation for several open problems in adversarial training. However, it seems that the theory in this paper only focuses on perturbation learning and is difficult to extend to explain other phenomena. In my opinion, the framework should have a deeper exploration of the feature subsets that are used by the trained models to make predictions

First, we would like to emphasize the following: the target of this study is perturbation learning, not others, such as adversarial training or transferability. While the feature hypothesis and perturbation learning are foundational for understanding several phenomena of adversarial examples and training, they are originally different research topics. We would appreciate it if the reviewers evaluated our work fairly, taking into account its scope.

Nevertheless, we found the questions insightful for our future work because our analysis is compatible with the setups of adversarial training and other related topics. Recall that all of our discussions stem from the update equation of gradient descent (flow), which is a fundamental and common component of deep learning. In addition, the main assumption is only wide width, which is only used to control each gradient descent step and not specific to perturbation learning. These advantages of our framework generally do not confxlict with the analysis of adversarial training and examples. Therefore, we consider that our framework is helpful for considering the features (subsets) of adversarial examples and adversarial training. If there are specific barriers that the reviewer foresees, we would greatly appreciate if they could be shared during the discussion period.

For these reasons, we believe our theoretical framework has the potential to address other problems and phenomena, including adversarial training and examples. However, we would like to respectfully emphasize that the primary focus of our research is perturbation learning and the feature hypothesis. We kindly ask the reviewer to consider evaluating our work from this perspective. We believe that our research offers novel and profound theoretical insights into perturbation learning.

What is the effect of the sampling of adversarial labels on the accuracy of the model $g$ ? If we set the adversarial labels of all negative data to be positive and vice versa, will the accuracy of $g$ be very low?

Empirically, the accuracy of $g$ decreases along with the increase of label flips (i.e., negative to positive and vice versa). However, it is known that deep neural networks can sometimes achieve above-chance accuracy even in such environments. Detailed experimental results are presented in [13] and [18]. On the other hand, for simpler models like two-layer networks, it is empirically difficult to achieve accuracy significantly above chance in such environments. Indeed, our theoretical results, similar to prior work [18], indicate that two-layer networks are unlikely to achieve prediction matching under flipped label conditions.

What is the effect of the structures of $f$ and $g$ ? When these two networks have more divergent structures, I think the accuracy of $g$ must be lower. Is it correct in experiments? If it is, this is similar to the transferability of adversarial examples, i.e., the adversarial examples are easier to transfer to similar networks.

The more different the structures of $f$ and $g$ are, the lower the accuracy of $g$ tends to be. As the reviewer pointed out, this behavior is similar to that observed in transferability, suggesting that the features captured by models during learning (and exploited by adversarial attacks) are dependent on the structure. These results are provided in Figure 3 [13].

What do the functions $\hat{f}$ and $\hat{g}$ mean? What are the relationships between $f$ and $\hat{f}$ , $g$ and $\hat{g}$ ?

$\hat{f}$ and $\hat{g}$ can be interpreted as the main components of $f$ and $g$ respectively. Specifically, $f(z) = \hat{f}(z) + \Delta_f(z)$ and $g(z) = \hat{g}(z) + \Delta_g(z)$ , where $\Delta_f(z)$ and $\Delta_g(z)$ are, in many cases, relatively smaller than $\hat{f}(z)$ and $\hat{g}(z)$ . Therefore, if the signs of $\hat{f}(z)$ and $\hat{g}(z)$ match (i.e., the agreement condition holds), in most cases, the signs of $f(z)$ and $g(z)$ also match (i.e., perturbation learning succeeds). The functional margin conditions represent the conditions under which $\hat{f}(z)$ and $\hat{g}(z)$ become sufficiently large relative to $\Delta_f(z)$ and $\Delta_g(z)$ .

审稿意见

评分: 5置信度: 42024-07-17

This work aims to provide an alternative theoretical analysis to justify feature hypothesis and perturbation learning. The analysis is based on approximation theory in the kernel regime (i.e., infinite width). They show that the adversarial perturbation contains sufficient data information, which can be retrieved by the perturbation learning when y labels are uniformly sampled.

优点

The paper is theoretically sound and relaxed certain conditions required by prior works.

缺点

What is the contribution beyond technical novelty? Are any new insights obtained via the new analysis compared to existing results? The conditions in the main theorems lack interpretability. I can understand the technical reason behind them, but how reasonable are these conditions, especially the agreement condition? The author argues that the agreement condition depends on the consistency of the correlation between z and y_nx_n. This statement is not rigorous, since z is a function argument, but not a r.v. and there is no correlation involved. I suppose a more meaningful question will be, if z follows the data distribution of x (e.g., a mixture of two Gaussian), what is the chance these conditions hold?

问题

See weakness.

局限性

Yes. The limitation is clearly statement in the paper.

作者回复

2024-08-06

We thank the reviewer's suggestive questions. The reviewer seems to appreciate our technical contributions, and the questions are thus more posed on the high-level understanding of the results. We here address this concern. We are willing to address any feedback and requests for further clarification upon your response during the discussion period, which will hopefully lead to improved the reviewer’s understanding and evaluation of our work.

What is the contribution beyond technical novelty? Are any new insights obtained ... compared to existing results?

Indeed, our work not only proposes a technically advanced analysis but also provides many insights into perturbation learning as follows:

A large dimension and longer training time strengthen the alignment between perturbations and training samples.
Similar samples in the training set are emphasized in perturbations.
Our results explicitly reveal the dependncy between the success of perturbation learning and each variable (dimension $d$ , sample size $N$ , perturbation size $\epsilon$ , and confidence $\delta$ )

We elaborate on them.

Feature hypothesis (Theorem 3.3).

Our results offer three new insights regarding the feature hypothesis.

First, the residual term $\xi_n$ in our result offers new insights into the alignment between perturbations and training samples. The direction of the perturbation vector is described by two components: the weighted sum of the training samples (main term) and the residual term. Our results suggest that as the input dimension increases, the residual term becomes smaller than the main term, and the alignment strengthens. In other words, perturbations more robustly contain class features. This insight was not obtainable from the existing research due to the absence of a residual in their limited problem setting.

Second, our result suggests that longer training time strengthens the directional agreement, which is supported by intuition and experience but has not been addressed in existing research.

Third, we identify the coefficient $\Phi(x_n, x_k)$ for each training sample. Our explicitly derived $\Phi(x_n, x_k)$ reveals that the coefficient of each training sample is determined by the slope of the activation function and depends on the similarity between $x_n$ and $x_k$ (cf. Eq. (4)). This implies that the more similar samples are included in the training set, the stronger their influence is reflected in the perturbations. While similar coefficients existed in previous research, they are could not be explicitly obtained.

Perturbation learning (Theorems 3.4 and 3.5)

The last insight is the explicit connection between the successful perturbation learning and training factors, including training time $T$ , perturbation size $\epsilon$ , input dimension $d$ , sample size $N$ , and confidence level $\delta$ , enhancing our understanding of their impacts on perturbation learning. For example, perturbation learning succeeds more easily with a sublinear speed with respect to the dimension $d$ and sample size $N$ . Furthermore, our findings indicate that either a large $d$ or $N$ alone is insufficient since Eqs. (9) and (10) include $d$ - and $N$ -irrelevant terms. In contrast, existing research does not explain the impact of these variables, only demonsrtating that it succeeds when both $d$ and $N$ are infinite.

Additionally, technical improvements in data distribution, perturbation design, training procedures, and network settings directly contribute to understanding the broad applicability of the feature hypothesis and the success of perturbation learning. These improvements are insightful in their own right.

The conditions in the main theorems lack interpretability. ... how reasonable are these conditions, especially the agreement condition? ... if z follows the data distribution of x ..., what is the chance these conditions hold?

In this context, we used "correlation" only to abbreviate (the sign of) the inner product of two vectors (not necessarily random variables). We will improve our manuscript to reduce any misleading.

We consider $z$ as an arbitrary $d$ -dimensional vector rather than a random variable to provide general results that do not depend on a specific probability distribution. This approach allows for easy consideration of $z$ as a random variable in subsequent analyses.

Let us consider the agreement condition in the following setting.

The positive sample size is $N/2$ , and the negative is $N/2$ .
$x_1, \ldots, x_N$ are i.i.d. and sampled from $\mathcal{N}(y_n\mu, I_d)$ , where $\mu$ is a $d$ -dimensional vector.
$z$ follows $\mathcal{N}(\mu, I_d)$ ; namely, $z$ is positive.

In this setting, informally, the following holds:

$\langle x_n, z\rangle = y_n \| \mu \|^2 \pm O(\sqrt{d}\ln(1/\delta))$
$\Phi_+ = \Phi(x_n, z)$ if $y_n = +1$ , otherwise $\Phi_- = \Phi(x_n, z)$
$\Phi_+ = \Phi(x_n, x_k)$ if $y_ny_k = +1$ , otherwise $\Phi_- = \Phi(x_n, x_k)$

(1) is derived from a concentration inequality, where $\delta$ is the confidence level. Since $x_n$ has the same probabilistic properties within each class, $\Phi(x_n, z)$ should have similar values for all $n$ satisfying $y_n = 1$ (or $y_n = -1$ ). While there would be probabilistic fluctuations, for simplicity, we fix $\Phi_+ := \Phi(x_n, z)$ if $y_n=1$ , and $\Phi_- := \phi(x_n, z)$ if $y_n = -1$ . Similarly, we fix $\Phi(x_n, x_k)$ as in (3).

Noting that $\Phi_+ = \Theta(1)$ and $\Phi_- = \Theta(1)$ , and if the sign of $\sum_k y_k \Phi(x_n, x_k) \langle x_k, z \rangle$ is the same for all $n$ , then the sign of $\hat{g}_a(z)$ is equal to it, we can derive:

$sgn(\hat{f}(z)) = (\Phi_+ + \Phi_-) \| \mu \|^2 \pm O(\sqrt{d}\ln(1/\delta)) = \Theta(d) \pm O(\sqrt{d}\ln(1/\delta))$
$sgn(\hat{g}(z)) = (\Phi_+ + \Phi_-) \| \mu \|^2 \pm O(\sqrt{d}\ln(1/\delta)) = \Theta(d) \pm O(\sqrt{d}\ln(1/\delta))$

This implies that if the input dimension $d$ is large, the agreement condition easily holds with high probability.

审稿意见

评分: 5置信度: 32024-07-19

Perturbation learning, where classifiers are trained on adversarial examples with their associated incorrect labels, results in non-trivial generalization. This work theoretically tackles the perplexing former phenomenon for wide two-layer networks in the kernel regime. The authors first prove that adversarial perturbations are parallel to a meaningful feature vector (that can contain whole dataset information under some additional assumptions) up to an error term. Second, the authors provide some conditions on when the predictions with perturbation learning match those of standard learning.

优点

The problem is a very interesting one and lacks more analysis, so this paper addresses a significant problem in an original way.
The setting and assumptions are very well described and clear.

缺点

I am overall confused about the choice of kernel regime to explain perturbation learning. As the authors acknowledge, there is no feature learning for the choice of width in this paper (the output of hidden units remains the same). It's not clear how such a framework can explain perturbation learning and, in particular, the "feature hypothesis," which authors claim that they do. Since there is no feature learning in this regime, there should not be any "feature hypothesis." This is not to say the analysis is uninteresting; it applies to perturbation learning with kernels. But this requires a completely different contextualization than the authors provide, one that focuses on the adversarial robustness of kernels rather than neural networks. And for the reasons above, I find the comparison with prior work a bit misleading as they tackle perturbation learning for different scenarios.
It is difficult to judge the validity of the assumptions in the paper.

Assumption 3.2. is very convoluted, and I don't see how one can justify this assumption in finite width settings. I equate this assumption to an infinite-width assumption, which is perfectly reasonable, but as I discussed above, I believe it changes the object of study.
The assumptions of Theorem 3.3. and Theorem 3.4. are discussed, and some intuitions are provided. But it is not discussed why they would hold for a small $\delta$ .

The $\delta$ dependencies in Theorems 3.3, 3.4, and 3.5 are confusing. Let's focus only on Theorem 3.3. When we take $\delta \to 0$ while everything else remains fixed, the statement is trivial. So, the interesting part of the statement is when $T_f$ or $d$ scales with $\delta$ . I believe it is more reasonable to consider $T_f \to \infty$ . So, this verifies that Assumption 3.2. is an infinite width assumption.

问题

Could you discuss again the choice of kernel regime vs. feature learning for your analysis after my comments in the Weaknesses section?
Could you provide references or evidence towards why $\epsilon$ should scale with $1/\sqrt{d}$ ?

局限性

I think the authors overclaim by saying, "We provided a theoretical justification for perturbation learning and the feature hypothesis" (L293). Their analysis is limited to the kernel regime, which is acknowledged, but its implications for the feature hypothesis are not well explained. In addition, scaling with $\delta$ in main theorems is not discussed, and as I pointed out, it points to other limitations towards infinite width.

作者回复

2024-08-07

We appreciate the reviewer's insightful comments. The reviewer seems to highly evaluate our problem and analysis as interesting but also has concerns about the theoretical principle, which results in a low initial score.

Regarding the concerns, the reviewer seems to have several fundamental misunderstandings about the kernel regime. We would like to address this in the rebuttal.

In short, a) network parameters change during training, b) feature learning occurs in the kernel regime, and c) our claim does not require infinite width, $\delta \to 0$ , and $T \to \infty$ . We elaborate on each below.

Network parameters change during training.

If the reviewer considers that the outputs of hidden units remain the same during training as at initialization, this is not correct. In our framework, hidden weights are trained and updated during training. Thus, the outputs of hidden units change from their initialization.

We employ gradient flow.

The reviewer might misunderstand that we employ a kernel method. We use gradient flow. Our training scenario is the same as in prior work [18].

Feature learning occurs even in a kernel regime

In our framework, i.e., a kernel regime, feature learning occurs. In other words, network parameters are updated and learn class features from training samples.

For clarity, we offer the following definitions:

Feature learning: A process in neural networks where parameters change during training according to the training dataset via gradient flow, extracting class features, and determining predictions.
Kernel regime: A situation where parameters in neural networks change within a small (but not infinitely small) margin from their initialization in a (finitely) large width setting.

The key distinction between our work and previous studies is that prior work considers weights that can change freely (unrestricted feature learning), while our work examines weights that can change freely only around their initialization point (feature learning in a kernel regime).

Our assumptions and theorems do not require infinite network width.

Assumption 3.2 does not imply infinite width. For the identity loss, it requires $m > O(d^2T^2)$ , where $d$ is the input dimension and $T$ is the continuous training time. Given the finiteness of $d$ and $T$ , $m$ remains finite.

Our theorems also do not necessitate infinite width. Let us consider simplified Assumption 3.2 and Theorem 3.4.

Assumption 3.2: Network width $m$ satisfies $m > T^2$ .
Theorem 3.4: Let $\delta$ be a small positive number. Under Assumption 3.2, for any $z \in \mathbb{R}^d$ , if $g_a(z) > 1 / T \delta$ holds, then, with probability at least $1 - \delta$ , perturbation learning succeeds.

$\delta = 0$ leads to $T = \infty$ and thus $m = \infty$ . However, for any small $\delta > 0$ , we can choose finite $T$ such that $g_a(z) > 1 / T \delta$ . Furthermore, for any $T > 0$ , we can select finite $m$ such that $m > T^2$ . Thus, for any $\delta > 0$ , a finite width is sufficient to consider Assumption 3.2 and Theorem 3.4. Note that strict positivity of $\delta$ is assumed in these theorems.

We also note that $\delta$ is the confidence level, and it is not necessary to consider it as $0$ or infinitesimal. For example, with $\delta = 0.01$ , we can choose a finite width $m$ that guarantees the success of perturbation learning with probability at least 99%. We believe that the impacts of $T$ , $d$ , and $N$ on perturbation learning with a fixed $\delta$ are more insightful than considering the dynamics as $\delta \to 0$ . The primary implication of Theorem 3.4 is that a large $T$ leads linearly, and $d$ and $N$ lead sublinearly, to more pronounced prediction matching with the same probability (fixed $\delta$ ).

Our results hold even for infinite network width.

The reviewer might consider that in sufficiently wide width, weights do not change from initialization, and thus networks do not learn class features. This is not correct.

For any $m$ and $T$ , trained weights can be represented as $v_i(T) = v_i(0) + \Delta_i(T)$ with $\\|\Delta_i(T)\\| = O(T/\sqrt{m})$ . $\Delta_i$ is not $0$ and aligns with data patterns, indicating that feature learning occurs. While $\\|\Delta_i(T)\\|$ approaches zero as $m \to \infty$ , it does not become zero for any $m$ . If $\\|\Delta_i(T)\\| = 0$ for every $i$ , the trained network output $f(T)$ would be identical to the initialized output $f(0)$ , meaning training would not work at all. For any $m$ (even infinitely large), each hidden weight changes slightly, and network outputs as the summation of outputs of hidden units change significantly during training. Therefore, neural networks learn class features, and our theorems are not invalidated for any width.

Feature learning is not necessary to justify the feature hypothesis.

The reviewer suggests that theoretical approaches without feature learning dynamics cannot explain perturbation learning and the feature hypothesis. However, we consider that feature learning is not necessary to explain them. The key requirement to justify them is that the image gradients through a classifier (i.e., the direction of the perturbation) contain data information (like Theorem 3.3). For example, we believe that for kernel methods (which our analysis is not), we can empirically and theoretically justify them because the image gradients contain training sample information. Note that feature learning and the feature hypothesis are unrelated concepts.

Scaling of $\epsilon$ (Question 2)

As shown in Theorems 3.4 and 3.5, the L2 perturbation size $\epsilon$ needs to scale with $\sqrt{d}$ . This scaling is a consequnce of the property of the L2 norm, which scales with $\sqrt{d}$ . To maintain the signal-to-noise ratio, L2 perturbation sizes must scale accordingly. For example, please refer to Table 2 in [1].

[1] V. Sehwag et al., Robust Learning Meets Generative Models: Can Proxy Distributions Improve Adversarial Robustness? ICLR22.

2024-08-10

Thanks for your detailed explanations.

I do agree with both of your claims in a) and c), and I did not claim otherwise in my review. I will detail what I meant in my original review.

Assumption 3.2.

It is true that Assumption 3.2. is satisfied by some $m$ for any $d$ and $T$ (let's grant that $\ell'$ is bounded). I do not claim otherwise. But I do not see why the finite $m$ regime satisfied by Assumption 3.2. is any more interesting than infinite $m$ . There is nothing in the practice of ML that would make me think Assumption 3.2. is more reasonable than the infinite $m$ assumption. Again, there is nothing wrong with assuming infinite $m$ : it is extremely hard to do good theory, and kernel regime is only one of the few approaches available.

Assumptions in Theorems 3.4., 3.5.

Let $\delta$ be a fixed small number. For any $z$ , there are two competing conditions on $T_f$ and $T_g$ now: First, you assume them to be $\Theta(1)$ . Then, you also assume the margin conditions, which depend on $T_f$ and $T_g$ . There is no discussion on why these margin conditions would hold for the regime imposed on $T_f$ and $T_g$ . In addition, there are remainder terms in margin conditions even if $T_f$ and $T_g$ go to infinity, which are not discussed.

Scaling with $\delta$

Theorems 3.4. and 3.5. have $\delta$ dependency only in multiplication with $T_f$ and $T_g$ . For a choice of confidence level $\delta$ , $T_f$ and $T_g$ will need to be larger than some quantity that depends inversely on $\delta$ . This means that $m$ itself has to be larger than some quantity that depends on $\delta$ .

This is further evidence, along with Assumption 3.2., that authors need an infinite-width approximation, and the result holds with probability degrading depending on the validity of this approximation. This should not come across as a surprise to the authors: this is the core gist of the kernel regime. All this considered, my main point is the fact that for some confidence levels, there are large numbers of $m$ verifying the assumptions of the paper, is not more interesting than an infinite $m$ assumption.

Network parameters change during the training.

From now on, I will only consider the $m \to \infty$ limit. As the Kernel Regime paragraph details in the Sketch of Proof section, the outputs of hidden neurons change negligibly. This is the basis of my first objection to the kernel regime and how it can be used to explain perturbation learning. In this regime, one cannot claim feature learning in the usual sense. The final output changes, but this does not imply that the network has learned specific features.

My understanding of perturbation learning was that it is equal to "learning" the features present in adversarial perturbations, the features that are claimed to exist by "feature hypothesis." And since there is no actual feature learning happening in the kernel regime, this cannot be studied in such a framework. However, the authors use a more nuanced definition, and more specifically, the word "enable," i.e., indicating that the features in adversarial perturbations can somehow help the classifiers (in contrast to the case where classifiers simply learn these features). This is exactly why I brought up the kernels, which would have no learning of features but alignment of the features such that they align with these features.

Final conclusion

I will increase my score to 4 from 3, as my initial judgment on the relationship between perturbation learning and the kernel regime was harsh. The authors do not explicitly claim features posited by "feature hypothesis" are learned by wide, two-layer networks.

I am willing to increase my score to 5 from 4 if the authors agree to incorporate

i) a discussion on the kernel regime and what it can model in neural networks in the related work section,

ii) more nuanced discussions on the limitations of their results, including the ones I have highlighted in the text.

评论- Official Comment by Authors (2/2)

2024-08-11

(I) Assumption 3.2 and scaling with $\delta$ .

(scaling with $\delta$ ) Theorems 3.4. ... on $\delta$ .

the result holds with probability degrading depending on the validity of this approximation

First, these are correct.

(in "scaling with $\delta$ ") for some confidence levels, there are large numbers of $m$ verifying the assumptions of the paper, is not more interesting than an infinite $m$ assumption.

(in "Assumption 3.2") I do not see why the finite $m$ regime satisfied by Assumption 3.2. is any more interesting than infinite $m$ .

As the reviewer pointed out, for some confidence levels, there are infinitely many values of $m$ that satisfy the conditions, naturally including infinite width. However, this does not imply that we must choose infinite width. We can select the smallest $m$ that satisfies the conditions.

Moreover, we do not claim that the finite width regime is more interesting than infinite width. The finite width is discussed in this paper because it is not theoretically necessary to consider infinite width. There may be some misunderstanding between us and the reviewer regarding the term "interesting." If our response does not address the reviewer's concerns, we would appreciate further clarification.

There is nothing in the practice of ML that would make me think Assumption 3.2. is more reasonable than the infinite $m$ assumption.

We have interpreted this concern as follows:

In practice, the experimenter does not know in advance how large $T$ (training time) needs to be for perturbation learning to succeed. If the experimenter continues training until perturbation learning succeeds, $T$ could become very large. Therefore, the experimenter may need to select a very large $m$ in advance, which is essentially equivalent to assuming infinite width.

This concern is valid when $T$ is variable during training and unknown in advance, and the experimenter is free to choose an arbitrarily large value.

However, it should be noted that our theory does not claim how the training evolves over time (i.e., $T$ ; variable). It characterizes the model trained with a designated setup (including the training time $T$ ; constant). In other words, we fix $T$ before training, as is typical in actual network training. For example, it is natural to set $T$ in advance to 100 epochs for MNIST. The experimenter selects a constant $T$ (although it does not always lead to the success of learning) and, accordingly, a finite $m$ , which is more reasonable than an infinite width assumption.

(II) Assumptions in Theorems 3.4., 3.5.

there are two competing conditions on $T_f$ and $T_g$ ...

The reviewer might consider that $T$ is substantially constrained to $c < T < C$ for some $c,C$ , which prevents the satisfaction of the functional margin conditions. We should note that the assumption $T = \Theta(1)$ is introduced only for notational simplicity of the functional margin conditions. Essentially, no assumption on $T$ without $T > 0$ is required. Simply speaking, this assumption is introduced to derive $O(T + (1/\sqrt{d})) = O(T)$ . Although not realistic, if $T = \Theta(1/d)$ , then $O(T + (1/\sqrt{d})) = O(1/\sqrt{d})$ . Without any assumptions, we would need to write unnecessarily intricate conditions, which we believe should not be included in the main text. We will revise any misleading assumptions.

there are remainder terms ... which are not discussed.

These time-independent terms are discussed in Lines 179--181.

(III) Network parameters change during the training.

The reviewer's understanding is generally correct. The reviewer seems view "feature learning" as the extraction of higher-level features that are potentially latent within the data. In other words, they interpret features not as raw data vectors or their parts but as more complex combinations of these elements. According to the reviewer, kernel methods and our framework (cf. Eq. (22) and (23)) perform "feature alignment" with raw data rather than "feature learning" as a higher-level feature extraction.

I partially agree with this assertion. The difference in intepretation between us and the reviewer seems to be in the scope of what is referred to as "features." In addition to these high-level features, we also consider the data vectors themselves as features and regard their simple extraction as a form of "learning." For example, in the binary classification of vertical and horizontal lines or in MNIST, the raw data itself would serve as features. However, we will revise any misleading terms that may lead to misinterpretation. Furthermore, we will explicitly state in the limitations section that we prove the feature hypothesis and perturbation learning from primitive feature perspective, but do not fully explain them with higher-level features.

Note that prior work shares the same limitation, and the extraction of higher-level features might require relaxing constraints related to depth rather than width.

评论- Official Comment by Authors (1/2)

2024-08-11

We sincerely appreciate the reviewer's prompt response and the effort they put into engaging with the discussion.

First reply: We will first address the two conditions that the reviewer has set for increasing the score. Following that, we will summarize our perspective on the generality of finite width, which may be a point of misunderstanding between us and the reviewer.

Second reply: We will provide responses to the specific questions raised by the reviewer.

The reviewer's suggestion (important)

We address the two conditions that the reviewer required for raising the score as follows:

(i) We will include studies related to the kernel regime, such as [14], in the related work section. Rather than focusing on studies related to training convergence, we believe it would be more effective to reference works like [8], which first discussed the invariance of hidden units, and [1] (see below), which uses the kernel regime to analyze the properties of adversarial examples.

(ii) In the limitations section, we will acknowledge that the "features" we focused on are primitive, and higher-level features potentially included in perturbations and learning from them may not be theoretically clear (cf. (III)).

If there are any discrepancies between the reviewer's requests and our understanding, we would be happy to address them.

[1] H. Zhang et al. Adversarial Noises Are Linearly Separable for (Nearly) Random Neural Networks. AISTATS23.

Generality of finite width (important)

Furthermore, we sincerely appreciate the reviewer's thoughtful suggestion for an additional discussion, and we will certainly incorporate it into the main text (please see the elaboration at the end of this reply).

However, we believe that a critical point has been missed by the reviewer: one of our main contributions is the proposal of a general theory that encompasses both finite and infinite-width cases. While the reviewer has interpreted our work with $m \to \infty$ , it is important to note that the finite case is at the very core of this research. The reviewer may argue that the finite case is neither more interesting nor more practical than the infinite case, but we respectfully and strongly object to this.

a) First of all, all networks that can be realized in practice are constrained by finite width. Although the assumption of infinite width is commonly used for theoretical convenience, the finiteness of width is indeed significant in practical machine learning. For instance, if we had infinitely many samples, there would be no generalization gap; if we had infinitely long training time, simulated annealing would discover a global optimum with probability one.

b) Second, by deriving conditions that depend on specific variables rather than assuming infinity, we can understand how assumptions about width change. For example, in this study, we have $m > O(d^2T^2)$ , which suggests that our assumptions become stricter as they are proportionate to squared $d$ and $T$ . This insight is not evident under the naive assumption of $m \to \infty$ .

最终决定Accept (poster)

2024-09-25

The paper presents a significant theoretical contribution by analyzing perturbation learning in wide two-layer networks within the kernel regime. The reviewers recognized the originality and importance of this work, with Reviewer MQtT noting that the paper addresses a “very interesting problem” and provides a “significant problem in an original way.” Despite initial concerns about the kernel regime’s applicability to feature learning, the authors clarified that feature learning does occur in this regime and emphasized that their results hold for finite-width settings, which is more practical. Reviewer 5BXU highlighted that the paper is “theoretically sound” and praised its relaxation of certain conditions from prior works, noting the explicit connections the authors make between training factors and the success of perturbation learning. Reviewer fzgX appreciated the deeper understanding the paper provides for the feature hypothesis in adversarial learning, though they suggested exploring extensions to other adversarial training phenomena in future work. The authors’ responses effectively addressed the reviewers’ concerns, leading to score improvements and demonstrating the paper’s potential to make a strong impact on the field. Given the theoretical depth, novel insights, and the thorough clarifications provided, I recommend accepting the paper.