PaperHub
7.2
/10
Poster4 位审稿人
最低3最高4标准差0.4
4
4
3
4
ICML 2025

Simplicity Bias and Optimization Threshold in Two-Layer ReLU Networks

OpenReviewPDF
提交: 2025-01-23更新: 2025-07-24

摘要

关键词
Neural NetworksSimplicty BiasImplicit BiasOne hidden ReLU NetworkEarly Alignment

评审与讨论

审稿意见
4

This is a theoretical paper seeking to explain the phenomenon that in some situations overparameterized models, once the number of noisy training samples is large, fail to interpolate the training data and instead converge to a solution that minimizes the test loss. This was observed with in-context learning and with diffusion models. In this paper, the architecture is overparameterized two-layer ReLU networks, and the focus is on training them from a small initialization by gradient flow on noisy data which is labelled by a linear teacher and which satisfies some further simplifying assumptions. The main result confirms the motivating phenomenon, and its proof provides further insights, pinpointing an early-phase alignment of neurons as the principal cause. Another contribution are concentration bounds for the extremal vectors that drive the alignment process, and the assumptions here are less restrictive than in the main result. The paper also reports and discusses numerical experiments in setups related to but extending the setting of the main theoretical result.

给作者的问题

I do not have any at this point.

论据与证据

Full proofs of the theoretical results, which are the paper's main contributions, are provided in the appendix. In addition, the key proofs are sketched in the main. For the numerical experiments, sufficient details are provided so that it should not be difficult to reproduce them.

方法与评估标准

The experiments explore several settings that go beyond the assumptions for the theoretical results, and those include considering the GeLU activation and the Adam optimizer. Although it could have been interesting to investigate real datasets, diffusion models, and in-context learning in order to link the theoretical results more firmly to practical situations, I think the level of experimental exploration that the authors chose to perform makes sense for this theoretical work which seems to be the first in this direction.

理论论述

Yes, and I am convinced that they are correct.

实验设计与分析

Yes, as far as possible.

补充材料

I read the appendix of the paper.

与现有文献的关系

The paper builds on Boursier and Flammarion 2024, and on a number of recent works in the literature that investigate the training dynamics of one-hidden layer ReLU and other homogenous neural networks from a small initialization. However, as far as I know, the link between neuron alignment in early training and generalisation at the expense of interpolation is for the first time established rigorously in this work. Moreover, the theoretical results in this paper are not obvious and do not follow easily from the previous works.

遗漏的重要参考文献

I am not aware of any.

其他优缺点

The paper is well written and the main result is accompanied by a detailed discussion. The theoretical results are non-trivial, and their proofs are provided in the appendix. The concentration bounds result may indeed be useful for future work. The experimental results are interesting, and their discussion is informative.

The gap between the theoretical setting in this work and the empirically observed phenomenon with in-context learning and diffusion modes is large, and it is not yet clear whether and how the properties of the training dynamics identified here are related to what actually happens in those practical situations. (E.g. it might be that case that the stability issues studied by Qiao et al. in NeurIPS 2024 are at play there to a greater extent than the neuron alignment in early training.)

其他意见或建议

In the References, check whether papers cited as on arXiv have been published, and consider providing a clickable link for each item.

作者回复

We sincerely appreciate the reviewer’s insightful feedback.

The gap between the theoretical setting in this work and the empirically observed phenomenon with in-context learning and diffusion modes is large, and it is not yet clear whether and how the properties of the training dynamics identified here are related to what actually happens in those practical situations. (E.g. it might be that case that the stability issues studied by Qiao et al. in NeurIPS 2024 are at play there to a greater extent than the neuron alignment in early training.)

We agree that it remains unclear whether empirical observations in more complex architectures are better explained by our theoretical results or those of Qiao et al. (2024). Evaluating this in real-world settings would require extensive experiments beyond the scope of our primarily theoretical contribution, as also noted by the reviewer.

That said, Section A.5 in the appendix suggests that in our toy model—while not necessarily representative of more complex architectures—non-convergence appears to be driven by early alignment rather than stability issues.

We view our work as an alternative perspective to that of Qiao et al. (2024) in explaining non-convergence. In practice, it is likely that both phenomena contribute to the observed behavior in real-world settings.

In the References, check whether papers cited as on arXiv have been published, and consider providing a clickable link for each item.

We will update the references accordingly.

审稿意见
4
  • In the context of two-layer ReLU networks, the paper theoretically explores the issue where trained models got stuck in spurious local minima of the training loss as the number of training samples exceed a certain threshold.

  • It is demonstrated that networks might converge towards simpler solutions rather than interpolating training data, a phenomena so called directional alignment.

  • It is pointed out that this type of simplicity bias is indeed beneficial and enhances the generalization of trained models.

给作者的问题

I don't have other questions for the authors

论据与证据

The theoretical claims are supported by mathematical proofs.

方法与评估标准

An experiment with a toy model is illustrated to demonstrate the theoretical result.

理论论述

I did not check the correctness of the proofs due to the time limit.

实验设计与分析

The experiment on the training of an overparametrized two-layer neural networks show that the behavior of training and test loss are indeed consistent with the desired regimes of generalization.

补充材料

No due to the limited time.

与现有文献的关系

The paper advances the previous work on early alignment phenomenon by further indicating that this bias is beneficial to the generalization capability of the model.

遗漏的重要参考文献

Not that I am aware of.

其他优缺点

  • The paper makes original contributions on top of (Boursier and Flammarion, 2024)'s work.
  • The paper seems technically sound to me, even though I did not carefully check the proofs of the theoretical claims.

其他意见或建议

Maybe the theorems and propositions could be rephrased to a more plain and less mathematically rigorous language for a higher clarity.

作者回复

We sincerely appreciate the reviewer’s insightful feedback.

the theorems and propositions could be rephrased in a more accessible and less mathematically rigorous manner to improve clarity

We we will enhance the clarity of our results in the revised version. Specifically, we will include an informal statement of Theorem 4.1 at the end of the introduction. However, we do not find it feasible to do the same for Section 3, as it requires prior knowledge of extremal vectors. More precisely, we will make the following revisions regarding the clarity and organization of the paper:

  • Clarification of key results: We will include an informal statement of Theorem 4.1 at the end of the introduction.
  • Emphasis on the goal of Section 3: This section aims to provide a simple characterization of extremal vectors (up to dn\sqrt{\frac{d}{n}} terms) when the number of training samples is large. This serves two purposes: (1) forming the first step in the proof of Theorem 4.1 and (2) offering a general framework that may be useful in future research, as noted by Reviewer TN6U.
  • Additional clarification of key results: We will add an explanation of how Proposition 3.1 relates to and complements Theorem 3.1.
  • Discussion of main implications: We will briefly summarize the key implications of Theorem 4.1 at the beginning of the discussion section.
  • Expanded discussion in the appendix: We will elaborate on our work’s connections to related literature, particularly regarding the double descent phenomenon. Additionally, we will compare the NTK/lazy training regime with the feature learning/mean-field regime.

We appreciate the reviewers' valuable suggestions and look forward to refining our work accordingly

审稿意见
3

This paper studies the simplicity bias in regression to a two-layer ReLU network using gradient flow. The author shows that despite overparameterization, the network may converge toward simpler solutions rather than merely interpolating the data, leading to a drastic improvement in test loss.

给作者的问题

  1. You showed that the empirical gradient approximates the expected gradient with an error rate of O(logn/n)O(\sqrt{\log n/n}) in your theorem 3.1 in Theorem 3.1 and analyzed the critical direction/extremal vectors of the early training dynamics near some "good" expected gradient in Proposition 3.1. I am wondering about the relationship between these two results. Is Proposition 3.1 a consequence of Theorem 3.1?
  2. I am also curious about the existence of the extremal vector in Equation (6), as it depends only on the distribution μ\mu. This suggests that the extremal vector exists only for sufficiently "good" μ\mu. A natural question is: for what choices of μ\mu does the extremal vector in Equation (6) exist?
  3. What is the relationship between Section 3 and Section 4? I understand that Section 4 considers a special case of Section 3, where the data distribution μ\mu follows a linear model. Are the lemmas in Section 3 useful for proving the lemmas in Section 4? Additionally, what is the connection between the implications of Proposition 3.1 and Theorem 4.1?

论据与证据

I have a question: In lines 72 and 73, you claim that learning useful features stops before full interpolation in modern architectures (even with prolonged training). I assume this statement refers to diffusion models and large language models. Is there any reference to support this claim?

方法与评估标准

Yes, although two-layer ReLU neural networks and gradient flow are oversimplified models used to make the problem tractable, they are accepted by the machine learning theory community.

理论论述

No, but the results seem reasonable to me.

实验设计与分析

Yes, I checked the experiment details, and they make sense to me.

补充材料

No.

与现有文献的关系

This work contributes to the growing body of research on the implicit bias of overparameterized models, particularly in gradient-based optimization of neural networks. It builds on prior studies analyzing the simplicity bias in two-layer ReLU networks and extends these ideas by examining early training dynamics. However, clearer connections to empirical findings in large-scale deep learning, such as diffusion models and large language models, would strengthen its contribution.

遗漏的重要参考文献

No.

其他优缺点

Strengths:

  1. Theoretical contributions provide insights into the simplicity bias of gradient-based optimization in overparameterized models.
  2. The analysis of early training dynamics is well-motivated and connects to the broader study of generalization in deep learning.

Weaknesses:

  1. The connection to real-world large-scale models, such as diffusion models and LLMs, remains somewhat abstract and could benefit from empirical validation or discussion of practical implications.
  2. The organization of the paper is somewhat unclear.

其他意见或建议

Please unify the notation in Theorem 3.1, as the mixed use of EX,yE_{X,y} and EμE_{\mu} is confusing.

作者回复

We sincerely appreciate the reviewer’s insightful feedback.

In lines 72 and 73, is there any reference to support this claim?

Yes, this phenomenon is extensively discussed by Raventos et al. (2024). In particular, Figure 4 and the discussion at the bottom of page 7 suggest that the model does not interpolate, even when trained indefinitely.

For diffusion models, Kadkhodaie et al. (2023) indicate that models are trained until convergence, yet interpolation ceases beyond a certain number of training samples. However, their experimental details do not specify the exact duration of training.

The connection to real-world large-scale models remains abstract

Assessing the causes of non-convergence in real-world setups—beyond the analysis in Section A.5 for our toy model—would require extensive experiments that warrant a separate study. As reviewer TN6U noted, our primary contribution is theoretical, making such empirical investigations beyond our current scope.

That said, Raventos et al. (2024) provide strong empirical evidence supporting the relevance of our work to ICL. Their experiments suggest that after a certain number of task examples, the model stops reaching the ERM solution (modeled as a Nadaraya-Watson kernel with poor generalization) and instead converges to a spurious local minimum that generalizes well, aligning with the optimal estimator. Furthermore, their findings indicate that prolonged training does not improve performance, reinforcing our claim.

A similar pattern is observed in Kadkhodaie et al. (2023) for diffusion models: the training loss remains low with few samples but increases beyond a certain threshold, while the test loss continues to improve.

The organization of the paper is somewhat unclear.

In response to the different reviewers’ feedback, we will make the following revisions regarding the clarity and organization of the paper:

  • we will include an informal statement of Theorem 4.1 at the end of the introduction
  • emphasis on the goal of Section 3 (see more below)
  • we will add an explanation of how Proposition 3.1 relates to and complements Theorem 3.1
  • we will summarize the key implications of Theorem 4.1 at the beginning of the discussion section
  • we will elaborate on our work’s connections to related literature, particularly regarding the double descent phenomenon. Additionally, we will compare the NTK/lazy training regime with the feature learning/mean-field regime.

We appreciate the reviewers' valuable suggestions and look forward to refining our work accordingly.

the mixed use of EX,y\mathbb{E}_ {X,y} and Eμ\mathbb{E}_{\mu} is confusing.

We appreciate this feedback. The notation will be clarified (and unified) in the revised version.

Is Proposition 3.1 a consequence of Theorem 3.1?

As stated after Proposition 3.1, “Proposition 3.1 relies on the tail bound version of Theorem 3.1 and continuity arguments.” While Theorem 3.1 alone does not directly imply Proposition 3.1, its tail bound version plays a crucial role. This distinction is why we did not present Proposition 3.1 as a corollary. We will add further clarification in the revised version.

for what choices of μ\mu does the extremal vector in Equation (6) exist?

Extremal vectors exist for any distribution μ\mu since they correspond to critical directions of the continuous function GG (defined on line 196). However, if GG is non-differentiable, the definition of extremal vectors should be adapted to account for subgradients. Equation (6) assumes differentiability, which holds for distributions μ\mu that are continuous with respect to the Lebesgue measure (or, more specifically, when the marginal of xx is continuous).

What is the relationship between Section 3 and Section 4?

The first step in the proof of Theorem 4.1 relies on the tail bound version of Theorem 3.1. Specifically, Section 4 requires a characterization of extremal vectors in the finite-data setting, which is facilitated by first studying their infinite-data counterparts (Equation (6)) and then applying the tail bound version of Theorem 3.1 to establish their proximity.

While Proposition 3.1 is not used in Section 4, it supports the idea that early alignment behavior in the infinite-data case is not significantly different from the large but finite nn case. Theorem 3.1 bounds DnD|D_n-D|, but additional assumptions are needed to directly conclude that the extremal vectors of GnG_n are close to those of GG. Proposition 3.1 addresses this gap. We believe that Theorem 3.1 (and its tail bound version) and Proposition 3.1 are broadly applicable and may be useful in future work, as suggested by Reviewer TN6U. The paper is structured accordingly: Section 3 presents a general theoretical result that can be used beyond this work, while Section 4 applies it to a specific setting. We will clarify this connection in the revised version.

审稿意见
4

The paper theoretically studies how overparameterized networks converge to simpler generalizing solutions (as opposed to interpolating training data) when there are sufficiently many training samples. They do this by studying early alignment, where networks align their weights to the directions of the data early in training before changing in magnitude. They show that early alignment corresponds to the true loss as the number of training samples increases (reaching an optimization threshold), corresponding to a few key directions and thus yielding a simplicity bias which they show persists throughout training.

给作者的问题

  1. Other work studying early alignment has done so in the context of feature learning and the neural tangent kernel (Jacot et al. 2019). Does data interpolation correspond to the “lazy” learning regime, or would you consider this to be something separate? Can you discuss the connection of this work to feature learning?
  2. The authors state “Loss of omnidirectionality is specific to the (leaky) ReLU activation and does not hold for smooth activations.” Do other (smooth) activations exhibit early alignment and do they learn generalizing solutions, or are they constrained to data interpolation?

论据与证据

The claims made appear clear and well-supported.

方法与评估标准

The methods seem fine.

理论论述

I did not check the correctness of any proofs.

实验设计与分析

The experimental design seems well-done and there are additional experiments included in the supplementary to support the theory in the main text.

补充材料

I reviewed the additional experimental results in the supplementary, which seem fine.

与现有文献的关系

The paper shows theoretically how the number of training samples impacts the early alignment phase, leading to a simplicity bias and show that this persists through training. This helps elucidate how the number of training samples, parameters, and data dimensionality relate to generalizing versus overfitting (data interpolating) solutions and the corresponding underlying mechanism.

遗漏的重要参考文献

Atanasov et al. (2021) seems particularly relevant to this relevant since they also study early alignment and how it relates to feature learning.

其他优缺点

Strengths: The paper is well-motivated, well-written, and studies an important topic in deep learning theory. The theory appears sound and the experiments compliment the theory well and support its generality.

Weaknesses: The paper is quite dense and it’s difficult to follow the main points at times. The results and their applications could be emphasized more in a discussion that highlights their significance.

其他意见或建议

It might be helpful to readers to have some high-level explanations of the theoretical parts as well as some visuals to help with the intuition behind the theory. There could also be more emphasis and discussion on the simplicity bias. I suggest bringing another one of the simulations from the supplementary into the main text for more experimental support of the theory. It could also be interesting to link this to double descent and “data-double descent” (Henighan et al., 2023) in a discussion. In general I suggest having a more extensive discussion relating the results of this work to existing literature.

作者回复

We sincerely appreciate the reviewer’s insightful feedback.

Atanasov et al. (2021) seems particularly relevant since they also study early alignment and how it relates to feature learning.

Thank you for pointing out this reference. We will include it in the revised version, as it provides valuable insights into the early alignment phenomenon.

It might be helpful to readers to have some high-level explanations of the theoretical parts as well as some visuals to help with the intuition behind the theory. There could also be more emphasis and discussion on the simplicity bias. I suggest bringing another one of the simulations from the supplementary into the main text for more experimental support of the theory. It could also be interesting to link this to double descent and “data-double descent” (Henighan et al., 2023) in a discussion. In general I suggest having a more extensive discussion relating the results of this work to existing literature.

We appreciate these suggestions. Our discussion section already connects our results to the literature on convergence to global minima, implicit/simplicity bias, and benign & tempered overfitting. We thank the reviewer for pointing out Henighan et al. (2023), which we will discuss in the revised version. Our setting does not exhibit a data double descent phenomenon, as the observed test loss consistently decreases with the number of samples, whereas double descent is characterized by an intermediate bump in the loss curve. However, the toy experiments in Henighan et al. (2023) illustrate a similar phenomenon: for a sufficiently large number of training points, the training loss remains high while the model learns optimal features. It remains unclear though whether this high training loss stems from an underparameterized regime (i.e., the model lacks sufficient capacity to memorize the data) or if optimization fails to reach the ERM in their setup.

Given the breadth of related topics, we initially omitted certain discussions (e.g., NTK and double descent) that we felt were less central to our study. However, based on the reviewer’s feedback, we will add a section in the appendix relating our work to these topics.

Additionally, we will move Figure 3 (unless another figure is preferred by the reviewers) from the appendix to the main text, utilizing the allowed extra page.

Other work studying early alignment has done so in the context of feature learning and the neural tangent kernel (Jacot et al. 2019). Does data interpolation correspond to the “lazy” learning regime, or would you consider this to be something separate? Can you discuss the connection of this work to feature learning?

There might be some confusion here. We distinguish between feature learning and the NTK/lazy regime, as they involve fundamentally different training dynamics (see Chizat et al., On Lazy Training in Differentiable Programming for an in-depth discussion). Our study specifically focuses on the feature learning regime with small initialization, as indicated by our initialization choice (Equation (2)), where both the inner and outer layers scale as 1m\frac{1}{\sqrt{m}}.

In contrast, in the NTK/lazy regime (corresponding to large initialization scales), theory predicts that interpolation should occur at convergence, which is contrary to our main result. However, empirically demonstrating this interpolation in our toy model (with large nn) is computationally challenging, as it would require an extremely large number of parameters. We will clarify this distinction and add a discussion on NTK vs. feature learning regimes in the appendix.

Do other (smooth) activations exhibit early alignment and do they learn generalizing solutions, or are they constrained to data interpolation?

Our experiments in the appendix include results with the smooth GeLU activation, where we observe similar early alignment behavior. While theory predicts that with an infinite number of neurons, interpolation should eventually occur, it remains unclear how many neurons are required in practice. Our experiments suggest that this number is quite large.

最终决定

The paper shows that overparameterized two-layer ReLU networks can avoid interpolating training data when the number of samples exceeds a certain threshold. Instead, due to an early alignment of neurons with specific directions, the network develops a simplicity bias that persists throughout training and leads to better generalization. This behavior is analyzed theoretically and supported by experiments.

The reviewers agree that the paper makes a significant theoretical contribution, noting that: 1) the insights into early training dynamics and simplicity bias are interesting and novel, 2) the theoretical results are rigorous, non-trivial, and supported by experiments, and 3) the paper is fairly well-written, with a clear motivation and potential for further impact. A few concerns were raised, particularly regarding the connection to large-scale modern deep learning models and the accessibility of the theoretical content. While most of these points were addressed in the rebuttal, the paper would benefit from including high-level explanations of the theoretical results, along with visualizations to help build intuition around the key concepts.

I therefore recommend acceptance and encourage the authors to consider the feedback and update the final version of the paper accordingly.