PaperHub
5.5
/10
Rejected4 位审稿人
最低5最高6标准差0.5
5
5
6
6
3.8
置信度
正确性3.5
贡献度2.5
表达3.5
ICLR 2025

When Are Bias-Free ReLU Networks Effectively Linear Networks?

OpenReviewPDF
提交: 2024-09-26更新: 2025-02-05
TL;DR

We show that two-layer bias-free ReLU networks cannot express nonlinear odd functions and have the same learning dynamics as linear networks under symmetry conditions on data.

摘要

关键词
ReLU networklinear networkgradient flowimplicit bias

评审与讨论

审稿意见
5

This paper studies various cases where bias-free ReLU networks are equivalent to linear networks. This requires assumptions on the data (e.g., symmetry or linear target) and on the initialization (e.g., rank-one and balanced). These assumptions allow to show conservation laws, which imply that the network remains equivalent to a linear network throughout the dynamics.

优点

The paper is very clearly written. It tackles the crucial question of understanding training dynamics of deep neural networks. I appreciate that the authors find settings simple enough for elementary analysis and exposition. An interesting and important contribution of the paper is to warn people wanting to study (bias-free) ReLU networks that their assumptions might make them inadvertently fall back to the linear case.

缺点

My main concern regards the soundness and importance of the contribution. Most results rely on conservation laws, which show that, under a specific initialization and a particular family of data distributions, the network is equivalent to a (sum of independent) linear network throughout training. However, deviations from these conversation laws are not studied in a thorough manner, which would be crucial in order to substantiate the claim that “the bias terms in the network and the structures in the data play an essential role in learning nonlinear tasks with ReLU networks” (line 59). For instance, the results of Figure 2b strongly rely on a very small-scale initialization (and perhaps a small learning rate), and this is not clearly discussed. Furthermore, for several cases studied in the paper, preexisting works already gave similar results and additionally rigorously control some of the deviation terms to this ideal initialization scenario.

More precisely, the results of Section 4.1 are close to the ones of Lyu et al. While the current paper lifts the assumption of linear separability, it considers the ideal case of a rank-one and balanced initialization. On the contrary, in Lyu et al., the deviation terms from this perfect scenario are carefully controlled, showing that the network still converges to a (global-max-margin) linear solution. This makes the latter analysis more nuanced, challenging the assertion in the present paper that they “incorporate as special case” the latter (l. 48). The claim that "we are able to give exact time-course solutions to certain two-layer ReLU networks in closed form, which has never been done for nonlinear networks outside the lazy learning regime” is therefore misleading, because Lyu et al provide a richer, though not closed-form, description, since they control additional error terms.

The situation is similar in Section 4.2: the decoupled dynamics from Appendix D is a simplification from the study of Boursier et al., which does not assume that the weight matrices are aligned with the data at initialization, but rather control the deviation from alignment.

I encourage authors to:

  • clarify that they study idealized cases both in terms of data and initialization, and that these idealized cases lead to conservation laws which entail equivalence with linear networks.
  • study more thoroughly deviations from this scenario, both in terms of data assumption and initialization assumptions (what happens for a Gaussian initialization depending on scale? Learning rate?).

Minor remarks:

  • A relevant reference for rank-one structure in deep linear networks for regression (to complement references of lines 400) is Marion and Chizat, Deep linear networks for regression are implicitly regularized towards flat minima, NeurIPS 2024. Also, the rank-one structure is only approximate for a non-vanishingly small initialization, whereas authors seem to indicate the contrary on line 400.
  • A study of the rank of matrices in deep ReLU networks was done in Timor et al., Implicit Regularization Towards Rank Minimization in ReLU Networks, ALT 2023.

问题

In Appendix C.5, the fact that the equivalence with a linear map holds even with large-scale initialization is very interesting, and could be further investigated. Does this equivalence also hold for larger learning rates?

评论

Thank you for your detailed review. We'd like to clarify our contributions and how they differ from prior work.

  • Importance

    We very much agree with you that some prior works have studied cases where the ReLU networks behave like one or several linear networks. However, the connections between the ReLU and linear networks have not been explicitly highlighted and systematically summarized. We feel that these connections lend a valuable and attractive understanding of ReLU networks. To this end, our paper focused on specifying and explaining these connections as clearly as we can, aiming to make a number of previous results in the ReLU network subfield more intuitive and accessible to a broader audience. We're glad to see that you and other reviewers found our presentation very clear.

    Additionally, as noted by all three other reviewers, the impact of removing bias is an interesting topic because bias-free networks have been studied often in prior theoretical works. Discussing their limitations sheds new light on the conclusions from these prior studies.

  • Soundness

    You are correct in pointing out that we did not pursue the direction of deriving bounds for initialization scale and learning rate. However, we believe that the rigor of our claims is at an appropriate level, in the sense that we did not overstate our results. For instance, the equality in Assumption 5 is indeed exactly conserved throughout training if it is true at initialization, which we prove in Section C.2.

    We do recognize that giving bounds on error terms is important. However, we believe that providing closed-form solutions under well-stated assumptions is also important and valuable. Moreover, while our assumptions and conclusions differ from those of Lyu et al, some are actually stronger: 1) we study square and logistic loss (Lyu et al focused on logistic loss); 2) we give closed-form time-course solutions to certain two-layer ReLU networks, which has not been written out; 3) we relaxed the assumption on the target from being linearly separable to being odd. We thus see our contributions as being complementary to theirs.

    To quote the final paragraph in Lyu et al: "A critical assumption for our convergence analysis is the linear separability of data. We left it as a future work to study simplicity bias and global margin maximization without assuming linear separability." Our work addresses part of this open question: to study simplicity bias without assuming linear separability.

  • Large Initialization, Large Learning Rate

    Thank you for bringing this interesting question into discussion. We do have empirical evidence that some of our results still hold with a moderately large learning rate and large initialization. In our revised pdf, we have added Section C.6 in Appendix and some signposts in the main text to make it clear that: while our analytical derivations used vanishing initialization and infinitesimal learning rate, we have empirical evidence that some of our results extend to large initialization and large learning rates.

    In the added Figure 8, we use a learning rate of 0.60.6, which is 150 times larger than the learning rate of 0.0040.004 used in Figure 2. Similar to Figure 2, the loss curves in Figure 8 with different leaky ReLU slopes collapse to one curve after rescaling time and the differences between weight matrices are small. If the learning rate is further increased, the loss and weights curves oscillate and the equivalence breaks; but we typically wouldn't let our networks train in this oscillating regime.

  • Rephrasing

    We have deleted "as special cases" in line 48 and deleted "which has never been done for nonlinear networks outside the lazy learning regime" in line 58.

    You are correct that the rank-one structure is approximate for non-vanishingly small initialization. We have revised line 400 to clarify this point. Thanks for your suggestion.

  • Thank you for sharing the references. We have included them where you suggested.

We look forward to hearing whether our revision and response address the reviewer's concern. Please let us know if you have further questions.

评论

Thank your for your answer, which is rather convincing. I increased the score and updated the review. Let me give a few follow-up comments.

Soundness and significance: I agree with your points, but I still think that the paper is overstating in some places. This is why I believe the paper could improve on soundness, by providing a more balanced narrative (although I appreciate that you already rephrased some of the problematic sentences), while I do not question the mathematical soundness of the results, which seem correct. More precisely,

  • I agree on the interest of the results. However, you mention several times (in the paper lines 20, 53, 58-60, and in the rebuttal) that your work helps to understand ReLU networks. In my opinion, this should be nuanced since the equivalence with linear networks holds under quite restrictive assumptions, and more crucially because in practice we actually want ReLU networks to behave differently from linear networks (otherwise, we may as well use linear regression), and thus to study those settings where ReLU networks do something more interesting than a linear map. This is not mentioned in the paper. As a consequence, I believe that the most important contribution of your paper is actually to warn people wanting to study (bias-free) ReLU networks that their assumptions might make them inadvertently fall back to the linear case.
  • regarding the comparison with Lyu et al. and Boursier et al.: I definitely agree that providing closed-form solutions under well-stated assumptions, as well as easier proofs, is important and valuable. However, this is not written in the paper, which rather frames the novelty by insisting on the differences in assumptions. While your assumptions on the dataset are indeed less restrictive, your assumptions on the initialization are much stronger, and this is not mentioned at all in your paper. In fact, lines 48 and 255 may lead the reader to believe that the results of Lyu et al. are a subset of your results, which is not the case. This could be fixed simply by adding a few lines to describe differences with Lyu et al., including acknowledgment that your assumption on initialization significantly simplify the analysis by zeroing many terms, which are carefully controlled by Lyu et al.

Large initialization and learning rate: it would be nice to run the experiment with both large initialization and large learning rate at the same time (i.e., winit=0.5w_{init}=0.5 and learning rate of 0.60.6). This is because in other settings both quantities are known to interact, so it would be nice to see what happens here.

评论

Thank you very much for your response. We really appreciate the time and effort you invested in helping us improve our manuscript for publication. We have updated our pdf according to your suggestions. Please let us know if our revision is appropriate and if you have any further feedback.

  • Balancing Narrative, Specifying Assumption

    We do like how you frame our contributions not as an advancement of what is wanted, but as a warning of what is to be avoided, which is often under-represented in literature but just as important. We actually intended to convey the same message in the opening paragraph of our introduction: "This paper seeks to illuminate the implications of bias removal in ReLU networks, and so provide insight for theorists on when bias removal is desirable." We're sorry that we didn't get this point across effectively. We have revised lines 20, 53, 60 to adopt a more balanced narrative.

    We have revised line 48 and added a clarification of our initialization assumption in Remark 6. We hope that now Remarks 4 and 6 together clarify the relationship between our setup and that of Lyu et al -- our assumption on datasets is weaker and our assumption on initialization is stronger.

  • Large Initialization & Large Learning Rate

    Thank you for this follow-up question. We have added Figure 6d to demonstrate that the loss and weights curves with a large initialization and a large learning rate are qualitatively similar to the curves with a large initialization and a small learning rate. As in the case with a small learning rate, the loss curves in Figure 6d with different leaky ReLU slopes collapse to one curve after rescaling time.

评论

Thank you for the answer and updating the paper. I increased the soundness score and kept the overall score.

审稿意见
5

This paper compares the gradient descent dynamics of ReLU networks with no bias to linear networks. First, the authors show that two-layer bias-free networks have limited expressivity. Next, the main result (Theorem 7) is that for "symmetric datasets" (Condition 3) and under a specific initialization (Assumption 5), the gradient flow trajectories of a two-layer ReLU network and a linear network are the same, up to weight and time-rescaling. The authors also consider extensions to other data distributions (such as orthogonal or XOR), and ReLU networks with depth > 2.

优点

  • Prior theoretical work considers networks with no bias, and thus it is an interesting question to understand the expressivity and learning dynamics of such networks.
  • The proofs appear to the best of my knowledge to be sound, and the paper is well-written and easy to follow.

缺点

  • My main concern with the paper is that I find the contribution to be rather incremental, which limits the significance/impact of the work. For example, Theorem 7 requires both symmetry on the data and for the first layer to be initialized as rank 1 (W1=W2TrTW_1 = W_2^Tr^T). While the latter assumption is justified as a consequence of training from infinitesimal initialization, I still find these to be rather strong assumptions, and I do not think such equivalence between linear networks and ReLU networks holds beyond this limited case.
  • I find that the currently paper does not have much additional novelty, compared to the prior work Lyu et al. (2021). Lyu et al. (2021) shows that for a two-layer ReLU network trained on symmetric data 1) starting from infinitesimal initialization, the weight W1W_1 becomes rank 1 and 2) as training continues, this rank 1 component converges in the direction of the max-margin linear classifier. This second stage is exactly the same as the dynamics of a linear neural network (Ji & Telgarsky, 2019; Soudry et al., 2018). While the current paper does consider a slightly more general target which is not necessarily linearly separable, to me it seems as that the linear separability assumption in Lyu et al. (2021) was so that they could show convergence to max-margin, and thus I find the generalization to be rather minor
  • In the setting of section 4.2 (orthogonal or XOR data), the derivation in Appendix D relies on initializing the network so that the neurons W1W_1 are perfectly aligned with the directions of the data points. This also seems like a rather strong assumptions, and to me the important part of understanding these dynamics is showing that the neurons will align in the direction of the data points. Proving this for the setting of XOR data has been the goal of prior works [1, 2], and is quite challenging.

[1] Sgd finds then tunes features in two-layer neural networks with near-optimal sample complexity: A case study in the xor problem. Margalit Glasgow. ICLR 2024. [2] Random Feature Amplification: Feature Learning and Generalization in Neural Networks. Spencer Frei, Niladri S. Chatterji, Peter L. Bartlett. JMLR 2024.

问题

  • I would appreciate if the authors could comment on my concerns re novelty and significance stated above.
  • Line 438 states that the second layer weights W2W_2 are nonnegative. Why is this true?

Minor comments:

  • It might be helpful to add a plot of the depth separation function in Section 3.2.
  • line 1329 "sumed"
评论

Thank you for your interest in our work and for your thoughtful feedback. We'd like to respond below.

  • Significance of Contribution

    We acknowledge that prior works, including the references you mentioned, have studied cases where the ReLU networks behave like one or several linear networks. However, the connections between the ReLU and linear networks have not been explicitly highlighted and systematically summarized. We feel that these connections lend a valuable and attractive understanding of ReLU networks. To this end, our paper focused on specifying and explaining these connections as clearly as we can, aiming to make a number of previous results in the ReLU network subfield more intuitive and accessible to a broader audience.

    Additionally, as you and other reviewers have noted, the impact of removing bias is an interesting question due to the large body of existing works on networks with bias removal simplification. Discussing their limitations sheds new light on the conclusions from these prior studies. For instance, if we do not extend Lyu et al's results from linear separable functions to odd functions, the disadvantages of two-layer bias-free ReLU networks may not be apparent. For linearly separable tasks, two-layer bias-free ReLU networks converge to the max-margin linear classifier, which is presumably a good solution. However, with our extension to odd functions, it becomes clear that: while convergence to a max-margin linear map is advantageous when the target task is linear, it can be a disadvantage when the target is nonlinear.

  • Comment on Assumption

    Thank you for bringing this issue to discussion. The primary purpose of Section 4.2 (orthogonal or XOR data) is to identify two common cases where a two-layer ReLU network behaves like multiple independent linear networks. Our goal was not to improve upon the analytical analysis of these cases, which have been studied in existing literature, as you correctly noted. Rather, we aim to illustrate and explain their connections to linear networks.

    As for our assumptions in Section 4.1 (symmetric data), while our assumptions and conclusions differ from those of Lyu et al, some are actually stronger: 1) we study square and logistic loss (Lyu et al focused on logistic loss); 2) we give closed-form time-course solutions (Corollary 8) to certain two-layer ReLU networks, which was not given in Lyu et al or other prior works; 3) we relaxed the assumption on the target from being linearly separable to being odd. We thus see our contributions as being complementary to theirs.

  • Non-negative Weights in Deep ReLU Networks

    Line 438 says the second-layer weights W2\boldsymbol W_2 are non-negative based on the empirical observation in Figure 4b. We plot W2\boldsymbol W_2 in color in Figure 4b and only see gray and white elements, representing positive and zero numbers. Analytically showing W2\boldsymbol W_2 is approximately non-negative is an intriguing future direction. We have revised the sentence in line 438 to clarify that the non-negativity of W2\boldsymbol W_2 is a statement based on empirical observation.

  • Thank you for the useful suggestion. We added "Section F Depth Separation" in our revised pdf to include a plot of the depth separation function and some discussions.

    We added the missed reference (Glasgow, 2024). Thanks.

    We corrected the typo "summed". Nice catch!

We hope our response and revision address the reviewer's questions and welcome any further suggestions.

评论

Thank you to the authors for your detailed responses to my questions.

Upon reading the rest of the reviews and the rebuttals, my concerns about the novelty and significance of the contribution still remain. For instance, in section 4.1, while I do acknowledge that the current paper considers a more general target than Lyu et al. (linear functions versus odd functions), the assumption of perfectly balanced initialization is quite restrictive. Moreover, given that bias-free ReLU networks can only express odd functions if they are linear, it is not clear how relevant of an extension it is to consider odd, non-linear functions that the network cannot express in the first place. In section 4.2, while the current paper does prove that a ReLU network behaves like multiple linear networks when W1W_1 is initialized in the directions of the data points, I still maintain that the complexity of the orthogonal data/XOR setting is proving why the W1W_1 converges to these directions from random initialization (i.e performs "feature learning"). This has been the focus of the prior works I mentioned in the XOR setting, and as pointed out by reviewer aiN2 was the focus of Boursier at al. for the orthogonal data setting.

As such I would like to keep my original score.

评论

Thank you for your attentive engagement and for further clarifying your concern. We regret that our initial rebuttal didn't effectively answer your question and would like to try providing a more pertinent answer below.

  • Clarification on Contribution

    We very much agree that the major technical hurdle of analyzing the orthogonal/XOR setting is the proof of how the first-layer weights align with several specific directions from random initialization. And we acknowledge that such proofs were the focus of prior works (Lyu et al; Boursier et al; and others) and were important contributions in those works.

    We'd like to clarify that the primary goal of our work is not to improve the proof upon them. Instead, as reviewer aiN2 helped us recapitulate: our contribution is less an advancement of what is wanted, but more a note of what is to be avoided. The latter is often underrepresented in literature but just as important.

    We'd be happy to rephrase relevant parts of our manuscript if they send the misleading message that our primary goal is to improve the proof of early phase alignment.

  • Comment on Restrictive Assumption

    We acknowledge that the perfectly balanced initialization assumption is made for simplifying the analysis. Nonetheless, we give empirical evidence that the errors are small with random initialization (and also large initialization and/or a large learning rate as shown in Figure 2 & 6). We also emplicitly noted in Remark 6 that prior literature have proven alignment with the weaker assumption of small random initailzation.

    As for the assumption of odd target functions, it is not introduced for simplification but for giving a correct answer to our title question, "when are bias-free ReLU networks effectively linear networks?" Thus while we agree that the odd target function is restrictive, it is necessarily so given the question we are inquiring.

    Now we justify why this question warrants an inquiry. Though these conditions are restrictive, they are not uncommon in existing literature. By explicitly summarizing these conditions, we aim to provide a cautionary note for future research, highlighting the assumptions that may inadvertently place the analysis in a linear network regime.

  • Reason for Considering Functions outside the Network's Expressivity

    Thank you for this insightful question! Your helpful comment has led us to include a useful example in Figure 9, which demonstrates that the two-layer bias-free ReLU network doesn't always learn a bad solution even though the training loss cannot reach zero for odd and nonlinear datasets.

    We consider a linearly separable binary classification task with label flipping noise in Figure 9. When the data points satisfy our condition, the network learns a linear decision boundary, which is presumably a robust solution as it avoids overfitting the few noisy labels. On the other hand, when the nonlinear odd component of the target function is not due to noise, learning a linear solution is presumably bad, as you have correctly pointed out.

    Therefore, we illustrate for the two-layer bias-free ReLU networks falling into our linear regime, the solution they learn may be good or bad depending on the specifics of the task.

Thank you again for taking the time to carefully review our paper. We welcome any further suggestion or feedback.

审稿意见
6

The paper studies the bias-free ReLU and leaky-ReLU networks, both two-layer networks and deep ones. The authors show that bias-free two-layer ReLU and leaky ReLU networks have limited expressivity, and cannot express non-linear odd functions. Showing that deep bias-free ReLU networks can express non-linear odd functions, the paper establishes a separation between bias-free deep and shallow networks. Additionally, the authors analyze the dynamics of training bias-free shallow networks under some distributional assumptions, showing theoretically and experimentally that their dynamics follows the dynamics of linear networks. Additionally, the authors provide some insights on the dynamics of deep bias-free ReLU networks.

优点

The paper gives novel insights on both expressivity and optimization of bias-free networks. As these networks have been studied often in prior theoretical works, discussing their limitations sheds new light on the conclusions from past theoretical works. Additionally, while ReLU networks often are trained with bias in practice, in certain situations bias-free networks have been used in practice, and thus understanding their behavior and limitation is important. The insights on the dynamics of these networks, drawing the connection to linear networks which are much better understood theoretically, helps advance our theoretical understanding of the dynamics and solutions found by ReLU networks.

缺点

  • Previous work by Basri et al. (cited by the authors) shows that two-layer bias free networks cannot express non-linear odd functions when the inputs are uniformly distributed. The authors claim that the result in the paper is stronger, but it's not clear to me that this is the case. My understanding is that the authors show: for any non-linear odd function ff, for any bias-free (leaky) ReLU network hh, there exists some input xx such that f(x)h(x)f(x) \neq h(x). My understanding is that Basri et al. shows a stronger result: for xx sampled from the uniform distribution, f(x)h(x)f(x) \neq h(x) (with high probability?). It is possible that I am misunderstanding either the Basri et al. result or the result shown in the paper, so I would appreciate it if the authors clarify this point. In any case, I believe that writing the result in the paper more formally with the right order of quantifiers will help clarify things.
  • The fact that bias-free ReLU networks are very limited is already evident from the somewhat trivial (and previously observed) fact that bias-free ReLU networks can only express positively homogenous functions. While the results shown in the paper are indeed stronger, showing that the networks cannot express a larger family of functions, it is worth emphasizing that the fact that bias-free ReLU networks are very restricted is not a novel contribution of this work.
  • The introduction of Assumption 5 feels a little bit without context and missing some details. Is this an assumption on the initialization? Throughout the network training? What is the vector rr - is this satisfied for some rr? It would be helpful to write this more formally, and clarify these points.
  • The bottom-line result/message of Section 5 is not clear. If the main point is stating a conjecture, it is worthwhile to state more precisely what the conjecture is, and how it is supported by the experiments.

问题

  • The main result on the dynamics of bias-free networks (Theorem 7) is shown for infinite data (training on the distribution) and with gradient flow. How would these results change for finite data and/or training with GD/SGD?
评论

Thank you for your feedback and constructive suggestions. We're glad you found our insights novel and important. We'd like to respond to your questions as follows.

  • Comparison of Expressivity Results

    Your understanding of our expressivity result is correct. Regarding Basri et al, their Theorem 2 & 4 showed that in the harmonic expansion of two-layer bias-free ReLU networks with input xx uniformly sampled on a sphere, the coefficients corresponding to odd frequencies greater than one are zero. (Functions with odd frequency one are linear functions; functions with odd frequency greater than one are nonlinear odd functions.) They didn't study the probability of f(x)h(x)f(x)\neq h(x) conditioned on xx uniformly sampled on a sphere. Thus, their statement on the expressivity is of the same strength as ours while they used an input assumption that we didn't use.

    We have updated our pdf to specify the Theorem numbers when citing Basri et al. Hope this helps clarify. Please let us know if we misunderstood your suggestion.

  • Comments on Limited Expressivity

    We have reworded the beginning of Section 3 to clarify that the limitation of positively homogeneous functions is a known fact. Further, we'd like to add some comments about this topic below.

    Though we agree it is somewhat evident that bias-free ReLU networks have limited expressivity, the practical usage of bias-free ReLU layers is actually not that limited. As we have cited, a best paper in ICLR2024 [1] showed bias-free convolutional ReLU networks are state-of-the-art models in image denoising and that the removal of bias specifically helps generalization. Meta's Llama [2], one of the most influential open-source large language models, doesn't seem to have bias terms. Hence, we believe that giving a more accurate description of the expressivity of bias-free ReLU networks is useful.

    [1] Kadkhodaie et al. Generalization in diffusion models arises from geometry-adaptive harmonic representation. ICLR 2024.

    [2] https://github.com/meta-llama/llama/blob/main/llama/model.py

  • Clarification of Assumption 5

    Thank you for this useful suggestion! We have clarified in our revised pdf that Assumption 5 is made only at initialization, and we proved (instead of assumed) that Assumption 5 will remain true throughout training in the proof of Theorem 7. What r\boldsymbol r is depends on the random initialization. We have reworded the assumption to clarify that there exists some r\boldsymbol r such that W1=W2r\boldsymbol W_1= \boldsymbol W^\top_2 \boldsymbol r^\top.

  • Clarification of Conjecture in Section 5

    We're sorry that the conjecture was unclear. We have now re-written the last paragraph of Section 5. We hope it is clearer now and welcome any further suggestions for improvement.

  • Finite Data

    The exact assumption we used is that the empirical distribution of input xx satisfies p(x)=p(x)p(x)=p(-x). For infinite data, it incorporates common distributions such as any zero-mean normal distribution. For finite data, it means that if xx is present in the dataset, x-x is also present, which was exactly the assumption used in Lyu et al [3]. We have revised Remark 3 to explicitly describe what our assumption means for infinite and finite data.

    [3] Lyu et al. Gradient descent on two-layer nets: Margin maximization and simplicity bias. NeurIPS 2021.

  • Gradient Descent versus Gradient Flow

    We empirically find that our results still hold with a moderately large learning rate. In our revised pdf, we have added Section C.6 in Appendix and some signposts the main text to make it clear that: while our analytical derivations used infinitesimal learning rate, we have empirical evidence that some of our results extend to large learning rates.

    In the added Figure 8, we use a learning rate of 0.60.6, which is 150 times larger than the learning rate of 0.0040.004 used in Figure 2. Similar to Figure 2, the loss curves in Figure 8 with different leaky ReLU slopes collapse to one curve after rescaling time and the differences between weight matrices are small. If the learning rate is further increased, the loss and weights curves oscillate and the equivalence breaks; but we typically wouldn't let our networks train in this oscillating regime.

评论

Thank you for your response. I will keep my original score.

审稿意见
6

The paper studies what functions bias-free ReLU and leaky ReLU networks can represent and what the training dynamics are. A difference between two-layer networks and deeper networks is established. Specifically, it is observed that such bias-free networks that are limited to two layers cannot express any odd function except linear ones. However, the exists non-linear odd functions that can be expressed if the network has at least three layers. It is also established that under certain assumptions on the training data, the training dynamics of bias-free two-layer (leaky) ReLU networks is essentially the same as that of linear networks.

优点

The paper is predominantly theoretical and full proofs are provided. The experimental parts complement the paper well and illustrate the theoretical findings.

The presentation is very good and polished and the main claims in the paper are clearly presented.

Understanding the expressiveness and training dynamics of different network architectures is important. The impact of removing bias from the architecture is also an interesting topic to study, in particular, as the authors point out, because in analytical studies of networks, bias terms are sometimes omitted for simplicity. The results on expressiveness of bias-free networks are mostly relatively simple observations. Understanding the training dynamics is much more involved.

Formally studying training dynamics is important, interesting, and challenging. This paper makes some welcome contributions. I found Section 5 as well as the discussion on "perturbed symmetric datasets" particularly intriguing.

缺点

The main limitation of the results on training dynamics is that the results are limited to the case that the target model is odd (in addition to some more mild assumptions). Specifically, it is shown that in this case, two-layer bias-free (leaky) ReLU networks essentially behave like a linear network. There is evidence that even slight violations of this property of the target model, make the network behave in a non-linear way in later phases of training. I appreciate that it is challenging, but deriving training dynamics for more different types of datasets would of course benefit the work.

In the writeup, I think it would be helpful to reemphasize the core assumption made on the target model more explicitly in places where the results are summarized or discussed. For example, in Section 6, in the sentence "Theorem 7 shows that under symmetry conditions on the dataset, two-layer bias-free (leaky) ReLU networks have the same time evolution as a linear network (modulo scale factors)", the phrase "symmetry conditions" does a lot of heavy lifting and it may be helpful to be more explicit about what Condition 3 says.

The authors could clarify that Assumption 5 is only for the initialization.

问题

Is there any possibility of deriving the empirical results in Section 5 analytically? In terms of results, this is possibly the most interesting part of the paper. What are the main difficulties in doing so?

It would also be interesting to somehow quantify the observed lack of robustness when a two-layer bias-free network is trained on a dataset that nearly satisfies Condition 3. Although, once again, I appreciate this may be a very challenging task.

评论

Thank you for your thoughtful comments and feedback. We are glad to know that you found our contributions important and interesting, and that our presentation is very clear. We'd like to respond to your questions below.

  • Dynamics for Different Datasets, Quantifying Non-Robustness

    This is an intriguing direction. We added Figure 5b in our revised pdf to show the plateau duration in the loss curves for perturbed symmetric datasets scales approximately linearly with 1/Δy1/\Delta y. Our intuition is that the gradient update during the plateau has a Δy\Delta y component and the time will thus have a 1/Δy1/\Delta y scaling factor. The simulations indeed match our intuition. This result is empirical at the moment, but we are working to provide an analytical analysis and a more general metric for quantifying how asymmetric a dataset is.

  • Reemphasize Assumption

    We have now explicitly written out our symmetric condition in the sentence you quoted. We have also edited summarizing sentences in the Introduction to make sure that we write out the symmetric condition and/or use a clickable hyperlink to the condition (Condition 3).

  • Clarification of Assumption 5

    We have revised to clarify that Assumption 5 is made only at initialization. Thank you for your useful suggestions!

  • Challenges of Analytical Derivation for Deep ReLU Networks

    This is indeed an interesting and challenging problem. We made an attempt to derive the late phase dynamics in Appendix E, showing that weights that formed a low-rank structure as Equation 15 will maintain the structure. We left the early phase dynamics to future work.

    Technically, studying the dynamics of deep ReLU networks is challenging because the nested nonlinearities introduce nested derivatives in the gradient descent differential equations. Most tools for analyzing the dynamics of two-layer ReLU networks do not apply to deep ReLU networks. For two-layer ReLU networks trained on symmetric datasets from small initialization, we use the tool of approximating the early phase dynamics with a linear differential equation (Equation 27), whose closed-form solution is available. For depth-LL ReLU networks, we can't reduce the early phase dynamics to a solvable differential equation with the same tool: the gradient update of a weight involves the multiplication of some nonlinear functions of (L1)(L-1) weights, which is generally intractable. Thus, unsurprisingly, the literature on the learning dynamics of ReLU (and perhaps also other nonlinear) networks in the rich regime is quite sparse.

    We will include our discussion and review relevant literature on the dynamics of deep ReLU networks in final revision.

We hope our response and revision are beneficial and welcome any furthur suggestions.

评论

Thank you for this response.

评论

We thank all reviewers for their constructive questions and their engagement during the rebuttal period.

We are glad to see that the reviewers found our presentation clear and recognized the importance of the question we investigate -- the impact of removing bias in ReLU networks. We believe the regimes we identify as equivalent to linear networks can serve as an accessible way to understand ReLU networks within these regimes, or as a cautionary note for future researchers aiming to study nonlinear behaviors of ReLU networks beyond the linear network regimes.

We also appreciate the constructive questions and have taken suggestions from reviewers to improve our paper. Our revisions are colored blue in the pdf and we summarize below.

  • We added Figure 6 to provide empirical evidence that our main equivalence result (Theorem 7), derived with infinitesimal initialization and learning rate, can apply to large initialization and learning rate, addressing questions from reviewer ejaf and aiN2.
  • We added Figure 5b to give a more quantitative description of how long two-layer ReLU networks follow the linear network dynamics when trained on slightly asymmetric datasets, answering a question from reviewer QuzZ.
  • We clarified our assumption and adjusted wording to adopt a more balanced narrative in places highlighted by reviewer aiN2.
  • We implemented further clarifications and rewording based on feedback from all reviewers.
AC 元评审

The authors provide a theoretical analysis of expressivity and training dynamics of ReLU networks with no bias, and in particular compare it against a linear network. The most novel result being Theorem 7, where under some conditions, the gradient flow trajectories of a two layer ReLU network is the same as a linear network up to symmetry.

While no significant issues were raised, all the reviewers seem to be skeptical about (1) how useful Theorem 7 is under the restrictive conditions, and (2) the novelty of the rest of the results, which seem to be an incremental improvement over existing ones. This led to all the reviews being very borderline, despite all of them acknowledging the rebuttals.

Given that I have quite a few submissions in my batch with borderline reviews, and that no reviewers are willing the champion the acceptance of this paper, I would recommend reject for this paper.

However, I believe this is a close case. The authors have a well written set of results, where novelty is the main the criticism, which can be a great fit for other venues such as TMLR. If any improvements can be made in terms of technical conditions or overall conceptual understanding, this paper could also be a clearer accept.

审稿人讨论附加意见

Two repeated points of discussion were on

  1. How restrictive the conditions are for Theorem 7, and
  2. How novel are the other results compared to existing work?

While neither are critical issues that would lead to a clear reject, they do cast sufficient doubt on whether or not this paper should be a clear acceptance. This is essentially the main decision factor, and improvements in either results could lead to a clearer recommendation.

最终决定

Reject