PaperHub
5.3
/10
Rejected4 位审稿人
最低4最高6标准差0.8
6
5
6
4
3.0
置信度
正确性2.3
贡献度2.3
表达2.0
NeurIPS 2024

Hamiltonian Mechanics of Feature Learning: Bottleneck Structure in Leaky ResNets

OpenReviewPDF
提交: 2024-05-16更新: 2024-11-06
TL;DR

The path of representations from input to output, optimizes a kinetic energy that favors `short' paths, and a potential energy that favors low dimensional representations.

摘要

We study Leaky ResNets, which interpolate between ResNets ($\tilde{L}=0$) and Fully-Connected nets ($\tilde{L}\to\infty$) depending on an 'effective depth' hyper-parameter $\tilde{L}$. In the infinite depth limit, we study 'representation geodesics' $A_{p}$: continuous paths in representation space (similar to NeuralODEs) from input $p=0$ to output $p=1$ that minimize the parameter norm of the network. We give a Lagrangian and Hamiltonian reformulation, which highlight the importance of two terms: a kinetic energy which favors small layer derivatives $\partial_{p}A_{p}$ and a potential energy that favors low-dimensional representations, as measured by the 'Cost of Identity'. The balance between these two forces offers an intuitive understanding of feature learning in ResNets. We leverage this intuition to explain the emergence of a bottleneck structure, as observed in previous work: for large $\tilde{L}$ the potential energy dominates and leads to a separation of timescales, where the representation jumps rapidly from the high dimensional inputs to a low-dimensional representation, move slowly inside the space of low-dimensional representations, before jumping back to the potentially high-dimensional outputs. Inspired by this phenomenon, we train with an adaptive layer step-size to adapt to the separation of timescales.
关键词
Feature LearningBottleneck StructureNeuralODEHamiltonian mechanics

评审与讨论

审稿意见
6

The paper maps the dynamics of representations across layers of leaky ResNets to a Largrangian and Hamiltonian formulation, giving an intuitive picture of a balance between two terms: a kinetic energy term which favors small layer derivatives and a potential energy that favors low-dimensional representations. This intuition to explain the emergence of a bottleneck structure, as observed in previous work.

优点

  1. The paper addresses a timely and important topic: feature learning in DNNs.

  2. The introduction provides a good connection to previous work.

  3. The mapping to a Hamiltonian formulation is interesting and provides a valuable intuition.

  4. The propositions and theorems are mostly clearly stated and the proofs seem sound.

缺点

  1. Numerical experiments: a. Many of the figures are poorly explained and have missing labels etc, e.g. in Figs 1b, 2b what is the color code? b. I failed to find a mention of what data the models were trained / evaluated on. c. Fig 2c - what is the projection on to?

  2. It is sometimes hard to follow the rationale and motivation for the "storyline" of the paper and its different sections could be better connected to each other.

  3. Novelty wrt previous works - in lines 206-208 a difference from similar works is mentioned but this seems is very brief and looks not very significant on the face of it. It would be better to highlight and elaborate on what is new here relative to these previous works.

Technical comments: a. Eqs are not numbered. b. I found the notations to be confusing and non standard, e.g. αpRw\alpha_p \in \mathbb{R}^w which make it harder to follow the derivations.

Typos: a. Lines 211-212: "layers of the layers..."

问题

  1. Does KpK_p bare some analogy with the mass matrix from classical mechanics?

  2. The backward pass variables BpB_p play the role of momenta but why are they defined as in line 190? why not via a Legendre transformation of the velocities? or are these equivalent?

  3. Line 224: the analogous condition in classical mechanics is that the potential energy depends only on the coordinates and not on the momenta, and that the kinetic energy is a quadratic form in the velocities. Is there some intuition that can be imported here?

  4. In Fig 1c - why are the Hamiltonians not exactly conserved? these are quite substantial deviations.

局限性

The authors have adequately addressed the limitations of their results.

评论

Re item 2 - I can't point to any specific section, this was a general sense I had when reading the paper.

yes, using pp as a continuous layer index and ww as the dimension is non-standard for me, but of course it's a matter of taste.

I read the author's response and comments by other reviewers and am currently inclined to keep my score of 6, although I should say that my level of confidence is such that I would not have a solid objection if this paper were to be rejected.

评论

Thanks for your thoughtful review. Regarding the weaknesses you point to:

  1. For Figures 1b and 2b the color have no meaning, we just assign different colors to different singular values for aesthetic purpose. The experimental setup and the synthetic data is described in the Appendix B, we will add more details and refer to the appendix in the main.

  2. What part(s) did you think was not fitting the structure?

  3. This previous work defines the same Hamiltonian for non-leaky ResNets (as a minor point inside a more general paper), but there is no reformulation as the sum of two energies, nor any mention of a separation of timescales / Bottleneck structure. We will better discuss the difference in the main.

We will also connect the different formulas more and add numbers when needed.

What part of αp\alpha_{p} is confusing? Is it the use of pp to represent the layer instead of \ell? We switched to pp because at the end we do have discrete layers indexed by \ell.

Thanks, we'll fix the typos you pointed.

Regarding your questions:

  1. Very nice observation, it seems to indeed match quite directly the inverse of the mass matrix. We'll mention this connection.

  2. We followed Pontryagin's maximum principle to derive the momenta/dual variables and the Hamiltonian, but it should also be possible to recover it with the Legendre transform.

  3. Yes that's exactly the intuition.

  4. We do not have a definite answer to this, but we believe it comes from the layer discretization:

  • The bumbs become smaller as the number of "layer steps" (LL) is increased, also switching to the adaptive layer step size reduced this error.

  • The bumbs appear at pp with large kinetic energies, which should exactly be the points were short layer steps are needed.

  • We simply use the forward Euler method to discretize, since it is what traditional ResNets can be interpreted as doing, but we are interested in studying better discretization methods, or even methods that guarantee a constant Hamiltonian in the future.

审稿意见
5

This paper studies feature learning in Leaky ResNets and shows the emergence of the previously studied Bottleneck structure under certain assumptions. In particular the paper provides a Hamiltonian formulation of the features and their dynamics to show that the ResNet will prefer low dimensional features (low potential energy) when the effective depth of the ResNet is large, which gives the Bottleneck structure. The paper also has a final section on choosing the scales of the residual layers across depths motivated by their theory.

优点

  1. Studies the problem of understanding feature learning in NNs, which is of broader interest in the NeurIPS community.
  2. The paper identifies the effect of “effective depth” in Leaky ResNets on the previously observed Bottleneck structure, through a Hamiltonian decomposition into kinetic and potential energy.
  3. In particular, the authors provide a nice intuition that the potential energy is minimised at large effective depths, which corresponds to low rank solutions.

缺点

  1. The paper is unclear in several important moments which compromises readability. For example, is the leakage parameter L~\tilde{L} suppose to lie in [0,1] (as suggested by line 80) or in [0,\infty] (as is necessary for the “separation of timescales” arguments in section 2.1). Moreover, in line 224 the authors write closed forms for the Hamiltonian but it is not clear how they obtain this object, from the previously stated Hamiltonian on linear 195.
  2. Theory seems tied to several simplifying assumptions which reduces its generality in describing/understanding feature learning in standard NNs, e.g. the reliance on ReLU activation, the need for weight decay to minimise parameter norms (though it has been called into question if the role of weight decay is actually to find minimal parameter norms in practice (https://arxiv.org/abs/2310.04415), or also the omission of normalisation layers.
  3. On a related note, the theory in the paper seems to have a limited relevance for practice. The one glimmer of this is that the paper suggests changing the weighting on the residual branches in order to evenly balance the difference layers in terms of how much the representations are changing, but this seems underdeveloped at present. It would be worth investigating if this can improve training in practice. Moreover, the paper (and the works on the Bottleneck structure in general) seem to argue that the representations should be changing a lot at late layers because the representation shifts back from being low rank. But this is counter to existing practical works that suggest one can prune late residual layers for improved efficiency https://arxiv.org/abs/2207.07061. This represents a gap between this theory (of Bottleneck structures) and practice to me.
  4. The paper studies properties of optimal solutions (e.g. geodesics with minimal parameter norm) in terms of Hamiltonian energies etc (Theorem 4) but does not seem to discuss whether training dynamics will lead to such solutions in practice.

问题

  1. The motivation to study the Bottleneck structure seems not fully convincing to me. I appreciate that the authors spend quite a lot of the introduction to justify the (information) Bottleneck setting from various angles, but has it been shown that practical networks display such a bottleneck structure? Especially in the modern age of large pre-training datasets when the models are underparameterised. It seems like the theory relies on assuming that a low rank structure exists in the function being learnt?
  2. What is the effect of γ\gamma in the inverse in line 257 in terms of meaning that the Bottleneck rank is an integer? It seems that will only hold if γ=0\gamma=0, in which case is it necessary for the Bottleneck rank to be an integer?

Typos:

  1. q/2 in equation below line 78
  2. B^T in line 105.
  3. no L~\tilde{L} in middle term of equation below line 112.
  4. "of the layers" repeated in line 212

局限性

There is no limitations section.

评论

Thanks for the thoughtful review. To answer to the weaknesses you mention:

  1. Readability should be improved thanks to your and the other reviewers' remarks. Regarding L~\tilde{L}, our proofs only need L~0\tilde{L}\geq0, and you seem to have misunderstood line 80, the range [0,1][0,1] is the integration range for pp, not for L~\tilde{L}, we will clarify this. The Hamiltonian was derived using Pontryagin's maximum principle (which we will mention), but the derivation is technical and not very enlightening. But note that it is easy to check that it is the right Hamiltonian by computing the derivatives.

  2. We have plans to handle more general nonlinearities and normalizations, but we focused here on the simplest setting that exhibits a Bottleneck structure, to make the analysis as clean as possible. We also have plans for the training dynamics too and how weight decay affects earlier training times, see our answer to Question 1 of Reviewer JYMC for more details.

  3. We also plan to more thoroughly analyse the effect of adaptive layer steps on real world data, but this paper is mostly theoretical, the experiments are mostly there to help visualize the theory. You are right that in practice, and especially for classification tasks with the Neural Collapse, the dimension does not increase back in the last few layers, but this can simply be understood as a `half-bottleneck' (as observed in https://arxiv.org/abs/2402.08010) and our theory still applies.

  4. One big improvement of this paper over previous Bottleneck structure papers is that it applies at any (stable) local minima and not the global minimizer, assuming convergence to a local minimum is much more reasonable.

Regarding your questions:

  1. The Bottleneck structure has up to now been mostly studied theoretically, but it has the potential to explain multiple empirical observations such as Neural Collapse, Information Bottleneck and other low-rank bias observations with a unified theoretical framework of feature learning. We think that theoretical analysis is not only useful for describing empirical observations, but can also be used to identify structures that can be then confirmed empirically afterwards.

  2. Thanks for pointing that error, it will indeed not exactly be an integer but should approach an integer if one takes L~\tilde{L}\to\infty and γ0\gamma\searrow0.

Thanks for finding these typos, we will fix them.

评论

Thanks for the response and clarifications. It is good to hear that the authors plan in future work to address concerns regarding more general settings, practical relevance and training dynamics. I will keep my score for this submission.

审稿意见
6

The paper introduces the so-called "Leaky ResNet" ordinary differential equation. Leaky ResNets are a variant of the NeuralODE with an additional vector field that attracts trajectories to the origin, the strength of which is governed by a parameter L~\tilde{L} that is later shown to correspond to a separation of timescales.

The authors consider a system of ODEs resulting from the optimization of the empirical risk with a regularization term inversely proportional to L~\tilde{L}.

They define the "Cost of Identity" (COI), a quantity that expresses the coupling of the ODEs and evolves along the trajectories of the ODEs with initial values depending on the training dataset. They then proceed to show that the solutions spend most time in regions where the COI is close to an optimal value.

The final part of the paper proposes three different discretization schemes for the ODE and provides numerical results.

优点

  • The idea of showing that neural networks have a certain property by constructing a model where trajectories spend most of their time in regions with that property is interesting.

  • The authors explain their intuition as well as the underlying assumptions of their derivations and highlight the limitations of their analysis.

  • The COI seems to be novel and reflect some interesting properties of ODE models for neural-networks.

缺点

  • The main results are not clearly stated in the abstract or introduction. The author's stated goal is to study Leaky ResNets, it would be nice to have a rigorous statement of the results of that study at the beginning of the paper.

  • In the abstract, the authors state that the paper explains the emergence of a bottleneck structure in ResNets. It is not clear how this claim can be derived from the results in the paper.

  • There is no rigorous justification that the results from the study apply to neural networks with a finite number of layers.

  • One of the main ingredients used in the derivations is quoted from another paper in (l.101). It would be nice to have a short proof in the appendix or some explanation to make the paper more self-contained.

  • In general the paper lacks rigor, as the authors themselves note in many positions, it is not guaranteed that all the quantities are finite and that the decompositions are justified. It would be welcome if some of the informal discussions could be replaced by rigorous statements and proofs.

  • The propositions/theorems in general lack clear statements of which assumptions are used. For example, at the top of page 4 it is stated that the decomposition only holds under a certain assumption (which is rarely satisfied in practice as noted later in the paper). Most of the discussion in the paper relies on this decomposition, but it is not immediately clear whether the formal results require the assumption.

  • In general there is a lot of discussion mixing formal definitions and informal arguments, for example reasoning in terms of quantities that don't exist in most cases of interest. This makes the paper somewhat hard to read. I can understand the author's intention of providing some intuition, which might perhaps be better served by a combination of rigorous definitions together with a toy example where all the rigorous quantities simplify.

问题

  • What are the precise assumptions required in each of the Propositions/Theorems?
  • How can this framework be used to analyze neural networks with a finite number of layers?

局限性

The authors appropriately discuss the limitations of their results next to their statements. However, the extrapolation to feature learning in ResNets made in the abstract is not justified by the results of the paper.

评论

Thanks for the thoughtful review. Regarding the weaknesses (and also the questions) you raise:

  • We agree with the sentiment and we usually try to follow this approach for most paper, but for this paper most results require several definitions to be stated so it is difficult to summarize all results at the beginning. But we will clarify the contribution section.

  • "Explaining" is subjective, we then precise what we show after the colon: that for large depth the dynamics switch between fast and slow dynamics coinciding with high and low-dimensional representations. It is true that our phrasing suggest that there will always be a fast then slow then fast sequence (corresponding to high to low to high dimensions), when in practice one can also observe half-bottlenecks where the sequence is only fast then slow, or the dynamics could remains low throughout all the layers, but this only happens when the dimension remains constant throughout all the layers. We simply view these as edge cases of the Bottleneck structure, and focused on the main `full' bottleneck for simplicity.

  • The NeuralODE approximation of DNNs as continuous path has been used in many previous paper, and our numerical experiments on finite layers network seem to agree with our theoretical results. The match between NeuralODEs and ResNets has been studied in previous papers (https://arxiv.org/pdf/2205.14612), which offers a mixed conclusion: basically that the match depends on the parameters of the network. We do think that investigating whether the Bottleneck structure helps or hinders this match could be an interesting follow up work.

  • The formula of line 101 follows from the fact that WpW_{p} is the minimal Frobenius norm solution of pAp=L~Ap+Wpσ(Ap)\partial_{p}A_{p}=-\tilde{L}A_{p}+W_{p}\sigma(A_{p}). We will add this derivation.

  • The statement and proofs are rigorous, we do not rely on the assumption ImApTImσ(Ap)T\mathrm{Im}A_{p}^{T}\subset\mathrm{Im}\sigma(A_{p})^{T}, we simply mention that under this assumption the separation of timescales and Bottleneck structure would follow directly from the decomposition of the Hamiltonian into the two energies. Since it is a useful intuition, we describe this argument in the first paragraph of section 2.1, but no proposition or theorem relies on it. Actually the whole point of Theorem 4 is to show an approximate version of the same argument that does not require this assumption.

  • Again, while in the discussions we offer some intuition that rely on specific assumptions, when it comes to the propositions and theorems we state all of our assumptions explicitly (for the results about the COI we will mention that the COI can be infinite, but since we only prove lower bounds it is not an issue).

Your questions are answered in our responses to the weaknesses you raised.

评论

Thank you for your response. As you point out, already the question of whether the NeuralODE models ResNets is delicate. In your discretization scheme in l. 275 you have both L~\tilde{L} and ρ_l\rho\_l that depend on LL in non-trivial ways, present in both the network architecture (as ρL~\rho \tilde{L}) and the regularized cost (as ρ/L~\rho / \tilde{L}). Assuming LL is the quantity we want to take to infinity, it is not immediately obvious that the discretization scheme converges to the expected continuous-time process.

Can you give a formula for a discrete leaky ResNet (and associated loss) only in terms of LL which converges to a version of the ODE with bottleneck structure, as LL goes to infinity, even with some simplifying assumptions? And compare the resulting Leaky ResNet architecture to a standard ResNet?

评论

We will add a result in the Appendix to describe the convergence of the discrete version to the continuous one. We need to assume that the derivative of the nonlinearity σ˙\dot{\sigma} is Lipschitz, so it would not apply to the ReLU, but it would apply to smooth approximations of the ReLU. The argument is a simple adaptation of the convergence of Euler methods (explicit for ApA_p and implicit for BpB_p):

For a fixed L~\tilde{L} (by the way L~\tilde{L} does not depend on LL, we are not sure where you got this impression?) consider parameters Wp1,,WpLW_{p_1},\dots,W_{p_L} that are critical points of the regularized cost with uniformly bounded A~_p_ell,B~_p_ell\tilde{A}\_{p\_\\ell},\tilde{B}\_{p\_\\ell} ( A~_p_ell_F,B~_p_ell_FC_1\\|\tilde{A}\_{p\_\\ell}\\|\_F,\\|\tilde{B}\_{p\_\\ell}\\|\_F \leq C\_1 for all \ell) then there is a continuous solution of the Hamiltonian dynamics Ap,BpA_p,B_p such that for all \ell:

A~_p_A_p__F,B~_p_B_p__F...\\|\tilde{A}\_{p\_\ell}-A\_{p\_\ell} \\|\_F,\\|\tilde{B}\_{p\_\ell}-B\_{p\_\ell} \\|\_F \leq ...

where the RHS goes to zero as the ρ\rho_\ell go to zero, more precisely the dominant term should be proportional to ρ2\sum_\ell \rho_\ell^2. There is also a dependence on L~\tilde{L}, but since we consider it fixed, it is fine.

This implies that if one fixes L~\tilde{L} and considers a uniformly bounded convergent sequence of critical points of the discrete regularized loss with increasing depth LL, then as LL\to\infty it will converge to a solution of the Hamiltonian dynamics as long as ρ20\sum_\ell \rho_\ell^2\searrow 0. This is obviously the case for equidistant steps, but even adaptive steps will respect this condition under the same assumptions, so everything is fine.

With some more work and a few extra assumptions, it might be possible to extend this result to the ReLU. Intuitively, we need to assume that at almost all layers, the activations are away from the discontinuity of σ˙\dot{\sigma}, or something of this type. This assumption is quite reasonable for finite width ww and finite NN assuming that each neuron,training input only crosses zero activation a finite number of times throughout the layers. Hopefully, we can make this work for the final deadline.

评论

I assumed L-tilde would have to depend on L since we want L-tilde to go to infinity in the theorem. And the claim is about bottleneck structure as the number of layers L goes to infinity if I understand correctly.

评论

That's a good point, however:

  • We are always allowed to choose LL as large as necessary to get a good match between the discrete and continuous versions. We are still writing the exact bound but something like LL~L \gg \tilde{L} should be sufficient, or at worst a polynomial of L~ \tilde{L}.
  • It appears that the Bottleneck structure we study appears for pretty small L~5\tilde{L}\approx 5, we can therefore easily choose an LL which is significantly larger than L~\tilde{L}.

Now it would be very intersting to closely capture how large LL needs to be to reach the continuous geodesic regime, and whether adaptive learning rates allows one to reach it earlier (our rough conjecture is that as L~\tilde{L}\to\infty one needs LL~L \gg \tilde{L} for equidistant steps but L1L \gg 1 might be sufficient for adaptive steps). We have plans to takle this question, but it appears that a number of new concepts are required to answer it in a satisfying manner. The aspect that we are not yet able to capture is how `deep' the first and last paths (going from high- to low- and finally back to high-dimension) need to be, to understand how many layers are needed to discretize them well. However we believe that large depths correspond to the presence of hierarchies of symmetries where the dimension is reduced sequentially in multiple steps, and we are still looking for the right concepts and tools to capture such structure.

评论

Thank you for your response, and I am looking forward to the convergence analysis.

If the final manuscript included a numerical scheme (equivalently, sequence of neural network architectures) that correctly captures the interaction between L~\tilde{L} \to \infty and the convergence of the numerical scheme I would consider my concern regarding relevance to neural networks addressed. Whether it explains anything about ResNets in particular would of course depend on the concrete expression of the resulting scheme, you might also want to comment on this aspect.

My main concern is about the interaction between the two limits, especially given all the rescaling of time done in Section 1.2. An example that keeps L~\tilde{L} fixed seems inconsistent with the stated main contribution of the paper (l. 202) and Proposition 5. I am willing to accept taking Euler scheme first and then limit with L~\tilde{L} if the Euler scheme makes clear higher-order dependence on L~\tilde{L} using for example little-o notation, so that it would apply to a real-life setting where LL is merely big and not infinite. I don't see any problem with using a smooth approximation to ReLU or similar simplifications unrelated to questions of timescale change and discretization.

评论

We have now worked a bit more on the proof, and LL actually needs to grow exponentially in L~\tilde{L} for the naive Euler method type of bound to work. We will add this result to the appendix. This would answer your question on how to build a practical network that would converge to the continuous limit we study: take the discretization already introduced in the paper (with equidistant steps for simplicity), and let LL\to \infty with L~=clogL\tilde{L} = c \log L for a sufficiently small constant cc.

However, we strongly believe that this exponential dependence is not tight in practice, because inside the bottleneck the dynamics seem to be much slower, and one should be able to leverage this to improve the bound. More precisely, if operator norm of the weights WpW_p inside the bottleneck are bounded by L~+cst\tilde{L} + cst, then only LL~L \gg \tilde{L} should be sufficient. The reason why this could be the case is that inside the bottleneck the weight matrix should approach L~\tilde{L} times the projection along the kk^*-dimentional span of the data, but we are not yet able to capture how fast it converges to this rescaled projection to control its operator norm.

Also we want to add that we study the infinite L~\tilde{L} limit because it makes it easier to prove and illustrate the separation of time-scales in and out of the bottleneck. This is mainly a theoretical endeavor, whose goal is to identify quantities/phenomena of interest. In practice it is more optimal to choose L~\tilde{L} just large enough to start seing a bottleneck (say such that the dynamics inside the bottleneck are roughly 3 times slower than outside) but not too large. In our experiments, this seems to happen for L~\tilde{L} as small as 3 or 4, and for such L~\tilde{L} a discretization is reasonable. We have plans to capture at which L~\tilde{L} the bottleneck starts, but it will probably require some new concepts, and would thus best be done in a follow-up paper.

Though there is still a long way to go to rigorously prove what features are learned in a ResNet, our work identifies a very similar model (up to continuous approximation + leakage), for which a non-trivial phenomenon in the learned features - the bottleneck structure - can be interpreted as a rather simple separation of time-scales coming from the changing balance of two energies/quantities. There exists few simple models to study feature learning in deep networks, and most models that can be analyzed theoretically are either shallow networks or linear networks, and for those models it is impossible to describe depth related phenomenons such as the bottleneck structure.

评论

I am willing to believe that the phenomenon studied appears for finite L~\tilde{L}. However, in the context of the present manuscript, the claimed contribution is about neural networks (discrete) and the presented "evidence" / theorems use L~\tilde{L} \to \infty in a continuous-time setting. Thus the presented evidence only supports the claim if it is shown that the results transfer from the continuous to the discrete-time setting.

To clarify, the scheme / network architecture you propose is a0=xa_0 = x, al=(1clogLL)al1+1LWlσ(al1)a_l = (1-\frac{c \log L}{L}) a_{l-1} + \frac{1}{L} W_l \sigma(a_{l-1}) with loss L(θ)=C(aL)+λ2cLlogLWl2L(\theta) = C(a_L) + \frac{\lambda}{2 c L \log L} \sum |W_l|^2 with cc sufficiently small so that clogLL<1\frac{c \log L}{L} < 1 (which is clearly not a simple Euler scheme since the discretization step 1/L1/L appears in some extra places)?

And you show that this is not converging to a dynamics with aLa0a_L \approx a_0 for LL sufficiently large, but to the leaky ResNet ODE instead, and that the bottleneck structure appears in this architecture as LL \to \infty?

评论

Yes that would be the setup, we can then show that at any critical point of the loss with bounded Ap,BpA_p,B_p, the representation path will be close to the solution of a continuous solution of the Hamiltonian dynamics. Our results then apply to this continuous path under the required assumptions.

Also note that our main result Theorem 4 applies for finite L~\tilde{L}, so another way to use our results would be to train a leaky ResNet for a fixed L~\tilde{L} and L=ec1L~L = e^{c^{-1}\tilde{L}} until convergence to a critical point, then this solution is close to a continuous Leaky ResNet, and if the BpB_p and the length of the path γ,L~\ell_{\gamma,\tilde{L}} are not too large (for the right choice of γ\gamma, e.g. γ=c2L~1\gamma=c_2 \tilde{L}^{-1}), then the LHS and RHS of both equations are small, which implies that the Hamiltonian will be close to the minimal cost of identity over the path, and that the kinetic energy is approximately proportional to the squared root of the extra-COI. Thus distinguishing slow layers, where the COI is close to minimal over the path and so the kinetic energy is small, from fast layers, where the extra-COI is sufficiently large, so that the kinetic energy can be shown to be large too.

In our numerical experiments we are in some sense implementing this strategy, and we do observe that the Hamiltonian of the discrete path is approximately constant (which suggests that we are close to a continuous solution), and it is close to the γ\gamma-Hamiltonians for the right choices of γ\gamma, and also to the minimal γ\gamma-COI over the path. We then observe that the γ\gamma-COI is close to minimal inside the bottleneck and becomes large close to the beginning and end.

评论

fair enough, I consider this aspect sufficiently addressed. I will increase my score to weak accept.

审稿意见
4

This paper explores the feature learning dynamics in Leaky ResNets using Hamiltonian mechanics. By introducing the concept of 'representation geodesics', the authors analyze continuous paths in representation space that minimize the parameter norm of the network. The study provides a Lagrangian and Hamiltonian reformulation that highlights the importance of balancing kinetic and potential energy during feature learning. The potential energy, which favors low-dimensional representations, becomes dominant as the network depth increases, leading to a bottleneck structure. This structure involves rapid transitions from high-dimensional inputs to low-dimensional representations, slow evolution within the low-dimensional space, and eventual rapid transitions back to high-dimensional outputs. The paper also proposes training with adaptive layer step-size to better handle the separation of timescales observed in the representation dynamics. These insights offer a novel perspective on the mechanisms underlying feature learning in deep neural networks and suggest potential improvements in network training methodologies.

优点

  • This paper offers a novel approach for understanding feature learning by applying Hamiltonian mechanics to Leaky ResNets, bridging a gap between theoretical physics and machine learning.
  • This paper conducts experiments to validate the findings. Based on experiments, some interesting observations are obtained, which may give some new insights for future works.
  • The insights gained from this study have the potential to influence future research in neural network optimization and feature learning, advancing the state of the art in deep learning theory.

缺点

  1. There are multiple typos in the article, which affect readability. Below are several obvious typos, and it is recommended that the authors carefully polish the language of the article.
    • The third word in line 24, "phenomenon"\rightarrow "phenomena".
    • In line 27, "determines" \rightarrow "determine".
    • In line 40, "lead" \rightarrow "leads".
    • In line 68, the preposition "in" should be added after "interested".
    • The formula at the end of line 78 should beαq=αq/c.\alpha_{q}^{'}=\alpha_{q/c}.
    • The formula between lines 78 and 79 is also incorrect. The last term should be 1cpαq/c(x).\frac{1}{c}\partial_{p}\alpha_{q/c}(x).
    • The expression of KpK_p in line 111 is incorrect, because it should depend on pp.
    • The formula between lines 112 and 113 is incorrect. The coefficient of the middle term on the right side of the equation should be 1 instead of L~\tilde{L}.
    • In line 145, " L~Ap+pApKp+2||\tilde{L}A_{p}+\partial_{p}A_{p}||_{K_{p}^{+}}^{2} " \rightarrow "L~Ap+pApKp2||\tilde{L}A_{p}+\partial_{p}A_{p}||_{K_{p}}^{2}".
    • In line 159, "bound" \rightarrow "bounded".
    • The expression for C(A)C(A) in line 190 is incorrect, which should be C(A)=1Nf(X)AF2.C(A)=\frac{1}{N}||f^{*}(X)-A||_{F}^{2} .
    • In line 282, "ρlL<1\rho_{l}L <1 " \rightarrow "ρlL~<1\rho_{l}\tilde{L}<1".
    • In line 415, "cones" \rightarrow "cone".
    • In Theorem 7 and its proof, it seems that all instances of γc\sqrt{\gamma c} should be changed to γc\gamma \sqrt{c}.
    • In proposition 9, Z~q\tilde{Z}_{q} should be A~q\tilde{A}_{q} in the formula above line 485.
  2. The formula between lines 101 and 102 is a crucial one, so it is recommended that the authors provide the derivation process for this formula.
  3. In line 59 of the paper, the definition of σ\sigma is given. First, the "+" in this formula should be replaced with a comma. Secondly, my question is about the last component "1" in the definition of σ\sigma. In the proofs of some propositions, is it necessary that σ\sigma does not include this last component "1"?
    • In line 153, why ApKp2=ApAp+F2||A_{p}||_{K_{p}}^{2}=||A_{p}A_{p}^{+}||_{F}^{2} for non-negative ApA_{p} ?
    • In line 156, why σ(A)FAF||\sigma(A)||_{F}\le ||A||_{F} ?
    • In line 450, why Ap0,i2σ(Ap0,i)2=c0>0||A_{p_{0}, \cdot i}||^{2}-||\sigma(A_{p_{0}, \cdot i})||^{2}=c_{0}>0 for all L~\tilde{L} ?
  4. In the proof of Proposition 8, the authors did not explain why the limit exists. Secondly, the formula above line 454 is missing O(ϵ4)O(\epsilon^4). Additionally, the formula above line 464 is incorrect.

问题

  1. In practical applications, the training of neural networks is often affected by random initialization and noise. Could the authors elaborate on how the Hamiltonian maintains its stability and robustness in the presence of such randomness and noise? Are there theoretical or experimental supports demonstrating the effectiveness of the Hamiltonian in describing the system dynamics under different training conditions?
  2. The authors propose an adaptive layer step-size to adapt to the separation of timescales observed in Leaky ResNets. How does this adaptive training approach impact the computational complexity and training time compared to standard training methods? Can the authors provide benchmarks or case studies demonstrating the trade-offs between performance gains and computational costs?

局限性

The theory they proposed aims to provide guidance for better training of networks, and their experimental and implementation details have been documented.

评论

Thanks for the thorough and thoughtful review. Responding to the weaknesses you point:

  1. We will fix all the typos you have identified.

  2. This line follows from the fact that WpW_{p} is the minimal Frobenius norm solution of pAp=L~Ap+Wpσ(Ap)\partial_{p}A_{p}=-\tilde{L}A_{p}+W_{p}\sigma(A_{p}). We will add this derivation.

  3. The pluses correspond to the notation [x]+=max0,x[x]_{+}=\max\\{0,x\\} but we forgot to define it. Also, thanks for pointing out to this error in the proof, the proofs of the results of Section 1.4 were done without bias. Thankfully we can show at each stable local minima of the COI with bias, the COI with and without bias are the same and all results for the no-bias COI then apply. We will add this result and a few other basic relations between the bias COI and no-bias COI in this section.

  4. You are correct that it may not converge, we will replace the statement by `the limit is non-negative if it exists'. Note that the argument would also work for any convergent subsequence.

Regarding your questions:

  1. Our long term goal is to prove how training dynamics converge to such Bottleneck structure, and we believe that in practice the BN structure is hidden behind the noise of the initialization, but remains relevant. This is supported by the fact that when we train these networks we observe an earlier time, where the train and test stop changing much and we start to see kk^{*} singular values coming out of the bulk of the weight spectra, but we need to train longer for the weight decay to `get rid' of this noise. It seems that the Hamiltonian analysis only works at the end of training, but we are confident that is might be possible to extend it to earlier time with the right modifications. Again, this is our end goal, but our strategy is to first understand the BN structure in its cleanest form at the end of training and later describe corrections to it.

  2. The adaptive layer steps has no cost at inference time. During training, the computational cost is negligible (the training is approx. 2% longer with adaptive learning), especially since it is sufficient to update the ρ\rho every few steps (every 30 steps in our case). We will add some training time information in the appendix.

评论

Thanks for the response. I will keep my score.

最终决定

This paper studies the dynamics of feature learning in leaky ResNets using tools from Hamiltonian dynamics. While the reviewers all agree that the results in the paper are interesting and relevant to the NeurIPS community, various reviewers also raised concerns about the clarity of writing and the presentation in the paper. I believe a careful update of the manuscript to address these concerns should be done before the paper is accepted, and I thus recommend rejection.