Noise Balance and Stationary Distribution of Stochastic Gradient Descent
摘要
评审与讨论
The paper studies the effect of rescaling symmetry in SGD and shows SGD tends to favor solutions with balanced gradient noises. The authors then derive an exact solution of the stationary distribution of a toy model trained by SGD. The derived solution shed lights on problems observed in deep learning such as fluctuation inversion and edge of stability.
优点
The paper contributes to the understanding of SGD properties. The noise balance theorem is novel and important. The analytical solution as well as the interpretation is interesting and insightful.
缺点
The results of the paper are interesting and important, the writing needs refinement to improve clarity and precision. The conditions under which the results hold is sometimes omitted, leading to confusion. The language should also be made more precise.
Minor points:
- The first paragraph in related works appears to overstate the novelty of the results. Specifically, our result is the first to derive an exact solution to the stationary distribution of SGD without any approximation. (Line 55-56) This is a strong claim, but it seems inaccurate. There are previous results showing exact solution of stationary distribution of SGD (e. g. Liu Ziyin 2021). Corollary I.1 in arXiv:2306.04251 (2023) also states the stationary distribution on a deep learning setup similar to the D=1 model discussed in this paper. Also, the solution given in the paper is for a specific model. These should be made clear.
- It seems that eq. 15 takes D=1, which has not been stated and thus is confusing.
- It is unclear why the left figure of Fig. 5 has only two theory lines instead of three.
Major points:
- The related works on symmetry and SGD dynamics are insufficient. There are a few related works that are missing, e. g. arXiv:2309.16932 (2023).
- The paper has not discussed convergence to the stationary distribution. The authors seem to assume convergence to stationary distribution and use interchangeably the SGD properties and the stationary solution properties (e. g. line 97-98). However, the properties of SGD can be very different from the properties of stationary solutions unless convergence to the stationary solutions is guaranteed. The authors should clarify this.
- The authors fail to discuss uniqueness of the stationary solutions. For example, it is unclear to me why eq (3) is a necessary and efficient condition for stationarity. Eq (3) is a critical result in the paper, and it would be better to make it a theorem or corollary. However, since eq (2) cannot be interpreted as a deterministic ODE. The unique condition for a stationary distribution should be justified, especially considering that C1 and C2 are not constant but depend on u and w.
- The equivalence of SGD bias and weight decay is not rigorous. (line 155-158) The C0 term is not constant but depends on u and w, while the weight decay rate is constant.
问题
- The entire paper is based on rescaling symmetry, but why tanh without rescaling symmetry is used in Fig. 3 and 5?
局限性
The authors have listed limitations at the end. The major limitations are the simplicity of the model and lack of experiments on deep neural networks.
Thank you for your feedback. We will answer the weaknesses and questions below.
Weaknesses:
The results of the paper are interesting and important,...The language should also be made more precise.
Thank you for your suggestions. We will do our best to refine our language in the revision.
Minor points: 1. The first paragraph in related works appears to overstate the novelty of the results...
Thanks for these references. We will include a discussion of these works. In the previous work Liu Ziyin 2021 (actually, Liu. et al 2021), the authors investigated the stationary distribution only near the local minima. In Eq. (13) in Sec. 4.6, they approximate the covariance matrix as a constant matrix near the minima which represents the strength of the noise. Hence, their results are approximate stationary distribution rather than an exact one. Corollary I.1 of arXiv:2306.04251 only gives a particular solution to the Fokker-Planck equation, assuming strong conditions on the initialization (A3-A4). In contrast, our solution is a general one, enumerating all possible solutions for the problem. In science and mathematics, there is a fundamental difference between the two. For example, the general solution to the Navier-Stokes equation is a difficult open problem in mathematics, whereas particular solutions to it are quite easy to find (e.g., https://en.wikipedia.org/wiki/Landau%E2%80%93Squire_jet). We will restate our contribution as the “first general exact solution to a deep-and-wide linear model with 1d input and output,” which we believe is accurate given the existing literature.
Also, the solution given in the paper is for a specific model. These should be made clear.
We agree that this point should be made more explicit and also in the abstract.
2. It seems that eq. 15 takes D=1, which has not been stated and thus is confusing.
Yes. Eq. (15) is for . We will clarify this.
3. It is unclear why the left figure of Fig. 5 has only two theory lines instead of three.
The dashed lines in Fig. 5 show the upper and lower bounds of the tail of the stationary distribution around l.295-l.300, where the left dashed line corresponds to the case while the right line corresponds to the case .
Major points: 1. The related works on symmetry and SGD dynamics are insufficient. There are a few related works that are missing, e. g. arXiv:2309.16932 (2023).
Thank you for pointing out this related work. This work studies discrete symmetry. Our focus is on the rescaling symmetry, which is a continuous symmetry. In addition, this work focuses on the structure of Hessian under the mirror symmetry, while we focus on the stationary solutions and distribution of the parameters. We will clarify this.
2. The paper has not discussed convergence to the stationary distribution. The authors seem to assume convergence to stationary distribution and use interchangeably the SGD properties and the stationary solution properties (e. g. line 97-98). However, the properties of SGD can be very different from the properties of stationary solutions unless convergence to the stationary solutions is guaranteed. The authors should clarify this.
Yes. Convergence to the stationary distribution is assumed. The problem of convergence itself is a difficult mathematical problem and beyond the scope of our work (See https://doi.org/10.1007/s10884-018-9705-8 for an example). We will add a discussion on this point. For the stationarity of Eq (3), see the next point.
3. The authors fail to discuss uniqueness of the stationary solutions. For example, it is unclear to me why eq (3) is a necessary and efficient condition for stationarity. Eq (3) is a critical result in the paper, and it would be better to make it a theorem or corollary. However, since eq (2) cannot be interpreted as a deterministic ODE. The unique condition for a stationary distribution should be justified, especially considering that C1 and C2 are not constant but depend on u and w.
See the summary rebuttal. The solution to Eq (3) is unique in the sense that for fixed and , there exists a unique such that Eq (3) reaches stationarity. We also establish that if this does not change in time, Eq (3) will converge to this fixed point.
4. The equivalence of SGD bias and weight decay is not rigorous. (line 155-158) The C0 term is not constant but depends on u and w, while the weight decay rate is constant.
We meant that they are qualitatively similar because SGD bias is like an “adaptive” weight decay. We will clarify this.
Questions:
The entire paper is based on rescaling symmetry, but why tanh without rescaling symmetry is used in Fig. 3 and 5?
Here, we consider the tanh network to clarify the application of our insights to cases where the symmetry only approximately holds. For a small initialization, the tanh network can be approximated by a linear network since . Thus, the rescaling symmetry approximately holds.
Limitations:
The authors have listed limitations at the end. The major limitations are the simplicity of the model and lack of experiments on deep neural networks.
Thanks for this criticism. We would like to politely emphasize that we do have a nontrivial theory for nonlinear models (Theorem 3.2), and experiments on nonlinear nets (ReLU networks in Fig. 2 and the tanh network in Figs. 3 and 5). We also include a new set of experiments on ReLU net in the uploaded pdf.
I appreciate author’s response to my review!
For the author’s response to my question, I would suggest replacing the experimental results with the ones run with ReLU activation. ReLU is widely used, and it still makes little sense to me that the author decided to run the experiments with tanh while the entire paper is based on rescaling symmetry.
For the supplement pdf, I have two follow up questions.
-
From my understanding, Equation 124 should still be considered a stochastic differential equation by nature. Then, how can the author arrive at Corollary D.2, which is a deterministic statement without out any probability condition?
-
In the right panel of the new Figure 6, why does the difference in the norm grows up again later in time?
Hi! We noticed that you have not responded to our previous clarification. It would be really helpful to us when revising and improving our manuscript if we could hear more from you. We would be happy to hear any additional thoughts or feedback you have on our work and on our previous reply!
I thank the authors for the clarifications! I am unable to arrive at Corollary D.2 without seeing the proof. Overall, I think the paper has novel contribution and the results are interesting. However, I feel like it still needs some refinement, especially for the language precision. For example, in the rebuttal pdf, can take , and it is unclear to me for Corollary D.2 what it means for an neighborhood of when takes .
Thank you for your additional questions.
For the author’s response to my question, I would suggest replacing the experimental results with the ones run with ReLU activation. ReLU is widely used, and it still makes little sense to me that the author decided to run the experiments with tanh while the entire paper is based on rescaling symmetry.
Thank you for your suggestion. We will move the results of tanh nets to the appendix, and use the ones for ReLU nets in the main text to avoid confusion.
For the supplement pdf, I have two follow up questions.
1. From my understanding, Equation 124 should still be considered a stochastic differential equation by nature. Then, how can the author arrive at Corollary D.2, which is a deterministic statement without out any probability condition?
This is a misunderstanding. There is no stochasticity in the evolution equation in Eq. 124. The diffusion terms in the dynamics of and cancel with each other due to the rescaling symmetry – even if and themselves are random. This (rather surprising fact) is essentially what the theorem is trying to prove. Also, please see our last reply to reviewer (56YC) for an alternative and more technical proof of theorem 3.1, which may be easier to understand for some readers.
Now, because Eq. (124) has no noise term, one can further construct a deterministic dynamics that strictly upper bound Eq. 124, which is how one can arrive at the corollary.
2. In the right panel of the new Figure 6, why does the difference in the norm grows up again later in time?
First of all, Figure 6-right can be a little visually misleading. It looks like the neurons first have a decreasing and then an increasing , but the actual situation is simply that roughly half of the signs of are changing during learning, and so in a log scale, it looks like they are first decreasing and then increasing, even though are quite monotonic in reality.
Secondly, is not necessarily the stationary point and Corollary D.2 only guarantees a convergence to a neighborhood close to a stationary point, and so there is no reason to expect a convergence to any specific value, especially when the model is nonlinear. What happens the most often is that the parameters fluctuate around some small neighborhood, which is exactly what Figure 6-right shows.
Please ask for additional clarification for any point that is not yet clear.
Thanks for the reply. We will refine our statements. What you point to is a corner case. Technically speaking, is indeed allowed to be infinity. The meaning is clear. When , the system diverges with probability 1 (here, an neighborhood of infinity simply means infinity, and thus, divergence). This is not uncommon when the model has some sort of continuous symmetry. For example, in case of scale invariance, it is well-understood that SGD diverges along the degenerate directions (namely, the radial direction) when there is no weight decay (e.g., see https://arxiv.org/pdf/2102.12470)
For ReLU networks trained by gradient flows, it is classical that a type of Minkowski inner product between the coefficients of consecutive layers is preserved. The authors demonstrate a monotonicity of the same quantity for stochastic gradient descent in continuous time. They use this to study the invariant distribution of parameters trained by (continuous time) SGD.
优点
The topic is well-chosen and the results - if correct - are very interesting.
缺点
-
In its current form, I find the article a bit unpolished and the results not easy to access. Many questions remained unanswered when I tried reading the article (see questions).
-
Important quantities are defined throughout the plain text. I understand that reading as a reviewer under time pressure is different from normal reading, but for instance in Theorem 3.1, I would have hoped for a more self-contained statement on relations and properties and the distribution of have to satisfy. As far as I can tell, the statement is fairly general and not specific to machine learning.
-
I have serious doubts about Theorem 3.1. It is derived in Appendix A from Itô's Lemma without the diffusion term. This is valid in expectation over , but not pointwise in . Pointwise in , there should be white noise in the 'time derivatives', i.e. the ODE identity should be written as an SDE. In the proof, equations (27) and (28) appear to be wrong.
-
The authors do not pay any attention to whether solutions to the evolution equations exist (or are unique). Problems with regularity can sometimes be alleviated if the distribution in is sufficiently regular, but I would appreciate a short discussion.
问题
-
The noise matrix defined below (1) is an uncentered covariance matrix, while the centered version appears to be used below (and I believe that this would be correct).
-
In line 499, can you justify that is a function of only the rank one matrix, and is the function sufficiently smooth to take derivatives? Is it defined on the space of matrices, or at least in an open neighborhood of the set of rank 1 matrices so that the derivatives are well-defined? This is easy to settle in the linear settings in the main article, but a more thorough consideration is required in general.
-
The notation is very unfortunate - for a single derivative, there should not be two arguments. This would be a derivative of in direction on a space of matrices... Can you explain why the modification is needed? And could it be avoided in the statement of Theorem 3.1 to focus on the more intuitive quantities involving the original loss ?
-
Where in the proof of Theorem 3.1 is there any indication that ? I have looked fairly closely and cannot find a justification for (3). I also assume that the covariance matrix of depends on and the covariance matrix of equally depends on all three sets of variables? I am also unsure how precisely this indicates that 'gradient noise between the two layers is balanced'.
-
What is the distribution of in the experiment of Figure 1? Please include code for reproducibility or describe the experiments which illustrate your point.
-
How does (2) imply that 'a single and unique point is favored by SGD'?
-
Would Theorem 3.1 hold for all neurons with leaky ReLU activation?
-
What do you mean by 'the difference between GD and SGD at stationarity is O(1)'? GD does not have an invariant distribution and the terminal state depends on (random) initialization. Is the message that a gradient flow analysis is only valid for a fixed finite time horizon?
-
Should the minimizer of (10) be or should ?
-
Below (10), what are the ? Is this the same as in (8) for a different model?
-
I find that
is achieved for . This differs from the value proposed in the article by a factor of .
-
I cannot find the proof of (12) in the appendix anywhere, and I am unsure what is. More concrete references may be useful.
-
In Figure 3, what is plotted for in the depth 1 case? Is it the product of parameters?
局限性
Yes.
Thank you for your constructive feedback. We will answer the weaknesses and questions below.
Weaknesses:
Important quantities are defined throughout the plain text. I understand that...
This is a good question. Theorem 3.1 holds generally for an arbitrary network with the rescaling symmetry. The only condition we need is the symmetry of the network regardless of the distribution of and the properties of and . So mathematically, this is a strong result and our results may also be applicable to other fields. This is not a weakness but a strength of our work. In fact, solving ODE with the Lie group is well-known in mathematics literature, whereas using the Lie group to solve SDE is, to our best knowledge, novel even for mathematicians.
I have serious doubts about Theorem 3.1. It is derived in...
This is a misunderstanding. While we did not write it explicitly, the time derivative of each variable is actually stochastic. In the SDE limit, the explicit forms of the time derivatives (Eqs. (27) and (28)) are which can be used to to derive Eq. (29).
The authors do not pay any attention to...
This is a good point. In the revision, we will strengthen the theorem by proving the existence and uniqueness of the solutions to Eq. (2). See the summary rebuttal and the pdf.
For regularity, we do need some regularity conditions for the matrix as a function of the data in order for the matrices and to exist and be well-behaved. One sufficient condition would be that the matrix is a Lipshitz function of . We will clarify this.
Questions:
The noise matrix defined below (1) is...
This is a typo and we will fix it.
In line 499, can you justify that is a function of...
Here, we need the loss function to be a differentiable function of and , and the rescaling symmetry implies some additional smoothness properties. To be specific, we can equivalently write the derivative and into a composition of the derivatives if we define . Hence, once the loss function is sufficiently smooth for the parameters and and the parameters are away from , we can always have the smooth derivative .
The notation is very unfortunate...
Here the introduction of facilitates us to derive the balancedness of the norm and the stationary distribution of the 1d case (See Corollary 3.3). The more intuitive expression of Eq. (2) is provided in Eq. (30), which utilizes the gradient of the original loss to the parameters and . The stationarity directly means the balancedness of gradient noises on different layers.
Where in the proof of...this indicates that 'gradient noise between the two layers is balanced'.
The proof of is provided in Eq. (30). Here the definition of and are given by and . We apologize for the confusion.
What is the distribution of in the experiment of Figure 1?...
This experiment is easy to reproduce. Here, is Gaussian, , and is an independent Gaussian. As long as the variance of and are nonvanishing, one will be able to reproduce this experiment.
How does (2) imply that 'a single and unique point is favored by SGD'?
When the r.h.s. of Eq (2) does not vanish, Eq. (2) has a unique fixed point in the degenerate valley of the rescaling transformation. Namely, there exists a unique such that the transformation makes Eq. (2) vanish. This is made precise with our additional update Theorem 3.1. See the summary rebuttal for more detail.
Would Theorem 3.1 hold for all neurons with leaky ReLU activation?
Yes. For example, the loss function is given by . The network has the rescaling symmetry: . Therefore, we still have the law of balance by defining and in Eq. (2), which is the same as the ReLU activation.
What do you mean by 'the difference between GD and SGD at stationarity is O(1)'?...
Here, the O(1) difference means the difference of the final values between the SGD and GD does not depend on the noise strength . Therefore, the GD can only approximate SGD for a finite time horizon.
Should the minimizer of (10) be...
This is a typo. The global minimizer should be while . We will complement it in the revision.
Below (10), what are the ?...
Yes, the definitions of are the same as in Eq. (8).
I find that...differs from the value proposed in the article by a factor of .
Thank you for pointing this out. Here the definition of is actually given in Eq. (11). We would like to revise the formula in l.189 as . We apologize for this typo.
I cannot find the proof of (12)...
The definition of is given by . Eq. (12) is easy to prove and we will include a proof of Eq. (12) in the revision.
In Figure 3, what is plotted for in the depth 1 case?...
Yes, the parameter is defined as the product of and .
I appreciate the authors' response and explanation.
-
I looked at the proof of Theorem 3.1 and I am unable to fill in the gaps where diffusion terms should be. It is possible that the authors chose an unfortunate simplifying notation and that this is correct, but I cannot verify the result right now. I would also be surprised at true monotonicity for a stochastic quantity. In Corollary 3.3, the authors seem to use the result in an ODE fashion to conclude that |u-v|^2 in fact decays to zero.
-
It is still not entirely clear to me how to show that the loss only depends on the rank 1 matrix and how the 'lifted' function would be defined on the space of matrices (not just the set of rank 1 matrices, which does not form a linear subspace). An explicit construction or formula would be appreciated.
I currently maintain my score and certainty. If the authors can provide a convincing answer to the two questions, I will most likely increase the score but drop the certainty, since I do not feel like I will be able to adequately assess its merit during the discussion period.
For the record, I find the work interesting and I believe that it deserves to be published, but I have reservations about the correctness (or the presentation, if this is really a question of notation).
I appreciate the authors' response and explanation.
I looked at the proof of Theorem 3.1 and I am unable to fill in the gaps where diffusion terms should be. It is possible that the authors chose an unfortunate simplifying notation and that this is correct, but I cannot verify the result right now. I would also be surprised at true monotonicity for a stochastic quantity. In Corollary 3.3, the authors seem to use the result in an ODE fashion to conclude that |u-v|^2 in fact decays to zero.
Thanks for this question. We clarify the details below. We chose a simplified notation in the original proof to avoid confusing the readers with too much technical details. More technically, the proof should start with Eq. (1), and the standard SDE formulation. The following is how one would arrive at the result in this route.
Also, a core aspect of the result of Theorem 3.1 is exactly that Eq. (2) is an ODE! This is because the diffusion term cancel each other for the time evolution of . The mechanism for this cancellation is essential to the proof: the gradient noice covariance is low-rank due to symmetry.
Now, we outline the proof of Theorem 3.1 in a more technical and standard manner. By defining , we can rewrite the quantity as , where . The dynamics of is described by Eq. (1): , where . To obtain the dynamics of , we use Ito's lemma: If the vector satisfies the stochastic process , then the dynamics of a function of can be written as . Hence, the dynamics of is given by . To simplify this, we use the infinitesimal form of the rescaling symmetry, which can be expressed as with the matrix . We expand the equation to first order in and obtain . By taking expectation to both sides, we have . In addition, . Therefore, we can see since and share the same null space. By substituting and into the evolution equation of , we have , which is nothing but Eq. (30). Here we can see the diffusion terms related to vanish due to the rescaling symmetry.
It is still not entirely clear to me how to show that the loss only depends on the rank 1 matrix and how the 'lifted' function would be defined on the space of matrices (not just the set of rank 1 matrices, which does not form a linear subspace). An explicit construction or formula would be appreciated.
Due to the rescaling symmetry, the loss function only depends on this rank 1 matrix. To show this, we redefine a new function to reorganize the parameters. Then the derivatives of can be rewritten as and . However, due to the rescaling symmetry , we have . Since this equality holds for an arbitrary , we always have for all . Hence, the loss function only relies on the rank 1 matrix . We will clarify this in the revision.
Please ask for additional clarification for any point that is not yet clear.
I looked at the proof of Theorem 3.1 and I am unable to fill in the gaps where diffusion terms should be.
I thank the authors for the explanation. It makes a lot more sense now and I appreciate the detailed presentation. I believe that integrating it into the article will help future readers as well.
It is still not entirely clear to me how to show that the loss only depends on the rank 1 matrix and how the 'lifted' function would be defined.
I still do not follow. Let me explain my dilemma: The authors define the function on the space of rank 1 matrices by . In my understanding, they argue that it does not depend on the scaling between the vectors which are used to decompose (unique up to scaling). Where I do not follow is when they take the derivative with respect to the ij-entry of the matrix, corresponding to a difference quotient
However is not necessarily a rank 1 matrix, meaning that is not defined there, as far as I can tell. This issue also arises in their previous response. I think this is the result of formal manipulations which may not be applicable here.
If the authors can provide an explanation or alternative argument, I will be happy to increase my score.
Thank you very much for this very good question.
It is true that our definition only defines a rank-1 subspace of the gradient vector of , but this is all that is required for us. Since the same argument applies to and , we focus on .
Essentially, we only need the quantity to exist, and this only requires a rank- subspace of to be defined, which is defined as: (ignoring the second term of as it follows from the same argument). Meanwhile, the quantity (see our very first rebuttal), which is well-defined whenever the gradient with respect to the original loss is well-defined. Since each term of the sum is well-defined, the quantity is also well-defined. Similarly, is always well-defined.
Lastly, the fact that we only require a rank- condition for the gradient of does not imply that when viewed as a matrix is rank-. As an example, consider the case , and , such that . Thus, we have that , which is full-rank and so its largest and smallest eigenvalues can still be well-defined.
I will increase my score. I still feel that this paper is not fully ready for publication and requires significant revision, but my concerns about mathematical correctness have been addressed. I thank the authors for engaging with the questions and providing thoughtful replies.
Thanks for the reply. We will do our best to incorporate the feedback from the reviewers to improve the manuscript.
This paper tries to analyze the specific features that carry the noise of SGD (through a continuous model). The authors show that there is a certain 'law of balance' across the layers when some invariance is assumed. Going further, they derive a toy model to push their study, showing that there is an analytic stationary solution to it. They finally propose a phenomenology related to the role of the noise of SGD when analyzing this precise stationary distribution.
优点
The idea that a conservation law for the gradient flow implies an asymptotic balancedness condition for the stochastic flow is a good and striking idea.
The one-dimensional examples that are given in the text are very pleasant to follow and they are good exercices to display the ability of the stochastic flow to diverge from the gradient flow.
The example given in Eq.(13) is thoroughly analyzed.
缺点
The paper present the following weaknesses:
-
The law of balance is an interesting phenomenon, yet considering it with a closer look, it seems that not much can be said generally and that one has to understand it case by case. In one dimension, sure, it is possible to conclude that balancedness will occur at exponential speed, yet in dimension more than , it seems impossible to predict it surely.
-
I have to say that I was a bit bothered by the general overselling of the paper :
- As said before the law of balance is truly valid asymptotically in one-dimension
- The stationary distribution that the authors claim to be the first to derive is for a very specific model, which is not standard and does not resemble a diagonal network!
- The fact that the stationary distribution can be computed is also very inherent to calculation and is simply a recognition of a Pearson diffusion that already made in way in ML (at least in https://arxiv.org/pdf/2402.01382 and https://arxiv.org/pdf/2407.02322).
Minor typos/flaws:
- l.41: Fokker Planck is not inherently high-dimensional
- l.44: Go to the line for new paragraph
- l.165: The law of balance is not strictly applicable here since is not scale invariant because of the regularization.
- Section 4.1 Depth - 0: I think that is not currently the "most practical example" since it corresponds to a underparametrized model.
问题
On top of the questions raised by the weakness section, here a some (more minor) questions:
- l.65: Eq (1) is not the usual covariance matrix that is used for Stochastic modified equations.
- l.95: Eq (3) seems important but it is difficult to follow where this comes from without intermediate calculations, can the authors develop?
- l.110: Same thing for the equation (3): can the authors develop?
- Thm 4.2 : Are they the only stationary distribution ? How to know to which one it converges ?
局限性
As said before, all conclusion are drawn for models that live intrinsically in one dimension.
Thank you for your feedback. We will answer the weaknesses and questions below.
The law of balance is an interesting phenomenon, yet...
Thanks for raising this point. We stress that the law of balance applies to high-dimensional problems as well. In the high-dimensional case, the convergence to the noise-balance point is no longer strictly exponential, but the following three strong properties still hold:
-
The fixed point (namely, the noise-balanced point) of Eq. (2) exists and is unique (among all degenerate solutions connected through the rescaling transformation). This is an insightful new theoretical result we will include that strengthens the law.
-
Assuming that this fixed point does not change in time, the dynamics in Eq. (2) converges to this fixed point. This is another new theoretical result we will include that strengthens the law.
-
Assuming that and are full-rank, convergence to an neighborhood of the fixed point will be exponentially fast. This is a direct corollary of Theorem 3.1, which also enriches and strengthens our result.
We also perform an additional set of experiments to validate the corollary. See the rebuttal summary and the attached pdf.
I have to say that I was a bit bothered by the general overselling of the paper:
Thanks for this criticism. We will do our best to tune down our statements to ensure that they are as accurate as achievable.
As said before the law of balance is truly valid asymptotically in one-dimension
As argued above, we stress that our law of balance (theorem 3.1) is generally valid for an arbitrary-dimension network, and this message strengthens with the newly added part and corollary to Theorem 3.1.
The stationary distribution..., which is not standard and does not resemble a diagonal network!
We will make it explicit that the stationary distribution is for a special model. Also, We have never claimed the model to be a diagonal network (and diagonal networks are no more realistic than our model). We also never claimed the model to be “standard.” Our claim actually has a very restrictive qualifier: that it is the first “general exact” solution of this specific model. We will emphasize these points.
The fact that the stationary distribution can be computed is also very inherent to 1d calculation and is simply a recognition of a Pearson diffusion that already made in way in ML (at least in https://arxiv.org/pdf/2402.01382 and https://arxiv.org/pdf/2407.02322).
It is true that 1d distributions are easier to derive, but our key contribution is not to derive a 1d distribution, but to reduce an arbitrary dimensional problem to 1d, and this is a special property of the SGD noise. This is not trivial and not an overclaim. Also, none of these references derived exact solutions. To be specific, at the beginning of Sec. 2.3 in the reference arXiv: 2402.01382, the authors applied the decoupling approximation to approximate the covariance of the gradient noise near the minima. In Lemma 4.4 of the reference arXiv: 2407.02322, the authors assume the strength of the noise to be a constant near the minima. In comparison, we give an exact solution for arbitrary initial parameters under the law of balance, which is not restricted to the points near the minima. We will include these references and clarification in the revision.
Minor typos/flaws:
l.41: Fokker Planck is not inherently high-dimensional
When the dimension of the dynamical variable is higher than 1, the Fokker-Planck equation is inherently high-dimensional. We will clarify this.
l.165: The law of balance is not strictly applicable here since is not scale invariant because of the regularization.
What we want to derive is the dynamics of , which can be decomposed into two parts: (a) contribution from symmetry loss, and (2) contribution from weight decay. These two parts can be analyzed separately. The contribution from the symmetry part directly follows from the law of balance.
Section 4.1 Depth - 0: I think that is not currently the "most practical example" since it corresponds to a underparametrized model.
This is a misunderstanding and is based on the questionable assumption that overparametrized models reach a zero training loss and are more practical. For example, large language models often reach a training loss far above zero and are certainly underparametrized. For conventional CV models, it is also almost never the case that it reaches a zero training loss. As long as the training loss is above zero, it qualitatively corresponds to the case .
Questions:
l.65: Eq (1) is not the usual covariance matrix that is used for Stochastic modified equations.
This is a typo. It should be .
l.95: Eq (3) seems important but it is difficult to follow where this comes from without intermediate calculations, can the authors develop?
This is a rewriting of the right-hand side of Eq. (2) by letting it equal .
l.110: Same thing for the equation (3): can the authors develop?
This is also a type of rewriting of the right-hand side of Eq. (2). By following the steps in l.108 and l.109, we let be and be . Then, by letting Eq. (2) equals 0, we have l.110.
Thm 4.2 : Are they the only stationary distribution? How to know to which one it converges ?
Yes. Theorem 4.2 enumerates all the stationary distributions. The solution which the network converges to depends on the initial parameters. If we initially choose a set of parameters such that , then the probability distribution of the parameters converges to the solution . For the initial parameters with , they converge to the solution .
I thank the authors for the good rebuttal that has answered partially to my concern. For this reason I decide to increase my score by one, still thinking that the article lacks some convincing example to strenghten their claim.
Thanks for your reply and update. We would really appreciate it if you can be more specific regarding "convincing examples." To be specific, which claim in the paper do you think requires more example to strengthen? Knowing this would really help us improve our work.
We thank all the reviewers for their constructive feedback, which has helped us greatly improve our manuscript. We are encouraged to see that all reviewers agree that our contribution is "good."
To address the concerns of the reviewers, we will make the following additions and changes to the manuscript. See the attached pdf for the updated theorem statements and additional experiments. We will include their proof in the revised manuscript.
- A strengthened version of Theorem 3.1, which states that the fixed points of Eq. (2) are unique (HVit, 56YC, mL9n), and that if we assume that this fixed point does not change in time, the dynamics in Eq. (2) will converge to this fixed point (HVit, 56YC, mL9n).
- This fixed point is unique in the following sense: when the r.h.s. of Eq (2) does not vanish, Eq. (2) has a unique fixed point in the degenerate valley of the rescaling transformation. Namely, there exists a unique such that the transformation makes Eq. (2) vanish. The intuition behind the uniqueness of the fixed point is this: in Eq. (2), the term is a monotonic increasing function of the norm of , while is a monotonically decreasing function of the norm of . When one of these two terms is nonvanishing, the sum of the two must intersect the x-axis at a unique point (including infinity) of the rescaling transformation.
- A corollary that states that if and are full-rank, Eq. (2) will converge to an neighborhood of the fixed point exponentially fast (HVit).
- A more accurate restatement of our second main contribution as “first general (instead of being “particular”) exact solution to a specific deep-and-wide linear model with 1d input and output for an arbitrary initialization.” This is accurate because, in comparison with prior works, our solution does not rely on any approximations (HVit) and is not a particular solution that relies only applicable to special initialization conditions (mL9n).
- Additional set of experiments on linear and relu nets, which validates the corollary that in high dimension, the convergence to a neighborhood of the noise-balanced solution is exponential. See Figure 6 (HVit).
- Additional discussion of technical points such as the condition on the regularity of data and the smoothness of the loss function (56YC).
- Discussion of related references (HVit, mL9n)
We look forward to your feedback on these revisions and are open to further discussion to enhance the manuscript.
This paper investigates the behavior of SGD, demonstrating its convergence to a noise-balanced solution. Subsequently, it discusses detailed properties of the stationary distribution of SGD. Generally, reviewers recognize the correctness and theoretical rigor of this paper, yet express concerns about the clarity and limitations of the theoretical models, and thus do not strongly support it. The authors are encouraged to substantially revise the paper to improve clarity and extend the theoretical results to a more general setting.