Exact risk curves of signSGD in High-Dimensions: quantifying preconditioning and noise-compression effects
SDEs and ODEs for SignSGD in high-dimensions: identifying scheduling, preconditioning, noise compression effects.
摘要
评审与讨论
The authors motivated by the signed gradient interpretation of the stochastic gradient algorithm, develop and apply an SDE/ODE framework to analyze the dynamics of the stochastic gradient algorithm using signed gradients. With this framework they study quantities like the effective learning rate of the algorithm, preconditioning and gradient noise. In addition, they draw up a conjecture to explain Adaptive moment estimation using their framework.
优点
I think the biggest strength of the paper is its clarity of presentation.
缺点
The weakness of the paper lies in the technical side, most especially relatively recent work that attempt to connect diagonal preconditioning and sign gradients to Adam.
-
For the generic reader, the authors fail to explicitly define what SDE and ODE mean.
-
In line 458, the authors can better reframe the content of the line to be more technically sound.
The use of the phrase: "... optimal deterministic convex opt algorithm such as Heavy ball momentum and conjugate gradient..." is not accurate with respect to stochastic gradient algorithms. The wording passes the two algorithms as two different algorithms, whereas, you have the same stochastic gradient algorithm, only with a different learning rate setting and a lowpass filter (momentum parameter) setting. A better phrase could be: "... the stochastic gradient algorithm with smooth gradients (momentum) and optimal convex optimization settings such as Polyak's Heavy Ball and the Conjugate Gradient ...". Note that the stochastic gradient algorithm itself does not care whether we use a minibatch gradient or not. -
The extra conjecture adds nothing essentially to the paper. Adam is simply a learning rate setting for the stochastic gradient algorithm. It is when you decompose the learning rate that you get a normalized gradient in expectation. It is the normalized gradient that has a connection with the sign function. Again, these things are not new. A careful study of classic literature, especially Tsypkin's classic 1971 work on Adaptation and learning in automatic systems will make this much clearer. Also, diagonal preconditioning is only true, if the hessian of the objective function is diagonal. In the case of the learning rate called Adam, we are only implementing it by dividing the gradient with its estimated variance (or moment), which gives a normalized gradient in expectation. The diagonal preconditioning is also in expectation because, in a sense, the eigenvalues of the hessian can be interpretated as variances of the gradient.
问题
None.
伦理问题详情
None
Dear Reviewer:
Thank you for your comments and for taking the time to evaluate our manuscript.
Note that the stochastic gradient algorithm itself does not care whether we use a minibatch gradient or not.
Your point is taken regarding the similarities between heavy ball momentum and conjugate gradient. We avoided discussing stochastic gradient methods at this line, because there is considerable nuance in showing acceleration for stochastic methods. We would be happy if you have a reference in mind. To our knowledge, outside of minibatch SGD (where indeed batch is relevant), there is not a stochastic gradient + momentum setup with provable speed-up.
The extra conjecture adds nothing essentially to the paper. Adam is simply a learning rate setting for the stochastic gradient algorithm… A careful study of Tsypkin's classic 1971 work on Adaptation and learning in automatic systems will make this much clearer.
Thank you for the Tsypkin reference! It's interesting to see some modern algorithms described so early in the literature.
The point of including the conjecture was to show that our methodology can be used to take Adam - which has a stochastic learning rate - and compute a deterministic equivalent - which becomes an arbitrarily good approximation in high dimensions - that can be explicitly written in terms of the risk and the data covariance. We hope to formally prove the novel conjecture in future work, which would allow us to quantify the difference between SignSGD, Adam and SGD
[1] Malladi et al., On the SDEs and Scaling Rules for Adaptive Gradient Algorithms: NeurIPS 2022.
[2] Kingma et al., ADAM: A Method for Stochastic Optimization: ICLR 2015
-
The term Heavy-ball algorithm and conjugate gradient algorithm are just different terms used to refer to the same stochastic gradient algorithm with different learning rate setting and lowpass filter parameter (momentum) setting. For more on this, see Polyak's 1969 paper ('The conjugate gradient method in extremal problems').
-
The stochastic gradient algorithm only accepts a gradient estimate as its input.
-
We need to be careful here, the only way we can provably show acceleration for the stochastic gradient algorithm is to invoke approximation theory and assume either convexity or linearity in the local sense. It has nothing to do with momentum or not. It also has nothing to do with whether we decide to make the gradient estimate to be some average over the training data.
Therefore, my note to you has nothing to do with convergence proofs, just the wordings as indicated in my first comment.
- To your last comment. I still maintain that the extra conjecture adds nothing to your paper. Again, we are just muddying literature by using deterministic vs stochastic learning rate. The learning rate is either constant or iterative.
Dear reviewer pG7Y,
We have updated the language in lines 429 and 455 to hopefully address the concerns you have raised.
Thanks,
The authors
The objective of this paper is to study SignSGD under the lens of SDEs in high dimensions. They try to use this continuous-time model to unveil some aspects of the dynamics of SignSGD such as i) Its effective learning rate; ii) Noise compression; iii) Preconditioning iv) Gradient noise reshaping. They provide some limited experimental evidence supporting their claims and close by highlighting that they expect these findings to be extended to Adam as well.
优点
-
Originality: The authors attempt to contribute by deriving an SDE for SignSGD in a simple high-dimensional setup, which appears to be a new direction in modeling this optimizer.
-
Quality: The derivation of an ODE that models the risk function for linear regression based on this SDE provides an interesting theoretical perspective which is also partially corroborated empirically.
-
Significance: The use of this model to predict the asymptotic expected risk level under Gaussian label noise offers some potential insights into SignSGD’s behavior in specific scenarios, though its practical relevance remains limited to the assumptions made.
-
Clarity: The paper presents the theoretical results in a generally understandable manner.
缺点
Weak Points
Since there is no dedicated space to provide one, I will write my "Overall Comment" here. I relegate a "Detailed Feedback" to the Questions, where I corroborate more on the points below.
Overall Comment: The paper addresses interesting questions and demonstrates potential through its mathematically sound approach. To enhance the quality and impact of the manuscript, I recommend focusing on several key areas for improvement, which are detailed below. Given the current status of the paper, I recommend rejection: Addressing these points could significantly strengthen the work which certainly deserves it.
-
Inclusion of Related Works: To strengthen the manuscript, it would be beneficial to discuss existing literature on continuous-time modeling of optimizers, particularly the Weak-Approximation (WA) Framework introduced by Li et al. (2019). Including this framework and comparing it with your approach could provide valuable context and demonstrate how your work contributes to the field. Additionally, since the SDE of Adam has been derived by Malladi et al. (2022), and given that SignSGD is a special case of Adam, incorporating a comparison with these results would enhance the depth of the analysis. Including relevant references when stating key facts will also improve the manuscript's credibility (See below for details).
-
Assumption Justification and Validation: Clarifying and justifying the assumptions upon which your analysis is based would strengthen the paper. Providing theoretical or experimental validation for these assumptions will help readers understand their appropriateness and relevance. It would also be valuable to explain any advantages your setup may have over the Weak-Approximation framework. For instance, elaborating on the necessity of the high-dimensional setup in your SDE derivations, especially when such a requirement is not present in weak approximations, could clarify the benefits of your approach.
-
Addressing Conceptual Missteps: It's important to ensure that the continuous-time models derived in your work faithfully represent the behavior of the optimizers they are intended to model. While your paper provides guarantees for the risk (as per Theorem 1), extending these guarantees to other critical aspects—such as the gradient norm and the norm of the iterates (as defined in Definition 2 of Li et al.)—would strengthen the validity of your models. Currently, insights are derived primarily about the SDEs rather than the optimizers themselves but are framed as insights on the optimizers. To effectively carry these insights over to the optimizers, it would be beneficial to provide guarantees that the SDEs closely track the behavior of their respective optimizers or to include experimental verifications demonstrating this correspondence.
-
Inclusion of Experimental Validation: Incorporating experimental validation of your results and insights would greatly enhance the manuscript. Validating the theoretical findings empirically will demonstrate the practical applicability of your SDE models and confirm that they are informative. For instance, if you derive a bound on the risk using an SDE, empirically verifying this bound for the respective optimizer would provide strong support for your theoretical claims.
-
Improving Structure and Clarity: Enhancing the organization and clarity of the manuscript would significantly improve its readability and impact. Formalizing key results as Lemmas, Propositions, or Theorems, rather than discussing them in free text, would make the arguments more rigorous and easier to follow. Clearly defining all symbols and terms, such as "Vanilla ODE," will help avoid confusion. Additionally, ensuring that figures are accessible to all readers, including those who are colorblind, by choosing appropriate color schemes and providing clear labels, will enhance the overall quality of the presentation.
I have provided detailed feedback on these points in the Questions section below, where I elaborate further on how the manuscript can be improved.
References:
- Li et al. (2019) "Stochastic Modified Equations and Dynamics of Stochastic Gradient Algorithms I: Mathematical Foundations".
- Malladi et al. (2022) "On the SDEs and Scaling Rules for Adaptive Gradient Algorithms".
问题
Detailed Feedback:
In the following, I provide detailed feedback. These points are generally marked in the same order as they appear in the paper and are not arranged by relevance.
-
Literature Review: It would be beneficial to include a review of the literature surrounding continuous-time models for optimizers. In particular, discussing weak approximations (Li et al. 2019) could enhance the context of your work, as they have been successful in deriving SDEs and ensuring their accuracy. Additionally, since not all readers may be familiar with SDEs and the concept of continuous-time models, providing visual aids—such as plotting a trajectory of the optimizer alongside that of the SDE—could make these ideas more accessible.
-
Discussion of Limitations: Including a section dedicated to the limitations of your approach would be appreciated. A subsection comparing your method with existing literature could provide readers with a clearer understanding of the advantages and potential drawbacks of your work.
-
High-Dimensional Setting Clarification: It would be helpful to clarify why your derivations require the high-dimensional setting, especially since the Weak Approximation framework does not have this requirement and has been effectively used to derive SDEs for various methods like SGD (Li et al.), Adam and RMSprop (Malladi et al.), and SAM (Compagnoni et al.). Explaining could enhance readers' understanding of your approach.
-
Focus on Linear Regression Model: It might be beneficial to explain why your analysis focuses on the linear regression model, particularly when the Weak Approximation framework allows for SDE derivations with relatively general loss functions. Highlighting any specific advantages your setting offers would strengthen your argument.
-
Relation to Existing SDEs: Considering that the SDE of Adam/RMSprop has been derived by Malladi et al. (2022), and since SignSGD is a special case of these algorithms, it would be helpful if you could explain how their SDE relates to yours. Demonstrating this connection could enhance the comprehensiveness of your work.
-
Definition Clarity: Regarding Definition 1, it appears that the risk definitions do not depend on the labels. It might improve clarity to list Assumption 1 first and then define the risks, enhancing the logical flow of your presentation.
-
Assumption 1 Concerns: In deep learning, we often encounter overparameterized regimes where there are more parameters than data points, leading to potentially infinite solutions for the regression task. This could make the SDE ill-posed because you define it in terms of a specific , but the dynamics might converge to a different solution. This could result in the risk in the denominator approaching zero while the numerator does not. Therefore, it seems that the SDE may have non-Lipschitz coefficients, which could hinder the existence and uniqueness of the solution: Could you comment on this aspect?
-
Conceptual Mistake (Line 86): In Line 86, the statement that "an important characterizing feature of SignSGD is its effect on the covariance matrix K" seems conceptually wrong. The matrix K is fixed and exists independently; it is not influenced by the choice of optimizer. Perhaps it would be more accurate to say that K influences SignSGD.
-
Universality Reference (Line 96): In Line 96, you mention universality as if it's a commonly known concept. Providing a reference for this phenomenon and considering whether this is the best place to introduce it in your manuscript would be helpful.
-
Assumption 3 Organization and Discussion: Regarding Assumption 3, presenting assumptions on the matrices immediately after their definitions, rather than after Assumption 3, could avoid confusion. Additionally:
- i) You mention that the upper bound is standard; including a reference would be helpful. Regarding the lower bound, does it hold in practice? High-dimensional data often have a lot of structure, and their generating covariance might be degenerate.
- ii) There is no discussion of the second assumption; adding this would enhance understanding.
- iii) The discussion around the third assumption is somewhat informal. Clarifying what you mean by "typical" and providing references or experimental validation would strengthen this section, as real-world covariance matrices are not random and possess rich structures.
-
Assumption 4 Clarification: Concerning Assumption 4, this assumption appears cumbersome, especially since the Weak Approximation setting does not require such special rescaling. Does this imply that SignSGD needs to be run with a learning rate that scales with ? Clarifying this point would be beneficial.
-
(Assumption 5): Regarding Assumption 5, Equation (4) guarantees an upper bound on the resolvent. However, in Equation (6), you divide by it. How do you ensure that you are not dividing by zero?
-
Conceptual Misunderstanding (Line 149): In Line 149, there may be a conceptual misunderstanding. Compression does not change the landscape; the landscape exists independently, and the optimizer moves through it without altering it. Are you suggesting that there is some implicit regularization of the landscape (Li. et al. & Compagnoni et al.)?
-
Naming Convention (Line 153 and Definition 2): In Line 153 and Definition 2, you refer to the method as "Homogenized." It might be clearer to call it what it is: the SDE of SignSGD. Additionally, as mentioned earlier, the SDE appears to be ill-defined in overparameterized settings.
-
Recovery of ODE from SDE (Definition 2): Regarding Definition 2, explaining how we can recover the ODE of SignGD when there is no noise would be helpful. If this is not possible, it raises questions about the explanatory power of the method. Experimenting with your provided Python code did not clarify this for me. Additionally, relating and comparing this SDE with that of Adam in Malladi et al. (2022), if possible, would enhance the comprehensiveness of your analysis. If not possible, providing an explanation would be helpful.
-
Figure 1 Clarity: Concerning Figure 1, it's crucial for figures to be clearly readable and well-described. The labels in the legend are unclear; terms like "Vanilla ODE" are not defined in the text, and I had to refer to your code to understand them. Similarly for "ODE." Additionally, the color scheme is not color-blind-friendly. I recommend using different markers for the lines and avoiding overlapping colors, as it was challenging to distinguish them. Regarding the caption:
- i) Are you representing the dynamics of Sign(H)SGD, or is it the dynamics of the risk under the dynamics of Sign(H)SGD?
- ii) When you mention the "usefulness of the ODE," do you mean its faithfulness or accuracy in describing the optimizer's behavior?
- iii) The phrase "and the significant estimation of key quantities" is unclear. Could you clarify what you mean?
-
Extended Plotting (Figure 1): While the initial part of the dynamics is well captured in Figure 1, the central purpose of SDEs is to model the stochastic nature of the optimizer, which is often most evident near convergence. Could you extend the plots to include more epochs (e.g., 100 or 1000) to illustrate this aspect?
-
Terminology Clarification (Remark 1): In Remark 1, it might be confusing to refer to this as interpolation. You can have label noise and still achieve interpolation; indeed, it's possible to fit random inputs x to random outputs y. Clarifying your terminology here would be helpful.
-
Theorem 1 Commentary: Theorem 1 is a key result and could benefit from further commentary:
- i) Do you have any insights into how strong the exponential explosion is?
- ii) What role do the moments play in this statement?
- iii) It might be clearer to denote the two constants currently marked with using different letters to avoid confusion.
- iv) It seems that under your framework, you're only able to derive the curves for the loss, whereas the Weak Approximation method provides guarantees for each sufficiently regular test function (Def. 2 of Li et al.). This might limit your framework. Is it possible to generalize your result to other functions, such as the norm of the gradient or the norm of the iterates? Would each quantity require a separate theorem, or is there a way to cover a class of interesting test functions under a broader theorem, similar to the WA setup?
-
Formalizing ODE Results: It might enhance your manuscript to formalize the ODE for the Risk as a Proposition or Theorem, rather than just describing it in the text. Additionally, providing a similar result for SGD would allow for a better comparison between the methods.
-
Equation Clarification (Eq 10): Regarding Equation (10), is this a definition or an equality?
-
Theorem 2 Statement Clarity: Theorem 2 is another key result, and similar observations as those for Theorem 1 apply here. Also, is given by Equation (10), but its dynamics are described by Equations (11.a) and (11.b). It might be helpful to rephrase the statement to make this clearer.
-
SDE Derivation for SGD (Eq 13): Regarding Equation (13), the SDE of SGD does not seem to be derived in the paper, and I couldn't locate it in the given reference. Could you provide more details on this? Additionally, the SDE of SGD has been extensively studied and derived without requiring high-dimensional settings (see Li et al.). Why is the high-dimensional setting needed here? How can we see that in a noiseless setup, this SDE becomes the ODE of Gradient Descent?
-
Scheduler Boundedness (Eq 14): Concerning Equation (14), is this scheduler actually bounded? As you approach a stationary point, it seems to diverge. Additionally, this scheduler resembles the normalization used to define Normalized SGD. Could you elaborate on this point?
-
Highlighting SignSGD's Advantage (Lines 276-277): In Lines 276-277, you mention that SignSGD is favored when the noise has unbounded variance. This point hasn't been mentioned earlier. It might be helpful to highlight this earlier in the manuscript, provide a reference, or offer an argument or example to support this claim.
-
"Training with signSGD": To my understanding, you are not training anything here. It seems like you are comparing the dynamics of the SDE models for SGD and signSGD: It is important to remember that you are not comparing the behavior of the real optimizers but that of their SDE models. To confirm your insights from the models, you need to run experiments on the real optimizers.
-
Definition of Variables (Eq 15.b): In Equation (15.b), the variable is not defined at this point. It might be clearer to include a remark or lemma where you define all the variables used in the equation before proceeding with the analysis.
-
Preservation of Optimizer Properties (Line 315): In Line 315, if I understand correctly, you suggest that SignSGD does not converge properly when the risk is small, and to address this, you propose multiplying the increments of SignSGD by the square root of the risk. However, this could negate the optimizer's resistance to noises with unbounded variance. Could you explain why you would want to remove such a significant advantage of the optimizer?
-
Relevance of Comment (Lines 316-318): Lines 316-318 contain a comment that seems somewhat disconnected from the surrounding discussion and lacks sufficient motivation. You might consider elaborating further to clarify its relevance or consider removing it.
-
Empirical Validation (Theorem 3): Theorem 3 presents an interesting result, and I strongly encourage you to provide empirical validation for this limit. Including a corresponding result for SGD would greatly enhance the ability to compare the methods. Additionally, is it possible to extend this result to scenarios with unbounded variance?
-
Section 4.2 Clarification: Section 4.2 may need further clarification. If the noise variance approaches infinity, wouldn't the risk also tend to infinity? In that case, might not actually diverge to infinity. Modeling unbounded variance is highly relevant because it's precisely the situation where SignSGD would be most beneficial. Addressing this point could strengthen your analysis.
-
Framing and Definitions (Lines 370-377): In Lines 370-377:
- i) It might be beneficial to formalize this discussion as a Proposition. Additionally, kurtosis is not defined for all degrees of freedom of the Student's t-distribution, and it quickly becomes unbounded in cases of interest like unbounded variance. Could you clarify why you use this term in such contexts?
- ii) Is Equation (19) straightforward? Is it possible for the risk to be nearly zero while the noise has an increasingly large (possibly unbounded) variance?
-
Collecting Equations into Results: It might strengthen the flow of the paper if you combined Equations (19), (20), and (21) into a clearly stated result, and present the equivalent findings for SGD for comparison. Including experimental validation would further support your theoretical claims.
-
Scheduling SignSGD Concerns: Regarding Scheduling SignSGD, this appears to be a puzzling point. In Equation (24), you suggest using a scheduler that increases or decreases with the risk. This could effectively undo the normalization that contributes to SignSGD's effectiveness: For instance, if the noise variance is unbounded, this scheduler could diverge, which may not be desirable. It might be counterproductive to undo the adaptive gradient rescaling. Could you provide further insight into this?
-
Relation to Existing SDEs (Line 432): In Line 432, it seems that this analysis could be performed by considering the SDE of Adam as derived in Malladi et al. Could you elaborate on this point? In your appendix, I noticed only an informal SDE for the iterates of Adam, but not the differential equations for and .
-
Merging Results (Theorem 4): Regarding Theorem 4, is it possible to merge this result with Theorem 3, or am I misunderstanding something? Including experimental validation and comparisons with SGD would be valuable additions to your manuscript.
-
Elaboration Needed (Line 465): In Line 465, you mention "when the covariance—the Hessian of the risk." Could you elaborate on this point? Providing a reference or a brief explanation would help clarify this statement.
-
Practical Implications (Section 4.3): A general comment on Section 4.3: In practical terms, could you provide guidance on when it is better to use one method over the other? This would help readers understand the practical implications of your findings.
-
Quantitative Characterization (Section 4.4): In Section 4.4, the discussion is mostly qualitative and intuitive. It would be helpful to include a quantitative characterization of the differences in noise structure to provide a clearer understanding of your analysis.
Minor Points:
-
Rewriting for Clarity (Lines 87-93): It might improve the readability to rephrase this section into a cohesive paragraph rather than a list of facts and observations. Regarding Equation (4), elaborating further and possibly providing a brief proof for the derivation of would enhance the clarity.
-
Lines 94 to 96: Please rephrase and make correct use of words and commas. Suggestion: "Although our theory is framed in the setting of Gaussian data, as we will see, the results are still a good description for real-world, a priori non-Gaussian settings."
-
Punctuation (Line 248): Please add a comma before the word "respectively".
-
Consistent Notation: It would be helpful to maintain consistent notation throughout the manuscript. For example, choose either or for processes and use it consistently.
-
Formal Language (Line 267): The phrase "apples-to-apples" is informal. Consider using more formal language appropriate for a scientific paper.
-
Terminology Correction (Figure 3): The correct term is "Student's t," not "Student-t." Please make this correction.
Appendix:
While I won't go into detailed comments about the appendix, I noticed that it's quite technical and could benefit from better organization. Presenting technical lemmas upfront and breaking down major results into smaller components could streamline the proofs. Several missing preliminaries and notation definitions affect readability. Here are some examples that came to mind:
-
Equation Justification (Line 738): It would be preferable to first prove, in a technical lemma, what is necessary to justify Equation (44) before using it. Since you seem to drop this term and mention that you will justify it later, it's important to handle such a key step carefully to ensure clarity and rigor.
-
Assumption Clarity (Line 739): It appears that you're working under the assumption of bounded risk, which may not be realistic in practical scenarios. This is a significant assumption that should be explicitly stated in the main paper.
-
Accessibility of Concepts: Concepts like nets, stopped processes, and martingales may not be familiar to all readers. Including a section that compiles all the notation, technical lemmas, and theoretical preliminaries would help make the appendix more accessible and provide a more self-contained experience for readers. I attempted to follow the proofs but found it challenging to verify all the steps in a reasonable amount of time.
-
Lemma Placement (Line 836): Lemma 8 is referred to here but is stated much later in the appendix. Reorganizing the content so that lemmas are introduced before they are used would improve the flow and readability.
-
Undefined Notation (Line 915): The notation is used but not defined at this point. Defining all notation when first introduced would help prevent confusion.
-
Conceptual Clarification (Line 1007): There seems to be a conceptual misunderstanding here. It is the dynamics of the SDE that converge to those of the optimizer, not the other way around.
-
Variable Definition (Line 1025): The notation is used but not defined. Providing a definition would improve clarity.
-
Term Introduction (Line 1049): The term "HSGD" is mentioned, but it hasn't been introduced earlier in the paper.
-
Notation Clarification (Line 1630): The notation is used without prior definition.
Compagnoni et al. "An SDE for Modeling SAM: Theory and Insights".
Dear Reviewer QUHk,
First, thank you for the exceptionally thorough review that you have performed of our paper. Regardless of the outcome of the review process, your feedback has already helped us improve the paper.
Much of your review discusses the Weak-Approximation framework. We agree that a comparison between our work and the Weak-Approximation framework will improve our paper. The Weak-Approximation approach is quantitatively and qualitatively different from our approach, and we would like to explain the details here. We will make a more detailed, itemized response to your points later, but we wanted to begin discussion of this substantial point of your critique first.
The Weak--approximation SDE is the following:
(Malladi et al. -- Adam SDE, Def 4.4). Let . Let and . Define the state of the SDE as and the dynamics as
where, is -dimensional Brownian motion.
To recover SignSGD from ADAM, we would take and . The noisy gradient model which most closely reproduces our setup has with as in (Eqn 2) of our paper. (Note Malladi et al also has RMS-Prop, with a slightly simpler SDE. But even recovering the RMS-Prop SDE from the ADAM SDE requires taking a nontrivial limit . We will add an appendix to the paper in which we take these limits; even after doing so, however, the resulting SDE is substantially different from ours).
In comparison, our SDE is given by (Eqn 7):
The SDEs are different in many ways. In the comment below, we discuss in detail some of these details.
- The SDEs are substantially different, and this stems from a fundamental difference between Weak-approximation and high-dimensional limit theory. In Weak-approximation, the formal comparison is:
The constant depends on and crucially the dimension . Hence, weak-approximation unfortunately says nothing about sequences of problems of growing dimension. In our setup, we look at problems of growing dimension and obtain a dimension-independent bound on the approximation error (Theorem 1).
- High-dimensionality has multiple consequences. One is that the loss-curves become deterministic as dimension grows. Hence we can describe them by a deterministic system of ODEs (Theorem 2), which are influenced by all the sources of noise (label-noise and gradient-noise). One of the criticisms made was:
Currently, insights are derived primarily about the SDEs rather than the optimizers themselves but are framed as insights on the optimizers.
The main goal of high-dimensional analysis is to understand the behavior of the optimizers themselves through the SDEs. The self-averaging seen in high-dimensions makes this possible; this is what makes the true risk trajectory comparable to the SDE risk trajectory (and, indeed, the averaged SDE risk described by our ODEs). This is the key content in Theorem 1 of our paper.
-
Another consequence of high-dimensionality is that our SDE is easier to interpret: from the Malladi et al SDE, we cannot see the factor or the , whose properties we develop from (Eqn 16) on in the paper. High-dimensionality is a key ingredient to this; we are putting additional restrictions on the problem setup in order to gain more information about the behavior of the optimizer. We do not believe the type of conclusions we draw about the high-dimensional SDE and ODE (especially Theorems 3 and 4) can be derived from the Malladi et al SDE, because the key high-dimensional simplifications are not built-into that framework.
-
Once the problem instance has been specified (data covariance, noise structure, learning rate), we can solve the high-dimensional SDEs and corresponding ODEs for the risk and other quantities. This makes our high-dimensional approach a good tool for understanding how different aspects of the problem interact with the optimizer to generate learning dynamics. In contrast, Malladi et al. do not consider an optimization problem, and instead consider a `noisy gradient model;' a major contribution of our paper is characterizing the optimization consequences of properties of SignSGD through its ODEs (esp. Theorems 3 and 4). Nonetheless, we propose to include the detailed comparison to Malladi et al in the Appendix of the paper and include a shorter comparison in the main text.
-
You also mention that a major limitation of Theorem 1 was the restriction to risk:
While your paper provides guarantees for the risk (as per Theorem 1), extending these guarantees to other critical aspects—such as the gradient norm and the norm of the iterates (as defined in Definition 2 of Li et al.)—would strengthen the validity of your models.
We agree. In fact, the main theorem in the Appendix (Lemma 3) does compare other statistics, including the very-important distance to optimality and norm of the iterates. The gradient norm, in our setup is actually given by the , which is already covered by Theorem 1. We will clarify this point in the main text, and add some additional discussion around other optimizer statistics.
- There were additional concerns about experimental validations:
Incorporating experimental validation of your results and insights would greatly enhance the manuscript. Validating the theoretical findings empirically will demonstrate the practical applicability of your SDE models and confirm that they are informative.
We note that Figure 1 in the original submission already shows correspondence between the raw signSGD simulation and our derived SDE and ODE. The SDE well-captures the distribution (80% CI plotted in figure 1, raw signSGD in purple, SDE in yellow), and the ODE representing the average well describes both the average of the raw signSGD, but also the typical trajectory as well (given that the 80% CI envelope has the same shape as the ODE). The simulations in the original paper were done at fixed dimension ; we are preparing additional simulations which show that increasing improves the convergence of the process to the SDE as our theorem suggests.
We look forward to discussing further with you in the rebuttal period, and we reiterate that we will include a full response soon.
The authors.
3. Conceptual Accuracy and Veracity concerns
A number of comments are regarding phrasing and accuracy of some statements. We address these below indexed by the bullet point where these comments were made:
#8) We have changed the phrasing to “an important characterizing feature of SignSGD is its effect on the covariance of the signed stochastic gradients”
#13) While the landscape might remain constant (the problem itself has not changed) the landscape has effectively changed from the point of view of the optimizer. Given that we are writing about the optimizer it seems fair to make this claim.
#18) Good point, we will change how we refer to interpolation. Here we truly just mean noiseless.
#23) Thank you for pointing this out. We mistakenly referenced the wrong paper. This has been fixed.
#24) Yes this scheduler is bounded so long as the noise has non-zero variance (as then where ). We have assumed this. In Remark 1, we point out that if we consider the no-noise scenario then we must assume the risk is bounded away from and again the scheduler is bounded)
#25) In the case of infinite variance, SGD does not even converge. We have added a citation.
#26) Our main theorems (1 and 2) show that the risk curves of signSGD and signHSGD concentrate in fact become identical as dimension goes to infinity. Thus, it is consistent to refer to training signSGD when we discuss the dynamics of signHSGD since they are referring to identical processes in the high-dimensional limit.
#27) We added a forward reference to the missing equation.
#28) We say why we would do this in the text. We also do not follow what rescaling by risk does with regards to variance. The sign function properly controls for gradients with infinite variance.
#29) This sentence was intended to provide a reason why one may want to keep the schedule from signSGD, as it adjusts for ‘vanishing gradients’. We have clarified the text.
#30) These are excellent questions: we will look into an expanded discussion and adding simulations.
#31) The risk is defined independent of the noise. So if noise variance goes to infinity will not.
#32)
- Kurtosis is a broader concept than the Pearson definition in terms of the fourth moment, which is of course useless in the case of distributions with infinite 4th moment. See Balanda and Mcgillivray. "Kurtosis: a critical review"
- Eq (19) is a dominated convergence exercise.
#33) We believe it is clearer as written.
#34) This scheduler depends on the deterministic equivalent of which is independent of the noise. This is indeed the optimal schedule, which confirms the discussion earlier. Note the difference between -risk and -risk.
#35) See our other responses regarding Malladi et al.
#36) We do not like the idea of combining them, as they contain different information. (The suggestion to provide experimental validation is well-taken, and we will look into it).
#37) In our setup, we have . The Hessian of this function is , the covariance.
From Appendix point #1) we do not actually drop this term. Equation 44 is a simplification of Equation 36 to be index independent, we justify this in Lemma 1. This section is to build intuition for the reader to see where the argument is going. From Appendix point #6) as we show the dynamics of signSGD and signHSGD converge. It’s valid to refer to either converging to the other.
Structural and formatting comments
Thanks for the numerous comments regarding the formatting of our manuscript. We have incorporated many of your suggestions which have improved the readability of the results. We reply below to select parts of your comments.
#6) Thank you for pointing this out. We have clarified the definition.
#9) The central limit theorem or laws of large numbers are the simplest examples. Universality means the particulars of the random variables being summed do not affect the average beyond their first and second moments. We have added a citation to a discussion.
#14) The term homogenized is used to match with existing literature [3]
#15) Please refer to Appendix B where the derivation of the ODE is explained, including in the case with 0 noise (this is the case with which gives by definition).
#16) The color map chosen for the figures is “plasma” from matplotlib. To the best of our knowledge this colormap is accessible to those with common forms of colorblindness. We would welcome specific suggestions to use instead to improve the accessibility of our paper.
- Figure 1 represents the dynamics of the risk under signSGD/signHSGD. We’ve clarified this.
- Yes
- Working with real data requires estimating as well as the noise distribution. In synthetic experiments these values are explicitly available since we chose them.
#17) Theorems 1 and 2 explicitly show that signHSGD and the deterministic equivalent match the dynamics of signSGD under the risk for longer time frames. In our revised manuscript we have added Figure 5 which shows the dynamics of the risk under signSGD and signHSGD along with their deterministic equivalent under long time scales (which is iterations) and with increasing dimension.
#19)
- The exponential term in Theorems 1 and 2 involving is caused by our inclusion of divergent learning rate schedules. Under convergent schedules it does not appear. However, in general we do not know this constant. We did not carefully investigate this aspect nor would our proof necessarily give the optimal constant. Our main focus is that this constant is dimension and quadratic independent. See Theorem 5.
- The moments are required for a technical lemma regarding the moments of the signed gradient updates (Lemma 9). They affect the probability with which Equations (9) and (12) hold. The choice could be optimized but we make no attempt. Good point, we have changed this.
- We agree. In fact, in our original submission this result was included in Lemma 3 (Theorem 5 in the revision) which guaranteed curves under any quadratic function. We chose to focus on the risk for clarity of presentation. We have highlighted that our results hold for a general quadratic in the revision.
#20) In the original submission we derived similar results for SGD in Appendix B and referenced this in the main text (Now line 256). Our ODE results for SGD are not new, and due to space constraints, we have to leave these in the appendix.
#21) It’s a definition.
Thank you again for your exceptionally detailed review of our manuscript. Your suggestions have improved the structure and clarity of our manuscript. We have collected your points into a few key critiques and address each below. We’ll also reference specific bullet point numbers of your review when relevant.
1. Inclusion of Related Works, Literature Review, and Limitations (Detailed feedback points 1-5)
We appreciate the reviewer’s recommendation to expand the discussion on related work and provide a clearer comparison with the Weak-Approximation (WA) framework [2] and Malladi et al.'s SDE derivations [1]. In response, we have added Appendix F, A new section comparing our approach with the frameworks discussed by [1] and [2], highlighting key differences in focus, methodology, and results.
We summarize some of key points here:
We believe that although our paper also derives an SDE approximator for a discrete optimizer, the main results are fundamentally different. For one, we believe the weak-approximation framework they propose is unable to directly recover a corresponding signSGD SDE from their ADAM SDE formulation. For instance, In order to construct their ADAM SDE, they estimate the normalizing constants in continuous time by where . This estimator fails when , meaning that recovering a signSGD SDE is not simply a matter of selecting the appropriate hyperparameters, as is the case when recovering the signSGD optimizer from the Adam optimizer. We discuss this further in Appendix F of our updated manuscript.
We apply a different method of analysis to study optimization algorithms as compared to [1] and [2]. As a result, we arrive at different approximation bounds. Malladi shows a comparison between the SDE and the optimizer as the learning rate goes to . In contrast, we show that the exact dynamics of our SDE and signSGD closely track each other in the high-dimensional limit. This is what allows us to study the behavior of the actual optimizer (signSGD) by studying its SDE. Our goal is not only to derive SDEs for signSGD but to also gain insight into how adaptive methods like signSGD and eventually Adam, behave in the limit of large problem sizes. The aspect of dimensionality is not addressed in [1] or [2] thus requires a different set of tools.
It is true that the weak approximation framework allows for a wider selection of loss functions and noise models. However, our approach on just quadratic statistics and linear regression, although restrictive, allows for more in-depth analysis of the optimizer. For example, the exact effects of the noise can be seen in as well as the transformation of the covariance induced by the sign operator. Our focus on linear regression and quadratic statistics is certainly also a limitation, and believe an extension to GLMs is a natural direction of future research. However, even in the simple setting of linear regression, extracting these dynamics was a non-trivial task.
references
[1] - Li et al. (2019) "Stochastic Modified Equations and Dynamics of Stochastic Gradient Algorithms I: Mathematical Foundations".
[2] - Malladi et al. (2022) "On the SDEs and Scaling Rules for Adaptive Gradient Algorithms".
2. Justification and Validation of Assumptions
The bullet points #7, #10, #11, #12 raise questions regarding our setup.
#7) In our setup, the risk is defined as the expected value of the loss. This function only has 1 minimum, and it is . While certainly a concern in other setups, this will not cause our SDE to be ill-posed.
#10) Thank you for this suggestion. We have moved where this assumption is introduced.
- Assumption 1: The lower bound assumption is important and may not hold for every dataset. It has however held for the two settings (CIFAR10 and wikitext2) that we have checked.
- Assumption 2: There is random matrix theory that discusses this, but there is certainly no easy answer about how restrictive this assumption is. We have added added a reference discussing this.
- Assumption 3: We removed the “typical” wording, and clarified our position (with additional references discussing the links with real data). We push back on the idea that “real world covariances are not random”; modeling data as samples from various random distributions (i.e. not just iid matrices) has been a useful principle in probability, statistics, and machine learning. Additionally, many real world, datasets have properties similar to samples from parameterizable, high-dimensional distributions. The assumptions we focus on here characterize some key properties of high dimensional datasets (related to concepts like self-averaging and eigenvector delocalization). Violations of these assumptions in real data often come from low-dimensional structures — e.g. the dead pixels in MNIST. Real datasets of course contain both low-dimensional and high-dimensional parts; our assumptions are equivalent to focusing on how the high-dimensional parts affect training. #11) In this setting both signSGD and regular SGD need to be trained with learning rates of or else they diverge. This is indeed a standard feature of training in high dimensions; for examples in neural network training see e.g. this work (footnote on page 4, and appendix F). Roughly speaking, as parameter count grows, each individual parameter needs to be changed by a smaller amount to change the model outputs by
Note that weak approximation is also sensitive to dimension; the problem-dependent constant in the error bound will grow at least as in this setting. Therefore our learning rate scaling does not place any artificial restrictions and is simply a property of the underlying optimization problem.
#12) If this value is then a row of the resolvent is . This implies the resolvent is non-invertible. However, the resolvent is invertible by definition. Moreover, the row of the resolvent is bounded from below by .
Re: bullet #2 in your Appendix notes we highlight that we do not assume that the risk is bounded. Although our proof method starts by introducing a stopping time that bounds the risk, we later show that this stopping time effectively does not occur with overwhelming probability. This may be found in Lemma 4 of Appendix A. At no time do we require the risk to be bounded in Theorem 1 and 2.
Please let us know if we can clarify anything further. Your detailed review was quite helpful, and we'd love to know if we were able to address some of your criticisms (particularly those involving comparisons to the weak-approximation framework).
Dear Authors,
I apologize for the delay. This is not the only paper I am reviewing to this depth. I am taking my role as a reviewer very seriously, and it took days to form an opinion when I first read this paper and now it took a bit to revise everything again.
Reply to the points concerning the Overall Comment:
-
As I mentioned above, I agree that using the SDEs of Malladi et al. to directly derive the SDE of signSGD is not feasible, and one might need to attempt deriving it "from scratch" (if it is even doable!). However, I would still advise caution before dismissing the WA framework. For example, while [1] studies the behavior of SGD with the SDE derived under WA, it does so in detail, providing all sorts of explicit convergence results. This demonstrates that while the SDE for Adam derived in Malladi et al. does not allow for studying such problems (or maybe it does, but they simply did not explore this as their focus was on deriving scaling laws), it does not mean that the WA framework is inherently inadequate for such tasks.
-
Thanks for commenting on WA and providing a fair comparison. However, the empirical validation of the assumptions is still missing:
- (i) You claim to have verified it for these datasets, but I do not see this in the paper. Did I miss it?
- (ii) I understand your point, thanks for the reference. Perhaps you could experimentally verify this on the datasets you use by randomly projecting the data points into spaces of increasing size.
- (iii) While I do not fully understand the figure you reference, it would be beneficial to discuss it in your paper, as it pertains to a key (and somewhat controversial) assumption. In general, I would recommend reverting to "classic assumptions" if possible, or conducting a thorough experimental investigation if you are proposing new ones.
-
I understand why you use this terminology, but I suggest adding a sentence like: "Since the SDE approximates the dynamics of the optimizer as per Theorem X, and since we have experimentally validated this in many contexts, we will often say that the optimizer has property P even if it is derived from its SDE." I apologize for not being clearer earlier.
-
I am sure you are working on providing experimental verification of your theorems, which is currently missing. I look forward to trying out your code.
-
Structuring the paper is, of course, a matter of taste. However, I believe that mathematical contributions emerge from formalism and clarity, which in this case means framing your results as propositions rather than leaving them in free text.
Reply to the points regarding Detailed Feedback:
-
Thank you for including the discussion on WA. However, as highlighted by other reviewers, a section introducing SDEs would be beneficial and is still missing, as I noted previously. The same applies to much of your theoretical tools, which may not be familiar to all readers (see, for example, point 3 regarding your appendix). Additionally, an intuitive graphical representation of what it means to use continuous-time models to approximate an algorithm is still missing and would greatly enhance the reader's understanding.
-
I note that such a section is still missing, which I believe is detrimental to the paper.
-
I see this has been addressed as you discussed WA.
-
This point has not been addressed in the paper and is quite crucial. I read your reply, where you state that your SDE allows for a detailed analysis of your setup, even if it is simplistic. As I mentioned earlier, the WA setting allows for a full exploration of general loss functions for SGD. One cannot rule out that this may also be possible for Adam (Malladi) or even for signSGD. Please do not misunderstand me: I am not criticizing your model choice itself.
-
Thanks.
-
Thanks.
-
While the SDE appears to have a Lipschitz drift, did you verify all the other assumptions required for the existence and uniqueness of solutions to SDEs? There seems to be no discussion about the diffusion term, its affine growth, and so on. Additionally, I believe that in overparameterized settings, there may be multiple minima even for linear regression. Correct me if I am wrong. Otherwise, please elaborate on this point in the paper.
-
I still find the wording misleading. It is the "sign" operator that changes the covariance of the noise in signSGD. The matrix is a property of the data and is not affected by the optimizer.
-
Thanks.
-
See point 2 above.
-
I am not convinced that you need to use a learning rate scaled with . You reference a specific paper, but this is not a standard requirement to ensure the convergence of SGD or signSGD. For example, you can train a 128M-GPT2 model with signSGD (or even SGD) using a learning rate much larger than . You can verify this by trying out the GitHub repository for nanoGPT. Additionally, please refer to Eq. 11 in [3] for a specific "optimal" scheduler for signSGD. Do you have references for general results supporting this restriction? I am not aware of such requirements under smoothness or strong convexity for either signSGD or SGD.
-
Thanks; that was a silly question on my part.
-
Once again, the wording is unclear and misleading. The landscape exists independently of the optimizer. You could say that signSGD likely explores different regions of the optimizer compared to SGD, Adam, etc.
-
I understand, but you could address this directly the first time you mention the SDE.
-
I apologize for any misunderstanding. If one removes the stochasticity from SGD (e.g., using a full batch), one should recover the gradient flow ODE of GD: . Similarly, for signGD, the ODE is well-known and well-studied (see Eq. 5 in [3]). If removing the stochasticity from signSGD does not lead to this ODE, I would be concerned. Note that deriving the ODE of signGD does not require WA.
-
I apologize, but to me, that is a gray band in the figure. Could you consider using different shading textures or line styles? This would also improve accessibility for other color-blind readers. The same applies to Figure 2.
-
It seems the updated image is the same as the old one, with only the x-ticks changed. Am I missing something?
-
Ok.
-
Thank you, but why should I pick one value of over another? Should I not select the that maximizes the probability? It seems unnecessary to specify a constant for each ; you could simply state, "there exists a constant," and elaborate on this in the appendix.
-
Did you ignore this point? While you added the SGD discussion, the main issue remains: A mathematically oriented paper should formalize results as propositions, lemmas, theorems, and so on.
-
Why not use a symbol like ?
-
See the point about Theorem 1.
-
You fixed the reference but did not address the remainder of this point, which is significant and connected to point 15 above.
-
Typically, if the optimizer has converged, the gradient is : How can you reconcile this with the boundedness of your optimizer? Importantly, I see no comment on the relationship between this scheduler and Normalized SGD.
-
Thanks.
-
As discussed above, the wording remains misleading, at least until your experiments confirm your results.
-
Thanks. While this is acceptable, I would still recommend defining it earlier.
-
The risk scales with the variance of , as I understand it. If you multiply the learning rate by , you effectively undo a form of normalization. If the noise lacks bounded variance, could become infinite or very large, potentially leading to divergence. Why would you undo the property of signSGD that guards against noise? This seems like a key property to retain.
-
Alright.
-
Great—looking forward to the experiments and the code!
-
I apologize for the misunderstanding. However, this scheduler is not viable because cannot be computed for large datasets. Moreover, undoing the normalization seems counterproductive, as discussed in point 28. If you believe I am wrong, you could simulate this in your setup (which I encourage) and on a GPT2 model to test whether your results generalize to more relevant settings.
-
(i) I had not heard of this before—nice to know. Please specify this clearly in the paper.
(ii) True, thanks. Perhaps add a line in the appendix.
-
I disagree: Let us make an example. Let us take to be defined as a random variable whose density is the average between the density of a normal distribution and a normal distribution . has expected value and variance . Therefore, . If we new send to infinity, the variance of goes to infinity but goes to 0, rather than to as you claim in line 336. Similarly, if we write down Eq. 19 for the example above (btw you have a typo: you need to send to 0, not to ), we have that , which is actually small if (or the kurtosis if we want to use your loose definition) goes to . Therefore, formalizing results can help preventing these unpleasent situations. I verified my calculations experimentally, but maybe you can check yourself and come back with a feedback.
-
See the discussion above.
-
Ok.
-
Great—looking forward to the experiments and code!
-
True, silly question on my part.
-
I believe you ignored this point. It is quite important.
-
Similarly, I believe you ignored this point. It relates to other points about clarity and formalism.
Reply to the points concerning the Minor Comments:
I assume you decided to ignore these? Many are quite straightforward to address and would help align the paper with a more formal and professional tone, addressing issues of low formalism and colloquialism.
Reply to the points concerning the Appendix:
-
While you acknowledged some points, the rest of the comment regarding how to structure the paper was ignored, which hinders the readability of the proofs.
-
You explicitly write "to work under the setting that the risk is bounded" in line 847 of the revised version. Am I missing something here?
It seems the other points in this section were also ignored.
References:
[1] Maulen-Soto, Rodrigo, Jalal Fadili, and Hedy Attouch. "An SDE perspective on stochastic convex optimization." arXiv preprint arXiv:2207.02750 (2022).
[2] Balles, Lukas, Fabian Pedregosa, and Nicolas Le Roux. "The geometry of sign gradient descent." arXiv preprint arXiv:2002.08056 (2020).
[3] Ma, Chao, Lei Wu, and E. Weinan. "A qualitative study of the dynamic behavior for adaptive gradient algorithms." Mathematical and Scientific Machine Learning. PMLR (2022).
Conclusion:
I feel like the paper has marginally improved, and much work remains to be done. Overall:
-
The soundness of the approach is promising and definitely present. However, as I highlighted, it is difficult to follow the proofs (and thus verify the theory!) because the appendix is not well-organized, not all quantities are properly introduced, and so on. Similarly, the main paper could benefit from restructuring the main contributions into lemmas, propositions, and so on. Since the experiments to validate the theory are missing, the experimental aspect feels somewhat lacking.
-
The presentation, as discussed, is quite poor and has not improved. I acknowledge that this is a matter of taste to some extent.
-
The contribution is fair: The theoretical framework appears powerful, albeit based on some "unorthodox" assumptions that lack experimental validation. Some results are sometimes puzzling and remain unvalidated experimentally (for now, as I understand).
Therefore, I am currently confirming my original score.
31-This discussion is all in the case of finite variance, where there is an interesting tradeoff and where there is a point to comparing to SGD (the whole of section 4 is about comparison).
Saying more about SignSGD in the infinite variance case is a reasonable goal for future research.
33-Thanks for pointing out the typo about . Indeed the comment was made with unimodal distributions in mind, your example is a clear counterexample to the statement about kurtosis. In fact, we think (19) is already relatively clear in terms of what the tradeoffs are regarding when it is large or small, so we removed the reference to kurtosis.
38-Sorry we did not reply directly to this point. We assume you refer to the role of conditioning in choosing which optimizer to use: after selecting learning rates correctly, signSGD gains from the diagonal preconditioning when the averaged condition number of decreases from the averaged condition number of . We added an explicit sentence to this effect in the text.
39-As above, sorry we did not reply directly to this point. We have discussed in the paper the difficulty of the analysis of for general covariances . Aside from the trivial case where is diagonal, we do not have a compelling non-diagonal example which would serve as a starting point for further analysis.
Minor Comments:
We appreciate the feedback regarding the formatting and have indeed made many changes as a result. We have already addressed most of the minor comments in the revision. With regards to some of the formatting and language suggestions that we did not implement, we respectfully disagree with the suggestions and have chosen to keep the current version.
Appendix:
Thank you for the feedback! Regarding your previous comments:
1-We have reformatted this section of the appendix to be more readable. Our intent was to provide motivation for the upcoming lemmas. We have also added an overview prior to the appendix to summarize the main content of each section.
2-We note that we do not specify a bound on the risk, hence this may be taken to be as large as one likes. Since the risk cannot blow up to infinity in finite time, if one chooses an appropriate upper bound, with some probability, the stopping time would not be observed within a given time horizon. We show in Lemma 8 for signHSGD (also see Line 1385 in updated) that one may choose a bound so that with overwhelming probability, the bound is never exceeded within the given time horizon. The concentration result of Theorem 1 between signHSGD and signSGD allows us to extend this bound to signSGD. This effectively removes the bounded risk assumption. This can be found in the original manuscript and has been clarified in red in the updated.
3-See 2) in detailed.
4-We believe it is more suitable to move the technical lemmas to a later section of the appendix, as their proofs are largely self-contained and do not directly relate to the core ideas of the main theorem. Given the technical nature of our main result, we consider it more important to present the key part of the appendix—the main proof—at its forefront.
5-This is defined in equation 51 in the original manuscript and equation 56 in updated.
6-This is in fact interchangeable. Provided that both signSGD and signHSGD have the same starting point, they are defined independently of each other, i.e. one does not need to run signSGD to derive signHSGD.
7-Thank you for the catch! In the original manuscript, this was defined in Line 1519 in appendix A.3 after having been moved to a later point in the appendix. It is now defined near the beginning of appendix A.2.
8-Good catch! This has been fixed.
9-In the original manuscript, was defined at the start of the appendix A.1 in Line 661 and was defined in Line 742.
Please let us know if there is anything else you would like us to clarify. We appreciate the time you've spent reviewing our paper.
Dear Authors,
Thank you very much for the detailed reply and I apologize for the delay, but I had planned holidays for the past week: I was back in office this morning and I did what I could.
This is the last time I go through your document or code.
Major points:
- This is the first time that you mention that what you want to do is to analyze sequence of problems of increasing dimension: This is not made explicit in the paper anywhere. Maybe it is better to clarify this aspect. I thought that the focus was the "high-dimensional limit" (you literally wrote "The main goal of this paper is to derive the high-dimensional limit of SignSGD"), with your approximations holding better and better as grows. Interestingly, then one could analogously argue that the WA is not in the infinitesimal limit, but it also is dealing with a sequence of problems with decreasing . So, I guess there are some parallelisms, and it is still unclear to me which framework is best for which setup. Correct me if I am wrong:
- WA: 1) has some restrictive (mainly due to technicalities from numerical analysis) assumptions, 2) seems to be tailored for small , 3) does not appear to have any assumptions on , 4) and can handle quite general nonconvex functions. On the one hand you say there is no reason why WA should be able to handle the high dimensional setup. On the other, extensive experimental evidences (models with tens of millions of parameters and not infinitesimal ) from Malladi et al. seem to show that this is the case and that the theory works also outside of the mentioned assumptions;
- Your framework is 1) based on some restrictive (but fundamental) assumptions which are poorly justified, 2) it is tailored for high dimesional cases, 3) it uses a somewhat controvertial scaling of , 4) and it is restricted to quadratic functions. On the one hand, it seems that it should handle the high dimensional setting much better. On the other, it has been only (partially) verified on somewhat low-dimensional setups ();
- (i) To clarify: You actually never checked your assumption? I am sorry, I misunderstood your statement, "It has however held for the two settings (CIFAR10 and wikitext2) that we have checked," interpreting it as confirmation of the assumptions on these datasets. In my opinion, verifying predictions to justify assumptions is a problematic approach, as "from true assumptions follow true statements," but "from false assumptions, true statements can also occasionally result".
- (ii) My concerns remain as previously stated.
- (iii) Currently, your justification is based on your own claims, references to unrelated images, and lacks any form of validation.
To be fair, not all assumptions have to hold in general: SGD converges under smoothness (and I believe this is the weakest assumption we can take but I might be wrong!). But is the loss of the DNNs smooth? I do not think so! But this is the best (again, I might be wrong!) that we can do. Fortunately, smoothness holds in very relevant cases that already shed a lot of light on the behavior of SGD. I am not here to criticize the assumptions, but you are not making many efforts to back them up.
Other points:
-
I find no reference in the literature supporting the need for this rescaling. If it were necessary, it would appear in a general proof of SGD convergence for the strongly convex case: it does not. I experimented with this and found that it does not diverge; instead, it converges to the (very large) asymptotic risk value suggested by your theory.
-
Excellent.
- (i) Do these schedulers somehow ruins the lipschitzianity of the coefficient of the SDE? Especially when the risk goes to ? Anyway, these schedulers are not available for deployment because one cannot calculate ;
- (ii) What I am worried about is that you seem to claim that if one uses SGD with the scheduler , this would behave like signSGD to some extent. But that normalization is that which defines Normalized SGD. How can it be?
- Do you now see my point? The lack of formalization allowed me to spot a counterexample. Anyway, if we sent to infinity, also the variance of the variable I proposed goes to , while goes to . If a single informal claim can be countered, what guarantees exist that no others are similarly vulnerable?
11-In our setting (and indeed, any linear regression setting with well-behaved covariance) the learning rate must be scaled as 1/d; we invite you to conduct numerical explorations in our code (for example in anisotropic_signHSGD.ipynb, change the line ‘lr=0.5’ to ‘lr = 0.5 * d’ and vary the ‘d’).
In neural network settings, the equivalent scaling dimension is typically the width of fully connected blocks --- not the number of parameters. For standard parameterizations, optimal (and often, largest stable) learning rates usually scale as 1/width. This is a well known phenomenon; see e.g. https://arxiv.org/pdf/2203.03466 figure 1, left panel. In your NanoGPT example, the embedding layers have dimension 768, implying a learning rate on the scale of 1e-3 - comparable to the supplied 6e-4.
The difference between the linear regression setting and the neural network setting reflects the fact that the important quantity is the effective dimension. In our problem and notation the effective dimension is . See https://arxiv.org/pdf/2406.11733 for a discussion (where it is called "intrinsic dimension"). In Appendix C.3 of that paper, this quantity is computed for ResNet18 on CIFAR10 and is found to be ~ 1e3.
13-We’ve updated the wording. When you apply the sign-function to the gradient, you are no longer using the original least-squares geometry. So sure, the original landscape ‘exists’ but it is also not the relevant geometry for understanding the algorithm. One point of the analysis we do is to allow a comparison between signSGD, and a langevin type diffusion (signHSGD) whose gradient term is interpretable. So in that sense the relevant optimization landscape has as a Hessian and gradients are flatted by the function.
15-One component of the noise is inherent in the setup; our stochastic gradients are always noisy, as they come from a statistical setup. You can get an ODE by sending stepsize to 0, but it need not be the same one as when you have access to the exact gradient.
If you do a batch version of our setup, taking the batch average pre-sign function, it can converge to [3], because the stochastic gradients become exact preactivation for sufficiently large batch.
16-We have made changes to figures 1, 2 and 3). Do they improve the visibility? Besides changing colors, we decreased the number of runs, to allow the confidence intervals of signSGD and signHSGD to fluctuate more (and hence not be on top of each other).
17-This change was made in the last manuscript. The longer time horizon plots are in the appendix. Here we changed the axes to correspond to signSGD iterations.
19-Indeed, we could optimize for if the constant were known. But we don’t actually know the precise values of . See Lemma 12 and Corollary 2.
20-Theorem 2 is the Theorem? The SGD theorem is in Collins-Woodfin et al, which is cited. We’re not quite sure to which statements you are referring. We agree that all mathematical statements we are making should be stated formally (barring trivialities); so we have cleaned up the appendix, including putting more statements inside of formal lemma claims (see below)
21-Ok
22-Noted
23-See our response to 15 above. We chose this specific SGD-as-SDE formulation for comparison because it most closely matches our chosen setting for signSGD
24-As you write, this scheduler has problems as the risk goes to 0 (this is part of the discussion in Section 4.1, and is one of its undesirable features). For HSGD, it will actually cause the risk to not go to 0, even in the presence of 0 noise (as the risk gets small, the LR will go way above the descent threshold, causing the risk to increase). The observed dynamics of the risk will stabilize at a positive level. (In the isotropic case, using the ODEs in (25), one can easily verify this).
If one runs normalized SGD and performs an exponential moving average of the gradients, like in Cutosky and Mehta ‘20, with bandwidth B, then in a limit of and with one should recover this scheduled HSGD as a high-dimensional equivalent.
26-In Figure 1) we do confirm that the dynamics of signSGD, signHSGD, and signODE match as predicted in Theorems 1) and 2). What more is missing?
28-In the -risk you have already taken an expectation over the noise, so either the variance is finite (like we are assuming in this section) or it is infinite and this makes no sense. If you are attempting to infer the -risk, of course your point is good, but that's not what is being discussed here.
Dear reviewer, we thank you for taking the time to thoroughly look over our response.
We have completed experimental validation of Theorems 3 and 4. For validation of Theorem 3 please see Figure 6 of the newly revised manuscript. We have also updated Figure 1 to include the limit risk value of Theorem 3 when the noise is Gaussian as well as the CIFAR10 dataset. Theorem 4 is validated in Figures 7 and 8. The code has been updated to include these new experiments.
We now respond to your overall comments:
- In our high-dimensional framework, we are dealing with a sequence of problems, where the complexity increases as the dimension grows. If we were to fix the dimension, in some sense, our work amounts to performing a kind of weak approximation. So there are strong parallels.
However, there are aspects of our framework which are also not present in a fixed-dimension setup. In particular, the risk curves concentrate around ODEs, which encode the effect of the randomness. This is not present in weak approximation, and it is a key result of this paper. We also hope this illustrates that there is some theory that we have to build from the ground-up, as this is nowhere in [1] or [2].
So to summarize: we are not dismissing Weak Approximation. But it’s also not a tool which is built for what we’re doing, which is to understand sequences of problems of growing dimension. Nor do the key results of our theory appear in weak approximation (risk curve concentration). They’re just different; they have different conclusions, different assumptions, and even different proof methods.
- i) We don’t claim to check our assumptions on datasets. We have checked the predictions – that the ODEs match the behavior of the optimizer – on real datasets (See Figure 1). Our Assumptions, especially 3 and 5, are designed with random problems in mind – it’s not even clear what they would mean for a single dataset.
ii) The suggestion of forming random projections is smart. We would be optimistic.
iii) The main goal of this paper is to derive the high-dimensional limit of SignSGD. This is not a weak approximation, and these assumptions are important and reasonable from the point of view of matrices in high dimensions.
Your point about checking assumptions is reasonable, and assumptions should be challenged. There are lots of directions to take this work. One would be to find non-asymptotic assumptions (e.g. in terms of ‘intrinsic dimension’ – see the discussion below), which could be applied to fixed problems. We think your request makes sense for a paper attempting to make those non-asymptotic comparisons, instead of trying to check these assumptions (which are not engineered for that task) on general empirical problems.
Detailed feedback:
1,2-These are of course good things to have in the paper, but there are space considerations. Discussion in the appendix will be ignored by everyone except a small fraction of the audience. We would need to sacrifice the content of the paper to do this, and these do not raise to the level of concern as to sacrifice that. So in our judgment, it's simply inappropriate to add these.
The most significant limitations of this paper are built into this setup (e.g. linear regression), and we are certainly not hiding these. How does the theorem really change outside of linear regression? We don’t have anything smart to say, and it’s certainly a valuable direction for future research. Again, we don’t think this type of discussion adds enough to the paper to warrant cutting something else.
Also, the discussion we’re having is preserved on openreview. Many researchers benefit from openreview by going to the discussion after making an effort at reading the paper, and so there’s value in already having them here.
4-This relates back to 1) in main point
7-The diffusion term for signHSGD is trivially Lipschitz (for SDE's the relevant Lipschitz-ness is in the spatial variable), and this diffusion term is constant in space. Theorem 5.2.5 of Karatzas Shreve applies directly. It applies as well to the HSGD, although we are not proving anything about it.
Linear regression can have multiple minimizers in overparameterized settings, you are correct. Under assumption 2, the coefficient is Lipschitz regardless of the existence of multiple minimizers. Without assumption 2, you have to stop the SDE when you reach a small risk (as we do in Theorem 6).
8-This was changed in the last manuscript, text currently reads: "As we will see, an important characterizing feature of signSGD is its effect on the covariance of the signed stochastic gradients." Please elaborate.
... (continued)
References
[1] -Li et al. (2019) "Stochastic Modified Equations and Dynamics of Stochastic Gradient Algorithms I: Mathematical Foundations".
[2] - Malladi et al. (2022) "On the SDEs and Scaling Rules for Adaptive Gradient Algorithms".
-
I see: Unfortunately, the information you mention is typically unavailable since and similar matrices are unknown in practice. Are there maybe practical insights that could be drawn? For example, can you recommend when a practitioner should choose signSGD over SGD based on observable behaviors in deep learning?
-
To clarify, there is no formal statement or experiment backing this claim?
Regarding the appendix: It has improved significantly. While there is room for further refinement, the major rework is promising.
Regarding experiments: Nice job verifying Theorem 3. However, Theorem 4 presents a bound, and your verification focuses solely on , which is essentially verifying Theorem 3 again. Shouldn't you have plotted the curve described by Eq. 29 as a function of ?
Unfortunately, many other insights are not verified. For example, the effects of the schedulers are not verified --- Is SGD with the scheduler in Eq. 14 actually behaving like signSGD to some extent? The list goes on and it would be easier to pinpoint them all if your insights were formally conjectured in formal statements.
I am not asking for more experiments: I am simply acknowledging that there are some formal results with some degree of verification, some others without, and quite some informal insights without any experiments to support them.
To Conclude
I acknowledge the major changes and triple revision made to the paper: These include images, additional literature review, rewording of unclear paragraphs, and experiments. On top of this, great progress has been made on the appendix which is now significantly different (and better) than the original submission.
However, the following concerns persist:
- Assumptions: These appear non-standard in the literature, with insufficient justification for their validity.
- Formality vs. Intuition: Many insights remain qualitative or intuitive, leaving them open to counterexamples. For a theoretical paper, formal rigor should take precedence.
- Experimental Validation: Some results and insights lack experimental support. While I understand it may now be too late to address this, it is a notable limitation.
Post Scriptum
I want to highlight that I enjoyed this back-and-forth. I like what you are doing in this paper and believe it can shine. I would have not invested so many hours in this review process if I did not value this paper and your dedication.
If my feedback has come across as overly critical or adversarial at times, please know that my intention has always been to be the kind of reviewer I would have appreciated having: someone who thoroughly reads and evaluates the work, thoughtfully interacts with it, and strives to help the paper evolve into the best possible version for both the authors and the broader community.
I wish you every success moving forward, and I will now discuss this paper with the other reviewers and the Area Chair before finalizing my recommendation. Thank you again for your dedication and hard work!
This paper studies SignSGD to have better qualitative understanding of this algorithm, especially in the high-dimensional limit. The authors derive SDE and ODE for the limiting equation as continuous approximation of SignSGD. Using this framework they quantify four effects of SignSGD: effective learning rate, noise compression, diagonal preconditioning, and gradient noise reshaping. They finally conjectured how this might be extended to ADAM.
优点
(1) The presentation is very clear;
(2) The results present a good quantitative understanding of SignSGD;
(3) A comparison with vanilla SGD is provided;
缺点
Overall, the paper looks good.
问题
Using continuous approximation may explain features of SignSGD but only at the qualitative level. At the quantitative level, long-time behavior of SignSGD and its continuous approximation may be very different. Can the authors explain more about this situation? This may include:
(1) Discuss any theoretical or empirical evidence the authors have for how well the continuous approximation matches discrete SignSGD over long time horizons.
(2) Clarify if there are specific regimes or conditions where the authors expect the approximation to break down for long-time behavior.
(3) Consider adding a discussion of the limitations of the continuous approximation approach, particularly for long-time dynamics, to their paper.
A suggested reference paper: Kushner, H. J. (1982). A cautionary note on the use of singular perturbation methods for “small noise” models. Stochastics: An International Journal of Probability and Stochastic Processes, 6(2):117–120.
伦理问题详情
N/A
Dear Reviewer, Thanks for your comments; we address your concerns with the approximations here.
Discuss any theoretical or empirical evidence the authors have for how well the continuous approximation matches discrete SignSGD over long time horizons.
The constant in Theorem 1 controls the timescale of good approximation. The key point here is that is independent of dimension. The rescaled time in the theorem is , where is the number of iterates; therefore the approximation is good for iterates. This is a key benefit of our theoretical analysis; by using the high dimensional limit, we have a bound that does NOT require a small learning rate (such as in the provided reference [1], see also [2] and [3]). We provide a more detailed comparison with approaches [2] and [3], and how our approach avoids certain limitations of those works, in Appendix F of our revision. We believe this is highly related to the issues raised in your reference [1].
Figure 1 in our original submission shows the correspondence between our theory and the true stochastic optimizer over a time period of 5000 iterates; the original figure was labeled in rescaled time (iterates/). We have corrected the figure, and have added a plot in the appendix which extends this another factor of 10 with good agreement (Figure 5), following your suggestion.
Clarify if there are specific regimes or conditions where the authors expect the approximation to break down for long-time behavior.
As stated by the theorem and validated by our experiments, we expect that for high dimensions the approximations are good for timescales long compared to the overall relaxation time of the dynamics. One case that might require a very high dimension for good approximation would be power-law features and covariance structure; this is studied in [4] for the SGD case.
Consider adding a discussion of the limitations of the continuous approximation approach, particularly for long-time dynamics, to their paper.
We have added some more discussion about other continuous approximations. Regarding very-long-time distributional behavior, such as convergence to the invariant distribution -- we can provide some information, such as Theorem 3, which describes the limit risk for constant stepsize. It's an interesting question if the invariant distribution of signHSGD meaningfully relates to the invariant distribution of signSGD, beyond this, but we don't have much to say.
[1] Kushner, H. J. (1982). A cautionary note on the use of singular perturbation methods for “small noise” models. Stochastics: An International Journal of Probability and Stochastic Processes, 6(2):117–120.
[2] Li et al., Stochastic Modified Equations and Dynamics of Stochastic Gradient Algorithms I: Mathematical Foundations: Journal of Machine Learning Research, 2019
[3] Malladi et al., On the SDEs and Scaling Rules for Adaptive Gradient Algorithms: NeurIPS 2022.
[4] Paquette et al. 4+ 3 Phases of Compute-Optimal Neural Scaling Laws. arXiv preprint arXiv:2405.15074 (2024)
Thank you very much. I keep the score.
We have uploaded a revised version of the manuscript. All changes made are tagged in red. We’re logging here a summary of our changes to the manuscript. We are thankful to all reviewers and especially to reviewer NSda for the detailed feedback, which we have used to improve our manuscript.
Regarding Weak Approximation (In response to Reviewer NSda) We added a section comparing the weak approximation setup (Malladi et al) to our setup, (Appendix F), trying to make as close a comparison as possible (eq 233). Even after doing so, we observe that there are major differences in the resulting SDEs, and the Malladi et al does not reproduce our result.
- Moreover, the Malladi et al SDE only depends on the first and second moments of the noisy gradient, and our SDE depends on higher moments; in fact, even in a Weak Approximation framework, to match the behavior of SignSGD, one cannot simply use the first and second moments of the noisy gradient to describe the evolution of the system as the moment structure of the sign-of-gradient vector cannot be expressed in terms of first-and-second moments of the underlying gradient. Unfortunately, the most likely conclusion is that Malladi et al, cannot describe SignSGD, even in the sense of Weak Approximation.
- The technical sense of the comparison is not equivalent – Weak Approximation is a small stepsize approximation statement with inexplicit dimension dependence, while our setup is a large stepsize approximation with explicit dimension dependence. There is no mathematical reason to expect the Weak Approximation SDE to correctly reproduce the high-dimensional limit. This could as well explain why the Malladi et al SDE does not match.
- We think the added discussion (an appendix) relating Malladi et al to our SDE is overall a net positive for the paper in that it connects to related work. However, the conclusion we’re drawing is that the theory of Malladi et al says nothing about our problem setup, and so we simply cannot do a deeper investigation of, for example, how our conclusions relate to those of Malladi et al.
Other changes: We have integrated many changes which were prompted by all three reviewers. Thank you for all the suggestions for improvement.
- We added further discussion of the assumptions (2 and 3).
- We fixed the labeling of the axes in Figure 1.
- We added a section overviewing the contents of the Appendix.
- We formulated two theorems in the appendix for our results with general quadratics. (These were already proven -- the point was to simply make clear the result of the proofs). We also added a formulation of our main result without assumption 2.
- We added a new Figure in the Appendix, showing the concentration effect over increasing d (Figure 5).
- We added a section comparing Weak Approximation to our setting. Besides these, there were numerous small changes, marked in red.
Planned changes We are looking into adding additional simulations related to Theorems 3 and 4, as well as comparisons in Theorem 3 across various axes (SignSGD vs SGD, different noises), as requested by Reviewer NSda
Thank you very much for the rebuttal. You can find the detailed reply below the thread of our interactions.
-
I took a closer look at the paper by Malladi, and indeed, deriving the SDE of signSGD from that of Adam and/or RMSprop is not feasible. One would need to derive the SDE for signSGD "from scratch" to enable a clear comparison, which would be an interesting challenge to undertake (if it is even possible to derive an SDE for signSGD under WA, which is not immediately obvious).
-
This is a cumbersome claim. In my opinion, the fact that WA does not require any constraints on the dimensionality inherently makes this a non-issue. I would like to highlight that [1,2] perform experiments to validate their SDEs on PreResNet32, VGG19, and ResNet-50, which have up to 138M parameters, using learning rates that are not infinitesimal but rather standard, such as . Therefore, I would advise caution when making these comments and suggest refraining from dismissing a theoretical framework that has been successfully employed and experimentally validated in various setups for years. This is particularly important considering that in your experiments, ranges from to , which can easily be classified as low-dimensional relative to the DL models mentioned above.
-
I agree that the comparison is difficult to perform, and demonstrating this adds value to your paper. It does not detract from WA but rather shows that deriving the SDE for signSGD directly from Adam is not the appropriate approach.
Finally, I look forward to seeing your new experiments to substantiate your claims, as these are still missing.
I would like to add an additional point that I initially overlooked: I apologize for this oversight, but it is an important issue. The experimental details are quite poorly described. It would be best to specify all the hyperparameters, design choices, and related details. While one could read your code to find this information, it is not standard practice, as your code may become unavailable in the future.
[1] Li, Zhiyuan, Sadhika Malladi, and Sanjeev Arora. "On the validity of modeling SGD with stochastic differential equations (SDEs)." Advances in Neural Information Processing Systems 34 (2021): 12712-12725.
[2] Malladi, Sadhika, et al. "On the SDEs and scaling rules for adaptive gradient algorithms." Advances in Neural Information Processing Systems 35 (2022): 7697-7711.
Dear reviewers, AC,
We have uploaded a second revision.
There are many minor updates, which are marked in red, which we will not summarize here.
Besides these, we made the following more substantial changes
- We added an Appendix D with additional supporting experiments (3 sets, which are designed to add numerical verification to Theorems 1/2, Theorem 3 and Theorem 4 respectively. We have also updated the anonymous github linked in the paper with the code to generate these figures.
- We made a pass through Appendix A and B to improve the presentation of the proofs, putting all statements inside of formal lemmas, and adding additional details where we thought they were lacking.
- We added details to Appendix H on the how the experiments were conducted.
- We updated the figures in the main text, with the intention of making them clearer to colorblind readers and improving the content of Figure 1.
We have updated the manuscript to include:
-
Experimental validations for Theorem 3, which may be found in Figures 1 and 6.
-
Experimental validations for Theorem 4, which may be found in Figures 7 and 8.
-
Update on Figure 1 for improved clarity between the confidence intervals of signSGD and signHSGD.
-
Reformatted the Appendix for better readability.
The code for the new experiments can be found in the anonymous github link.
This paper presents an analysis of SIGNSGD in the high-dimensional limit, deriving a limiting stochastic differential equation (SDE) and ordinary differential equation (ODE) to characterize the risk. Within this framework, four key effects of SIGNSGD are quantified: effective learning rate, noise compression, diagonal preconditioning, and gradient noise reshaping.
Overall, the paper is well-written and represents an effort to understand the dynamics of SIGNSGD, albeit in relatively simple settings. This line of research, being in its infancy, is inherently interesting and holds promise for future developments. While early-stage contributions to emerging research directions often come with stricter assumptions or a less comprehensive treatment, a certain level of clarity, practicality, contextualization, intuition, coupled with reasonable assumptions is still essential.
During the rebuttal process, the paper underwent numerous changes based on reviewers' feedback, some of which are significant and may require additional time for thorough evaluation. However, the limited scope of the rebuttal process restricts the ability to fully assess these revisions. Moreover, as a primarily theoretical paper, its results rely on several assumptions that may be considered unintuitive. Greater effort to justify these assumptions, particularly through experimental validation, would have strengthened the work. Additionally, while the focus of the paper is theoretical, the results are derived using very simple models. Extending the insights to more complex, non-trivial models, at least experimentally, would provide greater value to the machine learning community and enhance the paper's broader impact.
审稿人讨论附加意见
During the rebuttal process, several serious concerns and questions emerged regarding the unintuitive nature of the assumptions, the lack of discussion surrounding these assumptions and the limitations of the work, as well as its relation and contributions relative to prior studies and related frameworks. Additionally, questions were raised about the merits of the results beyond the simple models considered in the paper. While the authors, in good faith, made efforts to address many of these concerns, there remain significant issues that may require further adjustments and refinements to the paper.
Reject