PaperHub
7.8
/10
Poster4 位审稿人
最低4最高4标准差0.0
4
4
4
4
ICML 2025

Exact risk curves of signSGD in High-Dimensions: quantifying preconditioning and noise-compression effects

OpenReviewPDF
提交: 2025-01-23更新: 2025-08-10
TL;DR

We derive SDEs and ODEs for SignSGD in high-dimensions: identifying scheduling, preconditioning, noise compression effects.

摘要

关键词
signSGDstochastic optimizationdeep learning theoryhigh-dimensional probabilitystochastic differential equation

评审与讨论

审稿意见
4

The paper studies the precise risk curves of signSGD in high dimensional limit for quadratic loss with Gaussian data under certain assumptions on the label noise. It contrasts the risk curves with SGD and quantifies the differences in terms of four effects - effective learning rate, noise compression, preconditioning and gradient noise reshaping. The exact risk curves are numerically verified for various examples.

给作者的问题

  1. Can the authors explain the reasoning behind why they expect KσK_\sigma to have a power-law spectra when KK has power-law spectra?

论据与证据

The claims in the work are theoretical and proofs are provided in the Appendix.

方法与评估标准

N/A

理论论述

I had a brief look at Appendix A.1, B and C, which seem mostly correct to me.

实验设计与分析

N/A

补充材料

I went through Appendix A.1, B and C.

与现有文献的关系

The work is well placed among the recent works related to analyzing preconditioned gradient algorithms.

遗漏的重要参考文献

N/A

其他优缺点

The main strength of the paper is the rigorous extension of HSGD to signSGD and then using the result for the clear discussion of the differences between the risk curves of the two.

其他意见或建议

N/A

作者回复

Thank you for the review, we’re glad that you appreciate our contributions. Our expectation that KσK_\sigma inherits a power-law spectrum from KK is mostly speculative. We should qualify that this claim is basis dependent. If KK is diagonal then KσK_\sigma is the identity, which collapses any power-law structure. So this is more about `very non-diagonal' matrices.

This is probably best defined as having eigenvectors which are very spread out over the space, such as in CIFAR-10 or in an i.i.d. random matrix. In the case of CIFAR-10, the map appears to preserve the power-law structure but change the power. We've included an additional plot to illustrate this effect: https://anonymous.4open.science/r/revision-2025-stuff-F275/cifar_eigenvalues.pdf

We don't know of an existing theorem demonstrating this effect.

We’ll revise this section to make ourselves clearer.

审稿意见
4

This paper studies SignSGD, which can be viewed as Adam without the moment accumulator, in the high-dimensional limit. The main goals of the paper are to quantitatively understand the observed preconditioning and "noise compression" effects of SignSGD in practice. Toward this end, a limiting SDE (SignHSGD) and ODE (SignODE) are derived for SignSGD on square-loss linear regression. Notably, this requires non-trivial analysis due to the discontinuous sign operation. The authors then compare SignSGD to SGD in various ways, demonstrating four key differences: 1. effective learning rate, 2. noise compression, 3. diagonal preconditioning, 4. gradient noise rescaling, each of induces regimes where SignSGD may be favorable over SGD or vice versa. Most notably, SignSGD is expected to provide benefits for many non-Gaussian or heavy-tailed noise classes, supporting folk knowledge.

给作者的问题

None beyond the above.

论据与证据

This claims in this paper are well-supported by rigorous theorem statements and proofs, and supporting results in the appendix. Numerical experiments are presented to support the theoretical trends.

方法与评估标准

The numerical experiments are helpful for contextualizing the theory, and are documented in the appendix.

理论论述

I did not check all the proofs entirely. However, I checked the proof strategies of the main Theorems and they make sense to me. I also checked the comparisons between SGD and SignSGD; these are correct by my verification.

实验设计与分析

A sufficient description of experiment set-ups is contained in the main paper, with additional details contained in Appendix D and H.

补充材料

I checked through most of the supplementary material to understand the proof structure of the major claims. I also checked through the additional section (Appendix E) regarding a partial extension to Adam.

与现有文献的关系

This work belongs to the general category of literature concerned with capturing the practical behavior of optimization algorithms in deep learning settings. In particular, this paper takes the approach of deriving the corresponding continuous-time SDE/ODE of its algorithm of interest, SignSGD, and deriving some key properties therein. On the other hand, most of prior works along this line study SGD. However, it is well-understood in practice that vanilla SGD experiences a host of problems in deep learning, for which various practical adjustments have been introduced. I think this work references the important prior work leading to this point, and provides some interesting tools and insights toward closing the "insight gap" between SGD and more practical methods like SignSGD and Adam. In particular, the results formalizing the unique interaction of SignSGD and noise are potentially a fruitful way of studying this type of adaptive gradient methods. Furthermore, I think the "case-by-case" results, such as when diagonal preconditioning hurts or helps (Section 4.3) are also potentially fruitful, as they may serve as a basis for understanding the relative benefit of other families of non-diagonal preconditioning methods in deep learning, such as KFAC and Shampoo.

遗漏的重要参考文献

None as far as I can tell.

其他优缺点

I think this paper is quite well-written and understandable. The main insights and arguments are likely of interest to the machine learning optimization community. I have described some of the strengths that stood out to me earlier. I have a few minor things to point out / ask:

  • The analysis in this paper is restricted to linear regression and MSE loss (which is acknowledged in the Discussion), and already seems quite involved. It would be helpful to highlight to what degree the broader class of noise considered here can expand expressivity of the problem set-up. Additionally, can the "linearized" analysis here potentially extend to the NTK regime (despite the expressivity gap the NTK regime itself suffers)?

  • In the main theorems (Thm 1 and 2), the gap between the discrete descent method and flow) suffers an Exp(T) factor from an application of Gronwall's inequality. This implies for fixed dimension dd, the drift blows up exponentially in time, which doesn't seem to practically be the case. What would be required to make this bound less conservative, if possible?

其他意见或建议

Minor comments:

  • The authors should remember to put in the Impact Statement.
  • Line 304 first column: "very SignSGD" -> "SignSGD is very".

(Possibly errant) suggestions:

  • As previewed earlier, some other works have considered departing from diagonal preconditioning. The most relevant algorithm in this class is Shampoo (or more recently still, Muon). The base curvature matrix between SignSGD and Shampoo I believe are the same (AdaGrad), but Shampoo does a layer-wise "Kronecker-Factorization" rather than the diagonal. I think it would be extremely interesting to see if this type of preconditioning has similar beneficial interaction with noise, and whether it broadens the class of covariances where preconditioning is helpful.
作者回复

Thanks for your comments, we address some key questions here.

NTK Regime

If we write consider the mean square loss for a general neural network f(θ,x)f(\theta, x) in the NTK regime such that f(θ,x)=f(θ0,x)+θf(θ0,x)T(θkθ)f(\theta, x) = f(\theta_0, x) + \nabla_{\theta} f(\theta_0, x) ^T (\theta_k - \theta_*) then the risk, in a student-teacher setup, is given by

R(θk)=E[(f(θk,x)f(θ,x)2]=E[(θf(θ0,x),θkθ2]=(θkθ)TE[Θ(x,x)](θkθ) \mathcal{R}(\theta_k) = E [ (f(\theta_k, x) - f(\theta_*, x)^2 ] = E[( \langle \nabla_{\theta} f(\theta_0, x), \theta_k - \theta_*\rangle^2] = (\theta_k - \theta_*)^T E[\Theta(x,x)] (\theta_k - \theta_*)

where Θ\Theta is the NTK. Therefore in order to describe the NTK regime, we need to understand the expected signSGD updates where the “data” is given by θf(θ0,x)\nabla_{\theta} f(\theta_0, x). Right now, we must assume this is Gaussian. Then, if the model is such that θf(θ0,x)\nabla_{\theta} f(\theta_0, x) is Gaussian then our results will hold as written. In the more likely case that θf(θ0,x)\nabla_{\theta} f(\theta_0, x) isn’t Gaussian but is sufficiently regular (such as subgaussian) then if we can formulate our results for this larger class of data distribution then we can capture the behaviour of this NTK regime.

Quite possibly our equations, if not our method of proof, already hold for subgaussian data given the good agreement with the CIFAR-10 and IMDB datasets which are not themselves Gaussian.

Exponential blowup in T

This eTe^T term appears because we allow for very flexible learning rate scheduling, including schedules where the risks diverge. With an additional restriction on the smallness of the learning rate, one could improve the exponential explosion in TT. It will not go away entirely, however -- optimistically one could put a polylog(T)polylog(T). On very long time-frames there will always be a chance for the stochastic optimizer to explode.

We don't have a full picture of how a proof like this would go, but roughly you would want to show that the projection of the errors in each eigendirection of Kˉ\bar{K} are O(1/(d))O(1/\sqrt(d)) and crucially remain bounded that way in time due to the contractive nature of the dynamics in that eigendirection. This is also the reason the small stepsize is needed -- to ensure the dynamics tends to contract in each eigendirection, and hence the approximation errors are shrunk as well. (Our current proof uses a resolvent argument, which is an average over all eigendirections; this is technically simpler, but would not be good as the contractive-ness of the optimizer dynamics is lost).

审稿意见
4

The paper studies the dynamics of signSGD in a linear regression setting with Gaussian covariates xN(0,K)x \sim N(0,K) and noisy labels y=x,θ+ϵy = \langle x,\theta_\ast \rangle + \epsilon. The authors derive a limiting SDE (signHSGD) that describes the dynamics of θ\theta as the dimension dd \to \infty which depends on both the covariance KK and the distribution of ϵ\epsilon. After transforming this process onto a process on the residual rr, they derive a deterministic equivalent (signODE) which approximates the loss of of signSGD as dd \to \infty. Finally, they compare the SDE for signHSGD to the corresponding SDE for SGD to compare the behavior of the two algorithms and to predict when signSGD should outperform vanilla SGD.

Update after Rebuttal: I am satisfied with the author's responses and have updated my score.

给作者的问题

  • This paper focuses on the batch-size 11 setting. How are both signHSGD and the strength of the SDE approximation affected by the batch size? Would your analysis recover the "square root" scaling rule in Malladi et al. for Adam? Additionally, would the analysis extend to batch sizes that grow with dd?

  • The analysis restricts to learning rates ηt=Θ(1/d)\eta_t' = \Theta(1/d) (eq. 6). Is the issue that if ηt1/d\eta_t' \gg 1/d the noise will dominate and the SDE becomes degenerate in the limit dd \to \infty?

论据与证据

This paper is focused on theoretical understanding and the technical claims are supported by the proofs in Appendices A, B.

There is a minor claim/conjecture which is unsupported (lines 416-424). It would be good to include a few words about why the authors expect this to hold.

方法与评估标准

This paper is focused on developing theoretical understanding in a linear regression setting, and the experiments on more realistic datasets support the idea that Gaussian universality may allow the analysis to extend beyond the Gaussian setting. However, this is not a central claim of the paper.

理论论述

I skimmed the proofs in appendices A, B but did not verify their correctness.

实验设计与分析

n/a

补充材料

I skimmed the proofs in appendices A, B but did not verify their correctness.

与现有文献的关系

This paper is attempting to understand the effectiveness of coordinate-wise adaptivity (as in Adam) in a concrete theoretical setting. The approach in this paper is most closely related to (Collins-Woodfin & Paquette 2023) which performed a similar analysis for SGD.

遗漏的重要参考文献

n/a

其他优缺点

Strengths:

  • The paper is generally well written and includes an extended discussion of the different terms in eq. 8
  • The theory seem to match experiments incredibly well (in the linear regression setting)
  • The derivation of the SDE for signHSGD is highly nontrivial

Weaknesses:

  • As acknowledged by the authors, this paper focuses on the simple setting of linear regression with Gaussian covariates. However, even in this setting, we have a very poor understanding of coordinate-wise adaptive optimizers like signSGD or Adam.
  • The paper focuses on online SGD with batch size 1. It would be interesting to extend these results to the more general case.
  • There is limited discussion about the effect of the noise reshaping KKσK \to K_\sigma.

其他意见或建议

  • The use of ηt\eta_t' for the signSGD learning rate and ηt:=η(t)\eta_t := \eta(t) for the rescaled learning rate is a bit confusing
  • The discussion for the effects of ϵ\epsilon-compression (4.2) and preconditioning (4.3) are clear but the noise reshaping (section 4.4) is a bit handwavy. In particular, the conjectures in the last paragraph are difficult to follow and could merit additional explanation in the Appendix, even if there is no rigorous theory to back them up. It could also be useful to run some ablations with and without the reshaped noise covariance to check if it's actually beneficial or just a side-effect of the preconditioning.
  • In Assumption 5, zz is not defined. Is the assumption that this holds for all zz? Could this assumption simply be replaced with the assumption that vT(θ0θ)v^T (\theta_0 - \theta^\ast) is sub-Gaussian for any vv? Also, why does this assumption hold for a deterministic θ0,θ\theta_0,\theta_\ast? If v=R(z;K)iv = R(z;\overline{K})_i is aligned with θ0θ\theta_0 - \theta^\ast then Assumption 5 forces θ0θd1/2\|\theta_0 - \theta^\ast\| \lesssim d^{-1/2}?
作者回复

We thank the author for their detailed review and address weaknesses and questions below.

Weaknesses

Noise reshaping

For the KKσK \to K_\sigma mapping, there is strong dependence on the eigenbasis of KK. For example if KK is diagonal then Kσ=IdK_\sigma = Id. Otherwise, if the eigenbasis is relatively random (such as what is seen in CIFAR and in Figure 4), the operation can preserve some spectral properties of KK. For example, in CIFAR-10, KσK_\sigma appears to preserve the power-law structure of KK, albeit with a different power (plots here).

One major point (discussed below in response to the learning-rate) is that the the trace of KσK_\sigma can be much larger than the trace of KK, which will cause sign-SGD gradient noise to be a major slow-down Rigorously saying more is probably difficult. But, we'll rewrite this section in order to be clearer.

Assumption 5

Thank you for catching this. We like your suggestion to instead assume vT(θ0θ)v^T(\theta_0 - \theta_*) is subgaussian with constant O(d1/2)O(d^{-1/2}). You're right that general deterministic initialization and targets do not fit the assumption. For the record, the assumption is for all zB(K)z \in B(||K||).

Questions

Batch sizes

This is a very good question -- thanks for bringing it up. We have worked out some of the mathematics for batch sizes small compared to dd, and plan to add the following results to an appendix.

So first, just to be clear, we are following usual optimization (and standard package) conventions, and we are applying the minibatch average prior to applying the sign-function.

We will take a fixed batch size b independent of the dimension of the problem. One could also look at other scaling regimes, and we expect (based on the fixed batch size case) that the behavior continues to hold for batch sizes that satisfy (b/d) to 0, but we’re only 100% confident for fixed b,b, and dd \to \infty. The good news is that even in this limited setting, the story is already interesting.

The bias-term (or descent-term) is the only term affected by minibatching. The gradient noise term (the Brownian term in homogenized sgd) is unaffected.

The homogenized-SGD equation is changed to

K(Θtθ)dt\overline{\mathbf{K}} \left(\Theta_t-\theta^{*}\right) \mathrm{d}t

+ηt2Kσπd dBt + \eta_t \sqrt{\frac{2 \mathbf{K}_\sigma}{\pi d}} \mathrm{~d} \mathbf{B}_t

The NbN_b is a numerical constant (the mean of a chi-variable with bb degrees of freedom, to the mean of a chi variable with 11 degree of freedom). It is asymptotic to b\sqrt{b}, consistent with what is shown in Malladi et al.

The φ(b)\varphi^{(b)} is the φ\varphi of a new convolved noise. It is the φ\varphi that results from looking at i=1bϵ(i)ωi\sum_{i=1}^b \epsilon^{(i)} \omega_i where ϵ(i)\epsilon^{(i)} are iid copies of the label noise and where ω\omega is an independent random vector, drawn uniformly from the sphere of dimension bb.

We have run some simulations implementing this (code is available in the anonymous code repository for the paper). Simulations at fixed learning rate, varying batch size are available here and with b\sqrt{b} scaling here.

Learning rate scaling

Indeed, the choice of η=O(1/d)\eta’ = O(1/d) is needed to prevent (gradient) noise domination. See the Hessian term of equation 94 where the Hilbert-Schmidt inner product is O(Tr(Kσ))O(Tr(K_\sigma)). In this setting, if η\eta’ decays slower than 1/d1/d this noise term would dominate. Had η\eta’ decayed faster, the noise term would vanish and we would be performing gradient flow in the high-dimensional limit.

For SGD it is possible to instead rescale by the “intrinsic dimension” of the data, defined as the trace of covariance normalized by its operator norm ( Tr(K)/KTr(K) / ||K|| ). We think a similar change could be made for signSGD but some care has to be taken since signSGD reshapes the gradient covariance to KσK_\sigma. This connects to our discuss about the reshaping of gradient noise. Given a diagonal KK with intrinsic dimension O(d)O(\sqrt{d}), signSGD still has intrinsic dimension O(d)O(d) since Kσ=IdK_\sigma = Id. This leads to signSGD having learning rates O(d1/2)O(d^{1/2}) smaller than vanilla SGD to guarantee convergence, leading to slow dynamics.

Let us know if we can clarify anything else; if we have addressed your concerns satisfactorily, we hope that you will consider revisiting your review score.

审稿意见
4

The authors analyze Sign-SGD for linear regression in high dimensions by deriving limiting differential equations that describe its behavior. Their analysis quantifies four main effects: learning rate adjustment, noise compression, diagonal preconditioning, and gradient noise reshaping. The paper includes theoretical proofs and experimental validation on both synthetic and real datasets.

给作者的问题

  • Can you explain: "But we expect the conclusion of (27) remains mostly true in well-conditioned settings."
  • From the above: I would like to double check that the result from theorem 3 roughly align with standard results from stochastic convex optimization regarding SGD with a fixed step-size. If they do not would the authors explain in more detail the differences.

论据与证据

Main Claims and Evidence:

Claims:

  1. Sign-HSGD and its deterministic equivalent ODE are good models for the risk dynamics induced by Sign-SGD
  2. The risk curves of Sign-SGD are well approximated by the risk curves of Sign-HSGD and this approximation improves as dimension grows.
  3. Convergence to a neighbourhood of the solution (?) under a fixed learning rate
  4. The condition number affects whether Sign-SGD or SGD is should be favoured

Evidence:

  1. Figure 1
  2. Theorem 1 / Theorem 2
  3. Theorem 3
  4. Theorem 4

The evidence to me was convincing and followed from my intuition of the algorithm, however I did not review the appendices in detail. I especially appreciated the mixture of theoretical analysis with a grounded set of experiments which verified the results derived seemed to make sense.

方法与评估标准

Benchmarks were a mixture of toy problems like linear regression under varying noise and preconditioned assumptions which more closely resembled the theoretical setting analyzed, as well as a set of more practical experiments on common ML benchmarks like CIFAR10 and IMDB.

理论论述

I did not verify the claims in detail but based on my experience they seem reasonable. I would like to double check that the result from theorem 3 roughly align with standard results from stochastic convex optimization regarding SGD with a fixed step-size. If they do not would the authors explain in more detail the differences.

实验设计与分析

The experimental designs at there face seem sound, and tailored to fit the larger narrative of the paper.

补充材料

I did not have time to review the appendix in detail.

与现有文献的关系

I think that this work seems to address a unique and important problem posed by modern optimization research, and I believe a better understanding of the optimization dynamics of algorithms like sign decent would benefit the larger machine learning community.

遗漏的重要参考文献

Possibly might want to look at "Heavy-Tailed Class Imbalance and Why Adam Outperforms Gradient Descent on Language Models"

其他优缺点

Strengths:

  1. Comprehensive theoretical framework:
  • Proves convergence to limiting equations
  • Derives exact formulas for limiting risk
  • Quantifies specific effects (learning rate, noise compression, preconditioning, noise reshaping)
  1. Strong empirical validation:
  • Theory matches experiments even at moderate dimension (d=500)
  • Works on both synthetic and real datasets
  • Demonstrates convergence under different noise conditions
  1. Clear practical implications:
  • Shows when Sign-SGD might outperform SGD (based on condition numbers)
  • Explains behavior under different noise distributions

Weaknesses:

  1. Limited scope:
  • Theory only covers linear regression
  • Strongest results require specific noise conditions (C² density near zero)
  • Extensions to non-smooth noise only work above risk threshold
  1. Some findings lack theoretical foundation:
  • Success on real (non-Gaussian) data isn't fully explained
  • Authors note this is left for future work
  1. Some practical aspects not addressed:
  • Doesn't analyze mini-batch settings
  • Doesn't fully connect to practical deep learning scenarios / analyze non-convex settings.

其他意见或建议

  • The "In a nonconvex setting, identifying ...." paragraph is broken and needs to be revised.
作者回复

We thank the reviewer for their comments, and address some key points below.

Minbatching:

We have added more details regarding batch sizes bb in our response to Reviewer ugE8. One particular consequence of minibatching is that our condition on the behaviour of the noise near 00 relaxes significantly. Averaging multiple sources of noise effectively conditions the noise even for b=2b = 2. Numerical results can be found here for fixed learning rate, varying batch size.

Questions:

First question

(27) is comparing the risk obtained by the optimally scheduled signSGD to optimally scheduled SGD when data is isotropic. We don’t know the optimal learning rate schedule in non-isotropic settings but what we expect is that whether or not signSGD outperforms SGD depends generally on the magnitude of ψ\psi.

The argument why can be sketched like: for any signSGD learning rate ηS(t)\eta^{S}(t) run SGD with a learning rate of η(t)=ηS(t)ϕ(Rt)/Rt\eta(t) =\eta^{S}(t)\phi(R_t) / \sqrt{R_t}. Now the descent terms are the same up to the KK vs K\overline{K} distinction. With this learning rate we see the function ψ\psi in the variance term on the SGD risk equation. If the various matrices involved (K,K,KσK, \overline{K}, K_\sigma) are similar enough (as they are when K=IdK=Id) then whether or not signSGD is beaten by SGD with this learning rate is again determined by the magnitude of ψ\psi. This “similar enough” statement is what we mean by well-conditioned settings.

Second question

For SGD in the same setting, the limiting risk can be found in Equation (254) and is

ηv2Tr(K)2(2dTr(K)η).\frac{\eta v^2 Tr(K) }{2(2d-Tr(K)\eta)}.

For sufficiently small η\eta, the limiting SGD risk is O(ηv2Tr(K)/d)O(\eta v^2 Tr(K) / d). This is consistent with 'neighborhood' convergence for fixed step-size SGD when v>0v > 0 (e.g. Theorem 5.5 of Garrigos-Gower '23, Handbook of Convergence Theorems) and convergence with fixed step-size in the case v=0v=0.

In contrast we can use Theorem 3 to approximate the limiting signSGD risk as up to constants:

ηTr(K)dmax(ηTr(K)d,v)\frac{\eta Tr(\overline{K})}{d} \max \left (\frac{\eta Tr(\overline{K})}{d}, v \right)

Notably, even if v=0v = 0 this limiting risk is not 00. The simple explanation is that with SGD, the average magnitude of the gradient naturally decreases with the risk and so we effectively take smaller steps; however for signSGD the magnitude of the gradient is always d\sqrt{d} and does not change with the risk. To achieve similar results with signSGD the stepsize needs to be decreased. This aligns with what is seen in practice for algorithms like Adam.

Other:

"In a nonconvex setting, identifying ...." paragraph. We have attempted to improve the wording for clarity, but we’d be happy for additional guidance here. Our intention is to say that dividing by the norm of the gradient could be reasonable in vanishing gradient situations, such as near saddles.

最终决定

The paper studies the dynamics of signSGD in a linear regression setting with Gaussian covariates and label noise and derives a limiting SDE (SignHSGD). This paper makes solid theoretical contribution towards understanding coordinate-wise adaptivity in optimization algorithms, which is at the core of the optimization theory in the era of LLMs. Though the theory is limited to linear regression, the result and analysis is already quite complicated, yet they predict the experiments incredibly well, even for the real dataset like CIFAR10 (though it is linear regression). Reviewers unanimously recommend to accept.