PaperHub
7.7
/10
Spotlight6 位审稿人
最低6最高10标准差1.4
10
8
8
6
8
6
3.2
置信度
ICLR 2024

High-dimensional SGD aligns with emerging outlier eigenspaces

OpenReviewPDF
提交: 2023-09-20更新: 2024-03-17

摘要

关键词
stochastic gradient descentHessianmulti-layer neural networkshigh-dimensional classificationGaussian mixture modelXOR problem

评审与讨论

审稿意见
10

Background: Empirical findings in [1] show that the iterates of SGD, used for training deep learning models for classification, converges to a neighborhood of a tiny subspace of the parameter space - in particular to the eigenspace of the top eigenvalues of the Hessian and the covariance matrix of gradients - which justifies why deep learning algorithms does not suffer the curse of dimensionality.

Summary: This manuscript rigorously proves that this observation should hold in multi-class logistic regression and 2-layer neural networks trained on Gaussian mixture models. Moreover, they also show that this alignment between the SGD iterates and the top eigenspace of the Hessian holds layer-wise and does not depend on the model's success.

[1] Gur-Ari, G., Roberts, D.A., & Dyer, E. (2018). Gradient Descent Happens in a Tiny Subspace. ArXiv, abs/1812.04754.

优点

  • It provides theoretical evidence for an important observation.

  • Although the previous studies considered the single and/or multi-index models [1,2] suggest that such alignments are expected, this work goes beyond them and proves more, such as the layer-wise alignment, and that alignment occurs even if the model fails to learn the ground truth.

  • Overall, it is a very good work! (However, there are some minor mistakes in the proof. Please see the Questions section)

[1] Damian, A., Lee, J.D., & Soltanolkotabi, M. (2022). Neural Networks can Learn Representations with Gradient Descent. ArXiv, abs/2206.15144. [2] Mousavi-Hosseini, A., Park, S., Girotti, M., Mitliagkas, I., & Erdogdu, M.A. (2022). Neural Networks Efficiently Learn Low-Dimensional Representations with SGD. ArXiv, abs/2209.14863.

缺点

NA

问题

  • I checked the proofs corresponding to Theorems 3.1, 3.2, and 3.3 carefully. There are some minor mistakes (which should be fixable):

    • In Eq. (C.10), the second term in the RHS should not be correct.
    • In Eq. (C.8), you missed the case b \neq a = c.
    • In the fifth equation on Page 26 in Appendix, (the one starts with \langle \nabla_c H, \nable_a R_aa^{\perp} \rangle), I think you missed a term in the RHS.
    • In the equation above (C.14) (the one starts with \nabla_c H \otimes \nabla_d H), I think many terms are missing in the RHS. Please check that part carefully.
    • Because of the previous point, Eq. (C.14) should be incorrect, which eventually makes the correctness of the correction terms in Theorem 5.7 questionable. Can you check this part again as well?
  • I skimmed through the rest of the proofs. I could not see a major mistake.

  • As the last question for the authors: Your results crucially depend on the high-dimensional limit for SGD proven in [1]. In that work, it was shown that there is a critical learning rate that causes additional correction terms in the limit. Can you comment on how the existence of the correction terms affects your results?

[1] Arous, G.B., Gheissari, R., & Jagannath, A. (2022). High-dimensional limit theorems for SGD: Effective dynamics and critical scaling. ArXiv, abs/2206.04030.

评论

We thank the referee for their careful reading of the manuscript and helpful comments.

Thank you for pointing us to the references [1,2] in "strengths"; we have added references to both.

"I checked the proofs corresponding...": We are grateful to the referee for pointing these out, especially the missing terms in (C.14) and its effect on Theorem C.7. We have fixed all of these in the supplementary material (they do not any of our affect our main results).

"As the last question for the authors...": This is a great question; the correction term at critical scaling also scales as 1/λ1/\lambda, i.e., the variance of the noise. Thus, the effect of the critical scaling is getting absorbed as one of the sources of the O(1/λ)O(1/\lambda) errors that we sustain in our main results. It would require a more refined analysis of these errors to understand better the role of the critical scaling.

评论

I thank the authors for their rebuttal and for welcoming the suggestions. My questions have been cleared out. I don't have further questions and I'll keep my score.

审稿意见
8

This paper looks at the task of classification for high-dimensional gaussian mixtures using 1 or 2-layer neural networks. The authors show that the SGD trajectories rapidly aligns with the low-rank outlier eigenspaces of the Hessian and gradient matrices.

优点

There has been many previous papers that study the empirical Hessian during the training process of SGD. It has been observed that the hessian spectrum can often be separated into a bulk component depending on the network architecture, and outlier eigenvalues which depend on the data. For a simple Gaussian mixture model, this paper gives a rigorous proof of this phenomenon for SGD.

Overall I think this is a strong theoretical result that characterizes the implicit bias of SGD and explains some of the previous empirical observations about the empirical Hessian.

缺点

Overall I found this to be a very solid work. I did not find any major weaknesses.

问题

N/A

评论

We thank the referee for their careful reading and review.

审稿意见
8

This manuscript investigates the alignement of one-pass SGD trajectory with the leading eigenvectors of the Hessian in the high-dimensional limit. In particular, it shows that for two different tasks:

  1. multi-class logistic regression on a k-Gaussian mixture model;
  2. classifying a XOR Gaussian mixture with a two-layer neural network

for sufficiently high signal-to-noise ratio, the SGD trajectory of the model parameters approximately lies in the span of the top eigenvectors of the test Hessian after a number of steps which is linear in the input data dimension.

优点

Characterising the geometry of SGD, and in particular finding the relevant subspaces where the low-dimensional subspaces where the high-dimensional trajectory lies is an important problem. In many theoretical setups, the relevant subspaces can be guessed from symmetries of the problem, and can be used to derive low-dimensional scaling limits for the projection of the trajectory in these subspaces - an approach that has been employed in many works old and recent works in the literature, c.f. [Saad & Solla 1995a,b; Goldt et al. 2019; Chen et al. 2019; Refinetti et al. 2021; Veiga et al 2022; Ben Arous et al. 2022; Paquette et al. 2022; Arnaboldi et al 2023; Shuo & Vershynin 2023]. However, an important limitation of this approach is that it is hard to generalise to more general setups where the relevant statistics are less clear to guess. This work proposes a roadmap to overcome this limitation, consisting of taking the spam of the top outlier directions in the Hessian/G-matrix, and shows this correlates with the standard summary statistics for these examples. If this proves more general, it could open up an important new theoretical tool in the study of SGD.

缺点

The main limitation is that the two classification tasks have a particular structure. Indeed, in both the linear case and the two-layer case considered here, the Hessian/G-matrix of the first layer weights is proportional to YYY\otimes Y. For a Gaussian mixture Y=yμ+ZλY = y\mu + Z_{\lambda}, indeed the relevant subspace spanned by the means will naturally pop as an outlier of this matrix when the SNR λ\lambda is large enough: this is the classical BBP transition. Therefore, it is not surprising this also happens at the level of the Hessian/G-matrix. Maybe proving this requires a tour de force, but it is fair to doubt on the generality of the conclusion. For instance, it would be desirable to show a similar result for tasks where the structure is on the labels, and not on the inputs, e.g. teacher-student models.

问题

  • [Q1]: In Thm 3.1, how does the critical variance λ0\lambda_{0} for SGD alignement compares with the BBP threshold for the recovery of the means μa\mu_{a} from Y=yμ+ZλY = y\mu + Z_{\lambda}?

  • [Q2]: In both examples considered here, the timescale for the trajectory to correlate with the outlier subspace is linear in the dimension. How general should we expect this to be? For instance, would be authors expect this to hold in problems where escaping a fixed point would take longer than linear time in the dimension?

  • [Q3]: On the same spirit of the question above, would the authors expect the theorem to remain true at longer timescales? In general, should we expect these subspaces to be stable or the SGD trajectories to eventually escape these subspaces?

  • [Q4]: From the statement of the results, it seems they hold for both the G-matrix and the Hessian. Is there any reason for looking at the Hessian instead of the computationally simpler G-matrix? Are the subspaces spanned by the top eigenvectors of both matrices equivalent, e.g. in the sense of Def. 2.2?

  • [Q5]: Are the authors aware of converse examples to their result. For instance, a problem where the underlying target function does depend only on a few relevant directions, but that the outliers in the Hessian/G-matrix are not necessarily aligned with it?

Minor comments

  • The reference in the top of Page 2:

"Liao & Mahoney (2021) studied Hessians of some non-linear models beyond generalized linear models, also at initialization."

is misleading. The main result in Liao & Mahoney (2021) concerns the Hessian of a loss function of the type (see eq. 3): L(w)=1ni=1n(yi,wxi)L(w)=\frac{1}{n}\sum\limits_{i=1}^{n}\ell(y_{i}, w^{\top}x_{i}) under the assumption yif(ywxi)y_{i}\sim f(y|w_{\star}^{\top}x_{i}) (see eq. 1). Despite their unusual choice of terminology for this model ("G-GLM"), this is mostly commonly known as a generalized linear model in the literature... when just not simply refereed to as "linear model".

  • The following references:

ODE limits of the SGD training with single layer networks have been derived and numerically solved in Mignacco et al. (2020); Loureiro et al. (2021).

are not accurate. Indeed, Mignacco et al. (2020) and Loureiro et al. (2021) study the classification of Gaussian mixtures with a single layer networks. However, they derive exact expressions for the training and test errors of the minimiser of the empirical risk. Indeed, this should correspond to the asymptotic performance of one-pass SGD when tt\to\infty of the example considered in Section 3.

  • The figures in the manuscript are not very readable. First, even if this is described in the caption, it would be good to label the (x,y)-axis for clarity. Second, they are small, and since the format is not vectorial they get pixelated when zoomed over. Third, what does the colour scheme means? Maybe add a colour bar?

  • I understand the authors might not want to dwell 30 years of literature on SGD scaling limits. But I would encourage them to mention at least some of the recents works on this line which are contemporary to [Ben Arous et al. 2022] to represent the diversity of this literature, see e.g. the list below.

Typos:

  • Page 7: "direcitons" -> "directions"

References

  • [Goldt et al. 2019] Sebastian Goldt, Madhu Advani, Andrew M. Saxe, Florent Krzakala, Lenka Zdeborová. "Dynamics of stochastic gradient descent for two-layer neural networks in the teacher-student setup". Part of Advances in Neural Information Processing Systems 32 (NeurIPS 2019)

  • [Chen et al. 2019] Yuxin Chen, Yuejie Chi, Jianqing Fan, and Cong Ma. "Gradient descent with random initialization: fast global convergence for nonconvex phase retrieval". Mathematical Programming, 176(1):5–37, Jul 2019

  • [Veiga et al. 2022] Rodrigo Veiga, Ludovic Stephan, Bruno Loureiro, Florent Krzakala, Lenka Zdeborová. Phase diagram of Stochastic Gradient Descent in high-dimensional two-layer neural networks. Part of Advances in Neural Information Processing Systems 35 (NeurIPS 2022).

  • [Paquette et al. 2022] Courtney Paquette, Elliot Paquette, Ben Adlam, Jeffrey Pennington. "Homogenization of SGD in high-dimensions: Exact dynamics and generalization properties". arXiv:2205.07069 [math.ST]

  • [Arnaboldi et al. 2023] Luca Arnaboldi, Ludovic Stephan, Florent Krzakala, Bruno Loureiro. From high-dimensional & mean-field dynamics to dimensionless ODEs: A unifying approach to SGD in two-layers networks. Proceedings of Machine Learning Research vol 195:1–29, 2023.

  • [Shuo & Vershynin 2023] Yan Shuo Tan and Roman Vershynin. "Online stochastic gradient descent with arbitrary initialization solves non-smooth, non-convex phase retrieval". Journal of Machine Learning Research, 24(58):1–47, 2023

评论

We thank the referee for their careful reading and helpful comments. We address these below.

Weaknesses: We agree with the referee that, in the case of multiclass logistic regression and the first layer of the XOR model, the results we present are natural. That said, we would like to emphasize, here, the following subtle points:

  1. The spectrum of these matrices are spatially dependent and, in particular, their outlier eigenvectors, their spectra and even the “critical” lambda is itself vary in space. Note that even for the first layer, the spatial dependence of the coefficients of YYY\otimes Y can affect not only the eigenvalues but also the eigenvectors.
  2. For the second layer, the matrix is not proportional to the data matrices: It is a spatially (and thus time) dependent, non-linear transformation of the data. Moreover the second layer is O(1) and thus distinct from the regime of the classical BBP transition.
  3. It is not crucial to our analysis that the structure is in the feature data. In particular, similar arguments should extend to single index models/teacher-student networks.

[Q1]: We work here in the regime where λ0\lambda_0 is large enough (though O(1)O(1)); it is a very interesting question to understand the critical λ\lambda for the dynamical BBP transition.

[Q2]: We expect similar behavior to hold for problems where it takes more than linear time to escape a fixed point, except that the alignment will itself take a diverging time to appear. A good example to have in mind for this is a problem like tensor PCA or certain single-index models, and is an interesting avenue of future investigation.

[Q3]: The alignment should hold for much longer timescales, and depending on how strong alignment one desires, that can be proven in some examples. In the XOR problem with multi-layers, one must be careful since there can be cascades of transits from one fixed point to another, and during those transits the alignment could become lost as it looks for the next best fixed point, then regained once it settles there.

[Q4]: We agree with the referee that the G-matrix is computationally more natural. This was our motivation for considering that matrix. We consider the Hessian spectrum because this is the object that has received most of the attention in the literature. Furthermore, in the setting we consider here the principal subspaces of the Hessian and G are close in the sense of Def 2.2, however their eigenvalues are different.

[Q5]: We thank the referee for this question. In the XOR model, while the population loss has only four relevant directions, one for each hidden class, we show that there are regimes of initialization where the spectrum that emerges misses some of these directions. Please see Figure 4.3 and the related discussion.

"Reference in top of page 2": We reworded this to match the terminology from Liao & Mahoney (2021).

"The following references...": Thank you for pointing this out, we have corrected this.

"The figures are not very readable...": We have added to the captions descriptions of what the different colors are denoting in each figure. Due to space constraints, adding legends/axis labels becomes even harder to parse compared with having them in the caption, but we will try to vectorize the figures for the final version so that they are more clear when zooming.

"30 years of literature on SGD scaling limits": We have included the references as requested, except for Chen et al (2019) which appears to be less relevant as it is focused on sample complexity bounds for gradient descent.

评论

I thank the authors for their detailed rebuttal and for welcoming the suggestions. Most of my questions have been cleared out. I just have an additional clarification and a comment.

"It is not crucial to our analysis that the structure is in the feature data. In particular, similar arguments should extend to single index models/teacher-student networks.".

Can the authors ellaborate what evidence they have for this claim?

"We reworded this to match the terminology from Liao & Mahoney (2021)."

I think this terminology is not standard and confusing. Do you understand why "generalized GLM" and not simply "GLM"?

But ok, this is a minor point and doesn't really concern this manuscript.

评论

"Can the authors ellaborate what evidence they have for this claim?": We agree with the referee that extending our results to this setting is worth careful consideration, and are actively exploring this question. In light of the public nature of this forum, however, we do not feel comfortable expressing more than a belief that similar arguments should be applicable, without having a complete, rigorous proof.

评论

I understand the authors' concern, and have no further question or comment. I retain my score towards acceptance.

审稿意见
6

The paper investigates the interplay between the training dynamics of one-pass SGD and the spectral decompositions of the Hessian matrix and G-matrix. The authors study this interaction on two high-dimensional classification tasks with cross-entropy loss.

The authors start with the classification of k-component Gaussian mixture models by a single-layer network. The outlier-minibulk-bulk structure is shown by deriving limiting dynamical equations for the trajectory of summary statistics of the SGD trajectory. The authors also study a specific case where means are orthogonal and show the one-pass SGD aligns with the largest outlier eigenvalue.

The authors further study the classification of the XOR problem on the Gaussian mixture model via a two-layer network. It is shown that the alignment between SGD and outlier eigenspaces is present in each layer. The results match the spectral phase transition in spiked covariance matrices.

优点

  • Provide theoretical understandings of the interaction between training dynamics of one-pass SGD and the outlier-bulk structure of Hessian matrix and G-matrix.
  • Detailed explanations help understand the theory and make the manuscript easy to follow.
  • Numerical experiments verify the theorems.

缺点

  • Comparison and contribution over the prior works are not very clear.
  • Some technical part is hard to parse, it might be helpful to introduce the intuition for the main proof steps.

问题

I would like to ask the following questions to the authors:

  • Is it possible to extend the theory based on cross-entropy loss to general function class?
  • The authors mention that SGD finds the subspace generated by the outlier eigenvalues for any uninformative initializations with norm O(1)O(1) in Section 3.2, does the similar property hold for XOR-type mixture models via two-layer networks?
评论

We thank the referee for their careful reading and helpful comments.

Weaknesses: We have added further references to previous work as per this and other referees' requests to better contextualize our work. We have attempted to give intuition for the proof strategy in Section 5 (page 9), but are happy to expand in the supplements if there is anything specific that could be elucidated.

Question 1: Our focus in this work was to build a theoretical foundation for the experimental work, which was largely in the context of classification tasks. We expect that many of the phenomena we find extend to more general settings, and these are certainly interesting avenues for future work.

Question 2: Indeed, Gaussianity of the initialization is similarly not essential to the XOR theorems.

审稿意见
8

This paper studies high-dimensional SGD by examining the alignment between the SGD iterates with the emerging outlier eigenspaces of the Hesian and gradient matrices. It is shown that after a short period of training, the SGD iterates start to align with the low-rank outlier eigenspace with the empirical Hessian and empirical G-matrices, and that the alignment may happen layer-wise for multi-layer architectures. The main results are proved for two settings: 1) learning a Gaussian mixture model with linearly independent classes by a single layer neural network, and 2) learning a Gaussian mixture model version of the XOR problem. Numerical evidence is also provided to illustrate the theoretical results.

优点

This paper is well written and easy to follow. The results are novel and solid, and the presentation of the technical results are very clear with the numerical illustrations of the theoretical predictions.

缺点

  • The results are proved for two somewhat restricted settings, and it seems not clear how to extend the argument to more general settings. It would be helpful if the authors can comment on the limitations of the current proof.

  • It seems that the main technical tool is from Ben Arous et al. (2022). It would be helpful to add some discussion on the technical novelty.

  • The results are valid for online SGD. It is worth a comment whether this is necessary, as well as what happens for multi-pass SGD.

  • A small typo: In the sentence above Theorem 4.1, "principal direcitons" -> "principal directions"

问题

  • The learning rate δ=O(1/d)\delta = O(1/d) seems to be on a very small scale, especially for large dd. Could the authors comment on this requirement for the learning rate?

  • The main theorems state that the alignment happens for [T0δ1,Tfδ1]\ell \in [T_0\delta^{-1}, T_f \delta^{-1}]. What's the corresponding generalization performance of the model on the test data?

评论

We thank the referee for their careful reading and helpful comments. We address these below.

“Two rather restricted settings”: We agree that these two settings are rather restricted, and it would indeed be of interest to extend them to general, multi-class, non-linearly classifiable mixture models. We expect the random matrix concentration to be amenable to this greater level of generality. The biggest hurdle to such extensions is analyzing the limiting dynamical system for the stochastic gradient descent. These can be incredibly complex, even in simple settings like XOR labelings.

“Main technical tool”: Our analysis consists of two stages; understanding the SGD trajectory, and understanding the matrices’ spectra along those trajectories. For the first stage, the main technical tool is indeed from Ben Arous et al (2022) (plus new stability/perturbative analysis of the limits). But the second half of the paper involves a disjoint set of tools, mainly from random matrix theory, which we describe in page 9 (Outline and Ideas of Proof).

“What happens for multi-pass SGD”: Other variants of SGD would indeed be very interesting to study, but the limit theory for the SGD is undeveloped in this setting.

Question 1: The learning rate has to be that small as a function of the dimension for the SGD to remain stable enough to converge in the right observables. In particular, if the learning rate is larger than that, then the resulting stochastic terms have diverging variances. That is the sense in which Θ(1/d)\Theta(1/d) is identified as a critical step-size scaling in Ben Arous et al (2022).

Question 2: Thank you for this important question. The SGD analysis lets us understand the generalization error over the course of training. In the kk-GMM setting, the population loss approaches 00 on the same timescale as alignment occurs. In the XOR setting, the answer is more complicated. As shown in Ben Arous et al (2022), on this timescale the SGD can converge to a classifier with poor generalization with positive probability over the initialization. In this case, the alignment still occurs, but we emphasize that when the latter happens, the corresponding principal eigenspace has dimension less than 44: see Figure 4.3.

评论

Thank you for the explanations. I would recommend incorporating the remarks about the second question into the manuscript. I don't have further questions and I'll keep my score.

审稿意见
6

This paper studies SGD applied to two problems, namely multi-class logistic regression and XOR classification with a two-layer network. It is shown that in both scenarios, the training trajectories of SGD aligns with both the hessian and the Fisher information during training. Moreover, for the XOR problem, the alignment occurs in both layers of the network.

优点

  • The results of alignment are technically strong.
  • The reasoning is solid and convincing.

缺点

  1. The major weakness of this work is that the motivation of studying the evolution of the Hessian/Fisher information is not super clear. According to the introduction and the related works, from my perspective, the best thing we know about the relationship between the Hessian and the effectiveness of SGD is that "this common low-dimensional structure to the SGD and Hessian matrix may be key to many classification tasks in machine learning" (Gur-Ari et al., 2019). From my point of view, there is no solid evidence indicating that the main results of this work imply an advantage or a disadvantage of SGD. I will consider raising my score if this question is properly settled.
  2. Figure 3.3 and Figure 4.3 are a bit confusing without legends.
  3. In the paragraph following Theorem 4.2, the terms "GOE" and "Wishart matrix" may appear unfamiliar to many.
  4. In the notation of covariance matrix Id/λI_d/\lambda, the λ\lambda, which is typically a weight decay parameter, is used as the inverse of variance, causing some confusions to me.

问题

  • I'm trying to understand Theorem 3.1. Intuitively, as ε\varepsilon approaches 0, T0T_0 should grow larger. Is it possible that T0T_0 becomes so large that the theorem becomes vacuous, i.e., T0>M/dT_0>M/d?
  • Is there a counterpart of Theorem 3.3 for the XOR setting?
  • Is there any specific reason why the authors study the Hessian and Fisher information that is obtained from the test data instead of from the expectation w.r.t. the data distribution?
评论

We thank the referee for their careful reading and helpful comments. We address these below.

“motivation of the work is not clear...”: The success of SGD for high-dimensional tasks is often attributed to its ability to find low-dimensional subspaces, which captures the underlying data structure. This has not been rigorously shown, but empirically long predicted, and linked to outliers in the eigenspectra of loss Hessians and FI matrices. The goal of our work was to give this a strong theoretical foundation, demonstrating a precise manner in which it holds true in high-dimensional Gaussian mixture classification tasks, and explore refined features of this low-dimensional alignment with more complex multi-layer networks. In these tasks, we also relate the outlier eigenspaces of the matrices with the relevant coordinates of the data proving how and when the SGD succeeds at the classification task in terms of when the outlier eigenspectrum of the Hessian being of the same rank as the number of hidden classes.

Our goal was not to compare the performance of SGD to other optimization algorithms, though understanding their alignment phenomena would also be of significant interest. In particular, that may allow one to prove relative advantages/disadvantages of different learning algorithms.

“Figures 3.3 and 3.4”: We have modified the captions to indicate what the different colors indicate in the figures. We will also try to vectorize the figures for the final version to make them more clear when zoomed in.

“GOE and Wishart unfamiliar”: We expanded the acronym GOE, and added a footnote with a reference to a classical random matrix theory textbook.

λ\lambda as a weight parameter”: We apologize for the confusion, and added a phrase emphasizing that λ\lambda is a signal-to-noise parameter in our paper.

Question 1. It is correct that as ϵ0\epsilon \to 0, T0T_0 blows up. So long as one takes the number of samples MM a sufficiently large factor times (still proportional to) the input dimension, we will have T0<M/dT_0 <M/d.

Question 2. We are not sure if this addresses your question, but in the XOR setting, we have already specialized the means to be orthonormal. It would be interesting to study the XOR phenomenology with more than two classes, and when the means are not orthonormal. We are happy to clarify further if this doesn’t address the question.

Question 3. Our analysis in fact contains the analogous results for the population Hessian and G-matrices. However, from a practical/data-driven perspective, the test matrices are more natural objects to analyze as those are what the practitioner has access to.

评论

Thanks for the clarification! All of my confusions are resolved.

公开评论

Dear Authors,

I read your ICLR submission with great interest.

It is a great contribution to understanding deep learning dynamics and the associated Hessian/Gradient Spectra.

I would like to draw your attention to two of my recent works which I think are very relevant but are absent in your work. Both of them strongly relate to deep learning dynamics.

[1] is a NeurIPS2023( arxiv 2022) paper focusing analyzing the spectra of gradient covariances. [2] is a arxiv 2022 preprint paper focusing analyzing the spectra of Hessian.

Reference:

[1] Xie, Z., Tang, Q. Y., Sun, M., & Li, P. (2023, November). On the Overlooked Structure of Stochastic Gradients. In Thirty-seventh Conference on Neural Information Processing Systems.

[2] Xie, Z., Tang, Q. Y., Cai, Y., Sun, M., & Li, P. (2022). On the power-law Hessian spectrums in deep learning. arXiv preprint arXiv:2201.13011, 2.

评论

Thank you for your comments and bringing these works to our attention; we have included a reference in our revision.

AC 元评审

This is a great paper with technically solid contributions to the theory of high-dimensional SGD. A clear acceptance.

为何不给更高分

NA

为何不给更低分

NA

最终决定

Accept (spotlight)