Flatter, Faster: Scaling Momentum for Optimal Speedup of SGD
We find a power-law relationship between the optimal momentum hyperparameter and learning rate which maximizes generalization.
摘要
评审与讨论
The authors present the novel theoretical result on how the momentum hyperparameter and learning rate balance the acceleration of the SGD Momentum (SGDM). The main result is derived from the stochastic process theory which describes the trajectory of the SGDM. Also, three models and datasets are considered to support the obtained roles of learning rate and momentum hyperparameter in moving along the longitudinal and transversal components of trajectory. The balance of this ingredient of fast convergence to flat local minimum is crucial for the runtime of the training process and the generalization of the obtained solution.
优点
I would like to thank the authors for the clear and sufficiently detailed introduction section that provides all the necessary results and ideas. The interpretation of the convergence as a trade-off between two main scales (longitudinal and traversal) looks interesting and promising for further development in the case of adaptive step size optimizers. The presented experimental results in Section 4 confirm the obtained theoretical dependence of the momentum hyperparameter on the learning rate. The presented results can help practitioners to tune hyperparameters more efficiently and reduce the consumption of computational resources.
缺点
The weaknesses of the presented study are listed below
- It is very hard to read Section 3 for non-experts in the stochastic process theory. I suggest the authors compress it and extend the section with experiment evaluation.
- Figure 2 presents fitting line results that confirm the estimated dependence rule, however, I do not find the analysis of the variance of the derived estimate. I am not sure that if one adds more points to the plots, then the dependence is changed significantly.
问题
- The authors consider simple models from computer vision tasks like ResNet18 and MLP. Could you please list the assumptions on the deep neural network that are necessary for the correctness of the derived relation between learning rate and momentum hyperparameter? These assumptions will be very helpful in the extension of the presented results to other models from different domains, for example, transformer-based LLMs.
- Is it possible to extend the proposed approach from Heavy Ball to a Nesterov-like acceleration scheme? If so, please comment what are the potential obstacles to deriving similar results.
- The question related to weakness 2: how robust is the derived estimate w.r.t. the new points that may appear in the plots?
- How the derived relations between the learning rate and momentum term can be interpreted from the perspective of loss surface properties? What the derived 2/3 power rule can highlight in the loss surface corresponding to the considered models?
We thank the referee for their positive feedback on our paper, and for the constructive questions that helped expanding and improving our presentation. We believe we addressed all the referee's questions. We detail our reply and updates to the draft below:
-
The main assumption is that the neural network is in the overparameterized regime. There has been extensive literature (as discussed in Sec. 2) supporting the statement that, in overparameterized neural networks, global minima tend to form connected manifolds, what we call ``valley''. A separate but related assumption, is that the closer the weights are to the valley, the more applicable we expect our prediction to be. Given the relationship between overparameterization and existence of zero-loss valleys have been studied across many models, we believe these assumptions are architecture-independent and we expect they should apply to transformer-based models as well, in particular in the context of fine-tuning where models are often overparameterized (see e.g. https://arxiv.org/abs/1908.05620). We also point out that, in our experiments, even when training is initialized far from the valley (e.g. due to random initialization), we still observe the scaling law we proposed. In fact, we now added in Sec. 4.3.1 and Fig. 2 (bottom right) an experiment training a ResNet-18 model on CIFAR-10 starting with random initialization, and observed the optimal scaling relationship we proposed.
-
Extending this approach to Nesterov acceleration involves evaluating the gradients at instead of , where are the network weights and is momentum. Technically, while it will be slightly more involved to derive the limit diffusion equation of Theorem 3.4, we expect it to be still doable as the latter result is derived through a Taylor expansion in , since the Theorem is valid in the limiting case in which the configuration is close to the valley, which is located inside the subspace . We also note that it is in principle possible that Nesterov-like acceleration schemes will lead to a different power from . The important point is that there will still be a power law since the steps involved in the derivation will be qualitatively similar.
-
We included an error for our numerical estimates (see e.g. Fig. 2), which was derived using the function from the package . This function computes errors of the estimates of the parameters using the "linear approximation method" from Vugrin, Kay White, et al. "Confidence region estimation techniques for nonlinear regression in groundwater flow: Three case studies." Water Resources Research 43.3 (2007). In our case we assume that the true data do lie on a line with noise which is i.i.d. distributed. While this assumption may not be strictly satisfied (e.g. the noise of the data points to fit may depend on , or there might be deviations from our scaling law due to finite strength of the noise we use in experiments) the data show a clear linear relationship and such deviations are not apparent. Responding specifically to the question about adding more points -- we do not expect the estimate to change significantly because new points will be independent from old points, and if they lie inside the extensive range of learning rate values we have explored, will be approximately identically distributed as well. This means that our error estimate of one standard deviation about provides a good measure for the expected deviation of the estimator of the slope.
-
As we mentioned in point 1., the scaling law is related to the presence of a zero-loss valley. We do not expect the specific power of to hold for a generic "rough" loss surface. An additional observation is that the power we found holds empirically for random initial conditions, i.e. even when we initialize far from any valley. We observed this across various datasets (artificial, FashionMNIST, and CIFAR10). In terms of the loss surface, this could mean that a large fraction of training is spent near a zero-loss valley.
Following the referee's suggestions, we compressed part of Secs. 3.1 and 3.2, leaving only the necessary definitions and the most important results. We extended the experiments section (Sec. 4) by including the results we obtained after the original submission.
Dear authors, Thanks for the detailed response and proper modification of the main text. The presentation becomes more well-written and convincing. All raised questions and weaknesses are addressed. Thus, my score remains the same.
This paper extends the analysis of (Li et al., 2022) to SGDM. Based on such an analysis, the optimal momentum hyperparameter to accelerate the training without hurting the generalization is studied. Experiments of matrix-sensing, a 6-layer MLP on FashionMNIST, and ResNet-18 on CIFAR10 are conducted to support the theoretical claims.
优点
-
It is an important topic to study how the momentum hyperparameter is optimally picked in deep learning, as it allows for shrinking the search space of hyperparameters.
-
The theoretical results (at least the part I have understood) are solid.
缺点
-
The biggest concern I have is regarding the presentation of this paper. The quantities and are the focus of this paper. But it is only defined through informal descriptions (for example " Define to be the number of time steps it takes so that the displacement of along becomes a finite number as we take → 0 first, and → 0 afterward") in the introduction, and no (even informal) definition elsewhere. This makes these two terms extremely hard to interpret and understand.
-
This paper only considers optimization around the manifold of the global minima. Although this is inherited from (Li et al., 2022), I wonder whether this framework can characterize the convergence rate along the whole trajectory, which is the one of more interest. For example, in Figure 2, the initialization is picked where a perfect interpolation has been already achieved. What happens if the initialization is chosen as, for instance, Kaiming's initialization?
-
The experiments are too few and toy considering this paper aims to provide modification for algorithms: there is only an experiment of 2-layer MLP and an experiment over CIFAR 10 with a very uncommon initialization (as discussed above).
问题
See weaknesses above.
We thank the referee for their constructive questions that helped expanding and improving our presentation, and led us to design further experiments. We believe we addressed all the referee's questions. We detail our reply and updates to the draft below:
-
We appreciate the referee's feedback on the definition of and . Regarding , in the introduction we replaced "mixing time" with "the inverse of the smallest nonzero eigenvalue of the damping term" which is more self-contained and should completely clarify what we mean. Regarding , we added appendix B to provide a formal definition of this quantity and included two more sentences at the end of the first paragraph of Sec. 1.2 to make the reasoning around tau2 more explicit. We hope these elements are now clarified.
-
We thank the referee for this question, as it prompted us to further investigate empirically what happens with random initialization. We already implemented random initialization for part of our experiments (Matrix sensing and MLP on artificial dataset), showing a consistent behavior with the scaling law we predicted without the need to initialize near the manifold of the global minima. To answer the referee's question, we performed a new set of experiments in which we trained ResNet-18 on CIFAR10 with Kaiming's initialization. We found clear evidence of the optimality of the 2/3 scaling law in this setting, as illustrated in Fig. 2 (bottom right) and Sec. 4.3.1. From a theoretical standpoint, the properties of the loss surface are known to be very challenging to characterize in generic deep-learning models, in particular away from the zero-loss valley, which on one hand makes it hard to rigorously predict this last empirical finding, and on the other hand we believe that, because of this reason, the empirical evidence we found is even more valuable.
-
As mentioned in point 2., stimulated by the referee's question, we took the opportunity to look at the standard setting of training with Kaiming initialization and SGD noise over CIFAR 10, finding strong empirical evidence that our prediction holds in realistic scenarios. We respectfully remind the referee that we also performed experiments on FashionMNIST using a deep MLP, which also provides positive evidence of our prediction. The aim of this paper is to develop a rigorous theory to ground our predictions and to initiate and display its verification across progressively complex models.
The paper analyzes a noisy version of gradient descent (meant to be a proxy of SGD but simpler to analyze theoretically) in the presence of momentum. Given momentum parameter \beta and learning rate \eta, two timescales are identified: 1) one corresponds to the relaxation time to the Gaussian distribution in the traverse directions to zero loss manifold; 2) the other corresponds to the amount of time it takes to have finite displacements along the zero loss manifold. The authors argue that the most efficient training is obtained when the two timescales are comparable, which implies a relation among eta and beta. In the limits of small noise, small learning rate and large times, the authors derive the SDE of the limit diffusion process. The process is driven toward flat minima (small hessian trace).
优点
- The paper is well written and well organized.
- The paper provides numerical experiments on synthetic and natural settings.
- Both rigorous proofs and heuristics arguments are given.
- The theoretical analysis of the timescales provides the simple prescription gamma=2/3 that is shown to speed up training and also give the best generalization in some settings.
缺点
- The work seems a relatively straightforward extension of Li et al. (2022)
- It is not clear if the optimality of the gamma=2/3 exponent applies to standard SGD as well.
- The theory does not give a prescription for the prefactor C. I think this makes it not so useful in practice.
问题
- What happens when using standard SGD instead of the label noise one? Can the author provide numerical evidence that gamma=2/3 is still a good exponent? My understanding is that all numerical experiments are carried out with SGD label noise.
- Could the author clarify the argument for the tau1 = tau2 criterium. In particular, it is not clear to me why we should care about traversal equilibration. What seems relevant is to move fast toward the flat region staying close enough to the zero loss manifold.
- What happens if also phase 1 is trained with SGD instead of GD? The starting point for the following SGD label noise diffusion would be already in a flat region there wouldn't be much drift?
- Is Fig 1 obtained with a single value of \eta and varying \gamma? Could the author show multiple sets of points corresponding to different eta values?
We thank the referee for their positive feedback on our paper, and for the constructive questions that helped expanding and improving our presentation. We believe we addressed all the referee's questions. We detail our reply and updates to the draft below:
-
The referee is correct that in our analysis so far we focused on SGD label noise. We carried out a new set of simulations training ResNet-18 on CIFAR10 using SGD instead of label noise. We find a clear signal of the gamma=2/3 scaling relationship. Strictly speaking, our mathematical framework currently assumes that the noise is constant, which is not quite the case for SGD noise and is the reason why we did not consider SGD noise in our experiments in the first place. Yet, this result shows that in practice the differences with SGD noise might not be minor in terms of the power law. It is gratifying to see empirically that the optimal scaling law still applies, and we are grateful to the referee for bringing up this question. We included the new results in Sec. 4.3.1 and Fig. 2 (bottom right).
-
The number of timesteps it takes for the model to converge to a low-curvature point of the zero-loss valley is controlled by the slowest between and . By ``controlled'' we mean that the longer is , the longer it takes to converge to such low-curvature point. More specifically, optimality is reached when because, as illustrated in Fig. 1(a), if becomes faster, becomes slower and viceversa. The transverse timescale is important because if this is too slow, even though longitudinally convergence has happened, there might still be residual transverse fluctuations (similar to the top left of Fig. 2).
-
Regarding our previous experiment on CIFAR10 (where we explicitly separated phase 1 and phase 2 in the training schedule), label noise would have continued to cause a drift unless SGD would have caused the weights to converge to the widest minimum by the end of phase 1. Please let us know if this answers your question, we are happy to discuss more.
-
Fig 1 is obtained using various values of for a fixed value of . As explained in the last paragraph before App. A.2, the way we extracted the exponent was to fit the number of timesteps to convergence following the relationship , where we performed several runs with different values of and the associated , and extracted and the exponent through a fit. Because of this, the plots of Fig. 1 are defined only when running across several values of , it is not possible to plot them for a single value of .
Additionally, we have two remarks about the weaknesses mentioned by the referee:
-
Prefactor: We appreciate the referee's critique that the theory does not give a prescription for the prefactor . In principle, this could mean that we have not reduced the search space of hyperparameters, but we notice two things. First, given the optimal power is fixed to 2/3 by the theory, to know one only needs to perform a few runs to find the optimal momentum hyperparameter for a given fixed value of the learning rate. In particular, from this information it is possible to extract the best momentum hyperparameter for other values of the learning rate thanks to the knowledge of the exponent. This effectively reduces the hyperparameter search to one parameter instead of two. Secondly, and intriguingly, we noted that in all our experiments, independently of the models and tasks we considered.
-
Simple Extension: We would also like to note that, while (Li et al. 2022) has been a main result inspiring our research, our framework has substantial differences. In order to treat the momentum hyperparameter as an additional small parameter (i.e. in addition to the learning rate ), we had to develop a completely different scheme of limits to derive our final equation. In that derivation, we relied on the small-noise limit rather than small-learning rate.
The paper proposes a new scaling of the momentum parameter beta (setting beta to be 1 - eta^{2/3}). The paper derives via a theoretical analysis which is the primary contribution of the paper. The paper performs this analysis via identifying two timescales. One which corresponds to the time to converge to the stationary distribution in the traverse directions over the zero loss manifold. The second one comes via computing the amount of time it takes to have finite displacements along the zero loss manifold. The authors propose to set these two timescales comparably, which leads to the proposed relation between eta and beta.
The paper's proposal is reasonably well-motivated and the analysis is built upon the framework proposed by Li et al. While reviewers liked the analysis in the paper, eventually the efficacy of such a proposal hinges upon how well it does in experiments as there is no proof of a 'speedup' via this scaling. The experiments are as pointed out by reviewers a little lacking. The experiments are either of the form ones verifying their theory on the small scale models or on the large models in a regime which is unclear when it is hit in training. Upon request from the reviewers a new experiment was added on a more standard training task but the experiment was done during the rebuttal period and the description is limited and the scope not sufficient to gauge whether the proposal is leading to a more effecient scaling setup.
Overall I recommend the authors to consider expanding the score of the experiments on a lrger models / baselines and performing a clear verification of whether their proposal leads to a cleaner result. It would be further good to highlight what the benefits of this scaling are over using default values of momentum. I also request the authors to do a better job of writing and explaining their proposed scaling. As a reviewer also pointed out the descriptions are very limited to fully understand the derivation quickly.
为何不给更高分
The paper is a borderline with a reasonable proposal and useful analysis to back it. However the efficacy of the proposal can only be verified by experiments on which front i found the paper lacking. Please further see my meta review.
为何不给更低分
NA
Reject