暂无评分数据
ICLR 2025
EDGE OF STOCHASTIC STABILITY: REVISITING THE EDGE OF STABILITY FOR SGD
摘要
Recent findings by Cohen et al. (2021) demonstrate that during the training of neural networks with full batch gradient descent at a step size of $\eta$, the sharpness—defined as the largest eigenvalue of the full batch Hessian—consistently stabilizes at $2/\eta$. These results have significant implications for generalization and convergence. Unfortunately, this was observed not to be the case of mini-batch stochastic gradient descent (SGD), thus limiting the broader applicability of these findings.
We empirically discover that SGD trains in a different regime we call Edge of Stochastic Stability. In this regime, what hovers at $2/\eta$ is, instead, the average over the batches of the largest eigenvalue of the Hessian of the mini batch loss—which is always bigger than the sharpness.
This implies that the sharpness is generally lower when training with smaller batches or bigger learning rate, providing a basis for the observed implicit regularization effect of SGD towards flatter minima and a number of well established empirical phenomena.
关键词
Edge of StabilityStochastic Gradient DescentNeural Networks
评审与讨论
作者撤稿通知
We posted the wrong version by mistake and noticed only after the deadline.
Best regards,
The Authors