Training Deep Learning Models with Norm-Constrained LMOs
摘要
评审与讨论
This paper develops an algorithmic framework that can exploit and appropriate choice of norm for entire neural network with emphasis on hyperparameter transfer across model sizes. The algorithm called uSCG, unconstraint stochastic conditional gradient method, shows improvements both theoretically and practically when the norm-ball constraint matches the natural geometry of the problem.
update after rebuttal
My concerns have been addressed. Good paper.
给作者的问题
- Since you build on top of modded-nanoGPT, I have to ask the question: what is your speedrun result against Muon?
论据与证据
the evidence presented supports the claims.
方法与评估标准
The methods and evaluation both make sense.
理论论述
The theoretical claims are reasonable though I have not checked the proofs in details. I really like the part where different optimizers get organized under the lmo framework and presented in a clean and clear way. Very insightful paper!
实验设计与分析
The experiments are strong enough to support the claims for a mainly theoretical paper.
补充材料
I have briefly scanned through the material for the theoretical part. No strong inconsistency has been spotted.
与现有文献的关系
This work is broadly related to series of works that aim to beat AdamW with new optimizer proposal, such as Shampoo, SOAP, Muon. Usually it's under the context of large language model pretraining and finetuning.
遗漏的重要参考文献
Not that i am aware of.
其他优缺点
The insight and systemic analysis is quite interesting, unifying everything under lmo gives me a new perspective on looking at things.
其他意见或建议
N/A
We are happy that the reviewer found the paper insightful.
"Since you build on top of modded-nanoGPT, I have to ask the question: what is your speedrun result against Muon?"
The speedrun configuration is the 124M parameter model with 512 batchsize in Fig. 1. In this setting Scion consistently across 3 runs has slightly lower validation loss than Muon (with the same wallclock time), which would translate into a slightly better wallclock time if the number of iterations is reduced to exactly match the validation loss of Muon.
The more substantial gain occurs at larger batch sizes. We conduct an additional experiment for the batch size 6144 configuration of Fig. 2, where we simply reduce the number of iterations until Scion matches the larger validation loss of Muon. We find that Scion can achieve the same validation loss as Muon with a 25% smaller wallclock time in this setting.
The paper introduces the unconstrained Stochastic Conditional Gradient method (uSCG), which builds upon classical Conditional Gradient methods by utilizing linear minimization oracles (LMOs) even for unconstrained optimization problems. This approach leverages the geometry of the problem by assigning different norms for different layers, particularly in language modeling tasks. The authors provide convergence guarantees for uSCG with constant batch sizes and introduce a principled framework for selecting effective constraints based on the input and output space geometries of neural network layers. The paper also discusses the relationship between their method and hyperparameter transfer techniques, such as the Maximal Update Parametrization, which allows for transferring optimal hyperparameters from smaller proxy models to larger ones. The numerical experiments show that the optimal learning rate for the proposed method to minimize validation loss is independent with the model width.
给作者的问题
No.
论据与证据
Yes.
方法与评估标准
Yes.
理论论述
No.
实验设计与分析
Yes.
补充材料
No.
与现有文献的关系
Not applicable.
遗漏的重要参考文献
No.
其他优缺点
The theoretical novelty of this paper from section 5 is sound and concrete.
Nevertheless, the algorithmic contribution of this paper is unclear. Both algorithm 1 and 2 are simply steepest descent methods with momentum. One minor novelty can be that the authors use different norms for different layers described in section 3.1. Specifically, the authors use Sign in the output layers. According to the numerical experiment, this change does make the optimal learning rate of the proposed method to be independent with the model width but the validation loss improvement is negligible in Figure 1. The experiments on batch size sensitivity does support that the proposed method has better validation loss compares to Muon when batch size is large. Nevertheless, the validation loss curves for all compared algorithms look to be increasing w.r.t. batch size and with the smallest batch size of 512, the difference between proposed method and Muon are not significant.
Besides, the numerical experiments in this paper are restricted to nanoGPT. It would be better to try different transformer architecture to show the applicability of the proposed method in LLM pre-training.
其他意见或建议
The algorithmic descriptions in algorithm 1 and 2 are same. Is this a typo?
We thank the reviewer for their feedback. We believe there are important misunderstandings that we address below.
"The algorithmic descriptions in algorithm 1 and 2 are same. Is this a typo?"
This is not a typo: the two algorithms are distinct and solves different problems as detailed in Section 3. Algorithm 1 solves an unconstrained problem, while Algorithm 2 solves a norm-ball constrained problem.
For deep learning this leads to a distinct behaviour of the norm of the weights between Algorithm 1 and 2 as illustrated in Fig. 8-9, since only Algorithm 2 ensures that the norm does not grow beyond the constraint radius . This explicit norm control can be particularly important for numerical stability and to avoid overfitting, as e.g. seen in the CIFAR10 experiments were Algorithm 2 (Scion) works without the heuristic weight normalization otherwise present in the Muon implementation for this problem.
"Both algorithm 1 and 2 are simply steepest descent methods with momentum"
Neither of our two proposed algorithms is steepest descent (Please see Section 4 were we explicitly compare with steepest descent). Algorithm 2 is a Conditional Gradient (aka Frank-Wolfe) based algorithm which applies to constrained problems, while with Algorithm 1 we show that the LMO can even be used for unconstrained problems.
One important distinction with steepest descent is that the stepsize does not have to depend on the Lipschitz constant (see our main results Lemma 5.3-5.6). Intuitively, this is due to the scale invariance of the LMO, which makes the algorithm more stable (the algorithm is less sensitive to the curvature).
"The experiments on batch size sensitivity does support that the proposed method has better validation loss compares to Muon when batch size is large. Nevertheless, the validation loss curves for all compared algorithms look to be increasing w.r.t. batch size"
The loss is increasing with increasing batch size in the experiments because we keep the total number of tokens fixed. The validation loss as a function of batch size is expected to be worse for any optimizer in this setting – the comparison criterion is instead the rate with which performance deteriorates. As the reviewer points out, Scion have better validation accuracy than the baselines for larger batch sizes.
One way to quantify the validation loss improvement, is to ask how much quicker Scion can achieve the same validation accuracy as one of the baselines. To this end we conduct an additional experiment for the large batch size of 6144, where we simply reduce the number of iterations until Scion matches the larger validation loss of Muon. We find that Scion can achieve the same validation loss as Muon with a 25% smaller wallclock time.
The authors propose a new stochastic family of algorithms exploiting the so-called Linear Minimization Oracle (LMO) to solve optimization problems. It provides a more general framework and unifies many other algorithms as special cases. Theoretical guarantees as well as significant speed-up on the training of models such as nanoGPT demonstrate the effectiveness of their proposal.
给作者的问题
I have one question: From what I understand, all the LMOs calculated in Tables 2, 3 and 4 do not take into account the non-linear activation function. If that is the case, do you think taking them into account can improve the optimization process? After all, the geometry of the data after the activation is what really matters, and not the pre-activated ones.
论据与证据
All the claims and experimental evidence are clearly presented and justified. I hope the code is publicly available once the paper is published.
方法与评估标准
The proposed framework is pretty interesting and well-presented. The experiments conducted look credible and show certain improvement over existing methods.
理论论述
I found all theoretical claims sound. I admit that I did not check the proof very carefully in the Appendix. However, for the parts that I did check, there is not any mathematical error.
To be verified: In Lemma D.1, I think it is the right constant is , and not .
实验设计与分析
I think the experimental designs are justified and credible.
补充材料
I read several proofs in the Supplementary Materials, notably Sections C, D.
与现有文献的关系
The paper provides a general framework to unify several methods in the literature.
遗漏的重要参考文献
One particular line of related research (that is quite old) is natural gradient descent [1]. It also exploits the geometry of the function by defining a pseudo-metric in the parameters space by KL-divergence. It is not very popular nowadays since: 1) Calculating the Fischer matrix and its inverse is quite expensive, and 2) The empirical performance in stochastic setting is not very impressive. However, I think it is worth mentioning because it is among the first papers stepping away from the Euclidean metric and proposing a new metric for optimization, to the best of my knowledge.
[1]: S. Amari, "Natural Gradient Works Efficiently in Learning," in Neural Computation, vol. 10, no. 2, pp. 251-276, 15 Feb. 1998, doi: 10.1162/089976698300017746.
其他优缺点
N/A
其他意见或建议
N/A
We thank the reviewer for their positive feedback.
"To be verified: In Lemma D.1, I think it is the right constant is , and not ."
The right constant is , since smoothness is taken w.r.t. the arbitrary norm . This is what eventually yields the dependency on seen in the main results of Section 5.
What might have caused the confusion is that there is a minor typo in the statement of Lemma D.1, since the guarantee of Lemma D.1 should be stated in terms of instead of . This is just a typo though, that does not change the final result, since the following Lemma 5.3 is correctly stated in terms of the dual norm. We will correct the typo in the final version.
"One particular line of related research (that is quite old) is natural gradient descent [1]. It also exploits the geometry of the function by defining a pseudo-metric in the parameters space by KL-divergence. It is not very popular nowadays since: 1) Calculating the Fischer matrix and its inverse is quite expensive, and 2) The empirical performance in stochastic setting is not very impressive. However, I think it is worth mentioning because it is among the first papers stepping away from the Euclidean metric and proposing a new metric for optimization, to the best of my knowledge. "
This is a good point. We will remember to cite and discuss natural gradient descent.
Geometry of activation functions
It might be possible to further improve the performance by taking the geometry of activation functions into account, but it seems challenging. One place to maybe look for ideas is the vast literature on initialization.
Thanks the authors for their responses. I have no further questions and I keep the score as it is.
This paper proposes a new approach to optimizing DNNs that leverages the Linear Minimization Oracle (LMO) over norm-constrained sets. The core idea is to adapt the optimizer a priori to the geometry of the neural network, rather than relying solely on on-the-fly adaptive methods like Adam. The authors introduce a new stochastic algorithm, unconstrained Stochastic Conditional Gradient (uSCG), which surprisingly applies the LMO concept to unconstrained problems. They also adapt the existing constrained Stochastic Conditional Gradient (SCG) method for non-convex deep learning. The key to their approach is choosing a norm for the LMO that reflects the structure of the neural network, particularly operator norms between layers. The paper claims that this approach leads to improved performance, better batch size robustness, memory efficiency, and zero-shot hyperparameter transferability across models of different widths. The authors provide theoretical convergence guarantees and empirical results on nanoGPT language modeling and CIFAR-10 image classification tasks. The framework unifies several known optimization algorithms like Normalized SGD, SignSGD, and Muon.
给作者的问题
Comparison to other optimizers should be made on more diverse datasets
论据与证据
The claims are generally supported by evidence. However, while the claim of improved performance and hyperparameter transferability is well supported for language modeling, the evidence is weaker for image classification tasks, lacking comparison to other optimizers in this domain.
方法与评估标准
The core methodological idea of using norm-constrained LMOs, and particularly the uSCG algorithm, is novel and makes sense for adapting the optimizer to the network structure. The connection to operator norms is theoretically well-motivated. The nanoGPT experiments, including the use of a 3B parameter model, are appropriate for evaluating performance and hyperparameter transfer in language modeling. The CIFAR-10 experiments are insufficient for evaluating the method's general applicability. Comparison to other optimizers should be made on more diverse datasets.
理论论述
I didn't thoroughly check the correctness of the proofs.
实验设计与分析
The nanoGPT experiments are well-designed to test the core claims on language modeling. Varying model width, batch size, and comparing to Adam and Muon provides a strong evaluation. The CIFAR-10 experiments show the transferability of the optimal stepsize, but they do not establish the method's competitiveness in image classification.
补充材料
The supplementary material is comprehensive and well-organized, providing more details for understanding and reproducing the results.
与现有文献的关系
The paper connects to several areas of the literature, including conditional gradient methods and adaptive optimization.
遗漏的重要参考文献
None.
其他优缺点
None.
其他意见或建议
The paper could benefit from a clearer explanation of the intuition behind why uSCG works for unconstrained problems.
We thank the reviewer for their feedback and address the remaining concerns below.
"The CIFAR-10 experiments show the transferability of the optimal stepsize, but they do not establish the method's competitiveness in image classification"
To establish competitiveness on image classification we conduct a speedrun to achieve the 94% test accuracy target on the airbench codebase. Scion achieves a 94.08% test accuracy (averaged over 200 runs) in the same wallclock time that the SOTA method Muon uses to achieve the target accuracy. This provides a 13.6% speedup over the tuned baseline of SGD with momentum.
Note that Scion achieves this SOTA performance without the ad-hoc weight normalization otherwise used in the airbench implementation of Muon.
"The paper could benefit from a clearer explanation of the intuition behind why uSCG works for unconstrained problems."
It was indeed also surprising to us initially that we could show convergence for uSCG.
After applying smoothness in the analysis there are mainly two terms to address: i) the direction of the update (given by an inner product), and ii) the magnitude of the update. For simplicity, let us consider the deterministic case where we have
The main insight is the relationship between the LMO over a norm-ball and the definition of the dual norm. Specifically, the value that the LMO attains on the boundary of the norm-ball is exactly the dual norm of the gradient. So the direction of the update (the inner product) is still informative and it in fact decreases the function value by the dual norm of the gradient. However, the magnitude of the update (the last term) will be large even close to a solution, but this can be taken care of by the stepsize choice (as made precise by Lemma 5.3 and 5.4).
Notice that we do not need to know the Lipschitz constant to select the stepsize as is otherwise the case in e.g. gradient descent. The intuitive reason for this is the scale invariance of the LMO. We will remark on these intuitions in the final version.
This paper proposes a novel stochastic optimization framework, Stochastic Conditional Gradient (SCG) and unconstrained SCG (uSCG), which leverages the Linear Minimization Oracle (LMO) to adapt to problem geometry in deep neural networks. The framework unifies existing methods like Normalized SGD, SignSGD, and Muon. By instantiating SCG/uSCG with operator norms (dubbed Scion/unconstrained Scion), the method achieves hyperparameter transferability across model sizes and competitive performance on tasks such as nanoGPT and CIFAR-10.
While many optimizers have claimed to beat Adam in recent years, but none of them actually did. Recent consensus suggests Muon may achieve this. This paper theoretically and empirically demonstrate that Scion/unconstrained Scion outperform Muon and Adam. Moreove, Scion/unconstrained Scion have another appealing character — zero-shot hyperparameter transferability across models of different widths, which means small proxy models can tune hyperparameters for large models, significantly boosting LLM training efficiency. This positions Scion as a potential default optimizer for DNNs, akin to Adam’s current dominance.
Three reviewers strongly endorse the paper; one raised concerns, but the authors addressed these in the rebuttal. Given the method’s technical novelty, empirical success, and broad applicability, the AC strongly recommends acceptance.