Stacey: Promoting Stochastic Steepest Descent via Accelerated $\ell_p$-Smooth Nonconvex Optimization
We propose Stacey, an accelerated $\ell_p$ steepest descent algorithm that leverages non-Euclidean geometry to achieve faster convergence and higher accuracy.
摘要
评审与讨论
The paper uses different mixed Lp norms to run SGD
给作者的问题
NA
论据与证据
The main takeaway is that there are different Lp norms boost performance of optimization for different problems. For CNN's for example they find L2 to work best but for LLMs they find another L3 to work better for example. Unfortunately there are no error bars or standard deviation in tables. Also the learning rates for the baseline Adam is very off for LLMs, its set to 1e-4 where it should really be set to 1e-3 or higher. Also the epsilon for Adam is also off. Setting epsilon to 1e-8 basically makes Adam act as SGD + M. On LLMs the epsilon should be set closer to 1e-17.
方法与评估标准
I believe the baselines were not set correctly.
理论论述
The paper does a good job of contextualizing their method in modern optimization.
The paper gives some extensions to simpler claims from but good overall.
[1] Guillaume Garrigos, Robert M. Gower, Handbook of Convergence Theorems for (Stochastic) Gradient Methods
[2] Ahmed Khaled, Peter Richtárik, Better Theory for SGD in the Nonconvex World
实验设计与分析
please see other sections.
补充材料
I downloaded the code and ran it on a few benchmarks. I did not see a boost in modded-nanoGPT which is a properly tuned baseline. I also did not see a boost vs Adam in some toy problems like xor https://github.com/lixilinx/psgd_torch/blob/master/rnn_xor_problem_general_purpose_preconditioner.py
与现有文献的关系
NA
遗漏的重要参考文献
NA
其他优缺点
The paper lacks variance bars or standard deviations as well as weakly tuned baselines.
其他意见或建议
Fix baselines.
We thank the reviewer for their comments and suggestions.
Error Bar and Standard Deviation No error bars or standard deviation in tables
We ran 3 random seeds to obtain the error bar.
CIFAR:
| Optimizer | Train NLL @50 | Train NLL @100 | Train NLL @200 | Test ACC @50 | Test ACC @100 | Test ACC @200 |
|---|---|---|---|---|---|---|
| SGD w/ Momentum | 0.0567 ± 0.0017 | 0.0441 ± 0.0014 | 0.0352 ± 0.0012 | 91.15 ± 0.30 | 92.02 ± 0.24 | 92.76 ± 0.13 |
| Adam | 0.0401 ± 0.0017 | 0.0182 ± 0.0017 | 0.0083 ± 0.0010 | 91.69 ± 0.18 | 92.13 ± 0.16 | 92.66 ± 0.36 |
| AdamW | 0.0590 ± 0.0010 | 0.0278 ± 0.0009 | 0.0195 ± 0.0015 | 90.59 ± 0.37 | 91.47 ± 0.42 | 92.12 ± 0.07 |
| Lion | 0.1006 ± 0.0571 | 0.2226 ± 0.1410 | 0.0245 ± 0.0043 | 89.38 ± 2.02 | 89.19 ± 1.88 | 92.15 ± 0.32 |
| Stacey(p,p) | 0.0423 ± 0.0009 | 0.0118 ± 0.0014 | 0.0021 ± 0.0011 | 91.88 ± 0.21 | 92.79 ± 0.16 | 93.79 ± 0.38 |
| Stacey(p,2) | 0.0614 ± 0.0031 | 0.0131 ± 0.0027 | 0.0014 ± 0.0005 | 90.83 ± 0.32 | 92.70 ± 0.28 | 93.54 ± 0.06 |
ImageNet:
| Optimizer | Train NLL @20 | Train NLL @40 | Train NLL @60 | Test Top-1 ACC @20 | Test Top-1 ACC @40 | Test Top-1 ACC @60 |
|---|---|---|---|---|---|---|
| Stacey(p,p) | 1.4680 ± 0.0150 | 1.1636 ± 0.0159 | 1.0324 ± 0.0100 | 66.93 ± 0.10 | 69.15 ± 0.15 | 69.87 ± 0.14 |
LLM:
| Optimizer | Train loss @5K | Train loss @10K | Train loss @20K | Train loss @30K | Test loss @5K | Test loss @10K | Test loss @20K | Test loss @30K |
|---|---|---|---|---|---|---|---|---|
| SGD w/ Momentum | 6.6704 ± 0.0129 | 6.5205 ± 0.0088 | 6.4206 ± 0.0055 | 6.3920 ± 0.0048 | 6.6558 ± 0.0131 | 6.5150 ± 0.0085 | 6.4173 ± 0.0038 | 6.3909 ± 0.0038 |
| Adam | 6.4548 ± 0.0028 | 6.3647 ± 0.0037 | 6.2851 ± 0.0030 | 6.2485 ± 0.0028 | 6.4493 ± 0.0017 | 6.3646 ± 0.0035 | 6.2820 ± 0.0037 | 6.2480 ± 0.0028 |
| AdamW | 5.6655 ± 0.0095 | 5.5172 ± 0.0081 | 5.4401 ± 0.0091 | 5.4268 ± 0.0096 | 5.6510 ± 0.0099 | 5.5171 ± 0.0080 | 5.4350 ± 0.0088 | 5.4240 ± 0.0093 |
| Lion | 6.8722 ± 0.0656 | 6.8190 ± 0.0549 | 6.8021 ± 0.0451 | 6.7794 ± 0.0425 | 6.8624 ± 0.0587 | 6.8220 ± 0.0500 | 6.7954 ± 0.0438 | 6.7733 ± 0.0413 |
| Stacey(p,p) | 5.4016 ± 0.0107 | 4.9938 ± 0.0209 | 4.6492 ± 0.0112 | 4.4962 ± 0.0123 | 5.3616 ± 0.0068 | 4.9655 ± 0.0169 | 4.6372 ± 0.0116 | 4.4879 ± 0.0132 |
| Stacey(p,2) | 6.2492 ± 0.0060 | 6.0038 ± 0.0319 | 5.7210 ± 0.0363 | 5.5841 ± 0.0379 | 6.2312 ± 0.0065 | 5.9867 ± 0.0313 | 5.7062 ± 0.0375 | 5.5755 ± 0.0375 |
Baseline Settings Learning rates for the baseline Adam is very off for LLMs; its set to 1e-4 where it should really be set to 1e-3 or higher. Also, the epsilon for Adam is also off. Setting epsilon to 1e-8 basically makes Adam act as SGD + M. On LLMs the epsilon should be set closer to 1e-17.
We would kindly ask the reviewer to provide additional details regarding which LLMs they are referring to, as we did not observe a notable difference using the suggested parameters for our setting.
| Optimizer | lr | eps | Test loss @15K | Test loss @30K |
|---|---|---|---|---|
| Adam | 1e-3 | 1e-17 | 6.4120 | 6.2341 |
| Adam (Our settings) | 1e-4 | 1e-8 | 6.3102 | 6.2485 |
Performance Boost I downloaded the code and ran it on a few benchmarks. I did not see a boost in modded-nanoGPT, which is a properly tuned baseline. I also did not see a boost vs Adam in some toy problems like xor
We would kindly ask the reviewer to elaborate on the details of their evaluation, as otherwise, we are unable to provide a proper response or explanation.
Thank you for the experiments. I will certainly take them into consideration.
Here are the last two. Feel free to tune as much as you like.
Modded nanoGPT is the following benchmark. NanoGPT (124M) in 3 minutes. Can the authors run the 124M and the bigger benchmarks and report the results?
https://github.com/KellerJordan/modded-nanogpt
xor is the following benchmark. I tested the uploaded code, but I cannot seem to get Stacy to outperform Adam.
https://github.com/lixilinx/psgd_torch/blob/master/rnn_xor_problem_general_purpose_preconditioner.py
The results of the 124M benchmark are as follows:
| Optimizer | Val loss @0.2B tokens | Val loss @0.4B tokens | Val loss @0.6B tokens | Val loss @0.8B tokens |
|---|---|---|---|---|
| AdamW | 4.715 | 4.055 | 3.853 | 3.765 |
| Stacey(p,p) | 4.157 | 3.887 | 3.762 | 3.688 |
We observe a notable improvement over AdamW, which is consistent with the LLM experiments in our paper. We set nearly all of the hyperparameters the same as listed in the paper for the LLM experiments with Stacey(p,p) (Table 7), except for and .
We further wish to emphasize the overall contributions of our work, namely a primal-dual view of steepest descent that we justify both theoretically and empirically.
Having addressed these concerns and provided additional context for our contributions, we kindly ask the reviewer to reconsider their evaluation.
This paper introduces Stacey an optimisation algorithm targeted at training deep neural networks (DNNs). Stacey generalises SignSGD and conventional SGD, in a p-norm sense, where SGD uses the 2-norm to measure distance and SignSGD uses the inf-norm. On top of this Stacey include a acceleration scheme to aid the speed of convergence. The paper offers a theoretical result, specifically a convergence rate for an non-accelerated version of Stacey on smooth stochastic problems with bounded gradient and variance. Additionally there is an empirical evaluation of Stacey on standard DNN benchmarks against popular deep learning algorithms.
给作者的问题
-
Stacey has a lot of hyperparameters, many more than typically tuned for SGD and Adam, do you think your comparison is fair given this fact? Did you run a comparison where the same number of limited hyperparameters (say 1 or 2 only) are adjusted for all algorithms and the rest are left at their default values? Much of Adam (&AdamW) success is due to their robustness to it's hyperparamters, algorithms that need extensive hyperparameter tuning are unlikely to be used in practice. The experiments present in the paper do not make it clear to me how useful Stacey is as a practical algorithm.
-
The p norm used in Stacey is a hyperparameter, do you think it would be possible to adept Stacey to automatically work out the best setting of "p" for a given problem during the optimisation process? This would would help reduce the amount of hyperparamters required.
-
Further to the above, I understand the p norm used to measure distance in Stacey is fixed for all parameters, do you think it would be possible to extend stacey so it is adaptive adjusting the p-norm used, say per layer?
-
For some tasks Stacey(p,2) does better and some Stacey(p,p) does better what do you think might be causing this discrepancy?
-
Do you think Stacey with acceleration enjoys any theoretically properties? Did you make any progress to this end?
-
Is Stacey(p,p) equivalent to Stacey(p,2) and Stacey(2,p), If p = 2. If so, why do they seem to behave differently in your experiments?
论据与证据
The theoretical claims seem well supported.
The empirical claims are supported however I do have some doubts about the fairness of the empirical evaluation when comparing to other methods, however it is difficult to be totally objective here, given the natural differences between optimisers. Inevitably the wider community will need to judge how Stacey performs in practice compared to existing methods.
方法与评估标准
The benchmarks look appropriate, it would have been nice to see some smaller networks and more classical optimisation problems considered, given these don't take much compute.
理论论述
I did not thoroughly check the proofs due to time. The Theoretical Claims presented are not for Stacey but a far simpler non-accelerated algorithm.
实验设计与分析
The experimental design looks okay, though there are definitely some flaws here, see weakness section.
补充材料
Yes experimental design section.
与现有文献的关系
The Relation To Broader Scientific Literature is well motivated.
遗漏的重要参考文献
This recent paper is missing, and would be great to see included as a baseline, specially given its lower number of hyperparameters.
Defazio A, Yang X, Khaled A, Mishchenko K, Mehta H, Cutkosky A. The road less scheduled. Advances in Neural Information Processing Systems. 2024 Dec 16;37:9974-10007.
其他优缺点
Strengths
- The paper is well written.
- The idea behind the paper seems neat and some related theoretical results are provided.
- The benchmarks considered look appropriate, of course it would be nice to see more, including some more classical non-deep learning optimisation problems.
- This seems a promising direction for further research to build on top of.
Weaknesses
-
The empirical experimental section has some flaws: i) Some recent baselines are missing from the experimental section, specifically the one introduced last year in "The Road Less Scheduled" ii) The amount of hyperparameters tuned for Stacey seems to far exceed the other methods, making the comparison hard to judge. iii) It is not clear how robust Stacey is to its hyperparameters, nor how much hyperparameter tuning is required to get good results. iv) There is lack of error bars or mention of variation between runs v) Smaller scale classical optimisation problems are missing vi) For the ImageNet results It is not clear why the results are presented at epoch 60 rather than the typical epoch 90 given at least some of the experiment were run for 90 epochs, as report in the appendix vii) Tables of some results seem to be missing final, such as PPL on transformer pretraining. viii) The number of iterations shown vary between plots for no explained reason. ix) Some of the results for Stacey(2,2) seem to be different between plots (figure 5&6) & (figure 3 vs 11&12) x) The transformer pertaining experiments seem to be missing some important details, model architecture what is LLama-100?, why test ppl increases, unexplained differences in performance between plots see point ix. xi) Missing results for Stochastic ℓp Descent.
-
The totality of the above critics makes me wonder how much the results are being presented in a way to make them seem best. I would suggest spending a little more time in the appendix making it clear why specific choices were made. Without the empirical results the paper doesn't offer enough of a contribution in my opinion so it is essential a reader is not wondering why some seemly odd choice have been made in the way the experiments have been conducted and presented. My score is assuming greater clarity is given on the experiments in the final version of the paper.
-
The theoretical results are not presented for Stacey but a far simpler non-accelerated algorithm.
其他意见或建议
It would be great to detail the grids searched over in terms of hyperparameters, not just the final values.
Some idea of the robustness of stacey to its hyperparameters would really help sell its practicality.
Please explain why the number of iteration at which results (both tables and plots) are shown seem to vary so much.
We thank the reviewer for their thoughtful comments and suggestions, and we respond to their questions below.
Missing Reference
Thanks for the suggestion. We will provide a citation and include it as a baseline in the updated manuscript.
Hyperparameters
Though we may tune and , we found that, similar to SGD and Adam, the best choice comes from a small set, i.e., and , and so we believe this provides a fair comparison with other methods. We did run comparisons where a limited number of hyperparameters were tuned, while the rest were left at default values, as such defaults let us reduce the scope of the search.
Questions and Comments on Experiments
- Lack of error bars/variation
For space reasons, we kindly point the reviewer to our rebuttal for Reviewer ddGc for the tables of results with error bars.
- Smaller scale classical optimisation problems are missing
While our method was designed with large-scale models in mind, we will include smaller scale experiments, and we would kindly ask the reviewer for suggested problems.
- ImageNet result epochs
Thanks for catching this. This is a typo from an outdated table in the appendix, which we will update accordingly.
- Tables of some results seem to be missing final
If we understand correctly, we will include tables of the PPL/transformer pretraining results, with error bars/variance included (as in the tables provided in the rebuttal).
- The number of iterations shown vary
The iteration counts in Figures 2 and 4 differ because they represent two distinct experimental setups: one for ImageNet classification, the other for LLM pretraining, each with its own training configuration and iteration schedule.
- x) The transformer pertaining experiments details
LLama-100 is adopted from the github repo of "GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection" [Zhao et al. 2024]. The line's ppl increases because it is within a range that our algorithm cannot converge.
- xi) Missing results for Stochastic ℓp Descent.
Stochastic lp descent is a special case of our algorithm; thus we have presented the better empirical versions of our algorithm (with acceleration), though we will include ablation studies in the updated manuscript.
- (Q4) Stacey(p,2) vs. Stacey(p,p)?
Basing our intuition on the theory for the convex case, coupling a non-Eulidean primal with a Euclidean dual update (Theorem 1 in [Bai & Bullins 2024]), vs. non-Euclidean primal and dual updates (e.g. Theorem 2 in [Diakonikolas & Guzmán 2024]), leads to different trade offs between the problem geometry, initial iterate, and acceleration exponent, such that neither is uniformly better than the other.
- (Q6) Stacey for p=2?
This occurs due to numerical differences from handling general . However, we acknowledge this could cause confusion, so we will provide the algorithm for the simplified case of Stacey(2,2), along with a discussion of this distinction.
Choice of
(Q2) automatically work out "p"?
One idea would be to use occasional Hessian information to adjust over time, based on the spectrum density [Ghorbani et al. 2019]. There has also been much work on developing automated means of hyperparameter tuning, including e.g. parameter-free methods [Jacobson & Cutkosky 2022], from which we hope to draw inspiration.
(Q3) adjusting Stacey p choice per layer?
This is a wonderful suggestion. We may even benefit from a per-layer basis if we wished to leverage layer-wise Hessian information as a diagnostic tool (being far more compute-friendly).
Theoretical Properties of Stacey
The acceleration framework of Stacey builds on linear coupling [Allen-Zhu & Orecchia, 2017], which is optimal in the first-order smooth and convex setting, and HASD [Bai & Bullins, 2024], which achieves faster convergence in the smooth non-Euclidean setting. Beyond the deterministic and convex regime, the core foundation of Stacey ---- non-convex stochastic steepest descent ---- achieves a first-order oracle complexity of , which we discuss as tight in Section 4.1 and 4.2.
Providing a further theoretical characterization of Stacey’s acceleration remains challenging. Notably, even for widely used algorithms like Adam, existing theory typically shows only convergence or recovers first-order optimal regret, with limited formal evidence of superiority over other accelerated first-order methods. We suspect capturing the acceleration of Stacey in theory would require more refined and tailored assumptions. While we leave this as an open direction for future work, we highlight that Stacey generalizes both SignSGD and Lion ---- as discussed in the third paragraph of Section 4.2 ---- which suggests it offers greater flexibility and is more likely to admit improved theoretical properties.
This paper introduces STACEY, a novel optimization algorithm designed to accelerate stochastic steepest descent via ℓp-smooth nonconvex optimization. The key contributions of this work include:
- The development of STACEY, which incorporates primal-dual iterate interpolation to improve convergence rates for non-Euclidean smooth optimization problems.
- A theoretical framework that generalizes both SGD (when ) and signSGD (when ), with a convergence guarantee of under standard assumptions.
- Empirical results demonstrating superior convergence speed and final accuracy compared to existing optimization methods, including SGD, Adam, AdamW, and Lion, on large-scale deep learning tasks such as image classification (CIFAR, ImageNet) and large language model (LLM) pretraining.
- A study on how different values of affect performance, showing that non-Euclidean norms can be more effective in certain settings than traditional ℓ2-based methods.
The paper is well-written, well-structured, and presents both strong theoretical contributions and compelling empirical results.
给作者的问题
- How does the per-iteration computational cost of STACEY compare to SGD, Adam, and Lion in terms of runtime and memory consumption?
- Could you provide a practical heuristic or automated procedure for selecting in different tasks?
- How does STACEY perform when compared to second-order methods like Shampoo or K-FAC? Would these methods benefit from a similar ℓp-based approach?
论据与证据
The paper makes several claims, all of which are generally well-supported:
- Claim: STACEY improves convergence rates over traditional SGD and adaptive optimizers.
Evidence: Empirical evaluations on CIFAR, ImageNet, and LLM pretraining demonstrate improved speed and final accuracy. - Claim: The proposed method generalizes previous optimization approaches by considering a broader class of ℓp-norms.
Evidence: Theoretical analysis rigorously proves this generalization. - Claim: STACEY benefits from the flexibility of choosing different values.
Evidence: Experiments with different values of show that problem-specific norm choices can yield better performance.
One area where the claims could be strengthened is in providing a more detailed computational complexity comparison to confirm the practical efficiency of STACEY in large-scale settings.
方法与评估标准
The methods and evaluation criteria are appropriate and well-justified:
- Optimization Benchmarks: The paper evaluates STACEY on well-established benchmarks, including CIFAR-10, ImageNet, and LLM pretraining tasks.
- Comparisons: The algorithm is compared against widely used optimizers (SGD, Adam, AdamW, Lion), which are the correct baselines for this type of work.
- Evaluation Metrics: The paper reports training loss, test accuracy, and convergence speed, which are standard and relevant evaluation criteria for optimization algorithms in deep learning.
However, an additional runtime or computational cost comparison would be useful to fully assess the trade-offs of using STACEY in practice.
理论论述
I reviewed the theoretical claims and found them mostly sound:
- The convergence proof for stochastic ℓp steepest descent follows standard assumptions and logically extends previous results.
- The use of primal-dual interpolation is well-motivated, and the acceleration rate follows existing literature on non-Euclidean acceleration.
- The generalization to different ℓp norms appears correct and aligns with prior research on optimization under non-Euclidean norms.
One possible area for clarification is the effect of different values of on the acceleration exponent. An intuitive explanation of how acceleration scales with would improve clarity.
实验设计与分析
The experimental design is well-structured, with meaningful comparisons. I verified the validity of:
- Image classification experiments on CIFAR and ImageNet.
- LLM pretraining experiments on the C4 dataset.
- Ablation studies exploring the role of .
Potential improvement:
- A detailed analysis of computational efficiency (runtime per iteration, memory overhead, etc.) would make the comparisons more complete.
- It would be useful to explore STACEY’s performance in additional non-Euclidean problems, such as adversarial training or reinforcement learning.
补充材料
The supplementary material includes code, additional proofs, and extra experimental results. I reviewed:
- Appendix A: Detailed proofs of convergence rates.
- Appendix D: Hyperparameter tuning strategies.
These sections add depth to the paper and support the claims made in the main text.
与现有文献的关系
This paper is well-grounded in prior research on stochastic optimization and non-Euclidean geometry in machine learning:
- It builds upon SGD (Robbins & Monro, 1951), signSGD (Bernstein et al., 2018), and AdamW (Loshchilov & Hutter, 2019).
- It connects with recent studies on non-Euclidean optimization (Diakonikolas & Guzmán, 2024) and adaptive gradient methods.
- The idea of primal-dual interpolation is influenced by work on non-Euclidean acceleration (Allen-Zhu & Orecchia, 2017; Nemirovskii & Nesterov, 1985).
This paper extends these ideas in a meaningful way, demonstrating both theoretical and empirical improvements.
遗漏的重要参考文献
The paper covers most essential references, but a comparison with curvature-aware optimizers (e.g., Shampoo, K-FAC) would be useful. These methods also attempt to handle non-Euclidean optimization challenges, making them relevant to the discussion.
I recommend citing Gupta et al. (2018) on Shampoo and Martens & Grosse (2015) on K-FAC to highlight how STACEY differs from these approaches.
其他优缺点
Strengths
- Strong theoretical foundation with generalized convergence guarantees.
- Comprehensive empirical validation showing consistent improvements over baselines.
- Clear writing and structured explanations.
Weaknesses
- No computational cost analysis (memory and runtime comparisons are missing).
- Hyperparameter selection for is unclear (no systematic guidance provided).
- Missing comparisons with curvature-aware optimizers (e.g., Shampoo, K-FAC).
其他意见或建议
- Clarify the intuition behind acceleration for different values of .
- Include an ablation study on the effect of different values of on generalization performance.
- Discuss whether STACEY could be extended to second-order methods or mixed-order approaches.
We thank the reviewer for their thoughtful comments and suggestions, and we respond to their questions below.
Detailed computational complexity in terms of runtime and memory consumption compared to SGD, Adam, and Lion
Runtime
Let be the number of parameters, and let each “basic operation” refer to simple scalar arithmetic (e.g., an addition, multiplication, or sign). We will ignore lower-level details (e.g., hardware vectorization) and focus on how many scalar operations are performed per parameter per iteration.
SGD
Key steps:
- .
- .
Approximate ops per parameter: ~5–6 scalar ops (fewer if no momentum is used).
Adam
Key steps:
- .
- .
- .
Approximate operations per parameter: ~9–12 scalar ops (slightly more if you count the bias-correction steps separately).
Lion
Key steps:
- .
- .
Approximate operations per parameter: ~6–7 scalar ops (including sign as an operation).
Stacey
Key steps:
- Update the “momentum-like” buffer (coordinate-wise re-scaling).
- Update the dual vector (coordinate-wise multiplications/additions).
- Combine and to get the final parameter update.
Representative breakdown:
- Update : ~3–5 ops.
- Update : ~3–4 ops (coordinate-wise re-scaling plus addition).
- Final parameter update: ~2–3 ops (linear combination and one addition/subtraction).
Approximate operations per parameter: ~9–12 scalar ops.
Memory Footprint
SGD:Momentum buffer (if used). Total auxiliary overhead:
Adam: First moment and second moment. Total auxiliary overhead:
Lion: Momentum-like buffer. Total auxiliary overhead:
Stacey: Momentum-like buffer and dual vector. Total auxiliary overhead:
Thus, Stacey’s memory requirement is similar to Adam, and slightly more than other single-pass gradient methods (though we note that its overhead compared to e.g. SGD, Lion comes precisely from the additional dual vector).
Choice of : intuition, ablation, practical heuristics
Although it can be challenging to provide intuition for acceleration (even in the convex , i.e. Nesterov's, case), at a high level the analysis carefully balances the primal and dual updates, leveraging the uniform convexity of and an alternative smoothness-derived upper bound, as in Definition 2 and Lemma 1 in [Diakonikolas & Guzmán 2024]. We will include an ablation study of the influence of on the acceleration rate in the updated manuscript, to help provide additional insight for these theoretical results.
Regarding heuristics for choosing , one idea would be to use occasional Hessian information to determine how to adjust over time, based on the spectrum density [Ghorbani et al. 2019]. Additionally, there has been much work on developing automated means of hyperparameter tuning, including e.g. parameter-free methods [Jacobson & Cutkosky 2022], from which we may hope to draw inspiration.
Second-order curvature-aware optimizers: citations and discussion
We thank the reviewer for pointing out these helpful references. We will cite them and incorporate the following discussion in the revised version. Shampoo [Gupta et. al., 2018], K-FAC [Martens & Grosse, 2015] and their follow-ups are indeed representative works of curvature-aware optimization methods. One notable difference is that these works are second-order methods that exploit the structure of the Hessian or Fisher information and focus on techniques for their efficient approximation, whereas our method, Stacey, is a first-order approach that explores non-Euclidean geometry though a differing norm. We will include comparisons with these curvature-aware optimizers in the updated version of our manuscript.
Extending Stacey to second-order methods is, in our view, a promising direction that aligns with our own considerations for future research as well. It is natural to investigate the non-Euclidean counterpart of a preconditioned gradient step, just as we have done in the first-order setting. Such a method holds the potential to benefit from both the curvature awareness provided by the Hessian information and the geometric advantages of operating under a differing norm.
This paper introduces Stacey, an accelerated l_p steepest descent algorithm leveraging primal-dual interpolation for non-Euclidean optimization. Theoretical guarantees and empirical evaluations on image classification and LLM pretraining suggest improved convergence speed, final accuracy, and benefits from adaptive p-values, advocating for non-Euclidean methods over conventional approaches.
The reviews are highly mixed: Reviewers d7Gi and Reviwer NAH2 strongly support the work, emphasizing its novel theoretical insights into optimizer design, which could inspire future research. However, Reviewer ddGc raises significant concerns. The primary criticism centers on the experimental validity, arguing that Stacey’s performance may not surpass Adam, let alone state-of-the-art optimizers (e.g., Shampoo, Muon, Scion). The AC also notes Stacey’s hyperparameter complexity, which may limit practical adoption.
While the empirical claims require further validation, the theoretical contributions—particularly the framework for non-Euclidean optimization—align with ICML’s theory-oriented focus. Balancing these factors, the AC leans toward acceptance, as the work provides inspirations for future research into simpler, effective optimizers for realistic deep learning scenarios.