PaperHub
6.4
/10
Poster3 位审稿人
最低4最高4标准差0.0
4
4
4
3.3
置信度
创新性3.0
质量2.7
清晰度2.7
重要性3.0
NeurIPS 2025

SUMO: Subspace-Aware Moment-Orthogonalization for Accelerating Memory-Efficient LLM Training

OpenReviewPDF
提交: 2025-05-12更新: 2025-10-29

摘要

关键词
Low-rank optimizationSteepest descentMoment orthogonalizationNewton–Schulz methodLarge language modelMemory efficient finetuningPretraining

评审与讨论

审稿意见
4

This paper introduces SUMO (Subspace-Aware Moment-Orthogonalization), a novel optimization algorithm designed to accelerate the training of LLMs while simultaneously enhancing memory efficiency. SUMO's core innovation is the use of an exact SVD to perform moment orthogonalization within a dynamically adapting low-dimensional subspace. This method ensures that optimization steps are better aligned with the spectral characteristics of the loss function.

优缺点分析

  1. Strengths
    1. The paper provides substantial theoretical backing for its approach. These theoretical claims are validated through empirical evaluations.
    2. The paper successfully argues that the computational cost of SVD, typically a concern, is manageable and even preferable within the proposed low-rank framework.
    3. The authors have conducted a thorough empirical study across a diverse set of tasks, including pre-training, fine-tuning, and long-context reasoning.
  2. Weaknesses
    1. My primary concern is that it seems SUMO is essentially Muon with GaLore. In fact, GaLore stands for Gradient Low-Rank Projection and can be combined with various optimization methods. The GaLore mentioned in the paper specifically refers to Algorithm 2: Adam with GaLore from [1]. Therefore, SUMO differs from Muon with GaLore only in some minor modifications. The authors should provide a more detailed discussion of these differences to clearly articulate the novelty and distinct contributions of SUMO compared to Muon with GaLore.
    2. The pre-training experiments are conducted on up to 1.1B and 13.1B tokens. While the results are promising, validating the method’s effectiveness at a larger training scale would strengthen the claims regarding scalability. In practice, I have observed that many improvements appear effective during the early stages of training, but their advantages often diminish—or even reverse—after exposure to a substantially larger number of training tokens.

[1] Jiawei Zhao, et. al. Galore: Memory-efficient llm training by gradient low-rank projection. ICML 2024.

问题

  1. Your complexity analysis notes that SVD can require approximately twice the FLOPs of Newton-Schulz5 in certain low-rank settings. However, your results show a significant speedup in terms of optimization steps (Figure 2) and total training time (Table 6). Could you provide more insight into the wall-clock time trade-off and confirm if the per-step overhead of SVD is consistently overcome by the faster convergence across all tested scenarios? And I find the results in Table 6 somewhat puzzling: the reported training times (in hours) are GaLore (2.59) > SUMO (Newton-Schulz5) (1.83) > SUMO (SVD) (1.56), which appears to contradict the relative algorithmic complexity of the methods.
  2. Since SUMO is built upon GaLore and Muon, a natural baseline for comparison would be these two methods. However, I did not observe such comparisons in the main experimental results presented in Tables 1, 2, and 3. I am particularly curious about how SUMO performs relative to Muon.

局限性

Types:

  1. Are the two square brackets in line 140 missing corresponding citations?

  2. In Algorithm 1,“Q_t Truncated_Randomized_SVD(G_t)” should be “Q^{(t)} Truncated_Randomized_SVD(G^{(t)})”. And other G_t should be G^{(t)} in Algorithm 1.

最终评判理由

I appreciate the authors’ detailed response and the supplementary experiments provided, which help clarify several points. While some concerns remain, such as the lack of an ablation study isolating the specific impact of Block 1.1 and questions about scaling behavior with increasing training tokens. The paper presents a meaningful advancement, and the core ideas are well-supported by theory and experiments. With further refinements and more thorough documentation in the final version, this work will make a valuable contribution to the field. Therefore, I raise my rating.

格式问题

no

作者回复

We thank the reviewer for the helpful feedback and recognition of our theoretical and empirical contributions. We appreciate the concerns about the distinction from prior work and large-scale validation. These are important points, and we provide clarifications and new comparisons below.

W1 distinction from Muon. We thank the reviewer for raising this critical point, as it provides an excellent opportunity to clarify the significant distinctions and novel contributions of SUMO compared to Muon combined with GaLore. While both SUMO and Muon incorporate gradient orthogonalization, and GaLore provides gradient low-rank projection, their combination does not simply equate to SUMO. Specifically, SUMO differs notably from Muon combined with GaLore in the following key aspects:

Exact Orthogonalization via SVD: Muon uses Newton-Schulz iterations to approximate the spectral orthogonalization of gradient moments. However, as we demonstrate analytically in Lemma 3.2, this approximation can introduce significant errors, particularly in ill-conditioned optimization landscapes often encountered during the training of large language models (LLMs). In contrast, SUMO utilizes the exact Singular Value Decomposition (SVD) for orthogonalization, which is feasible because we are operating in a low-dimensional space. This approach completely eliminates the approximation error associated with Newton-Schulz methods.

Subspace-Aware Dynamic Adaptation: GaLore, by definition, uses a fixed-rank gradient projection but does not dynamically adapt the optimization subspace or adjust its orthogonalization based on spectral information. SUMO explicitly performs SVD-based orthogonalization within a dynamically adapted, low-dimensional subspace. This subspace awareness ensures optimal alignment with the dominant spectral characteristics of the loss landscape, significantly improving convergence stability and speed, as validated empirically.

Moment Transformation Across Adaptive Subspaces: SUMO uniquely maintains consistency in optimization by transforming the first-order moments between dynamically changing subspaces (Algorithm 1 Block 1.1). This subspace transformation is critical to retaining accurate optimization steps and is a novel component that does not exist in either Muon or GaLore independently.

Theoretical and Empirical Justification: Our theoretical analysis clearly distinguishes SUMO from Muon by quantifying the exact impact of Newton-Schulz approximation errors on convergence (Lemma 3.3, Remark 3.7). Empirically, we provide direct ablation studies comparing SUMO (exact SVD) against SUMO (Newton-Schulz) and GaLore alone. Results consistently show SUMO’s superior performance, attributed explicitly to eliminating orthogonalization errors via exact SVD and the adaptive subspace optimization strategy.

These distinct innovations highlight that SUMO is not merely a combination of existing methods but introduces significant theoretical and practical advances that directly address the shortcomings of Muon and GaLore when used individually or in combination.

W2 Scalability of SUMO. We appreciate the reviewer's feedback regarding the need for large-scale validation. A central motivation for our research on efficient fine-tuning and optimization is to democratize pretraining, making it accessible to smaller laboratories beyond the major industry players like OpenAI and Google. In this context, demonstrating consistent performance improvements across up to 13 billion tokens is quite significant. This scale exceeds the capacities of many academic settings and provides strong evidence of the method's robustness and scalability within realistic computational budgets. Additionally, we believe that the faster convergence of SUMO is valuable even for a larger number of tokens.

Q1 explaining convergence time result. While SVD may have higher FLOPs per step in some low-rank settings compared to Newton-Schulz5, our empirical results demonstrate that SUMO (SVD) consistently achieves faster convergence in terms of optimization steps (Figure 2) and lowers total training time (Table 6). This is primarily due to the improved stability and accuracy of exact SVD-based orthogonalization, which leads to a more direct and efficient path to convergence in the highly anisotropic loss landscapes of LLMs.

The seemingly contradictory results in Table 6, where SUMO (SVD) has a lower training time (1.56 hours) than GaLore (2.59 hours) and SUMO (Newton-Schulz5) (1.83 hours), despite the FLOPs discussion, are due to early stopping based on validation performance. Faster convergence means that SUMO (SVD) reaches stopping cretearia on validation performance (e.g., perplexity or accuracy on a validation set) in significantly fewer optimization steps. Therefore, even with a potentially higher per-step computational cost, the reduced number of total steps leads to a substantial decrease in overall wall-clock training time. This faster convergence rate consistently overcomes the per-step overhead of SVD across all tested scenarios, making SUMO (SVD) more efficient in practice.

Q2 Comparison with GaLore: We would like to clarify that GaLore is a primary baseline in all our key experiments. We compare against it directly in Table 2 (GLUE), Figure 2 (QNLI convergence), Table 3 (LLaMA pre-training), and Tables 4, and 5, where SUMO consistently shows superior performance and memory efficiency.

Comparison with Muon. Thank you for your suggestion. We opted not to include the original Muon because it operates under a different paradigm than SUMO. While SUMO is designed for low-rank, memory-efficient optimization, Muon updates the entire parameter space, leading to memory usage comparable to full fine-tuning, which is already addressed in our baselines.

We do provide a fair comparison: SUMO (Newton-Schulz5) incorporates Muon’s core approximation within our low-rank framework. The weaker performance of SUMO (Newton-Schulz5) compared to SUMO (SVD) underscores the advantages of exact orthogonalization, especially in memory-constrained environments. Additionally, applying Muon to all parameters introduces significant computational overhead, making it less suitable for the efficiency-focused scope of our work. We hope this clarifies our experimental design.

We would kindly like to share the following new experimental results are presented with the two other reviewers.

Hyperparameters Grid search: To evaluate the impact of Subspace Update Frequency (K) and Ranks (r), we performed a grid search during the pretraining of the LLaMA 130M model on the C4 dataset. This specific setup allows for a direct comparison with the Galore method. Table: Perplexity results from a grid-search of Subspace Update Frequency (K) and Ranks (r) for the LLaMA 130M model pretrained on the C4 dataset. Values are presented as Galore/SUMO.

Update FrequencyRank = 128Rank = 256Rank = 512
10029.7/28.2727.9/26.7427.4/26.73
25028.1/27.8626.5/24.8726.2/24.82
50027.2/25.9125.6/24.9825.3/24.31
1k26.8/25.8325.1/25.4224.8/24.93

Compared to Galore, SUMO demonstrates notably less sensitivity to variations in rank and update frequency. This suggests greater hyperparameter robustness and potentially easier tuning, achieving competitive or superior perplexity with more stable performance across the tested range.

Per-step wall-clock overhead (ms) of the randomized SVD In the following Table, we show the time measurements for subspace computations on the GULE MRPC dataset, comparing sumo, and Galore on A100 GPU:

Subspace update roleAverage time (ms) of subspace update - Initial Rank 8
GaLore SVD Block0.6610947
SUMO Randomized SVD0.1232528

In addition, the average time of the moment's subspace transformation/rotation is measured at 0.023574 ms.

Moreover, we extend our evaluation to diverse commonsense and reasoning tasks using zero-shot tasks, followed by the exact evaluation protocol described in Table 4 of Apollo (Zhu et al. SGD-like Memory, AdamW-level Performance). SUMO-pretrained models achieve lower perplexity and outperform AdamW on downstream benchmarks.

Table 7: Zero-shot evaluation of LLaMA-350M models pretrained with sequence length 1024 across reasoning tasks

MethodMemoryPerplexityBoolQRTEHSWGOBQAARC-EARC-CPIQASciQMathQAAvg.
AdamW1.37G16.300.49170.46930.36880.52330.3320.37290.24490.65340.6090.20640.4272
APOLLO0.34G15.640.53730.46980.38500.49250.3220.37880.24830.66810.6240.21270.4406
APOLLO-Mini0.15G16.120.53760.45620.37070.52170.3240.37580.23120.66380.6190.22240.4374
SUMO0.18G15.490.54790.47090.39370.53130.3210.38320.24960.67090.6230.22460.4416

We hope this addresses all your concerns, and we would be happy to further clarify open issues.

评论

Thank you for the detailed response. While it addresses some of my concerns, the primary issues remain unresolved.

First, the main distinction between SUMO and Muon with GaLore appears to lie in Block 1.1. However, the paper lacks an ablation study specifically isolating the effect of Block 1.1, making it difficult to assess whether this modification is indeed impactful. It remains possible that Muon with GaLore already serves as a very strong baseline. Regarding the computation of the SVD, I consider this an implementation-level detail rather than a fundamental methodological innovation. The so-called Subspace-Aware Dynamic Adaptation appears to be already present in GaLore (see Algorithm 2 in the GaLore paper), and thus should not be claimed as a novel contribution.

Scaling involves two dimensions: model size and the number of training tokens. As I pointed out in Weakness 2, I remain concerned about whether the benefits of SUMO diminish as training tokens increase. Your current explanation does not sufficiently address this concern.

For the experimental section, I recommend that the authors more thoroughly document the experimental setup. For example, early stopping is mentioned in your response, but I could not find any mention of it in the main paper.

评论

Thank you very much for your thoughtful review and detailed feedback. We appreciate your engagement with our work and would like to respectfully clarify our paper's contributions.

We understand your perspective that some elements of our paper may, in first sight, seem incremental or implementation-focused. However, we believe these algorithmic modifications, while subtle, are central to the method’s success and make a meaningful contribution to the field. As noted by reviewer “FdsP”, “the insight and motivation of this paper are very insightful and interesting; the algorithms are simple but clear and correct.”

Our contributions include a rigorous theoretical foundation that, to the best of our knowledge, introduces analytical expressions for the estimation error when using Newton-Schulz in Muon for the first time, along with convergence rates and a proof of convergence (see Lemma 3.2, Lemma 3.3, Theorem 3.8). Additionally, we provide new insights into optimizer state (moment) decay (Lemma 3.1), which are supported by empirical evidence (Figure 1) and help explain the practical impact of our design choices.

Regarding the algorithmic design, Block 2 represents a core innovation that distinguishes our method from both GaLore and Muon by explicitly leveraging subspace structure. Our implementation of Blocks 1 and 1.1, including the moment transformation and the use of randomized SVD, is not merely a minor engineering detail but rather a principled design decision that directly influences the method’s performance.

We hope this clarifies the motivation and significance behind our choices, and we thank you once again for the opportunity to engage in this constructive dialogue.

评论

We sincerely appreciate your time and valuable comments.

The table below shows our findings on pre-training LLaMA 7B on the C4 dataset for up to 120K steps. It reports the validation perplexity and memory estimates for the methods.

Table 8: Preliminary results for pre-training LLaMA 7B on the C4 dataset, reporting validation perplexity and optimizer memory up to 120K steps. The results for Galore and Apollo are collected from the original papers. The experiments were conducted on an H200 GPU.

Optimizer Memory40K80K120K
8-bit Adam13G18.0915.4714.83
8-bit GaLore4.9G17.9415.3914.95
APOLLO1.6G17.5514.3913.23
SUMO1.48G18.0714.3212.91
Tokens (B)5.210.515.7

Even though our primary focus is on developing methods for hardware with limited resources, these initial results on a 7B-parameter model demonstrate that our approach is effective at a larger scale.

Due to the limited discussion time, we would be grateful for any additional feedback or confirmation that our rebuttal addressed your concerns.

评论

We followed your suggestion and conducted a full fine-tuning experiment with vanilla Muon to complete the comparison for Table 2 in the paper, as presented below.

ModelMemoryCoLASTS-BMRPCRTESST-2
Full Fine-Tuning747M62.2490.9291.3079.4294.57
Muon Full Fine Tuning458M61.1990.9892.1480.8394.71
SUMO (Newton-Schulz5, rank=4)197M61.81 ± 0.0290.81 ± 0.01392.43 ± 0.03479.33 ± 0.03194.14 ± 0.028
SUMO (SVD, rank=4)197M62.32 ± 0.01591.05 ± 0.00793.48 ± 0.02281.08 ± 0.01994.93 ± 0.01
SUMO (Newton-Schulz5, rank=8)198M61.73 ± 0.02190.77 ± 0.03291.93 ± 0.0479.66 ± 0.0394.13 ± 0.025
SUMO (SVD, rank=8)198M61.69 ± 0.01491.11 ± 0.0293.72 ± 0.01881.38 ± 0.01194.83 ± 0.01

These results show that our SUMO, achieves better performance with a significantly smaller memory footprint compared to Muon full fine-tuning approach.

评论

I thank the authors for their further response and experiments, which partially address my concerns. However, I still hope to see ablation experiments related to Block 1.1 in the revised version to validate its effectiveness.

Taking all factors into consideration, I will raise my rating accordingly.

审稿意见
4

The paper introduces SUMO (Subspace-Aware Moment-Orthogonalization), an optimizer that exploits the empirical observation that first-order momentum tensors in large-scale neural-network training quickly collapse to a very low-rank subspace. SUMO periodically (every K steps) identifies this subspace via randomized SVD, performs an exact orthogonalization of the momentum inside the reduced space, limits norm growth with a lightweight scaling gate, and then projects the update back to the full parameter space. The authors (i) prove that Newton–Schulz-style approximate orthogonalization leaves a non-negligible residual when the momentum matrix is ill-conditioned, (ii) provide convergence bounds that show strictly tighter guarantees when the SVD error δ → 0, and (iii) implement a practical algorithm that adds negligible wall-time overhead. Experiments on GLUE fine-tuning (RoBERTa-Base), LLaMA-style pre-training (60 M–1 B parameters), and GSM8k reasoning show consistent memory reductions (20–85 %), 1.2–1.6 × step-speedups, and modest accuracy/perplexity gains over AdamW, LoRA, GaLore, and Muon baselines.

优缺点分析

Strengths:

  1. It identifies and formalizes the low-rank momentum phenomenon, then tailors an orthogonalization scheme to it.
  2. It supplies error bounds linking SVD precision, condition number, and convergence.
  3. It uses randomized SVD and subspace recycling to keep runtime overhead low.
  4. It covers three model scales, three task families, and compares to four strong baselines.
  5. It delivers both memory savings and quality improvements without extra hyper-parameters.

Weaknesses:

  1. Experiments has only single-seed runs; training token budgets differ across methods; no wall-clock timing.
  2. “invertible layer” and rank-1 convergence are not justified for attention/FFN layers.
  3. No study of subspace update period K, rank r, or γ (norm limiter).
  4. All experiments are single-GPU; no distributed data-parallel tests.

问题

  1. How sensitive is SUMO to the SVD period K and rank r? Could you provide a small grid-search on GLUE?
  2. What is the per-step wall-clock overhead (ms) of the randomized SVD and subspace rotation on A100/H100?
  3. Does the “low-rank momentum” observation hold for attention and MLP weight matrices individually, or only when concatenated?
  4. Can you share preliminary results on >7 B-parameter models or multi-node training?

局限性

Yes

最终评判理由

Thanks for the author's efforts in addressing my concern. Since most of my concerns were well clarified, I would like to raise my score accordingly.

格式问题

NAN

作者回复

We thank the reviewer for the detailed and thoughtful review. We are glad that the core ideas, theoretical contributions, and practical benefits of SUMO were appreciated. We also acknowledge the valuable concerns raised about hyperparameter sensitivity, runtime metrics, and scalability, which we address in our response.

W1 adding std for different seeds run. We thank the reviewer for the suggestion. While our initial experiments followed community-standard single-seed evaluation (e.g., on GLUE), we observed consistent performance trends across tasks and datasets. To further demonstrate stability, we conducted 10 runs on CoLA, STS-B, MRPC, RTE, SST-2, and MNLI, and 5 runs on QNLI and QQP, using seeds drawn from a Gaussian distribution. We now report standard deviations to support the robustness of our results.

Table 2 Extended: Evaluation comparison of SUMO against state-of-the-art memory-efficient fine-tuning methods on the GLUE benchmark using the pre-trained RoBERTa-Base model. For comparison,we provide detailed results for SUMO using both SVD and Newton-Schulz5 orthogonalizations (ablation study).The table shows that SUMO consistently outperforms the low-rank fine-tuning methods and is relatively stable across runs, as indicated by the small standard deviations.

ModelMemoryCoLASTS-BMRPCRTESST-2MNLIQNLIQQP
Full Fine-Tuning747M62.2490.9291.3079.4294.5787.1892.3392.28
LoRA (rank=4)257M61.3890.5791.0778.7092.8986.8292.1891.29
GaLore (rank=4)253M60.3590.7392.2579.4294.0087.0092.2491.06
SUMO (Newton-Schulz5, rank=4)197M61.81 ± 0.0290.81 ± 0.01392.43 ± 0.03479.33 ± 0.03194.14 ± 0.02886.90 ± 0.02192.22 ± 0.0491.26 ± 0.018
SUMO (SVD, rank=4)197M62.32 ± 0.01591.05 ± 0.00793.48 ± 0.02281.08 ± 0.01994.93 ± 0.0187.36 ± 0.02393.25 ± 0.03691.67 ± 0.015
LoRA (rank=8)264M61.8390.8091.9079.0693.4686.9492.2591.22
GaLore (rank=8)257M60.0690.8292.0079.7894.3887.1792.2091.11
SUMO (Newton-Schulz5, rank=8)198M61.73 ± 0.02190.77 ± 0.03291.93 ± 0.0479.66 ± 0.0394.13 ± 0.02587.20 ± 0.0192.22 ± 0.03191.36 ± 0.02
SUMO (SVD, rank=8)198M61.69 ± 0.01491.11 ± 0.0293.72 ± 0.01881.38 ± 0.01194.83 ± 0.0187.59 ± 0.01293.65 ± 0.03791.73 ± 0.018

Token Budget: We ensured a fair comparison of token budgets by matching the total number of optimization steps across all methods. Since SUMO converges faster, as shown in Figure 2, this means that fewer training tokens are generally required.

As for the wall-clock timing, please refer to the following Table 6, presented in the (supplementary material) appendix D, "Additional Experiments" where we reported training times.

MethodsRankTime(h)Memory(GB)Accuracy(%)
LoRA320.4014.3645.80
DoRA320.6915.0144.96
GaLore322.5915.1558.40
SUMO(Newton-Schulz5)321.8313.8658.47
SUMO(SVD)321.5613.8661.23
LoRA1280.4515.6465.97
DoRA1280.7216.1766.81
GaLore1282.6115.7964.29
SUMO(Newton-Schulz5)1281.7814.1264.41
SUMO(SVD)1281.6214.1268.03

W2 Low‑Rank Gradient Behavior in FFN Layers. Weights in feed‑forward (FFN) blocks, especially the “project‑up” layer in Transformers, naturally exhibit a low-rank gradient structure during training. This phenomenon is theoretically substantiated in GI-rank collapse theory [1], which shows gradients in fully-connected architectures collapse to low rank, even under nonlinearity and standard activations. Separately, in the JoMA framework [2] and the GaLore study, it is shown that Transformer FFN gradients converge to extremely low rank (sometimes rank‑1), in practice during training, despite the layers not being strictly invertible [3] (in Appendix Sec. B). These findings support our empirical observations and validate why our SUMO method, leveraging low-rank orthogonalization, remains effective in FFN and attention layers. This will be clarified in the main text.

Illustration for the moment singular values decay is presented in Figure 1 (b), in the manuscript.

W3 Study of subpsace update period K, rank r Please refer to our reply to your Q1 (below), where we provided a grid-search.

W4 Single GPU. Our focus in this paper was to evaluate SUMO under constrained compute settings (single GPU) to emphasize its practical benefits in memory-limited environments. Nonetheless, the core SUMO operations, low-rank projections, and subspace-aware updates are orthogonal to the data-parallel training paradigm and can be naturally integrated into distributed frameworks. To clarify, SUMO is implemented in PyTorch and compatible with torch.distributed. We are currently conducting multi-GPU experiments, which we plan to release with our open-source code.

Q1 Hyperparameters Grid search.: To evaluate the impact of Subspace Update Frequency (K) and Ranks (r), we performed a grid search during the pretraining of the LLaMA 130M model on the C4 dataset. This specific setup allows for a direct comparison with the Galore method. Table: Perplexity results from a grid-search of Subspace Update Frequency (K) and Ranks (r) for the LLaMA 130M model pretrained on the C4 dataset. Values are presented as Galore/SUMO.

Update FrequencyRank = 128Rank = 256Rank = 512
10029.7/28.2727.9/26.7427.4/26.73
25028.1/27.8626.5/24.8726.2/24.82
50027.2/25.9125.6/24.9825.3/24.31
1k26.8/25.8325.1/25.4224.8/24.93

Compared to Galore, SUMO demonstrates notably less sensitivity to variations in rank and update frequency. This suggests greater hyperparameter robustness and potentially easier tuning, achieving competitive or superior perplexity with more stable performance across the tested range.

Q2 per-step wall-clock overhead (ms) of the randomized SVD. In the following Table, we show the time measurements for subspace computations on the GULE MRPC dataset, comparing sumo, and Galore on A100 GPU:

Subspace update roleAverage time (ms) of subspace update - Initial Rank 8
GaLore SVD Block0.6610947
SUMO Randomized SVD0.1232528

In addition, the average time of the moment's subspace transformation/rotation is measured at 0.023574 ms.

Q3 low-rank momentum observation. This low-rank momentum phenomenon occurs independently within MLP layers, as supported by our proof Theorem 3.1. The concatenation of layers is not a prerequisite for the observation. Further empirical evidence of (first order) moment singular value decay can be found in Figure 1 (b) of the manuscript. In this Figure, singular value decay of the momentum at step 100 shows that the top 4 singular values capture the vast majority of the information. Please refer to our response to W2 for a further detailed theoretical explanation.

Q4 high scale results. We are working on a larger scale evaluation, and hope to report the results as soon as we have them. Please note though that our paper focuses on developing a method to perform well on hardware with limited resources.

We hope this addresses all your concerns, and we would be happy to further clarify open issues.

[1] Bradley T. Baker et al. Low-Rank Learning by Design: the Role of Network Architecture and Activation Linearity in Gradient Rank Collapse.

[2] Tian, Y. et al. Demystifying multilayer transformers via joint dynamics of MLP and attention. In The Twelfth International Conference on Learning Representations.

[3] Jiawei Zhao et al. GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection.

评论

Dear reviewer, we sincerely appreciate your time and valuable comments. Due to the limited discussion time, we would be grateful for any additional feedback or confirmation that our rebuttal addressed your concerns.

评论

Please see the final justification. Thanks for the authors' feedback and response. I appreciated that my concern was addressed well. My score was already updated.

评论

The table below shows our findings on pre-training LLaMA 7B on the C4 dataset for up to 120K steps. It reports the validation perplexity and memory estimates for the methods.

Table 8: Preliminary results for pre-training LLaMA 7B on the C4 dataset, reporting validation perplexity and optimizer memory up to 120K steps. The results for Galore and Apollo are collected from the original papers. The experiments were conducted on an H200 GPU.

Optimizer Memory40K80K120K
8-bit Adam13G18.0915.4714.83
8-bit GaLore4.9G17.9415.3914.95
APOLLO1.6G17.5514.3913.23
SUMO1.48G18.0714.3212.91
Tokens (B)5.210.515.7

Even though our primary focus is on developing methods for hardware with limited resources, these initial results on a 7B-parameter model demonstrate that our approach is effective at a larger scale.

审稿意见
4

Based on the observation that the gradient rank decreases during the training, the paper proposes a novel algorithm that can achieve both memory efficiency and fast convergence rate. The mathematics is solid, but the results are somehow unclear and weak.

优缺点分析

Strengths:

  1. the insight and motivation of this paper are very insightful and interesting, the algorithms are simple but clear and correct.
  2. The mathematics of the analysis and derivations is good, easy to follow and understand. The remarks are particularly interesting and insightful.

Weaknesses:

  1. The paper claims "substantially improves convergence rates", but there is no direct evidence in the results section demonstrates the convergence rate, although various final results are better than the baselines.
  2. It also claims "SUMO effectively mitigates approximation errors", but only verifies it in the mathematics analysis, but not in the results or ablation study. It would be wonderful to demonstrate this point in an ablation study. I believe this is actually a missed opportunity.
  3. The Experiments and results section is weak and unclear, particularly tables, probably the authors have no space to explain them. It makes the paper looks like not ready.
  4. Why there are 4 different models are used in experiments? Besides the size of the model, are there any other reasons, such as why zero-shot using Phi-2 and 8-shot using LLaMA? why not use both in both tests?

问题

  1. in table 3, why the memory efficiency doesn't increase from 350M model to 1B model?
  2. in the introduction and in the section 3.1, the authors claims both reference [2] and themselves found that the gradient rank decreases through the training, is [2] from the authors?
  3. SVD of a large matrix will take a huge amount of computation, when the model size increases. Can you actually apply this method to the frontier models, like GPT 4o?
  4. There are many assumptions in the theoretical analysis, how reliable are these assumptions in a real application.

局限性

Yes

最终评判理由

I appreciate the authors addressed most of my questions and argued about my concerns. They accepted that there is an important typo in their original manuscript, and their new results may lead to more concerns, all of these makes me worry the result section of their paper is not mature and ready. Although I have this concern but I like the insight and motivation in this paper, so I keep my score.

格式问题

  1. in line 140, there are two missing citations.
  2. the font size and format are not consistent, such as table 2 and table 3.
作者回复

We sincerely thank the reviewer for their positive and encouraging feedback on our work. We are glad they found the motivation insightful, the algorithms clear and correct, and the mathematical analysis easy to follow. We appreciate the suggestions regarding the empirical validation of our claims and agree that clearer experiments and ablations would strengthen the work. We address these points in detail below.

W1 convergence rate: We want to clarify that Figure 2 in Section 4 addresses this point. The figure presents a convergence plot on the QNLI task, where we directly compare the optimization steps required for SUMO (SVD), SUMO (Newton-Schulz5), and GaLore. The results show that SUMO with SVD converges approximately 1.6× faster than both baselines, reaching similar or higher final accuracy in significantly fewer steps. This provides direct visual and quantitative evidence supporting our claim of substantially improved convergence rates. We will add similar learning rate figures for other Glue tasks to the appendix.

W2 mathematical and empirical evidence for mitigation of the approximation error: We thank the reviewer for the helpful comment regarding empirical support for SUMO’s effectiveness in mitigating approximation errors. Below, we highlight and clarify points in the paper that demonstrate this point:

a. Mathematical Foundation (Equation 2): The approximation error associated with the Newton-Schulz orthogonalization, as derived in Formula (2), illustrates that this error is inherently dependent upon the condition number of the moment matrix. Given a fixed number of iterations (denoted by i), the approximation error bound grows exponentially with respect to the condition number: ‖E_i‖_F ≤ √r ( 1 - 1/κ )^(2i). This establishes a direct and quantifiable relationship between the numerical accuracy of orthogonalization and the spectral condition of the matrix, making the mitigation of approximation errors a critical aspect of the optimization process.

b. Visual Evidence (Figure 1): Empirically, the impact of this mathematical relationship is visually substantiated in Figure 1. The condition number of the first-order moment matrices during training is shown to increase significantly, especially in early optimization steps, underscoring scenarios wherein approximation errors would notably affect convergence performance when using Newton-Schulz orthogonalization. These conditions confirm the relevance of addressing approximation errors for practical training stability and efficiency.

c. Ablation Study (Table 2): To concretely demonstrate this theoretical insight empirically, we conducted an ablation study explicitly comparing SUMO utilizing exact SVD orthogonalization against its counterpart employing Newton-Schulz5 orthogonalization across the GLUE benchmark tasks. As summarized in Table 2, the exact SVD variant consistently outperforms the Newton-Schulz5 variant across all GLUE tasks, despite identical optimization settings and model configurations.

This performance gap directly reflects the practical advantage gained by reducing approximation errors through exact SVD-based orthogonalization. For instance, the QNLI task clearly demonstrates accelerated convergence and higher accuracy for SUMO (SVD) over SUMO (Newton-Schulz5), which can be attributed explicitly to the superior numerical accuracy in optimization steps, mitigating errors arising from high condition numbers during moment orthogonalization.

Through this expanded explanation, we reinforce both mathematically and empirically the critical importance of using exact SVD orthogonalization within SUMO to effectively reduce approximation errors, thus achieving enhanced optimization performance.

W3 Elaboration on experiments section: We thank the reviewer for this feedback regarding the clarity of the Experiments and Results section. We therefore kindly elaborate in the following.

Table 2 (GLUE): SUMO consistently outperforms other memory-efficient methods while using less memory. It exceeds SOTA accuracy at ranks 4 and 8 on multiple tasks.

Table 3 (Pretraining): On C4, SUMO improves perplexity by >0.5% and reduces memory to ~1.25 at the lowest.

Tables 4–5 (Reasoning): On GSM8K, SUMO achieves 2%+ higher accuracy in both 0- and 8-shot settings.

These results collectively demonstrate that SUMO improves memory efficiency, convergence, and performance across all settings. We will incorporate these explanations and additional clarifications in the final version.

New results: We extend our evaluation to diverse commonsense and reasoning tasks using zero-shot tasks, followed by the exact evaluation protocol described in Table 4 of Apollo (Zhu et al. SGD-like Memory, AdamW-level Performance). SUMO-pretrained models achieve lower perplexity and outperform AdamW on downstream benchmarks.

Table 7: Zero-shot evaluation of LLaMA-350M models pretrained with sequence length 1024 across reasoning tasks

MethodMemoryPerplexityBoolQRTEHSWGOBQAARC-EARC-CPIQASciQMathQAAvg.
AdamW1.37G16.300.49170.46930.36880.52330.3320.37290.24490.65340.6090.20640.4272
APOLLO0.34G15.640.53730.46980.38500.49250.3220.37880.24830.66810.6240.21270.4406
APOLLO-Mini0.15G16.120.53760.45620.37070.52170.3240.37580.23120.66380.6190.22240.4374
SUMO0.18G15.490.54790.47090.39370.53130.3210.38320.24960.67090.6230.22460.4416

In addition to Table 3, we include a comparison of our results with state-of-the-art methods, AdamSN[1] and LDAdam. The comparison shows that while other methods have a slight advantage on smaller models, SUMO is highly effective at scale, achieving a perplexity of 14.68 on the 1B model (vs. AdamSN's 14.96) with superior memory efficiency (2.87G vs. 6.17G).

W4 experiment configuration We thank the reviewer for this question. We use four models to evaluate our method under diverse and complementary scenarios (classification, regression, zero-shot, 8-shot) as follows:

RoBERTa-base: for GLUE, to benchmark against well-established fine-tuning baselines. LLaMA (60M–1B): for pre-training experiments, to study memory-efficient scaling trends. Phi-2 (2.7B): for zero-shot GSM8K, because Phi-2 is optimized explicitly for reasoning-heavy tasks and provides a strong baseline for evaluating generalization without in-context examples. LLaMA (3B): for 8-shot GSM8K, since LLaMA is widely adopted for few-shot evaluation, where its strong in-context learning ability is well documented.

Importantly, our method is model-agnostic and would behave identically when applied to either model in both regimes. The chosen setup follows standard experimental protocols commonly used for each task in prior and related work.

In addition, following we have expanded tables 4 and 5 from the paper to complete an 0/8-shot evaluation for both models. Table 4/5: Zero-shot and 8-shot evaluations on GSM8K dataset for Phi-2, 2.7B and LLaMA, 3B

ModelAccuracy(0-shot)Accuracy(8-shot)
Phi-2(2.7B)Base15.16%19.37%
Phi-2(2.7B)GaLore52.24%71.48%
Phi-2(2.7B)LoRA42.8%67.24%
Phi-2(2.7B)SUMO54.13%72.35%
LLaMA(3B)Base11.52%17.93%
LLaMA(3B)GaLore33.66%74.9%
LLaMA(3B)LoRA30.78%68.3%
LLaMA(3B)SUMO35.21%76.7%

SUMO demonstrates the strongest generalization, outperforming other methods across both models in zero-shot and few-shot scenarios.

Q1 Typo Correction. Thanks! The memory size is 2.87 GB.

Q2 Please note that the low-rank analysis here differs from the one in [2]. In [2], the authors considered the gradient, while here we consider the momentum.

Q3 Methodology Clarification. A key contribution of SUMO is applying exact SVD within a low-dimensional subspace onto which gradients are projected, rather than on full-sized matrices. This is both practical and beneficial: exact SVD in this compact space captures the true dominant directions, improving gradient conditioning and accelerating convergence. As shown in Block 2 of Algorithm 1, this is feasible because (1) gradients in LLMs often exhibit low-rank structure, and (2) SUMO confines SVD to small matrices (e.g., rank 8 or 16), resulting in minimal overhead. Additionally, as detailed in Section 3.2 and Algorithm 1 (Block 1), SUMO employs Randomized Truncated SVD, which differs from GaLore, to efficiently update the low-rank projection every few hundred steps. This optimization reduces the computational cost from O(mn²) to O(mnr + mr²), making the method more scalable. As shown in Table 3, SUMO effectively pretrains LLaMA models with up to 1 billion parameters, achieving lower perplexity and memory usage compared to competing methods. Although scaling to advanced models like GPT-4o is a future goal, SUMO's adaptive low-rank design positions it favorably for this type of goal.

Q4 Reliability of theoretical assumptions The assumptions presented are valid in practical applications. For instance, in Lemma 3.3, we assume an L-Lipschitz gradient and a bounded Newton–Schulz approximation error, both of which are commonly made in related work. We also provide a bound on this error in Lemma 3.2 and show it can be controlled through sufficient iterations. Figure 1 illustrates that large condition numbers often arise in practice, supporting the practical relevance of this bound. The smoothness and boundedness assumptions in Theorem 3.8 are similarly standard, appearing in works such as GaLore, AdaRankGrad, and Muon. While we acknowledge that these assumptions may not hold perfectly in all settings, even then we still observe strong empirical performance across multiple networks and benchmarks.

We hope this addresses all your concerns, and we would be happy to further clarify open issues

评论

Thank you for the authors thorough reply to our comments. I still think the experiment results could be improved. The obvious mistake pointed out in my Q1 makes me skeptical to some of the results. The new table shows the results for zero- and eight- shot using two models are helpful. But indeed for the zero shot the proposed method has better improvement of performance compared with GaLore using Phi2 model and for eight-shot, the proposed method has better performance using LLaMA model. I appreciate the mathematics and the innovation of the proposed method, but the experiment results could be improved. Maybe put additional results in the appendix.

评论

Dear reviewer, we sincerely appreciate your time and valuable comments. Due to the limited discussion time, we would be grateful for any additional feedback or confirmation that our rebuttal addressed your concerns.

评论

We sincerely appreciate your time and valuable comments.

The table below shows our findings on pre-training LLaMA 7B on the C4 dataset for up to 120K steps. It reports the validation perplexity and memory estimates for the methods.

Table 8: Preliminary results for pre-training LLaMA 7B on the C4 dataset, reporting validation perplexity and optimizer memory up to 120K steps. The results for Galore and Apollo are collected from the original papers. The experiments were conducted on an H200 GPU.

Optimizer Memory40K80K120K
8-bit Adam13G18.0915.4714.83
8-bit GaLore4.9G17.9415.3914.95
APOLLO1.6G17.5514.3913.23
SUMO1.48G18.0714.3212.91
Tokens (B)5.210.515.7

Even though our primary focus is on developing methods for hardware with limited resources, these initial results on a 7B-parameter model demonstrate that our approach is effective at a larger scale.

评论

In the following table, for additional comparison to Table 2 in the paper, we present the result of a full fine-tuning with the vanilla Muon .

ModelMemoryCoLASTS-BMRPCRTESST-2
Full Fine-Tuning747M62.2490.9291.3079.4294.57
Muon Full Fine Tuning458M61.1990.9892.1480.8394.71
SUMO (Newton-Schulz5, rank=4)197M61.81 ± 0.0290.81 ± 0.01392.43 ± 0.03479.33 ± 0.03194.14 ± 0.028
SUMO (SVD, rank=4)197M62.32 ± 0.01591.05 ± 0.00793.48 ± 0.02281.08 ± 0.01994.93 ± 0.01
SUMO (Newton-Schulz5, rank=8)198M61.73 ± 0.02190.77 ± 0.03291.93 ± 0.0479.66 ± 0.0394.13 ± 0.025
SUMO (SVD, rank=8)198M61.69 ± 0.01491.11 ± 0.0293.72 ± 0.01881.38 ± 0.01194.83 ± 0.01

These results show that our SUMO, achieves better performance with a significantly smaller memory footprint compared to Muon full fine-tuning approach.

评论

Dear Reviewer FdsP,

Thank you for your constructive engagement and thoughtful feedback. We believe we've addressed each weakness you identified: W1 with Figure 2 showing 1.6× faster convergence, W2 with the mathematical proofs now complemented by empirical validation through Table 2's ablation study (confirming exact SVD's superiority over Newton-Schulz5), and W3/4 with comprehensive evaluation including Table 7's zero-shot generalization results and Table 8's large-scale pre-training experiments up to 7B parameters, plus our comparison with vanilla Muon full fine-tuning showing memory efficiency and superiorperformance. We believe these additions, together with the theoretical contributions you recognized, present a stronger and more complete picture of our work. We kindly ask that you take this into account in your final assessment.

最终决定

The paper proposes a new optimizer based on moment orthogonalization and demonstrates practical benefits when applying it to LLM training. After a rebuttal and discussion process in which the authors provided additional experimental results, all three reviewers lean toward accept and indicate that most, but perhaps not all, of their concerns were addressed. There is a consensus that the ideas and their technical development are a strength, though there may be room for further experimental validation. The Area Chair agrees with the unanimous reviewer rating.