/10

Poster3 位审稿人

最低3最高4标准差0.5

ICML 2025

Sparse Spectral Training and Inference on Euclidean and Hyperbolic Neural Networks

Jialin Zhao,Yingtao Zhang,Xinghang Li,Huaping Liu,Carlo Vittorio Cannistraci

OpenReview PDF

提交: 2025-01-22更新: 2025-07-24

摘要

关键词

Parameter-efficient training; Pre-training; Hyperbolic Network

评审与讨论

审稿意见

评分: 42025-02-26

The authors introduce sparse spectral training (SST), a memory-efficient way to update neural network parameters. Unlike low-rank methods, SST begins by computing the singular-value decomposition of a weight matrix. At each step, SST updates all singular values and a sample of singular vectors (weighted by singular values). Because these updates can violate orthogonality, SST occasionally reinitializes the low rank parameters as $\mathbf{U, \Sigma, V^T} \gets \operatorname{SVD}(\mathbf{U\Sigma V^T})$ . On a variety of benchmarks spanning natural language processing, graph data science, and image classification, SST consistently outperforms low-rank methods; for hyperbolic methods, it often even beats the full-rank models.

Update after rebuttal

I was satisfied with my discussion with the authors and am continuing to recommend acceptance.

给作者的问题

It seems to me that the Enhanced Gradients in Section 4.2 are not actually the partial derivatives of the loss with respect to each of these parameters. While this may work in practice, it seems like a misnomer to refer to them as gradients. Can you please either:
- Demonstrate to me why these "enhanced gradients" are still faithful to the computation performed by the model, or
- Perform ablations of these enhanced gradients to demonstrate their empirical usefulness
What happens with $\mathbf{V^T}$ in Section 4.2? I assume it is updated analogously to $\mathbf{U}$ , via Equations 5-7 and using the enhanced gradients in Equation 9. If this is true, please edit the paper to reflect this. Please also be clear about whether the indices for updating $\mathbf{V^T}$ and $\mathbf{U}$ are shared or drawn independently.
In 4.3, you explain that saddle point issues are due to the initialization $\mathbf{B}=0$ . It occurs to me that initializing $\mathbf{B}$ differently (i.e. to a nonzero matrix) could help avoid this issue, reducing the need for approaches like SST. What happens if you attempt this more trivial solution, and why is SST still a better option?
In L323--325, you say "For all low-rank methods, all linear layers in the baseline models were replaced by low-rank layers." Can you elaborate on what exactly this means? It is not explained in Appendix E, and appears like a non-standard way of evaluating methods like LoRA.
- Additionally, please confirm that the "Full" benchmarks in Tables 1--4 are not affected by this change.
What's going on with the placement of Table 5? Is it supposed to be part of the Appendix, or is it a float from the main body of the paper that has crawled over the page limit?
Why did you feel it was important to include hyperbolic models throughout your benchmarks?

论据与证据

All of the authors' claims are well-supported by the paper.

方法与评估标准

I am not an expert in memory-efficient/low-rank training methods and what typical benchmarks for them are, but the experiments make sense to me. I appreciate that the authors cover a wide range of NLP tasks and even evaluate zero-shot performance. Their approach to hyperbolic networks on graphs is basically canonical, though it is surprising that the authors include hyperbolic models in their evaluations at all.

理论论述

To the best of my knowledge, the only proof in this paper is in Appendix D (gradient of sparse spectral layer), which I read. It appears correct to me, and I do not feel that the authors make any other theoretical claims which need further justification.

实验设计与分析

The experiments as described in the main body of the paper appear sound. I read the descriptions of additional experiments in Appendices G, H, and J, and found them to be adequate as well.

补充材料

I read Appendices D-K and reviewed the anonymized repo linked in the abstract.

与现有文献的关系

The authors lump together a number of memory-efficient training approaches into their "ReLORA*" generalization, demonstrate intuitively and empirically that all of these approaches suffer from saddle point issues which compromise their efficiency, and propose SST as an alternative. In Appendix G, they also compare to GaLore. They demonstrate a substantial improvement over existing methods, and also expand their

遗漏的重要参考文献

In general, the references seem reasonably complete. There is another work at the intersection of hyperbolic geometry and LLM fine-tuning [1] that may be worth discussing, though direct comparison to their approach could be difficult.

[1] Yang et al, 2024. Hyperbolic Fine-tuning for Large Language Models. https://arxiv.org/abs/2410.04010

其他优缺点

Strengths:

The authors craft an excellent narrative, providing elegant justifications for their way of thinking. I particularly liked their explanations for the following topics:
- Their explanation of saddle point issues in ReLORA*, combined with the immediate empirical evidence in Figure 2.
- In Section 4.4, they give an intuitive exploration-exploitation perspective on low-rank methods that feels quite obvious once it is stated. I have
- Why hyperbolic geometry is prone to overfitting, especially in high dimensions, and why SST can remedy this issue
- Introducing the effective step metrics to motivate their analysis in Figure 4
Strong performance on all benchmarks, including those in the Appendix
Extension to non-Euclidean geometry, where their method really shines, outperforming full-rank methods

Weaknesses:

Some more pragmatic considerations could be included. In particular, I am interested in how this method interacts with quantization and how it scales to larger models.
I feel that the comparison to GaLore should be incorporated into the main body of the paper
I have concerns about the enhanced gradients (see questions for the authors)

其他意见或建议

Throughout the introduction, the authors use \citep for in-text references, where they should be using \citet:
- L084: reference to Hu et al
- L092: Reference to Lialin et al
- L097: Zhao et al
- L098: Meng et al
Table 5: It would be better to bold the best method for each specific task, like in the other tables
Tables 4 and 5 should probably be reversed in order
ReLoRA* should be benchmarked in Table 4

作者回复

2025-04-01

Thank you for your insightful comments and suggestions.

W1: There is another work ...

Reply: Thank you for pointing out this recent work. We have included it in the related work section of our revised manuscript.

W2: I am interested in ...

Reply: Low-precision and mixed-precision training are standard techniques to accelerate LLM pre-training while reducing memory usage. In our experiments:

For LLaMA pre-training, we used bfloat16 precision, which offers a wide dynamic range and is well-suited for stable large-scale training.
For OPT pre-training, we used a mixed-precision setup, combining bf16 and TF32 to leverage fast matrix operations on modern hardware while preserving numerical stability.

Although low-precision training greatly improves efficiency, it can increase sensitivity to instability. We observed that LoRA and GaLore occasionally experienced training instability, which may be partially due to interactions between low-rank updates and reduced precision.

W3: Comparison to GaLore ...

Reply: Thank you for the suggestion. In the revised manuscript, we have moved the relevant results and analysis from the appendix into the main body.

W4: Other Comments Or Suggestions

Reply: Due to character limit, couldn't address this part in detail. We have carefully revised the manuscript accordingly. Thank you again for your constructive feedback.

W5: It seems to me ...

Reply: The motivation behind the Enhanced Gradient is to decouple the updates of singular vectors and singular values, allowing singular vectors with smaller singular values to still receive meaningful updates. From another perspective, this effectively corresponds to using a larger learning rate for less dominant directions, similar in spirit to adaptive optimization methods like Adam, where updates are rescaled by the inverse of the magnitude. Enhanced Gradient remains faithful to the model’s computation in the sense that when the gradient of W approaches zero, the Enhanced Gradients of U and V also approach zero. This ensures convergence behavior remains aligned with the model’s optimization objective.

W6: What happens with ...

Reply: Due to the page limit, we had originally omitted the update formulas for $V^T$ in Equations 6–8, but now added them back.

To clarify, the indices for updating U and $V^T$ are shared. This is necessary because the gradient update of $U_i$ depends on $V_i$ and vice versa, which promotes faster convergence.

W7: In 4.3 ...

Reply: Thank you for the insightful question. We conducted additional experiments where A and B in ReLoRA* are initialized with random values instead of zeros. The results (https://anonymous.4open.science/r/sparse_spectral_training-6A2C/relora_random.png) show clear improvement over the original ReLoRA*, but performance still falls short of SST.

One possible reason is that SST uses singular vectors of the weight matrix W , which may better align with the update directions. In contrast, randomly initialized vectors in high-dimensional space are typically close to orthogonal and may not capture meaningful structure. We're also testing random-initialized ReLoRA* on LLaMA and will update results as they become available.

W8: In L323--325...

Reply: Thank you for the question. The key difference from standard fine-tuning setups is that LoRA typically replaces only the query and value projection layers during fine-tuning, while freezing the rest of the model. In contrast, for pre-training, all linear layers, including Q, K, V, O projections and FFN, must be trainable. Therefore, to apply low-rank methods like LoRA, ReLoRA*, or SST to pre-training fairly, we replace all torch.nn.Linear layers of original model with their corresponding low-rank implementations to maintain parameter efficiency across the entire model.

We confirm that the "Full" benchmarks are not affected by this replacement; they use standard full-rank training with all linear layers unchanged.

W9: What's going on...

Reply: Thank you for pointing this out. Table 5 was originally intended as part of the Appendix but was referenced in the main text, so it wasn’t placed under any specific appendix section. We have revised the manuscript and moved it into the main body for clarity and consistency.

W10: Why did you feel...

Reply: Thank you for the question. Hyperbolic models often suffer from overfitting and instability in high-dimensional settings due to the exponential growth of space. Prior work has typically restricted them to low-dimensional use cases for this reason. We included hyperbolic models to show that SST, by limiting the parameter search space, mitigates overfitting and improves stability. Compared to both full-rank and other low-rank methods, SST is more robust and consistently delivers better performance in high-dimensional hyperbolic setups.

审稿人评论

2025-04-02

Thank you for your thoughtful answers to my review. I have no further questions, and am keeping my score at a 4.

审稿意见

评分: 32025-03-14

Sparse Spectral Training (SST) is a memory-efficient pre-training method for large language models. It updates all singular values, selectively updates singular vectors and achieves performance comparable to full-rank training. It first incorporates parameter-efficient pre-training process in hyperbolic space, demonstrating its superior performance and generalizability across various data structures and models.

给作者的问题

No.

论据与证据

Experiments are conducted for large language models with Euclidean and hyperbolic backbone networks respectively. Performances of SST under the evaluation of perplexity, BLEU score, F1 score, and accuracy are comparable with full-rank training. Besides, memory consumption and training time analysis is also included here.

方法与评估标准

Yes, the proposed methods and evaluation criteria make sense for the problem of training large language models and inference efficiently.

理论论述

Yes. they are based on well-established mathematical principles such as singular value decomposition and gradient descent.

实验设计与分析

Result analysis should provide insights into future work and underlying reasons for the experimental results in the performance tables, not only provide numerical values already shown in the table.

Detailed analysis is needed for rank choice.

Experiments on Hyperbolic graph neural networks are limited. Comparison between hyperbolic transformer and graph neural networks needs furtheranalysis.

补充材料

No.

与现有文献的关系

Sparse Spectral Training (SST), are related to the broader scientific literature on parameter-efficient training methods and hyperbolic neural networks.

遗漏的重要参考文献

No.

其他优缺点

No.

其他意见或建议

No.

作者回复

2025-04-01

Thank you for your detailed review and valuable insights.

Weakness 1: Result analysis should provide insights into future work and underlying reasons for the experimental results in the performance tables, not only provide numerical values already shown in the table.

Reply: Thank you for this valuable feedback. We agree that the original version of our paper focused too heavily on numerical results and lacked sufficient analysis. In the revised version, we have incorporated more discussion on the underlying causes of the observed performance differences, and we highlight two key examples below:

First, in our ablation study on sampling mechanisms (Appendix E.2), we examined why multinomial sampling outperforms other strategies such as uniform, sequential, and top- $r$ selection. As shown in Table 17, top- $r$ performs the worst because it repeatedly updates only a fixed subset of singular vectors, restricting the search space to a narrow low-rank subspace. In contrast, stochastic strategies like multinomial sampling encourage exploration across a broader spectral domain, leading to improved generalization and convergence. This observation suggests future work could explore adaptive or learnable sampling strategies to further enhance training dynamics.

Second, in our extended comparison of SST and GaLore (Appendix G), we analyzed why SST consistently outperforms GaLore in low-rank regimes. GaLore relies on a low-rank projection of the gradient derived from a single batch, which can be noisy and unstable due to sampling variability. SST, on the other hand, uses the SVD of the weight matrix $\Vec{W}$ , a more stable and aggregated representation over training steps, and updates all singular values at every iteration. This makes SST more robust, especially at small ranks. These findings also suggest a potential future direction in combining SST’s spectral stability with GaLore’s gradient-based projection approach.

We have incorporated these analyses into the main text in the revised version, utilizing the additional page allowed for the final version.

Weakness 2: Detailed analysis is needed for rank choice.

Reply: Thank you for raising this important point. We agree that a more in-depth discussion of rank is valuable. In our study, the rank is not treated as a tunable hyperparameter but rather as a practical constraint determined by available GPU memory. While increasing the rank generally improves performance, it also significantly increases memory consumption. To ensure fair comparisons, we used the same rank across LoRA, ReLoRA*, and SST in all experiments.

To offer further insight, we report BLEU scores on the IWSLT'14 dataset using a Transformer with model dimension 128:

Rank ( $r$ )	1	2	4	8	16	32	64
LoRA	12.44	14.16	16.37	20.56	23.30	25.12	26.11
ReLoRA	14.53	15.39	18.00	20.61	22.92	24.15	25.25
SST	17.49	20.69	22.80	24.19	25.12	26.08	26.15

These results show that SST consistently outperforms both LoRA and ReLoRA across all rank settings, particularly in the low-rank regime. This demonstrates SST’s robustness and practical utility for memory-efficient training under constrained compute budgets.

Weakness 3: Experiments on Hyperbolic graph neural networks are limited. Comparison between hyperbolic transformer and graph neural networks needs furtheranalysis.

Reply: Thank you for this suggestion. We acknowledge that our current experiments on hyperbolic graph neural networks and hyperbolic transformers are limited in scope. However, the results presented already highlight the advantages of SST in this setting.

SST consistently outperforms other low-rank methods across all tested configurations of hyperbolic models (with the exception of node classification on Airport with $r=2$ ), and in several cases, especially for the hyperbolic Transformer, it even surpasses full-rank training. As shown in Table 1, baseline methods like LoRA and ReLoRA suffer from training instability (e.g., NaN losses), whereas SST maintains stable convergence throughout.

Geometrically, hyperbolic space exhibits exponential volume growth with distance, unlike the polynomial growth of Euclidean space. This makes hyperbolic models highly expressive but also prone to overfitting, particularly in high dimensions. By selectively updating a small subset of singular vectors, SST constrains the parameter search space, helping to regularize the model and prevent overfitting.

We appreciate the reviewer’s suggestion and are actively running additional experiments to further evaluate SST on other hyperbolic architectures. We will share updated results as soon as they are available.

审稿意见

评分: 32025-03-16

Training large language models faces challenges due to high GPU memory demands. Existing methods like LoRA and ReLoRA have limitations, such as being constrained by low - rank structure or suffering from saddle point issues. This paper proposes Sparse Spectral Training (SST). It updates all singular values, selectively updates singular vectors via multinomial sampling, and uses SVD for initialization and re - initialization to optimize memory usage during pre - training. SST outperforms existing memory - reduction training methods in various tasks across Euclidean and hyperbolic neural networks. It reduces the performance gap between low - rank methods and full - rank training, and is more memory - efficient, making it a promising technique for model pre - training.

给作者的问题

N.A.

论据与证据

Yes.

方法与评估标准

Yes.

理论论述

Yes.

实验设计与分析

Yes.

补充材料

Yes.

与现有文献的关系

Yes.

遗漏的重要参考文献

Yes.

其他优缺点

Strengths:

The paper is well written. Although I'm not that familiar with the hyperbolic concepts, I can roughly take this paper.
The proposed SST method is interesting, and the novelty of the method is sufficient, as it is one of the few studies on efficient LLM in hyperbolic space.

Weaknesses:

There are not enough baselines and they are not particularly advanced. Consider introducing Dora, HIRA, Vera, FourierFT (at least Dora and Vera).
I would like to discuss issues related to training and fine-tuning with the authors. The paper does not actually distinguish between the two and describes a number of PEFT methods as "Previous memory reduction training techniques", but I personally have doubts about this. At least formally speaking, efficient training (such as pruning technology) and PEFT (such as LoRA) are two machine learning tasks. The former focuses on the efficiency of training and reasoning, while the latter focuses on the number of parameters to be fine-tuned (and people have recently begun to pay attention to issues such as memory and time in PEFT). Therefore, as a paper, if there is no additional explanation, the two tasks should be clearly distinguished. This is for the development of the entire field of efficient AI from a macro perspective, and for the readers to better understand your paper from a micro perspective. In addition, I think training and fine-tuning are intuitively different. The biggest difference is that fine-tuning is based on a certain amount of knowledge, and the common concept is that good fine-tuning will retain and amplify useful knowledge and effectively learn new knowledge. Of course, without comprehensive research, I can only have this intuitive feeling. But in any case, I think the two should be distinguished before the author has a more comprehensive study and discussion. If the authors are willing to change the writing, I would suggest changing the training rhetoric to fine-tuning as much as possible (this seems to be the easiest, because the baselines are all PEFT methods).
So on the weakness 2, the author can be open to refute me using some reference facts and logic, cause it is somewhat of an open question.

其他意见或建议

N.A.

作者回复

2025-04-01

Thank you for your insightful comments and suggestions. We are grateful for the opportunity to clarify and improve our work.

Weakness 1: There are not enough baselines and they are not particularly advanced. Consider introducing Dora, HIRA, Vera, FourierFT (at least Dora and Vera).

Reply: Thank you for the suggestion. In the original manuscript, we did not include Dora and Vera as baselines because their parameter search spaces are restricted to the low-rank subspace. As discussed in Section 3.1 "Limitation of LoRA," this constraint limits their expressiveness, while it may suffice for simple tasks like fine-tuning, it becomes a major bottleneck for more complex tasks such as training from scratch, which is the focus of our work. Since ReLoRA and GaLore are explicitly designed for pre-training, we prioritized comparing with them as the most relevant baselines.

To further demonstrate the limitations of Dora and Vera in pre-training, we conducted additional experiments comparing them to SST:

BLEU scores (higher is better) on IWSLT’14 with Transformer models:

(d, r)	Dora	Vera	LoRA	ReLoRA*	SST
(64, 8)	18.19	9.22	18.08	18.12	22.28
(64, 4)	14.38	9.30	14.05	15.49	20.27
(128, 16)	23.46	12.22	23.30	22.92	25.12
(128, 8)	21.09	12.25	20.56	20.61	24.19
(128, 4)	18.12	11.61	16.37	18.00	22.80

These results show that while Dora slightly improves over LoRA, it still falls short of SST. Vera performs poorly across all settings.

Dora decomposes the pre-trained weights into magnitude and direction components, but this does not overcome the fundamental limitation of the low-rank subspace. Vera further constrains the trainable parameters to a single vector, which significantly reduces memory but fails to capture the complexity required for effective pre-training.

We appreciate this valuable suggestion and have added this analysis and experiment to the revised manuscript.

Weakness 2: I would like to discuss ... because the baselines are all PEFT methods).

Reply: Thank you for raising this important point. We agree that training and fine-tuning are fundamentally different tasks and should be clearly distinguished. In this paper, our focus is specifically on parameter-efficient pre-training, rather than parameter-efficient fine-tuning (PEFT) or sparse training.

There are two main reasons we emphasize pre-training. First, pre-training is considerably more challenging than fine-tuning, as it starts from random initialization without any prior knowledge. The model must learn meaningful representations entirely from scratch, making it more sensitive to optimization stability and expressiveness constraints. Methods that work well in fine-tuning often fail to generalize to this more difficult setting.

Second, pre-training requires significantly more computational and memory resources, which has become a major bottleneck. Many academic labs today lack the resources to conduct large-scale pre-training. SST is motivated by the desire to lower the resource barrier, making pre-training more accessible to the broader research community.

We appreciate the reviewer’s suggestion and have revised our manuscript to explicitly clarify that our work focuses on parameter-efficient pre-training. We have also updated the wording from "Previous memory reduction training techniques" to "Previous parameter-efficient training techniques", since LoRA is a PEFT method and ReLoRA is designed for pre-training. We hope these changes enhance the clarity of our contributions.

最终决定Accept (poster)

2025-05-01

This paper introduces Sparse Spectral Training (SST), a memory-efficient pretraining method that updates all singular values while selectively updating singular vectors. It addresses the limitations of low-rank approaches (e.g., LoRA, ReLoRA) in the context of large-scale pretraining and shows strong empirical results across a diverse set of tasks, including hyperbolic models where it even outperforms full-rank baselines.

The reviewers were overall positive, highlighting the novelty of the method, strong performance, and clear writing. Concerns around missing baselines and clarity on training vs. fine-tuning were adequately addressed in the rebuttal, with new experiments and clarification added to the revision. Given the novelty, practical relevance, and solid evaluation, I agree with the positive consensus and recommend acceptance.