Hyperbolic Fine-Tuning for Large Language Models
A novel method to efficiently fine-tune Large Language Models in hyperbolic space, unlocking their latent tree-like structures and significantly boosting complex reasoning performance by adapting directly on the hyperbolic manifold.
摘要
评审与讨论
The paper explores the suitability of hyperbolic geometry for representing token embeddings in large language models (LLMs), motivated by the hierarchical structure commonly found in language data. It first identifies that token embeddings naturally display hyperbolic characteristics, such as a tree-like structure, where high-frequency (general) tokens cluster near the origin, and low-frequency (specific) tokens are positioned farther away, following a power-law distribution.
To better exploit this hierarchical structure, the authors propose HypLoRA, a new fine-tuning method that integrates hyperbolic geometry directly into the low-rank adaptation process, unlike traditional methods that use Euclidean geometry. The key idea of HypLoRA is to perform parameter-efficient updates directly on the hyperbolic manifold, avoiding the limitations of using standard Euclidean operations.
Through extensive experiments on arithmetic and commonsense reasoning benchmarks, the paper demonstrates that HypLoRA significantly improves the performance of existing LLMs compared to conventional Euclidean fine-tuning methods, confirming the advantage of explicitly modeling token hierarchies in hyperbolic space.
优缺点分析
Strengths:
- Extensive experiments across diverse benchmarks and models
- Integration of hyperbolic geometry into low-rank adaptation (HypLoRA)
Weaknesses:
- Incremental improvements may not justify increased complexity and computational overhead in practical deployments.
问题
NA
局限性
yes
最终评判理由
The authors have addressed my concern on the increased complexity and computational overhead.
格式问题
no
Q: Incremental improvements may not justify increased complexity and computational overhead in practical deployments.
A: We thank the reviewer for raising the practical consideration of the trade-off between performance gains and computational overhead. We agree that any added complexity must be justified by meaningful improvements.
While HypLoRA introduces hyperbolic projection operations, we have taken care to ensure its practical overhead is minimal and justified by its performance benefits. In terms of model complexity, the number of additional trainable parameters in HypLoRA is negligible, remaining nearly identical to that of the standard LoRA method (as shown in Tables 3 and 7).
Regarding computational overhead, our analysis in Section 5.2 shows that while inference time is slightly increased compared to LoRA, HypLoRA remains more efficient than other competitive methods like DoRA (Figure 2). Furthermore, memory consumption is comparable to LoRA (Table 7) and in the table below, ensuring our method remains a practical choice for deployment.
GPU Allocated Memory (Finetuning, in GB)
| Model | Method | Allocated Memory (GB) |
|---|---|---|
| Qwen2.5 - r32 | LoRA | 28.63 |
| DoRA | 28.64 | |
| HypLoRA (stereographic) | 28.63 | |
| HypLoRA (exp/log) | 28.63 | |
| Gemma3 - 4B | LoRA | 14.61 |
| DoRA | 14.62 | |
| HypLoRA (stereographic) | 14.61 | |
| HypLoRA (exp/log) | 14.61 | |
| LLaMA3 - 8B | LoRA | 30.13 |
| DoRA | 30.14 | |
| HypLoRA (stereographic) | 30.13 | |
| HypLoRA (exp/log) | 30.13 |
Besides, the performance improvements offered by HypLoRA are not merely incremental, but are particularly significant on more challenging reasoning tasks. For instance, on the complex GSM8K and AQUA arithmetic benchmarks, HypLoRA delivers substantial gains over LoRA across multiple models, such as a +9.8% and +12.2% improvement with Gemma-7B, respectively (Table 3). Similarly, on commonsense reasoning, we observe consistent gains of +1.8-3.0% on average (Table 4). This demonstrates that the benefits of explicitly modeling the latent hyperbolic geometry are most pronounced where complex, hierarchical relationships must be understood. Therefore, we believe the slight and manageable increase in computational cost is a worthwhile trade-off for the significant and targeted performance enhancements HypLoRA provides on difficult reasoning tasks.
Thank you for your clarifications on the increased complexity and computational overheads. I will update my score accordingly.
This paper studies the geometric properties of token embeddings in LLMs. This paper first performs a detailed analysis and find that LLM embeddings exhibit strong hyperbolic properties. Based on this observation, the authors propose a new PEFT method, HypLoRA, to perform low-rank adaptation directly on a hyperbolic manifold. Through extensive experiments on multiple benchmarks, the authors show that HypLoRA outperforms LoRA across a variety of models.
优缺点分析
Strengths:
-
Novel and insightful perspective: The main strength of this paper lies in its novel and insightful perspective. Rather than simply proposing a new technique, the authors provide a compelling motivation by first analyzing the underlying geometric structure of LLM embeddings.
-
Robust performance improvement: The proposed method, HypLoRA, shows consistent and robust performance improvement over the standard LoRA baseline. This is demonstrated across multiple base models (LLaMA, Gemma, etc.) and various inference benchmarks.
-
Solid theoretical foundation: The research in this paper is well supported by theoretical analysis.
Weaknesses:
-
Lack of pre-fine-tuning benchmarks: The experimental results table lacks the performance of the base model before any fine-tuning (i.e., zero-shot or few-shot performance). Including the performance of the base model allows the reader to measure the absolute improvement brought by PEFT itself. This is crucial to understand the overall impact of the two PEFT methods and to justify the need for fine-tuning in the first place.
-
Motivation is disconnected from experimental analysis: The experimental analysis can be better aligned with the core motivation of the paper. The motivation of this paper is based on the idea that hyperbolic geometry can better model the hierarchical tree-like structures present in language. However, the evaluation section focuses mainly on showing its accuracy improvement on downstream tasks. The paper would be significantly strengthened if qualitative or quantitative analysis could be included to show how HypLoRA actually changes the embedding space to better represent these hierarchical structures.
问题
-
As shown in Table 3, the LLaMA3-8B model fine-tuned with LoRA achieves 69.8% accuracy on GSM8K. However, the official technical report for LLaMA-3 (https://ai.meta.com/blog/meta-llama-3/) shows an 8-shot CoT accuracy of GSM8K is about 79.6%. Can you explain the evaluation settings that lead to this result? This significant difference raises potential concerns about the strength of the LoRA baseline used for comparison. If the baseline is not optimally tuned, the gains reported by HypLoRA may be partially due to a suboptimal starting point rather than simply the advantages of the hyperbolic method.
-
Regarding the stability of fine-tuning: It is well known that the performance of LoRA can be sensitive to hyperparameter choices (e.g., learning rate, rank "r"). The paper compares the final performance of LoRA and HypLoRA, but a deeper analysis of training stability would be helpful. Can you comment on the stability of HypLoRA compared to LoRA? Are the reported results based on the average of multiple runs with different random seeds to explain the variance?
局限性
yes
最终评判理由
The authors' response addressed my concerns regarding the performance of the base model and baseline. They also provided additional experimental results supporting their motivation. Regarding parameter stability, the authors provided variance results, which showed that their method does exhibit significant volatility, but the improvement is stable and therefore acceptable. I believe the method proposed in this paper has potential for further research, and therefore this paper can be accepted by the main conference.
格式问题
no
Thank you for your thorough review and constructive feedback. In what follows, we use “Q” to denote each question, weakness, or issue raised and “A” for our response.
Q1: Base models’ Performance
A1: To address the reviewer’s concerns and in the limit of time, we report the arithmetic reasoning performance across four base LLMs, and on Gemma 3-4B for the commonsense reasoning evaluations.
Arithmetic reasoning task:
| Model | MAWPS | SVAMP | GSM8K | AQuA |
|---|---|---|---|---|
| LLaMA-7B | 51.7 | 32.4 | 15.7 | 16.9 |
| LLaMA-13B | 65.5 | 37.5 | 32.4 | 15.0 |
| Gemma-7B | 76.5 | 60.4 | 38.4 | 25.2 |
| Gemma-3 4B | 54.6 | 42.2 | 39.2 | 23.2 |
Commonsense reasoning task:
| Model | BoolQ | PIQA | SIQA | winogrande | ARC-Challenge | ARC-Easy | OpenbookQA |
|---|---|---|---|---|---|---|---|
| Gemma3-4B | 58.41 | 77.31 | 63.10 | 14.92 | 61.43 | 72.011 | 54.2 |
These baseline results clearly demonstrate the value of fine-tuning. More importantly, these benchmarks further highlight the contribution of our proposed method, HypLoRA. Compared to both the base LLM and the LoRA baseline, HypLoRA consistently outperforms across most evaluations in both arithmetic and commonsense reasoning tasks.
Q2: Motivation is disconnected from experimental analysis: The experimental analysis can be better aligned with the core motivation of the paper.
A2: Thanks for your suggestion. In this response, we present a quantitative study of the embeddings finetuned using HypLoRA. We extend our hyperbolicity analysis beyond the input embeddings. We examine the relative hyperbolicity of the final hidden layer representations of Gemma 7B and Gemma 3-4B across five datasets, two focused on math reasoning (AQuA, GSM8K) and three on commonsense reasoning (ARC-Challenge, Winogrande, OpenbookQA). We conduct this analysis on both the base models and their fine-tuned counterparts using three PEFT methods: LoRA, DoRA, and HypLoRA.
First, as a baseline, the hyperbolicity of the initial, non-contextualized token embeddings is presented in Table 1.
Hyperbolicity of Inital Token Embeddings (Table1)
| Method | AQuA | GSM8K | ARC-Challenge | Winogrande | OpenbookQA |
|---|---|---|---|---|---|
| Gemma 7B | 0.12 ± 0.01 | 0.11 ± 0.01 | 0.12 ± 0.01 | 0.11 ± 0.01 | 0.12 ± 0.01 |
| Gemma3 4B | 0.19 ± 0.01 | 0.19 ± 0.02 | 0.16 ± 0.02 | 0.17 ± 0.01 | 0.16 ± 0.02 |
Hyperbolicity of Final Hidden Layer Embeddings in Gemma 7B: (Table2-1)
| Dataset | Base LLM | LoRA | DoRA | HypLoRA |
|---|---|---|---|---|
| AQuA | 0.31 ± 0.04 | 0.24 ± 0.05 | 0.23 ± 0.05 | 0.22 ± 0.03 |
| GSM8K | 0.28 ± 0.04 | 0.21 ± 0.05 | 0.21 ± 0.05 | 0.20 ± 0.03 |
| ARC-Challenge | 0.30 ± 0.03 | 0.35 ± 0.03 | 0.36 ± 0.02 | 0.25 ± 0.02 |
| Winogrande | 0.22 ± 0.04 | 0.32 ± 0.02 | 0.27 ± 0.02 | 0.27 ± 0.02 |
| OpenbookQA | 0.30 ± 0.03 | 0.35 ± 0.03 | 0.38 ± 0.02 | 0.25 ± 0.02 |
Hyperbolicity of Final Hidden Layer Embeddings in Gemma3 4B: (Table2-2)
| Dataset | Base LLM | LoRA | DoRA | HypLoRA |
|---|---|---|---|---|
| AQuA | 0.17 ± 0.03 | 0.17 ± 0.03 | 0.19 ± 0.02 | 0.11 ± 0.01 |
| GSM8K | 0.16 ± 0.03 | 0.20 ± 0.03 | 0.19 ± 0.03 | 0.11 ± 0.02 |
| ARC-Challenge | 0.17 ± 0.02 | 0.21 ± 0.01 | 0.17 ± 0.02 | 0.20 ± 0.02 |
| Winogrande | 0.16 ± 0.02 | 0.16 ± 0.02 | 0.21 ± 0.01 | 0.12 ± 0.01 |
| OpenbookQA | 0.17 ± 0.03 | 0.16 ± 0.02 | 0.17 ± 0.03 | 0.11 ± 0.01 |
The results for the final hidden layer embeddings are shown in Table 2-1 and .Table 2-2. Our analysis reveals several key findings: First, the final hidden states of the base models exhibit notable hyperbolicity, though it is generally less pronounced (i.e., higher values) than in the initial embeddings. Second, HypLoRA consistently learns final representations with a higher degree of hyperbolicity (lower δ values) across almost all datasets and models. This provides strong empirical evidence that our method actively preserves and enhances the hierarchical structure of the representations throughout the model, aligning the final contextualized embeddings with the geometric biases that are beneficial for reasoning. We also observe that as dataset complexity increases, the final hidden layer exhibits correspondingly higher relative hyperbolicity value.
For instance, AQuA, a particularly challenging dataset, shows a relative hyperbolicity of 0.31 in Gemma 7B and 0.17 in Gemma 3-4B. When these models were fine-tuned with HypLoRA, the degree of hyperbolicity increased (i.e., the relative hyperbolicity value decreased). In Gemma 7B, the drop from 0.31 to 0.22 led to a significant improvement in accuracy, reaching 46.5%, compared to LoRA’s 34.3%, which had a relative hyperbolicity of 0.24. Similar trends are observed in Gemma 3-4B as well. Moreover, in commonsense reasoning tasks, the effect is even more pronounced: while LoRA and DoRA tend to reduce the degree of hyperbolicity relative to the base LLM, HypLoRA increases it. As a result, the performance improvement with HypLoRA is both evident and consistent.
Q3: As shown in Table 3, the LLaMA3-8B model fine-tuned with LoRA achieves 69.8% accuracy on GSM8K. However, the official technical report for LLaMA-3 (https://ai.meta.com/blog/meta-llama-3/) shows an 8-shot CoT accuracy of GSM8K is about 79.6%. Can you explain the evaluation settings that lead to this result? This significant difference raises potential concerns about the strength of the LoRA baseline used for comparison. If the baseline is not optimally tuned, the gains reported by HypLoRA may be partially due to a suboptimal starting point rather than simply the advantages of the hyperbolic method.
A3: We follow the experimental setup outlined in (Hu et al.)[1], which has been adapted by various works. For arithmetic and commonsense reasoning tasks, we finetune the LLMs using Math10k and Commense170k datasets, respectively proposed in (Hu et al.)[1]. For the PEFT baselines, we adopt standard hyperparameters: lora rank 32, lora alpha of 64 and apply it at QKVUD. We use a batch size of 16, micro-batch size of 4, a learning rate of 3e-4, and train for 3 epochs with a maximum sequence length of 256 tokens. These settings are consistent with Hu et al[1]., DoRA, and are also similar to those used in the S2FT[2] framework. Notably, similar LoRA performance on GSM8K can be observed in the S2FT paper. Importantly, the 8-shot accuracy reported in the official LLaMA-3 blog refers to the chain-of-thought prompting, which is not directly comparable to our setting, where the model is fine-tuned using Math10K, a dataset that includes GSM8K, AQuA, and MAWPS. Thus, drawing a direct comparison between these two settings would not be valid, as they represent different evaluation paradigms.
[1]Hu, Z., Wang, L., Lan, Y., Xu, W., Lim, E.P., Bing, L., Xu, X., Poria, S. and Lee, R.K.W., 2023. Llm-adapters: An adapter family for parameter-efficient fine-tuning of large language models. EMNLP 2023
[2]Yang, X., Leng, J., Guo, G., Zhao, J., Nakada, R., Zhang, L., Yao, H. and Chen, B., 2024. S FT: Efficient, scalable and generalizable LLM fine-tuning by structured sparsity. Advances in Neural Information Processing Systems, 37, pp.59912-59947.
Q: Regarding the stability of fine-tuning: It is well known that the performance of LoRA can be sensitive to hyperparameter choices (e.g., learning rate, rank "r"). The paper compares the final performance of LoRA and HypLoRA, but a deeper analysis of training stability would be helpful. Can you comment on the stability of HypLoRA compared to LoRA? Are the reported results based on the average of multiple runs with different random seeds to explain the variance?
A: We thank the reviewer for the suggestion. The reported results were from the best single run across different curvature values (0.5, 1.0, 2.0). To address stability concerns and in limit of time, we report mean and std over 3 runs for LLaMA-3B (LoRA vs. HypLoRA) and for Gemma3-4B (HypLoRA). While HypLoRA shows slightly higher variance in some cases, it consistently outperforms LoRA on most arithmetic tasks. Improving stability is an exciting direction for future work.
| Model | MAWPS | SVAMP | GSM8K | AQuA |
|---|---|---|---|---|
| LLaMA3-8B (LoRA) | 91.04 ± 0.14 | 80.75 ± 0.45 | 70.88 ± 0.99 | 41.77 ± 0.49 |
| LLaMA3-8B (HypLoRA) | 91.46 ± 0.54 | 79.27 ± 1.26 | 72.93 ± 0.27 | 44.75 ± 0.60 |
| Gemma 3 4B (HypLoRA) | 91.28 ± 0.46 | 77.77 ± 1.17 | 67.62 ± 1.22 | 46.89 ± 1.19 |
I have read the author's response, which addresses most of my concerns. I will keep my score unchanged.
Thank you for your prompt response. We are happy to have addressed most of your questions and concerns, and we appreciate your positive rating score.
This paper proposes HypLoRA, a low-rank adaptation technique designed for large language models (LLMs) operating in hyperbolic space. The authors begin by empirically demonstrating that token embeddings exhibit hierarchical structure, supported by measurements of hyperbolicity. Next, they explore the relationship between the distribution of the token frequency and hyperbolic geometry. Finally, they introduce a low-rank transformation directly formulated in the Lorentz model of hyperbolic space and present experimental results validating its effectiveness.
优缺点分析
Strength
LLM attracts huge attention in recent research, and using the hyperbolic geometry tool to analyze LLM is well-motivated and interesting for both communities. The paper lays out a step-by-step analysis, justifying the finding of the hierarchical structure of the token and using the hyperbolic geometry tool to enhance the low-rank transformation. I’m not from the LLM background and could not judge the significance from this aspect, but I believe integrating and using the hyperbolic tools in a new domain and analyzing the geometric insights should be appreciated.
Weakness
- The explanation of the “cancellation effect caused by the exponential and logarithmic maps” is unclear. If operations are done in hyperbolic space and then mapped to or from the tangent space, the overall transformation is not necessarily the same. So why and how does appear here? This needs more clarification.
- The motivation for the design of the proposed direct Lorentz low-rank transformation is not well explained. While the authors want to explore the tool of hyperbolic geometry for LLM, there are a few questions raised in the design when examining the hidden states of LLM. First, why not transform the hidden states of LLMs from Euclidean space to hyperbolic space and stay in hyperbolic space instead of transforming the state back to Euclidean space? Second, why is transforming the space-like dimensions desired and not another alternative way?
- In the quantitative analysis of hyperbolicity (lines 194–196), the authors mention using GNNs to embed random graphs with known hyperbolicity. However, it’s not clear which GNN model is used or how the experiment is set up. Appendix C does not provide enough detail.
- In lines 199–202, the authors describe computing hyperbolicity “at the prompt level”, but it is not clear to me. What is the prompt level? Did you compute the hyperbolicity of x? What is the representation you computed for hyperbolicity? What exactly is measured and how? This part is hard to follow.
- While the authors briefly mentioned the linear transformation in lines 266-268, the description is vague and lacks explanation.
- In line 314, the paper refers to Appendix G for a “comprehensive comparison with additional methods,” but Appendix G contains GPU usage details. The comparison is either missing or mislabeled.
- The paper uses the Lorentz model for the main transformation, but when discussing the connection between power-law distribution and hyperbolic geometry, it suddenly switches to the Poincaré disk model without any introduction or reasoning. For readers unfamiliar with hyperbolic geometry, this may cause confusion and feel abrupt.
- The connection between the power-law distribution and hyperbolic geometry was not clear. First, the author uses the Poincaré disk model, which is in 2D, and the token embedding in hyperbolic dimension is not; how to generalize it for higher dimensions is unclear. Second, the derivation was not clear on how to consider token embedding with polar coordinates. Also, the authors use “radial coordinate (correlating with token frequency) ” and “angular coordinate (encoding semantic similarity)”, and then state radial distribution, but it’s unclear how it is derived. Each statement in lines 224-225 was just given without any explanation or derivation.
- In Equations 6, 7, and the following lines, the symbol is used without explanation of what kind of approximation or relation it indicates.
- In line 263, the paper refers to Appendix E.1 for the stereographic projection formulas, but the actual definitions of and are in Appendix E.2.
- Appendix F consists of the transformation analysis. However, it was not clearly referred to in the main text, and the context of the properties introduced in Appendix F is unclear, e.g., using proof to introduce a property
- There is no proof of Proposition 1, and the explanation of the dependency on the L2 norm was not clear. Following the authors pointing to Appendix F.1, it was hard to follow how “the norm-dependent, higher-order modification enables HypLoRA to capture hierarchical relationships”
问题
See weakness above.
局限性
In the paper checklist, the authors mention that limitations are discussed in the Conclusion (Section 6). However, the point raised there is a general observation about hyperbolic methods. The paper does not clearly identify under what conditions the method might fail or perform poorly. A more detailed discussion of the method’s limitations (such as sensitivity, dataset characteristics) would help clarify the scope and applicability of the approach.
最终评判理由
The authors addressed my comments during the rebuttal and discussion. I believe this paper offers valuable insights and potential applicability for the hyperbolic community. Therefore, I recommend acceptance.
格式问题
- The spacing between Figure 1 and the surrounding text is too tight, making the layout feel cluttered. A similar issue appears with Table 4, where the table is placed too close to the text without enough padding.
- There is a noticeable formatting issue between lines 224 and 225
Thank you for your constructive feedback. In the follows, "Q” denotes any question raised, and “A” denotes response.
Q1: Explanation of cancellation effect
A1: The cancellation effect occurs because standard hyperbolic neural network [1,2] approaches apply transformations in the tangent space at the origin, requiring the sequence : Euclidean embedding → exponential map to hyperbolic space → logarithmic map to tangent space → linear transformation → exponential map back to hyperbolic space → projection to Euclidean space.
When these operations are chained together, the maps are mutually inverse and effectively cancel out, reducing the entire sequence to approximately the original Euclidean transformation BAx without preserving the beneficial hyperbolic geometry. Logrithmic map is the inverse of exponentail map.
This is why we developed Direct Lorentz Low-rank Transformation (LLR) that operates directly on the hyperbolic manifold without intermediate tangent space mappings, preserving the hyperbolic structure.
[1] Hyperbolic Graph Convolutional Neural Networks
[2] Hyperbolic Neural Networks
Q2: Motivation of direct LLR; the reason of transforming the space-like dimensions
A2: Regarding the decision to transform the hidden states back to Euclidean space, this is a fundamental requirement for our method's design as a parameter-efficient adapter. HypLoRA is intended to fine-tune existing, pre-trained LLMs, whose core components—such as the attention mechanism, MLP layers, and residual connections—are frozen and operate exclusively on Euclidean vectors. For our adapter to be compatible with this architecture, its output must be projected back into Euclidean space so it can be integrated into the model's residual stream. This "project-transform-project back" approach allows us to leverage the benefits of hyperbolic geometry in a localized, "plug-and-play" manner for specific tasks, avoiding the prohibitive cost and complexity of redesigning and retraining the entire language model in a new geometry.
For the second question about transforming space-like dimensions, as detailed in Section 5 and Appendix D, our spatial-like transformation can be interpreted as a constrained Lorentz rotation that preserves the hyperbolic manifold structure while enabling low-rank adaptation. The space-like components contain the semantic information analogous to the original token embeddings, while the time component is recomputed to maintain the Lorentz constraint (x₀² - ||xs||² = K). This design is inspired by established hyperbolic neural network architectures and allows us to apply linear transformations BA directly to the meaningful semantic dimensions while preserving hyperbolic geometry.
Q3: Details of GNN models
A3: We use two-layer GCN to construct a GNN models.
Q4: Meaning of “prompt level”
A4: Prompt-level means that for each input prompt (e.g., a mathematical problem from GSM8K), we extract the token embeddings for all tokens within that specific prompt, treating each token embedding as a point in the metric space.
The "prompt level" refers to this per-prompt analysis rather than computing hyperbolicity across the entire vocabulary or dataset. We repeat this process for all prompts in each dataset and then average the resulting δ values to obtain the overall hyperbolicity measure reported in Table 2. We will revise this section in the manuscript to detail this procedure more clearly.
Q5: Linear transformation in lines 266-268
A5: Thanks for your comment. We direct your attention to the comprehensive mathematical formulation provided throughout Section 5 and Appendix D for the complete technical details.
Our Direct Lorentz Low-rank Transformation (LLR) operates directly on the hyperbolic manifold using the formula LLR(BA, x^H) = (√(||BAx^H_s||² + K), BAx^H_s), where x^H_s represents the space-like component of the hyperbolic representation and BA are the low-rank adaptation matrices (Equation 9). This transformation applies the linear adaptation BA to the semantic space-like dimensions while recomputing the time component to preserve the Lorentz manifold constraint x₀² - ||xs||² = K. As detailed in Appendix D, this can be interpreted as a constrained Lorentz rotation that maintains hyperbolic geometry while enabling parameter-efficient adaptation.
The linear transformation is inspired by established hyperbolic neural networks, which allows us to capture the hierarchical inductive bias of hyperbolic geometry while maintaining the efficiency and compatibility.
Q6: GPU usage comparison is either missing or mislabeled.
A6: We apologize for this mislabeling error.
Q7: The reason of using Poincaré disk model for discussing the connection between power-law distribution and hyperbolic geometry.
A7: Thanks for your comments. While our main methodology employs the Lorentz model throughout due to its numerical stability and advantages for performing transformations within neural networks, the theoretical connection in Section 4.3 utilizes the Poincaré disk model because it provides a more intuitive geometric framework for explaining the relationship between power-law distributions and hyperbolic curvature.
The Poincaré disk model's polar coordinate system (r, θ) naturally aligns with the hierarchical organization we observe in token embeddings, where radial distance from the origin correlates with token specificity as demonstrated in our empirical analysis (Figure 1, Table 1).
Importantly, the Lorentz and Poincaré disk models are isometric; they are different representations of the same underlying hyperbolic geometry and share equivalent geometric properties. Therefore, the choice of the Poincaré disk model for the conceptual derivation in Section 4.3 does not affect the mathematical validity of the connection we draw between power-law distributions and hyperbolic space.
To prevent any potential confusion for readers, we will revise the manuscript accordingly.
Q8: Connection between the power-law distribution and hyperbolic geometry
A: Thanks for your detailed and constructive feedback. 2D Poincaré disk model offers an intuitive visualization of exponential volume growth—a key property that generalizes to n-dimensional hyperbolic spaces.
The generalization to higher dimensions follows standard hyperbolic geometry principles where the exponential volume growth property (C(r) ∼ e^r and A(r) ∼ e^r as r → ∞) holds in n-dimensional hyperbolic space, not just 2D.
Our core argument builds upon the established work of Krioukov et al. [19], which demonstrates that embedding random geometric graphs in an n-dimensional hyperbolic space naturally gives rise to scale-free networks with a power-law degree distribution. The polar coordinate system is a conceptual framework used in that work to model this phenomenon. Our assignment of "radial coordinate correlating with token frequency" is motivated by our empirical findings in Section 4.1. The angular coordinate encoding semantic similarity follows from the hierarchical tree-like structure we demonstrate through δ-hyperbolicity analysis (Table 2).
Q9: In Equations 6, 7, and the following lines, the symbol is used without explanation of what kind of approximation or relation it indicates.
A9: The symbol "∼" indicates asymptotic equivalence as r approaches infinity, meaning the ratio of the left-hand side to the right-hand side approaches 1 in the limit. We will make it clear in the improved version.
Q10: In line 263, the paper refers to Appendix E.1 for the stereographic projection formulas, but the actual definitions of and are in Appendix E.2.
A10: Sorry for the typo. Line 263 should refer to Appendix E.2.
Q11 Missing referred information of Appendix F in main text
A: The purpose of Appendix F is to provide the formal mathematical derivation that substantiates Proposition 1.
Specifically, it demonstrates how the non-linear projections inherent in HypLoRA introduce higher-order, norm-dependent terms into the adaptation, which is the core mechanism behind its improved performance. Thanks for the reviewer’s comments. We will revise the main text to more clearly introduce the purpose of the appendix when it is referenced.
Q12: Proof of Proposition 1
A: The analysis in Appendix F serves as the mathematical derivation that substantiates Proposition 1. The core of the argument is that the sequence of non-linear projections to and from hyperbolic space results in an effective transformation that is dependent on the input token's L2 norm. As shown in Appendix F.1 (Eq. 30), the final update term from HypLoRA can be approximated as . The second term, , is the higher-order modification introduced by our method. This term is directly proportional to the square of the L2 norm () of the input token embedding . This explicit dependency is the key mechanism that allows HypLoRA to perform a more nuanced adaptation than standard LoRA, which only applies the linear transformation .
The link between this norm-dependent modification and "capturing hierarchical relationships" stems directly from our empirical findings in Section 4.1. In our analysis (Figure 1, Table 1), we demonstrate that token embeddings are organized hierarchically, where token specificity correlates with L2 norm: general, high-frequency tokens (e.g., "to", "and") have smaller norms, while specific, low-frequency tokens (e.g., "banana", "purple") have larger norms. Because the higher-order term in HypLoRA is proportional to , our method applies a larger, more significant adaptation to these specific, large-norm tokens. This allows the model to differentially weight the importance of tokens based on their position in the semantic hierarchy, enabling a more refined and context-aware fine-tuning.
Thank you for your rebuttal response. My concerns regarding the cancellation effect and the motivation behind the design of the LLR have been addressed through your clarification of the project–transform–project-back procedure and the requirement for the LLM model.
As a suggestion, it may be helpful to include a brief introduction to both the Poincaré and Lorentz models, along with a note on their isometric relationship, to improve accessibility for a broader audience. Additionally, emphasizing the cancellation effect using clear mathematical notation could strengthen the explanation. In the context of hyperbolic learning, we usually aim to keep the data within the hyperbolic space rather than projecting it back to Euclidean after transformation.
In addition, it might be worthwhile to include a discussion of existing hyperbolic dimension reduction methods and to clarify how the proposed LLR differs from them. For example, once the data is mapped to hyperbolic space, one might consider applying a transformation via horospheres like HoroPCA, which is also directly applied to the hyperbolic representation. Including such a discussion could help clarify the motivation behind your approach and better position the contribution within the context of existing work in the hyperbolic learning community.
Thank you for your suggestion. We will include a brief introduction on Poincaré and Lorentz models and their isometric relationship as well as emphasizing the cancellation effect using clear mathematical notation.
For the difference with exisiting hyperbolic dimension reduction, we appreciate the opportunity to better position our work within the broader hyperbolic learning landscape.
While both HypLoRA and HoroPCA[1] operate on hyperbolic representations, their fundamental objectives and technical methodologies differ significantly. Our primary goal is supervised, parameter-efficient fine-tuning of pre-existing Euclidean LLMs for specific downstream tasks. This motivation differs from that of HoroPCA, which focuses on unsupervised dimensionality reduction to identify an optimal low-dimensional hyperbolic subspace that preserves the variance of the original data.
HypLoRA learns an AB matrix decomposition, where A functions as a dimensionality reduction matrix and B serves as a dimensionality expansion matrix. In contrast, HoroPCA (as well as other related methods, like PGA[2], EPGA[3]) exclusively involves the dimensionality reduction process, and the reduced representations cannot be directly aligned with the original dimensions of the LLMs, making it unsuitable for our fine-tuning objective.
Furthermore, this technical difference also exhibit distinct computational efficiency characteristics. HoroPCA requires an iterative optimization process to identify principal directions (ideal points) while maximizing a variance objective, which becomes computationally intensive, when applied across multiple Transformer components (query, key, value, and feed-forward layers) in LLMs. Conversely, HypLoRA avoids the need to search for optimal subspaces and instead directly learns a low-rank update (BA) for the LLM's existing weight matrices. This approach enables HypLoRA to maintain computational complexity comparable to standard LoRA.
Thank you for your insightful suggestion again. We will incorporate these into our revised version. Should you have any follow-up questions or additional suggestions, we would be happy to discuss them further.
[1] Chami, Ines, Albert Gu, Dat P. Nguyen, and Christopher Ré. "Horopca: Hyperbolic dimensionality reduction via horospherical projections." In International Conference on Machine Learning, pp. 1419-1429. PMLR, 2021.
[2] Fletcher, P. Thomas, Conglin Lu, Stephen M. Pizer, and Sarang Joshi. "Principal geodesic analysis for the study of nonlinear statistics of shape." IEEE transactions on medical imaging 23, no. 8 (2004): 995-1005.
[3] Sommer, Stefan, François Lauze, Søren Hauberg, and Mads Nielsen. "Manifold valued statistics, exact principal geodesic analysis and the effect of linear approximations." In European conference on computer vision, pp. 43-56. Berlin, Heidelberg: Springer Berlin Heidelberg, 2010.
I would like to thank the author for their effort to address my comments and the work they put into the rebuttal and discussion. I am satisfied with the proposed adjustments and look forward to reading the revision.
Thank you for your thoughtful comments and prompt response. We will incorporate your insightful suggestions, along with the discussion regarding HoroPCA, into our revised manuscript. If possible, we would greatly appreciate your consideration of adjusting your rating score. Thank you very much.
This paper first provided an empirical observation showing that the frequency of tokens tends to be inversely proportional to the norm of their embedding vectors in several LLMs on various datasets, indicating an underlying hierarchical structure. The paper then measures the hyperbolicity for the token embeddings of various LLMs to validate that conjecture. The paper then proposed to fine-tune LLMs by performing low-rank transformation of the input of each layer on the hyperbolic manifold. Extensive experiments are conducted on several reasoning datasets with various pretrained models.
优缺点分析
Strengths:
- The consideration of transforming the token embeddings in the hyperbolic space is reasonable and interesting.
- The suggestion of an underlying hyperbolic structure of token embeddings from pretrained models are informative.
- The usage of the direct Lorentz Low-rank Transformation is interesting.
Weakness: The effectiveness and the motivation of the proposed method, HypLoRA, seem a bit unclear.
Please see the section of Questions.
问题
-
Comparing the final representations between HypLoRA and Lora, eqns (30)-(32) and eqns (49)-(52), it seems that instead of injecting matrix into the original weight, HypLoRA injects where is some constant that depends on the input norm for some integer . In the end, the representative power of LoRA and HypLoRA seems to be equivalent, i.e. the solution sets of HypLoRA is covered by LoRA. I am wondering what might be the advantage of doing all the transformations from and to the manifold, compared with directly injecting, e.g. , and optimize over B and A?
-
Assuming the tokens follow the hyperbolic structure, it seems natual to do any of the following (or simultaneously):
- Fine-tune the embedding space by applying riemmanian optimization algorithms on the hyperbolic manifold, and fine-tune the other parameters separately either in the eculidean space or via the proposed linear transformation on the hyperbolic manifold.
- Train a hyperbolic network that does all the transformations on the hyperbolic space before the final layers.
For now, HypLoRA freezes the embedding space and the pretrained weights, and injects a learnable linear transformation (by first projecting to the hyperbolic manifold and projecting back to the euclidean space), and merges it with the original eucliean linear transformation. I am wondering what is the logic behind and why it offers more effective training procedure, considering the similar amount of parameters of other fine-tuning algorithms.
-
The frequency of the tokens seem to be largely depend on the corpus or the context considered. I am wondering how the observation that the frequency of tokens tends to be inversely proportional to the norm can help for fine-tuning on new datasets.
-
I am wondering how you choose or estimate the curvature constant and how it affects the final result.
局限性
N/A
最终评判理由
After the rebuttal, it seems that the authors further clarify the effectiveness of the proposed fine-tuning approach. Given the interesting connections between hyperbolic structure and token embeddings, and the potential broader impact, I increased my score to 5.
格式问题
N/A
Thank you for your thorough review and constructive feedback. In what follows, we use “Q” to denote each question, weakness, or issue raised and “A” for our response.
Q1: Representative power of LoRA and HypLoRA and the advantage of HypLoRA
A1: Thank you for your thorough observations and constructive feedback. The main differences lies in the higher-order interaction term( ), which represents a qualitatively different type of transformation that cannot be replicated and covered by standard LoRA.
The higher-order term () creates a multiplicative coupling between the input embedding norm and the low-rank transformation matrices, fundamentally changing how the adaptation interacts with different tokens. This is not simply a scaled version of BAx—it's an entirely different functional form that introduces input-dependent adaptation strength. For example, for tokens with larger norms (more specific concepts like "dog", "banana" with norm ~0.569 from Table 1), the term () provides significantly stronger modifications, while for tokens with smaller norms (general function words like "to", "in" with norm ~0.353), the adaptation is naturally attenuated. This creates an adaptive mechanism where the fine-tuning automatically emphasizes tokens that are more semantically specific and likely crucial for downstream task performance.
Q2: why not applying riemmanian optimization && does all the transformations on the hyperbolic space before the final layers
A2: Thanks for your critical suggestions. While these are theoretically appealing alternatives, they face significant practical limitations that motivated our design choices. Fine-tuning embeddings with Riemannian optimization on hyperbolic manifolds would require substantial computational overhead and careful handling of the manifold constraints, while training fully hyperbolic networks from scratch is prohibitively expensive for large-scale LLMs and would sacrifice the valuable pretrained representations that already capture hierarchical structure (as shown in our Section 4 analysis).
Our HypLoRA approach balances theoretical rigor with practical feasibility: we preserve the pretrained Euclidean embeddings and weights that already exhibit hyperbolic characteristics, while selectively applying hyperbolic transformations only to the low-rank adaptation matrices A and B. This hybrid design leverages the best of both worlds—maintaining compatibility with existing pretrained models while capturing hyperbolic inductive biases through the learnable curvature parameter and norm-dependent scaling terms.
Q3:The frequency of the tokens seem to be largely depend on the corpus or the context considered. I am wondering how the observation that the frequency of tokens tends to be inversely proportional to the norm can help for fine-tuning on new datasets.
A3: While it is true that absolute token frequencies are corpus-dependent, our core argument relies on the stable, underlying hierarchical structure of the embedding space, which is a fundamental property learned during the LLM's large-scale pre-training, not just on the specific fine-tuning dataset. Our cross-model and cross-dataset analysis in Appendices A and B demonstrates that this principle is a remarkably consistent feature across different model families and task domains.
This pre-existing geometric organization is precisely what makes our approach effective for fine-tuning on new datasets. Fine-tuning does not occur in a vacuum; it leverages the rich geometric landscape established during pre-training. When HypLoRA is applied to a new task, its norm-dependent update mechanism automatically exploits this latent structure. The higher-order term, , naturally applies a stronger adaptation to the tokens that the pre-trained model has already encoded as more specific (i.e., those with larger norms). These are often the most informative tokens for mastering the nuances of the new task.
Q4: I am wondering how you choose or estimate the curvature constant and how it affects the final result.
A4:We treat curvature as a learnable hyperparameter to allow the model to adapt to the specific geometric needs of each task. During our experiments, we initialize by searching over the candidate set {0.5, 1.0, 2.0} and subsequently make it trainable during the fine-tuning process. This approach allows the model to automatically discover a suitable curvature, effectively balancing the hyperbolic inductive bias with the geometric characteristics of a given dataset.
I thank the authors for the response. It seems that the last point of my first question remains unclear, i.e. what is the advantage of the current proposed method compared with directly injecting a simple LoRA module: into where is the pretrained weight, is a scalar depending on . For instance, one can choose for some integer as you emphasized, or even a polynomial of . (We can either detach the gradient for or keep the forward AD for it during training).
Thank you for follow-up and for this insightful question. While direct injection with appears similar, our hyperbolic approach is theoretically grounded rather than ad-hoc. Compare with direct injection, the proposed method has a clear geometric meaning and the curvature parameter provides principled control through learned geometric properties.
In terms of representational power: LoRA applies uniform scaling while HypLoRA introduces norm-dependent adaptation + that standard LoRA cannot represent and cover.
Beyond the methodological innovation, our paper makes two key contributions: (1) comprehensive investigation of hyperbolic characteristics in LLM token embeddings, revealing power-law distributions and tree-like structures across multiple models and datasets, and (2) HypLoRA methodology that performs low-rank adaptation directly on hyperbolic manifolds while preserving geometric properties.
Thanks again for your critical feedback, we will add this discussion in our revised version and weclome further discussion.
Thank you for your response. It would be great to add comparison experiments to see if the additional operations of the current method, such as logarithmic and exponential maps, are necessary or not in practice, for improving computation efficiency.
Thank you for your suggestions. We will conduct this ablation study by removing the mappings and and will share our findings shortly.
We sincerely appreciate Reviewer 8MJ1's valuable suggestion. Given the time constraints, we conducted series of experiments on Gemma 7b by removing the corresponding mappings and observed a significant decrease in model performance, confirming that these mappings are indeed essential components. We will incorporate this analysis into our revised manuscript. Weclome further discussion.
| Configuration | MAWPS (8.5%) | AQuA (9.0%) | GSM8K (46.9%) | SVAMP (35.6%) | Weighted Avg |
|---|---|---|---|---|---|
| HypLoRA no log/exp | 89.9 | 33.9 | 63.8 | 77.0 | 68.04 |
| HypLoRA with log/exp | 89.5 | 40.1 | 66.3 | 78.4 | 70.20 |
| HypLoRA with no stereo | 89.5 | 34.3 | 61.4 | 74.2 | 65.91 |
| HypLoRA with stereo | 89.5 | 46.5 | 71.2 | 76.7 | 72.48 |
I thank the authors for the additional experiments. If injecting the module and training over and is less effective than the current approach, then I agree that the mappings provide additional benefit. Trusting the authors will make appropriate analysis and revisions, I have increased my score to 5.
Thank you for your positive feedback. We greatly appreciate your constructive suggestions throughout the review process. We will incorporate all the above experimental results, including the comparisons with the (||x||² + 1)BAx module, into the final version of the paper.
This paper explores whether operating in Euclidean space is optimal for token embeddings in large language models. The authors find that token embeddings exhibit non-Euclidean characteristics: high-frequency tokens cluster near the origin, and embeddings display strong hyperbolicity, suggesting a latent tree-like structure. To exploit these properties, they propose HypLoRA, a novel method for low-rank fine-tuning directly in hyperbolic space. Experiments show that HypLoRA improves LLM performance on some benchmarks.
优缺点分析
Strengths:
- The proposed method is interesting and novel, leveraging the geometric properties of token embeddings.
- The experimental evaluation is reasonably comprehensive, covering multiple models and datasets to support the claims.
- The paper is generally well-written and well-structured with clear explanations.
Weaknesses:
- The analysis of hierarchical structure and hyperbolicity appears to focus predominantly on the input embeddings. It would strengthen the work to examine whether these non-Euclidean properties persist or transform across deeper layers of the model.
- Additional discussion in the Questions section below
问题
- Do the authors have any insights or empirical results regarding how the hierarchical structure and hyperbolic properties of embeddings evolve across deeper layers of the model, as well as with the token context? It could also be interesting to see how such behavior may differ when using HypLoRA as opposed to other parameter efficient finetuning methods.
- Is there any relationship between the degree of hyperbolicity in an LLM’s embeddings and the amount of performance improvements observed with HypLoRA?
- Could the insights from this work be applied beyond the low-rank adapter setting, for example in other fine-tuning strategies or pretraining?
局限性
yes
最终评判理由
My concerns and questions have been mostly addressed through the hyperbolicity analysis of the final layer embeddings and the comparison between accuracy improvements and the degree of hyperbolicity. Moreover, this geometric perspective holds potential for broader applications in training, beyond just parameter-efficient fine-tuning.
格式问题
None
Thank you for your thorough review and constructive feedback. In the follows, we use “Q” to denote any question, weakness, or issue raised, and “A” to denote our response.
Q1: The analysis of hierarchical structure and hyperbolicity appears to focus predominantly on the input embeddings. It would strengthen the work to examine whether these non-Euclidean properties persist or transform across deeper layers of the model. (as well as with the token context) . It could also be interesting to see how such behavior may differ when using HypLoRA as opposed to other parameter efficient finetuning methods.
A1: We thank the reviewer for this excellent suggestion. To extend our hyperbolicity analysis beyond the input embeddings, we further examine the relative hyperbolicity of the final hidden layer representations of Gemma 7B and Gemma 3-4B across five datasets, two focused on math reasoning (AQuA, GSM8K) and three on commonsense reasoning (ARC-Challenge, Winogrande, OpenbookQA). We compared the base models against versions fine-tuned with LoRA, DoRA, and our proposed HypLoRA across a diverse set of five reasoning datasets.
First, as a baseline, the hyperbolicity of the initial, non-contextualized token embeddings is presented in Table 1.
Hyperbolicity of Inital Token Embeddings (Table1)
| Method | AQuA | GSM8K | ARC-Challenge | Winogrande | OpenbookQA |
|---|---|---|---|---|---|
| Gemma 7B | 0.12 ± 0.01 | 0.11 ± 0.01 | 0.12 ± 0.01 | 0.11 ± 0.01 | 0.12 ± 0.01 |
| Gemma3 4B | 0.19 ± 0.01 | 0.19 ± 0.02 | 0.16 ± 0.02 | 0.17 ± 0.01 | 0.16 ± 0.02 |
Hyperbolicity of Final Hidden Layer Embeddings in Gemma 7B: (Table2-1)
| Dataset | Base LLM | LoRA | DoRA | HypLoRA |
|---|---|---|---|---|
| AQuA | 0.31 ± 0.04 | 0.24 ± 0.05 | 0.23 ± 0.05 | 0.22 ± 0.03 |
| GSM8K | 0.28 ± 0.04 | 0.21 ± 0.05 | 0.21 ± 0.05 | 0.20 ± 0.03 |
| ARC-Challenge | 0.30 ± 0.03 | 0.35 ± 0.03 | 0.36 ± 0.02 | 0.25 ± 0.02 |
| Winogrande | 0.22 ± 0.04 | 0.32 ± 0.02 | 0.27 ± 0.02 | 0.27 ± 0.02 |
| OpenbookQA | 0.30 ± 0.03 | 0.35 ± 0.03 | 0.38 ± 0.02 | 0.25 ± 0.02 |
Hyperbolicity of Final Hidden Layer Embeddings in Gemma3 4B: (Table2-2)
| Dataset | Base LLM | LoRA | DoRA | HypLoRA |
|---|---|---|---|---|
| AQuA | 0.17 ± 0.03 | 0.17 ± 0.03 | 0.19 ± 0.02 | 0.11 ± 0.01 |
| GSM8K | 0.16 ± 0.03 | 0.20 ± 0.03 | 0.19 ± 0.03 | 0.11 ± 0.02 |
| ARC-Challenge | 0.17 ± 0.02 | 0.21 ± 0.01 | 0.17 ± 0.02 | 0.20 ± 0.02 |
| Winogrande | 0.16 ± 0.02 | 0.16 ± 0.02 | 0.21 ± 0.01 | 0.12 ± 0.01 |
| OpenbookQA | 0.17 ± 0.03 | 0.16 ± 0.02 | 0.17 ± 0.03 | 0.11 ± 0.01 |
The results for the final hidden layer embeddings are shown in Table 2. Our analysis reveals several key findings:
- First, the final hidden states of the base models exhibit a lower degree of hyperbolicity (i.e., larger hyperbolicity value than in the initial embeddings).
- Second, HypLoRA consistently learns final representations with a higher degree of hyperbolicity (lower δ values) across almost all datasets and models. This provides strong empirical evidence that our method actively preserves and enhances the hierarchical structure of the representations throughout the model, aligning the final contextualized embeddings with the geometric biases that are beneficial for reasoning.
Q2: Is there any relationship between the degree of hyperbolicity in an LLM’s embeddings and the amount of performance improvements observed with HypLoRA?
A2: We observe that as dataset complexity increases, the final hidden layer exhibits correspondingly higher relative hyperbolicity value.
For instance, AQuA, a particularly challenging dataset, shows a relative hyperbolicity of 0.31 in Gemma 7B and 0.17 in Gemma 3-4B. When these models were fine-tuned with HypLoRA, the degree of hyperbolicity increased (i.e., the relative hyperbolicity value decreased). In Gemma 7B, the drop from 0.31 to 0.22 led to a significant improvement in accuracy, reaching 46.5%, compared to LoRA’s 34.3%, which had a relative hyperbolicity of 0.24.
Similar trends are observed in Gemma 3-4B as well. Moreover, in commonsense reasoning tasks, the effect is even more pronounced: while LoRA and DoRA tend to reduce the degree of hyperbolicity relative to the base LLM, HypLoRA increases it. As a result, the performance improvement with HypLoRA is both evident and consistent.
Q3: Could the insights from this work be applied beyond the low-rank adapter setting, for example in other fine-tuning strategies or pretraining?
A3: The insights from HypLoRA have broad applicability beyond low-rank adaptation. The core principle—using norm-dependent scaling (||x||²) to automatically emphasize semantically specific tokens—can enhance any parameter-efficient fine-tuning method including adapters, prefix-tuning, and prompt-tuning, replacing their uniform scaling with hierarchically-aware adaptations that align with natural language structure.
For pretraining, the potential is also transformative. Since our analysis reveals pretrained embeddings naturally develop hyperbolic characteristics, incorporating hyperbolic inductive biases from the start—through curvature-aware attention, hyperbolic positional encodings, or learnable curvature parameters—could accelerate the emergence of beneficial hierarchical structure.
Rather than discovering these geometric properties implicitly through millions of training steps, models could be architected to respect linguistic hierarchy from initialization, potentially reducing training costs while improving the quality of learned representations and enabling more interpretable, structured language understanding.
I appreciate the authors’ response, and my concerns and questions have been mostly addressed through the hyperbolicity analysis of the final layer embeddings and the comparison between accuracy improvements and the degree of hyperbolicity. As such, I have tentatively adjusted my score.
As a side note, it may be worth further investigating the cases where the degree of hyperbolicity in the final layer embeddings decreases when using HypLoRA compared to the base LLM (eg. Winogrande for Gemma 7B) to explore any potential reasons behind this behavior.
Thank you for the reviewer's follow-up. We are pleased to hear that most of your concerns and questions have been resolved. We are currently investigating case of hyperbolicity on Winogrande and exploring potential reasons underlying this behavior.
We observed that all fine-tuning methods, including standard LoRA, resulted in a higher degree of hyperbolicity (lower δ values) compared to the base model on most datasets, with Winogrande on Gemma 7B being a notable exception. This phenomenon suggests there may be a discrepancy between the pre-trained objective and new fine-tuning tasks, resulting in differing hyperbolicity trends across different datasets and task types. Comparing Winogrande results between Gemma 7B and Gemma 4B suggests this may also be related to the base model.
However, we have observed a consistent pattern where HypLoRA typically exhibits higher degree of hyperbolicity than LoRA across all datasets and base models, demonstrating that hyperbolic operations provide more effective geometric regularization regardless of task-specific requirements. Since our main focus here is on fine-tuning and exploiting the underlying hyperbolicity, this consistent pattern better reflects the rationality and effectiveness of our proposed method.
This paper proposes HypLoRA, a method for fine-tuning large language models that exploits the latent hierarchical structure in token embeddings. The motivation is based on the observation that token embeddings display hyperbolic characteristics, reflecting a tree-like structure. Standard low rank adaptation in Euclidean space does not preserve this property, while HypLoRA applies a Lorentz transformation directly in hyperbolic space, resulting in norm-dependent updates.
The strengths are a novel and well-motivated idea, a technically sound method with minimal additional cost, and thorough experiments across multiple models and reasoning benchmarks. Reviewers appreciated the additional analysis in the rebuttal, including evidence that hyperbolicity persists in hidden layers, ablations confirming the role of the mapping operations, and clarifications on baselines and efficiency. These responses resolved the main concerns, and all reviewers now support acceptance.
The weaknesses were mostly related to clarity in the initial submission, the strength of baseline comparisons, and a gap between the geometric motivation and the evaluation. Some reviewers also noted that the performance gains, while consistent, are moderate and that fine-tuning stability could be improved. Overall, this is a solid and creative contribution that connects hyperbolic geometry with parameter-efficient fine-tuning and demonstrates consistent benefits.