6.8

/10

Spotlight4 位审稿人

最低4最高5标准差0.4

4.0

置信度

创新性3.0

质量2.8

清晰度2.8

重要性2.5

NeurIPS 2025

GraLoRA: Granular Low-Rank Adaptation for Parameter-Efficient Fine-Tuning

Yeonjoon Jung,Daehyun Ahn,Hyungjun Kim,Taesu Kim,Eunhyeok Park

OpenReview PDF

提交: 2025-05-09更新: 2025-10-29

TL;DR

This paper proposes an improved parameter-efficient fine-tuning method that utilizes granular low-rank adaptation.

摘要

关键词

LoRAParameter-efficient fine-tuningLLMAdapation

评审与讨论

审稿意见

评分: 4置信度: 42025-06-25

This paper proposes Granular Low-Rank Adaptation (GraLoRA), a parameter-efficient fine-tuning (PEFT) method. The authors identify that Low-Rank Adaptation (LoRA) can show stagnating or declining accuracy at higher ranks (e.g., 32-64). The paper attributes this to a "structural bottleneck" where outlier input channels with high activation values disproportionately influence and distort gradient updates. To address this, GraLoRA partitions weight matrices into a grid of sub-blocks, with each block having its own independent low-rank adapter. The paper claims this design localizes gradient updates, enhances expressive capacity, and mitigates the outlier issue, all while maintaining a comparable parameter count and computational cost to standard LoRA.

优缺点分析

Strength:

The method is designed to incur the same parameter count and comparable computational FLOPs as standard LoRA. The authors analyze the trade-offs, such as memory overhead, and propose a "Hybrid GraLoRA" approach to improve performance in low-rank scenarios where the base method may be too constrained.
The paper proposes GraLoRA, a method that partitions the weight matrix into independent blocks, each with a local low-rank adapter, to isolate gradient updates. The authors provide a theoretical analysis arguing that this structure increases the model's expressive capacity by raising the effective rank of the update matrix.

Weakness:

The paper's core motivation is that LoRA's performance degrades at high ranks (e.g., > 64) because outlier channels distort the entire gradient update. The explanation for why this happens is not fully elaborated. Intuitively, increasing the rank should provide the update matrix with more capacity, allowing it to better approximate the localized gradients of FFT and thus representing a smooth transition toward FFT performance. The paper does not provide a clear theoretical justification for why increasing the rank would intensify the "entangled influence" of outliers rather than providing more degrees of freedom to isolate their impact, creating a conceptual question around the mechanism that causes a performance dip at high, but not full, ranks.
The method introduces a new hyperparameter, 'k', which defines how many ways the matrix is partitioned. The paper shows that performance is sensitive to this value and that an improper choice can hurt performance, adding a layer of tuning complexity not present in vanilla LoRA.
The paper compares GraLoRA against three other PEFT methods: standard LoRA, MORA, and RaSA. While these are relevant and contemporary baselines, they all belong to the family of low-rank decomposition techniques. The evaluation does not include comparisons against other established families of PEFT methods, such as FourierFT [1] or BOFT [2], which claims high rank adaptation. Both of them are readily available in standard libraries like Hugging Face PEFT.

[1] Gao, Ziqi, et al. "Parameter-efficient fine-tuning with discrete fourier transform." arXiv preprint arXiv:2405.03003 (2024).

[2] Liu, Weiyang, et al. "Parameter-efficient orthogonal finetuning via butterfly factorization." arXiv preprint arXiv:2311.06243 (2023).

问题

Could you address the motivation problem in weakness 1?
Could you add baselines such as FourierFT and BOFT?

局限性

Not so clear. The future work part should be revised to discuss more directly on limitations.

最终评判理由

The authors have addressed all my concerns.

格式问题

N/A

作者回复

2025-07-31

Thank you for your review and for highlighting both the strengths of our work and areas for improvement. We have addressed each of your comments in detail below.

(W1 & Q1) Why increasing the rank intensifies the entangled influence of outliers

As discussed in Section 2.1 and 2.2 of the paper, we analyze the gradient dynamics of LoRA to explain why it suffers at higher ranks, particularly in the presence of outlier channels.

Given the LoRA weight update $R = BA^\top$ where $B \in \mathbb{R}^{M \times r}$ , $A \in \mathbb{R}^{N \times r}$ , and input $X \in \mathbb{R}^{N \times T}$ , the gradients respect to $A$ and $B$ are: $\frac{\partial{L}}{\partial{A^\top}} = B^\top \frac{\partial{L}}{\partial{Y}}X^\top, \ \frac{\partial{L}}{\partial{B}}=\frac{\partial{L}}{\partial{Y}}X^\top A$ The corresponding gradient of $R$ becomes: $\frac{\partial{L}}{\partial{R}}=\frac{\partial{L}}{\partial{B}}A^\top + B\frac{\partial{L}}{\partial{A^\top}}=\frac{\partial{L}}{\partial{Y}}X^\top AA^\top + BB^\top \frac{\partial{L}}{\partial{Y}}X^\top$ Note that $X$ appears inside both terms. In particular, the first term $\frac{\partial{L}}{\partial{Y}}X^\top AA^\top$ , shows that $X$ is multiplied between two. If $X$ contains an outlier channel, a row with abnormally high magnitude, the channel will influence the entire update due to the matrix multiplications.

Moreover, as the rank $r$ increases, the norm of the Gramm matrx $AA^\top$ also increases. This amplifies the effect of the outlier channel, widening the scale gab between outlier and non-outlier contributions and distorting the overall gradient landscape. In other words, the entanglement becomes worse as $r$ grows, further suppressing gradients from informative but less dominant channels.

This phenomenon is empirically validated in Figure 4, where increasing rank leads to disproportionately larger gradients, driven by the outlier channel. These observations highlight the structural limitation of LoRA at high ranks, motivating the design of GraLoRA to localize such influence.

(W3 & Q2) Additional experiments with new baselines on varying tasks and models

Thank you for the suggestion. While the main comparisons in our paper focused on low-rank decomposition methods, we have since conducted additional experiments to evaluate the generality and robustness of GraLoRA across a broader range of model architectures, tasks, and PEFT baselines—including BOFT and FourierFT.

1. Commonsense Reasoning with Extensive PEFT Baselines

We conducted a new evaluation on the commonsense reasoning benchmark using the LLaMA3.2–3B model, comparing GraLoRA with a wide spectrum of 10 PEFT baselines, including full fine-tuning and all models were trained with rank 32. For consistency, we used the results reported in the LoRA-SB paper for LoRA-XS, LoRA-SB, rsLoRA, and PiSSA. For other methods, we conducted a learning rate sweep over {2e-4, 4e-4, 1e-3, 2e-3}, selecting the best-performing configuration per method. All other hyperparameters followed the settings in LoRA-SB.

Method	Rank	Params	BoolQ	PIQA	SIQA	HS	WG	ARC-c	ARC-e	OBQA	Avg.
FullFT	-	3.21B	70.4	85.6	80.5	91.9	85.0	75.3	88.5	81.9	82.4
LoRA	32	48.63M	70.0	85.2	79.1	90.7	82.2	74.3	86.9	81.9	81.3
LoRA-XS	96	1.81M	67.3	83.4	78.7	89.0	82.1	72.6	85.2	78.9	79.6
LoRA-SB	96	1.81M	70.3	84.8	80.2	91.6	84.6	74.7	87.9	81.2	81.9
rsLoRA	32	48.63M	69.8	85.1	78.9	90.5	82.0	74.2	86.7	81.7	81.1
PiSSA	32	48.63M	70.1	85.4	79.4	90.9	82.7	74.6	87.2	81.8	81.5
BOFT	32	48.48M	72.3	84.6	79.1	91.3	84.5	73.7	87.8	80.6	81.7
MELoRA	32	48.63M	71.3	85.0	78.6	93.0	79.7	73.7	085.5	79.0	80.7
MoRA	32	48.63M	72.4	86.1	80.1	92.3	84.8	76.8	88.8	84.8	83.3
RaSA	32	48.63M	73.1	87.5	81.1	93.7	85.3	78.9	88.9	83.6	84.0
GraLoRA	32	48.63M	74.1	86.5	80.8	93.8	87.5	79.9	89.5	84.8	84.6

GraLoRA achieves the highest average accuracy and ranks first on 5 out of 8 tasks. These results, combined with those in Table 3 of the main paper, confirm GraLoRA’s scalability and consistent improvements across model sizes, ranks, and PEFT baselines.

2. Mathematical Reasoning

We further evaluated GraLoRA on a mathematical reasoning task using the MetaMathQA for training and testing on the MATH data. Two models, LLaMA3.2–1B and Qwen2.5–1.5B, were fine-tuned, mostly following the settings from Hu et al. [1], with ranks 64 and 128. Following our paper’s heuristic, we used $k=4$ for both.

Model	Rank	Method	Accuracy
LLaMA3.2–1B	64	LoRA	14.9%
		GraLoRA	15.2%
Qwen2.5–1.5B	64	LoRA	23.6%
		GraLoRA	25.7%
	128	LoRA	24.7%
		GraLoRA	28.9%

GraLoRA consistently outperforms LoRA, especially on the larger model, highlighting its robustness across architectures and tasks requiring symbolic reasoning.

3. General Language Understanding (GLUE)

We evaluated GraLoRA on the GLUE benchmark using RoBERTa-base, an encoder-only architecture with eight subtasks. To ensure fair comparison with recent PEFT methods designed for parameter efficiency, we included two additional baselines: VeRA and FourierFT.

Following the protocol of prior work, we excluded MNLI and QQP—two time-intensive tasks—which also meant we did not apply the MNLI-based tricks for MRPC, RTE, and STS-B (as used in the original LoRA paper). Accordingly, we retrained LoRA on these tasks without this optimization and report updated results.

While VeRA and FourierFT involve fewer trainable parameters, their training time is comparable to or even longer than LoRA with rank 8. Therefore, we set the LoRA and GraLoRA ranks to 8. Since this is a relatively low-rank setting, we also evaluated Hybrid GraLoRA by splitting the rank equally between LoRA and GraLoRA (i.e., 4+4), which is expected to be beneficial under such constraints.

All hyperparameters for GraLoRA followed those used in the VeRA implementation, except for learning rate, which was reduced by a factor of 5 to 10.

Method	Params	SST-2 (%)	MRPC (%)	CoLA (%)	QNLI (%)	RTE (%)	STS-B (%)	Avg (%)
FT	125M	94.8	90.2	63.6	92.8	78.7	91.2	85.2
LoRA	0.3M	95.1	86.5	63.4	93.3	76.2	90.6	84.2
VeRA	0.043M	94.6	89.5	65.6	91.8	78.7	90.7	85.2
FourierFT	0.024M	94.2	90.0	63.8	92.2	79.1	90.8	85.0
GraLoRA	0.3M	95.2	89.7	65.3	93.0	80.9	91.1	85.8
Hybrid GraLoRA	0.3M	95.2	90.2	64.1	93.4	79.8	91.2	85.6
Best GraLoRA	0.3M	95.2	90.2	65.3	93.4	80.9	91.2	86.0

GraLoRA demonstrates strong performance in this low-rank regime, outperforming all baselines in average score. Hybrid GraLoRA delivers the most robust results, achieving the best performance on 4 out of 6 tasks. These results indicate that GraLoRA maintains high effectiveness even under constrained parameter budgets and in non-LLM architectures.

4. Diffusion Model Fine-Tuning

Finally, we applied GraLoRA to diffusion models by fine-tuning SDXL for personalization. We followed the official training setup from the HuggingFace diffusers repository, using the lambdalabs/naruto-blip-captions dataset. The dataset was split 90% for training and 10% for evaluation.

Method	CLIP Similarity (%)	DINOv2 Similarity (%)
LoRA	91.4	79.2
GraLoRA	91.9	81.3

GraLoRA consistently outperformed LoRA in both CLIP and DINOv2 similarity scores, further demonstrating its generality and effectiveness beyond LLMs—including vision-text and generative architectures like diffusion models.

(W2) Analysis for Hyperparameter $k$

We acknowledge that introducing $k$ —which determines the granularity of matrix partitioning—adds a degree of tuning not present in vanilla LoRA. However, we empirically found that setting $r/k^2 \approx 8$ provides consistently strong results across different models and tasks. Based on this, we use $k=2$ for ranks up to 32, and $k=4$ for higher ranks (64 and 128). All experiments in the main paper and during the rebuttal strictly follow this heuristic.

Thus, while $k$ is a new hyperparameter, we provide a practical and robust guideline that eliminates the need for exhaustive tuning in most settings.

(Limitations)

We appreciate the suggestion to more explicitly frame future directions around current limitations. Two key limitations were initially considered: Effectiveness at low ranks (e.g., $r=8$ ) and Hyperparameter sensitivity of $k$ .

In small ranks, the per-block expressiveness of GraLoRA may diminish. To address this, we proposed the Hybrid GraLoRA architecture, which retains the fine-grained structure of GraLoRA while incorporating standard LoRA to maintain sufficient expressivity. Empirical results confirm that this approach remains competitive even in low-rank regimes (see GLUE experiment).

As noted, we provide a simple and effective heuristic for setting $k$ . However, to eliminate manual tuning altogether, we believe this limitation can be addressed more fundamentally. As suggested in the paper’s future work section, adaptive or learned partitioning could allow the model itself to determine the optimal granularity, replacing the need for a fixed $k$ . This would significantly enhance the usability and flexibility of GraLoRA in practical deployments.

To conclude, we sincerely thank the reviewer for the thoughtful and detailed feedback. Your comments have helped us clarify the theoretical foundations of GraLoRA and broaden our empirical validation across diverse model families, tasks, and PEFT baselines. In response, we have incorporated additional theoretical analysis on LoRA’s gradient behavior and GraLoRA’s mitigation strategy, and we have conducted extensive new experiments—including comparisons against BOFT, FourierFT, and applications to GLUE and diffusion models. These results, along with revised discussions on tuning complexity and limitations, will be included in the updated manuscript.

[1] Hu et al., "LLM-Adapters: An Adapter Family for Parameter-Efficient Fine-Tuning of Large Language Models", EMNLP 2023

2025-08-04

Thanks for the rebuttal. However, one critical question still remains on the motivation part: why full fine-tuning is not worse given the theory that larger r results in more significant abnormality?

评论- Why Full Fine-Tuning Remains Robust Under Outlier Channels

2025-08-05

Thank you for your thoughtful question. We agree that our analysis emphasizes how LoRA becomes increasingly susceptible to gradient distortion as the rank $r$ grows, especially in the presence of outlier input channels. This raises a valid concern: why does full fine-tuning (FFT) not suffer from a similar degradation?

The key distinction lies in how gradients are propagated. In FFT, the gradient with respect to the weight matrix is computed as

\frac{\partial L}{\partial W} = \frac{\partial L}{\partial Y} X^\top,

where each input channel affects only its corresponding column in the gradient matrix. Thus, even when an outlier channel exists in $X$ , its influence is localized and does not entangle with the gradients of unrelated weights.

In contrast, LoRA introduces a structural bottleneck due to its low-rank decomposition $R = BA^\top$ . The fused gradient update becomes:

\frac{\partial L}{\partial R} = \frac{\partial L}{\partial B} A^\top + B \frac{\partial L}{\partial A^\top},

which, as shown in Equation (4) and Figure 2 of the paper, leads to global entanglement. Specifically, the interaction between $A$ , $B$ , and the outlier input causes the gradient distortion to propagate throughout the entire update matrix $\frac{\partial L}{\partial R}$ , not just the region associated with the outlier. This structural entanglement is what causes LoRA’s performance to degrade at higher ranks, whereas FFT remains robust due to its localized update mechanism.

To address this limitation of LoRA, GraLoRA introduces block-wise reparameterization, which enables localized and independent adaptation across subregions of the weight matrix. As a result, only a fraction—specifically, $1/k$ of the blocks—are directly affected by the outlier channel, while the remaining blocks maintain a gradient landscape that closely resembles that of full fine-tuning. This structure significantly improves gradient locality and mitigates the global distortion observed in vanilla LoRA, resulting in more stable and efficient training. Further theoretical and empirical details on this mechanism are provided in Sections 3.1 and 3.3.

We appreciate your insightful observation and will ensure that this important distinction is articulated more clearly in the final version of the paper. Please let us know if further clarification is needed.

2025-08-05

Respectfully, I believe that the equation (4) is mathmatically wrong. As we know

\frac{\partial L}{\partial R} = \frac{\partial L}{\partial (R+W)} = \frac{\partial L}{\partial Y} X^T,

if equation (4) holds, we have

\frac{\partial L}{\partial R} = \frac{\partial L}{\partial R} AA^T + BB^T\frac{\partial L}{\partial R},

which does not hold for all A and B (e.g., if $AA^T=BB^T=I$ ). Due to this flaw in the theory, which should be a cornerstone of the whole paper, I tend to reject this paper.

评论- Clarification on Equation (4): Effective Update vs. True Gradient

2025-08-05

Thank you for your insightful comment. You are absolutely correct that the true gradient of the low-rank update matrix $R = BA^\top$ is given by:

\frac{\partial L}{\partial R} = \frac{\partial L}{\partial (R + W)} = \frac{\partial L}{\partial Y} X^\top.

The intention behind Equation (4) in our paper was not to redefine this gradient, but rather to describe the effective update in the fused weight space induced by updates to the LoRA parameters $A$ and $B$ . Since $R$ is a function of these two matrices, its update must be computed via the chain rule with respect to $A$ and $B$ . Projecting these gradients back into the space of $R$ , we obtain:

\underbrace{\frac{\partial L}{\partial B} A^\top + B \frac{\partial L}{\partial A^\top}}_{\text{Effective update in the space of } R} = \frac{\partial L}{\partial Y} X^\top A A^\top + B B^\top \frac{\partial L}{\partial Y} X^\top.

This is the expression shown in Equation (4). It is not a restatement of the true gradient $\frac{\partial L}{\partial R}$ , but rather a decomposition of how the LoRA structure alters the weight update path due to its low-rank factorization.

We acknowledge that the notation $\frac{\partial L}{\partial R}$ on the left-hand side of Equation (4) may have caused confusion by implying equality with the true gradient. We will revise the notation and clarify the surrounding explanation in the final version of the paper. We hope this clarification addresses your concern, and we would greatly appreciate it if you could kindly reconsider this point in your evaluation.

2025-08-05

Thanks and I have updated the score.

2025-08-05

Thank you very much for taking the time to review our response and update the score. We truly appreciate your thoughtful evaluation and consideration.

审稿意见

评分: 4置信度: 32025-07-01

The paper proposes GraLoRA, an improved version of LoRA that divides weight matrices into smaller blocks for independent adaptation. Experiments demonstrate that GraLoRA outperforms LoRA and other methods in tasks like code generation and reasoning, offering better accuracy without increasing computational cost.

优缺点分析

Strengths:

GraLoRA effectively addresses the limitations of LoRA by enhancing model capacity through granular, block-wise decomposition, improving performance without additional computational cost.
The method consistently outperforms existing PEFT techniques across code and reasoning tasks, demonstrating scalability and robustness across model sizes and rank settings.

Weaknesses:

The paper focuses on visualizing the gradient dynamics in the down-projection matrices of FFT but does not investigate other matrices, such as those in self-attention layers.
In the code generation task, the paper applies the same learning rate across all methods with varying ranks, without conducting a grid search for optimal learning rates, which may not ensure a fair comparison.
While the paper suggests that LoRA suffers from outlier channels in larger-rank optimization, it uses the LionW optimizer in experiments, which applies sign updates to mitigate the influence of large gradients, and momentum, which helps reduce significant gradient fluctuations.

问题

Does the hyperparameter $k$ require tuning for different models? Is there a simple way to find an optimal $k$ ?
How does the method perform on math tasks?
What would the performance gap be between GraLoRA and other PEFT methods if the number of epochs were increased to 3?

局限性

Yes

最终评判理由

The authors have addressed my concerns with extensive experimental results.

格式问题

N/A

作者回复

2025-07-31

We sincerely thank you for your encouraging evaluation and the valuable suggestions for improvement. Below, we address each of your points and questions clearly.

(W1) GraLoRA's gradient dynamics on matrices

In the paper, we identify a fundamental misalignment between LoRA updates and the true gradient landscape, particularly in the presence of outlier channels in the input. Prior studies have shown that outlier channels and weights are common in large models [1][2], and the impact of such outliers has been actively addressed in areas like quantization [3][4].

Our empirical analysis focuses on the down-projection layer of LLaMA3.1–8B, based on the findings from [2] where most outliers mostly existed in the down projection layers. However, any linear projection matrix with LoRA adapters—such as the Q, K, V projections in self-attention—can exhibit similar behavior. These matrices are functionally equivalent in structure and are therefore equally susceptible to the effects of outlier activations.

To further support this claim, we measured the kurtosis of input for the first 10 layers in LLaMA3.1-8B. High kurtosis is indicative of a heavy-tailed distribution, which implies the presence of strong outliers.

Projection	0	1	2	3	4	5	6	7	8	9
Q/K/V	1117	1796	144	71	87	77	59	57	78	79
O	570	239	57	109	31	42	37	24	17	14
Gate	93	7	32	2	22	4	4	5	8	13
Up	93	7	32	2	22	4	4	5	8	13
Down	30567	1888483	13063	386	15855	183	156	4498	2598	99

Among these, down-projection layers exhibited the highest kurtosis values, indicating an extremely heavy-tailed distribution and thus a high likelihood of dominant outlier channels. However, most other layers are also leptokurtic, meaning they too are likely to be influenced by outlier activations.

This supports our central claim: the phenomenon of outlier-induced gradient distortion is not limited to the down-projection layer, but is broadly relevant across various parts of the model that use LoRA adapters.

(W3 & Q2) Experiments on additional tasks with different Optimizer

1. Mathematical Reasoning

We further evaluated GraLoRA on a mathematical reasoning task using the MetaMathQA for training and testing on the MATH data. Two models, LLaMA3.2–1B and Qwen2.5–1.5B, were fine-tuned, mostly following the settings from Hu et al. [5]. Following our paper’s heuristic, we used k = 4 for both rank 64 and 128. In addition, we replaced the LionW to AdamW, since LionW itself can mitigate the influence of large gradients.

Model	Rank	Method	Accuracy
LLaMA3.2–1B	64	LoRA	14.9%
		GraLoRA	15.2%
Qwen2.5–1.5B	64	LoRA	23.6%
		GraLoRA	25.7%
	128	LoRA	24.7%
		GraLoRA	28.9%

As shown in the table, all GraLoRA showed superior performance for math task, for both models. This highlights the robustness of GraLoRA, for different architecture and hyper-parameters including the optimizer.

2. General Language Understanding (GLUE)

We evaluated GraLoRA on the GLUE benchmark using RoBERTa-base and included two additional baselines: VeRA and FourierFT.

While VeRA and FourierFT involve fewer trainable parameters, their training time is comparable to or even longer than LoRA with rank 8. Therefore, we set the LoRA and GraLoRA ranks to 8 for fair comparison. Since this is a relatively low-rank setting, we also evaluated Hybrid GraLoRA by splitting the rank equally between LoRA and GraLoRA (i.e., 4+4), which is expected to be beneficial under such constraints. All hyperparameters followed those used in the VeRA implementation, except for learning rate, which was reduced by a factor of 5 to 10.

Method	Params	SST-2	MRPC	CoLA	QNLI	RTE	STS-B	Avg
FT	125M	94.8	90.2	63.6	92.8	78.7	91.2	85.2
LoRA	0.3M	95.1	86.5	63.4	93.3	76.2	90.6	84.2
VeRA	0.043M	94.6	89.5	65.6	91.8	78.7	90.7	85.2
FourierFT	0.024M	94.2	90.0	63.8	92.2	79.1	90.8	85.0
GraLoRA	0.3M	95.2	89.7	65.3	93.0	80.9	91.1	85.8
Hybrid GraLoRA	0.3M	95.2	90.2	64.1	93.4	79.8	91.2	85.6
Best GraLoRA	0.3M	95.2	90.2	65.3	93.4	80.9	91.2	86.0

(W2 & Q3) Fair Comparison with Hyper-parameter search

We conducted an additional experiment on the commonsense reasoning task using the Qwen2.5–1.5B model, extending the number of training epochs from 2 to 3. In response to the concern about fixed learning rates, we also performed a grid search over learning rates {1e-4, 3e-4, 5e-4} for each method and selected the best-performing configuration individually.

Model	Method	BoolQ	PIQA	SIQA	HS	WG	ARC-c	ARC-e	OBQA	Avg.
Qwen2.5–1.5B	LoRA	68.5	82.3	74.4	85.3	75.2	71.0	86.2	84.2	78.4
	MoRA	67.1	82.3	74.1	84.8	74.7	73.5	87.1	81.6	78.2
	RaSA	68.4	83.4	74.5	84.5	73.6	73.9	87.5	81.2	78.4
	GraLoRA	67.7	82.7	75.3	86.1	74.0	74.2	87.5	80.8	78.5

GraLoRA continues to achieve the highest average accuracy and ranks first on 4 out of 8 tasks, demonstrating robust performance even under altered training schedules and per-method parameter tuning.

Interestingly, we observed that overall accuracy slightly decreased compared to the 2-epoch setting reported in the main paper. We believe this is due to overfitting, since we observed that training loss continued to decrease in the third epoch, while evaluation loss began to increase.

3. Commonsense Reasoning with Extensive PEFT Baselines

We conducted a new evaluation on the commonsense reasoning benchmark using the LLaMA3.2–3B model, comparing GraLoRA with a wide spectrum of PEFT methods, including full fine-tuning, LoRA, LoRA-XS, LoRA-SB, rsLoRA, PiSSA, BOFT, MELoRA, MoRA, and RaSA. All models were trained with rank 32. For consistency, we used the results reported in the LoRA-SB paper for full fine-tuning, LoRA-XS, LoRA-SB, rsLoRA, and PiSSA. For other methods, we conducted a learning rate sweep over {2e-4, 4e-4, 1e-3, 2e-3}, selecting the best-performing configuration per method. All other hyperparameters followed the settings in LoRA-SB, including the usage of AdamW optimizer.

Method	Rank	Params	BoolQ	PIQA	SIQA	HS	WG	ARC-c	ARC-e	OBQA	Avg.
FullFT	-	3.21B	70.4	85.6	80.5	91.9	85.0	75.3	88.5	81.9	82.4
LoRA	32	48.63M	70.0	85.2	79.1	90.7	82.2	74.3	86.9	81.9	81.3
LoRA-XS	96	1.81M	67.3	83.4	78.7	89.0	82.1	72.6	85.2	78.9	79.6
LoRA-SB	96	1.81M	70.3	84.8	80.2	91.6	84.6	74.7	87.9	81.2	81.9
rsLoRA	32	48.63M	69.8	85.1	78.9	90.5	82.0	74.2	86.7	81.7	81.1
PiSSA	32	48.63M	70.1	85.4	79.4	90.9	82.7	74.6	87.2	81.8	81.5
BOFT	32	48.48M	72.3	84.6	79.1	91.3	84.5	73.7	87.8	80.6	81.7
MELoRA	32	48.63M	71.3	85.0	78.6	93.0	79.7	73.7	085.5	79.0	80.7
MoRA	32	48.63M	72.4	86.1	80.1	92.3	84.8	76.8	88.8	84.8	83.3
RaSA	32	48.63M	73.1	87.5	81.1	93.7	85.3	78.9	88.9	83.6	84.0
GraLoRA	32	48.63M	74.1	86.5	80.8	93.8	87.5	79.9	89.5	84.8	84.6

4. Diffusion Model Fine-Tuning

Method	CLIP Similarity (%)	DINOv2 Similarity (%)
LoRA	91.4	79.2
GraLoRA	91.9	81.3

(Q1) Heuristic for Hyperparameter $k$

We have further analyzed the hyper-parameter sweep result in the ablation study section. Based on the sweep result and experience, $r/k^2 \approx 8$ yielded stable performance under varying configurations including model and task. Thus, we set GraLoRA $k=2$ for ranks under 32, and $k=4$ for higher ranks (64, 128) in the experiments.

To conclude, we sincerely thank the reviewer for the thoughtful feedback, which has significantly contributed to improving our work. In response, we have added new theoretical analysis on the generality of gradient distortion beyond down-projection layers, extended evaluations on mathematical reasoning and GLUE, and included comprehensive comparisons with regularized LoRA variants such as MELoRA. We have also investigated the impact of training duration and hyperparameter tuning, and validated GraLoRA’s generality through experiments on diffusion models. These additions will be incorporated into the revised manuscript to reflect the valuable insights raised in your review.

[1] Bondarenko, et al., "Understanding and Overcoming the Challenges of Efficient Transformer Quantization" [2] Yu, et al., "The Super Weight in Large Language Models" [3] Xiao et al., "SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models" [4] Liu et al., "SpinQuant: LLM quantization with learned rotations" [5] Hu et al., "LLM-Adapters: An Adapter Family for Parameter-Efficient Fine-Tuning of Large Language Models"

2025-08-05

Thank you very much for your detailed review and thoughtful questions. In response, we have substantially expanded both the theoretical and empirical analysis of GraLoRA: (W1) Gradient Dynamics beyond down-projection layers, (W2 & Q3) Fair comparison with per-method learning rate tuning and extended training epochs, (W3 & Q2) Robustness across optimizers and tasks including math reasoning, GLUE, and diffusion models, and (Q1) Heuristic guidance for selecting the hyperparameter $k$ . We will include all of these clarifications and results in the revised manuscript to strengthen the clarity, rigor, and generality of the work.

We are grateful for your earlier feedback, which played a key role in strengthening the paper. We hope that the expanded theoretical analysis, comprehensive benchmarking, and detailed ablation studies effectively address your concerns and offer a clearer view of GraLoRA’s contributions. If there are any remaining questions or if further clarification is needed, we would be very grateful for the opportunity to address them. We respectfully ask you to consider these additional materials in your final evaluation.

2025-08-05

Thanks for the additional experiments. The authors' responses address my concerns, and I have updated the score.

2025-08-05

Thank you very much for your thoughtful consideration and for taking the time to review our additional experiments. We truly appreciate your updated evaluation and your constructive feedback throughout the process.

审稿意见

评分: 5置信度: 42025-07-03

This paper addresses a known limitation in LoRA: its performance degrades at higher ranks. Motivated by empirical observations and theoretical analysis, the authors propose GraLoRA (Granular Low-Rank Adaptation), which partitions the weight matrices into sub-blocks to mitigate the issue of “outlier channels” disproportionately influencing training.

优缺点分析

Strengths: The paper is well-motivated. The logical flow from empirical observations (i.e., LoRA's degraded performance at higher ranks), to the hypothesis of outlier channel dominance, and finally to the proposed architectural change is clearly laid out and persuasive.

The concept of “outlier channel” is used to explain gradient entanglement in LoRA, which is an insightful analysis. Although similar patterns have been discussed in deep learning literature (see weakness below), framing this in the context of LoRA and fine-tuning is valuable.

GraLoRA is directly motivated by this analysis, offering a structured method to partition updates and reduce interference.

Experimental results show strong empirical gains, especially, the improvement for code generation across various model sizes and LoRA ranks is significant.

Weakness The authors use the name “outlier channel”. It seems to me that the authors are implying that this is not a good phoneme. But it is quite nature: many papers have shown that the learning dynamics of DNN (not only ViT) are in low dimensional spaces. Hence, the presence of dominant channels is expected. Particularly in fine-tuning scenarios, these dominant directions may actually assist in fast adaptation. A more nuanced discussion is warranted to distinguish harmful dominance from useful representation compression.

Although GraLoRA mitigates gradient interference via sub-block isolation, from a flexibility and capacity standpoint, the hierarchy seems to be: Full Fine-Tuning > LoRA > GraLoRA. That is, while GraLoRA may improve stability, it also constrains learning to smaller regions of the parameter space. The paper does not fully address this potential trade-off.

The evaluation is limited to natural language tasks. Given the generality of the proposed method, it would be helpful to test GraLoRA on vision-related fine-tuning tasks. This would further establish the robustness and generalizability of the method across modalities.

My own experience tells me that fine-tuning performance on NLP tasks often exhibits significant variance depending on initialization, data quality, and learning rate schedules. Some evaluation of this variance (e.g., standard deviations or box plots) would make the results more convincing.

问题

GraLoRA introduces regularization through structural partitioning, hence, it could be viewed as a regularization method. Thus could the authors also compare GraLoRA with more regularization techniques in LoRA?

Can the authors include experiments on image-based tasks or vision transformers? These would serve as compelling evidence.

A special requirement is the comparison with MELoRA (MELoRA: Mini-Ensemble Low-Rank Adapters for Parameter-Efficient Fine-Tuning). In MELoRA the update is restricted to diagonal-blocks, which can be regarded as a more specific GraLoRA. I think it is better to use the GraLoRA directly on the tasks presented in that paper, which will be more convenient for you and meanwhile could test the generalization capability across tasks.

局限性

Yes

最终评判理由

The rebuttal, which contains additonal explanation and experiments, have addressed most of my concerns. Now I vote for acceptance. Although the "outlier channel", or similar names, is not a new finding. But the authors design is new and the improvement is well supported. Thus, I suggest acceptance.

格式问题

No.

作者回复

2025-07-31

We are grateful for your careful reading of our paper and your positive remarks. Your comments have helped us clarify several important aspects.

(W1) Influence of Outlier Channels

As mentioned, prior studies have shown that outlier channels and weights are common in large models [1][2], and the impact of such outliers has been actively addressed in areas like quantization [3][4]. We also agree that existence of outlier channels and weights are nature, which usually show high correlation with tokens like ".", or special tokens.

On the other hand, while the existence of outlier is not a problem in general situations, it can give negative influences when training LoRA adapters. As illustrated in Figure 2, while full fine-tuning localizes the gradient impact, LoRA’s entire gradient update becomes disproportionately influenced by the single outlier.

Such structural limitation of LoRA leads to misalignment between LoRA updates and the gradient landscape shaped by full fine-tuning. In addition, since outlier only exists in certain layers, it results in different gradient scale across layers making the model harder to learn.

(W2) Flexibility and Capacity analysis of GraLoRA

To comapare the flexibility and capacity, we examine the column spaces of the weight update matrices from LoRA and GraLoRA within the framework of Grassmannian manifold.

Let $R_{LoRA} = B_{LoRA}A_{LoRA}^\top \in \mathbb{R}^{M \times N}$ and $R_{GraLoRA} = B_{GraLoRA}A_{GraLoRA}^\top \in \mathbb{R}^{M \times N}$ , as illustrated in Figure 5 of the paper. In vanilla LoRA, assuming column-wise linear in both $B_{LoRA}$ and $A_{LoRA}$ , the rank of $R_{LoRA}$ is $r$ . Thus, its column space lies in the Grassmannian manifold $\mathcal{C}(R) \in \mathcal{Gr}(r, M)$ , which denotes the set of all $r$ -dimensional linear subpaces in $\mathbb{R}^M$ .

In GraLoRA, each sub-block $R_{i,j}$ has rank $\frac{r}{k}$ , as shown in Section 3.2. The rank of each column block $R_i = [R_{i,1} R_{i,2} \cdots R_{i,k}]$ is there fore $r$ under the assumption that all sub-blocks are linearly independent, maximum rank of $R_{GraLoRA}$ can reach up to $kr$ . Consequently, the column space of the GraLoRA update list in $\mathcal{C}(R_{gralora}) \in \mathcal{Gr}(kr, M)$ .

**dim**({\mathcal{Gr}(r, M)}) < **dim**({\mathcal{Gr}(kr, M)})

Therefore, the subspace learned by GraLoRA spans a higher-dimensional manifold, providing greater representational flexibility and capacity than standard LoRA. This geometric perspective further supports our theoretical claim that GraLoRA enhances expressivity by expanding the effective rank of the adaptation subspace.

(W4 & Q1 & Q3) Experiments with New Baselines and Variance Evaluation

1. Commonsense Reasoning with Extensive PEFT Baselines

We conducted a new evaluation on the commonsense reasoning benchmark using the LLaMA3.2–3B model, comparing GraLoRA with a wide spectrum of PEFT methods, including full fine-tuning, LoRA, LoRA-XS, LoRA-SB, rsLoRA, PiSSA, BOFT, MELoRA, MoRA, and RaSA. All models were trained with rank 32. For consistency, we used the results reported in the LoRA-SB paper for LoRA-XS, LoRA-SB, rsLoRA, and PiSSA. Here rsLoRA, BOFT, and MELoRA are popular regularization techniques in LoRA. For BOFT, MELoRA, MoRA, RaSA, and GraLoRA, we conducted a learning rate sweep over {2e-4, 4e-4, 1e-3, 2e-3}, selecting the best-performing configuration per method. All other hyperparameters followed the settings in LoRA-SB.

Method	Rank	Params	BoolQ	PIQA	SIQA	HS	WG	ARC-c	ARC-e	OBQA	Avg.
FullFT	-	3.21B	70.4	85.6	80.5	91.9	85.0	75.3	88.5	81.9	82.4
LoRA	32	48.63M	70.0	85.2	79.1	90.7	82.2	74.3	86.9	81.9	81.3
LoRA-XS	96	1.81M	67.3	83.4	78.7	89.0	82.1	72.6	85.2	78.9	79.6
LoRA-SB	96	1.81M	70.3	84.8	80.2	91.6	84.6	74.7	87.9	81.2	81.9
rsLoRA	32	48.63M	69.8	85.1	78.9	90.5	82.0	74.2	86.7	81.7	81.1
PiSSA	32	48.63M	70.1	85.4	79.4	90.9	82.7	74.6	87.2	81.8	81.5
BOFT	32	48.48M	72.3	84.6	79.1	91.3	84.5	73.7	87.8	80.6	81.7
MELoRA	32	48.63M	71.3	85.0	78.6	93.0	79.7	73.7	085.5	79.0	80.7
MoRA	32	48.63M	72.4	86.1	80.1	92.3	84.8	76.8	88.8	84.8	83.3
RaSA	32	48.63M	73.1	87.5	81.1	93.7	85.3	78.9	88.9	83.6	84.0
GraLoRA	32	48.63M	74.1	86.5	80.8	93.8	87.5	79.9	89.5	84.8	84.6

2. Mathematical Reasoning

We further evaluated GraLoRA on a mathematical reasoning task using the MetaMathQA dataset for training and testing on the MATH benchmark. Two models, LLaMA3.2–1B and Qwen2.5–1.5B, were fine-tuned, mostly following the settings from Hu et al. [1], with ranks 64 and 128. Following our paper’s heuristic, we used $k=4$ for both.

Model	Rank	Method	Accuracy
LLaMA3.2–1B	64	LoRA	14.9%
		GraLoRA	15.2%
Qwen2.5–1.5B	64	LoRA	23.6%
		GraLoRA	25.7%
	128	LoRA	24.7%
		GraLoRA	28.9%

GraLoRA consistently outperforms LoRA, especially on the larger model, highlighting its robustness across architectures and tasks requiring symbolic reasoning.

3. General Language Understanding (GLUE)

All hyperparameters for GraLoRA followed those used in the VeRA implementation, except for learning rate, which was reduced by a factor of 5–10, as VeRA uses a learning rate approximately 10× larger than LoRA.

Method	Params	SST-2 (%)	MRPC (%)	CoLA (%)	QNLI (%)	RTE (%)	STS-B (%)	Avg (%)
FT	125M	94.8	90.2	63.6	92.8	78.7	91.2	85.2
LoRA	0.3M	95.1	86.5	63.4	93.3	76.2	90.6	84.2
VeRA	0.043M	94.6	89.5	65.6	91.8	78.7	90.7	85.2
FourierFT	0.024M	94.2	90.0	63.8	92.2	79.1	90.8	85.0
GraLoRA	0.3M	95.2	89.7	65.3	93.0	80.9	91.1	85.8
Hybrid GraLoRA	0.3M	95.2	90.2	64.1	93.4	79.8	91.2	85.6
Best GraLoRA	0.3M	95.2	90.2	65.3	93.4	80.9	91.2	86.0

GraLoRA demonstrates strong performance in this encoder-only setting and low-rank regime, outperforming all baselines in average score. Hybrid GraLoRA delivers the most robust results, achieving the best performance on 4 out of 6 tasks. These results indicate that GraLoRA maintains high effectiveness even under constrained parameter budgets and in non-LLM architectures.

(W3 & Q2) Image Task Experiments

4. Diffusion Model Fine-Tuning

We applied GraLoRA to diffusion models by fine-tuning SDXL for personalization. We followed the official training setup from the HuggingFace diffusers repository, using the lambdalabs/naruto-blip-captions dataset. The dataset was split 90% for training and 10% for evaluation.

Method	CLIP Similarity (%)	DINOv2 Similarity (%)
LoRA	91.4	79.2
GraLoRA	91.9	81.3

To conclude, we sincerely thank the reviewer for their valuable suggestions and thoughtful critique. Your feedback prompted us to clarify the role of outlier channels in LoRA, deepen the theoretical analysis of GraLoRA’s expressivity via Grassmannian geometry, and conduct comprehensive new experiments—including comparisons against regularized LoRA variants such as MELoRA, as well as evaluations on mathematical reasoning, GLUE, and diffusion tasks. These additions, derived directly from your comments, will be incorporated into the revised manuscript to strengthen both its theoretical depth and empirical breadth.

2025-08-05

Thank you very much for your detailed review and thoughtful questions. In response, we have substantially extended both the theoretical analysis and experimental evaluation of GraLoRA: (W1) Clarification on Outlier Channels, (W2) Grassmannian Geometry for Capacity Analysis, (W3 & Q2) Evaluation on Vision Tasks, (W4 & Q1) Variance Analysis and Comparison with Regularization Methods, and (Q3) Direct Comparison with MELoRA. These additions will be included in the revised manuscript to improve the depth, clarity, and generality of our contributions.

Thank you once again for your insightful and constructive review. Your feedback has led to significant improvements in both theoretical formulation and empirical breadth. We hope these updates address your concerns and provide a more complete picture of GraLoRA’s capabilities. If any part remains unclear, we would be grateful for the opportunity to further clarify. We respectfully ask that you consider these additional results and analysis in your final evaluation.

评论- Thanks for the rebuttal

2025-08-06

Thank you for the detailed rebuttal. The additional experiments and clarifications have addressed most of my concerns. Over the past few days, I have also read the discussions between the authors and the other reviewers, which further helped clarify the key points. Overall, I would like to raise my score.

2025-08-06

Thank you very much for your thoughtful response and for taking the time to review both our rebuttal and the broader discussion. We’re grateful that the additional experiments and clarifications were helpful, and we truly appreciate your updated evaluation and constructive engagement throughout the review process.

审稿意见

评分: 4置信度: 52025-07-03

The paper identifies bottlenecks in scaling LoRA to higher ranks and introduces GraLoRA, a granular low-rank adaptation approach. The authors observe that LoRA’s structure makes it sensitive to the full input, which allows outliers to disproportionately influence the gradient signal. To address this, they propose partitioning the weight matrix into sub-blocks, with each sub-block has its own LoRA adapter. Experimental results indicate that this block-partitioned strategy, GraLoRA, consistently outperforms standard LoRA and its variants across multiple settings.

优缺点分析

Strengths

The paper identifies bottlenecks in the performance of LoRA and highlights the sensitivity of gradient signals to outliers.
A novel approach is proposed—partitioning the weight matrix into sub-blocks and training a separate LoRA adapter for each block.
The analysis of memory and computational complexity indicates parity with conventional LoRA.
Experimental validation demonstrates that GraLoRA consistently outperforms other well-known LoRA variants.
The paper is well written, with experimental results presented in a clear and concise manner.

Weaknesses:

Theoretical justification of GraLoRA’s gradient dynamics and its sensitivity to outliers is minimal. While empirical results suggest changes in gradients localized to interacting blocks, a more rigorous formulation would be valuable.
The paper lacks an analysis of the subspaces learned by the adapters using Grassmannian geometry, which could provide insight into observed performance gains. 3)The discussion on effective rank in Section 3.2 is incomplete. The authors assume independence of all columns in the submatrices ( B_{(i,k)} ) and ( A_{(i,k)} ), but this assumption is not clearly justified. Empirical support is provided for the higher implicit rank hypothesis, but additional clarification is needed.
The GraLoRA structure may not be amenable to efficient computation on resource-constrained or on-device deployments. Suggestions for possible optimizations would enhance the paper’s applicability.
In scenarios requiring support for multiple adapters, GraLoRA could face scalability or architectural challenges. The paper does not address this issue.

问题

The authors could help

Why are all the columns of B_i and A_i sub-matrix independent?
How are sub block adapters initialized? Is the initialization same as in the case of LoRA.
Are LoRA alpha factors same across all blocks? Unfortunately, couldn't find this form the paper.
Would alternate approaches such as VeRA /LoRA-SB alter the impact of outliers in the gradient signal? Can VeRA/LoRA-SB style adapters be leveraged for GraLoRA.

局限性

Yes

最终评判理由

The authors have tried to address all the questions I previously raised. I carefully reviewed their responses and comments during the rebuttal period and acknowledge their efforts to clarify the points of concern. My remaining reservations pertain to the following:

The invocation of random matrix theory to justify higher rank in finite settings warrants more rigorous analysis. While the asymptotic behavior is well-understood, a non-asymptotic treatment would be more appropriate and informative in this context.
Although I agree that GLoRA maintains the same parameter count, practical deployment on edge devices often necessitates adapter switching or fusion with varying weights. Such scenarios demand careful memory management to ensure efficient adapter switching with acceptable latency.
The experimental results convincingly demonstrate the benefits of high-rank adapters. For example, MoRA performs comparably to GraLoRA across several metrics. It remains unclear whether the authors have investigated sparse high-rank adapters, which was another elegant effort leverage high rank adapter. I sincerely thank the authors for their diligent efforts in addressing the concerns raised. Based on their responses, I have increased my score.

格式问题

The paper is formatted according to the guideline.

作者回复

2025-07-31

Thank you for your thoughtful review and constructive feedback. We appreciate your recognition of our contributions and the insightful points raised.

(W1) Theoretical justification of GraLoRA's gradient dynamics

While the gradient dynamics of LoRA are theoretically formulated in Equation (4) and Figure 2, and empirically validated in Figures 3 and 4, the analysis for GraLoRA in the main paper has thus far focused on empirical results (Figures 3 and 6). We now provide a more rigorous treatment of GraLoRA’s gradient dynamics.

GraLoRA divides the adapter into $k^2$ sub-blocks. Each sub-block is defined as

R_{i,j} = B_{i,j}A_{i,j}^\top, \ B_{i,j} \in \mathbb{R}^{\frac{M}{k} \times \frac{r}{k}}, \ A_{i,j} \in \mathbb{R}^{\frac{N}{k} \times \frac{r}{k}}

Given an input submatrix $X_i \in \mathbb{R}^{\frac{N}{k} \times T}$ , the gradient of the parameters is:

\frac{\partial{L}}{\partial{A_{i,j}^\top}} = B_{i,j}^\top \frac{\partial{L}}{\partial{Y_j}}X_i^\top, \ \frac{\partial{L}}{\partial{B_{i,j}}}=\frac{\partial{L}}{\partial{Y_j}}X_i^\top A_{i,j}

The corresponding gradient of the composed update matrix $R_{i,j}$ is:

\frac{\partial{L}}{\partial{R_{i,j}}}=\frac{\partial{L}}{\partial{B_{i,j}}}A_{i,j}^\top + B_{i,j}\frac{\partial{L}}{\partial{A_{i,j}^\top}}=\frac{\partial{L}}{\partial{Y_j}}X_i^\top A_{i,j}A_{i,j}^\top + B_{i,j}B_{i,j}^\top \frac{\partial{L}}{\partial{Y_j}}X_i^\top

This closely resembles the gradient dynamics of LoRA, maintaining the same underlying structure of interaction between the input and parameter gradients.

Now if an outlier exists in input slice $X_{out}$ , only $k$ sub-blocks $[R_{out, 1}, \dots , R_{out, k}]$ directly process the outlier input and receive the amplified gradient signals. In contrast, the remaining $k^2-k$ blocks remain largely unaffected, as they operate on non-outlier slices aligning from the observations in Figure 6. This selective exposure sharply contrasts with standard LoRA, in which a single outlier channel can influence the entire low-rank adapter due to global entanglement.

(W2 & Q1) Independence of all columns in the submatrices

The independence of columns in the submatrices $B_{(i,k)}$ and $A_{(i,k)}$ is a critical condition to achieve the maximum rank $kr$ in GraLoRA, comparable to the condition in standard LoRA, where the rank is maximized when all columns of $B$ and $A$ are linearly independent.

To justify why this condition holds in practice, we refer to results from random matrix theory. Consider a random matrix $X \in \mathbb{R}^{r \times r}$ , where each entry is sampled independently from a continuous distribution (e.g., Gaussian). The event that such a matrix is rank-deficient is equivalent to it being $\det(X) = 0$ . Since determinant function is a multivariate polynomial over $r^2$ dimensional space of the matrix entries, the set of rank-deficient matrices forms a lower-dimensional algebraic hypersurface of dimension $r^2-1$ , which has Lebesgue measure zero in $\mathbb{R}^{r^2}$ . Therefore:

$\mathbb{P}(\mathrm{rank}(X) < r) = \mathbb{P}(\det(X) = 0) = 0$ which implies $\mathbb{P}(\mathrm{rank}(X) = r) = 1$

This logic extends naturally to non-square matrices $X \in \mathbb{R}^{m \times n}$ . Rank-deficiency in such cases would require all $k \times k$ minors (where $k = \min(m, n)$ ) to vanish simultaneously, which again corresponds to a measure zero. Therefore, the condition of independence in both LoRA and GraLoRA is well-supported by probabilistic arguments rooted in random matrix theory.

(W2) analysis with Grassmannian geometry

To compare the subspaces learned by the adapters, we examine the column spaces of the weight update matrices from LoRA and GraLoRA. Let $R_{LoRA} = B_{LoRA}A_{LoRA}^\top$ and $R_{GraLoRA} = B_{GraLoRA}A_{GraLoRA}^\top$ , as illustrated in Figure 5 of the paper. In vanilla LoRA, assuming column-wise linear in both $B_{LoRA}$ and $A_{LoRA}$ , the rank of $R_{LoRA}$ is $r$ . Thus, its column space lies in the Grassmannian manifold $\mathcal{C}(R) \in \mathcal{Gr}(r, M)$ , which denotes the set of all $r$ -dimensional linear subspaces in $\mathbb{R}^M$ .

In GraLoRA, each sub-block $R_{i,j}$ has rank $\frac{r}{k}$ , as shown in Section 3.2. The rank of each column block $R_i = [R_{i,1} R_{i,2} \cdots R_{i,k}]$ becomes $r$ under the assumption that all sub-blocks are linearly independent, and $R_{GraLoRA}$ can reach the maximum rank $kr$ . Consequently, the column space of the GraLoRA update list in $\mathcal{C}(R_{gralora}) \in \mathcal{Gr}(kr, M)$ .

In terms of dimension, the Grassmannian manifold $\mathcal{Gr}(r, M)$ has dimension $r(M-r)$ , while $\mathcal{Gr}(kr, M)$ has dimension $kr(M-kr)$ . Under the common setting where $r \ll M$ , this implies that: $**dim**({\mathcal{Gr}(r, M)}) < **dim**({\mathcal{Gr}(kr, M)})$ Therefore, the subspace learned by GraLoRA spans a higher-dimensional manifold, providing greater representational flexibility and capacity than standard LoRA. This geometric perspective further supports our theoretical claim that GraLoRA enhances expressivity by expanding the effective rank of the adaptation subspace.

(W3) GraLoRA on resource-constrained scenario

As discussed in Section 3.4 (Tradeoff Analysis), GraLoRA introduces no additional computational or memory overhead during training. This is because GraLoRA maintains the same number of parameters and FLOPs as standard LoRA.

During inference, GraLoRA adapters can be statically fused into the base model weights, just like in LoRA. Once merged, inference proceeds identically to that of the original model with updated weights, incurring no runtime penalty. Thus, for single-adapter deployments, such as in typical on-device or resource-constrained settings, GraLoRA remains as efficient as standard LoRA.

(W3) GraLoRA on multiple adapter scenario

We agree that multi-adapter settings introduce practical challenges due to the need to dynamically load and apply adapter weights without fusion. However, we emphasize that GraLoRA introduces no additional burden compared to LoRA in this scenario. Both methods involve the same number of parameters and identical FLOPs, resulting in equivalent memory and compute requirements per adapter.

To empirically validate this, we measured the inference latency of GraLoRA and LoRA under a multi-adapter setting using the LLaMA3.1–8B model with rank 32. We simulated a realistic scenario by generating 128 output tokens from 128 input tokens using a single NVIDIA H100 GPU, with adapters loaded dynamically and applied without fusion. The average end-to-end inference times were 5.1 seconds for LoRA and 5.3 seconds for GraLoRA.

This negligible difference demonstrates that GraLoRA scales effectively in multi-adapter deployments, and does not suffer from significant computational or architectural limitations compared to existing PEFT methods. We will include this measurement in the revised paper to clarify GraLoRA’s practicality in such scenarios.

(Q2 & Q3) Initialization and hyper-parameter settings of GraLoRA

Thank you for pointing this out. The initialization and hyperparameter settings for GraLoRA follow the same convention as in standard LoRA. Specifically, each sub-block adapter $A_{i,j}$ is initalized using Kaiming uniform initialization, while $B_{i,j}$ is initialized to zero. Regarding the scaling factor, we currently apply the same LoRA alpha value uniformly across all blocks. Consequently, each sub-block’s output is scaled identically, ensuring consistent training dynamics across the adapter grid.

(Q4) Applicability of VeRA and LoRA-SB to GraLoRA

Both VeRA and LoRA-SB primarily aim to reduce the number of trainable parameters, but they do not fundamentally address the gradient sensitivity to outlier channels. In VeRA, the shared matrices $A$ and $B$ still multiply the full input $X$ , so outliers can dominate the gradient signal. LoRA-SB adopts the structure $R = B R A^\top$ , training only $R$ , but since $A$ still interacts with the entire input, it also remains vulnerable to outliers.

Nevertheless, both styles can be incorporated into GraLoRA. A VeRA-style GraLoRA would share $A$ and $B$ across sub-blocks, increasing parameter sharing. While this raises the latent dimensionality from $r$ to $kr$ , the blockwise structure of GraLoRA still localizes the influence of outliers, offering improved robustness. For LoRA-SB-style GraLoRA, each sub-block can be reparameterized as $R_{i,j} = B_{i,j} R_{i,j} A_{i,j}^\top$ , preserving GraLoRA’s outlier isolation while benefiting from reduced trainable parameters.

To further validate GraLoRA’s generality, we have conducted additional experiments across diverse tasks, including commonsense reasoning, mathematical reasoning, GLUE, and diffusion model fine-tuning. These evaluations incorporate VeRA and LoRA-SB as baselines and consistently demonstrate GraLoRA’s robustness and effectiveness across domains. As this response focuses on mathematical and theoretical aspects, the corresponding experimental results are provided in our responses to other reviewers.

To conclude, we sincerely thank the reviewer for the thoughtful and constructive feedback. The insightful comments guided us to develop additional theoretical analysis of GraLoRA’s gradient dynamics, independence assumptions, and subspace structure, which we have now included. Furthermore, we conducted new experiments, including inference latency and multi-task evaluation, that address practical concerns raised in the review. These theoretical clarifications and empirical results will be incorporated into the revised manuscript to improve clarity, completeness, and impact.

2025-08-05

Thank you very much for your detailed review and thoughtful questions. In response, we have substantially expanded both the theoretical and empirical analyses of GraLoRA, addressing the following points: (W1) Gradient Dynamics, (W2 & Q1) Independence Assumption, (W2) Grassmannian Geometry, (W3) Efficiency in Resource-Constrained Settings, (Q2 & Q3) Initialization and Scaling, and (Q4) Applicability of Other Methods. These clarifications and results will be incorporated into the revised manuscript to improve the clarity and rigor of our presentation.

We sincerely appreciate your earlier feedback, which has greatly contributed to strengthening the paper. We hope that the additional theoretical insights, geometric analysis, and experimental evidence address your concerns and provide a clearer understanding of GraLoRA’s contributions. If any questions remain or further clarification is needed, we would be grateful for the opportunity to respond. We respectfully ask that you consider these updates in your final evaluation.

2025-08-07

Thank you very much for your time and effort in reviewing our paper. We truly appreciate your thoughtful feedback and careful evaluation.

We are sorry to see that we may not have fully addressed all of your concerns. If there are any remaining issues or further clarifications needed, we would be grateful for the opportunity to discuss them. We remain open and eager to engage in any additional discussion to ensure the clarity and rigor of our work.

评论- Request your comments on the rebuttal

2025-08-09

Dear Reviewer EFNJ,

Please consider to provide your feedback on the authors' rebuttal, which will be truly helpful for them.

Thanks

评论- [ NeurIPS 25 ] reviewer-author discussions

2025-08-01

Dear Reviewers,

Thank you all for the big efforts. Please check authors' rebuttal to see if your original concerns have been addressed, as well as if you have any follow-up questions to the authors.

Dear Authors: Please engage with our Reviewers during this discussion period.

Thanks a lot.

最终决定Accept (spotlight)

2025-09-17

This paper proposes GraLoRA, a granular low-rank adaptation approach designed to improve LoRA’s performance at higher ranks by partitioning weight matrices into sub-blocks. The work is well-motivated, technically sound, and supported by extensive experiments demonstrating clear improvements over standard LoRA.

Reviewers appreciated the clarity of the presentation, the empirical strength of the results, and the authors’ detailed rebuttal, which effectively addressed initial concerns. Several reviewers raised their scores after rebuttal, citing strong empirical evidence and useful clarifications.

Remaining concerns are relatively minor: some reviewers noted that the theoretical analysis could be more rigorous in finite settings, and that deployment considerations (e.g., adapter switching on edge devices) may need further exploration. These issues, however, do not undermine the overall contribution.

Overall, this paper received 1 Accept and 3 Borderline Accept. The consensus is that this paper makes a meaningful and timely contribution to parameter-efficient fine-tuning. This meta-reviews concurs, and recommend to Accept this paper. The authors are encouraged to revise their paper carefully by taking into account all the comments and discussions.

GraLoRA: Granular Low-Rank Adaptation for Parameter-Efficient Fine-Tuning

摘要

评审与讨论

优缺点分析

问题

局限性

最终评判理由

格式问题

(W1 & Q1) Why increasing the rank intensifies the entangled influence of outliers

(W3 & Q2) Additional experiments with new baselines on varying tasks and models

1. Commonsense Reasoning with Extensive PEFT Baselines

2. Mathematical Reasoning

3. General Language Understanding (GLUE)

4. Diffusion Model Fine-Tuning

(W2) Analysis for Hyperparameter kkk

(Limitations)

优缺点分析

问题

局限性

最终评判理由

格式问题

(W1) GraLoRA's gradient dynamics on matrices

(W3 & Q2) Experiments on additional tasks with different Optimizer

1. Mathematical Reasoning

2. General Language Understanding (GLUE)

(W2 & Q3) Fair Comparison with Hyper-parameter search

3. Commonsense Reasoning with Extensive PEFT Baselines

4. Diffusion Model Fine-Tuning

(Q1) Heuristic for Hyperparameter kkk

优缺点分析

问题

局限性

最终评判理由

格式问题

(W1) Influence of Outlier Channels

(W2) Flexibility and Capacity analysis of GraLoRA

(W4 & Q1 & Q3) Experiments with New Baselines and Variance Evaluation

1. Commonsense Reasoning with Extensive PEFT Baselines

2. Mathematical Reasoning

3. General Language Understanding (GLUE)

(W3 & Q2) Image Task Experiments

4. Diffusion Model Fine-Tuning

优缺点分析

问题

局限性

最终评判理由

格式问题

(W1) Theoretical justification of GraLoRA's gradient dynamics

(W2 & Q1) Independence of all columns in the submatrices

(W2) analysis with Grassmannian geometry

(W3) GraLoRA on resource-constrained scenario

(W3) GraLoRA on multiple adapter scenario

(Q2 & Q3) Initialization and hyper-parameter settings of GraLoRA

(Q4) Applicability of VeRA and LoRA-SB to GraLoRA

(W2) Analysis for Hyperparameter $k$

(Q1) Heuristic for Hyperparameter $k$