PaperHub
6.6
/10
Poster4 位审稿人
最低3最高4标准差0.5
3
4
3
4
ICML 2025

Beyond Zero Initialization: Investigating the Impact of Non-Zero Initialization on LoRA Fine-Tuning Dynamics

OpenReviewPDF
提交: 2025-01-13更新: 2025-08-13
TL;DR

This paper demonstrates that the non-zero initialization of LoRA enhances robustness to learning rates.

摘要

Low-rank adaptation (LoRA) is a widely used parameter-efficient fine-tuning method. In standard LoRA layers, one of the matrices, $A$ or $B$, is initialized to zero, ensuring that fine-tuning starts from the pretrained model. However, there is no theoretical support for this practice. In this paper, we investigate the impact of non-zero initialization on LoRA's fine-tuning dynamics from an infinite-width perspective. Our analysis reveals that, compared to zero initialization, simultaneously initializing $A$ and $B$ to non-zero values improves LoRA's robustness to suboptimal learning rates, particularly smaller ones. Further analysis indicates that although the non-zero initialization of $AB$ introduces random noise into the pretrained weight, it generally does not affect fine-tuning performance. In other words, fine-tuning does not need to strictly start from the pretrained model. The validity of our findings is confirmed through extensive experiments across various models and datasets. The code is available at https://github.com/Leopold1423/non_zero_lora-icml25.
关键词
Low-Rank AdaptationNon-Zero Initialization

评审与讨论

审稿意见
3

This paper studies how non-zero initialization improves the perforamnce of LoRA, especially the stabilitiy.

  • The authors define 1) the notation of stabilitity, BAX=Θ(1)BAX = \Theta(1) for all LoRA layers when the width is infinity, where XX is the input. 2) the notation of efficiency, the linear update term is Θ(1)\Theta(1).

  • Based on the above two creteria, the author derive the requirements on the random Gaussian variance of A and B as well as the step-size. When using SGD, the optimal initialization under such two creteria is neither classical LoRA initialization nor other variants of zero initialization.

  • The author continue to define the robustness of LoRA and derive the similar requirements on them.

给作者的问题

N/A

论据与证据

This paper looks good and provides some findings beyond zero initialization. I understand the motivation of using non-zero initialization but the current claim is weak. Previous non-zero initialization work, e.g., LoRA-GA, has better motivation. For instance, LoRA-GA is to ensure that LoRA gradient updates can match full gradient updates as much as possible, which is also the spirit of using LoRA.

Under theory-guided instructions, this paper gives some non-zero initialization strategies that achieves better performance than LoRA, e.g., 2x speedup than LoRA. However, the comparison with previous work is limited: only LoRA is compared. I understand that the key idea of this work is to speedup and obtain other benefits (e.g., LoRA). Nevertheless, the experimental comparison is not sufficient.

Another significant issue is that, the derivation heavily follows with previous work, e.g., Hayou et al. When I read it at first, it can be a good journal extension but I'm not sure that it can be regarded as an independent work.

方法与评估标准

The evaluation makes sense but the comparison is limited.

理论论述

The theoretical claim is ok in terms of the stability and efficiency.

实验设计与分析

The experiments are not sufficient. Only LoRA is compared.

补充材料

Yes. I high level checked the proof, e.g., Appendix B, C. Besides, I also read the experimental setting in Appendix D.

与现有文献的关系

this topic and the obtained findings are interesting to the machine learning community.

遗漏的重要参考文献

The essential references are sufficient but it's true that not main LoRA-based algorithms are discussed.

其他优缺点

N/A

其他意见或建议

N/A

作者回复

Hi Reviewer wxon:

Thank you for your detailed and insightful comments. Below, we provide responses to each point individually. Additional experimental results can be found in https://anonymous.4open.science/r/nzlora_rebuttal-7D3E. To save space, we denote zero initialization as ZI and non-zero initialization as NZI.

Q1: "the current claim about motivation is weak"

R1: Unlike LoRA-GA, which was motivated by intuition, our approach is motivated by the theoretical analysis of LoRA's fine-tuning dynamics. From the solution set in Eqs.(5-6), we observe that stable and efficient learning imposes stricter constraints on learning rates, whereas the initialization space is more flexible. Traditional ZI (γ[B0]=\gamma[B_0]=-\infty) is merely an extreme case. This motivates us to reconsider the necessity of ZI and explore the potential benefits of NZI. Based on this insight, we conduct further analysis and evaluation, leading to two key findings:

  1. NZI can reduce the sensitivity of LoRA to suboptimal (i.e., smaller) learning rates.
  2. The purpose of traditional ZI, "fine-tuning from a pre-trained model", is not strictly necessary.

Notably, our motivation and claims are not in competition with prior NZI methods, such as LoRA-GA and PiSSA. Instead, our findings offer a theoretical foundation for their effectiveness and provide an explanation for the observed performance improvements. Fig.11 in the above link show that the accuracy gains of LoRA-GA and PiSSA primarily stem from NZI. These points will be clarified in the revised version of the paper.

Q2: "the experimental comparison is not sufficient"

R2: Following your suggestion, we have added additional comparisons and combinations of LoRA-based methods with NZI. Specifically, two key aspects are considered:

  1. Ablation comparison with LoRA-GA and PiSSA. As shown in Fig.11, a large portion of the accuracy gains in PiSSA and LoRA-GA can be attributed to the use of NZI. The remaining gains are due to the fact that initialization values derived from pre-trained weights or gradients are more effective than random noise.

  2. Combination with LoRA+ and HydraLoRA. We introduced NZI for both LoRA+ (using a larger learning rate for the matrix BB) and HydraLoRA [1] (an asymmetric LoRA that uses one matrix AA and multiple matrices BB). As shown in Figs.12-13, NZI enhances the robustness of LoRA+ and HydraLoRA to variations in learning rate and improves accuracy. The relevant settings are detailed in the figure caption.

[1] HydraLoRA: An Asymmetric LoRA Architecture for Efficient Fine-Tuning, NeurIPS 2024.

Q3: "the derivation heavily follows with previous work"

R3: The derivation used in this paper, including the notation of γ\gamma and Θ\Theta and the definitions of stability and efficiency, is widely employed in infinite-width analysis and can be traced back to Yang et al. (NeurIPS 2021; see line 520 of our paper). Hayou et al. used these tools to explore the effects of learning rate (ICML 2024; line 475 in our paper) and initialization (NeurIPS 2024; line 479 in our paper) on zero-initialized LoRA. However, a fundamental question remains unaddressed: Why is ZI necessary? In this paper, we extend these general derivations to examine the potential advantages of NZI.

Notably, the contribution and innovation of this paper lie not in proposing a new derivation method, but in the following three aspects:

  1. Motivation for NZI. We provided a comprehensive solution set that ensures LoRA's stable and efficient learning. Building on this, we observe that the solution set includes both zero and NZI, prompting us to investigate the role of NZI further. This constitutes a key distinction between our work and previous studies, where the potential of pure NZI has often been overlooked. Our research bridges this gap and provides preliminary evidence supporting the feasibility of NZI.
  2. A new metric for LoRA's fine-tuning dynamics, "robustness", is proposed. We compare the fine-tuning dynamics of zero and NZIs and define robustness in terms of the sensitivity of these dynamics to the learning rate. The central argument of this paper is that NZI exhibits superior robustness compared to ZI. We believe that this metric is crucial for LoRA fine-tuning dynamics, offering a significant extension and enhancement to existing theoretical derivations and analyses.
  3. Breaking inherent cognitions. Our analysis and experiments further show that fine-tuning does not need to strictly start from a pre-trained model. This challenges the default practice in previous studies, such as LoRA, LoRA-GA, and PiSSA.

These contributions provide valuable guidance for understanding LoRA initialization and fine-tuning LLMs. They represent notable advancements and are substantial enough to be considered as independent work.

审稿人评论

thank for the authors' response with additional experiments.

The current explanation looks good for me on the motivation. I sugges the authors to mention it (maybe in a high level way) in the introduction.

I understand the authors' claim on "lie not in proposing a new derivation method, but in the following three aspects". NZI has been studied in LoRA-GA, LoRA-Pro as well as (Ponkshe et al., 2024) with some experimental-driven design. This paper claims some theoretical understanding/analysis of NZI. There is one paper posted on arXiv (https://arxiv.org/abs/2502.01235) after ICML deadline which builds a mathematical analysis framework of LoRA under NZI. I suggest the authors to discuss this work in the updated version.

Based on the above, I increase my score to 3.

作者评论

Thank you again for reviewing our paper and for your valuable feedback! We're glad that the additional experiments and motivation clarification resolved your concerns. Your comments were extremely helpful, and we truly appreciate you increasing the score based on our rebuttal response.

As suggested, we will enhance the introduction (Section 1) with a more detailed discussion of our motivation to improve clarity for readers.

Additionally, we sincerely appreciate your suggestion regarding the latest advances in LoRA initialization, particularly the LoRA-One paper on arXiv. We will carefully study these works and incorporate a discussion in our revision to better contextualize our theoretical contributions in relation to these recent developments.

审稿意见
4

This paper investigates the impact of non-zero initialization on the fine-tuning dynamics of LoRA. Traditionally, in LoRA, one of the low-rank matrices, A or B, is initialized to zero to ensure fine-tuning starts from the pretrained model. However, this practice lacks theoretical justification. The authors theoretically analyze the effects of initializing both A and B to non-zero values. Their key findings are: (1) Non-zero initialization improves robustness to suboptimal learning rates; and (2) Fine-tuning does not need to strictly start from the pretrained model. The authors validate these findings through extensive experiments across various models and datasets. These results challenge the conventional practice of zero initialization in LoRA and highlight the benefits of non-zero initialization.

给作者的问题

How do the definitions of stability and efficiency in this paper differ from those in previous studies (e.g., LoRA+)?

论据与证据

The claims made in the paper are theoretically proven and experimentally verified.

方法与评估标准

The proposed methods and evaluation criteria are reasonable. This paper studies the initialization problem of LoRA, a common but previously overlooked aspect of fine-tuning LLMs. The authors systematically compare different initialization methods (zero vs. non-zero) using models, datasets, and code based on published work, ensuring reliability.

理论论述

The reviewer carefully checked the theoretical claims and corresponding proofs in the paper, including all results in Section 3 and the proofs in Appendices B and C. To the reviewer, the claims and proofs are reasonable. Although there are minor typos, for example, γ[A0]ηA\gamma[A_0]\leq\eta_A in Eq (19) in Appendix B should be γ[A0]γ[ηA]\gamma[A_0]\leq\gamma[\eta_A], these do not affect the validity of the theoretical results.

实验设计与分析

The reviewer checked the experimental setup, results, and analysis. As described in the paper, the authors conducted experiments on three standard benchmarks. The experimental setups were based on published work and aligned with general practices in LoRA fine-tuning. The authors primarily analyzed different initialization settings and learning rates, which is consistent with the paper's motivation. The experimental analysis is also reasonable and supports the theoretical findings.

补充材料

The reviewer checked Appendices B and C for proofs, and Appendix D for additional experimental results. No other supplementary material.

与现有文献的关系

Previous work (Hayou et al., 2024b) discussed the difference between initializing A or B to zero but did not explore the rationale behind zero initialization. This paper fills that gap, demonstrating that both A and B can be initialized to non-zero values. These findings provide theoretical support for related LoRA variants, such as PiSSA and LoRAGA, and significantly contribute to LoRA research.

遗漏的重要参考文献

To the best of the reviewer's knowledge, all related work on LoRA initialization has been covered.

其他优缺点

Strengths: This paper challenges the traditional LoRA initialization method, and studies the significance of non-zero initialization from the perspective of robustness to learning rate. The method is simple yet insightful.

It also fundamentally overturns the purpose of traditional zero initialization (fine-tuning from pre-trained models). Experiments show that non-zero initialization with appropriate variance does not affect fine-tuning performance, indicating that fine-tuning does not need to start strictly from a pretrained model.

Weaknesses: A minor shortcoming is the lack of discussion on how the definitions of stability and efficiency in this paper differ from those in previous studies (e.g., LoRA+). The authors are encouraged to clarify this distinction in the appendix.

其他意见或建议

Typo in line 62: "raise" should be "raises." Incorrect reference to Llama 3 in line 799. Typo in Eq (19): γ[A0]ηA\gamma[A_0]\leq\eta_A should be γ[A0]γ[ηA]\gamma[A_0]\leq\gamma[\eta_A].

作者回复

Hi Reviewer rPo6:

Thank you for your detailed and insightful comments. Below, we provide responses to each point individually. Additional experimental results can be found in https://anonymous.4open.science/r/nzlora_rebuttal-7D3E.

Q1: "typos in lines 62 and 799, and Eq (19)"

R1: Thank you for your thorough review. We will correct the identified typos and carefully re-examine the entire manuscript.

Q2: "how the definitions of stability and efficiency differ from those in previous studies"

A2: The only difference between our definitions and those in previous studies is that our stability definition is slightly less restrictive. Specifically, we do not consider interval stability, i.e., ZA=AZ=Θ(1)Z_A=AZ=\Theta(1), where ZZ is the input of the LoRA layer. Instead, we focus on the stability of the final output of LoRA, ZB=BAZ=Θ(1)Z_B=BAZ=\Theta(1). Eq.(5) outlines the conditions that must be met by the initialization and learning rate when interval stability is excluded. A detailed discussion on interval stability is provided in Appendix B.2. Given that other reviewers have raised concerns regarding interval stability, we summarize the key points related to this topic in Q3.

Q3: "interval stability"

A3: In this paper, stability is defined as ZB=BAZ=Θ(1)Z_B=BAZ=\Theta(1), where ZZ represents the input to the LoRA layer. The condition ZB=Θ(1)Z_B=\Theta(1) ensures the stability of LoRA's final output, while interval stability is defined as ZA=AZ=Θ(1)Z_A=AZ=\Theta(1), which indicates the stability of LoRA's intermediate results. In Section 3, we present the solution set for stable and efficient learning without considering interval stability, as shown in Eq.(5) or as follows:

γ[ηA]+γ[ηB]=1,γ[A0]γ[ηA]\gamma[\eta_A]+\gamma[\eta_B]=-1, \gamma[A_0] \leq \gamma[\eta_A] and γ[B0]γ[ηB]\gamma[B_0] \leq \gamma[\eta_B].

When interval stability is considered (i.e., ZA=Θ(1)Z_A=\Theta(1)), an additional constraint is imposed: γ[ηA]=1\gamma[\eta_A]=-1. Consequently, the solution set of the learning rate and initialization becomes Eq.(21) in Appendix B.2:

γ[A0]γ[ηA]=1\gamma[A_0] \leq \gamma[\eta_A]=-1 and γ[B0]γ[ηB]=0\gamma[B_0] \leq \gamma[\eta_B]=0.

Two important points should be noted here:

  1. γ[ηA]=1\gamma[\eta_A]=-1 and γ[ηB]=0\gamma[\eta_B]=0 are the key findings of LoRA+, which suggest using a larger learning rate for the matrix BB in practical applications.

  2. Regardless of the optimal value for γ[ηA]\gamma[\eta_A] and γ[ηB]\gamma[\eta_B], the conditions γ[A0]γ[ηA]\gamma[A_0] \leq \gamma[\eta_A] and γ[B0]γ[ηB]\gamma[B_0] \leq \gamma[\eta_B] must always be satisfied to ensure LoRA's stable and efficient learning. When both "\leq" become "=", the maximum robustness to the learning rate is achieved. Therefore, non-zero initialization can also enhance LoRA+'s robustness to the learning rate, as shown in Fig.12 in the above link.

审稿意见
3

This paper investigates the impact of non-zero initialization in Low-Rank Adaptation (LoRA) fine-tuning, challenging the conventional practice of initializing one of the LoRA matrices (A or B) to zero. Through theoretical analysis and empirical validation, the authors demonstrate that simultaneously initializing A and B to non-zero values (Init[AB]) enhances LoRA’s robustness to suboptimal learning rates, particularly smaller ones, common due to learning rate decay. The study finds that while non-zero initialization introduces slight noise to the pre-trained model, it does not degrade fine-tuning performance as long as appropriate initialization variances are used. Extensive experiments across models and datasets confirm that non-zero initialization improves accuracy, stability, and convergence speed, making it practical for LoRA-based fine-tuning.

给作者的问题

See my questions in the analysis part.

论据与证据

Well-supported claims:

  1. Non-zero initialization improves LoRA’s robustness to suboptimal learning rates: This claim is supported by theoretical analysis and empirical proofs
  2. Fine-tuning does not need to strictly start from the pre-trained model: Experiments and theoretical evidence provided
  3. Non-zero initialization achieves superior performance compared to zero initialization, particularly at smaller learning rates: The heatmaps and performance tables demonstrate consistent improvements when using Init[AB] instead of Init[A], especially in low learning rate scenarios.

Claims that need more evidence:

  1. Non-zero initialization improves performance in all cases: There is clearly a dependence on the learning rate for different tasks as can be seen in Tables 2,3
  2. What are the limits on the variance of noise that can be used in the init[AB] case

方法与评估标准

The authors test their method with Llama-3 8B and T5 Models on the GLUE and arithmetic reasoning benchmarks. It would be interesting to check how their method works for other fine-tuning settings such as instruction tuning. Also, it is not clear how their method works with varients of LoRA such as Asymmetric LoRA, LoRA+, QLoRA, etc.,

理论论述

No

实验设计与分析

Yes, I examined the soundness and validity of the experimental designs and analyses in the paper

  1. The paper evaluates natural language understanding (GLUE benchmark) and natural language generation (commonsense & arithmetic reasoning), ensuring broad applicability.
  2. The paper uses multiple model architectures: T5-Base (encoder-decoder) and Llama 3-8B (decoder-only transformer)
  3. The study systematically varies the learning rate (η) and initialization variance (β), allowing a detailed exploration of their effects.
  4. The heatmaps and accuracy tables provide clear evidence that non-zero initialization (Init[AB]) improves performance, particularly at lower learning rates.
  5. The toy model experiment provides intuitive validation that non-zero initialization reduces sensitivity to learning rate choices.

There are areas where the text can improve with additional details

  1. The reported accuracy differences (e.g., between Init[A] and Init[AB]) are sometimes small (e.g., ~1%).
  2. No confidence intervals or standard deviations are provided
  3. It’s unclear if different ranks or scaling factors would affect the relative performance of zero vs. non-zero initialization.
  4. The initialization variance (β) is tested in discrete steps (e.g., {1, 2, 4, 8, 16}), but there’s no justification for why these values were chosen.
  5. How does their method interact with version improvements of LoRA such as LoRA+, Asymmetric LoRA, etc.?

补充材料

Yes, sections A, B, D, E

与现有文献的关系

This paper builds upon existing research in LoRA fine-tuning, neural network scaling, and weight initialization, challenging the conventional zero-initialization approach in LoRA. While prior work (Hu et al., 2022; Hayou et al., 2024a) focused on optimizing learning rates and rank selection, this study demonstrates that initializing both LoRA matrices (A and B) to non-zero values (Init[AB]) enhances robustness to suboptimal learning rates. Applying infinite-width scaling theory formalizes conditions for stable and efficient fine-tuning, extending insights from Kaiming initialization (He et al., 2015) to LoRA. Unlike recent empirical methods that use quantization errors (LoftQ), SVD (PISSA), or gradient-based initialization (LoRA-GA), this paper provides a theoretical foundation for non-zero initialization. It validates it with experiments across T5-Base, Llama 3-8B, and multiple benchmarks. These findings refine LoRA fine-tuning dynamics without additional computational cost, offering a practical and theoretically justified improvement.

遗漏的重要参考文献

The paper focuses on LoRA and the surrounding methods while not shedding light on other PEFT methods, such as BitFit [Zaken et al., 2022] and Adapters [Houlsby et al., 2019]. Adding these papers can help the reader understand the landscape better.

其他优缺点

See my comments above

其他意见或建议

See my comments above

作者回复

Hi Reviewer zK3d:

Thank you for your detailed and insightful comments. Below, we provide responses to each point individually. Additional experimental results can be found in https://anonymous.4open.science/r/nzlora_rebuttal-7D3E.

Q1: "accuracy's dependence on the learning rate in Tables 2,3, and accuracy differences are sometimes small"

R1: Our analysis reveals that non-zero initialization can reduce the adverse effects of suboptimal learning rates on LoRA performance. This effect is particularly evident when the learning rate is below its optimal value. However, when the learning rate approaches its optimal value, the performance improvement from non-zero initialization becomes less significant.

Q2: "no justification for why β1,2,4,8,16\beta\in\\{1,2,4,8,16\\}"

R2: In our experiments, we set the initialization variance of matrices AA and BB as δA2=δB2=(βδk)2\delta_A^2=\delta_B^2=(\beta \delta_k)^2, where δk2=1/n\delta_k^2=1/n is the variance used in Kaiming initialization (the default setting of LoRA). Notably, β\beta does not strictly represent variance, but rather a scaling factor applied to δk\delta_k.

Our analysis indicates that robustness improves as γ[A0]\gamma[A_0] and γ[B0]\gamma[B_0] approach −1/2. To explore this, we begin with standard Kaiming initialization (β=1\beta=1, corresponding to γ[A0]=γ[B0]=1\gamma[A_0]=\gamma[B_0]=−1) and systematically increase the variance (β{2,4,8,16})(\beta\in\{2,4,8,16\}) to study its effects. Our experimental results (Figs.4 and 6 in the original paper) confirm that, within a certain range, increasing the initialization variance enhances robustness to learning rate variations and leads to better accuracy.

Q3: "limits on the variance in Init[AB]"

R3: The theoretical limits of initialization variance are γ[A0]1/2\gamma[A_0] \leq -1/2 and γ[B0]1/2\gamma[B_0] \leq -1/2. However, this condition only describes the asymptotic behavior of the initialization variance as nn \to \infty, rather than providing a specific value. To further investigate this, we performed ablation experiments on the variance limits of LLaMA 3-8B and T5-base models. As shown in Fig.16, these limits vary across models or datasets (e.g. Init[AB]-Init[A] with β=4\beta=4 is generally less than 0 in the Commonsense reasoning task). However, the variance associated with Kaiming initialization (i.e., β=1\beta = 1) is generally effective, yielding near-optimal accuracy.

Q4: "different ranks or scaling fators"

R4: We conducted ablation experiments with varying ranks and scaling factors. As shown in in Fig.14 in the above link, adjusting these hyperparameters does not affect the improvement gained through non-zero initialization.

Q5: "different fine-tuning settings"

R5: Following your suggestion, we conducted experiments using the an instruction tuning dataset, databricks-dolly-15k, and evaluated its performance on the MMLU task. As shown in Fig.15, non-zero initialization enhances LoRA's robustness to small learning rates in the instruction tuning task, thereby improving accuracy. Notably, LLama 3-8B exhibits limited accuracy on MMLU, and thus the improvement due to non-zero initialization is less pronounced. However, the trend is still observable.

Q6: "LoRA variants"

R6: To address this question, we evaluated the impact of non-zero initialization on LoRA+ (using larger learning rates for matrices BB) and HydraLoRA [1], an asymmetric LoRA variant (using one matrix AA with multiple matrices BB). The results are presented in Figs.12-13 in the above link.

  1. We tested LoRA+ on GLUE and Arithmetic reasoning tasks. The results show that appropriately increasing the learning rate of BB can indeed improve the model accuracy. Most importantly, for the same learning rate, non-zero initialization significantly enhances the accuracy of LoRA+.
  2. We tested HydraLoRA on Arithmetic reasoning tasks. To ensure that the non-zero initialized ABAB terms could be subtracted from the pre-trained weights, we use the same initialization for different BB matrices within a HydraLoRA layer. As shown in Fig.13, non-zero initialization also improves the robustness of HydraLoRA to the learning rate.

[1] HydraLoRA: An Asymmetric LoRA Architecture for Efficient Fine-Tuning, NeurIPS 2024.

Q7: "standard deviations"

R7: The standard deviation for the GLUE dataset ranges from 0.01 to 0.4, while for commonsense and Arithmetic reasoning tasks, it spans from 0.2 to 0.4. In our experiments, smaller learning rates tend to converge less effectively and exhibit higher standard deviations. However, this effect is minimal compared to the performance gains achieved by non-zero initialization. We will include the full standard deviation results in the revised paper.

Q8: "surrounding methods such as BitFit and Adapters"

R8: Thank you for your suggestion. We will add a discussion of relevant PEFT methods in the revised paper.

审稿意见
4

This paper considers scaling of hyperparameters for LoRA finetuning from an infinite width perspective following [1, 2]. The key difference compared to past works is that a non-zero random initialization of both the B and A adapter matrices is considered. The initialization can optionally be subtracted from the pretrained weights to ensure the overall layer starts at the pretrained weights. The key observations are that the new initialization scheme allows "robustness" in addition to previous desiderata such as stability and efficiency. Informally robustness refers to a lack of sensitivity in the scaling of certain quantities to learning rate hyperparameters. The authors demonstrate on a variety of tasks the superior performance of the new scheme and improved robustness to suboptimal learning rates.

[1] LoRA+: Efficient Low Rank Adaptation of Large Models - Soufiane Hayou, Nikhil Ghosh, Bin Yu [2] The Impact of Initialization on LoRA Finetuning Dynamics - Soufiane Hayou, Nikhil Ghosh, Bin Yu

给作者的问题

Can the analysis say anything useful about PiSSa?

To clarify \eta_A = \eta_B is needed for maximum robustness? Also in this case internal stability is not achieved?

If we use LoRA+ and non-zero initialization will we do even better?

Using Init[AB] can reduce learning rate sensitivity, but it will increase initialization variance sensitivity compared to init[A]?

论据与证据

Yes the evidence is clear and convincing.

方法与评估标准

Yes the methods and evaluation criteria make sense.

理论论述

Yes the proofs appear correct.

实验设计与分析

Yes the experimental designs are solid.

补充材料

Just briefly passed over the supplement.

与现有文献的关系

The paper is important for understanding the optimal setting of hyperparameters for LoRA finetuning, a popular parameter efficient finetuning method. The paper characterizes the scaling of certain quantities in terms of width and imposes various desiderata for finetuning akin to a variety of works such as [1, 2, 3]. Importantly this work goes beyond previous works by considering a non-zero initialization of LoRA. In particular they show that finetuning can be successful even when the non-zero initialization is not subtracted from the pretrained weights as long as the initialization variance is not too large, demonstrating robustness of the finetuning procedure to a noisy initialization. Furthermore, the non-zero initialization has certain advantages relative to other initializations including decreased sensitivity to learning rate hyperparameters and improved empirical performance.

[1] LoRA+: Efficient Low Rank Adaptation of Large Models - Soufiane Hayou, Nikhil Ghosh, Bin Yu [2] The Impact of Initialization on LoRA Finetuning Dynamics - Soufiane Hayou, Nikhil Ghosh, Bin Yu [3] Feature Learning in Infinite-Width Neural Networks - Greg Yang, Edward J. Hu

遗漏的重要参考文献

None.

其他优缺点

The strength of this paper is that it expands the practical consideration of initializations for LoRA and offers evidence for the superiority of a new initialization. Empirically this initialization appears to be non-trivially better than the standard practice and is trivial to implement.

其他意见或建议

typos: In Section B.2 Appendix Hu et al. reference is incorrect. Last line of Eq. (20) should be Z_A^{t-1} not Z_B^{t-1}.

The presentation in Sections 3.2 and 3.3 are a bit hard to parse at first as is the definition and intent of "robustness". I don't think it is about perturbing \gamma[\eta] (which doesn't make much sense) but really perturbing \eta and in certain scalings the perturbation has a dominant quadratic dependence on \eta.

Also in the grid sweeps the optimum is in the top right corner. Can you extend beyond that to check that increasing the hyperparameters further does not bring improvements?

作者回复

Hi Reviewer uk4U:

Thank you for your detailed and insightful comments. Below, we provide responses to each point individually. Additional experimental results can be found in https://anonymous.4open.science/r/nzlora_rebuttal-7D3E.

Q1: "typos"

R1: Thanks again for catching these! All typos will be fixed in the revision.

Q2: "perturbing γ[η]\gamma[\eta] or η\eta"

R2: This question is essential for understanding our analysis. To improve clarity, we restate our infinite-width analysis as follows:

  1. In this paper, we focus on the asymptotic behavior of the learning rate, γ[η]\gamma[\eta], as the network width nn \to \infty, rather than its exact value. The γ\gamma-operator is defined such that η=Θ(nγ[η])cnγ[η]\eta = \Theta(n^{\gamma[\eta]}) \approx c \cdot n^{\gamma[\eta]}, where c>0c > 0 is a constant and lower-order terms are neglected.
  2. As nn \to \infty, the term nγ[η]n^{\gamma[\eta]} dominates, making γ[η]\gamma[\eta] the key factor determining the asymptotic behavior of η\eta. While the constant cc is important for exact values, it doesn't influence the asymptotic scaling behavior. Ignoring the influence of cc, perturbations to γ[η]\gamma[\eta] and η\eta are effectively equivalent.
  3. Thus, we focus on how perturbations to γ[η]\gamma[\eta] affect the fine-tuning dynamics, excluding the constant cc in Θ\Theta. Note that we analyze γ[η]\gamma[\eta] to guide learning rate and initialization choices, not compute exact values (which depend on cc).

We appreciate the opportunity to clarify our analytical framework and will explicitly incorporate these refinements in the revised paper.

Q3: "extend beyond the top right corner"

R3: Following your suggestion, we have expanded the upper right corner of the heatmap, and the updated results are presented in Fig.16 in the above link. The results show that increasing the hyperparameters further does not lead to a substantial improvement in accuracy.

Q4: "insights about PiSSA"

R4: Our analysis suggests that PiSSA is more robust to variations in the learning rate, owing to its use of non-zero initialized LoRA, which is achieved through Truncated SVD on pre-trained weights. Fig.11 in the above link shows that a significant portion of the improvement in PiSSA's accuracy can be attributed to non-zero initialization. The remaining improvement is due to the fact that the initialization values derived from the pre-trained weights are more effective than random noise. A similar trend is observed in LoRA-GA, which employs gradients for non-zero initialization. Please refer to Fig.11 for further details.

Q5: "clarify the need of \eta_A = \eta_B, and the internal stability"

R5: In fact, the condition ηA=ηB\eta_A = \eta_B is not necessary for achieving maximum robustness. The solution set in Eq. (5) indicates that stable and efficient learning can be achieved as long as γ[ηA]+γ[ηB]=1\gamma[\eta_A] + \gamma[\eta_B] = -1, γ[A0]γ[ηA]\gamma[A_0] \leq \gamma[\eta_A], and γ[B0]γ[ηB]\gamma[B_0] \leq \gamma[\eta_B]. Under this condition, when both "\leq" become "=", maximum robustness is attained. By default, we set ηA=ηB\eta_A = \eta_B since fine-tuning typically employs a uniform learning rate for all LoRA weights. However, achieving internal stability further requires γ[ηA]=1\gamma[\eta_A] = -1 and γ[ηB]=0\gamma[\eta_B] = 0, which are the core propositions of LoRA+. Due to space limitations, further details on internal stability can be found in R3 from reviewer_Rpo6 or in the analysis presented in Appendix B.2.

Q6: "LoRA+ with non-zero initialization"

R6: We integrated LoRA+ with non-zero initialization. As shown in Fig.12 in the above link, non-zero initialization also enhances the robustness of LoRA+ to variations in the learning rate, leading to improved model accuracy.

Q7: "initialization variance sensitivity of Init[AB]"

R7: Let's first explain the meaning of each initialization method:

Init[A]:A0N(0,δ2),B0=0A_0\sim \mathcal{N}(0, \delta^2), B_0=0,

Init[AB]:A0N(0,δ2),B0N(0,δ2)A_0\sim \mathcal{N}(0, \delta^2), B_0\sim \mathcal{N}(0, \delta^2), with αrA0B0\frac{\alpha}{r}A_0B_0 subtracted from the pre-trained weights.

Init[AB+]: Same as Init[AB], but without the subtraction process.

First, we emphasize that Init[A] itself exhibits sensitivity to variance. As shown in Eq.(6), stable and efficient learning requires the variance of A0A_0 to satisfy γ[A0]12\gamma[A_0] \leq -\frac{1}{2}. In Init[AB], B0B_0 uses the same variance as A0A_0 and only needs to satisfy the same condition (i.e., γ[B0]1/2\gamma[B_0]\leq-1/2). Therefore, Init[AB] does not introduce additional sensitivity to initialization variance but instead enhances the robustness of LoRA to ηB\eta_B.

Notably, if Init[AB+] is used, a larger initialization variance results in greater noise, which negatively impacts performance. However, this issue arises due to the absence of noise subtraction in Init[AB+], rather than a fundamental limitation of Init[AB].

最终决定

This paper received four effective reviews, and all of them are positive. Overall, the paper is of good quality and should be accepted.