LoRA vs Full Fine-tuning: An Illusion of Equivalence
摘要
评审与讨论
This work introduces the concept of intruder dimensions that arise during fine-tuning of language models via low rank adaptation (LoRA). Intruder dimensions are directions in the parameter space that disrupt the original structure of pre-trained weights and strongly correlates with forgetting of pre-trained knowledge. Furthermore, the authors point out that fine-tuning via LoRA not always results in less forgetting and strongly depends on the number of introduced intruder dimensions. Via causal intervention, the authors verify that by scaling down singular values associated with intruder dimensions leads to recovering pre-trained knowledge while maintaining downstream performance.
优缺点分析
Strengths
The paper is well written and most of the claims are well supported.
Definition of a new concept that results from fine-tuning with low-rank adaptation, which is very relevant.
The experiments are well designed with interesting results and takeaways.
The key takeawas are based on multiple rigorous experiments, not just single seeds/experiments.
Weaknesses
I am generally very positive about this work, the following weaknesses are minor issues.
Unsupported claims
There are a few claims that are not entirely supported:
- line 166: "LoRA introduces new singular vectors that have a large contribution to the norm of the updated parameter matrix." - As far as I can see there is no support for a change in norm, only in directions because of using cosine similarity.
- line 210: "LoRA consistently has more intruder dimensions than full fine-tuning" - not entirely correct, see Appendix J Figure 15, LoRA r=64 has less intruder dimensions than full fine-tuning
- line 244: "This extends the finding that LoRA learns less " - I assume this is a typo and should be "forgets less" rather than "learns less", as there is no support for the learning less argument as far as I can see
Different distance measures
Currently, the authors only consider cosine similarity as distance measures between singular vectors. It would be interesting if intruder dimensions could be identified via different distance measures, or whether it is really only the difference in directions that enables identifying them.
Potential additional experiments
It is very interesting that initialization has a huge effect on intruder dimensions (Figure 23). It would be interesting how the number of intruder dimensions vary for different initialization schemes, specifically for data-driven initialization, e.g. [1,2,3].
[1] Wang et al., LoRA-GA: Low-Rank Adaptation with Gradient Approximation, NeurIPS 2024
[2] Paischer et al., One Initialization to Rule them All: Fine-tuning via Explained Variance Adaptation, NeurIPS 2024 ENLSP Workshop
[3] Yang et al., CorDA: Context-Oriented Decomposition Adaptation of Large Language Models for Task-Aware Parameter-Efficient Fine-tuning, NeurIPS 2024
Another very interesting point would be showing where exactly the intruder dimensions appear in. Are they only apparent in certain projection matrices (Q/K/V, etc.) and layers and can they be related with implicit biases as introduced in [4]?
[4] Sun et al., Massive Activations in Large Language Models, COLM 2024
Finally, it would be interesting how the number of intruder dimensions changes with larger models.
Presentation
Generally, I recommend increasing the fontsize of the plots, they are partly a bit hard too read without zooming in a lot. It would be cool to have a zoomed in version of the top left part of Figure 1, right, to better identify intruder dimensions (same for Figure 2b). There are typos in line 188 and 202.
问题
See weaknesses.
局限性
Some limitations are in the appendix, but they could be a bit more elaborate. To accomodate those in the main body of the paper, I suggest moving Algorithm 1 in the appendix, it does not add too much value.
最终评判理由
My initial rating of the paper was already positive and all my raised points have been addressed during the rebuttal. I see this work having an impact in understanding and improving LoRA-style fine-tuning, therefore I reside with my original rating.
格式问题
No major issues.
We thank you for your thoughtful review and are grateful for your positivity about our work. We have incorporated the suggestions you have requested, including typos, presentation changes, and updating claims, into the updated manuscript. We provide responses and more context where required below:
There are a few claims that are not entirely supported: line 166: "LoRA introduces new singular vectors that have a large contribution to the norm of the updated parameter matrix." - As far as I can see there is no support for a change in norm, only in directions because of using cosine similarity.
Thank you for pointing this out. While Figure 5 shows an increase in the singular value corresponding to the intruder dimension—suggesting a potential increase in the matrix norm—we currently don't report the norm itself. We will include this statistic in future revisions to more concretely support the claim.
Different distance measures Currently, the authors only consider cosine similarity as distance measures between singular vectors. It would be interesting if intruder dimensions could be identified via different distance measures, or whether it is really only the difference in directions that enables identifying them.
Thank you for the suggestion; it's a valuable point. While we focus on cosine similarity, we don’t claim it's the only viable measure. We chose it because, in the context of SVD, comparing singular vectors via cosine similarity offers a natural and interpretable way to track changes in the principal directions of the weight matrix. Since these directions define how the matrix transforms input, shifts in them are meaningful. That said, exploring alternative distance measures is an interesting direction for future work and could offer complementary insights.
Another very interesting point would be showing where exactly the intruder dimensions appear in. Are they only apparent in certain projection matrices (Q/K/V, etc.) and layers and can they be related with implicit biases as introduced in [4]?
Across our experiments, intruders appear in all weight matrices, with no systematic difference in the number of intruder dimensions between MLP and attention matrices. We will add in a section outlining these details to the appendix.
It would be interesting how the number of intruder dimensions changes with larger models.
We thank the reviewer for there extensive suggestions of additional experiments. Due to the limited length of the rebuttal, we will leave these experiments to future work to extend this work.
Thank you for answering all of my raised points. Since I was already positive about the paper at the time of submission, I will retain my positive rating.
This paper thoroughly investigates the structural differences between weight updates from LoRA and full-finetuning through the lens of SVD decomposition, and introduces a new concept intruder dimension to quantify these differences. Extensive empirical stuides and discussions are provided to better exemplify the properties and behaviors of this new concept.
优缺点分析
Strengths
- The paper is well-written and clearly structured, with a logical and straightforward presentation.
- The concept of intruder dimension is well-defined and intuitively aligns with the mathematical underpinnings of representation learning, particularly in the context of SVD, where directions with higher variance tend to carry more informative content.
Weaknesses
-
The core concern is that the paper primarily presents extensive empirical comparisons between LoRA and full fine-tuning to illustrate the emergence and impact of intruder dimensions, without proposing new methods or actionable improvements based on these insights. As per the NeurIPS regulation, it may be better suited for the Datasets & Benchmarks track rather than the main conference.
-
Despite the extensive empirical evidence for the impact of intruder dimension, the theoretical explanation for why they appear in fine-tuning LoRA is still lacking. Mathematically, it is expected that constrained updates (e.g., via LoRA) tend to concentrate on directions associated with the largest singular values, while full fine-tuning distributes updates more evenly with sufficiently large update space. Without rigorous theoretical support, the introduction of the intruder dimension appears to be a restatement of this known behavior within the specific framing of LoRA.
-
I've several concerns regarding the empirical analyses:
- Regarding the correlation between the number of intruder dimension and catastrophic forgetting (Lines 256-263), the experiments using a larger learning rate show that LoRA is not well optimized, as evidenced by a significant performance gap. In such cases, the increase in intruder dimensions could be a byproduct of suboptimal training rather than an inherent issue. For a well-optimized LoRA model (e.g., with lr=1e-4), the observation in [1] (LoRA learns less but forgets less) still holds. Thus, conclusions based on poorly trained models are less convincing.
- In Appendix H.2, the observation that performance drops when learned knowledge is perturbed and LoRA's preserved knowledge is exaggerated is unsurprising. This behavior is consistent with general deep learning dynamics and does not convincingly support the claim that intruder dimensions degrade OOD performance. The argument feels more like an expected consequence of disrupting well-internalized features.
- In Appendix I, was the hyperparameter search of learning rate conducted to ensure that LoRA with different converged to similar test accuracy? If not, it's trivial that improper/insufficient learning naturally leads to low rank solutions.
[1] Dan, Biderman, et al., "LoRA Learns Less and Forgets Less," TMLR 2024.
问题
See Weaknesses.
局限性
yes.
最终评判理由
Overall Assessment:
- I thank the authors for their clarifications, and I find this submission to be in alignment with the guidelines of the main conference.
- As noted by reviewers BfKg and GjPE, the exploration of an “intruder dimension” for LoRA seems to have limited practicality in real-world deployments, considering that the primary value of PEFT lies in enabling the rapid application of LLMs in industrial scenarios.
- I appreciate the authors’ substantial effort and the detailed clarifications provided.
After considering the discussions throughout the rebuttal phase, I view this paper as being around the borderline. I have accordingly adjusted my rating to 4, while noting that this does not imply a definitive decision regarding acceptance.
格式问题
NA.
We thank the reviewer for their thoughtful review. We provide responses to their review below.
The core concern is that the paper primarily presents extensive empirical comparisons between LoRA and full fine-tuning to illustrate the emergence and impact of intruder dimensions, without proposing new methods or actionable improvements based on these insights. As per the NeurIPS regulation, it may be better suited for the Datasets & Benchmarks track rather than the main conference.
We are confused by your suggestion to change tracks. First, under the main track call for papers, the website states that NeurIPS "encourage in-depth analysis of existing methods that provide new insights in terms of their limitations or behaviour beyond the scope of the original work". Moreover, this paper provides neither a dataset nor a benchmark. Further, we do indeed provide actionable insights: we show several ways that the number of intruder dimensions can be reduced, including with certain initializations (Fig. 10), lower learning rates (Fig. 7), and setting (Section N). We even identify a simple post-training intervention motivated by section 5: scaling down the magnitude of intruder dimensions can reduce the forgetting a model has while retaining nearly identical adaptation performance. These are concrete prescriptions that can be used to train better LoRA models that exhibit reduced forgetting and better out-of-distribution generalization.
Despite the extensive empirical evidence for the impact of intruder dimension, the theoretical explanation for why they appear in fine-tuning LoRA is still lacking.
We do provide a few mathematical justifications for the occurrence of intruder dimensions in Section B.1. We showed that adding the outer product of a random vector to a weight matrix introduces an intruder dimension (Section B.1). This is analogous to the matrix product of BA in LoRA when rank is 1, and implies that intruder dimensions occur because the B and A vectors are uncorrelated to the columns/rows of the weight matrix W_0. Further, we showed empirically that just training the B matrix, while freezing the A matrix with all singular values of 1, reduces the amplification of the singular values because of the matrix product and therefore reduces the number of high ranking intruder dimensions, aligning with our mathematical intuition.
Mathematically, it is expected that constrained updates (e.g., via LoRA) tend to concentrate on directions associated with the largest singular values, while full fine-tuning distributes updates more evenly with sufficiently large update space. Without rigorous theoretical support, the introduction of the intruder dimension appears to be a restatement of this known behavior within the specific framing of LoRA.
What you describe as expected behavior actually doesn’t align with what our observations. Rather than changes occurring primarily along the existing high-ranking singular vectors, what happens is that a few new high-ranking singular vectors emerge, while the original ones remain largely unchanged or only slightly shifted. This suggests that the dynamics introduced by constrained updates like LoRA are more nuanced than simply concentrating on pre-existing dominant directions like you describe. Whereas full fine-tuning distributes updates across all singular vectors.
It is important to note that, before our work, it was not expected that LoRA would update a weight matrix in a structurally different manner than full fine-tuning. Rather, It was expected that LoRA approximated full fine-tuning (https://arxiv.org/pdf/2106.09685), which would result in the LoRA looking like an approximation of the fully fine-tuned weight updates that is concentrated on the top existing singular values, like you mention.
Regarding the correlation between the number of intruder dimension and catastrophic forgetting (Lines 256-263), the experiments using a larger learning rate show that LoRA is not well optimized, as evidenced by a significant performance gap. In such cases, the increase in intruder dimensions could be a byproduct of suboptimal training rather than an inherent issue. For a well-optimized LoRA model (e.g., with lr=1e-4), the observation in [1] (LoRA learns less but forgets less) still holds. Thus, conclusions based on poorly trained models are less convincing.
We are concerned the reviewer is misinterpretting this section. All the models are equally well trained (as measured by equal validation and train loss) on the adaptation distribution--which are typical measures used when finetuning models. Could you please explain which measure would make one of the models sub-optimal over the other?
In the section you refer to, we sweep learning rates to see its effect on optimization. We find that we get similar test loss but significantly different forgetting profiles. For example, in Fig. 7, let us examine LoRA with lr=1e-4 and 2e-4. We see that the models have trained to within 0.5% test accuracy performance across epochs (middle plot). However, we see that 2e-4 consistently forgets a much more than 1e-4 when examining their corresponding pseudo loss. These are the datapoints that are used to indicate that intruder dimensions correlate with forgetting. We infact support the conclusion of ("LoRA learns less but forgets less") and extend it in this paper (lines 235-247).
Our main conclusions of our paper are all based on well trained models. We welcome the reviewer to follow up and provide clarification if we have misunderstood their concern. If there is another criterion by which one model is considered suboptimal and is typically monitored at finetuning, we would appreciate further clarification.
In Appendix H.2, the observation that performance drops when learned knowledge is perturbed and LoRA's preserved knowledge is exaggerated is unsurprising. This behavior is consistent with general deep learning dynamics and does not convincingly support the claim that intruder dimensions degrade OOD performance. The argument feels more like an expected consequence of disrupting well-internalized features.
Your suggested intuition is actually opposite to our findings, making this a surprising result. Our finding is not that "performance drops when learned knowledge is perturbed", but rather that it is barely changed when intruder dimensions are scaled down while forgetting has decreased significantly. We provide a key passage in our main text (lines 290-295) here:
"In one example for LLaMA2-7B fine-tuned on MetaMath with LoRA r = 256, we observe that scaling the top intruder dimension in each matrix with leads to a 0.1% drop in test accuracy and a 33.3% drop in the forgetting induced by fine-tuning. In another for RoBERTa-base fine-tuned on QQP, using leads to equivalent in test accuracy and a 33.2% reduction in the forgetting induced by fine-tuning. In certain scenarios, we even see test accuracy improve along with a drop in forgetting. If we instead increase their contribution (), we observe more forgetting."
This is an unexpected result. We welcome the reviewer to follow up and provide clarification if we have misunderstood their concern.
In Appendix I, was the hyperparameter search of learning rate conducted to ensure that LoRA with different alpha converged to similar test accuracy? If not, it's trivial that improper/insufficient learning naturally leads to low rank solutions.
Yes. We follow standard machine learning practice while conduct learning rate sweeps and ensure that the different models compared converged to similar test accuracy as shown in Table 2. In Appendix I, all models train to similar test loss and accuracy, making our investigation fair across ranks.
Thank you for the detailed response and I’ve read it carefully. I will discuss with the other reviewers to update my score as the discussion converges.
Thank you again for your thoughtful review. We are happy to answer any further questions you may have in case any come up.
This paper studies the difference between full fine-tuning and LoRA through spectral analysis of weight matrices. The authors identify intruder dimensions—high-ranking singular vectors in tuned weights that are dissimilar to any pre-trained directions—and show they are a key cause of forgetting. Scaling down these components reduces forgetting without harming downstream too much performance. The accumulation of intruder dimensions is also shown to hurt LoRA in continual learning.
优缺点分析
Strengths:
-
The authors provide a novel perspective on the difference between LoRA and full fine-tuning. The concept of intruder dimensions reveals a concrete structural difference between full-finetuning and LoRA.
-
The link between intruder dimensions and forgetting is insightful. It offers both theoretical understanding and potential avenues for mitigation.
Weaknesses
-
Logical mismatch in Section 5. Section 4 convincingly correlates the number of intruder dimensions with forgetting, yet Section 5 only experiments with scaling down the magnitude of the top intruder dimension. To solidify the argument, the authors should also manipulate the count of intruder dimensions and measure its effect on forgetting.
-
Limited practical guidance. The paper does not provide guidance on how to prevent the emergence of intruder dimensions during training, which is crucial for improving real-world applicability. As it stands, intruder dimensions appear to be primarily useful as a post-hoc model selection criterion. However, in practice, comparing two models based on their performance on pre-training data may be more straightforward.
问题
-
The paper concludes that intruder dimensions cause forgetting. However, full-tuning with few intruder dimensions, still exhibits more forgetting than LoRA with an appropriately chose . How do we understand this based on findings in the paper?
-
In the continual learning setup, are the similarity matrices in Figure 9 all computed with the original model? I am wondering can we observe instruder dimensions over a model already tuned on another task(say do weights after task 3 have instruder dimension over weights after task 2)?
局限性
Yes.
最终评判理由
Resolved Issues:
- Clarified the relationship between intruder dimensions and forgetting.
- Provided a clear explanation of the method to prevent intruder dimensions and addressed potential logic mismatches.
Unresolved Issues:
- Additional evidence is needed to connect Sections 3 and 5 effectively.
- Algorithmic suggestions for mitigating intruder dimensions remain limited.
In conclusion, this is a technically solid and engaging paper with some limitations. I assign it a rating of 4.
格式问题
No.
We thank the reviewer for their thoughtful review. We provide responses to their review below.
Logical mismatch in Section 5. Section 4 convincingly correlates the number of intruder dimensions with forgetting, yet Section 5 only experiments with scaling down the magnitude of the top intruder dimension. To solidify the argument, the authors should also manipulate the count of intruder dimensions and measure its effect on forgetting.
We actually do study altering the count of intruder dimensions in section 5. When scaling with , we remove that intruder dimension. In Section 5, we remove the top intruder dimension in each weight matrix. This has the effect of removing up to 72 intruder dimensions from a RoBERTa model and has the effect of significantly reducing forgetting while causing less extreme drop in test performance. For examples, see Fig. 8 when . We hope this clarifies the reviewers confusion.
Limited practical guidance. The paper does not provide guidance on how to prevent the emergence of intruder dimensions during training, which is crucial for improving real-world applicability. As it stands, intruder dimensions appear to be primarily useful as a post-hoc model selection criterion.
This paper provides several concrete suggestions on how to reduce the emergence of intruder dimensions at training time. We provide evidence that certain initializations (Fig. 10), lower learning rates (Fig. 7), and setting (Section N) all lead to fewer intruder dimensions. These are concrete prescriptions that can be used to prevent the emergence of intruder dimensions during training.
Moreover, we show that scaling down the top intruder dimensions after LoRA fine-tuning recovers the drop in pretraining performance without affecting downstream adaptation. In one such case, we see forgetting reduced by 33.2% with no change in test accuracy (lines 292-293). This effectively resurfaces forgotten information without compromising task performance—a novel and actionable intervention that leads to improved OOD generalization which was not previously known.
These are a few practical guidelines offered by the paper.
The paper concludes that intruder dimensions cause forgetting. However, full-tuning with few intruder dimensions, still exhibits more forgetting than LoRA with an appropriately chose alpha. How do we understand this based on findings in the paper?
It is important to note that intruder dimensions are a cause of forgetting in LoRA'd models, but not the only cause of forgetting in fine-tuning in general. This is obvious, since we should expect any sort of deviation from the pre-trained weights, which were trained to minimize language modelling loss, to lead to an increase in language modelling loss (aka forgetting). This finding demonstrates that these two methods update weight matrices in fundamentally different ways. This explains the structural basis of forgetting with LoRA'd models (and variants of LoRA). However, a functional explanation of forgetting in full-finetuned models can be different and an interesting avenue for future study.
In the continual learning setup, are the similarity matrices in Figure 9 all computed with the original model? I am wondering can we observe intruder dimensions over a model already tuned on another task(say do weights after task 3 have intruder dimension over weights after task 2)?
Yes, all similarity matrices in Figure 9 are computed using the original model. That said, we do observe what you're describing: as we train on more tasks, new intruder dimensions emerge in distinct locations with every round of adaptation to a new task (highlighted in pink), indicating that intruder dimensions can indeed arise relative to earlier task weights (e.g., after Task 3 vs. Task 2).
Thank you for the rebuttal; it addresses most of my concerns, and I will raise my rating to 4.
Regarding my first question, I agree that when , one intruder dimension is removed for each weight matrix. However, the paper does not include a figure showing the trend of pre-train loss versus the number of intruder dimensions. Moreover, this decrease follows a specific pattern, where exactly one intruder dimension is removed per matrix. I believe additional discussion or results linking the number of intruder dimensions to their magnitudes would strengthen the logical flow of the paper. At present, Section 3 focuses entirely on the count of intruder dimensions, while Section 5 discusses their magnitudes—two related but not equivalent concepts.
We thank the reviewer for their thoughtful responses and their decision to update their score.
Regarding my first question, I agree that when , one intruder dimension is removed for each weight matrix. However, the paper does not include a figure showing the trend of pre-train loss versus the number of intruder dimensions. Moreover, this decrease follows a specific pattern, where exactly one intruder dimension is removed per matrix. I believe additional discussion or results linking the number of intruder dimensions to their magnitudes would strengthen the logical flow of the paper. At present, Section 3 focuses entirely on the count of intruder dimensions, while Section 5 discusses their magnitudes—two related but not equivalent concepts.
Thank you for this valuable suggestion. We will look into measuring and linking the number of intruder dimensions to their magnitudes in order to strengthen the logical flow of this work as you suggest. Due to the limited time remaining in the rebuttal cycle, we will be unable to complete this measure before the response deadline but will include this measure in the updated version of the manuscript.
We thank you again for your thoughtful responses and your positive reaction to our work.
The paper analyzes the differences between LoRA and full fine-tuning of LLMs by examining the spectral properties of the resulting weight matrices. Specifically, the authors compare the singular vectors of the pre-trained and fine-tuned weights and find that, while full fine-tuning largely preserves the original spectrum, LoRA introduces intruder dimensions—singular vectors with large singular values that have low similarity to any singular vector from the pre-trained model. The authors conduct thorough experiments to study how various hyperparameters, such as LoRA rank and alpha, learning rate, the parameters used to compute the number of intruder dimensions, and even variations of the LoRA method, affect these intruder dimensions. They then analyze the forgetting behavior of LoRA in both fine-tuning and continual learning settings, and argue that the intruder dimensions contribute to it.
优缺点分析
Strengths:
- The paper addresses an important problem: understanding the properties of solutions obtained with LoRA is highly relevant, given its widespread use.
- To the best of my knowledge, the analysis of spectral differences between LoRA and full fine-tuning is novel, and the results are nontrivial and interesting.
- The experimental setup is strong. The use of different architectures and datasets supports the generality of the intruder dimension phenomenon.
- The paper is thorough and includes a wide range of additional experiments and ablations.
- The paper is clearly written and easy to follow.
Main weaknesses:
- The focus of the paper. The paper focuses on intruder dimensions as the main factor distinguishing the spectral properties of LoRA and full fine-tuning, but it is not entirely clear whether this choice captures the core difference. According to Figure 2b, the difference appears to be more uniform: all singular vectors in the LoRA solution tend to have lower cosine similarity, and there is no clear small subset of outlier directions with significantly lower similarity. This suggests that LoRA broadly rotates the basis rather than introducing a few distinct intruder dimensions. The paper does not provide a clear justification for why the top intruder dimensions should be considered more important than the overall rotation of the spectrum.
- Reasons behind intruder dimensions. The paper does not provide a clear explanation for why LoRA introduces intruder dimensions. Moreover, while the results convincingly show that LoRA solutions have intruder dimensions, the causal relationship is not fully established. In particular, it is possible that these dimensions are not specific to LoRA, but instead arise from larger overall changes to the weights. The experiments varying the learning rate for LoRA show that higher learning rates lead to more intruder dimensions, and it is plausible that full fine-tuning would show a similar pattern if higher learning rates were used. This raises the question of whether the observed differences between LoRA and full fine-tuning are due to specific properties of the LoRA method, or simply a consequence of differences in optimal hyperparameter ranges. Additional experiments that vary the learning rate in full fine-tuning and analysis of the relationship between the number of intruder dimensions and the magnitude of weight updates would help clarify this point.
- Practical importance of intruder dimensions. The analyses in Sections 4 and 5 do not clearly establish a causal link between intruder dimensions and forgetting or generalization. Moreover, they do not provide evidence that the top intruder dimensions contribute more significantly than the rest of the changes in the weights.
- The correlation between the number of intruder dimensions and forgetting observed in Section 4 may result from a shared underlying cause rather than a direct causal relationship. Specifically, larger changes in weights during fine-tuning (such as those resulting from higher learning rates in Figure 7) can lead to both a greater number of intruder dimensions and stronger adaptation to the fine-tuning data, which in turn increases forgetting. In this case, even though there is no strong correlation between the number of intruder dimensions and test accuracy on the fine-tuning task, there is likely a correlation with training accuracy/loss.
- While Section 4 claims that the number of intruder dimensions correlates with forgetting, Figure 7 shows that full fine-tuning has the fewest intruder dimensions, yet exhibits more forgetting than LoRA with a low learning rate.
- In Section 5, the experiments do not convincingly show that intruder dimensions contribute to forgetting more than other changes in weights. The comparison between downscaling intruder dimensions and regular singular vectors effectively compares removing part of the fine-tuning versus removing part of the pre-training. To show that intruder dimensions are specifically responsible for forgetting, a more appropriate comparison would be between downscaling intruder dimensions and downscaling the overall fine-tuning update. This distinction is important, since downscaling the overall fine-tuning update is also known to improve generalization (http://arxiv.org/abs/2109.01903). From the current experiments, it remains unclear whether intruder dimensions contribute more to forgetting than other changes introduced during fine-tuning.
Additional concerns, comments and questions:
- The analysis of the effect of LoRA rank on intruder dimensions could be improved. Figure 4 clearly shows a non-monotonic trend: as the LoRA rank increases, the number of intruder dimensions initially grows and then decreases. While the text notes that intruders decrease and converge toward full-rank behavior at high ranks, it does not discuss the initial increase at low ranks.
- The claim that LoRA forgets less than full fine-tuning, even at comparable performance levels, is not particularly strong. The results for LoRA and full fine-tuning in Table 2 differ significantly, and the observed differences in forgetting appear to correlate, at least to some extent, with differences in accuracy. For example, on the QQP task, both accuracy and forgetting differ between high-rank LoRA and full fine-tuning, while on MNLI, the results are almost identical in both respects.
- The results in Figure 17 do not seem to fully support the conclusions drawn from Figure 9a. Across different tasks, LoRA exhibits both more and less forgetting, suggesting that the effect may not generalize consistently.
- A more detailed analysis of intruder dimensions of different LoRA variants would be beneficial. There appears to be a non-monotonic dependence of the number of intruder dimensions on the epsilon parameter, which likely reflects properties of specific methods used.
- Are Figures 1, 2, and 3 based on the same experimental setting? In Figure 2, only one low-epsilon intruder dimension appears among the top 10 singular vectors, whereas Figures 1 and 3 seem to show a higher number of such dimensions.
- In Figure 2c, it seems odd to refer to a "normal" singular vector with cosine similarity equal to 1, given that Figure 2b clearly shows that the similarity is significantly lower for most vectors under both fine-tuning methods.
- It would be helpful to briefly define what alpha represents in the LoRA setup in the Background section, especially since its effect is analyzed later in the paper.
- Why do the full fine-tuning results differ Tables 1 and 2?
- In Figure 23, the line type for both VeRA experiments is the same.
问题
All questions and concerns are detailed in the Strengths and Weaknesses section. The main weaknesses 1–3 are the most critical and will have the greatest impact on my final evaluation after the rebuttal.
局限性
I believe the limitations section should provide a more thorough discussion noting that the paper focuses only on one aspect of the spectral differences between LoRA and full fine-tuning, the intruder dimensions, while leaving aside the effect of the overall rotation of the spectrum and the changes in singular values.
最终评判理由
After careful consideration, I still hold the opinion that the paper is not ready for publication and requires significant revision. The rebuttal and discussion mostly addressed my concern about the focus of the paper (Weakness 1). However, the additional results largely confirmed that my concerns regarding the reasons behind the intruder dimensions and their connection to forgetting (Weaknesses 2 and 3) were reasonable. The main reason for my negative score is the section on forgetting (Weakness 3): the current claim that intruder dimensions are “special” with respect to forgetting is not convincing and may be incorrect.
Final comments on the main weaknesses:
- Focus on intruder dimensions (Weakness 1). After the discussion, I am convinced that intruder dimensions are indeed present in many LoRA experiments and that the paper’s focus is reasonable. However, some experiments still show very different behaviour (Figure 2b). Looking at Figure 4, both types of behaviours appear in practical experiments: clear outlier intruder dimensions result in an up–constant–up pattern (as in low-rank LoRA in the MNLI experiment), while a uniform change as in Figure 2b results in a constant–up pattern (as in high-rank LoRA in the MNLI experiment). I believe the paper should explicitly discuss this distinction.
- Reasons behind intruder dimensions (Weakness 2). Given that the LoRA weight update has a much higher L2 norm, it is not clear whether intruder dimensions are specifically related to LoRA or simply to a larger weight update in general. Based on the current results, in my opinion, the only justified conclusion is that LoRA and full fine-tuning under optimal hyperparameters have different update structures. This conclusion is interesting in itself, but the paper would then need to adjust its claims accordingly. Alternatively, the paper could include additional experiments, e.g., using a higher learning rate or longer training for full fine-tuning (or a lower learning rate for LoRA), to confirm whether LoRA and full fine-tuning still produce different weight structures when the weight update norms are similar.
- Practical importance of intruder dimensions (Weakness 3). After the discussion, I remain unconvinced that intruder dimensions are in any way special with respect to forgetting. Based on the previous point, the high correlation between the number of intruder dimensions and forgetting can most likely be explained by the strong correlation between the weight update norm and forgetting. Moreover, the similarity in the effects of downscaling intruder dimensions and downscaling the entire weight update in LoRA (Q4 in the discussion) contradicts the claim that intruder dimensions are particularly special.
格式问题
--
Thank you for your thoughtful review. We have incorporated your suggestions, including the mistakes you caught like VeRA having the same line twice in Fig. 23, into our updated manuscript.
Where required, we provide responses to your review below:
Main weaknesses: The focus of the paper. The paper focuses on intruder dimensions as the main factor distinguishing the spectral properties of LoRA and full fine-tuning, but it is not entirely clear whether this choice captures the core difference.
Our primary objective was to assess whether different fine-tuning methods converged to functionally equivalent models despite differing parameterizations. To investigate this, we perform a spectral analysis to characterize how the sequence of transformations each method applied differ.
Since the SVD breaks up the principal components of a matrix, studying how these principal components change (via cosine similarity) is an intuitive way to examine how a weight matrix is changed during fine-tuning. Empirically, we observe that while full fine-tuning, which directly manipulates the weights, makes small adjustments to the magnitude and direction of the singular vectors, LoRA, which uses a low rank product, introduces new singular vectors with low cosine similarity to the existing singular vectors. We provide some intuition for why intruder dimensions are the right framework in Section B.2, where we showed that adding the outer product of a random vector to a weight matrix introduces an intruder dimension. This is analogous to the matrix product of BA in LoRA when rank is 1, and implies that intruder dimensions occur because the B and A vectors are uncorrelated to the columns/rows of the weight matrix W_0.
We have conducted an extremely detailed examination and would have every reason to report a better measure had it been found. Does the reviewer have a suggestion of measure?
The paper does not provide a clear explanation for why LoRA introduces intruder dimensions. Moreover, while the results convincingly show that LoRA solutions have intruder dimensions, the causal relationship is not fully established.
It is important to note that intruder dimensions are an empirical observation of LoRA. However, we provide a few mathematical justifications for their occurrence. We showed that adding the outer product of a random vector to a weight matrix introduces an intruder dimension. This is analogous to the matrix product of BA in LoRA when rank is 1, and implies that intruder dimensions occur because the B and A vectors are uncorrelated to the columns/rows of the weight matrix W_0 (Section B.2). Further, we showed empirically that just training the B matrix, while freezing the A matrix with all singular values of 1, eliminates the amplification of the singular values because of the matrix product and therefore reduces the number of high ranking intruder dimensions. This suggests that this multiplicative property may cause new singular vectors to have large singular value. These experiments provide evidence towards why LoRA causes intruder dimensions to be introduced.
it is possible that these dimensions are not specific to LoRA, but instead arise from larger overall changes to the weights ... The experiments varying the learning rate for LoRA show that higher learning rates lead to more intruder dimensions, and it is plausible that full fine-tuning would show a similar pattern if higher learning rates were used. This raises the question of whether the observed differences between LoRA and full fine-tuning are due to specific properties of the LoRA method, or simply a consequence of differences in optimal hyperparameter ranges.
The reviewer is absolutely correct to be curious about this and we indeed investigated this. In our investigation, we used standard learning rates for both full fine-tuning and LoRA. When we conduct a large learning rate sweep for both methods, we observe that increasing learning rate significantly over the default setting leads to training instabilities and divergence. Decreasing learning rate has difficulty converging to similar performance, even with more training steps. Because of this, the resulting models perform significantly worse and therefore cannot be used in our analysis. For models that do converge to similar, near optimal performance (Fig. 7), we observe that full fine-tuning contains no intruder dimensions while LoRA contains many.
Practical importance of intruder dimensions. The analyses in Sections 4 and 5 do not clearly establish a causal link between intruder dimensions and forgetting or generalization.
In Section 4, we find a strong correlation between intruder dimensions and forgetting. The reviewer is absolutely right to point out that this could be the result of a third variable that is causing both to increase. To show that this is not the case, in section 5 we intervene on the intruder dimensions. In it, we scale down the singular values of the intruder dimensions, which has the effect of reducing their contribution on the fine-tuned weights. By performing this causal experiment, we find that there is little change in test accuracy but a large impact on forgetting. One example of this (reported in the main text in lines 292-293) shows that using on our model trained on QQP leads to no change in test accuracy but a 33.2% reduction in forgetting.
These findings lead to the following conclusion: intervening on intruder dimensions and scaling them down results in a big drop in forgetting but little change in test accuracy, showing that these intruder dimensions, and in particular their magnitude (singular vector), causes a large amount of the forgetting in LoRA models.
We do not claim that intruder dimensions are not the only possible cause of forgetting. Any sort of change to the pre-trained weights should be expected to impact a models base language modelling ability. Indeed, we observe that full fine-tuning makes small adjustments to the direction and magnitude of the pre-trained singular vectors. This still changes the fine-tuned weights, and we should expect forgetting to occur. Our causal experiment scaling intruder dimensions is specific to LoRA.
Intruder dimensions and section 5 explain the structural basis of forgetting with LoRA'd models (and variants of LoRA). However, a functional explanation of forgetting in full-finetuned models can be different and an interesting avenue for future study.
Additional concerns, comments and questions: The claim that LoRA forgets less than full fine-tuning, even at comparable performance levels, is not particularly strong. The results for LoRA and full fine-tuning in Table 2 differ significantly, and the observed differences in forgetting appear to correlate, at least to some extent, with differences in accuracy. For example, on the QQP task, both accuracy and forgetting differ between high-rank LoRA and full fine-tuning, while on MNLI, the results are almost identical in both respects.
Our conclusion that "LoRA forgets less than full fine-tuning, even at comparable performance levels" stems from our results in Table 2 and Fig. 6. Table 2 contains the test accuracies of all of our models. In it, we see models trained to approximately the same accuracy. We are not sure why the reviewer claims that the results for LoRA and full fine-tuning in Table 2 differ significantly. For example, in Table 2, all our models fine-tuned on MNLI perform within 0.5% of each other. Looking horizontally, comparing full fine-tuning and LoRA r=16, without loss of generality, we see similar performance, with LoRA sometimes outperforming full fine-tuning and vice versa. Further, when correlating test accuracy and forgetting within datasets for Table 2 (as indicated by the reviewer), we get no statistically significant result (test: Spearman's rank-order correlation, p-value>0.33 for each.). Therefore, this means that observed differences in forgetting do not correlate with differences in accuracy.
When looking at Fig. 6b, we see that full fine-tuning always forgets more than LoRA. This leads us to conclude that LoRA forgets less than full fine-tuning, even at comparable performance levels, which exends the findings of [1]. We hope this clarifies the reviewers concern.
[1] - https://arxiv.org/pdf/2405.09673
A more detailed analysis of intruder dimensions of different LoRA variants would be beneficial. There appears to be a non-monotonic dependence of the number of intruder dimensions on the epsilon parameter, which likely reflects properties of specific methods used.
We provide a preliminary study to show that our findings scale to other variants in Section P. We find that LoRA variants (LoRA+, VeRA, AdaLoRA, PiSSA) all have intruder dimensions, showing that these methods are not adequate for preventing intruder dimensions. We emphasize that our study of LoRA variants is orthogonal to our main study, and that it is difficult to conduct every detailed experiment we have on all LoRA variants. While we save an in depth analysis for future work, we have no reason to suspect these will impact the claims we make in this paper.
Thanks for the detailed response!
The main weaknesses I pointed out are only partially addressed, so I have some clarification questions below. Questions 1/2/4 are more critical.
Weakness 1. Focus of the paper
Q1. I fully agree that analyzing how the principal components of weight matrices change is a natural choice here. My concern is not about the idea of analyzing SVD components, but rather about the focus on only the top subset of these components, rather than considering the average change across all of them. As I mentioned in the initial review, according to Figure 2b, the difference between LoRA and full fine-tuning appears more uniform: all singular vectors in the LoRA solution tend to have lower cosine similarity, with no clear small subset of outlier directions showing significantly lower similarity. Is it not the case usually? If it is, could you please elaborate on why the paper focuses only on changes in the top directions and does not analyze all components?
Weakness 2. Reasons behind intruder dimensions
Q2. Could you please provide the results for the learning rate values that led to reasonable training outcomes for full fine-tuning in your learning rate sweep (i.e., accuracy and number of intruder dimensions at different threshold levels)? Results even for 2-3 of them would already be valuable.
Q3. If possible, a comparison of the distances between pre-training and fine-tuning checkpoints for LoRA and full fine-tuning would also be very helpful to make sure that intruder dimensions are mostly related to LoRA and not higher changes in weights in general.
Weakness 3. Practical importance of intruder dimensions
Q4. While I agree that this intervention experiment is a good starting point for making a causal claim, I believe it currently lacks a proper baseline. To support the idea that there is something specific about the forgetting properties of intruder dimensions, the result should be compared to a baseline where the full fine-tuning weight update is downscaled. Without this comparison, it’s unclear whether the observed effect is related specifically to the intruder dimensions or is simply a general consequence of downscaling the fine-tuning update. Prior work (e.g., http://arxiv.org/abs/2109.01903) shows that the latter can also lead to similar effects. A strong argument would be to show that downscaling only the intruder dimensions results in less loss of accuracy at the same level of forgetting than downscaling the entire fine-tuning update.
Q5. If possible, could you please provide the results on the correlation between the number of intruder dimensions and the training loss on the fine-tuning task (instead of test accuracy provided in the paper)?
Additional concerns
Thanks for the correlation results, I found them very useful. Regarding the additional analysis of LoRA variants, I meant it only as a suggestion, and I fully agree with you that this is an interesting direction for future work!
Thank you for your detailed response and for your quick answer to our rebuttal. We provide responses to your follow up questions below. Please let us know if any of our below responses are unsatisfactory, and we would be happy to follow up.
Q1.
You are absolutely correct to bring up this point. We included Fig. 2b to help provide intuition on measuring the cosine similarity between pre-trained and fine-tuned singular vectors and is not what is typically observed. We instead observe that all pre-trained singular vectors are preserved and that new, high ranking singular vectors are introduced, as shown in Fig 2a. We apologize that this selection of graphic was confusing to the claims we make. Thank you for helping us clarify this.
Q2.
As requested, we provide accuracy and number of intruder dimensions for several full fine-tuning runs below. While we cannot update the PDF, we print out accuracyand number of intruder dimensions for several different runs. Each value in a list is for each epoch (the first entry is for after epoch 1, etc.. note that these values would apply to the settings used in Fig. 7).
Full fine-tuning with lr=2.5e-6:
Number of Intruders: [0, 0, 0, 0, 0]
Test Accuracy: [0.8554, 0.8668, 0.8704, 0.8680, 0.8655]
Full fine-tuning with lr=5e-6:
Number of Intruders: [0, 0, 0, 0, 0]
Test Accuracy: [0.8601, 0.8742, 0.8699, 0.8648, 0.8656]
Full fine-tuning with lr=1e-5:
Number of Intruders: [0, 0, 0, 0, 0]
Test Accuracy: [0.8607, 0.8745, 0.8703, 0.8694, 0.8671]
We hope these values are helpful and encourage the reviewer to reach out with futher requests or clarifying questions should they come up.
Q3.
To measure this, we select a model and measure the weight norm of the update to the weights (). We measure full fine-tuning's update has an average weight norm of 0.000172, while LoRA's update has an average weight norm of 0.002839. This is a ~16x difference. However, this is consistent with our observation of intruder dimensions. Since LoRA has intruder dimensions (new singular vectors with large singular value), we should expect LoRA to have larger weight norm difference because will contain the intruder dimension. In contrast, full fine-tuning, which makes subtle changes to singular values/vectors, will have smaller weight norm difference because the update matrix will contain very small values (since ).
Q4.
This is indeed a good baseline to measure. When we compare LoRA'd models that either have their entire updated scaled or just their intruder dimensions scaled, we get similar pareto curves on the forgetting vs performance graph. However, we argue that this makes sense. Since we argue that most of the LoRA update is in the intruder dimensions, scaling these intruder dimensions down should have a similar effect to scaling the entire update down. We thank the reviewer for suggesting this measure and pointing us to this piece of literature.
Q5.
Unfortunately, we only log training loss at batch level only, which means we cannot calculate a global train loss measure. However, we attempt to calculate this value by averaging the batch training loss across the previous 100 batches in order to estimate the value you request. When calculating the spearman correlation, we measure a statistic of -0.2174 and pvalue of 0.3573. This means that there is no statistically significant relationship between training loss and intruder dimensions. This further shows that our results are not simply based on the fact that certain models are overfit to the training set.
We would thank you again for your thoughtful review and response. Given that the rebuttal period is closing soon, we are reaching out to ensure we have addressed all the concerns raised. We would also be happy to answer any further questions if any have come up.
Weakness 1. I don’t find this argument convincing. Figure 2a is a motivational image, while Figure 2b is an actual plot of the results. So, if Figure 2b is not what is usually observed in the experiments, could you please provide some data on what is observed? For example, a histogram of cosine values from other experiments? And please explain why the results in Figure 2b are different.
Based on all the results from the paper, it seems that while the cosines for non-intruder vectors are close to one for the first several vectors (Figure 3), this is not the case for the vectors in general (Figures 1 and 2). The analysis in the paper is clearly not focused only on the first several vectors.
Weakness 2, Q1. Thanks for the additional results. However, my question was more about learning rates higher than the one used in the original experiment from the paper. It is expected that for lower learning rates, the number of intruder dimensions would also be 0.
Weakness 2, Q2. Thanks for the additional results. I agree with your discussion here; however, this result also supports the idea that the effect of intruder dimensions could be connected to the learning rate used.
Weakness 3, Q4. Again, I agree with your discussion here, but this result demonstrates that intruder dimensions are not particularly special in terms of forgetting compared to other dimensions.
Weakness 3, Q5. Thanks for the result, it is indeed an interesting observation.
I understand that there is not much time left in the discussion period, so I will keep that in mind. I would appreciate any response possible within the remaining time.
Thank you again for your thoughtful responses and involvement with this rebuttal.
Weakness 1. I don’t find this argument convincing. Figure 2a is a motivational image, while Figure 2b is an actual plot of the results. So, if Figure 2b is not what is usually observed in the experiments, could you please provide some data on what is observed? For example, a histogram of cosine values from other experiments? And please explain why the results in Figure 2b are different.
We are apologetic about any confusion that has been caused by this plot. Fig 2b is a graph specifically selected to motivate this paper and to suggest that LoRA and Full fine-tuning are updating weight matrices differently. The proper plot to examine for the standard difference between full fine-tuning and LoRA is Fig. 1. In it, we see an 'offset' in LoRA due to intruder dimensions, whereas we see no offset for full fine-tuning. In particular, we call attention to the fact that both LoRA and full fine-tuning have both a diagonal with the same slope and a similarly wide band- except for intruder dimensions causing the offset, there is little different between the two models.
While we cannot update the PDF or add any links to include any new figures due to rebuttal rules, we hope that Figure 1 can clarify this issue.
Based on all the results from the paper, it seems that while the cosines for non-intruder vectors are close to one for the first several vectors (Figure 3), this is not the case for the vectors in general (Figures 1 and 2). The analysis in the paper is clearly not focused only on the first several vectors.
We are unsure of what the reviewer means by the last sentence, but while we agree that low-ranked singular vectors in the fine-tuned models do not map as full as high-ranking singular vectors(this is trivial: even with small changes to high ranked singular vectors, low ranked singular vectors will be forced to change in order to continue to span the vector space), we do not observe a particularly pronounced effect on this. Figure 1, as you mention above, actually supports this: We see that both full fine-tuning and LoRA have a clear diagonal, indicating a ordered mapping to the pre-trained singular vectors. Further, there is no clear difference in the width of the band between full fine-tuning and LoRA. We emphasize that the purpose of this paper is to study the differences between these two models.
Weakness 2, Q1. Thanks for the additional results. However, my question was more about learning rates higher than the one used in the original experiment from the paper. It is expected that for lower learning rates, the number of intruder dimensions would also be 0.
We are sorry if we misinterpreted your request from your earlier response. Unfortunately, due to the lack of time remaining in the rebuttal, we are unable to conduct new experiments. We hope this is acceptable to you. However, it is a common observation that increasing the learning rate leads to training instabilities and even divergence.
Weakness 2, Q2. Thanks for the additional results. I agree with your discussion here; however, this result also supports the idea that the effect of intruder dimensions could be connected to the learning rate used.
As we detail in Fig 7. of the paper, we agree that learning rate plays a role on intruder dimensions. However, we find it important to mention that we adapt no special hyperparameter settings for fine-tuning, but rather we replicate the training settings of existing works (https://arxiv.org/abs/2405.09673, https://arxiv.org/abs/2106.09685). To reiterate, we use standard hyperparameter settings that have been selected by others for their success in well-optimizing these models. This shows that our results are not due to our failure to well-optimize our LoRA models, but rather are for LoRA's standard setting and use-case.
Weakness 3, Q4. Again, I agree with your discussion here, but this result demonstrates that intruder dimensions are not particularly special in terms of forgetting compared to other dimensions.
We would like to re-emphasize the claim that the large portion of the update of LoRA is in its intruder dimensions. If this were to the case, scaling both down the intruder dimensions only or the full update should lead to similar results. In fact, this is exactly what we observe, and show that the magnitude of these intruder dimensions is responsible for forgetting and can be reduced in magnitude while reducing forgetting and maintaining performance.
Weakness 3, Q5. Thanks for the result, it is indeed an interesting observation.
Thank you again for your suggestions. We are glad that this particular question was helpful to you.
I understand that there is not much time left in the discussion period, so I will keep that in mind.
We thank you for your understanding and are grateful for your thoughtful feedback which has improved this paper.
Dear Reviewers and Authors,
Thank you all for your efforts so far. As the author–reviewer discussion period will conclude on August 6, please start the discussion as soon as possible.
For Reviewers: If you have not done so, please read the authors’ responses and, if necessary, continue the discussion with them.
-
If your concerns have been addressed, consider updating your review and score accordingly.
-
If some concerns remain, or if you share concerns raised by other reviewers, clearly state these in your review and consider adjusting your review (positively or negatively).
-
If you feel that your concerns have not been addressed, you may also choose to keep your review as is.
-
I will follow up with you again during the reviewer–AC discussion period (August 7–13) to finalize the reviews and scores.
For Authors: If you have not already done so, please respond to all questions raised by the reviewers. Keep your responses factual, concise, and ensure that every point raised is addressed.
Best regards,
The AC
This paper introduces the concept of intruder dimensions in LoRA fine-tuning, showing their strong correlation with forgetting and demonstrating, via causal interventions, that mitigating their effect can recover pre-trained knowledge without sacrificing downstream performance. Reviewers agreed the paper is novel, clearly written, and supported by extensive experiments across models, datasets, and hyperparameters.
While minor issues were raised (e.g., phrasing of claims, reliance on cosine similarity as a distance measure), these do not undermine the core contributions. The rebuttal effectively addressed concerns about hyperparameter bias, causal importance of intruders, and whether LoRA differences are simply basis rotations. Overall, the paper provides valuable insights into the spectral properties of LoRA and practical strategies for mitigating forgetting.