Gradient dynamics of low-rank fine-tuning beyond kernels
We analyze the SGD dynamics of learning rank-1 perturbations beyond the NTK setting, and prove linear sample complexity in the dimension for strong recovery.
摘要
评审与讨论
This paper studies the dynamics of two-layer network training in a student-teacher setting, focusing on the convergence of gradient descent under low-rank perturbations. The authors proposed assumptions applicable to a wide range of activation functions and designed a new algorithm that simplifies the learning problem to the optimization of the perturbation direction. On this basis, the authors prove that under mild assumptions, the student model can converge to the teacher model through multiple iterations, and can still converge effectively even when the kernel approximation fails.
优点
-
The paper provides an analysis method that goes beyond traditional kernel approximation and lays a theoretical foundation for understanding the training dynamics of two-layer networks.
-
this paper's results apply to a wide range of commonly used activation functions with strong generalisability.
缺点
-
The problem setting of this paper is quite different from the LoRA application in practice and is closer to the theoretical work of matrix decomposition or two-layer networks in the student-teacher setting. This work only analyzes a weight matrix and its corresponding rank-1 fine-tuning matrix, without considering the mutual influence between the fine-tuning of weights in different layers, and lacks analysis of the weight LoRA within the attention layer.
-
The paper lacks a comparison between its setting and the convergence results with other related studies on the convergence of two-layer networks under the student-teacher setting [1, 2, 3].
-
This paper makes several simplifications (such as on , , and ) in its analysis of dynamics for a single neuron, which limits its technical contribution and significance.
-
The method in this paper cannot be simply and directly extended to the case of rank>1.
问题
-
What are the similarities, differences, and advantages of this paper compared with existing two-layer network studies in the student-teacher setting [1, 2, 3]?
-
For fine-tuning matrices with higher ranks, what modifications do it need to make our method applicable? And, can this method still guarantee convergence?
[1]. Kazusato, O., Song, Y., Suzuki, T., & Wu, D. (2024). Learning sum of diverse features: computational hardness and efficient gradient-based training for ridge combinations. In Conference on Learning Theory (COLT 2024).
[2]. Zhou, M., & Ge, R. (2024). How Does Gradient Descent Learn Features—A Local Analysis for Regularized Two-Layer Neural Networks. arXiv:2406.01766.
[3]. Xu, W., & Du, S. S. (2023). Over-parameterization Exponentially Slows Down Gradient Descent for Learning a Single Neuron. In Conference on Learning Theory (COLT 2023).
Comparison to works [1,2,3]:
We are aware of the three and believe that these are quite orthogonal to our setting.
- [1] is on learning linear combinations of polynomial single index models. This is squarely within the traditional “feature learning” regime where the dynamics depend sensitively on information/generative exponent.
- [2] considers training a student network on a target two-layer relu teacher network with relevant subspace of dimension , in the heavily overparametrized regime where the student width is exponential in (the exponential dependence would thus also appear in the runtime). While their analysis focuses on the “local convergence” regime where the student already achieves small loss, the setup and phenomenology are entirely different from ours.
- [3] studies learning a single relu using an overparametrized student network, showing the convergence rate is exponentially slower than if the student itself were also a single relu. Again, our focus is not overparametrization. Furthermore, the target function we are learning is far richer than a single relu.
Our focus is on low-rank fine tuning, which is a novel setup not considered in any of the works mentioned, and their analyses unfortunately do not say anything in our setting. Therefore, we do not directly compare our work to these papers, but we are happy to add them to the Related Work section. To reiterate, our focus is on training a student on a teacher given by a low-rank perturbation of the initialization, and as we show, this is a genuinely new regime that operates strictly between kernel dynamics and full feature learning.
Assumptions on parameters:
First, we do not make any assumptions about and are stating our results in terms of . We are also not assuming anything about except implicitly that it is known to us, which can be relaxed by replacing the terms with . For simplicity of presentation we focus on known . Furthermore, we demonstrate via simulations that our findings are not sensitive to the assumptions we make on . We refer the reviewer to the comments left by reviewer DDTa, who noted, “However, I do not view this as a significant drawback for the authors, as the dynamics under this simplified setting already reveal interesting phenomena.” Given that our work is the first theoretical work on fine tuning beyond the linear regime, we respectfully suggest evaluating our work based on the fact that we have rigorously identified interesting and robust phenomena that are genuinely different from what was known in the two-layer literature.
Rank-1 setting:
Finally, we would like to comment about the rank-1 modeling. If the base network weights are 0 and rank > 1, this already subsumes the regime of multi-index models. While this regime has seen significant attention in recent years, the truth of the matter is that we still know fairly little. Single-index models (where the target network has 1 neuron) have certainly been widely studied and are quite well-understood, but the works that go beyond that (i.e. multi-index models) actually make far more restrictive assumptions than we do on the target function (e.g. Abbe et al, 2023).
Hence, our understanding for the multi-index setting is already a bottleneck to understanding the rank>1 setting, and we think the reviewer should keep this in mind when contextualizing our work. Regardless, even in the rank-1 setting, our modeling is more general than much of the multi-index model literature. In particular, our assumptions on the weights and the activations of the base layers are more general than many of the works that consider full training (e.g. Abbe et al, 2023 or [1]).
Regarding “LoRA application in practice,” we do not think this is a serious criticism. First, it is unclear why the reviewer is mentioning “weight LoRA within the attention layer.” as we focus on MLPs. The interaction between LoRA across layers is indeed an interesting open question; we are not able to solve every possible question in our setting, but there is ample evidence that even in our setting, there is a rich phenomenology to understand and we make serious strides toward doing so. The reviewer is correct that our setting is closely related to the broader literature on training dynamics for shallow networks. We would like to point out that papers in this direction remain an ongoing, highly active area of learning theory research which continues to be featured in NeurIPS/ICML/ICLR. Our work offers fresh insights into how the fine-tuning setting is genuinely different from the regime of learning from scratch. It is also one of the first works on fine-tuning to go beyond expressivity and kernel methods. Thus we believe the rank-1 setting deserves study, and criticism of the reviewer about rank>1 is not fair given that even the special case of 0 base network weights is not well understood.
As the discussion period is coming to an end, we wanted to check if you had the opportunity to read our rebuttal where we addressed your concerns. Your time is greatly appreciated and we are happy to answer any remaining questions.
This paper studies the dynamics of low rank fine-tuning (LoRa) of 2-layer feed-forward networks.
优点
This work studies the training dynamics of LoRa, which is the first of such works.
缺点
- The major drawback of this paper is the setting of a teacher-student model. This indicates that the model can already learn the data in terms of approximation. However, this is not necessarily true, and the approximation capability seems to be a key obstacle that may affect the outcome of LoRa. It may make more sense if the author could make some assumptions on the data and modify the setting of teacher-student model.
- This paper lacks discussion about the LoRa with rank more than 1. Although the authors leave this topic as future work, the reviewer believes that this part is important (due to the Question 2),
- The main theorems only discuss the cosine similarity of and , and the author is expected to connect the cosine similarity to the generalization error.
- The division of sections is a bit weird. Section 1 is the most important part and should be divided into several sections. Sections 2-4, however, seem to be parts of the proof, and it may make more sense to merge them as a single section.
问题
- According to Figure 1, whether is frozen seems to have a significant effect, but the authors claim the opposite. It seems that by freezing , the initial dynamics is faster than that in joint training. I was curious to see the explanations of this phenomenon as well as why the authors conclude that whether is frozen or not makes little difference.
- (Related to Weaknesses 1 and 2) What happens if the teacher model and the student model has different ranks in the low-rank approximation?
Rank-1 setting and student-teacher setup
Related to Weakness 1, note that even in the original LoRA paper, it was observed that rank-1 perturbations actually perform quite well for downstream tasks for GPT-3. More generally, the fact that LoRA is empirically successful is in fact justification for why there doesn’t seem to be an issue with approximation power, making it reasonable to model the data as coming from a low-rank perturbation of a pre-trained model. For this reason, the student teacher setup is not a priori restrictive, though it is of course interesting to study richer settings, e.g. agnostic.
The reviewer’s objection might be the focus on rank-1 perturbations (since if we allow arbitrarily large ranks and second layer training, we can learn any target function). We have points related to this:
- Note that for learning rank > 1 perturbations, this is already at least as hard as learning multi-index models from scratch, as setting the base model weights to 0 recovers that setting. There are still many open questions in that setting, despite many years of intense study. We certainly intend to attack this direction in future work, and investigating the case when the rank of the perturbations of teacher and student models differ is part of this open direction. Nevertheless, our work makes clear that even the rank 1 setting is highly nontrivial and offers useful conceptual takeaways.
- The rank-1 setting already reveals interesting phenomena and meaningful separations between (see reviewer DDTa’s comments) training from scratch and fine tuning, as some of the quantities that dictate the complexity of training from scratch (such as the information or generative exponent) do not affect training complexity in the fine tuning regime.
Generalization error
Related to the reviewer’s comment about relating learning u to generalization error, in appendix D3 we show that once u is learned, there is a simple algorithm that does linear regression on the second layer of a 2-layer network that can learn the target function to \epsilon error over the dataset. Since this aspect of learning is more straightforward, we focus the discussion in the main text on learning the perturbation direction u as there is a simple algorithm for learning the target function afterwards (see Theorem 10, appendix D3). However, in the camera ready version, we can elaborate a bit more on this in the main text if this was not clear.
Organization
Related to the division of the sections, we believe this is a fair comment and we are looking to make the organization of the sections and presentation of the results more clear for the camera-ready version. We are planning to separate the intro and contributions section from the technical discussion.
Frozen
Finally, the reviewer notes that freezing makes a significant difference and makes training faster. We are guessing that there might be some confusion about Figure 1 since the scales of the two figures are not the same, and freezing has a longer time scale. Once we adjust for this scaling, the difference turns out not to be so dramatic, and the later stages of training with a frozen is just longer by a factor of 2. For the camera ready version, we will make the figures a bit more clear by using the same scaling for the time-iteration axis (in Figure 1), as we have noticed this might be causing some confusion. Nevertheless, we claim the difference between the two settings is not in asymptotic order, but rather a constant factor. Since we are interested in the complexity of fine-tuning in the dimension parameter , and number of neurons , this difference is not significant for us.
As the discussion period is coming to an end, we wanted to check if you had the opportunity to read our rebuttal where we addressed your concerns. Your time is greatly appreciated and we are happy to answer any remaining questions.
This submission examines the sample complexity of an idealized version of low-rank fine-tuning, where the base model is a two-layer neural network with near-orthogonal neurons, and the fine-tuning target function is a rank-1 perturbation of the first-layer parameters of the base model. The authors analyze the optimization dynamics of the online spherical gradient algorithm to find this rank-1 perturbation, based on the Hermite decomposition of the landscape. It is shown that, unlike the learning of single-index models, the fine-tuning sample complexity does not scale with , where is the information exponent defined in (Ben Arous et al., 2021).
优点
The complexity of learning single- and multi-index models using SGD has been extensively studied in recent years. This submission considers a distinct setting from most prior works, focusing on learning a rank-1 perturbation on top of a rank- base model. Notably, the authors show that the quantity governing the statistical efficiency of online SGD in the standard single-index setting no longer plays a role in the fine-tuning setting. This result is a valuable contribution relevant to the ICLR community, as it suggests that once a base model is obtained, learning additional low-rank structure may not face the same computational challenges as learning a low-dimensional function from scratch.
缺点
I have the following questions and concerns:
-
There is a sizable gap between the studied problem setting and LoRA due to the absence of joint training (and the quantized vector ). Moreover, the near-orthogonal weights of the base model and the assumed orthogonality between the base weights and the rank-1 perturbation are nontrivial assumptions. However, I do not view this as a significant drawback for the authors, as the dynamics under this simplified setting already reveal interesting phenomena.
-
The authors claim that the information exponent dictates the complexity of noisy gradient descent and provides a CSQ lower bound. However, it is known that SGD can achieve a sample complexity that does not depend on the information exponent (Lee et al., 2024; Arnaboldi et al., 2024). In these analyses, the complexity of SGD is instead governed by the generative exponent (Damian et al., 2024) associated with the SQ lower bound.
From the authors' assumption on the activation , it appears that higher generative exponent functions are not excluded, which would create a gap between the statistical efficiency of fine-tuning and SQ learners. If this is the case, a discussion of this stronger separation should be included. -
Could the authors comment on the tightness of the sample complexity analysis? For sufficiently large , it appears that the fine-tuning complexity could be worse than learning a single-index model from scratch (if the information/generative exponent is small). A lower bound statement for online SGD analogous to Theorem 1.4 in (Ben Arous et al., 2021) would be informative.
-
If the first-layer parameters are not optimized, but instead adaptation to the target function occurs by training the top layer , is there a lower bound on the required sample size? Intuitively, since the perturbation is assumed orthogonal to , the kernel model defined by the first layer cannot learn this perturbed teacher model.
-
On Line 291, the authors wrote: "As we discuss at the end of Section 2, even if the base model or teacher model satisfies this in the setting we consider, it does not appear to pose a barrier for low-rank fine-tuning in the same way that it does for learning from scratch..
The lower bounds cited here are constructed with specific weight configurations. Notably, such computational barriers are absent when the neurons are nearly orthogonal (Oko et al., 2024). It is unclear to me whether such worst-case scenario can be attained by orthogonal + rank-1 weight matrix. Could the authors clarify? -
The perturbed teacher model resembles a two-layer neural network where the first-layer parameters are optimized with a single gradient step. Specifically, the scaling of (which matches the Frobenius norm) aligns with the large learning rate regime in (Ba et al., 2022; Cui et al., 2024), and it is known that such a large learning rate is beneficial for learning single-index tasks. Could the authors comment on potential settings where the studied perturbed model might exhibit greater expressivity than the base model?
References:
- Lee et al. (2024), Neural network learns low-dimensional polynomials with SGD near the information-theoretic limit.
- Arnaboldi et al. (2024), Repetita iuvant: Data repetition allows SGD to learn high-dimensional multi-index functions.
- Damian et al. (2024), Computational-statistical gaps in Gaussian single-index models.
- Oko et al. (2024), Learning sum of diverse features: computational hardness and efficient gradient-based training for ridge combinations.
- Ba et al. (2022), High-dimensional asymptotics of feature learning: how one gradient step improves the representation.
- Cui et al. (2024), Asymptotics of feature learning in two-layer networks after one gradient-step.
问题
See weaknesses.
We thank the reviewer for the extensive and detailed feedback. We will respond to all the points:
- As the reviewer noted, our paper aims to shed light on the practical performance of fine tuning (specifically low-rank fine tuning), and some of the assumptions are to facilitate the analysis. Regardless, we reveal an interesting separation between fine-tuning and training from scratch.
- We agree that there are works that can go beyond the information exponent (e.g. via sample repetition or smoothing), but they require modifications to the training algorithm such as repeating samples. This is the main reason for our emphasis on the information exponent, but we will add discussion about these recent works in our related works section. Note Our algorithm does not require reusing samples, so it already establishes a separation between fine tuning and learning single index models without additional modifications. Regardless, we can compare our complexity results to general SQ algorithms as the assumptions we have on the activation are quite general and encapsulate smooth, lipschitz activations. As the reviewer has noted, Damian et al, 2024 shows even such functions can have high generative exponents, thus leading to runtime scaling with a large polynomial in the dimension. Note that their results are for single index models and not for multi-index models as in Abbe et al, 2023, so we can only compare learning single index models to fine-tuning for results that go beyond CSQ. However, if they were to generalize their results to multi index functions with similar constructions, our separations should generalize as well. Overall, we will implement the suggestion of the reviewer to our related works section, and discussion of results.
- For the camera ready version, we can aim to offer results similar to ones in Arous et al, as our analysis should be tight given the variance bounds are tight. The main limitation we have for showing tightness of analysis is finding tight bounds on the variance of the gradients (similar to the single index setting). Because our setting involves many more degrees of freedom than single index models, how the w_i’s and lambda_i relate to each other can affect the variance, which in turn affects the tightness of our analysis. Hence, we do not immediately provide a tightness guarantee as this would require more carefully bounding the variance and our goal is to investigate the dimension dependence, which is tight since the information theoretical limit is linear in the dimension.
- Yes, it shouldn’t be too hard to prove lower bounds on the error when only the second layer is trained. This might be easy to show when , and we can add this result for the camera ready submission.
- We do not claim one can immediately obtain complexity for learning the target network in our setup, but we claim the moment tensors or the activation do not sensitively affect the complexity. We will reword this passage to make this clearer. We cite the existing works to provide exposition on the complexity of learning 2-layer networks, but it is an interesting direction to build up on our analysis and show examples of networks that are very hard () to learn from scratch and significantly easier to fine tune. We are optimistic that such a construction could exist in the angularly-separated + rank-1 setting.
- While the reviewer points out an interesting connection, for us, the setting we study is ultimately not about expressivity because we are focused on training dynamics at the input layer, not the output layer. However, note we do not make assumptions about the randomness of W, which is a different setting that relies on the random-feature model.
Finally, we are aiming to make some revisions to the paper given the reviewer’s comments. Concretely,
- We will add more discussion about the lower bounds (e.g. generative exponent) and algorithms that can go beyond the information exponent, and mention that our assumption on the activation holds in that setting.
- Establish lower bounds on only training the second layer without learning the new direction of perturbation to also show separations between low-rank fine tuning and adapting the second layer.
- Comment on the tightness of the finite sample analysis.
As the discussion period is coming to an end, we wanted to check if you had the opportunity to read our rebuttal where we addressed your concerns. Your time is greatly appreciated and we are happy to answer any remaining questions.
This work analyzes the training dynamics of Low-Rank Adaptation (LoRA) fine-tuning for 2-layer neural networks. Specifically, the work considers a student-teacher setup where a pre-trained model (student) learns to approach a target function (teacher) with a rank-1 modification. The analysis relies on some very strong assumptions, including knowledge of the parameter. The analysis utilizes the machinery for analyzing 2-layer neural networks.
优点
Personally, I have a hard time evaluating the merits of this work. The results are quite restrictive in that they only consider a rank-1 update and require certain very strong assumptions such as the orthogonality of the features. On the other hand, it is the first result of its kind, so there is certainly value to the result.
The analysis utilizes much of the built-up machinery for analyzing 2-layer neural networks and single-index models. I am not an expert in this particular area, so I am not able to properly judge how novel the proof techniques are, given the existing machinery. In the end, however, I feel that the results are not that strong compared to other results people have shown with single index models in non-fine-tuning setups. I am also not quite sure what the main takeaway of the results should be.
I feel that an appropriate criterion for a paper like this would be to judge the novelty of the proof technique, looking at how much this paper advances the toolset in this area. If this result is obtained through a straightforward application of the existing tools, then it is not that interesting, since the conclusions are not that strong. If these results required the authors to build up substantially new tools and use novel arguments, then that would constitute progress. However, I am not intimately familiar with the 2-layer NN analysis techniques used in this paper, so this is difficult for me to judge.
缺点
.
问题
.
Novelty of results and proof
Related to the novelty of our proof technique and the strength of our results, while we do use some tools from existing work such as hermite analysis and the drift martingale decomposition (as do essentially all papers in this literature), there is a significant gap between the works that use these and our setting. To give the reviewer a sense for this, note that in the well-studied single index model setting (Arous et al, 2021) there is only one parameter (the overlap between ground truth and estimate) that controls the landscape, and the behavior of the population gradient is easily shown to be monotone. This is because they are training a single neuron to learn a function depending on a 1-dimensional projection. Their main contribution is the drift-martingale analysis which connects their population gradient analysis (which is quite simple with single index models) to the finite sample analysis (e.g. stochastic gradient descent, rather than population gradient descent). We readily use this tool, but it is entirely orthogonal to the main thrust of our work, which is in analyzing the population gradient.
There is a sizable jump in complexity when going from lower bounding the population gradient in the traditional single index setting versus in our fine-tuning setting. In our setting, the population gradient is much more complex and affected by many variables such as , angles between , and the overlap . Because of having significantly more parameters that affect the optimization landscape, we need to make some assumptions, simply due to having more degrees of freedom in our model. As reviewer DDTa notes, these assumptions are not so limiting, and indeed compared to what is known in the multi-index setting, we actually make relatively weaker assumptions on the activations and structure of the target function. In particular, our contributions include but are not limited to:
- Reducing the analysis of training dynamics in MLPs to lower bounding a non-linearity term () in the population gradient. Note that the idea generalizes to other settings, where one can follow our proof structure to even analyze dynamics that involve more than 1 variable (e.g. 2).
- Analyzing the structure of the highly non-linear, non-convex function , which is significantly more involved compared to previous work. This differs from previous analyses in many ways
- The angularly separated setting involves various moment tensors whose effect can be arbitrary. We provide anti-concentration results (e.g. probabilistic lower bounds, appendix B) relating the inner products of these moment tensors to the magnitude of the population gradient.
- We work with a much more general class of activations than many of the work that focus on two-layer networks.
- Other than the generic assumptions about non-pathology (e.g. linear polynomial growth, existence of weak derivative) we make a mild assumption about the decay of the hermite coefficients of the activation. This is much more general than much of the work that considers polynomial, smooth, or specifically ReLU activations.
Strength of conclusions
Now, focusing on our results, we show separations between fine tuning, training from scratch, and linearized methods. which we actually believe to be a fairly strong conclusion. A priori one would expect that fine-tuning is more tractable than pre-training in practice simply because it occurs in a linearized regime, as was suggested in prior work. We give the first proof that this is not the case, and that it operates in a genuinely new regime not previously studied in the learning theory literature. The fact that the dynamics are genuinely nonlinear but not dependent on the particulars of the activation function is quite surprising.
Rank-1 setting
Finally, we would like to comment about the rank-1 modeling. If the base network weights are 0 and rank > 1, this already subsumes the regime of multi-index models. While this regime has seen significant attention in recent years, the truth of the matter is that we still know fairly little. Single-index models (where the target network has 1 neuron) have certainly been widely studied and are quite well-understood, but the works that go beyond that (i.e. multi-index models) actually make far more restrictive assumptions than we do on the target function (e.g. Abbe et al, 2023).
Hence, our understanding for the multi-index setting is already a bottleneck to understanding the rank>1 setting, and we think the reviewer should keep this in mind when contextualizing our work. Regardless, even in the rank-1 setting, our modeling is more general than much of the multi-index model literature. In particular, our assumptions on the weights and the activations of the base layers are more general than many of the works that consider full training.
As the discussion period is coming to an end, we wanted to check if you had the opportunity to read our rebuttal where we addressed your concerns. Your time is greatly appreciated and we are happy to answer any remaining questions.
As the discussion period is ending, we would like to thank reviewer QMRj for responding to our rebuttal and as a result increasing their score. While the other reviewers did not respond to our comments, we believe that we similarly addressed their concerns.
To reiterate, our paper is the first work that went beyond the kernel regime and gives an end-to-end analysis of low-rank fine tuning. We note the existing analyses for fine tuning do not apply when the low-rank perturbation is not too small. We focused on rank-1 perturbation case, as this is a clean but rich setting where we can show meaningful phenomena, such as a separation between fine-tuning and training from scratch. We also reveal other interesting phenomena such as the insensitivity to the non-linearity of the target function that distinguish the fine-tuning regime from the feature learning regime. As reviewer DDTa noted, the assumptions we make do not pose a barrier to understanding fine tuning. Our work has many interesting follow up directions, such as going beyond rank-1 and understanding the transition from fine-tuning to feature-learning: at what point does the perturbation 'erase' previous progress? Ultimately, our work has many merits including a valuable contribution to understanding fine-tuning, and a new rich learning setup for future work.
This paper studies the sample complexity of a very restricted type of low-rank fine-tuning. Authors consider a student-teaching setting: base is a two-layer NN, and the fine-tuning target is a rank-1 perturbation of the first-layer of the base. Similar to existing works, authors consider online spherical gradient algorithm to recover the perturbation. It was shown that sample complexity does not scale with the properties of the activation's Hermite expansion.
This paper was reviewed by three expert reviewers and received the following Scores/Confidence: 5/3, 5/4, 5/2, 6/3. I think paper is studying an interesting topic but authors are not able to convince the reviewers sufficiently well about the novelty and the relevance of the setting they are considering. The following concerns were brought up by the reviewers:
- The major issue is that the studied model deviates too much from the actual LoRA setting.
- Somewhat less important: It is difficult to assess tightness of the provided sample complexity results as authors do not provide any kind of lower bounds.
- Certain results are not well presented/explained.
No reviewers championed the paper and they are not particularly excited about the paper. I think majority of the concerns can be addressed but that would require significant revision and another set of reviews. As such, based on the reviewers' suggestion, as well as my own assessment of the paper, I recommend not including this paper to the ICLR 2025 program.
审稿人讨论附加意见
Authors provided a detailed response and answered reviewers' questions. However, reviwers were not convinced about the setting considered in this work.
Reject