6.8

/10

Poster4 位审稿人

最低3最高5标准差0.8

3.8

置信度

创新性3.3

质量2.8

清晰度3.0

重要性2.5

NeurIPS 2025

LoRASuite: Efficient LoRA Adaptation Across Large Language Model Upgrades

Yanan Li,Fanxu Meng,Muhan Zhang,Shiai Zhu,Shangguang Wang,Mengwei Xu

OpenReview PDF

提交: 2025-04-14更新: 2025-10-29

TL;DR

How can we leverage the existing LoRA weights to adapt to the latest model version with less effort?

摘要

关键词

LLM Upgrades; LoRA Adaptation

评审与讨论

审稿意见

评分: 5置信度: 42025-06-30

This paper introduces a modular framework for reusing LoRA adapters across newer versions of language models. The idea is novel and has practical relevance. Experimental results demonstrate that LoRASuite achieves better performance than conventional LoRA fine-tuning in low-data regimes.

优缺点分析

strengths

How to use LoRA weights in a new model version is a very interesting idea.
The paper is well-organized and well-written.
The experimental results are comprehensive, especially when the authors compare the results with many backbone models.

weaknesses

Actually, training a LoRA for a new version of a model is not expensive. Although LoRASuite has achieved some improvement compared to the base model, the w/o version still needs some additional training datasets to conduct the lightweight fine-tuning, which is not flexible.
There is no code released. I suspect the reproduction of the results.
The LoRAsuit is sensitive to the learning rate. In this case, the experimental results should have more training loss analysis. For example, there should be a training loss figure and model size comparison.

问题

What are the statistics of the training dataset?
What is the training dataset for lightweight fine-tuning?
The
Could you present the training curve of the lightweigh for LoRA (100), LoRASuite with LFT (100), and LoRA (10k)?
Could you show the training parameters size for LoRA (100), LoRASuite w LFT (100), and LoRA (10k)?
It is a little weird why LoRASuite w/o LFT and base model get almost the same results. For example, since you use the LoRA adapters from MiniCPM-S-1B. For MiniCPM-2B, the LoRASuite w/o LFT get (23.96) compared to the base (23.85).
"Although LoRASuite requires small-scale fine-tuning, this results in a modest memory reduction of 5.5GB, primarily due to the smaller sample size" Why can a smaller sample size reduce memory?

局限性

yes

最终评判理由

I have read the paper again, and I think the idea is very interesting. The authors address most of my concerns and I hope this paper can be accepted.

格式问题

作者回复

2025-07-31

Q1: What are the statistics of the training dataset?

A1: Thank you for your question. To ensure reproducibility, we use the same training and evaluation dataset as LLM-Adapters, which has over 1.2k stars on GitHub and has been widely adopted in published studies. We will include a detailed description of the dataset statistics in the experimental setup section for clarity and completeness.

Q2: What is the training dataset for lightweight fine-tuning?

A2: Thank you for your valuable feedback. The lightweight fine-tuning dataset is currently a randomly selected subset of the dataset used for full-scale LoRA training. We will clarify this in the revised manuscript.

Q3: Could you present the training curve of the lightweigh for LoRA (100), LoRASuite with LFT (100), and LoRA (10k)?

A3: Thank you for your valuable feedback. As we cannot include additional figures during the rebuttal period, we instead provide the step-wise training loss for LoRA (100) and LoRASuite with LFT (100) under the settings of batch size = 16 and micro batch size = 4.

Model	Step1	Step2	Step3	Step4	Step5	Step6	Step7	Step8	Step9	Step10	Step11	Step12	Step13	Step14	Step15	Step16	Step17	Step18
LoRA (100)	1.210	1.199	1.081	1.120	1.067	1.108	0.996	1.124	1.027	1.113	1.018	1.150	1.150	1.374	1.004	1.078	1.120	1.086
LoRASuite with LFT (100)	1.140	1.028	0.921	0.914	0.786	0.859	0.955	0.738	0.680	0.702	0.683	0.709	0.749	0.675	0.625	0.601	0.760	0.663

Q4: Could you show the training parameters size for LoRA (100), LoRASuite w LFT (100), and LoRA (10k)?

A4: Thank you for your valuable feedback. Under the default setting with rank 32 for MiniCPM-2B, the number of trainable parameters for LoRA (100), LoRASuite with LFT (100), and LoRA (10k) is the same—23.59M, which is approximately 0.78% of all model parameters. We will clarify this in the revised manuscript.

Q5: It is a little weird why LoRASuite w/o LFT and base model get almost the same results. For example, since you use the LoRA adapters from MiniCPM-S-1B. For MiniCPM-2B, the LoRASuite w/o LFT get (23.96) compared to the base (23.85).

A5: Thank you for your valuable feedback. As noted in Line 268, we believe this phenomenon occurs because the transformed LoRA, when relying solely on matrix multiplication without backpropagation, has limited compatibility with the new model. This is why a small-scale fine-tuning step is necessary. The significant performance improvement achieved after this fine-tuning further supports our assumption.

Q6: Although LoRASuite requires small-scale fine-tuning, this results in a modest memory reduction of 5.5GB, primarily due to the smaller sample size" Why can a smaller sample size reduce memory?

A6: Thank you for your valuable feedback. While the core memory footprint of the model—such as parameters, gradients, and optimizer states—is independent of the dataset size, several other factors contribute to increased memory usage when the dataset is larger.

First, data pipeline overhead and auxiliary states (e.g., DataLoader prefetching, Trainer’s internal buffers, dataset shuffling, and caching) scale with both the number of samples and training steps. Second, if the full dataset is loaded into memory (e.g., using HuggingFace Datasets load_dataset(...) or pre-tokenized and stored as a list), the total number of samples directly affects host DRAM usage. Lastly, for variable-length inputs, larger datasets increase the likelihood of batches containing extremely long sequences, which can trigger excessive padding and dynamic memory allocation, resulting in transient memory spikes.

2025-08-05

Summary of Our Rebuttal

Q1 & Q2 – Data Details: We use the same LLM-Adapters dataset (≈1.2K GitHub stars) that’s widely adopted in the literature. The “lightweight” fine-tuning (LFT) dataset is a random subset of the full LoRA training data. We will clarify this selection process in the revised manuscript.
Q3 – Training Curves: While we can’t add figures now, we’ve provided step-wise training losses for LoRA (100) vs. LoRASuite + LFT (100) under batch_size=16, micro_batch=4, demonstrating faster convergence for LoRASuite.
Q4 – Parameter Counts: With rank = 32 on MiniCPM-2B, all three methods—LoRA (100), LoRASuite + LFT (100), and LoRA (10 K)—train 23.59 M parameters (~0.78% of model size). We will state this explicitly.
Q5 – LoRASuite w/o LFT vs. Base Model: Transformed LoRA without fine-tuning relies solely on matrix multiplication, limiting compatibility with the upgraded model. The notable gains after a small LFT step confirm this assumption. Future research could investigate strategies to eliminate this LFT step without compromising performance.
Q6 – Memory Reduction with Smaller Samples: Although core footprints (params, grads, optimizer states) are fixed, larger datasets increase overhead in data loading, buffering, caching, and padding for variable-length inputs, causing transient memory spikes. Reducing sample size mitigates these factors.

We appreciate your detailed feedback. Are there any other questions or concerns—and, if our responses address your points, would you kindly consider updating your final rating?

评论- comments

2025-08-06

Thanks for your response.

For Q3, it seems that the training of LoRA has not converged well. It is a little weird. Why LoRASuite with LFT is more powerful?
For Q4, when using LoRASuite without LFT, as you mentioned, “relying solely on matrix multiplication without backpropagation has limited compatibility with the new model.” Therefore, I would expect the performance to be poor from the start, suggesting that the structure of MiniCPM-2B would be compromised at the very beginning.
I did not see the anonymous code, so I can not make sure the results can be reproduced.

2025-08-06

FQ1: For Q3, it seems that the training of LoRA has not converged well. It is a little weird. Why LoRASuite with LFT is more powerful?

Thanks for your question. As mentioned in Line 225, the default linear learning-rate scheduler of vanilla LoRA (100) with a warm-up phase to stabilize training from random initialization. However, with limited data, a considerable steps is spent in the warm-up stage, which slows convergence and limits performance. In contrast, LoRASuite starts with transformed parameters already close to a good solution, allowing us to omit warm-up and use a higher learning rate to accelerate convergence. This leads to better performance, especially in low-data regimes.

FQ2: For Q4, when using LoRASuite without LFT, as you mentioned, “relying solely on matrix multiplication without backpropagation has limited compatibility with the new model.” Therefore, I would expect the performance to be poor from the start, suggesting that the structure of MiniCPM-2B would be compromised at the very beginning.

We'd like to clarify that the performance of LoRASuite without LFT is generally worse or comparable to the base MiniCPM-2B model without any adapters, as shown in Tables 4 and 5. This aligns with our findings in Figures 2 and 3, where similar behavior is observed across various LLM upgrade types. The one notable exception is the Yi-6B to Yi-1.5-9B upgrade, where LoRASuite w/o LFT significantly outperforms the base model. This is likely because the upgrade involves only changes in layer count, making matrix-based transformation alone sufficient for effective transfer. We will clarify this point in the revised manuscript.

FQ3: I did not see the anonymous code, so I can not make sure the results can be reproduced.

Thank you for pointing this out. We fully understand the importance of reproducibility and had intended to share an anonymous GitHub link. However, the NeurIPS policy email on July 27 explicitly prohibits including any URLs—anonymous or not—in the rebuttal due to identity leakage concerns. We will make the code publicly available upon acceptance to ensure full reproducibility.

评论- comments

2025-08-08

Thanks for your response. I think most of my concerns have been addressed. I read your paper again and the other reviews, and I would like to raise my score to 5 and hope this paper will be accepted. I hope the author can study more about how to reuse the LoRA adapters across different backbones (maybe not LoRA, maybe other PEFT methods)

2025-08-08

Thank you very much for your thoughtful follow-up and for raising your score. We sincerely appreciate your constructive feedback and are glad to hear that most of your concerns have been addressed.

We also appreciate your suggestion regarding the broader applicability of PEFT methods beyond LoRA. We fully agree that exploring adapter reuse across different backbone architectures is an important and promising direction. We plan to investigate this further in future work and look forward to contributing more to this area.

Thank you again for your time and support.

审稿意见

评分: 3置信度: 42025-06-30

This paper studies the problem of adapting LoRA adapters for large language models (LLMs) when upgrading to new model versions that differ in architecture or parameters. The authors identify six types of incompatibilities that may arise during such upgrades. To address these, they propose LoRASuite, a modular framework that applies a sequence of techniques, including dimensionality transfer, layer and attention head alignment, and a brief fine-tuning step, to transfer LoRA adapters from an older LLM to a newer one. The method is evaluated on several backbone LLM families (like Llama, Yi, MiniCPM, Qwen, etc.) and two types of tasks, reporting improved results on adapter transfer efficiency, compute savings, and downstream performance than standard LoRA retraining.

优缺点分析

Strengths:

The paper provides a systematic and detailed analysis of six specific incompatibility types, such as changes in vocabulary size, hidden dimensions, and attention structure, that arise when transferring LoRA adapters across large language model upgrades.
It introduces a modular framework, LoRASuite, that integrates linear transfer, layer-wise alignment using CKA and dynamic programming, and attention head matching via the Hungarian algorithm, offering a practical and potentially extensible solution for LoRA transfer.
The proposed approach is empirically validated on a diverse set of LLM backbones and on both math and commonsense QA tasks, with improved performance metrics compared to standard LoRA retraining.

Weaknesses:

Although the paper proposes an interesting method for transferring LoRA adapters, it is primarily evaluated on small to medium-scale models (up to 9B parameters), whereas its practical value would be much greater for larger models (e.g., 13B to 70B+), where adapter retraining is significantly more resource-intensive. This leaves the impact on truly large-scale LLMs unverified.
The selected models used for evaluation are relatively dated, for example, Llama-3 to Llama-3.1/3.2 and Qwen1.5 to Qwen2/2.5/3. As a result, it remains unclear whether the proposed method can bring notable improvements when applied to more recent or stronger model baselines.
The main advantage of the proposed method, saving compute and improving performance in low-data regimes, is only demonstrated when the fine-tuning dataset is extremely small (e.g., 100 samples), and the benefit diminishes or disappears as the dataset size increases, according to results in Section 4.2. Furthermore, the paper does not describe how the small fine-tuning datasets are selected, nor does it report results averaged over multiple random draws. This omission is problematic given the potential instability and variance in performance when using such small data subsets.

问题

Please address the weaknesses mentioned before.
line 292: what does "full-scale LoRA retraining" mean? Is it the same as "LoRA (10k)" in Table 4/5? Same for "LoRA (small)". Please define it in the paper explicitly and keep it consistent throughout the paper.
Why did you choose to upgrade Qwen1.5-1.8B to Qwen2.5-3B, while skipping Qwen2-1.5B? It could be interesting to see a comparative study on the performance of LoRASuite when upgrading to Qwen2-1.5B vs. Qwen2.5-3B, as the former is a more direct upgrade in terms of model size and architecture.

局限性

yes

最终评判理由

While the authors have acknowledged the concerns and promised additional experiments and clarifications, the major concerns regarding scalability to large/recent models and the statistical robustness of the small-sample adaptation regime remain unresolved at this stage. I would like to keep my original rating.

格式问题

n/a

作者回复

2025-07-30

Q1& Q2: Although the paper proposes an interesting method for transferring LoRA adapters, it is primarily evaluated on small to medium-scale models (up to 9B parameters), whereas its practical value would be much greater for larger models (e.g., 13B to 70B+), where adapter retraining is significantly more resource-intensive. This leaves the impact on truly large-scale LLMs unverified. The selected models used for evaluation are relatively dated, for example, Llama-3 to Llama-3.1/3.2 and Qwen1.5 to Qwen2/2.5/3. As a result, it remains unclear whether the proposed method can bring notable improvements when applied to more recent or stronger model baselines.

Thank you for your feedback. Our initial motivation was primarily aimed at mobile scenarios, where an on-device foundation model supports multiple application-specific LoRA adapters for different downstream tasks (as detailed in AICore). Thus, we initially focused on small to medium-scale models. We acknowledge your valuable point that applying LoRASuite to larger-scale models (e.g., 13B to 70B+ parameters) would significantly enhance its practical value, especially given the higher resource costs of retraining adapters at this scale. Due to time constraints, we will include additional experiments with more recent and larger models in the revised manuscript to address this concern and highlight the broader applicability of our approach.

Q3: The main advantage of the proposed method, saving compute and improving performance in low-data regimes, is only demonstrated when the fine-tuning dataset is extremely small (e.g., 100 samples), and the benefit diminishes or disappears as the dataset size increases, according to results in Section 4.2. Furthermore, the paper does not describe how the small fine-tuning datasets are selected, nor does it report results averaged over multiple random draws. This omission is problematic given the potential instability and variance in performance when using such small data subsets.

Thank you for your constructive feedback. Since the converted LoRA parameters already encapsulate the knowledge learned by the original model, the purpose of small-scale fine-tuning is merely to help the parameters adapt to the upgraded model. As the dataset size increases, the additional fine-tuning can easily lead to overfitting, which explains the diminishing benefit. Currently, the small fine-tuning datasets are randomly selected. We will explicitly describe the dataset selection process in the revised manuscript.

Q4: line 292: what does "full-scale LoRA retraining" mean? Is it the same as "LoRA (10k)" in Table 4/5? Same for "LoRA (small)". Please define it in the paper explicitly and keep it consistent throughout the paper.

Thank you for your feedback. "full-scale LoRA retraining" refers to the scenario labeled as "LoRA (10k)" in Tables 4 and 5. Additionally, as described in Line 293, LoRASuite involves a small-scale fine-tuning step to mitigate potential performance degradation caused by direct matrix multiplication; thus, "LoRA (small)" corresponds to the same small-scale fine-tuning labeled as "LoRA (100)" in Tables 4 and 5 when upgrading from MiniCPM-S-1B to MiniCPM-2B. We will explicitly define these terms and ensure consistency throughout the revised manuscript.

Q5: Why did you choose to upgrade Qwen1.5-1.8B to Qwen2.5-3B, while skipping Qwen2-1.5B? It could be interesting to see a comparative study on the performance of LoRASuite when upgrading to Qwen2-1.5B vs. Qwen2.5-3B, as the former is a more direct upgrade in terms of model size and architecture.

Thank you for this insightful suggestion. We initially chose the upgrade path from Qwen1.5-1.8B to Qwen2.5-3B to highlight LoRASuite’s effectiveness across intermediate size, layer count, and attention heads. However, we agree that evaluating a more direct upgrade to Qwen2-1.5B offers valuable insights into LoRASuite’s performance under incremental changes. The following table presents the performance on math tasks when upgrading from Qwen1.5-1.8B to Qwen2-1.5B, further confirming the effectiveness of LoRASuite in this setting. We will include this comparative study in the revised manuscript.

Base Model	PEFT	AddSub	MultiArith	SingleEq	GSM8K	AQuA	MAWPS	SVAMP	Avg.
Qwen1.5-1.8B	-	32.91	51.33	52.56	9.7	11.02	47.06	32.7	33.90
Qwen1.5-1.8B	LoRA (10k)	52.91	79.67	59.25	12.74	14.17	55.04	33.5	43.90
Qwen2-1.5B	-	46.58	72	42.32	19.86	13.78	33.61	25.3	36.21
Qwen2-1.5B	LoRASuite w/o LFT	45.06	73.67	45.08	21	15.35	32.35	26.7	37.03
Qwen2-1.5B	LoRA (100)	38.23	68.5	39.76	19.03	13.39	26.47	23.1	32.64
Qwen2-1.5B	LoRASuite w LFT	60.51	78.83	72.83	29.42	16.93	57.98	48.2	52.10
Qwen2-1.5B	LoRA (10k)	70.38	88.83	76.57	30.33	19.29	71.85	51.4	58.38

2025-08-06

Thanks for the rebuttal. While the authors have acknowledged the concerns and promised additional experiments and clarifications, the current revision does not yet provide the necessary new empirical evidence or robustness analysis. As a result, the major concerns regarding scalability to large/recent models and the statistical robustness of the small-sample adaptation regime remain unresolved at this stage. I would like to keep my original rating.

2025-08-06

Thanks for your feedback. To help us address your concerns more effectively and ensure a responsible review process, could you kindly clarify what you specifically mean by “robustness analysis” and “statistical robustness of the small-sample adaptation regime”?

Additionally, we respectfully ask whether some of the core merits of our work may have been overlooked. While we have not yet included 70B-scale model results due to time constraints, we did provide the requested Qwen2-1.5B experiment. We believe this limitation does not diminish the key contributions of LoRASuite—a novel framework for structural adaptation, thoroughly validated across diverse model backbones, task types, and extensive ablation studies. Even when applied solely to small- and medium-scale models, LoRASuite provides a practical and effective solution for LoRA transfer, with significant impact in mobile and resource-constrained scenarios.

2025-08-05

Summary of Our Rebuttal

Model Scope (Q1 & Q2): We initially focused on on-device scenarios—hence, evaluations on small-to-medium models (up to 9B). As requested, we rapidly completed the Qwen2-1.5B adaptation. Owing to time constraints, migrating the 70B model wasn’t feasible during the rebuttal phase, but we will include it in the camera-ready version. In any case, this does not diminish our paper’s core contribution—the first LoRA adaptation methods during LLM upgrade with competitive performance.
Low-Data Regime Behavior (Q3): The converted LoRA parameters already carry the original model’s knowledge, so only a small fine-tuning step is needed. As dataset size grows, overfitting can reduce gains, which explains the diminishing returns. We will (1) detail how we randomly select the small fine-tuning subsets and (2) report results averaged over multiple random draws to capture variance.
Terminology Consistency (Q4): “Full-scale LoRA retraining” corresponds to “LoRA (10k)” in Tables 4–5, and “LoRA (small)” corresponds to “LoRA (100).” We will define these terms explicitly in the manuscript and ensure consistent usage throughout.

We appreciate your insightful feedback. If there are any remaining questions or concerns, please let us know—and, if you feel our revisions address your points, would you kindly consider updating your final rating?

审稿意见

评分: 5置信度: 42025-07-03

This paper proposes a novel method to leverage existing LoRA weights for adaptation in the updated models. The core idea assumes that we have access to both old and newer weights of a specified model and therefore we can calculate a transfer matrix which we can then utilize to adapt the parameters. They use the Centered Kernel Alignment (CKA) method for layer mapping and the Hungarian algorithm (aka the Kuhn-Munkres algorithm) for attention head mapping. The CKA sequentially aligns corresponding layers to maximize the total similarity, and the Hungarian algorithm

优缺点分析

Strength:

The paper is well-written and it is easy to understand. The application is technically sound and potentially useful as it can reduce the training costs.
The method is innovative and tries to tackle an interesting problem.
The experiments are informative and it can potentially help future research for solving this issue.

Weakness:

The LoRASuite is an expensive method (in the order of cubic) which might limit the applicability of this method.
It has not been clearly explained why they used different methods for head and layer mapping.
As shown in table 4 and 5, the performance of LoRASuite is not consistent across different datasets.
Their method doesn’t seem to be applicable to customized architectures.

问题

Why in figure 4-b, the change of learning rate has a significant impact on the performance of LoRASuite? What’s the justification?
Why in Algorithm 2, the input to the cka_layer_mapping() is S? S has not been introduced properly.

局限性

yes

最终评判理由

In general, this paper is trying to solve a novel problem, the proposed method can be potentially useful and the paper is technically sound, and well-written. The authors also shows some level of competence during the rebuttal and that increased my trust to their results. Despite, the limitation of the experiment in the sense that they have strong assumptions about the architecture and availability of the model weights, I would like to increase my score as I still believe this might initiate exciting future research directions.

格式问题

作者回复

2025-07-30

Q1: The LoRASuite is an expensive method (in the order of cubic) which might limit the applicability of this method.

A1: Thank you for your valuable comment. While LoRASuite involves algorithms with cubic complexity, the practical cost remains manageable. This is because the number of layers and attention heads in mainstream LLMs typically remains under 100, keeping the actual computation well within the capabilities of modern processors.

Q2: It has not been clearly explained why they used different methods for head and layer mapping.

A2: Thank you for your thoughtful comment. We use different methods for layer and head mapping based on the distinct structural roles they play. For layer mapping, we adopt a dynamic programming approach inspired by prior studies in image classification with CNNs, where different layers capture different levels of abstraction, and higher layers build upon the representations of lower layers. This motivates a sequential alignment strategy. In contrast, attention heads within the same layer are parallel and independent, so we treat head mapping as a bipartite matching problem and use the Hungarian algorithm to find the optimal one-to-one correspondence. We will clarify this design choice in the revised manuscript.

Q3: As shown in table 4 and 5, the performance of LoRASuite is not consistent across different datasets.

A3: Thank you for your observation. The performance variation of LoRASuite across different datasets can be attributed not only to the task differences but also to the underlying pretraining data of both the original and upgraded models. If the pretraining corpora differ significantly, the representations learned by the models may diverge, which can affect the compatibility and effectiveness of the transferred LoRA weights. We will clarify this important factor in the revised manuscript.

Q4: Their method doesn’t seem to be applicable to customized architectures.

A4: Thank you for your feedback. Our current evaluations focus on mainstream LLMs to ensure broad applicability and reproducibility. However, LoRASuite is also applicable to customized architectures, as long as the original and upgraded models share a similar structural design—for example, when both follow the Transformer paradigm. In such cases, the core idea of layer mapping and attention head mapping remains valid.

Q5: Why in figure 4-b, the change of learning rate has a significant impact on the performance of LoRASuite? What’s the justification?

A5: Thank you for your insightful feedback. Unlike LoRA (Small), which employs a linear learning-rate scheduler with a warm-up stage to stabilize optimization from random initialization, LoRASuite starts with transformed parameters that are already close to a good solution. Hence, we omit the warm-up phase. However, this initialization results in a steeper optimization landscape, where a small learning rate combined with limited data can cause the model to underfit or fail to converge effectively. We will clarify this in the revised manuscript.

Q6: Why in Algorithm 2, the input to the cka_layer_mapping() is S? S has not been introduced properly.

A6: Thank you for your feedback. As described in Line 175, the matrix $S$ is defined such that each element $S_{ij}$ represents the CKA similarity between the $i$ -th layer of the original model and the $j$ -th layer of the upgraded model. We will clarify the introduction of $S$ in Algorithm 2 in our revised manuscript.

2025-08-05

More questions which need clarification:

In the case of non-identical architecture what's the major limitation in the current proposed algorithm? Is it just the activation functions? and how do you see an extension of your method will facilitate the application to non-identical setting?
In the limitation section and line 224, you mentioned that "LoRASuite requires an additional small-scale fine-tuning step to achieve optimal performance". How do you know that's the optimal performance? I think in general, there are not that much discussion around that fine-tuning step. How do you guarantee the comparison is fair?
In your experiments, do you assume that the models have been trained on exactly the same dataset? It seems at least the model that have been used for comparison (MiniCPM-S-1B to MiniCPM-2B) are in this setting. Would you please confirm that whether this is true?
Specifically in the setting that I have mentioned in the previous note, one might arise this question that why one don't use the algebra to calculate the new LoRA weights if one already have the original weights, new weights, old LoRA weights, equal architecture, and even equal training set. We already know $W_0 + \Delta W_0$ and we are looking for a $\Delta W_1$ which if we add to $W_1$ we have the correct weight set. Why we don't calculate $\Delta W_1$ using something like this? Let's assume both adapted models will be in a similar vicinity (maybe with a small difference which one can model with c) for solving any down-stream task a, then potentially one can assume $W_1 + \Delta W_1$ = $W_0 + \Delta W_0 + c$ and calculate the new LoRA weights by $\Delta W_1$ = $W_0 - W_1 + \Delta W_0 + c$ . And one can assume c is just a learnable set of weights which you need to fine-tune.
It has been mentioned that you are using SVD for decomposition of the newly found $({\Delta W_Q})_n$ after all head mappings are completed. Isn't that costly? If yes, why it has not been considered in the complexity analysis?

2025-08-05

FQ1: In the case of non-identical architecture what's the major limitation in the current proposed algorithm? Is it just the activation functions? and how do you see an extension of your method will facilitate the application to non-identical setting?

Thank you for the insightful question. One major limitation of applying our current method to non-identical architectures lies in the potential mismatch in abstraction levels across layers, often caused by differences in activation functions or architectural design. This mismatch can lead to a scenario where the information represented by a single attention head in the original model may be distributed across multiple heads in different layers of the upgraded model.

To handle such distributed mappings, it would be necessary to compute similarities across all possible head-layer combinations between the two models, significantly increasing the search space. In this case, the time complexity grows to approximately $O((n_{layer} \times n_{head})^3)$ , due to the cubic complexity of the Hungarian algorithm used for optimal matching.

Extending LoRASuite to support such cases would involve developing more scalable many-to-many or soft matching strategies, potentially incorporating approximation techniques or learned similarity metrics. We see this as an important direction for future work and will clarify this point in the revised manuscript.

FQ2: In the limitation section and line 224, you mentioned that "LoRASuite requires an additional small-scale fine-tuning step to achieve optimal performance". How do you know that's the optimal performance? I think in general, there are not that much discussion around that fine-tuning step. How do you guarantee the comparison is fair?

Thank you for pointing this out. Our statement that "LoRASuite requires an additional small-scale fine-tuning step to achieve optimal performance" refers specifically to the best average performance reported in Tables 4 and 5. We acknowledge that "optimal" may imply a stronger guarantee than intended. We will revise the wording in the manuscript for greater precision and add further clarification around the fine-tuning setup to avoid any ambiguity.

FQ3: In your experiments, do you assume that the models have been trained on exactly the same dataset? It seems at least the model that have been used for comparison (MiniCPM-S-1B to MiniCPM-2B) are in this setting. Would you please confirm that whether this is true?

Thank you for your thoughtful question. For a fair comparison, we use exactly the same training dataset for corresponding models. Specifically, the LoRA (10k) adapters for both the original (MiniCPM-S-1B) and upgraded (MiniCPM-2B) models are trained on the same full dataset. Likewise, the datasets used for LoRA (100) and LoRASuite with LFT (100) are identical and randomly selected subsets of the original LoRA (10k) dataset. We will clarify this setup in the revised manuscript.

FQ4: Specifically in the setting that I have mentioned in the previous note, one might arise this question that why one don't use the algebra to calculate the new LoRA weights if one already have the original weights, new weights, old LoRA weights, equal architecture, and even equal training set. We already know $W_0 + \Delta W_0$ , and we are looking for a which if we add to $W_1$ we have the correct weight set. Why don't we calculate $\Delta W_1$ using something like this? Let's assume both adapted models will be in a similar vicinity (maybe with a small difference which one can model with c) for solving any downstream task a, then potentially one can assume $W_1 + \Delta W_1 = W_0 + \Delta W_0 + c$ and calculate the new LoRA weights by $\Delta W_1 = W_0-W_1 + \Delta W_0 + c$ . And one can assume c is just a learnable set of weights which you need to fine-tune.

Thank you for your insightful feedback. As you suggested, a straightforward algebraic solution—bypassing head mapping—was explored in our ablation study in Appendix A.6. Specifically, the LoRASuite w/o Head Mapping setting in Figure 13 corresponds to this approach. However, we observed that the full LoRASuite, which includes explicit head mapping, consistently outperforms this simplified method on both math and commonsense tasks. We will clarify this more explicitly in the revised manuscript.

FQ5: It has been mentioned that you are using SVD for decomposition of the newly found after all head mappings are completed. Isn't that costly? If yes, why it has not been considered in the complexity analysis?

Thank you for raising this important point. You're absolutely right that the SVD step introduces additional computational cost, and we will revise our complexity analysis to explicitly include it for completeness. That said, since SVD dominates the overall runtime—especially for large projection matrices—the overhead from layer and head mapping becomes relatively negligible in comparison.

2025-08-06

Thanks for the clarification. That said, I’m still unsure why LoRASuite without Head Mapping is considered equivalent to the approach I described earlier. Could you clarify this reasoning in more detail?

It seems to me that the papers use overly complex language, which ends up hiding some of the most important points.

Concerning the SVD cost, if the overhead from layer and head mapping becomes insignificant compared to the dominant runtime of the SVD itself, doesn’t that render the current complexity analysis somewhat misleading?

2025-08-08

1. Thanks for the clarification. That said, I’m still unsure why LoRASuite without Head Mapping is considered equivalent to the approach I described earlier. Could you clarify this reasoning in more detail?

Thank you for the clarification. Upon closer inspection, we acknowledge that there are indeed differences between LoRASuite without Head Mapping and the approach you described. Below is a comparison of the two implementations (where $W_x$ is used to align the dimensional mismatch between $W_0$ and $W_1$ ):

# LoRASuite without Head Mapping
new_delta_W_Q = W_x.T @ delta_W_Q @ W_x 

# Your proposed approach
new_delta_W_Q = W_x.T @ (W_old_Q + delta_W_Q) @ W_x - W_new_Q

To clarify further, we re-ran the experiments using your suggested method. The performance comparison is shown below:

PEFT Module	Math (Avg.)	Commonsense (Avg.)
LoRASuite	43.80	42.34
LoRASuite w/o Head Mapping	42.58	41.95
Your proposed approach	41.60	33.62

The performance gap, especially on commonsense tasks, suggests that directly aligning to the trained $W_0 + \Delta W_0$ may discard valuable information encoded in $W_1$ through its pretraining and post-training. As a result, even with further lightweight fine-tuning, this method struggles to fully leverage the potential of the stronger target model $W_1$ . We hope this clarifies the distinction.

2. It seems to me that the papers use overly complex language, which ends up hiding some of the most important points.

Thank you for the feedback. Our intent was to be precise, not to obscure key points. If there are specific sentences that seem unclear or overly complex, we would greatly appreciate it if you could point them out—we'll be happy to clarify them.

3. Concerning the SVD cost, if the overhead from layer and head mapping becomes insignificant compared to the dominant runtime of the SVD itself, doesn’t that render the current complexity analysis somewhat misleading?

Thank you for the follow-up. As noted earlier, you're absolutely right that the SVD step introduces significant computational cost. We will revise our complexity analysis accordingly to $O(n_{layer}(\Delta_{layer}^2 + n_{head}^3+n_{hidden}^3))$ . Since $n_{hidden} >> \Delta_{layer}, n_{head}$ , the overall complexity is effectively dominated by the SVD and simplifies to $O(n_{layer}n_{hidden}^3)$ . We will make this clearer in the revised manuscript.

审稿意见

评分: 4置信度: 32025-07-03

The paper proposes LoRASuite, a modular approach tailored specifically to various types of LLM updates. It identifies structural differences (e.g., vocabulary size, hidden size, attention heads) between old and new models and adapts existing LoRA weights accordingly through transfer matrices, layer/head mapping (via CKA and the Hungarian algorithm), and a lightweight fine-tuning (LFT) step. Experimental evaluations demonstrate that LoRASuite consistently surpasses small-scale vanilla LoRA methods and full-scale LoRA retraining, while LoRASuite significantly reduces memory con- sumption by 5.5 GB and computational time by 78.23%.

优缺点分析

Strengths:

The paper studies an interesting yet practically important and under-explored problem of adapting LoRA weights during LLM upgrades.
The paper systematically explores the problem by explicitly identifying the potential changed dimensions (e.g., vocabulary size, hidden size, attention heads)
Extensive experiments across multiple model upgrades, task types (math, commonsense) and thorough baselines demonstrates strong empirical results. Extensive ablations on layer and head mapping.

Weaknesses: See my questions.

问题

Why LoRASuite is much more sensitive to the learning rate compared to vanilla LoRA (fig4-b)?
How do you set the maximum offset constraint in Algo 1?
What happens if the optimal head mapping is not one-to-one, i.e., the information could be distributed across or aggregated from multiple heads)?
What happens if the tokenizer or embedding changes significantly?
What could be some potential solutions for implicit upgrades, such as changes in pre-training datasets and post-training methods?
Does the LoRASuite still work when the LoRA weights are for different tasks?
Curious whether there are relatively different sensitivity when matching various factors, e.g., is layer matching more important/sensitive than head mapping?

局限性

See my questions.

最终评判理由

The research problem is novel and interesting, and the paper presents a systematic exploration with strong empirical results. The rebuttal largely addressed my concerns. The main limitation lies in the assumption for model upgrades—similar architecture that enables the proposed transfer matrix, and that model weights are available. More practical scenarios include implicit upgrade, and finer-grained information mapping across attention heads — making mapping more challenging and expensive (see my comments). These could limit the method to generalize to more other real-world settings. But overall the paper is interesting. So I keep my score of a "Borderline Accept".

格式问题

N/A

作者回复

2025-07-30

Q1: Why is LoRASuite much more sensitive to the learning rate compared to vanilla LoRA (fig4-b)?

A1: Thank you for your insightful feedback. Unlike LoRA (Small), which employs a linear learning-rate scheduler with a warm-up stage to stabilize optimization from random initialization, LoRASuite starts with transformed parameters that are already close to a good solution. Hence, we omit the warm-up phase. However, this initialization results in a steeper optimization landscape, where a small learning rate combined with limited data can cause the model to underfit or fail to converge effectively. We will clarify this in the revised manuscript.

Q2: How do you set the maximum offset constraint in Algo 1?.

A2: Thank you for your feedback. The maximum offset constraint in Algorithm 1 is set to the absolute difference in layer counts between the original and upgraded models. We will clarify this detail in the revised manuscript.

Q3: What happens if the optimal head mapping is not one-to-one, i.e., the information could be distributed across or aggregated from multiple heads)?

A3: Thank you for your insightful feedback. Indeed, as you pointed out, information being distributed across or aggregated from multiple heads is a valid scenario. In such cases, one would need to compute similarity scores between all attention heads across every layer of the original and upgraded models, distribute information from each attention head in the original model to multiple heads in the upgraded model, and aggregate these mappings accordingly. However, this comprehensive method would have a time complexity of approximately $O((n_{layer} \times n_{head} )^3)$ due to the cubic complexity of the Hungarian algorithm. Hence, our current method can be viewed as a practical simplification to mitigate computational overhead.

Q4: What happens if the tokenizer or embedding changes significantly?

A4: Thank you for this valuable question. As described in Line 130, we manually calculate the transformation matrix using shared tokens between models. However, if the tokenizer or embedding changes significantly, additional alignment techniques—such as supervised or unsupervised embedding alignment methods [1], or entity alignment methods from graph domains [2]—would be required before applying our proposed method. Investigating these alignment strategies is an interesting direction for future research, and we will clarify this in our revised manuscript.

[1] (EMNLP’2018) Gromov-Wasserstein Alignment of Word Embedding Spaces.

[2] (NIPS’24) Entity Alignment with Noisy Annotations from Large Language Models.

Q5: What could be some potential solutions for implicit upgrades, such as changes in pre-training datasets and post-training methods?

A5: Thank you for your insightful feedback. For implicit upgrades involving changes in pre-training datasets or post-training methods, prior studies on knowledge editing [1, 2, 3, 4] provide promising directions. Integrating knowledge-editing techniques with our proposed methods would be an interesting area for future research. We will clarify this in our revised manuscript.

[1] (NIPS’22) Locating and editing factual associations in gpt.

[2] (ICLR’23) Massediting memory in a transformer.

[3] (NIPS’24) WISE: Rethinking the knowledge memory for lifelong model editing of large language models.

[4] (NIPS’24) Knowledge circuits in pretrained transformers.

Q6: Does the LoRASuite still work when the LoRA weights are for different tasks?

A6: Thank you for this insightful question. LoRASuite is designed to adapt LoRA weights across model upgrades and remains effective even when the weights are trained on different tasks. Prior studies such as DARE and TIES have shown that merging task-specific LoRA modules is both feasible and effective. In our case, each task-specific LoRA can first be transformed using LoRASuite, and then merged using established methods like TIES to ensure compatibility with the upgraded model while preserving task-specific knowledge.

For example, we merged the MiniCPM-S-1B math and commonsense LoRA adapters using TIES with equal weights (0.5 each), and performed the same merging on the LoRASuite-transformed adapters for MiniCPM-2B. Compared to the simple average of individual task performances, the merged LoRASuite-transformed adapters improved math and commonsense scores by 1.37× and 1.96×, respectively. We will include these results in the revised manuscript.

PEFT Module	Avg. performance on Math Tasks	Avg. performance on Commonsense Tasks
TIES-merged LoRA on MiniCPM-S-1B	17.45	11.53
TIES-merged LoRASuite-transformed LoRA on MiniCPM-2B	23.90 (1.37×)	22.57 (1.96×)

Q7: Curious whether there are relatively different sensitivity when matching various factors, e.g., is layer matching more important/sensitive than head mapping?

A7: Thank you for your insightful feedback. In Appendix 6.1, we conducted ablation studies on both layer mapping and head mapping. Based on the results shown in Figures 11 and 12, we observed that layer mapping has a significantly greater influence on the average performance compared to head mapping. We will clarify this observation in our revised manuscript.

2025-08-06

Thank you for your detailed response to address my concerns. I will maintain my score.

2025-08-08

Thank you for taking the time to review our rebuttal and for your engagement during the discussion phase. While we understand that you have chosen to maintain your score, we appreciate your thoughtful feedback and the opportunity to clarify our work.

2025-08-04

Dear Reviewers,

We sincerely appreciate your thoughtful and constructive feedback. Below, we summarize the key contributions of our submission and the additional evaluations we conducted during the rebuttal phase to address your comments. We hope these efforts have clarified our methodology and demonstrated the broader value of our work.

Summary of Our Original Contributions:

LoRASuite Framework: We propose LoRASuite, a novel framework for adapting LoRA weights across model upgrades. It is the first to address a comprehensive set of structural changes—including hidden size, layer depth, attention head count, and attention types—with competitive performance.
Layer and Head Mapping Algorithms: We introduce a dynamic programming-based layer mapping strategy and a Hungarian-based attention head mapping method. These were motivated by the hierarchical representation behavior observed in prior CNN studies and the parallel multi-head attention mechanism.
Large-scale, diverse tasks evaluation: We conduct comprehensive experiments across a broad set of models. Evaluations span multiple downstream tasks—including math and commonsense reasoning—with extensive ablation studies and sensitivity analyses, which demonstrate both the robustness and generalizability ability.

Additional Experiments and Clarifications in Rebuttal:

Direct Comparison to Qwen2-1.5B: In response to reviewer feedback, we added a comparison of upgrades from Qwen1.5-1.8B to Qwen2-1.5B on math tasks. While our original focus was on mobile-centric scenarios (e.g., AICore), we will include evaluations on larger-scale models in the revised manuscript.
Cross-Task Merging Performance: We showed that merging LoRASuite-transformed LoRA modules using TIES significantly outperforms naive merging. For example, merging math and commonsense LoRAs led to 1.37× and 1.96× improvements in respective task scores.
Training Curve Disclosure: Although figures are restricted during the rebuttal phase, we provided per-step loss values for LoRA (100) and LoRASuite w/ LFT (100), and will include full training curves (including LoRA 10k) in the revised manuscript.
Learning Rate Sensitivity: Unlike LoRA (Small), which uses a linear learning-rate scheduler with warm-up to stabilize training from random initialization, LoRASuite starts from a well-initialized transformed state that is already near a good solution. As a result, we omit the warm-up phase. However, this favorable initialization creates a steeper optimization landscape, where a small learning rate and limited data can lead to underfitting or poor convergence.
Memory Efficiency Explanation: We elaborated on why small-scale fine-tuning reduces memory usage—namely due to lower data pipeline overhead, reduced activation storage, and more stable padding behavior in shorter sequences.

Reviewer-Author Discussion

We hope these clarifications and additional experiments have addressed your concerns. As the Reviewer-Author Discussion period reaches its midpoint, we remain eager to answer any further questions you may have. We believe LoRASuite provides a theoretically sound, practically efficient solution for LoRA transfer across LLM upgrades, with strong empirical evidence across diverse settings.

Thank you again for your time and feedback. We would greatly appreciate it if you could let us know whether our responses have resolved your concerns or if you might consider revising your rating.

Best regards,

Authors of Submission 2277

最终决定Accept (poster)

2025-09-17

It received ratings of 5,5,4,3. The only review that does not suggest accepting it is concerned about scalability of the method to more recent models. However, the the positive reviews point out that the paper presents a novel approach to adapting pre-trained model weights for updates in new versions of the same model architecture. This is an innovative solution that demonstrates promising results across various backbone models. The paper is well-structured, clearly written, and supported by extensive empirical evidence. While there may be limitations in terms of applicability to all real-world scenarios, such as implicit upgrades or different architectures, the work opens up exciting possibilities for future research in this area. Overall, these factors contribute to a solid justification for accepting the paper.