7.3

/10

Poster4 位审稿人

最低4最高5标准差0.5

3.5

置信度

创新性3.0

质量3.3

清晰度3.3

重要性3.0

NeurIPS 2025

Accurate and Efficient Low-Rank Model Merging in Core Space

Aniello Panariello,Daniel Marczak,Simone Magistri,Angelo Porrello,Bartłomiej Twardowski,Andrew D. Bagdanov,Simone Calderara,Joost van de Weijer

OpenReview PDF

提交: 2025-05-05更新: 2025-10-29

TL;DR

The paper proposes Core Space Merging, a method to efficiently merge LoRA-adapted models by aligning them in a shared low-rank subspace, achieving higher accuracy and major speedups over prior merging techniques.

摘要

关键词

Core SpaceLoRAModel MergingLow-RankEfficiency

评审与讨论

审稿意见

评分: 5置信度: 32025-06-29

This paper proposes a novel approach for merging LoRA-finetuned instances of a same model on various tasks within Core space as opposed to the full space. The authors have shown the proposed approach preserves information while significantly improving the computational complexity as well as the ultimate accuracy. The empirical study on both language and vision tasks have shown the merits of the proposed framework in practice.

优缺点分析

Strengths:

The paper proposed an interesting low-rank merging methodology via the core space that is theoretically sound.
The experimental study is well-done and convincing.
The paper is generally very well-written with a good flow.

Weaknesses:

While the authors have claimed in the Limitation section that the proposed approach can be also applied in scenarios with heterogenous ranks, the extension is not straightforward unless the corresponding LoRA updates still have the same rank, even though the rank might be different across different layers. In fact, this points to an important limitation of the proposed framework which hasn't been discussed in the paper; that is, the proposed method is not applicable to models that are finetuned via the rank adaptive version of LoRA (e.g. AdaLoRA) where the rank assigned to each weight matric may be different across the tasks. However, the full space and KnOTS do not suffer from this issue in general.
The argument provided for explaining the superior accuracy of the proposed framework is somewhat weak. In particular, the authors have implied that the better alignment between the core spaces of different tasks is the main reason for the better accuracy, but this is merely a correlation at best. A theoretical analysis or ablation studies are needed to back this "implicit" claim and establish the connection between subspace alignment and the generalization error.

问题

Are you claiming that the proposed methodology can be applied even when the LoRA rank is different among different tasks? If so, how?
How does higher subspace alignment explain better generalization of the model after the merge?

局限性

As I mentioned above, dealing with different ranks for the same weight matrix is not straightforward and seems to be an important limitation of the proposed framework which hasn't been discussed in the paper.

最终评判理由

The authors have made a convincing argument in their response to my concerns about the proposed methodology. As a results, I'd like to increase my score to 5.

格式问题

No formatting concern.

作者回复

2025-07-31

We are pleased that the Reviewer leans towards acceptance, acknowledging the paper’s clarity, novelty, efficiency, and effectiveness of the proposed approach, as well as the convincing theoretical and experimental study. We thank the Reviewer for the constructive feedback, and below we respond to specific points raised.

W1/Q1/L1 Application to heterogeneous rank of LoRAs setting

While the authors have claimed in the Limitation section that the proposed approach can be also applied in scenarios with heterogenous ranks, the extension is not straightforward unless the corresponding LoRA updates still have the same rank, even though the rank might be different across different layers. In fact, this points to an important limitation of the proposed framework which hasn’t been discussed in the paper; that is, the proposed method is not applicable to models that are finetuned via the rank adaptive version of LoRA (e.g. AdaLoRA) where the rank assigned to each weight matric may be different across the tasks. However, the full space and KnOTS do not suffer from this issue in general. Are you claiming that the proposed methodology can be applied even when the LoRA rank is different among different tasks? If so, how?

While at first glance, handling LoRA modules with heterogeneous ranks might seem non-trivial, in practice, our method naturally supports it without requiring any modification.

As shown in our implementation (see Listing A in the Appendix or the get_core_matrices function in task_merger.py in the provided code at L214), we compute the reference bases via SVD on the stacked A and B matrices across all tasks. Even when individual LoRA modules have different ranks $r^{(t)}$ , these matrices can still be concatenated across tasks, resulting in an aggregate basis that spans the combined subspaces. The projection and alignment operations are then applied accordingly for each $r^{(t)} \times r^{(t)}$ local task core matrix.

To illustrate this, consider two tasks: one with a LoRA module of rank 4 and another with a rank of 8. Suppose:

Task 1 has $A^{(1)} \in \mathbb{R}^{4 \times 768}, B^{(1)} \in \mathbb{R}^{768 \times 4}$

Task 2 has $A^{(2)} \in \mathbb{R}^{8 \times 768}, B^{(2)} \in \mathbb{R}^{768 \times 8}$

We concatenate these as:

$A_{\text{stack}} \in \mathbb{R}^{(4+8) \times 768} = \mathbb{R}^{12 \times 768}$

$B_{\text{stack}} \in \mathbb{R}^{768 \times (4+8)} = \mathbb{R}^{768 \times 12}$

Applying SVD to these stacked matrices yields shared reference bases $V^{\text{ref}}_A \in \mathbb{R}^{768 \times d}$ and $U^{\text{ref}}_B \in \mathbb{R}^{768 \times d}$ , where $d=12$ . Each task’s $A^{(t)}$ and $B^{(t)}$ are then individually aligned to these reference bases, regardless of their original rank. This setup enables us to handle heterogeneous-rank LoRA modules naturally, without the need to pad, truncate, or enforce uniformity.

This flexibility arises because SVD makes no assumptions about input ranks; it always produces valid orthonormal bases for the column and row spaces. Our merging framework leverages this to support variable-rank LoRA modules seamlessly.

To further support the generality of our method, we evaluate performance in a heterogeneous-rank setting, where LoRA modules have varying capacities across tasks. Specifically, we randomly assign rank 16 to half of the tasks and rank 64 to the other half, maintaining a 50/50 split. The assigned ranks per dataset are shown below:

LoRA Rank 16	LoRA Rank 64
Cars	DTD
EuroSAT	GTSRB
MNIST	RESISC
SVHN	SUN397

Despite this rank mismatch, our method handles the merging seamlessly, without requiring any changes to the implementation. This is possible because we compute the Core Space by stacking the A and B matrices across all tasks—regardless of their individual ranks—and perform SVD to obtain shared reference bases. Each task’s local core matrix is then aligned accordingly, based on its native $r^{(t)} \times r^{(t)}$ dimensionality.

The results below show that our method continues to outperform other merging baselines even in this more general setting:

Space	TA	TIES	DARE-TIES	TSV	TIES + Iso-C	DARE-TIES + Iso-C	TSV + Iso-C	Iso-C
Full	64.34 (-)	63.50 (0.00)	63.81 (0.00)	67.95 (0.00)	66.90 (0.00)	67.12 (0.00)	68.72 (0.00)	72.06 (0.00)
KnOTS	64.34 (-)	65.13 (+1.63)	66.69 (+2.88)	64.20 (-3.75)	64.16 (-2.74)	63.37 (-3.75)	70.40 (+1.68)	71.26 (-0.80)
Core	64.34 (-)	70.59 (+7.09)	69.43 (+5.62)	67.41 (-0.54)	72.56 (+5.66)	72.68 (+5.56)	71.51 (+2.79)	74.90 (+2.84)

We will add this discussion to the final version of the manuscript.

W2/Q2 Relationship between subspace alignment and the generalization error

The argument provided for explaining the superior accuracy of the proposed framework is somewhat weak. In particular, the authors have implied that the better alignment between the core spaces of different tasks is the main reason for the better accuracy, but this is merely a correlation at best. A theoretical analysis or ablation studies are needed to back this “implicit” claim and establish the connection between subspace alignment and the generalization error.

The authors of the Subspace Alignment Ratio (SAR) metric provide a theoretical analysis explaining why higher SAR leads to better performance (see [27], Appendix A.3). They demonstrate that when SAR is low, the merging interference is high. The interference is defined as the L1 distance between the final activations of the task-specific model and the activations of the merged model. Therefore, we follow the experimental protocol from [27] to compare the interference when merging with TSV + Iso-C in Full Space versus Core Space. For each dataset, we collect the activations from the final layer (i.e., the projection to a common vision-language space) of both the task-specific model and the merged model. We present the average distance across all the samples in the test set in the table below. We observe lower interference when merging in Core Space, highlighting its effectiveness.

Dataset	L1(Full, Task-specific)	L1(Core, Task-specific)
Cars	0.255	0.234
DTD	0.224	0.183
EuroSAT	0.178	0.164
GTSRB	0.184	0.103
MNIST	0.175	0.132
RESISC	0.245	0.211
SUN397	0.308	0.278
SVHN	0.239	0.197

We will report this analysis in the final version of the paper.

2025-08-04

I thank the authors for taking time to respond to my concerns. The presented arguments are well put and convincing. As a result, I'd like to increase my score to 5 (Accept).

审稿意见

评分: 4置信度: 42025-06-30

This paper tackles the problem of merging multiple models that have been fine-tuned using Low-Rank Adaptation (LoRA). The authors identify a key challenge: existing merging methods either operate on the full-size weight matrices, sacrificing the efficiency of LoRA, or are suboptimal when applied directly to the low-rank factors. To address this, they propose "Core Space Merging," a novel framework that operates entirely in a low-dimensional space. The core idea is to first project the individual LoRA adaptors into a shared, common basis called "reference bases." This projection creates small "core matrices" for each task. The paper proves that this projection is lossless, meaning no information is lost in the transformation. Merging is then performed efficiently on these small core matrices. The final merged model is reconstructed by projecting the merged core matrix back into the original parameter space. The authors provide theoretical analysis of the method's correctness and computational complexity, and demonstrate through experiments on both language (Llama 3 8B) and vision (ViT) models that their approach is faster and also achieves superior state-of-the-art performance.

优缺点分析

Strengths:

LoRA merging a compelling research area following the democratization of large model finetuning. Improving model merging methods allows to practitioners to potential leverage existing pre-trained checkpoints to improve performance at a faction of the finetuning cost.

Aligning the SVD bases is a novel idea that proves effective in practice.

Theorem 10 to justifies that the projection into the core space is lossless as long as the a zero reconstruction error with regards to the reference bases is achieved.

The complexity analysis (Section 4.3) demonstrates the efficiency gains over prior work.

Given these strengths, I believe that the proposed algorithm merging is likely to be adopted by both researchers and practitioners.

Weaknesses:

A dominant advantage of model merging which is not studied here is how existing knowledge can be combined to generalize to new tasks. Prior works such as aTLAS [1] have studied this task (including with LoRAs) and should be compared with in future work or cited as a limitation of the current paper.

It is not clear that the reconstruction loss will be lossless if the total rank of LoRA to combine exceeds the maximum rank of the target weight matrix

[1] Zhang, et al. "Knowledge composition using task vectors with learned anisotropic scaling." Advances in Neural Information Processing Systems (2024)

问题

Have the authors studied edge cases where the total rank of the LoRA exceeds the maximum rank of the target weight matrix. Is the reconstruction error still 0 in this case ?

局限性

It is not clear that the reconstruction will still be lossless if the total rank of the LoRAs exceed the maximum rank of the weight matrix

No experiments studying generalization to a unseen task

最终评判理由

In light of the other reviews and of the author's answer, I maintain my original score

格式问题

No concern

作者回复

2025-07-31

We are pleased that the Reviewer leans towards acceptance, acknowledging the significance of the research topic, the theoretical soundness, as well as novelty, efficiency, and efficacy of the proposed approach. We thank the Reviewer for the constructive feedback, and below we respond to specific points raised.

W1/L2 Generalization to unseen tasks

A dominant advantage of model merging which is not studied here is how existing knowledge can be combined to generalize to new tasks. Prior works such as aTLAS [1] have studied this task (including with LoRAs) and should be compared with in future work or cited as a limitation of the current paper.

We would like to clarify that our method is not designed for out-of-distribution generalization. Instead, our approach focuses on efficient and accurate merging of already fine-tuned LoRA modules where all task-specific components are available at merge time. Additionally, we note that aTLAS operates in a fundamentally different setting: it learns trainable parameters to guide the merging process based on the target task data, requiring an additional training stage. In contrast, our method performs training-free merging, eliminating the need for labeled data or task supervision. This makes our approach truly zero-shot at the merging stage and suitable in situations where retraining is undesirable or infeasible.

Exploring generalization to unseen tasks is an interesting future direction. However, it falls outside the current scope, where our focus is on designing simple, efficient, and high-performing merging strategies for practical multi-task use.

We thank the Reviewer for the suggestion. We will consider studying this setting and include the corresponding discussion, mentioning aTLAS, in the Future Work section.

W2/Q1/L1 Reconstruction error when total LoRA rank exceeds maximum rank of target weight matrix

Have the authors studied edge cases where the total rank of the LoRA exceeds the maximum rank of the target weight matrix. Is the reconstruction error still 0 in this case ? It is not clear that the reconstruction will still be lossless if the total rank of the LoRAs exceed the maximum rank of the weight matrix

We thank the Reviewer for the insightful question and revisit our analysis under the relaxed assumption that the total LoRA rank may exceed the maximum rank of the weight matrix.

From an experimental perspective, we confirm that the reconstruction error remains zero in this case. Specifically, consider merging 8 ViT-B/32 models fine-tuned with LoRA $r=256$ . In this case $T r = 8 * 256 = 2048$ while all the matrices fine-tuned with LoRA $W_q, W_k, W_v, W_o \in \mathbb{R}^{768 \times 768}$ , so $T r > 768$ . We computed the value of $\varepsilon$ (Eq. 10) for both $U$ and $V$ in this scenario when merging in Core Space, and found it to be exactly zero.

From a theoretical point of view, although the derivations in Appendix A assume $T r \leq m, n$ , we find that this assumption is not strictly necessary.

When $T r > m$ (for $B$ ) and $T r > n$ (for A), stacking the matrices $A^{(t)}$ , $B^{(t)}$ and taking their SVD, produces reference bases $U_{B}^{ref}$ and $V_{A}^{ref}$ with intrinsic: $d_U = rank([B^{(1)}, .., B^{(T)}] \leq m, \quad d_V = rank([A^{(1)}, .., A^{(T)}]^T) \leq n$ , since the number of linearly independent directions cannot exceed the number of rows or columns. We now focus on $U_{B}^{ref}$ ; analogous reasoning applies on $V_{A}^{ref}$ .

We use the truncated orthonormal basis $U^{\text{ref}}_B \in \mathbb{R}^{m \times d_U}, \quad (U^{\text{ref}}_B)^\top U^{\text{ref}}_B = I\_{d_U}$ , where $d_U$ is the intrinsic rank of the stacked LoRA matrices $B^{(t)}$ . These $d_U$ directions span the entire LoRa update space.

Referring back to Eq. (A) in Appendix A.1, we can restate the least-squares problem for $R_B^{(t)}$ as: $R_B^{(t)} = \text{argmin}_{R \in \mathbb{R}^{d_U \times r}} || U^{ref}_B R - U_B^{(t)}||_F^2$

Following the steps as in the original derivation, we recover the same solution: $R_B^{(t)} = {U_B^{ref}}^T U_B^{(t)}$ . The same reasoning applies to $Q_A^{(t)}$ .

Similarly, the alignment error defined in Eq. ( C ) of Appendix A.2 remains unchanged when using $R_B^{(t)} \in \mathbb{R}^{d_U \times r}$ . In Appendix A.3, the search space for orthonormal matrices becomes $\mathcal{S} = \{ U \in \mathbb{R}^{m \times d_U} \;\ |\; U^\top U = I_{d_U} \}$ and the resulting alignment error is still zero.

To summarize, the problems behave as if the total LoRA rank $T \cdot r$ were reduced to intrinsic ranks $d_U$ and $d_V$ , since directions beyond $m$ or $n$ are linearly dependent and do not affect the reconstruction.

We will update the appendix with a generalized formulation that incorporates $d_U$ and $d_V$ , removing the need for the assumption $T r \leq m, n$ .

2025-08-07

We thank the reviewer for their constructive feedback and the time dedicated to assessing our submission. We noticed that the reviewer updated their score following the rebuttal phase but did not provide further comments in the discussion thread.

For clarification, we understand that the final justification and updated scores are not visible to authors during the discussion phase. If the reviewer has any feedback regarding whether the response addressed the concerns raised in the initial review, we would greatly appreciate hearing it and discussing it.

审稿意见

评分: 5置信度: 42025-07-01

This paper introduces Core Space Merging, a novel framework for efficiently and effectively merging LoRA modules. The core idea is to project the task-specific low-rank matrices into a referenced "Core Space" before applying existing model merging algorithms. The authors provide theoretical guarantees that this strategy is lossless and conduct extensive experiments on vision and language tasks to demonstrate that their method improves the performance of various merging techniques with computation efficiency.

优缺点分析

Strengths:

The proposed method is simple, yet remarkably effective and efficient. It significantly improves upon existing merging techniques while drastically reducing computational overhead.
The paper provides a solid theoretical foundation for the Core Space framework.
The comprehensive experiments show consistent performance improvements when integrating Core Space with a wide range of existing merging methods.
The paper is well-written and provide complete and effective experimental details, along with sufficient analysis and ablation studies to ensure the validity and reproducibility of their findings.

Weaknesses：

The notation of $\tilde M^{(t)}$ in Eq.(7) and Algorithm 1 is inconsistency and the dimensions in the latter formulation seem incorrect. If the formulation in Eq.(7) is correct, it raises a question about the necessity of performing SVD on each individual task's LoRA matrices since $\tilde M^{(t)}$ will be ${U_{B}^{ref}}^\top B^{(t)} A^{(t)} V_{A}^{ref}$ directly.
As the authors claim, a current limitation is that the method's applicability has only been demonstrated for LoRA. It would be valuable for future work to explore its compatibility with other PEFT methods to test the generality of the Core Space concept. Examples can include:
- VeRA, which uses trainable vectors and frozen random matrices.
- HydraLoRA, which uses an asymmetric structure with a shared A matrix.

[r1] VeRA: Vector-based Random Matrix Adaptation

[r2] HydraLoRA: An Asymmetric LoRA Architecture for Efficient Fine-Tuning

问题

In your proposed method, the final merged update has a rank of up to $Tr$ . In contrast, methods like Full Space or KnOTS merging might result in merged updates of rank $r$ . For a fair comparison, do you consider providing the merging results under a same target rank?
The construction of reference bases by stacking task-specific matrices can be viewed as creating a basis for a mixture of LoRA experts. This bears some conceptual similarity to findings in CopRA (specifically their Section 3.1). It introduces a learnable invertible matrix P to minimize the alignment difference, which seems related to Eq.(5) in this paper. Could you please elaborate on the relationship and key differences between Core Space Merging and the strategy proposed in CopRA?

[r3] CopRA: A Progressive LoRA Training Strategy, UniReps Workshop.

局限性

yes

最终评判理由

Most of my concerns have now been addressed, and I will maintain my 'Accept' rating.

格式问题

作者回复

2025-07-31

We are pleased that the Reviewer recommends full acceptance, acknowledging the simplicity, effectiveness, and efficacy of the proposed method, as well as its solid theoretical foundation, comprehensive empirical evaluation, and the clarity of the paper. We thank the Reviewer for the constructive feedback, and below we respond to specific points raised.

W1 SVD redundancy

The notation of $\tilde{M}^{(t)}$ in Eq.(7) and Algorithm 1 is inconsistent and the dimensions in the latter formulation seem incorrect. If the formulation in Eq.(7) is correct, it raises a question about the necessity of performing SVD on each individual task's LoRA matrices since $\tilde{M}^{(t)}$ will be $U_B^{\text{ref} \top} B^{(t)} A^{(t)} V_A^{\text{ref}}$ directly.

We thank the Reviewer for identifying the inconsistency between Eq. (7) and Algorithm 1, and for raising the concern about the necessity of computing per-task SVDs.

As correctly observed, the core matrix $\tilde{M}^{(t)}$ can be computed directly via:

$\tilde{M}^{(t)} = U^{\text{ref}^\top}\_B B^{(t)} A^{(t)} V^{\text{ref}}\_A$ ,

without the need to decompose each LoRA matrix individually.

We have tested the simplified formulation and confirm that it yields identical results, while being marginally more computationally efficient. We will revise both Eq. (7) and Algorithm 1 accordingly to reflect the corrected and streamlined implementation.

That said, the per-task SVDs are needed from a theoretical standpoint: they help formalize the projection process and support the derivations in Appendix A, including orthogonality, rank preservation, and the zero-error reconstruction proofs. For this reason, we may retain the SVD-based formulation in the theoretical exposition while aligning the algorithmic description with the more efficient practical implementation.

We thank the Reviewer again for this helpful clarification, which improves both the accuracy and clarity of our work.

W2 Core Space for other PEFT methods

As the authors claim, a current limitation is that the method's applicability has only been demonstrated for LoRA. It would be valuable for future work to explore its compatibility with other PEFT methods to test the generality of the Core Space concept.

We present below the average normalized accuracy of merged VeRA models with rank 16. Our method consistently outperforms KnOTS, showing that the Core Space framework applies to various types of PEFT methods. However, we also notice that the normalized accuracies are generally lower than those obtained with LoRA, suggesting that further research in this direction is needed. We will provide the detailed results and absolute accuracies in the final version of the manuscript and release the checkpoints used.

Space	TA	TIES	DARE-TIES	TSV	Iso-C	TIES + Iso-C	DARE-TIES + Iso-C	TSV + Iso-C
Full	64.78 (-)	63.93 (0.00)	64.88 (0.00)	63.74 (0.00)	64.08 (0.00)	63.35 (0.00)	63.35 (0.00)	63.56 (0.00)
KnOTS	64.78 (-)	65.26 (+1.33)	65.63 (+0.75)	65.33 (+1.59)	63.30 (-0.78)	63.26 (-0.09)	63.27 (-0.08)	64.71 (+1.15)
Core	64.78 (-)	65.31 (+1.38)	65.35 (+0.47)	65.56 (+1.82)	66.27 (+2.19)	64.38 (+1.03)	64.35 (+1.00)	66.56 (+3.00)

Q1 Effective rank of Full and KnOTS spaces

In your proposed method, the final merged update has a rank of up to $Tr$ . In contrast, methods like Full Space or KnOTS merging might result in merged updates of rank $r$ . For a fair comparison, do you consider providing the merging results under a same target rank?

The final rank $r_{\Delta W}$ of the merged update matrix $\Delta W$ depends on both merging space and merging method. We report in the table below the rank of the $\Delta W$ matrices averaged across all layers obtained by merging 8 ViT-B/32 models fine-tuned with LoRA rank $r=16$ .

Merging Space	TA	TSV	TIES
Full	128.00	128.00	766.25
KnOTS	128.00	128.00	128.00
Core	128.00	128.00	128.00

In most cases, the target rank of $\Delta W$ is equal to $Tr = 128$ . The only exception is merging with TIES in Full space, where for weight matrices $W \in \mathbb{R}^{m \times n}$ the target rank approaches the dimensionality of the matrices $d = \text{min}(m, n) = 768$ . This is because when TIES performs trimming on reconstructed weight matrices $\Delta W_t = BA$ it does not preserve the low-rank structure of the matrix. On the other hand, both Core and KnOTS operate entirely in a constrained $Tr$ -dimensional space.

Q2 Comparison with CopRA

The construction of reference bases by stacking task-specific matrices can be viewed as creating a basis for a mixture of LoRA experts. This bears some conceptual similarity to findings in CopRA (specifically their Section 3.1). It introduces a learnable invertible matrix $P$ to minimize the alignment difference, which seems related to Eq.(5) in this paper. Could you please elaborate on the relationship and key differences between Core Space Merging and the strategy proposed in CopRA?

We thank the Reviewer for bringing this work to our attention. We agree there is some connection between our approach and CopRA regarding the alignment of low-rank updates. However, there are key differences in motivation, scope, and technical implementation.

In CopRA (Sec. 3.1), a learnable invertible matrix $P$ is introduced to reparameterize a set of LoRA matrices (i.e., $B_2P, P^{-1}A_2$ ) to align them better with those of another task. In contrast, Core Space aligns low-rank updates across tasks by projecting them into a shared latent subspace.

Alignment Mechanism: CopRA uses a single learnable invertible matrix to minimize the difference between two LoRA modules. In Core Space, we use reference bases (via SVD) and task-specific closed-form projections $R_B^{(t)}, Q_A^{(t)}$ to align all task updates into a common subspace.

Learning Objective: While CopRA minimizes the alignment error between two specific models through gradient descent, we construct a globally optimal shared basis across all tasks, ensuring zero reconstruction error without additional learning (see Eq. 9, Sec. 4.2).

Scalability: CopRA is limited to pairwise alignment, whereas Core Space is designed for merging multi-task models with efficient subspace operations. Specifically, as shown in Eq. 5 of CopRA, one of the two LoRA matrices needs to be used as an anchor. It is unclear, in a multiple-task scenario, how CopRA would select the anchor point.

Information Preservation: There is no guarantee that reparameterization with $P$ is lossless, while in Core Space, the transformation is lossless.

Finally, while CopRA introduces a valuable strategy for pairwise alignment using a learnable transformation, Core Space Merging generalizes this concept to the multi-task setting, offering a non-parametric, efficient, and provably lossless merging framework.

2025-08-03

Thank you for the detailed rebuttal and the improvements. I am pleased with the additional comparisons, and I will keep my original score.

审稿意见

评分: 4置信度: 32025-07-05

This paper addresses the problem of efficiently merging multiple models that have been adapted using Low-Rank Adaptation (LoRA). The authors observe that existing merging techniques often require reconstructing the full weight matrices from the low-rank components, before merging, which results in a computational overhead. To solve this, they authors propose to identify a shared, low-dimensional basis for all LoRA updates by performing an SVD on the concatenated LoRA matrices. Then, any merging algorithm can be applied to this compact space. The authors provide a theoretical framework to show that the projection is lossless alongside a complexity analysis showing significant efficiency gains. The authors show extensive experiments on both vision (ViT) and language (Llama 3 8B) models and show that the proposed method not only has significant computational advantage but also results in improved performance

优缺点分析

Strengths:

The paper is well-written, with a clear motivation.
The proposed method is theoretically grounded and straightforward.
The primary contribution is that the proposed method is not only dramatically more efficient compared to KnOTS but also improves the final performance of merging algorithms.
The experimental validation includes many model scales, domains and model merging techniques as baselines. The authors show consistent improvements across all combinations.

Weaknesses:

The paper focuses on the efficiency of merging which I am not sure is really an issue. Even in weight space, the transformation from low rank matrices to full will include a few matrix multiplications which is minor compared to fine-tuning. In the meantime, there exist new sota methods (in full space) that work much better than TA (perhaps with stronger assumptions). The paper focuses on optimizing merging efficiency rather than full final performance.
The framework's ability to merge models assumes that the different task-specific LoRA updates can be aligned within a common reference basis. This assumption may not hold for highly dissimilar tasks, which could limit the method's applicability. This can also be problematic beyond classification benchmarks where the fine-tuning dynamics become more complex.
Minor: Figure 2 needs more context to become standalone (as much as possible of course)

问题

Why are the numbers so low in Table 3? For TA: MNIST 53.12, SVHN 41.25 for an avg of 63.78 normalized. In comparison, TSV paper reports 76.55 for TA and 92.31 for TSV (for full-fine-tuning), while the paper reports 66.66 (for LoRA transformed to full). WHy such a huge discrepancy? Similar comments apply to the ViT-L/14 table in the appendix.
What are the single-task accuracies here? Most papers report both absolute and normalized, or just absolute, but the authors here choose only normalized.
“in Sec. 5.1 we show that merging in Core Space yields better performance than merging in alternative spaces, when the merging function is non-linear.” Can you explain your intuition as to why this happens?
“Our approach is found to yield compact representations of each task and improve alignment between subspaces of each task.” Isnt this at odds with weight interference from ties $32$ and task interference from $a$ ?
Figure 1: isn;t the bulk of the computation for e.g. ties the validation to compute the optimal scaling? This applies to knots as well as the proposed method.

$a$ Wang, K., Dimitriadis, N., Ortiz-Jimenez, G., Fleuret, F., & Frossard, P. (2024). Localizing task information for improved model merging and compression. arXiv preprint arXiv:2405.07813.

局限性

Yes

最终评判理由

Given the authors' rebuttal and the comments from other reviewers, I have updated my score.

格式问题

None.

作者回复

2025-07-31

We are pleased that the Reviewer appreciates the clarity of the paper, its theoretical soundness, the effectiveness and efficacy of the proposed method, and the thorough empirical evaluation. We thank the Reviewer for the constructive feedback, and below we respond to specific points raised.

W1 On the focus on efficiency

The paper focuses on the efficiency of merging which I am not sure is really an issue. Even in weight space, the transformation from low rank matrices to full will include a few matrix multiplications which is minor compared to fine-tuning. In the meantime, there exist new sota methods (in full space) that work much better than TA (perhaps with stronger assumptions). The paper focuses on optimizing merging efficiency rather than full final performance.

We acknowledge that training LoRA adapters for single-task weights is more expensive than most merging strategies, and improving final performance remains essential. Our method contributes to this by showing that merging in the Core Space significantly boosts the performance of several state-of-the-art approaches.

That said, we believe that efficiency is also an important and orthogonal goal. While merging is cheaper than fine-tuning, it can still be costly for large models, such as Llama or ViT, especially when done at inference time. For instance, very recent methods, such as MASS [A], EMR-Merging [B], and Twin Merging [C], perform multiple merges during inference, where lightweight operations directly impact speed and scalability. Even when merging is done offline, reducing its cost enables efficient tuning of the scaling factor, especially in large models such as Llama, where merging is more expensive than running multiple validation inferences (as we discuss in Q5).

Regarding the point “there exist new SOTA methods (in full space) that work much better than TA”, our work assumes access to LoRA fine-tuned weights, as in KnOTS. The LoRA setting is increasingly relevant due to the prohibitive cost of full fine-tuning of large models like Llama 8B, which requires large-scale computational resources. Additionally, as deeply described in the KnOTS paper, merging LoRA models is not as straightforward as the fully fine-tuned models, leading to much poorer performance.

We comprehensively evaluate recent SOTA methods in this setting -- including TSV, CART, and Iso-C -- which were originally tested on fully fine-tuned models. When LoRA checkpoints are available, this corresponds to merging in the full space matrix $\Delta W$ . We also compare against SOTA LoRA-specific methods such as the KnOTS variants. Our results show that merging LoRA modules with TSV, CART, and Iso-C is significantly improved when performed in the Core Space, which we find to be both more effective and more efficient than KnOTS.

W2 Common reference basis and task diversity

The framework’s ability to merge models assumes that the different task-specific LoRA updates can be aligned within a common reference basis. This assumption may not hold for highly dissimilar tasks, which could limit the method’s applicability. This can also be problematic beyond classification benchmarks where the fine-tuning dynamics become more complex.

We do not assume a shared basis a priori. Instead, we construct the “reference basis” as defined in Eq. 4, using the left singular vectors of the horizontally concatenated B matrices and the right singular vectors of the vertically concatenated A matrices. These singular vectors always exist and are obtained via SVD. Regarding task diversity, in our experiments we have already merged a wide range of diverse tasks (as suggested by the Reviewer), including fine-grained datasets (Cars and CUB), satellite imagery (EuroSAT and RESISC), and digits (MNIST and SVHN). Exploring merging beyond classification is indeed an open and interesting direction, which we plan to explore in future work.

W3 Figure 2 context

We thank the Reviewer for the suggestion and will revise Figure 2 to include more standalone context and improve clarity in the final version. Moreover, we propose an improved caption as follows:

Figure 2. Full Space Merging (left) firstly reconstructs full space matrices $\Delta W^{(t)} = B^{(t)} A^{(t)}$ , and then performs merging in the full space to obtain $\Delta W$ . KnOTS Merging concatenates reconstructed $\Delta W^{(t)}$ matrices, and performs a costly Singular Value Decomposition (SVD) on the high-dimensional concatenated matrix. Afterwards, it merges the $V^{(t)}$ matrices and performs reconstruction to obtain the final $\Delta W$ . The proposed Core Space Merging (right) performs SVD on a concatenation of low-dimensional $A^{(t)}$ and $B^{(t)}$ matrices to obtain reference bases. Afterwards, it performs SVD on the individual $A^{(t)}$ and $B^{(t)}$ matrices and calculates the core matrices. It then performs merging in the Core Apace and reconstructs to obtain the final $\Delta W$ . Core Space Merging performs merging in a low-dimensional space, allowing for efficient merging methods that would be extremely expensive when performed in a high-dimensional space, such as Full Space or KnOTS.

Q1 Low numbers with respect to the TSV paper

Why are the numbers so low in Table 3? [...]

As discussed above, converting LoRA to full matrices does not yield the same result as full-finetuning. Indeed, we are using the same checkpoints as the KnOTS paper, which differ from those used in TSV; thus, the results differ significantly. The low normalized scores, on the other hand, highlight how underexplored and difficult LoRA merging still is (as discussed in the KnOTS paper), as most prior work focuses on merging fully fine-tuned models. We hope that our work will spark further attention to this important setting.

Q2 Missing absolute accuracies

We follow the KnOTS setting and thus obtain the same fine-tuned accuracies, which are as follows:

Backbone	Cars	DTD	EuroSAT	GTSRB	MNIST	RESISC	SUN397	SVHN	Avg
ViT-B/32 r=16	74.0	58.3	99.0	92.7	99.3	88.4	64.5	96.2	84.1
ViT-L/14 r=16	99.7	70.0	98.5	97.2	99.5	95.7	79.6	97.7	92.7

Due to limited space, we decided to follow KnOTS and report only the normalized accuracies. However, we agree that it would be helpful to include the absolute accuracies, and we will add them in the final version of the manuscript. It is possible to obtain the absolute accuracies through $\text{Absolute Accuracy} = \frac{\text{Normalized Accuracy}}{100} \times \text{Fine-tuned Accuracy}$ , where the fine-tuned accuracy is reported above.

Q3/Q4 Better performance and weight interference

“in Sec. 5.1 we show that merging in Core Space yields better performance than merging in alternative spaces, when the merging function is non-linear.” Can you explain your intuition as to why this happens?

“Our approach is found to yield compact representations of each task and improve alignment between subspaces of each task.” Isnt this at odds with weight interference from ties [32] and task interference from [a]?

We appreciate the Reviewer’s question about potential conflicts with prior work on weight interference (TIES [32]) and task interference (TALL [a]). Our approach, however, is fundamentally different: rather than merging in original weight space, we operate in a shared Core Space, constructed via SVD over all task-specific LoRA updates.

In our paper (L271), we demonstrate that merging in the Core Space yields dense representations, whereas the full space contains many unused or redundant components (as illustrated in Figure 4). This dense space has an important property: as shown in Figure 5, the Subspace Alignment Ratio (SAR) across tasks increases when working in the Core Space. In the Iso-C paper ([27], Appendix A.3), the authors demonstrate that a higher SAR correlates with reduced task interference, which can be measured as the L1 distance between the activations of the merged model and those of the single task fine-tuned model. We follow the evaluation protocol from [27] and measure the L1 distance between the activation of the merged model (in Core Space) and the original task-specific models. We found a reduction in interference compared to merging in full space (see our response to Reviewer DB4t, W2, for a detailed description of these results).

The performance improvement occurs only with non-linear merging functions due to the design of our Core Space. When combined with a linear merging function, merging in Core Space becomes equivalent to merging in the Full Space (as shown in L200).

Q5 Cost of searching optimal scaling factor

Figure 1: isn;t the bulk of the computation for e.g. ties the validation to compute the optimal scaling? This applies to knots as well as the proposed method.

As noted by the Reviewer, inference is costly when tuning the scaling factors. This is a common problem among all model merging approaches. However, in large models such as Llama 8B, the cost of the merging becomes dominant. For example, in the KnOTS space, merging can be over five times more expensive than inference on the validation set due to its cubic complexity. In contrast, inference cost grows linearly with the model size. We report, in the following, the costs of a merging and single validation inference of a Llama 8B.

Space	Val. Inference (s)	TIES merging (s)	TSV merging (s)	Iso-C merging (s)
Full	900	72	3360	540
KnOTS	900	3000	4800	4860
Core	900	8	12	8

[A] Crisostomi et al. MASS: MoErging through Adaptive Subspace Selection, arxiv preprint 2025

[B] Huang et al. EMR-Merging: Tuning-Free High-Performance Model Merging, NeurIPS 2024

[C] Lu et al. Twin-Merging: Dynamic Integration of Modular Expertise in Model Merging, NeurIPS 2024

2025-08-05

Thank you for your thorough answers and clarifications. Regarding W2, my comment is about higher diversity among checkpoints; the original task arithmetic benchmark focuses only on classification tasks fine-tuned with the exact same hyperparameters. Therefore, it would be interesting to see how the method will perform if this very strong assumption is violated (which is the case for real-world scenarios). The authors' rebuttal has addressed many of my concerns and I will update my score accordingly.

2025-08-05

We sincerely thank the reviewer for the thoughtful feedback and helpful clarifications.

Regarding W2, we appreciate the observation about the strong assumption of consistent hyperparameters across tasks in the original benchmark. We agree that this limits the realism of the setting and that evaluating the method under more diverse conditions is an important direction.

While the concern was raised in a different context by another reviewer, we conducted an additional experiment involving heterogeneous LoRA ranks across checkpoints trained on different datasets, which we believe is relevant to the issue of hyperparameter diversity. Specifically, we randomly assign rank 16 to half of the tasks and rank 64 to the other half, maintaining a 50/50 split. Varying the LoRA rank introduces a non-trivial form of variation between models, significantly altering their training dynamics. We refer to the response to reviewer DB4t for additional details on this experiment.

Our results, reported below, show that the method continues to perform well even under these relaxed and more heterogeneous conditions, which we believe is a promising step toward assessing its robustness in realistic scenarios.

Space	TA	TIES	DARE-TIES	TSV	TIES + Iso-C	DARE-TIES + Iso-C	TSV + Iso-C	Iso-C
Full	64.34 (-)	63.50 (0.00)	63.81 (0.00)	67.95 (0.00)	66.90 (0.00)	67.12 (0.00)	68.72 (0.00)	72.06 (0.00)
KnOTS	64.34 (-)	65.13 (+1.63)	66.69 (+2.88)	64.20 (-3.75)	64.16 (-2.74)	63.37 (-3.75)	70.40 (+1.68)	71.26 (-0.80)
Core	64.34 (-)	70.59 (+7.09)	69.43 (+5.62)	67.41 (-0.54)	72.56 (+5.66)	72.68 (+5.56)	71.51 (+2.79)	74.90 (+2.84)

最终决定Accept (poster)

2025-09-17

This paper proposes Core Space Merging, a framework for efficiently and effectively merging LoRA-adapted models. The key idea is to project low-rank updates into a shared Core Space before merging, ensuring information preservation and computational efficiency. The approach is supported by theoretical analysis and validated with extensive experiments on both vision and language tasks, showing clear gains in efficiency and accuracy over existing methods.

Pros:

Clear, well-written paper with strong motivation and theoretical grounding.
Demonstrates both efficiency (avoiding costly full-matrix reconstructions) and improved performance.
Extensive empirical validation across domains, scales, and baselines, with consistent improvements.
Rebuttal convincingly addresses reviewer concerns (e.g., heterogeneous ranks, subspace alignment, comparison with alternative methods).
High potential for adoption in both research and practice due to simplicity and effectiveness.

Cons:

Focuses mainly on efficiency; broader questions of merging for generalization to unseen tasks remain unexplored.
Assumptions about shared alignment basis may limit applicability to highly dissimilar tasks.
Current scope limited to LoRA (though preliminary results on other PEFT methods are promising).

The paper makes a solid, well-justified contribution to efficient model merging with LoRA. Despite some limitations, the method is novel, theoretically sound, and empirically strong, with clear impact for both the deep learning and parameter-efficient fine-tuning communities.