PaperHub
7.8
/10
Poster4 位审稿人
最低4最高4标准差0.0
4
4
4
4
ICML 2025

No Task Left Behind: Isotropic Model Merging with Common and Task-Specific Subspaces

OpenReviewPDF
提交: 2025-01-10更新: 2025-07-24
TL;DR

We achieve state-of-the-art model merging results by proposing isotropic merging in common and task-specific subspaces of weight update matrices across vision and language, for fully fine-tuned and LoRA-adapted models

摘要

关键词
Model merging

评审与讨论

审稿意见
4

This paper demonstrates that alignment between the individual components of task-specific and merged matrices is strongly correlated with performance improvements over a pre-trained model. Building on this finding, the authors propose an isotropic merging framework that flattens the singular value spectrum of task matrices, thereby enhancing alignment and narrowing the performance gap. Furthermore, they incorporate both common and task-specific subspaces to further optimize alignment and boost performance. The proposed approach achieves state-of-the-art results.

给作者的问题

  1. Why can subspace alignment eliminate conflicts and improve performance?
  2. How is the ratio of singular values controlled between the common and task-specific subspaces?

论据与证据

The proposed Normalized Accuracy Improvement and Subspace Alignment Ratio are supported by evidence. These metrics provide quantitative validation for the proposed isotropic merging framework, showcasing the alignment improvements and their direct impact on model performance.

方法与评估标准

It is intuitive to keep task-specific knowledge in the near-zero singular values part of the common subspace and discard the unimportant part.

理论论述

This paper does not conduct theoretical analysis.

实验设计与分析

The experimental design followed Task Singular Vectors, but was limited to vision tasks.

补充材料

Supplementary material provides the code.

与现有文献的关系

The method is a further improvement of Task Singular Vectors, removing noise through SVD decomposition.

[1] Task Singular Vectors: Reducing Task Interference in Model Merging. arXiv.

遗漏的重要参考文献

Section 4.2, which discusses retaining components from the common subspace and the orthogonal projection in Equation 10, bears resemblance to the shared subspace optimization concept in DOGE [2]. It is recommended to discuss this.

[2] Modeling Multi-Task Model Merging as Adaptive Projective Gradient Descent. arXiv.

其他优缺点

Strengths: The article is well-written and clear, with simple and effective methods achieving state-of-the-art results in model merging. The proposed methods and metrics are novel and intuitive.

Weaknesses: Lack of further analysis, such as why SVD can be used for model merging due to the redundancy of parameters caused by fine-tuning. It now seems more like an experimental discovery, and these further analyses would elevate the article to a higher level.

其他意见或建议

Suggestions: Supplement experiments on NLP tasks to verify the generalizability of the method and make the article more complete.

作者回复

We are pleased that the Reviewer appreciates the soundness of our introduced metrics, the simplicity and effectiveness of the proposed approaches, and clear writing. We thank the Reviewer for the comments and we respond below to specific points.

[Reference 1 (R1)]: Section 4.2, which discusses retaining components from the common subspace and the orthogonal projection in Equation 10, bears resemblance to the shared subspace optimization concept in DOGE [2]. It is recommended to discuss this.

Thank you for pointing us to this recent reference. We were not aware of DOGE paper at the time of preparing a submission as the preprint appeared two weeks before the deadline. Here we compare this approach with ours:

  • Definition of common/shared subspace: We define the common subspace as top-k components from sum of individual task matrices. DOGE defines shared as concatenation of top-k components from each task matrix followed by SVD, which resembles the TSV method.
  • Orthogonal projection: Both Iso-CTS and DOGE use an idea of orthogonal projection. DOGE uses it on the gradient of Δ\Delta to restrict the optimization process from changing the shared space. Iso-CTS uses the orthogonal projection on the level of weight matrices to determine the task-specific subspace that is orthogonal to the common subspace.

Iso-CTS and DOGE both use ideas of common/shared subspace and orthogonal projection in different ways. Moreover, the results of our approaches are better than DOGE. We believe that this discussion is very significant and we will add a detailed version of it to the revised manuscript.

[Weakness 1 (W1)]: Lack of further analysis, such as why SVD can be used for model merging due to the redundancy of parameters caused by fine-tuning. It now seems more like an experimental discovery, and these further analyses would elevate the article to a higher level.

It is known that fine-tuning of large pre-trained models results in low-rank parameter update. This observation enables efficient fine-tuning of models using inherently low-rank adaptation techniques such as LoRA. Consequently, the recent TSV paper shows how low-rank approximation of the parameter update matrices, obtained using SVD, can be used to facilitate model merging.

In our paper, we propose to extend the scope of SVD-based analysis for the purpose of model merging. Most importantly, by introducing the SAR metric we show that SVD can help in understanding the overlap between task-specific and merged matrices. Moreover, we show that by modifying the spectrum of singular values of merged matrix we can increase the alignment between task and merged matrix (see the Response to Reviewer ff9S (section Q2) for detailed discussion).

[Suggestion 1 (S1)]: Supplement experiments on NLP tasks to verify the generalizability of the method and make the article more complete.

We present NLP results in response to Reviewer mnUL (section C3). Iso-C and Iso-CTS outperform other baselines across two presented settings.

[Question 1 (Q1)]: Why can subspace alignment eliminate conflicts and improve performance?

Consider the Subspace Alignment Ratio between a task matrix and merged task matrix. SAR quantifies an overlap between the subspaces spanned by dominant singular vectors of these matrices. If the SAR is low, the overlap between the subspaces of these matrices is low and the corresponding singular vectors are close to orthogonal. Therefore, the merged matrix cannot reliably represent the dominant components of task matrix if SAR between them is low. Therefore, low SAR leads to low performance on the corresponding task. Conversely, high SAR indicates high subspace overlap, indicating that the merged matrix can reliably represent the important components of the task matrix and results in high performance.

[Q2]: How is the ratio of singular values controlled between the common and task-specific subspaces?

The ratio of singular values between the common and task-specific subspaces is controlled by the hyperparameter kk that is fixed for all the experiments. kk is chosen such that k/rk/r for a single layer is equal to 0.8. The final paragraph of Section 5.3 from the paper contains an analysis of impact of kk on performance.

审稿人评论

Thank you for the author's rebuttal. Most of the explanations you provided were things I already understood during my review, and I was trying to ask for deeper explanations, such as why subspace alignment eliminates conflicts. You have merely repeated the definition and findings of SAR from the paper. Because the singular value ​​​​average reduces the Frobenius norm and condition number of task vectors, which is why you need to search for a larger λ\lambda, this is unstable on LLM. As I said before, these questions are meant to encourage further analysis to elevate the article to a higher level. I also want to point out that the checkpoints used by ISO and TSV are different from those used in most model merging methods (from Task Arithmetic), which leads to slightly higher results. Additionally, I would like to ask why the authors did not compare with methods such as EMR merging or Twin merging.

The current response does not satisfy me. If the further responses are better, I will increase my score.

作者评论

Why subspace alignment eliminates conflict.

We thank the Reviewer for encouraging us to think more deeply about the relationship between subspace alignment and merging conflicts. We will incorporate this analysis in the revised manuscript.

Intuitively, we can minimize the task interference (i.e. eliminate the conflicts) by ensuring that the internal representations of task jj remain stable after merging. Let θ0\theta_0 be the pre-trained weights for a layer ll. Define the task matrix Δj=θjθ0\Delta_j=\theta_j-\theta_0 and the merged task matrix ΔM\Delta_M for the layer ll. Then, for an input xj(l)x_j^{(l)}, we desire that the post-merging activation hj(l)=(θ0+αΔM)xj(l)h_j^{(l)}=(\theta_0+\alpha\Delta_M)x_j^{(l)}, with α\alpha chosen on a validation set, be close to the task-specific activation h^j(l)=(θ0+Δj)xj(l)\hat{h}_j^{(l)}=(\theta_0+\Delta_j)x_j^{(l)}. Hence, we can quantify the interference using:

h^j(l)hj(l)=(ΔjαΔM)xj(l)ΔjαΔMxj(l)||\hat{h}_j^{(l)}-h_j^{(l)}||=||(\Delta_j-\alpha\Delta_M)x_j^{(l)}||\leq||\Delta_j-\alpha\Delta_M||\cdot||x_j^{(l)}||

To show that the interference is lower when the Subspace Alignment Ratio (SAR) between Δj\Delta_j and ΔM\Delta_M is higher, we decompose Δj\Delta_j into components aligned with and orthogonal to ΔM\Delta_M:

Δj=Δj+Δj\mboxforΔj=ΠkM,MΔj\mboxandΔj=(IΠkM,M)Δj,\Delta_j=\Delta_j^{||}+\Delta_j^{\perp}\mbox{ for }\Delta_j^{||}=\Pi_{k_M,M}\Delta_j\mbox{ and }\Delta_j^{\perp}=(I-\Pi_{k_M,M})\Delta_j,

where ΠkM,M\Pi_{k_M,M} is the projection matrix onto the subspace spanned by the top kMk_M left-singular vectors of ΔM\Delta_M (see Eqs. 5-6 for their definitions). By rewriting the SAR we have:

SAR(Δj,ΔM)=ΔjFΔj+ΔjF.SAR(\Delta_j,\Delta_M)=\frac{||\Delta_j^{||}||_F}{||\Delta_j^{||}+\Delta_j^{\perp}||_F}.

Similarly, decomposing ΔM\Delta_M in ΔM\Delta_M^{||} and ΔM\Delta_M^{\perp}, we write:

ΔjαΔM=ΔjαΔM+ΔjαΔMΔjαΔM+Δj,||\Delta_j-\alpha\Delta_M||=||\Delta_j^{||}-\alpha\Delta^{||}_M+\Delta_j^{\perp}-\alpha\Delta^{\perp}_M||\approx||\Delta_j^{||}-\alpha\Delta^{||}_M+\Delta_j^{\perp}||, since kMk_M minimizes the approximation error of ΔM\Delta_M (i.e.ΔM0\Delta^{\perp}_M\approx0).

If SAR is close to 1, then Δj||\Delta_j^{\perp}|| is small, so interference mainly depends on ΔjαΔM||\Delta_j^{||}-\alpha\Delta^{||}_M||. Conversely, if SAR is near zero, the large orthogonal component Δj\Delta_j^{\perp} increases the overall interference, regardless of the choice of α\alpha. Even with optimal α\alpha chosen via validation, interference cannot be reduced below the norm of the orthogonal component.

Iso-C increases SAR of Δt\Delta_t with the merged model — bringing it close to 1, as shown in the paper — by flattening the singular values. Thus, the optimal α\alpha can adjust the merged model such that interference is minimized. In contrast, Task Arithmetic (TA), with SAR varying across tasks, exhibits interference that cannot be reduced below the norm of the orthogonal component.

We also experimentally show that interference, measured as L1 distance between the final embeddings of task-specific and merged models (following [1]), for Iso-C is lower than the interference for TA for ViT-B/16: https://imgur.com/a/b9Lpk8q.

[1] Representation Surgery for Multi-Task Model Merging, ICML 2024

search for a larger λ\lambda, this is unstable on LLM.

It is true that the singular value average reduces the Frobenius norm and we need to search for a larger λ\lambda. However, we did not observe instabilities for α[0.5,3.1]\alpha \in [0.5, 3.1] (plot: https://imgur.com/a/GgB6nFD) in NLP experiments on T5-Large -- a 770M parameter LLM (see response to Rev. mnUL, Sec. C3).

checkpoints used by ISO and TSV are different

Thank you for pointing out this important detail. We use checkpoints introduced by Consensus Merging in all the experiments in our paper (both for our and competing methods) providing a fair comparison.

However, many other papers use TA checkpoints, and we were not aware of this when comparing with additional methods during this rebuttal. We reran Iso-C and Iso-CTS using the TA checkpoints to fairly compare with methods that reported merging using them:

ViT-B/32ViT-L/14
Fisher68.383.7
RegMean71.882.2
PCB76.387.5
CART83.090.8
Iso-C84.192.5
Iso-CTS84.393.0

Iso-C and Iso-CTS still outperform all of the added methods.

compare with EMR merging or Twin merging.

We would like to highlight that, during this rebuttal, we added comparisons with 4 vision methods, including the recent SOTAs CART and PCB, as well as PEFT evaluations — e.g. recent KnOTS — and NLP experiments. Moreover, we consider merging methods that result in a single set of multi-task weights and do not change the inference procedure, which can be used as a drop-in replacement for the pre-trained model. Twin-Merging, however, composes task-specific components at test-time and alters the inference algorithm increasing its cost over two times. Similarly, EMR-Merging uses additional per-task parameter masks and rescalers to perform inference. We will include this discussion in the revised manuscript.

审稿意见
4

This paper focuses on bridging the performance gap between the merged and task-specific models. They first show that the subspace alignment of merged and task-specific models correlates with performance improvement. Then, they propose an isotropic merging method to improve the merging performance via flattening the singular values. An extension is proposed to further improve the alignment and performance by considering the task-specific subspaces. Empirical results show that their method consistently outperforms the baselines.

After rebuttal:

I think most concerns are well addressed after the rebuttal. I particularly like the interesting findings of the reasons that leads to performance gap in TA. I believe those insights are valuable and critical to the community. Theoretical analysis is provided to enhance the understanding. Experimental design and results are also improved.

I have no questions now. Just suggest that the authors summarize all the reviews and appropriately integrate them in the paper, no matter in the main paper or the appendix. I am very glad to raise my score to 4.

给作者的问题

  1. In Fig.2, the author tries to show that the task vector orthogonality is not the reason for TA performance improvement. However, Fig.2(b) is confusing to me. The x-axis is the similarity between Δi\Delta_i and ΔTA\Delta_{TA}. Why do we need to compare these similarities? Fig.2(a) shows the task-vector orthogonality, but 2(b) shows some bad NAI, which is enough to conclude the incorrelation.
  2. I may be missing or misunderstanding something, but I have a question regarding the motivation of the proposed method. In Sec.3.3, the author shows that the merging performance is correlated with SARavgSAR_{avg}. However, in Sec.4.1, the author uses Fig.1(a) as their motivation to “flatten” the singular values. I wonder what the relationship is between these two motivations. The authors claim that the variability of SARavgSAR_{avg} is due to the skewness in Fig.1(a), but I didn't see a clear relationship.
  3. Some important baselines are missing in the experiments, such as Fisher merging, RegMean, and a recent SOTA PCB-merging. Recent literature also competes with them. Is there any reason the authors do not compare their method with those baselines?
  4. What are the values of kk in Tab.1?
  5. In Fig.4(c), the performance of SUN397 is worst with β=0.5\beta=0.5. Though this is the only inconsistency, I am curious if there is any explanation.
  6. In Fig.5(a), the performance of Iso-C improves significantly when β1\beta \to 1, while in Fig.6, the improvement of Iso-CTS is marginal compared to Iso-C. Does it mean that the most improvement is due to the isotropic singular values but not the design in Alg.2? I am curious about the performance of Iso-CTS w/o line 12 (i.e., when there are no isotropic singular values).

I have no questions now. Just suggest that the authors summarize all the reviews and appropriately integrate them in the paper, no matter in the main paper or the appendix. I am very glad to raise my score to 4.

论据与证据

I am confused about Fig.2, where the authors propose their motivation. The motivation itself makes sense to me, but Fig.2 is confusing. See questions for details.


After rebuttal:

I think the claims are clear to me right now and are supported by sufficient evidence.

方法与评估标准

I think some important baselines are missing. The motivation of the method is unclear to me as well. See questions for details.

理论论述

N/A


After rebuttal:

They propose some theoretical analysis, which I believe will strengthen the interpretability of this paper and provide insightful views.

实验设计与分析

There is a dataset in which the performance is inconsistent. Also, the improvement of Iso-CTS over Iso-C is not sufficiently discussed. See questions for some comments.


After rebuttal:

Concerns are well addressed.

补充材料

I only checked the README.md file.

与现有文献的关系

Model merging is a popular method to construct a multi-task model without retraining. However, the performance gap between the task-specific and merge models is critical. This paper first analyzes the key reason why the merged model has a worse performance and then proposes a novel method to address it. I believe this paper can help enhance the understanding of model merging and be applied as an effective method to merge models.

遗漏的重要参考文献

N/A

其他优缺点

The overall organization and writing of this paper are good. I also like the findings that the subspace alignment is critical to model merging, which I believe is important to the understanding of model merging. My main concern is two-fold. First, it lacks some important baselines. Second, there are some potential issues regarding the motivations, the analysis of experimental results, and the effectiveness of the extended method. Please see the questions for details.


After rebuttal:

I think most concerns are well addressed after the rebuttal. I particularly like the interesting findings of the reasons that leads to performance gap in TA. I believe those insights are valuable and critical to the community. Theoretical analysis is provided to enhance the understanding. Experimental design and results are also improved.

其他意见或建议

N/A

作者回复

We are pleased that the Reviewer appreciates the novelty of the proposed method, the significance of our contribution, and the clear writing. We thank the Reviewer for their constructive feedback, and below we respond to specific points raised.

[Question 1 (Q1)]: In Fig.2, the author tries to show...

Providing Fig. 2(a) along with only the Normalized Accuracy Improvement (NAI) for each task would show that all task vectors Δt\Delta_t exhibit similar cosine similarities close to zero, yet the NAIs vary significantly (e.g. the DTD task vector is orthogonal to all others but has the lowest NAI). We agree that this alone suggests that mere orthogonality among task vectors does not explain differences in performance.

Fig. 2(b) however takes this analysis one step further by examining whether the cosine similarity between each Δt\Delta_t and the task arithmetic vector ΔTA\Delta_{TA} correlates with NAI. Intuitively, one might expect that a higher similarity between Δt\Delta_t and ΔTA\Delta_{TA} would result in a higher NAI, but no clear correlation is found. This reinforces the observation that cosine similarity is not a good predictor of performance improvement. We conducted this analysis because we believe that a key factor in understanding the effectiveness of task arithmetic model merging is to directly compare each task matrix with the merged model matrix. Since cosine similarity between task vectors and task arithmetic vector alone does not explain performance gains, we propose analyzing the Subspace Alignment Ratio (SAR) between each individual and merged task matrices, which indeed shows a positive correlation with performance improvement (Fig. 3). In the revised manuscript we will include elements of this discussion to more clearly link the results in Figs. 2 and 3.

[Q2]: I may be missing or misunderstanding something, but I have a question regarding the motivation of the proposed method...

The subspace alignment ratio SARavg(Δt,ΔTA)SAR_\text{avg}(\Delta_t, \Delta_{TA}) quantifies how well a task matrix Δt\Delta_t is represented by the subspace of the task arithmetic matrix ΔTA\Delta_{TA}. The subspace dimension, kMk_M as defined in Eq. (6), is determined by the number of singular vectors required to minimize the reconstruction error in terms of Frobenius Norm. Because the singular value spectrum of ΔTA\Delta_{TA} is skewed, only a few singular values are large, leading to a low kMk_M (see Fig. 4(a) - β=0.0\beta=0.0 (TA), marked by a vertical red dashed line). Relying on these few singular vectors to represent each task matrix produces a highly variable SARavgSAR_{avg} across tasks (Figure 4(b) - β=0.0\beta=0.0), indicating that some tasks are not well captured by this limited subspace.

The motivation for "flattening" the singular values via Iso-C in Sec. 4.1 is to address this issue. By scaling the singular values, the influence of less dominant singular values increases while that of the dominant ones decreases. This adjustment raises the effective subspace dimensionality kMk_M (as shown by the vertical dashed lines for β>0\beta > 0 in Figure 4a), resulting in a subspace that better represents all task matrices. Consequently, this leads to a higher SARavgSAR_{avg} (Fig. 4(b)) and Normalized Accuracy Improvement (Fig. 4(c)). Thus, the skewness in Fig 1(a) explains the variability in SARavgSAR_{avg} (and hence merging performance), and the singular value flattening is introduced as a solution to this limitation.

[Q3]: Some important baselines are missing in the experiments, such as Fisher merging, RegMean, and a recent SOTA PCB-merging...

Thank you for pointing out PCB method which is a recent and relevant baseline. Originally, we omitted Fisher Merging and RegMean for brevity as they are outperformed by many recent methods. We additionally include the CART baseline requested by Reviewer BR4L. The PCB paper reports the average absolute accuracy for merging 8 tasks across 2 model sizes and we compare these results in the Table below:

MethodViT-B/32ViT-L/14
Fisher68.382.2
RegMean71.883.7
PCB76.387.5
CART83.090.8
Iso-C86.394.2
Iso-CTS86.294.7

Iso-C and Iso-CTS outperform all of the added baselines.

[Q4]: What are the values of kk in Tab.1?

We use kr=0.8\frac{k}{r} = 0.8 as a default for all the experiments (see L408-410, right column), where rr is a number of singular values for a given layer. Therefore, kk can vary across layers according to the rr for each particular layer.

[Q6]: ...I am curious about the performance of Iso-CTS w/o line 12...

We present the comparison of the performance of Iso-CTS and Iso-CTS w/o line 12: https://imgur.com/a/39CrGKJ. We observe that isotropic scaling is indeed a crucial component of Iso-CTS. However, the design in Alg. 2 also plays an important role, especially when the number of merged models increases, leading to up to 2.8% improvement on 20 tasks (see Table 1).

审稿人评论

Thanks a lot for your detailed response! While most of my concerns are addressed, I still have a question regarding your response to Q2. Based on the response, my understanding is that a skewed spectrum leads to a lower kMk_M, which further results in a higher SARSAR. However, the latter relationship is still unclear to me.

From your response, the logic of "causal chain" is {σi}ikMSAR\lbrace\sigma_i\rbrace_i \to k_M \to SAR. But I don't know why a lower kMk_M leads to a higher SARSAR from Eq.5 and 6. The results in Fig.4 that you mentioned only imply that {σi}iSAR\lbrace\sigma_i\rbrace_i \to SAR or kMSARk_M \to SAR, but I am still unsure whether it is due to a lower kMk_M.

While I think this does not influence the quality and contribution of this paper, I'd still like to know whether kMk_M affects SARSAR and how it does so. Could you explain a bit (theoretically) based on Eq.5 and 6? Empirical results are also acceptable but I think it could be hard to verify it via experiments. I am glad to raise my score if this can be addressed.

Due to the limited number of communication rounds, I'd like to summarize my review here. I really like the findings of this paper, esp. Fig.2 and 3, which provide new views to understand task arithmetic. Though some inclarities may be due to my misunderstanding, I strongly encourage the authors to make it clearer for readers, as other reviewers also post similar questions. Overall, this is an interesting and solid paper.

作者评论

We thank the Reviewer engaging with us in this discussion. Below, we formally clarify the causal chain {σi}kMSAR\{\sigma_i\}\rightarrow k_M\rightarrow SAR.

The Subspace Alignement Ratio (SAR) between a task matrix Δt\Delta_t and a merged matrix ΔM\Delta_M is:

SAR(Δt,ΔM;kM)=ΠkM,MΔtFΔtF,\text{SAR}(\Delta_t,\Delta_M;k_M)=\frac{||\Pi_{k_M,M}\Delta_t||_F}{||\Delta_t||_F},

where ΠkM,M=UkM,MUkM,M\Pi_{k_{\text{M}},\text{M}}=U_{k_{\text{M}},\text{M}}U^\top_{k_{\text{M}},\text{M}} is the projection onto the subspace spanned by the top kMk_{\text{M}} left-singular vectors of ΔM\Delta_{\text{M}}. The rank kMk_M minimizes the approximation error:

kM=min{k:ΔMΠk,MΔMFϵΔMF}.k_{M}=\text{min}\lbrace k:||\Delta_M-\Pi_{k,M}\Delta_M||_F\leq\epsilon||\Delta_M||_F\rbrace.

σikM\sigma_i\rightarrow k_M: The connection between the skewness of the spectrum of ΔM\Delta_{M} and kMk_M

Using the SVD, ΔM=UΣVT\Delta_{M}=U\Sigma V^T, where Σ=diag(σ1,,σr)\Sigma=\text{diag}(\sigma_1,\ldots,\sigma_r), by definition of Frobenius norm we have:

ΔMF2=i=1rσi2,ΔMΠk,MΔMF2=i=k+1rσi2.\Vert \Delta_M\Vert_F^2=\sum_{i=1}^r\sigma_i^2,\quad\Vert\Delta_M-\Pi_{k,\text{M}}\Delta_M\Vert_F^2=\sum_{i=k+1}^r\sigma_i^2.

Hence, the relative approximation error becomes: ΔMΠk,MΔMF2ΔMF2=i=k+1rσi2i=1rσi2,\frac{\Vert\Delta_M-\Pi_{k,M}\Delta_M\Vert_F^2}{\Vert\Delta_M\Vert_F^2}=\frac{\sum_{i=k+1}^r\sigma_i^2}{\sum_{i=1}^r\sigma_i^2},

and kMk_M can be defined as:

kM=min{k:i=k+1rσi2i=1rσi2ϵ2}.k_M=\text{min}\left\lbrace k:\frac{\sum_{i=k+1}^r\sigma_i^2}{\sum_{i=1}^r\sigma_i^2}\leq\epsilon^2\right\rbrace.

This formulation is equivalent to the one used in the paper but explicitly shows how the skewness of the spectrum {σi}\lbrace \sigma_i\rbrace controls kMk_M. When ΔM\Delta_M has a skewed spectrum (e.g. σ12i=2rσi2\sigma_1^2 \gg \sum_{i=2}^r \sigma_i^2), a small kMk_M is enough to satisfy the condition. This explains why Task Arithmetic ΔTA\Delta_{TA} (β=0\beta=0 in Fig. 4(a)) — which has a skewed spectrum — yields a smaller kMk_M than Iso-C, whose flatter spectrum leads to a larger kMk_M. We believe that expressing kMk_M directly in terms of singular values highlights the link between the spectral skeweness and subspace dimensionality. We will adopt this definition in the revised version.

kMSARk_M \rightarrow SAR: The connection between kMk_M and SAR.

The rank kMk_M defines the effective rank of the subspace identified by the merged model, determined directly by its spectrum. Let kTAk_{TA} be the effective rank of ΔTA\Delta_{TA}, and define

T={u1,..,ukTA}T=\lbrace u_1,..,u_{k_{TA}}\rbrace

as the orthonormal basis formed by those kTAk_{TA} singular vectors. Flattening the spectrum of ΔTA\Delta_{TA} (Fig. 4(a)), yields ΔIsoC\Delta_{Iso-C} with effective rank kIso>kTAk_{Iso}>k_{TA} (as discussed previously). This flattening modifies only the singular values of TA, leaving the singular vectors unchanged. Therefore, the original subspace TT is contained within the larger subspace spanned by the top singular vectors of ΔIsoC\Delta_{Iso-C}, defined as:

I={u1,..,ukTA,..,ukIso}.I=\lbrace u_1,..,u_{k_{TA}},..,u_{k_{Iso}}\rbrace.

Thus, by construction, we have TIT\subset I.

For simplicity, let ΠT=ΠkTA,TA\Pi_T=\Pi_{k_{\text{TA}},\text{TA}} and ΠI=ΠkIso,Iso\Pi_I=\Pi_{k_{Iso},\text{Iso}} denote the projection operators onto the subspaces spanned by TT and II, respectively. Since TIT\subset I, for any matrix Δt\Delta_t it holds that:

SAR(Δt,ΔTA)=ΠTΔtFΔtFΠIΔtFΔtF=SAR(Δt,ΔIsoC),SAR(\Delta_t,\Delta_{TA})=\frac{\Vert\Pi_T\Delta_t\Vert_F}{\Vert\Delta_t\Vert_F}\leq\frac{\Vert\Pi_I\Delta_t\Vert_F}{\Vert\Delta_t\Vert_F}=SAR(\Delta_t,\Delta_{Iso-C}),

This inequality holds because by definition: ΠTΔtF2ΔtF2=i=1kTAjui,Δt(j)2ΔtF2i=1kTAjui,Δt(j)2+i=kTA+1kIsojui,Δt(j)2ΔtF2=ΠIΔtF2ΔtF2,\frac{\Vert\Pi_T\Delta_t\Vert_F^2}{\Vert\Delta_t\Vert_F^2}=\frac{\sum_{i=1}^{k_{TA}}\sum_j\langle u_i,\Delta_t^{(j)}\rangle^2}{\Vert\Delta_t\Vert_F^2}\leq\frac{\sum_{i=1}^{k_{TA}}\sum_j\langle u_i, \Delta_t^{(j)}\rangle^2+\sum_{i=k_{TA}+1}^{k_{Iso}}\sum_j\langle u_i,\Delta_t^{(j)}\rangle^2}{\Vert\Delta_t\Vert_F^2}=\frac{\Vert\Pi_I\Delta_t\Vert^2_F}{\Vert\Delta_t\Vert^2_F},

where Δt(j)\Delta_t^{(j)} denotes the jj-th column of Δt\Delta_t. Equality holds (i.e. SAR(Δt,ΔTA)=SAR(Δt,ΔIsoC)SAR(\Delta_t,\Delta_{TA}) = SAR(\Delta_t,\Delta_{Iso-C}) ) only if the additional vectors added to the basis TT — that is {ukTA+1,,ukIsoC}\lbrace u_{k_{TA}+1},\ldots,u_{k_{Iso-C}}\rbrace — are orthogonal to each Δt(j)\Delta^{(j)}_t or, equivalently, if they do not intersect the column space of Δt\Delta_t (i.e. its left sigular vectors).

Hence, in general a lower kMk_M yields smaller or equal SAR than a larger kMk_M. However, our empirical findings show that enriching the basis TT with singular vectors corresponding to smaller singular values in original task arithmetic spectrum (i.e. {ukTA+1,,ukIsoC}\lbrace u_{k_{TA}+1},\ldots,u_{k_{Iso-C}}\rbrace) consistently increases the alignement ratio (Fig. 4(b)), implying that these vectors are relevant for representing each task matrix Δt\Delta_t and not orthogonal to its left singular vectors. This analysis formally supports the claim that a higher effective rank kMk_M for the merged matrix leads to a higher SAR. We will make explicit the connection between kMk_M, Iso-C and SAR at the end of Section 4.1. in the final version of the paper.

审稿意见
4

This paper proposes a novel model merging framework that enhances alignment between subspace of task models and merged model. The framwork includes two algorithms, (1) Iso-C that achives isotropic by flattenning the spectrum to the averaged singular values and (2) Iso-CTS in which lowest spectral components are further replaced by task-specific directions. Experiments on merging 8, 14, and 20 CLIP models demonstrated the effectiveness of this framework.

update after rebuttal

Thank you for the very detailed and complete rebuttal. I appreciate authors' newly added theoretical justification and arguments under responses to Reviewer ff9S. Also, thank you for providing the comparisons to those new baselines. It clearly shows the advantages of Iso-C and Iso-CTS. The new results authors provided for LoRA FT models and also T5 models (under responses to Reviewer mnUL) are also convincing. The new results provided during the rebuttal have greatly improved the quality of the paper and I hope the authors can include these results in the final version. I have updated my scoring to reflect this.

给作者的问题

  • The claim in Section 3.2 (lines 145–147) is somewhat confusing. TA suspected their effectiveness arises from the cosine similarity between the vectorized representations of the task matrices being close to zero. From Fig2(a), they are indeed close to zero?

论据与证据

Yes

方法与评估标准

Yes

理论论述

N/A

实验设计与分析

The multi-task model merging experimental designs and the metrics for evaluating merged model follow the standard procedure of model merging work. The three analyses experiments in Section 5.3 are sound and valid.

补充材料

Yes, skimmed the whole appendix.

与现有文献的关系

Yes. This paper provides an analysis of how an isotropic merged matrix can enhance model merging performance, complementing existing spectral-based merging methods by showing solely modifying singular values can be a powerful approach. This work also analyze individual task performance, offering a fresh evaluation angle in areas such as fairness.

遗漏的重要参考文献

N/A

其他优缺点

Strengths

  • Both Iso-C and Iso-CTS exhibit low hyperparameter sensitivity, as demonstrated in the analysis and appendix.
  • The proposed framework enhances model merging performance as well as fairness.
  • The proposed framework uses subspace alignment ratio as the metric for quantifying subspace similarity.
  • The paper is generally clear, and the overall structure is easy to follow.

Weaknesses

  • Lacks theoretical justification or motivation on why enhances the average subspace alignment ratio by making the merged matrix isotropic.
  • The SVD baseline is insufficient.

[1] Choi, J., Kim, D., Lee, C., & Hong, S. (2024). Revisiting weight averaging for model merging. arXiv preprint arXiv:2412.12153.

  • The main experiments are conducted on a single model family.
  • Current results are based on merging fully fine-tuned models. Evaluating the approach on PEFT models (e.g. LoRA) would provide a more complete understanding of methods' capabilities.
  • As acknowledged by the authors in the Limitations section, the methods have not been tested in the NLP domain.

其他意见或建议

  • Iso-CTS requires multiple SVD operations, can authors provide a complexity analysis of the proposed method?
  • Why Table 1 only reports “average absolute accuracy” and “average normalized accuracy” but not the proposed NAI?
  • The authors should provide more motivation for the performance gains by Iso-C and Iso-CTS and include supporting numerical results to echo "no task left behind". e.g., in Figure 3(a), tasks that were less represented in the TA model (such as Cars, DTD, and SUN397) exhibit greater performance improvement after Iso-C, which aligns with expectations.
作者回复

We are glad that the Reviewer appreciates the soundness of the experimental protocol and analyses, the effectiveness of the proposed approach, and clear writing. We thank the Reviewer for their comments and we respond below.

[Weakness 1 (W1)]: Lacks theoretical justification...

We provide a detailed explanation why making merged matrix isotropic increases subspace alignment in the response to Reviewer ff9S (section Q2).

[W2]: SVD baseline is insufficient.

Thank you for pointing out CART which is a recent and very relevant baseline. We added it (alongside baselines suggested by Reviewer ff9S, section Q3) to our main Table for 8 tasks (as these are the only results reported in the CART paper):

ViT-B/32ViT-L/14
Fisher68.383.7
RegMean71.882.2
PCB76.387.5
CART83.090.8
Iso-C86.394.2
Iso-CTS86.294.7

Iso-C and Iso-CTS outperform all these methods.

[W3]: Experiments on a single model family.

In this rebuttal we add NLP experiments highlighting the effectiveness of Iso methods on the T5 -- an encoder-decoder language transformer (see the response to Reviewer mnUL, section C3).

[W4]: ...Evaluating the approach on PEFT (e.g. LoRA)...

Thank you for the suggestion - it helps to emphasize the generalizability of our approach. We follow the evaluation protocol of KnOTS[1], a recent SOTA (ICLR 2025) method for merging LoRA fine-tuned models, tested on 8 vision tasks using ViT-B/32 and ViT-L/14. For comparison, we merge the task-specific LoRA weights - provided by the authors - to the pre-trained models, and then we apply Iso-C and Iso-CTS. Below, we present the average normalized accuracy:

ViT-B/32ViT-L/14
KnOTS-TIES68.078.2
KnOTS-DARE63.975.6
Iso-C74.489.4
Iso-CTS75.089.6

Iso-CTS achieves SOTA results in LoRA merging setting. Note that our method is a general purpose merging technique while KnOTS is specifically designed for the LoRA merging. This highlights the versatility of Iso methods.

[1] Stoica et al. Model merging with SVD to tie the Knots, ICLR 2025.

[W5]: The methods have not been tested in the NLP domain.

See our response to Reviewer mnUL (section C3) for NLP results.

[Comment 1 (C1)]: ...can authors provide a complexity analysis?

Let ΔtRm×n\Delta_t \in \mathbb{R}^{m\times n}, with mnm\geq n and let TT and LL be the number of tasks and network layers, respectively. For simplicity, assume that each layer has a single matrix, whose dimensions are mm and nn. In the analysis below, the lines refer to Algorithm 2 in the main paper.

  • Iso-CTS: One SVD on ΔTA\Delta_{TA} (lines 2-3) whose complexity is O(mn2)O(mn^2) and this is applied to each layer, so the complexity is O(Lmn2)O(Lmn^2); one SVD on each Δt,t=1..T\Delta_{t}, t= 1..T, for each layer (line 5) so the complexity is O(LTmn2)O(LTmn^2); finally line 11 requires two SVDs on matrices URm×rU \in \mathbb{R}^{m \times r} and VRn×rV \in \mathbb{R}^{n \times r}. Since mnm \geq n, then r=nr=n, the complexity is O(2Lmn2)O(2Lmn^2). The total complexity is: O(IsoCTS)=O(Lmn2+LTmn2+2Lmn2)=O(LTmn2).O(IsoCTS)=O(Lmn^2+LTmn^2 + 2Lmn^2)=O(LTmn^2).
  • Iso-C: One SVD on ΔTA\Delta_{TA} with complexity: O(IsoC)=O(Lmn2).O(IsoC)=O(Lmn^2).
  • TSV (our nearest competitor): TT SVDs per layer on each task matrix (line 1 Alg. 1, TSV paper), and two additional SVDs per layer(line 10-11 Alg.1 TSV paper) and thus: O(TSV)=O(LTmn2+2Lmn2)=O(LTmn2).O(TSV)=O(LTmn^2+2Lmn^2)=O(LTmn^2).

While Iso-CTS and TSV share the same asymptotic complexity, Iso-CTS incurs slightly more overhead due to the SVD on ΔTA\Delta_{TA} (lines 2-3). Both methods can be further optimized by computing Truncated SVDs for Iso-CTS (line 7) and TSV (line 1 Alg. 1, TSV paper), since only a few components are retained. This reduces the complexity for both approaches. Iso-C is the most computational efficient algorithm.

[C2]: Why Table 1 only reports “average absolute accuracy” and “average normalized accuracy” but not NAI?

We report these two metrics to stay consistent with previous literature (Consensus TA, TSV-M). In the revised manuscript we will add a Table reporting NAI.

[C3]: The authors should provide more motivation...

Thank you for the suggestion. In the revised manuscript we will put more emphasis on "no task left behind" achieved by Iso methods highlighting higher performance improvements for tasks underrepresented in TA.

[Question 1 (Q1)]: The claim in Section 3.2...

Yes, each pair of task vectors has a near-zero cosine similarity (Fig. 2a). However, our analysis goes a step further by comparing the cosine similarity between individual task vectors and the task arithmetic vector, demonstrating that this measure alone does not correlate with normalized accuracy improvement (Fig. 2b). Since cosine similarity alone does not explain performance gains, we introduce SAR. Unlike cosine similarity, SAR allows for meaningful differentiation among task matrices by highlighting shared subspaces (Fig. 3b). Additionally, SAR positively correlates with NAI (Fig. 3a).

审稿意见
4

The paper studies how to improve model merging methods by leveraging the singular value decomposition (SVD) of task matrices, defined as the differences between fine-tuned models' weight matrices and the pre-trained model. The authors first show that merging performance correlates with the alignment between the top eigenspace of task-specific and merged matrices. Building on this insight, they propose isotropic merging (ISO-C), which replaces the singular values of merged matrices with a uniform spectrum. Additionally, they refine the merged matrices by substituting directions associated with small singular values with task-specific eigenspaces orthogonal to the top eigenspace of the merged matrices before flattening the spectrum. These approaches achieve state-of-the-art performance on standard computer vision model merging benchmarks.

Update after rebuttal

I maintain my positive assessment.

给作者的问题

N/A

论据与证据

The claims are well supported.

方法与评估标准

The paper includes 3 standard model merging benchmarks with 8, 14, and 20 tasks and evaluates 3 CLIP models with VIT base and large encoders, as it is standard in this research area.

理论论述

N/A

实验设计与分析

The experimental design is sound.

补充材料

The supplementary material (appendix) was reviewed. I did not review the attached code.

与现有文献的关系

The paper contributes to the ongoing research on improving model merging and mitigating task interference. Similarly to concurrent studies, it considers SVD decomposition of the weight matrices. The proposed techniques – uniform singular value scaling and selective incorporation of task-specific subspaces – are novel and improve performance over previous techniques across model sizes and the number of merged tasks.

遗漏的重要参考文献

N/A

其他优缺点

Other strengths:

  • The proposed method is original and contributes novel insights into model merging.
  • Model merging via weight interpolation is a relatively recent but impactful area. Improving it by reducing task interference is a significant contribution.
  • The paper is clearly written and well-structured.

Minor weaknesses:

  • Sec. 3, along with Fig. 2 and 3, lacks details of the experimental setting and the models considered.
  • ρ\rho (L195, right) is undefined in the main text.
  • L240-246 repeatedly mention more/less correlated tasks. I think this terminology is vague and should be clarified in terms of alignment.
  • The claim in L256-258 (right) would benefit from explicit justification.

其他意见或建议

  • Have you analyzed SAR at different depths of the models?
  • I assume Fig. 3(a) is obtained for a ViT base. How would it look for the larger model?
  • Finally, I think the paper would be much stronger presenting some results for NLP as well, as it is standard in the field since model merging is widely relevant beyond computer vision.
作者回复

We are pleased that the Reviewer acknowledges the novelty of the proposed method, the contribution towards understanding model merging and clear writing. We thank the Reviewer for the constructive feedback, and below we respond to specific points raised.

[Minor weakness 1 (MW1)]: Sec. 3, along with Fig. 2 and 3, lacks details of the experimental setting and the models considered.

We used ViT-B/16 and 8 tasks (see Sec. 5.1 for details). We will clarify this in the revised manuscript.

[MW2]: ρ\rho (L195, right) is undefined in the main text.

ρ\rho is a Pearson correlation coefficient – defined in the caption of Fig. 3. We will unify the notation to ρTA\rho_{\text{TA}} and add the definition to the main text.

[MW3]: L240-246 repeatedly mention more/less correlated tasks. I think this terminology is vague and should be clarified in terms of alignment.

We will clarify this part in the revised manuscript to avoid vagueness:

However, significant variability in the average alignment ratio across the dataset leads to a lower accuracy improvement for less aligned tasks compared to the tasks belonging to groups of high alignment. This variability stems from the skewness of the task arithmetic spectrum (Fig. 1), which is concentrated in the first few singular values (which we call top or dominant), favoring the tasks from the highly aligned groups.

[MW4]: The claim in L256-258 (right) would benefit from explicit justification.

We can formalize the SVD problem for the first left principal singular vector as the variance maximization problem:

u1=argmaxu=1ΔTATu2=uT(t=1TΔtΔtT)u+uT(t,s=1,tsTΔtΔsT)uu_1=\arg\max_{||u|\vert=1}||\Delta_{TA}^Tu||^2=u^T\left(\sum_{t=1}^T\Delta_t\Delta_t^T\right)u+u^T(\sum_{{t,s=1,t\neq s}}^T\Delta_t\Delta_s^T)u

If a particular task Δj\Delta_j has dominant directions with significantly lower intensity compared to the other tasks (i.e. lower Frobenius Norm), then its individual contributions ΔjΔjT\Delta_j \Delta_j^T to the total variance becomes smaller. Similarly, cross terms involving Δj\Delta_j will also be comparatively small. Therefore, task jj explicitly contributes less to the maximized variance captured by the first principal singular direction.

Moreover, if the directions of Δj\Delta_j are orthogonal or nearly orthogonal to u1u_1, (i.e. u1TΔj=0u_1^T\Delta_j=0), task jj contributes minimally or not at all along this principal direction. Similar considerations apply to subsequent singular vectors u2,uku_2, \ldots u_k, defining the common subspace. Finally, as the number of tasks TT increases and tasks become more diverse, it becomes increasingly likely that tasks with distinct but smaller-magnitude directions will be underrepresented or absent in the dominant singular directions identified by the task arithmetic decomposition. This is empirically supported by our results. Iso-CTS provides the most improvement when the number of tasks increases.

[Comment 1 (C1)]: Have you analyzed SAR at different depths of the models?

We analyze SAR at different depths. For the ViT-B/16 model, we calculate SAR between fine-tuned and merged weight matrices and an average of all the matrices from a given layer. We present the results here: https://imgur.com/a/tLnEoAi. We observe that the alignment is higher for Iso across all layers of the vision transformer. One may expect early layers to be more aligned but we find that for both approaches the alignment is similar across the layers.

[C2]: I assume Fig. 3(a) is obtained for a ViT base. How would it look for the larger model?

Yes, we obtain Fig. 3(a) for ViT-B/16. See the Figure for ViT-L/14 here: https://imgur.com/a/3V6xv7T. It closely resembles Fig. 3(a) for ViT-B/16 from the paper.

[C3]: I think the paper would be much stronger presenting some results for NLP...

We present NLP results following the experimental setup from [1]. We use T5-Large-LM-Adapt base model fine-tuned on tasks from T0 mixture. We consider subsets of 8 and 7 NLP tasks adhering to the setup from Table 1 from [1] and compute an average accuracy of Iso-C and Iso-CTS in these settings:

Method8 tasks (Zhou et al., 2022)7 tasks (Yadav et al., 2023)
Weight Avg.56.460.5
TA63.869.2
TIES62.871.9
Fisher57.761.0
RegMean69.174.3
MaTS72.581.5
Iso-C75.683.3
Iso-CTS75.282.8

Both Iso-C and Iso-CTS significantly outperform the competing approaches, which highlights the versatility of our proposed methods. We observe that Iso-CTS achieves slightly worse results than Iso-C. This is consistent with our vision results in which both approaches performed very similarly when merging 8 models. We argue that the common space captures all the directions necessary to reliably represent these 7 and 8 NLP tasks, while task-specific subspaces may became more effective when merging more models.

[1] Tam et al. Merging by Matching Models in Task Parameter Subspaces, TMLR 2024

审稿人评论

Thank you for the response. I maintain my positive assessment.

最终决定

The work proposes to improve model merging by reducing task interference. The method computes the SVD of the summed task vectors (which are actually per-layer weight matrices) and then flattens the singular values to their average.

Strong improvements on standard computer vision model merging datasets are shown, yielding a similar performance to the recent Task Singular Vectors (TSV) method.

Reviewers unanimously liked the paper, but several issues were pointed out:

  • Reviewer ff9S, Reviewer BR4L, Reviewer 38Gq found the motivation why flattening the singular values works unclear.
  • Reviewer mnUL, Reviewer BR4L pointed out the lack of comparisons on NLP datasets.
  • Reviewer BR4L pointed out a missing complexity analysis and lack of comparison to another SVD-based baseline.

The authors addressed all these points in their rebuttal, and all reviewers raised their score to 4. I recommend acceptance of the work.