PaperHub
5.5
/10
Poster4 位审稿人
最低3最高3标准差0.0
3
3
3
3
ICML 2025

Modeling Multi-Task Model Merging as Adaptive Projective Gradient Descent

OpenReviewPDF
提交: 2025-01-09更新: 2025-07-24

摘要

Merging multiple expert models offers a promising approach for performing multi-task learning without accessing their original data. Existing methods attempt to alleviate task conflicts by sparsifying task vectors or promoting orthogonality among them. However, they overlook the fundamental target of model merging: the merged model performs as closely as possible to task-specific models on respective tasks. We find these methods inevitably discard task-specific information that, while causing conflicts, is crucial for performance. Based on our findings, we frame model merging as a constrained optimization problem ($i.e.$, minimizing the gap between the merged model and individual models, subject to the constraint of retaining shared knowledge) and solve it via adaptive projective gradient descent. Specifically, we align the merged model with individual models by decomposing and reconstituting the loss function, alleviating conflicts through $data-free$ optimization of task vectors. To retain shared knowledge, we optimize this objective by projecting gradients within a $shared subspace$ spanning all tasks. Moreover, we view merging coefficients as adaptive learning rates and propose a task-aware, training-free strategy. Experiments show that our plug-and-play approach consistently outperforms previous methods, achieving state-of-the-art results across diverse architectures and tasks in both vision and NLP domains.
关键词
Model MergingTask VectorGradient Projection

评审与讨论

审稿意见
3

This paper views model merging from a multi-task learning angle. It designs an adaptive projective gradient descent method that tries to minimize the gap between the merged model and individual models, subject to the constraint of retaining shared knowledge. Specifically, the method only uses gradients in the orthogonal direction of the shared space of task vectors. Experiments show performance improvement compared to baseline methods.

update after rebuttal

As discussed during the rebuttal, I generally support the acceptance of this paper despite its performance gap issue. The proposal should be helpful for some other researchers working in this field.

给作者的问题

NA

论据与证据

The main goal of the paper, which is ambitious, is "ensuring the merged model performs comparably to task-specific models on respective tasks". This is not well supported by the experimental results. Based on Tables 1, 2, and 3, the model merging proposal does not maintain the same level of performance as the individually trained ones, and is even worse than multi-task learning in many cases.

Theoretically, it is also unclear why "only take gradient steps in the direction orthogonal to the shared space" (Line 68) can help us achieve this ambitious goal. The argument is not convincing.

方法与评估标准

The proposed method needs more justification to help me understand why it can help keep task-specific information while model merging.

The experimental evaluations are diverse and the proposal shows better performance compared to previous model merging methods.

理论论述

The paper does not provide much theoretical evidence.

实验设计与分析

The experimental designs are satisfactory, comparing the proposal with SOTA methods and analyzing different modules of the method.

补充材料

No.

与现有文献的关系

This paper is related to many works on model merging and multi-task learning.

遗漏的重要参考文献

I am not aware of closely related works that were not discussed in the paper.

其他优缺点

Other strengths:

  • This paper is interesting and shows promising performance compared to previous methods based on the experimental results.
  • The manuscript is well-written, with a proper discussion of previous works.

Other weakness:

  • The connection between the motivation (keeping task-specific information) and the method is not very clear.

其他意见或建议

NA

作者回复

Thanks for your review and detailed comments. We hope the following discussion can address your concerns!


Q1: Based on Tables 1, 2, and 3, the model merging proposal does not maintain the same level of performance as the individually trained ones, and is even worse than multi-task learning in many cases.

A1: Transfer learning has driven the proliferation of fine-tuned models, but deploying separate models for each task creates significant storage and computational burdens. While multi-task learning could address this, it involves costly training and simultaneous access to all tasks. Additionally, determining the optimal data mixture for effective multi-task training can be complex and resource-intensive. Model merging addresses these challenges by compressing task-specific models without requiring access to training data (privacy or copyright). The performance gap between merged model and individual models or multi-task learning is an inherent constraint, as merging multiple trained models into a single model occurs without the benefit of costly computations.

While previous methods focus on alleviating conflicts between tasks, our approach takes a more direct path by establishing the minimization of the gap between the merged model and individual models as our explicit optimization objective (Line 24). By formulating and effectively solving this as a data-free constrained optimization problem, we achieve significant performance improvements. On ViT-L/14, our method reaches 92.6% performance, approaching the 93.5% achieved by multi-task learning—a substantial narrowing of the gap. We have revised the description of model merging requirements in Line 18, and greatly appreciate your suggestion.


Q2: Theoretically, it is also unclear why "only take gradient steps in the direction orthogonal to the shared space" (Line 68) can help us achieve this ambitious goal. The argument is not convincing.

A2: The optimization objective in Eq. (5) promotes orthogonality between task vectors to mitigate conflicts, while multi-task learning similarly emphasizes shared representations. Parameters between similar tasks can be shared (e.g., applying the MNIST task vector improves accuracy on SVHN). Therefore, we propose constructing a shared subspace SshareS_{share} to preserve common representations. By constraining task vector optimization to reduce updates along SshareS_{share}, we maintain shared knowledge while minimizing the gap for each task as defined in Eq. (5).

Ablation studies demonstrate a 3.5% improvement with SshareS_{share}. Table 7 presents a comparison of different gradient directions, revealing dataset-specific performance variations. Our method achieves significant improvements on the DTD dataset while showing decreased performance on SVHN. This pattern stems from DTD's reliance on rich textural features that are preserved in SshareS_{share}. In contrast, SVHN's visual representations differ substantially from other tasks, making the primary components in SshareS_{share} less suitable. This observation is further validated by examining the performance gap between pre-trained and fine-tuned models: SVHN exhibits the lowest pre-trained performance (31.4%) but achieves remarkable results after fine-tuning (97.5%), indicating its strong dependence on task-specific features. In summary, our approach effectively preserves shared knowledge across tasks while achieving optimal overall performance.


Q3: The connection between the motivation (keeping task-specific information) and the method is not very clear.

A3: To isolate task-specific information, the task vector is defined as τi=θiθ0\tau_i = \theta_i - \theta_0 to capture unique characteristics for each task. While preserving task-specific information through simple vector addition is straightforward, the challenge in model merging lies in managing conflicts between multiple tasks. This challenge becomes evident in Figure 1, which demonstrates how performance consistently declines across all merging methods as the number of tasks increases, directly reflecting increased task conflicts.

As shown in Eq. (3), we measure the gap between the merged model and individual models in terms of task-specific losses. To alleviate conflicts, we introduce a modification vector Δ\Delta for each task vector. This leads to our optimization objective in Eq. (4), which aims to achieve optimal cross-task performance by optimizing Δ\Delta. Through this optimization process, the merged model approximates the behavior of task-specific models while effectively resolving conflicts. In short, by minimizing our proposed loss function, we ensure the merged model preserves essential task-specific information.

审稿人评论

I would like to thank the authors for providing a detailed rebuttal. However, my concerns mentioned above were not solved, including the performance gap and the theoretical advantage of the proposal. Therefore, I decided to maintain my original ratings.

作者评论

Thank you again for your thorough review. We incorporate your constructive suggestions to better explain our method. Regarding concerns about the performance gap, please refer to our recent discussion with Reviewer QBR6. We acknowledge that theoretical advantage is not our primary contribution. Our paper directly models the multi-task model merging problem and empirically validates our motivation through experimental evidence.

审稿意见
3

This paper addresses the challenge of merging multiple task-specific models into a unified model without accessing their original training data. The authors identify critical limitations in existing methods, such as discarding task-specific information during conflict resolution and over-enforcing orthogonality, which erodes shared knowledge. They propose DOGE, a constrained optimization framework that minimizes performance gaps via data-free gradient descent, projects updates orthogonally to a shared subspace (preserving common representations), and employs task-aware merging coefficients derived from task vector norms.

给作者的问题

  • Is Δ\Delta a task-specific vector? Since you mentioned Δ\Delta is a modification vector to each task vector, and it is not indexed by the task, it was a bit confusing to me at the beginning. Maybe rephrase it to "a universal modification vector to each task vector".

论据与证据

  • Correct me if wrong, to use the Taylor expansion, the expanded point and the pretrained model should be very close. May need to point this out and justify.
  • In addition, did the author evaluate the performance of using this 1st order Taylor expansion to approximate the loss to validate this choice?

方法与评估标准

Methods:

  • In algorihtm 1, the authors should define Δ\Delta, whether it is input or how it is initialized.

Evaluation:

  • The benchmark datasets are commonly used in task vector based model merging.
  • However, I would like to see the comparison between DOGE with other strong baseline methods such as EMR merging, Twin merging.

理论论述

There is no theoretical claims in this paper.

实验设计与分析

Yes, I checked the experiments. As I mentioned before, it would be great to add comparisons between DOGE with other strong baseline methods such as EMR merging and Twin merging.

补充材料

I did not review the supplementary material.

与现有文献的关系

This paper is related to prior ideas including twin-merging (modulating shared and exclusive knowledge) and representation surgery (trying to make the representation of the merged model close to each individual task). It uses Taylor expansion to approximate the loss without using any data (similar to the idea in MAP (using Taylor expansion to approximate loss function)).

遗漏的重要参考文献

For the Taylor expansion part, it would be helpful to cite a related work (MAP: https://arxiv.org/pdf/2406.07529) which also uses Taylor expansion to approximate the loss function / evaluation metric.

其他优缺点

Strengths:

  • Empirical results (performance gains, robustness to task scaling, cross-domain generalization) convincingly demonstrate DOGE’s practical efficacy.

  • The plug-and-play design and compatibility with architectures like ViT/LoRA are validated experimentally.

Weaknesses:

  • Since the method requires additional optimization and additional modification vectors for each task, I would like the authors to present the additional time/space that DOGE requires.

其他意见或建议

  • In methodology section, Gap\lVert\cdot \lVert_{Gap} and Sshare\lVert\cdot \lVert_{Sshare} make it seems like you are defining some new norms. I would avoid using them as subscripts of the norm symbol.
  • Table 5 and 6 are not numbered in the order they appeared in the paper.
  • Table 5: it is interesting the tasks selected MNIST and EuroSAT are the relatively easier tasks to the ViT models. It would be interesting to see the generalization performance on SUN397, DTD, and Cars.
作者回复

Q1: Taylor expansion may need to point this out and justify.

A1: During fine-tuning, parameter evolution in pre-trained models is frequently minimal, indicating that training remains within the tangent space where Taylor expansion closely approximates network behavior. This aligns with MAP, which examines task vector magnitudes and employs 2nd Taylor expansion to approximate metrics. It provides a formal proof regarding the negligibility of the remainder in Taylor series, and interestingly proposes using linear regression to estimate Hessian. Thanks for suggesting this related work! It strengthens our theoretical foundation, and we include this reference to further substantiate the rationale behind our approach.

We examined the difference between the 1st order Taylor expansion and the original loss, finding them to be within the same order of magnitude, confirming the accuracy of the estimation. Since calculating the gradient θLj(θ0)\nabla_{\theta}\mathcal{L}_j(\theta_0) requires specific data Dj\mathcal{D}_j, we used task vector τj\tau_j as an approximation. Interestingly, when we attempted to optimize using actual gradients computed from specific data, we observed performance degradation. We attribute this to highly unstable gradients at the initialization, which complicated the optimization process. Thus, approximating the original loss using task vectors appears to be the superior way.


Q2: Whether Δ\Delta is input or how it is initialized.

A2: Δ\Delta is initialized as a zero tensor with the same shape as the task vector.


Q3: Strong baseline methods such as EMR merging, Twin merging.

A3: As discussed in Twin merging [2], they all belong to dynamic merging, which requires additional storage for task-specific modules. Such methods face parallelization challenges during inference, necessitating either dynamic I/O loading of task-specific modules or storing all modules in GPU memory. EMR merging requires priors during inference to load corresponding modules, while Twin merging trains a router using validation datasets to select modules.

Both EMR merging and Twin merging can be viewed as lightweight WEMoE [3], yet they still impose storage demands (2.25× our approach). For instance, EMR merging's proposed mask implementation still uses 8-bit Bool types, and Twin merging's module reconstruction UΣVU\Sigma V requires matrix operations that may not reduce peak GPU consumption. Notably, these approaches avoid direct comparison with WEMoE, which is unsurprising. According to the no free lunch theorem, performance increases with the number of retained parameters, with complete task-specific models representing the upper performance bound.

Our approach, by contrast, is a static merging plug-and-play method (like TA and Ties merging) that maintains standard model size and enables parallelized inference. We compare our method with SOTA static merging approaches such as AdaMerging and PCB-Merging. We believe methods should first be classified before conducting fair comparisons within each category. Otherwise, MoE methods will always outperform others simply due to larger parameter count.


Q4: Present the time/space that DOGE requires.

A4: We have reported training time and memory usage in Table 10 of the Appendix, demonstrating remarkably efficient performance with only 121 seconds total training time and a memory usage of 729MB. We will relocate this information to the main text.


Q5: Generalization performance on SUN397, DTD, and Cars.

A5: Based on your request, we conducted experiments evaluating generalization on three unseen tasks when merging five other tasks. The results reveal that SUN397, DTD, and Cars datasets pose challenges for ViT models, while MNIST/EuroSAT show limited generalization to these complex tasks. Despite this, our method consistently outperformed other model merging approaches by a significant margin.

MethodSeenUnseen
RESISC45SVHNGTSRBMNISTEuroSATAvg.SUN397CarsDTDAvg.
Pre-trained60.623.530.447.645.641.563.259.943.955.6
Task Arithmetic52.883.971.197.761.973.527.925.026.426.4
Ties-Merging74.689.181.897.773.783.457.551.938.749.4
AdaMerging73.576.081.597.469.479.642.337.832.037.4
DOGE TA82.689.489.098.692.390.458.754.341.451.5

Q6: Is Δ\Delta a task-specific vector?

A6: Thanks for the suggestion. Δ\Delta is a universal modification vector to each task vector. In our experiments, using a universal modification vector yields performance nearly identical to that of task-specific modification vectors, as they are mathematically equivalent when optimizing Eq.(5).


[1] EMR-Merging: Tuning-Free High-Performance Model Merging. NeurIPS 2024.
[2] Twin-Merging: Dynamic Integration of Modular Expertise in Model Merging. NeurIPS 2024.
[3] Merging Multi-Task Models via Weight-Ensembling Mixture of Experts. ICML 2024.

审稿意见
3

The authors introduced an approach to merging tasks for a multi-task learning purpose while maintaining performance comparable to task-specific models. They formulated the problem as a constrained optimization task, solved using adaptive projected gradient descent. To facilitate task merging, they introduced a modification vector for each task, acting as a correction mechanism. To achieve this, they constructed a shared subspace using SVD to capture common features, optimizing within this space to minimize task conflicts. The gradient of the modification vector is decomposed into two components: one projected onto the shared subspace and the other orthogonal to it. Additionally, they introduced merging coefficients based on the norm of task vectors to mitigate the dominance of any single task’s gradient influence.

给作者的问题

None

论据与证据

Yes

方法与评估标准

Yes

理论论述

No

实验设计与分析

No

补充材料

Yes, sections A, B, C, and D

与现有文献的关系

The authors are trying to tackle the issue with merging parameters of the model achieving good performance comparable to task-specific methods by reducing conflict of tasks. They mentioned most of the other literature work that addressed this issue.

遗漏的重要参考文献

Yes, the Weight-Ensembling Mixture of Experts (WEMoE) method for multi-task model merging was introduced in the paper "Merging Multi-Task Models via Weight-Ensembling Mixture of Experts", published at ICML 2024. Additionally, an extended arxiv version, "Efficient and Effective Weight-Ensembling Mixture of Experts for Multi-Task Model Merging" (E-WEMoE), further refines this approach. Both papers should be included in the related work section for a comprehensive discussion.

其他优缺点

Strengths: They applied their approach on Vision and NLP tasks. Weakness: Each dataset should have a brief description. Most of the included datasets focus on a single task, primarily classification. Can this approach be applied to heterogeneous MTL? SVD is computationally expensive. Can this approach be applied to Llama 2 or Llama 3? Traditional MTL needs to be clarified more. For instance, what is its architecture? The results were not compared against the WEMoE published in this paper “Merging Multi-Task Models via Weight-Ensembling Mixture of Experts” and E-WEMoE frameworks presented in the paper "Efficient and Effective Weight-Ensembling Mixture of Experts for Multi-Task Model Merging". Additionally, Figure 3 is similar to one in the paper "Efficient and Effective Weight-Ensembling Mixture of Experts for Multi-Task Model Merging”. For vision tasks, your results fall short compared to those reported in the paper “Merging Multi-Task Models via Weight-Ensembling Mixture of Experts” and the paper "Efficient and Effective Weight-Ensembling Mixture of Experts for Multi-Task Model Merging".

其他意见或建议

None

伦理审查问题

Figure 3 closely resembles the one presented in the paper "Efficient and Effective Weight-Ensembling Mixture of Experts for Multi-Task Model Merging", sharing the same representation and color scheme. However, there is no proper citation to this arXiv paper. Notably, this arXiv paper is an extension of the ICML 2024 accepted paper, "Merging Multi-Task Models via Weight-Ensembling Mixture of Experts", and both papers report better results than the paper currently under review. Given that the authors reproduced a highly similar image without citing the original work—regardless of whether they are the same authors—the omission appears intentional, particularly since the prior work (in both papers) demonstrates superior performance.

作者回复

Q1: The omission appears intentional, since the prior work demonstrates superior performance.

A1: Our approach differs fundamentally from [1,2] in both setting and methodology. Our objective is to close the performance gap between model merging and multi-task learning without introducing additional computation and memory requirements—a core and previously unresolved challenge in model merging research.

  • Parameter

    MoE architecture preserves MLP layers from each fine-tuned task-specific model and the pre-trained model, while additionally training a router module. In contrast, our merged model maintains a standard model size. The parameter comparison for ViT-B/32 (8 tasks) is as follows: ||Total Parameters| |-|:-:| |Individual|113.45M| |Ours|113.45M| |WEMoE|573.96M|

    The primary motivation for model merging is parameter reduction. If performance were the sole consideration, retaining each task-specific model would be the trivial solution. Our method aims to compress multiple models (whether 8 or even 20) into a single standard-sized model, which aligns with the typical settings in model merging and multi-task learning. As MoE methods (89.4%) exceed the performance upper bound of multi-task learning (88.9%), comparing our approach directly with MoE would be inappropriate.

  • Data Requirements

    MoE approaches employ unlabelled test datasets to train the router module, whereas our optimization of task vectors is data-free. The performance benefits from test-time adaptation are self-evident. Merging based solely on model parameters is more practical and represents the focus of most model merging methods.

  • Computational Overhead

    Static merging maintains inference costs equivalent to standard models, while MoE dynamic merging consumes more memory and computational resources (router + activated experts kk). The inference phase memory usage comparison is as follows: ||ViT-B/32 (8 tasks)|ViT-B/32 (20 tasks)|ViT-L/14 (8 tasks)| |:-:|:-:|:-:|:-:| |Ours|963.42MB|963.42MB|3772.63MB| |WEMoE|2750.65MB|5346.00MB|10063.64MB|

    Similarly, test-time adaptation incurs additional training costs, while our method requires only lightweight training overhead (as shown in Table 10 of our paper):

    ViT-B/32 (8 tasks)ViT-L/14 (8 tasks)ViT-B/32 (8 tasks)ViT-L/14 (8 tasks)
    Ours729MB2448MB2.02min5.18min
    WEMoE3744.19MB24535.53MB7.07min56.84min

    Notably, our method can be trained layer by layer, enabling model merging for large models with minimal memory requirements.

  • Regarding Figure 3

    Figure 3 visualizes task vector magnitudes, highlighting a phenomenon inherently observable across domain benchmarks. E-WEMoE and DOGE propose different approaches to address this phenomenon. Figure 3 was drawn with assistance from E-WEMoE authors to create a new version. Associating performance gaps with non-citation introduces a conceptual misunderstanding, as fair comparison is impossible due to differing settings. Meanwhile, we compare our approach with state-of-the-art methods in both data-free and test-time adaptation scenarios (described in lines 314-328). We appreciate your feedback and will introduce MoE-like methods and the clear differences.


Q2: Each dataset should have a description. Most of the included datasets focus on classification. Can this approach be applied to heterogeneous MTL?

A2: We will add descriptions for each dataset. Model merging in CV indeed focus primarily on classification tasks, following common experimental settings (as acknowledged by Reviewer 97J9). Research on heterogeneous model merging remains limited, with existing work mainly centered on VGG and ResNet architectures using CIFAR datasets. We would welcome suggestions for appropriate benchmarks to explore this direction.


Q3: SVD is computationally expensive. Can this approach be applied to Llama 2 or Llama 3?

A3: SVD computation only needs to be performed once at the beginning. As shown in Table 10, which details the computation overhead, our approach requires minimal memory and time. We conducted experiments following standard LLM settings, completing the merging in 58 min on a single A100 GPU. We report normalized scores on merging WizardLM-13B (Instruction-Following), WizardMath-13B (Math), and llama-2-13b-code-alpaca (Code). Our method achieves optimal average performance across tasks.

AlpacaEvalGSM8KMATHHumanEvalMBPPAvg.
Individual100.0100.0100.0100.0100.0100.0
TA102.791.070.550.087.780.4
TIES98.197.468.160.089.482.6
TA + DARE103.188.072.563.392.984.0
TIES + DARE107.990.365.680.092.487.2
Ours107.5105.094.456.786.590.0

[1] Merging Multi-Task Models via Weight-Ensembling Mixture of Experts. ICML 2024.
[2] Efficient and Effective Weight-Ensembling Mixture of Experts for Multi-Task Model Merging. ArXiv 2024.

审稿人评论
  1. The current response is contradictory. Authors mentioned "If performance were the sole consideration, retaining each task-specific model would be the trivial solution", yet the paper’s primary comparison focuses only on accuracy. Furthermore, the authors do not provide any direct comparison regarding computation or memory efficiency against the state-of-the-art, which undermines their claim.

Regarding the comparison to WEMoE, the authors argue that it is unfair to compare their approach against WEMoE ( MoE dynamic merging) due to differences in merging strategies. However, both methods fundamentally merge parameters, where WEMoE does so dynamically based on test data, while the proposed approach employs a static merging method. Given that both tackle the same problem using the same datasets, the comparison appears valid. Moreover, the prior work, WEMoE, was evaluated using the same benchmark, such as Adamerging (test adaptation), Ties-Merging, and Task Arithmetic (data-free methods), which were used by the authors in this paper to evaluate their approach.

Therefore, the justification for claiming unfairness in comparison to WEMoE is unconvincing. The authors should explicitly include a discussion of these prior works in their manuscript, clearly outlining the pros and cons of their approach relative to them. In particular, while their method may offer improvements in computational and memory efficiency, it is important to address the fact that WEMoE surpasses their approach in accuracy. A balanced discussion of these trade-offs—accuracy versus resource efficiency—would provide a more comprehensive evaluation of the contributions of this work.

  1. The computational overhead comparison supports the claim that the static merging approach offers significant savings compared to dynamic MoE methods (i.e., WEMoE) . However, the analysis would be stronger if it quantified these benefits—for example, by stating the percentage reduction in memory usage and computation overhead relative to WEMoE, and reporting any corresponding percentage loss in accuracy. Detailed discussion of the trade-offs between resource savings and potential accuracy impacts should be included as it would provide a more comprehensive evaluation of the method's overall effectiveness.

Other comments:

  1. For Figure 3, it appears similar to one presented in a previous paper. The authors stated explicitly that they created this version with assistance from the E-WEMoE authors, indicating that it is derived from prior work (including their code). Therefore, it is essential that they provide proper citation to the original source in the figure caption.

  2. The authors did not respond to this question "Traditional MTL needs to be clarified more. For instance, what is its architecture?".

作者评论

Thanks for your time and feedback. Please find point-by-point responses to your concerns below:


Q1: The current response is contradictory, yet the paper’s primary comparison focuses only on accuracy.

A1: Our response is not contradictory. The target of model merging is to merge multiple models into a single model that approaches the accuracy of task-specific models. Model merging has developed rapidly, leading to inconsistencies across many works. This is a current issue in the field, as there is no clear distinction based on parameters, data requirements, and computational costs. For example, SOTA dynamic merging methods like EMR merging and Twin merging, which function as lightweight WEMoE, also did not compare with WEMoE in their evaluations, instead comparing against AdaMerging (test adaptation) and Ties-Merging (data-free). As stated in the paper, DOGE is a plug-and-play method—we incorporat it into classic methods from both test adaptation (AdaMerging) and data-free (Task Arithmetic) categories, achieving SOTA performance in static merging. DOGE can similarly enhance dynamic methods by replacing their weighted averaging components.


Q2: However, the analysis would be stronger if it quantified these benefits—for example, by stating the percentage reduction in memory usage and computation overhead relative to WEMoE.

A2: We will provide a comprehensive comparison table and include a detailed discussion of previous works in the manuscript, offering readers a thorough evaluation:

MethodParametersRouterDataParallelPerformance
TA [1]1×\times--static69.1
AdaMerging [2]1×\times-unlabeled test datasetstatic80.1
TA+DOGE1×\times--static81.0 (\uparrow 11.6)
AdaMerging+DOGE1×\times-unlabeled test datasetstatic85.9 (\uparrow 5.8)
Surgery [3]>1×\times-unlabeled test datasetstatic80.9
------------
WEMoE [4]5×\timestrained routerunlabeled test datasetdynamic89.4
EMR merging [5]4×\timesperfect router-dynamic88.7
Twin merging [6]2.25×\timestrained routerlabeled validation datasetdynamic86.1
------------
Traditional MTL1×\times---88.9
Multiple Models8×\times---90.8

As shown, merging multiple models into a single model presents significant challenges. DOGE, as a plug-and-play method, substantially improves accuracy. Dynamic merging face parallelization issues during inference, requiring either dynamic I/O loading of task-specific modules or storing all modules in GPU memory. EMR merging needs priors during inference to load corresponding modules, while WEMoE and Twin merging train routers to select modules. We believe methods should be classified before conducting fair comparisons within each category. Otherwise, according to the no free lunch theorem, MoE methods will always outperform any static merging methods simply due to their larger parameter count.


Q3: It is essential that they provide proper citation to the original source in the figure caption.

A3: Thank you for bringing this oversight. We will provide proper citation in the figure caption.


Q4: Traditional MTL needs to be clarified more. For instance, what is its architecture?

A4: We apologize for the previous omission. As explained in Appendix C (Lines 582-583), Traditional MTL trains a single base model on all tasks simultaneously. The architecture is the standard base model.


To summarize our contribution again: We frame model merging as a constrained optimization problem, propose projective gradient descent that optimizes a data-free objective, and design task-aware merging coefficients. Comprehensive experiments validate our plug-and-play capability.

Your discussion regarding MoE methods has helped us provide a more comprehensive evaluation in our paper. We believe that clearer categorization and comparison will benefit the model merging community as a whole. Thank you sincerely, and we wish you a pleasant day.


[1] Editing Models with Task Arithmetic. ICLR 2023.
[2] AdaMerging: Adaptive Model Merging for Multi-Task Learning. ICLR 2024.
[3] Representation Surgery for Multi-Task Model Merging. ICML 2024.
[4] Merging Multi-Task Models via Weight-Ensembling Mixture of Experts. ICML 2024.
[5] EMR-Merging: Tuning-Free High-Performance Model Merging. NeurIPS 2024.
[6] Twin-Merging: Dynamic Integration of Modular Expertise in Model Merging. NeurIPS 2024.

审稿意见
3

This paper proposes a new perspective on model merging—treating it as a multi-task learning problem rather than merely a parameter-level combination of multiple expert models. The main idea is to preserve each task’s strong performance while reconciling the potential conflicts that arise when unifying several task-specific models into a single, merged network. To that end, the authors introduce an approach they call “Adaptive Projective Gradient Descent” (DOGE). This method formulates the model-merging goal as a constrained optimization problem that minimizes the “performance gap” between the merged model and each of the individual expert models, while explicitly retaining cross-task shared representations.

The procedure has three core steps. First, it refines (or “modifies”) each task-specific vector so that merging doesn’t simply discard conflict-ridden parameters that might actually be performance-critical. Second, it projects the gradients of these task modifications onto a shared subspace to maintain overlapping knowledge across tasks rather than forcing all task vectors into near-orthogonality. Third, it adapts the merging coefficients in a “training-free” way that is reminiscent of how adaptive optimizers dynamically adjust the learning rate; effectively, the magnitude of each merging coefficient is scaled inversely by the norm of the corresponding task vector. The overall pipeline is then shown to achieve strong performance across diverse architectures (vision and language) and tasks (classification, generation) without requiring access to the original datasets.

给作者的问题

  • How sensitive is the overall performance to the choice of the subspace basis size and the global scaling factor η?

  • Could you provide further empirical or theoretical justification for approximating ∇θLj(θ0) by −τ_j? Under what conditions might this approximation break down? Should I expect gradient-based MTL methods (https://github.com/thuml/awesome-multi-task-learning) can outperform task arithmetic (e.g. MGDA, CAGRAD, PCGRAD, IMTL...)

论据与证据

  • Improved performance on merged models: DOGE is claimed to achieve higher accuracy than previous data-free merging approaches by better preserving task-specific information while retaining shared representations.
  • Effectiveness of gradient projection: By projecting gradient updates orthogonally to the shared subspace, the method aims to resolve task conflicts without sacrificing common knowledge.
  • Task-aware coefficient design: The adaptive (training-free) merging coefficients based on task vector norms provide a natural way to balance gradient contributions across tasks. The evidence supporting these claims comes from extensive experimental results on multiple benchmarks in both vision and NLP domains. - Quantitative comparisons across a variety of baselines (non-merging methods, data-free approaches, and test-time adaptation techniques) and detailed ablation studies show significant improvements. Although the experimental evidence is robust, one might wish for more discussion regarding statistical variability (e.g., more error bar analysis or significance testing) to further solidify the claims.

方法与评估标准

The proposed method is well-motivated and methodologically sound. Key components include:

  • Data-Free Objective: Derived via a first-order Taylor expansion, the objective minimizes the loss gap between the merged model and each individual model by approximating the unavailable gradient with the task vector.
  • Shared Subspace Construction: SVD is used to extract task-specific subspaces, which are then combined and refined to form a shared subspace that guides the gradient projection.
  • Adaptive Merging Coefficients: Interpreting task vectors as cumulative gradients leads to a natural formulation where merging coefficients play a role akin to adaptive learning rates.
  • The evaluation criteria—such as average accuracy across tasks (or Spearman’s ρ for STSB in NLP) and performance on out-of-distribution or unseen tasks—are appropriate for demonstrating the effectiveness and generalization of the method. The comprehensive experiments across different architectures and task modalities further validate the method’s practicality.

理论论述

The paper provides heuristic derivations rather than fully formal proofs. Notably:

  • The use of a first-order Taylor expansion to derive a data-free objective is a reasonable approximation given the unavailability of task data.
  • Approximating the gradient of the task loss at the pre-trained model using the task vector (i.e., −τ_j) is intuitively justified by interpreting the task vector as an accumulation of gradients.
  • The decomposition of the gradient into components within and orthogonal to the shared subspace is well-motivated, though the justification remains somewhat heuristic.

While these theoretical insights are plausible and backed by empirical evidence, a more rigorous treatment or further formal analysis would help strengthen the theoretical claims.

实验设计与分析

The experimental setup is comprehensive:

  • The authors evaluate on eight-task vision benchmarks using CLIP-based ViT-B/32 and ViT-L/14 models, as well as on eight-task language benchmarks using LoRA fine-tuned Flan-T5 models.
  • Multiple baselines, including both data-free and test-time adaptation methods, are used for comparison.
  • Detailed ablations assess the contribution of each module (∆ optimization, shared subspace projection, and adaptive λ), lending credibility to the claims about each component’s effectiveness.
  • Additional experiments on unseen tasks and corrupted test sets reinforce the method’s robustness.

One minor suggestion is to include more explicit details on computational overhead (I found some recent work also update ∆ by gradient descent but not sure how expensive this procedure is) and convergence behavior across varying numbers of tasks.

补充材料

The supplementary material (including appendices) appears to provide:

  • Additional experimental details (e.g., dataset specifics, hyperparameter settings, implementation details).
  • Extended ablation studies and discussions on sensitivity analyses (e.g., effect of varying the subspace rank).
  • Further comparisons with baselines and additional visualizations that support the claims in the main text.

与现有文献的关系

The paper is well-situated within the broader context of multi-task learning and model merging:

  • It builds upon previous work in data-free model merging (e.g., Task Arithmetic, Ties-Merging) and test-time adaptation (e.g., AdaMerging).
  • It draws connections to multi-task learning strategies that emphasize gradient alignment and modular architectures.

遗漏的重要参考文献

Some relevant work that tackle model merging on subspace need to be discussed:

  • Gargiulo, A. A., Crisostomi, D., Bucarelli, M. S., Scardapane, S., Silvestri, F., and Rodola, E. Task singular ` vectors: Reducing task interference in model merging. arXiv preprint arXiv: 2412.00081, 2024.
  • Stoica, G., Ramesh, P., Ecsedi, B., Choshen, L., and Hoffman, J. Model merging with svd to tie the knots. arXiv preprint arXiv: 2410.19735, 2024.

其他优缺点

Strengths:

  • Comprehensive Evaluation: The extensive experimental validation across both vision and language domains, along with detailed ablation studies, convincingly demonstrates the method’s effectiveness.
  • Practical Relevance: The approach is designed to work in data-free scenarios, which is particularly appealing in settings where access to original training data is restricted due to privacy or logistical concerns.

Weaknesses:

  • Theoretical Rigor: Some derivations, particularly the gradient approximations and the rationale behind using −τ_j as a proxy for the gradient, could benefit from a more rigorous treatment.
  • Hyperparameter Sensitivity: The method involves several hyperparameters (e.g., the subspace basis size and global scaling factor η) whose selection may critically affect performance. More discussion on sensitivity analysis would be helpful.
  • Computational Overhead: A deeper analysis of the additional computational costs (e.g., due to SVD and projection operations) would enhance understanding of the method’s scalability.

其他意见或建议

  • Clarity in Derivations: Some steps in the derivation of the data-free objective could be elaborated further. A step-by-step explanation with more intuition would improve readability. For example, ∥θ∗ − θi∥_Gap apprears in Eq 2 without introduction.
  • Limitations and Future Work: It would be beneficial for the authors to include a discussion on potential limitations (e.g., cases where tasks are highly heterogeneous) and directions for future research.
作者回复

Thanks for your detailed comments. We hope the following discussion can address your concerns!


Q1: Some relevant work that tackle model merging on subspace need to be discussed.

A1: Thanks for suggesting additional relevant work. We will discuss them in related work: TSV [1] aggregates task vectors within their subspaces via low-rank approximation and whitens matrices to minimize interference. KnOTS [2] aligns representation spaces between LoRA models using SVD, enabling the application of merging methods. Both these methods and ours recognize parameter low-rankness and implement merging within subspaces.


Q2: Theoretical Rigor: The rationale behind using τj−τ_j as a proxy for the gradient, could benefit from a more rigorous treatment.

A2: Under the Neural Tangent Kernel assumption (i.e., fine-tuning often occurs in a linear regime), which has been validated in prior work [3,4], θLj(θ0)\nabla_ {\theta}\mathcal{L}_ j(\theta_ 0) can be estimated as kτi{k\tau_ i} where k<0k < 0. Here, τj=θTθ0=t=1TαtθtLj(θt)\tau_ j = θ_ T - θ_ 0 = -\sum_ {t=1}^T\alpha_ t\nabla_ {\theta_ t}\mathcal{L}_ j(\theta_t), where αt\alpha_t represents the learning rate and TT denotes the update iteration. Given the linearity of parameters θ0θ_0 in the vicinity, we have θtLj(θt)=θ0Lj(θ0)\nabla_ {\theta_t}\mathcal{L}_ j(\theta_t) = \nabla_ {\theta_0}\mathcal{L}_ j(\theta_0). Therefore, we derive θLj(θ0)=τjt=1Tαt\nabla_ {\theta}\mathcal{L}_ j(\theta_0)=-\frac{τ_j}{\sum_{t=1}^T\alpha_t}.


Q3: Hyperparameter Sensitivity: The method involves several hyperparameters (e.g., the subspace basis size kk and global scaling factor η\eta) whose selection may critically affect performance.

A3: We have conducted experiments on the subspace basis size kk in Figure 4, which displays performance with varying rank ratios alongside the explained standard deviation. We also investigated the relationship between different projection directions and basis sizes. Additional sensitivity analysis for the global scaling factor η\eta is supplemented as follows:

η0.010.020.030.040.050.060.070.080.09
ViT-B/3279.580.380.680.981.080.880.780.279.8

The evaluation of across values from 0.01 to 0.09 demonstrates that performance remains stable and even achieves higher results. (We did not conduct a specialized grid search, this setting was chosen because the calculated λ\lambda was close to 0.3). This consistency across different η\eta values verifies the robustness of our approach and highlights the practicality of applying task-aware coefficients.


Q4: Computational Overhead: A deeper analysis of the additional computational costs (e.g., due to SVD and projection operations) would enhance understanding of the method’s scalability.

A4: We have reported training time and memory usage in Table 10 of the Appendix, showing an efficient total training time of only 121 seconds and memory usage of 729MB. The SVD operation only needs to be executed once at the beginning, with a computational complexity of O(min(mn2,m2n))O(\min(mn^2, m^2n)). We appreciate your reminder and will relocate this to the main text. Moreover, we supplement the final version with convergence loss curves for 8 and 20 tasks, showing that convergence is typically achieved within 100 to 200 iterations.


Q5: Clarity in Derivations: A step-by-step explanation with more intuition would improve readability. For example, ∥θ∗ − θi∥_Gap apprears in Eq. (2) without introduction.

A5: We apologize for any confusion caused. Eq. (2) is a brief mathematical summary presented before the detailed methodology. We have revised it and explained each symbol's meaning. Combined with the proof presented above, this will enhance the overall clarity of the derivation.


Q6: It would be beneficial for the authors to include a discussion on potential limitations and directions for future research.

A6: Thanks for your suggestion. A potential limitation is the lack of consideration for heterogeneous model merging, which requires transformation when task vectors have inconsistent shapes or layer numbers. Regarding future research, we are extending our work to LLMs by merging WizardLM-13B, WizardMath-13B, and llama-2-13b-code-alpaca, achieving SOTA performance. For detailed table results, please refer to our response to Reviewer QBR6.


Q7: Should I expect gradient-based MTL methods can outperform task arithmetic.

A7: Yes. Task arithmetic implements MTL in a training-free manner and can be viewed as a post-transfer approach for existing models, while MTL methods typically serve as performance upper bounds that we aim to approach.


[1] Task Singular Vectors: Reducing Task Interference in Model Merging. CVPR 2025.
[2] Model Merging with SVD to Tie the KnOTS. ICLR 2025.
[3] Task Arithmetic in the Tangent Space: Improved Editing of Pre-Trained Models. NeurIPS 2023.
[4] A Linearized Framework and A New Benchmark for Model Selection for Fine-Tuning. ArXiv 2021.

最终决定

This paper initially received borderline scores. The main concerns raised by the reviewers included: (1) lack of discussion on related work, such as subspace-based methods for model merging; (2) insufficient justification for the theoretical analysis, particularly the linearity assumption of the loss landscape near fine-tuned parameters; (3) limited analysis of computational cost; and (4) missing comparisons with MoE-based methods, such as approaches using routers.

During the rebuttal, the authors addressed most of these concerns effectively, leading to a consensus among reviewers in favor of acceptance. In particular, the AC believes that comparing MoE-based and static model merging approaches purely in terms of accuracy is not very meaningful and that computational cost should be considered as part of the evaluation. For example, MoE approaches with a zero-compression rate are effectively equivalent to storing all fine-tuned parameters separately, diminishing their practical advantage.