PaperHub
5.3
/10
Rejected3 位审稿人
最低5最高6标准差0.5
5
5
6
4.0
置信度
正确性3.0
贡献度2.7
表达3.0
ICLR 2025

What Matters for Model Merging at Scale?

OpenReviewPDF
提交: 2024-09-25更新: 2025-02-05

摘要

关键词
model mergingweight averagingaveragingcompositionmodular modelgeneralization

评审与讨论

审稿意见
5

This paper focused on the model merging issue, exploring various dimensions including model sizes (1B-64B parameters), merging techniques (Averaging, Task Arithmetic, TIES-Merging, Dare-TIES), and the number of expert models (up to 8). The evaluation covers both the tasks the experts were trained on (held-in tasks) and unseen tasks (zero-shot generalization). Different with previous studies for small model scales, the study extended the exploration to the effect of scaling model size, as well as the base model quality and number of expert models. Key findings include: (1) Merging is more effective when using strong instruction - tuned base models compared to pretrained ones. (2) Larger models are easier to merge. (3) Merging improves generalization capabilities, and with strong base models and increasing numbers of merged experts, the merged model can outperform multitask trained models. (4) Larger models can merge more expert models effectively.

优点

  1. The objective of this study is to offer profound insights regarding the scalability aspect of model merging, which indeed represents a significant direction within the realm of "scaling".
  2. The research presented herein exhibits a comprehensive and meticulous experimental design, which encompasses multiple dimensions such as model sizes, merging methods, and the count of experts. The results are presented in a highly satisfactory manner. Through a sequence of well-conducted experiments, it has been clearly demonstrated that the merged model can effectively harness the diverse expert knowledge. This beneficial effect becomes more pronounced with the increase in model size and when instruction-tuned base models are utilized.
  3. The paper is generally well-written and easy to follow.

缺点

  1. The fact that the study's exclusive concentration lies on PaLM-based models does give rise to legitimate concerns regarding the generalizability of the findings to other architectural frameworks such as GPT, LLaMA, and Qwen.
  2. Incomplete theoretical exploration: The paper is heavily empirical, lacking necessary theoretical analysis to explain the observed phenomena. For example,the relationship between weight disentanglement and merging effectiveness.
  3. Constraints in Experimental Design:The experimental design of the paper is primarily focused on a narrow range of model sizes (1B to 64B parameters) and a limited number of expert models (up to 8).

问题

  1. What are the theoretical and practical implications that arise when the merging process extends beyond involving 8 experts? Additionally, does there exist an anticipated performance ceiling in such a context?
  2. It is of great significance to explore the following aspects: Firstly, how could the findings obtained from the current study be extrapolated and applied to other model architectures? Secondly, which specific architectural features might exert an impact on the performance of the merging process?
  3. In Section 4.3, a pertinent query arises regarding the strength of the multitask baseline. Specifically, one might question whether the multitask baseline is overly potent. For example, in the context of 6-expert merging, should the baseline be trained solely on those specific 6 tasks rather than on a combination of all 8 tasks?
评论

Thanks a lot for your valuable review. We would like to address your concerns as follows:

Weakness-1: Other model families.

Response: This study primarily focused on PaLM-2 models due to resource and infrastructural constraints. However, model merging has been shown to work across almost all different transformers based models [TIES, DARE, WIDEN, Task Arithmetic, FusionBench] in both vision and language domains. We believe that the architectural differences between the Palm-2, GPT, LLaMA, Qwen are minimal and we expect our findings to hold true across all of these model families. Moreover, it is very expensive to perform full finetuning on large datasets for multiple different model families which restricts the number of experiments that can be done.


Weakness-2: Incomplete theoretical exploration

Response: The empirical nature of the paper is by design as it is hard to answer most of these questions about the impact of model size, model quality etc studied in the paper from a pure theoretical perspective. The main goal of the paper is to help practitioners develop more intuition about model merging and use our works as a reference point on what to expect when working with merging large models.


Weakness-3: Constraints in Experimental Design

Response: We believe that up to 64B model size covers a large fraction of the models that people typically use. Most models are only released in sizes up to 70B parameters and there are very few models that have sizes over 100B parameters. Most of these > 100B size models are not used by people to perform full finetuning due to resource and other constraints. Hence, due to both resource constraints and the limited possibility of fine tuning bigger models we skip them.

For the number of models, most past works that study models >7B like [DARE, WIDEN], etc merge only up to 3 models and hence merging up to 8 models is significantly better than what exists in the literature. Additionally, it is worth noting that the cost of such experimentation also grows significantly as we add more experts.


Question-1: theoretical and practical implications over 8 experts.

Response: From our empirical observations, we believe that when merging more than 8 experts we expect the trends reported in the paper to continue. For good base models, this means that we expect the normalized held-in performance to remain the same or degrade compared to when merging 8 models, while we expect the held-out performance to improve when merging more models. Theoretically, model merging is not very well understood and in general it is hard to make any concrete claims about theoretical upper or lower bound which is also beyond the intended scope of the work. We hope that our empirical insight will help both theorists and practitioners to dig deeper and develop answers to some of the theoretical questions raised here.


Question-2: Extension to different arch, and what architectural design matter for merging.

Response: Most works on model merging do not distinguish between different types of parameters, say linear layer vs layer/batchnorm, etc. This is because the main prerequisite for model merging is the linear mode connectivity (LMC) property. In most practical cases, LMC holds if we are starting off from a good zero-shot model and perform a small amount of fine-tuning such that the model parameters do not change much. This ensures that the fine-tuned models and the base model are all in the same loss basin which makes merging them easier. Hence, as shown at small scales by other works like [DARE, WIDEN], etc, that model merging works well for different architectures. We expect our findings to translate to other similar model families LLaMA, Qwen, Mistral, etc which are all transformer based architectures.

One thing to note when merging models is that the vocabularies of different models should be aligned. For example, sometimes people add some special token etc when finetuning a model, in such cases the embedding layers might not be aligned and for each token in the vocab, we might have to carefully merge the embedding from different models.


评论

Question-3: Overly potent multitask baseline.

Response: Yes, this observation is correct. Our multitask baseline is always trained on all the 8 tasks making it stronger than what it would be in a fair comparison. However, this choice was made to reduce the computational cost. For example, for each of the num_models (2,4,6,8) we select random 3 tuples to average out the effect of the tasks selected. Given that we consider models of sizes (1,8,24,64) this would result in a total of 443=48 full-fine tuning multitask training runs which are expensive. Hence, we decided to make the multitask baseline more potent as our main focus was on merged models where we performed all experiments thoroughly.


Concluding Remark: We would be happy to answer any other question or clarify further our response. Please consider updating your score if you find that the insights presented in the paper are useful to practitioners working on model post-training to better understand how they can leverage model merging for large models.


[DARE]: Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch

[Task Arithmetic]: Editing Models with Task Arithmetic

[TIES]: Resolving interference when merging models

[WIDEN]: Extend Model Merging from Fine-Tuned to Pre-Trained Large Language Models via Weight Disentanglement

[FusionBench]: FusionBench: A Comprehensive Benchmark of Deep Model Fusion

评论

Thank you for the clarification!

the authors have provided reasonable explanations for the limitations of their paper, especially regarding resource constraints and experimental design. However, their responses lack in-depth discussion on theoretical exploration and specific evidence to support the generalizability across model families. For example, I would like to see at least some experiment data from other model family to further support the claims about the performance trends.

Moreover, while they offer some insights into the theoretical and practical implications of merging more than eight expert models, these points require more detailed analysis for support. Overall, the authors need to provide more evidence and in-depth analysis on the theoretical foundations and generalizability across model families to enhance the persuasiveness and impact of their paper.

评论

Thank you for your valuable feedback. We are pleased that most of your other concerns have been addressed.

Merging more than 8 models: However, we do not have a theoretical bound on the what happens as we merge more than 8 experts, we expect the held-in performance to decrease and the held-out performance to increase uptill a certain point and the become constant. Other similar studied [1] have tried merging up to 25 models for small base models (see Figure 5 in [1]) and landed at similar conclusions as stated in their Section 4.6 takeaways that "When scaling up the number of models being merged, held-in performance decreases whereas generalization performance increases, though these trends plateau after around 10 models". Hence, our finding are consistent with other concurrent studies and we expect similar behavior with large models.

Experiment on other model families: We are running experiments with Llama models however these experiments are expected to take some time as the whole experimental setting would need to change as llama models are already trained on the datasets we use. Hence, we were not able to get those results in time, however we are committed to improve our work and promise to add llama result to the final version of the paper.

We request the reviewer to focus on the useful insights in the paper and our commitment to add the llama experiments. Hence, we request them to consider increasing their score and we promise to add the llama experiments in the camera ready version.

[1] Realistic Evaluation of Model Merging For Compositional Generalization, Tam et al 2024 (https://arxiv.org/pdf/2409.18314)

审稿意见
5

This is an empirical study paper that answers questions for model merging. The author explores research questions about model merging.

  • What is the effect of using pretrained vs. instruction-tuned base models for creating expert models for merging? The instruction-tuned base models outperform the pretrained models in the merging process.
  • Does model merging become easier or harder as the model size increases? Larger models consistently showed better performance in merging, indicating that they are easier to merge effectively.
  • How does merging affect zero-shot generalization to held-out (not seen) tasks, and how is this influenced by model size? Those based on larger and instruction-tuned base models, has improved zero-shot generalization ability. Sometimes surpassing multitask baselines.
  • How many expert models can be merged without performance loss, and how does this depend on model size? Larger models could effectively merge more expert models without significant performance degradation, whereas smaller models experienced performance drops when merging more experts.

优点

  • This is a comprehensive evaluation, systematically examines multiple factors (model size, base model quality, number of experts, and merging methods) across a large-scale experimental setup, providing robust insights.

缺点

  • It seems when comparing merging pretrained "experts" and finetuned "experts", after the merging process, the pretrained one is never finetuned. I think it might be unfair to compare between a never finetuned checkpoints and a finetuned checkpoints (althrough it is a merged checkpoint). And thus, it is very natural to predict that merging finetuned "experts" is better than merging pretrained "experts".
  • All the tasks (held-in and held-out) are text based. It would be better if involving some vision based tasks.
  • The smallest model is 1B. It is small for text models but probably still fairly large for vision models.
  • If adding vision tasks, it would be great to check both vision transformer based models and resnet based models.
  • Besides, I am also wondering if the way of training the "expert" matters. e.g. Zero-shot contrastive loss for classification vs supervised learning for classification.

问题

See weakness

评论

Thanks a lot for your valuable review. We would like to address your concerns as follows:

Weakness-1: unfair to compare between a never finetuned checkpoints and a finetuned checkpoints

Response: We would like to note that expert models are always created by fine-tuning the base model and hence we are always merging fine-tuned models. The only difference is if the base model has gone through some instruction tuning or not. The setup makes sense because expert models created from both the base model and IT model both reach reasonable performance. Moreover, we normalize the merged models performance by the base model’s performance which removes any differences between the capabilities of the expert created from the base and IT models. The main difference this experiment highlights is that IT helps to make the process of merging easier.


Weakness-2: All the tasks (held-in and held-out) are text based. It would be better if involving some vision based tasks.

Response: Yes, all of the tasks are text based because most of the large models (>13B) are primarily trained on text in an autoregressive manner. Hence, we focus on text. For vision studies there are very few vision models with sizes over 13B that contain a whole family of models of different sizes, say 1/8/24/64B parameters.


Weakness-3/4: The smallest model is 1B. It is small for text models but probably still fairly large for vision models. If adding vision tasks, it would be great to check both vision transformer based models and resnet based models.

Response: As mentioned in L282:285 and L152:154, one of the goals of our work is to help practitioners to better understand what they can expect if they use model merging to perform post-training of large language models. Hence, because of this reason we study text based LLMs which are large in size. Regarding vision models, past works like [TIES], [Task Artithmetic], etc have already performed merging on VIT models variant up to the large size (~350M) and shown that model merging works well. We know that model behavior changes as models become really large and hence this study aims to see if the benefit of model merging can be achieved at scale as well or they diminish as models get bigger. However, the difference between 350M and 1B for VIT is not significant enough to change the model behavior and hence VIT experiments are well suited for our goal. Having said that, we expect 1B models to be equally good at merging as 350M size VIT models.


Weakness-5: Besides, I am also wondering if the way of training the "expert" matters. e.g. Zero-shot contrastive loss for classification vs supervised learning for classification.

Response: From past work we know that models trained with both supervised learning [see TIES, Task Arithmetic] vs contrastive learning [WISE-FT] can be easily merged. Hence, we expect the objective to have some impact on the mergeability but models trained with both of these objectives can be merged.

More generally, the way experts are trained matters because we need the linear mode connectivity property to hold for model merging to work well. Even small things like training the model for long with a very high learning rate can take it far away from the base model which can potentially break the LMC property. Hence, it might be harder to merge experts created by using a very high lr.


Concluding Remark: We would be happy to answer any other question or clarify further our response. Please consider updating your score if you find that the insights presented in the paper are useful to practitioners working on model post-training to better understand how they can leverage model merging for large models.


[Task Arithmetic]: Editing Models with Task Arithmetic

[TIES]: Resolving interference when merging models

[WISE-FT]: Robust fine-tuning of zero-shot models

评论

We hope that we have addressed all of your concerns in our rebuttal, please let us know if you have more questions. We request you to considering increasing your score.

评论

Dear Author,

Regarding W2–W5: I understand that there are relatively few large-scale vision models (>13B parameters) available, with the largest currently being 22B. However, if the aim is to answer general questions such as, "Does model merging become easier or harder as model size increases?", it is important to consider not only large models but also smaller ones (e.g., ranging from 10M to 4B). Additionally, the scope should not be restricted to models trained solely with autoregressive loss. Otherwise, the conclusions drawn will be limited to models with parameters specifically within the 1B–64B range and trained using autoregressive objectives.

This paper appears to be strongly experimental in nature. To support a more general conclusion, it is crucial to conduct comprehensive experiments. The suggestions from W2 to W5 aim to make the paper's findings more broadly applicable by incorporating the following variations:

  • Model size: Including sub-1B models alongside larger ones. (We know large models perform very differently from small model in many ways.)
  • Tasks: Expanding beyond language tasks to include vision-related tasks.
  • Loss functions: Exploring different training paradigms, such as contrastive or supervised learning, in addition to autoregressive loss.
  • Architectures: Comparing transformers with alternative designs like ResNets.

Despite these suggestions, there have been no updates to the experiments. The author argued that the goal of the paper is "to help practitioners better understand what to expect when using model merging for post-training of large language models." While this focus is clear but too narrow, in my view, the contribution remains insufficient for an experimental paper. Without addressing the broader range of scenarios mentioned above, the findings lack the generalizability needed to provide robust guidance to practitioners.

For these reasons, I believe it is appropriate to maintain the original score. I hope you understand my perspective.

审稿意见
6

This paper systematically analyzes the impact of different model sizes, base model quality, model merging methods, and the number of expert models on the effectiveness of model merging, and draws five key conclusions. In general, this paper is of certain significance to the model merging community.

优点

  • This paper systematically reveals the impact of different model sizes, quality, quantity, and merging methods on the effectiveness of model merging.
  • The figures and tables in this paper are very clear.
  • This paper is well organized and clearly written.

缺点

  • Some inconsistencies lack explanation:
    • (1) In Figure 1, why is multi-tasking better than single-tasking in 8B and 24B, but multi-tasking is not better than single-tasking in 1B and 64B? How does this relate to model size?
    • (2) In Figure 5 (PaLM-2-24B, PaLM-2-64B), why is the generalization performance when the number of experts is 8 not as good as when the number of experts is 6? Why does the TIES method perform worse than the pre-trained model when the number of experts increases in PaLM-2-24B?
    • (3) In Figure 6, under PaLM-2-Held-Out, 64B is significantly better than 24B. Why is 64B not as good as 24B under PaLM-2-IT-Held-Out.
    • (4) In Figure 7, why is the performance of merging 8 experts better than merging 4 and 6 experts under the Held-In-64B setting? The greater the number of tasks, shouldn't task conflicts be more serious?
  • There is a lack of outlook or suggestions for future directions based on the phenomena observed in this paper.
  • Lack of source code and checkpoints. As this is an evaluation paper, the author can consider open-sourcing the resources used in the paper to facilitate further reproduction and research by the model merging community.
  • The author can consider adding discussions of the following related work.
    • Model Merging in LLMs, MLLMs, and Beyond: Methods, Theories, Applications and Opportunities. Arxiv, 2024.
    • Fine-tuning large language models for domain adaptation: Exploration of training strategies, scaling, model merging and synergistic capabilities. Arxiv, 2024.
    • FusionBench: A Comprehensive Benchmark of Deep Model Fusion. Arxiv, 2024.
    • Arcee's MergeKit: A Toolkit for Merging Large Language Models. Arxiv, 2024.
  • Some minor errors:
    • References are repeated, "Language models are super mario" appears twice, "Extend model merging from fine-tuned" appears three times, and "Model ratatouille" appears twice. The author needs to check carefully whether other references are repeated.
    • Reference year error: Ties-merging was published in NeurIPS 2023 instead of 2024. Similarly, "Task arithmetic in the tangent Space" was also published in NeurIPS 2023 instead of 2024. The author needs to check the year of other references.

问题

Please refer to the Weaknesses section

评论

Weakness-2: There is a lack of outlook or suggestions for future directions based on the phenomena observed in this paper.

Response: We have added this to section 4.6, discussion and takeaway. Here is a summary of the suggested future directions.

Based on our findings, the following future directions are worth exploring. First, exploring strategies to mitigate held-in performance loss during merging. Some sort of iterative branched training and merging of expert models could be fruitful to mitigate performance held-in performance loss. Second, investigating weight disentanglement and its relationship with zero-shot capabilities could yield deeper insights for improving both held-in and held-out performance. Third, diving deeper into the insight as to why different merging methods perform similarly at scale. It might be valuable to assess the generality of this finding either theoretically or empirically across different models. Lastly, there are some minor aberrations in our findings which we believe are impacted by the choice of the expert data distributions. A thorough study analyzing the impact of data distribution on the merging process would be very useful. Answering questions like which models are easier to merge and which ones are harder can result in insights to build better merging methods.


Weakness-3: Lack of source code and checkpoints. As this is an evaluation paper, the author can consider open-sourcing the resources used in the paper to facilitate further reproduction and research by the model merging community.

Response: We are currently in the process of getting approvals to release the codebase and the model checkpoints. We will try our best to release both the checkpoints and the codebase to perform these large scale merging experiments.


Weakness-4: The author can consider adding discussions of the following related work...

Response: Thank you, we have added a discussion with them in Section 2, highlighted in blue.


Weakness-5: References are repeated and wrong years in citations.

Response: Thank you for pointing out, we have checked and fixed the bib file.


Concluding Remark: We would be happy to answer any other question or clarify further our response. Please consider updating your score if you find that the insights presented in the paper are useful to practitioners working on model post-training to better understand how they can leverage model merging for large models.


评论

Thank you for the author's response. These responses resolved my concerns about some of the inconsistencies. In addition, the author also discussed related work and future directions. For the code and weight checkpoints, it would be positive if they could be open sourced in the future. Therefore, I decided to keep the original positive score.

评论

We are glad we could address all of your concerns. We would appreciate if you would reconsider increasing your score. Thanks for your hard work.

评论

Thanks a lot for your valuable review. We would like to address your concerns as follows:

Weakness-1.1: In Figure 1, why is multi-tasking better than single-tasking in 8B and 24B, but multi-tasking is not better than single-tasking in 1B and 64B? How does this relate to model size?

Response: Thanks for asking this question, when we observed this initially we suspected that the multitask models for 1/64B models are undertrained or undertuned. However, even after significant tuning we found that multitask performance for 8/24B models was better than the task specific expert models and vice versa for 1/64B. After careful tuning, we concluded that this is an artifact of the quality of training the PaLM-2 base models. The models 8/24B were slightly undertuned during pre-training which makes them harder to fine-tune on specific tasks with limited data. However, when we have a large multitask mixture of data, the model is able to learn effectively from it and surpasses the single-task models. Hence, we believe that this is due to the amount of training/tuning the underlying base model has gone through as opposed to the size.


Weakness-1.2: In Figure 5 (PaLM-2-24B, PaLM-2-64B), why is the generalization performance when the number of experts is 8 not as good as when the number of experts is 6? Why does the TIES method perform worse than the pre-trained model when the number of experts increases in PaLM-2-24B?

Response: Given that the PaLM-2 model is not a good zeroshot model, it is hard to explain all the aberrations in results for it. However, we have some hypotheses.

Our hypothesis for worse performance of TIES is that, given more models to merge we expect more conflict, however the way TIES resolves conflict for each parameter is by moving in the direction with the highest total magnitude across all the models being merged. This would work well in cases where the base model is well tuned as it would lead to a much flatter loss landscape. Hence, say when moving in the + direction (resolved sign) we expect the performance to improve on some tasks while minimal to no degradation in other tasks. However, for pre-trained models the loss landscape is sharper and hence when moving in the + direction it might hurt the performance of tasks that moved in the negative direction during fine-tuning.

Given this, we would like to reiterate that it is hard to predict trends for base models that are not good.


Weakness-1.3: In Figure 6, under PaLM-2-Held-Out, 64B is significantly better than 24B. Why is 64B not as good as 24B under PaLM-2-IT-Held-Out.

Response: Thanks for asking this question, from our response to weakness-1.1, we concluded that the palm-2-24B model is under trained/tuning compared to palm-2 64B model. Hence, in (PaLM-2, Held-Out), 64B > 24B because 64B models are better trained and hence learn more disentangled models which merge better and lead to larger improvements in generalization.

When looking at the IT model, (PaLM-2-IT, Held-Out), we would like to first note that the plots show relative improvements compared to the PaLM2-IT models. Given that the 64B-IT model is a better zeroshot model than the 24B-IT model, having a similar amount of relative improvement for 64B is much harder. Moreover, given that for both the model sizes the experts are created using the same amount of data the bigger model might be more data constraint as well. Hence, when we fine tune the 24B model and merge them, the relative performance gains for the 24B-IT model is higher than the 64B-IT model.


Weakness-1.4: In Figure 7, why is the performance of merging 8 experts better than merging 4 and 6 experts under the Held-In-64B setting? The greater the number of tasks, shouldn't task conflicts be more serious?

Response: In general, we expect the task conflicts to increase as the number of models increase. However, it is possible that the 2 tasks that are added are much more compatible with some of the existing tasks leading to high performance on them which could result in better average performance for merging 8 tasks compared to 6. Because of this, we believe that a thorough study analyzing the impact of data distribution on the merging process would be very useful. Answering questions like which models are easier to merge and which ones are harder can result in insights to build better merging methods.

AC 元评审

Overall, this is an interesting paper but it ultimately needs more extensive empirical analysis as pointed out by several reviewers.

The post-rebuttal ratings are at a borderline but are leaning more towards rejection. In addition, both reviewers who are leaning more towards rejection have not been fully convinced with the authors' rebuttal.

Given this, the AC believes that this paper is unfortunately below the acceptance bar at ICLR. The authors are encouraged to revise the paper with more extensive experiments and resubmit to a future venue. The AC believes that with a stronger set of experiments, the paper will have much better ratings in the next round.

审稿人讨论附加意见

There has been some discussion between the authors and reviewers. The authors have made efforts to address concerns raised by the reviewers. But, unfortunately, important concerns regarding the paper's empirical studies have not been thoroughly addressed. As a result, two reviewers maintain a decent but below-bar rating (5) for this paper. Their provided reasons, in my opinion, are valid.

最终决定

Reject