PaperHub
5.5
/10
Rejected4 位审稿人
最低5最高6标准差0.5
6
5
5
6
3.3
置信度
正确性2.5
贡献度2.3
表达2.5
ICLR 2025

Foldable SuperNets: Scalable Merging of Transformers with Different Initializations and Tasks

OpenReviewPDF
提交: 2024-09-25更新: 2025-02-05
TL;DR

We merge transformers trained from different initializations on different tasks.

摘要

关键词
Model mergingKnowledge DistillationDeep Learning

评审与讨论

审稿意见
6

This paper explores the method of merging large transformers trained on different tasks from distinct initializations, deviating from the traditional setting where models are required to originate from the same checkpoint. The authors build upon and enhance the concept of folding from Zipit![1], successfully extending its application to transformers, thereby broadening its original scope. Furthermore, the authors employ data augmentation to address the constraints of limited data settings. This approach has achieved remarkable results on models such as MLPs and ViTs.

优点

  1. Expanding the scope of model merging is a highly significant direction.
  2. Very detailed introduction and guidance.
  3. Thorough ablation studies and large-scale experimental results and discussions.

缺点

  1. Concerns about application scenarios: While the paper takes a step forward in expanding the scope of model merging, it does not significantly enhance the applicability of model merging scenarios. The target scenario (same model architecture with identical pre-training data but different initialization, followed by fine-tuning on different tasks) is not common in practice.
  2. I noticed that data augmentation techniques are also used and discussed in Zipit. I suggest the authors address this point when discussing the connections with Zipit, highlighting similarities and differences in the methods used.

问题

  1. There is not much discussion about the next steps in the paper. What are the authors' thoughts on merging truly similar architectures but trained entirely from scratch on different datasets?
  2. I am a little curious about the potential effects if the method were applied to the more common setting we have now (originating from the same checkpoint, fine-tuning on different tasks).

Typos: Table 22, last row.

[1] ZIPIT! MERGING MODELS FROM DIFFERENT TASKS without Training

评论

We are grateful to the reviewer for their valuable feedback. To address some of the reviewer's concerns, we have added new experiments to our manuscript (Appendix F.3, Appendix F.4), which are discussed in bullets 3 and 4 of this response.

  1. “The target scenario… is not common in practice”

We believe the current merging literature focuses on a narrow scenario—merging models fine-tuned from the same initialization—and that it is important to extend model merging to broader scenarios, as demonstrated for CNNs in [1,2,3]. For example, merging weights of two unrelated models from an online repository (e.g., Hugging Face or GitHub) is a practical use case where models are often not fine-tuned from the same initial model, making most existing merging techniques unsuitable.

  1. “I suggest the authors address this point when discussing the connections with ZipIt”

We thank the reviewer for this suggestion. Our manuscript already details the similarities and differences with ZipIt (Section 2, Appendix A.2) and the effect of data augmentation (Appendix F.7, Appendix G). We will ensure these are referenced in the main text. Please let us know if further clarification is needed.

  1. “...discussion about the next steps in the paper… merging truly similar architectures but trained entirely from scratch on different datasets?”

To address this concern, We have added new experiments to our work (Appendix F.3, Tables 22 and 23) merging fine-tuned ViTs from different pre-training strategies (CLIP contrastive approach and supervised learning with ImageNet-1K). Again, FS-Merge consistently outperformed the baselines across all pre-training strategies.

  1. “I am a little curious about the potential effects if the method were applied to the more common setting”

In response to the reviewer's request, we evaluated FS-Merge in the scenario commonly studied in the literature, which involves merging models fine-tuned on different tasks from the same pre-trained initialization (Appendix F.4, Table 24). While FS-Merge outperforms most baselines in this setting, its margin over methods like distillation and RegMean is small, with RegMean achieving comparable performance using fewer resources. Thus, despite its success in this scenario, we conclude that FS-Merge's true strength lies in more challenging scenarios involving models with different initializations.

[1] Stoica et al., "ZipIt! Merging Models from Different Tasks without Training"

[2] Ainsworth et al., “Git Re-Basin: Merging Models modulo Permutation Symmetries”

[3] Jordan et al., “REPAIR: REnormalizing Permuted Activations for Interpolation Repair”

评论

Thanks for the response. I am willing to keep my score.

审稿意见
5

This work addresses the challenging task of merging large transformers trained on different tasks from distinct initializations. The authors first demonstrate that traditional merging methods fail catastrophically in this scenario. To tackle this, they propose Foldable SuperNet Merge (FS-Merge), a method that optimizes a SuperNet to fuse the original models using a feature reconstruction loss. FS-Merge is straightforward, data-efficient, and capable of merging models with varying widths. The method is evaluated against existing approaches, including knowledge distillation, on MLPs and transformers across diverse settings, sizes, tasks, and modalities. FS-Merge consistently achieves state-of-the-art (SOTA) results, particularly in data-limited scenarios.

优点

  1. The paper tackles an intriguing problem: merging large transformers trained on different tasks from distinct initializations into a single model.
  2. The proposed method is simple and easy to follow.

缺点

  1. FS-Merge is a training-based merging method, which can be costly compared to other model merging techniques.
  2. The figures, such as Figure 3, are low resolution, and the overall writing quality of the paper is not very professional, requiring significant improvement and refinement.
  3. The datasets used, such as MNIST and SVHN, are relatively small, and the performance improvements appear marginal. Also, the experimental setup seems unique to this paper and not aligned with standard practices in prior literature.
  4. The paper is largely empirical, lacking an in-depth discussion on why the proposed method effectively combines models trained on different domains.

问题

  1. What is the training cost associated with different datasets, such as FLOPs?
  2. The proposed approach is similar to adapter-based methods. Could the authors discuss this similarity?
  3. It would be beneficial to include results on larger-scale datasets, such as the VTAB-1K benchmark and even ImageNet-1K, which are widely used in model merging or domain adaptation research.
评论

We would like to thank the reviewer for the time and effort, as well as the constructive feedback. To address some of the reviewer's concerns, we have added new experiments to our manuscript (Appendix F.5), which are discussed in the response to question 3.

Weakness 1. "“FS-Merge… can be costly compared to other model merging techniques”

As noted in the article (e.g., Table 1), traditional efficient merging methods fail catastrophically in our challenging setting of merging transformers with different initializations, leaving models performing at chance level. Consequently, only resource-intensive non-local methods are effective in this scenario.

Weakness 2. “The figures, such as Figure 3, are low resolution”

We thank the reviewer for this helpful feedback. The figures have been replaced with higher-resolution versions.

“the overall writing quality of the paper is not very professional, requiring significant improvement and refinement”.

We apologize for any difficulties the reviewer experienced with the language of our article. If the reviewer could kindly provide specific examples of unclear sections we will address these in the final manuscript.

Weakness 3. “Performance improvements appear marginal”

In most vision experiments, FS-Merge outperforms the strongest baseline, KD, by a margin of over 4%, and by more than 10% in some cases. To the best of our knowledge, such improvements are typically considered significant in image classification problems.

“The experimental setup seems unique to this paper and not aligned with standard practices in prior literature”

Indeed, this work seeks to broaden the scope of model merging by addressing the challenging scenario of merging transformers from different initializations, an important and practical direction as demonstrated for CNNs [1,2,3]. For instance, merging unrelated models from online repositories like Hugging Face or GitHub often involves models not fine-tuned from the same initial model, rendering most existing techniques unsuitable. Thus, we view our unique setup as a strength, not a weakness.

Weakness 4. “The paper is largely empirical, lacking an in-depth discussion on why the proposed method effectively combines models”

We thank the reviewer for this important note. Most model merging works are indeed empirically focused. As discussed in the article, we believe FS-Merge succeeds where traditional methods fail due to its use of a global objective rather than a local one, along with its ability to employ more complex merging rules. Additionally, FS-Merge outperforms KD by leveraging the knowledge encoded in the original model weights. We discuss some of those issues in detail in Appendix A.2, Appendix G and Appendix H.

Question 1. “What is the training cost associated with different datasets, such as FLOPs?”

To the best of our knowledge, the resources that limit the GPU and the demands of an algorithm are runtime and memory usage. In the case of merging four ViT-B-16 models with 100 original images and 1,000 augmented images per dataset, RegMean takes about 3 minutes and 4,056MB; distillation 1.9 hours and 15,740MB; FS-Merge 3.6 hours and 29,550M; and FS-Merge seq. around 2.2 hours and 19,088MB.

Question 2. “...is similar to adapter-based methods. Could the authors discuss this similarity?”

We thank the reviewer for this suggestion. We have added a section discussing adapter-based methods (Appendix A.1). Both methods inject and train additional weights, but FS-Merge merges knowledge from multiple models, unlike adapter-based methods, which fine-tune a single model. After merging, FS-Merge folds the trainable parameters into the original weights, while many adapter methods retain additional complexity by remaining in the fine-tuned models.

Question 3. “It would be beneficial to include results on larger-scale datasets”

To mitigate this concern, we conducted new experiments, merging models finetuned on SUN397, Food101, and CIFAR100. Combined, these three datasets contain over 290,000 images and 598 classes. This represents the most challenging merging attempt known to us in the transformers merging literature under our setting. FS-Merge outperforms the baselines in this scenario as well (and see Appendix F.5, Table 25). We also wish to emphasize that some of our existing experiments were performed on large or challenging datasets, such as MNLI (392,000 data points, Table 26), QQP (363,000 data points, Table 26), or merging groups of 5 ViT (Table 19).

[1] Stoica et al., "ZipIt! Merging Models from Different Tasks without Training"

[2] Ainsworth et al., “Git Re-Basin: Merging Models modulo Permutation Symmetries”

[3] Jordan et al., “REPAIR: REnormalizing Permuted Activations for Interpolation Repair”

评论

With the discussion period ending in less than two days, we kindly ask the reviewer if there are any remaining concerns or feedback we should address. We have carefully responded to the comments and questions provided so far and would be happy to clarify or discuss any additional points.

评论

Thanks for the authors' responses. I have read the rebuttal and other reviewers' comments. I share similar concerns as Reviewer eSD4 which align with my initial impression in the review. As acknowledged by the authors in the rebuttal, the experimental settings in this work differ significantly from those of other prior methods. The proposed method requires additional computational cost, compared to KD as (not clear whether KD is a weak or strong) model merge baseline is vague and somewhat weird, which makes the comparisons appear unclear and potentially unfair. Furthermore, I do not find the proposed method to be particularly novel in its approach.

审稿意见
5

The paper tackles the problem of merging two Transformers with different initializations and target tasks. To this problem, the paper proposes an extension of the previous idea of folding two weight matrices to the Transformer architecture, and proposes to optimize the merging/unmerging layers for folding by knowledge distillation on unlabeled data.

优点

  1. The paper is well-structured and easy to follow.
  2. The paper polishes the idea of merging by folding from [1], i.e., merging two weights by inserting merging/unmerging layers, and extends it to the specific architecture of Transformer. Also the paper shows it works well when combined with knowledge distillation on unlabeled data.

缺点

  1. Although the extension of the previous idea to Transformer is a technical contribution, the novelty of the proposed method is still limited since it just applies the feature-level knowledge distillation to the merging/unmerging layer which is originally proposed in [1].
  2. While the title suggests that this paper addresses the general problem of merging Transformers with different initializations, but the experiments are performed only with the Vision Transformer with a few downstream tasks. Also, since there are various pre-trained Vision Transformers available, it should be tested with other initializations rather than just ImageNet-1K pre-trained one.
  3. The proposed approach heavily relies on optimization with unlabeled data, while the previous works (SLERP, RegMean, ZipIt, Opt) are designed to be applied without any optimization (but also can be with additional finetuning). If we allow such optimization, the problem can also be reduced to multi-task learning or distillation (with unlabeled data in this case), which now has a plenty of existing approaches ([2,3,4,5] for e.g.) to solve it. Thus, the paper should discuss more on the relationship to such works.
  4. There is a concern about the computational/memory inefficiency in the optimization phase particularly when the size of models to be merged increases, because the knowledge distillation part involves large matrices. Since model merging is typically used with a low-end GPU, the proposed approach may be not promising.
  5. The number of optimized parameters reported in Table 3,4,5 may be (possibly intentionally) too misleading. It apparently suggests that the proposed method is more than 10x efficient compared to the vanilla knowledge distillation, but the actual time for merging reported in Table 6 of Appendix shows this is not the case. Rather, the proposed method seems more than 2x inefficient, possibly due to the above weakness.

[1] Stoica et al., "ZipIt! Merging Models from Different Tasks without Training"

[2] Li et al., "An Unsupervised Multiple-Task and Multiple-Teacher Model for Cross-lingual Named Entity Recognition"

[3] Nguyen-Meidine et al., "Unsupervised Multi-Target Domain Adaptation Through Knowledge Distillation"

[4] Park and Kwak, "Feature-level Ensemble Knowledge Distillation for Aggregating Knowledge from Multiple Networks"

[5] Shi et al., "Data-free Model Fusion with Generator Assistants"

问题

See Weaknesses.

评论

We thank the reviewer for their insightful comments and suggestions. We appreciate the opportunity to improve our manuscript. To address some of the reviewer's concerns, we have added new experiments to our manuscript (Appendix F.3), which are discussed in bullet 2 of this response.

  1. “The novelty of the proposed method is still limited…”

We wish to emphasize that a naive version of FS-Merge, which simply trains M and U layers with KD, is insufficient for this challenging setting. Our success stems from key innovations: shifting to a global reconstruction problem, parametrizing M and U matrices as diagonal plus low-rank matrices, leveraging “first” initializations, and using data augmentations. These contributions are discussed in Section 2, Appendix H, and ablation studies (Appendix G). We also proposed a more efficient version (FS-Merge seq.) and evaluated alternative approaches, such as using inner features (Appendix H.3), which did not improve accuracy. Additionally, our work offers the most detailed evaluation to date of merging transformers from different initializations, covering both vision and text domains.

  1. “The experiments are performed only with the Vision Transformer with a few downstream tasks…”

Our work is not limited to Vision Transformers; it also includes merging fine-tuned BERT models pre-trained with Masked Language Modeling and Next Sentence Prediction (see Appendices F.5 and F.7, mentioned on lines 313-314 in the main text).

“It should be tested with other initializations rather than just ImageNet-1K pre-trained ones”

To address this concern, we have added new experiments to our work (Appendix F.3, Tables 22 and 23) merging fine-tuned ViTs from different pre-training strategies (CLIP contrastive approach and supervised learning with ImageNet-1K). Again, FS-Merge consistently outperformed the baselines across all pre-training strategies.

  1. “Distillation… now has plenty of existing approaches ([2,3,4,5])… the paper should discuss more on the relationship to such works”.

We evaluated many KD approaches (Appendix G.2), including different initializations (random, average, RegMean), inner feature usage (Appendix H.3), and augmentations. The KD method used in our article is the strongest we identified. Regarding those references, [2] is specifically designed for NER, and [5] assumes no access to the original teacher's data, making them irrelevant to our setting. [4] employs inner features, which we found unhelpful (see Appendix H.3). [3] uses the teacher models in a progressive manner, similarly to FS-Merge seq., which has shown a somewhat lower performance compared to the original FS-Merge. Therefore, we believe it is unlikely to outperform our KD baseline. We thank the reviewer for those references, they have been added to our article.

  1. “There is a concern about the computational/memory inefficiency…” +
  2. “in Table 6… the proposed method seems more than 2x inefficient…”

As explained in lines 466-475 in section 3 and Appendix C:

a) When merging two models, our method is comparable in efficiency to KD.

b) Our method becomes less efficient when more models are merged (e.g. the mentioned result in Table 6 in the appendix merges four models).

c) We provide ways to improve efficiency, such as reducing the rank of M, U matrices and using FS-Merge seq. (Appendix B.4). FS-Merge seq. outperforms KD in accuracy while maintaining comparable efficiency.

Thus, our method involves an efficiency-performance tradeoff but consistently outperforms KD. Moreover, the costly merging phase occurs only once; afterward, all methods, including FS-Merge, produce a model of the original size with the same resource requirements. Additionally, our implementation of FS-Merge could likely be optimized for greater efficiency.

  1. “The number of optimized parameters… may be (possibly intentionally) misleading”

We thank the reviewer for this remark. We did not intend to obscure efficiency limitations, as we discussed them in detail in Section 3 and Appendix C, as mentioned above. To avoid confusion, we modified the column from “number of learnable parameters” to “Does the method use learnable parameters” (see tables 3, 4, 5).

We hope these revisions address all the reviewer’s concerns and clarify the intentions and contributions of our work. We thank the reviewer again for their careful review and helpful feedback.

[2] Li et al., "An Unsupervised Multiple-Task and Multiple-Teacher Model for Cross-lingual Named Entity Recognition"

[3] Nguyen-Meidine et al., "Unsupervised Multi-Target Domain Adaptation Through Knowledge Distillation"

[4] Park and Kwak, "Feature-level Ensemble Knowledge Distillation for Aggregating Knowledge from Multiple Networks"

[5] Shi et al., "Data-free Model Fusion with Generator Assistants"

评论

Thank you for the clarification and additional experiments. Based on these efforts, I raised my score. However, since I think my initial concerns (particularly W1, W3, W4) remain unsolved, I can't suggest acceptance of this work. Especially, since the proposed work introduces tremendous cost for model merging due to the heavy use of knowledge distillation, the experimental comparison is totally unfair for most of previous merging methods except for the KD baseline (which itself is not well-studied in literature of model merging) and also the practicality is limited.

审稿意见
6

This paper addresses the challenging problem of merging large transformers trained on different tasks from distinct initializations, where prior works typically rely on models that share a common pretrained initialization. The proposed method, FS-Merge, utilizes a feature reconstruction loss to merge the original models effectively.

优点

  • The method is data-efficient, requiring only an unlabeled subset of the training data for optimization, which is advantageous when full access to data is limited.
  • The paper presents comprehensive experimental results across various model architectures and data scenarios, demonstrating the scalability and effectiveness of FS-Merge in merging models of different scales and tasks.

缺点

  • While the method is designed for models trained from scratch, it would be insightful to investigate its performance when applied to pretrained models that are fine-tuned on different sources. Specifically, it would be beneficial to explore potential challenges or advantages this application might present compared to merging models trained from scratch. This analysis could provide a broader understanding of the method's applicability and limitations.
  • The paper could benefit from a discussion on robustness to distribution shifts, similar to what is explored in the WiSE-FT paper, "Robust fine-tuning of zero-shot models".
  • It would be helpful to compare this method to Mixture of Experts (MoE) approaches, such as "Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM", to provide context on how FS-Merge differs or could complement these strategies. I encourage the authors to provide a comparison between FS-Merge and MoE approaches, with a focus on key differences such as computational efficiency, model size, and how FS-Merge could potentially complement MoE methods. This comparison would contextualize the distinct contributions of FS-Merge and highlight its unique strengths.

问题

  • Why does distillation perform poorly in the last task (C, M, C100, E) as shown in Table 4, while it performs well in the earlier tasks? An analysis or hypothesis explaining this discrepancy would strengthen the paper's results section. Discussing potential contributing factors such as task similarity, dataset characteristics, or interactions between specific models being merged would add depth to the findings and clarify this discrepancy.
  • Given the use of the “first” initialization, how sensitive is the method to changes in the order of models? Would changing the order significantly affect the merged model’s performance, and if so, why? The authors should consider conducting an ablation study on the impact of model ordering during initialization. Reporting performance metrics for different orderings and discussing any observed patterns or practical implications would provide valuable insights into the method's robustness and guide its real-world application.
评论

We appreciate the reviewer's insightful comments and suggestions. To address some of the reviewer's concerns, we have added new experiments to our manuscript (Appendix F.4), which are discussed in bullet 1 of this response.

  1. “to investigate its performance when applied to pretrained models that are fine-tuned on different sources”

In response to the reviewer's request, we evaluated FS-Merge in the more common scenario found in the literature, where models fine-tuned on different tasks originate from the same pre-trained initialization (Appendix F.4, Table 24). While FS-Merge outperforms most baselines in this setting, its margin over methods like distillation and RegMean is small, with RegMean achieving comparable performance using fewer resources. Thus, despite its success in this scenario, we conclude that FS-Merge's true strength lies in more challenging scenarios involving models with different initializations.

  1. “It would be helpful to compare this method to Mixture of Experts (MoE) approaches, such as 'Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM'”

In contrast to the MoE approach, which retains multiple MLP blocks in memory, model merging methods such as FS-Merge fuse these blocks, offering memory and computational advantages. We added “Branch-Train-MiX” and “Robust fine-tuning of zero-shot models” to our references.

  1. “Why does distillation perform poorly in the last task (C, M, C100, E) as shown in Table 4, while it performs well in the earlier tasks?”

We thank the reviewer for this important question. Discussion regarding which groups of tasks or models are easier to merge is indeed crucial, and we plan to investigate it in future work. Cars, MNIST, CIFAR100, and EuroSAT involve very different domains, but there appear to be additional complex factors influencing the ability to merge (such as the performances of the models, the tasks' difficulty, and other factors).

4. “Given the use of the ‘first’ initialization, how sensitive is the method to changes in the order of models?”

Based on our limited experiments on the subject, changing the order of tasks does affect the resulting accuracy of the merged model in both FS-Merge and KD. Selecting the order of tasks is a significant and challenging question, which remains an active topic in the fields of continual and curriculum learning [1, 2]. For example, in [1] it is shown that sometimes it is better to start with the easiest tasks, while sometimes it is better to begin with the hardest tasks. Also, in [2] it is shown that a complex interaction between Task similarity and overparameterization can affect the final accuracy. Thus, we believe the complexity of this question warrants a great direction for future work.

[1] Saglietti et al., “An Analytical Theory of Curriculum Learning in Teacher-Student Networks”

[2] Goldfarb et al., “The Joint Effect of Task Similarity and Overparameterization on Catastrophic Forgetting -- An Analytical Model”

评论

As the discussion period ends in less than two days, could the reviewer kindly let us know if there are any remaining concerns we should address? We have responded to the comments and questions raised by the reviewer and would greatly appreciate the opportunity to address any additional feedback.

评论

Thank you for the author's response. It was interesting to learn that ordering can affect performance. Including an experiment to demonstrate this impact could strengthen the paper. However, my concern remains regarding the lack of consistency in the choice of datasets and ordering. For instance, in the newly added Table 24, the order and datasets are unique and not consistent with those used in previous experiments. This inconsistency makes it challenging to compare the method across different scenarios. After considering the other reviews, I have decided to maintain my original rating.

评论

We appreciate the reviewer’s feedback and apologize for the misunderstanding. We did not realize there was a concern regarding using the same order of datasets. Instead, we interpreted the comment about ordering as a request to perform an ablation study on how the order of datasets affects the merging methods.

To understand the effect of ordering, we conducted the following experiment. We merged pairs of ViT-B-16 models, testing both possible dataset orders. Our results below show that:

(1) Ordering does affect the accuracy of the merged model.

(2) Overall, FS-Merge outperforms KD in both orders.

RESISC45, then MNIST:

KD: Per-task - 91.49, Joint - 90.24

FS-Merge: Per-task - 94.30, Joint - 92.75

MNIST, then RESISC45:

KD: Per-task - 75.15, Joint - 73.44

FS-Merge: Per-task - 75.32, Joint - 73.56

EuroSAT, then CIFAR100:

KD: Per-task - 72.02, Joint - 66.34

FS-Merge: Per-task - 71.86, Joint - 68.23

CIFAR100, then EuroSAT:

KD: Per-task - 84.65, Joint - 83.37

FS-Merge: Per-task - 87.30, Joint - 86.08

SVHN, then DTD:

KD: Per-task - 63.25, Joint - 62.44

FS-Merge: Per-task - 64.12, Joint - 62.03

DTD, then SVHN:

KD: Per-task - 70.06, Joint - 65.62

FS-Merge: Per-task - 72.23, Joint - 67.54

评论

We deeply appreciate the time and effort the reviewers have dedicated to reviewing our work, and thank them for the constructive feedback.

We are encouraged that the reviewers found our paper well-written and easy to follow (HgAx , eSD4) and recognized that it addresses a challenging and intriguing problem (2F4V, HgAx, bVWy). The reviewers also highlighted the data efficiency of the proposed method (2F4V, HgAx) and its ability to achieve SOTA results across various scenarios, architectures, model sizes, datasets, and modalities (2F4V, HgAx, bVWy).

Following the responses to our rebuttal, we see several questions remaining. We would like to respond to these questions:

  1. Isn’t the comparison between traditional merging methods and FS-Merge or the KD baseline “unfair”, due to the latter methods requiring significantly more resources?

This comparison was meant only to demonstrate that traditional merging methods perform poorly in our setting — across all the tasks, modalities, and model sizes we evaluated. This motivated using KD and developing FS-Merge, to address this challenging scenario.

However, we can understand the sentiment that if we could somehow add more resources to the traditional local methods, this would make the comparison more “fair”. Interestingly, we already did something very similar: in Appendix H.1 we examined the (more resource-intensive) local version of FS-Merge and found it is ineffective in our setting. This suggests that traditional methods are ineffective because they rely on local objectives (which are inherently unsuitable for this challenging task) — not only because they use fewer resources. Therefore, we are not sure what we can do to make the comparison more fair.

  1. Is the KD method a strong baseline?

Yes, since it performs better than existing merging methods both in:

(a) The standard ‘same init’ setting (Table 24)

(b) Our ‘different init’ setting (most other tables in our paper)

Moreover, we tested various KD approaches for model merging (Appendix G.2), including different initializations, inner feature usage (Appendix H.3), and augmentations. The KD method that we identified as the strongest was used as a baseline (and it was still outperformed by FS-Merge).

Thus, we do not see a better baseline than the KD method we used. However, merging transformers from different initializations is a relatively new direction, and we hope future works will introduce additional baselines.

  1. Are suggested methods too resource-intensive to be practical?

No, since there is no better option. Specifically, in our challenging setting:

(a) No other method works at all, other than KD and FS-merge.

(b) FS-merge (or FS-merge seq.) has a comparable computational footprint as KD, yet achieves better accuracy. This should not come as a surprise: many times in the past computationally cheap methods worked well in an ‘easy’ setting but became inadequate in more challenging cases, and this led to more computationally intensive methods (e.g., neural networks replacing linear models).

AC 元评审

Summary: The paper introduces FS-Merge, a method for merging large transformers trained on different tasks with distinct initializations. It extends the concept of folding from previous work and uses feature reconstruction loss with unlabeled data for optimization. The method shows promise in data-limited scenarios and is effective across various model architectures and tasks.

Strengths:

FS-Merge addresses the challenging problem of merging models trained from scratch, offering a data-efficient solution that does not require labeled data for optimization.

The paper provides a comprehensive evaluation, demonstrating FS-Merge's effectiveness and scalability across different model sizes, tasks, and data scenarios.

Drawbacks:

The method's reliance on knowledge distillation makes it computationally expensive compared to other model merging techniques, which could limit its practical applicability.

The paper lacks a comparison to Mixture of Experts (MoE) approaches, which could offer insights into how FS-Merge differs or complements these strategies.

There is a concern about the robustness of FS-Merge to distribution shifts and its sensitivity to the order of models during initialization, which could affect the merged model's performance.

Based on the above points, I must reject this work due to concerns about its computational efficiency, lack of comparison to existing methods like MoE, and potential robustness issues.

审稿人讨论附加意见

The concerns are not well addressed. Most of the reviewers agree to reject this work.

最终决定

Reject