4.7

/10

Rejected3 位审稿人

最低3最高6标准差1.2

3.3

置信度

正确性2.7

贡献度2.7

表达3.0

ICLR 2025

Pretrained Hybrids with MAD Skills

Nicholas Roberts,Samuel Guo,Zhiqi Gao,Satya Sai Srinath Namburi GNVV,Sonia Cromp,Chengjun Wu,Chengyu Duan,Frederic Sala

OpenReview PDF

提交: 2024-09-24更新: 2025-02-05

TL;DR

We develop a framework for creating pretrained hybrid models from existing pretrained models.

摘要

关键词

hybrid architectureslarge language modelstransformersstate space modelsmodel mergingneural architecture searchmechanistic search

评审与讨论

审稿意见

评分: 6置信度: 32024-11-04

The paper builds by proposing a new framework for automating the design of hybrid language models by re-using existing pre-trained models using ideas from Neural Architecture Search. The authors show that their approach allows for merged language models to be competitive with their component models and also outperform them on fine-tuning tasks.

优点

The paper is well-written and the the motivation is clear and convincing.
The figures are well-designed and helpful.
Interesting results that prove the effectiveness of their framework on specific models.

缺点

The experiments done are mostly on smaller models and it's not clear if MAD remains effective with larger models. The authors could do additional experiments with other open-source models to validate this.

问题

Have you done experiments to validate whether the performance continues to hold at scale?
What is the overhead of the projectors?

2024-11-24

Thank you for noting the strengths of our paper, including the writing, motivation, and our current set of results!

On scaling to larger models. This is an excellent point – we are actively working to obtain more compute so that we can evaluate Manticore at sizes of 1B+ parameters. Our current set of evaluations include two compute scales: combining GPT-Neo-125M with Mamba-130M to produce a ~255M parameter model, and Pythia-410M with Mamba-370M to produce a ~780M parameter model. Once we obtain additional compute resources, we plan to scale up Manticore to ~5.5-6B parameters using Mamba2-2.7B with a comparably-sized transformer, such as Pythia-2.8B as our component models. Please let us know if there are specific concerns that you have with respect to scaling, we would be happy to discuss this further!

On the overhead of projectors. The projectors introduce minimal overhead. We have added an inference cost analysis in Appendix D.3 showing that the FLOPs incurred due to projector usage are dominated by the inference FLOPs from the forward passes through the component models.

2024-11-30

Thanks again for the feedback in your original review. We believe we have addressed your concerns in our rebuttal, and we would love to engage with you further if you have any remaining concerns before the end of the rebuttal period. Please note that we have also included a learned ensemble baseline in response to reviewer Dze5.

审稿意见

评分: 3置信度: 32024-11-04

The paper introduces Manticore, a framework for creating hybrid architectures by combining pretrained components from different language models (LMs). Manticore aims to automate hybrid model design by reusing pretrained model components from distinct architectures, such as Transformers and state-space models, using projectors to align feature representations between these models. The framework's goal is to merge model architectures in a way that preserves their respective strengths, ideally achieving better performance than the individual components. Through experiments on the Long Range Arena (LRA) and MAD tasks, Manticore demonstrates comparable or improved performance over some individual models and existing hybrid models.

优点

Manticore’s approach to combining pretrained models from different architectures using projectors and mixture weights is innovative and extends beyond typical model merging methods.
Manticore’s design, which allows for fine-tuning and programming pretrained hybrids, offers a degree of flexibility, making it potentially beneficial for practitioners looking to leverage diverse model architectures.
Testing across LRA and MAD tasks provides an initial sense of the framework's potential, although the evaluation depth limits the conclusions drawn.

缺点

The main claim, "Pretrained hybrids can outperform their component models on fine-tuning tasks," is not well-supported. A fair comparison would entail fine-tuning Manticore and its component models under the same budget to evaluate relative gains. Without this, it’s unclear whether the hybrid approach provides substantial benefits beyond those of individually optimized models.
Manticore requires a dedicated training process for projector layers and mixture weights, potentially adding overhead and limiting applicability in constrained environments.
The absence of publicly available code restricts reproducibility and verification of the results, limiting the community's ability to assess the framework’s impact fully.

问题

How would Manticore perform relative to its component models if they were fine-tuned under the same computational budget? Would this comparison validate Manticore’s primary claim of outperforming its component models?
Could the authors elaborate on the memory and computational efficiency of Manticore hybrids, especially compared to traditional single-model architectures? Does the use of projectors introduce significant overhead?

2024-11-24

Thank you for noting the strengths of our paper, including the innovation and flexibility of our approach!

On the evaluation depth. Please note that our evaluation is not only on LRA and the MAD tasks – these only constitute a small fraction of our experimental results. We also evaluate on a variety of natural language tasks, including Alpaca, Penn Treebank, ELI5, NI Spanish QA + Alpaca, NI Chinese QA + Alpaca, MLQA Vietnamese QA + Alpaca, NI all non-English QA, OpenOrcha, XQuAD Arabic, and a Penn Treebank synthetic that we designed. We emphasize this point as these constitute quite a substantial portion of our experimental results.

On fine-tuning Manticore and its component models with the same budget. This is an interesting question, which we think highlights some of the nuances of why Manticore is useful. In Appendix D, we have added a thorough discussion of the tradeoffs and challenges associated with selecting baselines under different definitions of “budget.” Please see the general response for a summary and Appendix D for details.

We feel that your comments have helped us to craft these much needed clarifications that we have added in Appendix D, so thank you!

On projector overhead and applicability to compute-constrained environments. Please see our new FLOPs analyses that cover training FLOPs (Appendix D.2) and inference FLOPs (Appendix D.3)--these should clarify the overhead associated with our training process.

Beyond this, what constrained environments are you referring to specifically?

On publicly available code. Our code is available here: https://anonymous.4open.science/r/manticore-anon. We have also added this link to the main text and to the Appendix of the paper.

2024-11-30

审稿意见

评分: 5置信度: 42024-11-04

Manticore is a framework for automating the creation of hybrid language model architectures by reusing pre-trained models from different architectures. By leveraging NAS and projectors, Manticore enables pre trained hybrid models that combine the strengths of multiple architectures, allowing for flexible and high-performing LLMs without extensive manual design or retraining.

优点

This paper focuses on an important and relevant problem of using architectural components from different state-of-the-art model architectures to construct a hybrid model that provides the best of all worlds without incurring expensive pre-training and search space exploration overheads.

It introduces a novel idea of projectors that enable different architectures to interact in each other feature space by projecting an intermediate shared feature space that acts as a translator for them.

缺点

Although the idea of projectors is novel but using gating to combine the contributions of different architectures has been explored in Mixture of experts [1], weighted ensemble averaging and finds a direct use in this paper.

The evaluation compares the combined hybrid that has 2x the number of parameters and >2x FLOPs due to projectors and gating against individual models of half the size.

In Table 1, Mamba is already better than Pythia in all of the tasks, in Table 2, Mambaformer is also better in all tasks but one, In Table 4, GPT-Neo is better than all tasks, So creating a hybrid is achieving almost the same score as the better model.

[1] https://arxiv.org/abs/1701.06538

问题

A fair comparison would either be a model of the same architecture with 2x size or the hybrid being discretized and having the same amount of FLOPs/parameters of the one of the individual models. This would address weakness 2.

A stronger test would be to consider models that perform better in some tasks than others and then validate if the hybrid does achieve the best of both worlds. This would address the weakness 3.

2024-11-24

Thank you for pointing out the importance and relevance of our problem, as well as the novelty of our approach!

On novelty with respect to MOE. It is true that MOE also uses gating, but that is where the analogy between Manticore and MOE ends. Gating has been used as a basic primitive in deep learning since at least 1997 when the LSTM was introduced. Furthermore, MOE serves a different purpose: as far as we are aware, MOE only applies to the same architecture and does not involve combining pretrained components. We believe that the gating used in Manticore is much more closely related to differentiable NAS (both in terms of motivation and implementation), which we discuss at length in our paper.

On the fairness of our baselines. Your point regarding the comparison of parameter count and FLOPs between our component model baselines and Manticore is valid. To address this, we have added a thorough discussion of the tradeoffs and challenges associated with selecting baselines under different definitions of “budget”---this can be found in Appendix D. Please see the general response for a summary and Appendix D for details.

On weakness 3. We indeed have results of this form, please refer to Figure 6 in the main text, and our new Table 9 in Appendix D. These results are exciting precisely for the reason that you mentioned: Pythia performs better on some of the tasks that we evaluate, while Mamba performs better on others. When we combine datasets on which Pythia and Mamba specialize, we find that the hybrid outperforms both of them, thus achieving the best of both worlds.

2024-11-27

Thank you for the response.

The main issue, even after this response, remains that the baseline comparison should be more fair with respect to how one would combine existing models. That would not be a simple equally weighted ensemble but rather a gated/weighted approach that is learned by the fine-tuning process.

Even in new Table 9, Mamba is better on most tasks than Pythia and the two tasks at which Pythia is better, it is only negligibly better.

The new discussion on FLOPS being the right metric to make two techniques comparable does make sense, but this is true only for training.

2024-11-30

Thank you for the reply. We have responded to your concerns below.

On the fairness of our baselines (continued). We have also run this baseline and we include the results for the learned ensemble baseline below in an updated version of Table 9. For this, we tune the ensemble weights using the same DARTS search procedure used in Manticore, and find that Manticore outperforms the learned ensemble baseline on all but one task.

Task	Pythia-410M (A)	Mamba-370M (B)	Ensemble [A, B]	Learned Ensemble [A, B]	Manticore [A, B]
Es. + Alpaca	1.819	1.704	2.172	1.688	1.664
Ch. + Alpaca	3.729	3.447	3.854	3.756	3.369
Vi. + Alpaca	2.130	2.004	2.173	2.041	1.980
NI non-En	1.764	1.560	1.652	1.379	1.530
OpenOrcha	1.570	1.576	1.756	1.630	1.553
XQuAD Ar.	0.205	0.207	0.533	0.442	0.201

On weakness 3 (continued). What we mean is that the tasks in Table 9 comprise combinations of different datasets, on which Pythia or Mamba might perform better, and on the combined dataset, we see that Manticore outperforms both Pythia and Mamba. We believe this is consistent with your suggestion "A stronger test would be to consider models that perform better in some tasks than others and then validate if the hybrid does achieve the best of both worlds. This would address the weakness 3."

For example, in Table 9 and Figure 6, we evaluated on a dataset comprised of 50% NI Chinese QA and 50% Alpaca instructions. The Pythia model outperforms Mamba on the NI Chinese QA split (3.567 vs 3.764) whereas Pythia is worse than Mamba on the Alpaca split (3.504 vs 3.447). Then on the combined dataset (as shown in Table 9 and Figure 6), Manticore outperforms both Pythia and Mamba. We will include this information in the final version of the paper.

On the FLOPs discussion. We are not sure what the reviewer means by "this is true only for training" and would love to clarify. Our analysis covers training, inference cost, as well as parameter count. While we believe we are in agreement that FLOPs is an appropriate metric to measure inference cost, if there are other potential metrics we are happy to discuss them as well!

评论- Overall response

2024-11-24

Thank you all for the helpful comments, insights, and questions! Your responses included many excellent questions that have allowed us to highlight the advantages of Manticore.

Since submission, we have added a variety of new content to the paper. This includes new baselines, insightful analyses, clarifying discussions, and an anonymous link to our code repository. We have highlighted these changes to the text in red.

Most importantly, in Appendix D, we have added a thorough discussion of the tradeoffs and challenges associated with selecting baselines under different definitions of a “budget.” To summarize, we cover three different definitions:

Parameter count

Comparison along the axis of parameter count is challenging because larger models are typically pretrained for longer and pretrained models of specific sizes aren’t guaranteed to always exist. For example, if we were to combine Mamba-2.8B and GPT-Neo-2.7B, we would not be able to obtain 5.5B parameter Mamba and GPT-Neo/Pythia baselines for comparison to the Manticore hybrid.

Training FLOPs

The fact that we compare Manticore to the better of two component models means that both component models need to be fine-tuned in order to compare Manticore to the best component model–this is the baseline that we compare against in the main text. We conduct a training FLOPs analysis and show that the total training FLOPs contribution is dominated by pretraining, and note that the (already relatively small) projector pretraining cost is amortized over potentially many fine-tuning runs.

Inference FLOPs

It is true that our component model baselines naturally incur only about half of the inference FLOPs required by Manticore. We show that the inference FLOPs contribution of the projectors is minimal, and our analysis motivates a new baseline which we have added: ensembles of component models. Manticore outperforms this baseline while roughly matching the inference FLOPs and total training FLOPs.

We are excited about these new additions to the paper, and we would love to dive into more detail on these new results should you have any questions about them.

AC 元评审

2024-12-21

This paper introduces a novel framework for building hybrid language models from pretrained components. While reviewers praised the idea and clarity, they raised concerns on fairness of comparisons, scalability, and demonstration of true hybrid benefits over simpler baselines. The authors provided additional analyses and new baselines, but concerns remained. Although the approach is promising, the paper currently falls short of acceptance. Future work should include more rigorous baseline comparisons, larger-scale experiments, and stronger evidence that hybrids outperform simpler ensembles. Unfortuantely, at this time, rejection is recommended.

审稿人讨论附加意见

Initial reviews raised concerns about model scaling, computational budget comparisons, projector overhead, and hybrid model performance benefits. Authors addressed these by adding FLOPs analyses in Appendix D, releasing code, demonstrating complementary model strengths on different tasks, and including a learned ensemble baseline. Only one reviewer followed up, maintaining concerns about baseline comparisons while accepting the FLOPs analysis. The area chair noted the paper was borderline and requested reviewer responses, though most did not re-engage during rebuttal. Key updates focused on computational budget analysis and new baselines, with mixed reviewer satisfaction about demonstrated benefits.

最终决定Reject

2025-01-22

Reject