PaperHub
7.8
/10
Poster4 位审稿人
最低3最高5标准差0.7
4
3
4
5
ICML 2025

Model Swarms: Collaborative Search to Adapt LLM Experts via Swarm Intelligence

OpenReviewPDF
提交: 2025-01-23更新: 2025-07-24

摘要

关键词
evolutionary algorithmmodel adaptationswarm intelligence

评审与讨论

审稿意见
4

This paper introduces MODEL SWARMS, a collaborative search algorithm for adapting large language models (LLMs) through principles of swarm intelligence, leveraging collective behaviors to guide individual systems. Inspired by Particle Swarm Optimization (PSO), MODEL SWARMS optimizes collaboration among diverse LLM experts by navigating their weight space towards maximizing a defined utility function. Extensive empirical evaluations demonstrate that MODEL SWARMS effectively adapts LLM expert ensembles across single-task scenarios, multi-task domains, reward modeling, and varied human preferences, consistently outperforming 12 baseline model composition methods.

给作者的问题

--

论据与证据

The claims made in the submission are supported by clear and convincing evidence.

方法与评估标准

The methods and evaluation criteria make sense for the problem.

理论论述

Not applicable.

实验设计与分析

The experimental designs and analysis are detailed and promising.

补充材料

--

与现有文献的关系

--

遗漏的重要参考文献

--

其他优缺点

Strengths:

  1. The proposed method is innovative and conceptually intriguing, and the manuscript is clearly structured, well-written, and accessible to readers.

  2. The experimental evaluation is comprehensive, covering a wide range of settings and tasks, and the empirical results convincingly demonstrate the effectiveness and promise of the approach.

其他意见或建议

  1. It would be beneficial to include a convergence analysis of the proposed update mechanism to better understand the theoretical properties and stability of the optimization process.

  2. Could the authors clarify the computational cost associated with training the proposed method and provide a detailed comparison of training efficiency relative to the baseline approaches?

作者回复

We would like to thank the reviewer for their thoughtful comments and feedback.

It would be beneficial to include a convergence analysis of the proposed update mechanism to better understand the theoretical properties and stability of the optimization process.

We investigate convergence and stability empirically in several studies in the appendix.

In line 872 and Figure 6, we visualize the search trajectory of models. Starting as diverse model checkpoints, they gradually converge to an area in the weight space that optimizes utility function f.

In line 860 on the right, we identify the three randomness factors in Model Swarms: while the algorithm is not 100% stable and deterministic, these stochastic factors actually help exploration and performance in Table 5.

In line 1028 on the right and Figure 15, we demonstrate the convergence of model performance across tasks and optimization iterations. The stability of the search is task-dependent, with some per-model performance plots smooth and some turbulent. Nevertheless, the global best performance steadily increases across tasks.

There are theoretical analyses of PSO convergence in evolutionary algorithm research such as [1], while the assumptions made in the classic optimization problems are not applicable for the LLM setting. For example, the initial particles in classic PSO are often initialized randomly or in a grid, while the initial LMs in the swarm cannot be random/arbitrary (taking a random point in the 7-billion-dimensional space would almost certainly be a failed language model). Instead they are seeded with curated model checkpoints with different training and fine-tuning data mixtures. Given these differences, we decided to take an empirical route to analyze convergence and stability properties.

[1] Van den Bergh, Frans, and Andries P. Engelbrecht. "A study of particle swarm optimization particle trajectories." Information sciences 176.8 (2006): 937-971.

Could the authors clarify the computational cost associated with training the proposed method and provide a detailed comparison of training efficiency relative to the baseline approaches?

The main computational cost comes from model inference and evaluating the LM checkpoints on the utility function, most simply performance on a small validation set. Trivial and static baselines in Table 1 are less expensive since they don’t require this evaluation. Model Swarms is on par with the dynamic baselines such as EvolMerge and Lorahub, evaluating models on the validation set at each iteration.

We refer the reviewer to lines 980 in the appendix for a more detailed discussion on the computational cost. To recap, Model Swarms has linear complexity to the number of models and to the cost of one model inference, while it takes about 10-15 iterations for each run on average. Empirically, with 5 40GB GPUs you could run a Model Swarms search under an hour, with 2 GPUs you need about 3 hours. We propose further acceleration with dropout-k and dropout-n in Figure 7.

审稿意见
3

This paper proposes a Particle Swarm Optimization based Large Language Model collaborative search algorithm, where LLM weights are considered as particles and PSO is applied to search for a best performing LLM on a target task. Experimental results show that the searched LLMs outperform the initial LLMs and other LLM composition methods on diverse tasks.

update after rebuttal

Since my concerns are addressed, I change the score to 3.

给作者的问题

There are various advanced and adaptive PSO variants in EC community, I wonder why authors use a (relatively) simple PSO method, using advanced PSO might obtain better performance?

论据与证据

The authors claim that the PSO based LLM composition method could achieve superior performance within 200 examples. Experimental results demonstrate that the searched LLMs outperform the initial LLMs and other LLM composition methods. However, this demonstration doesn’t show the efficiency of the searched LLM over state-of-the-art LLMs on these tasks.

方法与评估标准

The overall method is to utilize PSO to collaboratively optimize a set of initial LLMs and search for a best LLM on the target task. It makes sense that the searched LLM could surpass the initial models, but the absolute performance is not evaluated. Besides, the proposed method does not provide the alignment method for LLMs with different sizes, which may limit the further application of Model Swarms.

理论论述

There is no theoretical claims in this paper.

实验设计与分析

In the experiment, authors compare Model Swarms with initial models and other compisition methods. However, the initial models are fine-tuned on data in Tulu-v2 which are significantly different from the tested tasks, which may lead to biased evaluation. Besides, the state-of-the-art LLMs for these tested tasks and the GEMMA-7B models fine-tuned on these tested tasks are not included. Since fine-tuning GEMMA-7B for 5*10 epochs (the resource used for training the initial models) on the target tasks may obtain better performance.

补充材料

Appendix A discusses the key strengths of MODEL SWARMS and its relationship with some related research fields. Appendix B conduct the analysis on ablation study, hyperparameter settings, complexity, modularity, etc. Appendix C presents experimental details.

与现有文献的关系

  1. Large Language Model: The method is proposed for LLM adaptation.

  2. Swarm Intelligence: PSO is employed in the proposed method for searching LLMs from a set of initial models.

  3. Model Composition: The proposed method adapt LLMs to new tasks by composing existing LLM models.

遗漏的重要参考文献

The discussion on related methods is sound, I didn’t see any essential references not discussed.

其他优缺点

Weakness: Usually the initial expert LLMs for different fields have different model architectures and model sizes, since the proposed method could only be applied to a set of LLMs with the same architecture and size, the contribution of the proposed method might be limited.

其他意见或建议

None.

作者回复

We would like to thank the reviewer for their thoughtful comments and feedback.

Besides, the proposed method does not provide the alignment method for LLMs with different sizes, which may limit the further application of Model Swarms.

Usually the initial expert LLMs for different fields have different model architectures and model sizes, since the proposed method could only be applied to a set of LLMs with the same architecture and size, the contribution of the proposed method might be limited.

We propose token swarms on line 372, an extended version to compose models with different sizes/architectures, by conducting PSO on the token probabilities space.

Instead of defining the search space as model parameters, in token swarms the PSO search determines how token probability distributions from heterogeneous models should be aggregated. We show that this works for a swarm of 4 Gemma and 4 Mistral models in Figure 5, with performance improvement regardless of the architecture.

The token probability variant takes inspiration from [1] and it provides great context about aggregation in the token probability space, reconciling tokenizer differences, etc.

[1] Liu, Alisa, et al. "Tuning Language Models by Proxy." First Conference on Language Modeling.

However, this demonstration doesn’t show the efficiency of the searched LLM over state-of-the-art LLMs on these tasks.

It makes sense that the searched LLM could surpass the initial models, but the absolute performance is not evaluated.

Besides, the state-of-the-art LLMs for these tested tasks and the GEMMA-7B models fine-tuned on these tested tasks are not included. Since fine-tuning GEMMA-7B for 5*10 epochs (the resource used for training the initial models) on the target tasks may obtain better performance.

The reviewer asks for two comparisons: Model Swarms against fine-tuning Gemma-7B on the target task, and Model Swarms against “state-of-the-art LLMs”.

The former is a fair comparison. In line 1065 and Table 9 we compare Model Swarms against direct fine-tuning on the task data of the utility function. We see consistent improvement. The main issue here is generalization: when the available data is small, direct fine-tuning tends to memorize with very large validation-test set gaps while Model Swarms leads to better skill learning and composition. We fine-tuned for 5 epochs in Table 9 and empirically see that fine-tuning for 5*10 epochs exacerbate memorization and harm generalization.

We are unsure if the latter is a fair ask. We don’t think taking 7B models and applying adaptation would outperform “state-of-the-art” models that are much larger and much more extensively trained. We don’t see the goal of this work as establishing a new state-of-the-art on the evaluated tasks, but rather demonstrating the adaptation and composition of diverse LMs could be achieved through collaborative search and swarm intelligence. Thus we believe our selection of 12 model composition baselines is adequate for this purpose.

However, the initial models are fine-tuned on data in Tulu-v2 which are significantly different from the tested tasks, which may lead to biased evaluation.

This difference between SFT data and evaluation tasks is actually a good thing, meaning that the initial swarm was not overfitted for the evaluation tasks and there is genuine adaptation/generalization to these tasks on-the-fly in Model Swarms searches.

There are various advanced and adaptive PSO variants in EC community, I wonder why authors use a (relatively) simple PSO method, using advanced PSO might obtain better performance?

We agree that there is broad literature on evolutionary algorithms (EAs) about PSO and beyond, many of them could be adapted for today’s LLM research. Model Swarms take one of the first stabs at this interdisciplinary research direction so we decided to keep the EA part simple and straightforward, hoping to get the ball rolling on future evolutionary designs of LLMs.

审稿意见
4
  • This paper introduces Model Swarms, a search algorithm to adapt LLM capabilities
  • The model is based on particle swarm optimization (PSO)
  • Model Swarms (MS) = multiple LLM experts collaboratively search for new "adapted models"
  • The purpose of these newly adapted models is to search for capabilities beyond the current model
  • MS is evaluated in four LLM adaptation objectives, in which superior performance is achieved

给作者的问题

Will the code be released publicly upon publication?

论据与证据

  • Empirical—MS outperforms baselines in all four LLM adaptation objectives — this claim is supported by the empirical results and analysis in Sections 4 and 5
  • Other claims from Section 5 (e.g. "diversity matters") are well-supported by the results and discussion

方法与评估标准

  • The MS method is a natural—though novel and interesting—solution to the problem of producing a new adapted models
  • The baselines chosen are reasonable and seem to cover comparable training schemes in prior works
  • The four adaptation objectives appear to be well-chosen, and offer a robust set of challenges to test the capabilities of MS relative to baselines. The domains themselves also appear to be natural choices.

理论论述

N/A

实验设计与分析

From Section 3 and Appendix C, the experimental setup appears to have been carefully designed. All relevant training details (beyond the code itself) are provided; results seem reproducible from details in the paper.

补充材料

Yes, took a brief pass through all the supplemental material.

与现有文献的关系

  • MS is related to other works attempting to compose LLMs to produce more capable models
  • Being an evolutionary approach, it is also related to other evolutionary algorithms applied to the LLM space
  • MS appears to be a novel construction in both of these areas, from the authors' contextualization in the related work section

遗漏的重要参考文献

N/A

其他优缺点

Strengths:

  • Paper is well-written, well-structured, and clear
  • The method makes sense
  • The analysis is precise and insightful

Weaknesses:

  • Nothing major

其他意见或建议

N/A

作者回复

We would like to thank the reviewer for their thoughtful comments and feedback.

Will the code be released publicly upon publication?

Yes, along with the best-found model checkpoints for all tasks.

审稿意见
5

The authors propose utilizing a population of base language models (with the same architecture but different weight initialization) and then study how to finetune them for downstream utilization. Their method consists of treating each model as a particle in the weight space, assigning random "exploration" velocities to each, along with an exploitation velocity corresponding to moving towards the current best particle. Subsequently, a weight update moves each particle along its own trajectory after which this process is repeated. The authors test their method on a variety of downstream utilization paths and find improvements in comparison to several recently published baselines.

给作者的问题

How does the computational cost of ModelSwarms compare with baselines?

论据与证据

The claims made are well supported by the authors - swarm optimization seems to be a general enough paradigm to combine expert LLMs in weight space.

方法与评估标准

The authors study their method's performance in single/multitask, reward modeling and human preference settings and demonstrate improvements across the board. They perform a thorough analysis of the dynamics of training and study the trajectories of eventually successful particles.

理论论述

N/A

实验设计与分析

The experiments are clean, and easy to understand with informative baselines selected. Analysis of trajectories and ablations on diversity of agents is informative.

补充材料

I read through the Supplementary Material briefly. The baseline descriptions and the training dynamics was very informative.

与现有文献的关系

The paper is very relevant to the AI community at large. This form of gradient free optimization is useful for post training alignment and I believe that this paper is an impressive demonstration of a well known idea at scale.

遗漏的重要参考文献

N/A

其他优缺点

The thoroughness of experiments and general insights from this paper are very useful. I appreciate the number of baselines studied. It was a joy to review this paper.

其他意见或建议

N/A

作者回复

We would like to thank the reviewer for their thoughtful comments and feedback.

How does the computational cost of ModelSwarms compare with baselines?

The main computational cost comes from model inference and evaluating the LM checkpoints on the utility function, most simply performance on a small validation set. Trivial and static baselines in Table 1 are less expensive since they don’t require this evaluation. Model Swarms is on par with the dynamic baselines such as EvolMerge and Lorahub, evaluating models on the validation set at each iteration.

We refer the reviewer to lines 980 in the appendix for a more detailed discussion on the computational cost. To recap, Model Swarms has linear complexity to the number of models and to the cost of one model inference, while it takes about 10-15 iterations for each run on average. Empirically, with 5 40GB GPUs you could run a Model Swarms search under an hour, with 2 GPUs you need about 3 hours. We propose further acceleration with dropout-k and dropout-n in Figure 7.

最终决定

While some questions remain around theoretical analysis and comparisons to state-of-the-art models, the authors have made a convincing case that their primary goal was to demonstrate the potential for adaptation and composition through this novel approach.