Balancing Act: Diversity and Consistency in Large Language Model Ensembles
摘要
评审与讨论
This paper proposes a unified framework to examine the trade-off between diversity and consistency to the final performance of model ensemble. The author proposes a dynamic mixture of agent approach to optimize the balance between task-specific capabilities, ultimately enhancing overall performance.
优点
- The motivation of balancing the model diversity and output consistency in model ensemble is compelling and well-founded.
- The unified dynamic mixture of agent framework effectively addresses the trade-off between diversity and consistency across various tasks.
缺点
- The proposed dynamic mixture of agent framework necessitates the individual search and optimization of MoA structures for each distinct task. The process relies on divergence filtering and mixture optimization, which is costly and requires additional task-specific datasets for evaluation.
- The application of EignScore originally proposed for hallucination detection in a single model presents inherent limitations when extended to an ensemble of multiple models. This is primarily due to the fact that the sentence-embedding spaces of various models are not aligned during their pre-training or fine-tuning phases. Consequently, these embeddings do not inhabit the same representational space, which poses challenges for direct comparison and aggregation across different models.
- The design of mixture optimization leads to a scenario where the final MoA model is absolutely dominated by a single model, as shown in Fig. 3, left. Since the search process for each run is determined by a greedy algorithm that replaces the model with the lowest delta to the one with highest delta.
问题
- In Section 4.4, Table 2 presents a comparison between DMoA/Sonnet and the Claude-3.5-Sonnet baseline. While DMoA/Sonnet demonstrates a marginal performance improvement (91.85 vs. 90.20 normalized accuracy on BBH), it is important to consider the associated computational costs. DMoA/Sonnet necessitates multiple inferences across diverse models and subsequent aggregations using Claude-3.5-Sonnet. This process incurs significantly higher expenses compared to the baseline due to the additional model inferences and the substantially longer input required for aggregation. Moreover, the efficacy of the MoA approach is heavily contingent upon the final aggregation model employed. When Claude-3.5-Sonnet is not utilized as the aggregation model in the DMoA approach, a substantial performance degradation is observed (90.20 vs. 83.63 normalized accuracy on BBH).
- What if testing DMoA on the seven benchmarks (AlpacaEval 2.0, MT-Bench, GSM8K, MATH, CSQA, ARC-C, ARC-E) in accordance with the experimental setups in Sections 4.1 and 4.2?
- Based on the experimental findings presented in Section 4.3, several key conclusions can be drawn regarding the impact of diversity and consistency on various cognitive abilities. Firstly, high levels of diversity appear to have a detrimental effect across all measured abilities. Secondly, strong consistency enhances reasoning and mathematical capabilities, but impairs the instruction-following proficiency. Lastly, when strong consistency is coupled with an appropriate degree of supplemental diversity, there is an observed improvement in instruction-following abilities, though this comes at the cost of diminished mathematical and reasoning skills. Compared to the discussion in current version, the above summary appears to more accurately reflect the core idea of this paper: balancing diversity and consistency for model ensemble across various tasks.
We thank the reviewer for their questions and insights. We are glad they found the motivation for our work to be compelling and well-founded, feel our DMoA effectively addresses the trade-off between diversity and consistency across various tasks. We have made a number of amendments to our manuscript based on the feedback, and respond to the queries below.
The application of EignScore originally proposed for hallucination detection in a single model presents inherent limitations when extended to an ensemble of multiple models. This is primarily due to the fact that the sentence-embedding spaces of various models are not aligned during their pre-training or fine-tuning [..]
Thank you for this point. We completely agree that sentence embedding spaces of different models are not necessarily going to be aligned,. However, we instead project each model’s output to the same embedding space using the text-embedding-3-small model from OpenAI. We clarify this information in the additional experimental setup section in Appendix B.3: “Language and embedding models”. We have additionally clarified that our analyses utilised a shared embedding space in the main manuscript.
The design of mixture optimization leads to a scenario where the final MoA model is absolutely dominated by a single model [...]
The figure illustrates an example whereby one model which was more robust at a particular type of mathematical reasoning dominated the mixture. Whilst this phenomenon was observed in this particular instance, it was not always the case. As you correctly describe, the algorithm replaces the model with the lowest delta with the model with the highest delta, however in the next step of the algorithm, if performance degrades, this step can be reversed. Indeed in other tasks (particularly AlpacaEval 2.0 and MT-Bench) the ‘optimal’ set of LLMs were heterogeneous. We describe the algorithm and additional stopping criterions in more detail in Appendix C.
The proposed dynamic mixture of agent framework necessitates the individual search and optimization of MoA structures for each distinct task. The process relies on divergence filtering and mixture optimization, which is costly [...] In Section 4.4, Table 2 presents a comparison between DMoA/Sonnet and the Claude-3.5-Sonnet baseline. While DMoA/Sonnet demonstrates a marginal performance improvement (91.85 vs. 90.20 normalized accuracy on BBH), it is important to consider the associated computational costs [...]
Thank you for this important point. In light of this we have now added a cost analysis section based on our experimental results in Section 4.4. We reproduce the main figure here for convenience ((anonymous fig. 1)). The cost analysis can be seen in Appendix I. We find that the dynamic mixture of agents framework sits on the pareto optimal front between performance and operation cost. Additionally, the DMoA demonstrates a similar performance in the Big Bench Hard (BBH) benchmark to gpt-4o-2025-05-13 but with significantly cheaper input/output token cost. The baseline DMoA (which does not utilize Claude Sonnet as the aggregator) achieves 92.7% of the performance of Claude 3.5 sonnet, and indeed we now also have a new test-time inference analysis in Appendix I which shows that by scaling the number of layers, a fully open-source DMoA can achieve 99.05% of Claude 3.5 Sonnet’s performance for BBH.
What if testing DMoA on the seven benchmarks (AlpacaEval 2.0, MT-Bench, GSM8K, MATH, CSQA, ARC-C, ARC-E) in accordance with the experimental setups in Sections 4.1 and 4.2?
This is a nice suggestion. We have now expanded Appendix E (adding E.4: “Additional results”), where we demonstrate the performance of the DMoA on the instruction following, arithmetic reasoning, and common-sense reasoning benchmarks in the gated mixture of agents and mixture optimization experiments. We reproduce the table here for convenience. In summary, the DMoA outperforms other models and ensembling strategies across the majority of the benchmarks.
| Model | AlpacaEval | MT-Bench | GSM8K | MATH | CSQA | ARC-C | ARC-E |
|---|---|---|---|---|---|---|---|
| DMoA | 63.21 | 9.19 | 96.67 | 71.23 | 87.51 | 92.50 | 94.47 |
| GMoA | 58.66 | 8.97 | 94.23 | 56.35 | 85.20 | 92.32 | 93.75 |
| MoA | 59.50 | 9.19 | 93.87 | 55.22 | 84.32 | 91.85 | 94.31 |
| Llama-3-70B | 34.4 | 8.8 | 93.0 | 50.4 | 83.8 | 90.5 | 94.1 |
| Qwen-1.5-110B | 43.9 | 8.9 | 85.4 | 49.6 | 82.1 | 69.6 | 93.9 |
| Qwen-1.5-72B | 36.6 | 8.4 | 79.5 | 34.1 | 83.2 | 65.9 | 92.7 |
| WizardLM-8x22B | 51.3 | 8.8 | 81.6 | 22.7 | 69.0 | 62.5 | 90.1 |
| Mixtral 8x22B | 30.9 | 8.8 | 83.7 | 41.7 | 81.7 | 70.7 | 91.8 |
| DBRX-Instruct | 25.4 | 8.4 | 72.8 | 32.5 | 82.2 | 68.9 | 89.7 |
| GPT-4 Omni (05/13) | 57.5 | 9.19 | 94.1 | 61.2 | 88.6 | 94.6 | 94.3 |
Based on the experimental findings presented in Section 4.3, several key conclusions can be drawn regarding the impact of diversity and consistency on various cognitive abilities. Firstly, high levels of diversity appear to have a detrimental effect across all measured abilities. Secondly, strong consistency enhances reasoning and mathematical capabilities, but impairs the instruction-following proficiency. Lastly, when strong consistency is coupled with an appropriate degree of supplemental diversity, there is an observed improvement in instruction-following abilities, though this comes at the cost of diminished mathematical and reasoning skills. Compared to the discussion in current version, the above summary appears to more accurately reflect the core idea of this paper: balancing diversity and consistency for model ensemble across various tasks.
We are deeply appreciative for this insightful comment, and have adjusted our discussion section to more clearly reflect this refinement of points. Thank you once more.
Thanks for the explanations and the additional experiments! They indeed address many of my concerns.
The authors in this work look at Mixture of Agents (MoA) and propose:
- A framework that captures different variations of possible MoAs
- Divergence metric called
EigenDivergencebased on the hallucination detected in K sampled outputs with the additional proposition of using an external embedding instead of the model's internal embedding - Propose an optimization algorithm based on incremental performance gains and usage
- Propose DMoAs that dynamically select the models
Results of this work are shown as follows: 1) Gated MoA against standard MoA and other openly-available models, with GMoA only providing marginal improvement on settings with some models performing reasonably better while underperforming in settings where all models perform close to each other; 2) Show that mixtures do not translate the same across various tasks and 3) DMoAs perform better on the BBH
优点
- Methodology, presentation, and results are good for the first 3 / 4 proposed contributions - I quite liked the idea of EigenDivergence and the analysis around it
- The takeaways on diversity and consistency are clear and well presented
缺点
- Sections 3.3 and 4.4 were both unclear and unnecessary, in my opinion - I couldn't quite understand how these sections tie to the main point of this paper. The presentation and motivation around this subset of contribution requires significant re-write
- No critical focus on looking at the intermediate reasoning - while I get that semantic diversity was a focus, I would have liked to see a deeper look at how diverse semantic reasoning looked at with a few examples - were there instances were correct reasonings by multiple models were still judged to be semantically diverse, etc. I liked Appendix H - it was a good start, but a detailed, controlled experiment could've provided readers with much more about EigenDivergence
Nitpicks that can be easily fixed and does not affect the review/score:
- Typos in some parts - line 182 for example
问题
- Can the details of 3.3 and 4.4 be clarified? (see weaknesses for comments)
- Can Appendix F.2 be expanded to understand the behaviour of multiple sentence embeddings ? The 0.78 doesn't make much sense as an individual number - it is unclear if it a consequence of choice of embedding models / how much variance can exist, etc.
- Is there a reason why closed-source models weren't used since the eigendivergence doesn't require access to the weights ?
Is there a reason why closed-source models weren't used since the eigendivergence doesn't require access to the weights ?
We chose to use freely accessible open source models for reproducibility and accessibility purposes. Closed models can change in capabilities across time without formal announcement [1]. On top of that, models are often simply deprecated, making replications harder still [2]. Furthermore, closed source models can be significantly costlier to run [3,4]. We therefore chose to run freely available and locally hostable open-source models to aid reproducibility of our frameworks as well as to ensure model accessibility.
References:
- Chen, Lingjiao, Matei Zaharia, and James Zou. "How is ChatGPT's behavior changing over time?." arXiv preprint arXiv:2307.09009 (2023).
- OpenAI. "Deprecations." OpenAI Platform, OpenAI, https://platform.openai.com/docs/deprecations. [Accessed: 18/11/2024]
- Model pricing, Together AI. https://www.together.ai/pricing [Accessed: 18/11/2024]
- OpenAI pricing, https://openai.com/api/pricing/ [Accessed: 18/11/2024]
Thank you for your clear responses. I think the readability and additional details have improved the paper.
We thank the reviewer for their questions and insights. We are glad they found our methodology, presentation, and results to be good for most of our contributions. We are also deeply appreciative that they found our takeaways on diversity and consistency to be clear and well presented. We have made a number of amendments to our manuscript based on the feedback, and respond to the queries below.
No critical focus on looking at the intermediate reasoning - while I get that semantic diversity was a focus, I would have liked to see a deeper look at how diverse semantic reasoning looked at with a few examples - were there instances were correct reasonings by multiple models were still judged to be semantically diverse, etc. I liked Appendix H - it was a good start, but a detailed, controlled experiment could've provided readers with much more about EigenDivergence
This is a nice suggestion. In order to provide more insight into the links between EigenDivergence, semantic reasoning traces and correctness, we performed an additional experiment in Appendix H. We reproduce the main figure of this analysis here for convenience ((anonymous fig. 1)). In summary, we subsample 400 questions across GSM8K and ARC-C and assess the relationship between the individual answer accuracy and semantic consistency in the layer as measured by the EigenDivergence (ED). We find that ED scores of individually correct queries are more negative on average (i.e., increase semantic consistency in the mixture) than individually incorrect queries, which are more positive (i.e. decrease semantic consistency if considered in the mixture). Additionally, in both datasets, we find that correct answers in layers with at least one incorrect query are more negative on average. For instance, in GSM8K, correct answers had a mean ED score of -0.269 compared to -0.065 for incorrect answers, with similar patterns observed in ARC-C. This suggests that filtering based on ED scores is more likely to lead to removing diverse queries that are statistically more likely to be incorrect answers in close-ended tasks like arithmetic and common-sense reasoning.
Can the details of 3.3 and 4.4 be clarified? [...]
Thank you for your feedback. The motivation behind the DMoA approach was to create a dynamic, task-specific ensembling framework informed by the results of the preceding experiments (GMoA and Mixture Optimization) and the ablation studies which examined the trade-offs between task performance and ensemble diversity/consistency. We investigated this trade-off at the semantic level (GMoA) and the mixture-composition level (mixture optimization). From these experiments we accrued three main insights: 1) When models agree, this tends to improve performance; 2) The utility of them agreeing is based on whether the models in the ensemble have sufficient expertise to answer the current query correctly; 3) Different tasks require different skills which appear to exist in a trade-off with one another. This motivated us to develop a ‘dynamic’ inference-time strategy (Section 3.3), which operationalises these insights. Namely, we identify a set of skills required to solve the current task, and then estimate which models might perform well given these skills. We construct an ensemble of these models at inference time, and aggregate their outputs before finally synthesising them into a final high-quality solution. Indeed, we show this outperforms ‘static’ ensembling (Section 4.4) and can achieve leading results in the challenging BBH benchmark. Further to your feedback, we have made the following amendments to the manuscript: We have clarified the motivation for DMoA in Section 3.3 by more explicitly connecting the experiment to the prior experiments; We have clarified that the DMoA effectively mitigates some of the trade-off behaviours identified in the previous experiments; We have better motivated the experiment in the methods section in a number of areas.
Can Appendix F.2 be expanded to understand the behaviour of multiple sentence embeddings ? The 0.78 doesn't make much sense as an individual number - it is unclear if it a consequence of choice of embedding models / how much variance can exist, etc.
We have now expanded the analysis in Appendix F.2 to include two additional embedding models, namely OpenAI’s text-embedding-3-large and text-embedding-ada-002. In all cases the correlation coefficient remains between 0.739 to 0.781 (3 s.f.), demonstrating minimal variance across different embedding models.
Typos in some parts - line 182 for example
Thank you very much indeed. We have corrected this.
The paper proposes a new model ensemble strategy that builds on top of a Mixture of Agents called Dynamic Mixture of Agents (DMoA). First, the authors attempt to unify model ensemble methods with a formal definition. The overall hypothesis for all the experiments are that different tasks require different ensemble diversity/consistency. With that, they first do an experiment where they propose a divergent metrics EigenScore and EigenDivergence that measure how semantically consistent/inconsistent LLM outputs are to the overall semantics. The effect of this Gated MoA is nuanced. For the next experiment, they try to optimize MoA structure by proposing a model delta that correlates with task performances. They found that there are trade-offs between each task. If they optimize for one task, it would decrease for another task. Lastly they tried DMoA, which chose mixtures based on some criteria. They showed that DMoA outperforms MoA on BBH.
优点
- The authors design and execute a pretty comprehensive analysis and extensions to the model ensembling method Mixture-of-Agents. The analysis and conclusions are useful because they give intuitions on what may work and may not work in a mixture of model frameworks. They first investigate whether reducing semantic inconsistency would help (it didn't, I wonder if the reverse would help though -- increase the diversity). Then a mixture optimization method is proposed and they found some useful insights, e.g., aggregate and synthesis works better than ranking and self-consistency. They then tested DMoA which dynamically uses different expertise models for the mixture.
- It is quite important to dive deeper into what makes model collaboration work as model ensemble methods are a promising way to inference-time scaling.
- The tasks are pretty comprehensive.
缺点
- The purpose of the first experiment is a bit questionable. The result from divergent filtering is nuanced, as it only slightly improves four out of five reasoning tasks, and decreases performances on three tasks. I understand you use this experiment's result as insights to build DMoA in section 4.4, but I question whether those insights are actually correlated.
- For insight 1 in section 4.4, divergent filtering is used as a supporting evidence as "As shown in Sec. 4.1, LLM ensembles outperform individual models regardless of divergence filtering, across both open- and close-ended tasks." If the conclusion arrived "regardless of divergence filtering" then it doesn't really add much value to the argument.
- For insight 2, it says "We found in Sec. 4.1 that removing information from an ensemble can improve task-specific performance." But it improves four and hurts three.
- For insight 3, it says "Performance varies when semantic diversity is altered within a fixed ensemble (Sec. 4.1)." This claim is not that clear and doesn't provide much insight since performance can vary if you change anything about the mixture. It would be more useful if more patterns could be discovered.
- Similarly, I am concerned about the claim in section 4.3: "two GMoA variants...one with the two most semantically divergent outputs removed (maximizing consistency), and one with the two most consistent outputs removed (maximizing diversity)...indicating that some semantic consistency is necessary for high-quality results, even in open-ended instruction-following queries." I don't think removing two most consistent outputs would maximize diversity. You could very well be removing two very similar outputs that are very distinctive from other outputs. This also has the unwanted effect of removing output that's more correct (since more models output it).
- This leads to insight 1 in section 4.4 that "Maximizing semantic diversity harms performance (Sec. 4.3), indicating that some cross-validation between outputs is necessary for high-quality results, even in open-ended instruction-following tasks." This can be problematic since the premise is wrong.
- The structure of the paper gives a strong vibe that the first two experiments are very disconnected from the third as neither divergent filtering (GMoA) nor the mixture optimization methods are used in the third. And the insights they provide are also very limited which I elaborated above. I would encourage maybe downplaying the portion of the first two experiments and focusing more on the third.
问题
- I wonder about the performance of DMoA on seven tasks used in the first two experiments. This should be a natural progression. It feels weird that for the first two, we are using the same set of seven and for third, we are using BBH.
This leads to insight 1 in section 4.4 that "Maximizing semantic diversity harms performance (Sec. 4.3), indicating that some cross-validation between outputs is necessary for high-quality results, even in open-ended instruction-following tasks." This can be problematic since the premise is wrong.
This is an interesting point of discussion. We completely agree that edge cases can exist. There could, for instance, exist two outputs which are semantically very similar to one another, but very different from all other outputs (as you suggest) – in this case removing them would not necessarily maximise the overall semantic diversity of the outputs. However, in most other cases removing more semantically similar information will, on average, leave behind outputs which are more semantically distinct from one-another. On the point of removing outputs that are more semantically similar, we note that this does not necessarily relate to the correctness of the answers. For instance, we found that dbrx-instruct and mixtral-8x22B-instruct-v0.1 produce very semantically similar results. Despite their similarity, they can both be wrong (and indeed, subtle errors can propagate differently throughout their reasoning traces despite a very similar overall structure to their chains). We believe this to be an important limitation of our EigenDivergence score which we hope to mitigate in future work. We discuss this in more depth (and provide an illustrative example) in Appendix H of the paper. This leads to an inherent difficulty which we try to resolve with the DMoA framework. To achieve good performance, you should get models to ‘agree’ with one another. But models can agree whilst both are incorrect. Thus you need two main steps: 1) Select (or estimate) which models are likely to be highly performant given a specific task and create an ensemble of them; 2) Query the ensemble, and see if these models agree. Indeed, this is what our proposed DMoA approach attempts to do.
The structure of the paper gives a strong vibe that the first two experiments are very disconnected from the third [...]
Thank you for this thoughtful suggestion. To address your feedback, we have clarified the connections between the first two experiments and the DMoA in the introductory paragraph of the methods section and have additionally added a transitional sentence to the DMoA section itself. We hope that this helps to emphasise that the first two experiments develop methodologies that allow us to accrue insights that directly inform and support the development of the DMoA.
I wonder about the performance of DMoA on seven tasks used in the first two experiments [...]
This is a nice suggestion. We have now expanded Appendix E (adding E.4: “Additional results”), where we demonstrate the performance of the DMoA on the instruction following, arithmetic reasoning, and common-sense reasoning benchmarks in the gated mixture of agents and mixture optimization experiments. We reproduce the table below for convenience. In summary, the DMoA outperforms other models and ensembling strategies across the majority of the benchmarks.
| Model | AlpacaEval | MT-Bench | GSM8K | MATH | CSQA | ARC-C | ARC-E |
|---|---|---|---|---|---|---|---|
| DMoA | 63.21 | 9.19 | 96.67 | 71.23 | 87.51 | 92.50 | 94.47 |
| GMoA | 58.66 | 8.97 | 94.23 | 56.35 | 85.20 | 92.32 | 93.75 |
| MoA | 59.50 | 9.19 | 93.87 | 55.22 | 84.32 | 91.85 | 94.31 |
| Llama-3-70B | 34.4 | 8.8 | 93.0 | 50.4 | 83.8 | 90.5 | 94.1 |
| Qwen-1.5-110B | 43.9 | 8.9 | 85.4 | 49.6 | 82.1 | 69.6 | 93.9 |
| Qwen-1.5-72B | 36.6 | 8.4 | 79.5 | 34.1 | 83.2 | 65.9 | 92.7 |
| WizardLM-8x22B | 51.3 | 8.8 | 81.6 | 22.7 | 69.0 | 62.5 | 90.1 |
| Mixtral 8x22B | 30.9 | 8.8 | 83.7 | 41.7 | 81.7 | 70.7 | 91.8 |
| DBRX-Instruct | 25.4 | 8.4 | 72.8 | 32.5 | 82.2 | 68.9 | 89.7 |
| GPT-4 Omni (05/13) | 57.5 | 9.19 | 94.1 | 61.2 | 88.6 | 94.6 | 94.3 |
References:
- Xu, Ziwei, Sanjay Jain, and Mohan Kankanhalli. "Hallucination is inevitable: An innate limitation of large language models." arXiv preprint arXiv:2401.11817 (2024).
- Wang, Xuezhi, et al. "Self-consistency improves chain of thought reasoning in language models." arXiv preprint arXiv:2203.11171 (2022).
We thank the reviewer for their questions and insights. We are glad they found our analysis comprehensive, our conclusions to be useful, and our task set to be comprehensive also. We have made a number of amendments to our manuscript based on the feedback, and respond to the queries below.
For insight 1 in section 4.4, divergent filtering is used as a supporting evidence as "As shown in Sec. 4.1, LLM ensembles outperform individual models regardless of divergence filtering, across both open- and close-ended tasks." If the conclusion arrived "regardless of divergence filtering" then it doesn't really add much value to the argument.
Thank you for this comment - we merely intended to highlight that given our results in Table 1, we can see that any form of ensembling appears to correlate with higher benchmark performance relative to the performance of individual models, which aligns well with prior literature on the topic [1,2]. We use this as a basis for justifying the use of an ensembling strategy for our proposed DMoA.
For insight 2, it says "We found in Sec. 4.1 that removing information from an ensemble can improve task-specific performance." But it improves four and hurts three.
Well noticed - here, we state that task-dependent LLM expertise is crucial for boosting task performance. As you noted, removing information can improve performance in four tasks. These tasks evaluated arithmetic and common-sense reasoning which (in this instance) all had a ‘single best’ correct answer and are thus close-ended tasks (e.g., a correct single mathematical answer in the MATH benchmark, or a fixed correct multi-choice index for CSQA). When we remove information for these benchmarks we note a slight improvement in performance. For open-ended tasks without a ‘fixed’ correct answer, particularly for instruction following, the opposite effect is noted - that is, removing information appears to harm performance. Removing information for close-ended tasks seems to improve performance on average, suggesting that excluding models with semantically divergent answers can help. This implies that merely increasing the number of reasoning traces, without considering whether the models are suited to the task, might actually lower performance. We have additionally added a new section to Appendix H which investigates the relationship between semantic divergence and performance in close-ended tasks, and the results further support our main experiments. In aggregate, these results contravene prior suggestions that simply adding more reasoning traces (even from multiple heterogeneous models) always improves performance up to some plateau [2]. It would therefore be better if we stacked LLMs based on their task-dependent expertise. We have amended the manuscript to better clarify the instances when removing information can lead to improved performance.
For insight 3, it says "Performance varies when semantic diversity is altered within a fixed ensemble (Sec. 4.1)." This claim is not that clear [...]
We agree with this suggestion – and think that our mixture optimization experiment provides a much more natural illustration of the insight that task-dependent skills demonstrate a trade-off with one another. We have amended the main manuscript accordingly.
The paper introduces a unified approach to explore the balance between diversity and consistency in LLM ensembles. It addresses the challenge of finding optimal ensembling methods, where traditional strategies like self-consistency decoding (focusing on output consistency) and Mixture of Agents (MoA, focusing on model diversity) each have limitations. The authors propose the Dynamic Mixture of Agents (DMoA), a novel inference-time ensembling strategy that integrates insights on the interplay between diversity and consistency.
- The paper establishes a framework for examining trade-offs between task performance, diversity, and consistency within LLM ensembles.
- Introducing a consistency score enables selective filtering within model outputs, enhancing ensemble consistency.
- This method refines mixtures of agents by considering semantic and model-level adjustments, optimizing task performance.
- DMoA dynamically selects models based on task-specific needs, achieving state-of-the-art results on the Big Bench Hard benchmark, demonstrating effective balance across tasks.
优点
- This paper addresses an intriguing and nuanced problem—how to balance consistency and diversity within LLM ensembles. This issue is fundamental, as different tasks often require prioritizing one over the other.
- The paper is well-structured and logically rigorous. The authors provide clear, precise definitions of both DMoA and GMoA frameworks, enabling readers to fully understand the distinctions and specific innovations in each method.
- The authors show that DMoA achieves state-of-the-art results on the Big Bench Hard (BBH) benchmark, indicating the framework's efficacy across various challenging tasks.
缺点
While the paper demonstrates the effectiveness of GMoA and DMoA in achieving high performance, it does not analyze the computational costs associated with these methods. Cost considerations are essential in ensemble approaches, particularly with LLMs, where scaling and inference-time model selection can be computationally intensive. For a comprehensive comparison, the authors should evaluate the total cost associated with GMoA and DMoA, including resource utilization during inference and model selection. This cost analysis would offer a more balanced view of the trade-offs between performance gains and computational expense, especially compared to other ensemble strategies and Chain-of-Thought (CoT) approaches.
问题
- I am curious if the proposed methods, particularly DMoA and GMoA, align with test-time scaling laws. Specifically, does performance consistently improve as the number of models in the ensemble or the length of inference chains increase?
- In the experimental setup, the paper mentions constructing a "MoA-Lite" variant with a limited number of layers. What would happen if additional layers were added to MoA?
- In what specific scenarios are GMoA and DMoA each most effective? Can these methods be combined within a single framework to further improve performance?
We thank the reviewer for their questions and insights. We are glad they found our topic to be nuanced and fundamental, our paper to be well-structured and logically rigorous, and our framework to be efficacious. We have made a number of amendments to our manuscript based on the feedback, and respond to the queries below.
While the paper demonstrates the effectiveness of GMoA and DMoA in achieving high performance, it does not analyze the computational costs associated with these methods [...]
This is a nice suggestion. We have now added a cost analysis section which can be found in Appendix I based on our experiment in Section 4.4. We reproduce the main figure here (anonymous fig. 1) for convenience. In summary, costs were calculated using pricing data from API providers' websites. Individual models generally offered low-cost options with worse performance, while ensembles provided enhanced performance at higher costs. Frontier models outperformed open-source models but incur significant cost premiums. The Dynamic Mixture of Agents (DMoA) approach achieved a well-balanced position which sits on the Pareto-optimal front, offering performance similar to gpt-4o-2025-05-13 at a significantly lower inference cost. The Pareto front progression from DMoA to Claude-3.5-Sonnet to DMoA/Sonnet surpassed gpt-4o-2025-05-13. DMoA/Sonnet achieved the highest normalised accuracy but is the most expensive, whereas the fully open-source DMoA offered a balanced trade-off, delivering high performance at moderate costs.
I am curious if the proposed methods, particularly DMoA and GMoA, align with test-time scaling laws. Specifically, does performance consistently improve as the number of models in the ensemble or the length of inference chains increase? In the experimental setup, the paper mentions constructing a "MoA-Lite" variant with a limited number of layers. What would happen if additional layers were added to MoA?
Thank you for these suggestions. To investigate scaling layer dimension and the number of layers we ran an experiment based on our results from section 4.4. We add this new section to Appendix J. We reproduce the main figures here for convenience: ((anonymous fig. 2), (anonymous fig. 3)). In summary, we find that adding more models per layer consistently improves performance in the BBH benchmark. We investigate up to 10 models per layer and in this setting we achieve a normalised accuracy of 87.36%, which represents 96.85% of Claude 3.5 Sonnet’s performance with only open-source models. With regards to adding more layers, we find that four layers achieves an even greater normalised accuracy of 89.34, which is 99.05% of Claude 3.5 Sonnet’s performance for this task.
In what specific scenarios are GMoA and DMoA each most effective? [...] We see the GMoA as a preliminary investigation of how semantic similarity between outputs affects the performance of LLM ensembles across a number of disparate benchmarks. This investigation yields a number of insights which allows us to develop the more flexible DMoA framework. Nevertheless, combining the ideas of performing inference-time selection of high-performing models with semantic filtering is an interesting avenue of future work. We should point out however that in our third ablation study (Section 4.3: Figure 4; Right plot) – we note that filtering already specialised ensembles can degrade performance.
Thank you very much for your thorough and patient response. The newly added experiments are highly interesting, and I am delighted to see that the DMoA method demonstrates significant advantages in both efficiency and test-time scaling laws. I look forward to seeing a more detailed analysis of these aspects in the final version of your paper.
We are deeply appreciative of all of the reviewers for their time and detailed reviews. We appreciate their recognition of the effectiveness of our inference-time ensembling framework (bHjk, 7TjC), of the nuanced and fundamental importance of the research topic (bHjk), of our well-structured, logically rigorous, and comprehensive work (EPHj, bHjk, jGzM), and of the clarity of our takeaways (EPHj). In response to their feedback, we have made several amendments to the manuscript, which we believe has improved its quality. We summarise these here:
| Category | Amendment |
|---|---|
| Cost Analysis | - Added a detailed cost analysis section in Appendix I. |
| - Showed DMoA sits on the Pareto-optimal front, balancing performance and cost. | |
| Scaling Experiments | - Expanded Appendix J with experiments on scaling models and layers. |
| - Demonstrated that increasing models and layers improves BBH benchmark performance. | |
| Methodology Links | - Strengthened connections between GMoA, mixture optimization, and DMoA in the manuscript. |
| - Explained how earlier experiments informed the design of DMoA. | |
| DMoA Benchmarks | - Added results for DMoA on seven benchmarks in Appendix E.4. |
| - Found DMoA outperforms other models and ensembling strategies across a majority of benchmarks. | |
| EigenDivergence | - Conducted a new analysis in Appendix H on the relationship between EigenDivergence and correctness. |
| - Showed that semantic consistency improves close-ended task performance. | |
| Semantic Diversity | - Expanded discussion on semantic diversity and filtering in Appendix H. |
| - Addressed limitations and edge cases of EigenDivergence. | |
| Discussion Section | - Incorporated feedback on balancing diversity and consistency. Revised to reflect trade-offs and task-specific performance impacts. |
| Embedding Analysis | - Expanded Appendix F.2 to analyze multiple embedding models. |
| - Demonstrated minimal variance in results across embeddings. | |
| Additional clarification | - Clarified the motivation for DMoA over static ensembles in the main manuscript. |
We hope these amendments and our additional analyses address the reviewers’ feedback and thank them once more for their valuable insights and time.
This paper presents a unified framework for examining trade-offs between diversity and consistency in LLM ensembles through the Dynamic Mixture of Agents (DMoA) approach. The key scientific contribution is demonstrating that different reasoning tasks require distinct balancing of model diversity and output consistency. The work's major strengths include: comprehensive analysis of ensemble methods, clear empirical validation achieving state-of-the-art results on BBH benchmark, and practical insights into task-dependent trade-offs. While there were initial concerns about computational costs and methodology connections, the authors thoroughly addressed these through additional experiments and analyses during rebuttal, particularly around cost-performance trade-offs and scaling behavior. The paper makes a meaningful contribution to understanding and optimizing LLM ensemble methods.
审稿人讨论附加意见
The discussion during rebuttal focused on three main areas: computational costs, methodology connections, and empirical validation. The authors addressed these by: 1) Adding detailed cost analysis showing DMoA achieves Pareto-optimal performance/cost trade-off; 2) Clarifying connections between GMoA, mixture optimization and DMoA while expanding methodology sections; 3) Providing comprehensive results across additional benchmarks. Reviewer bHjk found the new scaling experiments "highly interesting," while EPHj noted improved readability. The thorough responses and substantial additions, particularly around cost-effectiveness and empirical validation, strengthen the paper's contributions and address the main reviewer concerns, supporting acceptance.
Accept (Poster)