5.5

/10

Rejected4 位审稿人

最低3最高8标准差1.8

3.0

置信度

正确性2.8

贡献度3.3

表达3.0

ICLR 2025

Towards Efficient and Scalable Multi-agent Reasoning via Bayesian Nash Equilibrium

Xie Yi,Zhanke Zhou,Chentao Cao,Qiyu Niu,Bo Han

OpenReview PDF

提交: 2024-09-28更新: 2025-02-05

摘要

关键词

Large Language ModelsReasoningMultiagent Reasoning

评审与讨论

审稿意见

评分: 8置信度: 32024-10-31

The paper proposes the EcoNash framework, a new solution designed for efficient and scalable coordination in multi-agent systems utilizing Large Language Models (LLMs). The framework leverages Bayesian Nash Equilibrium (BNE) to optimize multi-agent interactions while minimizing communication overhead. By combining global and local coordinators and clustering LLM agents, EcoNash achieves effective coordination with reduced computational costs.

The paper introduces a detailed theoretical analysis, including a convergence guarantee and a Bayesian regret bound, demonstrating the efficiency and stability of the proposed approach. EcoNash’s hierarchical structure enables it to achieve sublinear regret, significantly improving learning and convergence rates compared to existing multi-agent debate methods.

Experimental results across multiple benchmarks show that EcoNash outperforms both single-agent and traditional multi-agent methods, achieving improved scalability and efficiency. The framework contributes to advancing multi-agent reasoning, particularly in large-scale deployments where resource constraints and scalability are critical considerations.

EcoNash’s hierarchical coordination and emphasis on distributed reasoning allow agents to achieve Bayesian Nash Equilibrium independently, which is theoretically well-founded and empirically validated. Overall, the paper is well-written, methodologically sound, and experimentally thorough. I recommend it for acceptance.

优点

This paper introduces EcoNash, a novel multi-agent reasoning framework that combines hierarchical coordination with Bayesian Nash Equilibrium (BNE) principles to enhance the scalability and efficiency of large language models (LLMs) in multi-agent systems. The EcoNash framework reduces computational costs and minimizes the need for extensive inter-agent communication, which addresses common challenges in multi-agent LLM frameworks. By leveraging distributed reasoning and theoretically proving convergence with BNE, the authors provide a framework that is both theoretically sound and empirically validated.

The paper is well-written, methodologically sound, and experimentally thorough. The experiments on benchmark datasets convincingly demonstrate the framework’s scalability and efficiency, while the theoretical analysis provides a solid foundation for the proposed coordination mechanism. I recommend the paper for acceptance.

缺点

While the paper presents a valuable contribution, a few areas could be enhanced to fully realize its goals:

Scalability of Coordination Mechanism: EcoNash leverages a central Coordinator LLM to achieve distributed reasoning among Execution LLMs, yet the scalability of this coordination mechanism as agent numbers grow is unclear. Since scalability is a primary aim, it would be useful to explore alternative coordination structures (e.g., decentralized or multi-coordinator configurations) that could mitigate potential bottlenecks from a single coordinator, especially in large agent systems.
Assumptions in Convergence Proofs: The paper’s theoretical contributions are strong, particularly in proving convergence to BNE. However, several key assumptions, such as those related to the reward structures and learning rates, are briefly mentioned without full elaboration. Detailed analysis on how these assumptions align with real-world LLM behavior could clarify the conditions under which EcoNash’s convergence is robust. A more thorough exploration of these theoretical aspects would strengthen the rigor of the convergence claims.
Heterogeneous Agent Configurations: The experiments focus on homogeneous agent setups, which may not fully represent practical, mixed-capability multi-agent settings where different models operate together. Including experiments with heterogeneous agents (e.g., LLMs of varying sizes or architectures) would better validate EcoNash's adaptability and efficiency under more diverse configurations, highlighting its flexibility in real-world applications.
Reward Design Specificity: The paper’s use of multi-faceted reward designs (e.g., action likelihood, task-specific, and self-evaluation rewards) is innovative, but the implementation lacks full transparency, particularly regarding how these rewards are balanced to avoid bias or feedback loops. Providing more specific details on the reward design and ensuring balanced feedback mechanisms would strengthen the framework’s robustness and generalizability. Additionally, discussing potential trade-offs among different reward types (e.g., consistency versus creativity) could offer further insights.
Comparison with Related Multi-Agent Systems: The paper compares EcoNash against traditional multi-agent debate and coordination methods, but it could benefit from a comparison with additional recent multi-agent LLM approaches. For example, frameworks that optimize coordination in decentralized ways or those that involve explicit negotiation protocols could serve as useful baselines. Such comparisons would contextualize EcoNash’s performance and clarify its unique strengths.

Addressing these areas could enhance EcoNash's clarity, scalability, and applicability in diverse multi-agent environments, further supporting its practical contributions.

问题

Scalability of the Coordination Mechanism: Could the authors clarify how EcoNash's coordination structure scales as the number of agents grows, particularly with a single Coordinator LLM? Are there any plans or existing methods for adapting the framework to include multiple coordinators or a more decentralized approach to avoid potential bottlenecks?
Clarification on Assumptions in Convergence Proofs: The theoretical results on convergence to BNE are valuable. However, could the authors clarify the assumptions made regarding the reward structures, learning rates, and agent behavior during convergence? Additionally, how sensitive are the convergence results to these assumptions?
Heterogeneous Agent Configurations: Could the authors discuss the feasibility and potential implications of using EcoNash with heterogeneous agents? Are there specific challenges the authors anticipate if agents of different sizes or architectures were integrated? Experimenting with a mixed-LLM setup could provide insights into EcoNash’s adaptability.
Details on Reward Balancing and Bias Prevention: The use of diverse rewards (action likelihood, task-specific, and self-evaluation) is innovative. Could the authors provide more details on how these reward types are balanced during training to prevent any unintended biases or feedback loops? How sensitive is the model's performance to changes in the reward balance?
Additional Baselines for Comparison: EcoNash is compared against traditional multi-agent debate and coordination methods, but it could benefit from a comparison with recent multi-agent LLM frameworks, particularly those using decentralized coordination or negotiation protocols. Would the authors consider including these as additional baselines, if possible, in the final paper?
Implementation of Self-Evaluation Rewards: The self-evaluation reward mechanism is intriguing for improving reasoning consistency. Could the authors elaborate on how this reward is designed and applied across different tasks? Are there any specific criteria used for self-evaluation that may vary by task type?

These questions aim to clarify potential improvements to EcoNash and enhance understanding of the approach. Addressing these points could strengthen the paper’s claims, particularly around scalability, reward balancing, and adaptability in diverse agent settings.

评论- Would you mind checking our responses and confirming whether you have any further questions?

2024-11-22

We sincerely thank Reviewer 3bEo for the valuable feedback. We have addressed all the comments you provided. Please find our point-by-point responses below. Any further comments and discussions are welcome! We apologize for the lack of clarity and the missing details in the initial submission. We have added extensive implementation details to enhance the understanding of our work. All modifications are highlighted in blue in the updated manuscript.

Q1. Scalability of Coordination Mechanism

"Scalability of the Coordination Mechanism: Could the authors clarify how EcoNash's coordination structure scales as the number of agents grows, particularly with a single Coordinator LLM? Are there any plans or existing methods for adapting the framework to include multiple coordinators or a more decentralized approach to avoid potential bottlenecks?"

Reply:

Thank you for your response. We are actively researching this area. As shown in Sec 4.4, Figure 2, and Figure 3, our findings using LLaMA 3.1 70B indicate that performance declines when a single Coordinator LLM handles more than six Execution LLMs. However, additional experiments with LLaMA 3.1 405B suggest that a stronger Coordinator LLM can effectively manage more Execution LLMs, with performance only starting to decline when coordinating more than nine LLaMA 3.1 70B models.

This performance difference is closely related to the capability of the Coordinator LLM itself. Beyond forming strategies and formats based on the given question (as demonstrated in Appendix D.2, where LLaMA 3.1 70B generates high-quality strategies and formats), the Coordinator LLM must also evaluate the responses from the Execution LLMs, generate rewards, and update parameters based on the reward-derived loss to approximate a BNE. In Table 6, we provide results using random rewards for evaluation, which show a significant drop in performance.

Therefore, we believe introducing an External Reward Model could be a reasonable solution, despite the additional computational overhead. Additionally, employing a more powerful model as the Coordinator LLM or fine-tuning the evaluation capability of the Coordinator LLM through specific methods would also be promising approaches.

Q2. Clarification on Assumptions in Convergence Proofs

"Clarification on Assumptions in Convergence Proofs: The theoretical results on convergence to BNE are valuable. However, could the authors clarify the assumptions made regarding the reward structures, learning rates, and agent behavior during convergence? Additionally, how sensitive are the convergence results to these assumptions?"

Reply:

We have revised Appendices A.3, B.2, and B.3 to further clarify our BNE convergence and provide additional details regarding the theoretical underpinnings. Specifically, Appendices B.2 and B.3 explain why EcoNash achieves a superior regret bound compared to MAD. The core assumptions behind these results are Assumption 5 (Q-function Estimation Error) and Assumption 6 (Policy Suboptimality), which are based on the following conditions:

Convex and compact policy space.
Lipschitz continuous policy functions.
Bounded and continuous Q-functions.
Proper learning rate scheduling.

Regarding learning rate sensitivity, we emphasize that excessively high learning rates can destabilize our on-policy learning process. Additionally, since we use on-policy learning within the datasets, we must ensure that our replay buffer is not too large to prevent instability.

In contrast, Assumption 7 highlights that in multi-agent systems with the following conditions:

Non-cooperative zero-sum or constant-sum game structure.
Strategic uncertainty due to incomplete information.
No mechanism for joint strategy optimization.
Competitive reward structure.

Policy suboptimality remains bounded from below. This necessitates the introduction of a coordinator to guide the system and the design of a bounded and continuous reward space to ensure effective coordination.

2024-11-22

Q3: Heterogeneous Agent Configurations

"Could the authors discuss the feasibility and potential implications of using EcoNash with heterogeneous agents? Are there specific challenges the authors anticipate if agents of different sizes or architectures were integrated? Experimenting with a mixed-LLM setup could provide insights into EcoNash’s adaptability."

Reply:

Thank you for this insightful question. We appreciate the opportunity to discuss the feasibility, challenges, and potential adaptations for integrating heterogeneous agents within the EcoNash framework.

Feasibility of Using EcoNash with Heterogeneous Agents

EcoNash is designed to accommodate agents of varying sizes and architectures. Its core components, the belief encoder and the centralized mixing network, are architecture-agnostic, enabling seamless integration of diverse Large Language Models (LLMs). This modularity ensures that EcoNash can effectively manage heterogeneous agents without requiring major changes to the framework.

Potential Challenges and Mitigations

Diverse Capabilities and Response Patterns
- Challenge: Different LLMs exhibit varying reasoning depths, language proficiency, and response styles, leading to potential inconsistencies in quality and format.
- Mitigation: The belief encoder employs multi-head attention to harmonize disparate outputs, while the centralized mixing network processes these aggregated beliefs, aligning actions through a shared coordination signal $C$ .
Integration of Different Architectures
- Challenge: Variations in underlying architectures may complicate encoding and decoding strategies.
- Mitigation: By abstracting inter-agent communication through belief states, EcoNash minimizes architecture-specific dependencies. Attention mechanisms further enable seamless integration.
Scalability and Computational Overhead
- Challenge: Heterogeneous setups increase computational complexity, particularly in belief encoding and strategy coordination.
- Mitigation: Efficient multi-head attention and parallel processing are employed to manage scalability, while the CTDE paradigm ensures computational overhead remains manageable.

Role of Attention Mechanisms in Handling Heterogeneous Agents

Belief Encoder with Multi-Head Attention
- Functionality: Aggregates belief states from all agents, capturing diverse inter-agent relationships.
- Impact: Synthesizes information from agents with varying capabilities, producing a comprehensive group-level representation $\mathbf{E}$ .
Centralized Mixing Network with Self-Attention
- Functionality: Processes prompt embeddings $\{\mathbf{e}_i^t\}_{i=1}^N$ to capture dependencies and local-global interactions.
- Impact: Dynamically weighs each agent's contribution based on task relevance, facilitating effective coordination despite differences.

Adaptations in EcoNash for Heterogeneous Agents

Customized Belief Networks
Each agent maintains its own belief network $B_i(\mathbf{\tau}_i, O_i; \theta_i^B)$ , ensuring belief states $\mathbf{b}_i$ are tailored to individual strengths and weaknesses.
Adaptive Mixing Mechanism
The centralized mixing network $f_{\text{mix}}$ integrates diverse belief states and prompt embeddings, leveraging stronger agents while mitigating weaker ones.

Experimental Insights and Future Work

While our current experiments focus on homogeneous agents, we recognize the importance of heterogeneous setups. Future work will include:

Performance Analysis
- Evaluate the impact of agent heterogeneity on BNE convergence, task performance, and efficiency.
- Identify emergent behaviors or coordination patterns in mixed setups.
Enhanced Compatibility
- Investigate dynamic weighting strategies based on real-time metrics.
- Explore transfer learning or meta-learning for belief network and mixing mechanism adaptability.

In summary, while integrating heterogeneous agents presents challenges, the modular and flexible design of EcoNash, supported by robust attention mechanisms, positions it well for diverse LLM integration. We are optimistic about its feasibility and plan to explore its potential in future experiments.

2024-11-22

Q4: Details on Reward Balancing and Bias Prevention

"The use of diverse rewards (action likelihood, task-specific, and self-evaluation) is innovative. Could the authors provide more details on how these reward types are balanced during training to prevent any unintended biases or feedback loops? How sensitive is the model's performance to changes in the reward balance?"

Reply:

Thank you for this comment. In Appendix B.4, we provide more detailed explanations of the reward settings, the summary is as follows:：

Continuous Reward Components:

Action Likelihood Reward ( $r_i^{\text{AL}}$ ):
Uses cosine similarity $\text{sim}(u_i, C)$ , which is inherently continuous in both $u_i$ and $C$ .
Task-Specific Reward ( $r_i^{\text{TS}}$ ):
Employs normalized scores through $\text{eval}(u_i, \text{task})$ , avoiding binary 0/1 judgments.
The $\text{eval}$ function produces continuous scores by considering:
For mathematical problems: partial credit for solution steps and reasoning quality.
For planning tasks: response relevance on a continuous scale.
Collaborative Contribution Reward ( $r_i^{\text{CC}}$ ):
Evaluates quality on a continuous scale.

Continuity Guarantees:

All reward components are bounded through the $\min(R_{\text{max}}, \cdot)$ operation.
The weighted combination $r_i = \alpha_1 r_i^{\text{AL}} + \alpha_2 r_i^{\text{TS}} + \alpha_3 r_i^{\text{CC}}$ preserves continuity.
The dynamic weight adjustment mechanism, using gradient-based updates, ensures smooth transitions.

Task-Specific Evaluation:

While some tasks might naturally suggest binary outcomes (correct/incorrect), our $\text{eval}$ function deliberately produces continuous scores by:
Evaluating solution completeness.
Assessing reasoning quality.
Considering solution strategy.
This continuous evaluation aligns with how human experts assess solutions, especially in complex tasks.

Conclusion:

This design ensures that our payoff function maintains continuity with respect to $\theta$ , satisfying the conditions required for the existence of BNE.

Q5: Additional Baselines for Comparison

"EcoNash is compared against traditional multi-agent debate and coordination methods, but it could benefit from a comparison with recent multi-agent LLM frameworks, particularly those using decentralized coordination or negotiation protocols. Would the authors consider including these as additional baselines, if possible, in the final paper?"

Reply:

We are highly interested in comparing our approach with these methods, particularly those employing decentralized coordination or negotiation protocols. However, at the time of writing, we were unable to identify corresponding baseline methods specifically tailored for comparison. Current multi-agent LLM systems, such as MetaGPT(https://arxiv.org/abs/2308.00352) and ChatDev(https://arxiv.org/abs/2307.07924), focus on agent-based collaboration, while other methods, like (https://arxiv.org/abs/2410.08115), rely on fine-tuning strategies. This limitation led us to select prompt-based methods for comparison instead. We remain open to suggestions and welcome any strong baselines you might propose. We would be glad to perform comparisons and analyze their regret bounds to further evaluate and refine our framework.

Baseline Comparisons: We included comparisons with:
- [1] Feng et al. "Alphazero-like tree-search can guide large language model decoding and training." .
- [2] Liu et al. "Don't throw away your value model! Generating more preferable text with Value-Guided Monte-Carlo Tree Search decoding.
Updated Results:

Benchmark	Improvement over TS-LLM (%)	Improvement over PPO-MCTS (%)
GSM8K	2.7%	4.9%
GSM-Hard	10.9%	13.5%
SVAMP	3.0%	3.0%
StrategyQA	4.2%	5.1%
MATH	4.9%	8.2%

Analysis:

GSM8K: EcoNash outperforms TS-LLM by 2.7% and PPO-MCTS by 4.9%, demonstrating strong reasoning capabilities.
GSM-Hard: EcoNash shows a 10.9% improvement over TS-LLM and 13.5% over PPO-MCTS, excelling in complex challenges.
SVAMP: EcoNash exceeds TS-LLM and PPO-MCTS by 3.0%, showcasing efficiency in arithmetic tasks.
StrategyQA: EcoNash surpasses TS-LLM by 4.2% and PPO-MCTS by 5.1%, highlighting its strategic reasoning ability.
MATH: EcoNash achieves a 4.9% and 8.2% improvement over TS-LLM and PPO-MCTS, demonstrating strength in advanced math tasks.

2024-11-22

Q6: Implementation of Self-Evaluation Rewards

"The self-evaluation reward mechanism is intriguing for improving reasoning consistency. Could the authors elaborate on how this reward is designed and applied across different tasks? Are there any specific criteria used for self-evaluation that may vary by task type?"

Reply:

Thank you for your insightful question. Below, we explain how the self-evaluation reward mechanism is designed and adapted to different tasks, focusing on mathematical reasoning and planning benchmarks.

Task-Specific Criteria for Self-Evaluation

1. Mathematical Reasoning Tasks (e.g., GSM8K, SVAMP, MATH, GSM-Hard)

Mathematical reasoning tasks emphasize correctness and step-by-step reasoning clarity, given the structured nature of the problems.

Evaluation Criteria:

Correctness: The Task-Specific Reward ( $r_i^{\text{TS}}$ ) evaluates how accurate the final answer is relative to the ground truth.
Reasoning Steps: Intermediate steps are assessed for logical consistency and clarity, with partial credit given for accurate reasoning even if the final answer is incorrect.

Example: For the problem $2x + 3 = 7$ , the reward would consider:

Correct isolation of $x$ (i.e., $x = 2$ ).
Clear explanation of steps, such as subtracting 3 and dividing by 2.

This ensures the model learns to prioritize both process clarity and solution accuracy.

2. Planning Tasks (e.g., TravelPlanner, Strategy QA)

Planning tasks require agents to generate actionable, coherent plans under various constraints.

Evaluation Criteria:

Relevance: Responses are evaluated for alignment with key task objectives, such as adhering to user-defined constraints (e.g., budget, time).
Coherence: Plans must exhibit logical structure, with each step building naturally on the previous ones.

Example: For TravelPlanner, where agents create travel itineraries under budget constraints:

The evaluation rewards agents for satisfying constraints while optimizing time and cost.
Penalties are applied for infeasible or irrelevant actions.

This approach ensures that agents focus on generating practical and effective solutions.

Dynamic Reward Adaptation

To adapt to different tasks, we dynamically adjust the weights ( $\alpha_1, \alpha_2, \alpha_3$ ) of the reward components—Action Likelihood, Task-Specific Reward, and Collaborative Contribution Reward. This ensures that the reward system remains flexible:

Mathematical Tasks: Focus on correctness and reasoning clarity ( $\alpha_1, \alpha_2$ ).
Planning Tasks: Emphasize relevance, coherence, and adherence to constraints ( $\alpha_2, \alpha_3$ ).

This dynamic adjustment ensures agents optimize for the most relevant criteria for each task type.

We hope this explanation addresses your question and are happy to provide further clarification if needed.

评论- Would you mind checking our responses and confirming whether you have any further questions?

2024-11-24

Dear Reviewer 3bEo,

We sincerely thank you for your efforts in reviewing our work and for your great support! We also greatly appreciate your insightful questions and hope that our responses have helped to clarify them.

Please let us know if you need any further information or if there are additional points you would like to discuss with us. Thank you very much!

Best regards,

Authors of #14101

2024-11-24

Thank you to the author for the detailed response to my questions, effectively addressing my doubts and concerns about the method. Sincerely looking forward to your paper being accepted.

评论- Further responses to reviewer 3bEo

2024-11-25

Dear Reviewer 3bEo,

Thank you for your support of our work. We are pleased that our responses were able to address your doubts and concerns about our method effectively.

Once again, we sincerely appreciate your valuable feedback and time, which have greatly contributed to improving our paper.

Best regards,
Authors of #14101

审稿意见

评分: 5置信度: 32024-11-03

This paper proposes EcoNash, a multi-agent framework geared towards achieving Bayesian Nash equilibrium, which has a sublinear regret instead of linear regret in the regular multi-agent debate setup. The method consists of a coordinator LLM that tells a question-specific strategy to each executor LLM and the executor LLM that acts as the solver and executes the strategy based on its internal state using the agent-specific belief network. The beliefs from all the agents are relayed into the coordinator LLM via embeddings from the belief encoder which then aggregates and commits. The networks are trained using gradient descent with TD and SD loss. Empirically, the authors show that 4 different LLMs (of different sizes and families), EcoNash outperforms various prompting based baselines: CoT, SC, multi-agent debate and also uses fewer tokens (more efficient or low latency).

优点

Multi-agent techniques are a popular and effective method for improving the reasoning performance of LLMs, however, they are computationally expensive. So the paper addresses a problem on an important topic and their better regret bounds and token efficiency results are compelling.
The BNE take on multi-agent conversations are relatively novel to this domain, and is conceptually intriguing -- could be useful for the ICLR community
The empirical results over prompting baselines give good improvement on multiple datasets and models

缺点

The authors can present the paper's content more clearly and make it easier to follow. I found a lot of crucial details not adequately explained in theoretical results, experimental setup and training, and general examples/intuition that would make the paper more appealing to the multi-agent reasoning community. See questions below (1-6, 8)
I have major questions or doubts about the assumptions made in the theoretical results, which are mostly delegated to the appendix. Overall this section was tough to follow and missed necessary background and intuition needed for it to be beneficial for LLM reasoning community. See questions below (7, 11)
The paper does not clearly explain their experimental setup, which involves training, and in its current form, it would be hard for any reader to be able to have a working implementation of their method (also brings up reproducibility concern). It also raises doubts about if the baseline comparisons are fair or adequate (no training baselines), and the ablations and analysis for why the method works appear to be insufficient (see questions: 8-10, 12-15).

PS: I would be open to revisiting the scores if the authors add some of these missing details and simplify the explanations and content presentation of the paper. In the current form, I don't think the content is clear and coherent for the community at large to benefit from it.

问题

3.1.1: What is a type function?
Problem with the notation: shouldn’t tuple of $\theta$ s and actions be ordered, so $\theta_i$ and $\theta_{-i}$ does not account for the orderings, the authors should clarify that.
Appendix A.3: H is entropy and I is mutual information – if so it should be stated or defined?
Also roles coordinator and executor LLMs should be defined in the method before referring to this proof, i.e., somewhere in 3.1 possibly the “implementation” part of 3.1.1 as the title suggests?
Broader point on Sec 3: for it to be useful for researchers in LLM agents, it is important to build the intuition for what coordinator LLMs inputs and especially outputs are (question and strategy) and should be stated at the start of Sec 3. While some of these examples and prompts are referred to in Sec 4 (mainly in the appendix), it would be useful to introduce these earlier on, or add an example in the figure, or possibly moved to the main paper.
Figure 1: need to label the belief network – blue box?
Lemma 1: How do you get the bound for estimation and policy suboptimality errors as $(\sqrt{t})^{-1}$ ? This is the assumption that proof in Appendix B.2 hinges on and is not explained in the appendix either. Similarly, why did the authors set the value of $\delta_t$ to a constant max in B.3.
Figure 1 and Sec 3.3.1: Can the authors share more details on “informative strategy and a format based on the input question q” that the Coordinator LLM generates?
Missing Training details/experimental setup of the belief encoding and network, number of parameters, what learning rate and hyperparameters were used, how were they tuned? What is the training dataset?
All the baselines are solely prompting-based, but their method trains the belief network and encoding. Can you compare it to directly finetuning the LLM with LoRA with minimal parameters or prefix tuning or soft-prompt optimization which may be more similar to what the authors do?
Issue with the prompts in appendix C and computing rewards using prompts: use llm-as-a-judge capabilities to assess model responses for reasoning which is already something models are bad at Huang et al. 2024 Requires models to stick to token counts which models are bad at Yuan et al. 2024, so do the authors use a cutoff or in-context examples?
- To measure the soundness and relevance of these reward scores,what it you sampled the scores with high temperature, added noise or replaced it with random rewards – how does the performance change in these scenarios?
- Most strategy suggestions from the coordinator can be seen as a plan and then asking the executors to solve it. This is similar to works such as Plan and Solve, Adapt, etc., but how faithful are the executors to the plan as pointed out in Lyu et al. 2023. How do we know that most improvements are not coming from additional diversity (missing factor in homogeneous LLM agent settings) from sampling multiple plans that the coordinator sends to different LLMs. To simulate this, authors could run self-consistency with "plan and solve" instead of just solutions.
- Does assumption 2 (posterior alignment) in Appendix A.3 imply executor LLM is faithful to the plan/strategy generated by coordinator LLM? If so that may not always occur since CoT and math responses in zero-shot generative settings are not known to be faithful. Eg. the model can say it solving for 2x - 6 + 2 = 0 but say that it solves x=3 (instead of x=2).
- In general, it is unclear how simply prompting the LLM guarantees the assumptions in A.3 are met.
Dataset metrics for travelplanner in Sec 4.2 need to be explained.
Minor: Figures 2 and 3 are never referred to in text, which makes it harder to read the relevant sections
Instead of using LLM as a judge to output the reward, something models are shown not to be good at why not use an external RM from the reward bench? Also if you already have scores for different executors that give answers, why not do weighted SC, or best-of-N based on reward or an ensemble?
In sec 4.3, when comparing the token counts, the SC numbers look quite high. Is there an explanation for this? Usually SC is done using temperature sampling (so input is provided ones, and k outputs are sampled). Are counting the input tokens 64 times too and is that the reason for the high token counts?

评论- Would you mind checking our responses and confirming whether you have any further questions?

2024-11-22

We sincerely thank Reviewer 2o1n for the valuable feedback. We have addressed all the comments you provided. Please find our point-by-point responses below. Any further comments and discussions are welcome! We apologize for the lack of clarity and the missing details in the initial submission. We have added extensive implementation details to enhance the understanding of our work. All modifications are highlighted in blue in the updated manuscript.

Q1: Accuracy of BNE in the MA-LLM Framework (Section 3.1.1)

"The authors can present the paper's content more clearly and make it easier to follow. I found a lot of crucial details not adequately explained in theoretical results, experimental setup and training, and general examples/intuition that would make the paper more appealing to the multi-agent reasoning community. See questions below (1-6, 8)."

Reply:

Thank you for your valuable comments. We appreciate the opportunity to clarify and improve our manuscript. Let me address each point:

1. "3.1.1: What is a type function?"

Reply:

In our revised framework, we have moved away from the type function formulation to a more intuitive belief-based representation. Instead of using type functions, we directly model agent states through belief states $\mathbf{b}_i \in \mathbb{R}^d$ and belief networks $B_i(\mathbf{\tau}_i, O_i; \theta_i^B)$ . This approach better captures the dynamic nature of LLM interactions while maintaining mathematical rigor. The detailed revision can be referred to Sec 3.1.1 which is highlight by blue.

2. "Problem with the notation: shouldn’t tuple of $\theta$ s and actions be ordered, so $\theta_i$ and $\theta_{-i}$ does not account for the orderings? The authors should clarify that."

Reply:

You raise a good point about the ordering issue. In our revised framework, we've eliminated the $\theta_i$ and $\theta_{-i}$ notation to avoid ambiguity. Instead, we use belief network parameters $\theta_i^B$ for individual agents and the group-level representation $\mathbf{E}$ generated by $f_e(\{\mathbf{b}_i\}_{i=1}^N; \theta_e)$ to capture inter-agent relationships. This notation better reflects the hierarchical structure of our framework.

3. "Appendix A.3: $H$ is entropy and $I$ is mutual information—if so, it should be stated or defined?"

Reply:

Thank you for your comment. We have revised the entire Appendix A.3 section, providing complete explanations for all the assumptions and their relation to EcoNash. We have explicitly defined $H$ as entropy and $I$ as mutual information to enhance clarity. Specifically, here is the definition of of Assumption 3:

Definition: Belief Entropy

For a given time $t$ , the belief entropy $H_t$ is defined as the Shannon entropy of the aggregated belief embeddings:

H_t = -\sum_{i=1}^N \mathbb{E}_{\mathbf{b}_i \sim B_i}[\mathbf{b}_i \log \mathbf{b}_i],

where $B_i$ represents the belief network of agent $i$ .

Assumption: Game Regularity

There exists a constant $\eta > 0$ such that for any $t_1 < t_2$ , if the entropy difference satisfies $H_{t_1} - H_{t_2} \leq \log 2$ , then the following holds:

I(\theta_i^B; \xi(\mathbf{e}_i, \mathbf{E}) \mid D_{t_1}) \leq 4\eta \cdot I(\theta_i^B; \xi(\mathbf{e}_i, \mathbf{E}) \mid D_{t_2}),

where:

$I(\cdot; \cdot \mid \cdot)$ represents the conditional mutual information,
$\theta_i^B$ are the parameters of the belief network for agent $i$ ,
$\xi(\mathbf{e}_i, \mathbf{E})$ denotes the coordination outcome based on the agent's prompt embeddings $\mathbf{e}_i$ and the group-level representation $\mathbf{E}$ ,
$D_t$ refers to the dataset or information set available at time $t$ .

This assumption ensures stability in how the belief network parameters influence coordination outcomes over time.

2024-11-22

4. "Also, the roles of coordinator and executor LLMs should be defined in the method before referring to this proof, i.e., somewhere in 3.1, possibly the 'implementation' part of 3.1.1 as the title suggests?"

Reply:

Thank you for your valuable suggestion to improve our readability. In the revised Section 3.1.1, we have provided definitions for the coordinator and executor LLMs:

Bayesian Nash Equilibrium (BNE): A strategy profile where each agent maximizes its expected utility based on its beliefs about other agents' strategies.
Coordinator LLM: Takes a question as input and outputs corresponding strategy and format specifications to guide execution LLMs. After receiving answers from execution LLMs, it generates a final commitment to address the question.
Execution LLMs: Each maintains its belief state $\mathbf{b}_i \in \mathbb{R}^d$ and receives observations $O_i = [e_t, e_s, \mathbf{b}_i]^\top$ , where $e_t$ encodes the task and $e_s$ represents the coordinator's strategy.
Belief Network: $B_i(\mathbf{\tau}_i, O_i; \theta_i^B)$ updates each agent's state based on its history $\mathbf{\tau}_i$ and current observation, generating prompt embeddings $\mathbf{e}_i$ .
Belief Encoder: $f_e(\{\mathbf{b}_i\}_{i=1}^N; \theta_e)$ aggregates these beliefs into group information $\mathbf{E}$ , which the coordinator LLM uses to guide coordination through a commitment $C$ .

5. "Broader point on Sec 3: For it to be useful for researchers in LLM agents, it is important to build the intuition for what coordinator LLM's inputs and especially outputs are (question and strategy) and should be stated at the start of Sec 3. While some of these examples and prompts are referred to in Sec 4 (mainly in the appendix), it would be useful to introduce these earlier on, or add an example in the figure, or possibly move to the main paper."

Reply:

Thank you for this valuable suggestion about improving the intuition behind coordinator LLM inputs and outputs. We have made several revisions to address this:

Section 3.1.1:

We enhanced the explanation of the coordinator-executor framework to clarify:

The coordinator LLM's input includes both the question and belief states $\mathbf{b}_i \in \mathbb{R}^d$ .
Its outputs consist of strategy specifications ( $e_s$ ) and the final commitment ( $C$ ).
The interaction of these components within the belief network architecture.

Figure 1:

We provided a clearer illustration of the inference and optimization phases. The updated figure better explains the inputs and outputs of the mixing network and the coordinator LLM.

Section 3.3.2:

We revised the description of the Centralized Mixing Network, emphasizing its role in coordinating belief information from execution LLMs to optimize towards a Bayesian Nash Equilibrium (BNE). Key updates include:

Processing Prompt Embeddings:
- The prompt embeddings $\{\mathbf{e}_i^t\}_{i=1}^N$ are processed via self-attention to capture intra-agent dependencies, producing transformed embeddings $\{\mathbf{w}_i^t\}_{i=1}^N$ .
Feature Transformations:
- These transformed embeddings are concatenated with the group-level representation $\mathbf{E}^t$ to generate feature transformations $\{F_i^t\}_{i=1}^N$ , encoding both local agent-specific and global group-level information.
Global Value Function:
- The feature transformations $\{F_i^t\}_{i=1}^N$ and individual Q-values $\{Q_i^t\}_{i=1}^N$ are combined via multi-head attention to compute the global value function $Q_{\text{tot}}^t$ , aligning with Figure 1.

Examples in Appendix D.2:

We added intuitive examples of coordinator outputs to better illustrate its functionality. For instance:

Example:

Q1: John buys 3 pizzas for $12 each. If he gives the delivery person a 20% tip on the total, how much did he spend in total?
S1 (Strategy): Calculate pizza subtotal first. Add 20% of subtotal for tip. Sum for final amount.
F1 (Format):

Pizza cost = $? × ?

Tip = ? × subtotal

Total = subtotal + tip

This example demonstrates how the coordinator:

Breaks down complex problems into manageable steps.
Provides clear formatting guidelines.
Ensures consistency across executor responses while maintaining computational efficiency (total strategy + format: 35 tokens).

6. "Figure 1: Need to label the belief network—the blue box?"

Reply:

We have provided a new Figure 1 to make it clearer, separating the optimization and inference phases. We have detailed the composition of the belief network and mixing network, and we have labeled all components, including the belief network (blue box).

2024-11-22

7. "Figure 1 and Sec 3.3.1: Can the authors share more details on 'informative strategy and a format based on the input question q' that the Coordinator LLM generates?"

Reply:

We have provided additional informative strategies and formats generated by the Coordinator LLM in Appendix D.2 for the GSM8K, MATH, and SVAMP datasets. For each dataset, we included five examples and reported their total token consumption. This addition offers clearer insights into how the coordinator guides the execution LLMs.

Q2: Assumptions Made in the Theoretical Results

"I have major questions or doubts about the assumptions made in the theoretical results, which are mostly delegated to the appendix. Overall, this section was tough to follow and missed necessary background and intuition needed for it to be beneficial for the LLM reasoning community. See questions below (7, 11)."

Reply:

Thank you for this feedback. We acknowledge that the theoretical section needed more clarity. We have revised the relevant appendices (Appendix A.3, Appendix B.2 Appendix B.3 ) and expanded explanations (Sec 3.3.1 Sec 3.3.2) to make the assumptions and proofs more accessible.

7. "Lemma 1: How do you get the bound for estimation and policy suboptimality errors as $O(t^{-1/2})$ ? This is the assumption that proof in Appendix B.2 hinges on and is not explained in the appendix either. Similarly, why did the authors set the value of $\delta_{\min}$ to a constant max in B.3."

Reply:

Thank you for this question about our error bounds. We have modified Appendices B.1, B.2, and B.3 to make them clearer. Let me clarify our assumptions and their justification:

Estimation Errors and Policy Suboptimality

The errors are defined as follows:

Estimation Error : Related to Q-function approximation.
Policy Suboptimality : Measures the deviation from the optimal policy.

Assumption: Q-Function Estimation Error

The estimation error decreases as:

\epsilon_t \leq \frac{C_{\epsilon}}{t^{\alpha}}, \quad \text{with } \alpha = \frac{1}{2}.

This rate is justified by:

Stochastic Approximation Theory showing $O(t^{-1/2})$ convergence (Borkar, 2009).
Minimax Optimality in stochastic optimization (Nemirovski, 2009).
Achieving these rates through proper learning rate scheduling.

Assumption: Policy Suboptimality

The policy suboptimality decreases as:

\delta_t \leq \frac{C_{\delta}}{t^{\beta}}, \quad \text{with } \beta = \frac{1}{2}.

This rate is supported by:

Regret Bounds in online learning (Hazan, 2016).
Gradient-Based Methods in convex policy spaces (Shalev-Shwartz, 2012).
Empirical Evidence in cooperative multi-agent RL (Zhang et al., 2021).

Debate Setting: Persistent Policy Suboptimality

In the debate setting, policy suboptimality remains bounded below:

\delta_t \geq \delta_{\min} > 0

Assumption: Persistent Policy Suboptimality

This assumption is justified by:

Game-Theoretic Properties of competitive settings (Fudenberg & Tirole, 1998).
Information-Theoretic Limitations in adversarial environments (Owe, 2013).
Empirical Evidence of non-convergence in competitive RL (Lanctot et al., 2017).

References

Borkar, V. S. (2009). Stochastic Approximation: A Dynamical Systems Viewpoint. Springer.
Nemirovski, A. (2009). "Robust Stochastic Optimization via Convex Programming." Mathematics of Operations Research.
Hazan, E. (2016). Introduction to Online Convex Optimization. Foundations and Trends® in Optimization.
Shalev-Shwartz, S. (2012). "Online Learning and Online Convex Optimization." Foundations and Trends® in Machine Learning.
Zhang, K., et al. (2021). "Multi-Agent Reinforcement Learning: A Selective Overview of Theories and Algorithms." Handbook of Reinforcement Learning and Control.
Fudenberg, D., & Tirole, J. (1998). Game Theory. MIT Press.
Owe, A. (2013). Information Theory and its Applications. Wiley.
Lanctot, M., et al. (2017). "A Unified Game-Theoretic Approach to Multiagent Reinforcement Learning." NeurIPS.

2024-11-22

11. "Issue with the prompts in Appendix C and computing rewards using prompts: Using LLM-as-a-judge capabilities to assess model responses for reasoning, which is already something models are bad at (Huang et al., 2024). It requires models to stick to token counts, which models are bad at (Yuan et al., 2024). So do the authors use a cutoff or in-context examples?"

Reply:

Thank you for this question. We will address your concerns point by point:

1. Sampling with High Temperature or Adding Noise

Question: "To measure the soundness and relevance of these reward scores, what if you sampled the scores with high temperature, added noise, or replaced them with random rewards—how does the performance change in these scenarios?"
Reply:
- We have updated the supplementary reward setting in Appendix B.4 to make it clearer.
- We conducted ablation experiments, replacing all rewards with random rewards, and updated our Table 6.
Findings:
- The performance of LLaMA 3.1 70B drops to 62.71%, even worse than zero-shot CoT (68.24%).
Interpretation:
- Random rewards prevent the coordinator from reasonably judging the direction of the commitment output, leading to degraded performance.

2. Faithfulness of Executors to the Plan

Question: "Most strategy suggestions from the coordinator can be seen as a plan and then asking the executors to solve it. This is similar to works such as Plan-and-Solve, Adapt, etc., but how faithful are the executors to the plan as pointed out in Lyu et al., 2023? How do we know that most improvements are not coming from additional diversity (missing factor in homogeneous LLM agent settings) from sampling multiple plans that the coordinator sends to different LLMs? To simulate this, authors could run self-consistency with 'plan and solve' instead of just solutions."
Reply:
- Training Objective: The EcoNash framework aims to adjust each execution LLM's belief state towards BNE under the influence of the coordinator and belief encoder which is clearly demonstrated in Sec 3.3.2, the alignment of local and global objectives within EcoNash ensures the reliability of optimization.
- Monotonicity Proof: We have proved in Appendix A.5 the monotonicity of the mixing network, ensuring that improvements in individual agent performances contribute positively to the overall system performance.
- Homogeneous and Heterogeneous Settings: EcoNash improves performance under both homogeneous and heterogeneous LLM settings, as shown in Section 4.3 (Additional Results) and Table 3.
- Additional Experiments: To address the reviewer's concern about the contribution of strategy diversity in our approach, we conducted a comparative experiment between Plan-and-Solve with Self-Consistency (P&S-SC) and EcoNash using LLaMA 3.1 70B on MATH and GSM-Hard datasets. For P&S-SC, we implemented an extensive sampling strategy with 10 different plans per problem and 10 solutions per plan (100 total solutions), applying majority voting for answer selection. The results show that P&S-SC achieves moderate improvements over Zero-shot CoT (MATH: 73.86% vs 68.24%, +5.62%; GSM-Hard: 42.12% vs 36.78%, +5.34%). However, EcoNash demonstrates substantially higher performance (MATH: 81.47%, GSM-Hard: 51.43%), maintaining significant advantages over P&S-SC (MATH: +14.80%, GSM-Hard: +17.20%) despite P&S-SC's extensive sampling. These results strongly suggest that EcoNash's superior performance cannot be attributed merely to sampling diversity, but rather stems from its fundamental architectural advantages, including dynamic strategy adaptation, Nash equilibrium-based optimization, and effective multi-agent coordination, which enable more sophisticated reasoning capabilities beyond what can be achieved through increased sampling alone.
- Comparison with Original Plan-and-Solve: It's worth noting that in the original Plan-and-Solve paper, their improvements over Zero-shot CoT were relatively modest. As quoted from their paper: "PS+ prompting improves the accuracy over Zero-shot CoT by at least 5% for all datasets except GSM8K which sees a 2.9% improvement... PS prompting enjoys 2.5% higher average accuracy than that of Zero-shot CoT." In contrast, EcoNash shows much more substantial improvements, achieving approximately 16% improvement over Zero-shot CoT on GSM8K (90.17% vs 74.74%). This significant performance gap further supports our argument that EcoNash's advantages stem from its fundamental architectural innovations rather than simple planning or sampling diversity.

2024-11-22

3. Assumption 2 (Posterior Alignment) and Executor Faithfulness

Question: "Does Assumption 2 (posterior alignment) in Appendix A.3 imply the executor LLM is faithful to the plan/strategy generated by the coordinator LLM? If so, that may not always occur since CoT and math responses in zero-shot generative settings are not known to be faithful. E.g., the model can say it's solving for 2x - 6 + 2 = 0 but say that it solves x = 3 (instead of x = 2)."
Reply:
- We acknowledge this concern regarding the executor LLM's faithfulness to the coordinator LLM. We have modified Appendix A.3 (Assumptions) and added additional explanations to avoid misunderstandings. Specifically:
Assumption 2 (Approximate Posterior Alignment):
Does not require that the executor LLM strictly follows the coordinator's instructions without error.
Aims for alignment within an acceptable error range.
We introduce the Kullback-Leibler divergence $D_{\text{KL}}$ to quantify the alignment between the executor LLM's strategy distribution $P_{\text{LLM}}$ and the coordinator's expected posterior distribution $P_{\text{post}}$ .
By setting an acceptable error boundary $\epsilon$ , we allow the executor to deviate slightly from the coordinator's strategy, which is more realistic in practical applications.

4. stick to token counts

"It requires models to stick to token counts, which models are bad at (Yuan et al., 2024). So do the authors use a cutoff or in-context examples?"

Reply:

Thank you for your insightful observation regarding token length control. While we initially set a 50-token constraint for the coordinator's strategy generation, we recognize the challenges in ensuring precise adherence to token counts by LLMs, as demonstrated in this study. Their research revealed that approximately 95% of responses fall within 1.4 times the target length, with 50% within 1.0 times.

Furthermore, our experiments confirmed that LLaMA exhibits significantly fewer instances of exceeding the token limit compared to GPT and Claude, a finding also reported in the aforementioned study. We merely verified this observation in our context.

To address this inherent variability while maintaining concise and effective coordination, we employ a robust two-stage approach:

A primary instruction targeting a 50-token length.
A 70-token hard cutoff with an automatic regeneration mechanism if exceeded.

We have provided additional clarification in Line 335, and further details regarding the strategy and format can be found in Appendix D.2. The results show that, across the datasets we tested, the average token consumption remained well below 50 tokens.

Q3: Clarification of Experimental Setup

"The paper does not clearly explain their experimental setup, which involves training, and in its current form, it would be hard for any reader to have a working implementation of their method (also brings up reproducibility concerns). It also raises doubts about whether the baseline comparisons are fair or adequate (no training baselines), and the ablations and analysis for why the method works appear to be insufficient (see questions: 8-10, 12-15)."

Reply:

We apologize for the lack of clarity in our experimental setup. We have made significant revisions to address these concerns.

8. "Missing training details/experimental setup of the belief encoding and network, number of parameters, what learning rate and hyperparameters were used, how were they tuned? What is the training dataset?"

Reply:

We have revised Section 3.3.2 (Optimization Phase) to increase its readability and provided a complete description of input/output of belief network, beliefencoder and mixing network and provide Appendix B.6 for a comprehensive hyperparameter table corresponding to Section 3.3.2. Here are some additional points:
API Usage: We use the Together API for reference while on-policy learning towards BNE.
Dataset Handling: In our dataset, each question is answered once, and each answer is treated as trajectory information. This information is fed into the replay buffer to update the parameters of our belief network, belief encoder, and mixing network.
Cost Example: For instance, using LLaMA 3.1-405B on the MATH dataset incurs an approximate cost of $75.
Early Stopping: Detailed criteria for early stopping are provided in Section 3.3.2, with hyperparameters listed in Appendix B.5.

2024-11-22

"9. All the baselines are solely prompting-based, but their method trains the belief network and encoding. Can you compare it to directly fine-tuning the LLM with LoRA with minimal parameters or prefix tuning or soft-prompt optimization, which may be more similar to what the authors do?"

Reply: We believe that directly fine-tuning the LLM with LoRA, prefix tuning, or soft-prompt optimization may present an unfair comparison because these methods modify the LLM parameters directly. However, to address your concern:

Comparison with RL-Based Methods:
We have provided comparisons with methods that also train action-value functions aims to achieve stronger ability for reasoning methods which could be more reasonable and fair:
Comparison Methods:
[1] Feng, Xidong, et al. "Alphazero-like tree-search can guide large language model decoding and training." arXiv preprint arXiv:2309.17179 (2023).
[2] Liu, Jiacheng, et al. "Don't throw away your value model! Generating more preferable text with Value-Guided Monte-Carlo Tree Search decoding." First Conference on Language Modeling. 2024.
Updated Results:

Benchmark	Improvement over TS-LLM (%)	Improvement over PPO-MCTS (%)
GSM8K	2.7%	4.9%
GSM-Hard	10.9%	13.5%
SVAMP	3.0%	3.0%
StrategyQA	4.2%	5.1%
MATH	4.9%	8.2%

Analysis

GSM8K: EcoNash achieves a 2.7% higher average score compared to TS-LLM and a 4.9% improvement over PPO-MCTS. This demonstrates its superior performance in solving standard mathematical problems.
GSM-Hard: EcoNash achieves a 10.9% improvement over TS-LLM and a 13.5% improvement over PPO-MCTS. This highlights its significant advantage in tackling complex mathematical challenges.
SVAMP: EcoNash outperforms TS-LLM by 3.0% and PPO-MCTS by 3.0%, showcasing its efficiency in arithmetic and mathematical reasoning tasks.
StrategyQA: EcoNash achieves a 4.2% higher average score compared to TS-LLM and a 5.1% improvement over PPO-MCTS, demonstrating its superior understanding and decision-making capabilities in strategic question-answering tasks.
MATH: EcoNash outperforms TS-LLM by 4.9% and PPO-MCTS by 8.2%, highlighting its strength in addressing advanced mathematical problems.

"12. Dataset metrics for TravelPlanner in Sec 4.2 need to be explained."

Reply:

We apologize for this oversight. We have provided detailed explanations of the dataset metrics for TravelPlanner and all datasets used in EcoNash in Appendix B.5 (Task Setups), including:

Description of the dataset.
Evaluation metrics used.
Specific challenges associated with the task.

"13. Minor: Figures 2 and 3 are never referred to in the text, which makes it harder to read the relevant sections."

Reply:

Thank you for pointing this out. We have rearranged Figures 2 and 3 and provided corresponding references in Section 4.4. This should make it easier to connect the figures with the relevant discussion in the text.

2024-11-22

14. "Instead of using LLM as a judge to output the reward, something models are shown not to be good at, why not use an external RM from the reward bench? Also, if you already have scores for different executors that give answers, why not do weighted SC, or best-of-N based on reward or an ensemble?"

Reply:

Thank you for these excellent suggestions. Let me address each point:

Reasons for Not Using External RMs:

Fair Comparison: To ensure fair comparison with baseline methods that don't use external RMs.
Self-Contained System: To maintain a system that doesn't rely on external components.
Demonstrate Effectiveness: To showcase the effectiveness of our coordination mechanism even without specialized reward models.

Weighted Self-Consistency (SC) or Ensemble Methods

Advantages of Our Framework:

Token Efficiency: Unlike weighted SC, which requires multiple samples/generations, our coordinator-based approach guarantees one-shot outputs, significantly reducing token usage.
Theoretical Guarantees: Our framework provides monotonic convergence towards BNE, while weighted ensembles lack such theoretical guarantees.
Dynamic Adaptation: The coordinator actively learns and adjusts strategies based on executor performance, providing more sophisticated coordination than static weighting schemes.

Future Directions

We acknowledge the potential value of integrating external reward models and ensemble techniques into our framework. Future work could explore how these approaches might complement our existing mechanism, particularly in cases where more granular or domain-specific feedback is needed. Incorporating external RMs could also enhance the flexibility of our system, while ensemble methods may improve performance by leveraging the strengths of multiple models. However, careful attention would need to be paid to how these additions might affect the system's efficiency and theoretical guarantees.

"15. In Sec 4.3, when comparing the token counts, the SC numbers look quite high. Is there an explanation for this? Usually, SC is done using temperature sampling (so input is provided once, and k outputs are sampled). Are you counting the input tokens 64 times too, and is that the reason for the high token counts?"

Reply:

Thank you for this careful observation about the Self-Consistency (SC) token counts.

Correction

You are absolutely right—we made an error by erroneously counting input tokens 64 times, when they should have been counted only once.
We have corrected this error in our calculations, and the revised results reflect this adjustment in table4.

Impact

The corrected results now show significantly lower but still substantially higher token usage for SC compared to other methods.
This correction further emphasizes EcoNash's token efficiency while maintaining superior performance.

We hope that our responses have addressed your concerns satisfactorily. We are grateful for your insightful comments, which have helped us improve the clarity and quality of our manuscript.

评论- Would you mind checking our responses and confirming whether you have any further questions?

2024-11-24

Dear Reviewer 2o1n,

We sincerely thank you for your efforts in reviewing our work and for your great support! We also greatly appreciate your insightful questions and hope that our responses have helped to clarify them.

Please let us know if you need any further information or if there are additional points you would like to discuss with us. Thank you very much!

Best regards,

Authors of #14101

2024-11-25

I thank the authors for their efforts, I have updated my rating accordingly.

评论- Further responses to reviewer 2o1n

2024-11-25

Dear Reviewer 2o1n,

Thank you for recognizing our work and for raising your score accordingly. Based on your feedback, we have revised the relevant sections and outlined the main changes as follows to enhance presentation and soundness:

Presentation
- Readability and Intuitiveness
  In Section 3.1, we have provided a more intuitive explanation of BNE to enhance readability and understanding.
- Detailed Description of Optimization
  We have updated Figure 1 for clearer visualization, expanded the explanations of each component in Section 3.3.2, and added a detailed description of the reward settings in Appendix B.4.
- Supplementary Dataset Descriptions
  Detailed descriptions of each dataset are now included in Appendix B.5 to provide better context and understanding of the data used.
- Coherent Theoretical Explanations
  We have included theoretical proofs in Theorem 1 demonstrating the existence of BNE and in Proposition 1 showing how prompt embeddings can be effectively tuned to achieve BNE. Additionally, Lemma 1 establishes a lower regret bound based on Bayesian Regret, and Appendix A.5 provides a proof of monotonic improvement to further support our framework.
Soundness
- Assumptions and References
  We have elaborated on the assumptions of Lemma 1 and included relevant references to support them.
- Bayesian Regret Expansion
  The Bayesian regret derived from Lemma 1 has been further detailed in Appendix B.2 and B.3 to ensure comprehensive understanding.
- Hyperparameters and Code Update
  Appendix B.6 now contains detailed descriptions of all hyperparameters, and we have updated our implementation code to reflect these changes and improve reproducibility.
- Experimental Setup Enhancements
  We revised the experimental setup section to include two baseline comparisons with a learned action-value function and clarified how we adhere to token length instructions within our framework.
- Strategy and Format Examples
  Appendix D.2 includes detailed examples of strategies and formats generated by the coordinator LLM to illustrate their application.

We kindly ask you to review our responses and confirm if you have any further questions. Any additional comments and discussions are welcome!

Thank you once again for your valuable feedback and time.

Best regards,

Authors of #14101

评论- Further responses to reviewer 2o1n

2024-11-27

Dear Reviewer 2o1n,

We deeply appreciate your detailed review and insightful comments. Your feedback has significantly improved the quality of our submission.

Revisions Made

In our rebuttal, we have comprehensively addressed your concerns and enhanced the manuscript's readability by providing additional explanations. Specifically, we have:

Added detailed information on the implementation in Sections 3.1.1 and 3.3.2.
Included proofs in Appendices A.3, B.2, and B.3.
Elaborated on the reward settings in Appendix B.4.
Detailed the dataset setup in Appendix B.5.
Provided a table of hyperparameters in Appendix B.6.
Expanded the formats and strategies generated by the coordinator in Section D.2.

We kindly request your feedback on our responses, especially as the deadline for submitting the revised PDF is approaching. Your further insights would be invaluable to ensure that our revisions meet the necessary standards.

Thank you once again for your time and constructive input.

Best regards,

Authors #14101

评论- Further responses to reviewer 2o1n

2024-11-28

We sincerely thank you for your response during the discussion period, which has been very helpful. If there are any remaining weaknesses in the paper that might have prevented a higher score, we would greatly appreciate it if you could share them. As the discussion phase nears its conclusion, we remain available to promptly address any additional concerns or provide further clarification as needed.

审稿意见

评分: 3置信度: 22024-11-06

This paper presents a way to optimize multi-agent LLM systems for Bayesian Nash Equilibrium (BNE). It showed strong results against other popular benchmark methods. Even though the paper is very well-written, I have to reserve my recommendation for acceptance until some confusion is cleared up.

优点

The paper is written very well. The structure and organization are both very clear.
The paper uses actual game theory objectives to ground multi-agent debate optimization. This effort is very applaudable.

缺点

Currently, the paper lacks quite a few key details to properly understand the work.
Training details are almost completely missing.
The provided code (through a URL) is very minimal/barebone. It includes data files and code for Game of 24 -- which is not even reported in the actual paper.
The comparison is between EcoNash (a method that involves fine-tuning the model) and other methods (RAP, ToT, rStar, etc.) -- essentially comparing a fine-tuned model with prompting methods. The higher performance is not entirely surprising.
It might be cool to see if fine-tuning, according to Nash equilibria, leads agents to develop distinct "personas" or emergent differentiations.

问题

What is the hardware spec for fine-tuning the LLaMA3.1-405B model? If it is hard to estimate or proprietary, please express that clearly in the paper. Similarly, please report the time spent running the training loop. Early stopping is mentioned but no hyperparameter is reported (how many epochs did you train, what's the exact criteria)? If you used an API for fine-tuning, that is totally fine too -- please report which company's API you used (it seems like Together's API?), and can you report the cost (a rough number is ok)?
Glicksberg's Fixed Point Theorem [1] seems to be used to prove the min-max value for zero-sum games. I believe in your framework, it's a cooperative game? Does this theorem still apply? Please point me to some textbook/tutorial/paper that uses this theorem to prove convergence for Bayesian Nash Equilibrium.
Is the proof for the payoff function to be continuous w.r.t. $\theta$ valid? This might be my misunderstanding -- but in Sec 3.3.2. Reward Setting: you use a lot of different rewards, including correctness, which we know is 0/1 for many of the questions.

[1] https://en.wikipedia.org/wiki/Glicksberg%27s_theorem

Additional questions and concerns:

I am concerned the authors are not acting in good faith. For example, if you look at the draft, Table 8 just says "Hyperparameters of EcoNash", but the author's follow up response to me explained this is only a a representative example for the GSM8K task with LLaMA 3.1 70B. So why is this not clearly stated in the paper draft when they added the table?

Through this process, the learned embeddings guide the executor LLMs to generate responses that are more consistent and coordinated, enabling improved overall performance.

For inference, we used the Together API. Regarding performance validation on a math dataset using LLaMA 3.1 405B, the approximate cost was $75.

These two parts combined are very confusing. How are embeddings sent to executor LLMs to guide their response generation? TogetherAPI does not seem to support taking in embedding (it only supports returning embeddings). Can the authors at least explain this? Is it possible for the authors to show which part of TogetherAPI did they use to send in embeddings that guide the Executor LLM to generate responses?

2024-11-22

Q2: Code provided through a URL

"The provided code (through a URL) is very minimal/barebones. It includes data files and code for Game of 24—which is not even reported in the actual paper."

Reply:

Thank you for pointing this out. We have refined and submitted our complete codebase, along with detailed explanations in the README file as https://anonymous.4open.science/status/EcoNash-867A. We will continue to maintain and fully open-source the code.

Regarding the Implementation of the Game of 24:

Initially, we conducted experiments on the Game of 24 as a mathematical planning task. However, we observed that tree search-based methods (e.g., ToT, MCTS-Rollout, TM-LLS) consistently outperformed methods like CoT and self-consistency on this task. This created an imbalance in comparison, as the evaluation unfairly favored certain approaches rather than providing a well-rounded assessment across diverse tasks.

Additionally, using the Game of 24 for comparing token consumption is not entirely reasonable, as search depth becomes a significant factor, as highlighted in https://arxiv.org/pdf/2309.17179:

"We argue that the Path@1/Path@N metric may not be reasonable. We also include the number of computations used in Path@1 generation (average number of tokens in sentence-level and average number of forward computations in token-level for solving a single problem). We refer readers to the second row of Fig. 2 for the Path@N result, with token/forward number as the x-axis. TS-LLM variants consume much more computation than CoT, making the comparison unfair.”

Considering these observations, we decided to focus on a more complex and practical task—the travel planner—as our planning task benchmark, as shown in Table 2. This benchmark provides a broader and more realistic evaluation scope, aligning better with our objective of testing generalizable planning methods.

Q3: Potential unfair comparison

"The comparison is between EcoNash (a method that involves fine-tuning the model) and other methods (RAP, ToT, rStar, etc.)—essentially comparing a fine-tuned model with prompting methods. The higher performance is not entirely surprising."

Reply:

Thank you for pointing this out. However, we would like to clarify that the EcoNash framework does not fine-tune the LLMs directly. Instead, we focus on adjusting the prompt embeddings and training the belief networks to modify the outputs of the LLMs. Through the coordinator's guidance, we aim to steer the multi-agent LLM system towards a Bayesian Nash Equilibrium (BNE) in an incomplete information game setting.

To address your concern about unfair comparison, we have added other baselines that involve learned action-value functions (arXiv:2309.17179 and arXiv:2309.15028). We have included the updated results in Table 1 of the revised manuscript. Comparing with these baselines enhances the persuasiveness of our method by demonstrating its advantages over approaches that also learn action-value functions.

Benchmark	Improvement over TS-LLM (%)	Improvement over PPO-MCTS (%)
GSM8K	2.7%	4.9%
GSM-Hard	10.9%	13.5%
SVAMP	3.0%	3.0%
StrategyQA	4.2%	5.1%
MATH	4.9%	8.2%

Analysis

GSM8K: EcoNash achieves a 2.7% higher average score compared to TS-LLM and a 4.9% improvement over PPO-MCTS. This demonstrates its superior performance in solving standard mathematical problems.
GSM-Hard: EcoNash achieves a 10.9% improvement over TS-LLM and a 13.5% improvement over PPO-MCTS. This highlights its significant advantage in tackling complex mathematical challenges.
SVAMP: EcoNash outperforms TS-LLM by 3.0% and PPO-MCTS by 3.0%, showcasing its efficiency in arithmetic and mathematical reasoning tasks.
StrategyQA: EcoNash achieves a 4.2% higher average score compared to TS-LLM and a 5.1% improvement over PPO-MCTS, demonstrating its superior understanding and decision-making capabilities in strategic question-answering tasks.
MATH: EcoNash outperforms TS-LLM by 4.9% and PPO-MCTS by 8.2%, highlighting its strength in addressing advanced mathematical problems.

2024-11-22

Q4: Emergent personas in Nash equilibria

"It might be cool to see if fine-tuning, according to Nash equilibria, leads agents to develop distinct 'personas' or emergent differentiations."

Reply:

Thank you for this thoughtful suggestion. However, we would like to clarify that our framework is specifically designed to promote coordinated behavior among execution LLMs, rather than encouraging differentiation.

Protocol

The framework is built to guide all execution LLMs towards a unified Nash Equilibrium strategy through the centralized mixing network.
The commitment $C$ , generated by $f_{\text{mix}}$ , serves as a coordination signal to align the behaviors of all agents.
The similarity difference loss explicitly encourages consistency between each agent's actions and the commitment, ensuring coherent decision-making.

Task-Specific Considerations

For domains such as mathematical problem-solving and planning, where optimal solutions are well-defined, behavioral differentiation may introduce unnecessary complexity and potential conflicts.
Coordinated decision-making is critical in these tasks, as the primary objective is to converge on the correct solution rather than explore diverse approaches.
Therefore, our framework prioritizes convergence to optimal strategies, avoiding the potential pitfalls of diverse but suboptimal behaviors.

Theoretical Foundation

The monotonicity property, proven in the Appendix, guarantees that individual agent improvements contribute to the overall system's performance through aligned behavior.
This alignment is especially important for tasks with clear objectives and well-defined solutions, as it ensures that all agents work synergistically towards a common goal.

While emergent diversity could be valuable in creative tasks such as writing, where variety is an asset, our focus on mathematical reasoning and planning emphasizes the importance of coordination. In these scenarios, the framework is designed to align agents' efforts toward finding optimal solutions, rather than fostering distinct behavioral patterns.

Q5: Hardware specifications for training

"What is the hardware spec for fine-tuning the LLaMA 3.1-405B model? If it is hard to estimate or proprietary, please express that clearly in the paper. Similarly, please report the time spent running the training loop. Early stopping is mentioned but no hyperparameter is reported (how many epochs did you train, what's the exact criteria)? If you used an API for fine-tuning, that is totally fine too—please report which company's API you used (it seems like Together's API?), and can you report the cost (a rough number is ok)?"

Reply:

Thank you for your question. However, the EcoNash framework does not fine-tune the LLMs directly. Instead, we focus on adjusting the prompt embeddings and training the belief networks, which is more akin to a multi-agent reinforcement learning (MARL) task.

Key Points:

API Usage: We use the Together API for on-policy learning.
Dataset Handling: In our dataset, each question is answered once, and each answer is treated as trajectory information. This information is fed into the replay buffer to update the parameters of our belief network, belief encoder, and mixing network.
Cost Example: For instance, using LLaMA 3.1-405B on the MATH dataset incurs an approximate cost of $75.
Early Stopping: Detailed criteria for early stopping are provided in Section 3.3.2, with hyperparameters listed in Appendix B.5.

2024-11-22

Q6: Applicability of Glicksberg's Fixed Point Theorem

"Glicksberg's Fixed Point Theorem [1] seems to be used to prove the min-max value for zero-sum games. I believe in your framework, it's a cooperative game? Does this theorem still apply? Please point me to some textbook/tutorial/paper that uses this theorem to prove convergence for Bayesian Nash Equilibrium."

Reply:

Thank you for this important question about the applicability of Glicksberg's Fixed Point Theorem in our framework.

Necessity of Glicksberg's Theorem

While our framework implements a cooperative game, we specifically chose Glicksberg's theorem because we are dealing with infinite-dimensional strategy spaces arising from LLM outputs and continuous belief states.
The theorem's applicability is determined by the properties of strategy spaces and payoff functions, rather than the game type (zero-sum vs. cooperative).
For specific applications to Bayesian Nash Equilibrium in similar settings with infinite-dimensional strategy spaces, we refer to:
- [1]Ui T. Bayesian potentials and information structures: Team decision problems revisited[J]. International Journal of Economic Theory, 2009, 5(3): 271-291.
- [2]Balder E J. A unifying approach to existence of Nash equilibria[J]. International Journal of Game Theory, 1995, 24: 79-94.
- [3] Emmons S, Oesterheld C, Critch A, et al. For learning in symmetric teams, local optima are global nash equilibria[C]//International Conference on Machine Learning. PMLR, 2022: 5924-5943.

Why Not Kakutani's Theorem?

Kakutani's Fixed Point Theorem is foundational but primarily suited for finite-dimensional spaces.
In our framework, each execution LLM operates in an infinite-dimensional space due to:
- The continuous nature of language model outputs.
- Continuous probability distributions over actions.
- Continuous belief state updates.
Glicksberg's theorem naturally extends to these infinite-dimensional topological vector spaces while maintaining similar conditions (e.g., upper hemicontinuity, convex-valuedness).

Technical Necessity

The infinite-dimensional aspects of our framework make Glicksberg's theorem not just applicable but necessary, as it handles:
- Continuous mapping between belief states and mixed strategies.
- Weak topology considerations in continuous strategy spaces.
- Complex relationships between belief networks and policy optimization.

Q7: Problem about reward setting and payoff function continuity

"Is the proof for the payoff function to be continuous w.r.t. θ valid? This might be my misunderstanding—but in Sec 3.3.2. Reward Setting: you use a lot of different rewards, including correctness, which we know is 0/1 for many of the questions."

Reply:

Thank you for this important question about the continuity of our payoff function. We have added supplementary explanations in Appendix B.4 to make our reward setting clearer.

Continuous Reward Components:

Action Likelihood Reward ( $r_i^{\text{AL}}$ ):
- Uses cosine similarity $\text{sim}(u_i, C)$ , which is inherently continuous in both $u_i$ and $C$ .
Task-Specific Reward ( $r_i^{\text{TS}}$ ):
- Employs normalized scores through $\text{eval}(u_i, \text{task})$ , avoiding binary 0/1 judgments.
- The $\text{eval}$ $eval$ function produces continuous scores by considering:
  - For mathematical problems: partial credit for solution steps and reasoning quality.
  - For planning tasks: response relevance on a continuous scale.
Collaborative Contribution Reward ( $r_i^{\text{CC}}$ ):
- Evaluates quality on a continuous scale.

Continuity Guarantees:

All reward components are bounded through the $\min(R_{\text{max}}, \cdot)$ operation.
The weighted combination $r_i = \alpha_1 r_i^{\text{AL}} + \alpha_2 r_i^{\text{TS}} + \alpha_3 r_i^{\text{CC}}$ preserves continuity.
The dynamic weight adjustment mechanism, using gradient-based updates, ensures smooth transitions.

Task-Specific Evaluation:

While some tasks might naturally suggest binary outcomes (correct/incorrect), our $\text{eval}$ $eval$ function deliberately produces continuous scores by:
- Evaluating solution completeness.
- Assessing reasoning quality.
- Considering solution strategy.
This continuous evaluation aligns with how human experts assess solutions, especially in complex tasks.

Conclusion:

This design ensures that our payoff function maintains continuity with respect to $\theta$ , satisfying the conditions required for the existence of BNE.

We hope that our responses have addressed your concerns satisfactorily. We are grateful for your insightful comments, which have helped us improve the clarity and quality of our manuscript.

评论- Would you mind checking our responses and confirming whether you have any further questions?

2024-11-22

We sincerely thank Reviewer PYcK5 for the valuable feedback. We have addressed all the comments you provided. Please find our point-by-point responses below. Any further comments and discussions are welcome!

Q1: Lack of key details and missing training information

1. "Currently, the paper lacks quite a few key details to properly understand the work."

2. "Training details are almost completely missing."

Reply:

We apologize for the lack of clarity and the missing details in the initial submission. We have added extensive implementation details to enhance the understanding of our work. All modifications are highlighted in blue in the updated manuscript. Here is a summary of the changes:

Section 3.1: Revised the description of the BNE in the multi-agent LLM framework for better clarity, specifically elaborating on how we quantify the effectiveness of different strategies.
Section 3.2: Updated Appendix A.3 to provide further explanation on the basis and reasoning behind Lemma 1. Refined Appendices B.2 and B.3 to detail the computation of the Bayesian regret bounds for EcoNash and MAD based on Lemma 1.
Section 3.3.1: Updated Figure 1 to more clearly illustrate the specific procedures in the inference and optimization phases.
Section 3.3.2 (Reward Setting): Added more detailed descriptions of the reward setting in Appendix B.4 to supplement the main text.
Section 3.3.2 (Individual Network / Centralized Mixing Network): Made systematic modifications to both the individual belief networks and the centralized mixing network to ensure theoretical completeness and learning stability.
Section 3.3.2 (Belief Encoder): Added a detailed definition of the belief encoder and its optimization details to clarify how it aggregates group-level information.
Section 3.3.2 (Early Stopping): Provided detailed criteria for early stopping to ensure efficient optimization and convergence.
Appendix B.5: Included a comprehensive table of hyperparameters corresponding to Section 3.3.2 to enhance reproducibility.
Appendix D.2: Added more strategy examples generated by the coordinator to provide additional insights.
Table 1: Introduced additional performance comparisons with baselines that use learned action-value functions, specifically:
- [1] Feng, Xidong, et al. "Alphazero-like tree-search can guide large language model decoding and training." arXiv preprint arXiv:2309.17179 (2023).
- [2] Liu, Jiacheng, et al. "Don't throw away your value model! Generating more preferable text with Value-Guided Monte-Carlo Tree Search decoding." First Conference on Language Modeling. 2024.

Optimization Procedure:

Each execution LLM maintains its belief state $\mathbf{b}_i \in \mathbb{R}^d$ and receives observations $O_i = [e_t, e_s, \mathbf{b}_i]^\top$ , where $e_t$ encodes the task, and $e_s$ represents the coordinator's strategy.
The belief network is defined as $B_i(\mathbf{\tau}_i, O_i; \theta_i^B)$ , which updates each agent's state based on its history $\mathbf{\tau}_i$ and current observation, generating prompt embeddings $\mathbf{e}_i$ .
The belief network outputs:
- A set of prompt embeddings $\mathbf{e}_i^t$ for $i = 1, \dots, N$ ,
- A set of individual Q-values $Q_i^t$ for $i = 1, \dots, N$ .
The belief encoder aggregates the belief states from all agents to generate a group-level representation using multi-head attention with $H$ attention heads to capture inter-agent relationships.
The Centralized Mixing Network is designed to coordinate belief information from execution LLMs, receiving:
- A set of prompt embeddings $\mathbf{e}_i^t$ for $i = 1, \dots, N$ ,
- The group-level representation $\mathbf{E}^t$ ,
- A set of individual Q-values $Q_i^t$ for $i = 1, \dots, N$ .
After achieving commitment, the mixing network computes:
- The similarity difference (SD) loss,
- The global value function $Q_{\text{tot}}^t$ .
A combined loss is then used to update all parameters.

评论- Would you mind checking our responses and confirming whether you have any further questions?

2024-11-24

Dear Reviewer PYck,

We sincerely thank you for your efforts in reviewing our work and for your great support! We also greatly appreciate your insightful questions and hope that our responses have helped to clarify them.

Please let us know if you need any further information or if there are additional points you would like to discuss with us. Thank you very much!

Best regards,

Authors of #14101

评论- Would you mind checking our responses and confirming whether you have any further questions?

2024-11-26

Response to Reviewer PYck

Dear Reviewer PYck,

Thank you very much for your time and valuable comments.

During the rebuttal period, we provided detailed responses to all your comments and questions point-by-point. Specifically, we:

Added missing key details and training information (Q1)
Updated the corresponding code URL (Q2)
Incorporated new baselines to address the issue of unfair comparison (Q3)
Explained the concept of Emergent Personas and its relationship with EcoNash (Q4)
Clarified that we do not fine-tune the LLMs directly and discussed the computational cost of using Llama 3.1 405B (Q5)
Explained why Glicksberg's Fixed Point Theorem is applicable (Q6)
Added an explanation of the reward mechanism (Q7)

We have also updated the revised version of our manuscript accordingly and outlined the main changes to enhance both presentation and soundness as follows:

Presentation

Readability and Intuitiveness

In Section 3.1, we have provided a more intuitive explanation of Bayesian Nash Equilibrium (BNE) to enhance readability and understanding.

Detailed Description of Optimization

We updated Figure 1 for clearer visualization, expanded the explanations of each component in Section 3.3.2, and added a detailed description of the reward settings in Appendix B.4.

Supplementary Dataset Descriptions

We have included detailed descriptions of each dataset in Appendix B.5 to provide better context and understanding of the data used.

Coherent Theoretical Explanations

We have included theoretical proofs in Theorem 1 to demonstrate the existence of BNE and in Proposition 1 to show how prompt embeddings can be effectively tuned to achieve BNE. Additionally, Lemma 1 establishes a lower regret bound based on Bayesian Regret, and Appendix A.5 provides a proof of monotonic improvement to further support our framework.

Soundness

Assumptions and References

We have elaborated on the assumptions of Lemma 1 and included relevant references to support these assumptions.

Bayesian Regret Expansion

The Bayesian regret derived from Lemma 1 has been further detailed in Appendices B.2 and B.3 to ensure comprehensive understanding.

Hyperparameters and Code Update

Appendix B.6 now contains detailed descriptions of all hyperparameters, and we have updated our implementation code to reflect these changes and improve reproducibility.

Experimental Setup Enhancements

We revised the experimental setup section to include two baseline comparisons with a learned action-value function and clarified how we adhere to token length instructions within our framework.

Strategy and Format Examples:

Appendix D.2 includes detailed examples of strategies and formats generated by the coordinator LLM to illustrate their application.

Would you mind checking our responses and confirming whether you have any further questions?

Any comments and discussions are welcome!

Thanks for your attention and best regards, Authors of #14101

评论- Further responses to reviewer PYck

2024-11-27

Dear Reviewer PYck,

We deeply appreciate your detailed review and insightful comments. Your feedback has significantly improved the quality of our submission.

Revisions Made

In our rebuttal, we have comprehensively addressed your concerns and enhanced the manuscript's readability by providing additional explanations. Specifically, we have:

Added detailed information on the implementation in Sections 3.1.1 and 3.3.2.
Included proofs in Appendices A.3, B.2, and B.3.
Elaborated on the reward settings in Appendix B.4.
Detailed the dataset setup in Appendix B.5.
Provided a table of hyperparameters in Appendix B.6.
Expanded the formats and strategies generated by the coordinator in Section D.2.

Thank you once again for your time and constructive input, any additional comments and discussions are welcome!

Best regards,

Authors #14101

评论- Further responses to reviewer PYck

2024-11-28

Dear Reviewer PYck,

We sincerely appreciate your thoughtful comments and valuable feedback on our submission. We have carefully addressed all the concerns and questions raised, providing detailed responses, corresponding revision and additional experimental results as needed. Your insights have been instrumental in refining and strengthening our work.

As the discussion phase approaches its conclusion, we kindly request your feedback on our responses at your earliest convenience. Your input is invaluable in helping us ensure that any remaining concerns are thoroughly addressed. If there are any aspects of our responses that require further clarification, we would be glad to provide them within the available timeframe.

Thank you once again for your time and effort in reviewing our paper. We deeply value your engagement and look forward to your thoughts.

Best regards,

Authors #14101

2024-12-03

This rebuttal is beyond incredible.

I promise I will increase my score to at least a 6 and possibly an 8 if I can fully understand the empirical side of this paper. But seeing these details raised even more questions and I'm incredibly worried. This might be an incredibly confused paper.

I will very temporarily lower my score to 3 -- and I hope this is not too discouraging to the authors because I appreciate the great amount of work they have done to improve the paper. Please understand that, as reviewers, we don't have the resources to run your code. Therefore, we have to rely on what you wrote. The current level of detail is incredibly worrisome.

For mathematical problems: partial credit for solution steps and reasoning quality.

For planning tasks: response relevance on a continuous scale.

I read through Appendix B.4. However, I don't see how you defined partial credit. I also don't see how to define response relevance. The authors clearly spent a lot of time revising the paper and drafting rebuttals; I wish they could have spent just a little bit more time to provide this extra information.

Provided a table of hyperparameters in Appendix B.6.

Are Table 8 hyperparameters for ALL tasks listed in B.5? I find that hard to believe, but please let me know if that's the case.

Figure 1

Now I understand the authors are training a belief network and a mix network. These are trained to produce continuous embeddings. My question is:

Is the learned embedding used by coordinator LLM or executor LLM to guide their response generation? If so, how? If not, then your entire algorithm is essentially just sampling from coordinator and executors till convergence? Can you explain how the learned prompt embedding is helping LLMs make better generations?

Section B.5 Task Setup

This section only explained each task. What is the training set for your optimization process? At inference time, do you run Algotihm 1 till convergence / early stop for EACH test query? How many updates/steps are typically seen for each test query? Multi-Agent debate already takes a long time to converge. How long does your method take to converge on average for these datasets?

Algorithm 1

Algorithm 1 shows $T_{\max}$ as the max number of iterations. Then, I checked Table 8 to see how high this number is (related to my previous question). Table 8 says $T_{\max}$ is temperature, and it's set to be 2. What is going on?

I hope the authors will explain a little bit more (in the simplest way possible):

How is belief network and mix network being optimized?
What happens at inference time for the test set data?

Right now I'm not sure I fully understand what is going on.

评论- Further Response to Reviewer PYck

2024-12-03

Thank you for your thorough review and valuable feedback on our work. We have addressed each of your questions in detail below.

Q1: Reward Setting

"I read through Appendix B.4. However, I don't see how you defined partial credit. I also don't see how to define response relevance."

Response:

We apologize for the lack of clarity in Appendix B.4 and appreciate the opportunity to elaborate on how we define our reward mechanism for mathematical problems and response relevance for planning tasks.

1. Mathematical Reasoning Tasks (e.g., GSM8K, SVAMP, MATH, GSM-Hard)

Goal: Emphasize both the correctness of the final answer and the clarity of step-by-step reasoning.
Evaluation: Performed by the coordinator LLM, which assigns continuous reward based on:
- Correctness of Final Answer
- Quality of Reasoning Steps
- Logical Coherence
- Mathematical Rigor

Example:
For the problem 2x + 3 = 7:

Coordinator evaluates:
- The solution process (e.g., subtracting 3 from both sides, dividing by 2)
- The final answer (x = 2)
- The logical flow and mathematical validity of each step

2. Planning Tasks (e.g., TravelPlanner, StrategyQA)

Goal: Generate coherent and actionable plans under specific constraints.
Evaluation: Coordinator LLM assigns a continuous reward based on:
- Constraint Compliance
- Plan Coherence
- Solution Completeness
- Practical Viability

Example:
In TravelPlanner:

Constraints: Budget limits, accommodation preferences, transportation options, and specific activities.
Evaluation: How well the plan satisfies constraints and creates a practical itinerary.

Q2: Hyperparameters in Table 8

"Are Table 8 hyperparameters for ALL tasks listed in B.5? I find that hard to believe, but please let me know if that's the case."

Response:

Thank you for bringing up this important point. The hyperparameters listed in Table 8 serve as a representative example for the GSM8K task with LLaMA 3.1 70B. In practice, we use different hyperparameter settings for different tasks and models to account for their unique characteristics. This approach of adapting hyperparameters across tasks has also been highlighted in the comparative method TS-LLM (arXiv:2309.17179).

Here are some of task-specific configurations for different models:

For MATH dataset:

Model	Learning Rate (η)	Episodes	Batch Size	Buffer Size	Update Interval
LLaMA 3.1 8B	0.0015	150	32	32	4
LLaMA 3.1 70B	0.0010	150	32	32	8
LLaMA 3.1 405B	0.0005	150	16	32	16

For GSM-Hard dataset:

Model	Learning Rate (η)	Episodes	Batch Size	Buffer Size	Update Interval
LLaMA 3.1 8B	0.0012	140	24	32	4
LLaMA 3.1 70B	0.0010	140	24	32	8
LLaMA 3.1 405B	0.0008	140	16	32	16

Key adjustments are made to:

Learning Rate (η): Adjusted based on model size and task complexity
Batch Size: Adjusted based on model size
Update Interval: More frequent updates for smaller models, less frequent for larger models
Network Architecture: Remains consistent with Table 8 across all tasks

We will include additional tables in the revised manuscript to present the complete hyperparameter configurations for all tasks and models.

2024-12-03

Q3: Learned Embedding and Optimization Phase

"Is the learned embedding used by coordinator LLM or executor LLM to guide their response generation? Can you explain how the learned prompt embedding is helping LLMs make better generations?"

Response:

Yes, the learned embeddings are used by the executor LLMs to guide their response generation.

To help LLMs make better generations, we introduced incomplete information game theory into the multi-agent LLM framework (Sec 3.1) to address the token consumption issue in Multi-Agent Debate. To optimize the system towards a Bayesian Nash Equilibrium, we incorporated the concept of "belief." In LLMs, belief is reflected in how they generate outputs, and our optimization goal is to achieve belief coordination among all executors, where each executor's strategy is the best response to the outputs of others.

The theoretical reason that motivates us to do this that: Theorem 1 proves the existence of BNE in MA-LLM, while Proposition 1 suggests that it can be achieved by modifying the prompt embedding. Subsequently, the Lemma analyzes its advantages over MAD under this framework. So we optimize our framework towards BNE with this procedure:

Optimization Process:

Belief State and Prompt Embedding Adjustment:
- Each executor LLM maintains a belief state b_i, which is used to adjust its prompt embedding
- These embeddings influence the LLM's generation strategy, guiding their response generation
- The mix network coordinates belief information among all executors to align strategies
Inference and Commitment Generation:
- At inference time t, each executor LLM generates an answer based on its belief state b_i^{t-1} from the previous step
- The coordinator LLM aggregates these answers to form a final commitment and calculates the corresponding reward and Similarity Difference (SD) loss
Optimization and Belief Update:
- The mix network calculates the Temporal Difference (TD) loss based on the reward
- Gradient descent is performed using the SD loss and TD loss to update the mix network and belief networks
- New belief states are obtained for the next inference step

Through this process, the learned embeddings guide the executor LLMs to generate responses that are more consistent and coordinated, enabling improved overall performance.

Q4: Training Set, Inference Procedure, and Convergence Time

"This section only explained each task. What is the training set for your optimization process? At inference time, do you run Algorithm 1 till convergence / early stop for EACH test query? How many updates/steps are typically seen for each test query? Multi-Agent debate already takes a long time to converge. How long does your method take to converge on average for these datasets?"

Response:

What is the training set for your optimization process?

We train on each task's training set until early stopping. For example, on the MATH task with Llama 3.1 70B, we optimize on its training set for approximately 130-200 parameter updates, with each updates using a batch of 32 samples to achhieve early stopping rather than train on the entire training set. The early stopping will be achieved earlier with Llama 3.1 405B.

At inference time, do you run Algorithm 1 till convergence/early stop for EACH test query?

No, during inference on the test set, we do not perform any further training or parameter updates. We only use the trained framework for a single inference iteration. This ensures efficient inference without the computational overhead of convergence per query.

How many updates/steps are typically seen for each test query?

Only one iteration is performed using the trained framework
No parameter updates during inference

How long does your method take to converge on average for these datasets?

During the training phase, our method converges in approximately 130 to 200 parameter update on datasets like MATH with LLaMA 3.1 70B, while the inference phase requires only a single iteration using the trained framework, without the need for convergence.

2024-12-03

Q5: Network optimization?

How is belief network and mix network being optimized?

Response:

Please see the response in Q3.

Q6: What Happens at Inference Time for the Test Set Data?

"What happens at inference time for the test set data?"

Response:

During the inference phase, the trained framework is directly applied to test set data without any further updates to the belief and mix networks. This ensures that the evaluation process remains efficient and fair. Below, we provide an overview of the key steps during both the training and inference phases to clarify how the framework operates.

Training Phase:

Optimizes belief and mix networks using on-policy training
Data stored in a replay buffer for updates
Early stopping criteria applied to prevent overfitting

Inference Phase:

No Further Training: Belief and mix networks remain fixed
Response Generation: Executor LLMs generate responses based on learned policies
Coordinator LLM aggregates responses to produce the final output
Only one iteration is performed during the inference of test set data.

Since we cannot upload the updated PDF at this time, we have summarized our planned modifications below:

Clarify Reward Settings: We will add clear definitions and examples of "partial credit" and "response relevance," detailing how model performance is assessed for mathematical problems and planning tasks, in Appendix B.4.
Correct Hyperparameter Notations: We will clarify the confusion between the maximum iterations in Algorithm 1 and the temperature in Table 8, and provide task-specific hyperparameter settings where needed, in Appendix B.6.
Detail the Optimization Process: We will use intuitive language and examples to explain how the learned embeddings guide executor LLMs' responses, in Section 3.3.2.
Clarify Training and Inference Procedures: We will distinguish between training and inference operations, emphasizing that no further training occurs on the test set and inference is only for evaluation, in Section 3.3.2 and Appendix B.5.

We sincerely hope that our responses address your concerns and provide a better understanding of our work. Thank you again for your valuable feedback.

Best regards,
Author # 14101

评论- Further Response to Reviewer PYck

2024-12-04

Dear Reviewer PYck,

Thank you for your careful review and important questions. We would like to address your concerns point by point:

Regarding Table 8: We acknowledge that the current label "Hyperparameters of EcoNash" could be misleading. We want to clarify that Table 8's primary purpose is to demonstrate the network parameter settings of the EcoNash framework that are common across all tasks. While different models may require some adjustments in learning rate, episode batch size, and update interval, we have not concealed this aspect. We have been transparent about these task-specific settings in our codebase, specifically in the config.yaml and README(line 154) documentation. We will revise label of Table 8 to make this distinction clearer in the updated manuscript.
Regarding the embeddings and Together API: We want to clarify that we do not send embeddings to executor LLMs to guide their response generation. According to the Together API official documentation (https://docs.together.ai/reference/chat-completions-1), we can only adjust the API parameters to influence the output. As we explicitly stated in Section 3.3.2 (Individual Belief Network, line 238), our approach involves adjusting Ti and Pi parameters to influence the LLM outputs. This implementation can be verified in our code, specifically in the llm_wrapper.py file, where the DynamicParamNetwork(line25) and APIHandler(line85) classes demonstrate how we achieve this through API parameter adjustments rather than embedding inputs.

We hope these clarifications address your concerns. We appreciate your feedback as it helps us improve the clarity and transparency of our work.

Best regards,

Authors #14104

审稿意见

评分: 6置信度: 42024-11-07

The paper proposes EcoNash, a hierarchical reinforcement learning framework for scalable, multi-agent reasoning by using Bayesian Nash Equilibrium (BNE) to enhance coordination among large language models (LLMs). This paper first proposes a tight bound for MA-LLM's performance improvement. Then introduce EcoNash, which reduces the communication and computational costs typical in multi-agent systems by enabling LLMs to independently generate optimal responses based on their own beliefs. Experimental results show EcoNash surpasses single and multi-agent models in complex reasoning tasks and proves to be effective at scaling with increased model ensemble sizes.

优点

The introduction is organized and well-written.
The experiments are comprehensive, including multiple baselines and different sizes of LMs.
The empirical results demonstrate the superiority of the proposed method against the baselines.
The method can be scaled by increasing the number of Agents and improving performance.

缺点

I don't see a strong connection between Sec 3.2 and the following method
Fig 1 should be clearer.
Lack of information about the model structure of the belief encoders and hyperparameters of optimizing them.
Since the proposed method needs to tune a Q function, the author should include a baseline with a learned action-value function, e.g. [1][2]
In Table 4, the author may also show the performance w.r.t. the token consumption.

问题

Line 196, why differences of Q-values can be bound by separating estimation errors and policy suboptimality? There should be some references.
Appendix A.3 should be named 'Assumptions' instead of 'PROOF OF PROPOSITION', and several references are recommended here. It's the most direct way to show that these are standard assumptions.
What does assumption 3 mean in Appendix A.3?
Equation 1, how to update target Q network weight? Is the Q target value missing an input of belief?
Will the Coordinator LM and the Executor LM be tuned?
What's the computation cost of tuning the Belief Encoders and Coordinator LM?
For each task/dataset and each LLM, a new group Belief Encoder and Q values will be trained. Is that correct or wrong?

Reference

[1] Feng, Xidong, et al. "Alphazero-like tree-search can guide large language model decoding and training." arXiv preprint arXiv:2309.17179 (2023).

[2] Liu, Jiacheng, et al. "Don't throw away your value model! Generating more preferable text with Value-Guided Monte-Carlo Tree Search decoding." First Conference on Language Modeling. 2024.

评论- Would you mind checking our responses and confirming whether you have any further questions?

2024-11-22

We sincerely thank Reviewer bum5 for the valuable feedback. We have addressed all the comments you provided. Please find our point-by-point responses below. Any further comments and discussions are welcome!

Q1: The connection between Section 3.2 and the following method

"I don't see a strong connection between Section 3.2 and the following method."

Reply:

Thank you for your valuable feedback. In Section 3.1, we define the Bayesian Nash Equilibrium (BNE) within the context of multi-agent LLMs, including the evaluation method and the proof of its existence. In Section 3.2, we build upon these definitions to introduce Lemma 1, which allows us to evaluate the performance of multi-agent LLM systems through Bayesian regret analysis. Our analysis demonstrates that EcoNash achieves a sublinear regret bound, in contrast to the linear regret of existing multi-agent debate methods. In the subsequent Section 3.3, we present the specific implementation of our architecture based on the assumptions satisfying Lemma 1 (as detailed in the Appendix). This implementation is then correlated with the results in Section 4 (Experiments), providing a cohesive flow from theoretical foundations to practical application.

Q2: Presentation problem in Figure 1

"Figure 1 should be clearer."

Reply:

We appreciate your feedback. We have updated our manuscript and provided a new Figure 1. In the revised figure, we have separated the inference and optimization phases into left and right parts, respectively. Additionally, we have provided detailed structures and data flow for each network in the optimization part. This redesign aims to enhance clarity and improve the visual representation of our framework.

Q3: Missing details about belief encoders and hyperparameters

"Lack of information about the model structure of the belief encoders and hyperparameters of optimizing them."

Reply:

Thank you for pointing this out. We have revised Section 3.3.2 (Optimization Phase) in the updated manuscript (see the blue-highlighted parts). We have added detailed descriptions of the belief encoders:

“The belief encoder $f_e(\cdot; \theta_e)$ aggregates the belief states from all agents to generate a group-level representation $E = f_e(\{b_i\}_{i=1}^N; \theta_e)$ , using multi-head attention with $H$ attention heads to capture inter-agent relationships. Each head is computed as:........ ”

We have also added Appendix B.5 to detail all the hyperparameters used to train EcoNash, providing transparency and facilitating reproducibility.

Q4: Additional performance comparison with a baseline with a learned action-value function

"Since the proposed method needs to tune a Q function, the author should include a baseline with a learned action-value function, e.g., [1][2]."

Reply:

Thank you for this valuable suggestion. We have compared our method with the two methods you mentioned ([1] and [2]) that use learned action-value functions. We have included the updated results in Table 1 of the revised manuscript. Comparing with these baselines enhances the persuasiveness of our method by demonstrating its advantages over approaches that also learn action-value functions. Specifically:

Benchmark	Improvement over TS-LLM (%)	Improvement over PPO-MCTS (%)
GSM8K	2.7%	4.9%
GSM-Hard	10.9%	13.5%
SVAMP	3.0%	3.0%
StrategyQA	4.2%	5.1%
MATH	4.9%	8.2%

Analysis

GSM8K: EcoNash achieves a 2.7% higher average score compared to TS-LLM and a 4.9% improvement over PPO-MCTS. This demonstrates its superior performance in solving standard mathematical problems.
GSM-Hard: EcoNash achieves a 10.9% improvement over TS-LLM and a 13.5% improvement over PPO-MCTS. This highlights its significant advantage in tackling complex mathematical challenges.
SVAMP: EcoNash outperforms TS-LLM by 3.0% and PPO-MCTS by 3.0%, showcasing its efficiency in arithmetic and mathematical reasoning tasks.
StrategyQA: EcoNash achieves a 4.2% higher average score compared to TS-LLM and a 5.1% improvement over PPO-MCTS, demonstrating its superior understanding and decision-making capabilities in strategic question-answering tasks.
MATH: EcoNash outperforms TS-LLM by 4.9% and PPO-MCTS by 8.2%, highlighting its strength in addressing advanced mathematical problems.

2024-11-22

markdown Copy code

Q5: Problem in Table 4

"In Table 4, the author may also show the performance with respect to the token consumption."

Reply:

Thank you for this comment. We have updated Table 4 to include the performance metrics with respect to token consumption. This addition provides a more comprehensive evaluation of our method's efficiency and resource utilization.

Q6: Why can differences of Q-values be bound by separating estimation errors and policy suboptimality?

"Line 196, why can differences of Q-values be bound by separating estimation errors and policy suboptimality? There should be some references."

Reply:

Thank you for this insightful comment. The decomposition of Q-value differences into estimation errors and policy suboptimality is grounded in fundamental principles of reinforcement learning and value function approximation. This separation is valid and useful due to the following reasons:

Additive Nature of the Errors: The Q-value difference between the learned policy and the optimal policy can naturally be expressed as a sum of two components:
1. Errors arising from imperfect estimation of the value function (estimation errors).
2. Deviations due to the learned policy not being optimal (policy suboptimality). Since these two sources of errors are orthogonal in nature—one stemming from approximation inaccuracies and the other from the choice of suboptimal actions—they can be analyzed separately.
Bellman Error Decomposition: The Bellman equation provides a framework to propagate estimation errors through the value function. Errors in the Q-function (e.g., Temporal Difference errors) propagate through iterative updates. When the learned policy deviates from the optimal policy, this propagation introduces policy suboptimality. By isolating these two terms, their individual contributions to the overall Q-value difference can be explicitly analyzed and bounded.
Theoretical Insights from Approximation Theory: Function approximation theory allows us to characterize errors introduced by approximating the value function (e.g., using neural networks). These errors are independent of the policy's suboptimality, which depends on how the policy interacts with the environment. This independence permits a clean separation of the two effects.

References:

Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction. MIT Press.
Discusses the separation of estimation and policy errors in the context of TD learning.
Kakade, S. M., & Langford, J. (2002). Approximately optimal approximate reinforcement learning. Proceedings of the 19th International Conference on Machine Learning (ICML), 206–213.
Explores bounding errors in approximate RL by separating approximation and policy impacts.
Fujimoto, S., van Hoof, H., & Meger, D. (2018). Addressing Function Approximation Error in Actor-Critic Methods. Proceedings of the 35th International Conference on Machine Learning (ICML), 1587–1596.
Highlights how separating approximation and policy errors enhances actor-critic stability.
Munos, R. (2005). Error Bounds for Approximate Policy Iteration. Proceedings of the 22nd International Conference on Machine Learning (ICML), 560–567.
Provides theoretical bounds for policy suboptimality and its relationship to approximation errors.

This decomposition is theoretically valid because the Q-value differences can be independently affected by how well the value function is estimated and how optimal the policy is. Separating these two allows for deeper insights into their contributions and potential remedies.

2024-11-22

Q7: Problem about Appendix A.3 (explanation and related references)

"1. Appendix A.3 should be named 'Assumptions' instead of 'PROOF OF PROPOSITION', and several references are recommended here. It's the most direct way to show that these are standard assumptions."
"2. What does Assumption 3 mean in Appendix A.3?"

Reply:

Thank you for this comment. We have renamed Appendix A.3 as "Assumptions" instead of "PROOF OF PROPOSITION" to accurately reflect its content. Due to the insufficient readability of the previous version, we have provided a new Appendix A.3 with complete explanations for each assumption.

This assumption essentially describes how information flows and remains stable in our multi-agent learning framework. Let me break it down intuitively:

The assumption states that for any two time points ( $t_1 < t_2$ ) where the entropy difference ( $H_{t_1} - H_{t_2}$ ) is bounded by $\log(2)$ , the mutual information between an agent's type ( $\theta_i$ ) and the coordination outcome ( $\xi$ ) at the earlier time ( $t_1$ ) is at most $4\eta$ times the mutual information at the later time ( $t_2$ ).

Key Purposes:

a) Stability of Information Flow: It ensures stability in how information flows between language models over time. This prevents both information collapse (where agents lose too much information) and explosion (where information growth becomes uncontrollable).
b) Quantifying Influence: Through the mutual information term $I(\theta_i; \xi(a_i, a_{-i}))$ , we quantify how much an agent's type influences its ability to coordinate actions with others. This is crucial for understanding how language models learn to work together.
c) Regulated Adaptation: The constant $\eta$ acts as a regulatory parameter that controls how quickly agents can adapt their strategies based on what they learn from interactions. This ensures learning happens at a manageable pace.

This type of regularity condition is well-established in the literature, particularly in information design in games [Bergemann and Morris, 2019], and has proven especially relevant for language model coordination tasks [Andreas et al., 2020].

The assumption is essential for proving the convergence properties of our learning algorithm and ensuring that the multi-agent system remains stable during training.

Q8: About the update of target Q-network weights

"Equation 1, how to update target Q-network weights? Is the Q-target value missing an input of belief?"

Reply:

Thank you for your valuable feedback, and we apologize for the previous lack of clarity regarding the belief state description.

We first updated the relationship between the belief state and observation in the MA-LLM framework in line 127 and subsequently introduced the parameters of the belief network. Correspondingly, we revised all relevant descriptions of the optimization process in Sec 3.3.2 to make the optimization of the belief state explicit. Our optimization procedure is summarized as follows:

Each execution LLM maintains its belief state $\mathbf{b}_i \in \mathbb{R}^d$ and receives observations $O_i = [e_t, e_s, \mathbf{b}_i]^\top$ , where $e_t$ encodes the task, and $e_s$ represents the coordinator's strategy.
The belief network is defined as $B_i(\mathbf{\tau}_i, O_i; \theta_i^B)$ , which updates each agent's state based on its history $\mathbf{\tau}_i$ and current observation, generating prompt embeddings $\mathbf{e}_i$ .
The belief network outputs a set of prompt embeddings $\mathbf{e}_i^t$ for $i = 1, \dots, N$ , and a set of individual Q-values $Q_i^t$ for $i = 1, \dots, N$ , which are used in the mixing network.
The belief encoder aggregates the belief states from all agents to generate a group-level representation, using multi-head attention with $H$ attention heads to capture inter-agent relationships.
The Centralized Mixing Network is designed to coordinate belief information from execution LLMs, receiving:
- A set of prompt embeddings $\mathbf{e}_i^t$ for $i = 1, \dots, N$ ,
- The group-level representation $\mathbf{E}^t$ ,
- A set of individual Q-values $\{Q_i^t\}_{i=1}^N$ .
After achieving commitment, the mixing network computes:
- The similarity difference (SD) loss,
- The global value function $Q_{\text{tot}}^t$ .
A combined loss is then used to update all parameters.

The specific modifications can be observed in Figure 1 and Sec 3.3.

2024-11-22

Q9: Will the LMs be tuned?

"Will the Coordinator LM and the Executor LM be tuned?"

Reply:

Thank you for your question. The EcoNash framework does not directly fine-tune the LLMs. Instead, it focuses on adjusting the prompt embeddings and training the belief networks to influence the outputs of the LLMs, as mentioned in line 236. By leveraging the coordinator's guidance, the system is directed toward a BNE in an incomplete information game setting.

Theoretical support for this framework includes:

Theorem 1: Establishes the existence of a BNE under the given setting.
Proposition 1: Demonstrates that adjusting the prompt embeddings using TD loss leads to convergence to the BNE.
Lemma 1: Confirms that our approach achieves sublinear regret in the multi-agent LLM system, ensuring effective convergence without directly fine-tuning the LLMs.
Appendix A.5: Guarantees the monotonicity of the mixing network, ensuring that during the optimization process, improvements in individual agent performance positively impact global coordination, enabling stable convergence to the equilibrium.

Q10: Computation cost

"What's the computation cost of tuning the Belief Encoders and Coordinator LM?"

Reply:

Thank you for this comment. We do not fine-tune any LLMs in our framework. The computation cost involves training the belief encoders and the mixing network, which is handled in a multi-agent reinforcement learning manner.

As detailed in Appendix B.5, we have provided the hyperparameters of EcoNash. The estimated total number of trainable parameters is around 2 million, ensuring that the computational overhead remains manageable.

For inference, we used the Together API. Regarding performance validation on a math dataset using LLaMA 3.1 405B, the approximate cost was $75.

Q11: For each task and each LLM, a new group Belief Encoder and Q-values will be trained. Is that correct or wrong?

"For each task/dataset and each LLM, a new group Belief Encoder and Q-values will be trained. Is that correct or wrong?"

Reply:

Thank you for this question. Yes, that is correct. For each different task setting and each combination of LLMs, we train different Belief Encoders, individual Belief Networks, and Mixing Networks. This approach allows us to adapt to different tasks and enables the system to converge towards the BNE in various settings. Tailoring the networks to specific tasks and LLM combinations ensures optimal performance and coordination among agents.

We hope that our responses have addressed your concerns satisfactorily. We are grateful for your insightful comments, which have helped us improve the clarity and quality of our manuscript.

2024-11-22

Thank you to the author for the detailed response to my questions, effectively addressing my doubts and concerns about the method. I will seriously consider raising my score.

评论- Further responses to reviewer bum5

2024-11-25

Dear Reviewer bum5,

Thank you for recognizing our work and for raising your score accordingly! Based on your feedback, we have revised the relevant sections and outlined the main changes to enhance presentation and soundness as follows:

Presentation
- Readability and Intuitiveness
  In Section 3.1, we have provided a more intuitive explanation of BNE to enhance readability and understanding.
- Detailed Description of Optimization
  We have updated Figure 1 for clearer visualization, expanded the explanations of each component in Section 3.3.2, and added a detailed description of the reward settings in Appendix B.4.
- Supplementary Dataset Descriptions
  Detailed descriptions of each dataset are now included in Appendix B.5 to provide better context and understanding of the data used.
- Coherent Theoretical Explanations
  We have included theoretical proofs in Theorem 1 demonstrating the existence of BNE and in Proposition 1 showing how prompt embeddings can be effectively tuned to achieve BNE. Additionally, Lemma 1 establishes a lower regret bound based on Bayesian Regret, and Appendix A.5 provides a proof of monotonic improvement to further support our framework.
Soundness
- Assumptions and References
  We have elaborated on the assumptions of Lemma 1 and included relevant references to support them.
- Bayesian Regret Expansion
  The Bayesian regret derived from Lemma 1 has been further detailed in Appendix B.2 and B.3 to ensure comprehensive understanding.
- Hyperparameters and Code Update
  Appendix B.6 now contains detailed descriptions of all hyperparameters, and we have updated our implementation code to reflect these changes and improve reproducibility.
- Experimental Setup Enhancements
  We revised the experimental setup section to include two baseline comparisons with a learned action-value function and clarified how we adhere to token length instructions within our framework.
- Strategy and Format Examples
  Appendix D.2 includes detailed examples of strategies and formats generated by the coordinator LLM to illustrate their application.

We kindly ask you to review our responses and confirm if you have any further questions. Any additional comments and discussions are welcome!

Thank you once again for your valuable feedback and time.

Best regards,

Authors of #14101

评论- Further response to reviewer bum5

2024-11-27

Dear Reviewer bum5,

We deeply appreciate your detailed review and insightful comments. Your feedback has significantly improved the quality of our submission.

Revisions Made

In our rebuttal, we have comprehensively addressed your concerns and enhanced the manuscript's readability by providing additional explanations. Specifically, we have:

Added detailed information on the implementation in Sections 3.1.1 and 3.3.2.
Included proofs in Appendices A.3, B.2, and B.3.
Elaborated on the reward settings in Appendix B.4.
Detailed the dataset setup in Appendix B.5.
Provided a table of hyperparameters in Appendix B.6.
Expanded the formats and strategies generated by the coordinator in Section D.2.

Thank you once again for your time and constructive input, any additional comments and discussions are welcome!

Best regards,

Authors #14101

AC 元评审

2024-12-24

This paper proposes a hierarchical reinforcement learning framework for scalable, multi-agent reasoning by using Bayesian Nash Equilibrium. To use multi agent reasoning could potentially alleviate many problems encountered with single-agent reasoning, and hence the topic could be appealing to a broad community. Yet, the proposed method is not well presented. In particular, the DEC-MDP is described a bit vaguely. There are relatively few details about the training and the inference of the model. Also, the connection between its theoretical part and the entire inference framework is unclear. That is to say, the experimental methods are somewhat disconnected from the theory. Furthermore, the improvements as observed in the experiments cannot fully justify the complex design of the method.

审稿人讨论附加意见

Two main concerns are shared with most reviewers. The presentation of this paper, which remains after the rebuttal. Secondly, the setting and the details of the experiments. The paper tries to clarify those in the discussion period, yet most reviewers find those unconvincing.

最终决定Reject

2025-01-22

Reject

Towards Efficient and Scalable Multi-agent Reasoning via Bayesian Nash Equilibrium

摘要

评审与讨论

优点

缺点

问题

Q1. Scalability of Coordination Mechanism

Reply:

Q2. Clarification on Assumptions in Convergence Proofs

Q3: Heterogeneous Agent Configurations

Reply:

Feasibility of Using EcoNash with Heterogeneous Agents

Potential Challenges and Mitigations

Role of Attention Mechanisms in Handling Heterogeneous Agents

Adaptations in EcoNash for Heterogeneous Agents

Experimental Insights and Future Work

Reply:

Continuous Reward Components:

Continuity Guarantees:

Task-Specific Evaluation:

Conclusion:

Q5: Additional Baselines for Comparison

Reply:

Analysis:

Q6: Implementation of Self-Evaluation Rewards

Reply:

Task-Specific Criteria for Self-Evaluation

1. Mathematical Reasoning Tasks (e.g., GSM8K, SVAMP, MATH, GSM-Hard)

2. Planning Tasks (e.g., TravelPlanner, Strategy QA)

Dynamic Reward Adaptation

优点

缺点

问题

Q1: Accuracy of BNE in the MA-LLM Framework (Section 3.1.1)

Reply:

1. "3.1.1: What is a type function?"

2. "Problem with the notation: shouldn’t tuple of θ\thetaθs and actions be ordered, so θi\theta_iθi​ and θ−i\theta_{-i}θ−i​ does not account for the orderings? The authors should clarify that."

3. "Appendix A.3: HHH is entropy and III is mutual information—if so, it should be stated or defined?"

Definition: Belief Entropy

Assumption: Game Regularity

4. "Also, the roles of coordinator and executor LLMs should be defined in the method before referring to this proof, i.e., somewhere in 3.1, possibly the 'implementation' part of 3.1.1 as the title suggests?"

Section 3.1.1:

Figure 1:

Section 3.3.2:

Examples in Appendix D.2:

6. "Figure 1: Need to label the belief network—the blue box?"

7. "Figure 1 and Sec 3.3.1: Can the authors share more details on 'informative strategy and a format based on the input question q' that the Coordinator LLM generates?"

Q2: Assumptions Made in the Theoretical Results

Reply:

Estimation Errors and Policy Suboptimality

Assumption: Q-Function Estimation Error

Assumption: Policy Suboptimality

Debate Setting: Persistent Policy Suboptimality

Assumption: Persistent Policy Suboptimality

References

1. Sampling with High Temperature or Adding Noise

2. Faithfulness of Executors to the Plan

3. Assumption 2 (Posterior Alignment) and Executor Faithfulness

4. stick to token counts

Reply:

Q3: Clarification of Experimental Setup

Reply:

Analysis

Weighted Self-Consistency (SC) or Ensemble Methods

Future Directions

Correction

Impact

Revisions Made

优点

缺点

问题

Q2: Code provided through a URL

Reply:

Regarding the Implementation of the Game of 24:

Q3: Potential unfair comparison

Reply:

Analysis

Q4: Emergent personas in Nash equilibria

Reply:

Protocol

2. "Problem with the notation: shouldn’t tuple of $\theta$ s and actions be ordered, so $\theta_i$ and $\theta_{-i}$ does not account for the orderings? The authors should clarify that."

3. "Appendix A.3: $H$ is entropy and $I$ is mutual information—if so, it should be stated or defined?"