Q4: Lack of detail in experiments

A4: Thanks for your suggestion. We have carried out some additional experiments to enrich our presentation. Here we list how the number of agents affects the performance on the quantitative experiments for GPT-3.5-turbo. We run each setting on Humaneval (coding) and MGSM (math) for 3 runs, and report the averaged performance.

(A brief recap, Solo means 1 role assigner agent + 1 decision-making agent + 1 evaluation agent, Group is the same except that there are multiple decision-making agent)

	CoT	Solo	Group-2	Group-3	Group-4
Mathematical Reasoning
Programming
Average Performance

where Group-x indicates x decision making agent. Generally, using AgentVerse framework with one to three decision making agents give satisfying results. The averaged performance gets highest when there are 3 decision making agents. While these datasets primarily test specific agent abilities, not fully utilizing the diversity of a multi-agent setup, we still observe an upward trend in average performance with an increase in agents. The diminishing returns upon further scaling can be attributed to communication inefficiencies, as discussed in Section 3.1.

At the meantime, we highlight that AgentVerse’s true potential is best observed in more complex challenges. Our case studies, tool utilization experiments, and Minecraft game-playing scenarios are prime examples where the multi-agent framework's capabilities are more pronounced and beneficial.

Q5: Missing analyses and explanations

A5: In Section 3.1, we discuss the performance variance between Group and Solo settings for GPT-3.5-turbo and GPT-4. We also examine the conflict resolution capabilities of LLMs in multi-agent systems. Further detailed explanations on these phenomena will be added to enhance understanding.