/10

Rejected4 位审稿人

最低3最高4标准差0.4

ICML 2025

Multi-Agent Design: Optimizing Agents with Better Prompts and Topologies

Han Zhou,Xingchen Wan,Ruoxi Sun,Hamid Palangi,Shariq Iqbal,Ivan Vulić,Anna Korhonen,Sercan O Arik

OpenReview PDF

提交: 2025-01-24更新: 2025-06-18

摘要

关键词

language agentmulti-agent system

评审与讨论

审稿意见

评分: 42025-03-09

The authors propose a sequence of three phases to come up with a high-performing agentic system. The phases are: 1. individual “block” prompt optimization as warmup, 2. the topology refinement phase without altering the prompts, and 3. joint prompt optimization for all “blocks”. In phase 2 the authors score the topologies with the introduced “incremental influence” metric.

The authors verify the proposed method on the extensive set of 8 benchmarks. Even though the majority of the experiments were done on a single model: Gemini 1.5 Pro, the authors verify the findings on Claude 3.5 Sonnet. The authors compare with an extensive set of 6 baselines including strong competitors. For prompt optimization the authors use off the shelf MIPRO optimizer.

给作者的问题

In formula 1 it is not clear why $a$ is a function of a single sample $x$ (referring to $a(x)$ ). Is the workflow of the configuration $a$ a singular version applied to all samples of the dataset $D$ .

There is no formal definition of a “block” in Section 2.

Like 307. It’s not clear what a “proper” prompt design means.

论据与证据

The work claims that the combination of prompt and topology optimization is the key for high MAS performance. This claim is well supported by the experimental results.

While topology optimization is well discussed and compared to the prior works, the prompt optimization is arbitrarily chosen to be MIPRO, without considering and even discussing the alternatives: [1-4]. Many references to the key papers on automatic prompt optimization (APU) are missing:

[1] Yang et al, 2024., Large language models as optimizers.

[2] Yuksekgonul et al, 2024. TextGrad: Automatic "Differentiation" via Text.

[3] Cheng et al, 2024. Trace is the Next AutoDiff: Generative Optimization with Rich Feedback, Execution Traces, and LLMs.

[4] Wang et al, 2024. How to Correctly do Semantic Backpropagation on Language-based Agentic Systems.

The claim that stages 1 and 2 of MASS are parallelizable is legit.

However, the details of the implementation are incomplete.

In Figure 2, the “propose new workflow” should be the focal point of the implementation walkthrough, but it is barely discussed in the paper in Section “Workflow topology optimization”. The formal algorithm for “propose new workflows” is missing even though it is mentioned as a step 13 in Algorithm 1.

方法与评估标准

Yes.

理论论述

No theoretical claims. The convergence of all three optimization stages of the algorithm is not studied.

实验设计与分析

The scores of the baselines and the proposed method are sound.

补充材料

I did not find the supplementary material, specifically the code of the experiments. Without the code and without the formal algorithm for “propose new workflows” the paper is incomplete.

与现有文献的关系

The paper is an effort towards AGI.

遗漏的重要参考文献

The 4 essential references are not discussed, please see “Claims And Evidence”.

其他优缺点

The undoubted merit of the paper is showcasing joint prompt and workflow optimization ands its benefits, even though these two were performed in an interleaved manner.

I appreciate Figure 8 with the final found best topologies for each dataset.

其他意见或建议

I’d appreciate examples of original and refined prompts during 1PO and 3PO. It is not clear what a typical prompt improvement looks like.

伦理审查问题

No.

作者回复

2025-04-01

We thank the reviewer for their insightful suggestions and acknowledging the undoubted merits of MASS-style optimization! We appreciate the reviewer pointing out the details in the paper that could be further clarified. We have provided further clarifications in this response and will update the final manuscript. We hope that our response sufficiently addresses the reviewer’s concerns, and the reviewer could consider improving their score.

Alternatives to the prompt optimizer.

We appreciate the reviewer for suggesting many insightful works that have advanced the field of prompt optimization and will certainly include all your referred literature in the related work. We’d like to recall that MASS is a plug-and-play framework with arbitrary prompt optimizers, and one of our primary contributions is identifying the influence of prompt optimization on the MAS. We integrate MIPRO as a representative prompt optimizer due to the importance of simultaneous instruction and exemplar optimization, which has been justified in [1, 2] that show superior performance over OPRO-style [3] instruction-only optimization methods. It is also worth noting that the MASS framework itself is agnostic to prompt optimizer, and thus any prospective better methods can only enhance the overall performance of MASS. Below, we additionally provide an ablation of common prompt optimizers, and we show MASS with exemplar optimization (+DSPy) also led to significant gains. In line with the reviewer, we have also considered extending the existing PO to feedback-based optimizers (e.g., ProTeGi or TextGrad) that may come with better sample efficiency, which we have included as a desirable future work (line 727).

Data	MATH
CoT	66.7
MASS
+APE	73.3
+DSPy	78.2
+MIPRO	81.0

[1] Opsahl-Ong, K., Ryan, M. J., Purtell, J., Broman, D., Potts, C., Zaharia, M., & Khattab, O. (2024). Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programs. EMNLP 2024.

[2] Wan, X., Sun, R., Nakhost, H., & Arik, S. O. (2024). Teach Better or Show Smarter? On Instructions and Exemplars in Automatic Prompt Optimization. NeurIPS 2024.

[3] Yang, C., Wang, X., Lu, Y., Liu, H., Le, Q. V., Zhou, D., & Chen, X. (2023). Large Language Models as Optimizers. ICLR 2024.

Details of parallelization in MASS.

We thank the reviewer for pointing out the parallelizable feature of our optimization. We will highlight in Algorithm 1, lines 5-8 & 12-13, to indicate phases that can be parallelized to improve the efficiency of MASS implementation.

Clarification on the “propose new workflow” in workflow-level topology optimization.

We thank the reviewer for suggesting to formulate the topology optimization more precisely. Given the configuration space per topology building block, as described in step 12, we conduct rejection sampling to sample workflow candidates. Formally, the workflow is randomly sampled from a pruned configuration space within a maximum budget, such that $a \sim \mathcal{A} \text{ s.t.} N(a) < Budget$ , where $N(a)$ caps the overall number of agents; $\mathcal{A} = (N_{aggregate}, N_{reflect}, N_{debate}, N_{Tool}, …)$ is the configuration space as defined in Sec 2.2, and each search dimension will be weighted by the influence of that dimension and treated as deactivated if $\mathcal{Uniform}(0, 1) > p_{a_{i}}$ . Followed by that, the workflow $W(a)=(a_i, a_i+1, …)$ is constructed in a predefined rule to arrange the order of agents (line 267). We included the specification of the detailed search space in the App. A, Table 3. Overall, we thank the reviewer and will update the algorithm 1, step 13 from one-line description to suggested mathematical formulations.

In formula 1, it is not clear why $a$ is a function of a single sample $x$ (referring to $a(x)$ ). Is the workflow of the configuration a singular version applied to all samples of the dataset.

In Equation 1, the optimization objective function is to maximize the expectation of a configuration $a$ over all samples. Therefore, it is expressed as $a(x)$ for a single sample but marginalized over the whole dataset $\mathcal{D}$ .

There is no formal definition of a “block” in Section 2.

We thank the reviewer for the suggestion of providing a formal definition of “blocks” earlier. In Sec. 2, line 65, we refer to the topology of agents as building blocks. Formally, building blocks represent the minimum set of agents within a type of topology. It forms the final search space of MASS. We currently define it from Sec. 2.2 line 161 with visualization provided in Figure 8. We will move this part of the information earlier in light of your suggestion.

Line 307. It’s not clear what a “proper” prompt design means.

We appreciate the reviewer pointing this out. The “proper” prompt design actually refers to “prompt optimization”, and we will rephrase the sentence using “prompt optimization” instead for clarity.

审稿意见

评分: 32025-03-11

This paper investigates the interactions between multiple design spaces including prompts and topologies, and the impact of various aspects such as optimizing prompts, scaling the number of intelligences, and involving different types of topologies are investigated. The optimized MAS is generated by optimizing the identified influential components.The optimization process consists of Block-level prompt optimization, Workflow topology optimization and Workflow-level prompt optimization.

给作者的问题

In Algorithm 1, how do you implement prune the design space based on the selection probability?

论据与证据

Yes. The claims are supported by clear evidence.

方法与评估标准

The evaluation criteria generally make sense. Reporting metrics about cost is also needed.

理论论述

There is no theoretical claim in this paper.

实验设计与分析

I have checked the experiments. The chosen bencharks are generally appropriate. However, since the optimization requires API cost, there are limited discussions on the cost.

补充材料

some of the prompts

与现有文献的关系

May be related to AI agents.

遗漏的重要参考文献

其他优缺点

Strengths

The design factors for MAS performance are analyzed, the importance of prompt is emphasized, and the use of APO on MAS is implemented.
The overall logic is good and makes clear the reasons for the implementation of the method. The motivation is clear.

Weaknesses

The optimization process relies too much on a pre-defined dataset and the pipeline is somehow long, making it slow and resource-intensive for optimizing the workflow.
Critical algorithms are not described in detail, just scribbled over. For example, it is unclear how the authors optimize the topology.
The comparisons on token consumptions are not discussed. As the proposed method involve many steps that require token consumptions, this part is critical.

其他意见或建议

It seems like the proposed method require many token consumptions during the optimization, which should be discussed in details.

作者回复

2025-04-01

We thank the reviewer for their insightful suggestions! Please see below for our detailed response, where we provide more details on the token consumption with further clarifications on optimizing the topology. We will include your valuable discussion in the manuscript. We hope that the reviewer could consider increasing the score if they feel the concerns have been sufficiently addressed.

The optimization process relies too much on a pre-defined dataset and the pipeline is somehow long, making it slow and resource-intensive for optimizing the workflow.

We agree with the reviewer that optimizing workflow requires a validation set. However, we would like to recall that all workflow optimization baselines require some form of labeled samples, and, as reflected in Figure 6, MASS is capable of exploiting in a more refined and effective search space, advancing multi-agent performance while being more computation-efficient than the state-of-the-art automatic agent design methods (ADAS and AFlow), and the 3-stage pipeline, while seemingly long, can be justified that each stage leads to concrete performance improvements (Fig 5 & Table 5). In future work, we expect the sample-efficiency of MASS could be further improved by researching more sample-efficient prompt optimizers (Line 727) and low-cost proxies as optimization objectives.

Critical algorithms are not described in detail, just scribbled over. For example, it is unclear how the authors optimize the topology.

We thank the reviewer for suggesting a better description of the topology optimization. Given the configuration space per topology building block, we conduct rejection sampling to sample workflow candidates. Formally, the workflow is randomly sampled from a pruned configuration space within a maximum budget, such that $a \sim \mathcal{A} \text{ s.t.} N(a) < Budget$ , where $N(a)$ caps the overall number of agents; $\mathcal{A} = (N_{aggregate}, N_{reflect}, N_{debate}, N_{Tool}, …)$ is the configuration space as defined in Sec 2.2, and each search dimension will be weighted by the influence of that dimension, and rejected if $\mathcal{Uniform}(0, 1) > p_{a_{i}}$ . Followed by that, the workflow $W(a)=(a_i, a_i+1, …)$ is constructed in a predefined rule to arrange the order of agents (line 267). We have included the specification of the detailed search space in the App. A, Table 3. Overall, we thank the reviewer and will update the algorithm 1, steps 12 & 13 with better mathematical formulations.

The comparisons on token consumption are not discussed. As the proposed method involves many steps that require token consumption, this part is critical.

We thank the reviewer for suggesting a token consumption report. Here, we include an additional table of the actual token cost compared to baselines. In accordance with Figure 6 in the paper, we show that the training cost of MASS is comparable to state-of-the-art automatic agent design baselines.

Model	Training: Input Token	Output Token	Cost ($)	Runtime (mins)	Inference (per query): Input Token	Output Token	Cost ($)	Acc (%)
SC					1538	3013	0.0010	69.3
Reflect					2051	850	0.0004	71.3
Debate					6536	2483	0.0012	71.7
AFlow	11M	8M	3.89	67	2523	1481	0.0006	64.3
ADAS	23M	13M	5.61	55	7850	3335	0.0016	72.7
MASS	24M	11M	5.09	58	6645	3263	0.0014	81.0

In Algorithm 1, how do you prune the design space based on the selection probability?

In sampling the valid configuration of agents at each iteration, we first prune each search dimension with the normalized selection probability that guarantees at least one dimension is kept activated. The target dimension will be rejected if $\mathcal{Uniform}(0, 1) > p_{a_{i}}$ , where $a_{i}$ refers to the individual dimension in $\mathcal{A}$ . For example, given the original search space $\mathcal{A}=(N_{aggregate}, N_{reflect}, N_{debate}, …)$ , if $p_{reflect}$ is 0.8, in each iteration of sampling, this search space dimension will have a 20% chance pruned (i.e., turned off), and the rest dimensions will form the current search space, such that $\mathcal{A}$ becomes $(N_{aggregate}, N_{debate}, …)$ . We hope this addresses your concerns.

审稿意见

评分: 32025-03-14

The paper proposes Multi-Agent System Search (MASS), a novel multi-stage framework designed to optimize multi-agent systems (MAS) by automating the design of prompts and topologies. The authors demonstrate that both prompt design and agent topology significantly impact the performance of MAS. The optimization process is divided into three stages: 1) block-level prompt optimization (local prompt optimization for each agent); 2) workflow topology optimization (determining the most effective agent configuration); and 3) workflow-level prompt optimization (global optimization of prompts conditioned on the selected topology). The MASS framework leverages a plug-and-play approach to optimize individual agent prompts and the overall workflow structure, resulting in a more effective MAS for complex tasks.

The authors claim that MASS outperforms existing multi-agent systems, both manually-crafted and automated, across multiple benchmarks, including reasoning tasks (e.g., MATH), multi-hop understanding (e.g., HotpotQA), and code generation tasks (e.g., LiveCodeBench). Furthermore, the paper proposes guidelines for building effective MAS based on the insights gained from the optimization process.

"## update after rebuttal" Thanks for the authors' response and revision. I've checked out the authors' responses as well as concerns & comments of all other reviewers. I agree that the revised paper has improved over the original version. However, it is still not enough for increasing my rating for the paper to next level, and I'd keep my rating as "weak accept" considering the level of significance and novelty of the paper. I still think it is a borderline paper - it might be publishable for ICML if there is space but I won't push hard for its acceptance if other reviewers have strong different opinions.

给作者的问题

N/A

论据与证据

The claims in this paper are largely supported by empirical evidence. The authors compare MASS against several baselines (e.g., CoT, Self-Consistency, Self-Refine, Multi-Agent Debate, ADAS, and AFlow), showing that MASS leads to substantial performance improvements across all tasks. Results are provided in tables and figures (e.g., Table 1 and Figures 2-7), demonstrating the effectiveness of the framework.

Strengths: The paper presents a clear experimental setup with rigorous benchmarks. Results in Table 1 show substantial performance gains for MASS in comparison to both manual and automatic MAS systems. Ablation studies (Figures 5 & 6) effectively demonstrate the importance of each optimization stage in MASS.

Potential Issues: While the results are compelling, the generalization to larger-scale or more diverse real-world settings is not extensively explored. Future work could address this by applying MASS to a broader range of domains.

方法与评估标准

The methods presented are well-motivated and effective for the problem at hand. The approach is a clear improvement over prior work, particularly due to the integration of both prompt and topology optimization in a staged process. The pruned search space reduces the combinatorial complexity, allowing for more efficient optimization of MAS.

Evaluation Criteria: The paper uses a wide range of benchmarks to validate MASS, such as reasoning tasks, multi-hop understanding, and coding tasks. The comparative performance across tasks shows that MASS consistently outperforms existing baselines, both manual and automated.

Strengths: The evaluation is thorough and diverse, covering a wide array of task domains. Real-world applications such as LiveCodeBench and HotpotQA demonstrate the scalability and robustness of MASS.

理论论述

The theoretical claims seem sound. The authors claim that prompts and topologies play critical roles in MAS design, and they back this up with an in-depth analysis of the design space. The paper formulates optimization problems and provides justifications for pruning the search space based on observed relationships between prompt design and system performance.

Correctness: The mathematical formulation of the optimization problem is clear, and the theory behind the multi-stage optimization process is well-articulated. The stage-by-stage optimization approach is logically sound, and the benefits of each stage (block-level prompt optimization, topology optimization, and workflow-level prompt optimization) are well-supported by experimental data.

实验设计与分析

The experimental design is solid, and the analysis appears robust. The authors compare MASS with a variety of baselines, providing detailed statistical results and ablation studies that validate the contributions of each part of the MASS framework.

Strengths: Ablation studies show that stage-wise optimization (starting from block-level to workflow-level) provides significant improvements. Cost-effectiveness analysis demonstrates the computational efficiency of MASS, including comparisons with baselines like AFlow and ADAS.

Possible Concerns: While the ablation studies are comprehensive, the real-world applicability of MASS (e.g., in extremely large-scale systems with thousands of agents) could be further explored in future experiments.

补充材料

N/A

与现有文献的关系

The paper makes important contributions to the field of automated multi-agent system design. It draws on existing work related to prompt optimization, multi-agent collaboration, and topology design. The paper positions MASS as a significant improvement over prior methods like ADAS, AFlow, and DyLAN.

Related Work: The paper acknowledges existing works in MAS optimization, such as DyLAN (Liu et al., 2024) and Archon (Saad-Falcon et al., 2024), but it goes beyond them by incorporating both prompt optimization and topology optimization in the design process. The use of joint optimization (prompts and topologies) aligns with emerging trends in neural architecture search (NAS), where search space design is becoming increasingly important (e.g., Zhou et al., 2023).

遗漏的重要参考文献

N/A

其他优缺点

Some weaknesses & areas for Improvement:

Why specific topologies are more effective than others is not deeply explored beyond empirical evidence.
Scalability considerations: The paper does not extensively discuss the computational cost of running MASS. How MASS scales with increasing agent complexity remains unclear. More details on runtime and computational overhead across different tasks would be useful.
Baseline comparisons & ablations: a deeper discussion of why MASS outperforms AFlow across different tasks would add value. It would also be beneficial to include an ablation study evaluating the impact of different topology configurations.
While the paper provides strong empirical results, a discussion on limitations (e.g., cases where MASS underperforms) is missing.

其他意见或建议

N/A

作者回复

2025-04-01

We thank the reviewer for their insightful and positive comments, especially for many valuable suggestions on a deeper discussion on topology design space, impacts of actual configurations, and further extending MASS to real-world applications! We have included your suggestions in this response and will also incorporate them into the final manuscript. We hope that in light of the response, the reviewer could consider improving their score.

While the results are compelling, the generalization to larger-scale or more diverse real-world settings is not extensively explored.

Thank you for suggesting the exploration of larger-scale, real-world agent applications. The scale investigated in our work (roughly O(10) agents) aligns with current state-of-the-art agent design methodologies. Importantly, this scale is characteristic of numerous deployed real-world MAS applications where complex interactions within smaller teams are common [1]. Therefore, while we recognize the value of scaling further and consider it an important avenue for future research, our current focus addresses a highly relevant regime. Extending MASS to systems with substantially more agents remains a key area for future investigation.

[1] Xia, C. S., Deng, Y., Dunn, S., & Zhang, L. (2024). Agentless: Demystifying LLM-based Software Engineering Agents.

Why specific topologies are more effective than others is not deeply explored beyond empirical evidence? It would also be beneficial to include an ablation study evaluating the impact of different topology configurations.

We thank the reviewer for raising this very insightful question! The optimal topology does indicate certain patterns per task family, and there are topologies that demonstrate clear advantages over other topologies in particular tasks. By inspecting Figure 8, we notice that the “debating” topology brings significant gains to all multi-hop tasks that require factual knowledge: HotpotQA, MuSiQue, and 2WikiMQA, which is aligned with [2] that argues debating will elicit more truthful answers. Reasoning tasks: MATH and DROP benefit from more exploration, where SC becomes more effective. Lastly, the coding tasks share a common pattern of reflection with tool-using. However, even the best configuration in the same task family still shows differentiations, indicating the necessity of automatic optimization. Therefore, no matter the underlying complexity of the task-dependent topology, the unique advantage of MASS is being able to identify the most influential topology automatically for any customized search space. We will incorporate this discussion into a new ablation subsection.

[2] Khan, A., Hughes, J., Valentine, D., Ruis, L., Sachan, K., Radhakrishnan, A., ... & Perez, E. (2024). Debating with more persuasive LLMs leads to more truthful answers. ICML 2024.

Scalability considerations: The paper does not extensively discuss the computational cost of running MASS. How MASS scales with increasing agent complexity remains unclear. More details on runtime and computational overhead across different tasks would be useful.

We thank the reviewer for suggesting a token consumption report. Here, we include an additional table of the actual token cost for running MASS and how it is compared to baselines, where we show the training cost and the actual run-time of MASS is comparable to the training cost of auto-agent baselines.

Method	Training: Input Token	Output Token	Cost	Runtime (mins)	Inference (per query): Input Token	Output Token	Cost $	Acc %
SC					1538	3013	0.0010	69.3
Reflect					2051	850	0.0004	71.3
Debate					6536	2483	0.0012	71.7
AFlow	11M	8M	3.89$	67	2523	1481	0.0006	64.3
ADAS	23M	13M	5.61$	55	7850	3335	0.0016	72.7
MASS	24M	11M	5.09$	58	6645	3263	0.0014	81.0

Baseline comparisons & ablations: a deeper discussion of why MASS outperforms AFlow across different tasks would add value. While the paper provides strong empirical results, a discussion on limitations (e.g., cases where MASS underperforms) is missing.

We agree with the reviewer that a deeper discussion with AFlow can shed light and add value for future designs. We’ve provided a discussion in Lines 323-382 and Lines 431-434, and we are happy to further elaborate on that: the core differentiation between MASS and AFlow sits at the design of the optimization dimension (i.e., search space), and we observe that the importance of search space design outweighs the actual search algorithm, which have been reflected in many NAS literature. More precisely, MASS exploits a more effective prompt design space in conjunction with a general topology space, whereas AFlow conducts search in a more constrained set of operators with a very limited prompt space designed to be tailored to certain tasks. Therefore, the human-prior in defining these operators could provide implicit advantages for some tasks (e.g., 2WikiMQA in Table 1) where MASS shows lower but still comparable results.

审稿意见

评分: 32025-03-24

This paper formulates multi-agent system design as a joint prompt and topology optimization problem. It introduces the MASS framework, a multi-stage search process that interleaves block-level prompt optimization, workflow topology optimization, and workflow-level prompt refinement, to efficiently navigate the vast MAS design space. By focusing on the most influential prompts and a pruned set of topologies, the method dynamically constructs multi-agent systems. Experiments across benchmarks for reasoning, long-context understanding, and coding demonstrate that MASS significantly outperforms both manually-crafted and automatically-generated alternatives.

给作者的问题

Figure 2 suggests that prompt-optimized agents exhibit superior scalability relative to Debate, Reflect, and self-consistency. Could the authors provide additional evidence or discussion on whether this scalability advantage persists across other models and benchmarks?
Figure 8 shows that the agent topologies are noticeably more complex than the baseline approaches (e.g., CoT, self-consistency) presented in Table 1. Could the authors elaborate on whether this discrepancy in complexity affects the fairness of the comparison? Moreover, if the sampling numbers for the Self-Consistency baseline were increased, would the proposed method still maintain its performance advantage?

论据与证据

The paper proposes MASS framework that jointly optimizes both the agent prompts and the system topology. It addresses the interdependence of prompt design and agent connectivity, which prior work often treats in isolation.
The paper includes detailed ablation studies that clearly illustrate the contribution of each optimization stage, enhancing the credibility of the proposed method.

方法与评估标准

My main concern is that the proposed method appears to lack sufficient novelty. The prompt optimizer leverages the existing MIPRO approach, and the topology optimization primarily relies on established topological designs.
The proposed method depends on many manual design choices, such as the topology search space, and does not explain why these particular topology structures were chosen. Moreover, these topology structures might not be applicable to all domains.

理论论述

No Theoretical Claims.

实验设计与分析

The paper uses large base models, Gemini 1.5 and Claude 3.5 Sonnet, which makes it unclear whether the proposed method would be effective on smaller open-source models, such as Llama 3.1 8B, and other propieratory model. This limitation restricts the method’s applicability.

The paper does not compare with some essential baselines, such as GPTSwarm (which also treat agents as optimizable graphs), MacNet (which also considers LLM agents on graphs), and more recent multi-agent debate methods.

Moreover, the proposed method consumes a great number of tokens. The efficiency of the proposed method has not been compared, especially compared to more cost-efficient methods such as self-consistency.

补充材料

Yes.

与现有文献的关系

This paper is broadly related to multi-agent systems research and LLM research .

遗漏的重要参考文献

No.

其他优缺点

No.

其他意见或建议

There are still some minor errors in the manuscript; for example, it appears the x-ticks are missing in Figure 3.

---------------post rebuttal-----------------------------

I am not convinced that it is worthwhile to spend so many efforts (in terms of the number of tokens consumed, which is the most straightforward measurement rather than the money spent) to "optimize" the topologies for perfermance improvement on only a single dataset. In fact, without clear evidence of generalization, this approach risks being a form of “overfitting.” However, this may not be a limitation unique to this specific work, but rather a broader issue within this research direction. It might therefore be unfair to criticize only this paper for this limitation. With that in mind, I have adjusted my rating from 2 to 3 (weak accept, though rejection remains justifiable).

作者回复

2025-04-01

We thank the reviewer for their constructive suggestions! We have included additional experimental results with open-source models, a graph optimization baseline, and provided clear comparisons on the token consumption. We hope in light of our response, the reviewer could consider improving their score.

Novelty of MASS

A key contribution of this work is highlighting critical, under-explored design dimensions in MAS: the significant impact of prompt engineering and the redundancy in conventional topology choices. Unlike prior approaches often using manual prompts or emphasizing scaling alone, we demonstrate that optimizing these elements yields substantial gains and interacts critically with other dimensions like agent scaling. We argue this insight is foundational for MAS—a developing field—as it clarifies the necessity for automated co-optimization, answering why methods like strong prompt optimization are essential.

Leveraging this understanding, we introduce MASS, a novel methodology for automated MAS design. While adaptable to various prompt optimizers (MIPRO was used here), MASS employs a distinctive three-stage, interleaved optimization strategy. This approach navigates the complex design space by sequentially refining the most influential components within pruned search space boundaries. Consequently, MASS achieves state-of-the-art performance, significantly outperforming existing automated design methods and the specialized prompt optimizer used in isolation (Fig 5, left).

Justification of topology design choices

While there are some manual design in deciding the search perimeter, our topology search space aligns with well-established topological designs, including SC, reflect, and debate, that have been shown generalizable and effective to a wide spectrum of tasks in the board literature and were also used in the search spaces of seminal previous works like ADAS and AFlow, which we also chose for a fair comparison and generality. However, the MASS framework on its own does not depend on a specific topology space, and we can easily extend it to customized topology choices. We leave experimental results with MASS on other involved topology search spaces to future work.

Open-source models

We extend experiments to Mistral-Nemo-12B, where the consistent gains prove the applicability of MASS even to small open-source models.

Method	MATH	DROP	HotpotQA
CoT	13.3	49.0	55.9
SC	22.0	57.6	58.9
Refine	14.3	48.6	52.5
Debate	26.0	61.4	59.5
MASS	43.7	68.4	62.6

Graph optimization baseline

Following the reviewer’s suggestion, we report GPTSwarm and observe that the graph optimization is more effective in improving the inference efficiency from a fully-connected graph to a sparse graph rather than enhancing the task performance, whereas the prompt optimization component of MASS particularly led to more significant contributions.

Method	MATH	HumanEval
GPTSwarm (Pro)	76.0	85.0
MASS (Pro)	84.7	91.7
GPTSwarm (Flash)	61.0	73.0
MASS (Flash)	81.0	84.7

Cost-efficiency of the method

We present the token consumption comparison of gemini-flash below, where it’s clear that MASS consumes comparable compute to ADAS and AFlow:

Model	Train:Input Token	Output Token	Cost	Runtime (min)	Inference:Input Token	Output Token	Cost $	Acc %
SC					1538	3013	0.0010	69.3
Reflect					2051	850	0.0004	71.3
Debate					6536	2483	0.0012	71.7
AFlow	11M	8M	3.89$	67	2523	1481	0.0006	64.3
ADAS	23M	13M	5.61$	55	7850	3335	0.0016	72.7
MASS	24M	11M	5.09$	58	6645	3263	0.0014	81.0

Q1: Generalization on prompt scalability

We’d like to refer to Table 5, Page 16 where it’s obvious that the gain from “Base” to “1PO”, which is the prompt optimization step, far exceeds, e.g., that from “1PO” to “2TO”, which is the topology search step, which aligns with the observation in Fig 2. We additionally show Claude results below, which demonstrate that the observation is not specific to Gemini models.

Claude	Avg.
Base	60.2
1PO	70.0
2TO	71.9
3PO	72.4

Q2: Fairness of comparisons

As mentioned in the cost comparison table above, we’d like to emphasize that we broadly controlled the cost, and all methods roughly consume the same token/dollar costs. We also provide detailed specifications of baselines in App. B.2 that all come with a fair number of token consumptions compared to MASS.

Regarding further scaling SC, firstly we note that SC is a part of the MASS search space, so MASS can naturally benefit from SC scaling. Second, as we have shown in Figure 2 & 9, even if SC brings significant benefits, SC still saturates earlier than MASS-optimized topologies, which show a better token-effectiveness. In other tasks in Table 1, even assigned with a large budget, SC only makes limited gains on multi-hop tasks, whereas the Debate topology substantially advances the performance. This observation further consolidates the necessity of automatic topology optimization in MASS.

最终决定Reject

2025-05-01

This paper provides an optimization method for agents using LLMs that improves both the prompts and how they are combined in a so-called topology. After a detailed discussion with the authors, all reviewers are generally favorable. However, the area chair and the senior area chair lean towards rejection, because the method is fundamentally brute-force and the contribution to knowledge is incremental. The work is thorough, but will likely be superseded soon by other research whose methods are more efficient, and possibly more insightful.