We deeply thank reviewer xvqz for your time and effort in reviewing our work, and your invaluable comments that further enrich our paper. We are particularly encouraged that you find our claims well-supported, our methods & evaluations reasonable, and by your recognition of the importance and novelty of our work. We address your concerns here:

[Claim & Evidence] AutoTransform is harder to control than AutoInject.

The reason why faulty agents produced by AutoTransform have less impact than directly injecting errors (AutoInject) is because GPT-3.5-Turbo is weaker in terms of the precise control of generated errors. We conduct an analysis using AutoTransform to instruct agents to introduce syntax errors in 20% and 40% of the code lines. The results are summarized below:

Error Rate	Avg	Std	Min	Max
Instruct 20%	1.56	3.65	0.00	14.30
Instruct 40%	9.49	26.70	0.00	90.10

These results indicate significant variability, with agents struggling to consistently achieve the precise error rates of 20% or 40%. This underscores the necessity and robustness of our AutoInject method.

[Method & Evaluation] [W2 & W3] Inclusion numerical data and specific evidence to support our claims.

We have added numbers at (1) Left-column: Line 248, 251, 261, 274 and (2) Right-column: Line 251, 253, 257, 263, 292, 308, from the results in Table 5, 6, 7, and 8 in Appendix C on page 13.

[Design & Analysis] Fair comparison using the 6 different code bases.

Different structures have various role designs. For example, a flat system may not have a leader, where the other two systems have. There is not a set of profiles applicable to all multi-agent systems. Therefore, instead of manually writing profiles, we use the code bases from published studies which have been optimized by the authors for the specific structures. To offer fairer comparison, i.e., to mitigate the impact of using different prompts, we have included two systems per structure.

[W1] Math definition on agent collaborations.

A multi-agent system can be defined as a graph: , where represents agents and is a set of directed edges. Each denotes agent reports to agent .

Linear systems are: directed path graphs, where , we have: ; for the endpoints, , , , and . Agents in this structure form a chain from to .
Flat systems are: directed complete graphs with bidirectional edges, where , both and . This represents a fully connected, non-hierarchical structure.
Hierarchical systems are: rooted directed trees, where there exists a unique root agent such that , and , . The structure is acyclic and forms a strict top-down hierarchy.

[Q1] Other communication topologies.

Inspired by MacNet and GPTSwarm, we design two advanced graph-based multi-agent frameworks using four agents:

Complete graph (a flat structure): each agent generates their own answers. After receiving all others’ answers in the next run, they re-generate the answers after thinking. The final answer is the majority one.
Star (a hierarchical structure): one leader proposes three approaches and distributes them to the three agents. After receiving the solutions, the leader gives its evaluation and generates the final answer.

We evaluate GPT-3.5 using the Math task. The performance is shown in the table below. We can conclude that our methods and analyses are applicable to diverse frameworks. Flat structure still has a lower performance since there is no leader coordinating the work.

System Type	Vanilla	AutoTransform	AutoInject
Graph	28	20	16
Star	36	30	28

[Q2] Impact on single agents.

We conduct experiments on applying the two error-introducing methods on a single agent based on GPT-3.5-Turbo across all four tasks. The performance is shown below:

Tasks	Vanilla	AutoInject	AutoTransform
Code	58.41	15.24	3.92
Math	24.00	18.00	8.00
Translate	68.42	61.08	68.42
Text Eval	41.25	32.50	18.75

Compared to the performance of other multi-agent systems (the table below), we conclude that all three types of systems have better resilience against both methods compared to a single agent. This is because the systems have other “good” agents for reviewing and testing, which can identify the errors made by the faulty agent.

Systems	Vanilla	AutoInject	AutoTransform
Single Agent	48.02	31.71	21.66
Linear	55.62	38.27	38.24
Flat	54.37	40.25	43.93
Hierarchical	53.00	48.12	46.57