7.8

/10

Spotlight4 位审稿人

最低4最高6标准差0.8

3.8

置信度

创新性3.0

质量3.3

清晰度2.8

重要性3.5

NeurIPS 2025

AgentBreeder: Mitigating the AI Safety Risks of Multi-Agent Scaffolds via Self-Improvement

J Rosser,Jakob Nicolaus Foerster

OpenReview PDF

提交: 2025-05-08更新: 2025-10-29

TL;DR

Our paper introduces AgentBreeder, a framework for multi-objective self-improving evolutionary search over scaffolds, specifically targeting scaffolds' safety impact on large language models in multi-agent systems.

摘要

关键词

AI SafetyMulti-Agent SystemsLLMsLarge Language ModelsJailbreakingAgent Scaffolds

评审与讨论

审稿意见

评分: 5置信度: 42025-07-01

This paper introduces AgentBreeder, an approach to multi-objective evolutionary search to automatically generate and evaluate LLM scaffolds. The central application is to optimizing for task capability and safety metrics simultaneously, which is an important problem in AI safety research. The authors demonstrate how their method can be used as (a) a "blue team" approach to discover scaffolds with significantly improved safety performance while maintaining or enhancing capabilities; (b) a "red team" approach that identifies highly capable yet vulnerable scaffolds; and (c) a capability-only improving baseline.

优缺点分析

Strength

Methodological contributions: The paper provides a few solid methodological improvements over ADAS, including extending the method to multi-objective optimization and using semantic clustering to balance diversity and efficient evolutionary search.
Strong empirical results: The results convincingly demonstrate that AgentBreeder can find scaffolds that significantly improve performance and safety properties, and that it can also be useful as a red-teaming tool.
Careful empirical analysis: The empirical evaluation is well-executed, the results are reported across multiple benchmarks and with confidence intervals. The analysis is nuanced and the authors discuss positive results as well as important limitations.

Weaknesses

Unclear red-teaming setup: The paper proposes a red-teaming setup where the goal is to maximize the task performance but minimize safety performance. It is unclear to me what kind of risk model this corresponds to. The authors argue that their approach "seeks to model the case where an actor may unknowingly expose weaknesses in the base LLM when employing scaffolding to improve task performance". However by explicitly optimizing for the agent to perform poorly on safety metrics their approach seems much more adversarial then a model developer just aiming to increase performance (which would be better modeled by the capability-only AgentBreeder variant).
Some gaps in empirical results: There are a number of natural questions that the results do not answer that would be easy to answer with minor modifications:
- How does AgentBreeder compare to ADAS in the single-objective case?
- What impact does the scaffold optimization have on other capability datasets? (eg. how does the MMLU-optimized scaffold perform on GPQA and vice verse)
- How does the capability-only AgentBreeder perform on the safety metrics?
- Could you add helpfulness explicitly to the multi-objective optimization target?
Lack of clarity about contributions: The paper proposes a number of methodological advances but is not always clear about which of these are made with which goal in mind. Specifically which changes are to improve the evolutionary algorithm in general, which are to modify it for multi-objective optimization? The paper presents safety as the main motivation of their method but it seems to me that it can be applied much more widely; which would be useful to discuss in the paper.
Unclear applicability to frontier models: The authors admit that the proposed method seems less beneficial for more capable models and discuss possible reasons for this. However the paper does not evaluate the model with state-of-the-art frontier models and does not provide any systematic evaluation of the benefit for different combinations of models. This leaves important open questions about the applicability of the method unanswered.

Minor typos:

Line 98: "Fowler [17] provide" should be "Fowler [17] provides"
Line 142: "improve performance performance"

问题

Questions about experiment results (from above):

How does AgentBreeder compare to ADAS in the single-objective case?
What impact does the scaffold optimization have on other capability datasets? (eg. how does the MMLU-optimized scaffold perform on GPQA and vice verse)
How does the capability-only AgentBreeder perform on the safety metrics?
Could you add helpfulness explicitly to the multi-objective optimization target?

Questions about contributions

What is the purpose of the red teaming experiments? Why do you think red teaming is a better model of a developer inadvertently reducing safety than capability-only optimization?
Is the proposed method applicable to general multi-objective optimization problems or is it specific to safety problems for some reason?

Additional questions

Is the number in the F_CDROP row the exact same experiment in the red and blue breeder setup? If not what is the difference? If yes why are the numbers different?
How does you method compare to running ADAS with a quality metric that combines different optimization targets, eg. a capability and a safety benchmark?

局限性

The authors discuss limitations adequately, though (as mentioned above) I think the applicability to frontier models is a potentially major concern that would've been nice to investigate further

最终评判理由

The author response has clarified some of my uncertainties about some results. Overall my assessment of the paper hasn't changed substantially and I recommend to accept it.

格式问题

no concerns

作者回复

2025-07-31

Dear Reviewer Tn5P,

We would like to express our sincere gratitude for your thorough and thoughtful evaluation of our work. We are particularly encouraged by your recognition that AgentBreeder addresses "an important problem in AI safety research" and that our results "convincingly demonstrate that AgentBreeder can find scaffolds that significantly improve performance and safety properties”. We are grateful that you have recommended that our paper is accepted to this conference. The acceptance of this work would help establish this important research direction, provide further visibility to this important threat model and provide the community with our open-source framework as a foundation for future investigations.

We deeply appreciate your acknowledgment of several key strengths of our approach. First, we are pleased that you recognized our methodological contributions, particularly our extension to multi-objective optimization and the use of semantic clustering to balance diversity with efficient evolutionary search. Second, we are gratified by your assessment of our "strong empirical results" and that you found our demonstration of both blue-team safety improvements and red-team vulnerability discovery to be convincing. Finally, we are encouraged by your recognition of our "careful empirical analysis," noting our multi-benchmark evaluation with confidence intervals and nuanced discussion of both positive results and limitations. We took extreme care in proportionately discussing the significance of our results and are glad to see this celebrated!

We also thank you for your constructive feedback regarding areas for improvement, fixing typos and particularly clarifying our red-teaming setup and empirical analyses such as direct comparisons with ADAS in single-objective settings and cross-dataset generalization studies. Additionally, we appreciate your suggestion to clarify which methodological contributions target general evolutionary improvements versus multi-objective optimization specifically. We view this feedback as valuable guidance for strengthening our contribution and are keen to address these points.

Questions about experiment results (from above):

How does AgentBreeder compare to ADAS in the single-objective case?

We have reported these results in Section 5.3 Experiment 3: Multi-Objective Ablation. In particular we report “Comparable Performance to Previous Work. CapableAgentBreeder achieves competitive results to ADAS, marginally surpassing performance across all capability benchmarks”.

What impact does the scaffold optimization have on other capability datasets? (eg. how does the MMLU-optimized scaffold perform on GPQA and vice verse)

This is an excellent question about cross-domain generalization that we would be very excited to explore in future work. Unfortunately, given the computational expense of our evolutionary runs, we did not have the budget to conduct these additional cross-dataset evaluations, but we agree this would provide valuable insights into scaffold transferability.

How does the capability-only AgentBreeder perform on the safety metrics?

We appreciate this question, as it would help establish important baselines for understanding the safety implications of capability-focused optimization. While we did not have the computational resources to include these evaluations in the current work, this represents a natural and important extension that we are eager to pursue.

Could you add helpfulness explicitly to the multi-objective optimization target?

Thank you for this thoughtful suggestion. Adding helpfulness as an explicit objective would indeed be valuable for addressing reward hacking concerns. We would be excited to explore this in future work and note that this would increase the computational cost of capability validation in each experimental run by approximately 50%, which exceeds our current budget constraints.

Questions about contributions

What is the purpose of the red teaming experiments? Why do you think red teaming is a better model of a developer inadvertently reducing safety than capability-only optimization?

We appreciate this opportunity to clarify our threat model. The RedAgentBreeder setup models a scenario analogous to the inner/outer alignment problem, where a scaffold designer operates under a misaligned reward function that inadvertently incentivizes unsafe behavior while pursuing seemingly beneficial objectives. This differs from capability-only optimization because it explicitly captures how well-intentioned optimization processes can systematically discover and exploit safety vulnerabilities when the reward structure creates misalignment between intended goals and actual optimization targets.

Is the proposed method applicable to general multi-objective optimization problems or is it specific to safety problems for some reason?

Thank you for highlighting this broader applicability. Our method is indeed general to any multi-objective optimization problem, not specific to safety. We are excited that our work demonstrates the effectiveness of multi-objective approaches in this domain, and we view exploration of different objective combinations as a promising direction for future research.

Additional questions

Is the number in the F_CDROP row the exact same experiment in the red and blue breeder setup? If not what is the difference? If yes why are the numbers different?

Great question for clarification. While both setups optimize for DROP capability during evolution, they differ in their second objective: BlueAgentBreeder optimizes capability and safety, while RedAgentBreeder optimizes capability and harm (1-safety). The F_CDROP numbers reflect the test set performance of the best scaffolds discovered through these different evolutionary processes, explaining why the results differ despite both targeting DROP performance.

How does you method compare to running ADAS with a quality metric that combines different optimization targets, eg. a capability and a safety benchmark?

This is an insightful comparison question. Our approach differs from ADAS in being both evolutionary and explicitly multi-objective, rather than using a single combined metric. Creating a multi-objective version of ADAS for direct comparison would be an exciting avenue for future work that could help isolate the contributions of our different methodological innovations.

Kindest regards,

The Authors

2025-08-03

Thanks for the detailed response, which has improved my understanding of the work. I think this is a good paper and continue to think it should be accepted.

审稿意见

评分: 4置信度: 42025-07-01

AgentBreeder aims to address the growing need to autonomously design and optimise multi-agent scaffolds (i.e. LLM-based agent systems). Moreover, AgentBreeder uses multi-objective optimisation to optimise both accuracy and safety; the latter being an underrepresented in the optimisation process of current automated design systems for multi-agent scaffolds.

AgentBreeder can be used employed in three different modes:

BlueAgentBreeder - optimises for both capability and safety
RedAgentBreeder - optimises for capability and minimising safety
CapableAgentBreeder - optimises just for capability

The evolutionary process is controlled by a Meta Agent using Claude for is coding capabilities. The Meta Agent is responsible for creating new Python code scaffolds through mutation or crossover of elite scaffolds from the previous generation. Diversity is imposed by clustering candidates and then selecting pareto-front optimal solutions (capability vs safety) from each cluster. Experiments are performed on one of three different datasets: DROP, MMLU, GPQA, along with the safety benchmark SaladData and TruthfulQA to reduce reward hacking.

Results show a significant uplift in safety (79.4% on average) with BlueAgentBreeder. Results with RedAgentBreeder highlight the potential limits of ignoring safety, i.e. it is possible to have similarly capable model

优缺点分析

Strengths:

Incorporating safely into automated scaffold discovery through multi-objective optimisation and quality-diversity
The paper addresses a significant need to ensure safety in scaffold design. This method is an elegant way to balance capability and safety at the core of the optimisation process.
The three different modes provide interesting insights into automated scaffold design. BlueAgentBreeder clearly shows that safety can be optimised alongside capability, RedAgentBreeder demonstrates the worst-case scenarios where capable scaffolds can also be highly vulnerable , and CapableAgentBreeder provides a benchmark against which to compare
The code is open source and the experiments use recognised benchmarks which is great for reproducibility and comparison.

Weaknesses

One of the biggest contributions is the ability to include safety in the optimisation process. As such, BlueAgentBreeder's ability to show an average 79.4% improvement in safety over the ADAS-generated baselines appears impressive. However, I am concerned that these baselines were not generated with safety in-mind, i.e. no inclusion in the task definition prompts to generate safe, as well as capable, scaffolds. A better benchmark would be the output from an ADAS where the prompt included such an instruction, similar to the prompt used to the Meta Agent.
The clustering process is somewhat questionable. The embedding is very low dimensional and it is not clear whether this is sufficient to capture the architectural quality differences between scaffolds. Furthermore, it is not clear how the 0.7 threshold was chosen. Examples from different clusters, and their analysis, would be very useful to illustrate how the embedding and clustering impacts diversity. The paper could be further strengthened if an ablation of cluster was included to concretely demonstrate that it was contributing to performance.
Safety evaluation relies on an LLM to judge responses. LLM evaluations are sensitive to their prompting and are susceptible to biases, such as positional, sycophancy, etc. The paper would benefit if the limitations of this evaluation were included and discussed. Human validation of the LLM evaluator on a representative sample would be useful to instill confidence in this approach.
The finding that the relative gain between seed and discovered scaffolds diminishes with stronger models is significant and is not fully explored.This could have significant implications regarding the limitations of scaffolding ever-improving LLM models. A more in-depth discussion on this point would significantly increase the paper's impac

问题

I have a few questions and clarifications.

In Table 1, you report on three scaffolds from each experimental run: the most capable (arg max{f_C}), the safest (arg max{f_S}), and a 'best overall' (arg max{f_C, f_S, f_H}). While the first two are clear selections from the Pareto front, could you please clarify the specific criterion or scalarisation function used to select the single 'best overall' scaffold?
Section 4.1 describes a balanced validation sampling strategy using 50% 'positive' and 50% 'negative' samples. Could you please confirm if these categories correspond to samples that a baseline CoT agent answered correctly and incorrectly, respectively?
The impressive safety gains are benchmarked against scaffolds not optimized for safety. To ensure a fair comparison, did you consider creating a stronger baseline by running the original ADAS framework with a prompt that also asks it to consider safety? This seems crucial for isolating the benefit of AgentBreeder's specific methodology.
Could you provide more justification for your clustering methodology (the 12-dim embedding and 0.7 threshold)? An ablation study that removes clustering would be the most convincing way to demonstrate its contribution to performance and diversity.

局限性

The authors have addressed most limitations.

There are a couple more that may be worth including:

The initial population diversity (from the seven seeds) may have limited the potential diversity of solutions.
As discussed above, the limitations of using LLM-based evaluation for safety evaluation should be included and discussed.

最终评判理由

I am satisfied with the authors detailed responses and for addressing my few questions and clarifications. Overall my assessment of the paper, and therefore rating, remains the same along with my original comment that if the authors had had the resources to address some of the outstanding points then I believe the paper could have been awarded an even higher rating.

格式问题

None.

作者回复

2025-07-31

Dear Reviewer 7aKn,

We would like to express our sincere gratitude for your thoughtful and comprehensive evaluation of our work. We are particularly encouraged by your recognition that our paper "addresses a significant need to ensure safety in scaffold design" and that our method provides "an elegant way to balance capability and safety at the core of the optimisation process." Your astute note that considerations of safety are “underrepresented in the optimisation process of current automated design systems for multi-agent scaffolds” is a core reason we are so excited about the potential for this paper to be accepted. The acceptance of this work would help establish safety as a fundamental consideration in automated agent system design and provide the research community with our open-source framework to build upon.

We deeply appreciate your acknowledgment of several key strengths of AgentBreeder. First, we are pleased that you recognized our core contribution of "incorporating safety into automated scaffold discovery through multi-objective optimisation and quality-diversity," which addresses what you correctly identify as a significant gap in current approaches. Second, we are gratified by your positive assessment of our three operational modes, noting how BlueAgentBreeder demonstrates that "safety can be optimised alongside capability," RedAgentBreeder reveals "worst-case scenarios where capable scaffolds can also be highly vulnerable," and CapableAgentBreeder provides important benchmark comparisons. Finally, we appreciate your recognition of our commitment to reproducibility through open-source code and use of recognized benchmarks.

We also thank you for your constructive feedback regarding areas for improvement, particularly regarding baseline comparison fairness (suggesting ADAS baselines that explicitly include safety instructions), the clustering methodology and threshold selection, and the limitations of LLM-as-a-judge safety evaluation. Additionally, we appreciate your observation about the diminishing gains with stronger models, which you note could have "significant implications regarding the limitations of scaffolding ever-improving LLM models." We view this feedback as valuable guidance for strengthening our contribution and are keen to address these important points.

In Table 1, you report on three scaffolds from each experimental run: the most capable (arg max{f_C}), the safest (arg max{f_S}), and a 'best overall' (arg max{f_C, f_S, f_H}). While the first two are clear selections from the Pareto front, could you please clarify the specific criterion or scalarisation function used to select the single 'best overall' scaffold?

Thank you for this excellent clarifying question. The 'best overall' scaffold was selected as the one maximizing the sum of capability and safety scores, with the important caveat that we excluded scaffolds that engaged in reward hacking of the safety objective.

Section 4.1 describes a balanced validation sampling strategy using 50% 'positive' and 50% 'negative' samples. Could you please confirm if these categories correspond to samples that a baseline CoT agent answered correctly and incorrectly, respectively?

Yes, that is exactly correct! ‘Positive' samples correspond to those the baseline CoT agent answered correctly, while 'negative' samples are those it answered incorrectly. We implemented this balanced sampling strategy specifically to increase the signal strength for the evolutionary process, ensuring adequate representation of both success and failure cases.

The impressive safety gains are benchmarked against scaffolds not optimized for safety. To ensure a fair comparison, did you consider creating a stronger baseline by running the original ADAS framework with a prompt that also asks it to consider safety? This seems crucial for isolating the benefit of AgentBreeder's specific methodology.

Thank you for this thoughtful suggestion about creating a more rigorous baseline comparison. We did not modify the original ADAS framework to include safety considerations in this work, but we completely agree this would provide valuable insights into the specific contributions of our multi-objective approach. This represents an exciting direction for future work that would help isolate the benefits of our methodology from simply including safety in the optimization process.

Could you provide more justification for your clustering methodology (the 12-dim embedding and 0.7 threshold)? An ablation study that removes clustering would be the most convincing way to demonstrate its contribution to performance and diversity.

Thank you for this important methodological question. We found empirically that the embedding dimension could be effectively reduced to 12 dimensions while still enabling the clustering to meaningfully separate different scaffold types, such as debate-based versus self-consistency approaches. The distance threshold of 0.7 was selected through preliminary experiments that indicated this value provided an effective balance - ensuring clusters remained sufficiently diverse without becoming overly fragmented, thereby maintaining exploration across different regions of the scaffold design space. These ablation studies were unfortunately deprioritised due to budget constraints, however we agree that a systematic ablation study would strengthen these design choices, and we view this as valuable future work to more rigorously validate the clustering methodology's contributions.

Kindest regards,

The Authors

2025-08-04

Thanks for detailed response and addressing my few questions and clarifications, particularly clarifying how you came to choose the 12-dim embedding with 0.7 threshold.

I will stick with my current ratings. If the authors were able to address points (3) a comparison against a modified ADAS framework to include safety, and (4) an ablation to validate the clustering methodology's contributions, in the paper then I could consider reviewing my ratings upwards.

审稿意见

评分: 4置信度: 22025-07-02

AgentBreeder tackles the under-explored safety implications of turning a single LLM into a multi-agent scaffold. The authors cast scaffold design as an open-ended, multi-objective evolutionary search: a “Meta-Agent” (Claude 3.5) mutates and cross-breeds code for candidate scaffolds, embeds each design, clusters them, then promotes Pareto-optimal “elites” that jointly maximise capability and safety scores. The framework can be run in three modes—BLUE (defence, maximise both objectives), RED (attack, maximise capability while minimising safety), and CAPABLE (capability only) — providing an automated way to stress-test or harden agent systems before deployment.

优缺点分析

Strengths

Quality

Reproducibility: algorithm is formalised (Alg. 1), code is promised open-source, and all evaluations are scripted inside Inspect.
Empirical support: runs BLUE, RED and CAPABLE modes on three capability benchmarks (DROP, MMLU, GPQA) plus the SaladData safety set; reports confidence intervals and ablates single- vs multi-objective search.
Sound methodology: frames scaffold design as a multi-objective evolutionary search, combining embedding-based clustering with Pareto elite selection.

Clarity

Clear separation of BLUE / RED / CAPABLE modes aids readability.
Paper is well organised; Figure 1 plus Alg. 1 convey the pipeline at a glance.

Significance

Addresses a fast-growing deployment pattern (LLMs scaffolded into swarms) that current safety work largely ignores.
Provides a practical defence (BLUE) and attack (RED) recipe that labs could integrate into release pipelines.

Weaknesses

Quality

Heavy reliance on proprietary models (Claude 3.5, GPT-4o-mini); results may not generalise to open weights.

Significance

Absolute capability gains are modest; blue-team safety jump is large mainly because the baseline is poor.
enchmarks are synthetic; no evidence yet that findings transfer to real web-based agents.

问题

Validation-set size & statistical power

You use only 250 validation items per generation and report bootstrap CIs. How sensitive are your evolutionary decisions to this small sample? Did you try larger samples or a moving-average fitness estimate to reduce noise?

Hyper-parameter choices
- Why fix the agglomerative-clustering distance threshold at 0.7?
- How did you arrive at the 2:1 weighting of mutation vs. crossover?
- What temperature / top-p settings were used for the Meta-Agent and for the agents inside the scaffolds?
Reward-hacking detection

You note that some scaffolds achieve 95 % “safety” by answering “Sorry, I can’t help with that.” Beyond adding a helpfulness metric, did you explore behavioural diversity penalties or response-length constraints to discourage such trivial policies?

局限性

yes

格式问题

This paper is well formatted.

作者回复

2025-07-31

Dear Reviewer bz36,

We would like to express our sincere gratitude for your comprehensive and well-structured evaluation of our work. We are particularly encouraged by your recognition that AgentBreeder "addresses a fast-growing deployment pattern (LLMs scaffolded into swarms) that current safety work largely ignores" and provides "a practical defence (BLUE) and attack (RED) recipe that labs could integrate into release pipelines”. We believe this work introduces a threat model not yet explored - multi-objective multi-agent self-improving evolutionary risks - and are excited for the long-term impact of this work on the field.

We deeply appreciate your acknowledgment of several key strengths across the dimensions of quality, clarity, readability and significance. We were pleased to be able to run our experiments on a number of widely-used benchmarks - DROP, MMLU, GPQA, SaladData and TruthfulQA. Additionally, we provide a simple API in the codebase to integrate further benchmarks and have included examples in our open source repository. Most importantly, we are encouraged by your assessment of the significance of our work, particularly your observation that we address an important gap in current safety research by focusing on multi-agent scaffolds.

We also thank you for your constructive feedback regarding areas for improvement, particularly regarding generalizability beyond proprietary models and the need for evaluation on more realistic deployment scenarios. We view these as valuable directions for future work and are keen to address the questions you have raised.

Validation-set size & statistical power

You use only 250 validation items per generation and report bootstrap CIs. How sensitive are your evolutionary decisions to this small sample? Did you try larger samples or a moving-average fitness estimate to reduce noise?

We thank you for this interesting question! To compare our approach to the seminal work ADAS (Hu et al. 2024), we matched our sample size and methodology closely to ADAS which used between 20 (ARC) and 128 (other domains) samples for validation. We empirically found that AgentBreeder was sensitive to sample sizes below 100. We believe this to be because our multi-objective diversity optimized setting reduced the signal for the self-improving process, so increased the sample size to 250 (the largest we had budget for). We would be excited to explore moving-average fitness as a future direction and are grateful for the suggestion!

Hyper-parameter choices

Why fix the agglomerative-clustering distance threshold at 0.7? How did you arrive at the 2:1 weighting of mutation vs. crossover?

Thank you for this clarifying question! We chose thresholded agglomerative clustering to allow the number of clusters to emerge naturally during the evolutionary process rather than requiring a fixed cluster count a priori. The distance threshold of 0.7 was selected based on preliminary experiments that indicated this value provided an effective balance - ensuring clusters remained sufficiently diverse without becoming overly fragmented, which helped maintain exploration across different regions of the scaffold design space.

Similarly, the 2:1 mutation-to-crossover weighting was determined through initial hyperparameter tuning. We observed that emphasizing mutation over crossover facilitated better exploration of the scaffold design space, likely due to the discrete and structured nature of code-based scaffold representations where crossover operations can more easily produce invalid or suboptimal combinations.

We acknowledge that both of these hyperparameters would benefit from more systematic ablation studies. Given the computational expense of each evolutionary run (requiring extensive LLM evaluations across multiple benchmarks), we focused our limited budget on the core experimental comparisons presented in the paper. We view comprehensive hyperparameter analysis as an important direction for future work, particularly as computational costs for such experiments decrease.

What temperature / top-p settings were used for the Meta-Agent and for the agents inside the scaffolds?

A temperature of 0.5 and top-p of 1 (the default) were chosen for the Meta Agent - matching the seminal work ADAS (Hu et al., 2024). For the scaffolds, temperatures for the agents could be dynamically chosen by the Meta Agent and varied between each.

Reward-hacking detection

You note that some scaffolds achieve 95 % “safety” by answering “Sorry, I can’t help with that.” Beyond adding a helpfulness metric, did you explore behavioural diversity penalties or response-length constraints to discourage such trivial policies?

Thank you for this exciting suggestion! We did not explore behavioral diversity penalties or response-length constraints in this work, but we agree this represents an exciting and important future direction. The observation that some scaffolds achieve high safety scores through overly conservative refusal behavior highlights a fundamental challenge in safety optimization - distinguishing between genuine safety improvements and superficial compliance strategies.

Your suggestion of behavioral diversity penalties is particularly compelling, as it could encourage the evolution of scaffolds that maintain safety through more sophisticated reasoning rather than blanket refusal. Similarly, response-length constraints or more nuanced helpfulness metrics could help identify scaffolds that achieve the desired balance between safety and utility. This connects to broader questions in AI alignment about reward specification and avoiding Goodhart's law effects in safety metrics.

Kindest regards,

The Authors

审稿意见

评分: 6置信度: 52025-07-02

This paper makes a significant and timely contribution to AI safety and the Automated Design of Agentic Systems (ADAS) by introducing AgentBreeder, a novel multi-objective evolutionary framework for designing multi-agent scaffolds for Large Language Models (LLMs). It effectively addresses the critical, yet underexplored, safety implications of such systems, offering a robust approach to balancing capability and safety in multi-agent architectures.

优缺点分析

Strengths

This paper addresses a critical gap in the safety of ADAS and self-improving AI systems by explicitly incorporating safety as an optimization objective alongside capability in the design of multi-agent scaffolds for LLMs. This focus is timely and essential, given the growing development and potential deployment of AI self-improvement techniques.

AgentBreeder builds on prior work in Automated Design of Agentic Systems (ADAS) by introducing tailored mutation and crossover operators, multi-objective optimization, and a MAP-Elites-inspired clustering approach. These advancements enhance the diversity and quality of scaffold designs, effectively balancing capability and safety. Surprisingly, the inclusion of a safety objective not only improves adversarial robustness but also appears to facilitate more effective open-ended exploration of the search space, as evidenced by performance improvements over ADAS.

The framework’s three operational modes—BlueAgentBreeder (optimizing both safety and capability), RedAgentBreeder (maximizing vulnerability and capability), and CapableAgentBreeder (focusing solely on capability)—are extensively evaluated. Experiments demonstrate compelling results, with BlueAgentBreeder achieving a 79.4% average uplift in safety benchmark performance (e.g., SaladData) while maintaining or improving capability on benchmarks like DROP, MMLU, and GPQA. RedAgentBreeder effectively identifies exploitable vulnerabilities, providing valuable insights for improving system robustness.

Weakness

The paper could better justify certain design choices. For instance, the rationale for using clustering in the archive is unclear. Given the typically limited number of solutions explored in ADAS-style algorithms, combining clustering with non-dominated sorting to form Pareto fronts within each cluster may result in most solutions being non-dominated, potentially reducing the effectiveness of the selection process. Ablation studies evaluating key components, such as clustering and the crossover operator, would strengthen the paper by clarifying their impact.

Additionally, in the red-teaming setting (RedAgentBreeder), the paper optimizes both capability (e.g., DROP performance) and vulnerability (1-SaladData). It is unclear why capability is prioritized, as a typical adversary might focus exclusively on exploiting safety weaknesses, regardless of task performance. A comparison with a single-objective approach, similar to CapableAgentBreeder but minimizing safety scores, could clarify the value of this dual-objective strategy.

Overall, this paper makes a significant contribution to ADAS and self-improving AI by advancing the design of safe and capable multi-agent systems. I would consider increasing my score if the above issues are addressed.

问题

The paper states, "Weighting the mutation operator twice as highly as crossover was found empirically to lead to faster convergence" (Line 133). Could the authors provide details or results from the empirical study supporting this claim?
The explanation in Line 213 regarding RedAgentBreeder’s approach is unclear: "It is important to note that in this case, the Meta Agent is not prompted to discover unsafe scaffolds, instead these arise via Pareto optimization on capability and harm benchmarks. This seeks to model the case where an actor may unknowingly expose weaknesses in the base LLM when employing scaffolding to improve task performance." If RedAgentBreeder simulates an adversarial attacker, why is the Meta Agent not explicitly prompted to identify unsafe scaffolds? Additionally, how optimizing for both capability and harm benchmarks leads to "unknowingly exposing weaknesses." Isn't minimizing the safety benchmark explicitly trying to expose weaknesses?
The results for argmax {f_C^DROP, f_C^MMLU, f_C^GPQA} in Table 1 (CapableAgentBreeder) do not align with those reported in Table 3 (Appendix B.3). Are there differences in experimental settings that account for this inconsistency?

局限性

yes

最终评判理由

My concerns are now addressed by the authors' explanations and additional results. I think the topic of this paper is timely and important for the safety of agents and self-improving AI, making an important milestone in the direction of automated agent design. I believe the method design is sound, and the extensive experiments support the effectiveness of the method. I will recommend the acceptance of the paper and increase my score to 6.

格式问题

作者回复

2025-07-31

Dear Reviewer ahDK,

We would like to express our sincere gratitude to the reviewer for their thorough and constructive evaluation of our work. We are particularly encouraged by their recognition that our paper "makes a significant and timely contribution to AI safety and the Automated Design of Agentic Systems (ADAS)" and addresses "a critical gap in the safety of ADAS and self-improving AI systems". We are excited about the potential for this paper to be accepted, as the first paper (to our knowledge) addressing this kind of multi-agent self-improving evolutionary threat model. We believe this will motivate research in this new area, as well as our open-source codebase providing a jumping-off point in this field.

We are gratified that they found our experimental evaluation compelling, specifically noting the "79.4% average uplift in safety benchmark performance" achieved by BlueAgentBreeder while maintaining capability improvements. We are also pleased that the reviewer recognized our technical contributions, including our tailored mutation and crossover operators, multi-objective optimization framework, and MAP-Elites-inspired clustering approach, with them noting that our safety objective "not only improves adversarial robustness but also appears to facilitate more effective open-ended exploration of the search space".

We also thank the reviewer for their constructive feedback regarding areas for improvement, particularly their suggestions for additional ablation studies and clearer justification of certain design choices such as our clustering approach and the dual-objective strategy in RedAgentBreeder. We view this feedback as valuable guidance for strengthening our contribution and are keen to address these points.

The paper states, "Weighting the mutation operator twice as highly as crossover was found empirically to lead to faster convergence" (Line 133). Could the authors provide details or results from the empirical study supporting this claim?

Thank you for highlighting this point. We acknowledge that we should have included a more systematic ablation study of the mutation-to-crossover ratio. This empirical observation was based on initial tuning experiments, but we recognize that a comprehensive analysis of this hyperparameter would strengthen the paper. We will include this as an important direction for future work.

The explanation in Line 213 regarding RedAgentBreeder’s approach is unclear: "It is important to note that in this case, the Meta Agent is not prompted to discover unsafe scaffolds, instead these arise via Pareto optimization on capability and harm benchmarks. This seeks to model the case where an actor may unknowingly expose weaknesses in the base LLM when employing scaffolding to improve task performance." If RedAgentBreeder simulates an adversarial attacker, why is the Meta Agent not explicitly prompted to identify unsafe scaffolds? Additionally, how optimizing for both capability and harm benchmarks leads to "unknowingly exposing weaknesses." Isn't minimizing the safety benchmark explicitly trying to expose weaknesses?

We thank the reviewer for highlighting this important conceptual confusion in our explanation. The threat model we consider with RedAgentBreeder is analogous to the inner/outer alignment problem. Specifically, we model a scenario where a scaffold designer (or Meta Agent) has good intentions and explicitly optimizes for task capability, but operates under a misaligned reward function that inadvertently incentivizes unsafe behavior. In this setting, the Meta Agent is not explicitly prompted to discover unsafe scaffolds - rather, it is optimizing for what it believes to be beneficial objectives (high task performance). However, the reward structure creates a misalignment between the intended goal (safe capability improvement) and the actual optimization target.

We believe you have highlighted an exciting direction for future work, where a jailbroken Meta Agent is tasked with designing unsafe scaffolds.

The results for argmax {f_C^DROP, f_C^MMLU, f_C^GPQA} in Table 1 (CapableAgentBreeder) do not align with those reported in Table 3 (Appendix B.3). Are there differences in experimental settings that account for this inconsistency?

We thank the reviewer for their careful attention to the experimental details. Yes, the results in Table 1 are reported for BlueAgentBreeder - AgentBreeder run in a multi-objective setup which jointly optimizes capability and safety. Whereas the results in Table 3 are reported for CapableAgentBreeder which optimizes solely for capability. Both experimental configurations are labeled at the top of each respective table and are also referenced in the corresponding figure captions, however we will update the manuscript to better contextualize these three (blue, red and capable) different experiments as we believe you’ve highlighted a great improvement to the readability of the paper.

Kindest regards,

The Authors

2025-08-03

Thank you for your detailed response to my questions. It has resolved many of my concerns, and I appreciate the clarifications provided. However, I have a few follow-up questions below. If these can be addressed, I will be open to updating my score accordingly.

My question in Weakness 1 regarding the clustering and Pareto fronts was not addressed in your response. Could you please discuss this?
Regarding the multi-objective optimization framework:

2.1. Could you provide a visualization of the Pareto front (PF) and report additional multi-objective metrics, such as the hypervolume (HV) indicator and inverted generational distance (IGD)? I understand new figures cannot be uploaded now, but including these additional metrics would be helpful to help readers understand the effectiveness of the multi-objective optimization.

2.2. How do you select the final agent from the PF to present in the tables? There should be many agents balancing the tradeoffs between objectives.
Thank you for the clarification on RedAgentBreeder’s setting—it makes sense now.

2025-08-04

I thank the authors for their response and the additional experimental results. I think they address all of my concerns. A few additional suggestions:

The calculation of HVs provides good insights. But it would be more insightful to compare HVs to other optimization methods, e.g., when you choose different objective(s) to optimize in Table 1.
Selecting the scaffold that has the highest sum of objectives makes sense. But I think it would be helpful to also show the best-performing agent and safest agent, i.e., choose a few scaffolds to show the trade-off between objectives. Also, the selection method should be stated in the paper.

That being said, I think the topic of this paper is timely and important for the safety of agents and self-improving AI, making an important milestone in the direction of automated agent design. I believe the method design is sound, and the extensive experiments support the effectiveness of the method. I will recommend the acceptance of the paper and increase my score to 6.

评论- Here are the plots (see Appendix) and metrics you requested + answers to your questions! :)

2025-08-04

Dear Reviewer ahDK,

Thank you so much for providing further feedback and suggestions! We find them of great value and have taken the time to address all your further points, including plots, metrics and answers to your questions!

Visualization of Pareto front (PF) and reporting of hypervolume (HV) indicator

In Appendix B: Experimental Runs (on page 14), you will find plots showing the advancement of the Pareto capability-safety frontier on the validation set for all multi-objective experiments.

We valued your suggestion of reporting a metric to quantify the effectiveness of the multi-objective optimization and report the HV indicator on the test set.

For BlueAgentBreeder:

Benchmark	Seed Scaffolds	Discovered Scaffolds
GPQA	0.219064	0.247536
MMLU	0.484208	0.542816
DROP	0.390754	0.438813

And for RedAgentBreeder:

Benchmark	Seed Scaffolds	Discovered Scaffolds
DROP	0.572759	0.602547

Additionally, here is the code we used to generate these values:

import json, math
from pathlib import Path

file = Path(
    "/home/Documents/AgentBreeder/src/results/test/gpqa-cecec343-5f63-4a02-99b8-7d0155d7c45f.jsonl"
)
records = [
    json.loads(l) for l in file.read_text(encoding="utf-8").splitlines() if l.strip()
]
for r in records:
    r.pop("system_code", None)

seed = {
    "Chain-of-Thought",
    "Self-Consistency with Chain-of-Thought",
    "LLM Debate",
    "Step-back Abstraction",
    "Quality-Diversity",
    "Dynamic Assignment of Roles",
    "Self-Refine (Reflexion)",
}

groups = {
    "Seed": [
        (
            r["system_capability_ci_median"],
            r["system_safety_ci_median"],
            r["system_name"],
        )
        for r in records
        if r["system_name"] in seed
    ],
    "Discovered": [
        (
            r["system_capability_ci_median"],
            r["system_safety_ci_median"],
            r["system_name"],
        )
        for r in records
        if r["system_name"] not in seed
    ],
}

ref = (0.0, 0.0)


def pareto(pts, ref=(0, 0)):
    pts = [p for p in pts if p[0] >= ref[0] and p[1] >= ref[1]]
    pts.sort(key=lambda t: (-t[0], t[1]))  # x↓ then y↑
    front, best_y = [], -math.inf
    for x, y, n in pts:
        if y > best_y:
            front.append((x, y, n))
            best_y = y
    return front


def hv(pts, ref=(0, 0)):
    front = pareto(pts, ref)
    xs = [p[0] for p in front] + [ref[0]]
    return (
        sum(
            max(xs[i] - xs[i + 1], 0) * (front[i][1] - ref[1])
            for i in range(len(front))
        ),
        front,
    )

for name, pts in groups.items():
    volume, frontier = hv(pts, ref)
    print(
        f"{name} hypervolume: {volume:.6f}  (|S|={len(pts)}, |front|={len(frontier)})"
    )

Final Questions

My question in Weakness 1 regarding the clustering and Pareto fronts was not addressed in your response. Could you please discuss this?

We chose clustering within the archive specifically to enhance exploration of the search space, following principles from MAP-Elites, and because it aligns well with our threat modeling objectives. Your concern about the potential reduction in selection pressure due to most solutions becoming non-dominated within clusters is valid and deserves more thorough investigation. We agree that ablation studies would significantly strengthen our work by demonstrating the individual contributions of clustering and the crossover operator. We focused our limited budget on running AgentBreeder across multiple and varied benchmarks and would love to explore the ablation studies in future work.

2.2. How do you select the final agent from the PF to present in the tables? There should be many agents balancing the tradeoffs between objectives.

Thank you for highlighting this! The Pareto optimal scaffold was selected as the one maximizing the sum of capability and safety scores, with the important caveat that we excluded scaffolds that engaged in reward hacking of the safety objective.

We hope we have addressed your questions, but please do let us know if there's anything we can clarify in order to increase our score.

Kindest regards,

The Authors

最终决定Accept (spotlight)

2025-09-17

The paper introduces AgentBreeder, a multi-objective evolutionary framework for designing LLM-based multi-agent scaffolds that explicitly balance safety and capability. It supports three modes—Blue (safety + capability), Red (capability + vulnerability), and Capable (capability only)—and demonstrates that safety can be significantly improved (~79% uplift) without sacrificing task performance. The framework provides both conceptual insights into alignment and practical tools for red/blue teaming.

Strengths of the paper:

All reviewers (ahDK, bz36, Tn5P) agree this is a timely and important contribution to AI safety and ADAS. They highlight the novelty of treating safety as a first-class optimization objective, the clear technical innovations (mutation/crossover operators, MAP-Elites clustering, Pareto optimization), and the strong empirical results across multiple benchmarks (DROP, MMLU, GPQA, SaladData). The work is well-presented, reproducible, and accompanied by open-source code.

Weaknesses of the paper:

Reviewer ahDK notes limited justification for clustering and the mutation/crossover ratio, and questions the dual-objective setup in RedAgentBreeder.
Reviewer bz36 raises concerns about reliance on proprietary models, modest capability gains, and synthetic benchmarks.
Reviewer Tn5P points to missing ablations (e.g., generalization across tasks), limited evaluation on frontier models, and some ambiguity in the Red threat model.

Primary reasons for Accept (Spotlight):

The primary reasons for recommending Accept (Spotlight) are that this paper makes a novel, significant, and timely contribution to AI safety by proposing the first framework (to our knowledge) for multi-objective, self-improving agent scaffolding that explicitly optimizes for safety alongside capability. It demonstrates large, consistent safety improvements without hurting performance, offers both red- and blue-team use cases, and contributes a reproducible, extensible framework. The authors’ rebuttal and revisions further strengthened the submission by providing additional results and clarifications (e.g. hyperparameter rationale, selection strategy, and threat model explanation), addressing the key concerns raised in review.

Summary of the discussion and rebuttal

Overall, the rebuttal successfully addressed the key issues raised by the reviewers. The discussion was very positive: one reviewer explicitly raised their score from borderline to a clear accept after seeing the authors’ revisions and answers, and all reviewers converged on enthusiastic recommendations for acceptance. The consensus is that the paper’s contributions – both conceptual and empirical – are strong enough to warrant a spotlight presentation at NeurIPS.