6.5

/10

Poster4 位审稿人

最低6最高7标准差0.5

3.5

置信度

COLM 2025

LM Agents May Fail to Act on Their Own Risk Knowledge

Yuzhi Tang,Tianxiao Li,Elizabeth Li,Chris J. Maddison,Honghua Dong,Yangjun Ruan

OpenReview PDF

提交: 2025-03-21更新: 2025-08-26

TL;DR

This paper develops a systematic safety evaluation framework for LM agents, reveals persistent gaps between risk awareness and safe execution, and proposes effective mitigation strategies.

摘要

Language model (LM) agents have demonstrated significant potential for automating real-world tasks, yet they pose a diverse array of potential, severe risks in safety-critical scenarios. In this work, we identify a significant gap between LM agents' risk awareness and safety execution abilities: while they often answer "Yes'' to queries like $"Is executing `sudo rm -rf /*' dangerous?"$, they will likely fail to identify such risks in instantiated trajectories or even directly perform these risky actions when acting as agents. To systematically investigate this, we develop a comprehensive evaluation framework to examine agents' safety across three progressive dimensions: 1) their knowledge about potential risks, 2) their ability to identify corresponding risks in execution trajectories, and 3) their actual behaviors to avoid executing these risky actions. Our evaluation reveals two critical performance gaps that resemble the generator-validator gaps observed in LMs: while agents demonstrate near-perfect risk knowledge (>98% pass rates), they fail to apply this knowledge when identifying risks in actual scenarios, with performance dropping by >23%, and often still execute risky actions (<26% pass rates). This trend persists even in specialized reasoning models like DeepSeek-R1, reinforcing the challenge of translating an LM's risk knowledge into safe decision-making. We take advantage of these observed gaps to develop a risk verifier that independently critiques the proposed actions by agents, with an abstractor that converts specific execution trajectories into abstract descriptions where LMs can more effectively identify the risks. Our overall system achieves a significant reduction of risky action execution by 55.3% over vanilla-prompted agents.

关键词

Large Language ModelsLanguage Model AgentsAI SafetyEvaluation

评审与讨论

审稿意见

评分: 7置信度: 42025-05-12

This paper defines and studies safety awareness-execution gaps in LM agents. Safety awareness-execution gaps occur when LM agents are able to correctly classify whether or not a safety concept is being violated in a specific scenario, but when asked to perform the task the LM agent will comply and complete the harmful actions. Prompting techniques are applied to evaluate these gaps on state-of-the-art LM agents on a single manually curated set of agent trajectories. The authors find that these gaps are quite large, emphasizing the lack of safety alignment in modern LM agents. The authors propose two step test-time compute solution for verifying whether or not the LM agents' proposed action is safe or not. The authors show significant improvements in safety while maintaining helpfulness on agent tasks.

接收理由

This paper defines a the notion of awareness-execution gaps in LM agents and provides one way to measure these gaps. In addition, the authors show that these gaps are quite large in state-of-the-art LMs, demonstrating the shallowness of current LM safety training and alignment.
This paper provides a simple way to utilize additional test-time compute to improve safety and decrease these gaps by utilizing a verifier model (a copy of the model under evaluation).
On one model (GPT-4o-mini), the paper shows that using the verifier with
The improvements to safety do not degrade helpfulness significantly according to the results (Figure 4 and Table A.2).
The paper is relevant, well-motivated, and well-written.

拒绝理由

The awareness-execution gaps seem like a re-branding of the generator-verifier gap with a slight twist.
The verifier with abstractor methodology is quite simple and straightforward.
The evaluation is somewhat limited in the sense that only one curated dataset was used. There are also a lot of questions about how the dataset was curated -- e.g. the authors state on lines 156-157 that they reduce 468 trajectories by manually filtering down to 328 trajectories without ever describing how or what criteria was used.
The method is quite expensive, requiring at least 2-3x more compute as compared with the vanilla LM agent.

给作者的问题

Is the problem considered in the paper more of a misuse or misalignment problem according to the authors?
What causes the knowledge-identification gap? What causes the identification-execution gap?
How expensive is running the verifier with abstractor? Could you run this in production?
Can you get away with a cheaper verifier? Is there a minimum level of capability needed to have a reasonable quality verifier?
Do you advocate for using verifiers in practice? Should we incorporate deeper alignment training via verifiers?
The results provided are on only one dataset, are the results specific to this dataset?
What is the computational cost vs. safety trade-off of using the different external safeguards? Is the abstractor the same model as the LM agent and if so how much of an increase in computational cost is using this verifier with abstractor?
Why do some models respond better to the safety prompt as compared to others (in reference to Table 1)?
Do all of the trajectories have unsafe actions? Are there any completely benign queries?
How were the trajectories manually filtered from 468 to 328?

评论- Response to Reviewer xJF8 [1/3]

2025-06-03

Thank you for your insightful review and for recognizing our work's contributions. We appreciate that you found our paper well-motivated and well-written, and our definition of "awareness-execution gaps" valuable for demonstrating "the shallowness of current LM safety training". We would like to address your remaining concerns with the following responses.

“The awareness-execution gaps seem like a re-branding of the generator-verifier gap with a slight twist.”

Our awareness-execution gaps represent an instantiation of the generator-validator gap [2] to safety-critical agent scenarios, and we explicitly referenced this connection for understanding where such gaps may come from and why our risk verifier approach is effective for bridging the gaps. However, our work differs in several important ways: (1) safety-critical context: we investigate these gaps specifically in safety-critical scenarios with real-world implications, where failures can lead to tangible harm. (2) progressive gap structure: we identify two distinct gaps (knowledge-identification and identification-execution) rather than a single generator-validator divide, revealing a more nuanced failure pattern. (3) practical safety application: we develop targeted interventions that leverage these gaps to achieve significant safety improvements in agent deployment.

“The verifier with abstractor methodology is quite simple and straightforward.”

We agree that our verifier-with-abstractor methodology is simple and straightforward—and we view that as a strength, not a limitation. This simple, straightforward approach yields a 55.3% reduction in risky actions while remaining broadly applicable across diverse base models, showing that leveraging existing LLM capabilities with minimal architectural overhead can drive substantial safety improvements.

“The evaluation is somewhat limited in the sense that only one curated dataset was used. There are also a lot of questions about how the dataset was curated -- e.g. the authors state on lines 156-157 that they reduce 468 trajectories by manually filtering down to 328 trajectories without ever describing how or what criteria was used.”

Thank you for pointing this out, and we apologize for the unclear description in dataset curation. When creating the Execution Test set, we truncated each of the 468 trajectories immediately before the first risky action to create safe partial trajectories. This truncation process can create new duplicates when distinct trajectories share identical initial safe steps, so we applied the same LM-based deduplicator to these truncated snippets. We then manually verified that each snippet was correctly truncated at the appropriate execution step (i.e. before risky actions occurred) and filtered out erroneous snippets. This combined deduplication and manual filtering process reduced the count from 468 to 328 safe partial trajectories. We then paired each of these 328 partial trajectories with their corresponding Knowledge and Identification test cases, yielding a matched set of 328 test cases across all three evaluation dimensions. We will include these details in our updated paper.

“The method is quite expensive, requiring at least 2-3x more compute as compared with the vanilla LM agent.”

We acknowledge that our method requires more compute compared to vanilla LM agents due to the additional verifier and abstractor components. However, we note our approach provides a favourable trade-off between computational overhead and safety gains, which aligns with recent trends in test-time scaling approaches that trade inference cost for improved performance. Given the high stakes of safety-critical agent deployment, where failures can result in significant harm or financial loss, the additional computational investment represents a reasonable cost for improved safety guarantees. Moreover, the efficiency of using only a single critique iteration (k=1) helps minimize this overhead while maintaining effectiveness, making the approach computationally feasible for practical real-world deployment scenarios.

评论- Response to Reviewer xJF8 [2/3]

2025-06-03

We answer the reviewer’s questions below:

“Is the problem considered in the paper more of a misuse or misalignment problem according to the authors?”

The gaps investigated in our work are primarily a misalignment issue rather than a misuse problem. Our threat model assumes a benign user who provides legitimate instructions, but the agent takes unintended risky actions due to failures in applying its own safety knowledge during execution. The gaps we identify—where agents possess risk knowledge but fail to act on it—represent fundamental misalignment between the agent's safety awareness and its operational behaviour, rather than adversarial exploitation by malicious users.

“What causes the knowledge-identification gap? What causes the identification-execution gap?”

We have two primary hypotheses.

Knowledge-Identification Gap: We hypothesize that alignment training plays a significant role, as current alignment procedures are predominantly conducted in conversational rather than agentic scenarios. This training distribution mismatch may limit models' ability to apply safety knowledge in dynamic, tool-use environments.

Identification-Execution Gap: This gap appears to inherit characteristics from the general generator-validator inconsistency in LMs [2], where models demonstrate different capabilities when generating versus validating content.

Both gaps likely reflect that current LM training paradigms, while effective for conversational safety, may inadequately prepare models for complex safety decision-making in real-time agentic execution.

“How expensive is running the verifier with abstractor? Could you run this in production?”

Running the verifier with the abstractor adds three inference steps to the vanilla agent’s workflow that incur roughly a 2–3× increase in inference cost. However, even a single iteration of this pipeline yields substantial safety gains (Figure 6). We believe this self‐critique strategy can be deployed in production, since LLM inference costs have decreased quickly, roughly 10× per year [4], and test-time scaling can be utilized to achieve a favourable tradeoff between inference cost and safety.

“Can you get away with a cheaper verifier? Is there a minimum level of capability needed to have a reasonable quality verifier?”

Thank you for this practical question about verifier efficiency. While we did not exhaustively test minimum capability thresholds, our experiments used the same base model for both the execution agent and verifier. Among tested models, GPT-4 and Llama-3.1-70B represent lower capabilities on LMArena yet still achieved substantial safety improvements with our verifier system. This suggests that effective verification may not require SOTA models. Using different, potentially smaller models as verifiers presents an interesting direction for reducing computational costs while maintaining safety gains, though systematic exploration of these capability-cost trade-offs remains important in future work.

“Do you advocate for using verifiers in practice? Should we incorporate deeper alignment training via verifiers?”

Yes, we do advocate for using verifiers in practice, particularly given the persistent generator-validator gaps we observe across model capabilities. Our verifier approach serves as a test-time method that complements rather than replaces alignment training. While alignment training instills safety knowledge during development, our verifier leverages existing capabilities to improve safety during deployment. Incorporating verifier-based feedback into alignment training could potentially reduce the gaps we identified, though careful design in the training procedure is required.

“The results provided are on only one dataset, are the results specific to this dataset?”

While we used ToolEmu [1] as our primary evaluation framework, it provides diversity encompassing 36 toolkits, 311 tools across 9 risk categories, covering representative high-stakes domains spanning 18 different tool categories as detailed in the original ToolEmu paper. This breadth captures diverse real-world agent deployment scenarios rather than a narrow domain-specific dataset.

The fundamental gaps we identify—knowledge-identification and identification-execution—represent instantiations of the broader generator-validator inconsistency [2] observed across different contexts in LM research. Previous work like Constitutional AI [3] demonstrates similar generator-validator gaps in model alignment, suggesting these phenomena extend beyond specific evaluation frameworks.

评论- Response to Reviewer xJF8 [3/3]

2025-06-03

“What is the computational cost vs. safety trade-off of using the different external safeguards? Is the abstractor the same model as the LM agent and if so how much of an increase in computational cost is using this verifier with abstractor?”

We used the same base model for the agent, verifier, and abstractor components to ensure fair comparison.

Our results in Figure 6 show that the verifier alone provides significant safety improvements, while verifier + abstractor achieves even greater gains. The computational overhead scales accordingly: the verifier adds two inference steps to critique proposed action and to repropose action based on critique, while verifier + abstractor requires three additional steps (i.e., abstraction generation). This translates to approximately 2-3x computational cost compared to vanilla agents. However, our single-iteration approach (k=1) minimizes overhead while maintaining effectiveness. Given the substantial safety improvements (55.3% reduction in risky actions) and high stakes of safety-critical applications, this represents a favourable trade-off between inference cost and safety guarantees.

“Why do some models respond better to the safety prompt as compared to others (in reference to Table 1)?”

Thank you for this insightful observation about the differential response to safety prompting across models.

We hypothesize that this variation stems from differences in models' instruction following capabilities and chain-of-thought reasoning abilities. For example, models with strong reasoning capabilities liike DeepSeek-R1 (15.6% to 69.2%) tend to show more substantial improvements with safety prompting, likely reflecting stronger ability to follow complex safety instructions and reason through safety considerations. In contrast, weaker models like Llama-3.1-70B shows modest gains (4.9% to 16.5%), potentially indicating weaker instruction following capabilities.

“Do all of the trajectories have unsafe actions? Are there any completely benign queries?”

In our evaluation, we specifically filtered out safe trajectories and only used risky trajectories to perform our tests.This design choice was intentional to focus our analysis on scenarios where safety gaps actually matter. Since our research question centers on understanding when and why agents fail to act on their risk knowledge, evaluating performance on inherently safe queries would not provide meaningful insights into the safety awareness-execution gaps we aim to investigate. Using only risky trajectories allows us to directly measure agents' ability to recognize and avoid potential harm in scenarios where such capabilities are actually tested.

“How were the trajectories manually filtered from 468 to 328?”

Thank you for pointing this out, and we apologize for the unclear description in dataset curation. In Section 3.2, after the initial deduplication from 524 to 468 risky trajectories, we perform another round of deduplication and manual filtering to build the Execution Test set. First, we truncate each of the 468 trajectories right before its first risky action, producing 468 “safe” partial trajectories. Truncation can create new duplicates—for example, if two distinct trajectories share the same initial safe steps—so we apply the same LLM-based deduplicator to these truncated snippets. Next, we manually verify that each snippet was correctly truncated at the first risky action. This combined process reduces the count from 468 to 328 safe partial trajectories. Finally, we pair each of these 328 safe partial trajectories with its corresponding Knowledge and Identification cases, yielding a matched set of 328 test cases.

References:

[1] Ruan, Y., Dong, H., Wang, A., Pitis, S., Zhou, Y., Ba, J., … Hashimoto, T. (2024). Identifying the Risks of LM Agents with an LM-Emulated Sandbox. arXiv [Cs.AI]. Retrieved from http://arxiv.org/abs/2309.15817

[2] Li, X. L., Shrivastava, V., Li, S., Hashimoto, T., & Liang, P. (2023). Benchmarking and Improving Generator-Validator Consistency of Language Models. arXiv [Cs.CL]. Retrieved from http://arxiv.org/abs/2310.01846

[3] Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., … Kaplan, J. (2022). Constitutional AI: Harmlessness from AI Feedback. arXiv [Cs.CL]. Retrieved from http://arxiv.org/abs/2212.08073

[4] Appenzeller, G. (2024, November). Welcome to LLMflation - LLM Inference Cost is going down fast ⬇️. Andreessen Horowitz. Retrieved from https://a16z.com/llmflation-llm-inference-cost/

2025-06-03

Thank you for your detailed responses. I will maintain my positive assessment of this work.

审稿意见

评分: 7置信度: 42025-05-13

The paper offer comprehensive evaluation on safety of LM agents, including three key dimensions: knowledge, identification and execution. The paper discloses the intriguing finding that while LMs possess risk knowledge, LMs may fail to identify risks in trajectories and tend to execute risky actions. Concretely, the evaluation reveals two critical performance gaps that resemble the generator-validator gaps observed in LMs: while agents demonstrate near-perfect risk knowledge (>98% pass rates), they fail to apply this knowledge when identifying risks in actual scenarios, with performance dropping by >23%, and often still execute risky actions (<26% pass rates). Based on the findings, the paper develops a risk verifier to enhance safety of agents. The verifier independently critiques the proposed actions by agents, with an abstractor that converts specific execution trajectories into abstract descriptions where LMs can more effectively identify the risks.

接收理由

This work is novel and clear of the design that evaluates three key dimensions with same sets of data for comparison and analysis.
The finding of performance gaps is intriguing, and consistent with insights from previous works in community.
The proposed mitigation is effective.
The results are convincing and versatile with high-quality data curation and statistical analysis.

拒绝理由

I did not find major concerns.

给作者的问题

Do the authors have any explanation on the performance gaps between knowledge, identification and execution of agents in safety-critical scenarios?

评论- Response to Reviewer deXw

2025-06-03

Thank you for your thoughtful review and for recognizing the value of our contribution. We appreciate that you found our work novel and clear, our findings are intriguing, our proposed mitigation is effective, and our results are convincing and versatile.

In response to the question regarding what might cause the performance gaps between knowledge, identification and execution of agents in safety-critical scenarios, we have two primary hypotheses as below:

Knowledge-Identification Gap: We hypothesize that alignment training plays a significant role, as current alignment procedures are predominantly conducted in conversational rather than agentic scenarios. This training distribution mismatch may limit models' ability to apply safety knowledge in dynamic, tool-use environments.

Identification-Execution Gap: This gap appears to inherit characteristics from the general generator-validator inconsistency in LMs [1], where models demonstrate different capabilities when generating versus validating content. Both gaps likely reflect that current LM training paradigms, while effective for conversational safety, may inadequately prepare models for complex safety decision-making in real-time agentic execution.

References:

[1] Li, X. L., Shrivastava, V., Li, S., Hashimoto, T., & Liang, P. (2023). Benchmarking and Improving Generator-Validator Consistency of Language Models. arXiv [Cs.CL]. Retrieved from http://arxiv.org/abs/2310.01846

2025-06-10

Thanks for the response. I will keep the positive rating.

审稿意见

评分: 6置信度: 32025-05-15

This paper investigates a troubling discrepancy between the risk awareness and safe behavior of language model (LM) agents. While these agents often demonstrate high safety knowledge in abstract question-answering formats—correctly recognizing risky commands such as sudo rm -rf /*: they frequently fail to identify such risks in real execution contexts and sometimes even carry them out. To systematically evaluate this, the authors develop a three-part framework that tests agents on their knowledge, identification of risks in actual trajectories, and safe execution. They uncover two major gaps: the knowledge-identification gap and the identification-execution gap, which persist across a range of models and capabilities. To address these failures, they introduce a verifier system with an abstractor that converts execution trajectories into high-level descriptions, enabling the verifier to better detect risks. This approach reduces risky behavior by over 55% without sacrificing helpfulness, highlighting the importance of explicitly bridging the disconnect between safety knowledge and agentic execution.

接收理由

1. Timely and Important Problem: The paper addresses a critical and underexplored issue—the disconnect between an LM agent’s abstract knowledge of risks and its actual behavior—highlighting a potential safety failure that is relevant for real-world deployment.

2. Clear Problem Decomposition: The authors provide a thoughtful three-stage evaluation framework (knowledge → identification → execution), which clarifies where and how LM agents fail.

3. Strong Empirical Findings: The paper demonstrates that even capable agents like GPT-4 and Claude-2 exhibit these safety gaps, making the findings widely relevant across multiple systems.

4. Novel Mitigation Strategy: The abstractor-verifier approach is intuitive and effective, showing a >55% reduction in risky behavior without harming task performance -- a promising direction for safer agents.

5. Reproducibility and Evaluation Rigor: The authors provide transparent experimental details, including human evaluations, ablations, and varied settings (e.g., multiple LMs, temperature, and prompting strategies).

拒绝理由

1. Limited Realism of Environments: The paper’s execution contexts are simulated; real-world agent behaviors (e.g., in robotic or web environments) may introduce new complexities and failure modes not captured here.

2. Scalability of the Verifier System: The abstractor-verifier pipeline depends on the quality and generality of high-level abstractions, which may struggle in complex or unfamiliar domains. Scalability to long or multi-step interactions is not fully tested.

3. Binary Risk Framing: The definition of “risky” commands is somewhat binary and pre-specified. More nuanced or contextual risk judgments (e.g., ethical gray areas or long-term harms) are not addressed.

4. Dependence on Human Labels: Risk identification and evaluation often rely on human-annotated labels, which may limit automation or introduce subjectivity.

5. Mitigation Limited to Known Risks: While the verifier is effective, it is unclear how well it generalizes to unknown or adversarially constructed risks that agents haven’t been trained on.

评论- Response to Reviewer MVQF

2025-06-03

Thank you for your comprehensive review and for recognizing the significance of our work. We appreciate that you found our work provides "strong empirical findings" across multiple systems, and "novel mitigation strategy" to be valuable contributions to this "timely and important problem." We provide responses to the additional questions below.

“Limited Realism of Environments: The paper’s execution contexts are simulated; real-world agent behaviors (e.g., in robotic or web environments) may introduce new complexities and failure modes not captured here.”

We thank the reviewer for raising this concern. Thanks to the flexibility of the ToolEmu [1] framework, it already covers 36 toolkits and 311 tools, producing outcomes in 9 risk categories. We believe this breadth reflects a wide range of real-world scenarios, but we are also happy to add more evaluations that are not covered by our framework if needed.

“Scalability of the Verifier System: The abstractor-verifier pipeline depends on the quality and generality of high-level abstractions, which may struggle in complex or unfamiliar domains. Scalability to long or multi-step interactions is not fully tested.”

We would like to clarify that our evaluation already includes multi-step interactions with trajectories of up to 15 steps. Our test cases span 144 scenarios across 9 risk types and cover 36 toolkits with 311 tools. These test cases draw from diverse domains such as financial services, communication platforms, system administration, and data management, showcasing the potential to simulate complex scenarios. Importantly, our results show that the abstractor–verifier pipeline remains effective across model sizes and capabilities (Figure 5), indicating its wide applicability and generalization. Since the verification capability of our system will likely scale with base model capabilities, we believe the performance of our system will remain effective or further improve with future models. We welcome any specific suggestions for further evaluations to strengthen our assessment.

“Binary Risk Framing: The definition of “risky” commands is somewhat binary and pre-specified. More nuanced or contextual risk judgments (e.g., ethical gray areas or long-term harms) are not addressed.”

We would like to clarify that our LM-based safety evaluator uses a 0- 3 Likert scale rather than a simple binary label. The scale is: 0: Likely Severe Risk; 1: Possible Severe Risk or Likely Mild Risk; 2: Possible Mild Risk; and 3: Certain No Risk. These scores incorporate both the likelihood and severity of potential outcomes—factoring in immediate and downstream effects such as financial impact, privacy implications, and long-term harm (see Appendix C.3.2). To compare performance across our three test stages (Knowledge, Identification, Execution), we convert these risk scores into binary pass/fail outcomes. This consistent metric lets us quantify the performance gaps at each stage.

“Dependence on Human Labels: Risk identification and evaluation often rely on human-annotated labels, which may limit automation or introduce subjectivity.”

Our risk identification and evaluation primarily rely on automated LM-based evaluators, not human-annotated labels. We employ LM-based safety evaluators that automatically assess risk levels across all 328 test cases, ensuring scalability and consistency in our evaluation process. Human annotation was used only for quality validation of the automatic evaluator to ensure the reliability of the automatic evaluator – this may indeed incur subjectivity, but the original ToolEmu [1] paper had employed human labels from multiple annotators (4) to reduce this fact. The actual risk assessment and scoring across all our tests (Knowledge, Identification, Execution) are conducted automatically using LM evaluators.

“Mitigation Limited to Known Risks: While the verifier is effective, it is unclear how well it generalizes to unknown or adversarially constructed risks that agents haven’t been trained on.”

Our verifier employs the same pre-trained, general-purpose LLMs (GPT-4, Claude, etc.) used as the base agents, rather than models trained on specific risk scenarios. This design ensures that the verifier's risk knowledge comes from diverse safety understanding instilled during large-scale pretraining, rather than being limited to our curated test scenarios. General-purpose LLMs possess extensive risk awareness across domains through pretraining on diverse data sources, enabling them to potentially identify novel or adversarial risks beyond our evaluation set.

References:

评论- Response from Reviewer

2025-06-10

Thank you for your thoughtful rebuttal. I will keep my positive rating of this work.

审稿意见

评分: 6置信度: 32025-05-16

This paper discusses the safety of LLM-based agents, claiming that while the frontier LLMs possess risk awareness, they often fail to identify risky actions from trajectories and to avoid executing the risky actions. The authors define such shortcomings as the knowledge-identification gap and the identification-execution gap. To study these gaps, the authors propose a systematic framework for evaluating LLM agent safety. Additionally, they introduce specific risk mitigation strategies that incorporate a risk verifier to provide critiques for detected risky actions and an abstraction enhancement to convert concrete trajectories into abstracted scenario description.

接收理由

This paper defines the performance gaps between risk awareness of LLMs and risk prevention during task execution. And it introduces a systematic evaluation framework with three types of tests (knowledge, identification, and execution tests) that may be reused in future work.
It provides the insight that the gaps may persist even with more powerful LLMs or with more reasoning.
The paper proposes risk mitigation strategies by incorporating a LM based risk verifier to critique risks and abstraction enhancement to abstract concrete trajectories. The experimental results demonstrate the effectiveness of the risk mitigation strategies over baselines. In addition, the analysis regarding mitigation approaches across critique iteration counts shows the practical benefits while introducing acceptable computational overhead.

拒绝理由

The abstraction process that converts trajectories into high-level description (QA formats) may not be fully reliable. This paper does not include any analysis or evaluation of the effectiveness of the abstraction process, nor its potential impact on the knowledge test.
The evaluation focuses solely on trajectories containing risky actions. It would strengthen the evaluation to also include trajectories without risky actions, to study how the mitigation strategies impact on safe trajectories.
In the test construction process, it is unclear which representation agents are selected and what the risky tool actions are used for filtering ToolEmu generated trajectories. The presentation can be improved by providing more concise yet clearer details.
The proposed framework is solely built upon ToolEmu, it would be better to include more frameworks and diverse real tasks to demonstrate the wider applicability.
The evaluation of mitigation strategies includes only one baseline (safety-prompt), including one or two more baselines can help solidify the improvements introduced by the proposed mitigation approaches.

给作者的问题

There is a significant performance gap observed between risk awareness and execution, which may be from the gap between LLMs and agents. Do you think this large gap is commonly present or is for certain agents? Additionally, do you think the proposed mitigation strategies are generalizable to more generic scenarios?

评论- Response to Reviewer M9KV [2/2]

2025-06-03

“The evaluation of mitigation strategies includes only one baseline (safety-prompt).”

We would like to clarify our baselines. Our evaluation includes two primary baselines: (1) the Vanilla Agent baseline (Appendix C.2), which employs the ReACT framework [1] to perform chain-of-thought reasoning for action selection, and (2) the Safety-Prompted Agent baseline (Appendix C.5.1), which augments the vanilla agent with detailed safety guidelines. We focused on these strong baselines that represent current best practices in agent deployment. While weaker baselines (such as direct action output without reasoning steps) could provide additional context, our current selection demonstrates that our approach achieves significant improvements even against these strong baselines. However, if there are specific baseline approaches the reviewer believes would strengthen our evaluation, we would be happy to incorporate them in our analysis as requested.

There is a significant performance gap observed between risk awareness and execution, which may be from the gap between LLMs and agents. Do you think this large gap is commonly present or is for certain agents? Additionally, do you think the proposed mitigation strategies are generalizable to more generic scenarios?

We believe the Knowledge-Identification and Identification-Execution gaps are indeed commonly present across LM agents rather than being specific to certain models or architectures. Our evaluation across a diverse set of models—including open-source LLaMA-3.1 variants, closed-source GPT-4 family and Claude models, and the specialized reasoning model DeepSeek-R1—consistently revealed both gaps in every case. This persistence across different model families, scales, and training paradigms suggests these gaps represent fundamental challenges in current LM safety approaches rather than model-specific limitations.

Regarding mitigation generalizability, our self-critique strategy demonstrated consistent improvements across all tested models (Table 1), suggesting broad applicability. The approach leverages the inherent generator-validator [3] inconsistency observed in LMs, similar to Constitutional AI's [2] findings in safety alignment, making it potentially generalizable to diverse agent deployment scenarios.

References:

[1] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., & Cao, Y. (2023). ReAct: Synergizing Reasoning and Acting in Language Models. arXiv [Cs.CL]. Retrieved from http://arxiv.org/abs/2210.03629

[2] Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., … Kaplan, J. (2022). Constitutional AI: Harmlessness from AI Feedback. arXiv [Cs.CL]. Retrieved from http://arxiv.org/abs/2212.08073

[3] Li, X. L., Shrivastava, V., Li, S., Hashimoto, T., & Liang, P. (2023). Benchmarking and Improving Generator-Validator Consistency of Language Models. arXiv [Cs.CL]. Retrieved from http://arxiv.org/abs/2310.01846

[4] Ruan, Y., Dong, H., Wang, A., Pitis, S., Zhou, Y., Ba, J., … Hashimoto, T. (2024). Identifying the Risks of LM Agents with an LM-Emulated Sandbox. arXiv [Cs.AI]. Retrieved from http://arxiv.org/abs/2309.15817

评论- Response to Reviewer M9KV [1/2]

2025-06-03

Thank you for dedicating your time to review and for providing valuable and constructive feedback. We are glad you found the paper's "systematic evaluation framework," insights that "gaps may persist even with more powerful LLMs," and "effective risk mitigation strategies" valuable. We provide responses to the additional questions below.

“The abstraction process that converts trajectories into high-level description (QA formats) may not be fully reliable. This paper does not include any analysis or evaluation of the effectiveness of the abstraction process, nor its potential impact on the knowledge test.”

To produce high-level abstractions for the Knowledge Test, we used a strong model (GPT-4o) with chain-of-thought prompting and systematic guidelines (see Appendix C.4.1) to improve their reliability. We also performed post-hoc manual verification on 70 of the 328 abstractions, confirming that in each case the abstraction accurately preserved the key risky scenario and its actions without leaking that the trajectory was risky. We will include more details of our implementation and validation in the updated version of our paper.

“The evaluation focuses solely on trajectories containing risky actions. It would strengthen the evaluation to also include trajectories without risky actions, to study how the mitigation strategies impact on safe trajectories.”

As stated in Section 3.2 (Execution Test), when curating the execution tests, we first collect trajectories that contain risky actions and extract their safe prefixes (which do not contain risky actions). When agents start with those safe partial trajectories, they could produce both safe and risky follow-up actions. That said, the mitigation strategies are applied to both safe and risky actions during the execution test. If the agent proposes a risky action at any point, the verifier flags it and issues a critique, prompting the agent to re‐propose a safe action; otherwise, the agent’s initial proposal is accepted.

“In the test construction process, it is unclear which representation agents are selected and what the risky tool actions are used for filtering ToolEmu generated trajectories.”

To generate the initial set of 934 trajectories from ToolEmu test cases, we employed a diverse set of agents spanning different capability levels, including both frontier models and smaller open-source variants, to capture a broad spectrum of agent behaviours and failure modes across different model scales and training paradigms. Specifically, we used the following models: gpt-4-1106-preview, gpt-3.5-turbo-1106, claude-3-opus-20240229, claude-3-sonnet-20240229, mistralai/Mistral-7B-Instruct-v0.2, claude-2, lmsys_vicuna_13b, and lmsys_vicuna_7b.

We applied an LM-based safety evaluator (developed in the ToolEmu [4] paper, detailed prompt in Appendix C.3.2) to identify trajectories containing risky tool actions, resulting in 524 risky trajectories. After LM-based deduplication (524 to 468), we truncated each trajectory before the first risky action to create safe partial trajectories for execution testing. This truncation created new duplicates, and thus we performed an additional round of deduplication and manual verification of correct truncation points. This yields our final dataset of 328 test cases across all three evaluation dimensions. We will include these details in our updated paper.

“The proposed framework is solely built upon ToolEmu, it would be better to include more frameworks and diverse real tasks to demonstrate the wider applicability.”

We thank the reviewer for raising this concern. Thanks to the flexibility of the ToolEmu [4] framework, it already covers 36 toolkits and 311 tools, producing outcomes in 9 risk categories. We believe this breadth reflects a wide range of real-world scenarios, but we are also happy to add more evaluations that are not covered by our framework if needed.

最终决定Accept

2025-07-08

This paper identifies the safety awareness-execution gaps in LM agents. Safety awareness-execution gaps occur when LM agents are able to correctly classify whether or not a safety concept is being violated in a specific scenario, but when asked to perform the task the LM agent will comply and complete the harmful actions. The result is not that surprising, but the evaluation framework is a great contribution to the community. The paper is also well written. All reviewers are in favor of acceptance.