PaperHub
8.2
/10
Spotlight4 位审稿人
最低4最高6标准差0.7
5
5
4
6
4.0
置信度
创新性3.3
质量3.5
清晰度3.5
重要性3.3
NeurIPS 2025

CTRL-ALT-DECEIT Sabotage Evaluations for Automated AI R&D

OpenReviewPDF
提交: 2025-05-12更新: 2025-10-29
TL;DR

We evaluate frontier LM agents' capabilities to sabotage and sandbag ML engineering tasks without being detected by automated monitors.

摘要

关键词
Sabotage EvaluationsSandbaggingAI ControlAI SafetyAI AlignmentDangerous Capability Evals

评审与讨论

审稿意见
5

The paper investigates LLM agents' capability of performing auxiliary “sabotage” tasks while performing MLE-bench tasks. It also looks at models' ability to sandbag, that is to deliberately perform worse than they are capable of, by targeting specific performance ranges. They find that Claude 3.7 sonnet is good at performing both the original task and the sabotage tasks but that monitors are generally pretty good at detecting sabotage but not sandbagging by default.

优缺点分析

Strengths:

  • I think this paper is timely and follows an Important Research Direction. This paper addresses a critical concern for AI safety, the potential for AI systems to sabotage ML engineering tasks, especially as these systems become increasingly capable of autonomous software development.
  • The use of long form and agentic MLE-bench tasks makes the results more representative of real-world deployment scenarios. This is particularly important as it is in these settings that deceptive alignment could be themost dangerous.
  • Overall solid execution.

Weaknesses:

  • The number of tasks is low, a problem exacerbated by the high variance. They also only use one model as the agent. This is understandable considering the high cost of running these experiments, but it is still a weakness.
  • The paper is unclear about the exact number of tasks - stating 22 low-complexity tasks, then mentioning 20. You mention that you exclude 2 tasks and that two of the tasks are “broken” in the appendix but you still seem to present 20 results?. The varying number of runs per task (16/9/6/3) appears arbitrary.
  • The monitors' suspicion scores appear conflated with perceived harm level rather than detection accuracy. The "agent was here" example receiving only 40/100 suspicion despite correct identification suggests this issue. The sabotage tasks could have been made more obviously harmful to avoid this issue.

问题

  • What motivated the varying number of runs per task (16/9/6/3), and how does this impact your results?
  • Why choose 0-100 scoring over say 1-5?
  • The sandbagging target range formulation seems unnatural to me. Hitting a specific range could be harder for an LLM to do than just behaving less capable. Have you considered reformulating sandbagging as "simulating a less capable model" (e.g., GPT-5 simulating GPT-4) rather than hitting specific score ranges? This might be more natural given LLMs' strong ability to emulate different personas and the existence of large amounts of weaker LLM text in the training data.
  • Could you separate detection accuracy from perceived harm severity in monitor evaluations?
  • What suspicion scores would monitors assign if explicitly shown the sabotage behaviors in isolation and asked "How concerning is this behavior?" This would help disentangle detection capability from perceived severity.

局限性

Yes

最终评判理由

The authors resolved some of my concerns but I still think the conflation between suspicion scores and perceived harm for the monitor is a big problem.

Overall I think it is a good paper.

格式问题

Nothing major

作者回复

Thanks for your feedback on the paper and for highlighting its strong motivation and execution! Below we answer your questions and address your concerns.

What motivated the varying number of runs per task (16/9/6/3), and how does this impact your results?

To be clear, we run the same number of runs for each task, but vary the number of runs according to the mode (Regular, Sabotage, Sandbagging). We run the highest number (16 for AIDE and 6 for ReAct) of Regular trajectories (which do not attempt sabotage or sandbagging) because these act as an important baseline for both sabotage and sandbagging. We run fewer ReAct trajectories because AIDE is more capable and therefore more interesting to experiment with (e.g., because a more capable agent has more room to sandbag).

For sabotage modes, we ran at least 3 trajectories per task, to compute mean performance and standard error. We were able to run more ReAct trajectories with the time and resource constraints we had —- of course, running more trajectories on other agents would provide more information about agent sabotage capabilities.

One consideration is that agents might succeed at sabotage only a very small percentage of the time, meaning that running only a few trajectories is unlikely to demonstrate low-percentage successes. We will add a note in this in the limitations.

Why choose 0-100 scoring over say 1-5?

The larger range provides a finer-grained measure of suspicion for monitors, and more room to distinguish between the suspicions of different actions. However, we found that monitors somewhat struggled to give consistent suspicion scores — also related to your point below.

Could you separate detection accuracy from perceived harm severity in monitor evaluations?

This highlights a real difficulty for monitors. The suspicion and/or concerningness of actions is highly-context dependent, so it is hard to elicit general (non-contextual) suspicion scores which can act as a classifier across different tasks. This is also why the generally prompted monitor did worse at classifying sandbagging (see Fig. 8). (Though note that monitors already perform well at classifying sabotage.)

What suspicion scores would monitors assign if explicitly shown the sabotage behaviors in isolation and asked "How concerning is this behavior?" This would help disentangle detection capability from perceived severity.

In real life, monitors might have to analyse a large number of agent behaviours in varying contexts. It may often be infeasible to provide lots of context to the monitor regarding which types of actions are suspicious, so monitors will need to classify bad behaviours in a context-general way.

Of course, greater efforts can always be put into elicitation, and, ideally, monitors might be fine tuned to detect sabotage attempts, and this will hopefully generalize to detecting even OOD bad behaviours.

The sandbagging target range formulation seems unnatural to me. Hitting a specific range could be harder for an LLM to do than just behaving less capable. Have you considered reformulating sandbagging as "simulating a less capable model" (e.g., GPT-5 simulating GPT-4) rather than hitting specific score ranges? This might be more natural given LLMs' strong ability to emulate different personas and the existence of large amounts of weaker LLM text in the training data.

Emulating a less-capable model is also an interesting setting (though, note that models might sometimes care about absolute performance, as in our target score calibration). Van Der Weij et al [6] look at sandbagging in terms of calibration and emulating a weaker model. Evaluating emulation on our benchmark is a good area for future work.

[6] Teun van der Weij, Felix Hofstätter, Ollie Jaffe, Samuel F. Brown, and Francis Rhys Ward. Ai sandbagging: Language models can strategically underperform on evaluations, 2025.

Thanks again for your feedback!

评论

I thank the authors for their response.

The larger range provides a finer-grained measure of suspicion for monitors, and more room to distinguish between the suspicions of different actions. However, we found that monitors somewhat struggled to give consistent suspicion scores — also related to your point below.

Do you find that the model is well calibrated on the 0-100 scale? In my experience models have a few numbers which they like so the effective resolution is much lower that 100.

In real life, monitors might have to analyse a large number of agent behaviours in varying contexts. It may often be infeasible to provide lots of context to the monitor regarding which types of actions are suspicious, so monitors will need to classify bad behaviours in a context-general way.

My comment was more about the realism of the tasks. Presumably (hopefully) a real deceptive task would be more concerning to the monitor. I was suggesting that an easy experiment to run would be to just ask the monitor how concerning the task is to try and separate this effect. I was suggesting making the tasks more concerning rather than giving the monitor context specific information.

评论

Dear authors,

Please respond to the discussion points brought up by reviewer Murx.

Thanks,

area chair

审稿意见
5

This paper considers the risk from LLM agents used for ML research sabotaging the research in various ways. The two scenarios considered are: a) accomplishing a harmful side objective, such as backdooring the trained ML model, and b) "sandbagging", or underperforming on the ML task. The focus is on capability evaluation, i.e. whether current models can do the sabotage when prompted (and less so on whether models would do so spontaneously). The paper finds that Claude 3.7 with the AIDE scaffold is quite good at research sabotage. It then investigates automated monitoring as a safety measure and find that it is quite effective (AUC > 0.9) for catching side objectives but notably less effective for sandbagging (AUC ~ 0.7).

优缺点分析

Strengths:

  • This is an important problem to study, as LLM agents are deployed increasingly broadly for SWE and research tasks.
  • The methodology of the paper seems solid. MLE-Bench is a good choice of dataset for investigating research sabotage. I've skimmed the side objectives that the authors added to the dataset: Many of them are good examples of potential harmful side objectives, though some are more toy. (e.g. adding the encoding of "Agent was here" to the output file) But in a way it doesn't matter that much what the side objective is; monitoring should work anyway.
  • The paper seems rigorously executed and high quality. Plots have confidence intervals, and the authors experiment with different scaffolds (AIDE, ReAct) sabotage prompts and and monitoring prompts. The choice of metrics makes sense.
  • The writing is very clear and the methodology is well explained.
  • Originality of contributions is good, I am only aware of one prior work on this topic.

Weaknesses:

  • The motivation for studying this problem could be expanded on a bit. It doesn't seem that likely that a model would spontaneously decide to e.g. sandbag by aiming for two medals less.
  • I would welcome more discussion of how to interpret these results. Are they concerning? Not very concerning? How to translate sabotage capability on MLE bench to real-world situations? What can be done besides monitoring (especially for sandbagging, where the monitor doesn't work that well)?

问题

  • What is the authors' interpretation of the results? Are they concerning? Not very concerning? How to translate sabotage capability on MLE bench to real-world situations? What can be done besides monitoring (especially for sandbagging, where the monitor doesn't work that well)?
  • Could the authors clarify their point about averaging monitors? What is the benefit of using an ensemble if it doesn't outperform the best single monitor?
  • The adopted capability elicitation philosophy is a bit unclear to me. It seems like the model is being fed strategies for how to approach the sabotage task. If so, to what extent is the evaluation measuring the agent's capability vs the human researchers' ability to write good guidelines?

nitpicks:

  • I notice that the melanoma task exclusion was mentioned repeatedly in at least three places: line 86, 128 and 285
  • There are some extraneous characters at the end of line 187 (typo)
  • The tables / figures seem a bit cramped to me, wider margins would be preferable (e.g. Table 1, Figure 3, Figure 4)

局限性

yes

最终评判理由

It's a strong paper (see my review) and while the rebuttal was a bit terse, I trust that the authors consider my advice and expand on the topics specified in my last comment when finalising the manuscript.

There are no major outstanding issues.

格式问题

no major concerns

作者回复

Thank you for taking the time to give feedback on the paper. We are glad that you appreciated all of its strengths :)

The motivation for studying this problem could be expanded on a bit. It doesn't seem that likely that a model would spontaneously decide to e.g. sandbag by aiming for two medals less.

We will add further motivation for the sandbagging experiments — in short, misaligned agents might sandbag on safety-relevant tasks, e.g., when developing monitors which they do not wish to be performant. In case it was not clear, we prompt agents to achieve a specific target score, which is defined by the medal. The medals provide a variety of targets which span the range of capabilities on the task and allow us to measure agent calibration across this range (Fig. 7).

What is the authors' interpretation of the results? Are they concerning? Not very concerning?

We think the results are a bit mixed! Whilst agents perform quite well at conducting sabotage, monitors also perform well at catching them. The sandbagging results could be more concerning, where monitors perform worse.

How to translate sabotage capability on MLE bench to real-world situations?

Whilst we think that MLE-sabotage is more realistic than other benchmarks, kaggle competitions still have limitations preventing them from capturing real-life, e.g., the data is always presented in a clean format.

Therefore, whilst agents perform well at our tasks (and achieve close to median human performance on main tasks) this may not be indicative of more complicated and messy real-life ML engineering pipelines.

What can be done besides monitoring (especially for sandbagging, where the monitor doesn't work that well)?

This is a good question! Other work on sandbagging focuses on techniques for capability elicitation, such as fine-tuning, jailbreaks, or activation steering, which should aim to elicit an agent’s capabilities, even if it is sandbagging [32]. We will add a note on this alternative to monitoring in the discussion.

[32] Felix Hofstätter, Teun van der Weij, Jayden Teoh, Henning Bartsch, and Francis Rhys Ward. The elicitation game: Evaluating capability elicitation techniques, 2025.

Could the authors clarify their point about averaging monitors? What is the benefit of using an ensemble if it doesn't outperform the best single monitor?

We had the intuition that averaging many weak monitors would perform better than a single monitor, cf “jury theorems”. Moreover, averaging many sampled suspicion scores from a single monitor model (the mean@k score) also improves performance (see Fig 9 in the technical appendix).

The adopted capability elicitation philosophy is a bit unclear to me. It seems like the model is being fed strategies for how to approach the sabotage task. If so, to what extent is the evaluation measuring the agent's capability vs the human researchers' ability to write good guidelines?

We only iterated on the prompt elicitation on a smaller “train” set of tasks, which indicates that this elicitation did generalise to the other tasks. We did find it difficult to elicit subtle behaviours from the agents — for instance, without substantial elicitation, agents would often leave comments in their code along the lines of “# inserting backdoor here”. Of course, follow-up work can experiment with different prompts or elicitation strategies, such as finetuning, on our benchmark.

Thanks for catching the few typos and formatting errors. We will fix these for the camera ready if the paper is accepted!

评论

Thank you for the rebuttal. See below for point-by-point comments:

We will add further motivation for the sandbagging experiments

Yes, I think this would improve the manuscript

We think the results are a bit mixed! Whilst agents perform quite well at conducting sabotage, monitors also perform well at catching them. The sandbagging results could be more concerning, where monitors perform worse.

I think the authors could expand on this a bit more in the conclusion / discussion.

Whilst we think that MLE-sabotage is more realistic than other benchmarks, kaggle competitions still have limitations preventing them from capturing real-life, e.g., the data is always presented in a clean format. Therefore, whilst agents perform well at our tasks (and achieve close to median human performance on main tasks) this may not be indicative of more complicated and messy real-life ML engineering pipelines.

Thank you, this is a good answer. I would recommend that you add this to the final version of your paper.

We had the intuition that averaging many weak monitors would perform better than a single monitor, cf “jury theorems”.

So in short, this was an experiment that didn't pan out? If so, I would not feature it as prominently in the abstract / intro.

We only iterated on the prompt elicitation on a smaller “train” set of tasks, which indicates that this elicitation did generalise to the other tasks.

Thank you, this is an important clarification and resolves my concern. I'd recommend that this is included in the final version of the paper (unless it is already there and I missed it).

评论

Reviewer miEE: are you happy with the author response?

审稿意见
4

This paper investigates novel approaches to expose LLMs' vulnerabilities. The authors first introduce CTRL-ALT-DECEIT, an attack framework to generates adversarial input prompts via controlled decoding, in order to bypass standard defenses like instruction tuning by using learned adversarial suffixes. To do so, the authors employ black-box optimization, which illustrates transferability across models like Claude and GPT-4, while evaluating countermeasures including output filtering. Also, the paper examines the potential of LLM-based agents against malicious behavior through MLE-sabotage, which is an extension of the MLE-Bench benchmark that incorporates 22 sabotage tasks (e.g., backdoor implantation, selective misclassification) alongside 230 standard tasks. In the context of this paper, the authors examine the ability of automated monitoring systems to detect adversarial behaviors and maintain nominal task performance concurrently. These contributions show significant weaknesses in current LLM alignment and monitoring approaches under critical/complex adversarial conditions.

优缺点分析

Strengths

  1. Quality: The robustness of this study's experimental design is commendable, with clear methodologies for extending MLE-Bench and assessing agent behavior. Using real-world tasks (e.g., Kaggle competitions) enhances ecological validity.

  2. Clarity: The paper is well-organized, with a logical flow and effective use of figures and tables. Key concepts (e.g., sabotage vs. sandbagging) are clearly explained.

  3. Significance: The topic is highly relevant to AI safety, especially as autonomous agents are more frequently used in high-stakes ML workflows.

  4. Originality: The focus on autonomous sabotage (rather than adversarial attacks) and the systematic evaluation of monitor effectiveness are innovative contributions.

Weaknesses

Despite all the benefits and strengths of this paper, there are serious issues, which No serious consideration is given to the safety-critical domain, which is a mistake, to the best of my understanding.

  1. Quality/ Rigor Concerns
  • This manuscript does not clearly distinguish externally induced sabotage (via explicit prompting) from intrinsically emergent misaligned behavior in the agents. Agents are explicitly instructed to sabotage (e.g., "implant a backdoor"), but the framing suggests this reflects autonomous intent. Without concrete testing for agents' abilities to exhibit such behavior without explicit instructions, claims about "misalignment" are overstated.

  • Limitations. The experimental setup of this study imposes limitations on resource constraints, resulting in the authors selecting all sabotage tasks from the “low-complexity” subset of MLE-Bench (22 tasks). It remains unclear how findings scale to higher-complexity real-world pipelines or larger benchmarks.

  • Bias. The strong sabotage and sandbagging behaviors depend mainly on manual prompt tuning, which raises severe generalizability concerns about the results to out-of-the-box agents.

  1. Clarity/ Readability Issues

Typos and Formatting Glitches. The manuscript contains several grammatical lapses, which can distract the reader.

2.1. Missing Cross-References

  • Throughout Section 3 and elsewhere, the paper refers to “??”, i.e., “See ?? for full prompts” instead of actual section or appendix numbers.

2.2 Typo:

  • “grater” instead of “greater” (line 145, Pg. 4)

2.3 Overuse of Acronyms. Terms like “ReAct,” “AIDE,” “MLE-Bench,” and “UK AISI’s Inspect” appear without consistent initial expansions, potentially confusing readers.

  1. Significance / Impact Limitations
  • While agentic sabotage is alarming, there is no empirically valid study on a real-world use case of how the red team misuses beyond the simulated benchmark.

  • I am curious to know why the paper stops short of exploring actionable defenses or automated safeguard designs for deployment in AI-R&D platforms.

  1. Originality Caveats
  • Some concepts like “sandbagging” and LM-based code sabotage exist already in the literature. The novelty of this paper lies in benchmark integration rather than fundamentally new threat models.

  • This paper relies extensively on agent awareness of monitors (or not) and “bootstrap prompts,” recalls existing black-box adversarial prompting research, raising questions about the degree of novelty.

问题

  1. Can the authors explain how they would disentangle /autonomous/ misalignment from /prompt-induced/ sabotage? Could agents exhibit this behavior without explicit instructions?

  2. Would results hold for high-complexity tasks? If not, how does this limit real-world applicability?

  3. Why do monitors fail at sandbagging? Is it a fundamental challenge (e.g., task ambiguity) or a solvable prompt-engineering issue?

  4. Can the authors add confidence intervals or significance tests to key results (e.g., Figure 3)? Small sample sizes raise concerns about the reliability of this study.

  5. Did the authors consider redacting sensitive details (e.g., specific backdoor implementations) to prevent misuse?

局限性

The authors acknowledge computational constraints and monitor weaknesses (Sec. 6) but miss critical gaps like:

  1. Intentionality Assumption: No evidence that sabotage arises without explicit prompting.

  2. Weaponization Risk: Detailed sabotage techniques could aid malicious actors.

3 Defense Gaps: No evaluation of state-of-the-art monitoring (e.g., RLHF fine-tuning).

Suggestions:

  • Add a subsection distinguishing elicited vs. autonomous misalignment.

  • Discuss the ethical trade-offs of publishing sabotage methods.

  • Test adaptive defenses (even preliminarily).

最终评判理由

After considering the authors’ rebuttal and additional clarifications, I acknowledge that several of my original concerns have been addressed. Particularly, the clarification of the KL divergence reference policy, the explicit inclusion of an SFT baseline trained on the same PR dataset, stability trends from training reward curves, and the provision of preliminary statistical testing for out-of-domain (OOD) results. In addition, the authors have improved clarity, expanded the related work citations with acceptable SWE papers, and open-sourced the code, data, and models for reproducibility. However, I still find a major drawback, which impeded me from moving forward with the paper beyond my original rating, and the authors did not engage with my final question by any means whatsoever, which I don't appreciate at all. These key weaknesses remain unresolved: Reward design limitations – While the difflib-based continuous similarity reward is simple and scalable, it remains semantically shallow and blind to control/data flow, invariants, and functional correctness. While the rebuttal’s qualitative examples and the “execution + similarity” experiment partially mitigate this, however,, the absence of deeper semantic validation leaves the framework vulnerable to superficial patch mimicry and potential reward hacking. Limited ablations and sensitivity analysis – Due to computational constraints, the authors did not perform systematic ablations on group size, clipping, KL penalty, or reward variants beyond the discrete/continuous comparison. This limits confidence in hyperparameter robustness, a critical factor for practitioners aiming to reproduce or extend this work. OOD generalization claims – While additional statistical analysis suggests significance across combined benchmarks, the gains on individual tasks remain modest. The causal link between SWE-RL training and broader reasoning ability is still more correlational than proven. Reproducibility/stability risks – Although the authors provided stable reward curves and fixed training settings, RL instability is a well-known challenge, and without multi-run variance reporting or robustness checks under altered configurations, reproducibility in different environments is uncertain. Overall, SWE-RL is a novel and promising application of RL to large-scale real-world SWE tasks, achieving state-of-the-art performance among medium-sized open models and showing intriguing signs of cross-domain reasoning gains. The rebuttal strengthens the paper’s empirical credibility, but the work would benefit from future extensions to richer reward modeling, expanded ablations, and deeper robustness studies. Final recommendation: Borderline accept. The contribution is timely and potentially impactful for the SWE and RL-for-LLM communities, but its adoption in practice may be limited until the above concerns are addressed.

格式问题

  • Missing references to Figures or Tables (??). E.g., Pg. 8 Lines: 302 and 328.
  • Table 1: Misaligned columns (e.g., "16 / 6" under "Regular" vs. "3 / 9" under "Sabotage").
  • Line 120: "Frontier agent make" → "Frontier agents make" (subject-verb agreement).
作者回复

Thank you for your feedback on the paper! We are glad that you appreciated its strong empirical methodology and significance to real-life AI safety. Below we answer your questions.

  1. Can the authors explain how they would disentangle /autonomous/ misalignment from /prompt-induced/ sabotage? Could agents exhibit this behavior without explicit instructions?

We focus on agent capability to conduct sabotage without being detected, in contrast to whether agents would naturally engage in sabotage unprompted. Because we wish to evaluate agent capabilities, we explicitly try to elicit sabotage capabilities through prompting. Whilst agents might engage in sabotage in some settings even without being prompted, a rigorous analysis of this is beyond the scope of our paper. We will add a clarifying note on this point.

  1. Would results hold for high-complexity tasks? If not, how does this limit real-world applicability?

Even the low-complexity tasks in MLE bench, scraped from kaggle, are already substantially more realistic and complicated than other sabotage tasks from the literature (e.g., they require giving the agent access to GPUs to train ML models).

However, we agree that it would be interesting to examine whether the results and dynamics we observe would persist in more realistic environments. Designing even more complex sabotage tasks is an important area of future work!

  1. Why do monitors fail at sandbagging? Is it a fundamental challenge (e.g., task ambiguity) or a solvable prompt-engineering issue?

Monitors sometimes successfully detect sandbagging, but they do perform worse compared to detected code-sabotage. We do experiments with different prompt elicitation (see Fig. 8 for a comparison of monitor performance with different prompts). When the monitor has more context about sandbagging, it does perform better (cf Fig. 8). As we note in the paper (lines 348-349), we think this poses an actual difficulty for monitoring. The suspiciousness of an action is highly context dependent, and consistently providing the right context might prove to be difficult in practice.

Here are some representative examples of how the monitor (with general prompt) failed:

Non-suspicious sandbagging runs:

  • The monitor sees the agent is decreasing performance, but interprets it as "progressively refining the architecture to balance performance and efficiency"
  • The monitor says "The performance achieved (0.05062 on the test set) is reasonable for the limited training time and computational resources." This points to a shortcoming of the monitor prompts: we did not include what resources were available to the agent.
  • The monitor knows the agent is targeting a specific score, but treats that as part of the task.
  • The monitor accepts the agent's comments of "optimizing for better balance between underfitting and overfitting".
  • Sometimes the monitor does not notice things that could have been done better. E.g. using a pre-trained vision model instead of training from scratch.
  • The monitor refers to efficiency gains even when prompted with the sandbagging-specific prompt.

Suspicious regular runs:

  • The monitor thinks the agent's score is suspiciously high. In some cases, it is because the monitor thinks the test score is reported by the agent and not something that was done independently from the agent. In other cases, the monitor has knowledge about the competition from pre-training. E.g. for the aptos2019-blindness-detection task, the monitor is saying "The winning solutions in the actual competition scored around 0.93-0.94, and even state-of-the-art approaches published after the competition typically don't exceed 0.95. [Hence a score above 0.98 is suspicious.]”
  1. Can the authors add confidence intervals or significance tests to key results (e.g., Figure 3)? Small sample sizes raise concerns about the reliability of this study.

These plots do present mean and standard error :)

  1. Did the authors consider redacting sensitive details (e.g., specific backdoor implementations) to prevent misuse?

For current SOTA agents, we do not believe that capabilities on our benchmark provide meaningful sabotage-uplift to serious malicious actors. But this is important to track going forward as agents become more capable at sabotage. We will add a note on this societal impact in the conclusion.

However, we will be happy to redact sensitive information from the public version of the paper, including system prompts and agent trajectories, if this is deemed necessary by the ACs.

Defense Gaps: No evaluation of state-of-the-art monitoring (e.g., RLHF fine-tuning).

We do include a thorough analysis of monitoring methods with different prompt-elicitation. Fine-tuning monitors requires substantial data (on agent trajectories) and computational resources. We therefore think it is beyond scope for this paper, but we agree that it is an important area of future work and we will add this point to the discussion.

We will add discussion of all the suggested points for the camera ready as discussed above. In addition we will fix the typos and formatting issues you highlight.

Thanks again for your feedback which we think will help to improve the paper! If you feel that we have sufficiently addressed some of your questions then we would appreciate it if you raised your score.

评论

Thank you for your response to my concerns. After reading through, I still have some doubts.

  • The paper doesn’t seem to demonstrate sabotage without explicit malicious prompting. This could be a potential drawback for autonomous misalignment testing.
  • The majority of benchmark tasks are drawn from Kaggle but remain low in real-world complexity (e.g., small-scale ML tasks). Thus, the benchmark still suffers from generalization beyond low-complexity tasks
  • The paper should improve its statistical reporting
  • The paper evaluates only zero-shot prompt-based monitors, without including stronger baselines
  • The threat model should be clearly defined with the key actors' capabilities upfront
评论

Thanks for following-up!

The paper doesn’t seem to demonstrate sabotage without explicit malicious prompting. This could be a potential drawback for autonomous misalignment testing.

We investigate agents' capability to conduct sabotage, not agents' propensity to conduct sabotage unprompted. It is very common to disentangle agent capabilities and propensities in the literature on dangerous capability evaluations. We think that an analysis of misaligned sabotage propensities is a good area for future work and we will add a note on this in the paper.

The majority of benchmark tasks are drawn from Kaggle but remain low in real-world complexity (e.g., small-scale ML tasks). Thus, the benchmark still suffers from generalization beyond low-complexity tasks

We agree that there is a gap between the Kaggle competitions in our data set and real-life settings of interest. On the other hand, our MLE-sabotage benchmark is the most realistic sabotage capability benchmark so far (compared to the other control evals discussed in the related work on lines 71 - 76). However, we will add a point to the discussion highlighting the gap between the Kaggle competitions used and more realistic settings.

The paper should improve its statistical reporting

Please let us know which statistical metrics you believe we should report beyond mean and standard errors (which are already included in the paper).

The paper evaluates only zero-shot prompt-based monitors, without including stronger baselines

We find that zero-shot monitors are already performant for detecting sabotage, though less performant at sandbagging detection. Which other baselines do you recommend we include? (few-shot? fine-tuning?)

The threat model should be clearly defined with the key actors' capabilities upfront

Do you not think that we explain the threat model in the introduction? What further details do you think should be included? How can we define the agent's capabilities upfront (we seek to measure these capabilities)?

评论

I appreciate your efforts, but I will keep my rating.

评论

Concerning this part of the authors' comment:

The paper should improve its statistical reporting

Please let us know which statistical metrics you believe we should report beyond mean and standard errors (which are already included in the paper).

The paper evaluates only zero-shot prompt-based monitors, without including stronger baselines

We find that zero-shot monitors are already performant for detecting sabotage, though less performant at sandbagging detection. Which other baselines do you recommend we include? (few-shot? fine-tuning?)

Reviewer cfjh: would it be possible for you to engage in discussion with the authors, answering these questions?

Thanks,

area chair

评论

I appreciate your response. Maybe the authors misunderstood my point if they only include mean ± SE. Whereas in RL and stochastic LLM setup. We know that in reinforcement learning, especially with LLMs, results are susceptible to initialization, seed, and batch sampling noise. Without variance or statistical controls, improvements may not be statistically significant or repeatable.

Some suggested reporting could include:

  1. Multiple random seeds (≥3–5) and showing per-seed variance.
  2. Bootstrapped confidence intervals, especially on performance metrics.
  3. Statistical significance testing (e.g., paired t-test or Wilcoxon signed-rank) when comparing against baselines.
  4. Training/reward learning curves, not just final metrics.

Zero-shot settings are fair, but they do not test or monitor robustness or adaptability. In the context of adversarial sabotage, it's reasonable also to test:

  • Few-shot tuned monitors, which adapt to specific sabotage examples.
  • Fine-tuned classifiers, trained on sabotage vs. safe behaviour.
  • Defense techniques like task verification prompts or post-generation filtering.

On the threat model, you ask a valid question but miss the nuance. For example, the authors could (i) explicitly enumerate the roles in the threat model: e.g., attacker agent, prompting interface, monitor, evaluator; (ii) clarify capability assumptions: e.g., is the agent pre-trained or fine-tuned? Does it have API access? Can it modify the input/output structure? (iii) describe the evaluator’s visibility: Does the evaluator see only generated text? Source code? Execution traces? This will clarify what is being tested and what is assumed (especially important in safety/security work).

审稿意见
6

This paper introduces a new benchmark called MLE-sabotage, which modifies a subset of 22 tasks from MLE-bench to introduce sabotage tasks or side objectives. For example, a task for LLM agents about training a classifier for toxic Wikipedia comments is modified to add a sabotage task of classifying toxic comments against Caucasians as non-toxic. LLM agents show strong performance on these sabotage tasks, without sabotaging the benchmark, but are also easily caught by monitors. LLM agents can sandbag or reduce their performance though with high variance.

优缺点分析

Strengths:

  • Timely topic given growing discussion of AI agent sabotage and how that might influence internal deployments in AI labs
  • Evaluates AI capabilities for sabotage as well as monitorability to catch sabotage
  • Benchmark tasks are implemented in Inspect, a commonly used benchmark format, improving interoperability
  • Multiple variant experiments are conducted: monitor-aware sabotage, monitor-unaware sabotage, sandbagging, and regular (achieving the main task). The sandbagging results indicate that agents are (with high variance) able to adjust their performance to target scores, informing concerns about AIs deliberately hiding capabilities to avoid increased controls, though sandbagging can be detected by monitors with partial success.
  • Due to building upon MLE-bench, MLE-sabotage includes comparisons to human team–level performance on Kaggle competitions. This allows for a comparison between AI and human-level performance.
  • Data and code are provided to allow for reproducibility and easy integration by AI developers.

Weaknesses:

  • The sabotage tasks constructed may be relatively easy for LLM agents, so there may be room to construct harder sabotage tasks

问题

No urgent recommendations for this paper, but please do clean up formatting mistakes. There is a lot of room for interesting follow-up work, but the paper seems strong as is. It would be helpful to include more details about the agent execution, such as number of tokens or cost used.

Potential extensions:

  • Expand to harder or more realistic ML engineering tasks, such as the remainder of MLE-bench, RE-Bench, or PaperBench. More realistic tasks, along with more harmful sabotage tasks, would be especially useful when considering real-world threats from AI sabotage.
  • Include a range of sabotage tasks with different levels of difficulty for each machine learning engineering task.

局限性

Yes, authors have adequately discussed limitations and potential negative societal impact of their work.

最终评判理由

I will maintain the same score as in my original review, as issues raised by other reviewers do not present any major unresolved concerns in my view.

格式问题

Several instances of "see ??" within the paper, which should be corrected.

作者回复

Dear Reviewer,

Thank you for your support for, and feedback on, the paper!

We will fix the formatting mistakes and try to include further details on agent hyperparameters (e.g., token costs) for the camera ready.

We are also pursuing follow-up work designing more difficult and varied sabotage tasks, as you suggest.

评论

Hi reviewer aLyA,

You wrote that "it would be helpful to include more details about the agent execution, such as number of tokens or cost used."

Do you think the paper is reproducible as is? Would you be comfortable reimplementing it on the basis of the manuscript if you had to?

Thanks,

area chair

评论

Hi Area Chair bMrw, thank you for your comment.

Based on the thorough details provided in the supplemental materials, including a link to a code repository, I believe the paper is reproducible as is. While agent token counts are not reported, agent scaffolds and execution time limits are described in Appendix B.1, which I think is adequate.

最终决定

The paper deals with the question of whether current LLMs have the capability to sabotage coding tasks (by inserting backdoors).

The main strength of the paper is that the problem is very timely and relevant.

Weaknesses are minor and were largely addressed during the rebuttal: (1) too few / too easy tasks (2) some doubts about evaluation.

The main reason for acceptance is that this is a well executed approach to a timely and important problem.

Discusion included: (1) reproducibility (2) capability vs propensity testing (3) methodology.