/10

Oral3 位审稿人

最低3最高4标准差0.5

ICML 2025

ITBench: Evaluating AI Agents across Diverse Real-World IT Automation Tasks

Saurabh Jha,Rohan R. Arora,Yuji Watanabe,Takumi Yanagawa,Yinfang Chen,Jackson Clark,Bhavya Bhavya,Mudit Verma,Harshit Kumar,Hirokuni Kitahara,Noah Zheutlin,Saki Takano,Divya Pathak,Felix George,Xinbo Wu,Bekir O Turkkan,Gerard Vanloo,Michael Nidd,Ting Dai,Oishik Chatterjee,Pranjal Gupta,Suranjana Samanta,Pooja Aggarwal,Rong Lee,Jae-wook Ahn,Debanjana Kar,Amit Paradkar,Yu Deng,Pratibha Moogi,Prateeti Mohapatra,Naoki Abe,Chandrasekhar Narayanaswami,Tianyin Xu,Lav R. Varshney,Ruchi Mahindru,Anca Sailer,Laura Shwartz,Daby Sow,Nicholas C. M. Fuller,Ruchir Puri

OpenReview PDF

提交: 2025-01-23更新: 2025-07-24

TL;DR

Benchmark for IT automation tasks

摘要

关键词

BenchmarkGenAIAgentsIT Automation

评审与讨论

审稿意见

评分: 42025-02-27

The paper introduces IT-Bench, a specialized benchmarking framework designed to evaluate AI agents on real-world IT automation tasks across three key domains: Site Reliability Engineering (SRE), Compliance and Security Operations (CISO), and Financial Operations (FinOps). Built from 94 scenarios derived from actual incidents, CIS benchmarks, and FinOps guidelines, IT-Bench operates in realistic environments—such as Kubernetes clusters—integrated with industry-standard tools like Grafana and Prometheus. The framework assesses agent performance using metrics such as pass@1 and time to resolution. Testing reveals significant challenges for even advanced models like GPT-4o, with success rates of 13.8% in SRE, 25.2% in CISO, and 0% in FinOps. These results highlight the complexity of IT automation and the pressing need for enhanced AI capabilities in this field.

给作者的问题

Can IT-Bench incorporate resilience testing (e.g., telemetry noise) to better simulate production unpredictability?

论据与证据

The authors assert that IT-Bench is a robust, extensible, and practical framework for evaluating AI agents in IT automation. This claim is supported by several key points:

Real-World Scenarios: The benchmark incorporates 94 tasks rooted in genuine IT incidents and established industry standards, ensuring relevance and applicability.
Framework Design: By integrating authentic environments and observability tools, IT-Bench mirrors the conditions AI agents would encounter in practice.

方法与评估标准

IT-Bench models agent-environment interactions as a Partially Observed Markov Decision Process, capturing the inherent partial observability of IT systems. Agents are evaluated using a suite of metrics, including pass@1 (success on the first attempt), fault localization accuracy, and fault propagation chain analysis, among others. These metrics provide a comprehensive measure of performance across diverse dimensions.

理论论述

The use of POMDP to model agent-environment interactions is theoretically sound and aligns with AI research paradigms. The NTAM metric is a novel contribution, accounting for topology-aware fault localization. However, the paper overlooks potential biases in scenario selection (e.g., over-representation of Kubernetes-based systems) and lacks discussion on generalizability across diverse IT ecosystems. NTAM’s parameter tuning is heuristic-based, needing empirical justification.

实验设计与分析

Experiments

The paper evaluates baseline agents, such as GPT-4o and Llama-3.3-70B, across IT-Bench’s scenarios, yielding the following insights:

Low Success Rates: Performance is notably weak, with FinOps tasks achieving a 0% success rate.
Complexity Challenges: Agent effectiveness diminishes as scenario complexity increases.
Environmental Factors: Non-deterministic elements, such as real-time telemetry data, pose significant hurdles for agents, underscoring the unpredictable nature of real-world IT settings.

补充材料

I have thoroughly examined the supplementary material, which provides additional support for the paper's claims and methods.

与现有文献的关系

The introduction of IT-Bench adds a valuable contribution to the literature on AI agent evaluation for IT automation. Its focus on single-agent performance establishes a strong foundation for assessing individual AI capabilities in these contexts. However, many real-world IT operations involve complex, collaborative environments where multiple agents or human-AI interactions are critical. Extending IT-Bench to incorporate multi-agent collaboration or human-AI workflows could significantly enhance its generalizability, addressing scalability and robustness concerns raised in this review. Such an advancement would align the framework with emerging research on collaborative AI systems, positioning IT-Bench as a versatile tool for evaluating AI-driven solutions in dynamic, multi-actor settings.

遗漏的重要参考文献

None

其他优缺点

Strengths

Comprehensive Metrics: IT-Bench employs a diverse set of evaluation criteria, including pass@1 for accuracy and time-based metrics like Mean Time to Diagnosis (MTTD) and Mean Time to Repair (MTTR). This multifaceted approach assesses both precision and efficiency, offering a holistic view of agent performance.
Openness: The framework and baseline agents are open-sourced, encouraging collaboration and further development by the research community.

Weaknesses

Narrow Agent Focus: The evaluation centers on large language model-based agents (e.g., GPT-4o, Llama), neglecting alternative AI approaches, such as rule-based systems or reinforcement learning, that might outperform in specific IT automation contexts.

其他意见或建议

Refer to the former comment.

作者回复

2025-04-01

Q1. Can IT-Bench incorporate resilience testing (e.g., telemetry noise) to better simulate production unpredictability?

Yes! We already incorporate a few resilience testing tools like Chaos Mesh in ITBench, which can be used to evaluate agentic technologies under different resilience testing scenarios. ITBench also supports the evaluations of the effectiveness of agents with different kinds and verbosity levels of telemetry data – one can easily turn on or off any telemetry.

We currently do not add random noises to telemetry data. The mechanism of adding noises is straightforward. However, one of our key principles is to achieve high-fidelity of real-world IT scenarios; we are working on policies for adding noises that can resemble faulty telemetry in the field.

We also respond to the following important comments.

C1. Many real-world IT operations involve complex, collaborative environments where multiple agents or human-AI interactions are critical. Extending IT-Bench to incorporate multi-agent collaboration or human-AI workflows could significantly enhance its generalizability, addressing scalability and robustness concerns raised in this review.

Those are excellent suggestions! Multi-agents and human-AI interactions are important components on the roadmap of ITBench, and we will discuss them in the paper. We use agents inexplicitly to refer to multi-agents – our SRE agents already have a form of multi-agents: the mitigation agents interact with the diagnosis agents to determine the resolution based on the root causes. Similarly, CISO agent is also a multi-agent system including skilled agents targeted for OPA-Ansible, Kyverno, OPA-Kubectl.

C2. Potential biases in scenario selection (e.g., over-representation of Kubernetes-based systems) and lacks discussion on generalizability across diverse IT ecosystems

Thanks for the question. We will discuss the generalizability in the final version. The design of ITBench is not specific to Kubernetes-based stacks; it can be easily extended to other IT infrastructures (e.g., Docker Swarm and Nomad from HashiCorp). Certainly, it would require engineering effort.

Kubernetes is chosen because it is the de facto open-source IT infrastructure for cloud and datacenter systems today. Its design is in principle similar to proprietary infrastructure systems such as Google’s Borg, Meta’s Twine/Tupperware, AWS’s ECS, and SnowFlake’s ECS; Kubernetes as a main platform services are offered by all major cloud services (e.g., Google, Azure, AWS, IBM). To make ITBench an open platform, we intend to use only open-source systems as the components instead of proprietary ones. So, Kubernetes seems to be the best choice. Note that most cloud system research uses Kubernetes as the representative infrastructure (in a similar vein as how Linux is used in OS research and how x86-64 is used in architecture research).

In the context of ITBench, Kubernetes only serves as a container orchestration infrastructure. Many IT tasks are beyond the Kubernetes layers, e.g., the applications like Hotel Reservation are not specific to Kubernetes, and thus misconfigurations in applications are orthogonal to Kubernetes (they happen in the same form regardless of the orchestration infrastructure). Similar to node failures and network disconnections.

Note that the term “Kubernetes” refers to the broader container orchestration based IT infrastructure, which is not limited to the original Kubernetes project (https://github.com/kubernetes/kubernetes). In fact, we use different backends like MiniKube, K3d, and Kind (for laptop-based setups).

审稿意见

评分: 32025-03-01

This paper is a benchmark paper. It evaluates the recent LLM agent systems in three IT domains: (1) Site Reliability Engineering (SRE), (2) Compliance and Security Operations (CISO), and (3) Financial Operations (FinOps). The main contribution of this paper is preparing three benchmarks and thoroughly evaluating candidate systems. Besides that, this paper also implements several IT agents following the traditional design. In the experiment section, the paper summarizes the key observations that the current agent systems still can not perform well on these real IT tasks.

给作者的问题

Can you explain more about the unique challenge of IT tasks compared with the broader SWE tasks?

论据与证据

Yes

方法与评估标准

yes

理论论述

N/A, no theoretical claims

实验设计与分析

Yes

补充材料

Yes, I check the agent frameworks used by the experiment and related work section in the appendix.

与现有文献的关系

The agent framework used in this paper follows the widely used ReAct style.
The evaluation is consistent with the previous setting.

遗漏的重要参考文献

其他优缺点

The main limitation of this paper is the theoretical depth. As the current LLM agent systems are black-box due to the LLM, we cannot clearly analyze and understand the decision process. This paper shows that existing agent systems cannot solve current IT challenges but fail to give a theoretical analysis.

其他意见或建议

作者回复

2025-04-01

Q1. unique challenge of IT tasks compared with the broader SWE tasks

IT tasks are more diverse than SWE. Consider SRE, closer to SWE than FinOps/CISO. SRE involves distributed systems (multi-machine, full stack: app, platform, OS, hardware & their integrations). SWE focuses on single programs.

Key differences between SWE and SRE include:

Complexity/Scope: SRE systems are larger-scale and more diverse than single programs. Root causes are broader than just bugs, including hardware/network faults, misconfigs, overload.

Diagnosis: SRE diagnosis differs from SWE debugging. SRE issues are often hard to reproduce (dependencies, scale, non-determinism) & lack source code. SWE debugging assumes reproducibility with source/debuggers (GDB).

Context/Agent Fit: IT systems' scale/complexity makes full model context infeasible (vs. SWE). Agents are needed; multi-agent/multi-modal designs handle diverse IT data (metrics, traces, logs).

Goals/Actions: SRE goal: production service reliability/availability. SWE goal: program correctness. SRE mitigation prioritizes immediate service restoration (rollbacks, feature flags, etc.), beyond just fixing bugs (SWE focus).

Environment/Safety: SRE is production; SWE is development. Safety is paramount for SRE agents (unlike SWE agents trying anything for tests); unsafe trial-and-error unacceptable. SRE actions need risk/impact assessment.

Evaluation: Evaluating SRE agents is harder than SWE. SWE uses public data (GitHub); replicating production SRE systems is difficult (scale, proprietary). This motivates ITBench for enabling AI in this complex domain.

We also respond to the following important comments.

C1. LLM agent systems are black-box

We acknowledge the challenge of understanding LLM agent decision processes. However, ITBench allows us to empirically investigate agent's behavior.

Agent's behavior can be empirically evaluated because of the following

Detailed Trajectory Logging: We record comprehensive logs for each step, including the specific tool used, the exact inputs (including full prompts), and the resulting action. This provides necessary data for analysis.

ReAct Framework: Our agent utilizes the ReAct framework, which prompts the LLM to output its reasoning ("Thought") before acting. This captures intermediate reasoning steps, offering insight into the decision process.

Error Source Differentiation: By combining detailed logs and ReAct traces, we can distinguish between high-level reasoning failures (strategy errors) and lower-level execution errors (tool usage mistakes). An automated process categorizes failures, enabling quantitative analysis of failure modes.

We provide 2 case studies as exemplars on SRE agent:

Prompt problem: Trajectory analysis revealed flaws in an agent using Granite-3.1-8B (e.g., tool misuse linked to prompt errors). Fixing the prompts based on this analysis significantly improved success rates (3.3% to 8%), reduced errors (incorrect tool calls 7% to 0.7%), and balanced tool usage.

Planning problem To quantitatively assess reasoning strategy (e.g., SRE diagnosis), we compare the agent's explored path (from 'Thoughts'/tool use) against the ground-truth fault propagation chain (causal sequence). Rationale: Diagnosis often traces this chain in reverse. Metrics developed across trajectories:

Detoured services: Avg services explored off the ground-truth path (lower = better focus).

Relative covered services: Avg ratio of relevant on-path services explored vs. ground-truth length (higher ≈ 1 = better alignment).

We analyzed these metrics separately for successful and unsuccessful trajectories, focusing on GPT-4o versus Granite-8B:

For successful trajectories: GPT-4o demonstrated significantly better reasoning quality. It achieved much higher alignment with the ground-truth path (avg Relative Covered Services: 0.75 for GPT-4o vs. 0.30 for Granite-8B) and substantially less deviation into irrelevant services (avg Detoured Services: 0.98 for GPT-4o vs. 2.00 for Granite-8B).
For unsuccessful trajectories: Even when failing, GPT-4o maintained better reasoning metrics compared to Granite-8B. GPT-4o still covered more of the relevant path (avg Relative Covered Services: 0.48 vs. 0.27) and explored fewer irrelevant services (avg Detoured Services: 2.1 vs. 3.1) than Granite-8B did during its failures.

审稿意见

评分: 42025-03-11

This paper presents IT-Bench, a framework that benchmarks AI agents for IT automation across roles including Site Reliability Engineering, Compliance and Security Operations and Financial Operations. It offers 94 real-world scenarios with automated, partial scoring evaluation and a leaderboard to ensure reproducibility. The framework models each scenario as a tuple of metadata, environments, triggering events, and desired outcomes, and benchmarks the agent's performance. Evaluations using various LLMs reveal modest success rates with FinOps unresolved. This demonstrates the challenges in automating IT complex tasks.

给作者的问题

In the abstract, you mention that the benchmark can be easily extended through community contributions. Could you elaborate on the process for how one can add new tasks to IT-Bench? Given that task scenarios often involve complex, task-specific setups and requirements, how do you ensure that integration is manageable for contributors? Is there any guidelines designed to standardize the addition of new tasks to IT-Bench?
The abstract claims that IT-Bench supports partial scoring. Could you clarify how partial scoring is implemented during evaluation? Specifically, how are partial scores computed and used to assess agent performance?
For the natural language to code tools in the SRE/FinOps settings (e.g., NL2Kubectl), which LLM backbone is used to translate natural language into specific code? Is this backbone the same model used for the agents, or is it fixed to a particular model?
What is the average token consumption for running the agent on one task?

论据与证据

Yes

方法与评估标准

Yes

理论论述

No theoretical claims are proposed in this paper.

实验设计与分析

Yes

补充材料

No supplementary material provided.

与现有文献的关系

This paper presents a novel direction for benchmarking LLM agents in real-world IT tasks, extending their application far beyond SWE applications. IT-Bench provides a comprehensive framework that covers multiple IT personas and reflects the complexity of actual IT operations.

遗漏的重要参考文献

No.

其他优缺点

Strengths:

The paper introduces a comprehensive benchmark that reflects real-world IT automation challenges. It unifies multiple IT roles into one framework, ensuring broad applicability and practical relevance. Moreoever, the use of real-world scenarios and industry benchmarks enhances its authenticity. Drawing on actual incidents and standards, the benchmark offers a realistic testbed for AI agent performance.
The evaluation pipeline is automated and systematic. It proposes several well-defined metrics with real-world implications. The design of a leaderboard can provide performance insight.

Weaknesses:

The domain coverage is imbalanced. FinOps contains only 2 tasks. A success rate of 0% cannot support any claims in difficulty due to the limited dataset size.
The framework's complexity and infrastructure demands may hinder accessibility. The detailed environment setup and the challenge of integrating new benchmarks could restrict broader adoption and ease of extension.

其他意见或建议

I still have reservations regarding the motivation for employing AI agents to address IT challenges. While the paper cites the CrowdStrike incident to demonstrate the need for intelligent IT resolution, it is not clear to me how deploying agents will prevent such failures in practice. In production-grade environments where errors can have significant consequences, how to ensure the reliability of AI agents is crucial. For example, what if an alarm fix inadvertently triggers a cascade of additional errors? I believe a deeper discussion on the built-in safeguards, error mitigation strategies and the overall reliability assurances of agents in IT is required.

作者回复

2025-04-01

Q1.1 Complexity and infrastructure demands may hinder accessibility

The framework’s complexity is abstracted from the agent interface, which is designed for accessibility, similar in principle to SWE-agent. AI researchers in the broader community have been able to use ITBench. Environment setup is automated (“push-button”) using provided scripts, masking infrastructure details. The framework runs on laptops (≥16GB RAM, Linux/MacOS) for smaller tasks, allowing developers to quickly pull and work on tasks. Workstations or cloud VMs are needed for larger problems.

ITBench aims to enable research on complex real-world problems. Reducing the inherent complexity is a non-goal; modeling it is necessary to rigorously evaluate agents on realistic IT infrastructures. Simplifying would impair task fidelity and evaluation validity. Our principle is to provide an accessible interface while preserving the necessary complexity for meaningful evaluation, balancing scalability and resource efficiency.

Q1.2 On extensibility

Extensibility is a first-class design principle of ITBench and is treated seriously. We promote and welcome open contributions from researchers and practitioners. Unlike the agent interface, extending the benchmarks requires expertise in IT infrastructure to maintain accuracy and realism.

We provide clear guidelines in our repositories (anonymized for the double-blind policy) to standardize the addition of new tasks based on their required extensions. The main effort comes from (1) ensuring the reproducibility of the problems in the tasks, and (2) defining tasks-specific criteria for partial scoring (the other criteria is automated based on whether the alarms are resolved by the agents). In our experience, the setup is rarely a problem as it is largely automated, and the reproducibility is verified through an automated Continuous Integration (CI) pipeline.

Q2. How partial scoring is implemented

Partial scoring is fundamental to our benchmark for systematic, fine-grained assessment of agent reasoning in IT tasks. It values intermediate steps when perfect solutions are hard to achieve, necessitating novel metrics tailored to specific IT domains. We exemplify partial scoring for root-cause diagnosis for SRE scenarios. Given large topologies (100K+ nodes), exact identification is difficult, but recognizing topologically close components demonstrates valuable reasoning. To quantify this, we developed the Normalized Topology Aware Match (NTAM) metric using expert-validated principles: topological proximity, node importance within the fault chain, effective search space reduction, and output length constraints. Inspired by information retrieval ranking (like BM25), NTAM measures prediction relevance and features tunable hyperparameters (see Appendix F).

Crucially, partial scoring methods are domain-specific. For FinOps, we supplement NTAM with other proximity metrics evaluating alignment with optimal cost/efficiency values by measuring relative difference, rather than requiring exact matches. This tailored approach ensures nuanced performance evaluation across diverse task types.

Q3. LLM for NL-to-code: Agent's or fixed?

We use the same LLMs for both the planner and the tools (e.g., NL2Kubectl). Exploring hybrid LLM configuration is our future work. For example, we can potentially use small models for NL-to-code tools, but our experience shows that small models are not yet good at generating code or using tools effectively.

Q4. Token utilization

Average token consumption varies by model and task type; GPT-4o uses ~675k ± 205k for SRE, ~208k ± 263k for FinOps and ~23k ± 32k CISO tasks .

We also respond to the following important comments.

C1. FinOps contains only 2 tasks

ITBench is an evolving benchmark. FinOps tasks increased from 2 at submission to 10 currently, and we continue adding tasks. FinOps initially had fewer tasks as it's a less established field; we are actively defining and adding scenarios. We evaluated agents on these 10 tasks; smaller models struggle, while GPT-4o achieved a 0.2 pass@1 score. We acknowledge the current scarcity and lack of statistical representativeness for FinOps (thank you) and will clarify this in the updated paper. Overall, the benchmark is growing, e.g., SRE tasks increased from 42 to 98 since submission via community contributions.

C2. Overall reliability assurances of agents in IT is required

We agree that safety and reliability are critical, and we will add deeper discussion as suggested. We are also enhancing ITBench for finer-grained safety feedback within the SRE, FinOps, and CISO contexts. Furthermore, ITBench already models real enterprise IT settings where SRE, FinOps, and Compliance tasks are interlinked. An agent's action (e.g., an unsafe SRE command) can trigger cross-domain issues (such as Compliance violations or FinOps costs) that ITBench measures. However, this is subject of future work.

审稿人评论

2025-04-02

I appreciate the authors’ efforts to clarify the points I mentioned. I have no further questions and have revised my score accordingly. I am voting for acceptance of this paper.

作者评论

2025-04-02

Thank you for taking the time and updating the score.

最终决定Accept (oral)

2025-05-01

The reviewers are universally positive about this paper, including that it 1) makes a significant contribution to the community in the form of a benchmark and evaluation infrastructure for using AI agents to automate IT tasks, 2) is comprehensive, includes a wide variety of real world data, and may be extended easily in the future 3) includes automatic evaluation with a leaderboard, which makes it accessible and will further motivate work in the community, and 4) is a clear and detailed examination of the benchmark, including assessment of related methods. There are some minor issues, such as imbalances among the categories of IT automation tasks, but the strengths of this work and its likely impact far outweigh any minor negatives.