6.8

/10

Poster4 位审稿人

最低4最高5标准差0.4

2.8

置信度

创新性2.8

质量3.0

清晰度3.3

重要性2.8

NeurIPS 2025

STRATUS: A Multi-agent System for Autonomous Reliability Engineering of Modern Clouds

Yinfang Chen,Jiaqi Pan,Jackson Clark,Yiming Su,Noah Zheutlin,Bhavya Bhavya,Rohan R. Arora,Yu Deng,Saurabh Jha,Tianyin Xu

OpenReview PDF

提交: 2025-05-11更新: 2025-10-29

摘要

关键词

Agentic AILarge Language ModelSystem Reliability

评审与讨论

审稿意见

评分: 4置信度: 22025-07-01

This paper introduces STRATUS, a novel LLM-based multi-agent system designed to automate Site Reliability Engineering (SRE) in large-scale cloud environments. As cloud systems scale and become increasingly complex, traditional human-in-the-loop reliability practices struggle to keep up with the volume and speed of failures such as software bugs, misconfigurations, and hardware faults.

优缺点分析

STRATUS addresses this challenge by coordinating a set of specialized agents—such as detection, diagnosis, mitigation, and undo agents—within a structured state machine to autonomously perform end-to-end failure management. The authors also propose a formal safety property called Transactional No-Regression (TNR) to guide safe and reversible actions during agentic operation, enabling iterative and safe learning and recovery.

Empirical evaluations on benchmark SRE testbeds (AIOpsLab and ITBench) show that STRATUS significantly outperforms existing SRE systems, achieving at least 1.5× higher success rates in failure mitigation across various modules. The work demonstrates the practical feasibility and advantages of deploying LLM-driven autonomous systems for cloud reliability management.

问题

Issue: STRATUS includes several specialized agents (detection, diagnosis, mitigation, undo), but it remains unclear how essential each component is.

局限性

as question

最终评判理由

The rebuttal squarely addresses the core concern about the necessity and design of STRATUS’s specialized agents. The authors (i) justify detection/diagnosis/mitigation/undo as first-class SRE tasks, with undo motivated by production safety. I will keep the score 4.

格式问题

作者回复

2025-07-31

Thank you for the insightful review! We'll carefully address the review comments.

We answer the listed question.

> Q1: STRATUS includes several specialized agents (detection, diagnosis, mitigation, undo), but it remains unclear how essential each component is.

Thanks for the question. The Detection, Diagnosis, and Mitigation agents are corresponding to the common SRE tasks (see Ref. [12] in the submitted manuscript) and are thus essential components of an SRE system. In AIOpsLab and ITBench, those tasks are explicitly required for evaluating any agentic SRE systems, except that ITBench does not evaluate on detection. Undo agents are introduced for safety guarantee which is the focus of the paper and we argue that production safety must be a first-class principle of any SRE agents.

Note that our multi-agent architecture is customizable and extensible – one can add, remove, and replace agents. For example, ITBench does not need the detection agent.

If the question is about our design of the specialized agents, we did consider and experiment with a conversation-based multi-agent framework, AutoGen, to construct a team of conversable agents with the roles of Application Developer, Platform Engineer, System Administrator, and a Team Manager to aggregate opinions and issue actions. However, we found that such a design was neither effective nor efficient.

We sampled 48 problems from AIOpsLab, where the incidents happen on the application and platform layers. However, the role-based design only solves 10 out of 48 problems. The multi-agent conversations tend to over-communicate and each agent only has a partial view of the entire system stack. For example, when round robin is used as the speaker selection method, every agent speaks in each step, slowing down decision making and causing agents to repeatedly collect telemetry data to investigate/confirm other agents’ observations and hypotheses. For example, solving a simple detection problem user_unregistered_mongodb-1-detection took 893 seconds, which is 25X longer than Stratus (Table 8 in Appendix). We believe that multi-agent conversation is not the right design for SRE agents, despite their promises in use cases like coding, Q&A, and entertainment as shown in the AutoGen paper.

The rationales of our design are:

Each agent has a global view of the system stack (which is a key advantage of AI agents over human engineers).
Each agent is corresponding to a concrete SRE task.
The design enables timely decision making as SRE tasks are time sensitive.
The state-machine based control logic facilitates safety reasoning (unlike many use cases like coding and writing, safety is the first principle of SRE agents).

2025-08-06

I thank the authors for their responses. The response addresses most of my concerns. Therefore, I remain positive about this work.

审稿意见

评分: 4置信度: 32025-07-01

This paper proposes Stratus, an LLM-based multi-agent system for fully autonomous Site Reliability Engineering (SRE) in cloud environments. Specifically, it presents a multi-agent architecture with specialized agents (detection, diagnosis, mitigation, undo) coordinated via a state machine and a formal safety framework, i.e., Transactional Non-Regression (TNR), to ensure mitigation actions not degrade system beyond initial fault severity. Extensive experimental evaluation on SRE benchmarks (AIOpsLab/ITBench) shows 1.5–5.4× higher mitigation success rates compared to state-of-the-art baselines.

优缺点分析

Strengths:

This work claims to make the first effort to enable fully autonomous failure mitigation (beyond diagnosis/recommendation).
The proposed TNR formalizes safety for agentic systems operating critical infrastructure.
It provides rigorous TNR proof and ablation studies to validate design choices.
The code is open-sourced, which makes it easy to reproduce the results.
This paper is well-structured and easy to follow.

Weaknesses:

Benchmarks (AIOpsLab/ITBench) emulate isolated faults. Absent evaluation on cascading failures (e.g., network partitions inducing state corruption) or multi-tenant edge cases.
Diagnosis agent achieves only 34.6% RCA accuracy. Root causes (e.g., hardware vs. misconfiguration) conflated in failure traces.
Mitigation latency and cost hinder real-time deployment. There is no analysis of cold-start delays or agent memory footprint.
Natural language tools may permit prompt injection attacks. Sandboxing lacks adversarial testing.
State-machine orchestration requires manual workflow definitions. Contrasts with dynamic agent frameworks. Limits adaptability to novel failures.

问题

Can Stratus handle correlated faults (e.g., simultaneous node/memory failures)? If not, what TNR extensions are needed? Real-world validation?
How would you modify TNR for environments with parallel agents? Would distributed locking suffice?
Is it possible to replace GPT-4o with smaller LLMs for undo agent? In this way, it may provide latency/accuracy trade-offs.
RCA failures often misattribute hardware issues to config errors. Would embedding domain knowledge improve it?
Could agents be spawned on-demand? How would state-machine consistency be maintained?

局限性

Yes

最终评判理由

I am grateful for the authors' significant efforts in their response, which have addressed the majority of my concerns. Taking into account the overall quality of this work, I would like to maintain my rating score.

格式问题

N/A

作者回复

2025-07-31

Thank you for the insightful review! We'll carefully address the review comments.

We first answer the explicitly listed questions and then respond to other concerns.

> Q1: Can Stratus handle correlated faults (e.g., simultaneous node/memory failures)? If not, what TNR extensions are needed? Real-world validation?

Stratus can handle cascading failures. Many problems in the two benchmarks are cascading failures. As the applications all use the microservice architecture, the initial fault propagates to dependent services. Mitigating such failures requires the agent to resolve the root cause, instead of the dependent services on the propagation chain. Stratus can also handle correlated failures. 25 problems in the two benchmarks manifest as correlated failures where the root cause fails multiple dependent services. Similar to cascading failures, mitigating correlated failures requires Stratus to resolve the root cause.

A more challenging case is concurrent faults that are inter-dependent. We crafted such a new problem in ITBench, where we have a quorum system of three nodes (one leader and two followers); we inject data corruptions in the followers’ data volumes and inject a crashing bug in the leader. A correct mitigation requires to first fix the data volumes of the followers and then restart the leader. Unfortunately, Stratus failed to mitigate this problem as it didn’t localize follower data volumes as one of the root causes. It did restart the leader; however, due to the dependencies, restarting the leader without fixing the two followers cannot mitigate the failures (as the leader cannot form the quorum). Note that TNR still applies in this problem, despite Stratus failing to solve it after many attempts.

> Q2: How would you modify TNR for environments with parallel agents? Would distributed locking suffice?

To adapt Transactional No-Regression (TNR) for environments with concurrent agents, we can evolve the single-writer lock (A-Lock) into a more fine-grained, resource-aware concurrency control system. This can be achieved by designing a Coordination Controller, which effectively functions as a distributed lock manager. The basic idea is to allow multiple mitigation agents to execute their transactions concurrently, with intended actions not conflicting over shared resources. This controller dynamically inspects the "mitigation plan" from each agent, identifies the specific resources the plan needs to modify, and only grants execution rights if those resources are not already locked by another active agent; otherwise, the plan aborts and the locked resources are released This preserves TNR's safety guarantee for each concurrent transaction by ensuring that concurrent operations are properly isolated.

This model can be formalized by extending TNR.

Each concurrently executing mitigation agent works on the output of the diagnosis agent. The mitigation agent first comes up with a plan $P_i$ , which is associated with a set of system resources, ${R}(P_i)$ containing the unique identifiers of all system components its write actions will mutate. The controller maintains a global set of locked resources, $R_{\text{locked}}$ , which is the union of resource sets for all currently executing plans, $R_{\text{locked}} = \bigcup_{j \in \text{executing}} R(P_j)$ . A newly submitted plan $P_i$ is admitted for execution if and only if its resource set is disjoint from the global lock set: $\text{Admit}(P_i) \iff R(P_i) \cap R_{\text{locked}} = \emptyset$ .

Upon admission, the plan's resources are atomically added to the global lock set ( $R_{\text{locked}} \leftarrow R_{\text{locked}} \cup R(P_i)$ ), effectively acquiring a distributed lock on them. The agent then proceeds to execute its transaction under the original TNR semantics, guaranteeing that the severity metric does not increase: $\mu(s_{\text{post}}) \le \mu(s_{\text{pre}})$ . Once the transaction completes (via commit or abort), its resources are released: $R_{\text{locked}} \leftarrow R_{\text{locked}} \setminus R(P_i)$ . This mechanism upholds the "Writer Exclusivity" (A1) at the resource level rather than the system level, ensuring that the safety properties of TNR hold for each transaction. Note that if the transaction aborts, the undo agent is responsible for undoing all mutation actions of the plan. This strict resource isolation inherently prevents read-after-write hazards, as an agent is guaranteed a consistent view of any resource not explicitly locked by its own transaction.

> Q3: Is it possible to replace GPT-4o with smaller LLMs for undo agent? In this way, it may provide latency/accuracy trade-offs.

Yes. A smaller model can effectively replace GPT-4o in the undo agent (the model is used for undo API calls). We tried Llama3.3-70B and it has the same effect as GPT-4o. We also tried Llama3.2-11B (a further smaller model), but the 11B model was ineffective (e.g., incorrect tool usage such as wrong API invocation and wrong parameters).

> Q4: RCA failures often misattribute hardware issues to config errors. Would embedding domain knowledge improve it?

In principle, domain knowledge is definitely helpful for enhancing RCA agents especially for known problems; for example, prior work (e.g., Ref. [20, 61] in the submission manuscript) shows that leveraging Troubleshooting Guides (TSG) can improve the effectiveness of RCA. Such domain knowledge can be encoded either through model fine-tuning or by RAG (Retrieval-Augmented Generation). Unfortunately, domain knowledge such as TSG is largely proprietary. Future work can explore using publicly available data (e.g., user manual and GitHub issues) to enhance RCA agents.

> Q5: Could agents be spawned on-demand? How would state-machine consistency be maintained?

Yes. Only the Detection agent is long-running. The Diagnosis and Mitigation agents are spawned on demand when a failure is detected. We expect these agents to have lifecycles till the end of the failure; therefore, the state machine is maintained throughout the failure duration. We don’t see benefits of having more fine-grained agent life cycles.

> Q6: Mitigation latency and cost hinder real-time deployment.

Stratus is not slower or more expensive than the other evaluated agents; the reason Stratus took longer on harder problems is that TNR enables Stratus to make multiple attempts to solve the problem, while the other agents can only make fewer attempts. It is a key advantage of Stratus, not a limitation.

We have a few ideas on accelerating Stratus (inspired by your questions; thank you!): (1) caching solutions for learning so future resolutions of similar type can be much faster (which becomes domain knowledge in a similar vein as Q4); (2) using smaller models (see Q3), and (3) asking the AI model reason about efficiency of mitigation strategies (e.g., cheapest strategy first).

> Q7: Analysis of cold-start delays or agent memory footprint.

Our agents have no cold-start delay as they directly use existing LLMs; it doesn’t maintain long-term memory.

We measure the runtime memory footprints of the agent when addressing a heavy problem (taking 13 steps to mitigate). The total DRAM usage of the Detection, Diagnosis, Mitigation, and Undo agents are 548.3 MB, 533.3 MB, 545.7 MB, and 534 MB, respectively. The agent runtime (CrewAI) and dependencies (e.g., tools) dominate the memory usage (around 530 MB). The remaining memory usage comes from prompts, and runtime in-memory states.

> Q8: Natural language tools may permit prompt injection attacks. Sandboxing lacks adversarial testing.

We will expand on the discussion of safeguards against LLM errors, including sandboxed execution, access control, human-in-the-loop approval, and tamper detection.

> Q9: State-machine orchestration requires manual workflow definitions. Contrasts with dynamic agent frameworks. Limits adaptability to novel failures.

Fair point. There’s a tradeoff between creativity and the ability of safety reasoning; we prioritized the latter as safety is the first-class principle. We believe that the current state machine doesn’t impair creativity as Detection, Diagnosis, Mitigation, and Undo are essential to SRE; novel failures are addressed by creativity and autonomy of the corresponding agents.

审稿意见

评分: 4置信度: 32025-07-02

STRATUS is a multi-agent system for autonomous Site Reliability Engineering (SRE) in cloud environments. It automates failure detection, diagnosis, and mitigation. Its main contribution is "Transactional No-Regression (TNR)," a safety specification ensuring that mitigation actions can be undone and system health never worsens. STRATUS uses specialized agents (detection, diagnosis, mitigation, undo) and tools for cloud interaction, with sandboxing and a stack-based undo for TNR. Evaluations on AIOpsLab and ITBench show STRATUS significantly outperforms existing SRE agents in mitigation success rates, primarily due to TNR's undo-and-retry mechanism, validating its potential for practical autonomous cloud reliability.

优缺点分析

Strengths: I appreciate the author tackling a very important problem in AIOps and system reliability.

Significance: Addresses critical, real-world SRE challenges by pursuing fully autonomous cloud reliability, moving beyond human-assisted approaches. This could transform incident response.
Originality: Introduces and formalizes "Transactional No-Regression (TNR)," a novel safety mechanism enabling safe exploration and iterative mitigation through guaranteed undo capabilities. Its specialized multi-agent architecture for SRE is also a unique design choice.
Quality: Backs TNR with rigorous formalization and implementation details. Demonstrates strong empirical validation, significantly outperforming baselines in mitigation success rates across multiple LLMs and benchmarks (AIOpsLab, ITBench).
Clarity: The paper is well-structured, clearly written, and enhanced by illustrative examples and a detailed appendix, making complex concepts accessible.

Weaknesses: The overall evaluation is constrained.

Quality - Practical Limitations:
- Imperfect Undo: Acknowledged challenges in achieving "perfect undo" for all complex, stateful, or external interactions mean some subtle degradations might persist post-rollback.
- Performance/Cost Overhead: STRATUS is significantly slower and more expensive than other agents, posing a concern for urgent incidents where rapid resolution is critical. It would be helpful to state solutions that accelerate the process.
- Benchmark Generalizability: TNR's benefits were less evident in ITBench due to a prevalence of non-persistent faults, raising questions about its effectiveness for persistent, harder-to-undo real-world issues.
Significance - Applicability Constraints:
- No Concurrent Writers: The assumption of no concurrent writer agents significantly limits its applicability in highly dynamic, multi-actor cloud environments where simultaneous changes are common.
- Oracle/Success Definition: The success criteria, based on immediate alerts and basic health checks, might not fully capture subtle, latent, or long-term system degradations, potentially leading to a false sense of full recovery.

问题

Section 3.1.1 lists three assumptions for TNR: Writer Exclusivity (A1), Faithful Undo (A2), and Bounded Risk Window (A3). Could you discuss a scenario where one of these assumptions might be violated in a real-world cloud system and the potential consequences for STRATUS's safety guarantees?

The paper mentions that "realizing perfect undo for all conceivable state changes in complex environments like cloud systems remains a practical challenge." What are some specific types of state changes or operations in a cloud environment that would be particularly difficult to perfectly undo, and how might STRATUS's current implementation address or be limited by these challenges?

Persistent vs. Non-Persistent Faults in ITBench: Your analysis of ITBench (Table 6) points out that "in 8 out of 18 problems, restarting the target pods clears the incident alerts" because "the injected faults do not persist." This suggests a specific characteristic of the benchmark. How would STRATUS's performance and the effectiveness of TNR differ when confronted with persistent faults (e.g., a hardware defect or a deeply embedded misconfiguration) that cannot be resolved by simple restarts?

局限性

The authors acknowledge some technical limitations (perfect undo challenges, no concurrency control, incomplete confinement rules, ITBench's non-persistent faults). However, they don't adequately address potential negative societal impacts.

Suggestions:

"Perfect Undo": Detail why certain changes are hard to undo and propose handling for partial/irreversible actions. LLM Errors: Beyond TNR, explain strategies to minimize initial LLM errors caused by non-determinism and hallucination (e.g., structured generation, fine-tuning, prompt engineering, ensemble methods, human-in-the-loop for high-risk ops)

Address other concerns:

Security Implications: Detail safeguards against compromise and the risk of amplified attacks. Accountability/Liability: Who is responsible for failures caused by STRATUS?

最终评判理由

The aurhors addressed most of my concerns, therefore I enhanced the score.

格式问题

The paper is well-formatted.

作者回复

2025-07-31

Thank you for the insightful review! We'll carefully address the review comments.

We first answer the explicitly listed questions and then respond to other concerns.

> Q1: Could you discuss a scenario where one of these assumptions might be violated in a real-world cloud system and the potential consequences for STRATUS's safety guarantees?

The simplest example is A2 (Faithful Undo). AI agents are imperfect. If the mitigation agent had a wrong mitigation attempt, say deleting a large file (e.g., to clean up disk space) but the root cause is not about the disk being full, it needs to recover the file in order to restore the system state as if the wrong mitigation procedure was not applied. However, without special handling as in Stratus, file deletion cannot be undone, and thus violates TNR. Basically, without TNR, the cure could be worse than the disease. TNR ensures that unsuccessful mitigation actions can be rolled back, which enables Stratus to try different mitigation strategies.

> Q2: What are some specific types of state changes or operations that would be particularly difficult to perfectly undo; how might STRATUS address or be limited by these challenges?

Modern cloud platforms (e.g., Google's Borg, Meta's Twine, Kubernetes, DBOS) all follow state-centric designs where the system states are maintained in fault-tolerant, strongly consistent data stores (e.g., etcd in Kubernetes). Any operations are realized by state reconciliation, which can be reliably done through declarative, state-centric interfaces. In principle, undo can be cleanly done by reconciling the system to a recorded state. Stratus leverages such state-reconciliation principles to realize systematic undo using a stack-based mechanism.

The state-reconciliation principle requires modern cloud systems to expose a state-centric management interface (e.g., CRD in Kubernetes). Modern systems support such interfaces and thus are referred to as "cloud-native" systems. Therefore, we believe that the Stratus style of system undo is generic and practical.

However, recent research [Acto-SOSP'23, Sieve-OSDI'22] shows that implementing a declarative, state-centric interface is nontrivial because (1) it’s error-prone to implement a declarative state-centric interface using common programming languages (e.g., Go, Rust, Python, etc) which are imperative, and (2) some cloud-native systems are transformed from legacy codebase and thus are inherently difficult to support arbitrary state reconciliation; this is particularly true for database systems, where on-disk data management is known to be challenging and expensive.

The aforementioned limitations are arguably orthogonal to Stratus agent designs. Currently, such problems are mostly worked around by disallowing certain state reconciliation that cannot be easily undone (without system redesigns). The same practice can be applied to Stratus by only allowing it to explore operations that can be effectively undone. Going beyond, we shall explore ways to automatically analyze the side effect of a given operation and determine whether it can be undone and how costly would the undo be.

> Q3: How would STRATUS's performance and the effectiveness of TNR differ when confronted with persistent faults that cannot be resolved by simple restarts?

We shall clarify that only 8 out of 18 ITBench problems have non-persistent faults. The remaining 10 ITBench problems all have persistent faults and all the 13 AIOpsLab mitigation problems have persistent faults (e.g., misconfigurations and persistently failing nodes) – simple restarts cannot resolve these failures.

Stratus successfully mitigates 9 out of 13 mitigation problems in AIOpsLab (all persistent faults), showing its ability to resolve persistent faults. In terms of TNR, Stratus retried at least once in over 80% of the problems and at least five times in over 30% of the problems. In the 10 ITBench problems with persistent faults, Stratus mitigated 2 of them. Stratus retried at least 8 times per problem.

> Q4: STRATUS is significantly slower and more expensive than other agents; It would be helpful to state solutions that accelerate the process.

We have a few ideas on accelerating Stratus: (1) caching solutions for learning so future resolutions of similar type can be much faster (which becomes domain knowledge in a similar vein as Q4 to Reviewer sLKR); (2) using smaller models (Q3 to Reviewer sLKR), and (3) asking the AI model reason about efficiency of mitigation strategies (e.g., cheapest strategy first).

> Q5: No Concurrent Writers: The assumption of no concurrent writer agents significantly limits its applicability in highly dynamic, multi-actor cloud environments where simultaneous changes are common.

Our current operation model is to deploy one Stratus per cloud system (each cloud system is managed in a separate namespace).

We plan to extend Stratus for multiple writers by evolving A-Lock into a fine-grained, resource-aware concurrency control system. This can be achieved by designing a Coordination Controller, which effectively functions as a distributed lock manager. The basic idea is to allow multiple mitigation agents to execute their transactions concurrently, with intended actions not conflicting over shared resources. This controller dynamically inspects the "mitigation plan" from each agent, identifies the specific resources the plan needs to modify, and only grants execution rights if those resources are not already locked by another active agent; otherwise, the plan aborts and the locked resources are released This preserves TNR's safety guarantee for each concurrent transaction by ensuring that concurrent operations are properly isolated.

This model can be formalized by extending TNR.

Each mitigation agent first comes up with a plan $P_i$ , which is associated with a set of resources, ${R}(P_i)$ its write actions will mutate. The controller maintains a global set of locked resources, $R_{\text{locked}}$ , which is the union of resource sets for all currently executing plans, $R_{\text{locked}} = \bigcup_{j \in \text{executing}} R(P_j)$ . A newly submitted plan $P_i$ is admitted for execution if and only if its resource set is disjoint from the global lock set: $\text{Admit}(P_i) \iff R(P_i) \cap R_{\text{locked}} = \emptyset$ .

> Q6: Oracle/Success definition; subtle, latent degradations

We use standard observability data, which defines Service Level Agreements, as the oracles (SRE agents like Stratus target observable symptoms and user-perceived incidents.) Addressing subtle, latent degradation (e.g., small performance regression) may need to enhance system observability. One intriguing direction is agentic techniques to autonomously enhance observability for a given problem (e.g., exposing more low-level knobs and metrics).

> Q7: Detail safeguards against compromise and the risk of amplified attacks.

We’ll expand on the discussion of safeguards against LLM errors, including sandboxed execution, access control, human-in-the-loop approval, and tamper detection.

> Q8: Accountability/Liability: Who is responsible for failures caused by Stratus?

This is a deep question! The accountability requires deep thoughts – it doesn’t resemble the current human-based DevOps/SRE accountability system. The system needs to account for the LLM developers, agent developers, and the agent operators.

As of the current industry practices:

Agents with writer roles (which can change system states) are enforced to request human approval before execution.
Agents’ action stack and tool interface are intercepted for every tool call or action pushed/popped.
Before executing a write operation, the agent presents a concise summary of the proposed command, its intent, and relevant telemetry, enabling the human to verify it.
The success rates of agents are monitored to decide their autonomy.

> Q9: Potential negative societal impact

We don't expect negative societal impacts. Our goal is to emphasize safety as the first-class principle of SRE agents and demonstrate one way of doing it. With the safety guarantee, system states do not degrade beyond the initial error state – Stratus/TNR enables more effective failure mitigation, which increases the availability of cloud systems (the computing infrastructures of many end-user applications) and leads to positive societal impacts. Note that all experiments are conducted in emulated environments (AIOpsLab and ITBench); the research does not reduce availability of real-world systems.

2025-08-06

Thanks for your review again! We would like to ask if there are any comments or questions regarding our rebuttal. We want to make sure that your questions in the review have been addressed.

评论- Good rebuttal

2025-08-09

I sincerely appreciate the authors' effort to address the questions. The authors clarifications are helpful, especially with several plans to address the limitations mentioned: e.g."plan to extend Stratus for multiple writers by evolving A-Lock into a fine-grained, resource-aware concurrency control system." Therefore, I raise my score.

2025-08-07

Dear Reviewer fQgT,

Thank you for your review of this paper. We would appreciate it if you could check the authors' response to see whether they have addressed your comments.

Thank you

审稿意见

评分: 5置信度: 32025-07-07

The authors present STRATUS, a method for Site Reliability Engineering of cloud services. The multi-agent approach presents TNR which allows for failure mitigation, a key advantage over existing approaches. It shows improved mitigation of failures, though at a longer cost and computational overhead.

优缺点分析

Strengths
- Clarity: The algorithm is explained in detail and TNR is proven by induction
- Quality: The experiments show strong results across different takss.
- Significance: The work addresses a key problem in the real world and shows string promise
- Originality: TNR is an important aspect for SRE
Weaknesses
- Clarity: The authors must formally state exactly what ε is. They must also add a diagram or clear example of how the LLM is used by the agents.
- Please move around figures and rename captions to have the full information and not really on the text e.g. Figure 5.
- Quality: The reasons for Stratus struggling on RCA problems needs to be expanded upon: for the rest of the paper, the user has expanded on limitations but should be here as well.
- Would like to see an ablation of the multi-aent system i.e. with different methods of separating the agents
- While authonomy rovides obvious benefits, the authors must be careful of downplaying the importance of human intervention.

问题

• Why does Stratus underperform on RCA tasks? • Were any other multi-agent decompositions considered and were any experiments run on these? • Can the system be adapted for hybrid settings i.e. how would you balance agent autonomy with human oversight in safety-critical environments?

局限性

Yes

最终评判理由

I commend the authors for thoroughly engaging with all reviewer feedback.

Taking into account the rebuttal and the other reviews, I am content to keep my score the same, contingent upon the authors incorporating the changes outlined in their rebuttal.

格式问题

None

作者回复

2025-07-31

Thank you for the insightful review! We will carefully address the review comments, including: improving the presentation (e.g., formally define ε, adding clear examples, and improving figure captions).

We answer the questions as follows.

> Q1: Why does Stratus underperform on RCA tasks?

We looked into the results and trajectories; the low RCA success rate is largely due to the ambiguous metrics used by AIOpsLab. The way AIOpsLab measures RCA success is to match the agent’s output with pre-defined category labels; however, the category labels are often ambiguous and not mutually exclusive.

For example, for a port misconfiguration, AIOpsLab defines the root cause to be “Virtualization” (as the layer) and Misconfiguration (as the fault). Stratus answers the root cause to be “Application” (as the layer) and “Network/Storage Issue” (as the fault). Arguably, Stratus is more precise in describing the root cause, e.g., a port is more of an application-level issue than “Virtualization” (AIOpsLab doesn’t have a traditional hypervisor-like virtualization layer). However, Stratus is considered wrong due to mismatch of the labels. We reached out to the AIOpsLab authors and they acknowledged this problem.

We manually go through all the Stratus RCA answers; the true success rate of Stratus (GPT-4o) is 50% (13/26), which is confirmed by the AIOpsLab authors.

> Q2: Were any other multi-agent decompositions considered and were any experiments run on these?

Yes. We considered and experimented with a conversation-based multi-agent framework, AutoGen, to construct a team of conversable agents with the roles of Application Developer, Platform Engineer, System Administrator, and a Team Manager to aggregate opinions and issue actions. However, we found that such a decomposition was neither effective nor efficient.

We sampled 48 problems from AIOpsLab, where the incidents happen on the application and platform layers. However, the role-based decomposition only solves 10 out of 48 problems. The multi-agent conversations tend to over-communicate and each agent only has a partial view of the entire system stack. For example, when round robin is used as the speaker selection method, every agent speaks in each step, slowing down decision making and causing agents to repeatedly collect telemetry data to investigate/confirm other agents’ observations and hypotheses. For example, solving a simple detection problem user_unregistered_mongodb-1-detection took 893 seconds, which is 25X longer than Stratus (Table 8 in Appendix). We believe that multi-agent conversation is not the right design for SRE agents, despite their promises in use cases like coding, Q&A, and entertainment as shown in the AutoGen paper.

The design rationales of our decomposition are:

Each agent has a global view of the system stack (which is a key advantage of AI agents over human engineers).
Each agent is corresponding to a concrete SRE task.
The design enables timely decision making as SRE tasks are time sensitive.
The state-machine based control logic facilitates safety reasoning (unlike many use cases like coding and writing, safety is the first principle of SRE agents).

> Q3: Can the system be adapted for hybrid settings i.e. how would you balance agent autonomy with human oversight in safety-critical environments?

Thanks for the question! Full autonomy is our moonshot; in practice, human-agent interaction is very important. We will make this point clear.

Our design supports hybrid operations in the following ways:

Agents with the writer role (which can change system states) can be enforced to always request human approval before execution. Agents’ action stack and tool interface are interceptable before every tool call or action pushed/popped. Before executing a write operation, the system can present a concise summary of the proposed command, its intent, and relevant telemetry, enabling the human to verify it.
Policy-based control for human-in-the-loop preferences such as:
- Always ask before executing write operations.
- Approve specific commands this time, require confirmation next time.
- Remember and always allow specific commands after approval.

Certainly, how to design an effective human-agent interface is still an open question and we plan to explore hybrid designs as future work.

2025-08-06

Thanks for your review again! We would like to ask if there are any comments or questions regarding our rebuttal. We want to make sure that your questions in the review have been addressed.

评论- Engaging Rebuttal

2025-08-07

I commend the authors for thoroughly engaging with all reviewer feedback. I believe that the discussion on underperformance on RCA tasks is essential, should the paper be accepted.

Taking into account the rebuttal and the other reviews, I am content to keep my score the same, contingent upon the authors incorporating the changes outlined in their rebuttal.

2025-08-07

Dear Reviewer Q1w9,

Thank you for your review of this paper. We would appreciate it if you could check the authors' response to see whether they have addressed your comments.

Thank you

最终决定Accept (poster)

2025-09-17

The paper introduces STRATUS, an LLM-based multi-agent system designed for autonomous Site Reliability Engineering (SRE) in cloud services. The system uses multiple specialized agents and a key safety specification called Transactional No-Regression (TNR), which allows for safe, iterative failure mitigation with an undo-and-retry mechanism. Experimental results on the AIOpsLab and ITBench benchmark suites demonstrate that STRATUS significantly outperforms state-of-the-art SRE agents, improving the failure mitigation success rate by at least 1.5 times. The reviewers’ concerns were adequately addressed to the reviewers’ satisfaction by the authors during the rebuttal period. Both the reviewers and I found the paper to have strong potential impact for applications with solid theoretical foundations in SRE and the broader machine learning community.