PaperHub
5.5
/10
Poster4 位审稿人
最低3最高3标准差0.0
3
3
3
3
ICML 2025

Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale

OpenReviewPDF
提交: 2025-01-24更新: 2025-07-24
TL;DR

A scalable, eficient, and fast open-sourced framework to test mutlimodal AI agents on PC within a full Windows OS environment

摘要

Large language models (LLMs) show potential as computer agents, enhancing productivity and software accessibility in multi-modal tasks. However, measuring agent performance in sufficiently realistic and complex environments becomes increasingly challenging as: (i) most benchmarks are limited to specific modalities/domains (e.g., text-only, web navigation, Q&A) and (ii) full benchmark evaluations are slow (on order of magnitude of multiple hours/days) given the multi-step sequential nature of tasks. To address these challenges, we introduce Windows Agent Arena: a general environment focusing exclusively on the Windows operating system (OS) where agents can operate freely within a real OS to use the same applications and tools available to human users when performing tasks. We create 150+ diverse tasks across representative domains that require agentic abilities in planning, screen understanding, and tool usage. Our benchmark is scalable and can be seamlessly parallelized for a full benchmark evaluation in as little as $20$ minutes. Our work not only speeds up the development and evaluation cycle of multi-modal agents, but also highlights and analyzes existing shortfalls in the agentic abilities of several multimodal LLMs as agents within the Windows computing environment---with the best achieving only a 19.5% success rate compared to a human success rate of 74.5%.
关键词
agentsbenchmarkcomputer agentsAI agentsmultimodal agentslarge language models

评审与讨论

审稿意见
3

This paper introduces a new agent environment, WINDOWS AGENT ARENA, for testing agents' capabilities within the operating system (exclusively focusing on Windows OS). The authors claim that WINDOWS AGENT ARENA is fully scalable and parallelizable, which enables rapid evaluations compared to other similar benchmarks. The experiments show that current LLM-based agents are still far from human-level performance, and WINDOWS AGENT ARENA serves as a valuable testbed for future work in testing agent systems on Windows OS.

给作者的问题

Please refer to the weaknesses.

论据与证据

The claims made in the submission are supported by the experiments.

方法与评估标准

The evaluation criterion adopted in this paper is the success rate, which is reasonable for assessing agents’ performance in the OS environment and evaluating task completion.

理论论述

Considering this is an application-driven benchmark paper, there is no need to verify any theoretical claims.

实验设计与分析

As a benchmark paper, I have some concerns about the comprehensiveness of the experiments. Specifically:

  1. The paper provides only global statistics on the success rates of different agents' performance across several task types. However, it remains unclear what specific challenges the proposed benchmark presents, as well as the underlying reasons for the poor performance of current agents. This lack of detail does not offer further insights for future work.

  2. It would be beneficial to test and discuss additional agent frameworks. As the authors mention, the paper primarily relies on Chain-of-Thought reasoning. However, other widely-used agent frameworks, such as ReAct and Reflexion, should also be incorporated and explored.

补充材料

I have reviewed the appendices referenced in the main text.

与现有文献的关系

This paper primarily focuses on Windows OS as the environment for testing agents’ performance, addressing a gap in the current research community. Its scalability and parallelizability enable rapid evaluations compared to other similar benchmarks, making it a significant contribution.

遗漏的重要参考文献

No, the references appear to be sufficient.

其他优缺点

Weaknesses

  1. Some tasks overlap with those from OS World, as they are adopted from it. It would be beneficial to include more Windows OS-specific tasks, such as system settings changes or other Windows-specific activities, to make the work more applicable and realistic.

  2. Given that Windows OS is fully closed-source, how can the authors ensure that future works will have sufficient access to modify settings in order to improve and adjust their methodologies? If not, the impact of this benchmark could be problematic, as the improvement strategies would be limited to a narrow scope.

其他意见或建议

No.

作者回复

Thank you for your comments/feedback. Please see our answers below.

RE: Reasons for poor agentic performance and further insights for future work

  • Due to character limits in our reply, we refer to our responses to Reviewers 88jp and sjR1 where we detail reasons for agents’ general/common failures and by task domain/category, respectively. We plan on including these, along with further insights (as noted in our response to sjR1), in our revisions.

RE: Other frameworks (ReAct/Reflexion)

  • We designed our agent to align with ReAct principles and will better clarify this in our paper. We already (Appendix D):
    • Explicitly instruct the agent to perform structured multi-step planning (reasoning)
    • Ask it to describe why it chose a certain action (rationale)
    • Require it to perform a single action per step, wait for the screen to update (observe), then iterate again.
    • Explicitly instruct agent to "Verify at each step whether you're on track."
  • Due to space constraints, we list a subset of Reflexion results, one of which comes close to best performance (19.5%): | inputs | model | Office | Web Browser | Windows System | Coding | Media & Video | Windows Utils | Total | |------------|--------|--------|-------------|----------------|--------|---------------|---------------|--------| | a11y | gpt-4o | 0% | 9.4% | 30.4% | 0% | 0% | 8.3% | 7.2% | | a11y + omniparser | gpt-4o | 0% | 18.9% | 28.3% | 25% | 16.7% | 0% | 14.3% |

RE: More Windows-specific tasks/activities

  • To clarify, of our 150+ tasks, a considerable portion (~33%) indeed focuses on Windows-exclusive aspects like its system settings and utilities/tools.
  • We also modified and tailored adapted tasks to specifically target/test aspects unique to Windows (Windows-specific UI conventions, interaction styles, native application workflows and context menus, ribbon interfaces, dialog windows, and keyboard shortcuts), thereby ensuring tasks are performed in a natural Windows-specific way. As such, our adapted/inspired tasks exclusively leverage Windows’ APIs, UI elements (e.g., Windows File Explorer, Taskbar, native menus), and common user actions on Windows (e.g., drag-and-drop interactions, native clipboard behaviors)
  • Many computer activities/workflows are not specific to a single OS—which is why we substantially modify tasks to reflect realistic and common Windows workloads done in a Windows-specific/native way.
  • To illustrate, for a "creating a pivot table from spreadsheet data" task, differences in native OS elements, distinct UI ribbons, iconography, etc. create different visual/interactive elements and different ways of performing the same task depending on the OS; as a result, agent performance—including visual perception/reasoning, planning, action—can differ considerably.
    • Windows:
      • Default icon theme ("Colibre") matches Microsoft's Fluent UI style and the ribbon-style toolbar (resembling Microsoft Excel’s ribbon).
      • Steps for creating a pivot table involve selecting data, accessing "Insert" > "Pivot Table..." through a ribbon-like toolbar prominently positioned at the top without the need to navigate to anything.
      • Dialogs for selecting ranges/placement within sheets use native Windows UI elements (Windows-specific dialog windows and controls)
    • Linux (e.g., Ubuntu with GNOME ):
      • Default icon theme ("Breeze" or "Elementary") follows Linux's typical flat design conventions which diverge significantly from Windows’ UI; ribbon-style toolbars are not the default and a traditional toolbar with menu-driven interface is used instead.
      • Creating a pivot table requires navigating via a menu ("Insert" > … > "Pivot Table..."), involving GTK-styled pop-ups/dialogs that differ visually/structurally from Windows.
      • Native Linux dialogs, influenced by GNOME/KDE, display different visual+interactive characteristics (e.g., GTK-based dialog boxes) than Windows, altering how an agent parses visual cues, UI elements, plans/acts, etc on Windows vs. other OS.
  • Lastly, we faced limitations in our task design due to the paywalled/closed-source nature of Microsoft programs (e.g., Office365) despite our efforts to make these accessible where we used popular open-source counterparts instead.

RE: Ensuring access to Windows

  • We provide a way to use a free evaluation copy which can be used for benchmark deployment and can be easily/continuously renewed thereafter. Our benchmark also allows users to easily install their own programs, add/configure new tasks, etc. By default, it relies on open-source widely maintained applications, etc. so anyone can modify these as needed.
  • Even if Windows changes system settings in future releases, the provisioning setup for the VM image and Docker can be easily updated to work with new OS versions. Additionally, our method for accessing screen information uses the accessibility (a11y) tree from Win32 apps that is consistently maintained.
审稿人评论

Thanks for your elaboration and complementary experiments. I have no further questions and will maintain my score.

作者评论

We thank the reviewer again for the thoughtful feedback as well as taking the time to evaluate our work. Please let us know if there are any further questions/feedback!

审稿意见
3

This paper introduces Windows Agent Arena, a benchmark for evaluating multimodal operating system agents. The benchmark contains over 150 diverse tasks on the Windows OS platform and leverages the Azure cloud environment for parallel evaluation.

给作者的问题

Does this benchmark provide desired plan steps for each task, such as the trajectory needed to successfully complete one task?

With humans achieving a 74.5% success rate, what are the common failure modes in this dataset? How many participants were involved in this test setting?

In Table 2, I don't see any functionality for audio files, such as listening to audio. Does the benchmark support this capability?

论据与证据

The benchmark supports multiple modalities and domains across the Windows operating system. Leveraging the Azure cloud environment, it enables rapid, parallel evaluations. Additionally, the authors provide unrestricted access to the Windows OS testing environment.

方法与评估标准

The evaluation is comprehensive, covering various closed-source and open-source models. The limited performance of existing models demonstrates the benchmark's potential and difficulty.

However, the baseline methods exclude agent systems like Camel or smolagent, instead focusing only on several MLLMs. The evaluation methodology for these models on the benchmark remains unclear. Additionally, while Appendix D mentions an agent system called Navi, its absence from the main paper creates confusion.

理论论述

There are no theoretical claims in this paper.

实验设计与分析

Please check the Method section above.

补充材料

Yes, I have read through the supplementary material.

与现有文献的关系

This paper can be seen as an extension of OSWorld to the Windows OS. More tasks are desired for a comprehensive evaluation. Please see the Weakness below.

遗漏的重要参考文献

This paper has discussed the essential references.

其他优缺点

A major weakness is the limited number of tasks. While this benchmark is important for the agent community given Windows OS's widespread global usage, it would benefit from additional tasks to enable more comprehensive evaluation of agent systems. Furthermore, the evaluation methodology could be more granular. Beyond measuring final success rates, analyzing intermediate process steps would provide valuable insights into why and how tasks fail.

其他意见或建议

No other comments or suggestions

作者回复

We thank the reviewer for the insightful feedback; we address each point below.

RE: Agent systems like Camel/smolagent & evaluation for these model systems

  • Our focus was on single agent systems, particularly state-of-the-art/popular multi-modal LLMs commonly used as agent/reasoning backbones.
  • However, multi-agent systems like Camel are also able to run on our benchmark as is. We believe this to be an important direction and plan on future work for multi-agent evaluations.
  • Thank you for bringing smolagents to our attention--we did not see it (as it had released around our submission ~late Dec 2024). Smolagent is more of an agent wrapper but has elements that strongly resemble our agents already (e.g., memory, ReAct-like, writing Pythonic code as actions and tools/functions, etc.).

RE: Appendix D mentions Navi

  • Thank you for pointing this out. This was an artifact from earlier experiments and has been removed along with similar mistakes.

RE: Limited number of tasks

  • We prioritized creating fewer tasks, each representing distinct but realistic skills/workflows, to avoid trivial task variations and better control for (and understand) agentic failures on these categories; we also make it easy for users to create their own tasks.
  • Nonetheless, we still have >3x as many Windows tasks (>150) as the next closest (~40 tasks in OSWorld).

RE: More granular evaluation and intermediate steps on task failures

  • We agree that this is important. Due to character limits, we refer to our responses to Reviewer sjR1 and 88jp detailing general agent failures and task-domain specific ones, respectively. We will include the full set of failures and analysis in our revised paper.

RE: Does this benchmark provide desired plan steps/trajectory to complete task

  • No, given the task complexity and open-endedness of the computing environment, there are multiple ways to accomplish any single task so many valid trajectories can exist.
  • e.g., even a simple task like creating a new file and saving it to a location can have multiple ways to complete it. The agent can: navigate to the location via file explorer and create the file (creation can also happen via multiple ways: keyboard shortcut, right-clicking the dropdown menu, explorer create file button, etc.), or it can create the file via powershell, or it can can open a file in a text/code editor and then save it, etc.
  • A desired/preset trajectory can also introduce potential bias; we use number of steps to track trajectory quality but allow the agent freedom otherwise.

RE: With a 74.5% human success rate, what are common failure modes?

  • Among human evals, the most common failure cases were: (1) the inability to find the “correct” next step to progress (e.g., lack of knowledge/familiarity of how to perform certain functions within a program), (2) misinterpreting the task instruction to mean something else, and (3) carelessness and human error. Our human success rate appears similar/comparable to those on other benchmarks (e.g., 70-80% for OSWorld, AndroidWorld, etc.)
    • Example of (1): the user could not figure out how to convert an MP4 to MP3 in VLC player (unable to find the relevant settings after several minutes), resulting in the user giving up and marking the task as failed.
    • Example of (2): On a task setting Chrome to automatically delete all on-device site data, the participant tries to configure deleting history and cookies, but “on-device site data” actually refers to a separate/different concept under “privacy and security.”
    • Example of (3): on a task that asks to change the first letter of each word in a document to uppercase, the participant forgets to capitalize 2-3 words.
    • We will include the full set of details/examples in the revised manuscript.

RE: Number of participants

  • 1 participant, a casual user of Windows, performed the tasks without any human or digital assistance (i.e., internet); statistics and details are reported in Appendix B. There were originally 3 but of the remaining 2, 1 attempted <1/4 of the tasks and the other even fewer.

RE: Audio capability

  • Yes, our benchmark does support audio as it was designed to be flexible in integrating additional modalities. We describe an implementation (our intended way) below.
  • One way is to have the user provide audio recordings (e.g., a WAV file) of the spoken task instructions along with an audio-capable agent (e.g., giving a VLLM an external tool like speech-to-text transcription or have an agent with inherent audio understanding). The agent would need to understand the task via audio and match the STT transcription to the actual task JSON/instruction. Beyond that, nothing else would change and the benchmark would run as normal.
  • Other ways are also possible so long as the corresponding task JSONs (see Appendix A.5 for details) are properly defined; however these JSONs are designed to be flexible and easily customizable.
审稿人评论

Thanks for the response and most of my concerns are addressed and I will raise my score by 1.

作者评论

Thank you again for your time, questions, and feedback. If there are any further questions, comments, or concerns, please let us know!

审稿意见
3

The paper introduces Windows Agent Arena, a benchmark for testing multi-modal AI agents on real Windows OS tasks. It provides 154 tasks across everyday applications (office, web browsing, coding, etc.) and uses a scalable, parallel evaluation setup (e.g., using Azure) so that tests finish in about 20 minutes. The work is practical and motivated by the need for a realistic, open evaluation environment for Windows agents.

给作者的问题

  1. How exactly is the Windows environment provided to researchers (pre-built images, trial licenses, etc.)?
  2. What are the main reasons for agent failures (vision, planning, or action execution)?

论据与证据

Yes.

  • Claims: The paper claims that current AI agents (including models like GPT-4V) are far below human performance (only ~19.5% success vs. 74.5% human).
  • Evidence: They back this up with extensive experimental results and comparisons to other benchmarks. The evidence is solid, though some details (like the “free access” to Windows) could be clearer.

方法与评估标准

Yes.

  • Environment: Defines a POMDP where agents see screenshots and accessibility trees, then act via mouse, keyboard, and OS commands.
  • Evaluation: Uses binary rewards (success/fail) with automated checks on the final state. This is a reasonable choice for a practical benchmark.

理论论述

No new theoretical proofs are presented. The paper uses standard formulations (POMDP) to describe the environment.

实验设计与分析

Yes.

  • The experiments test several state-of-the-art models and different input parsing methods.
  • Ablations show that combining UI accessibility data with pixel-based detectors improves performance.
  • Overall, the design is solid and shows that the benchmark is challenging.

补充材料

Yes.

The reviewer has checked the supplementary material and Appendix in the PDF.

与现有文献的关系

The paper builds on and extends ideas from previous benchmarks like OSWorld, WebArena, and AndroidWorld. It fills an important gap by focusing on the Windows OS.

遗漏的重要参考文献

The reivewer think that there is no more related works need to be discussed.

其他优缺点

Strengths:

  1. Practical, scalable, and fills a clear gap in current research.
  2. The task set is diverse and realistic.

Weaknesses:

  1. Current agent performance is very low and some setup details (e.g., Windows licensing, reproducibility) could be more clear.

其他意见或建议

There is no more suggestion or comment.

作者回复

Thank you for your questions/feedback! We address your points in turn below.

RE: Low agent performance

  • Yes, performance is low in absolute terms; however, in relative terms, the performance we've seen has largely been in line with trends from agent performance results on other comparable benchmarks (e.g., best success rates at ~10-20% on OSWorld, Visual Web Arena, etc.). The performance gap we observe has been consistent with the low performance of current LLVM agents on real world computer tasks, along with some of the problems with current agents (e.g., imperfect visual grounding/reasoning).

RE: Benchmark setup and how the Windows environment is provided

  • We cover this in Sec. 3.3 (pgs 5-6), but we lay out more details below which we will include in the paper for further clarity.
  • The Windows environment is setup using a two-step approach:
    • First, users download a Windows 11 Enterprise Evaluation ISO (a free 90-day trial provided by Microsoft) and use our automated provisioning scripts to build what we call a "golden image." This image is a fully functional Windows 11 VM pre-configured with all the necessary software and settings for running our benchmark.
    • Second, for Docker containers and VM provisioning, once the golden image is created, it’s integrated with Docker to streamline deployment and testing. This allows researchers to run the environment locally (via WSL/Linux) or via Azure for parallelization.
  • In short, the environment is provided as a pre-built, reproducible VM image (derived from a trial license) coupled with Docker containerization, ensuring a consistent and ready-to-use setup for computer-use research. As a result, our benchmark is one of the few (if not the only) that provides users open access to the Windows OS and computing environment for agent research and development. Our benchmark also allows users to install their own programs/applications, add new tasks for their own needs, etc.
  • This way, despite not being an open-source OS, users can access Windows via our benchmark as a self-contained docker image; users can then utilize a free evaluation copy for benchmark deployment that can be easily and continuously renewed thereafter.

RE: main reasons for agent failures

  • We provide a more detailed list below which we will include in the revised paper:
  1. Vision-related failures. In our experience, agents:

    • Fail to recognize if the previous action did anything, especially when there's no image modality and the screen state must be inferred from the accessibility tree
    • Trust captions from omniparser, resulting in wrong icons being clicked when the labels are wrong, misleading the agent into selecting incorrect UI elements.
    • Hallucinate successful actions (i.e., clicking the wrong element, then hallucinating that the correct dialog/next state was reached even if the ground-truth computer state did not actually change).
    • Click the wrong element with a similar bounding box ID, especially when the screen has dense bounding boxes, causing the agent to click incorrect but visually adjacent UI elements.
    • Be “blind” to popups, leaving it unable to exit out
  2. Planning-related failures, The agent can:

    • Fail to follow the output structure resulting in the actions failing to be parsed, causing downstream parsing and execution to fail entirely.
    • Repeat actions, such as clicking a mislabeled icon multiple times
    • Take the instruction too literally, and scroll/search for non-existent UI elements even if the correct UI element is already on the screen. Misinterpretation of instructions or overly literal plan formation causes unnecessary and incorrect actions.
    • Run out of steps early (hitting the max step counter) due to some of the issues above
    • Complete the task, but not output "DONE" (forgetting to signal task termination), failing to correctly recognize successful task completion or termination conditions.
  3. Action Execution-related Failures The agent can:

    • Often forget to press the "enter" key (for example, when typing a URL into the address bar) to confirm an action
    • Attempt to scroll without hovering the cursor over the scrollable area, resulting in multiple scroll attempts with no change
    • Select semantically similar but incorrect element (e.g., selecting a button labeled "Start" when intending another "Start" button or selecting an item from a "Size" column instead of the header) due to ambiguous element identification.
    • Try to output absolute coordinates to select cells in office apps. Action execution relying incorrectly on absolute pixel-based positioning rather than omniparser's output (e.g., element IDs)
    • Close a secondary window, incorrectly assume it’s still open, then attempt to click UI elements that no longer exist.
审稿人评论

The reviewer has read the rebuttal carefully. The reivewer has no more questions and will maintain the score.

审稿意见
3

This paper introduces Windows Agent Arena, a benchmark environment specifically designed for evaluating multi-modal AI agents within the Windows operating system. The authors develop a suite of 154 diverse tasks across various applications (Office, Web Browsing, Windows System, Coding, Media & Video, and Windows Utilities) that require planning, screen understanding, and tool usage capabilities. This benchmark also introduces scalable infrastructure, which allows for parallel evaluation in as little as 20 minutes compared to days for sequential evaluation. The authors evaluate several state-of-the-art multi-modal LLMs as agent backbones, with the best configuration achieving only a 19.5% success rate compared to human performance of 74.5%, highlighting significant room for improvement in this domain.

给作者的问题

  • The gap between human and agent performance is substantial (74.5% vs. 19.5%). Based on your analyses, which specific capabilities would need to be improved first to make the most significant progress in closing this gap?
  • You mention that UIA markers can take "from a few seconds up to several minutes to be queried depending on screen complexity." Could you elaborate on how this impacts the practical utility of UIA-based approaches and whether there are ways to optimize this process?
  • The paper identifies visual-language misalignment as a common cause of failures. Have you considered pre-training or fine-tuning approaches specifically designed to improve this alignment for Windows OS interfaces?
  • How sensitive is agent performance to the specific formulation of prompts? Did you experiment with different prompt designs, and if so, how much variation in performance did you observe?

论据与证据

The primary claims about the benchmark's value, scalability, and performance findings are well-supported by evidence. The parallel execution advantage is clearly demonstrated with timing data. The extensive evaluation of various agent configurations (20+ variants) provides compelling evidence for the benchmark's utility in comparing different approaches. The gap between the best agent performance (19.5%) and human performance (74.5%) convincingly demonstrates the current limitations of multi-modal agents in Windows environments.

方法与评估标准

The methods proposed for agent evaluation are appropriate and well-designed. The authors formalize agent behavior as a partially observable Markov decision process (POMDP) with clearly defined observation and action spaces. The evaluation based on device state after agent execution is a sound approach for determining task success. The benchmark includes a good variety of tasks across different applications, representing realistic user workflows in Windows OS.

理论论述

N/A

实验设计与分析

The experimental design for evaluating different agent configurations is comprehensive and well-executed. The ablation studies comparing different visual parsing methods and model backbones provide valuable insights. The analysis of failure cases correctly identifies key challenges for current agents, particularly in visual-language alignment and precise Set-of-Marks identification. The Azure parallelization approach for benchmark evaluation is creative and effective.

补充材料

N/A

与现有文献的关系

The paper positions itself well within the growing body of research on agent benchmarks. The authors provide a comprehensive overview of related work, including other benchmarks such as OSWorld, AndroidWorld, WebArena, and GAIA. They clearly articulate how Windows Agent Arena addresses limitations in existing benchmarks, particularly for Windows OS evaluation and benchmark scalability.

遗漏的重要参考文献

N/A

其他优缺点

Strengths:

  • Parallelizable evaluation: The Azure-based parallel evaluation infrastructure dramatically reduces evaluation time compared to serial execution.
  • Comprehensive ablation studies: The extensive evaluation of different model and input parsing configurations provides valuable insights into the factors affecting agent performance.
  • Practical tasks: The benchmark includes realistic tasks that actual Windows users would perform, increasing its practical relevance. Open-source contributions: The authors provide infrastructure and tools to make agent research more accessible to the community.

Weaknesses:

  • Limited discussion of optimizing agent prompts: There is limited analysis of how prompt design impacts performance in this paper.
  • Incomplete explanation of the gap between human and agent performance: More analysis could be provided on which specific capabilities need to be improved to close this gap.
  • Dependency on commercial cloud infrastructure for parallelization: The benchmark's scalability advantage depends on Azure, which may limit accessibility for some researchers.

其他意见或建议

  • The paper would benefit from a deeper analysis of the performance disparities across different task domains (e.g., why agents perform better on Web Browsing and Windows System tasks compared to Office tasks).
  • It would be helpful to see more detailed failure analysis for specific task categories, rather than just general failure patterns.
  • Some discussion of computational costs associated with the parallelization approach would help researchers better understand resource requirements.
作者回复

We thank the reviewer for the thoughtful comments/questions. We address each point in turn below.

RE: deeper analysis of performance disparities and failures across different task domains

  • Higher-quality accessibility trees on chromium web-browsers, less UI/element “clutter” and cleaner UI ribbons/interfaces on Windows system and web browser interfaces (i.e., less dense concentration of icons/buttons, sparser/cleaner separation between icons), etc. makes it easier for the agent to visually parse and identify the screen accurately. LibreOffice programs have denser UI icons/elements on the screen (e.g., more words, cells, icons), resulting in much lower performance.

  • We list some common failures by some task categories below and will include a full, more granular set in our revised manuscript (due to character limits here).

    • Office: the dense UI ribbons of LibreOffice results in Set-of-Marks not bounding every icon, overlapping bounding boxes, etc., limiting the agent's screen understanding
      • In Calc, the agent can’t sufficiently capture all spreadsheet cells/icons visually, limiting task understanding and available info. E.g., failing to properly set cell formats due to misinterpreting formatting menus (e.g., 12345 becoming 0012345)
      • In Writer, misinterpreting words' positions due to incorrect bounding box detection for UI elements, resulting in text not properly splitting for text alignment tasks. Failing to accurately localize sentences for certain tasks due to insufficiently granular visual parsing, causing incorrect highlighting or missed annotations
    • Web
      • Difficulty with UI slider controls and, at times, scrolling dropdown menus
      • Inability to close/navigate out of said pop-ups despite attempts to (sometimes pop-ups where the way to exit out is not made visually obvious)
      • Failing to visually recognize numeric cues on webpages, causing navigation to incorrect sites (e.g., "find the most popular banter discussion thread" task)
      • When setting a webpage as homepage, misinterpreting the config screen due to visually similar options (like "Startup pages" vs. "Home button")
    • System
      • With file explorer, compressing files using 7-zip can result in agent failing to correctly interact with 7-zip's interface due to misinterpreting fields visually, causing failed compression
    • Utilities
      • When using calculator to calculate differences between dates, the agent can fail to input the correct dates (correct planning but incorrect action mapping due to set-of-marks not recognizing the keys/digits)
      • When counting word instances in a text file, the agent can fail to correctly parse the UI’s displayed count, resulting in incorrect counts

RE: computational costs and resource requirements of parallelization

  • We provide this info in Appendix A.6 (pg 20) describing resources, time, and cost of parallelization.

RE: specific capabilities to close this gap

  • One is to extend visual/screen understanding to text and tabular data (in addition to set-of-marks around icons/UI elements). This would help agents on tasks better represented as text (e.g., notepad, command line/terminal) or a mix of both UI icons and text (e.g., spreadsheet and word processor programs).
  • Another is better self-verification: the agent sometimes believes its action changed the screen state even when it doesn't take effect (even with screen feedback). Unfortunately, instructing the agent to verify/check does not fully resolve this either.

RE: impact of query time on UIA utility and optimizations

  • Since the UIA tree API wasn’t designed for high-throughput queries and supplies the tree piece-by-piece, extracting a full snapshot results in significant latency. The problem only worsens with more applications/windows open.
  • Possible optimizations include: (1) Focusing only on the “active” window’s tree (but one would need to query a tree again when switching programs) or (2) Caching the tree for each program/app ahead of time, which would relieve latency but requires additional overhead.

RE: pre-training/fine-tuning to improve Windows alignment

  • Yes, good point. We considered it; however, we realized that it required significant amounts of trajectory data on Windows which was not available at the time, serving as part of the inspiration for creating this benchmark (i.e., to help generate said data).

RE: sensitivity of performance to prompts

  • We’ve observed variations in overall success rates (which we will add to the revised paper) from removing certain components from the agent’s prompt for GPT4V/o:
    • w/o memory: (19.5% → ~10-12%)
    • w/o in-context examples and provided functions for the agent to use: (19.5% → ~2-4%)
  • The prompts also change depending on the different kinds of inputs (e.g., omniparser output that contains the set-of-marks and element IDs, accessibility tree or a11y, etc.). These variations and their impact on performance are in Table 4.
最终决定

This paper presents Windows Agent Arena, a timely benchmark for evaluating multi-modal agents in realistic Windows OS environments. Reviewers praised its scalability, task diversity, and thorough experimental design, highlighting the significant performance gap between current agents and human users as a valuable insight for future research. I believe it is a clear accept.