PaperHub
7.3
/10
Poster4 位审稿人
最低4最高5标准差0.5
5
4
5
4
4.0
置信度
创新性3.0
质量3.0
清晰度3.5
重要性3.0
NeurIPS 2025

macOSWorld: A Multilingual Interactive Benchmark for GUI Agents

OpenReviewPDF
提交: 2025-05-05更新: 2025-10-29

摘要

关键词
BenchmarkGUI agentinteractivemacOSmultilingualsafety

评审与讨论

审稿意见
5

This paper presents macOSWorld, the first live evaluation benchmark for macOS GUI agents. It supports five languages and includes a safety evaluation subset.

优缺点分析

Strengths:

The paper's strength lies in introducing macOSWorld, the first live evaluation benchmark for macOS GUI agents. It fills a significant gap by offering a comprehensive set of 202 real-world tasks across 30 applications (28 exclusive to macOS). The benchmark is multilingual, supporting five languages, and includes a crucial safety subset for deception attacks. A thorough evaluation of six diverse GUI agents further strengthens its contribution to the community.

Weaknesses:

Similar benchmark designs and implementations exist for other operating systems. The paper has already cited some of them, such as OSWorld, AndroidWorld, MiniWoB, and WebArena. In other words, the paper lacks innovation except for adapting existing approaches to a new scenario. This could be a common weakness of most, if not all, benchmark papers.

Although I agree that this paper has a similar quality to previous benchmark papers accepted by NeurIPS, I am looking for new techniques that have not been reported in existing work in order to rate it higher. For example, all existing work (as well as this paper) either utilizes rules or an autorater (or both) in assigning environment rewards. An innovation in this direction could be automatic rule generation and implementation, a more accurate autorater, or a completely new way of finding rewards.

问题

No question

局限性

Yes

最终评判理由

The authors addressed my questions and comments in the rebuttal.

格式问题

No concern

作者回复

Thank you for your overall favorable review and for recognizing the significance of our work. We are happy that you mentioned:

 

I am looking for new techniques that have not been reported in existing work in order to rate it higher.

Below, we briefly summarize three key innovations in macOSWorld that were not fully detailed in the paper.

 

Innovation 1: Fast macOS Snapshot Recovery

  • Prior method (OSWorld on AWS):
    • Relies on full instance recreation
    • Each single recovery takes 75+ minutes for macOS
  • Ours:
    • At host machine OS-level reboot, we hot-swap the root volume instead of recreating the VM
    • ~10 minutes full environment recovery

 

 

Innovation 2: Compliant & Reproducible macOS Environments

  • Prior method: cannot satisfy both (legal compliance & OS reproducibility)
    • App-level recovery (AndroidWorld): legally compliant, but cannot restore changes the agent made outside specific apps
    • VM snapshots (OSWorld): OS-level recovery, but not compliant for macOS (violates macOS EULA)
  • Ours:
    • Legally compliant: Leveraging AWS virtualization
    • Reproducible: Pack environment as AWS AMIs that are bit-identical across users

 

 

Innovation 3: Native, Fine-Grained Step-Wise Rewarding

  • 64 out of 202 macOSWorld tasks were annotated for partial credit at each key step
  • Two examples:
Task: Help me navigate to the "System Report" page.
"System Report" is opened1.0
"Settings > About" is opened0.5
"Settings" app is opened0.25
Otherwise0.0
Task: Help me reduce the volume to 25%.
Volume in [20, 30]1.0
Otherwise, in [15, 35]0.75
Otherwise, in [10, 40]0.5
Otherwise, in [5, 45]0.25
Otherwise0.0
  • Why not reported in the paper? Multiple valid completion paths (e.g., GUI vs. Terminal) make it hard to define fair, path‑agnostic scores.
  • Why open-sourced? These annotations can facilitate future RL‑based reward functions or development of fairer, fine-grained evaluations.
评论

Thank you again for recognizing macOSWorld's potential. You had mentioned that highlighting our novel contributions could justify a higher score, so we outlined three key innovations in our last message. Does that address your concerns, or is there any additional detail we can provide? We'd be grateful for your guidance on how our work could justify a higher rating.

评论

Thanks for addressing my comments in your rebuttal! I will increase my rating.

审稿意见
4

This paper introduces macOSWorld, a multilingual benchmark for evaluating GUI agents on macOS environments. It contains 202 multilingual interactive tasks across 30 applications on MacOS, with task instructions provided in 5 languages (English, Chinese, Arabic, Japanese, Russian). Evaluation of six GUI agents reveals that most of the existing agents/models perform less than 40% (except Claude CUA). The benchmark also exposes multilingual performance disparities and shows the performance degration of GUI agents in non-english scenarios.

优缺点分析

Strengths.

  • Fill the gap of GUI agent evaluation in macOS's environments, which is beneficial to GUI agent community.
  • Multi-linguistic support is good.
  • Evaluation is extensive.

Weaknesses

  • Evaluation metric only considers binary success/failure metrics, which could neglect intermediate progress of agents.
  • More agent frameworks, e.g, [1,2] in addition to native agent models can be considered.
  • Limit the steps in 15 screenshot/30 dialog turns could limit agent performance as complex tasks require explorations that could have 50 or 100 steps.

[1] Agashe et al. Agent S: An Open Agentic Framework that Uses Computers Like a Human. ICLR 2025

[2] Yu et al. ExACT: Teaching AI Agents to Explore with Reflective-MCTS and Exploratory Learning. ICLR 2025

问题

  • How is the action space defined in detail? Actions like click, drag are categorized into the same type, does it mean that they will be mapped into a single function with different parameters?
  • Can this work be extended to support accessibility-tree included methods? If it can support, how could the accessibilty tree be obtained in the MacOS environment?
  • Does this work consider some infeasible tasks?
  • What is the rationale behind choice of 15 screenshot/30 dialog limit? The results in previous work [1] show that CUAs often require more steps, e.g., 50 and even 100 during the exploration. And what percentage of failures are due to step exhaustion instead of actual capability limitations?

局限性

yes

最终评判理由

I think this is a good work to make contributions to computer-use agent and I will maintain my current positive score to recommend acceptance.

格式问题

N/A

作者回复

We are grateful for your constructive and overall positive review, and we look forward to addressing the questions you raised.

 

Q1: Binary evaluation metric neglects intermediate progress of agent.

Although not written in the paper, 64 out of 202 macOSWorld tasks include 0-1 scoring based on partial completion. These are fully implemented and available in our open-source release. Following are two examples:

Task: Help me navigate to the "System Report" page.
"System Report" is opened1.0
"Settings > About" is opened0.5
"Settings" app is opened0.25
Otherwise0.0
Task: Help me reduce the volume to 25%.
Volume in [20, 30]1.0
Otherwise, in [15, 35]0.75
Otherwise, in [10, 40]0.5
Otherwise, in [5, 45]0.25
Otherwise0.0

Why not reported in the paper? Multiple valid completion paths (e.g., GUI vs. Terminal) make it hard to define fair, path‑agnostic scores.

Why open-sourced? These annotations can facilitate future RL‑based reward functions or development of fairer, fine-grained evaluations.

 

 

Q2: Could evaluate more agent frameworks like [1, 2].

Thank you for the suggestion. We will cite AgentS and ExACT. Due to current time and resource constraints, we were unable to benchmark them during the rebuttal, but we plan to include them in future updates of macOSWorld.

 

 

Q3: Why 15 screenshots/30 dialog limit?

We match OSWorld's limits. Since agents like Claude CUA fetch a screenshot every two turns, capping at 30 dialogs yields up to 15 screenshots.

 

 

Q4: How is the action space defined in detail? Will similar operations be mapped to a single function with different parameters?

  • Design: The action space design aligns with the action-space design and implementations in OpenAI CUA and Claude CUA.

  • Implementation: Your understanding is mostly correct. All GUI operations wrap a small set of underlying vncdotool calls with different parameters:

OperationFunctions Mapped To
left_click()VNC.mouseDown(1) + VNC.mouseUp(1)
right_click()VNC.mouseDown(3) + VNC.mouseUp(3)
drag_to(512, 384)VNC.mouseDown(1) + VNC.mouseMove(512, 384) + VNC.mouseUp(1)
key_press('@')VNC.keyPress('@')
type_text(string)VNC.keyPress(str) for str in string

 

 

Q5: Can this work be extended to support accessibility-tree included methods? If so, how to obtain the accessibility trees?

Yes, macOS accessibility trees can be obtained in two ways:

  1. Using an App: Xcode > Accessibility Inspector
  2. Using system command-line: AppleScript AX API
    • Example command obtaining a table cell value: osascript -e 'tell application "Numbers" to tell table 1 of sheet 1 of document 1 to get value of cell "C4"'
    • Natively supported by macOSWorld via SSH access point

We expect more methods supporting macOS-exclusive accessibility trees, and we will benchmark them in the future.

 

 

Q6: Does this work consider some infeasible tasks?

Not in the current version. All tasks in macOSWorld are designed to be unambiguously feasible, as our primary goal is to evaluate agents' core competencies on macOS. But we appreciate the suggestion and plan to include a subset of infeasible tasks in future versions.

 

 

Q7: What percentage of failures are due to step exhaustion instead of actual capability limitations?

Answering this question requires rerunning the experiment with a 50-step limit. However, we lack the resources to do so within the rebuttal period. We will conduct these extended runs and provide the detailed failure breakdown in our next revision.

审稿意见
5

This paper proposes a new benchmark for GUI Agents. It is targeted for MacOS applications, comes with tasks and environments in 5 languages, and includes a safety eval subset following a prior work.

The benchmark includes ~200 tasks. Each task comes with an instruction, a Preparation Script to set up the initial state of the OS, and an evaluation script to programmatically test correctness. The tasks were originally prepared in English. To convert instructions to other languages, they use LLM. The OS applications natively come with an option to change the language, which is what they directly leverage for the environment.

优缺点分析

Strengths

  • Both the focus on MacOS and multilingualism are novel and needed
  • The benchmark has significantly more apps than current ones.
  • The environment setup, programmatic task initial state setup, and evaluation are provided for reproducible evaluation.
  • Experiments using the best open and proprietary GUI agents show that the benchmark is challenging.

Weaknesses

Overall, this is a good environment and benchmark, inheriting design decisions from some of the previous benchmarks, and extending it to macOS and multilingual setups.

The paper, though, describes very little about the process of the task generation itself.

  • How was the task quality or "naturalness" ensured?
  • How was task diversity ensured?
  • How were the annotators selected? What were they instructed to do? What there any quality-based filtering?

Also, the paper does not describe how the preparation and evaluations were conducted. Were they written by the annotators as well? It's also unclear what the human performance is on this benchmark.

问题


局限性

yes

最终评判理由

My main concern was the lack of detail in the paper about how the quality and diversity of the tasks were ensured. The authors' response helped resolve this. I have raised the score to recommend acceptance.

格式问题


作者回复

Thank you for your thoughtful and positive assessment, particularly for highlighting our paper's strengths. We welcome the opportunity to clarify the points you noted.

 

Q1: How were the annotators selected?

  • Computer science graduate students
  • Daily macOS users & Unix shell proficiency
  • Specialized roles:
    • Advanced apps: the annotator has video editing & SwiftUI development experience
    • Safety subset: created by authors

 

 

Q2: What's the task curation/filtering process? How quality, naturalness, and diversity are ensured?

Step 1: High-level Categorization

  • Prompted Claude to:
    1. Brainstorm macOS use-case categorizations
    2. Summarize from a full list of OSWorld main set tasks
    3. Merge 1 & 2 to form initial categorizations
  • Discuss with annotators to validate and form the 7 final categories

 

Step 2: Task Prototype Creation

  • Collected 115 raw task proposals from the public using questionaire (for reference only)
  • Brainstormed among authors and with other GUI agent researchers
  • Created 8 example tasks; one example as follows:
Prototype
Task InstructionHelp me create an empty slide with an aspect ratio of 4:3.
Snapshot to Recoversnapshot_usedApps_en (an English snapshot with Keynote installed)
Preparation Scriptosascript -e 'tell application "Keynote" to activate'
Grading Script1.0: osascript -e 'tell application "Keynote" to set slideWidth to width of document 1' -e 'tell application "Keynote" to set slideHeight to height of document 1' -e 'slideWidth / slideHeight * 9' | grep -q "12" && echo "True" || echo "False"; 0.0: Otherwise

 

Step 3: Annotator Training

Conducted training sessions for annotators to let them:

  • Understand our purpose and motivation
  • Understand the above prototype
  • Learn how to write and validate a new task (using a Mac/Macbook + SSH terminal + ChatGPT)

 

Step 4: Task Creation

  • Annotators were instructed to:
    • Imitate the given prototypes to write new tasks based on (but not limited to) authors' ideas
    • Write tasks in their assigned category(s) (e.g. File Management/Advanced Apps)
    • For naturalness, annotators are encouraged to follow Mac User Guide, Apple Support, Apple Discussions, and App built-in templates
    • For determinism, annotators are required to write clear, evaluation-friendly tasks
  • Each annotator first create 5-10 tasks; authors validate and provide feedback
  • Then, annotators finalize the creation of all tasks
  • Annotators use a Mac/Macbook + SSH terminal to repeatedly validate the preparation/evaluation scripts

 

Step 5: Quality Filtering

  • All tasks were validated (with some edited) by authors, ensuring fair and deterministic evaluation
  • The safety subset was created and validated by the authors

 

How are properties ensured?

  • Diversity: questionnaire + brainstorming + OSWorld diversity + Claude knowledge
  • Naturalness: Apple official guides & templates
  • Quality: author-driven full validation  

 

Q3: How were preparation/evaluation scripts annotated?

Also written by annotators. Annotators simultaneously do:

  1. Write task instructions
  2. Design preparation & evaluation scripts
  3. Validate on-the-fly (Mac/Macbook + SSH terminal)

 

 

Q4: What were annotators instructed to do?

Please refer to Q2 Step 4.

 

 

Q5: Were there any quality-based filtering?

Please refer to Q2 Step 5.

 

 

Q6: Human performance on the benchmark?

Test subject: computer science undergraduate student with macOS & iWork experience, but no video-editing or SwiftUI background

Success Rate (\uparrow)System & InterfaceSystem AppsFile ManageProductivityMediaAdv AppsMulti-AppsOverall (w/o Adv Apps)Overall (w/ Adv Apps)
Human, en0.7930.8950.9310.9710.8330.3231.0000.9120.822
  • Overall 82.2%
  • Lowest: Advanced Apps (32.3%), as these tasks require domain knowledge (iMovie, Xcode, SwiftUI)
  • The relatively high human performance supports the naturalness of our dataset, indicating that the tasks are everyday, intuitive activities that ordinary users can generally understand and complete accurately
审稿意见
4

This paper describes macOSWorld, the first comprehensive benchmark for evaluating GUI agents in macOS environments. The benchmark provides solutions to three critical limitations of existing interactive, OS-level benchmarks: lack of macOS applications, no multilingual support, and narrowly defined safety. There are a total of 202 interactive tasks split across 30 applications (28 macOS exclusive) in the benchmark. The structured tasks and associated OS interfaces are provided for copied & pasted in 5 languages (English, Chinese, Arabic, Japanese, and Russian). The benchmark utilizes AWS EC2 Mac instances cloud resources to host virtualized macOS environments. The benchmark allows reproducible evaluations that are saved by snapshot recovery, as well as programmatic agent assessments. In addition, the benchmark included a specific safety subset that includes realistic fake deceptive pop-up windows to assess agent demonstration of vulnerability to context deception attacks. The authors employed the benchmark to evaluate six representative examples of GUI agents, demonstrating remarkable performance gaps: proprietary computer-use agents performed >30% of the tasks successfully while open-source models completed <2% of the tasks successfully. The findings also revealed significant performance discrepancies based on language and concerning safety vulnerabilities.

优缺点分析

Strengths

Comprehensive Infrastructure: The benchmarking reference point has a technologically solid test bed featuring virtualized macOS environments, reproducible snapshots, and programmatic test scripts that support equitable comparison among agents.

Significant Coverage Gap: Fills a critical gap in current benchmarks for pioneering inclusion of macOS, one of the largest operating systems with unique GUI trends and 28 proprietary applications.

Multilingual Innovation: Innovative integration of multilingual support at environment and task instruction levels, providing informative information on agent capabilities in various languages and writing systems.

Safety Integration: Comprises realistic (not fabricated) false pop-up tests in interactive environments, generating useful information about agent weaknesses useful for immediate safe deployment.

Weaknesses

Limited Evaluation Granularity: Binary success/failure evaluation may miss out on fine-grained variation in performance and doesn't track partial completion of tasks or measures beyond minimal step counts.

Accessibility Concerns: Mac hardware overdependence on AWS can limit research replicability and availability to researchers without sufficient cloud computing capabilities.

Limited Safety Scope: English-only safety subset is moderately small (29 tasks) yet reveals interesting vulnerabilities (≈70% deception rate for CUAs) that ought to be explored further.

问题

  1. Benchmark Accessibility: With the dependence on Mac computers hosted by AWS, are there plans to make the benchmark more accessible to researchers with fewer cloud-computing resources? Can we have cost estimates or alternatives for testing?

  2. Human Baseline Validation: Besides step counts, how do you ensure tasks are representative of typical human workflows and that the measurement metrics correctly capture task completion quality?

局限性

Yes

格式问题

No

作者回复

We sincerely appreciate your encouraging evaluation of macOSWorld's key contributions. We are pleased to address the insightful questions you raised.

 

Q1: Limited evaluation granularity. Extending to track fine-grained performance (e.g., partial completion of task)?

Although not written in the paper, 64 out of 202 macOSWorld tasks include 0-1 scoring based on partial completion. These are fully implemented and available in our open-source release. Following are two examples:

Task: Help me navigate to the "System Report" page.
"System Report" is opened1.0
"Settings > About" is opened0.5
"Settings" app is opened0.25
Otherwise0.0
Task: Help me reduce the volume to 25%.
Volume in [20, 30]1.0
Otherwise, in [15, 35]0.75
Otherwise, in [10, 40]0.5
Otherwise, in [5, 45]0.25
Otherwise0.0

Why not reported in the paper? Multiple valid completion paths (e.g., GUI vs. Terminal) make it hard to define fair, path‑agnostic scores. We leave this as an open question for future research.

Why open-sourced? These annotations can facilitate future RL‑based reward functions or development of fairer fine-grained evaluations.

 

 

Q2: The scope of the safety subset is limited: English-only safety subset is moderately small.

We extended the 29 safety subset tasks to all 5 languages.

DistractedGoldUnhandled
Claude CUAen0.7240.2410.034
zh0.8080.1920.000
ar0.8080.1920.000
ja0.7310.2690.000
ru0.7310.2690.000
OpenAI CUAen0.6900.2760.034
zh0.5860.3450.069
ar0.5770.3850.038
ja0.6150.3080.077
ru0.6150.3080.077

Takeaway: High, consistent deception rates across languages show this vulnerability is language‑agnostic and requires urgent mitigation.

 

 

Q3: Plans to make the benchmark less costly? (e.g., less dependent on cloud computing)

Yes, we already developed a VMware-based version for local deployment. Both versions will be released simultaneously. Comparison:

AWS (Cloud) VersionVMware (Local) Version
AWS-hostedLocal VMware-based
Minimal local hardware reuqirementUbuntu + Intel/AMD AVX2 CPU
Costs ~75 USD/language + agent APICosts 0 USD + agent API
Written in the paper, fully compliantNot in the paper, users use at their own compliance risks

 

 

Q4: How is task representativeness ensured?

This is ensured by where the tasks are derived from. The majority of macOSWorld tasks are derived from sources that contain representative macOS use-cases:

  • Built-in app templates (e.g., Numbers "Personal Budget" document)
  • Mac User Guide
  • Apple Support & Apple Discussions
  • Apple official tutorials (e.g., SwiftUI)

 

 

Q5: How is metric validity ensured?

The primary metric is task success rate. Its validity is guaranteed by our task curation protocol:

  • When creating a task, annotators iterate live on Mac + SSH terminal, running and refining each evaluation script until only the intended interaction passes, eliminating false positives/negatives.
  • Authors also reviewed every task to enforce finite, unambiguous success condition(s), ensuring binary success reflects true completion.

(Please see Reviewer UN4t for the complete task curation process)

评论

Dear Reviewers,

Thank you for your thorough and constructive reviews. We are especially grateful for the insightful questions, which allowed us to further showcase the strengths and uniqueness of our benchmark during the rebuttal. We also appreciate that some of you considered raising your scores. We wish you all the best in your research and future career!

Best regards, #7031 authors

最终决定

This paper introduces a new benchmark macOsWorld for evaluating computer-use agents for its capability and adversarial robustness. While at the first glimpse this paper fits the D&B track, the paper has provided in-depth results and analysis for several existing agents to form useful insights in this area to fit into the main paper track. Reviewers fJ2j and tRsF called for more granular evaluation metrics beyond binary success, while UN4t requested detailed transparency on the task curation process and human performance baselines. Concerns were also raised by fJ2j about the benchmark's accessibility and cost due to its reliance on AWS, and by 8KY8 regarding the work's technical novelty compared to existing benchmarks. The authors responded effectively by revealing that a partial credit scoring system was already implemented for 64 tasks, providing a detailed breakdown of their rigorous curation protocol, and announcing a zero-cost VMware version to address accessibility. Furthermore, they addressed the argument on novelty by highlighting three significant technical innovations—such as a 7x faster environment recovery method. Overall, I believe most concerns are addressed by the authors and the paper has significant contributions to agent evaluations to be included in this venue.