7.8

/10

Poster4 位审稿人

最低4最高5标准差0.4

3.3

置信度

创新性3.0

质量3.0

清晰度3.0

重要性2.8

NeurIPS 2025

Look Before You Leap: A GUI-Critic-R1 Model for Pre-Operative Error Diagnosis in GUI Automation

Yuyang Wanyan,Xi Zhang,Haiyang Xu,Haowei Liu,Junyang Wang,Jiabo Ye,Yutong Kou,Ming Yan,Fei Huang,Xiaoshan Yang,Weiming Dong,Changsheng Xu

OpenReview PDF

提交: 2025-04-29更新: 2025-10-29

TL;DR

A GUI-Critic for Pre-Operative Error Diagnosis method with Data Collection Pipeline with Reasoning Bootstrapping and Reinforcement Learning

摘要

关键词

GUI agentcritic modelsRLGRPO

评审与讨论

审稿意见

评分: 5置信度: 42025-07-02

This work constitutes a significant advance for reliable GUI automation. Its formalization of pre-operative critique, coupled with S-GRPO and high-quality datasets, resolves critical safety/efficiency challenges. The dual static/dynamic validation demonstrates both diagnostic precision and operational efficacy, offering a foundational framework for safe human-AI interaction.

优缺点分析

Strength:

The work introduces the first formalized pre-operative critic mechanism for GUI automation, marking a conceptual shift from post-hoc correction to proactive error prevention. It proposes S-GRPO, a novel suggestion-aware reward formulation that extends GRPO to support multimodal critique generation. Furthermore, the authors develop a reasoning-bootstrapping pipeline to collect high-quality data without ground-truth, effectively addressing the lack of existing GUI critic datasets.
The paper demonstrates strong experimental design, evaluating the approach on both static (GUI-Critic-Test) and dynamic (AndroidWorld) benchmarks across mobile and web platforms. Comprehensive ablation studies (Table 3–4, Fig. 3) provide clear justification for key design choices.

Weakness:

I don’t have major concerns, but there are a few minor points worth noting:

The method shows relatively weak performance on the GUI-W benchmark (63.08% vs. 69.20% on GUI-I), raising concerns about its cross-platform generalization, particularly in desktop or web-based environments. This aspect is not sufficiently discussed in the paper.
The experiments are primarily conducted on scripted benchmarks like AndroidWorld. It remains unclear how well the method handles more complex, real-world conditions—such as dynamic UIs, unpredictable layout changes, or network-induced delays—which are common in practical GUI automation tasks.

问题

How is Isimilar(s,s′)Isimilar(s,s′) computed in Equation (3)? The paper lacks detail here. Is the similarity judged by a vision-language model (e.g., Qwen2.5-VL-72B), or is it rule-based? More clarity would help understand how reliable and generalizable this reward is.
The critic model appears to perform reasoning-like functions similar to those found in advanced GUI action models with built-in "thinking" capabilities. This raises an important question: does introducing a separate critic reduce the need for reasoning within the action model itself, or do they complement each other? Clarifying this relationship would provide deeper insight into the system design and its modularity.
The paper does not discuss the computational overhead introduced by the real-time pre-critic mechanism. Since the critic is designed to operate before the GUI action model, it may lead to increased latency in practical applications. A discussion on runtime efficiency and its trade-offs would help assess the feasibility of deployment in real-world systems.

局限性

yes

最终评判理由

This paper presents a significant contribution to reliable GUI automation through its formalization of pre-operative critique, combined with the S-GRPO training strategy and a high-quality dataset. The approach effectively addresses key challenges in safety and efficiency, with strong empirical results.

格式问题

The paper followed the formatting instructions.

作者回复

2025-07-31

Thank you for your valuable feedback and for recognizing the contributions of our work. Your insights are greatly appreciated and will help us further improve the quality of our manuscript.

W1. Weak performance on the GUI-W

About Domain Gap.
- The performance difference between GUI-W (63.08%) and GUI-I (69.20%) benchmarks stems from the domain gap between web and mobile scenarios, as our model was trained on mobile GUI scenarios.
- We strategically focused on mobile applications because they present many challenges (e.g., frequent interface updates, diverse interaction patterns, and heterogeneous UI layouts). These characteristics make mobile GUI a more challenging and representative test bed for our approach.
Cross-domain Generalization and Methodology.
- Despite this domain-specific training, our model demonstrates strong cross-domain generalization capabilities. As shown in Table 1 in the main text, it achieves superior critic accuracy on the GUI-W benchmark compared to other MLLMs with many more parameters (e.g., Qwen2.5-VL-72B at 60.05%), even though web interfaces differ significantly from mobile UIs in terms of layout structure and action space.
- Furthermore, our core methodological contributions - the S-GRPO training strategy and reasoning-bootstrapping based data collection pipeline - are domain-agnostic and can be readily adapted to other GUI environments including web and desktop platforms. We'll incorporate this analysis into Section 4.2.1 (Static Evaluation) of the revised manuscript.
Additional Validation on a dynamic desktop benchmark OSWorld [1*]. To more thoroughly validate the model's efficiency in desktop scenarios, we have added experiments on the OSWorld benchmark, which is a challenging dynamic benchmark on a real computer environment that contains 369 computer tasks involving real web and desktop apps. We take Qwen 2.5-VL-72B as the action model (baseline), and the results are shown below.

	Model	Success Tasks Number
Baseline	Qwen2.5-VL-72B	38
Baseline + Pre-Critic	+ Qwen2.5-VL-7B	43
	+ GUI-Critic-R1（Ours）	50

From the table, we can find that our GUI-Critic-R1 significantly boosts the number of successful tasks on OSWorld (50 vs 38) and outperforms the configuration that employs Qwen2.5-VL-7B as the pre-critic (50 vs 43). The results show that our model also exhibits satisfactory generalization ability in desktop and web environments.

[1*] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments. NeurISP'24.

W2. Real-world GUI Conditions

Diversity of training sets. Although dynamic testing was conducted on AndroidWorld, the training data included over 100 real-world apps, covering a wide range of realistic conditions.
Real-world static evaluation. In this paper, we introduce a static evaluation benchmark including GUI-I/S/W test datasets (Section 4.1) that covers a broad spectrum of real-world applications and heterogeneous UI layouts. The consistent superiority of our method on this suite of tests provides evidence of its effectiveness.
Real-world experiments. Currently, there is no established suitable real-world dynamic benchmark in mobile scenarios. In order to verify the capabilities of our model in real-world scenarios, we added real-device experiments on Mobile-Agent-v2 test set [36]. In this experiment, Qwen2.5-VL-72B serves as the actor agent, while our proposed GUI-Critic-R1 serves as the pre-critic model. We assess the performance on 24 advanced instructions targeting external applications (Table 5/6 in Reference [36]).

Model	SR
Baseline （Qwen2.5-VL-72B）	9/24
Baseline + GUI-Critic-R1（Ours）	15/24

The results show that with the assistance of our pre-critic, the agent successfully executed 6 tasks it had previously failed to execute. For instance, while executing “Find a video on Bilibili and give it the triple—like, coin, and favorite,” the agent assumed the task was complete after liking and favoriting, but the critic noticed that the coin had not yet been given and guided the agent model to finish the instruction.

Q1. How is Isimilar(s,s′) computed?

We leverage LLM (Qwen2.5-72B) to compute the semantic similarity between each generated suggestion and its ground-truth annotation (Lines 225–226 and 264). The prompt used for this calculation is given in Appendix Table 4. We will add a detailed explanation of this similarity-scoring procedure, with reference to Eq. 3, in the implementation section of the revised manuscript.

Q2. Critic reduce the need for reasoning within the action model itself?

Thank you for this valuable suggestion. No, introducing a separate critic does not and should not reduce the need for reasoning within the action model itself.

Clarification: Foremost, we aim to clarify that the introduction of the critic is intended to complement the action model, rather than to supplant its reasoning capabilities.
Sequential workflow of the system: In our workflow, action model first independently makes decisions, followed by the critic's evaluation and potential feedback (The relevant prompt is provided in Appendix Table 5). If the action model were too simplified and more error-prone, it would lead to frequent critic interventions and multiple revision cycles, impacting the system's efficiency and response time. This is particularly crucial in GUI automation where timely and accurate first-attempt decisions are important for a smooth user experience.
Advantage: Based on this consideration, both models should maintain their full reasoning capabilities and complement the system at different points. The action model is responsible for generating concrete GUI actions, whereas the critic model focuses on detecting erroneous decisions and providing corrective guidance. This combination of strong initial decision-making and additional verification makes our system more robust and efficient.

Q3. Computational overhead

Following the reviewer's suggestion, we conducted a comprehensive runtime analysis on AndroidWorld benchmark as shown in the table below. The experiments measure task runtime, step counts, and success rates. To ensure fair comparison, all models (except GPT-4o) are deployed with identical hardware resources. The latency for each step includes environment interaction time, model inference time, and action execution time. "Avg Success Steps" denotes the average step number of all successful tasks, and "Avg success latency" represents the average latency of all successful tasks.

	Model	Avg Steps	Avg Success Steps	Avg Latency (s)	Avg Success Latency (s)	Success Rate (%)
Baseline	Qwen2.5-VL-72B	10.48	9.88	162.06	130.81	22.4
Baseline+Pre-Critic	+ Qwen2.5-VL-7B	10.22	8.37	286.02	184.35	20.3
	+ Qwen2.5-VL-72B	7.86	7.27	188.09	190.07	23.2
	+ GPT-4o	8.11	7.90	232.52	202.22	22.4
	+ GUI-Critic-R1(Ours)	7.19	7.09	156.91	127.61	27.6

The results show that:

Our GUI-Critic-R1 achieves clearly efficiency improvement over Baseline: We achieve lower average latency (156.91s vs 162.06s) while substantially reducing the average number of steps (7.19 vs 10.48). Moreover, our model maintains a higher success rate (27.6% vs 22.4%), showing the effectiveness of the pre-operative critic mechanism.
- Analysis of Latency Reduction: Although the critic mechanism introduces an additional VLM forward pass at each step, the significant reduction in the total number of steps (~31% reduction from baseline) more than compensates for this overhead. This is because our pre-operative critic mechanism can prevent erroneous actions before execution, thus avoiding time-consuming retry attempts. This early error detection capability enables the model to make more accurate decisions in fewer steps, leading to more efficient task completion.
Our GUI-Critic-R1 shows superior efficiency compared with other pre-critic models. It operates faster than both Qwen2.5-VL-72B critic (156.91s vs 188.09s) and GPT-4o critic (156.91s vs 232.52s). This performance advantage highlights our model's ability to strike an optimal balance between computational cost and effectiveness. In conclusion, our proposed pre-operative critic mechanism achieves both high operational efficiency and improved task success rates, making it a practical solution for real-world GUI automation deployment. We will integrate these new experimental results into the revised manuscript.

In conclusion, our proposed pre-operative critic mechanism achieves both high operational efficiency and improved task success rates, making it a practical solution for real-world GUI automation deployment. We will integrate these new experimental results into the revised manuscript.

2025-08-06

Thank you for your detailed and well-organized rebuttal. I appreciate the thoughtful clarifications and the new experimental results you have added. The authors have addressed all my previous concerns. In particular:

They offered a clear explanation of the domain gap and cross-domain performance, supported by new evaluations on the OSWorld benchmark.
They clarified the purpose and workflow of the critic model, ensuring it complements rather than replaces the reasoning of the action model.
The additional runtime analysis convincingly demonstrates the efficiency and effectiveness of the proposed GUI-Critic-R1.
The explanation of semantic similarity computation was also clear and complete.

Overall, the rebuttal strengthens the paper, and I will increase the score.

评论- Response to the Official Comment by Reviewer Znsj

2025-08-06

We appreciate the positive assessment of our rebuttal. We are deeply encouraged by this. We will certainly incorporate all of your suggestions into the revised version to further enhance the quality and clarity of the manuscript.

审稿意见

评分: 5置信度: 32025-07-02

This paper proposes a pre-operative critic mechanism for MLLM-based GUI agents. The critic can provide potential errors and suggestions for an action before execution. To train the critic model, the authors propose Suggestion-aware Group Relative Policy Optimization (S-GRPO). It contains a suggestion award besides a right or wrong signal. Besides, a data collection pipeline called reasoning bootstrapping is also developed for critic training and testing. Evaluation results show priority of this method.

优缺点分析

Strengths:

The idea of a pre-operative critic is novel and useful for GUI agent development.
The paper is well-written and logically organized. The method and results are presented clearly.
The dataset of GUI-Critic-Train and GUI-Critic-Test offers data support for future research on GUI agent critics.

Weaknesses: Please refer to the questions.

问题

The experiment for critic evaluation is quite sufficient, however, the main experiment of dynamic evaluation is mainly conducted on Android environments. Is this method also suitable for more broader scenarios such as desktop applications?
It requires high real-time abilities for online GUI agents, so how efficient the critic is?

In Sec. 4.2.2, the authors introduce a metric that measures the proportion that the ‘baseline + critic’ model achieves fewer steps than the baseline, called Efficiency Advantage Rate (EAR). However, besides step number, the token number and latency are also factors that influence efficiency. If the critic tends to generate long suggestions, the task execution time may be longer.
Besides, for samples where the 'baseline+critic' model generates more steps than the baseline, the EAR metric does not evaluate how worse they are. So why choosing the EAR metric rather than computing average step numbers?

A typo: in line 11 of the abstract, 'Suggestion-aware Gradient Relative Policy Optimization (S-GRPO)'. Gradient -> Group.

局限性

Yes. The authors have addressed the limitations in Appendix.

最终评判理由

The authors further conduct experiments on OSWorld benchmark, showing the ability of GUI-Critic-R1 to handle tasks of broader scenarios.
The authors provide a more comprehensive efficiency comparison on average (success) steps and average (success) latency, making the evaluation more reliable.

格式问题

No. There are not formatting concerns.

作者回复

2025-07-31

Thank you for your thoughtful review and acknowledging the potential and contribution of our work. We appreciate your insightful comments, which have provided us with an opportunity to refine our manuscript and address critical aspects that will enhance the clarity and impact of our research.

Q1. Experiments on broader scenarios such as desktop application

Our GUI-Critic-R1 is designed to be domain-agnostic and can generalize well across different GUI environments. As suggested by reviewer, to validate the model's effectiveness in broader scenarios such as desktop, we add a new experiment on the OSWorld benchmark, which is a challenging dynamic benchmark on a real computer environment that contains 369 computer tasks involving real web and desktop apps. We take Qwen2.5-VL-72B as the action model (baseline), and the results are shown below.

	Model	Success Tasks Number
Baseline	Qwen2.5-VL-72B	38
Baseline + Pre-Critic	+ Qwen2.5-VL-7B	43
	+ GUI-Critic-R1（Ours）	50

From the table, we can find that our GUI-Critic-R1 significantly boosts the number of successful tasks on OSWorld (50 vs 38) and outperforms the configuration that employs Qwen 2.5-VL-7B as the pre-critic (50 vs 43). The results show that our model also exhibits satisfactory generalization ability in desktop and web environments.

[1*] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments. NeurISP'24.

Q2. How efficient the critic is?

1. Comprehensive efficiency analysis. To make the efficiency assessment more comprehensive and credible, we recorded the step number, token number, and the latency required to complete tasks on the AndroidWorld benchmark as follows. "Avg Success Steps" denotes the average step number of all successful tasks, and "Avg success latency" represents the average latency of all successful tasks.

	Model	Avg Steps	Avg Success Steps	Avg Latency (s)	Avg Success Latency (s)	Success Rate (%)
Baseline	Qwen2.5-VL-72B	10.48	9.88	162.06	130.81	22.4
Baseline+Pre-Critic	+ Qwen2.5-VL-7B	10.22	8.37	286.02	184.35	20.3
	+ Qwen2.5-VL-72B	7.86	7.27	188.09	190.07	23.2
	+ GPT-4o	8.11	7.90	232.52	202.22	22.4
	+ GUI-Critic-R1(Ours)	7.19	7.09	156.91	127.61	27.6

The results clearly show that:

Our GUI-Critic-R1 achieves clearly efficiency improvement over Baseline. We achieve lower average latency (156.91s vs 162.06s) while substantially reducing the average number of steps (7.19 vs 10.48). Moreover, our model maintains a higher success rate (27.6% vs 22.4%), showing the effectiveness of the pre-operative critic mechanism. Besides, our GUI-Critic-R1 generates suggestions averaging less than 30 words.
Analysis of Latency Reduction: Although the critic mechanism introduces an additional VLM forward pass at each step, the significant reduction in the total number of steps (~31% reduction from baseline) more than compensates for this overhead. This is because our pre-operative critic mechanism can prevent erroneous actions before execution, thus avoiding time-consuming retry attempts. This early error detection capability enables the model to make more accurate decisions in fewer steps, leading to more efficient task completion.
Our GUI-Critic-R1 shows superior efficiency compared with other pre-critic models. It operates faster than both Qwen2.5-VL-72B critic (156.91s vs 188.09s) and GPT-4o critic (156.91s vs 232.52s). This performance advantage highlights our model's ability to strike an optimal balance between computational cost and effectiveness. In conclusion, our proposed pre-operative critic mechanism achieves both high operational efficiency and improved task success rates, making it a practical solution for real-world GUI automation deployment. We will integrate these new experimental results into the revised manuscript.

2. The rationality of EAR (Efficiency Advantage Rate). We believe that EAR effectively reflects whether our model, compared to the baseline, can complete the same tasks with fewer steps. Average execution time and average steps might be significantly skewed by substantial advantages of a small number of tasks. We fully acknowledge the reviewer's perspective that EAR alone is not comprehensive enough. Indeed, EAR, average step length, and average latency each measure model efficiency from different angles, and all these metrics are necessary for a complete evaluation. We will update the revised manuscript with these results.

Q3. Typo

We thank the reviewer for pointing out the issue in our abstract; we will correct this in the manuscript.

2025-08-06

Thank you for your detailed responses, which have largely addressed my concerns. I appreciate the clarifications provided. I will maintain my original score.

评论- Response to the Official Comment by Reviewer y9qJ

2025-08-06

We are grateful that the reviewer have acknowledged our rebuttal and found that our responses have addressed the concerns. We appreciate your maintaining the original positive score at 5. We will thoroughly consider your recommendations in the revised manuscript, with the aim of further enhancing our paper.

审稿意见

评分: 5置信度: 22025-07-03

The paper tackles the problem of catching step-level errors before they cripple an LLM-driven GUI agent. It introduces a pre-operative critic architecture embodied in GUI-Critic-R1, a vision-language model fine-tuned to judge a candidate action, explain why it might back-fire and suggest a fix and all before the click lands.

At the heart of the system is Suggestion-aware Group Relative Policy Optimization (S-GRPO). After a lightweight RFT cold-start, S-GRPO samples groups of candidate critiques; a tri-part reward (format, accuracy, and a novel suggestion reward that enforces actionable fixes) ranks them, and policy gradients push the model toward the best-scoring alternative. Algorithm 1 (S-GRPO) orchestrates this loop, while a built-in KL term keeps the critic tethered to a reference policy for stability.

Training data come from a reasoning-bootstrapping pipeline: positive actions are harvested from existing GUI trajectories; negative actions are hallucinated by an off-the-shelf agent and filtered by GPT-4o; Chain-of-Thought critiques are then boot-strapped in a progressive <thinking/score/suggestion> format to fill the gaps. The result is GUI-Critic-Train (~11k samples, 6k with full CoT) and GUI-Critic-Test spanning instruction, scenario and domain generalisation settings (GUI-I/S/W).

With a 7B Qwen-VL backbone, GUI-Critic-R1 lifts static critic accuracy on GUI-I from 54.9% to 69.2%, and suggestion accuracy from 43.1% to 52.4%, outperforming GPT-4o and other open models. Dropped into AndroidWorld as a plug-and-play pre-critic, it pushes the downstream task success rate from 22.4% to 27.6% while shortening action traces, beating both post-hoc reflection and larger pretrained critics.

Overall, the work shows that coupling boot-strapped GUI reasoning data with S-GRPO yields a compact yet reliable critic that lets GUI agents look before they leap, preventing irreversible clicks and trimming redundant detours in real-time automation.

优缺点分析

Strengths

Critical Safety Angle: Tackles the overlooked problem of pre-emptively vetoing bad GUI actions, turning reactive agents into agents that can “look before they leap.”

Novel Pre-Critic Architecture: Couples a vision-language backbone with explicit <thinking/score/suggestion> reasoning slots, so the critic not only flags errors but explains and repairs them in natural language.

S-GRPO Innovation: Introduces Suggestion-aware Group Relative Policy Optimization - an RL recipe that adds a suggestion reward on top of format and accuracy terms, directly optimising for actionable fixes rather than mere classification.

Boot-Strapped Dataset Contribution: Releases GUI-Critic-Train/Test, built with a reasoning-bootstrapping pipeline that mixes real trajectories, hallucinated negatives, and GPT-4o-verified critiques.

Good Empirical Gains: Lifts static critic accuracy on GUI-I from 54.9% to 69.2% and downstream AndroidWorld task success from 22.4% to 27.6%, outperforming GPT-4o and larger open models.

Lightweight Footprint: Achieves those gains with a 7B model, showing that safety and efficiency can co-exist without resorting to larger models.

Plug-and-Play Integration: Can be inserted as a decision gate in existing agents with no retraining of the actor, immediately shortening action traces and avoiding irreversible clicks.

Weaknesses

Latency Overhead: Every action now triggers a VLM forward pass, adding perceptible delay; the paper reports gains but no detailed timing budget for real-time RPA scenarios.

Limited Ablations: The ablation table omits variants without the suggestion reward or with alternative optimisers, leaving open how much each component truly matters.

False-Positive Risk: A conservative critic occasionally blocks valid but unfamiliar actions, lowering overall task throughput and yet false-negative/false-positive trade-offs are not quantified.

问题

Exactly how much end-to-end delay does the critic add per GUI action (mean, p95, worst-case) and how does it scale with screen resolution?
How much does each S-GRPO reward head (format, accuracy, suggestion) contribute, and how sensitive are results to group size k?
At the chosen threshold, how often does the critic wrongly block a valid action or permit a harmful one?
Does a critic trained on AndroidWorld transfer to desktop or web GUIs?
Will GUI-Critic-Train/Test and training scripts be released (licence + timeline)?

局限性

Yes

格式问题

G in S-GRPO is expanded as "Gradient" in abstract
Duplicate bibliography entries – the same Mobile-Agent-v2 citation appears twice as [36] and [37].

作者回复

2025-07-31

We greatly appreciate your recognition of our work and your insightful and constructive feedback. We have carefully addressed each of your concerns and are confident that the resulting revisions will further improve the quality of our study.

W1 & Q1. Latency Overhead

Timing budget: Following the reviewer's suggestion, we conducted a comprehensive runtime analysis on AndroidWorld benchmark as shown in the table below. The experiments measure task runtime, step counts, and success rates. To ensure fair comparison, all models (except GPT-4o) are deployed with identical hardware resources. The latency for each step includes environment interaction time, model inference time, and action execution time. "Avg Success Steps" denotes the average step number of all successful tasks, and "Avg success latency" represents the average latency of all successful tasks.

	Model	Avg Steps	Avg Success Steps	Avg Latency (s)	Avg Success Latency (s)	Success Rate (%)
Baseline	Qwen2.5-VL-72B	10.48	9.88	162.06	130.81	22.4
Baseline+Pre-Critic	+ Qwen2.5-VL-7B	10.22	8.37	286.02	184.35	20.3
	+ Qwen2.5-VL-72B	7.86	7.27	188.09	190.07	23.2
	+ GPT-4o	8.11	7.90	232.52	202.22	22.4
	+ GUI-Critic-R1（Ours）	7.19	7.09	156.91	127.61	27.6

The results show that:

Our GUI-Critic-R1 achieves clearly efficiency improvement over Baseline: We achieve lower average latency (156.91s vs 162.06s) while substantially reducing the average number of steps (7.19 vs 10.48). Moreover, our model maintains a higher success rate (27.6% vs 22.4%), showing the effectiveness of the pre-operative critic mechanism.
Analysis of Latency Reduction: Although the critic mechanism introduces an additional VLM forward pass at each step, the significant reduction in the total number of steps (~31% reduction from baseline) more than compensates for this overhead. This is because our pre-operative critic mechanism can prevent erroneous actions before execution, thus avoiding time-consuming retry attempts. This early error detection capability enables the model to make more accurate decisions in fewer steps, leading to more efficient task completion.
Our GUI-Critic-R1 shows superior efficiency compared with other pre-critic models. It operates faster than both Qwen2.5-VL-72B critic (156.91s vs 188.09s) and GPT-4o critic (156.91s vs 232.52s). This performance advantage highlights our model's ability to strike an optimal balance between computational cost and effectiveness.

End-to-end delay: As shown in the table below, the critic introduces end-to-end latency per GUI action: 6.28s on average, with 7.11s at p95 and 7.26s in the worst case.
Screen Resolution Influence: Additionally, we evaluate the inference time cost of our model at different screen resolutions. Specifically, we resize the original screenshot by the corresponding scale factor and evaluate the average inference time per test case on the GUI-I test set. As shown in the following table, we can find higher resolutions lead to increased inference times and the time difference is relatively small. This indicates that the image resolution does not greatly affect the inference delay.

Scale Factor	0.1	0.5	1.0	1.5	2.0
Avg Time (s)	6.20	6.27	6.28	6.35	6.42
p95 (s)	7.10	7.10	7.11	7.26	7.44
Worst (s)	7.23	7.18	7.26	7.40	7.63

W2 & Q2 Limited Ablation

Ablations on S-GRPO Rewards: Due to our oversight, the annotation in the fourth row of Table 4 was incorrect; this row should present the ablation results for $r_s$ . We apologize for this error and will correct it in the revised manuscript. As shown in the table, when the $r_s$ term is removed, the model's performance decreases on both the Critic Accuracy and Suggestion Accuracy metrics. Furthermore, we have added ablation experiments for $r_a$ and $r_f$ . Removing either of these components leads to a performance drop, as the accuracy reward ( $r_a$ ) and format reward ( $r_f$ ) are both essential reward components for ensuring the reliability of the intra-group advantage estimation for GRPO.

Model	Critic Accuracy	Suggestion Accuracy
w/o $r_f$	64.65	51.31
w/o $r_a$	63.82	49.28
w/o $r_s$	66.01	47.71
Ours	69.20	52.43

Ablations on Group Size: We analyze the impact of group size in the S-GRPO phase in Section 4.3. The group size balances performance with resource utilization, leading to our selection of 6 as optimal (see Figure 3).

W3 & Q3. False-Positive Results

The false-positive (FP) and false-negative (FN) trade-offs are indeed crucial evaluation metrics for assessing a model's ability to accurately identify decision errors without misclassifying correct actions as errors. We evaluated the FP and FN metrics on the GUI-I dataset, and the results are presented in the table below. CA denotes critic accuracy.

Model	FN (%)↓	FP (%)↓	CA (%)↑
GPT-4o	10.24	23.17	66.01
Qwen-7B	19.44	25.93	56.40
Qwen-72B	20.37	24.75	54.88
Ours	8.69	21.84	69.20

As shown, our model achieves a 10.48% reduction in FN and 4.09% reduction in FP relative to GPT-4o, while remaining competitive with other MLLMs.

Our method significantly reduces the rate - achieving only 8.69% FN (wrongly blocking valid actions or permitting harmful ones), substantially outperforming Qwen-7B (19.44%) and even surpassing GPT-4o (10.24%).
This proves that our method achieves a strong balance between efficiency and effectiveness in GUI automation, significantly reducing both false rejections of valid operations and false approvals of harmful ones.

Q4. Extend to desktop or web

Cross-Domain generalization to web interfaces. While the model was trained on mobile GUI data from GUI-Critic-Train dataset, we extensively evaluated its transferability to web scenarios through the GUI-W task in our GUI-Critic Benchmark. As shown in Table 1 in the main text, our GUI-Critic-R1 achieves the best performance (63.08%) compared to other open source MLLMs, indicating its effective domain adaptation from mobile to web interfaces despite the inherent differences in UI layouts and interaction patterns.
Additional validation on a dynamic desktop benchmark OSWorld[1*]. To more thoroughly validate the model's efficiently in desktop scenarios, we add a new experiment on the OSWorld benchmark, which is a challenging dynamic benchmark on a real computer environment that contains 369 computer tasks involving real web and desktop apps. We take Qwen 2.5-VL-72B as the action model (baseline), and the results are shown below.

	Model	Success Tasks Number
Baseline	Qwen2.5-VL-72B	38
Baseline + Pre-Critic	+ Qwen2.5-VL-7B	43
	+ GUI-Critic-R1（Ours）	50

From the table, we can find that our GUI-Critic-R1 significantly boosts the number of successful tasks on OSWorld (50 vs 38) and outperforms the configuration that employs Qwen 2.5-VL-7B as the pre-critic (50 vs 43). The results show that our model also exhibits satisfactory generalization ability in desktop and web environments.

[1*] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments. NeurISP'24.

Q5. Release plan

We have submitted the test code in the supplementary materials, and we plan to open-source the model parameters, dataset, and related resources in the future.

Q6. Typos

We thank the reviewer for pointing out the issues in our abstract and references; we will correct these in the manuscript.

评论- Rebuttal Feedback #1

2025-08-05

Thank you for the detailed rebuttal and the additional experiments. My main take-aways are below.

What was fully resolved

Latency and efficiency
- Per-step delay numbers (mean 6.28 s, p95 7.11 s) and the AndroidWorld runtime table show that the critic’s extra forward pass is offset by the ~31 % reduction in steps.
- Resolution sweep indicates only a marginal cost increase, so my real-time concern is largely alleviated.
Ablation depth
- New reward-head ablations confirm that all three heads matter; removing any of them drops both critic- and suggestion-accuracy.
- Group-size sweep (k = 2…6) clarifies why k = 6 was chosen.
Error trade-offs
- FN (8.69 %) and FP (21.84 %) are now quantified and compare favourably to GPT-4o and open baselines.
Cross-domain generalisation
- Results on GUI-W plus the new OSWorld desktop study (38 -> 50 successes) convincingly demonstrate transfer beyond AndroidWorld.
Formatting issues
- Authors acknowledged the acronym typo and duplicate citations.

Impact on my scores

Overall assessment: Stays at 5 - Accept. The rebuttal squarely addressed all technical questions.
Confidence: Remains 2 (I am comfortable defending this updated assessment).

评论- Response to the Official Comment by Reviewer 1CqV

2025-08-05

Thank you for your very comprehensive analysis of our submission! Addressing your concerns and questions greatly improved our paper, and the camera-ready version will benefit greatly from these changes.

审稿意见

评分: 4置信度: 42025-07-07

This paper proposes a critic model, GUI-Critic-R1, which can detect errors and provide corrective suggestions before actually executing actions in GUI tasks. To train this model, the authors first construct a dataset consisting of successful trajectories and sampled incorrect operations. Based on this dataset, critiques are collected via rejective sampling. GUI-Critic-R1 is trained with 1 epoch of SFT, followed by 10 epochs of GRPO. As the dataset contains two supervised signals, correctness and suggestion, the paper introduces an additional suggestion reward into GRPO, alongside the accuracy and format rewards. Experiments demonstrate that GUI-Critic-R1 outperforms its base model, Qwen-7B, and is comparable to some closed-source models.

优缺点分析

Strength

This paper presents a complete pipeline, covering both data collection and model training.
This paper introduces a novel perspective by making the critic model to provide not only a correctness signal but also suggestions.

Weakness

The idea of a pre-operative critic is interesting, but its necessity is not fully convincing. Since the dataset already contains step-level correctness labels and suggestions, it is natural to ask why not use this data to directly train a better policy model, instead of training a separate critic. Adding a critic makes the system more complex, and it is not clear if the extra component brings more benefit than directly improving the policy. Also, in some dynamic or partially observable GUI environments, it might be hard for the critic to predict the outcome of an action without actually running it.
The overall technical novelty is somewhat limited. This paper mostly adapts existing RFT + RL techniques to a specific GUI automation context, with relatively limited insights for broader domains.
The data collection process relies on successful trajectories from public datasets and filtering with GPT-4o, which limits the scalability of the approach. Step-level correctness signals and suggestions may become unavailable when GUI tasks extend beyond the coverage of existing datasets or exceed the reasoning capabilities of GPT-4o.

问题

In Section 3.3.1, the RFT stage involves two datasets: one with CoT and one without. Are different system prompts used for these two datasets to indicate whether the response should include a CoT?
After the critic model outputs the correctness score, critique, and suggestion, how exactly does the agent utilize this information? If an action is judged to be incorrect, does the agent directly follow the suggested correction, or does it treat the critique and suggestion as context and re-generate a new action?
Typo in Table 4: the position of $r_s$ seems to be incorrect.

局限性

The data collection process relies on successful trajectories from public datasets and filtering with GPT-4o, which limits the scalability of the approach. Step-level correctness signals and suggestions may become unavailable when GUI tasks extend beyond the coverage of existing datasets or exceed the reasoning capabilities of GPT-4o.

最终评判理由

I am satisfied with the rebuttal.

格式问题

None

作者回复

2025-07-31

We greatly appreciate the insightful and constructive feedback provided. We have carefully addressed each concern below.

W1. Necessity of pre-critic

We agree that improving the policy model can enhance GUI interaction performance. However, pre-operative critic is also necessary in GUI automation.

1. The decoupled relation between policy and critic model and the necessity of our pre-operative critic.

(1). The policy and critic models complement the system in different dimensions.

Policy generates actions while critic predicts errors and provides suggestions. Both components contribute uniquely to task completion, and the combination of decision-making and safety verification creates our efficient system.
According to Table 2 of main text and Table 1 of Appendix, upgrading the policy from Qwen2.5-VL-72B to GPT-4o improves success rate from 27.6% to 29.4% on AndroidWorld (using GUI-Critic-R1 as pre-critic). Similarly, enhancing the critic from Qwen2.5-VL-7B to GUI-Critic-R1 improves performance from 20.3% to 27.6% (using Qwen2.5-VL-72B as policy), showing both components' upgrades enhance overall performance.

(2). Pre-operative critic is necessary in GUI automation.

In dynamic GUI environments, generating accurate decisions is challenging for the GUI agent, leading to high error rates [36 ,27]. While self-reflection strategies fail in complex reasoning tasks [34, 52], an external pre-critic can objectively detect overlooked mistakes and provide corrective suggestions.
Dynamic evaluation on AndroidWorld (Table 2) and case studies (Figure 4) confirm that the pre-critic can effectively identify errors and provide remedial feedback in GUI automation, demonstrating the necessity of an explicit critic module.

2. Training an effective policy model with GUI-Critic-Train dataset is challenging due to its larger action space versus critic's binary classification. Policies typically need extensive training data (e.g., UI-Tars [26]). Following reviewer's suggestion, we constructed an equally-sized policy training dataset based on GUI-Critic-Train and applied cold-start and GRPO, yielding a policy model Qwen2.5-VL-7B-Policy. We evaluate it on AndroidWorld as follows.

	Model	SR
Policy	Qwen2.5-VL-7B	14.6
	Qwen2.5-VL-7B-Policy	15.5
Policy + Pre-Critic	Qwen2.5-VL-7B + GUI-Critic-R1	20.6
	Qwen2.5-VL-7B-Policy + GUI-Critic-R1	21.5

The results clearly show that Qwen2.5-VL-7B-Policy only obtains a small gain after training (14.6% to 15.5%), which indicates that training a good policy model is difficult. Besides, adding our critic to Qwen2.5-VL-7B-Policy improves the performance from 15.5% to 21.5% obviously, demonstrating the complementary roles of policy and critic model and the necessity of critic model.

3. Focused Scope and Proven Effectiveness. Regarding the challenge of predicting outcomes in complex GUI environments, we acknowledge this limitation but emphasize that our critic primarily focuses on identifying obvious errors through static analysis of the GUI state, rather than predicting all possible outcomes. Though not perfect in anticipating consequences, the 7% SR improvement on AndroidWorld (Table 3) validates its value in detecting errors and providing suggestions, enhancing task completion and efficiency. This confirms pre-critic as a promising research direction.

We will incorporate these clarifications in the revised manuscript.

W2. Technical novelty and insights for broader domains

We respectfully disagree with the assessment of limited technical novelty. Instead of simply adapting existing RFT + RL techniques to a specific GUI automation context, we advance and specialize these methods to meet the distinctive demands of the GUI pre-critic task.

1. Introduce technical innovations when compared with the standard GRPO method.

The GUI pre-critic task presents unique technical challenges that traditional RFT+RL methods cannot directly address, as it requires a more reliable thought process (critique). Specifically, traditional RL rewards only focus on the accuracy of the final answer. In the GUI pre-critic, simply adopting the same technique leads to a critical credit assignment problem: when the model makes a correct judgment, it's difficult to determine whether this resulted from proper reasoning or merely superficial pattern matching. This problem is acute in GUI scenario where reliable reasoning is crucial for both action judgment and suggestion generation.
To address this challenge, we introduce a suggestion reward as an innovative indirect supervisory signal for CoT reasoning, through measuring the semantic reliability of the generated suggestions. This reward is crucial as producing accurate revision suggestions requires proper analysis of the GUI operation, thereby encouraging the model to develop robust reasoning capabilities. The experiments in Table 4 demonstrate that this simple yet effective innovation of reward raises Suggestion Accuracy (SA) by 4.72 % and 7.02 % on GUI-I and GUI-S, respectively.

RFT	$r_f$	$r_a$	$r_s$	CA	SA	CA	SA
✓	✓	✓	✗	66.01	47.71	57.89	40.35
✓	✓	✓	✓	69.20	52.43	58.77	47.37

2. Introduce two key methodological innovations in reasoning-bootstrapping based data collection pipeline. To address the lack of suitable training data for GUI critics, we propose the following two key modules:

A progressive Chain-of-Thought paradigm specifically designed for step-by-step GUI operation analysis,
A Reasoning-Bootstrapping pipeline that enables automatic generation of high-quality critique data

Table 3 in the main text shows that incorporating these two components yields consistent gains, validating their effectiveness.

3. Broader applicability to multimodal interactive domains. Our core methodological contributions are domain-agnostic and can be readily adapted to other online domains.

The S-GRPO framework naturally extends to scenarios where reliable reasoning and error prevention are crucial, such as robotic manipulation where operational mistakes can be costly.
Our reasoning-bootstrapping method also provides practical insights for generating high-quality training data in dynamic environments.

W3. Scalability of the data collection process

1. Regarding public datasets concern, our data pipeline supports multiple data sources.

We support various data. For example, private/custom data from real-world applications, synthetic data generated by automated systems can be used. Public datasets are chosen for their accessibility, quality, representation and diversity.
We acknowledge that successful trajectories are necessary as ground truth for correct GUI operations. It fundamentally aligns with how pre-critic model should be learned - by referencing successful examples combined with generated negative samples.
When encountering GUI tasks beyond existing datasets, we can use automated GUI exploration methods [1*][2*] to collect successful trajectories. Negative sampling strategy and rule-based validation criteria can then generate training data, as in Section 3.1. Notably, both modules are domain-agnostic: the former capitalizes on mllm's general understanding of GUI interactions and focuses on universal principles (e.g., visibility and clickability) rather than being tailored to specific GUI scenarios, while the latter applies consistent rules across different GUI contexts.

2. Regarding GPT-4o concern, our GPT-4o filtering is optional, not mandatory.

As described in Section 3.1, GPT-4o filters erroneous critic annotations for data quality. Without filtering, the ablation study w/o DF in Table 3 reveals only a 2% performance drop and still demonstrates strong performance versus open-source baselines in Table 1. These results indicate that our pipeline's effectiveness isn't entirely GPT-4o dependent.
GPT-4o's reasoning capability concern is also addressable. The GPT-4o filtering is optional with minimal impact on results. Moreover, GPT-4o is just one possible choice. Our framework can incorporate newer, more capable models as they emerge, and multiple models or verification methods can be combined to enhance reliability.

We will incorporate the aforementioned analysis in the final version of the paper.

[1*] UI-Genie: A Self-Improving Approach for Iteratively Boosting MLLM-based Mobile GUI Agents

[2*] GUI-Reflection: Empowering Multimodal GUI Models with Self-Reflection Behavior

Q1. Distinct prompts for CoT vs. no-CoT sets

Yes. Different system prompts are used for CoT and no-CoT datasets. CoT dataset uses the prompt in Appendix Table 2, while no-CoT dataset removes thought-related format, requesting only score and suggestion. We'll add this clarification to Section 4.1 and include both prompts in Appendix.

Q2. How do agent use critic output?

The agent treats the critic’s feedback as context and regenerates a new action.

At each step, if critic gives a positive correctness score, the action is executed; otherwise, agent replans with critique and suggestion in context. The feedback integration prompt is in Appendix Table 5. We will add the clarification to Section 4.4.2 in the revised manuscript.

This design leverages complementary roles: policy generates actions while critic detects errors and provides suggestions. Due to GUI environments' dynamic nature, policy alone has high error rates [36]. An external pre-critic offers actionable suggestions, making it essential for reliable GUI automation.

Q3. Typo

We sincerely appreciate your catching this issue. In Table 4, the ✗ in the fourth row should be under $r_s$ . We will correct this in the manuscript, with the updated line shown in the W2 response.

评论- Response to the Official Comment by Reviewer arvH

2025-08-05

Thank you very much again for your thoughtful evaluation and the valuable feedback. We have carefully considered all of the queries and have incorporated revisions to address the concerns. We hope that our responses have adequately resolved the concerns. Please let us know if you have any remaining questions or if there are any other aspects you would like us to clarify further. We are grateful for your guidance and look forward to your final assessment.

2025-08-05

I have increased the score accordingly. The rebuttal is strong, and I am satisfied.

评论- Response to the Official Comment by Reviewer arvH

2025-08-05

We would like to reiterate our sincere appreciation. We are pleased to learn that you are satisfied with our response. We will incorporate clarifications addressing the concerns in the final version. This will be tremendously beneficial in enhancing the quality of our paper.

最终决定Accept (poster)

2025-09-17

The paper introduces GUI-Critic-R1, a pre-operative critic mechanism for Multimodal Large Language Models (MLLMs) in GUI automation. The key contributions are:

Pre-operative Error Detection: Unlike post-hoc correction, the critic evaluates actions before execution, preventing irreversible errors (e.g., accidental deletions or payments).
S-GRPO Training: A novel Suggestion-aware Group Relative Policy Optimization method that incorporates a suggestion reward to improve critique reliability.
Dataset Creation: A reasoning-bootstrapping pipeline generates GUI-Critic-Train and GUI-Critic-Test, addressing the lack of GUI critic datasets.
Empirical Results:
- Static Evaluation: Outperforms open-source MLLMs (e.g., Qwen-7B) and matches GPT-4o in critic accuracy.
- Dynamic Evaluation: On AndroidWorld, integrating GUI-Critic-R1 improves task success rates and reduces steps.

All reviewers are satisfied with the rebuttal and give an acceptance recommendation, though minor weaknesses remain (e.g., domain generalization).

Final Decision: Accept.