GUI-G1: Understanding R1-Zero-Like Training for Visual Grounding in GUI Agents
This paper investigates what kind of R1-Zero-like training is suitable for grounding tasks in GUI agents.
摘要
评审与讨论
This paper studies methods for training agents to navigate Graphical User Interfaces (GUIs) using R1-style post-training using RL. They investigate 3 different design components: the effect of input templates, effect and limitations of reward functions used for GUI environments, and the policy objective. They propose several improvements to these components based on GUI environments:
- whereas templates which encourage long chains of though (CoT) have proven effective in reasoning-based tasks (like math and coding), here they find that they are detrimental for more vision-based tasks like localizing objects in a screenshot, and they find that removing intructions encouraging CoTs improves performance for object grounding in GUIs.
- they find that existing reward functions for object grounding in GUIs, such as hit rate or IOU, lead to reward hacking behaviors. Based on this observation, they propose a different reward function which is not prone to these types of reward hacks.
- they find that the GRPO objective produces two biases for GUI-based tasks: 1) biased lengths due to a term dividing by the output length 2) a bias based on difficulty due to the division by standard deviation of rewards. They propose two modifications, where the division by output length is replaced by a constant and the objective is weighted by a proxy for question difficulty, which here is based on the bounding box size.
With these 3 modifications, the system is trained on GUI-based benchmarks, where it leads to improvements over previous methods.
First, I should mention that GUI-based agents are not my research area, and I am not very familiar with the main methods, benchmarks and prior work other, hence my review and recommendation should be weighted accordingly. Overall, this paper seems very specific to GUI-based environments: the template, reward functions and modifications to the GRPO objective are motivated by this class of problems and failure modes of previous methods on them. While the improvements are reasonable, it's unclear to me that these insights would transfer to other domains or have broader applicability (the paper does not claim or present any evidence of this). The paper seems quite application-focused and I think it might be better suited to a venue more focused on this area, rather than NeurIPS which (at least historically) tends to be more focused on general ML methods. For this reason, I'm recommending reject.
优缺点分析
Strengths:
- The ablations and analyses appear carefully done, with the claims backed up by results and visualizations showing the described behavior and failure modes.
- The empirical results for the final system appear to improve noticeably over the baselines, and many different methods are included in the comparison.
Weaknesses:
As mentioned above, the main weakness in my opinion is that the insights do not transfer much outside the area of GUI-based agents. They are quite specific to the modalities which the current system uses (for example, bounding boxes over areas of the screen, reward functions based on bounding boxes and object grounding on the screen, etc). So, other than researchers specifically working in this area, the paper may not be that relevant.
问题
The inputs and output spaces described in lines 102-104 could be clearer.
-
For example, it wasn't clear to me what exactly the action space of the agent is: does it jointly produce text and a bounding box? are there separate heads for each? Is it all in tokens produced autoregressively? Are other keys part of the action space?
-
What exactly is the input - I understand it's an image, is it just RGB pixels? Or are there other post-processing steps applied.
局限性
Yes.
最终评判理由
One of my main concerns was that the scope of the paper is overly specific to GUI-based environments for NeurIPS (as noted in my review, this paper is outside my research area). However, after reading the other reviews and some of the references shared by the authors, I see that this work may be within scope for NeurIPS. Therefore, I am raising my score from 3 to 4. My confidence is still at 1 however, so I largely defer to other reviewers who have more expertise in this area.
格式问题
None.
We sincerely thank the reviewer for the constructive comments and suggestions, which are very helpful for improving our paper. We are also grateful that you recognized the strengths of our paper. Please kindly find point-to-point responses below.
Weakness. Broader Impact Beyond GUI: As mentioned above, the main weakness in my opinion is that the insights do not transfer much outside the area of GUI-based agents. They are quite specific to the modalities which the current system uses (for example, bounding boxes over areas of the screen, reward functions based on bounding boxes and object grounding on the screen, etc). So, other than researchers specifically working in this area, the paper may not be that relevant.
Thanks for your detailed comments. While our work is evaluated in the context of GUI agents, we believe its contributions are broadly relevant to the ML community. In fact, many recent works on GUI agents have been published at major ML venues [1–5], reflecting growing interest in this area within the community.
In particular, our methods offer insights that extend beyond GUI environments:
- Fast vs. Slow Thinking: We introduce the text ratio metric to guide when a model should perform explicit reasoning. This principle can apply to a wide range of multimodal models, not just those for GUI tasks.
- Reward Design and Hacking: Grounding tasks using bounding boxes are common in MLLM such as Qwen2.5-VL [6], VLM-R1 [7] and Visual-RFT [8], not just in GUI-based environments. Although our experiments focus on GUI agents, the reward design challenges we study and the methods we propose are applicable to these general grounding tasks as well. Our analysis provides insights for improving reward functions in common visual tasks and, as noted by Reviewer jR9M, also offers guidance for future work on GUI agents.
- Difficulty-Aware Learning: Our use of difficulty coefficients in GRPO training is based on the relative size of the target object. This is a practical design that can be easily applied to other multimodal tasks where the difficulty of the target can be estimated.
Although our experiments are focused on GUI environments, the core methods are applicable in broader multimodal and RL settings. We appreciate the reviewer’s perspective and would be happy to further elaborate on the borad impact of our work in the updated version.
Q1. Model Output Action Space Details: For example, it wasn't clear to me what exactly the action space of the agent is: does it jointly produce text and a bounding box? are there separate heads for each? Is it all in tokens produced autoregressively? Are other keys part of the action space?
Thanks for your question about the action space of our model. Our action space consists solely of text tokens. The bounding box information is represented as part of the text output, generated autoregressively by the model. There are no other elements in the action space.
For example, the model might output a JSON-like text sequence such as:
[
{"bbox_2d": [1726, 226, 1787, 264], "label": "locked rotation"}
]
In this example, the bounding box coordinates [1726, 226, 1787, 264] are embedded within the text tokens generated by the model. This is consistent with the approach used in Qwen2.5-VL.
We will include a detailed example of the model’s input and output formats in the appendix. Thank you for your helpful question.
Q2. Model Input Description: What exactly is the input - I understand it's an image, is it just RGB pixels? Or are there other post-processing steps applied.
Thanks for your question about the model input. Our model takes two parts as input:
-
Text input, which includes the instruction template and the user question. For example: "Grounding instruction is: {user question}. Help to locate and output its bbox coordinates using JSON format."
-
Image input, which is provided as raw RGB pixels. The image is passed into an image encoder that processes the pixels into a sequence of tokens. These image tokens are then combined with the text tokens and fed jointly into the language model for forward computation and text generation.
We hope this clarifies the input processing of our model. Please let us know if you would like more details.
We hope the above response can fully address your concerns. We really appreciate your insightful and constructive comments to further help us improve the quality of our work. Please do not hesitate to let us know if you have any further questions. Thanks again!
[1] Mobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collaboration. NeurIPS 2024.
[2] DistRL: An Asynchronous Distributed Reinforcement Learning Framework for On-Device Control Agents. ICLR 2025.
[3] Lightweight Neural App Control. ICLR 2025.
[4] Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents. ICLR 2025 Oral.
[5] OS-ATLAS: A Foundation Action Model for Generalist GUI Agents. ICLR 2025.
[6] Qwen2.5-VL Technical Report. Arxiv.
[7] VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model. Arxiv.
[8] Visual-RFT: Visual Reinforcement Fine-Tuning. Arxiv.
This paper performs a systematic study on the impact of using R1-zero-like training for GUI agents along 3 axes: input design, output evaluation, and policy update. The authors note that the current R1-zero paradigm causes performance degradation for GUI agents. They propose solutions along each axis tailored to GUI agents and achieve state-of-the-art performance in the same size model family and a larger UI-TARS-7B on the ScreenSpot and ScreenSpot-Pro datasets. This paper provides useful insight for the GRPO training design of GUI agents.
优缺点分析
Thank you for sharing your work. Below are some strengths/weaknesses points. The main point that could be very useful to address is studying performance on different base models (please see the questions section).
Strengths:
- Significance & Originality: This paper points to interesting insights regarding training GUI agents with GRPO. To the best of my knowledge, this is the first paper to do so.
- Clarity & quality: The paper is well organized and the writing is clear. Figures/tables are clear and helpful.
Weaknesses:
- Clarity & quality:
- The insights highlighted in the paper are only evaluated on one base model. Thus, it is unclear whether the results found are model-specific or general.
- It is generally important to evaluate RL methods over several runs due to high variance.
问题
-
Do the authors plan to release their checkpoint? This model could be very useful for the research community to use and build on top of.
-
How does the proposed method perform when using different base model families? How does this method scale with model size?
-
Would it be possible to perform multiple runs and report mean/std?
-
Could there be any potential downstream effects on GUI control tasks when using the proposed training paradigm (e.g., degradation in downstream web navigation performance due to training to limit reasoning)?
局限性
Yes to negative societal impact. No to limitations regarding statistical significance testing and base model ablations (please see the questions section).
最终评判理由
Authors addressed most of my concerns. Hence I have reflected that in my updated scores.
格式问题
None
We sincerely thank the reviewer for the constructive comments and suggestions, which are very helpful for improving our paper. We are also grateful that you recognized the strengths of our paper. Please kindly find point-to-point responses below.
Q1. Model Checkpoint Availability: Do the authors plan to release their checkpoint? This model could be very useful for the research community to use and build on top of.
Thanks for the valuable suggestion. We agree that open-sourcing the model is important for supporting the research community. Due to previous company policies, we are unable to release the original checkpoints and data used in this paper.
However, during the rebuttal period, we re-trained the model (backbone: Qwen2.5-vl-3B-Instruct) using our proposed method on the dataset from UI-R1 on an external server (4×L20 GPUs), and achieve performance exceeding that reported in the original paper (will be presented in following response). This new checkpoint, training codes, and training datsets have all been released on HuggingFace. However, due to the rebuttal policy that prohibits including any form of URL links in the response, we are unable to provide the direct link here. We will ensure that all relevant materials remain publicly accessible for better reproducibility.
In addition, we find that using a smaller gradient_accumulation_steps=1 (compared to 32 in the main table) can further improve performance. Based on this finding, we plan to continue training on larger datasets such as GUI-R1-3K and the public datasets used in the paper, and will release those checkpoints as well for supporting future research. Thanks again for the suggestion.
Q2 & W1. Cross-Family and Scale Evaluation: How does the proposed method perform when using different base model families? How does this method scale with model size? & The insights highlighted in the paper are only evaluated on one base model. Thus, it is unclear whether the results found are model-specific or general.
Thank you for the helpful suggestion. We agree that testing across different model families and sizes can better demonstrate the generality of our method. To this end, we conducted experiments on two additional models: InternVL3-2B-Instruct (a different mainstream family and smaller size) and Qwen2.5-VL-7B-Instruct (a larger size). We also retrain Qwen2.5-VL-3B-Instruct using our method for comparison.
Since UI-R1 was originally trained on Qwen2.5-VL-3B, we follow the same training protocol from the UI-R1 paper and retrain Qwen2.5-VL-7B-Instruct and InternVL3-2B-Instruct on the UI-R1 dataset. Due to resource constraints, we used LoRA for training the 7B model, while other models were fully fine-tuned. For consistency and fair comparison with UI-R1-3B, we used the same training data and training settings.
Performance on ScreenSpot
| Model | Mobile | Desktop | Web | Avg. |
|---|---|---|---|---|
| Qwen2.5-VL-3B | ||||
| basemodel | 85.6 | 70.4 | 65.9 | 74.8 |
| UI-R1 | 85.4 | |||
| GUI-G1 | 96.2 | 87.4 | 84.3 | 89.8 |
| — | ——— | ——— | ——— | ——— |
| Qwen2.5-VL-7B | ||||
| basemodel | 88.2 | 80.5 | 80.1 | 83.4 |
| UI-R1(re-train) | 80.6 | 78.7 | 84.2 | 81.4 |
| GUI-G1 | 88.8 | 82.3 | 80.8 | 84.4 |
| — | ——— | ——— | ——— | ——— |
| InternVL3-2B-Instruct | ||||
| basemodel | 0.2 | 13.2 | 0.2 | 3.6 |
| UI-R1(re-train) | 0.4 | 14.4 | 0.5 | 4.1 |
| GUI-G1 | 0.8 | 16.8 | 0.2 | 4.8 |
Performance on ScreenSpot-Pro
| Model | CAD | Development | Creative | Scientific | Office | OS | Avg. |
|---|---|---|---|---|---|---|---|
| Qwen2.5-VL-3B | |||||||
| basemodel | 28.4 | 22.1 | 26.7 | 36.6 | 33.0 | 22.4 | 28.1 |
| UI-R1 | 17.8 | ||||||
| GUI-G1 | 44.30 | 39.40 | 39.10 | 49.90 | 56.10 | 35.50 | 43.8 |
| — | ——— | ——— | ——— | ——— | ——— | ——— | ——— |
| Qwen2.5-VL-7B | |||||||
| basemodel | 21.1 | 27.4 | 27.0 | 38.2 | 46.1 | 30.1 | 31.1 |
| UI-R1 (re-train) | 22.60 | 27.80 | 29.60 | 37.80 | 50.90 | 39.30 | 33.70 |
| GUI-G1 | 20.70 | 28.80 | 28.70 | 40.60 | 45.20 | 36.20 | 32.60 |
| — | ——— | ——— | ——— | ——— | ——— | ——— | ——— |
| InternVL3-2B-Instruct | |||||||
| basemodel | 3.10 | 0.30 | 1.20 | 1.60 | 0.40 | 0.50 | 1.20 |
| UI-R1 (re-train) | 7.30 | 0.00 | 1.50 | 5.90 | 1.70 | 1.00 | 2.80 |
| GUI-G1 | 6.90 | 1.30 | 1.50 | 5.50 | 1.70 | 0.50 | 2.90 |
Key Findings:
-
- Model families: Our method consistently improves performance across both Qwen and InternVL families, despite InternVL showing generally lower baseline performance. This aligns with recent trends, where leading GUI agents [1–3] predominantly adopt Qwen as the base model.
-
- Model size: Interestingly, scaling from 3B to 7B resulted in performance drops. We attribute this to two factors: (a) the 7B model was trained with LoRA instead of full fine-tuning, and (b) larger models may not benefit as much from task-specific grounding fine-tuning, as also observed in Table 1 of [2]. Moreover, UI-TARS-72B performs worse on ScreenSpot (88.4) than the smaller UI-TARS-7B (89.5). Our experiments on the Qwen series further support this observation, suggesting that grounding performance does not necessarily scale with model size.
-
- Across all models and settings (except Qwen2.5-vl-7B on ScreenSpot), our method consistently outperforms the UI-R1 approach, further validating its effectiveness.
Thank you again for the suggestion. It helps us better assess the robustness and scalability of our method, and we will include these results in the final version.
Q3 & W2. Multiple Runs and Statistical Consistency: Would it be possible to perform multiple runs and report mean/std? & It is generally important to evaluate RL methods over several runs due to high variance.
Thank you for your suggestion. We have conducted three independent runs of our method (GUI-G1) on the UI-R1 dataset and settings, using random seeds 2023, 2024, and 2025. For evaluation, we used the ScreenSpot and ScreenSpot-Pro test sets respectively. The results are highly consistent, with small variance across runs.
-
On ScreenSpot, the average performance from the three runs are: 0.43896, 0.43833, 0.43580 → Mean = 0.4377, Standard deviation = 0.0017, 95% confidence interval = [0.4344, 0.4410]
-
On ScreenSpot-Pro, the average performance from the three runs are: 0.89780, 0.90094, 0.89544 → Mean = 0.8987, Standard deviation = 0.0028, 95% confidence interval = [0.8938, 0.9035]
We have updated the paper to include these statistics. Thank you again for your helpful feedback.
Q4. Downstream GUI Task Impact: Could there be any potential downstream effects on GUI control tasks when using the proposed training paradigm (e.g., degradation in downstream web navigation performance due to training to limit reasoning)?
Thank you for your helpful question. There are two types of GUI Agents in current systems:
-
- Agents that output both action types and grounding coordinates. These can complete tasks alone.
-
- Agents that split planning and grounding. The planner handles reasoning, and the grounding model only needs to find the correct UI element.
Our model belongs to the second design. It does not perform reasoning and this does not hurt downstream performance. Instead, better grounding improves the agent’s success rate.
We follow the Table 4 setup from [4] and test on the Cross_Web split of Multimodal-Mind2Web. The planner was GPT-4o. The results are:
| GroundingModel | element_acc |
|---|---|
| Qwen2.5-vl-3B-Instruct | 45.82 |
| UI-R1-3B (w/ reasoning) | 43.54 |
| GUI-G1-3B (w/o reasoning) | 46.04 |
As shown, GUI-G1 achieves the best performance, even without reasoning ability. This suggests that strong grounding is more important than reasoning in this setup.
We hope the above response can fully address your concerns. We really appreciate your insightful and constructive comments to further help us improve the quality of our work. Please do not hesitate to let us know if you have any further questions. Thanks again!
[1] UI-R1: Enhancing Efficient Action Prediction of GUI Agents by Reinforcement Learning. arXiv.
[2] Gui-r1: A generalist r1-style vision-language action model for gui agents. arXiv.
[3] Infigui-r1: Advancing multimodal gui agents from reactive actors to deliberative reasoners. arXiv.
[4] Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents. ICLR 2025 Oral.
Thanks for addressing most of my concerns. In light of these updates, I am happy to increase my score. Kindly please ensure to add the updates in the writing.
Thank you very much for your positive feedback and recognition of our work. We’re glad that the revisions have addressed your concerns. We will make sure to incorporate these updates into the final version of the paper.
This paper performs a comprehensive analysis of R1-style training for the GUI grounding objective. They identify three bottlenecks. 1. The GUI grounding performance actually decreases with increasing output tokens (thinking tokens). 2. Using only hit-based or IOU-based reward functions leads to reward hacking, with hit-based rewards favoring smaller boxes and IOU based reward functions favoring larger boxes. 3. The GRPO objective leads to overfitting easier to ground samples due to the generation of fewer output tokens. They see a trend where the examples that the model gets correct have increasingly smaller chains of thought and the examples the model gets wrong have increasingly longer chains of thought.
The authors propose GUI-G1 model to mitigate these three issues. This is done through the use of a Fast thinking template, which skips reasoning, a combination of IOU, hit-based, and additional box-based reward functions, and the use of a difficulty-aware objective, which favors the contribution of examples with smaller ground truth bounding boxes.
优缺点分析
Strengths
- This paper is well-written and timely. The authors clearly analyse and describe the problems encountered by current GUI grounding models and propose well-motivated solutions to these problems.
- The performed analysis is thorough and offers useful insights that can be used in future work for improving on GUI grounding models.
- The GUI-G1 model shows good positive results on both ScreenSpot and ScreenSpot-pro.
- Interestingly, the method requires less training data, and the resulting model gets better results in fewer output tokens compared to previous works.
Weaknesses
- The section on the thinking template is a bit confusing. I have added some follow-up questions in the question suggestions. More importantly, an ablation exclusively on the thinking templates is missing. While the analysis does point to better results with fewer tokens, this analysis is performed on a different model (InfiGUI-R1). Also, the analysis only gives a hypothesis that needs to be validated through an explicit ablation in the GUI-G1 model.
- The net increase from the difficulty-aware addition to the objective seems very small in Table 2. I think this objective should be evaluated on Screenspot-pro to reflect its benefit.
问题
Questions:
- In Figure 2, right, what are the exact specifications of Low and High Text ratios, and how do you get them? Is this an intervention experiment where low is with a fast thinking template and high is with a slow thinking template? Or is it just an observation where you threshold some value that defines how many is high and how many is low?
- From my understanding, higher image tokens are obtained with higher resolution tokens only. So, how do you get different text ratios for the same image size? Is this just by controlling the number of generated tokens? Is there some other way of enhancing the image and increasing the image tokens?
Suggestions:
- Please include the UI-TARS-7B and UGround-v1-7B baselines in the GUI-specific models section of Table 4 for a better comparison, since you are comparing with other 7B models, and these models are compared in other tables.
- Line 198 should say Figure 6.
- The difficulty-aware objective should be validated on Screenspot-Pro.
局限性
Yes.
最终评判理由
The authors sufficiently addressed my key concerns regarding the following points:
- Explicit Ablation on thinking template
- Experiment showing the impact of DifficultyReweight on the Screenspot-Pro benchmark
- Inclusion of additional requested baselines
I would like to maintain my positive score.
格式问题
None.
We sincerely thank the reviewer for the constructive comments and suggestions, which are very helpful for improving our paper. We are also grateful that you recognized the strengths of our paper. Please kindly find point-to-point responses below.
W1. Thinking Template Ablation Results: The section on the thinking template is a bit confusing. I have added some follow-up questions in the question suggestions. More importantly, an ablation exclusively on the thinking templates is missing. While the analysis does point to better results with fewer tokens, this analysis is performed on a different model (InfiGUI-R1). Also, the analysis only gives a hypothesis that needs to be validated through an explicit ablation in the GUI-G1 model.
Thanks for the thoughtful suggestion. We agree that an explicit ablation on the thinking templates is necessary to validate our hypothesis. To this end, we conduct a controlled experiment using the Qwen2.5-VL-3B-Instruct model. We chose to train on the dataset used in UI-R1 [1]. We train two versions—with and without the thinking templates—under identical settings copied from UI-R1, fully fine-tuning all model parameters. The results are as follows:
Performance on ScreenSpot
| Mobile | Desktop | Web | Avg. | |
|---|---|---|---|---|
| Thinking | 94.6 | 84.7 | 84.9 | 88.7 |
| No_Thinking | 96.2 | 87.4 | 84.3 | 89.8 |
Performance on ScreenSpot-Pro
| CAD | Development | Creative | Scientific | Office | OS | Avg. | |
|---|---|---|---|---|---|---|---|
| Thinking | 39.5 | 39.5 | 37.8 | 48.8 | 51.3 | 40.3 | 42.4 |
| No_Thinking | 44.3 | 39.4 | 39.1 | 49.9 | 56.1 | 35.5 | 43.8 |
The results show that encouraging direct answers (Fast Thinking) consistently yields better performance, supporting our hypothesis. We appreciate your feedback in helping us strengthen the analysis and we will add this experiment to final version.
W2 & S3. ScreenSpot-Pro Difficulty Reweighting Ablation: The net increase from the difficulty-aware addition to the objective seems very small in Table 2. I think this objective should be evaluated on Screenspot-pro to reflect its benefit. & The difficulty-aware objective should be validated on Screenspot-Pro.
Thanks for the suggestion. We evaluate the same checkpoint used in Table 2 on the ScreenSpot-Pro dataset, and the results are shown below:
| CAD | Development | Creative | Scientific | Office | OS | Avg. | |
|---|---|---|---|---|---|---|---|
| GRPO | 37.9 | 38.1 | 37.8 | 45.7 | 51.3 | 29.1 | 40.0 |
| -LengthBias | 39.5 | 37.8 | 38.7 | 46.9 | 51.7 | 28.1 | 40.5 |
| +DifficultyReweight | 41.4 | 44.2 | 42.5 | 49.6 | 59.6 | 38.3 | 45.7 |
The results show that both the length bias correction and difficulty reweighting improve performance on ScreenSpot-Pro. The impact of removing the length bias is relatively small, which we suspect is because ScreenSpot-Pro consists primarily of high-resolution images—thus, the influence of text tokens is minimal compared to the large number of image tokens. In contrast, the proposed difficulty reweighting leads to significant gains, further validating its effectiveness in helping the model better handle harder samples.
Tips: We also observe that the results in this table are better than those reported in the main table, showing the same trend as results in W1. We suspect this is because both the UI-R1 and our analysis experiments used a gradient_accumulation_steps of 1, whereas the main table used a larger value 32. For GRPO training on the grounding task with Qwen2.5-VL-3B-Instruct, faster updates (i.e., smaller accumulation steps) appear to lead to better performance. Moving forward, we plan to train on the full dataset with smaller gradient_accumulation_steps and release the updated model.
Thanks again for your helpful feedback and we will add the results to the final version.
Q1. Low/High Text Ratio Explanation: In Figure 2, right, what are the exact specifications of Low and High Text ratios, and how do you get them? Is this an intervention experiment where low is with a fast thinking template and high is with a slow thinking template? Or is it just an observation where you threshold some value that defines how many is high and how many is low?
Thanks for your question. As defined in Line 105, the Text Ratio is computed as: (n_ins + n_thinking) / (n_ins + n_thinking + n_img), which represents the proportion of text tokens in the input sequence before generating the bounding box. A higher text ratio indicates that more textual information is used to generate the grounding box.
To plot Figure 2 (Right) we first divide the data into three groups based on image size. Within each group, we calculate the Text Ratio for all examples and then split them in half: the 50% with shorter text ratios are categorized as Low Text Ratio, and the remaining 50% as High Text Ratio. The process is illustrated in the caption of Figure 2.
We appreciate your suggestion and will include detailed settings of this experiment in the appendix.
Q2. Text Ratio Control Explanation: From my understanding, higher image tokens are obtained with higher resolution tokens only. So, how do you get different text ratios for the same image size? Is this just by controlling the number of generated tokens? Is there some other way of enhancing the image and increasing the image tokens?
Thanks for the excellent question. As you pointed out, to control for confounding factors, it is important to analyze different text ratios under the same image size.
While it is theoretically possible to manipulate text ratios for the same sample by controlling the length of generated text or altering the number of image tokens, such interventions may negatively impact model performance. Moreover, we currently do not have an effective method to precisely control these factors without introducing other variables.
Therefore, as described in Q1, we group the test samples based on image size. Within each group, images are of similar resolution, allowing us to fairly compare samples with different text ratios.
S1. Inclusion of UI-TARS-7B and UGround-v1-7B: Please include the UI-TARS-7B and UGround-v1-7B baselines in the GUI-specific models section of Table 4 for a better comparison, since you are comparing with other 7B models, and these models are compared in other tables.
Thank you for the suggestion. Our model, with only 3B parameters, achieves better overall performance (89.8 on ScreenSpot and 43.8 on ScreenSpot-Pro) than UGround-v1-7B (86.3 on ScreenSpot), UI-TARS-7B-SFT (91.6 on ScreenSpot and 35.7 on ScreenSpot-Pro), and UI-TARS-7B-DPO (89.5 on ScreenSpot and 35.7 on ScreenSpot-Pro). We will include the results of UI-TARS-7B and UGround-v1-7B in both Table 4 and Table 5 for a more complete comparison.
S2. Typos: Line 198 should say Figure 6.
Thanks for your suggestion. We have revised the typo of reference.
We hope the above response can fully address your concerns. We really appreciate your insightful and constructive comments to further help us improve the quality of our work. Please do not hesitate to let us know if you have any further questions. Thanks again!
[1] UI-R1: Enhancing Efficient Action Prediction of GUI Agents by Reinforcement Learning. arXiv.
Thank you for the response. This addresses my concerns. I think the additional results on Screenspot-Pro benchmark are significant and show the benefits of the Difficulty Reweighting more clearly. Please add them to the main paper in subsequent revisions. I will maintain my score are positive feedback.
Thank you for your positive feedback and recognition. We are glad to have addressed your concerns and will include the results on ScreenSpot-Pro in subsequent revisions.
This paper analyzes the R1-Zero-like training approach applied to GUI grounding tasks, identifying three key issues in current methods: reward hacking resulting from the reward function, biases inherent in the GRPO training objective, and degraded performance due to excessive reliance on lengthy textual reasoning. To mitigate this issue, the authors propose corresponding improvements, including (1) a new prompt designed to reduce unnecessary textual reasoning, (2) a constraint on reward function based on bounding box size to deal with reward hacking, (3) normalization of the RL objective with response length and sample difficulty. Leveraging these improvements, the authors train a GUI grounding model based on Qwen2.5-VL-3B-Instruct, achieving strong results on the ScreenSpot and ScreenSpot-Pro benchmarks.
优缺点分析
Strength:
-
The paper provides helpful empirical insights into the limitations of current GUI grounding training pipelines, such as reward hacking and biases in the RL objective. These insights contribute meaningfully to improving model training.
-
The proposed solutions are effective and well-motivated, addressing the identified limitations. Experimental results validate that the proposed methods lead to good performance on two widely adopted GUI visual grounding benchmarks, ScreenSpot and ScreenSpot-Pro.
Weakness:
-
The performance of the baseline model (Qwen2.5-VL-Instruct-3B) is omitted (marked "--") in Tables 4 and 5 without explanation. This omission raises questions about the comparability and validity of the reported improvements. The authors should clarify why these baseline results are absent, such as indicating if the base model struggled significantly with instruction-following tasks.
-
The writing and presentation could benefit from some improvements. I have listed some uncertainties in the question section in detail.
问题
-
Lines 91-92 appear to use "output evaluation" and "reward function" together. It would be great if the authors could further elaborate the difference between these two terms. Otherwise, standardizing the terminology would enhance clarity.
-
The concepts of "Slow Thinking" and "Fast Thinking" are under-specified and somewhat ambiguous. It would be great if the authors could provide a formal definition of these two concepts.
-
In line 83, the term "template" is unclear. Could the authors explicitly clarify what a "template" entails in this context? Is it simply a specific part of the prompt describing the output format, or does it encompass additional instructions for the model's reasoning process?
局限性
No major limitation
最终评判理由
The authors have addressed my concerns about the ambiguity of their results and terms. I will keep my scores unchanged.
格式问题
NA
We sincerely thank the reviewer for the constructive comments and suggestions, which are very helpful for improving our paper. We are also grateful that you recognized the strengths of our paper. Please kindly find point-to-point responses below.
W1:Result Omission Explanation The performance of the baseline model (Qwen2.5-VL-Instruct-3B) is omitted (marked "--") in Tables 4 and 5 without explanation. This omission raises questions about the comparability and validity of the reported improvements. The authors should clarify why these baseline results are absent, such as indicating if the base model struggled significantly with instruction-following tasks.
Thank you for your comment. The results for Qwen2.5-VL-Instruct-3B in Tables 4 and 5 are copied from the official technical report [1] (same for Kimi-VL). The report only provides overall scores and does not break down the results by category (mobile, web, desktop). Since the evaluation code was not released, we are unable to reproduce the detailed results and just copy the final result. We will clarify this omission in the captions of Tables 4 and 5. We appreciate your feedback in helping us improve the clarity of the paper.
W2: The writing and presentation could benefit from some improvements. I have listed some uncertainties in the question section in detail.
Thank you for your suggestions to help us improve the quality of our paper. We will provide point-by-point responses and corresponding revisions in the following.
Q1. Relation between
output evaluationandreward function: Lines 91-92 appear to use "output evaluation" and "reward function" together. It would be great if the authors could further elaborate the difference between these two terms. Otherwise, standardizing the terminology would enhance clarity.
Thanks for your thoughtful comment. Sorry for the confusion about the relationship between output evaluation and reward function. In our framework, output evaluation is a general concept, and reward function is a specific instance of it in the context of RL. Other paradigms (e.g., SFT or unsupervised learning) may use different evaluation signals such as loss or entropy. Similarly, template and RL objective are concrete instances of input design and policy update, respectively. We will clarify this in the introduction to make these relationships clearer. Thank you again for helping us improve the writing quality.
Q3. Formal Definition of "Template": In line 83, the term "template" is unclear. Could the authors explicitly clarify what a "template" entails in this context? Is it simply a specific part of the prompt describing the output format, or does it encompass additional instructions for the model's reasoning process?
Thanks for your question. In our paper, template refers to the system-level instruction that appears before the user query. It includes both the expected output format and additional instructions that guide the model’s reasoning process. So it is not limited to one or the other.
To illustrate this, Table 1 in DeepSeek-R1 [3] provides an example prompt that begins with:
A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., <think> reasoning process here </think> <answer> answer here </answer>. User: prompt. Assistant:
This system-level instruction before the user query is what we refer to as the template. We will revise the paper to make this definition clearer.
We will revise Section 3.1 to provide a clear and explicit definition of template. Thanks again for your suggestion, which helps us improve the clarity of the paper.
Q2. Formal Definition of "Slow Thinking" and "Fast Thinking": The concepts of "Slow Thinking" and "Fast Thinking" are under-specified and somewhat ambiguous. It would be great if the authors could provide a formal definition of these two concepts.
Thanks for your valuable suggestions. We believe it is necessary to provide clear definitions of key terms. We will include a clear comparison of the definitions of Fast Thinking and Slow Thinking from in the appendix to improve clarity:
-
Briefly, according to Kahneman’s Thinking, Fast and Slow, System 1 (Fast Thinking) is “fast, instinctive, and emotional,” while System 2 (Slow Thinking) is “slower, more deliberative, and more logical.”
-
In recent LLM research, such as the Qwen3 model [2], Non-thinking mode corresponds to fast thinking, where the model directly generates an answer without intermediate reasoning. Thinking mode involves multi-step reasoning before the final response, often using chain-of-thought (CoT) prompting. This corresponds to slow thinking.
-
In our work, Slow Thinking means the model observes the screen and performs reasoning before outputting a grounding coordinate, while Fast Thinking encourages direct answer generation without explicit reasoning, similar to the Non-thinking mode in Qwen3.
We thank the reviewer for the valuable suggestion, which helps improve the rigor and clarity of our presentation.
We hope the above response can fully address your concerns. We really appreciate your insightful and constructive comments to further help us improve the quality of our work. Please do not hesitate to let us know if you have any further questions. Thanks again!
[1] Qwen2.5-vl technical report. arXiv.
[2] Qwen3 Technical Report. arXiv.
[3] DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv.
Dear Reviewers and ACs,
We sincerely thank all reviewers and ACs for their great effort and constructive comments on our submission. We know that we are now approaching the end of the author-reviewer discussion and want to summarize our response. During the rebuttal period, we have been focusing on these beneficial suggestions from the reviewers and doing our best to add several experiments and repsonse to reviewers. We believe our current response can address all the reviewers' concerns.
First of all, we are very encouraged by the large number and diversity of positive comments. Several key strengths of our work have been highlighted by the reviewers:
-
Insightful Analysis and Practical Solutions: Our work is recognized for its well-motivated (Reviewer jR9M and uv8E) analysis of limitations in current GUI grounding pipelines. Reviewer uv8E praised its "helpful empirical insights" and "effective, well-motivated solutions." Reviewer jR9M highlighted the "thorough analysis" and its value for improving GUI grounding models, while Reviewer c2Uh found the analyses "carefully done."
-
Strong Empirical Results with Improved Efficiency: The proposed GUI-G1 model consistently outperforms baselines on ScreenSpot and ScreenSpot-Pro. Reviewer uv8E noted its "good performance," jR9M highlighted better results with fewer output tokens and less training data, and c2Uh confirmed noticeable improvements over baselines.
-
Clear and Accessible Presentation: The paper is praised for its clear writing and organization. Reviewer jR9M called it "well-written and timely," wn5i commended the "clear writing," "well-organized" structure, and helpful figures/tables, and c2Uh noted the claims are well-supported by results and visualizations.
Moreover, we have carefully addressed these constructive comments with further clarifications and experiments:
- [Reviewer wn5i, Reviewer jR9M] We confirmed that our model will be publicly released and expanded our experiments to cover models of different sizes and families, with repeated runs.
- [Reviewer jR9M] We added an ablation study to assess whether the model engages in actual reasoning.
- [Reviewer jR9M] We included the performance of difficulty-aware and length-debias methods on the ScreenSpot-Pro benchmark.
- [Reviewer wn5i] We further added experimental results in web navigation scenarios.
- [Reviewer uv8E] We clarified the reasons for missing results in the table, added terminology explanations, and committed to improving the clarity and presentation of the paper.
We are also encouraged by Reviewers wn5i and jR9M that our responses have addressed their concerns, and Reviewer wn5i is happy to increase his score. Although Reviewer c2Uh did not respond, we clarified in our rebuttal that our work aligns well with the NeurIPS scope. We believe our paper that focuses on understanding the R1-Zero-Like training in GUI agents can also provide valuable insights to broader multimodal and reinforcement learning research, as evidenced by the growing attention to GUI agents at major ML venues.
We hope our response could address all the reviewers' concerns, and are more than eager to have further discussions with the reviewers in response to these revisions.
Thanks once again for handing the review process of our submission.
Best Regard,
Submission 22787 Authors
This paper investigates R1-Zero-like training for GUI grounding tasks, identifying key limitations in current approaches and proposing targeted solutions. Through careful analysis of input templates, reward functions, and RL policy objectives, the authors show how existing practices can lead to degraded performance, reward hacking, or overfitting. They introduce a fast-thinking template to reduce unnecessary reasoning, a box-size-aware reward function, and a difficulty-aware objective, leading to a new model (GUI-G1) that achieves state-of-the-art results on ScreenSpot and ScreenSpot-Pro benchmarks, surpassing even larger models.
In general, the paper is well-written, timely, and provides valuable empirical insights into the failure modes of GUI grounding pipelines. Reviewers consistently praised the thorough analysis, strong empirical results, and clear presentation, and noted that the proposed solutions are both effective and well-motivated. However, as pointed out by the reviewers, there are some concerns about the scope of applicability beyond GUI agents, as well as requests for additional clarity and expanded evaluations across different model families and training runs. These concerns were largely addressed in the rebuttal through additional experiments, clarifications, and commitments to release code and checkpoints.
Overall, the paper makes a solid and impactful contribution to the understanding and improvement of GUI agent training. Given the novelty of the analysis, the effectiveness of the proposed methods, and the strong empirical performance, I recommend accepting the paper.