Q1: Table 4 and Table 5 lack the results of state-of-the-art methods. It would be more appropriate to include these SOTA results for comparison with OS-ATLAS to demonstrate that it outperforms previous methods. Specifically, OdysseyAgent achieves 65.22% AMS (or SR, as referred to in this paper) on the GUI-Odyssey benchmark, which outperforms OS-ATLAS (4B/7B).

Thanks for your suggestion. In the beginning, we omit the results of SOTA models because each benchmark has very different SOTA models, which will mess up the table. However, we agree that including SOTA model comparisons would provide a more comprehensive perspective. Below, we present the per-task success rate (SR):

	Gui-Act-Web	OmniAct-Web	OmniAct-Desktop	AC-Low	AC-High	GUI-Odyssey
SeeClick	72.34	68.59	72.69	75.00	59.11	53.92
MiniCPM-GUI[1]	74.90	-	-	-	-	-
PaLM-2S[2]	-	-	-	83.40	63.60	-
OdysseyAgent[3]	-	-	-	-	-	53.68 / 65.22
OS-Atlas-4B	81.06	73.91	84.78	80.64	67.54	56.39 / 63.10
OS-Atlas-7B	82.70	93.56	94.05	85.22	71.17	61.98 / 68.16
OS-Atlas-4B-Pro	77.80	80.65	92.99	81.02	65.46	66.30
OS-Atlas-7B-Pro	80.99	98.60	97.04	84.29	71.55	78.54

We include SOTA models from each benchmark paper, such as MiniCPM-GUI (GUI-Act), PaLM-2S (AndroidControl), and OdysseyAgent (GUI-Odyssey). Since the OmniAct paper only evaluated open-source models with poor performance, we provide SeeClick results as its SOTA. As shown in the table above, OS-Atlas outperforms state-of-the-art models on Gui-Act, Omni-Act, and AC. Notably, OdysseyAgent reports results both with and without history input (53.68 and 65.22, respectively), where history represents a sequence of past actions. In contrast, most other datasets use past steps' instructions as history. Thus, in the submitted version, we did not utilize the history field from GUI-Odyssey during both training and inference. To ensure a fair comparison, we conducted additional experiments incorporating the action history input, which resulted in 63.10 and 68.16 for OS-Atlas-4B and 7B respectively. Under both settings, we found that OS-Atlas significantly outperforms OdysseyAgent.

To ensure that most datasets remain available for OOD evaluation, OS-Atlas is initially trained using a limited selection of 3 agent datasets. To fully leverage its potential for broader applications, we use all 7 agent datasets for multitask fine-tuning. We found that OS-Atlas-Pro generally outperforms our previous model across most datasets, with the exceptions of GUI-Act-Web and AndroidControl-Low. These two datasets have significantly larger training sets, which makes simple SFT prone to overfitting and results in artificially high performance.

Q2: OS-ATLAS (7B) performs worse than OS-ATLAS (4B) on the AndroidControl-Low/High and GUI-Odyssey benchmarks. Moreover, OS-ATLAS (7B) exhibits poorer performance on the AndroidControl-High and GUI-Odyssey benchmarks compared to SeeClick, which uses Qwen-VL as the base model and a lower fixed input resolution (448x448). It would be beneficial to provide an explanation for this performance gap.

A2: As explained in Response Part 1 above, these are due to a mistake in the 7B models' evaluation scripts. With the updated Table 4 and 5, where OS-ATLAS (7B) now demonstrates significant performance improvements over both OS-ATLAS (4B) and SeeClick, your concerns shall be addressed.

Q3: Which benchmark is used in Figure 5?

A3: We use benchmarks as in Table 4 and 5. In Figure 5, we report the averaged performances on each platform (e.g., Web domain includes Gui-Act-Web and OmniAct-Web). We will provide clearer explanations about the benchmark composition in a later version of the paper.