7.8

/10

Spotlight4 位审稿人

最低4最高5标准差0.4

3.8

置信度

创新性3.0

质量3.5

清晰度3.3

重要性3.0

NeurIPS 2025

Web-Shepherd: Advancing PRMs for Reinforcing Web Agents

Hyungjoo Chae,Sunghwan Kim,Junhee Cho,Seungone Kim,Seungjun Moon,Gyeom Hwangbo,Dongha Lim,Minjin Kim,Yeonjun Hwang,Minju Gwak,Dongwook Choi,Minseok Kang,Gwanhoon Im,ByeongUng Cho,Hyojun Kim,Jun Hee Han,Taeyoon Kwon,Minju Kim,Beong-woo Kwak,Dongjin Kang,Jinyoung Yeo

OpenReview PDF

提交: 2025-05-12更新: 2025-10-29

摘要

关键词

Web AgentReward ModelLLM

评审与讨论

审稿意见

评分: 5置信度: 42025-06-22

This paper introduces Web-Shepherd, the first process reward model (PRM) designed for assessing web navigation tasks step-by-step. To train and evaluate this model, the authors created two new resources: the WebPRM Collection, a large dataset with 40,000 step-level preference pairs, and WebRewardBench, the first benchmark for evaluating web navigation PRMs. Web-Shepherd utilizes a unique checklist-based system to decompose high-level instructions into clear subgoals, enabling more accurate progress assessment. Experiments show that Web-Shepherd significantly outperforms existing models, achieving about 30 points higher accuracy than GPT-4o on WebRewardBench and improving agent success rates on the WebArena-lite benchmark by 10.9 points, all while being 10 times more cost-effective than using a prompted GPT-4o-mini as a verifier

优缺点分析

Strengths:

The introduction of a PRM benchmark and checklist-based reward modeling is interesting and impactful.
The paper has conducted rigorous experiments with insightful discussions.
The proposed approach is effective not only on the proposed reward modeling benchmark but also on “downstream” web agent tasks.

Weaknesses:

The paper only conducts web agent evaluation on one benchmark. Showing effectiveness on additional benchmarks like VisualWebArena can help us better appreciate the performance and understand the usefulness of process rewards based on multimodal inputs.
The authors could better demonstrate the values of their approach by showing that equipping with such reward-guided search improves efficiency (e.g. less searched states, takes shorter steps to reach goals).

问题

The author could elaborate more on the experimental setting for checklist generation. For example, which model they used? Is the checklist generation model the same architecture as the reward model introduced in section 5.2? Did you train a single model on both checklist generation and reward modeling? Is your checklist generation data basically a distilled gpt-4o but with guidance from expert trajectory during the distillation process?
In Table 1, for WEB-SHEPHERD (3B), did you train two separate models for T and T+I settings? Why didn't you report performance for the T + I setting for a 8B model as well?
In Table 2, we can have separate experiments with a fixed checklist for every PRM to better understand the reward modeling capabilities by isolating checklist quality.
How can your approach address situations where a “change-of-plan” is needed? It is possible that the agent may encounter an unseen website and the checklist generated is not applicable to the websites. In these scenarios, the agent would likely need to modify the checklist (therefore leading to increase/decrease in checklist length, etc). Can your approach still provide useful feedback for refining actions (Table 3) in these cases?
Line 189, the equation should be divided by K since it’s average?

局限性

yes

最终评判理由

The authors have addressed all my concerns and I would recommend this paper to be accepted.

格式问题

N/A

作者回复

2025-07-31

Dear Reviewer THVf,

First of all, we sincerely appreciate the time and effort you took to review our work, and we are grateful for your recognition of its contributions.

W1

We also acknowledge the reviewer’s suggestion that evaluating on additional benchmarks would strengthen the paper. In response, we conducted supplementary experiments on the WorkArena benchmark $1$ , which differs from WebArena in task structure and domain.

Specifically, we used GPT-4o-mini as the policy model and compared the performance of greedy search and reward-guided search with GPT-4o-mini and Web-Shepherd (3B and 8B). As shown in the results below, reward-guided search with Web-Shepherd continues to show clear advantages, validating the transferability of our method to unseen websites.

	Dashboard	Form	Knowledge	List-sort	Menu	Service Catalog	Total
w/o Trajectory Search	50.00	0.00	10.00	5.00	25.00	2.22	9.39
GPT-4o-mini	55.00	10.00	10.00	6.67	20.00	5.56	12.42
Web-Shepherd (3B)	57.50	14.00	10.00	10.00	10.00	11.11	14.85
Web-Shepherd (8B)	65.00	14.00	20.00	10.00	20.00	7.78	15.76

Although we additionally experimented with benchmarks that include both textual and visual observations, our current model is mainly trained on textual inputs. We made this design choice deliberately, as we found that naively incorporating screenshots often introduces ambiguity and noise that can hinder performance. Nonetheless, we recognize that certain tasks (e.g., “find the red shirt”) require understanding visual context. This highlights the need for future work to develop more effective multimodal reward models that can jointly leverage textual and visual inputs for web agent training. Furthermore, we agree with the reviewer’s suggestion and plan to extend our experiments to VisualWebArena to better assess the model’s capabilities in multimodal scenarios.

W2

Thank you for raising this important point regarding efficiency. To evaluate whether our reward-guided search improves not only accuracy but also sample efficiency, we conducted a comparison against greedy search. As shown in the table below, our reward-guided approach significantly improves both success rate (Acc) and efficiency (Avg. Step to reach the goal) across different policies.

In particular, we observe that using a strong reward model (Ours-8B) leads to more efficient behavior, achieving higher accuracy with fewer steps, compared to using GPT-4o-mini as the reward model. This underscores the critical role of a robust reward model in guiding agents effectively during search.

Policy	Search Method	Acc	Avg. Step
GPT-4o-mini	w/o trajectory search	23.64	19.07
	GPT-4o-mini	24.24	19.06
	Ours (8B)	34.55	14.73
GPT-4o	w/o trajectory search	31.52	16.16
	GPT-4o-mini	26.67	13.30
	Ours (8B)	39.39	12.51

Q1

As illustrated in Figure 5, WebShepherd is not a framework but a single unified model that performs two key roles: (1) generating a checklist for a given task, and (2) providing step-level rewards and natural language feedback for the agent’s actions based on the checklist.
To train the model, we use high-quality checklists derived from human-annotated trajectories with GPT-4o. Checklist generation is not distilled from GPT-4o, but learned directly from expert supervision, enabling strong generalization.

Q2

When we compare text-only input and multimodal input in the 3B setting, we do not observe a notable improvement from using the multimodal input. Thus, we do not further train 8B models for multimodal input. Furthermore, our GPU resource is not enough to run finetuning 8B VLMs.

Q3

Thank you for this insightful suggestion. To isolate the effect of checklist quality on reward modeling, we conducted an experiment using manually annotated checklists. Specifically, we annotated checklists for 50 tasks and compared reward-guided search performance using self-generated versus human-annotated checklists.

The results (see table below) indicate that there is no significant performance gap between the two settings. This suggests that once the checklist quality reaches a certain threshold, the performance is more influenced by the reward prediction or the policy model. However, since the evaluation was done on a limited set of 50 examples, further large-scale experiments are necessary to draw definitive conclusions.

Additionally, we observed that while our model generally produces useful and well-structured checklists, there are notable failure cases:

Redundant Checklist: The model occasionally adds unnecessary subgoals, which can confuse the reward assignment or mislead the agent.
Missing Subgoals: Some key steps are omitted, resulting in coarse-grained checklists that fail to reflect the full task structure.

Although these issues did not result in notable performance drops in this experiment, they highlight opportunities for improvement in checklist generation quality.

PRM	Method	Accuracy
-	w/o trajectory search	0.18
GPT-4o-mini	w/ self-generated checklist	0.20
	w/ human checklist	0.24
Ours (3B)	w/ self-generated checklist	0.24
	w/ human checklist	0.24

Q4

We appreciate the reviewer’s interesting question. When the checklist is completely wrong, we can assume that there is an error in the checklist generation given that the reward score does not increase after several attempts. In these cases, a practical strategy is to sample a new checklist from Web-Shepherd, possibly refining it based on the agent’s past trajectory.

Importantly, our model is explicitly trained to generate coarse-grained checklists rather than overly detailed ones. This design choice helps avoid overfitting to specific website structures and allows for better generalization to unseen websites. As shown in our main experiments on WebArena and additional experiments on WorkArena, Web-Shepherd performs effectively even in out-of-distribution environments.

Furthermore, we observed that even if part of the checklist is incorrect, the remaining steps still provide effective reward. This allows the reward model to continue offering valuable feedback, helping the agent refine its actions and move toward the goal, even when a change-of-plan is required.

Q5

Thank you for pointing this out. We will correct the equation by dividing by $K$ in the next version of the draft.

References

$1$ Workarena: How capable are web agents at solving common knowledge work tasks? (Drouin et al., 2024)

2025-08-01

Thank you for your detailed responses. All my concerns have been addressed and I have raised my score.

2025-08-01

We really appreciate the time you took to carefully read our response and reconsider your evaluation. We're glad the clarifications were helpful, and your updated score truly means a lot to us.

审稿意见

评分: 5置信度: 32025-07-01

This paper proposes a checklist-based process reward model (PRM) for web agents. The PRM is trained on a large dataset, the WebPRM collection, of step-level preference pairs on web navigation tasks collected by the authors. To evaluate the PRM better, they also proposed a WebReward Bench and showed that Web-Shepherd performs better than GPT-4o-mini. They also evaluated it by using a best-of-N search on WebArena and proved that Web-Shepherd can serve as a good reward model, outperforming GPT-4o-mini.

优缺点分析

Strengths:

PRM is an important topic for building better web agents and can be applied in multiple different scenarios: inference-time search, RL, etc.
This work produces a reward model, along with the dataset to train on and a benchmark to evaluate reward models for web agents. Overall, it is solid and impressive
The WebPRM collection is based on Mind2Web, but the reward model trained on it, i.e. Web-Shepherd, performs well on WebArena, which is quite out-of-domain.

Weaknesses:

Is Web-Shepherd trained on both checklist generation and reward generation, or only trained on reward generation? If it is also trained on checklist generation, one potential problem is that the accuracy of the generated checklist is bounded by GPT-4o. If it is co-trained on both tasks, will it result in better performance to train two distinguished models to do these two tasks rather than a single one, given that the nature of the two tasks does not align perfectly?
I noticed a performance drop in GitLab split when using GPT-4o as the base policy and WebShepherd as PRM. I think one possible reason why this happens could be the low quality of the checklist generated on this website. It would be interesting to conduct an experiment comparing with a ground-truth evaluator (human) in a later version to see how well WebShepherd generates reward given the oracle checklist, but considering the large amount of work required for this, this is not required during this period.

问题

Could you provide some concrete examples? Maybe pick one or two tasks to briefly summarize in which step and how the reward model helps guide the agent? This will make it more intuitive.
Some missing experimental details: was the main experiment conducted with the verbalizer scoring strategy?

局限性

Yes

最终评判理由

Most of my issues are solved, except for part of W2, which is about the weakness of the checklist-based reward model. However, I do not think it is a reason to reject the paper. I think the main part of the paper is solid enough to be accepted.

格式问题

No such concerns

作者回复

2025-07-31

Dear Reviewer 88aJ,

We appreciate your thoughtful review and recognition of our efforts. We hope that the following responses help clarify the reviewer’s questions:

W1

Yes, Web-Shepherd is trained on both checklist generation and reward modeling. To ensure the checklist quality is not biased or limited by GPT-4o’s outputs, we provide human-annotated trajectories to GPT-4o when generating a checklist dataset. This ensures that the model learns from expert reasoning grounded in successful demonstrations.

We fine-tuned Qwen2.5-3B under two settings: (a) only reward generation dataset, and (b) joint training with both checklist and reward generation dataset. We evaluated both settings on WebRewardBench using ground-truth checklists. As shown in our results, this joint training does not lead to significant performance degradation. In fact, we observe gentle improvements in certain out-of-domain tasks, suggesting that multi-task learning may enhance generalization.

	Mind2Web			WebArena
	MRR (step)	ACC (step)	ACC (traj)	MRR (step)	ACC (step)	ACC (traj)
(a) only reward generation	88.84	80.31	48.35	89.06	82.61	45.00
(b) checklist + reward generation	87.62	79.72	48.57	91.06	85.51	60.00

W2

Thank you for the insightful suggestion. To better understand the performance drop in the GitLab, we manually annotated checklists and conducted experiments using these human-annotated checklists. We observed performance improvements compared to using self-generated checklists. We found that the checklist quality for GitLab was slightly lower than for other websites, which may also have contributed to the observed performance drop.

However, the performance gain was still limited compared to the baseline (i.e., w/o trajectory search). We identified two main bottlenecks:
(1) Lack of correct action candidates due to limitations in the policy model, and
(2) Reward prediction errors that occur even when high-quality checklists are provided.

This is likely because the remaining failures after GPT-4o’s greedy search tend to be more challenging cases, which are less amenable to recovery through reward-guided search alone.

Policy: GPT-4o	GitLab
w/o trajectory search	56.25
w/ self-generated checklist (Ours 8B)	46.88
w/ human checklist (Ours 8B)	60.00

Q1

Here’s an example for clarification.
Task: “File upload” on a CMS

First, WebShepherd generates the checklist:
(1) navigate to the correct page,
(2) locate the upload button, and
(3) select a file.

Then, WebShepherd provides natural language feedback and a scalar reward for the agent’s actions. For example:

Action: “Clicked on the Settings’ tab”

Feedback: “The agent clicked on the Settings tab, which does not directly lead to the upload interface…”
Reward: 0.09 (checklist 1 is incomplete)

Action: “Clicked the 'Media' page”

Feedback: “The agent navigated to the Media page, which is the correct destination for uploading files...”
Reward: 0.31 (checklist 1 is completed)

This fine-grained, step-level supervision provides informative signals that help the agent make progress toward the final goal. Moreover, the natural language feedback can serve as an additional source of guidance when the agent needs to refine.

Q2

Following the previous work $1$ , we use the verbalizer to calculate the score. This allows the reward model to produce more discriminative scores by computing probabilities over semantically meaningful tokens (e.g., “Yes”, “No”) rather than relying solely on scalar outputs.

References

$1$ Generative verifiers: Reward modeling as next-token prediction. (Zhang et al., 2024)

2025-08-05

Thanks for your detailed response. Most of my concerns are well addressed. I just want to point out a small problem. From the experiments you did in W2, it seems that even when gold checklists are provided, the trajectory search method still struggles to solve the problem. Apart from the reasons you raised, it might be the problem of the checklist-based reward generation. Sometimes, even when the steps needed to complete tasks (i.e., the checklists) are provided, it is still not obvious to the model whether the current observation is in the correct direction or not, which requires more in-depth reasoning or understanding of the task. But your model shows 85% Acc. on the WebArena split, which contradicts GitLab's low improvements mentioned above. So it might be good to collect more instances for WebArena to better reflect the real reward accuracy. However, I appreciate your effort in PRM for web agents, which is crucial for scaling RL in this domain, and your effort in the rebuttal period. I will keep my score and welcome any further discussion

2025-08-06

We appreciate your observation regarding the W2 results, and we agree that even with gold checklists, the model may still struggle due to the inherent limitations of checklist-based reward modeling.

As you pointed out, the WebArena split in WebRewardBench includes relatively fewer instances compared to Mind2Web. This discrepancy arises because Mind2Web is based on publicly available human trajectories, whereas WebArena gold trajectories were manually collected by us, which naturally limits the scale. In this work, we focused primarily on validating our proposed PRM framework, and thus we were not able to collect larger-scale WebArena data within the current scope.

However, we fully agree that in order to develop more robust and accurate reward models for Web Agents, expanding the WebArena dataset is an important next step. Thank you again for your insightful feedback.

审稿意见

评分: 4置信度: 42025-07-03

This paper introduces Web-Shepherd, a PRM specifically designed to evaluate and guide web navigation agents at a step-by-step level. Prior work has relied on prompting, general-purpose MLLMs as evaluators, which is sometimes slow and expensive for practical use. This paper first constructs "WebPRM Collection", a large-scale dataset of 40K step-level preference pairs, where each task instruction is paired with a human-annotated checklist of subgoals. Second, it introduces WebRewardBench, a evaluation benchmark for accurately assessing the performance of web navigation PRMs without requiring costly full trajectory rollouts. Web-Shepherd works in two stages: it first generates a task-specific checklist from the user's instruction and then uses the checklist to assign dense, step-level rewards to an agent's actions. Experiments demonstrate that Web-Shepherd significantly outperforms strong MLLM baselines like GPT-4o on WebRewardBench. Furthermore, when used to guide a GPT-4o-mini policy in a trajectory search setting on WebArena-lite, it achieves a substantial performance boost of over 10 points, while being 10x more cost-efficient than using the same MLLM as a verifier.

优缺点分析

Strengths

[s1] This paper is well written and easy to follow.
[s2] WebPRM Collection (40K data) and WebRewardBench (accompanying with WebArena and Mind2Web) can be good resources for the research community (and potentially industry).
[s3] Constructing checklist seems to be the good way to evaluate the intermediate process in web navigation.
[s4] The proposed Web-Shepherd-8B achieved the best performance among other MLLMs on WebRewardBench (in Table 1). Also, GPT-4o-mini/GPT-4o policy with Web-Shepherd-8B achieved the best performance on WebArena-lite.

Weaknesses

[w1] WebRewardBench seems to be already saturated in Table 1. There are a bit doubt about its sustainability (or effective duration) as a benchmark.
[w2] Evaluating intermediate state of web navigation would be ill-defined tasks (no signle solution, the key state and action are very sparse in some tasks). I feel it is fine for now because Web-Shepherd-8B actually improved the trajectory search performance. But PRM approach may not the optimal approach to guide web navigation performance.
[w3] There is no comparison against ORM in trajectory search. For instance, https://arxiv.org/abs/2411.02337 has built ORM for web navigation.
[w4] The dataset and benchmark might be too connected with WebArena and Mind2Web. It may less help with other web navigation tasks or situations.
[w5] In Table 1, any models cannot leverage any image informations at all.

问题

Please see Weaknesses above.

局限性

yes

最终评判理由

The reviewer provided the detailed follow-up experiments in the rebuttal. My concerns are [w1] benchmark saturation and sustainability, [w2] optimality of PRM approach to guide web navigation performance, and [w4] "overfitting" to WebArena and Mind2Web. I understood those would face mixed opinions, and I feel the authors provided some positive signs to deal with them. As in the initial review, I'm leaning toward acceptance.

格式问题

N/A

作者回复

2025-07-31

Dear Reviewer MZUs,

Thank you for your deep understanding of our work, and appreciate the interesting question you raised. We would like to address the concerns in the comments below:

W1

We notice that Web-Shepherd has already achieved pretty high performance on WebRewardBench, especially in terms of MRR. However, when examining step-level accuracy (Acc step) and trajectory-level accuracy (Acc traj), there is room for improvement. For example, in Table 6 (Appendix E), Acc traj of Web-Shepherd (8B) is around 50-60 in Mind2Web.

In addition, considering recent findings that reasoning models (e.g., OpenAI’s o-series) can be used as PRM $1$ and its superiority of evaluating web agents $2$ , we tested o4-mini’s performance on WebRewardBench. Even though it benefits from a long sequence of reasoning, we find o4-mini struggles in WebRewardBench.

	Mind2Web			WebArena
	MRR (step)	ACC (step)	ACC (traj)	MRR (step)	ACC (step)	ACC (traj)
GPT-4o	69.30	49.93	8.07	69.73	53.62	15.00
o4-mini (text)	73.64	58.11	22.61	70.00	55.07	15.00

W2

We fully agree with the reviewer’s point. We are still actively exploring the broader question of how to better guide web agents. As the reviewer noted, it remains unclear whether step-level rewards are inherently superior to outcome-based rewards, as each approach has distinct advantages and limitations.

ORM: Outcome-based reward modeling assesses only the final state to determine task success or failure. This simplifies the task objective and yields a more straightforward evaluation criterion compared to PRM. However, as noted in WebJudge $2$ , evaluating a web agent’s performance solely based on the final state overlooks the quality of intermediate actions and introduces a significant gap compared to human evaluation.
PRM: Process-based reward modeling evaluates agent behavior at a finer granularity by assessing step-level progress. While this provides a more detailed signal, it can introduce noise, especially when preferences over individual steps are ambiguous. In designing Web-Shepherd, we aimed to mitigate this by focusing on whether the agent achieves sub-goals, rather than modeling step-level preferences directly. As shown in Section 7.2, our approach (Checklist-based Generative Reward Modeling) effectively addresses key limitations of conventional PRM (e.g., Bradley-Terry modeling), demonstrating strong generalization to the WebArena benchmark.

For future work, a promising direction is to use both signals from ORM and PRM, leveraging the strengths of both to better capture task success and agent behavior.

W3

To compare PRM with ORM, we use the ORM provided in WebRL which is based on LLaMA-3.1-8B-Instruct and trained on the expanded version of WebArena-Lite (which is in-domain for this experiment). For fair comparison, we trained Web-Shepherd using the same backbone LLM, LLaMA-3.1-8B-Instruct. We show the results in the table below.

	Shopping	CMS	Reddit	Gitlab	Map	Total
w/o Trajectory Search	21.74	22.86	19.05	34.28	19.35	23.64
ORM	32.61	17.14	14.29	31.25	16.13	23.64
PRM	28.26	37.14	19.05	28.12	29.03	29.09

In terms of Total score, which is an average across all domains, ORM’s performance is par on w/o Trajectory search. There is a significant increase in Shopping domain (21.74 -> 32.61), but underperforms w/o Trajectory Search in other domains.
When comparing our PRMs with the ORM, our method outperforms in overall, showcasing the effectiveness of process-level guidance. Since this version is based on LLaMA-3.1-8B, which is usually weaker than Qwen-3, it results lower performance than our original Web-Shepherd. Lastly, while PRM chooses a single action to be executed without executing all action candidates, ORM needs to run 5 trajectories independently, which takes 5x more environment-agent interaction costs.

W4

We agree that testing on more benchmarks to cover diverse domains would help increase the robustness of our experiments. Thus, we expanded the trajectory exploration experiment to WorkArena [3], which is completely out-of-domain for Web-Shepherd. (We also carefully inspect other benchmarks, such as AssistantBench, we frequently face CAPTCHA issues that ban web agents from further exploring the websites. We are still exploring other benchmarks that can possibly run our experiments.)

	Dashboard	Form	Knowledge	List-sort	Menu	Service Catalog	Total
w/o Trajectory Search	50.00	0.00	10.00	5.00	25.00	2.22	9.39
GPT-4o-mini	55.00	10.00	10.00	6.67	20.00	5.56	12.42
Web-Shepherd (3B)	57.50	14.00	10.00	10.00	10.00	11.11	14.85
Web-Shepherd (8B)	65.00	14.00	20.00	10.00	20.0	7.78	15.76

We find that trajectory search with PRM also improves the success rate in WorkArena when we compare w/o Trajectory Search and using GPT-4o-mini as PRM (i.e., the Total success rate increased from 9.39 to 12.42). In addition, Web-Shepherd outperforms GPT-4o-mini over all domains except Menu. Furthermore, Web-Shepherd outperforms GPT-4o-mini across all domains except for Menu. A potential explanation for the relatively low performance in the Menu domain is the complexity of the menu bar in WorkArena, which includes multi-level dropdown structures and search boxes embedded within the dropdowns. In such cases, the policy model often generates highly unreliable action candidates, and Web-Shepherd appears to be undertrained for handling these complex interaction patterns.

W5

When we first observed the results of multimodal input in Table 1, they seemed counterintuitive, and we initially suspected an error in our implementation. After extensive verification and repeated trials, however, we found no mistakes. We later realized that similar patterns of low performance with multimodal input have also been reported in prior work. For instance, AgentRewardBench $4$ , which introduces a benchmark for evaluating ORMs for web agents, similarly observes that incorporating both text and image observations can underperform compared to using text-only observation. This degradation is due to the increased complexity of multimodal input, which may distract the model and hinder learning.

On the other hand, we observe subtle improvements with visual information. In Table 6 (Appendix E.1), we provide results on WebRewardBench with the generated checklist — in the main table, we provide results with reference checklist. In the table below, we observe gentle improvements over all metrics without ACC (traj) in Mind2Web. It seems that self-generated checklists require much less reasoning cost compared to reference checklists that are unfamiliar to the model.

		Mind2Web			WebArena
PRM	Modality	MRR	ACC (step)	ACC (traj)	MRR	ACC (step)	ACC (traj)
GPT-4o-mini	Text	56.5	30.6	3.9	51.3	30.4	5.0
GPT-4o-mini	Text+Image	57.4	32.1	2.5	57.8	37.7	15.0

References

$1$ Scaling Evaluation-time Compute with Reasoning Models as Process Evaluators (Kim et al., 2025)

$2$ An Illusion of Progress? Assessing the Current State of Web Agents (Xue et al., 2025)

$3$ Workarena: How capable are web agents at solving common knowledge work tasks? (Drouin et al., 2024)

$4$ AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories (Lù et al., 2025)

2025-08-04

Thank you for the detailed response and extensive additional experiments despite the limited timeline. As in the initial review, I'm leaning toward acceptance.

2025-08-04

Thank you very much for taking the time to carefully read our response. We are glad to hear that it was helpful. Once again, we truly appreciate your feedback. It has been valuable in improving our work.

审稿意见

评分: 5置信度: 42025-07-03

The authors propose to use process-based reward predicted by a relatively small specialized model to provide fine-grained feedback for long horizon web navigation tasks carried out by web agents. This is in contrast with works that ask heavyweight off the shelf MLLMs to verify whether each step is successful, and promises to be more efficient and effective. Furthermore, the authors create and share the WebPRM Collection dataset with human per-step annotations for 40k trajectories, along with the WebRewardBench that measures the performance of a candidate model on this dataset. With this dataset, the authors train a new model (Web-Shepherd) to provide step-wise process-based reward assignment. The authors show that this model greatly outperforms current out of the box large proprietary and open source models for this task.

优缺点分析

Strengths

The work represents a significant contribution to the community to advance agentic AI for web tasks by providing the large and useful WebPRM Collection dataset, which is based on human collected trajectories of interactions that have undergone a rigorous level of review by evaluators.

The fine-tuning of the WebShepherd on the proposed dataset is well conceived and executed. The experiment to produce a lightweight Web-Shepherd model shows convincing performance improvements over state of the art proprietary and open source models, and is shown to translate to significantly higher success rates when used to guide trajectory search for GPT-4o and GPT-4o-mini.

The paper is overall very well written and structured, and very thorough in providing a complete view of the dataset’s characteristics in the supplementary material. For example, the data collection diagram in figure 3 provides ample information to understand the process and its strengths / limitations. The authors undertake an extensive amount of quantitative analysis of the impact of the WebShepherd model, for example by looking into exactly how much improvements in checklist quality translates to the accuracy of rewards.

Weaknesses

The originality of the work is not exceptionally strong. Process based rewards have been well understood to yield better performance for the final model when used as a much denser supervision source. It is also not entirely surprising that a small model fine-tuned on the PRM dataset can outperform off the shelf larger models.

As well, the work is missing some important citations. While it is true that to my knowledge the work is the first to propose a human annotated dataset for process based rewards in the context of web agents, the similar concept has been used for other domains. For example, ReST-MCTS (Zhang et al, NeurIPS2024) proposed per step rewards for math and scientific reasoning. VisualPRM (Wang et al, arxiv 2025 - not yet published in peer reviewed medium) provided a PRM dataset for visual reasoning. It would be good to contrast the work with these and explain how the authors’ proposal differs / improves on the methodologies.

问题

It would be great to see the authors’ response to the concerns listed in the weaknesses section.

As well, as with dataset papers, it would be nice to have some insight into data scaling laws: how much impact on the WebShepherd model would there be if the training dataset size was halved vs current vs doubled? This could give the reader an understanding of whether there is merit in further scaling up such data collection.

局限性

It would be great to see the authors’ response to the concerns listed in the weaknesses section.

最终评判理由

The new experimental results showing the performance scaling relative to amount of data available is illuminating, and it seems that future work to further increase the dataset size can still result in additional performance gain.

格式问题

none

作者回复

2025-07-31

Dear Reviewer nLse,

We appreciate your comments and positive feedback on our work. We will address the concerns and the questions raised by the reviewer in the comments below:

W1

We agree that process-based rewards have been previously explored and are known to provide denser supervision, often leading to improved downstream performance. Our work does not claim to introduce an entirely novel concept, but rather focuses on adapting and operationalizing PRM in the context of complex, multi-turn web environments—an application domain where reward modeling poses unique challenges. While it may not be surprising that a small model fine-tuned on the PRM dataset can outperform larger off-the-shelf models, we believe our contributions lie in the careful design of the reward modeling, extensive analysis, and the empirical insights obtained from adapting PRM to this setting.

W2

Thank you for highlighting the importance of prior work such as Rest-MCTS and VisualPRM. As the reviewer noted, these studies have effectively demonstrated the benefits of process-level supervision, and our work builds upon these insights. The key distinction between our approach and theirs lies in how the process-level reward data is constructed. While Rest-MCTS and VisualPRM rely on MCTS or Monte Carlo sampling to annotate reward scores, such methods are heavily dependent on the performance of the underlying policy model. This dependency poses a significant limitation in the context of web agents, where the action and observation spaces are extremely large, resulting in a very low likelihood of obtaining positive rewards through rollout-based methods.

Our work provides a path toward modeling PRMs for web agents where the process reward is hard to estimate from the outcome reward due to relatively low success rate and significant cost of agent-environment interaction time during rollout. We leverage synthetic and human annotation in a hybrid manner and improve generalization capability. We will make sure that this discussion is included in the next version of the draft.

Q1

Following the reviewer’s suggestion, we conducted analysis on the effect of the (1) number of instructions, and (2) number of rejected actions in the dataset on the performance of the PRM. Specifically, we construct datasets using the subset of WebPRMCollection 0.25, 0.5, and 0.75 percent of instruction and its corresponding chosen-rejected pairs and 1,2, and 3 number of max rejected actions. We trained variants of Web-Shepherd with these datasets using the same model (i.e., Qwen-2.5-3B-Instruct) and hyperparameters. The results are shown in the tables below.

Overall, if we use about half of the original dataset (in terms of both the number of instructions and the number of rejected actions), there is a drastic decrease in ACC (traj) on both of the benchmarks. Especially, in the out-of-domain benchmark, WebArena, instruction ablation results in ACC (traj) decreases from 60.0 to 15.0, which suggests it failed to generalize to unseen domains. In rejected ablation, only decreasing by one rejected action is critical, resulting 60.0-> 20.0 ACC (traj) score in WebArena. These results highlight that both the number of instructions and the number of rejected actions are critical for training an effective PRM; reducing either significantly impairs generalization, particularly in out-of-domain settings such as WebArena.

** $Ablation on the number of instructions$ **

	Mind2Web			WebArena
Proportion of instruction	MRR (step)	ACC (step)	ACC (traj)	MRR (step)	ACC (step)	ACC (traj)
0.25	68.38	49.75	11.34	65.97	47.83	10.00
0.5	77.46	62.56	21.54	73.57	56.52	15.00
0.75	83.64	72.00	33.63	88.09	79.71	55.00
Ours (3B)	87.62	79.72	48.57	91.06	85.51	60.00

** $Ablation on the number of rejected actions$ **

	Mind2Web			WebArena
Number of max rejected action	MRR (step)	ACC (step)	ACC (traj)	MRR (step)	ACC (step)	ACC (traj)
1	71.42	52.77	12.79	63.04	43.48	10.00
2	77.66	62.72	17.55	76.47	63.77	20.00
3	79.70	66.04	24.91	77.46	63.77	20.00
4 (Ours, 3B)	87.62	79.72	48.57	91.06	85.51	60.00

2025-08-02

I'd like to thank the authors for their response to the issues I raised in the original review. The new experimental results showing the performance scaling relative to amount of data available is illuminating, and it seems that future work to further increase the dataset size can still result in additional performance gain.

I would like to reiterate my support for the acceptance of this paper.

2025-08-04

Thank you for the kind follow-up and continued support. We are happy to hear that the additional experiments were helpful. We also think that further scaling the dataset could lead to even better results, and we plan to explore that direction more in future work.

最终决定Accept (spotlight)

2025-09-17

This paper proposes Web-Shepherd, a process reward model for web navigation, offering step-level evaluation that outperforms GPT-4o in terms of accuracy by a large margin. The work also contributes with the WebPRM Collection, a large-scale dataset with 40K step-level preference pairs and introduces WebRewardBench, a meta-evaluation benchmark for evaluating PRMs.

Summary of Strengths:

The paper is clearly written and well-structured.
The work contributes with new resources and a benchmark: the WebPRM Collection and WebRewardBench can benefit both research and industry.
Web-Shepherd outperforms larger models on WebRewardBench and improves trajectory search performance in WebArena-lite while being more cost-efficient.
Even when the model is trained with Mind2Web data, Web-Shepherd generalizes well to WebArena, showing robustness to different data.
PRMs are useful for advancing web agents during inference-time search and reinforcement learning.

Summary of Weaknesses:

The paper doesn't compare to ORMs. There were discussions on this topic during the rebuttals, and authors included additional experimentation to clarify.
The dataset and the benchmark are tightly tied to WebArena and Mind2Web, and hence the work may have limited applicability to broader web navigation settings. Showing results on additional benchmarks could be useful, which is mentioned as future work during the rebuttals.

The discussions and clarifications that the authors provided resulted in score increases by some reviewers. Given the resources and the novel contributions with good results, the paper would be a good contribution to NeurIPS.