Proposer-Agent-Evaluator (PAE): Autonomous Skill Discovery For Foundation Model Internet Agents
摘要
评审与讨论
This paper introduces a framework for autonomous skill discovery for real-world Internet agents, called PAE (Proposer Agent Evaluator). Compared to existing approaches, this method allows for the agent to collect, explore and refine new skills automatically, without relying on a limited set of predefined human-annotated instructions. The author(s) show how the additional PAE components introduced (the task proposer, the agent policy and the autonomous evaluator) allow for zero-shot SOTA generalization performance, as well as better success rates against opensource VLMs and finetuned opensource VLMs on 2 environments, WebVoyager and WebArena.
优点
-
The author(s) highlight and prove the feasibility of using much smaller VLMs and thus significantly lowering the test-time compute through their method, as opposed to finetuning larger models.
-
The author(s) are willing to opensource the code and models to encourage further development in the research community.
-
Overall, the paper is easy follow and clearly motivates the research question addressed and the main contribution in relation to existing literature. Concepts introduced in the main paper nicely point to corresponding appendix sections when needed which provide .
-
The author(s) are doing a great job explaining how their work differentiates itself from existing research along 3 separate dimensions: foundation model agents, self-generated instructions and unsupervised skill discovery in RL. I particularly appreciate the focus on enabling the community to use much smaller sized models in terms of parameters, and therefore significantly less test-time compute.
-
I appreciated the analysis on failure modes in Section 6, highlighting the value added by the PAE method, as opposed to SFT.
-
It is very good to see that each action contains chain of thought element, aiding interpretability of the model choices. The author(s) also take preventive measures, as highlighted in their Ethics statements.
缺点
-
To strengthen the submission/contribution, it would be good to elaborate on why proprietary VLMs outperform all opensource methods.
-
It would be great to add some details/discussion on inference times, comparing the opensource VLMs, the SFT variation and PAE.
-
It wasn't very clear to me how you define the reward for the agent policy, could you elaborate on that?
问题
Clarification Questions:
- For all components, the task proposer, autonomous evaluator and agent policy, could you elaborate more on the choice of VLMs: Claude-3-Sonnet and the LLaVa models? Why those and not others?
- For evaluation, could you elaborate on your choice of open-source VLMs? Why are they chosen as opposed to others?
- In the main results, under the scaling paragraph, line 400, I’m not quite clear on the following statement “Again, LLaVa-34B PAE beats LLaVa-7B SFT on 12 out of 13..”. Is that the intended comparison, or did you mean to compare across PAE models only?
- In the main results, under the generalization paragraph, line 405, why is LLaVa-7B PAE highlighted, when actually you present results for the larger LLaVa-34B PAE as well in Table 3? Also in Table 3, why did you drop some of the other models from Tables 1 and 2 (Claude 3.5, InternVL and vanilla LLaVa-7/34B)?
- In the alignment with human judgements, could you elaborate on how many human annotators were included in the user study and if possible how many hours they allocated to the task?
Minor comments/suggestions:
- There are a few minor typos: on line 91 “for foundation model agents” (remove “a”), on line 191 in section 3.2, task distribution should be C instead of R, on line 485 “LLaVa-7B SFT knows that”
- For consistency across the entire paper, use Claude-3-Sonnet – on line 269 in section 4.3, it is referred to as Claude-Sonnet-3.
- Line 317 the LLaVa model name needs to be in bold to align with the others.
伦理问题详情
Do you have the right permissions in place to sample from the user demos on the websites (Section 4.2 for the context aware task proposer) [Edit: concerns addressed by the authors, data was collected in-house].
Q5: How did we choose the baselines of open-source VLMs?
Again, we chose LLaVa Next series as they had the best performance on VisualWebBench (https://visualwebbench.github.io/) when main experiments were conducted. InternVL-2.5-Xcomposer and Qwen2VL are latest open source VLM releases around the time of the submission of this paper so they are also added to the paper.
Q6: In the alignment with human judgements, could you elaborate on how many human annotators were included in the user study and if possible how many hours they allocated to the task?
Five annotators participate in the user study and the entire annotation process takes around 40 annotator hours with the help of a designated user interface.
Q7: In the main results, under the generalization paragraph, line 405, why is LLaVa-7B PAE highlighted, when actually you present results for the larger LLaVa-34B PAE as well in Table 3? Also in Table 3, why did you drop some of the other models from Tables 1 and 2 (Claude 3.5, InternVL and vanilla LLaVa-7/34B)?
We bold the results of LLaVa-7B PAE to highlight the improvement of PAE compared to SFT on both the scale of 7B and 34B. The omission of the baselines in Table 3 that are not main points of comparisons is due to budgetary constraints. We are working on to get more compute and API credits to include those results in the final version of the paper.
Q8: Typos.
Thanks so much for catching the typos! We have fixed them in blue in the updated pdf.
Response to Ethics concerns.
In our experiments, all user demos are collected by the authors instead of actual users so there should be no permission/privacy concerns.
Thank you very much for addressing the clarification questions raised. Regarding the human annotators and the user demos, would it be possible to include this information in the appendix? It would be a great piece of information to add for the community, in case other would like to conduct similar user studies. I'll maintain my score of 8.
Thanks so much for reading our rebuttal. We have updated appendix B of the paper to include the above information of human annotations and user demos in the paper, along with the instructions provided to the annotators.
Thank you for your review and positive feedback on the paper! Please let us know if your questions are addressed. We are happy to discuss further. We answer your questions below:
Q1: It would be good to elaborate on why close-source models outperform all open-source models.
We have provided Failure Mode analysis in Figure 5 of the paper. In particular, the most outstanding failure mode that Claude Sonnet 3.5 reduces compared to other open-source models is low-level web navigation skills. We suspect that this might result from the training corpus of proprietary models containing more and better web-related data and observations in the form of set-of-marks.
Q2: It would be good to add more details on inference time analysis.
After the training process of PAE, the model directly outputs its thoughts and actions at each step without incurring any additional inference-time cost. As shown in the table below, in WebArena Easy, we found that the thoughts after PAE training got more accurate but not longer so PAE uses around the same inference-time analysis compared to the SFT policy as measured by the number of tokens in the action per step.
| Average Number of Tokens Per Action (Including Thought) | |
|---|---|
| LLaVa-7B SFT | 331.0 |
| LLaVa-7B PAE | 301.7 |
Q3: How is the reward defined for the action policy?
PAE makes use of an image-based outcome evaluator that evaluates the success of each trajectory from the last three screenshots and the answer from the agent. The outcome-based evaluator provides a reward of 1 at the end of the trajectory if the task is considered successful and 0 otherwise. Rewards elsewhere are all set to be 0. The use of an outcome-based only evaluator is actually one of our design choices that we have found to be able to take best advantage of the asymmetric capabilities of SOTA VLMs as evaluators and agents in web agent tasks (additional experiment results on this are provided in blue in Section 6 Discussions). In our preliminary experiments, we found that using VLMs to provide dense annotations such as per-step rewards are less robust due to hallucinations. We are re-running those experiments and will update these results once they are done.
Q4: Why did we choose Claude Sonnet 3 and LLaVa Next in our components?
We chose Claude Sonnet 3 because we happen to have credits for using this model but in principle any VLM that can reason across multiple images can be used for task proposers and evaluators. We have also provided additional results of using only open-source models such as Qwen2VL-7B and Qwen2VL-72B for proposing tasks and conducting evaluations, thereby eliminating the need of close-source models. We have updated the results in Section 6 discussions of the paper and also provide the table of most relevant results below for your convenience:
| OpenStreetMap | PostMill | OneStopMarket | Average | |
|---|---|---|---|---|
| Claude 3 Sonnet | 24.3 | 55.8 | 41.7 | 36.0 |
| Qwen2VL-7B | 0.7 | 10.2 | 20.2 | 7.5 |
| Qwen2VL-72B | 16.0 | 32.8 | 32.7 | 23.9 |
| LLaVa7B-SFT | 15.2 | 16.8 | 25.4 | 18.0 |
| LLaVa7B-PAE | 19.5 | 21.1 | 42.3 | 24.6 |
| LLaVa7B-PAE (Qwen2VL-7B) | 20.2 | 25.0 | 28.6 | 23.1 |
| LLaVa7B-PAE (Qwen2VL-72B) | 17.9 | 30.6 | 39.2 | 26.0 |
| In this table, we found that LLava-7B PAE using Qwen2VL-72B as the task proposer and evaluator achieved a similar performance as using Claude 3 Sonnet as the task proposer and evaluator, despite their significant difference in agent performances (23.9% compared to 36.0% average success rate). As a result of this improvement, LLava-7B PAE using Qwen2VL-72B as the task proposer and evaluator achieved a better performance compared to Qwen2VL-72B itself. Perhaps even more surprisingly, even Qwen2VL-7B with much inferior agent performance compared to LLaVa-7B SFT (7.5% compared to 18.0) can be used to make significant improvements (from 18.0% to 23.1%). These results demonstrate that the improvements from PAE essentially root in taking advantage of the asymmetric capabilities of state-of-the-art VLMs as agents and as task proposers/evaluators, instead of simply imitating a stronger VLM. |
We chose LLaVa Next models because they were the best open-source VLMs on VisualWebBench (https://visualwebbench.github.io/) at the time when main experiments were conducted. However, more capable open-source VLMs such as Qwen2VL are also becoming available recently.
This paper proposes a new framework called Proposer-Agent-Evaluator (PAE) which allows foundation model agents to autonomously discover and practice skills in the wild. The framework consists of a context-aware task proposer that suggests tasks for the agent to practice with website context information, an agent policy that attempts these tasks in the real world, and an autonomous model-based success evaluator that evaluates the resulting trajectories. The success evaluation serves as the reward signal for the agent to refine its policies. The authors validate PAE on challenging vision-based web navigation using real-world and self-hosted websites and demonstrate significant improvements in the zero-shot generalization capability of VLM Internet agents compared to state-of-the-art open-source VLM agents.
优点
- The paper is well written and easy to follow. The problem of foundation model agents and the PAE method is formally defined and clearly illustrated with figures. The idea of the Proposer-Agent-Evaluator framework seems reasonable in that the proposer should approximate the real-world task distribution and the agent learns to maximize the autonomous reward function.
- Extensive and systematic experiments are carried out to verify the effectiveness of PAE compared with untrained VLMs and SFT training. Error analysis and the study of context complexity also show that the context-aware method can discover low-level web skills.
缺点
- While the PAE framework seems reasonable, the method itself is naive and lacks novelty. Basically, the authors train a VLM with data containing instructions and successful trajectories relying on the proprietary VLMs, which is a common approach and does not include inspirational techniques. Although human annotation may overlook some long-tail real-world tasks, it's unclear why VLMs provided with web context can approximate the real-world task distribution. Also, behavior cloning the successful trajectories labeled by a VLM should not be described as reinforcement learning (line 117).
- It's not explained why behavior cloning is adopted instead of other training methods including SFT to discover web skills. Is the LLaVa-SFT baseline trained on the same trajectories as the proposed PAE model? What are the performances of other leading-edge VLMs such as GPT-4o? Comparisons with other trained web agents on Web Voyager or Web Arena are also missing.
- Several typos in line 91 ('a foundation model agents') and line 191 (task distribution R).
问题
Response to the questions and concerns mentioned in the Weaknesses.
Q5: Comparisons with other trained web agents are missing.
According to a recent VLM agent benchmark (https://github.com/THUDM/VisualAgentBench/tree/main?tab=readme-ov-file#leaderboard), there are no other trained generalist open-source VLM web agents that can be applied directly without task-specific fine-tuning. We are happy to add additional baselines of such trained web agents, if they are open-source. Thus, another significant contribution of this work is to release the medium-size sota open-source checkpoints and large-scale dataset for generalist VLM web agents. As mentioned by Reviewer wpvf, the release of our sota open-source checkpoint and large-scale dataset can facilitate RL and fine-tuning research in the VLM web agent domain, while previously the community has to rely on very large models such as Qwen72B or proprietary models. We will also release our training infrastructure that supports efficient large-scale interactions with thousands of distributed web browsers.
Q6: The performance of other cutting-edge VLMs such as GPT4o is missing.
We would like to mention that the performance of GPT4o is orthogonal to the contribution of PAE as it is built upon a significantly more capable base model. We did not include GPT4o results because of practical constraints on OpenAI API credits and rate limits at the time of paper submission, but we are working on acquiring such credits to include the performance of GPT4o in the final paper.
Q7: Several typos
Thanks for pointing out those typos. We have corrected them in blue in the updated pdf.
[1] Jenny Zhang, Joel Lehman, Kenneth Stanley, and Jeff Clune. Omni: Open-endedness via models of human notions of interestingness, 2024.
[2] Maxence Faldor, Jenny Zhang, Antoine Cully, and Jeff Clune. Omni-epic: Open-endedness via models of human notions of interestingness with environments programmed in code, 2024.
[3] Cedric Colas, Laetitia Teodorescu, Pierre-Yves Oudeyer, Xingdi Yuan, and Marc-Alexandre Cote. Augmenting autotelic agents with large language models, 2023.
[4] C´edric Colas, Tristan Karch, Nicolas Lair, Jean-Michel Dussoux, Cl´ement Moulin-Frier, Peter Ford Dominey, and Pierre-Yves Oudeyer. Language as a cognitive tool to imagine goals in curiosity-driven exploration, 2020
[5] Jesse Zhang, Jiahui Zhang, Karl Pertsch, Ziyi Liu, Xiang Ren, Minsuk Chang, Shao-Hua Sun, and Joseph J. Lim. Bootstrap your own skills: Learning to solve new tasks with large language model guidance, 2023b.
[6] Yuqing Du, Olivia Watkins, Zihan Wang, Cedric Colas, Trevor Darrell, Pieter Abbeel, Abhishek ´ Gupta, and Jacob Andreas. Guiding pretraining in reinforcement learning with large language models, 2023.
Open-source contributions: Another significant contribution of this work is to release the medium-size sota open-source checkpoints and large-scale dataset for generalist VLM web agents. As mentioned by Reviewer wpvf, the release of our sota open-source checkpoint and large-scale dataset can facilitate RL and fine-tuning research in the VLM web agent domain, while previously the community has to rely on very large models such as Qwen72B or proprietary models. We will also release our training infrastructure that supports efficient large-scale interactions with thousands of distributed web browsers.
Q2: Unclear why tasks proposed by context-aware VLMs can approximate real-world task distribution
While an VLM-based task proposer cannot perfectly approximate real-world task distributions, they are much cheaper and more scalable than asking humans to annotate tasks. However, as discussed above, careful design choices must be made so that the skills the agents acquire through synthetic tasks can generalize to the distribution of real-world tasks. We provide an additional ablation result showing the effectiveness of our design decisions of an additional reasoning step before operation actions.
| OpenStreetMap | PostMill | OneStopMarket | Average | |
|---|---|---|---|---|
| Claude 3 Sonnet | 24.3 | 55.8 | 41.7 | 36.0 |
| LLaVa7B-SFT | 15.2 | 16.8 | 25.4 | 18.0 |
| LLaVa7B-PAE | 19.5 | 21.1 | 42.3 | 24.6 |
| LLaVa7B-SFT (no reasoning) | 2.8 | 0 | 8.0 | 3.8 |
| LLaVa7B-PAE (no reasoning) | 9.6 | 8.0 | 32.0 | 14.4 |
In addition to the system-level design decisions and open-source contributions, PAE is also different from the prior literature in its incorporation of an additional reasoning step before taking the actions. Only the action parts determine the actual web operations but the reasoning traces are also included in the filtered BC loss. As shown in the above table and updated as Figure 6 in the pdf, while the agent can still learn diverse skills to a similar level on the proposed tasks without reasoning step, its generalization to unseen human-written tasks has significantly declined. This shows an example of the important difference of learning skills alone and generalization to unseen human-written tasks.
Q3: Is LLaVa-SFT trained on the same trajectories as LLaVa-PAE?
There might be an important misunderstanding of our approach, and we would like to clarify LLaVa-SFT and LLaVa-PAE use fundamentally different data. LLaVa-SFT is trained on trajectory demonstrations from Claude 3 Sonnet following common approaches so that it can have a non-trivial performance as web agents (better than 0.9% when LLaVa is used zeroshot as shown in Table 1 in the paper) to enter an autonomous RL loop of PAE. LLaVa-PAE is trained on top of with entirely self-generated trajectories. With additional experiments shown in the answer to the above question, the PAE training loop can be entirely independent of close-source models, leading to an entirely autonomous skill discovery system that can generalize to unseen real user tasks. This working autonomous skill discovery system is the novel contribution of this work.
Q4: Behavior Cloning successful trajectories labeled by a VLM should not be described as RL. And why is behavior cloning adopted instead of other training methods such as SFT?
In this work, filtered behavior cloning serves as a policy optimization technique for the rewards provided by the autonomous evaluator in an online RL loop where we iterate between rolling out the current policy to collect data and perform policy optimization. Importantly, the use of behavior cloning loss on successful trajectories as the policy optimization method does not alter the fact that our primary workflow adheres to an online reinforcement learning paradigm. However, we understand whether filtered behavior cloning should be considered as RL can be ambiguous. As such, we have updated the pdf in Section 4.4 to state Filtered Behavior Cloning as a policy optimization method instead of an RL method. It is important to note that PAE adopts an online RL paradigm instead of only SFT on the pre-collected successful trajectories.
To reply to your question on why behavior cloning is adopted instead of other training methods such as SFT: there might be a terminology difference between people from the RL community and the foundation model community. We use online Filtered Behavior Cloning to mean negative log likelihood on the successful trajectories collected online, which is the same as SFT on the successful trajectories collected online.
Thank you for your review and feedback on the paper. To address the raised questions, we have provided additional experiment results using only open-source models as task proposers and evaluators, showing that the performance of PAE does not depend on proprietary VLMs. We have also updated the pdf with major rewritings based your suggestions and frame the novel contribution of this work as a working system that can successfully generalize autonomous skill discovery to real-world human requests as opposed to the novelty of individual components. Please let us know if your concerns are addressed. We are happy to discuss further. We answer your questions below:
Q1: The method itself lacks novelty and has to rely on proprietary VLMs for instructions and successful trajectories.
First of all, we would like to clarify that PAE iteratively improves the agent models using self-generated trajectories instead of successful trajectories from proprietary VLMs. The design principle of PAE is to take the best advantage of the asymmetric capabilities of SOTA VLMs in terms of agent and proposer/evaluator. Intuitively, VLMs are very good at confirming whether a specific product has been added to the shopping cart (e.g. by looking at the final screenshot to see if the shopping cart contains the product), while less good at actually navigating the web to find the product and add it to the cart. To show that the improvement of PAE does not depend on close-source models. We have provided additional results of using only open-source models such as Qwen2VL-7B and Qwen2VL-72B for proposing tasks and conducting evaluations, thereby eliminating the need of close-source models. We have updated the results in Section 6 discussions of the paper and also provide the table of most relevant results below for your convenience:
| OpenStreetMap | PostMill | OneStopMarket | Average | |
|---|---|---|---|---|
| Claude 3 Sonnet | 24.3 | 55.8 | 41.7 | 36.0 |
| Qwen2VL-7B | 0.7 | 10.2 | 20.2 | 7.5 |
| Qwen2VL-72B | 16.0 | 32.8 | 32.7 | 23.9 |
| LLaVa7B-SFT | 15.2 | 16.8 | 25.4 | 18.0 |
| LLaVa7B-PAE | 19.5 | 21.1 | 42.3 | 24.6 |
| LLaVa7B-PAE (Qwen2VL-7B) | 20.2 | 25.0 | 28.6 | 23.1 |
| LLaVa7B-PAE (Qwen2VL-72B) | 17.9 | 30.6 | 39.2 | 26.0 |
| In this table, we found that LLava-7B PAE using Qwen2VL-72B as the task proposer and evaluator achieved a similar performance as using Claude 3 Sonnet as the task proposer and evaluator, despite their significant difference in agent performances (23.9% compared to 36.0% average success rate). As a result of this improvement, LLava-7B PAE using Qwen2VL-72B as the task proposer and evaluator achieved a better performance compared to Qwen2VL-72B itself. Perhaps even more surprisingly, even Qwen2VL-7B with much inferior agent performance compared to LLaVa-7B SFT (7.5% compared to 18.0) can be used to make significant improvements (from 18.0% to 23.1%). These results demonstrate that the improvements from PAE essentially root in taking advantage of the asymmetric capabilities of state-of-the-art VLMs as agents and as task proposers/evaluators, instead of simply imitating a stronger VLM. |
To clarify more on the novelty concern, the novel contribution of this work is a system that can learn to solve real web agent tasks by using autonomous skill proposals and reinforcement learning. While a wide range of skill discovery with LLM task proposers have been proposed in the literature, such works typically focus on learning diverse skills on relatively simplified environments such as games[1, 2, 3, 4] and robotics tasks with a limited number of scenes [5,6]. This work takes an important step forward to study when those skills can generalize to human requests in realistic benchmarks in the context of web agents and what the best design choices are for such generalization., including an agent policy with additional reasoning steps, context-aware task proposer, and image-based outcome evaluator. While each individual component of this work is not novel, the novel contribution of this paper is to put up the first working system for such an unsupervised skill discovery procedure to generalize to unseen real-world benchmarks with human requests with the right design decisions and present extensive experiment analysis to the effectiveness of this framework (as recognized by reviewers uKzp, 4H8g, vLaT and wpvf). We have included discussions on all your cited papers and additional works along this line of autonomous Rl and task proposers in the related works.
I appreciate your detailed response and additional results.
A1. Thanks for the clarification. It makes more sense that the advanced VLMs can provide reward signals for the base model (LLaVA-7B here) to iteratively improve it with filtered trajectories. And I agree that VLMs are better at confirming status, while are poorer at making decisions. This leads to my other concern that the trained model is actually distilling prior knowledge from advanced VLMs (Qwen2VL here), which can be limited when the trained model is getting better. I'm curious whether the model can also provide feedback signals and teach itself in a bootstrap manner which is more convenient to generalize to other scenarios.
A2. I understand that using VLMs for instruction generation is a more practical way to carry out large-scale training. However, in the paper, the authors emphasized the disadvantage of human-generated instructions, which "fail to capture the long-tail distribution of real-world tasks" and the VLM, as "the task proposer approximates the ground-truth task distribution" of the real world. By merely providing the web context such as screenshots (humans can also have access to that), how can the VLMs approximate the real-world task distribution? It is unclear how the VLMs can overcome this problem through your method.
A3. I see. Thanks for the explanation.
A4. Behavior cloning is a term in the imitation learning literature and commonly adopts MSE or cross-entropy loss. It's weird to use it here and claim it to be the same as SFT. Also, the filtered behavior cloning is not formally defined or referred to any prior approach that defines it or uses it to train language models.
A5&6. It's a pity that more comparison results can not be obtained. I believe more results against other methods can make the contribution of this paper more convincing.
Based on my opinions above, some of my concerns remain unaddressed. I decide to maintain the overall score and update the contribution score.
Dear Reviewer vLaT,
Thanks for your response but we believe that there might be some misunderstandings of our paper therefore not sure if your concerns are valid grounds of rejection (a score of 3).
A1 the trained agent is actually distilling knowledge from a more advanced VLM (Qwen2VL)
Thanks for the recognition of the motivation of PAE that “VLMs are better at confirming status while poorer at making decisions”. Regarding the concern of “the trained agent is actually distilling knowledge from a more advanced VLM (Qwen2VL)”, we believe that our additional results in Table 1, 2, and 3 in the updated manuscript should have addressed this concern. Although Qwen2VL-72B is indeed a more advanced VLM (with success rate 22.6% compared to LLaVa-7B SFT with success rate 14.6% in Table 1), Qwen2VL-7B is worse than LLaVa-7B SFT (success rate 1.4% compared to 14.6%) yet it can be used as proposer and evaluator to improve the performance of LLaVa-7B SFT to 21.7% (in Table 1). Therefore, it can’t be learned entirely from distilling knowledge from Qwen2VL-7B because it reaches a performance of 21.7% in the end while Qwen2VL-7B only has a success rate of 1.4%. Therefore this concern should not be a ground for rejection.
A5&6 It's a pity that more comparison results can not be obtained. I believe more results against other methods can make the contribution of this paper more convincing.
We have included ablations and some additional comparisons (such as different choices of actors and evaluators) as suggested by other reviewers in Figure 3 of the manuscript. We believe that we have included the strongest baselines central to our contribution (from a recent leader board of VisualAgentBench (https://github.com/THUDM/VisualAgentBench/tree/main?tab=readme-ov-file#leaderboard). If you have specific baseline comparisons critical to our contribution, we are happy to add those.
Phrasing-related issues:
A2.
Thanks for the recognition of the advantage of PAE of being more practical for large-scale training. That is the motivation for coming up with the entirely autonomous framework of PAE and we did not intend to say the the task proposer can come up with better tasks than humans if we have unlimited budget for human annotations. "the task proposer approximates the ground-truth task distribution" has already been updated to be "serve as a proxy to improve on the ground-truth task distribution" in line 183 of the latest manuscript. We will remove the confusing sentences related to "long-tail distribution of real-world tasks" from a future version.
A4
In line 253-254 of the manuscript, we have defined the loss for Filtered BC to be “simply imitates all thoughts and actions in successful trajectories with the negative log likelihood loss.” In a future version, we will also include a mathematical definition and include references to some of the prior literature that mentioned this method. [1,2]
[1] Offline RL For Natural Language Generation with Implicit Language Q Learning, ICLR 2023
[2] ArCHer: Training Language Model Agents via Hierarchical Multi-Turn RL, ICML 2024
Dear reviewer vLaT,
Thanks so much for providing your feedback, and we would like to let you know that we have gathered more evidence showing that the improvement of PAE stems from taking advantage of the asymmetric capabilities of SOTA VLMs instead of copying stronger proprietary models. We have carried out additional experiments and updated the main results Table 1, Table 2, and Table 3 with additional results in red using open source models as proposers and evaluators. As the results in Table 2 (of WebArena) are already included in the previous rebuttal, we mainly summarize the new results from Table 1 and Table 3 (of WebVoyager and unseen websites) in the table below:
| Seen Websites | Unseen Websites | |
|---|---|---|
| Claude 3 Sonnet | 42.4 | 25.0 |
| Qwen2VL-7B | 1.4 | 1.4 |
| Qwen2VL-72B | 22.6 | 8.3 |
| LLaVa7B-SFT | 14.9 | 9.1 |
| LLaVa7B-PAE | 22.3 | 16.3 |
| LLaVa7B-PAE (w/ Qwen2VL-7B) | 21.7 | 13.7 |
Similar to the conclusion of the open-source experiments on WebArena presented in the previous rebuttal, our new results show more empirical evidence that the improvement of PAE is not bounded by the agent capability of SOTA VLMs where even Qwen2VL (with agent performance of only 1.4%) can be used as task proposers and evaluators to improve the performance of LLaVa7B-SFT from 14.9% to 21.7% on seen websites and from 9.1% to 13.7% on unseen websites. We believe that these results strongly validate our claim that the improvements of PAE stem from the asymmetric capabilities of SOTA VLMs as agents and as proposers/evaluators, instead of copying a stronger model. While our experiments using Qwen2VL-72B are still running, we expect it to yield even better results and will update it to the paper once it is ready.
Please let us know if the additional results have addressed your concerns and we are happy to engage in further discussions.
This paper introduces Proposer-Agent-Evaluator (PAE), a novel framework for autonomous skill discovery in foundation model agents, particularly focusing on Internet agents. PAE consists of three key components: a context-aware task proposer, an agent policy, and an autonomous evaluator. The framework enables agents to autonomously discover and practice skills without human supervision, potentially leading to a more diverse and scalable skill repertoire. The authors validate PAE on challenging vision-based web navigation tasks using both real-world and self-hosted websites. Results show that PAE significantly improves the zero-shot generalization capability of VLM Internet agents. Notably, the PAE-trained model outperforms other state-of-the-art open-source VLM agents. The authors claim this work represents the first attempt to apply autonomous task proposal with reinforcement learning for agents, achieving state-of-the-art performance among open-source models.
优点
This paper demonstrates several notable strengths. The authors conducted comprehensive evaluations of their PAE framework across multiple challenging benchmarks, including WebVoyager and WebArena. They compared their method against a range of baselines, including proprietary models, state-of-the-art open-source vision-language models, and supervised fine-tuning approaches. This thorough evaluation provides a clear picture of PAE's performance in relation to existing methods.
The paper also presents a detailed analysis of the results, including error analysis and generalization studies. The authors break down different types of errors made by the models, providing insights into where improvements are made and what challenges remain. They also examine how well the skills learned through PAE transfer to unseen websites, demonstrating the framework's ability to develop general web browsing capabilities. Additionally, the authors investigate how PAE scales with larger base models and explore the impact of different context information on performance. This level of analysis and discussion significantly strengthens the paper by providing a nuanced understanding of the method's capabilities and limitations.
缺点
I am a little unclear about the novelty of the actual algorithm underlying PAE in comparison to something like OMNI (https://arxiv.org/pdf/2306.01711) and subsequent work. I think applying this kind of open-ended task selection for skill discovery to the internet agent case is definitely unique, but how does PAE differ from other open-ended skill discovery algorithms in other RL domains? If the authors could clarify this either in the related work section or after introducing their method in section 3.3 that would be really helpful.
Overall the results of the paper are strong, but framing the algorithmic novelty can be done with more clarity. If the novelty arises from the domain then this point and the difficulties of internet-based environments should be emphasized more rather than as just a potential application case as section 4 treats it as.
问题
How does PAE differ from other methods for skill discovery via foundation models such as OMNI?
Thank you for your review and the recognition that “experiments are extensive” and “paper are easy to follow”. To address the raised questions, we provided more clarifications in terms of the novel contribution of the paper, and significant rewritten the relevant sections in paper in blue as suggested, and provided additional ablation experiments in the experiments section to support our contribution.
Q1: What is the difference of PAE with other methods for autonomous RL and task proposers.
The novel contribution of this work is a complete working system that can learn to solve real web agent tasks by using autonomous skill proposals and reinforcement learning. While a wide range of skill discovery with LLM task proposers have been proposed in the literature, such works typically focus on learning diverse skills on relatively simplified environments such as games[1, 2, 3, 4] and robotics tasks with a limited number of scenes [5,6]. This work takes an important step forward to study when those skills can generalize to human requests in realistic benchmarks in the context of web agents and what the best design choices are for such generalization., including an agent policy with additional reasoning steps, context-aware task proposer, and image-based outcome evaluator. While each individual component of this work is not novel, the novel contribution of this paper is to put up the first working system for such an unsupervised skill discovery procedure to generalize to unseen real-world benchmarks with human requests with the right design decisions and present extensive experiment analysis to the effectiveness of this framework (as recognized by reviewers uKzp, 4H8g, vLaT and wpvf).
Open-source contributions Another significant contribution of this work is to release the medium-size sota open-source checkpoints and large-scale dataset for generalist VLM web agents. As mentioned by Reviewer wpvf, the release of our sota open-source checkpoint and large-scale dataset can facilitate RL and fine-tuning research in the VLM web agent domain, while previously the community has to rely on very large models such as Qwen72B or proprietary models. We will also release our training infrastructure that supports efficient large-scale interactions with thousands of distributed web browsers.
Q2: Framing the algorithmic novelty can be done with more clarity.
Thanks for your suggestions on adding more comparisons with prior works along the line of OMNI, and concrete suggestions on phrasing the novelty as building a complete working autonomous skill discovery system that generalizes to real-world web agent applications. To better situate the contribution of our paper as the first working system for autonomous skill discovery that generalizes to real web agent requests instead of the novelty of individual components, we have adjusted the related claims in the paper (and colored in blue). Concretely, we have rewritten the introduction to focus on our contribution as a working autonomous skill discovery system for realistic web agent tasks, added an additional discussion of the difference of PAE from prior works on autonomous RL and task proposers in the related works section, rephrased the technical section to take more about the specific design decisions that we have made for VLM web agents, and added additional ablation results in the experiment section.
[1] Jenny Zhang, Joel Lehman, Kenneth Stanley, and Jeff Clune. Omni: Open-endedness via models of human notions of interestingness, 2024.
[2] Maxence Faldor, Jenny Zhang, Antoine Cully, and Jeff Clune. Omni-epic: Open-endedness via models of human notions of interestingness with environments programmed in code, 2024.
[3] Cedric Colas, Laetitia Teodorescu, Pierre-Yves Oudeyer, Xingdi Yuan, and Marc-Alexandre Cote. Augmenting autotelic agents with large language models, 2023.
[4] C´edric Colas, Tristan Karch, Nicolas Lair, Jean-Michel Dussoux, Cl´ement Moulin-Frier, Peter Ford Dominey, and Pierre-Yves Oudeyer. Language as a cognitive tool to imagine goals in curiosity-driven exploration, 2020
[5] Jesse Zhang, Jiahui Zhang, Karl Pertsch, Ziyi Liu, Xiang Ren, Minsuk Chang, Shao-Hua Sun, and Joseph J. Lim. Bootstrap your own skills: Learning to solve new tasks with large language model guidance, 2023b.
[6] Yuqing Du, Olivia Watkins, Zihan Wang, Cedric Colas, Trevor Darrell, Pieter Abbeel, Abhishek ´ Gupta, and Jacob Andreas. Guiding pretraining in reinforcement learning with large language models, 2023.
Thank you for the clarification. I think the rewritten sections framed your work in a much better light and made the contributions more clear. I also appreciate the additional ablations and qualitative analysis included in the appendix. As such, I am happy to increase my score!
Thanks so much for helping us improve the narrative of the paper and upgrading your score to 8! We are very happy that all of your concerns have been sufficiently addressed!
The paper presents Proposer-Agent-Evaluator (PAE), a framework enabling autonomous skill discovery for vision-language model (VLM) internet agents. By employing a context-aware task proposer and an autonomous evaluator, PAE allows VLM agents to independently explore and refine skills for web navigation tasks, demonstrating improvements in success rates on benchmarks like WebVoyager and WebArena Easy compared to other open-source models. However, the model still relies on closed-source VLMs for effective task generation and evaluation, thus partially constraining the proposed framework’s open-source utility.
优点
- Demonstrates an increase in performance for open-source model weights compared to previous open-source models.
- PAE framework leverages self-generated tasks, avoiding static human-annotated instructions.
缺点
- The term "model-based task proposer" is used ambiguously. In reinforcement learning (RL), "model-based" generally implies use of a dynamics model for planning, whereas in this paper it refers to LLM prompting, which can lead to confusion in terminology.
- The novelty and motivation of the contributions are limited. Although the paper claims state-of-the-art (SoTA) for open-source models, it requires closed-source models for generating and evaluating tasks. This reliance on closed-source models highlights that the bottleneck remains the quality of these closed-source models, rather than any specific methodological improvement introduced by PAE.
- The paper makes an assumption that the ground-truth task distribution and reward functions are inaccessible, even though in simulated environments, these can often be directly obtained (e.g., verifying if a button was clicked). If the goal is to improve policy generalization skills, and use ground-truth distributions and rewards only as a way to evaluate the method, the paper would benefit from a clearer rationale and explanation for this assumption (Section 3.1).
- The paper lacks references to relevant prior work in autonomous task proposal and RL, and some claims of novelty are inaccurate, given that similar approaches have been explored [1, 2, 3, 4].
[1] Zhang, J., Zhang, J., Pertsch, K., Liu, Z., Ren, X., Chang, M., ... & Lim, J. J. (2023). Bootstrap your own skills: Learning to solve new tasks with large language model guidance. arXiv preprint arXiv:2310.10021. [2] Zhang, J., Lehman, J., Stanley, K., & Clune, J. (2023). Omni: Open-endedness via models of human notions of interestingness. arXiv preprint arXiv:2306.01711. [3] Faldor, M., Zhang, J., Cully, A., & Clune, J. (2024). OMNI-EPIC: Open-endedness via Models of human Notions of Interestingness with Environments Programmed in Code. arXiv preprint arXiv:2405.15568. [4] Colas, C., Teodorescu, L., Oudeyer, P. Y., Yuan, X., & Côté, M. A. (2023, November). Augmenting autotelic agents with large language models. In Conference on Lifelong Learning Agents (pp. 205-226). PMLR.
问题
- In PAE, why sample tasks uniformly rather than prioritizing by difficulty or learning progress?
Edit: updated score from 3 to 5
[1] Jenny Zhang, Joel Lehman, Kenneth Stanley, and Jeff Clune. Omni: Open-endedness via models of human notions of interestingness, 2024.
[2] Maxence Faldor, Jenny Zhang, Antoine Cully, and Jeff Clune. Omni-epic: Open-endedness via models of human notions of interestingness with environments programmed in code, 2024.
[3] Cedric Colas, Laetitia Teodorescu, Pierre-Yves Oudeyer, Xingdi Yuan, and Marc-Alexandre Cote. Augmenting autotelic agents with large language models, 2023.
[4] C´edric Colas, Tristan Karch, Nicolas Lair, Jean-Michel Dussoux, Cl´ement Moulin-Frier, Peter Ford Dominey, and Pierre-Yves Oudeyer. Language as a cognitive tool to imagine goals in curiosity-driven exploration, 2020
[5] Jesse Zhang, Jiahui Zhang, Karl Pertsch, Ziyi Liu, Xiang Ren, Minsuk Chang, Shao-Hua Sun, and Joseph J. Lim. Bootstrap your own skills: Learning to solve new tasks with large language model guidance, 2023b.
[6] Yuqing Du, Olivia Watkins, Zihan Wang, Cedric Colas, Trevor Darrell, Pieter Abbeel, Abhishek ´ Gupta, and Jacob Andreas. Guiding pretraining in reinforcement learning with large language models, 2023.
Q2: References and relations to prior works on autonomous RL and task proposals.
The novel contribution of this work is a system that can learn to solve real web agent tasks by using autonomous skill proposals and reinforcement learning. While a wide range of skill discovery with LLM task proposers have been proposed in the literature, such works typically focus on learning diverse skills on relatively simplified environments such as games[1, 2, 3, 4] and robotics tasks with a limited number of scenes [5,6]. This work takes an important step forward to study when those skills can generalize to human requests in realistic benchmarks in the context of web agents and what the best design choices are for such generalization., including an agent policy with additional reasoning steps, context-aware task proposer, and image-based outcome evaluator. While each individual component of this work is not novel, the novel contribution of this paper is to put up the first working system for such an unsupervised skill discovery procedure to generalize to unseen real-world benchmarks with human requests with the right design decisions and present extensive experiment analysis to the effectiveness of this framework (as recognized by reviewers uKzp, 4H8g, vLaT and wpvf). We have included discussions on all your cited papers and additional works along this line of autonomous Rl and task proposers in the related works.
Methodological difference from prior works in autonomous RL and task proposers.
| OpenStreetMap | PostMill | OneStopMarket | Average | |
|---|---|---|---|---|
| Claude 3 Sonnet | 24.3 | 55.8 | 41.7 | 36.0 |
| LLaVa7B-SFT | 15.2 | 16.8 | 25.4 | 18.0 |
| LLaVa7B-PAE | 19.5 | 21.1 | 42.3 | 24.6 |
| LLaVa7B-SFT (no reasoning) | 2.8 | 0 | 8.0 | 3.8 |
| LLaVa7B-PAE (no reasoning) | 9.6 | 8.0 | 32.0 | 14.4 |
In addition to the system-level design decisions and open-source contributions, PAE is also different from the prior literature in its incorporation of an additional reasoning step before taking the actions. Only the action parts determine the actual web operations but the reasoning traces are also included in the filtered BC loss. As shown in the above table and updated as Figure 6 in the pdf, while the agent can still learn diverse skills to a similar level on the proposed tasks without reasoning step, its generalization to unseen human-written tasks has significantly declined. This shows an example of the important difference of learning skills alone in prior works and generalization to unseen human-written tasks.
Q3: The writing of the paper can be improved based on the rationale of generalization and better terminologies.
To better situate the contribution of our paper as the first working system for autonomous skill discovery that generalizes to real web agent requests, we have adjusted the related claims in the paper (and colored in blue). Concretely, we have rewritten the introduction to focus on our contribution as a working autonomous skill discovery system for realistic web agent tasks, added an additional discussion of the difference of PAE from prior works on autonomous RL and task proposers in the related works section, rephrased the technical section to take more about the specific design decisions that we have made for VLM web agents, and added additional ablation results in the experiment section.
We have also updated Section 3.1 to emphasize that the ground-truth reward functions are only accessible for evaluation. We have also updated the terminology of “model-based” to be “VLM-based”.
Q4: Why sample tasks uniformly rather than prioritizing tasks by difficulty or learning progress.
In our experiments presented in Figure 6, we have considered the possibility of using user demos as contexts to inspire the task proposer to come up with more feasible tasks (similar to the “model of interestingness” as done in OMNI). Indeed, we have found that this prioritization of more feasible tasks can accelerate the learning process and lead to policies that generalize better to human-annotated tasks.
Thank you for your review and feedback on the paper. To address the raised questions, we have provided additional experiment results using only open-source models as task proposers and evaluators, showing that the SOTA performance of PAE is not bottlenecked by close-source models. We have also updated the pdf with major rewritings based on your suggested references to prior works, additional experiments, and the rationale of the novel contribution as a working system that can successfully generalize autonomous skill discovery to real-world human requests as opposed to the novelty of individual components.
Please let us know if your concerns are addressed. We are happy to discuss further. We answer your questions below:
Q1: The novelty and motivation of the paper are limited because it is bottlenecked by the dependence on close-source models.
The design principle of PAE is to take the best advantage of the asymmetric capabilities of SOTA VLMs in terms of agent and proposer/evaluator for web agents, instead of imitating a more capable VLM. Intuitively, VLMs are very good at confirming whether a specific product has been added to the shopping cart (e.g. by looking at the final screenshot to see if the shopping cart contains the product), while less good at actually navigating the web to find the product and add it to the cart. To show that the improvement of PAE is not bottlenecked by the dependence on close-source models. We have provided additional results of using only open-source models such as Qwen2VL-7B and Qwen2VL-72B for proposing tasks and conducting evaluations, thereby eliminating the need of close-source models. We have updated the results in Section 6 discussions of the paper and also provide the table of most relevant results below for your convenience:
| OpenStreetMap | PostMill | OneStopMarket | Average | |
|---|---|---|---|---|
| Claude 3 Sonnet | 24.3 | 55.8 | 41.7 | 36.0 |
| Qwen2VL-7B | 0.7 | 10.2 | 20.2 | 7.5 |
| Qwen2VL-72B | 16.0 | 32.8 | 32.7 | 23.9 |
| LLaVa7B-SFT | 15.2 | 16.8 | 25.4 | 18.0 |
| LLaVa7B-PAE | 19.5 | 21.1 | 42.3 | 24.6 |
| LLaVa7B-PAE (Qwen2VL-7B) | 20.2 | 25.0 | 28.6 | 23.1 |
| LLaVa7B-PAE (Qwen2VL-72B) | 17.9 | 30.6 | 39.2 | 26.0 |
| In this table, we found that LLava-7B PAE using Qwen2VL-72B as the task proposer and evaluator achieved a similar performance as using Claude 3 Sonnet as the task proposer and evaluator, despite their significant difference in agent performances (23.9% compared to 36.0% average success rate). As a result of this improvement, LLava-7B PAE using Qwen2VL-72B as the task proposer and evaluator achieved a better performance compared to Qwen2VL-72B itself. Perhaps even more surprisingly, even Qwen2VL-7B with much inferior agent performance compared to LLaVa-7B SFT (7.5% compared to 18.0) can be used to make significant improvements (from 18.0% to 23.1%). These results demonstrate that the improvements from PAE essentially root in taking advantage of the asymmetric capabilities of state-of-the-art VLMs as agents and as task proposers/evaluators, instead of simply imitating a stronger VLM. |
Open-source contributions: Another significant contribution of this work is to release the medium-size sota open-source checkpoints and large-scale dataset for generalist VLM web agents. As mentioned by Reviewer wpvf, the release of our sota open-source checkpoint and large-scale dataset can facilitate RL and fine-tuning research in the VLM web agent domain, while previously the community has to rely on very large models such as Qwen72B or proprietary models. We will also release our training infrastructure that supports efficient large-scale interactions with thousands of distributed web browsers.
I thank the authors for the new experiments and clarifications on how this work is different from previous work. In response, I have increased my score.
new results using open-sourced models
Why not include the results on the open-source models in tables 1, 2, and 3? I think this is a valuable comparison to see to understand more deeply if the increase in performance by PAE is bounded by what the sota VLM is.
Minor things:
“generalize to the ground-truth task distribution and reward function only used as evaluations”
should be “which are only used for evaluations”
Dear reviewer SFbJ,
Thanks so much for providing your feedback, and additional suggestions for improving the paper to show more evidence of PAE not bounded by the use of proprietary models. To follow your suggestions, we have carried out additional experiments and updated the main results Table 1, Table 2, and Table 3 with additional results using open source models as proposers and evaluators. We have also updated the main results Section 5 with an additional discussion on whether PAE is bounded by SOTA VLMs in red. As the results in Table 2 (of WebArena) are already included in the previous rebuttal, we mainly summarize the new results from Table 1 and Table 3 (of WebVoyager and unseen websites) in the table below:
| Seen Websites | Unseen Websites | |
|---|---|---|
| Claude 3 Sonnet | 42.4 | 25.0 |
| Qwen2VL-7B | 1.4 | 1.4 |
| Qwen2VL-72B | 22.6 | 8.3 |
| LLaVa7B-SFT | 14.9 | 9.1 |
| LLaVa7B-PAE | 22.3 | 16.3 |
| LLaVa7B-PAE (w/ Qwen2VL-7B) | 21.7 | 13.7 |
Similar to the conclusion of the open-source experiments on WebArena presented in the previous rebuttal, our new results show more empirical evidence that the improvement of PAE is not bounded by the agent capability of SOTA VLMs where even Qwen2VL (with agent performance of only 1.4%) can be used as task proposers and evaluators to improve the performance of LLaVa7B-SFT from 14.9% to 21.7% on seen websites and from 9.1% to 13.7% on unseen websites. We believe that these results strongly validate our claim that the improvements of PAE stem from the asymmetric capabilities of SOTA VLMs as agents and as proposers/evaluators, instead of copying a stronger model. While our experiments using Qwen2VL-72B are still running, we expect it to yield even better results and will update it to the paper once it is ready.
Finally, to provide even more justifications on how PAE makes the best design choices to take advantage of the asymmetric capabilities of sota VLMs, we compare the outcome-based evaluation included in PAE with other choices of evaluators in the related literature such as step-based evaluators and function-based evaluators in the table below:
| OpenStreetMap | PostMill | OneStopMarket | Average | |
|---|---|---|---|---|
| Claude 3 Sonnet | 24.3 | 55.8 | 41.7 | 36.0 |
| LLaVa7B-SFT | 15.2 | 16.8 | 25.4 | 18.0 |
| LLaVa7B-PAE | 19.5 | 21.1 | 42.3 | 24.6 |
| LLaVa7B-PAE (step reward) | 8.2 | 12.2 | 15.7 | 10.6 |
| LLaVa7B-PAE (function reward) | 16.8 | 16.3 | 28.0 | 19.2 |
In our implementation of the step-based evaluator, we ask Claude 3 Sonnet to evaluate whether each step is correct or not (i.e. whether it gets the agent closer to the goal) and behavior clone all the steps considered correct by the step-based evaluator. In our implementation of function-based evaluator, we provide 3 examples of verification functions as used by WebArena and ask Claude 3 Sonnet to also come up with verification functions to functionally verify the final task success rate (e.g. checking if the final website url is the same as the ground-truth url). As shown in Figure 3, both step-based evaluator and function-based evaluator perform worse than the outcome-based evaluator, where the use of step-based evaluator even leads to a worse performance compared to the SFT checkpoint to start with. We found that the step-based evaluator hallucinated more often and tended to be too ''generous'' in terms of considering the success of each step, potentially because the task of evaluating the success of each step is significantly harder compared to only evaluating the success of the final outcome. Furthermore, we found that oftentimes the function-based evaluator hallucinates on the success criterion for the verification function (e.g. making up a non-existing url that the agent needs to go to), therefore resulting in most tasks being impossible to learn. In contrast, the design choice of an outcome based evaluator can take best advantage of the asymmetric capability of the SOTA VLMs as agents and as evaluators, and best provide reliable reward signals for the policy to improve, resulting in better performances.
We have also fixed the minor grammer issue that you pointed out. We appreciate your help in improving the paper and would love to know if the additional results have addressed your concerns. We are more than happy to engage in further discussions.
Dear Reviewer SFbj,
Thank you for your time and effort in reviewing our manuscript. We have updated the pdf with additional results requested in your previous response. We hope that our additional results and clarifications can address your concerns and provide additional ground for your re-evaluation of the paper. Should you have any additional questions or require further clarification, we would be more than happy to address them. Given that the deadline for updating the pdf is closing within a day, if additional updates to the paper are needed, we kindly ask you to let us know soon to ensure sufficient time for those changes on our end.
Dear Reviewer SFbJ,
Thank you for your patience while we worked on the experiments you requested. We are pleased to share that we have successfully completed all the suggested experiments, and we will update the corresponding results in Tables 1, 2, and 3 in a later version of the manuscript.
In summary, as detailed in the table below, we observe significant improvements using PAE on both the seen and unseen website sets of WebVoyager. Notably, PAE, which relies solely on open-source VLMs as task proposers and evaluators, demonstrates its effectiveness:
- Qwen2VL-7B: Despite the low baseline agent success rate of Qwen2VL-7B (1.4%), LLaVa7B-PAE (w/ Qwen2VL-7B) achieves remarkable improvements of 6.8% on seen websites (from 14.9% to 21.7%) and 4.6% (from 9.1% to 13.7%) on unseen websites.
- Qwen2VL-72B: LLaVa7B-PAE (w/ Qwen2VL-72B) achieves comparable performance to LLaVa7B-PAE with Claude 3 Sonnet, even though the agent success rates of Qwen2VL-72B and Claude 3 Sonnet differ significantly (22.3% vs. 42.4% on seen websites).
Moreover, both LLaVa7B-PAE w/ Qwen2VL-7B and LLaVa7B-PAE w/ Qwen2VL-72B outperform their corresponding task proposer/evaluator models (Qwen2VL-7B and Qwen2VL-72B). This highlights that PAE is not constrained by the performance of the task proposer/evaluator model and can adaptively enhance outcomes.
| Model | Seen Websites | Unseen Websites |
|---|---|---|
| Claude 3 Sonnet | 42.4 | 25.0 |
| Qwen2VL-7B | 1.4 | 1.4 |
| Qwen2VL-72B | 22.6 | 8.3 |
| LLaVa7B-SFT | 14.9 | 9.1 |
| LLaVa7B-PAE | 22.3 | 16.3 |
| LLaVa7B-PAE (w/ Qwen2VL-7B) | 21.7 | 13.7 |
| LLaVa7B-PAE (w/ Qwen2VL-72B) | 23.6 | 15.9 |
We hope these new results provide further clarity and allow you to re-evaluate our submission. We look forward to hearing back from you as the discussion period is coming to an end.
This paper proposes a framework, Proposer-Agent-Evaluator(PAE), to boost the performance of agents in web tasks. In more detail, it uses a foundation model to automatically generate task instructions and another model to evaluate task completion and generate a reward. Lastly, RL is used to train a policy model.
优点
- The experiments are extensive. This paper compares with many SOTA baselines and makes many useful analyses. It shows the effectiveness of the proposed framework.
- This paper is easy to follow.
缺点
- The main weakness is still the novelty. The method can be divided into two parts. Utilize existing foundation models to process data or automatically generate data. This part is not novel. Many other works have tried similar pipelines to collect data. The second part is using RL to fine-tune a model, which is also a standard procedure.
[1] Navigating the Digital World as Humans Do.UNIVERSAL VISUAL GROUNDING FOR GUI AGENTS
问题
Will the author build a large-scale dataset that can be released to the public? The author mentioned the model only generates sparse rewards, will the batch RL be affected since only parts of the data have rewards?
Rewriting of the paper: To better situate the contribution of our paper as the first working system for autonomous skill discovery that generalizes to real web agent requests instead of the novelty of individual components, we have adjusted the related claims in the paper (colored in blue). Concretely, we have rewritten the introduction to focus on our contribution as a working autonomous skill discovery system for realistic web agent tasks, added an additional discussion of the difference of PAE from prior works on autonomous RL and task proposers in the related works section, rephrased the technical section to take more about the specific design decisions that we have made for VLM web agents, and added additional ablation results on our design decisions in the experiment section.
Q2: Will the author build a large-scale dataset that can be released to the public?
Yes, another significant contribution of this work is to release the medium-size sota open-source checkpoints and large-scale dataset for generalist VLM web agents. As mentioned by Reviewer wpvf, the release of our sota open-source checkpoint and large-scale dataset can facilitate RL and fine-tuning research in the VLM web agent domain, while previously the community has to rely on very large models such as Qwen72B or proprietary models. We will also release our training infrastructure that supports efficient large-scale interactions with thousands of distributed web browsers.
Q3: Will the batch RL be affected since only parts of the data have rewards?
It is important to note that our PAE is an online framework where the agent periodically interacts with the environment and performs policy optimizations, instead of batch RL/offline RL setting where the agent performs policy optimizations in an offline setting. As shown in Figure 6 in the new pdf, the requirement of being able to interact with the environment to collect new on-policy data is essential to the steady improvement of the agent without the pessimism issues of batch/offline RL[7]. The use of an outcome only evaluator is actually one of our design choices that we have found to be able to take advantage of the asymmetric capabilities of SOTA VLMs as evaluators and agents in web agent tasks (addition experiment results on this are provided in blue in Section 6 Discussions). In our preliminary experiments, we found that using VLMs to provide dense annotations such as per-step rewards (where we use VLM to annotate whether each action takes the agent closer to the goal and behavior clone all the actions that bring the agent closer to the goal) are less robust due to hallucinations. We are re-running those experiments and will update these results once they are done.
[1] Jenny Zhang, Joel Lehman, Kenneth Stanley, and Jeff Clune. Omni: Open-endedness via models of human notions of interestingness, 2024.
[2] Maxence Faldor, Jenny Zhang, Antoine Cully, and Jeff Clune. Omni-epic: Open-endedness via models of human notions of interestingness with environments programmed in code, 2024.
[3] Cedric Colas, Laetitia Teodorescu, Pierre-Yves Oudeyer, Xingdi Yuan, and Marc-Alexandre Cote. Augmenting autotelic agents with large language models, 2023.
[4] C´edric Colas, Tristan Karch, Nicolas Lair, Jean-Michel Dussoux, Cl´ement Moulin-Frier, Peter Ford Dominey, and Pierre-Yves Oudeyer. Language as a cognitive tool to imagine goals in curiosity-driven exploration, 2020
[5] Jesse Zhang, Jiahui Zhang, Karl Pertsch, Ziyi Liu, Xiang Ren, Minsuk Chang, Shao-Hua Sun, and Joseph J. Lim. Bootstrap your own skills: Learning to solve new tasks with large language model guidance, 2023b.
[6] Yuqing Du, Olivia Watkins, Zihan Wang, Cedric Colas, Trevor Darrell, Pieter Abbeel, Abhishek ´ Gupta, and Jacob Andreas. Guiding pretraining in reinforcement learning with large language models, 2023.
[7] Kumar, Aviral, et al. ‘Conservative Q-Learning for Offline Reinforcement Learning’. arXiv [Cs.LG], 2020, http://arxiv.org/abs/2006.04779. arXiv.
Thank you for your review and the recognition that “experiments are extensive” and “paper are easy to follow”. To address the raised questions, we provided more clarifications in terms of the novel contribution of the paper, and significant rewritten the relevant sections in paper in blue as suggested, and provided additional ablation experiments results of PAE with open-source models alone to show that PAE is not simply imitating the data generated from a stronger VLM. We are also currently running the experiments comparing our outcome-based evaluator with a dense step-based evaluator and will provide experiments as soon as they have finished. Please kindly let us know if your concerns are addressed or if further clarifications and experiments are required. We are happy to discuss further. We answer your questions below:
Q1: The main issue is still the novelty. Utilizing existing foundation models to process data or automatically generate data is not novel. Using RL to finetune the model is not novel either.
The novel contribution of this work is a complete working system that can learn to solve real web agent tasks by using autonomous skill proposals and reinforcement learning. While a wide range of skill discovery with LLM task proposers have been proposed in the literature, such works typically focus on learning diverse skills on relatively simplified environments such as games[1, 2, 3, 4] and robotics tasks with a limited number of scenes [5,6]. This work takes an important step forward to study when those skills can generalize to human requests in realistic benchmarks in the context of web agents and what the best design choices are for such generalization., including an agent policy with additional reasoning steps, context-aware task proposer, and image-based outcome evaluator. While each individual component of this work is not novel, the novel contribution of this paper is to put up the first working system for such an unsupervised skill discovery procedure to generalize to unseen real-world benchmarks with human requests with the right design decisions and present extensive experiment analysis to the effectiveness of this framework (as recognized by reviewers uKzp, 4H8g, vLaT and wpvf).
To give more context, the design principle of PAE is to take the best advantage of the asymmetric capabilities of SOTA VLMs in terms of agent and proposer/evaluator for web agents, instead of imitating a more capable VLM. Intuitively, VLMs are very good at confirming whether a specific product has been added to the shopping cart (e.g. by looking at the final screenshot to see if the shopping cart contains the product), while less good at actually navigating the web to find the product and add it to the cart. To show that the improvement of PAE is not bottlenecked by the dependence on close-source models, we have provided additional results of using only open-source models such as Qwen2VL-7B and Qwen2VL-72B for proposing tasks and conducting evaluations, thereby eliminating the need of close-source models. We have updated the results in Section 6 discussions of the paper and also provide the table of most relevant results below for your convenience:
| OpenStreetMap | PostMill | OneStopMarket | Average | |
|---|---|---|---|---|
| Claude 3 Sonnet | 24.3 | 55.8 | 41.7 | 36.0 |
| Qwen2VL-7B | 0.7 | 10.2 | 20.2 | 7.5 |
| Qwen2VL-72B | 16.0 | 32.8 | 32.7 | 23.9 |
| LLaVa7B-SFT | 15.2 | 16.8 | 25.4 | 18.0 |
| LLaVa7B-PAE | 19.5 | 21.1 | 42.3 | 24.6 |
| LLaVa7B-PAE (Qwen2VL-7B) | 20.2 | 25.0 | 28.6 | 23.1 |
| LLaVa7B-PAE (Qwen2VL-72B) | 17.9 | 30.6 | 39.2 | 26.0 |
| In this table, we found that LLava-7B PAE using Qwen2VL-72B as the task proposer and evaluator achieved a similar performance as using Claude 3 Sonnet as the task proposer and evaluator, despite their significant difference in agent performances (23.9% compared to 36.0% average success rate). As a result of this improvement, LLava-7B PAE using Qwen2VL-72B as the task proposer and evaluator achieved a better performance compared to Qwen2VL-72B itself. Perhaps even more surprisingly, even Qwen2VL-7B with much inferior agent performance compared to LLaVa-7B SFT (7.5% compared to 18.0) can be used to make significant improvements (from 18.0% to 23.1%). These results demonstrate that the improvements from PAE essentially root in taking advantage of the asymmetric capabilities of state-of-the-art VLMs as agents and as task proposers/evaluators, instead of simply imitating a stronger VLM. |
Dear reviewer uKzp,
Thanks for your efforts in reviewing the paper. We have got additional experiment results included in the paper (Figure 3 and discussions in Section 6) with respect to the effect of different design choices of evaluators. We would appreciate it if you could kindly suggest if these new experiment results and clarifications above have addressed your concerns. We are happy to engage in further discussions.
| OpenStreetMap | PostMill | OneStopMarket | Average | |
|---|---|---|---|---|
| Claude 3 Sonnet | 24.3 | 55.8 | 41.7 | 36.0 |
| LLaVa7B-SFT | 15.2 | 16.8 | 25.4 | 18.0 |
| LLaVa7B-PAE | 19.5 | 21.1 | 42.3 | 24.6 |
| LLaVa7B-PAE (step reward) | 8.2 | 12.2 | 15.7 | 10.6 |
| LLaVa7B-PAE (function reward) | 16.8 | 16.3 | 28.0 | 19.2 |
As shown in the table above, we compare the outcome-based evaluation included in PAE with other choices of evaluators in the related literature such as step-based evaluators and function-based evaluators. In our implementation of the step-based evaluator, we ask Claude 3 Sonnet to evaluate whether each step is correct or not (i.e. whether it gets the agent closer to the goal) and behavior clone all the steps considered correct by the step-based evaluator. In our implementation of function-based evaluator, we provide 3 examples of verification functions as used by WebArena and ask Claude 3 Sonnet to also come up with verification functions to functionally verify the final task success rate (e.g. checking if the final website url is the same as the ground-truth url). As shown in Figure 3, both step-based evaluator and function-based evaluator perform worse than the outcome-based evaluator, where the use of step-based evaluator even leads to a worse performance compared to the SFT checkpoint to start with. We found that the step-based evaluator hallucinated more often and tended to be too ''generous'' in terms of considering the success of each step, potentially because the task of evaluating the success of each step is significantly harder compared to only evaluating the success of the final outcome. Furthermore, we found that oftentimes the function-based evaluator hallucinates on the success criterion for the verification function (e.g. making up a non-existing url that the agent needs to go to), therefore resulting in most tasks being impossible to learn. In contrast, the design choice of an outcome based evaluator can take best advantage of the asymmetric capability of the SOTA VLMs as agents and as evaluators, and best provide reliable reward signals for the policy to improve, resulting in better performances.
I sincerely appreciate the author's rebuttal and the additional supplementary experiments. I never had any issues with the experiments in this paper. My primary focus has always been on the methodological aspect. Based on the author's response, the components of the system exist, but the contribution lies in using a similar system to accomplish a new setting or task. Correct me if I am wrong. While I acknowledge that this new setting or task is indeed novel, I still feel it does not meet the novelty standard required by ICLR. For example, when the Transformer was first introduced, people applied it to various other tasks. However, without further improvements, the novelty of such works has often been questioned.
Dear reviewer uKzp,
Thanks for your comments and the acknowledgement of our setting being novel. However, we respectfully disagree with the opinion that system papers in a new setting with the right design choices (that introduce a working machine learning system in a new setting, with comprehensive experiment results validating the design choices of each component) are unworthy of being published in ICLR. As evidenced by past successful system papers published in top-tier ML venues (e.g. [1,2,3,4]), we believe such system papers are valued in ICLR as well. In particular, [3] is probably closest to our work that comes up with a fine-tuning pipeline for Web Agents with synthetically generated HTML data. While they did not claim any individual component in their paper to be novel, [3] is accepted to ICLR 2024 as Oral presentation for the effectiveness of their system and importance of their new setting. Similarly, [4] proposes a system that applies instruction tuning to a new setting of Visual Question Answering. Without claiming the novelty of their methodology, it is accepted to NeurIPS 2023 as an oral.
Finally, we would like to emphasize that our setting of realistic web agents is significantly different and more realistic from the settings considered in prior works using autonomous RL and task proposers in games and robotic simulations (line 136-154 in our manuscript Related Works). One key advancement is that our setting requires autonomous skill discovery to generalize to realistic human-annotated instead of only learning a diverse set of skills. In our experiments, we have shown that this is only possible with the right design choices of using sota VLMs (e.g. the additional reasoning step and the use of outcome-based evaluator). While the individual components in our learning system are not novel, the combination of these components with the right design choices and its application to a novel setting is our novel contribution.
We hope that the evidence of prior successful system papers (with an emphasis on a new setting instead of a new methodology) can help you re-evaluate our submission, and thank you so much for your engagement in the discussion!
[1] Data-Driven Offline Optimization For Architecting Hardware Accelerators’. ICLR 2022.
[2] Chip Placement with Deep Reinforcement Learning’. ICML 2022.
[3] A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis. ICLR 2024 (Oral).
[4] Visual Instruction Tuning, NeurIPS 2023 (Oral)
This paper presents an interesting, working system for web-based agents. The core idea is to propose skills using VLMs and then finetune the low-level policies using filtered experience. The reviewers acknowledged that the proposed system is technically solid and can achieve good performance.
The reviewers reached the consensus that the current paper took existing well-known methods and applied them in a new domain. While there was disagreement among reviewers about whether this is novel enough, the AC believes that this can be good enough for a system paper if the paper is technically solid. However, the description of the method is poor and inaccurate. There is no formal definition of the learning objectives in the paper. The RL algorithm is not really an RL algorithm, but a type of BC. This is particularly concerning, since in the pseudocode provided in Appendix A, the algorithm stated a step described as "Update the agent policy with any RL algorithm." This can be missing leading for readers who try to implement a similar method, when in fact there is no RL algorithm applied in the paper. This combined with limited novelty makes this submission not ready for publication in its current form. If the technical description is clearer and more accurate, this could be a good system paper.
More information about the reviewer discussion: During reviewer discussion, AC urged reviewers to provide justification for their final recommendations based on the rebuttal. Reviewers who gave negative ratings all responded and provided their reasons and updated opinions about the paper after reading the rebuttal. The AC also read the revised paper, particularly the technical details that were questioned by the reviewers. In the end, the AC does not agree with some of the reviewers' statements that the novelty of a system paper that applies existing methods to a new domain is below the ICLR standard. However, the main reason for rejection is the lack of clear, accurate methodology details.
审稿人讨论附加意见
During reviewer discussion, AC urged reviewers to provide justification for their final recommendations based on the rebuttal. Reviewers who gave negative ratings all responded and provided their reasons and updated opinions about the paper after reading the rebuttal. The AC also read the revised paper, particularly the technical details that were questioned by the reviewers. In the end, the AC does not agree with some of the reviewers' statements that the novelty of a system paper that applies existing methods to a new domain is below the ICLR standard. However, the main reason for rejection is the lack of clear, accurate methodology details.
Reject