Tree Search for Language Model Agents
We develop an inference time tree search algorithm that improves the relative performance of LLM web agents by up to 39.7% on VisualWebArena and 28.0% on WebArena.
摘要
评审与讨论
This paper proposes a search-based approach to improve the performance of language model (LM) agents on realistic web automation tasks. Existing LM agents struggle with multi-step reasoning, planning, and using environmental feedback, which is crucial for success on open-ended web tasks. The authors introduce a best-first tree search algorithm that operates within the actual environment space and is complementary to existing state-of-the-art LM agents. The search procedure allows agents to explore a larger number of potentially promising trajectories at test time, reducing uncertainty through explicit exploration and multi-step planning. The authors evaluate their approach on the challenging VisualWebArena and WebArena benchmarks, demonstrating significant improvements over baseline LM agents and setting new state-of-the-art success rates.
优点
- The methodology presented in this paper is robust, as the application of tree search methods within a network environment proves to be an effective solution.
- The structure of the paper is logically organized, beginning with an examination of existing issues and specifically addressing the challenge of implementing multi-step reasoning in network environments.
- The authors propose a novel method, conduct rigorous testing, and perform comprehensive ablation studies.
- The explanations are articulated clearly and are easily comprehensible; the combination of diagrams and text effectively illustrates the improvements achieved by the proposed method.
- The experimental section is meticulously detailed, ensuring the reproducibility of the experiments, which enhances the credibility of the findings.
缺点
- The authors' central concept is to apply tree search methods within a network environment to achieve enhanced performance. However, as tree search techniques are commonly used in other systems, this approach exhibits limited innovation.
- In Section 5.4, the authors acknowledge that this method may result in slow execution, a significant drawback often associated with tree search. Nevertheless, the paper does not present a comparison of execution times between this method and other approaches, nor does it provide an in-depth discussion of the temporal challenges involved. The discussion primarily revolves around the search budget c, yet c does not accurately reflect the time expenditure. Readers are more concerned with the issue of temporal efficiency rather than the selection of hyperparameters.
问题
- In Figure 2, it can be observed that there is a noticeable improvement as the search budget c increases from 15 to 20. Have the authors attempted to experiment with a larger c? This might yield better results.
- The parameter c is set as the maximum search budget. Have the authors tested the average number of searches conducted for both successful and unsuccessful cases? It might be beneficial to use a larger search budget during testing and observe how the training performance improves as c increases.
We thank the reviewer for their detailed review. We are glad that they found our method novel, the paper logically organized, and the experiments and ablations comprehensive. We are pleased that they recognized the effectiveness of our proposed tree search approach, and the findings reproducible and credible. We address specific questions below and will incorporate all feedback.
On innovation/novelty of the approach
However, as tree search techniques are commonly used in other systems, this approach exhibits limited innovation.
While it is true that tree search is commonly used in other systems (from as early as the 1950s, as we highlighted in Sec 2.3), we respectfully disagree that this limits the innovation of our approach. The primary contribution of our paper is to show the viability of such a search algorithm on complex, highly realistic websites, such as those in (Visual)WebArena. Adapting search for this domain: implementing backtracking, building an effective value function, is non-trivial. Here we show, for the first time, through extensive ablations and analysis, what is needed to make search effective in these complex and realistic web environments (for which the value function is not clear-cut, unlike in previous applications in math and games). Please also see our top-level response to all reviewers for further discussion.
Execution time compared to other approaches
In Sec A.2.2 and Figure 6 in the appendix, we compared the search approach (c=5) against a re-ranking approach adapted from [1] that generates and re-ranks multiple trajectories (up to 7). The n=5 case of the re-rank approach uses comparable compute to our search procedure, as c=5 expands 5 nodes at each search step (similar to generating 5 trajectories). Our results highlight that our approach scales more effectively than this generate-and-re-rank method, which suggests that our search approach uses computation more efficiently
Moreover, as the cost of LLM inference continues to improve (due to hardware and ML systems innovations), we believe that cost of search will decrease and it is likely to become even more practical over time. Exploring ways to improve the efficiency and computational cost of search (as well as the inference cost of base LLM agents) is an exciting direction for future work.
References
[1] Pan, Jiayi, et al. "Autonomous evaluation and refinement of digital agents." arXiv preprint arXiv:2404.06474 (2024).
Results for larger c
In Figure 2, it can be observed that there is a noticeable improvement as the search budget c increases from 15 to 20. Have the authors attempted to experiment with a larger c? This might yield better results.
Due to compute and budget limitations, we were unfortunately unable to run full experiments with larger values, but our scaling curve (Figure 2) suggests that our search algorithm is not yet saturated at . In future work with access to more resources, it will be valuable to expand this beyond , and develop more comprehensive "scaling laws" for the search budget, breadth and width hyperparameters, and the baseline agent model.
Max search budget and performance
We detailed the performance with increasing in Figure 2, as well as the performance when we increase the depth and breadth of the search tree (Table 3). We find that the results align with intuition: as more search is conducted, success rate generally improves. We also find that aside from scaling , we also need to scale the depth and the breadth of the search tree to achieve the best results (see Sec 5.1 of the main paper for more detailed discussions).
I greatly appreciate the effort you have put into your research and the thoughtful answers you provided to my questions. However, I will be maintaining my original score. While your responses have addressed some of my concerns, they have not fully resolved all of my issues. I look forward to seeing more of your research in the future.
We thank the reviewer for looking over our response, and we're glad that some of your concerns were addressed. Please let us know if there is anything else that we can further clarify before the end of the discussion period.
The paper addresses a critical limitation of language models (LMs) used in autonomous Web agents, particularly their difficulty in performing multi-step reasoning, planning, and utilizing environmental feedback in decision-making tasks. The authors propose an inference-time search algorithm that enhances the decision-making capabilities of LM agents by incorporating a best-first tree search method. To handle the challenge of lacking clear cut rewards in the diverse environments of Web, they propose a model-based value function to guide best-first search. The proposed approach allows the agents to effectively explore and evaluate multiple action paths within interactive web environments, thereby improving their performance.
优点
- The paper adeptly addresses several critical challenges faced by LLM agents in real-world web environments: the difficulty in obtaining clear rewards, the accumulation of errors, and the complexity of multimodal interactive web environments. The motivation for this research is both meaningful and reasonable.
- The study introduces a tree search algorithm specifically tailored for LLM-based multi-step planning in web environments. This framework is both clear and straightforward.
- The authors provide qualitative examples of agent trajectories in Section 5.3, which significantly aids in understanding the operational principles and effectiveness of the proposed method.
- The paper is well-written, with clear and readable presentation of formulas, visualizations of figures and tables, and experimental results.
缺点
- Lack of technical contribution. The integration of tree search techniques with LLM planning is not entirely novel, as there is existing research in this area, e.g., [1, 2, 3, 4]. Thus, the contribution of this paper in terms of technique novelty may need to be reconsidered, as it incrementally applies existing tree search-based LLM planning frameworks to the web agent domain.
[1] Language Agent Tree Search Unifies Reasoning, Acting, and Planning in Language Models (Released in 6 Oct 2023)
[2] When is Tree Search Useful for LLM Planning? It Depends on the Discriminator (Released in 16 Feb 2024)
[3] LLM Self-Training via Process Reward Guided Tree Search (Released in 6 Jun 2024)
[4] LiteSearch: Efficacious Tree Search for LLM (Released in 29 Jun 2024)
-
Lack a specially designed evaluation method for value function. The authors evaluate the impact of different value functions on the success rate of the entire framework in order to select a best value function, as mentioned between lines 401 and 412. However, assessing value functions based solely on the overall framework's outcome might be indirectly influenced by various potential factors. It would be beneficial for the authors to propose a dedicated evaluation method for value functions, essentially allowing for a more stable and accurate selection process.
-
The performance and credibility of the proposed framework seems limited, as observed in Table 2 (lines 324-341):
- The use of different baselines across two different datasets raises questions about the credibility of the experiments.
- On the WebArena benchmark, the authors only report the performance of two base LLMs augmented by their method, without the performance of the SOTA baselines augmented by their method, suggesting that the method may not universally integrate with diverse models or scenarios as claimed.
- Since the base LLMs already perform sub-optimally, the observed improvements after involving the proposed method are not surprising.
- The highest success rate achieved by the framework is 19.2% on the WebArena benchmark, which is considerably lower than multiple baselines, raising doubts about its efficacy.
- Some mentioned future works should have been addressed in this paper:
- Integrating proposed search strategies into domain-specific SOTA baselines (mentioned as future work on lines 274 and 349).
- Experiments with larger search budgets (mentioned as future work on line 377).
- Mitigating the time cost of tree search-based LLM planning (mentioned as future work on line 502).
For instance, the study should report the performance of the domain-specific SOTA baselines augmented by the proposed method under the maximum step constraint setting used in the reported baselines, instead of just using the base LLMs and a small step constraint. Furthermore, addressing the time consumption of tree search for LLM planning is crucial for the feasibility of practical applications, making this an essential consideration.
Simply involving these crucial issues into future work may lead to skepticisms about the credibility and robustness of this study.
问题
Please refer to the 'Weaknesses' section.
We thank the reviewer for their helpful comments. We are glad that they recognized that our approach addresses several critical challenges faced by LLM web agents, and that they appreciated the clarity of the approach and the structure of our paper. We address specific questions below and will incorporate all feedback.
Novelty
We thank the reviewer for the references: we had discussed [1,2] in the related work, and have now updated the manuscript to cite and discuss [3,4] as well. While using tree search for LLMs is not new, we respectfully disagree that our approach is incremental. Previous work, including the provided references, primarily focused on quantitative tasks (such as math, science, coding) which have more clearly defined value functions, such as execution tests, or highly simplified web environments (such as MiniWoB or WebShop as in [1]). In contrast, our approach shows viability for the first time on complex, highly realistic websites, such as those in (Visual)WebArena. For example, the shopping environment in (V)WA is far more complex than that of WebShop, with advanced search mechanisms, an account creation system, and end-to-end shopping cart checkout. Developing a search algorithm -- implementing backtracking accurately, experimenting and developing a value function that is good enough -- is non-trivial, and here we show this viability for the first time, achieving substantial gains over a baseline without search.
Moreover, as also highlighted by the reviewer, one of the highlights of our approach is that it is simple, straightforward, and effective. The value of our paper is not in coming up with a convoluted solution, but showcasing for the first time how we can build a general search algorithm for realistic web tasks. We also refer the reviewer to the comments in our top-level response for further details.
Evaluating the value function
Lack a specially designed evaluation method for value function. The authors evaluate the impact of different value functions on the success rate of the entire framework in order to select a best value function, as mentioned between lines 401 and 412. However, assessing value functions based solely on the overall framework's outcome might be indirectly influenced by various potential factors. It would be beneficial for the authors to propose a dedicated evaluation method for value functions, essentially allowing for a more stable and accurate selection process.
We agree that it could be beneficial to have a value function specific evaluation. At the moment, we do not believe there exists a good benchmark to measure the effectiveness of value functions for web tasks, but we agree that developing such a benchmark (similar to RewardBench [5] but for web agents) would be a very valuable contribution.
Since our work has shown that value functions can substantially improve web agents, future work that evaluates and improves value functions is complementary to our findings, and should improve agents further. We see this not as a limitation of our work, but an opportunity for future work.
References
[5] Lambert, Nathan, et al. "Rewardbench: Evaluating reward models for language modeling." arXiv preprint arXiv:2403.13787 (2024).
Performance vs. SOTA
As the main contribution of our work is to introduce a general tree search algorithm for LLM web agents, we aim to reduce domain specific knowledge in the base agents, which many SOTA methods employ (as also highlighted by reviewer fowi and discussed in our response above). Our experimental results in Table 6 and 7 also demonstrate that our approach doesn’t overfit to a specific type of website, and shows consistent effectiveness over all websites in both WA and VWA.
Due to our compute and budget limitations, we were also unable to run it for extended horizons (all our experiments were limited to a maximum of 5 steps), which also negatively affected performance compared to other SOTA baselines (which typically run for around 30 steps). We also believe that our method is complementary to advances in efficiency and ML systems research, and is likely to benefit from these improvements in the near future. For these reasons, we believe that the efficiency and time consumption problems are not significant hurdles to practical deployment.
Thanks for the authors' response. I will keep my score after reviewing it. These responses did not address my concerns, particularly regarding weaknesses 3 and 4. Although the authors claim their framework is simple, straightforward, and effective, there are many issues in the experimental section (see weakness 3), raising concerns about credibility. Additionally, many important problems are not addressed, and simply dismissing them as future work diminishes the contribution of this paper (see weakness 4).
Thank you for reviewing our responses. We would like to reiterate that our approach is an inference time tree search algorithm, and agnostic to the base model. For these reasons, we used the baselines that made the most sense for each benchmark: VWA is a multimodal benchmark and hence it is natural that its baselines are different from WA, which is more text focused. Also, we will again release all models, code, and scripts to replicate experiments, which will hopefully alleviate some of the concerns on credibility.
simply dismissing them as future work diminishes the contribution of this paper (see weakness 4).
The goal of our paper is definitely not to solve all existing problems with LLM web agents, but present an approach that solves some of these (and we thank the reviewer for pointing this out too, that we "address several critical challenges faced by LLM agents" in your original review), and unlock interesting new directions for future research. We believe that we have accomplished this in our paper by proposing a well motivated, generalizable approach with rigorous analysis and ablations — key contributions towards advancing LLM agents.
This paper proposes a tree search algorithm for web agents.
The search algorithm expands states using best first search where the value of the state comes from LLM self-evaluation. Specifically, the value comes from self-evaluation of the parent state. This proceeds up to a max depth and search budget after which the algorithm linearly rolls out to complete the task.
Their results show that tree search can significantly improve LLM-based web agents.
They claim to be the first tree-search algorithm for web agents.
优点
Originality: This paper proposes an original tree search algorithm for LLM based web agents. The paper also claims to be the first such algorithm.
Quality: The paper is of high quality. The results are pretty strong and clear.
Clarity: The paper is generally clear and easy to follow. However, there could be some improvement here.
Significance: There has been a trend of applying tree search and test time compute across many applications LLMs can be applied to. So it is not surprising to see a tree search algorithm for LLM web agents. However, in a first to the flag sense, it is significant.
The results show that this is a meaningful route to improved performance and seems likely to inspire other works. It clearly clears the bar for publication significance.
Value function approach seems smart: Multi-modal, Last d screenshots. Scores in 0, 0.5, 1 and averaging over multiple reasoning. This could be a source of a lot of noise but this seems to be a way of reducing that. The trade-off is more compute so the further balance is by only evaluating states when expanded, not when generated.
“we generate 20 outputs from the model by prompting it with CoT reasoning (Wei et al., 2022), and aggregate the count of the action candidates. We use the top-b actions with the highest counts for branching.” - This is a good way to generate distinct action candidates
缺点
WebArena results do not look strong relative to other modern works. However, some of those works seem to be a bit over-optimized for the benchmark, whereas this work is more general.
There are some clarity issues to work out with the writing.
The section on destructive actions, seems highly speculative. This is a major weakness of the approach in that in real world settings it is difficult to conduct search with lots of back tracking required. It is not even clear to me how backtracking could be conducted in these situations even excluding destructive actions.
Another weakness of this method is that it involves more inference time compute.
问题
Algorithm 1 has “Backtrack and execute new actions to get to state s_p”
- How is this done in your implementation?
Algorithm 1 has “”” for i ← 1 to b do: Execute a_ip to get to state s_ip ”””
- How do you execute multiple actions from the same state in a web environment? This is similar to the above question. Essentially, how is backtracking done?
Tab 2 shows GPT-4o gets 18.9% across the whole VWA benchmark. On the 200 question subset, the results without search are 24.5%. And with b=5, d=5 the results are 37% compared to 26.4% in table 2. This seems like a large discrepancy. Can this be attributed to the sampled questions?
Line 405: “(which is a sparse reward signal that returns either 0 or 1)”
- What about just showing the accuracy of the value function when the ground truth is available? I.e. Something like mean square error of the value function in cases when ground truth is available.
How is tie-breaking done? Seems like it would make sense to break ties by depth and/or by the number of "votes" during generation.
Suggestions:
Algorithm 1 should be in main section of paper. It is easier to be able to refer to it in the method section.
“For example, a search budget of 10 indicates that at most 10 nodes will be expanded, after which the agent will commit to and execute the trajectory with the highest value” - This needs to be part of the algorithm. The reader needs a place to be able to refer to and understand how the algorithm works. They should not need to comb through ablation results to know how the method works.
Line 404: “prompted zero-shot (with just the current observation)”
- Maybe note this is because LLaVa only accepts one screenshot
This is maybe nit-picky but wouldn't all the children of a state have the same value? In figure 1, (7) and (9) have different values.
“Alloted” -> “allotted”
Line 483: backtrack -> backtracks
Figure 1 should be referred to in the text.
We thank the reviewer for their comprehensive review, and for these detailed suggestions that will make the paper more readable! We are glad that they found the paper to be of high quality, the results strong and clear, and recognized the significance of our approach in introducing test-time compute for LM web agents.
Other SOTA WebArena results
We agree that many state-of-the-art approaches include domain specific biases that are tailored for WebArena, and we indeed opted to develop a more general tree search methodology that can be applied to any baseline agent. In practical deployment scenarios, it will likely be useful to still include some of these domain specific tweaks (including to our value function and our search algorithm) but we will leave these for practitioners to explore and experiment with.
Backtracking feasibility and implementation
We agree that in many real world web settings it is difficult to backtrack. However, search is likely still useful for many economically valuable web tasks which can be trivially parallelized (and for which destructive actions are not a concern). For example, the same Google Sheets document can be copied n times and we can run search over the n tabs in parallel.
Other web tasks can also be parallelized similarly if we have an effective way to avoid destructive actions (e.g., by implementing a denylist for pages such as checkout), such as exploring multiple categories in parallel on a shopping site to find the best item. Of course, significant work will need to be done to make this viable in production, but there are already many existing real world scenarios where search can be both practical and effective.
For our backtracking implementation, we have added section A.4.1 to the appendix of the manuscript with more implementation details, and will include code pointers once the code is publicly released. Please also see our top-level response to all reviewers for further discussion.
More inference time compute required
Another weakness of this method is that it involves more inference time compute.
We believe that this is a feature rather than a weakness: our approach indeed expends more computation at inference, but achieves better results as more search is performed (Figure 2 and Table 3). While at present it may be impractical to apply the full search algorithm on all tasks, we believe that search unlocks a valuable "knob" that can be used to tradeoff performance and inference compute for web agents. For example, we can perform less (or no) search for easy tasks, while expending more compute on harder or more consequential tasks to get better results. Please also see our top-level response to all reviewers for more details.
Tie-breaking
In the event of ties we keep the original node (the one that was explored first, which had a higher value than other options at that point in time). We ran an ablation where we broke ties by preferring the most shallow node, and on the 200 example ablation subset this achieved a score of 34.5%. Breaking ties by preferring deeper trajectories achieved a score of 35.0%. Both of these are slightly worse compared to the success rate of 37.0% when we keep the original path. Breaking ties by the number of "votes" would definitely be interesting to explore in future extensions but would likely require taking even more samples during search to reduce noise.
Table 2 results vs. ablation subset
Indeed, this is likely because the sampled set of the first 100/50/50 tasks of VWA shopping/reddit/classifieds have easier difficulty levels than the full set (perhaps due to how VWA was originally constructed). The full dataset compared to the ablation subset has the following statistics:
| Difficulty | Full Dataset | Ablation Subset |
|---|---|---|
| Easy | 22.5% | 32.5% |
| Medium | 41.5% | 41.0% |
| Hard | 35.9% | 26.5% |
MSE of value function
This would be an interesting metric, but for the (V)WA tasks, MSE would likely be less interpretable than the success rate numbers in Table 4, as most nodes have a groundtruth value of 0. Early termination (when the groundtruth is 0) is as detrimental as failure to correctly terminate (when the groundtruth is 1), which also makes it harder to interpret MSE in some of these cases.
Other suggestions
Thank you for the detailed suggestions! We have incorporated most of the feedback in the revised draft. However, as we do not have sufficient space within the ICLR page limit, we opted to leave the algorithm in the appendix, as we felt that the high level overview of the approach in Figure 1 provided a better introduction for new readers. We added a reference in Figure 1's caption to point to the location of the algorithm details.
Thank you for the response.
I appreciate adding the additional details in the appendix.
I have some follow-up questions:
Inference compute
It seems that this algorithm may be especially slow, even among other search algorithms. The reason being is that, as per Algorithm 1, when you sample actions you execute all the actions (line 23 of Alg 1). This requires reseting the state times and rerunning the previous algorithms. This is a lot of time (though not necessarily compute) spent waiting to execute actions that may never actually end up being even searched over.
Ablation subset
This is a very high discrepancy in performance, which is a little concerning. I understand that this is a small ablation though. Is it possible that the Easy/Medium/Hard labels do not actually make a difference in success rate?
Summary
I stick by my review that this is an acceptable paper but would appreciate follow-up to some of these concerns. Thank you for the response and for making some of the requested changes.
We thank the reviewer for going through our response. Here are our replies to the follow up questions:
Inference compute
This is a lot of time (though not necessarily compute) spent waiting to execute actions that may never actually end up being even searched over.
Yes, it is correct that this can be a bottleneck. As a benchmark specific optimization step, we also persisted intermediate checkpoints in (V)WA (i.e., when the url of a page changes) and restore from these states rather than from the initial state whenever available. The experiments are also trivially parallelizable at the task level, and we run up to 3-6 environments in parallel (as this does not increase overall API/model calls and only requires additional CPU instances which are relatively cheap). We will update the paper to point to these exact details in the code once it is public.
There is likely additional optimization that we can perform to speed this up including possibly doing lossy but "good enough" backtracking, but we opted to implement our more expensive but accurate version as a first step, and leave further optimization for future work.
Ablation subset
We apologize for the ambiguity in our initial response: the table we shared above is actually the percentage of each difficulty level in the full vs. ablation subset. Thus, the ablation subset contains easier tasks overall compared to the full set: 32.5% of the ablation subset is marked as easy, compared to only 22.5% of the full set. We also generally observed that the agents achieve higher success rates on easier tasks (Table 5 of the main paper). These details align with the relatively higher success rates we observed on the ablation subset.
We hope this addresses your questions. Please let us know if there is anything else that we can further clarify before the end of the discussion period.
This paper introduces a tree search algorithm for improving the performance of language model agents. By implementing a best-first search approach that operates in the environment's actual state space, the authors enable LM agents to achieve multi-step planning and exploration during inference. Experiments are conducted on VisualWebArena and WebArena benchmarks with improved performance, showing that introducing search allows agents to perform better on complex tasks.
优点
-
Experiments are conducted on several practical benchmark datasets such as VisualWebArena and WebArena, showcasing its effectiveness in web-based tasks.
-
The search algorithm is compatible with a variety of LM agents and does not require fine-tuning or retraining.
-
Extensive hyper-parameter analysis is provided.
缺点
-
The search algorithm demands significant computational resources due to the increased number of environment interactions. This may limit its practical applicability in real-time or resource-constrained environments.
-
The success of the best-first search depends heavily on the quality of the value function. Although self-consistency techniques were used, further improvements in the value function are needed for optimal performance.
-
The paper briefly addresses the issue of destructive actions in search trajectories but lacks a comprehensive solution. This is a critical consideration for deployment in real-world web environments where irreversible actions are possible.
问题
-
How does the search algorithm perform on tasks with very high action complexity or longer required sequences? Are there any indications that increasing the search budget might lead to diminishing returns in such cases?
-
Can the authors provide more details on how the environment resets are managed during backtracking? Specifically, what challenges arise, and how might these affect the agent’s efficiency?
-
What were the key considerations in selecting the self-consistency technique for the value function? How does this compare to alternative approaches like reinforcement learning-based value functions?
-
Are there any safeguards or heuristics incorporated to prevent the agent from becoming stuck in specific repetitive or irrelevant actions during search?
We thank the reviewer for their helpful comments. We are pleased that the reviewer found the experimental results practical, and our proposed method effective. We are glad that the reviewer recognized the generality of our search algorithm. We address specific queries below, and have incorporated all feedback into the revised manuscript.
Requiring increased computational resources
The search algorithm demands significant computational resources due to the increased number of environment interactions. This may limit its practical applicability in real-time or resource-constrained environments.
We believe that this is a feature rather than a weakness: our approach indeed expends more computation at inference, but achieves better results as more search is performed (Figure 2 and Table 3). While at present it may be impractical to apply the full search algorithm on all tasks, we believe that search unlocks a valuable "knob" that can be used to tradeoff performance and inference compute for web agents. For example, we can perform less (or no) search for easy tasks, while expending more compute on harder or more consequential tasks to get better results. Please also see our top-level response to all reviewers for further discussion.
Quality of the value function
success of the best-first search depends heavily on the quality of the value function
What were the key considerations in selecting the self-consistency technique for the value function? How does this compare to alternative approaches like reinforcement learning-based value functions?
We agree, and this is true for almost all approaches that use search. For some domains (e.g., chess or Go) the value function is 100% deterministic and accurate — but for general web domains, this is not the case. For these reasons, it’s difficult to build a perfect value function. We chose self-consistency for this paper primarily because it performed well empirically (see Table 4), but we believe that there is a lot that future work can do.
In particular, we believe that finetuning reward models specifically for web tasks will be a very promising future direction, and is likely to improve our approach significantly (but this may require much more compute resources and labeled data than is available to us at present).
Destructive actions
The primary goal of our paper is to demonstrate the viability of a tree search algorithm (for the first time) in complex and realistic web environments. One existing practical deployment scenario that we discussed in Sec 5.4 is to run the algorithm in environments where actions are always reversible, e.g., Google Docs or Google Sheets.
However, we agree that deployment of such algorithms in general real world web environments requires more engineering work (and briefly discuss what might be required for this in our Ethics Statement). We believe that this is a very important direction for future work to explore, but also believe that it is out of the scope of this paper.
Performance on more complex or longer tasks
How does the search algorithm perform on tasks with very high action complexity or longer required sequences? Are there any indications that increasing the search budget might lead to diminishing returns in such cases?
Our results actually suggest the opposite: that search is likely to be more effective on more complex or longer tasks. Sec 5.2 of the paper includes a breakdown of tasks according to their. In the VWA benchmark, easy tasks are defined as those that require a human 3 or fewer actions to complete, medium tasks require 4-9, and hard tasks require 10 or more. Our results show that search actually results in larger relative improvements on medium (+75%) and hard tasks (+47%) compared to easy tasks (+24%), which likely do not benefit from as much exploration and planning. For the hard tasks, we believe that this is also because our compute limitations did not allow us to run search over trees deeper than 5 nodes (see Sec. 5.2), which is why many hard tasks (which are defined in VWA to require more than 10 actions) did not benefit as much from search as medium tasks.
Environment resets
We have added section A.4.1 to the appendix of the manuscript with more implementation details, and will include code pointers once the code is publicly released. Please also see our top-level response to all reviewers for more discussion.
Safeguards or heuristics for repetition
Are there any safeguards or heuristics incorporated to prevent the agent from becoming stuck in specific repetitive or irrelevant actions during search?
In our current implementation, we did not introduce such mechanisms, as we opted to develop a general approach that can be deployed to a variety of web environments. For specific use cases, this would make a lot of sense (e.g., we might stop the shopping agent from ever going to the pages with the "/checkout" suffix in their URL).
Thank you again for your helpful suggestions! As the discussion deadline approaches, we wanted to follow up to see if there are any additional concerns or questions of yours that we might address.
We thank all reviewers for their valuable comments. We are glad that reviewers found the paper to be well written and clear (5q9M, fowi, 6Eak), the approach to be effective (DTsq), experiments meticulously detailed and reproducible (6Eak), and a strong and a meaningful route to improved LLM web agent performance (fowi).
We are also glad that reviewers recognized the significance of our approach, finding it novel (6Eak), compatible with a variety of baseline LLM agents (DTsq), and to address several critical challenges faced by LLM agents in real-world web environments (5q9M), and that our paper is likely to inspire follow up work (fowi).
We address several common questions below, and provide responses to specific queries to individual reviewers below. We have also updated the manuscript draft on OpenReview to incorporate all feedback.
Requiring increased compute for search
Reviewers DTsq and fowi pointed out that our approach (like other tree search algorithms), demands additional compute over models that do not do search.
We believe that this is a feature rather than a weakness: our approach indeed expends more computation at inference, but achieves better results as more search is performed (see Figure 2 and Table 3). While at present it may be impractical to apply the full search algorithm on all tasks, we believe that search unlocks a valuable "knob" that can be used to spend inference compute to achieve increased performance for web agents. For example, we can perform less (or no) search for easy tasks, while expending more compute on harder or more consequential tasks to get better results.
In Table 5 of the paper, we demonstrated that our search algorithm is relatively more helpful for medium and hard VWA tasks, which shows that being able to control the amount of search dynamically can be beneficial. In real world web tasks, we may wish to expend more computation to achieve better results for harder and more consequential tasks, and our approach provides one method to do so for web agents. Additionally, as the cost of LLM inference continues to improve (from hardware and ML systems innovations), we believe that cost of search will decrease and it is likely to become more practical over time.
Innovation/novelty of the approach
Reviewers 6Eak and 5q9M suggest that our approach may exhibit limited novelty as tree search techniques are commonly used in other systems.
While it is true that tree search is commonly used in other systems (from as early as the 1950s, as we highlighted in Sec 2.3 of our paper), we respectfully disagree that this limits the innovation of our approach. The primary contribution of our paper is to show the viability of such a search algorithm on complex, highly realistic websites, such as those in (Visual)WebArena. Adapting search for this domain: implementing backtracking, building an effective value function, is non-trivial. Previous work primarily focused on quantitative tasks (such as math, science, coding) which have more clearly defined value functions such as execution tests, or highly simplified web environments. Here we show, for the first time, through extensive ablations and analysis, what is needed to make search effective in these complex and realistic web environments (for which the value function is not clear-cut, unlike in previous applications in math and games).
Moreover, as also highlighted by reviewer 5q9M, one of the highlights of our approach is that it is simple, straightforward, and effective. The value of our paper is not in coming up with a convoluted solution, but showcasing for the first time how we can build a general search algorithm that works well on realistic web tasks.
Backtracking implementation details
Reviewers DTsq and fowi requested more details on how we implemented backtracking and environment reset. We have added section A.4.1 to the appendix of the manuscript with more implementation details. We essentially reset the environment to the initial state, and perform the saved sequence of actions sequentially whenever we wish to return to a previously stored state. We implemented backtracking this way as naively executing the "back" browser action does not perfectly persist state: information such as scroll offset, text entered on text inputs, are often lost.
Our code (including backtracking and environment reset scripts) will also be completely open sourced. Once public, we will update the paper to point to the exact files containing these details to further facilitate reproducibility.
We thank all reviewers again for their valuable comments and suggestions. We believe that our responses and proposed changes address the concerns brought up in each review, but please let us know if there is anything else we can further clarify before the end of the discussion phase.
In this paper, the authors propose an inference-time tree search algorithm for LLM agents to perform web exploration and planning tasks. Reviewers noted that the experiments demonstrate the effectiveness of the approach and its compatibility with various LLM models. However, they also expressed concerns about the computational costs, the novelty of the approach, and its overall completeness. In summary, reviewers acknowledged the contributions of this work but suggested that further development is needed to make it a solid publication.
审稿人讨论附加意见
In the rebuttal, the authors argued that the increased computational cost is a feature rather than a weakness, as it leads to better results. They also addressed questions about novelty and clarified experimental details. However, not all reviewers were convinced that these responses fully resolved their concerns.
Reject