PaperHub
5.8
/10
Poster4 位审稿人
最低5最高6标准差0.4
6
6
5
6
4.0
置信度
COLM 2025

BEARCUBS: A benchmark for computer-using web agents

OpenReviewPDF
提交: 2025-03-18更新: 2025-08-26
TL;DR

We introduce BEARCUBS, a benchmark of 111 information-seeking questions designed to evaluate a web agent’s ability to search, browse, and identify factual information from the web.

摘要

关键词
computer-use agentbenchmarkmultimodal

评审与讨论

审稿意见
6

The paper proposes a new benchmark for evaluating web browsing agent on information retrieval and question-answering tasks on real-world websites. The benchmark consists of 111 questions, collected from the paper authors as well as crowdsourcing, and extensively validated and filtered to ensure a challenging set. In particular, the multimodal subset is designed to be impossible to be solved without multimodal capabilities, and the authors promise to ensure this in the future by evolving the benchmark once the web changes.

接收理由

The problem is highly timely, with a proliferation of both commercial and research systems in this space, and the need to accurately assess the capability. Other commonly-used benchmarks are close to saturation (WebVoyager) or have methodological limitations (WebArena). Hence, there is a need for a new benchmark.

The goal of constructing an adversarially challenging benchmark is sound, and it seems this was largely accomplished, given poor results from state-of-the-art models, while humans with relevant domain-specific knowledge can solve the benchmark tasks with no issue. In particular, the fact that human-perceived difficult correlates inversely with agent performance could this a very powerful benchmark for the goal of building human-quality agents.

The authors are promising to keep the benchmark up to date over time, which avoids the limitation of either simulated environments or recorded websites, and maintains the benchmark relevant even if answers leak into training set.

拒绝理由

There are a few concerns on the design of the benchmark which might undermine some of the claims in the paper:

  1. The benchmark is designed to be adversarial to search method, but the baseline LLM+search that only provides the top 10 search snippets is naive, and probably does not match SOTA search agents like Perplexity or Google Search AI overview. It is unclear what a SOTA agent would do on this benchmark, and it is unclear if a crawler+RAG-based mechanism that also indexes and retrieves page screenshots could be successful.
  2. It is unclear what specific skills of browser agents is this benchmark actually measuring: is it multimodal understanding of page screenshots? Navigation? Interaction with forms, widgets, and other controls? The benchmark is very broad in use case. As a browser agent benchmark, it is unclear how meaningful is to include music sheet understanding or video understanding. On the other hand, as a general purpose, all encompassing agent benchmark, the benchmark is clearly too small to be representative.
  3. Related to the previous point, it is unclear how realistic the questions are, and thus how transferable are results on this benchmark to real-world agent usage. The purpose of a benchmark should be to inform or assess real-world performance. Yet, the example seems unnatural, perhaps due to the requirement of exact-match checking for answers or to how the questions were obtained.

I also see potential issues around the evaluation protocol:

  1. Answers are manually evaluated, and what appears to be a subjective judgement is made on whether the agent is sufficiently confident of its answer. This is highly unusual: a more customary approach would be to use an LLM judge, or to instruct the agents to produce answers in a machine-readable format through appropriate prompting. I understand the desire of using the same prompts for all agents, but a per-model prompt template (independent of the goal) would smooth out irrelevant differences among systems and still achieve the benchmark goals.
  2. All agents are asked to bypass CAPTCHAs. It is first of all questionable to ask publicly released systems (with safety concerns and protections) to solve CAPTCHAs, when compared to unreleased research systems or open-source systems that have no such limitation. It is also generally unclear what is the value of evaluating an agent on the ability to override a CAPTCHA, because as CAPTCHAs become better or unbypassable (eg through attestation), it is reasonable to expect that agents will only be operated on websites that explicit consent to AI interaction.

给作者的问题

  1. I don't fully understand how filtering was done for multi-modal questions (lines 53-54, lines 474-475), given Deep Research solves 9% of multi-modal only questions nonetheless. This seems inconsistent in the text: either those questions are indeed solvable by text-only methods, or they are randomly guessable with almost 10% accuracy, which would raise further questions.

  2. OpenAI Deep Research is described as an agent without computer use in Table 2. This might be a bit misleading: Deep Research has a browsing tool (separate from the search tool) which can interact with live websites. Note that OpenAI itself evaluated Deep Research on a benchmark specifically for browser agents (https://openai.com/index/browsecomp/).

评论

We thank Reviewer ivpU for the detailed feedback. We appreciate the recognition of BEARCUBS’ timeliness. Below we address the main concerns and questions:

Weakness #1: LLM+search is naive

The reviewer asks for a stronger baseline. Google Search AI overview does not provide API access. We ran Perplexity’s model sonar-pro via their API. And the results are the following:

Correct5
Wrong31
No direct answer75

Perplexity sonar-pro does not outperform the other baselines reported in our paper. Concerning the crawler + RAG-based mechanism that indexes and retrieves page screenshots, this setup effectively functions as an agent which gathers information and reacts to it to generate a final answer, making it less suitable as a baseline for our purposes, which is to ensure that BearCubs cannot be easily solved by LLMs’ parametric knowledge and by simple search queries.

Weakness #2.1: BearCubs does not target specific agent skills

While it is important to assess individual capabilities (e.g., navigation, multimodal understanding), BEARCUBS aims to evaluate agents holistically. Real-world usage often requires combining multiple skills fluidly, and our benchmark is designed to reflect this integrated ability.

Weakness #2.2: BearCubs contains impractical tasks

BEARCUBS mainly targets computer-using agents, which are designed to operate on a laptop like humans. Tasks such as reading music sheets or interpreting videos are deliberately included to test this claim of human-like interaction.

Weakness #2.3: Benchmark size

We address this in the common response Nr. 2. In short, although BEARCUBS is modest in size, each task is challenging and skill-intensive, providing a meaningful signal for agent evaluation.

Weakness #3: Realism of tasks

We discuss this in detail in the common response Nr. 3. While our human-generated questions may differ from natural user queries, they serve as a valuable intermediate benchmark towards building agents that can reliably retrieve and reason over grounded web information in real-world scenarios.

Weakness #4: Manual evaluation

As detailed in our common response Nr. 1, manual evaluation was essential to deeply analyze agent behavior and trajectories. For those focused solely on final answer accuracy, we developed a prompting-based auto-rater, which offers a rapid and efficient way to assess agent performance.

Weakness #5: It is questionable to ask agents to bypass CAPTCHAs.

First, this “user-intervention-minimizing” prompt is only given to computer-using agents. Second, most BearCubs questions do not require solving CAPTCHAs. Third, given that these computer-using agents are positioned as human-equivalent (e.g., Anthropic), we hold them to the same standard as a human user. Lastly, while it may be possible that in the future websites will explicitly grant interaction permission for AI, it is not yet a common practice. No agent is currently designed to handle such consent. Therefore, we evaluate these computer-using agents by the same behavioral standards we expect from humans.

Question #1: How did Deep Research solve some multimodal questions if it was used to filter out questions?

Deep Research correctly answered 5 multimodal questions, but its trajectories and final answers showed no evidence of accessing pages with the correct answers. This suggests it may have guessed or hidden parts of its trajectory. However, hiding parts of the trajectory that contains the correct answer was not a behavior we observed elsewhere during our evaluation. Regardless, since agents are providing their trajectories (also in the final outputs), the trajectories should be transparent, verifiable trajectories to build user trust, underscoring the need to evaluate both answers and search behavior.

Question #2: Why is OpenAI Deep Research described as an agent without computer use?

As described in Line 26-30, we consider computer-using agents to be those that can process pixels on the screen and control a virtual keyboard and mouse, enabling them to complete tasks like watching videos or playing online games. While OpenAI Deep Research can browse live websites, it lacks these capabilities and thus falls outside our definition.

评论

Thank you for answering my review. The points regarding the size of the benchmark and the selection of tasks are already covered in other discussion threads, so I won't comment further.

Thank you for addressing my other comment regarding manual evaluation. I understand the need to evaluate trajectories but I think in practice the LLM judge you present now will be necessary for others to use the benchmark reliably.

Regarding the CAPTCHAs, my point wasn't on whether CAPTCHAs are useful to solve in abstract or not, but whether actual commercial systems - with safety precautions - should be held to the same standard as research systems with no guardrails. But evidently actual systems also bypass CAPTCHAs when instructed, so this is perhaps moot.

As for Deep Research, I'm not 100% convinced that distinguishing between "controls a browser at the HTML level" and "controls a browser at the pixel level" is that meaningful, but your point is taken, thank you.

评论

Thank you for your response!

We'd like to clarify one small point: we did not mean to suggest that OpenAI's Deep Research operates at the HTML level. That description in the paper was intended for agents like Mind2Web. In fact, we do not know how Deep Research functions internally. When we referred to "processing pixels," we meant the agent's ability to use a computer interactively, particularly in a multimodal manner, like humans, as claimed by some agent providers (e.g., Anthropic).

We appreciate your engagement throughout the discussion.

审稿意见
6

This paper presents a benchmark consisting of 111 information-seeking questions designed to evaluate a web agent’s ability to search, browse, and identify factual information from the web. The benchmark requires: (1) accessing live web content (not synthetic or simulated pages), and (2) performing a broad range of multimodal interactions including video understanding and 3D navigation. Questions have short, unambiguous answers the benchmark includes human-validated browsing trajectories. Human accuracy is 84.7%. The authors assert that this is explained in large part by search inefficiencies and domain knowledge gaps.

接收理由

The benchmark seems like it might last for a while, since GPT4o zero shot performance was low, OpenAI Deep Research has 60% text-based accuracy and 9% on the multimodal questions.

拒绝理由

The benchmark is quite small, with just 111 questions and “Solving BEARCUBS requires visiting 108 unique top-level URLs, minimizing the risk of agents overfitting to specific websites.” This is not a lot by modern standards.

给作者的问题

In the example provided on page 8 a sequence of URLs from Deep Research has been provided. However, it might be the case that Antidistillation Sampling (i.e. https://arxiv.org/abs/2504.13146) or some form of obfuscation is being used by OpenAI. Have you considered this ?

Secont 5.1 says “Analysis reveals that the most common factor for wrong answers is the human overlooking details in the question or answer (e.g., Example 1 in Table 6 in Appendix F). The next most common factor is lack of topic understanding (see also Example 1). Additional error sources are listed in Table 6, along with examples and explanations. In each error case, if the human annotator had spent more time or had more domain knowledge, they would likely have been able to find the correct answer.” But why does the abstract point to search inefficiencies and knowledge gaps ?

评论

We thank Reviewer HtzH for the constructive review. We address the comments below.

Weakness #1: Size of the benchmark

We address this in detail in our common response Nr. 2. In short, each question in BearCubs is carefully designed to require a diverse set of web-browsing skills and a large answer search space. The widely adopted benchmark AIME is also small in size but it still supports impactful evaluation.

Question #1: Antidistillation sampling

Thank you for sharing this paper! We will cite the work and add the discussion of the possibility that antidistillation sampling may be used. However, if OpenAI Deep Research indeed uses antidistillation sampling to generate the trajectory, its performance defeats the intended goal of preserving the teacher model’s capabilities.

Question #2: Discrepancy between abstract and content

We appreciate your close reading. You are right to note the mismatch. In our study, “search inefficiencies” may stem from annotators lacking domain-specific knowledge or other reasons. We will revise the abstract to more clearly reflect the main sources of error, namely, overlooking details and lacking topic understanding.

评论

I have read the author response and will stay with my score of 6.

审稿意见
5

The paper introduces BEARCUBS, a benchmark of 111 information-seeking tasks for evaluating web agents operating on the live web using real-world multimodal interactions. Unlike prior benchmarks based on synthetic or static pages, BEARCUBS emphasizes unpredictability and realistic human-like browsing. It features human-verified trajectories, adversarial question design, and ongoing updates to prevent contamination.

接收理由

  • Moves beyond static environments and multimodal requirement ensures questions cannot be answered via simple Google search or LLM knowledge. Human accuracy (85%) confirms tasks are solvable, yet challenging.

  • Each question includes a validated human trajectory, enabling fine-grained analysis of agent behavior and supporting more transparent evaluation.

拒绝理由

  • The dataset size is small at 111 questions which, while described as "small but mighty," limits its statistical strength for training or broad evaluation.

  • The paper lacks an automation pipeline; all evaluations are conducted manually, making reproduction and benchmarking slow and resource-intensive, and the promised benchmark maintenance remains a future commitment rather than a demonstrated feature.

  • At the time of review, the benchmark does not appear to be publicly released or open-sourced. Without access to the dataset or tooling, it’s difficult to verify claims or build upon this work.

给作者的问题

  1. What are your concrete plans for automating evaluation or building community tooling so BEARCUBS can scale beyond internal experiments and remain viable as a long-term benchmark?

Comment : I would like to see the authors' rebuttal and am open to adjusting my score if the concerns I raised—particularly around benchmark accessibility, evaluation scalability, and maintenance—are clarified or addressed.

评论

We thank Reviewer osWr for the valuable review. We address the comments below.

Weakness #1: Size of the benchmark

As discussed in our common response Nr. 2, while BEARCUBS is relatively small in number, each question is carefully designed to require a diverse set of computer-using skills. We draw a parallel to a widely adopted small-scale benchmark AIME, which similarly supports impactful evaluation despite their limited size.

The reviewer raised a concern about the statistical strength of our results due to the small size of BearCubs. BearCubs poses open-ended questions with answers grounded in the real-world web, making random guessing ineffective, unlike in multiple-choice settings. As a result, model accuracy tends to show low variance (see the results in common response in Nr.2).

Weakness #2: Manual evaluation is slow

We address this in the common response Nr. 1. Manual evaluation was a deliberate choice to enable in-depth analysis of agent behavior. For broader use, however, we agree that automation is desirable. For those focused solely on final answer accuracy, we developed a prompting-based auto-rater, which offers a rapid and efficient way to assess agent performance.

Weakness #3: Lack of public access

We uploaded our dataset as supplementary material with the submission. We exclude the answers to prevent strong search agents (e.g., OpenAI’s Deep Research) from directly retrieving them. We have considered releasing the answers in a pickle file but it is unclear whether a strong agent will be able to retrieve and deserialize the file.

Question: Plan for automation and maintaining BearCubs as a viable long-term benchmark

We developed an LLM-as-Judge auto-rater for final answer evaluation. Researchers can submit their final answers to us for a quick accuracy evaluation. It takes the auto-rater about 1min30s to evaluate 111 questions and it costs around 0.8 USD.

评论

After reading your responses, I still confuse about the dataset/benchmark itself. For example one of the question in the benchmark is "Using the CPI calculator from the U.S. Bureau of Labor Statistics, what is the inflation-adjusted equivalent of $1,234 from February 1985 in July 2022?" How does the LLM-as-judge scoring (are you only give reward at the end step? or for each step?) and why do this kind of questions matter? In other word, how do you verify the LLM-as-judge works and what is your scoring method?

评论

Thank you for your thoughtful follow-up.

1. How does the auto-rater work?

Our LLM-as-Judge auto-rater evaluates only the final answer, which is deterministic and unambiguous. The auto-rater is given the question, gold answer, and agent response, and determines whether the agent response unambiguously entails the gold answer. We evaluate this auto-rater performance against our manually assigned labels and see how many times the auto-rater label and our manual label match. More details are in the common response.

2. Do we give rewards for each step?

Our auto-rater only rewards correct final answers. Evaluating intermediate steps is important but nontrivial. As we discuss in the paper (§6), trajectory evaluation involves multiple dimensions, including interpretability, credibility, and planning quality, each of which demands its own carefully designed rubric. We view this as an open and important research problem, complementary to ours. Our primary goal with BEARCUBS is to establish a challenging and real-web-based testbed to examine whether computer-using agents are truly capable of following instructions and grounding answers in web content, as some companies claim. We hope our benchmark can serve as both an evaluation tool and a starting point for future work on trajectory-level evaluation.

3. Why questions like the CPI calculator one matter?

The CPI calculator task evaluates whether an agent can follow instructions to use an online tool correctly. Specifically, an agent needs to navigate to a government site, fill in structured fields, and extract a result. This kind of task is representative of many real-world use cases (e.g., filling out forms, querying online databases, booking tickets). If agents are to assist users with everyday tasks on the web, they must succeed at this kind of interaction. Our benchmark reveals that even with the answer source specified, agents often fail whereas humans succeed. This highlights a key capability gap.

We hope this clarifies your question, and we're happy to provide further details if helpful.

评论

I see your point, and I agree with the idea of using an LLM as a judge. However, you mentioned that the questions are meant to reflect real-world use cases, and I struggle to see the practical relevance of something like: "Using the CPI calculator from the U.S. Bureau of Labor Statistics, what is the inflation-adjusted equivalent of $1,234 from February 1985 in July 2022?"

That doesn’t seem like something most people would encounter in daily life.

How do you select these questions? Is there a cross-check or validation process involved? I reviewed the question list, and it feels like many of them are just included to be difficult rather than truly representative of everyday situations.

评论

We appreciate your follow-up and provide our responses below.

BearCubs is designed to test real-world web interactions. While not every question will feel familiar to every user, our goal is on evaluating capabilities that generalize across realistic web-based tasks.

1. Grounding in the real-world web Unlike synthetic benchmarks, BearCubs requires agents to operate on the live web, where content is multimodal and structures vary unpredictably.

2. Essential real-world interactions The BearCubs tasks mirror common user activities like retrieving financial records, filling in structured forms, interacting with embedded tools, or interpreting multimodal content. The CPI calculator question, for example, while perhaps niche in its exact form, tests a set of skills that are broadly applicable to real-world agent utility.

All questions are manually created and curated. Sections 2, 3, and Appendix B outline our criteria. We ensure that each question (1) has a verifiable web-based answer, and (2) requires meaningful interaction beyond static lookup. We also filter out questions prone to shortcuts. For example, Figure 2 shows a case where an agent is supposed to watch a video and find the answer. However, OpenAI Deep Research, despite limited multimodal capabilities, found a text-based workaround, which forced us to remove the question.

We appreciate your feedback. It helps us clarify both the purpose and the scope of our benchmark.

评论

Thank you for the response. However, it still remains unclear how the question set was initially constructed and what systematic process—if any—was followed to ensure the benchmark is both representative and well-calibrated in terms of difficulty and realism.

While you mention manual curation and criteria in Sections 2, 3, and Appendix B, these sections do not clearly describe:

  • The source of the questions – Were they derived from actual user queries, brainstormed by authors, or adapted from existing benchmarks?

  • Selection process – How were questions chosen or rejected beyond ad-hoc examples like the one in Figure 2? Was there any inter-annotator agreement or external validation? Оr you just simple select the questions that gpt deepresearch can not answer?

  • Representativeness – What makes these questions reflective of "real-world" web agent usage? Many questions appear to be constructed to be difficult rather than typical.

Without a more transparent and reproducible selection methodology, it's hard to assess the reliability and generalizability of the benchmark.

Аnd I agree with reviewer ivpU. It seems that the two goals you are pursuing are not aligned. If your aim is to evaluate the general capabilities of a web agent, then including hard questions makes sense. However, the size of your benchmark is too small to support strong claims in that direction. On the other hand, if your goal is to reflect real-world scenarios, then the current benchmark does not achieve that, as many of the questions are not representative of everyday situations, they are simply difficult.

评论

Thank you both for the thoughtful and constructive feedback. We respond to the key concerns below.

On Realism

We appreciate the discussion around the intended scope of BearCubs. To clarify, when we use the term “realism” in our responses to the reviewers, we are not claiming that the benchmark reflects the distribution of real-world user queries. Instead, the term is used to summarize the points raised by the reviewers. When we describe BearCubs as targeting “real-world web interactions,” we refer specifically to the following:

(1) Agents must operate on the live web, with all its structural and multimodal unpredictability, rather than a simulated environment or memorized corpus.

(2) The types of interactions and skills required (e.g., navigating nested content, interpreting multimodal inputs, interacting with dynamic web elements) are closely aligned with those needed to perform real-world tasks. For example, the abilities tested in our CPI question are directly applicable to tasks like filling out forms, querying online databases, or booking tickets. These represent a tail distribution of user goals that, while less common, are critical and aspirational for future agents, as they demand strong generalization capabilities.

The overall goal of BearCubs is to make the benchmark intentionally hard yet solvable, while preserving naturalistic web interaction flows.

On Dataset Construction and Representativeness

Lines 117–122 describe our data sourcing. The majority of the questions (65) were written by the authors, with the remainder authored by freelancers on Upwork. We do not derive questions from existing benchmarks or from actual user logs. We will clarify this in the paper.

The inclusion criteria, detailed in Appendix B, are as follows:

(1) Each question must provide minimal but sufficient context to unambiguously lead to a correct answer.

(2) The answer must be concise, unique, and easily verifiable.

(3) For multimodal questions, the answer must not be solvable using text-only methods (e.g., Deep Research).

(4) All answers must be publicly accessible without login or paywall restrictions.

We used Deep Research as an ad hoc filtering tool to exclude multimodal questions that could be answered via unintended shortcuts. For example, if a question is video-based, we want to ensure that agents answering it correctly genuinely demonstrate multimodal capabilities.

Concerning inter-annotator agreement, each question is checked by one author if it was from the freelancers. All questions are further checked by at least two additional authors to ensure the criteria are met (Line 478-479 and Figure 3). This review process ensures that each question is clear, well-formed, and aligned with the benchmark’s goals.

On Dataset Size

We acknowledge that BearCubs is relatively small. However, compact benchmarks exist in high-impact evaluations. For instance, AIME includes only 30 math questions yet is widely used by OpenAI, DeepSeek, and others. CodeElo contains 387 samples, and Proof or Bluff evaluates using just 6 USAMO problems.

The difficulty of our dataset is reflected in the low performance of current agents (19.6% average accuracy). Given the vast open-ended search space of the live web, it is unlikely that agents can consistently succeed by chance. Thus, even with a modest number of questions, BearCubs yields a meaningful and reliable signal of model limitations.

Scraping agents without API poses non-trivial workload. Hence, we re-ran DeepSeek with and without Google search three additional times. The results show low variance, further supporting the benchmark's stability and robustness.

评论

Thank you for participating in the discussion, and thank you also for responding to my review below.

If I might interject here, I think you're trying to have both ways. Here, you're saying "BearCubs is designed to test real-world web interactions", whereas in the other response you say "BEARCUBS aims to evaluate agents holistically". Those aims are necessarily in contrast: like the difference between an end-to-end test and a unit test for software, an evaluation of a specific set of capabilities is necessarily different than an overall evaluation of an agent. Without putting words in the mouth of the other reviewer, I think what we're saying here is, it's ok to have questions that are perhaps a bit convoluted, if the goal is coverage of different capabilities and interactions that an agent has to do with the web. But if the goal is realism, then it's a different story.

审稿意见
6

This paper proposes BEARCUBS, which assesses web agents' capabilities in real-world, multimodal web interactions. It comprises 111 information-seeking questions requiring agents to interact with live web content. The questions were written by the researchers and crowdsourcing workers. Each question and its viable trajectory and visited links were verified by at least two authors to ensure it can't be answered by text-based workarounds. The authors claim that BEARCUBS would be an growing and evolving dataset, with periodic updates to replace invalid or contaminated questions.

Experiments on the new benchmark show that SOTA open-domain agents underperformed significantly than human users. The main issues are inadequate exploration, multimodal capabilities, and source selection difficulties.

接收理由

  1. It presents a focused development of a challenging benchmark for information-seeking agents.
  2. The authors have invested significant effort in ensuring the quality and reliability of the BEARCUBS dataset.
  3. SOTA agent models fail significantly on this benchmark, exposing their limitations.

拒绝理由

  1. The dataset curation relies heavily on manual efforts, making it difficult to scale up the benchmark or automatically maintain and update the dataset over time.

  2. It is unclear how the authors and workers devised the questions; upon examining the questions, it appears that the workers already knew the answers beforehand, resulting in tasks that differ notably from typical information-seeking questions posed by real-world users.

评论

We thank Reviewer KfhM for the thoughtful feedback and acknowledging our effort in creating a high-quality benchmark.

Weakness #1: Manual evaluation and maintenance scalability

We address this in the common response Nr. 1. Manual evaluation is essential for gaining deep insights into agent behavior, particularly through tracking and analyzing agent trajectories. For those focused solely on final answer accuracy, we developed a prompting-based auto-rater, which offers a rapid and efficient way to assess agent performance.

Weakness #2: Realism of tasks

We address this concern in the common response Nr. 3.

评论

I have read the author's responses and I'll keep my original score.

评论

We thank the reviewers for their constructive feedback. Several key points were raised by multiple reviewers, which we address below.

1. Manual evaluation and maintenance scalability.

In BearCubs, our manual effort is broken down into three parts:

(1) manual question creation and curation: This is to ensure the high quality of our dataset. It is also a common practice adopted in other works such as BrowseComp and Frames.

(2) manual agent scraping: Interacting with agents without API access is challenging. Selenium-based automation proved unreliable, so we opted for manual interaction, which allowed us to better observe agent behavior and gain insights into their successes and failures.

(3) manual answer evaluation: We evaluated agent answers during manual scraping. This process can be automated. To support this, we developed a prompting-based LLM-as-Judge auto-rater that efficiently estimates answer correctness when only the final answer matters. Given a question, gold answer, and model response, the auto-rater assigns one of four labels: correct, wrong, no answer (stall/loop), or no direct answer (uncertainty/abstention). We assess its accuracy against human labels using both four-way and binary classification across five agents. To run on 111 questions, an auto-rator with gpt-4o-2024-11-20 takes about 1min30s and 0.8 USD. Below are the accuracy results.

4-wayBinary
Anthropic Computer Use99.1%100%
Proxy99.1%100%
Grok99.1%99.1%
OpenAI Operator97.3%98.2%
OpenAI Deep Research96.4%96.4%
  • We ran the evaluation on Anthropic and Proxy three times and obtained identical results each time.

  • The binary results are derived by collapsing the four-way classification outcomes.

2. Size of the benchmark.

Constructing high-quality questions is challenging. Our experience with freelancers showed that, after writing just a few accepted questions, the quality dropped significantly. Besides that, OpenAI's Deep Research often found unintended search shortcuts for multimodal questions, forcing us to discard many questions.

Small benchmarks are not rare. The widely used AIME benchmark, for instance, has only 30 questions but is regularly used for evaluation and model improvement (e.g., OpenAI and DeepSeek). Other than AIME, CodeElo has 387 questions and the Proof or Bluff paper only uses 6 USAMO questions for testing.

Despite its size, BEARCUBS is still meaningful. First, the search space of BearCubs questions is large (i.e., the whole real-world web), unlike multiple choice questions. Hence, it is difficult for agents to guess the correct answer which prevents high accuracy variance. Second, solving even one BEARCUBS question requires a complex set of skills, making each example valuable for measuring performance progression. To verify the first point, we ran DeepSeek R1 with and without Serper three additional times, yielding the following accuracies with low standard deviation:

1st run2nd run3rd run4th runStandard Deviation
DeepSeek R1 w/o Serper7.2%5.4%5.4%5.4%0.8%
DeepSeek R1 w/ Serper1.8%0.9%2.7%2.7%0.7%
  • The first run was conducted two months ago via Fireworks AI API. The new runs are done via OpenRouter.

3. Realism of the tasks

Our question creation process typically begins by browsing the web to uncover information. The creator then formulates a question given the information and verifies that it cannot be easily answered through simple search queries. This is a natural workflow in the absence of real user query logs.

BEARCUBS may differ from typical user queries and, in some cases, be easier since it often specifies the answer source. Yet agents still struggle given this advantage. This highlights BEARCUBS as a valuable test of an agent’s ability to locate and reason with grounded information, serving as preparation for real-world scenarios where identifying the correct source is the critical first step.

评论

Dear reviewers,

Thank you once again for your thoughtful and constructive reviews of the submitted paper. The authors have carefully considered your feedback and have now submitted their responses for clarification and further discussion.

We kindly invite you to take a moment to review the authors’ replies at your earliest convenience if you haven't done it yet. Engaging in this discussion phase not only helps ensure clarity and rigor in the final work but also fosters a collaborative review process that strengthens the quality and impact of research in our community.

I greatly appreciate your continued support in upholding the standards of peer review.

评论

Thank you to the reviewers for the thoughtful engagement. We briefly reiterate two key clarifications in case they were missed earlier:

  1. On query representativeness: Our benchmark questions are not intended to mimic real user queries but to probe core web interaction skills such as filling, tool usage, or multimedia interpretation that are essential for solving real-world tasks.

  2. On benchmark size: Although small in size, BearCubs shows low performance variance across runs (cf. DeepSeek R1 in the common response), and agents generally achieve low accuracy. This underscores the benchmark’s reliability and highlights its value for assessing agent capabilities.

We hope this clarifies our intentions, and we are happy to address any further questions.

最终决定

The proposed method for adversarially generating web navigation tasks is well-founded and presents a meaningful challenge to current state-of-the-art models. The authors’ commitment to continuously updating the benchmark further enhances its long-term value and relevance to the web agent research community. As such, this benchmark represents a valuable contribution to the field of web agent evaluation. While the authors note that small benchmark sizes are not uncommon, providing statistically significant testing would strengthen this point