6.3

/10

Poster4 位审稿人

最低6最高7标准差0.4

4.3

置信度

COLM 2025

Inducing Programmatic Skills for Agentic Tasks

Zora Zhiruo Wang,Apurva Gandhi,Graham Neubig,Daniel Fried

OpenReview PDF

提交: 2025-03-20更新: 2025-08-26

TL;DR

We propose ASI, Agent Skill Induction, which induces and applies skill programs from web navigation experiences without supervision, yielding improved correctness and efficiency.

摘要

关键词

agentskill learningweb navigationscalabilitygeneralization

评审与讨论

审稿意见

评分: 6置信度: 42025-05-12

The authors are interested in LLM-based agents. Broadly, we would like agents to improve as they interact with their world. Rather than update LLM parameters (expensive, slow, data-hungry), it's common to prompt LLMs to describe task-specific skills, then include these as actions, function calls, tools, etc in the LLM prompt later on.

In this work, the authors propose explicitly storing these task-specific skills as executable programs (in the form of Python functions) rather than textual descriptions of skills. The advantages of this approach are twofold: first, program skills can be verified by executing the new function/skill and then verifying the outcome; second, program skills often reduce the number of steps taken by an LLM because the LLM runs multiple primitive actions (click, fill, etc) in one "function call".

The authors evaluate their method (ASI) and compare it to another adaptive agent that uses textual skills (AWM) on WebArena, some new long-horizon tasks, and on some of the real websites associated with the sites in WebArena. ASI outperforms AWM when using Claude 3.5 sonnet (not sure if 3.5 or 3.6 (new)).

The authors do some analysis to understand why ASI outperforms AWM and conclude that ASI has inducts higher-quality skills and that verifying skills improves end success rate.

Quality

The paper is pretty well-written, without typos or major grammatical errors.

However, the methodology is quite complex and the different components of the method are not ablated at all.

The most straight-forward ablations might be:

Remove episode cleaning and evaluate on a subset of WebArena
Remove skill verification and evaluate on a subset of WebArena.

Furthermore, there is no mention of statistical confidence or error at all. As I understand it, an adaptive agent will score differently based on which tasks it completes first. Therefore, I would want at least 2 runs of the agent on different task orderings in order to see any variation in performance. I understand that agent evaluation is expensive, but task order is a new confounder that doesn't seem to be mentioned at all.

Furthermore, it would be great to do a statisical test of some kind that ASI is significantly better than AWM. You could use bootstrapped confidence intervals. Is 40.4 actually a meaningful improvement over 36.3 with only 812 examples?

Clarity

I find the paper hard to understand. There are many different concepts being thrown around (tasks, skills, programs, text-skills, actions, steps, functions, skill actions, etc). As best I can understand, here is what they each mean:

task: high-level goal that a user or agent should try to accomplish, often described in natural language.
action: a low-level primitive that an agent can use to interact with the environment. This is akin to actions in standard RL-speak.
step: an action in an episode (I think).
skill: some fixed composition of actions. These can be described with natural language (AWM) or in code (ASI). While skills can be added to the agent's action space, they are actually sequences of the original actions (and other skills?).
program-skill: A skill that is represented with code.
text-skill: A skill that is represented with language.
function: ??? I think this is a program-skill?

Lines 139-140 say: "Given a clean input episode e, we now prompt the induction module I to produce one or more program functions to represent reusable skills D = {d} as executable programs."

What is a program function? What is an executable program?

Beyond the vocabulary, the methodology is not very clear.

This entire paragraph (lines 132-138) is very hard for me to understand.

We first clean the input episodes to ensure the induction quality. We remove all the steps that cause execution errors such as invalid argument format, to keep these invalid actions from distracting agent predictions. Furthermore, noticing the long and possibly redundant thought process generated by agents along with each action, we simplify each thought text paragraph into a short one-sentence description (e.g., “Clicked the directions button to access the route planning feature”) using LM, effectively reducing the thought content from 87.9 to 13.4 tokens per step.

Who cleans the input episodes? Are they manually cleaned? Does an LLM clean them? How do you know which steps (are these actions?) caused errors? Why do you need to summarize the thoughts?

Originality

As far as I know, the idea of inducing new program-based tools for web agents was original at the time of submission. Since then, there are similar works that are available as preprints (https://arxiv.org/pdf/2504.07079 is one).

Significance

It's a good idea. Tool use is clearly a powerful paradigm in practice (DeepResearch, o3, Claude Code). Adding additional tools to an LLM's toolkit will likely enable self-improving models.

接收理由

Original idea at time of submission
Good idea, likely to be incorporated and built upon by many others.

拒绝理由

Lack of methodology ablations
Lack of statistical significance
Lack of methodological clarity

2025-06-02

Thank you for citing the importance and novelty of our idea, as well as the detailed and constructive suggestions!

Quality: Ablating Skill Verification and Episode Cleaning in ASI

In Section 3.3 (Table 4), we have ablated the critical skill verification component, where unverified induction yields a 6.4% lower success rate (SR), showing the benefit of doing skill verification.

We also appreciate your suggestion on ablating on the data processing (episode cleaning) step. We additionally experiment with AWM and ASI without episode cleaning, which get 35.5 (AWM) and 38.5 (ASI) SRs. In general, episode cleaning improves AWM and ASI by 0.8% and 1.9%. Nonetheless, episode cleaning is not necessary for AWM/ASI to work, as they still substantially outperform vanilla baseline by 2.8 and 5.8 points, while bringing bonus benefits such as efficiency because cleaned episodes have fewer tokens.

Quality: Statistical Analysis of ASI-AWM

Thank you for your suggestion. We conduct paired t-testing between AWM and ASI results, and get t-stats of -2.21 and p-value of 0.027. both |t|>2 and p<0.05 justify that ASI performs statistically significantly better than AWM. We would like to note that WebArena enforces a particular order such that earlier tasks do not interfere with later tasks, thus we cannot flexibly change example ordering. So we conduct all experiments following the original WebArena order. Further, AWM’s ablation on the Mind2Web dataset shows that inducing reusable skills can be robust to example ordering, which could be extrapolated to our program skill setup. Nonetheless, we do agree that example ordering is an interesting angle to analyze the ASI method in future works.

Clarity: concepts and descriptions

Thank you for your detailed read and feedback!

For concepts, thank you for your help clarifying. Yes, your understanding was largely correct, and here are our updated definitions:

task: An NL instruction that the agent attempts to accomplish.
action: an executable program function, this can be pre-defined (e.g., the 'click' primitive), or later induced high-level skills (e.g., search_product).
step: refers to a discrete timestamp during the task-solving process, for step at time t, this step is associated with the observed state $s_t$ , the action taken $a_t$ , and the state after action execution $s_{t+1}$ .
skill: correct. Further for program-skill and text-skill, semantically they contain the same information, just present it in different formats.
function: this is just the general meaning of a program function, not a specialized term.

Lines 139-140: A program function here means a function written in a Python program. We add "executable" to emphasize the fact that programs can be executed (thus verified), in contrast to the baseline AWM method using "non-executable" texts.

Lines 132-138: We write a script to automatically clean the input episodes. The browsergym platform returns the error message if an action is not correctly executed on the environment. The chains-of-thought are very long (average 87.9 tokens) and redundant (see example below), which may interfere with the induction process/quality, so our script leverages an LLM to summarize them for both quality and efficiency considerations (reduced length by 6.6x).

Hope this helps clarify the terminology! We will add these clarifications in the revised paper.

Originality

The mentioned SkillWeaver is a concurrent (later) work, so we do not think that it affects the originality of our work. Furthermore, SkillWeaver creates APIs offline, which means they have access to supervision signals, a costly learning process from excessive offline episodes, and potentially suffer from offline-online task mismatches. In contrast, we do not require additional offline compute and run directly online, thus we are supervision-free, efficient learning with just the online agent episodes, and distributionally match the test queries.

2025-06-09

Thank you for the clarifications, ablations and statistical tests.

I am adjusting my score to a 6.

审稿意见

评分: 7置信度: 42025-05-12

This paper presents a novel approach to web agent skill acquisition through Agent Skill Induction (ASI), which represents skills as executable programs. The idea is sound and practically beneficial — programmatic skills allow verification, reuse, and abstraction, which the authors leverage effectively. The performance gains on WebArena are significant and consistent, and the ability to generalize across websites is an important strength

接收理由

The paper tackles an important challenge in interactive AI systems — enabling agents to learn web-based tasks efficiently and adaptively. The results are strong, showing improvements over static and text-skill agents. The proposed approach also improves efficiency by grouping useful skills, leading to token efficiency and less prediction steps of LLMs.

拒绝理由

More detailed analysis and discussions of failure cases (e.g., when induced skills are incompatible with target websites) would strengthen the paper.

给作者的问题

Does the proposed skill induction simply grouping (chaining) existing skills?

2025-06-02

Thank you for recognizing the importance of our tackled problem, and ASI’s effectiveness in improving agent correctness and efficiency.

R1: More analysis of ASI failure cases

In cross-web generation tests, we showcased one of the most common failure cases of ASI in Figure 4 – a skill induced on one website is incompatible with the design of other websites. For in-domain web activities, in both basic (sec3) and scaled-up (sec 4) scenarios, the main bottleneck of ASI is skill program quality, which we attempted to guarantee via the method described in section 2.3. We will add more failure case examples to the appendix of the revised version.

Q1: Is skill induction chaining existing skills?

Not necessarily. Although many skills are sequentially executing actions (i.e., existing skills), some skills leverage more program primitives (e.g., if-else, for/while loops), such as:

def browse_category_by_navigation(menu_id_sequence: list):
   """Browse products by navigating through a sequence of menu IDs.
   This function allows navigation through a series of menu interactions.
   Args:
       menu_id_sequence (list): A list of menu IDs to interact sequentially, using hover actions followed by a click.
   Example usage:
       browse_category_by_navigation(['735', '786', '797']) # Navigates Home & Kitchen -> Storage & Organization -> Racks, Shelves & Drawers
   """
   for idx, menu_id in enumerate(menu_id_sequence[:-1]):
       hover(menu_id)
   click(menu_id_sequence[-1])  # Click the final id to land in the predefined category

def navigate_and_sort_category(category_id: str, subcategory_id: str | None = None):
   """Navigate to a product category and sort items by price.
   Args:
       category_id: ID of the main category link
       subcategory_id: Optional ID of the subcategory link, if needed
   Returns:
       None
   Examples:
       navigate_and_sort_category('1241', '1873')  # For PS4 under Video Games
       navigate_and_sort_category('1245')  # For main category only
   """
   click(category_id)  # Click main category
   if subcategory_id:
       click(subcategory_id)  # Click subcategory if provided
   select_option("1553", "Price")  # Sort by price ascending

We are happy to incorporate these examples into the revised paper.

审稿意见

评分: 6置信度: 42025-05-14

This paper introduces Agent Skill Induction (ASI), a method that enables web agents to autonomously acquire and apply program-based skills through interaction with web environments. The core idea is to induce reusable skills from action trajectories in the form of executable Python programs. These skills can be verified through execution, ensuring correctness by checking task completion. Experiments on the WebArena benchmark demonstrate that ASI outperforms both no-skill and text-based skill baselines by 23.5% and 11.3% in success rate, respectively, while reducing action steps by 10.7–15.3%. ASI also shows strong performance on long-horizon tasks and in cross-domain transfer settings.

接收理由

Online learning and adaptation of reusable skills for lifelong improvement is an important and timely research direction for building more general and capable agents.
The proposed approach is conceptually sound, and the experimental results are promising, demonstrating clear improvements in both task success and efficiency.

拒绝理由

Positioning relative to prior work is unclear. In particular, the comparison with AWM is not sufficiently accurate or differentiated. While the paper claims AWM represents skills as non-executable text, the original AWM paper stores workflow trajectories that include executable actions in the web environment. Furthermore, AWM also includes an evaluation step conceptually similar to the verification step in ASI. Given these overlaps, the novelty and unique contributions of ASI need to be more clearly articulated.
The scaled-up browsing (S3) and cross-domain transfer (S4) experiments appear to introduce new evaluation setups, but lack sufficient detail. Specifically, the paper should clarify how tasks are selected, how performance is measured, and what controls are used, in order to improve the soundness and reproducibility of the evaluation.
Minor issue: The paper is missing citations to relevant prior work that also leverages code as a representation for learnable agent skills, including:

Wang et al., Executable Code Actions Elicit Better LLM Agents, ICML 2024.
Singh et al., ProgPrompt: Generating Situated Robot Task Plans Using Large Language Models, ICRA 2023.

2025-06-02

Thank you for your recognition of the importance of our method, as well as its effectiveness in accuracy and efficiency.

R1: Position relative to prior work

Compared to AWM, the main difference of ASI is representing skills as programs, thus (i) integrating skills into agent action space instead of memory, and (ii) enabling execution-based verification to improve the quality of induced skills. According to our results in Table 1, this design change brings significant success rate increases by another 11.3%; further, it boosts agent efficiency by up to 85.2% compared to AWM in long-horizon web activities (Table 1,4,6).

R2: Details about Scaled-Up and Cross-Domain Settings

How tasks are selected: We create tasks of two scenarios by extending WebArena examples: For scaled-up we take "search for coffee product" in WebArena and expand it to "search for coffee, mug, and coffee machine" (as exemplified in Table8), in a natural way that search for multiple relevant items. For cross-domain, we simply change the search website from sandboxed OneStopMarket to Target (see all website mappings in Table 5). We create 10 examples per website per scenario, and provide all examples in Appendix B.1 and B.2.

How performance is measured: as stated in lines 268--270, we manually create evaluation checkpoints for each task and report the checkpoint-average success rate for each task. We also provide detailed checkpoint lists for all examples in Appendix B.1 and B.2.

What controls are used: As partially mentioned in point 1 above, we control the task (scale-up the processed items) and domain (switch to in-domain same-purpose website), to study the scale-up and cross-domain scenarios with minimal task transfers.

We couldn't provide all details in the main content due to page limits, but we are happy to specify these more in the revised paper!

R3: Add citation

Thank you for the paper suggestions. We will add and discuss it in the related works.

2025-06-05

Thank you for the response. As shown in Figure 2, the improved efficiency comes from grouping multiple atomic actions into a single combined action (which only counts as one step). However, the actual number of environment interactions remains unchanged. In this context, measuring input/output token counts from the LLM might provide a clearer and more accurate metric for efficiency than simply counting steps.

Additionally, if I understand correctly, during the execution of a learned skill, the agent does not receive intermediate observations between atomic actions, nor can it adapt the pre-recorded action sequence. This may pose issues when there exists slight variations in environments, where a previously recorded action may no longer be valid. How does the proposed method handle such situations?

评论- Follow-up Questions

2025-06-09

1. Measure efficiency by no. input/output tokens

Thank you for your suggestion! We further calculate the number of input/output tokens of Vanilla baseline and ASI agents in the table below. ASI reduces input tokens by 8.0% and output tokens by 12.9%, showing consistent efficiency improvements to step-based measures. We will include these token-based measures in the revised paper.

Method	# Input Tokens	# Output Tokens
Vanilla	301190.2	761.4
ASI	277208.1	663.1

2. How agents handle varying intermediate states during skill execution

Our current ASI method handles varying intermediate states in a post-hoc manner -- executing skills incompatible with changed environment will incur error, based on which the agent will refine or induce new skills to adapt to the new environment. Meanwhile, we have explored inserting an observe() function in skill implementations to allow intermediate state access, yet this function often requires extra LLM calls yet without sufficient stability, dragging the overall success and efficiency scores down; we are happy to add discussions about this observe() function in the revised paper. Nonetheless, we agree that it is important to continue improving ASI robustness in future works!

2025-06-10

Thanks for the reply. It would be beneficial to add the discussion of 2 to the paper as a potential limitation of ASI. Since my concerns are addressed, I will increase my score.

审稿意见

评分: 6置信度: 52025-05-15

This paper introduces a new approach to online skill extraction by combining episode filtering, skill extraction, and skill verification. Given a natural language query, the agent first navigates a website and collects an episode. This episode is evaluated using an LLM agent and discarded if deemed unsuccessful. If the episode is a success, a different LLM-based induction model is used to extract skills by combining and reusing primitive actions. These new skills are first tested against the agent as to whether they can be reused or adapted to by the agent; this step is done by using first part of the skill-action trajectory as a prefix and prompting the agent to complete the rest. If the new episode is a success, all skills are added to action space, rather than into the memory as text-based clues. Experimental results on WebArena shows improvement compared to previous AWM agent. Using verification and programs give better performance compared to unverified text or program. Later, the authors scale-up the benchmark to longer horizon tasks where multiple skills can be needed and to real-world transfer between similar domain websites. ASI still performs favorably and by updating existing skills in unseen websites, it reduces the number of steps.

接收理由

The paper introduces a practical approach to online skill extraction without the need of manual human laboring. In particular, skill extraction combined with skill evaluation leads to practical improvements.
Experiments are done on existing benchmarks as well as more realistic setups.
The paper is generally easy to follow.

拒绝理由

You mention that agents can spot the incompatibility and don’t use those skills. This is not clear to me why. Given that understanding incompatibility might require knowing environment transactions, how does the agent predict that the skill would generate an incompatible next state? Is the suggestion that LLMs are trained with environment transactions?
Skill acquisition, in the form introduced in the paper, assumes deterministic environments, even for the same website. This is also supported by the fact that ASI uses a single episode to extract skills (as opposed to learning from multiple episodes) which assumes generalization across episodes. It is not clear how well this would generalize to any change in the underlying dynamics or any stochasticity. For example, a popup might randomly appear or web elements might include randomly generated field values.

As the agent continuous to solve tasks in WebArena, in how many cases it reuses vs rediscovers a new skill?
How much the performance change when you shuffle the order of test tasks?

Does the induction module need the whole episode or a subset of the information available in it? Given that each action is accompanied by a reasoning text, would reasoning texts be enough to combine subsequent actions into a skill?

Is there any case where the induction module generates a skill beyond just chaining subsequent actions?

Skill verification tests an agent’s ability to generate primitive actions given a prefix of mixed primitive + skill action subtrajectoy. While this tests if the agent can adapt to skill action subtrajectories, it doesn’t necessarily suggest that the agent will successfully reuse these skills. Can you address this drawback of skill verification step?

If the agent is robust to some form of perturbations, such as an incorrect skill generation that happened very early, it could potentially lead to addition of incorrect skills into the action space. Can you explain how accurate the induction or LM-based evaluator models are?

As a baseline, rather than extracting skills, what happens if you use previously seen trajectories as prompts? Can the agent copy corresponding actions as good as skill extraction?

2025-06-02

Thank you for your recognition of the effectiveness of our ASI method, comprehensive experiments, and clarity in paper writing!

R1: How Agents Learn Not to Use Incompatible Skills

Agents often spot the incompatibility of skills based on the discrepancy of web design (e.g., opens a filtering sidebar) and skill implementation (e.g., a dropdown menu for sorting options), as we illustrated in Figure 4. This could be more pronounced when the agent has been trained on the website. Even if the agent cannot spot this incompatibility beforehand, the agent will receive execution errors when trying to use this skill; with these error messages included in context, the agent is less prone to use the incompatible skill in later steps.

R2: ASI assumes deterministic environments

We start with single-episode based skill induction to demonstrate the effectiveness of the programmatic format, and it took careful design to make the single-episode scenario work already. We agree that enabling multi-episode skill induction, or domain randomization, could be an interesting follow-up to potentially enable greater reusability. Nonetheless, our current method allows agents to refer to existing skills to induce new skills. For the pop-up case, our agent can write an extended version of the previous skill with extra lines closing the pop-up window (exemplified below), thus making it reusable and generalizable across episodes.

def click("Search textbox...")
    if has_popup_window(): click("Close")  # added
    fill("Search textbox", name)
    click("Search button")

R2.1: Rate of reusing and inducing skills

The number of examples that (i) attempt to create a new skill, (ii) successfully create a new skill, and (iii) reuse a skill is shown in the table below. Overall, the agent reuses at least 1 skill for 42.5% of the cases, pretty frequently across websites; while these skills are created from 7.6% cases, demonstrating the high scalability of our skill learning approach.

Domain	Attempted Creation	Successful Creation	Reuse Skill	# Examples
shopping	21	8	96	180
admin	38	15	108	156
reddit	24	11	14	105
map	13	10	80	109
gitlab	25	11	10	175

R2.2: Shuffling example ordering

We did not compare different example ordering, because the implementation of the WebArena benchmark evaluation enforces a particular order so that conducting earlier tasks does not contaminate the environment to do later tasks, thus we cannot flexibly change example ordering. We conduct all experiments following the original WebArena order. We do agree that example ordering is an interesting angle to analyze the ASI method in future works.

R3: Induction input We provide the reasoning chain of thought and action associated with each action step for the entire episode as the skill induction input. We find reasoning chains-of-thought are sufficient augmentation to actions to induce high-quality skills, where adding additional state information (e.g., webpage accessibility tree) is long and noisy, thus often hurting end-task success rate. We provide the whole episode for induction, because identifying sub-episodes first and then applying skill induction to them would take extra compute, not practically simple, and often hurt performance in our preliminary studies.

R4: Skill verification

Our goal is to test if agents can successfully use induced skills, to achieve this we first need to have the agent (i) reuse the skills, which we encourage via prefix trajectory (with steps rewritten using new skills), and (ii) correctly use the skills, by our evaluation of the agent task re-solving trajectory. So our skill verification does tell if agents can (i) reuse the skill, and (ii) correctly, to solve the task. Note that we do not add the induced skill to the agent’s action space if (i) or (ii) is not passed, to avoid low-quality skills affecting later task-solving processes.

R5: Baseline: add previous episodes directly in context

Thank you for the suggestion. We recognize this is a reasonable baseline and acquire its performance on WebArena – 35.6 in success rate — 4.8 lower than ASI; furthermore, this baseline agent uses an average of 6.3 steps to solve tasks, 26.0% more than ASI, because the agent cannot call high-level skills to achieve more scalable trajectories. We will additionally report this number in the revised paper.

最终决定Accept

2025-07-08

This paper presents Agent Skill Induction (ASI), a novel method for inducing and verifying executable programmatic skills in web agents. The approach is original, practical, and demonstrates clear improvements over baselines like AWM in success rate, efficiency, and generalization across tasks and domains. All reviewers rated the paper above the acceptance threshold, and the authors provided strong rebuttals addressing concerns around clarity, baseline comparisons, ablations, and statistical significance. Reviewers acknowledged the value of the program-based skill representation and its potential for broader agent generalization.