/10

Poster4 位审稿人

最低1最高3标准差0.8

ICML 2025

Agent Workflow Memory

Zora Zhiruo Wang,Jiayuan Mao,Daniel Fried,Graham Neubig

提交: 2025-01-21更新: 2025-07-24

TL;DR

We propose Agent Workflow Memory that enable digital agents to induce and reuse increasingly complex workflows on the fly.

摘要

关键词

agentmemoryweb navigation

评审与讨论

审稿意见

评分: 22025-02-23

This paper proposes the Agent Workflow Memory (AWM) that can summarize and abstract common workflow experiences from previous trajectories. However, it seems too similar to the series of Reflexion works. The author should discuss the significant contributions comparing with these works.

给作者的问题

Could you discuss the significant contributions compared with these works?

[1] Shinn, Noah, et al. "Reflexion: Language agents with verbal reinforcement learning." Advances in Neural Information Processing Systems 36 (2023): 8634-8652.

[2] Zhao, Andrew, et al. "Expel: Llm agents are experiential learners." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 38. No. 17. 2024.

[3] Zheng, Longtao, et al. "Synapse: Trajectory-as-exemplar prompting with memory for computer control." arXiv preprint arXiv:2306.07863 (2023).

论据与证据

It claims the proposed method can abstract common workflow from past experiences and apply it to guide future interactions. The experiments are related to this claim.

方法与评估标准

I think the proposed methods are too similar to the series of Reflexion works, such as [1-3]. The author should discuss the significant contributions compared with these works.

[1] Shinn, Noah, et al. "Reflexion: Language agents with verbal reinforcement learning." Advances in Neural Information Processing Systems 36 (2023): 8634-8652.

[2] Zhao, Andrew, et al. "Expel: Llm agents are experiential learners." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 38. No. 17. 2024.

[3] Zheng, Longtao, et al. "Synapse: Trajectory-as-exemplar prompting with memory for computer control." arXiv preprint arXiv:2306.07863 (2023).

理论论述

No theoretical claims in this paper.

实验设计与分析

The author conducts experiments on WebArena and Mind2Web. The domains are limited.

补充材料

I have viewed the supplementary material briefly.

与现有文献的关系

Yes, very relevant.

遗漏的重要参考文献

[1] Shinn, Noah, et al. "Reflexion: Language agents with verbal reinforcement learning." Advances in Neural Information Processing Systems 36 (2023): 8634-8652.

[2] Zhao, Andrew, et al. "Expel: Llm agents are experiential learners." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 38. No. 17. 2024.

其他优缺点

See above.

其他意见或建议

See above.

作者回复

2025-04-01

Thank you for the feedback and for pointing out these relevant works. We list the core differences between AWM and these works below, and will add these more detailed discussions in our revised paper version!

Difference from Reflexion

Content: AWM learned procedural knowledge on how to solve specialized, reusable tasks, i.e., only correct trajectories; Reflexion receives errors and comes up with plausible (erroneous, correct) solution pairs.

Context window: AWM carries reusable procedural knowledge across multiple tasks experienced over time, therefore could leverage and benefit from the procedural similarity across tasks. Reflexion only maintains learned reflections within the current single-task solving session.

Difference from Expel

Requirement on training data: Expel requires learning from training examples; AWM not only supports learning from training/offline tasks, but can also be directly applied online/testing without the need for any training data beforehand.

Expel stores raw trajectories in memory (similar to Synapse below); where our AWM stores sub-trajectory level, context-abstracted procedures, which have shown to be more reusable and beneficial in our experiments (Table 3 and Table 4).

Difference from Synapse

Full-trajectory vs. Sub-trajectory: AWM proposes to learn reusable procedures that are often part of the trajectory for a given task; in other words, a complex, long-horizon task usually needs to concatenate multiple workflows to solve it correctly. Synapse only remembers the full trajectory, but is often less reusable for other tasks (as the example in lines 147-161).

Dependent vs. Independent on example-specific context: Synapse records specific contexts to individual examples, e.g., “Flight from New York to Pittsburgh”; while AWM abstracts away those variable contexts to, e.g., “Flight from {departure} to {destination}”, so that it could be more easily reused to apply to other examples such as booking a flight from Chicago to Toronto (also refer to the example in lines 147-161).

审稿意见

评分: 12025-03-07

The paper introduces Agent Workflow Memory (AWM), a method designed to improve the performance of language model-based agents in long-horizon, complex tasks such as web navigation. AWM enables agents to learn reusable workflows from past task experiences and integrate them into memory for guiding future actions. Unlike conventional methods that rely on static training examples or in-context learning, AWM dynamically induces workflows that capture frequent sub-routines across tasks, allowing agents to generalize better and solve tasks more efficiently.

Update After Rebuttal

The paper presents unclear and problematic methodlogy for workflow induction and missing key benchmarks in the evaluation.. Thus, I keep my score.

给作者的问题

Please check comments above.

论据与证据

The claim that “AWM enables agents to learn and reuse workflows, improving task success rates” lacks sufficient justification. The reported improvements could be task-specific rather than a generalizable advantage of AWM. Given that web browsing tasks inherently contain repetitive action sequences, AWM’s success may stem more from the structured nature of these tasks rather than from a fundamental ability to generalize workflow induction across diverse domains. A more rigorous evaluation on non-web tasks or across varied domains would strengthen this claim.
AWM is fundamentally a prompt-based approach that depends heavily on the underlying base LLM. This reliance raises concerns about its adaptability to different LLM architectures. If AWM’s effectiveness is largely dictated by the capabilities of the LLM it prompts (e.g., GPT-4 vs. a weaker model), then its contribution as an independent methodological advancement is limited. Without demonstrating robustness across multiple LLMs, the method’s general applicability remains uncertain.

方法与评估标准

Issues in Section 2.3: "Inducing and Using Workflows"

Section 2.3 presents an unclear and problematic methodology for workflow induction.

Ambiguity in the Definition of Workflow Induction

Lines 138–140 state: “The workflow induction module operates by taking in E and producing a set of workflows, as …” but do not clearly define how this transformation occurs.
The notation suggests that $I$ (the induction module) takes a set of experiences $E$ as input and produces a set of workflows $W$ . However, the output structure (as described in Line 140) appears identical to the input experiences, maintaining the same elements ( $d$ and $P$ ).
This lack of transformation raises concerns: Is the module simply repackaging past trajectories instead of deriving reusable workflows? If not, how does it extract structured workflows from raw experiences?

Unclear Control Over Workflow Granularity

The method does not specify how it determines the appropriate granularity of induced workflows.
A workflow could capture a low-level skill (e.g., searching for an item in a menu) that is widely reusable across tasks, or it could represent a high-level task (e.g., booking a hotel), which is more specific and less transferable.
If the induction process fails to regulate granularity, it risks either:
- Extracting too generic workflows that do not meaningfully aid decision-making.
- Inducing overly specific workflows that are not broadly reusable, even if stored in memory.
The paper should clarify how the induction module balances this trade-off, potentially by leveraging hierarchical workflow structures or controlling abstraction levels during induction.

Without addressing these issues, the workflow induction process remains ill-defined and difficult to evaluate, limiting the interpretability and generalizability of the proposed method.

理论论述

There're no theoretical claims in this paper.

实验设计与分析

The overall results appear promising, but there are significant concerns regarding the generalization of AWM beyond the specific LLM used and missing key benchmarks in the evaluation.

Dependence on Heavy Prompting and LLM Generalization

The entire method is heavily reliant on prompting, making its generalization to other LLMs uncertain.
The paper does not explore how AWM performs with weaker LLM architectures (e.g., LLaMA or other open-sourced models).
A broader evaluation across different LLMs would provide stronger evidence of its robustness.

Missing WebVoyager Benchmark

The results focus on two web navigation tasks: WebArena and Mind2Web. However, WebVoyager, another widely used web-based agent benchmark, is not discussed or evaluated.

Incomplete Benchmark Comparisons in Table 1

Table 1 presents results primarily compared to BrowserGym, but misses two important web agent benchmarks, which are WorkArena and MiniWob.
Both WorkArena and MiniWob are tasks where BrowserGym excels, making it unclear how AWM would compare in scenarios beyond the two selected benchmarks.
A more comprehensive benchmark comparison is needed to properly situate AWM’s performance relative to existing methods.

补充材料

I reviewed all supplementary material.

与现有文献的关系

This paper relates to the LLM agent, especially browser-based agent applications.

遗漏的重要参考文献

This paper misses the following benchmarks:

WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models, arXiv:2401.13919
MiniWob - Humphreys, Peter C., et al. "A data-driven approach for learning to control computers." International Conference on Machine Learning. PMLR, 2022.
WorkArena - Drouin, Alexandre, et al. "Workarena: How capable are web agents at solving common knowledge work tasks?." arXiv preprint arXiv:2403.07718 (2024).

其他优缺点

A major concern with the paper is lack of clarity, as it introduces several key concepts without proper definitions or explanations. This makes it difficult to fully understand the method and its implications.

One unclear aspect is the definition of agent memory size. It is unclear whether the memory scales dynamically or remains fixed. If it is fixed, the paper does not explain how its size was determined. More details on memory allocation and constraints would help clarify its practical feasibility.
Another issue is the lack of clear definitions for states, observations, and actions in the agent's design. The paper does not specify whether states are represented as webpage snapshots, structured representations, or tokenized text inputs. It is also unclear how observations differ from states and how they integrate with the memory system. Furthermore, the paper does not clarify what action space the agent operates in. For example, it is unknown whether the agent uses predefined UI actions like "CLICK" and "SCROLL" or if it generates free-text commands.
The term "base memory," mentioned in line 59 on the right, is also ambiguous. The paper does not define whether "base memory" refers to the default knowledge the agent starts with, the workflow memory that stores reusable subroutines, or a temporary buffer for task execution. It is also unclear whether there are other types of memory besides base memory and, if so, how they differ.

其他意见或建议

Typos in L115 Right, should be $\mathcal{W}$ not W.

作者回复

2025-04-01

Claims and Evidence 1: AWM Improvement May Come from Structured Nature of Actions

First, we argue that WebArena and Mind2Web queries are realistic and human-annotated without intentionally injecting superficial structures, so we believe that experimenting on them is well-motivated. Our method depends on structures that are shared across trajectories (e.g., adding an item to a cart, searching for a term), so individual trajectories need not be repetitive. Meanwhile, datasets for other tasks, e.g., embodied/robotics, actually often have more superficially structured queries (e.g., “harvest 1 dirt” and “harvest 8 dirt” on Minedojo), so we may not fully agree that switching to other tasks would be a perfect solution for alleviating learning superficial task structures.

Claim and Evidence 2 & Experimental Design 1: Dependency on Prompting Strong LLMs

Although we experiment with strong GPT models, we do achieve very substantial improvements (over 50%) compared to using the vanilla GPT-4o model. For this reason,we would like to argue that AWM is effective, instead of having “limited” “methodological advancement”. One prerequisite for AWM (and many other autonomous agent frameworks) is that the LM backbone needs to have reasonable abilities in task-solving. Our main focus in this work is to establish this promising self-adaptive agent framework. It would be interesting future work to investigate the prerequisites for LM capabilities or other aspects for AWM to show effect.

Ambiguity in the Definition of Workflow Induction

As we described in lines 141-161, we input several examples into GPT model and ask it to generate workflows, which are different from input examples in two aspects: (i) they have sub-trajectory level granularity and thus are more reusable than full-trajectory examples, and (ii) they abstract away example-specific contexts and replace them with descriptive variables (e.g., “New York” to “{destination city}”) to facilitate generalization across tasks.

Unclear Control over Workflow Granularity

We agree that since we mostly rely on LMs to come up with workflows, we have limited control over the granularity generated by the LM. The alternative approach is rule-based induction, which hardly produces any workflows at all given the dynamic environment nature of web navigation tasks. Empirically, we have carefully analyzed the quality of all induced workflows in Appendix A.3, about the number of workflows induced, their coverage on test queries, their functional overlap, etc. Please find more analysis in Appendix A.3 and example workflows in Appendix A.2. Throughout the analysis, we find the workflows have reasonable granularity, as we prompted the LLM to generate sub-trajectories and abstract away example-specific contexts.

Benchmark missing: WebVoyager, MiniWob++, WorkArena

WebArena and Mind2Web are more widely used. Webvoyager may raise safety concerns as it operates on real-world websites; MiniWob++ is simplified and agent performance is not saturated (over 95% success); WorkArena is application-specific and largely follows the WebArena setup. We are interested in seeing future works experimenting with AWM with these more specialized benchmarks.

O1. Definition of Memory Size

In the offline setting, we learn workflows from training examples and keep the memory size fixed in the later inference process. In the online setting, the memory size is continuously growing as the agent solves more tasks and induces more workflows.

O2. Unclear observation and action space

As mentioned in lines 189-195, we adopt the browsergym framework and use the webpage accessibility tree as the agent observation. We adopt the WebArena action space supported by BrowserGym. We will add the list of actions to the revised paper version.

O3. What is the “base memory”

Base memory refers to the original agent memory before adding any induced workflows. Implementation-wise, this contains instructions for the model to follow the user query and solve the tasks. We provide the exact prompt, a.k.a., the base memory, in Appendix A.1.

审稿意见

评分: 32025-03-14

The authors introduce Agent Workflow Memory (AWM), a method that enables agents to learn and reuse task workflows, similar to how humans leverage past experiences to solve complex tasks. AWM identifies commonly used routines and selectively provides them to guide future actions in both offline and online settings. Applied to the Mind2Web and WebArena benchmarks, AWM improves baseline success rates by 24.6% on Mind2Web and 51.1% on WebArena while reducing steps needed for successful navigation. Also, online AWM generalizes robustly across tasks, websites, and domains, outperforming baselines by 8.9 to 14.0 absolute points.

给作者的问题

N/A

论据与证据

From a performance perspective, the argument is well-supported, as the proposed search methods enhance agent performance. A clear upward trend is observed with increasing number of workflows in the memory. However, certain weaknesses in the experimental design remain, which will be discussed in the following sections.

方法与评估标准

The proposed methods and evaluation benchmarks are well-designed and reasonable.

理论论述

N/A

实验设计与分析

While the research direction is highly promising with significant potential impact, there are several weaknesses in the method and experimental design that should be addressed:

Robustness of agent performance: The memory construction appears to be heavily dependent on the selected tasks and workflows in the offline setting. While the paper provides a temporal analysis for the online setting which is valuable, it remains unclear how the method performs when the selected offline tasks vary significantly. Will the agent's performance remain stable across diverse task selections? This is an important question that should be addressed in the rebuttal.

Memory formulation: The current memory framework seems limited to subroutine workflows, but memory can extend beyond that. Experienced environment states themselves are crucial, as they enable agents to anticipate action effects and make more informed predictions. Also, failed trials can serve as valuable memory, helping the model recognize and avoid repeating mistakes. Expanding the scope of memory formulation beyond its current constraints could significantly enhance the approach.

Generalization across task templates: It is somewhat surprising that memory from one task type benefits another, given the potentially large differences in trajectories and elements involved. A more detailed analysis is needed to justify why this generalization works as demonstrated in Table 2. For instance, if no relevant workflows exist in memory, will the agent still retrieve one, leading to erroneous predictions? Further insights into this phenomenon would strengthen the paper.

Reliance on memory: As mentioned earlier, memory may not always be beneficial. Is there a mechanism for the agent to disregard unhelpful memory and rely on its own knowledge for planning? More broadly, the analysis could be greatly improved by detailing how the current pipeline operates in practice and how error cases are handled, rather than focusing solely on the textual representation of workflows.

补充材料

Yes. I reviewed the prompting code inside it.

与现有文献的关系

Agent workflow memory has the potential to become a popular paradigm for enabling agents to adapt to new environments, an area that has been under-explored in prior work. The paper provides a strong formulation of this paradigm.

遗漏的重要参考文献

N/A

其他优缺点

N/A

其他意见或建议

N/A

作者回复

2025-04-01

Thank you for recognizing the effectiveness of our AWM approach!

AWM’s robustness to varied offline task selections

We agree that offline task selection is crucial for AWM to achieve better performance, where higher distribution overlap between offline/training tasks and online/testing tasks improves agent performance. As we directly adopt the Mind2Web dataset in our experiments on Mind2Web (to fairly compare with existing methods), its three test sets (as introduced in sec 3.2 beginning) already introduce mismatches in tasks, websites, and domains to those covered in the training examples. As shown in Table 4, our AWM still outperforms baselines on all three test sets, demonstrating its robustness when the train-test task distribution varies to different extents.

Extend memory beyond subroutines

We agree with the reviewer that memory can take various forms. Our work aims to take a first step in learning one of the most crucial memory types -- reusable, procedural workflows -- that boost agent success and efficiency. It would be a promising direction to extend this adaptive AWM method to other types of memories, such as the experienced states and failed trials suggested by the reviewer. However, as this requires substantial extra engineering effort and ideation, we think it’d be more suitable to study these memory variants in another standalone work.

Generalization across tasks

We provide one case for this cross-task generalization behavior in Figure 6, where the agent learns to build new workflows on top of previously learned, easier ones. Nonetheless, our agent can choose not to refer to existing workflows and come up with new ones solely based on the action trajectory when solving new examples (as illustrated in Figure 2 pipeline), and therefore to some extent can avoid bias from irrelevant existing workflows.

Reliance on memory

Our AWM pipeline induces workflows only from trajectories that are predicted as correct, and therefore is not expected to generate erroneous workflows which cannot solve tasks correctly. We also manually evaluated all the induced workflows (7.4 per website, so it is feasible to carefully check all of them; more detailed in Appendix A.3) and found them correct and useful. We cannot provide example workflows on each website in this response due to length limits, please find more examples of induced workflows in Appendix A.2. It would be an interesting follow-up study to examine and improve the quality of workflows and make AWM more effective, if possible.

Search for a Product
# To search for a product, I need to enter the search term in the search bar and press Enter.
fill('search_bar_id', '{search_term}')
press('search_bar_id', 'Enter')

Find Directions Between Two Points
# To find directions between two points, I will use the directions feature on OpenStreetMap. I will start by clicking on the "Find directions between two points" link.
click('149')
# Next, I will fill in the "From" and "To" fields with the respective locations and select the mode of transportation.
fill('158', 'FROM_LOCATION')
fill('163', 'TO_LOCATION')
select_option('166', 'MODE_OF_TRANSPORTATION')
click('171')
# I will then retrieve and send the calculated travel time and distance to the user.
send_msg_to_user('The estimated travel time from FROM_LOCATION to TO_LOCATION is ESTIMATED_TIME.')

审稿意见

评分: 32025-03-14

Motivated by the ability of humans to flexibly solve complex tasks by learning reusable workflows from past experiences, this paper introduces Agent Workflow Memory (AWM), a training-free paradigm designed to enhance the performance of language model-based agents in solving long-horizon, complex tasks. The core of AWM is an LM-based module that extracts reusable workflows from an agent's experience trajectories. These workflows serve as structured memory, allowing the agent to leverage past knowledge to guide future decision-making. The proposed method is evaluated on two web-based benchmarks, demonstrating strong cross-template and cross-task generalization. Additionally, the paper explores the impact of online/offline settings and different workflow representations on overall performance.

给作者的问题

In the online setting, is the LM-based evaluation of experience applied to the overall outcome or each step of the process? Additionally, I would like to understand the impact of incorrect experiences being converted into workflows on the final test results. Could you provide experimental results where different proportions of erroneous workflows (e.g., 20% error, 50% error) are introduced?
The experiments in the paper focus solely on web-based agent tasks. Can AWM be effectively applied to other agent environments, such as ALFWorld or SciWorld? Additionally, I am curious whether workflows learned from web-based tasks can generalize to these environments.
I noticed that the experiments are conducted using high-quality OpenAI models. To what extent does the capability of the underlying LLM affect the effectiveness of AWM? I would like to see results using more open-source models (e.g., LLaMA, Qwen).
How does directly using raw experiences as few-shot demonstrations compare to using AWM in terms of test performance? A comparison between these two settings would help clarify the advantage of workflow-based memory.

论据与证据

The paper claims that learning reusable workflows from past experiences can guide agents in solving complex, long-horizon tasks, thereby improving their performance and that AWM operates effectively in both online and offline settings. The supporting evidence for these claims is thoroughly reported in the experimental results, demonstrating significant improvements on WebArena and Mind2Web benchmarks.
However, the claim that AWM can quickly learn and adapt across all tasks is only validated on web-based tasks, lacking evaluation on broader agent tasks, which may limit the assessment of its generalization capability.

方法与评估标准

The proposed method and evaluation criteria are well-aligned with the problem of improving LLM-based agents' performance on long-horizon, complex tasks.

理论论述

The paper primarily focuses on an empirical evaluation of the proposed method AWM rather than presenting formal theoretical claims. As there are no mathematical theorems or formal derivations included, there are no proofs to verify.

实验设计与分析

I checked the soundness and validity of the experimental designs and analyses presented in the paper, particularly focusing on the evaluation of Agent Workflow Memory (AWM) on the WebArena and Mind2Web benchmarks. The use of these benchmarks provides a realistic and diverse testing ground, and the comparison with SOTA baselines is a strong validation of the method's improvements. The analysis is relatively detailed, covering AWM's generalization, the differences between online and offline settings, and the impact of workflow representations.

补充材料

I checked the appendix.

与现有文献的关系

The paper's key contributions are closely related to several areas in the broader scientific literature, particularly in LLM-based agents and memory-augmented learning. The study builds on prior work in web-based agents (WebArena and Mind2Web [Deng et al., 2023]) that evaluate LLMs on digital task execution. The idea of integrating structured memory for long-term adaptation is related to the prior work on LLM memory mechanisms. Unlike methods that use retrieval, AWM autonomously induces structured workflow memories, allowing for more flexible and scalable adaptation.

遗漏的重要参考文献

其他优缺点

Strengths:

The proposed method is training-free, minimizing computational overhead and making it more efficient compared to methods requiring extensive fine-tuning.
AWM is applicable in both offline and online settings, allowing it to function effectively with or without high-quality annotated data, thereby reducing annotation costs.
The experimental analysis is comprehensive, covering AWM's generalization capabilities, the differences between online and offline settings, and the impact of workflow representations on performance.

Weaknesses:

The experiments are limited to web-based agent tasks, lacking evaluations on other types of agent tasks (e.g., robotics, API-based interactions), which may restrict the assessment of AWM's broader applicability.
The LM-based workflow construction and LM-based judgement of experiences in the online setting may introduce inherent biases from the language model, yet the paper does not analyze their potential impact.
The approach shares similarities with few-shot learning, limiting its novelty. Additionally, the paper does not explore the direct impact of using raw experience versus structured workflows on agent performance, which could provide further insights into the benefits of workflow induction.

其他意见或建议

See Question

作者回复

2025-04-01

Thank you for recognizing the soundness of our method, the general applicability in online and offline settings, and the comprehensiveness of our experimental analysis!

W1. Experiments tailored to web agents

We fully agree that AWM can be similarly applied to other agentic tasks such as robotics or tool use. We conducted very comprehensive experiments with two major web agent benchmarks emphasizing varied correctness and generalization aspects, which we believe are sufficient contributions to this work and prove AWM’s effectiveness; as many influential works on agents focus on one domain as well [1,2]. Nonetheless, applying AWM to other tasks such as robotics would require substantial engineering effort (setup hardware and software), which we think would be more suitable to explore in standalone future works.

W2. LM introduces bias into workflow construction and judgment processes

For workflow construction, we have compared LM-based induction and rule-based induction (no LM involved) in Section 4.1 & Table 5. The rule-based method directly adopts content from the example without introducing any bias from the model, but their comparable scores show that LM does not introduce any “negative bias” into the workflows. For the judgment process, we have compared using LM-based evaluation and ground-truth evaluation on WebArena in preliminary studies, and found that LM-based judgment leads to a higher average success rate. We analyzed results and found ground-truth evaluations to be overly strict and disagree more with human correctness judgments; whereas LM is more flexible and makes more accurate judgments. For example, ground-truth evaluation checks if the final state url exactly matches the annotated one (“xxx.map.com/cafe-near-cmu”), while the agent may land on “xxx.map.com/cafes-near-cmu” that searches the plural “cafes” instead of “cafe” but still get the correct results; ground-truth evaluators will inaccurately judge the result to be wrong, but LM-based evaluator and humans often still mark it as correct.

W3 & Q4. Compare to using raw experiences

We have compared AWM to using raw experiences on both datasets. For Mind2Web results in Table3, the Synapse method is conditioning on retrieved raw examples for task-solving, and AWM outperforms it by 13.1% in success rate. For WebArena results in Table4, the rule-based induction is a stronger version of using raw examples, in that we further remove the invalid steps in raw examples to improve their quality. As AWM achieves comparable results to this stronger raw-example-using baseline, we expect AWM to perform better than using the original raw examples.

Q1. Online evaluation granularity and quality

The LM-based evaluation is applied to the overall outcome, i.e., on the full trajectory. We induce “correct” workflows from trajectories that are predicted as correct, preventing incorrect trajectories from being converted into workflows. We made this design choice to ensure the quality of workflows and the effectiveness of the entire pipeline. It would be an interesting extension of our work to further consider adding incorrect trajectories for workflow construction, but it will necessitate stronger trajectory evaluators and design changes in the pipeline to make it more effective in improving end success rate.

Q2. Can AWM generalize to ALFWorld and SciWorld

Yes, our AWM method generally applies to agent frameworks as everything is built on top of general agent components. While it depends on the task and domain overlap between our experimented benchmarks (WebArena, Mind2Web) and other datasets (ALFWorld, SciWorld) to tell if workflows produced in our experiments can be directly used to solve other datasets, it would be interesting to explore AWM performance on ALFWorld and SciWorld in future work.

Q3. AWM Requirement on Underlying Model Capabilities

AWM does require the LM to have certain capabilities in task-solving, trajectory evaluation, and workflow induction. We have experimented with LLaMa 70B and Deepseek-R1 32B and find they benefit less from AWM — achieving 7.9→6.2 and 13.5→11.6 success rate before→after applying AWM – we will reference these results in the revised paper. Nonetheless, our main goal is to showcase this promising self-adaptive agent framework as agents get increasingly stronger.

[1] Liu, Evan Zheran, et al. "Reinforcement learning on web interfaces using workflow-guided exploration."

[2] Wang, Guanzhi, et al. "Voyager: An open-ended embodied agent with large language models."

最终决定Accept (poster)

2025-05-01

This paper introduces AWM, a framework that enables agents to extract reusable workflows from past experiences and use them to guide future actions in long-horizon tasks like web navigation. The motivation is intuitive — procedural memory helps humans generalize. The empirical results on WebArena and Mind2Web are solid, showing significant gains over baseline agents.

The empirical results on WebArena and Mind2Web are solid, showing significant gains over baseline agents. However, several concerns remain unresolved. First, the workflow induction process is underspecified — it’s not clear how generalizable or robust the abstraction mechanism is, especially in terms of granularity control and reproducibility. There’s a lack of clarity on whether the method truly generates reusable workflows or just repackages prior trajectories. Additionally, the memory management (size, updating policy, etc.) and the agent’s action/observation definitions are not well-articulated. Lastly, the method feels adjacent to prior work in Reflexion, Expel, and Synapse. The authors address these differences, emphasizing AWM’s focus on procedural subroutines over error-based memory or raw trajectory replay. But still feels like an incremental advance in a crowded space (although I know this work was released a while back on Arxiv already, if we take that into account).