PaperHub
6.3
/10
Poster4 位审稿人
最低6最高7标准差0.4
6
7
6
6
3.5
置信度
COLM 2025

DynaSaur: Large Language Agents Beyond Predefined Actions

OpenReviewPDF
提交: 2025-03-22更新: 2025-08-26
TL;DR

We propose a flexible LLM agent framework for open-ended environments, where it dynamically generates new actions when existing ones are insufficient. These actions accumulate over time for future reuse.

摘要

关键词
LLMLLM agents

评审与讨论

审稿意见
6

The paper proposes DynaSaur, an LLM agent framework which lets the system write, execute and cache its own Python functions at run-time. The framework treats code itself as the universal, dynamically expandable action space, plus persisting successful functions for reuse. Extensive experiments on GAIA plus four static QA datasets show consistent gains over strong fixed-tool baselines. The empirical evidence is generally convincing, and the manuscript is clearly written.

接收理由

  • The manuscript is clearly and effectively written, making the proposed ideas and methodology easy to understand.

  • The conclusions presented in the paper are thoroughly supported by comprehensive experimental results, enhancing the credibility and strength of the findings.

拒绝理由

  • Revision required, for example, improper citation formatting in L58-L60.

  • The manuscript does not adequately discuss relevant related works, eg. Voyager[1], which similarly addresses autonomous skill creation and persistent retrieval. Including a direct comparison would significantly clarify the novelty of the current approach, such as its formal MDP framework and the broader range of task evaluations.

[1] Wang, Guanzhi, et al. "Voyager: An open-ended embodied agent with large language models." arXiv preprint arXiv:2305.16291 (2023).

给作者的问题

  • (Q1) The choice of retrieval parameter k=10 lacks justification. Could the authors explain this choice? Furthermore, varying k and using different embedding models to evaluate the robustness and efficiency of the cache retrieval mechanism would enhance understanding.

  • (Q2) There is a concern regarding memory scalability. While the retrieval method helps keep the prompt short, the action library (Ag) appears capable of unlimited growth. Could the authors provide empirical data on the growth rate and final size of Ag after extensive usage (e.g., thousands of episodes), and clarify how this impacts retrieval latency?

评论

We thank you for your time and the constructive feedback. We would like to address your questions below:

1. “Revision required, for example, improper citation formatting in L58-L60.”

Thank you for pointing this out. We have addressed the issue on our end and will include the fix in the revised version.

2. “The manuscript does not adequately discuss relevant related works, eg. Voyager, which similarly addresses autonomous skill creation and persistent retrieval. Including a direct comparison would significantly clarify the novelty of the current approach, such as its formal MDP framework and the broader range of task evaluations.”

While our method might share some high-level motivations with Voyager, Voyager relies on many Minecraft-specific assumptions, whereas our method is designed as a more general framework. Specifically, we differ in the following aspects:

  1. Voyager assumes that tasks are organized into a well-structured hierarchy: In Minecraft, each item can be crafted from a set of lower-tier items, forming a well-defined hierarchical progression within the game. Voyager’s curriculum design heavily exploits this assumption. Given the current items in the inventory, the current position, and nearby items, it uses an LLM to suggest the next immediate task that can be performed (e.g., crafting higher-tier items). In contrast, we do not make such an assumption, as tasks in the real world are open-ended and often lack a structure that can be predetermined.

  2. Tasks in Voyager follow well-defined templates: Voyager limits its task set to adhere to specific templates, such as “Mine [quantity] [block],” “Craft [quantity] [item],” “Smelt [quantity] [item],” “Kill [quantity] [mob],” and “Cook [quantity] [food].” In contrast, we make no assumptions about the task distribution. The tasks in GAIA, which we evaluated, exhibit significantly greater diversity and complexity, including web browsing, file processing, symbolic reasoning, commonsense reasoning, and more.

  3. Voyager’s generated actions depend on low-level actions from the Mineflayer API: The actions generated by Voyager must conform to the Mineflayer API in order to interact with the Minecraft environment. In contrast, we impose minimal constraints on the generated actions (only basic formatting requirements for proper parsing). As a result, our method is more flexible and easier to adapt to different environments.

In summary, Voyager is specific to the Minecraft environment and cannot generalize to other environments without redesigning or heavily modifying the entire system. In contrast, DynaSaur is designed to be a general, domain-independent framework that can be applied to diverse environments. As such, a direct comparison with Voyager is not applicable in our case.

3. “The choice of retrieval parameter k=10 lacks justification. Could the authors explain this choice? Furthermore, varying k and using different embedding models to evaluate the robustness and efficiency of the cache retrieval mechanism would enhance understanding.”

We thank the reviewer for raising this point. We chose k=10k=10 to strike a balance between ensuring that the retrieved toolset includes relevant actions for the current task while keeping the context length manageable. We will clarify this rationale in the revised manuscript. Additionally, we are currently conducting experiments with varying k values and different embedding models to better evaluate the robustness and efficiency of the retrieval mechanism, and will include these results shortly.

4. “There is a concern regarding memory scalability. While the retrieval method helps keep the prompt short, the action library (Ag) appears capable of unlimited growth. Could the authors provide empirical data on the growth rate and final size of Ag after extensive usage (e.g., thousands of episodes), and clarify how this impacts retrieval latency?”

We have addressed the majority of this question in our response to the common concerns above. Unfortunately, running the system for thousands of episodes is economically infeasible given our budget constraints. However, we conducted experiments over 500 episodes on the MATH dataset, which we believe is sufficiently significant to demonstrate that memory usage remains well within manageable bounds.

Regarding retrieval latency, we use Chroma, a production-grade vector database optimized for large-scale retrieval. According to its official benchmarks, Chroma achieves query latencies of approximately 30ms even with databases containing over 500,000 vectors [1], which is sufficient for our current scale and expected growth.

[1] https://docs.trychroma.com/production/administration/performance#latency-and-collection-size

评论

I thank the authors for their response. I would be willing to raise my score once the results from the retrieval ablations (varying k and embeddings) are reported. These results are essential to support the claimed robustness and efficiency of the cache mechanism, and would solidify the empirical foundation of the method.

评论

We sincerely thank the reviewer for their willingness to consider raising the score and for the valuable suggestion to conduct more detailed retrieval ablation studies, which have indeed strengthened our empirical findings. We have completed the requested experiments and present the new results below.

Effect of Varying Number of Retrieved Actions (kk):

kk51015
Success Rate45%47%51%

In this experiment, we fixed the embedding model and varied kk, the number of retrieved actions, then evaluated our method on the first 100 examples from the GAIA dataset. We observed two key trends:

  1. Performance does not degrade significantly when kk is small, suggesting robustness to the retrieval size; and
  2. Success rate improves as kk increases, indicating that retrieving a larger set of relevant actions can benefit the agent’s task-solving ability.

Effect of Using Different Embedding Models (with k=10k=10):

Embedding ModelMTEB Eval ScoreGAIA Success RateRuntime
text-embedding-ada-00261.0%54%01:10:59
text-embedding-3-small62.3%40%01:14:23
text-embedding-3-large64.6%45%00:56:17

Here, we fixed k=10k = 10 and compared three embedding models. We report the success rate on GAIA, total runtime, and each model’s performance on the MTEB evaluation suite as a proxy for general embedding quality. Interestingly, text-embedding-ada-002 resulted in the highest GAIA success rate, while text-embedding-3-large strikes the best trade-off between performance and efficiency.

These results collectively demonstrate that DynaSaur performs robustly across different values of kk and embedding models, further supporting its generalizability.

Additionally, while not requested by Reviewer CTUg, we also ran experiments with open-source LLMs (included in the “common concerns” section above). These results show that DynaSaur maintains strong performance across different LLM backbones, further reinforcing the robustness of our method.

We once again thank the reviewer for the thoughtful suggestion and hope these additional results provide compelling evidence to support a higher evaluation score.

评论

I appreciate the effort from the authors to address my concerns. That said, the current results remain somewhat inconclusive. In particular:

  • The 4-point increase in success rate when increasing k (from 47% to 51%) could stem from simply including more code in the prompt, rather than retrieving more relevant actions. Since it's unclear how including irrelevant or unused actions impacts performance, it becomes difficult to validate the retrieval mechanism itself or understand whether the observed gains come from better relevance or just more tokens. This also makes it hard to determine the appropriate value of k to adopt in practice, as we cannot observe the trade-off between recall and prompt noise.

  • All three tested models (text-embedding-ada-002, text-embedding-3-small, and text-embedding-3-large) are from OpenAI. While they differ in size and cost, they likely share similar architectural and training biases. To truly assess embedding robustness, I encourage testing models from other families.

In summary, while I appreciate the additional analysis, my primary concern remains with the first point above. I consider this issue central to validating the method, and it will influence my final recommendation for the paper.

评论

We sincerely thank the reviewer for the continued engagement and insightful feedback. We would like to clarify the retrieval mechanism and share further experimental results that address the raised concerns.

Clarification on Retrieved Action Content:

The reviewer noted that the observed 4-point gain in success rate (from 47% to 51%) when increasing kk might be due to simply including more code in the prompt. However, we would like to clarify that our system does not include the full code implementation of the retrieved actions in the context. Instead, each retrieved action is shown in a lightweight, natural language format:

- action_name(arg1, arg2, ...) -> output: Description of what the action does

Therefore, the performance gain is unlikely due to trivial factors such as the inclusion of more code in the context. We will clarify how we represent the retrieved actions more explicitly in the revised manuscript.

Impact of Irrelevant Actions:

We appreciate the reviewer’s suggestion to investigate how irrelevant or unused actions affect performance. We agree that this is an interesting analysis that helps us understand whether the observed gains come from the improved relevance of the retrieved actions. To this end, we implemented a random retriever that selects k=10k=10 actions uniformly at random from the generated action set. This variant achieved a 41% success rate, 6 points lower than the original (47%), indicating that relevant retrieval contributes meaningfully to performance.

On a side note, DynaSaur is designed to adapt by generating new functions at runtime if no retrieved action is useful. This fallback capability explains why DynaSaur with the random retriever does not fail catastrophically.

On Choosing the Appropriate Value of kk:

We would like to clarify that the goal of using a retriever is to control prompt length and computational cost, not necessarily to optimize performance. In principle, including all generated actions would yield the best possible recall, and we believe it is unlikely to cause performance degradation due to prompt noise, given that multiple recent studies have shown that frontier LLMs are increasingly better at handling long and noisy contexts [1–5].

To concretely test this, we conducted an additional experiment where no retrieval is used, instead, all generated actions are included in the prompt. This variant achieved a 53% success rate, which is the highest among all tested configurations. These results support our hypothesis that increasing kk can benefit performance, but doing so without control may lead to higher costs.

Experiments with Different Embedding Model Families:

We also conducted additional experiments with embedding models from different architectural families, beyond OpenAI models, including Salesforce’s SFR-Embedding-Code-2B_R and Microsoft’s multilingual-e5-large-instruct:

Embedding ModelMTEB Eval ScoreCoIR Eval ScoreGAIA Success Rate
text-embedding-ada-00261.0%45.6%54%
text-embedding-3-small62.3%40%
text-embedding-3-large64.6%45%
multilingual-e5-large-instruct76.2%50%
SFR-Embedding-Code-2B_R67.4%54%

These results confirm that our method performs robustly across a range of embedding models from diverse sources, not limited to OpenAI models.

In Summary:

  • Our prompts do not include raw code, only action names, inputs, outputs, and descriptions.
  • Retrieval quality matters: random retrieval degrades performance by 6 points.
  • Increasing kk improves performance, but at the cost of higher cost.
  • Our system is robust across different embedding model families.

We hope these clarifications and experiments address the reviewer’s thoughtful concerns and further demonstrate the robustness of our proposed framework.

References

[1] Tom Burns. GPT-4o’s Memory Breakthrough! https://nian.llmonpy.ai/intro. 2024

[2] An Yang et al. Qwen2 Technical Report. https://arxiv.org/abs/2407.10671. 2024

[3] An Yang et al. Qwen2.5-1M Technical Report. https://arxiv.org/abs/2501.15383. 2025

[4] DeepSeek-AI. DeepSeek-V3 Technical Report. https://arxiv.org/abs/2412.19437. 2024

[5] Aaron Grattafiori et al. The Llama 3 Herd of Models. https://arxiv.org/abs/2407.21783. 2024

评论

Thanks the authors for the detailed clarification. Correspondingly, I have raised my score.

审稿意见
7

This paper introduces an agent framework that can dynamically create and compose actions as needed. It addresses one of the main limitations of current agentic systems: fixed and predefined tools are not enough for open-ended real-world scenarios. Experiments across multiple benchmarks show that the proposed framework significantly improves flexibility and outperforms prior methods.

接收理由

  • The proposed action implementation and action accumulation mechanisms are innovative.

  • The experimental results are compelling, with a thorough ablation study and comprehensive results analysis.

  • The paper is well-structured and clearly written.

拒绝理由

  • The action implementation capability relies heavily on the coding proficiency of the backend LLM, which may constrain the framework's applicability across diverse use cases.

  • Given that some initial actions are derived from AutoGen, it would be beneficial to include AutoGen as a baseline tool for comparison to better contextualize the proposed framework's contributions.

评论

Thank you for your insightful and constructive comments on our paper. We address your questions below:

1. “The action implementation capability relies heavily on the coding proficiency of the backend LLM, which may constrain the framework's applicability across diverse use cases.”

Please see our response to the common concerns above.

2. “Given that some initial actions are derived from AutoGen, it would be beneficial to include AutoGen as a baseline tool for comparison to better contextualize the proposed framework's contributions.”

We appreciate the reviewer’s suggestion to include AutoGen as a baseline. However, we would like to clarify that AutoGen is not itself an agent system, but rather a library for building custom multi-agent workflows. Therefore, a comparison with AutoGen as a baseline agent is unfortunately not feasible.

评论

Dear Reviewer coxz,

Thank you once again for your valuable comments. Regarding generalization to the backend LLM, we have added additional experiments on open-source LLMs, including ones with weaker code generation abilities, in the “Common Concerns” section. These results demonstrate the robustness of our method across a broader range of models, which we believe addresses your concern about the reliance on the coding abilities of the underlying LLMs.

We would be grateful for your thoughts on these new findings and look forward to any further discussion you may have.

审稿意见
6

This paper introduces DynaSaur, a framework that enables LLM agents to dynamically create and compose arbitrary actions represented as python functions, rather than being constrained to a fixed set of predefined actions. The authors show that this approach allows agents to extend their capabilities on the fly and build a library of reusable functions over time.

接收理由

  1. This paper addresses a genuine limitation in existing LLM agent frameworks- the constraint of fixed, predefined action sets
  2. Clean architecture using python functions for action abstraction and accumulation
  3. Shows meaningful performance improvements across multiple benchmarks

拒绝理由

  1. The fundamental idea of using code generation for LLM agents is not novel. Prior work already explores similar concepts
  2. The action retrieval and accumulation mechanisms raise context window and memory concerns as the function set grows. No real analysis of long term usability or degradation.
  3. Limited analysis of the quality of generated functions. Many functions might be too task-specific to be truly reusable
  4. No clear comparison with strong baseline methods that use a more comprehensive set of predefined actions
  5. The paper mostly tests with GPT-4o, so we don't know if it works with other LLMs, especially open-source ones

给作者的问题

  1. How does the approach handle incorrect or malicious code generation? What safeguards would be necessary for real-world deployment?
  2. How well does the approach transfer to open-source LLMs with weaker code generation capabilities?
  3. Have you explored using this framework in truly interactive environments (dynamic) rather than static benchmarks?
  4. How would this architecture adapt to edge-device (phone, tablet, etc.) deployment where code execution is resource constrained?
  5. Could this system generate incorrect but plausible functions that mislead the agent? Any checks for semantic validity beyond syntactic correctness?
评论

5. “The paper mostly tests with GPT-4o, so we don't know if it works with other LLMs, especially open-source ones”

Please see our response to common concerns above.

6. “How does the approach handle incorrect or malicious code generation? What safeguards would be necessary for real-world deployment?”

We addressed this in our Ethics Statement section, which we would like to reiterate here: to handle malicious code generated by LLMs, one can apply a safety filter or a world-model-based formal verifier to each action during creation or prior to execution. Furthermore, the agent should be deployed in an isolated environment with restricted inbound and outbound traffic. Limiting file system permissions, such as enforcing read-only access or encouraging minimal edits instead of overwriting files, can further reduce the risk of unintended or harmful behavior. Restricting the agent’s permissions also helps prevent the execution of malicious scripts.

7. “How well does the approach transfer to open-source LLMs with weaker code generation capabilities?”

Please see our response to common concerns above.

8. “Have you explored using this framework in truly interactive environments (dynamic) rather than static benchmarks?”

As noted in Section 4.1, most existing interactive environments do not permit arbitrary code execution, making it infeasible to run experiments using our framework in those settings.

Although not explicitly stated in the paper, our evaluation methodology inherently involves interactive processes between the agent and its environment. Specifically, our environment is an IPython kernel operating within a computer equipped with internet access and interacting directly with an underlying operating system, as illustrated in Figure 1.

9. “How would this architecture adapt to edge-device (phone, tablet, etc.) deployment where code execution is resource constrained?”

While adapting our architecture to resource-constrained edge devices is slightly beyond the current scope of our work, we envision potential strategies to address such constraints. For instance, explicit constraints on resources (e.g., "ensure code execution does not exceed 5 GB of CPU memory") could be specified beforehand. Alternatively, the framework could incorporate mechanisms to detect resource usage dynamically, alerting the agent when approaching or surpassing resource limits, enabling the agent to adjust its subsequent actions accordingly.

10. “Could this system generate incorrect but plausible functions that mislead the agent? Any checks for semantic validity beyond syntactic correctness?”

As with most code-generation systems, there is indeed a possibility of generating functions that are syntactically correct but semantically flawed or misleading. While our current system primarily checks for syntactic correctness, integrating comprehensive semantic validation remains an open problem and an active area of research. However, our design mitigates this issue somewhat through iterative interactions: incorrect code is usually quickly identified through execution outcomes (e.g., returning a zero or an empty list) and corrected by the agent in subsequent steps. We believe incorporating explicit semantic validation mechanisms or leveraging runtime feedback loops for immediate correction represents promising directions for future enhancement of our approach.

评论

We thank reviewer P4Yg for their thorough and insightful comments on our paper, we would like to address them below:

1. “The fundamental idea of using code generation for LLM agents is not novel. Prior work already explores similar concepts”

Please see our response to common concerns above.

2. “The action retrieval and accumulation mechanisms raise context window and memory concerns as the function set grows. No real analysis of long term usability or degradation.”

For concerns about memory scalability, please refer to our response above, where we provide empirical measurements and discuss growth control strategies.

Regarding the context window, this is precisely the motivation for incorporating an action retrieval mechanism. Instead of appending the entire action set to the prompt, which would scale poorly, the agent has the autonomy to invoke a get_relevant_tools action that retrieves only the top-kk relevant actions. This keeps the prompt length effectively bounded, regardless of the size of the full action set.

As for long-term usability and degradation, we address this in Section 4.3.3 through the action coverage metric, which essentially measures how often previously generated actions are reused in successful trajectories. As shown in Figure 3, this action coverage metric increases over time, indicating that accumulated actions remain relevant and useful, with no observable signs of degradation in terms of action coverage.

3. “Limited analysis of the quality of generated functions. Many functions might be too task-specific to be truly reusable”

As mentioned above, in Section 4.3.3, we provide a detailed discussion on reusability, we also include examples of reusable actions in Figure 9. Beyond reusability, we also analyze the types and complexity of the generated actions. Additionally, we include a case study that highlights how our proposed framework offers more flexibility and effectiveness compared to standard agent frameworks with predefined action spaces. If the reviewer had specific aspects in mind that they felt were underexplored, we would greatly appreciate more detailed suggestions to help us strengthen the work further.

4. “No clear comparison with strong baseline methods that use a more comprehensive set of predefined actions”

We would like to emphasize that increasing the comprehensiveness of predefined actions does not address the fundamental limitations highlighted in our paper: (1) it significantly restricts the planning and acting capabilities of LLM agents by limiting them to predefined actions, and (2) it requires extensive human effort to enumerate and implement all possible actions, making this approach impractical for complex environments with a vast number of potential actions. These limitations are precisely what motivate our proposed framework.

Moreover, as described in Section 4.1, we ensure a fair comparison by initializing our action set with the same predefined actions used by the baselines. If a baseline with a more comprehensive action set were to be considered, we would initialize DynaSaur with the same set to maintain consistency.

评论

Dear Reviewer P4Yg,

Thank you once again for your thorough and insightful feedback. In response, we have added additional experiments on open-source LLMs in the “Common Concerns” section. These results demonstrate the robustness of our method across a broader range of model backbones, including those with weaker code generation capabilities.

We would be grateful for your thoughts on these new findings and look forward to any further discussion you may have.

评论

Thank you for the clarifications. I've updated my score.

审稿意见
6

This paper presents a novel framework for large language model (LLM) agents that eliminates the need for a predefined action set. Instead of selecting from a static list of actions, the proposed agent dynamically generates and composes executable programs in a general-purpose programming language, enabling more flexible and adaptive decision-making. Furthermore, the agent maintains a memory of past generated actions for future reuse. Through extensive evaluations on multiple benchmarks, the framework demonstrates superior adaptability, especially in open-ended or failure-prone scenarios where fixed action sets are insufficient.

接收理由

The paper tackles a critical bottleneck in current LLM agent systems — reliance on fixed action sets — which hinders performance in dynamic, real-world environments.

The framework’s ability to generate and compose actions on-the-fly, combined with execution via a programming language, introduces a highly flexible and generalizable mechanism for agent behavior.

The accumulation and reuse of generated actions adds a valuable layer of learning and efficiency, simulating a form of long-term agent memory.

The experimental results clearly support the proposed framework, showing significant improvements over baselines on multiple benchmarks.

拒绝理由

While the idea of using program generation is compelling, it builds on existing trends in code-based agents. The framework may appear incremental unless more detailed comparisons with prior works are provided.

给作者的问题

NA

评论

We thank the reviewer for taking the time to write a positive assessment and express strong support for our paper. We are encouraged by the reviewer’s recognition of our contributions.

Regarding the novelty of our method, please refer to our response to the common concerns above.

评论

Thank you for your response.

评论

On the novelty of DynaSaur

We thank the reviewers for pointing out the connection to prior work on code generation for LLM-based agents. We agree that code generation has been explored in earlier work, particularly in domain-specific settings such as math problem solving [1], VQA [2], or software engineering [3].

However, our contribution is not merely about generating code to solve a task. Rather, we propose a novel perspective: treating a general-purpose programming language as a general and composable action space for agents across diverse domains. In our framework, code is not just a solution, it is the interface through which the agent interacts with the environment. This allows for:

  • Full autonomy in acting: the agent can define and invoke any valid operation permitted by the language, enabling highly flexible and adaptive behavior.
  • Composability of actions: programs allow low-level operations to be composed into more complex behaviors, facilitating abstraction, reuse, and hierarchical reasoning.

To our knowledge, this broader framing of code as the action language for generalist agents, has not been fully articulated or operationalized in prior work. We will clarify this distinction in our revision.

References

[1] Cheng Qian, Chi Han, Yi R. Fung, Yujia Qin, Zhiyuan Liu, Heng Ji. CREATOR: Tool Creation for Disentangling Abstract and Concrete Reasoning of Large Language Models. 2023

[2] Lifan Yuan, Yangyi Chen, Xingyao Wang, Yi R. Fung, Hao Peng, Heng Ji. CRAFT: Customizing LLMs by Creating and Retrieving from Specialized Toolsets. 2024

[3] John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, Ofir Press. SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

On memory concerns

We thank the reviewers for their questions regarding the memory scalability of our method. We address these in two parts:

  1. Memory footprint of generated actions is minimal and sublinear in growth: The table below summarizes the size of the action set after action accumulation on two datasets, along with their respective growth rates: | Dataset | # Training tasks | # Generated actions | Avg. size of an action | Total size of the action set | Growth rate (# new actions / task) | |---------|------------------|---------------------|------------------------|------------------------------|------------------------------------| | GAIA | 165 | 80 | 0.43 kB | 34.40 kB | 0.48 | | MATH | 546 | 143 | 0.36 kB | 51.48 kB | 0.26 |

Each action is a relatively short Python function, making their memory footprint negligible. For perspective, a 1GB memory budget can store over 2 million such actions. Importantly, the action growth rate is sublinear. On the GAIA dataset, Dynasaur generates 0.48 actions per training task, but this drops to 0.31 during testing, indicating saturation as the agent accumulates more actions.

  1. Controlling action set growth is straightforward: While we did not incorporate pruning mechanisms in our current implementation, several simple strategies can be used to prevent unbounded growth. For example, one can set a fixed upper limit on the action set size and periodically remove actions based on usage frequency or recency. This ensures that only high-utility actions are retained, keeping memory usage and retrieval latency bounded over time.

On transferability to other LLMs

We thank the reviewers for these insightful comments. We are currently conducting additional experiments with a broader set of LLMs, including open-source models, to evaluate the framework’s robustness across different backbone LLMs. We will include these results soon.

评论

We thank the reviewers again for their valuable suggestions. In response, we have conducted additional experiments using open-source LLMs to evaluate:

  1. whether our method generalizes beyond proprietary models, and
  2. whether it remains effective when paired with models that exhibit weaker code generation abilities.

To this end, we tested two open-source models:

  • Qwen2.5-Coder-32B-Instruct: a code-specialized, instruction-tuned model.
  • Qwen2.5-32B-Instruct: a general-purpose instruction-tuned model with weaker code generation capabilities.

Qwen2.5-Coder-32B-Instruct

Agent PipelineLevel 1Level 2Level 3Average
No Pipeline13.21%1.16%0.00%4.85%
Sybil System20.75%8.14%0.00%10.91%
HF Agent13.21%10.47%3.85%10.30%
DynaSaur35.85%20.93%11.54%24.24%

Qwen2.5-32B-Instruct

Agent PipelineLevel 1Level 2Level 3Average
No Pipeline5.66%5.81%0.00%4.85%
Sybil System24.53%10.47%0.00%13.33%
HF Agent26.42%11.63%3.85%15.15%
DynaSaur35.85%30.23%3.85%27.88%

These results highlight several findings:

  1. Open-source compatibility: DynaSaur works effectively with open-source models, not just proprietary LLMs.
  2. Robustness to weaker code generators: Even with a general-purpose model like Qwen2.5-32B-Instruct, which has weaker code generation abilities, DynaSaur significantly outperforms other pipelines.
  3. Model-wise variability: Surprisingly, Qwen2.5-32B-Instruct even slightly outperforms its code-specialized variant. We hypothesize that its stronger commonsense reasoning and decision-making capabilities may compensate for its weaker coding skills, ultimately resulting in better overall performance.

Taken together, these results further support the robustness and versatility of DynaSaur. We hope they provide a solid basis for the reviewers to consider raising their scores.

最终决定

I recommend accepting this paper. DynaSaur presents a valuable contribution to LLM agent research by enabling dynamic action creation and composition rather than relying on predefined action sets. All four reviewers rated the paper positively (scores of 6, 6, 7, and 6), recognizing its innovative approach to agent flexibility. The authors responded thoroughly to reviewer concerns, providing additional experiments with open-source LLMs and different embedding models that demonstrated the framework's robustness across model families. While some reviewers initially questioned the novelty compared to code generation approaches, the authors clarified their unique contribution: treating a general-purpose programming language as a composable action space for agents across domains. The paper includes solid experimental validation and addresses important practical concerns about memory scalability and retrieval effectiveness. This work represents a meaningful step forward in creating more adaptable and autonomous LLM agents.