6.0

/10

Rejected4 位审稿人

最低2最高5标准差1.1

3.3

置信度

创新性3.0

质量3.0

清晰度3.3

重要性2.8

NeurIPS 2025

Sample, Predict, then Proceed: Self-Verification Sampling for Tool Use of LLMs

Shangmin Guo,Omar Darwiche Domingues,Raphaël Avalos,Aaron Courville,Florian Strub

OpenReview PDF

提交: 2025-05-03更新: 2025-10-29

摘要

关键词

LLMTool UseWorld Modelling

评审与讨论

审稿意见

评分: 4置信度: 42025-06-29

The authors present an approach to improving the tool use capabilities of agentic LLMs with Dynamics Modeling (DyMo), where not only is a model trained (fine-tuned) to correctly generate function calls for APIs, but to predict what subsequent “states” (outcomes of the function call) would occur as a result of making the call. By finetuning in both a supervised fashion and an RL fashion to correctly make the function call and predict the output of it, the authors demonstrate that performance can be improved in a number of axes on the Berkeley Function Calling Leaderboard (BFCL). Moreover, the authors present self-verification sampling, where test time compute is used to increase the number of possible samples and only proceed with the generated ‘trajectory’ that has the highest likelihood of success. An extensive analysis is shown to have decent performance gains when compared to a number of relevant baselines, and ablations demonstrate that state completion data, RL, and SVS are all useful in overall performance gains.

优缺点分析

Strengths:

Tool use and agentic behavior from foundation models has come into the forefront as a topic of great interest. The authors contribute an insight that falls squarely into the crosshairs of this space and I find the approach useful and insightful for the community.
The problem setup and solution are well presented, understandable, and sensible. It makes sense from the most obvious perspective to increase the training data to also predict state completion, but the RL approach and the very straightforward SVS method are practically valuable applications of the key insight and meaningfully improve the model’s performance.
There are some nice experimental takeaways (lines 270 and 277 in particular) that demonstrate that the authors’ approach can be a useful test-time scaling method.
The experiments are rather extensive in their analysis. The authors spare no detail in providing interpretations of results and ablating specific training decisions made in the overall SFT / RL / SVS approach.
I really think that Dynamics Modeling, or trying to get these foundation models to judge themselves by ‘speculating’ what will happen as a consequence of them taking actions, is a very important line of research. I think this work should be part of the growing focus of the community and the neurips community would benefit from seeing this work.

Weaknesses:

My biggest issue here has some amount to do with the “RL” analogy made throughout the paper. The analogy itself makes sense - you can treat a model both as the ‘policy’ and the ‘dynamics’ in an RL setup and train the model to be accurate in both regards. However, the thing that makes RL so hard is that trajectories often happen over long horizons, and one misstep can be fatal (or needs to be corrected in many cases.) The same is true here: compounding hallucinations are very plausible if the model mispredicts a future state, even if it gets the tool-use call correct. What would happen in this situation? The experimental evaluation, from my understanding, doesn’t really dig into this sort of behavior where long-horizon tool uses require multiple correct steps in sequence (and in this work in particular, multiple correct future state predictions.) This begs questions of how a fine-tuned model would behave once it gets out of distribution, and whether or not this could lead to critical mistakes.
I’m not sure what the purpose of section 4.2 is. The same model is trained to be a binary classifier and to actually perform the tool-use / state-prediction actions, right? I’m not entirely sure what the point of analyzing the model as a binary classifier is. it seems like a totally separate problem from trying to generate correct tool use calls or predict states.
I have some clarifying questions and uncertainties about details, as listed in my questions below.

问题

~line 105: “Note that using no-tool environment is sometimes available in some experiments measuring hallucination.” I don’t understand this sentence. Can you clarify?
line 136: where is this reward function coming from? what is it measuring? correct tool use? how fine-grained is the reward function?
line 139: what does ‘simultaneously minimize’, mean, precisely? minimize the additive loss of the two objectives?
line 184: can the authors comment on any potential for data poisoning in the main evaluation?
line 305: Does this imply that the model gets more confident in not refusing to answer, but then answers incorrectly with a higher rate on these potentially ‘trickier’ problems? i’d like to see an analysis of what kind of stuff it stops refusing as training progresses, but still gets wrong.

局限性

I did not see an explicit limitations section in the main text or the appendix. I'd really like for the authors to add this section (appendix is fine) to the final version of their paper. Namely, I'd like for the authors to list both the limitations of their experimental analysis (e.g., only evaluating on one-step sequences of tool use) as well as the overall approach (e.g., potential for compounding error if the state prediction is incorrect at some point along a trajectory.) I expect the authors to add it and my score is somewhat contingent upon this appearing in the final version of the paper.

最终评判理由

After a discussion with the authors and the other reviewers, I do agree that the benchmarks and scoping of the contribution is a bit inaccurate and improper, due to the issues of the RL framing and the BFCL evaluation serving as only one-step. I've updated my score accordingly.

格式问题

n/a

作者回复

2025-07-28

We thank the reviewer for recognising both the importance and the practical value of our work. We also appreciate the thoughtful suggestions and questions that will help us further improve the paper. Our point‑by‑point responses follow.

1. From single-step, single-turn to multi-step, multi-turn (common reply)

We appreciate the reviewer’s acknowledgement that single‑turn interactions are a reasonable starting point. We agree that multi-turn and multi-step tool-use is certainly attractive. However, we believe that proving and understanding the effectiveness of single-turn is a necessary prerequisite for extending the proposed method. Therefore, we chose to maintain a clean setup to deeply understand the capability of LLMs to model the single-step environments.

Multi-turn: Modelling a multi‑turn conversation requires an LLM to represent not only the environment but also the users. This dual modelling problem is almost unexplored, so isolating environment modelling first allows clearer insight.
Multi-step: Multi‑step interactions arise when an LLM must perform several successive actions in response to a single user prompt. Because these tasks decompose into a sequence of single steps, a firm grasp of the single‑step setting is essential. In the meantime, extending our method to multi-step settings requires an abstract, not exact, state representation. During long-horizon planning, overly specific or slightly incorrect details in a predicted state can actually hinder the model, thus predicting the state type might be enough. In fact, we believe that humans act the same way: when we plan a long-horizon task, we picture a coarse pipeline rather than every concrete intermediate results. How to build such abstract/coarse representations for LLMs is still a largely unexplored challenge. Moreover, multi‑step scenarios also introduce orthogonal methods such as ReAct, Reflexion, Self‑Refine, and Tree of Thoughts. Given our limited time and resources, we focused on a narrower scope that still provides actionable insights for the community.
Planned updates: We will expand Section 5 with a discussion of current limitations and outline future works on multi‑turn, multi‑step extensions to address the reviewer’s concerns more fully.

2. Objective of Section 4.2

Thanks for pointing this out. We just realised that the definition of $z$ in Section 3.1 is a bit misleading, and we’ll update the text to fix the problem.

Let’s first review the representation of $z$ in our setup. As we stated in line 179 and Appendix A.2, a $z$ is constituted by (i) the special pass/error token, and (ii) the exact result message returned from the tool engines. This aligns with the definition of environment in RL framework, i.e. a joint distribution over states and rewards conditioned on a given pair of state and action.

The first experiment is aimed to evaluate how accurately the models predict subsequent states. We considered several alternatives. For example, we could require an exact string match between the predicted state and the ground truth. However, this approach is inappropriate for successful tool calls, particularly in the executable category, because the ground truth results come from live Python functions or REST calls can vary over time. In effect, we would be measuring hallucination on outputs. As stated in the above reply about multi-step setup, measuring the state prediction capability requires an abstract state representation which is unexplored so far within LLM community.

We therefore returned to the role of the internal environment model within SVS. Because SVS ultimately relies on binary rewards, we framed the evaluation as a binary classification task. This setup also reveals how much the generation ability of an LLM can improve by leveraging its discrimination capability.

We hope this clarifies the objective of the experiments in Section 4.2.

3. Minor issues

Thank you for collecting these detailed comments. Each will help us raise the overall quality of the paper.

Line 105: The original sentence was unclear due to several typos. New text: "Note that certain prompts require no tool-use in categories designed to measure hallucination."
Line 136: We'll add more details about the reward function we adapted from the BFCL work. Briefly, the scores are always binary. For the executable category, a score of $1$ is given only if running the model’s tool call yields exactly the same result as running the ground truth call. For all other categories, a score of $1$ is given only if the model’s tool call is identical to the ground truth call. In those cases, whether parameter order matters depends on the fine-grained category. Further information can be found in the original BFCL blog post.
Line 139: Thank you for the suggestion. We've corrected it to more rigorous expression. Your proposal is clearer; indeed the objective is “minimising the additive loss of the two objectives.”
Line 184: We ensured that user questions in the validation set never appear in the training set. Tools in the validation set may appear, but only paired with different user questions. This design allows the models to learn tool dynamics rather than memorise specific prompts.
Line 305: Line 305 and Figure 5 address test‑time compute rather than training. They show the effect of increasing the number of generated candidates. We'll put the results of comparing the refuse rates over categories at $k=1$ and $k=64$ into appendix. We hope these results will shed light on model's refusal behaviour.

Please let us know if any further clarification would be helpful.

评论- Thanks for your response

2025-08-05

Thanks to the authors for their response.

Modelling a multi‑turn conversation requires an LLM to represent not only the environment but also the users. This dual modelling problem is almost unexplored, so isolating environment modelling first allows clearer insight.

Well, the "user" is the LLM as well, so you are already doing this representation, just in one step. If we're to extend the RL analogy, this would be doing model-based RL, which is already what you're proposing. The policy, which is also represented by the LLM's prediction of what action to take, is part of the overall system.

I agree with the authors that performing well in single-step settings is crucial and more or less a prerequisite for succeeding in the multi-step (long-horizon) regime. My issue is more that the analogy of RL is used somewhat heavily throughout the paper, but reducing the problem to single-step is a bit incongruent, since the central challenge of RL is behaving over a sequence of multiple timesteps (otherwise, you could have analogized it to MAB or some other single-step problem.) I'd encourage the authors to reconsider using this analogy in the paper. The contribution is still noteworthy even without trying to frame it as an RL context.

Because SVS ultimately relies on binary rewards, we framed the evaluation as a binary classification task. This setup also reveals how much the generation ability of an LLM can improve by leveraging its discrimination capability.

Thanks for the clarification. I think designing the reward as a binary indicator is certainly a fine way (and correctly captures the desired behavior.) I'd be curious about other reward function designs that could potentially be useful here (perhaps densifying the reward space more.) But this is well out of scope of the current contribution.

I appreciate the authors addressing my other comments as well. I will keep my score as-is but encourage the authors to be as precise as possible in how they frame their problem from a technical standpoint.

2025-08-06

Thank you for further engagement in the discussion. We feel that it would be helpful to further clarify the following two issues: (i) what we mean by “user modelling” in this work, and (ii) why our setup is a legitimate RL problem in one-step MDP (a.k.a. contextual MAB), thus not an analogy, along with evidence from existing works.

I. User modelling in our setting

Appendix A.1 gives Example 1, which shows how we organise multiple turns into a single token sequence and clarifies the boundary between the user and the system. The sequence begins with a system preamble, followed by a human user prompt. The system then calls the LLM to produce a completion. If the completion is a tool call, the system executes the call internally rather than returning it to the user.

In this paper, a “user” is a human interacting with our system, thus by no means an LLM. “User modelling” in our previous reply therefore means modelling how human users would react to the final system output. Note that an LLM completion is not necessarily a final output, because the next turn may be an internal interaction between the LLM and the tool engine.

II. Formulate our setup as RL

We would like to clarify that our setting is a valid one-step MDP rather than an analogy. A precise formulation is as follows.:

Sample an initial state $s_0\sim\mu$ , where $\mu$ is an unknown distribution over initial states induced by human users outside the system. In our notation, $s_0$ corresponds to the rendered user prompt $x$ ;
The policy LLM produces an action $y\sim\pi_\theta(\cdot|x)$ . Heret $y$ is a complete response not a single token;
The environment returns a subsequent state and a reward, i.e. $z=(r,s_1)\sim\mathcal{E}(x,y)$ . The environment function here can be decomposed into a transition function $s_1\sim p(\cdot|x,y)$ and a binary reward function $r(x,y)\in\\{0,1\\}$ ;
For every pair $(x,y)$ , the state $s_1$ is terminal.

If we ignore the variability among terminal states, this reduces to a contextual bandit. A contextual bandit is a special case of an MDP in which the process terminates immediately after one action. This view aligns with the RLHF work. Specifically, Ouyang et al. [1] describe their method as RLHF in a bandit environment (Section 3.4). More detailed discussion about token-level actions versus response-level actions appears in Sections 2.1 and 2.2 of the RLOO paper [2].

In the last, we feel necessary to clarify that our method is not model-based RL. Model-based RL trains the policy by interacting with an internal environment model. However, in our experiments, the policy LLM is trained by interacting with the oracle environment. So, policy learning in our work is model-free. This is why we used “planning algorithms from the RL community” in line 368.

We hope these clarifications further specify the scope and positioning of our contribution. In addition, we will revise Section 3.1 to present a clearer formulation of our method within the RL framework.

[1] Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., ... & Lowe, R. (2022). Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35, 27730-27744.

[2] Ahmadian, A., Cremer, C., Gallé, M., Fadaee, M., Kreutzer, J., Pietquin, O., ... & Hooker, S. (2024). Back to basics: Revisiting reinforce style optimization for learning from human feedback in llms. arXiv preprint arXiv:2402.14740.

评论- Thanks for your clarification

2025-08-06

Thanks to the authors for the clarifications. Sorry for the misunderstanding of "user" - your response clears it up. I agree that considering the human user's reactions and responses is a very challenging problem and is well out of scope.

To clarify on my own end, my understanding is that there are sometimes cases when a single human user's request would require multiple tool use invocations. Suppose a human user wants their LLM agent to plan travel accommodations. This would perhaps require (1) tool use to book flights, then (2) using the results of step 1 to book lodging that is compatible with the time constraints of the flight, and so forth. So here, a policy that is invoking different tool use calls will have to also model the results of these calls so that they can be used in future steps. I believe this is what you mean by "Note that an LLM completion is not necessarily a final output, because the next turn may be an internal interaction between the LLM and the tool engine."

I appreciate the precision of the contextual MAB formulation you've presented here. Can you include this somewhere in section 3 (probably 3.1)?

Model-based RL trains the policy by interacting with an internal environment model. However, in our experiments, the policy LLM is trained by interacting with the oracle environment. So, policy learning in our work is model-free.

So I'm a bit confused here - model-based RL indeed learns an internal environment model (and a policy, although the policy is often just following a learned value function), but it of course interacts with the ground-truth environment in order to learn an accurate representation of the internal environment model. The LLM is trained by interacting with the ground-truth environment, which is what I assume you mean by "the oracle environment", but it is trained to model the oracle environment, no? I agree that the lines between model-free and model-based blur a bit here, because the internal environment model is not explicitly used to conduct a policy. But the "environment's dynamics", so to speak, are indeed being learned, if I understand the core contribution correctly.

Sorry for being a bit pedantic here - it's not my intention to be a stickler and I do still think the paper is worthy of acceptance. But I think the impact of the paper can be strengthened by making sure the technical precision is as sharp as possible so that future readers can build off the foundation when extending the work (such as to multi-step interactions.)

2025-08-06

Thank you for further engaging in the discussion. We are glad our previous responses have helped move the discussion forward. We have also observed a gap between the NLP and conventional RL communities, and we believe a tutorial on RL for LLMs from a conventional RL perspective would benefit the broader community. We respond to your points below.

1. One-step MDP setup

Your interpretation of the travel planning example is very well aligned with our intent. We will incorporate a brief one-step MDP formulation into Section 3.1 and provide a more formal treatment in an appendix that states assumptions and notation explicitly.

2. Our Method vs Model-based RL

We agree that the boundary between model-free and model-based methods can blur. Our aim is to be precise about how our approach uses the learned world model.

Model-based RL typically improves the policy using simulated experience from an internal environment model, for example through rollouts, planning, or value learning on trajectories predicted by the model. A representative example could be the Dreamer V3 paper [1]. In its Section 2.2 “Critic learning”, the policy networks “learn behaviours purely from abstract trajectories of representations predicted by the world model.” In addition, Section 8.1 of Sutton’s book also describes the core idea of model based RL as updating the policy with “simulated experience.”

On the other hand, while our LLM learns an internal environment model, the policy induced by the same LLM is trained only on transitions from the ground-truth environment. As illustrated in our Section 3.2, the policy is never updated using samples generated by the internal model and is not improved via simulated rollouts. The internal model is used only for one-step planning during inference time, while policy learning remains model-free.

We hope this clarifies the distinction between our method and model-based RL.

[1] Hafner, D., Pasukonis, J., Ba, J., & Lillicrap, T. (2023). Mastering diverse domains through world models. arXiv preprint arXiv:2301.04104.

审稿意见

评分: 5置信度: 22025-07-02

This paper proposes two new methods:

Dynamics Modelling (DyMo): a post-training technique that teaches an LLM to predict not only the tool necessary for a given prompt but also the next environment state this call will create. DyMo works with both supervised fine-tuning (SFT) and RL.
Self-Verification Sampling (SVS): an inference strategy that generates several candidates, internally simulates their outcomes with DyMo, and then executes the best one. This is useful for problems where we cannot easily use the environment to evaluate multiple candidates in order to pick the best. Additionally, SVS also allows the LLM to refuse the task, in case its internal simulation shows that it couldn't solve it.

When implemented on top of Cohere R7B, DyMo helps reducing hallucinations and improves accuracy on the Berkeley Function Calling Leaderboard V2, achieving performance comparable to GPT-4o. SVS further scales test-time compute without extra environment queries, keeping a high accuracy while also allowing the model to avoid unnecessary computational costs by refusing to answer requests that it believes it cannot. Overall, the experimental results seem very solid.

优缺点分析

Strenghts

The paper is mostly well-written and quite complete. With minor exceptions, the paper is sufficiently formal, and the in-text examples help clarify the concepts (with some small caveats). The idea is also intuitively simple.

To the best of my knowledge, both contributions are also new, at least in the context of tool selection. Moreover, the results are quite convincing and the experimental evaluation is well detailed.

Weaknesses

One concern is regarding the scope of the paper. The idea seems quite general conceptually, and I would like to see how it performs in other scenarios. I understand that tool-calling was the problem that inspired both methods, but it feels that both techniques could be of use in more general cases. I missed a discussion on what would need to be adapted in potential alternative applications.

I would also like DyMo and SVS implemented on other LLMs to get a better picture. I understand that this is too much work, but I think it would be a useful addition.

I also found Figures 1 and 2 more confusing than helpful. I could not understand the details of Figure 1 until late in the text, and I did not understand the advantages of SVS from Figure 2 until later on as well.

Other comments:

The sentence on line 35 ("One may not...") is quite confusing. In general the introduction has unnecessary wedging in the form of auxiliary verbs.
In most of the paper, you use '-' instead of '---' or '--'. I don't know which dash you want, but '-' is definitely not the one.
"the the" on line 209
Are you sure you can have coloured symbols in text?
Figure 5 should follow the same style as Figures 3 and 4
Appendix requires a spellcheck

问题

What are some other potential applications, besides tool use, of your two methods? What would need to be adapted? Any ideas on how to do so or perhaps even results?
Have you implemented the two methods on top of other LLMs? If so, what are the results?

局限性

yes

最终评判理由

After the discussion and after reading the other reviews, I am still positive about this submission. Other reviewers brought up relevant concerns about the paper, but I still think that this submission has enough interesting ideas to be accepted.

格式问题

None.

作者回复

2025-07-28

We thank the reviewer for recognizing both the novelty of our method and the convincingness of our experiments.

1. Application of the proposed in other scenarios

Thank you for highlighting the need to discuss future directions. We are eager to apply our approach to more general tasks, since we believe it can serve as a versatile component of LLM post training. We will further elaborate on potential future work regarding the broader application of our method in the Discussion section.

2. Training more LLMs

Because our resources were limited, we focused on analysing the method in depth so that the NeurIPS community would gain clear insights and solid evidence. Rather than exploring more models, we provided a comprehensive analysis of one. Given the simplicity of our approach, we look forward to seeing other LLM teams integrate it into their post training pipelines.

3. Figures 1 and 2

We have redesigned Figures 1 and 2 to improve clarity and readability. We'll update the paper to include them once the revision portal becomes available.

4. Minor issues

Thank you for drawing attention to these details.

Line 35: We will revise the sentence to make the meaning clearer.
Hyphens: We will review the entire manuscript and ensure consistent usage.
Type in line 209: thank you for reading our paper so carefully. We'll update the paper to correct this typo.
Figure 3&4&5: All three figures have been updated. We'll update the paper once the revision portal becomes available.
Appendix: We will run a full spell check and put an extended discussion section.

Please let us know if any additional clarification would be helpful.

2025-08-03

Thank you very much for your responses. My questions were minor (compared to the other reviews) and the topic is not exactly in my area of expertise, but I would still appreciate if you could be more precise on what other scenarios you plan to add to the discussion of potential future work.

Also, I understand the point of all other reviewers regarding single-step vs multi-step, but I can also understand your perspective that single-step is a good starting point.

I already had a positive opinion towards your submission, and I continue to do so. I will continue to follow the discussion with the other reviewers, but for now I still recommend the acceptance of your paper.

2025-08-04

Thank you for engaging positively with our work.

Regarding future extensions, we initially considered an autonomous coding agent built on our method. We soon realised that, for resource-rich languages such as Python and Java, the abundance of existing examples would require an impractically massive dataset to show a clear improvement. A more promising direction is to apply our technique to low-resource languages like Fortran or COBOL, where each new data point can yield significant gains.

We are also experimenting with training an internal world model on Android emulators. As noted in our common reply, this setting demands abstract state representations. To achieve this, we must ablate raw API outputs from the state descriptions, which is a challenge that remains underexplored.

Beyond these software domains, our method could benefit self-driving vehicles. A textual world model might improve motion prediction of surrounding traffic participants, which offers as a solution of interpretability for self-driving models. The EMMA: End-to-End Multimodal Model for Autonomous Driving paper has explored multimodal models for driving, yet translating those ideas into a language-based world model offers fertile ground for research.

Similarly, robots could use a textual world model for high-level planning. Humans perform mental simulations of complex, long-horizon tasks before acting; equipping robots with this capability would enhance planning over intricate sequences. However, achieving this will depend on abstracting away low-level dynamics to capture only the essential high-level state transitions.

审稿意见

评分: 2置信度: 42025-07-07

This work introduces DyMo, an approach for augmenting LMs's tool use capability in stateful environments. It trains the model to predict the state of the environment after the function call alongside the function call itself (via both SFT and RL). The authors demonstrate improved success rates and reduced hallucinations in BFCLv2. The authors further develop a test-time strategy called Self-Verification Sampling (SVS), enabling models to sample multiple action candidates, predict their outcomes internally, simulate environment behavior, and select the most reliable one without interacting with the actual environment.

优缺点分析

Strengths

Well-motivated and novel idea.
Good result improvements on the BFCLv2 benchmark, especially on the irrelevance category.
Clearly written paper.

Weaknesses

Experimental setup does not faithfully represent the framing

This paper introduces a well-motivated idea of learning the environment dynamics/function itself, but the actual implementation and experimentation setup is too simplified to be actually faithful to this framing.

The core of the problem is the experiment setup or benchmark itself, BFCLv2. This benchmark only contains an evaluation of single-step function calls, i.e., given an instruction and a set of function definitions, the task is to generate single or parallel function calls in one go. In other words, it does not involve any agent-environment interaction, which is necessary for a canonical definition of an environment.

Additionally, it does not come with executable functions (even a small portion that involved executable APIs was recently removed due to instability and unusability due to API keys from external services), which again is necessary for learning the functions in the first place.

In the given setup, the method is effectively "only" learning the evaluation function for a non-interactive (environment-less) task and using it, effectively, as a way of calibrating the model's outputs, and refusing to output otherwise. This is similar to previous efforts for other non-interactive task setups.

Again, the author's proposed idea of learning the environment function is novel and interesting. However, the used benchmark is not sufficient to demonstrate the validity of the claims. I'd suggest that authors instead carry out experiments in a tool-use benchmark, like AppWorld [2], which comes with a stateful execution-based environment/functions and evaluation with tasks requiring actual agent-environment interaction. This is much better suited for demonstrating the value of this approach.

Alternatively, I suggest reframing the proposed method, not as learning the environment function, but as calibrating LMs for function calling by learning the evaluation function.

Citations suggestions:

[1] "Selective Question Answering under Domain Shift" for discussion around Selective QA and calibration training.
[2] "AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents" for tool-use benchmark providing stateful execution-based environment and evaluation.

问题

局限性

yes

最终评判理由

I am continuing with my original recommendation and scores. The details and reasoning are in the rebuttal response. The TLDR is that the claims about models being able to predict environment state are not true, or at least, do not fairly represent what is done in this work, and would be misleading to anyone who does not know the low-level details about the BFCLv2 benchmark.

格式问题

作者回复

2025-07-27

We thank the reviewer for noting both the novelty and the strong empirical performance of our work. We address your concerns point by point.

1. Benchmark insufficiency

The core of the problem is the experiment setup or benchmark itself, BFCLv2. This benchmark only contains an evaluation of single-step function calls, i.e., given an instruction and a set of function definitions, the task is to generate single or parallel function calls in one go. In other words, it does not involve any agent-environment interaction, which is necessary for a canonical definition of an environment.

First, BFCL has become a standard benchmark for agentic and tool‑use tasks, as shown by recent technical reports from Llama 3 (Section 5.2.7), Qwen 3 (Section 4.6), Command‑A (Section 4.2), and Gemini‑1.5 (Section 6.1.5).

Next, we show how our experimental setup meets the canonical definition of an environment within RL, in two steps: (i) a multi‑armed bandit is already recognised as a legitimate environment; (ii) our setting is strictly richer than a bandit, and therefore qualifies as an environment as well.

i. Bandit as environment: Section 1.1 of Bandit Algorithms calls a bandit problem “a sequential game between a learner and an environment,” where the environment merely “reveals a reward” for a given action. The preamble to Part I of Reinforcement Learning: An Introduction likewise states that bandit problems correspond to the special case with a single state (“for the special case of the reinforcement learning problem in which there is only a single state, called bandit problems.”). Both sources regard a bandit as the simplest possible environment.

ii. Why our setup is richer: A bandit has only one state, whereas our setting contains many. As defined in line 107, we treat the entire prompt x as an input state, combining the available tool list with the user request. Even for the same request, the tool set may change, so the input state varies. The resulted states vary as well. Because our setup generalises the bandit by allowing a nontrivial state space, it fits properly within the RL notion of an environment.

Fun fact: we asked several leading LLMs including ChatGPT, Grok, Claude, and Gemini whether a bandit counts as an environment, and they all answered yes, describing it as a trivial one.

2. The single‑turn, single‑step setup (common reply)

We agree that multi-turn and multi-step tool-use is certainly attractive. However, we believe that proving and understanding the effectiveness of single-turn is a necessary prerequisite for extending the proposed method. Therefore, we chose to maintain a clean setup to deeply understand the capability of LLMs to model the single-step environments.

Multi-turn: Modelling a multi‑turn conversation requires an LLM to represent not only the environment but also the users. This dual modelling problem is almost unexplored, so isolating environment modelling first allows clearer insight.
Multi-step: Multi‑step interactions arise when an LLM must perform several successive actions in response to a single user prompt. Because these tasks decompose into a sequence of single steps, a firm grasp of the single‑step setting is essential. In the meantime, extending our method to multi-step settings requires an abstract, not exact, state representation. During long-horizon planning, overly specific or slightly incorrect details in a predicted state can actually hinder the model, thus predicting the state type might be enough. In fact, we believe that humans act the same way: when we plan a long-horizon task, we picture a coarse pipeline rather than every concrete intermediate results. How to build such abstract/coarse representations for LLMs is still a largely unexplored challenge. Moreover, multi‑step scenarios also introduce orthogonal methods such as ReAct, Reflexion, Self‑Refine, and Tree of Thoughts. Given our limited time and resources, we focused on a narrower scope that still provides actionable insights for the community.
Planned updates: We will expand Section 5 with a discussion of current limitations and outline future works on multi‑turn, multi‑step extensions to address the reviewer’s concerns more fully.

2. Executable functions

Table 1 shows that we cover the “Exec” category, which includes real‑time Python execution and real‑time REST API calls. For the exact data composition, please see the original BFCL-V1 and BFCL-V2 blog posts. As you noted, "a small portion that involved executable APIs was recently removed", which actually confirms that our dataset does contain executable functions.

3. Evaluation function v.s. environment function

We are not certain what is meant by “evaluation function” and “previous efforts” in this context. Could you specify how you define these terms? A precise definition of “evaluation function” and “environment function” would greatly help our communication.

评论- Additional evidence on bandit as a legitimate environment.

2025-08-06

As further support, Ouyang et al. [1] explicitly describe their approach as RL in a “bandit environment,” a framing that has been widely accepted and adopted by the community.

审稿意见

评分: 4置信度: 32025-07-09

In this paper, the author addresses the challenge of Large Language Models (LLMs) using tools in stateful environments, where actions are often irreversible (e.g., making a payment, booking a flight). This context renders traditional trial-and-error methods impractical. To solve this, the paper introduces two core contributions:

DyMo teaches an LLM not only to act (generate a tool call) but also to predict the "subsequent state" of the environment after that action is executed
SVS: This is a new inference-time strategy summarized as "Sample, Predict, then Proceed". When given a task, the model first generates multiple k candidate tool calls. It then uses its internal environment model, learned via DyMo, to predict the outcome of each candidate without querying the real environment. Finally, it uses a scoring function to select and execute only the most reliable option.

In essence, DyMo can be thought of as a simplified world model tailored for tool use. It allows the agent to leverage this internal model for planning before executing actions in the real environment. The combination of DyMo and SVS provides an innovative and scalable framework for LLMs to use tools more safely and reliably in dynamic, real-world environments.

优缺点分析

Pros: The design of DyMo is very sound. Its use of both successful and failed run-logs is a key insight, essential for building a comprehensive and robust internal model of the environment. SVS strategy is an intelligent inference-time optimization. It mitigates real-world risk by performing "mental simulation" to evaluate potential actions, avoiding the cost and irreversible consequences of direct trial-and-error in the live environment.

Cons: The paper's focus on single-turn interactions is a reasonable starting point, but the true power of LLM agents is often realized in complex, multi-turn tasks. This raises a crucial question about the tractability of learning the dynamics model in more realistic settings. For the specific tool-use scenarios in the paper, the model appears learnable. However, as the complexity of the environment grows, can we still expect the LLM to reliably predict the next state or reward? The potential for a combinatorial explosion in the state space and the accumulation of predictive errors across multiple steps pose significant hurdles that warrant further investigation.

问题

See my comments on pros and cons

局限性

Yes

最终评判理由

This paper addresses the critical challenge of enabling Large Language Models (LLMs) to use tools safely in stateful environments, where actions can be irreversible. The proposed solution is an elegant framework combining two core contributions:

DyMo: A training method that teaches an LLM not only to generate a tool call but also to predict the environment's subsequent state after the action. This effectively equips the model with a simplified, internal world model tailored for tool use.

SVS (Sample, Predict, then Proceed): An inference-time strategy where the model first generates several candidate actions. It then uses its DyMo-trained internal model to simulate the outcome of each candidate without acting in the real world. Finally, it selects and executes only the most reliable option based on a scoring function.

The combination of DyMo and SVS provides an innovative and practical approach for safer tool use, allowing the agent to "look before it leaps." This is a significant step toward deploying LLM agents in real-world, high-stakes scenarios.

Regarding the discussion, I acknowledge the valid concern raised by Reviewer wJXU about BFCL. However, I do not believe this point fundamentally undermines the paper's primary contribution. The core idea of leveraging an internal world model for pre-execution simulation is novel and holds significant value independently.

格式问题

作者回复

2025-07-27

We thank the reviewers for highlighting the soundness, originality, and especially the scalability of our work, which makes it a natural fit for modern LLM development. Our point‑by‑point responses are as follows.

1. Applying the proposed method in more complex environments

Thanks for asking this question. We'd like to start from the key element of modern LLMs' success: scaling! Over the past three years the community has systematically scaled up: (i) the volume of training data, (ii) the number of parameters, and (iii) test‑time computes. Future work will surely introduce new axes, but these three already illustrate why our method scales well.

i. Data: Production systems continuously accumulate run‑logs. As argued in the paper, these logs form a valuable training corpus for an LLM’s internal world model. In a complex environment, we believe our method would still keep effective, as it benefits from the accumulated run-log data which the environment keeps generating. In complex environments, producing annotated/synthesised gold trajectories for supervised learning grows increasingly difficult. By contrast, our approach can benefit from failed trials to refine internal world models, a capability that constitutes one of its key advantages.

ii. Model size: An 8 B‑parameter model trained with our approach already achieves a strong success rate. It’s predictable that larger models can be expected to perform even better. Resource limits prevented us from training a bigger model for this study, yet virtually every domain has shown a clear benefit from parameter scaling, and our method is fully compatible with it. We’re happy to put discussion about model scaling as part of future work in Section 5.

iii. Test‑time computes: Our SVS method is exactly designed to increase test‑time computes, so the method is inherently aligned with this scaling axis.

To sum up from the above, we believe that more data, larger models, and more test‑time computes all reinforce the effectiveness of our approach in high‑complexity environments.

2. From single-step, single-turn to multi-step, multi-turn (common reply)

Multi-turn: Modelling a multi‑turn conversation requires an LLM to represent not only the environment but also the users. This dual modelling problem is almost unexplored, so isolating environment modelling first allows clearer insight.
Multi-step: Multi‑step interactions arise when an LLM must perform several successive actions in response to a single user prompt. Because these tasks decompose into a sequence of single steps, a firm grasp of the single‑step setting is essential. In the meantime, extending our method to multi-step settings requires an abstract, not exact, state representation. During long-horizon planning, overly specific or slightly incorrect details in a predicted state can actually hinder the model, thus predicting the state type might be enough. In fact, we believe that humans act the same way: when we plan a long-horizon task, we picture a coarse pipeline rather than every concrete intermediate results. How to build such abstract/coarse representations for LLMs is still a largely unexplored challenge. Moreover, multi‑step scenarios also introduce orthogonal methods such as ReAct, Reflexion, Self‑Refine, and Tree of Thoughts. Given our limited time and resources, we focused on a narrower scope that still provides actionable insights for the community.
Planned updates: We will expand Section 5 with a discussion of current limitations and outline future works on multi‑turn, multi‑step extensions to address the reviewer’s concerns more fully.

2025-08-05

Thank you for your additional explanation. I'll maintain my score.

最终决定Reject

2025-09-17

This paper introduces DyMo (Dynamics Modeling) and SVS (Self-Verification Sampling) for improving LLM tool use in stateful environments. DyMo trains models to predict both tool calls and subsequent environment states, while SVS uses test-time sampling to select the most reliable actions without environment interaction. The work demonstrates solid empirical results on BFCL-V2, with reviewers acknowledging the novelty and practical value of the approach. Key strengths include the well-motivated design, strong experimental performance achieving GPT-4o level results, and the scalability of the framework for modern LLM development. However, initial weaknesses emerged around experimental validity and scope limitations. The primary concern centers on whether BFCL-V2 adequately represents true stateful environments, with critics arguing it functions more as evaluation/calibration rather than genuine environment state prediction. Additional limitations include restriction to single-step interactions, evaluation on only one model (Cohere R7B), and questions about generalizability to more complex multi-step scenarios.

The rebuttal period generated substantial discussion around the core validity concerns. Reviewer wJXU maintained strong objections about BFCL-V2 not representing true stateful environments, arguing the method learns evaluation functions rather than environment dynamics, and that claims about predicting environment states are misleading. The authors defended their approach by framing it within bandit settings as legitimate RL environments and highlighting that BFCL contains executable functions, but wJXU remained unconvinced and kept their reject recommendation. Reviewers DScQ and Jbjy acknowledged wJXU's concerns as valid and relevant, agreeing they should be addressed in the paper, but both maintained their acceptance recommendations, viewing the contribution as sufficiently valuable despite the limitations. DScQ appreciated the additional future work directions provided by authors, while Jbjy engaged in detailed technical discussions about RL framing and requested more precision in problem formulation. Reviewer BsHi maintained their borderline accept position throughout. The borderline discussion reflects this split, with the decision to reject weighing the legitimate concerns about experimental scope and validity of claims against the paper's novel insights and practical contributions.