Planning without Search: Refining Frontier LLMs with Offline Goal-Conditioned RL
摘要
评审与讨论
This paper proposes leveraging goal-conditioned value functions to guide the reasoning process of LLM agents, with a framework that scales to large API-based models. During the offline training phase, training samples are summarized and embedded to train a lightweight goal-conditioned Q-function. At inference time, a natural language critic utilizes this value function to generate feedback in natural language, helping the LLM agent refine its reasoning. Extensive experiments across diverse benchmarks demonstrate the method’s effectiveness compared to traditional RL training and task-specific prompting approaches.
优缺点分析
Pros
- The paper is clearly written and easy to follow.
- The proposed method is evaluated on three datasets spanning different domains and compared against a variety of baselines, demonstrating strong performance.
- The method shows promising efficiency.
Cons
- The authors claim that their method benefits from training a goal-conditioned Q-function to score the likelihood of different future outcomes. A natural baseline would be to directly prompt an LLM (e.g., GPT-4) to predict probabilities of future outcomes while keeping all other components unchanged, to verify whether the trained Q-function provides an advantage.
- The experiments exclusively use GPT-4 as the decision-making LLM. The framework should also be evaluated on open-source models such as LLaMA or Qwen to better validate its generality—especially given that RL-based baselines use smaller models like GPT-2.
- The authors should include ablation studies on the number of goals and the number of refinement stages to analyze their individual contributions.
- Recent reasoning benchmarks such as AIME, GPQA, and Humanity's Last Exam also require multi-step reasoning, which aligns well with the paper's focus on long-horizon reasoning tasks where the goal corresponds to the final answer. This framework could achieve greater impact by demonstrating improvements on these more challenging reasoning benchmarks.
问题
See Cons above
局限性
yes
最终评判理由
The authors propose training a goal-conditioned Q-function to evaluate the likelihood of various future outcomes. While the idea is promising, the rebuttal lacks a thorough analysis of key components. For example, the authors do not quantify how the choice of hyperparameters m and n affects downstream performance, nor do they explore the applicability of the method beyond the simple agent planning tasks discussed in the paper. Furthermore, although additional experiments using 32B and 7B models are presented, these models are finetuned on QwQ-32B and Qwen2.5-7B, respectively, both of which are highly optimized. In contrast, the RL baselines rely on the GPT-2 series, raising fairness concerns due to the substantial disparity in base model capacity. Addressing these limitations would greatly enhance the paper. Therefore, I believe the current score appropriately reflects the paper’s strengths and weaknesses, and I encourage the authors to include these additional analyses in future revisions.
格式问题
no
Thank you for your suggestions and comments. We have used your suggestions to strengthen the message of our paper through the addition of new baselines. We will also address your points individually below.
Comparison to new baselines
You suggested two important modifications to compare against: (1) using LLMs to score the goals without any training, and (2) using smaller LLMs as the backbone. To this end, we have added a comparison to ReAct+Replan, which keeps our inference procedure but uses GPT-4 to automatically score the likelihood of goals, as well as versions of PNLC using two different, smaller open-source models. See our discussion below that describes the new baselines in more detail and discusses results.
Studies on n and m
We agree that this would be valuable to do. We chose m=2 because we found that additional steps of refinement were often not necessary for our tasks, and n=4 was chosen to contain a mixture of both positive and negative goals. Overall, we wanted to keep the inference time comparable with the fast versions of baselines also using GPT-4 that we compare against. However, we will include a section where we vary n and m separately in our updated paper to show our method's scaling on the two hyper-parameters.
Reasoning benchmarks
It is true that there exist many reasoning benchmarks that require multi-step reasoning. However, we would like to make a distinction between them and the multi-step problems that we consider. Namely, reasoning tasks such as math or question-answering are still single-step tasks, even if each step may consist of multiple “steps” of reasoning, because in the end one action is generated before a reward is received. In single-step tasks, LLMs have been proven to be relatively effective at judging the quality of responses a priori (without training). However, in long-horizon LLM agent tasks, LLMs would need to predict multiple steps into the future, which they are not nearly as competent at. This is illustrated when we evaluate the ReAct+Replan benchmark, where we see that even frontier LLMs are not good at assessing long-term impacts of actions.
New Baselines
After compiling the suggestions across multiple reviewers, we decided to add the following new baselines:
PNLC-7B; PNLC-32B-reasoning: These are two variations of PNLC where we replace the GPT4 backbone LLM with smaller models: namely Qwen2.5-7B and QwQ-32B. PNLC-7B is meant to be more directly comparable to baselines such as Agent Q, which uses a similarly-sized network. PNLC-32B-reasoning uses a substantially bigger model that was pretrained to produce reasoning traces; however, it is still much smaller than frontier LLMs and we use it as a middle-ground. It is important to note that in either case, we do not perform any fine-tuning on the backbone LLM, which is in contrast to approaches such as Agent Q. We believe a large strength of our approach is its lightweight nature, as many people lack the resources to directly fine-tune LLM models as policies.
ReAct+Replan: This method aims to act as an ablation of our approach, where the same self-refinement inference method using subgoals is preserved, but the scores of subgoals is directly inferred by the base LLM rather than learned via offline goal-conditioned RL. In this method, a thought and action are generated by a ReAct agent, then the agent assesses its action by generating n=4 positive and negative subgoals and scoring their likelihood, then finally the agent is allowed to refine its initial action up to m=2 times.
Due to time, we only present results for the persuasion task, but will update with results for WebShop and Avalon.
ReAct+Replan: 0.62
PNLC-7B: 0.56
PNLC-32B-reasoning: 0.76
PNLC (ours): 0.87
From the results, we see that model size definitely plays a role in overall performance of our approach PNLC. However, even with medium-sized reasoning models, our method achieves competitive performance with LATS and GDP-Zero that use frontier LLMs. Furthermore, we would like to stress that our method circumvents training on the LLM, so it is relatively cheap and lightweight to use frontier LLMs instead. In fact, the only advantage to using smaller models is lower inference time, as training is not impacted at all by the using a larger model.
In addition, we see that though ReAct+Replan performs better, it does not get close to our performance. This highlights that our offline training is important, and that even frontier LLMs are incapable of providing accurate scores for goals in long-horizon tasks a priori. Qualitatively, we see that even frontier LLMs provide uncalibrated, often overly-optimistic scores, such that they rarely deem their own actions to be suboptimal. In this manner, the Replan stage often does not accomplish anything beyond deciding that the initial action was good.
We show an example of this below:
State:
AI: Hello! Have you heard of Save the Children?
Donor: Yes. I know of that charity.
AI: Great! Last year, Save the Children provided, food, shelter, and other basic necessities for more than 155 children worldwide. As I'm sure you already know, we rely on donations to provide that help.
Donor: Great. But I am not interested in helping you.
AI: I understand your hesitation. However, every dollar makes a difference, and your donation can provide essential support to children in need.
Donor: I'm skeptical of donations. I think it's a waste of money.
Thought:
The donor is skeptical of the effectiveness of donations. I should address this skepticism by providing concrete evidence of the charity's impact.
Evaluation:
The donor expresses interest in learning more about how their donation would be used and asks for specific examples of impact. (Score: 0.6)
The donor acknowledges the information but remains unconvinced, stating they need more time to consider or research further. (Score: 0.4)
The donor brings up a new, unrelated objection. (Score: 0.2)
Refined Thought:
I should emphasize how donations are used efficiently and effectively to achieve specific, measurable outcomes.
Thank you for your rebuttal. The authors primarily added comparisons against React+Replan using smaller models, concluding that with a 32B model, PNLC can outperform the baseline. While this highlights PNLC’s potential, further analysis is still needed. Additionally, the paper would benefit from ablation studies on the hyperparameters n and m, as well as an extension of the downstream tasks beyond the agent planning domain. I recommend the authors to include these results in the next revision to strengthen the contribution.
Thank you for the response! To add on to the results we provided earlier, we have included results for evaluating the additional baselines in the other two experimental domains. Due to time, we did not evaluate the PNLC-32B-reasoning baseline, but we found that the performance of PNLC-7B was closer to our proposed PNLC than in the persuasion task (probably because the two other tasks are less natural and do not leverage existing knowledge from LLM pretraining as well), so we do not anticipate drastically different performance.
WebShop (Score/Success Rate):
ReAct+Replan: 53.8/25.0
PNLC-7B: 74.8/47.0
PNLC (ours): 78.2/48.0
Avalon:
ReAct+Replan: 27.0
PNLC-7B: 41.0
PNLC (ours): 48.0
As evidenced in the results, using the LLM to plan without seeing any true trajectories (either from training or from search during inference) performs poorly. This is likely because these benchmarks require generations that look less natural than textual utterances (in the form of tool calls or game actions). In such environments, having the LLM plan can even be detrimental, as we see that performance dropped a marginal amount in the WebShop task. This acts as additional evidence that some additional training is required for LLM agents to perform well.
Furthermore, your comment on investigating how our method scales with n and m is important; in the tasks we consider, we found increasing n beyond 1-2 positive and negative goals each, and m beyond a single revision was largely unhelpful, but we will perform a more holistic analysis in our revised work.
In this paper, the authors propose to use goal-conditioned value functions to guide the long-horizon planning of LLM agents in domains like tool use, social deduction. Specifically, they train a light-weight (2-layer) goal-conditioned value function using offline Reinforcement Learning, and use it to guide the LLM policy to take actions during inference -- by (1) scoring their actions, (2) feeding the scores back to enable the LLM to self-correct. The authors demonstrate superior performance across a range of long-range tasks compared to established baselines like ReAct and Reflexion. Moreover, the authors show how their method performs better than and is more efficient (during inference) than baselines that integrate search (like MCTS) with LLMs.
优缺点分析
Strengths:
- The paper is generally well-written and easy to follow.
- The authors compare their method against a wide range of baselines, including both general (e.g., ReAct, Reflexion) and domain-specific methods (e.g., AgentQ), and show strong performance improvements.
Weaknesses:
- Dependence on specially prepared offline data. The method relies on offline data that includes thoughts and step-level goals, which don’t exist by default. These annotations must be generated using a large language model (here, GPT-3.5), adding a significant preprocessing step.
- Domain-specific training required In L 283: 284, the authors state -- However, it is important to note that such methods use domain-specific prompting mechanisms and do not naively generalize to other task -- But your proposed method also requires training with domain-specific data. The authors state that in the future, one could use the reasoning abilities of LLM agents. Wouldn't this be the same as the Self-Plan baseline? Moreover, fine-tuning them on offline data would be even more expensive -- going against very motivation of the paper. This looks like a chicken-and-egg dilemma.
- Lack of more targeted ablations. The ablation results show that using step-level goals to compute value and feeding it back into the policy LLM gives a large boost. To this end, it would be useful to see if similar improvements occur when adding step-level goals to existing baselines like ReAct, Self-Plan, or Reflexion. For example, Prompt ReAct to generate a step-level goal along with the thought, or in Self-Plan, generate multiple responses with goals before selecting the best one -- this would essentially be the same as the future work that authors propose to address the domain-specific pretraining of value function. A similar one could be used for Reflexion. Finally, the proposed method bears a high overlap with heuristic planning with LLMs like [3] that uses pretrained value function (Pay) to guide the search. This could be a potential baseline, too.
- Time comparison missing. There is only a brief mention of runtime trade-offs (lines 286–288), but no clear analysis. It would be helpful to compare AgentQ with Mixtral-7B using MCTS with n=5 or GPT-4 generating 5 responses every time-step. Note that the proposed method also requires (1) to augment the offline data by running a GPT-3.5 model for thought and step-level goal-generation, (2) to compute embeddings of the (s, a, g) pairs using another LLM.
[1] SayCanPay: Heuristic planning with large language models using learnable domain knowledge, Hazra et al., AAAI 2024
问题
Questions:
- What is SR in Tables 1,2?
Major Points:
- The authors have cited the wrong paper for their offline RL approach (Reference 12). It is crucial that the authors pay attention to this. The correct Implicit Q-Learning paper is [2].
- Please clearly define the notations used in Lines 162–163 -- like the distinction between and is unclear. Without a precise definition or explanation of these terms, the paper is not self-contained.
Minor points:
- L233: natural language value -> natural language string
- In general, it would be useful for the reader if the authors clearly state that every rollout/trajectory also has an overall task/goal that is then further subdivided into sub-goals. A GA-MDP formulation could be used for the same [3]
[2] Offline reinforcement learning with implicit q-learning, Kostrikov et al., 2021
[3] Goal-Conditioned Reinforcement Learning: Problems and Solutions, Liu et al., IJCAI 2015
局限性
Please see Weaknesses above.
最终评判理由
The authors have provided a very convincing rebuttal, and the new experiments largely address my main concerns. I also find the contribution significant in showing the community how to effectively use offline data to train models using offline RL methods which haven't been explored in the context of LLMs.
格式问题
No.
Thank you for your detailed review. We aim to address your concerns with the work below, as well as provide additional baselines to strengthen the message of our paper.
Dependence on specially prepared offline data
It is true that we rely on offline data to train our method, but the data can be collected by any ReAct agent that generates both thoughts and actions. We would like to note that the goals are obtained simply from taking prefixes of the collected trajectories, and do not need a specialized model.
The method that we propose is a lightweight training method: lightweight in the sense that we do not directly fine-tune an LLM policy. However, it is still a training method like ArCHer or Agent Q that we compare against, and any training method relies on some way of collecting trajectories to train on. We would argue that the fact that our method is offline makes it less specialized than other training approaches that rely on on-policy data collected by the LLM policy being trained. The only methods that do not require any collected data would be prompting approaches such as ReAct, which we show are insufficient for complex, long-horizon tasks.
Domain-specific training required
When we say that our method generalizes to other tasks better than other baselines such as Agent Q or GDP-Zero, what we mean is that our method does not have any fundamental constraints that prohibit it from being used in other tasks. For example, GDP-Zero requires a taxonomy of dialogue strategies, and therefore cannot be generalized to tasks with action spaces beyond dialogue utterances.
It is true that our method requires task-specific data to be collected. However, this is true for any method that relies on some fine-tuning and not just prompting. We compare against strong prompting baselines and show that just prompting is insufficient to solve complex long-horizon tasks, so we think it is fair to propose a method that requires some form of training. We argue that out of all possible approaches that require some training, ours is particularly lightweight as (1) we do not directly train the LLM policy, which is expensive to do, and (2) we only need offline data from any ReAct policy and not on-policy data from the policy being trained. It is also important to note that the best-performing prompting baselines Reflexion and LATS collect on-policy trajectories to be used as prompts. Hence, most baselines that we compare against use task-specific data in some form.
Lack of more targeted ablations
This is an excellent point. To this end, we added an additional baseline called React+Replan where an LLM self-scores the likelihood of reaching goals, without any offline training. We see that in long-horizon tasks, the LLM is often overly-optimistic and scores suboptimal actions too highly. See our discussion below that describes the new baseline in more detail and discusses results.
Time comparison missing
We agree and will spend more effort comparing the inference times of all approaches. You are correct that a single inference step via our method requires multiple calls to an LLM, specifically to (1) generate goals, (2) compute embeddings, (3) refine the initial action. Overall, we find our approach takes similar time to the (n=5) version of the prompting baselines using GPT-4. Because Agent Q uses a small model that can be hosted locally, it takes noticeably less time than ours during inference (1.5s vs 6s). We also add comparisons of versions of our method PNLC using smaller models. See the discussion below for description of these new baselines and discussion of results.
Answer to question
SR is short for success rate, or the percentage of times that a maximum score was achieved. We will make this more clear in our updated paper.
New Baselines
After compiling the suggestions across multiple reviewers, we decided to add the following new baselines:
PNLC-7B; PNLC-32B-reasoning: These are two variations of PNLC where we replace the GPT4 backbone LLM with smaller models: namely Qwen2.5-7B and QwQ-32B. PNLC-7B is meant to be more directly comparable to baselines such as Agent Q, which uses a similarly-sized network. PNLC-32B-reasoning uses a substantially bigger model that was pretrained to produce reasoning traces; however, it is still much smaller than frontier LLMs and we use it as a middle-ground. It is important to note that in either case, we do not perform any fine-tuning on the backbone LLM, which is in contrast to approaches such as Agent Q. We believe a large strength of our approach is its lightweight nature, as many people lack the resources to directly fine-tune LLM models as policies.
ReAct+Replan: This method aims to act as an ablation of our approach, where the same self-refinement inference method using subgoals is preserved, but the scores of subgoals is directly inferred by the base LLM rather than learned via offline goal-conditioned RL. In this method, a thought and action are generated by a ReAct agent, then the agent assesses its action by generating n=4 positive and negative subgoals and scoring their likelihood, then finally the agent is allowed to refine its initial action up to m=2 times.
Due to time, we only present results for the persuasion task, but will update with results for WebShop and Avalon.
ReAct+Replan: 0.62
PNLC-7B: 0.56
PNLC-32B-reasoning: 0.76
PNLC (ours): 0.87
From the results, we see that model size definitely plays a role in overall performance of our approach PNLC. However, even with medium-sized reasoning models, our method achieves competitive performance with LATS and GDP-Zero that use frontier LLMs. Furthermore, we would like to stress that our method circumvents training on the LLM, so it is relatively cheap and lightweight to use frontier LLMs instead. In fact, the only advantage to using smaller models is lower inference time, as training is not impacted at all by the using a larger model.
In addition, we see that though ReAct+Replan performs better, it does not get close to our performance. This highlights that our offline training is important, and that even frontier LLMs are incapable of providing accurate scores for goals in long-horizon tasks a priori. Qualitatively, we see that even frontier LLMs provide uncalibrated, often overly-optimistic scores, such that they rarely deem their own actions to be suboptimal. In this manner, the Replan stage often does not accomplish anything beyond deciding that the initial action was good.
We show an example of this below:
State:
AI: Hello! Have you heard of Save the Children?
Donor: Yes. I know of that charity.
AI: Great! Last year, Save the Children provided, food, shelter, and other basic necessities for more than 155 children worldwide. As I'm sure you already know, we rely on donations to provide that help.
Donor: Great. But I am not interested in helping you.
AI: I understand your hesitation. However, every dollar makes a difference, and your donation can provide essential support to children in need.
Donor: I'm skeptical of donations. I think it's a waste of money.
Thought:
The donor is skeptical of the effectiveness of donations. I should address this skepticism by providing concrete evidence of the charity's impact.
Evaluation:
The donor expresses interest in learning more about how their donation would be used and asks for specific examples of impact. (Score: 0.6)
The donor acknowledges the information but remains unconvinced, stating they need more time to consider or research further. (Score: 0.4)
The donor brings up a new, unrelated objection. (Score: 0.2)
Refined Thought:
I should emphasize how donations are used efficiently and effectively to achieve specific, measurable outcomes.
Thank you, authors, for answering my questions so comprehensively. I find your arguments convincing that the requirement for domain-specific and specially prepared data is both necessary and justified by the perceived improvements. I also appreciate the ReAct + Replan baseline results.
Regarding my earlier question: Finally, the proposed method bears a high overlap with heuristic planning with LLMs like [1] that uses pretrained value function (Pay) to guide the search. , could the authors attempt to compare and show how the two are different. I realized I made a mistake in pointing to the exact reference -- ironically, after noting a similar issue earlier. Apologies.
[1] SayCanPay: Heuristic planning with large language models using learnable domain knowledge, Hazra et al., AAAI 2024
Thank you for pointing out the additional method SayCanPay; it is definitely a very relevant baseline! We would like to point out that we do evaluate on our own heuristic planning baseline in the ablation studies, namely one where a scalar Q-value is directly learned over high-level actions and used to filter actions, without any conversion to a textual description using our proposed subgoal learning. This learned Q-value is effectively the same as that learned in the "Pay" component of SayCanPay, with one minor distinction being we use Q-learning whereas SayCanPay directly uses the the Monte-Carlo returns, which should not substantially affect performance.
In our heuristic planning baseline, we see that there is a stark drop in performance. We believe it is due to the fact that though scalar values are sufficient in theory, in practice, scalar values are very difficult to learn accurately in LLM agent tasks. Hence, a major hypothesis that we adopt in our work is that when scalar values are hard to learn, values that are textual descriptions are greatly preferred due to their interpretability and flexibility (the LLM agent can choose to use the information provided by the textual description however they like to).
One thing that SayCanPay provides on top of our evaluated baseline is the "Can" component that is trained on expert trajectories. However, we do not assume access to any expert trajectories, which makes this component difficult to adapt to our experimental setup. However, we will evaluate an adaption of SayCanPay, where the "Can" component is learned using only the high-reward trajectories from our offline dataset. We will include such a comparison in our revised work.
Thank you, I'm largely satisfied with the answers and find the new updates very convincing. I'm hoping that the authors would add the baseline too. Thus, I've raised my score. All the best to the authors!
This paper introduces Planning with a Natural Language Critic (PNLC), a scalable approach to enhance the reasoning and planning capabilities of LLM agents in multi-turn, interactive settings without requiring fine-tuning of the base model. PNLC leverages offline goal-conditioned RL to train lightweight value functions that predict the likelihood of achieving positive or negative outcomes based on proposed high-level strategies. These value functions act as natural language critics during inference, enabling LLM agents to refine their reasoning by evaluating hypothetical future scenarios without the need for expensive inference-time search. The method outperforms state-of-the-art approaches across benchmarks in web navigation, social deduction games, and persuasive dialogue tasks, while requiring significantly lower computational costs. PNLC’s modular design makes it scalable to frontier LLMs like GPT-4, but the approach is limited by the need for task-specific value functions and reliance on the base LLM's domain knowledge.
优缺点分析
Strengths:
-
Scalability to Frontier LLMs: The method scales effectively to advanced models like GPT-4, which are typically constrained in their ability to perform RL fine-tuning due to API and computational limitations.
-
Strong Performance Across Tasks: PNLC demonstrates consistent success across diverse tasks requiring interactive decision-making and reasoning, such as web navigation, social deduction, and persuasion.
Weaknesses:
-
Missing Baselines and Limited Critic Design: PNLC employs a lightweight two-layer perceptron for goal-conditioned value prediction, which may have limitations in capturing complex relationships. Exploring the use of LLMs as critic agents in a zero-shot setting could provide valuable insights into performance improvements.
-
Structured Subgoal Reflection: The generation of subgoals during inference appears to mimic a manually-designed self-reflection process. Including experiments to directly compare PNLC with a baseline that uses these structured subgoals as self-reflection materials would strengthen the paper’s claims and highlight the added benefits of PNLC's approach.
问题
See above
局限性
yes
最终评判理由
Although the new baseline reflects the potential of the PNLC method, these cannot be strictly counted as its ablation. So I decided to keep the same score.
格式问题
None
We would like to thank the reviewer for their insightful comments. We address points that the reviewer raised below, particularly in including additional baselines to compare against.
Limited Critic Design
We agree that an LLM as a critic would allow for potential improvements. A major goal of our approach was to be lightweight, and not require any RL fine-tuning of LLMs. Hence, we decided that our method would work over embeddings of existing LLMs to reduce the need for large-scale training.
Missing Baselines
We believe you made a good point that we should additionally evaluate on a baseline that mimics our subgoal inference procedure. To this end, we added an additional baseline called React+Replan where we additionally prompt a ReAct agent to generate subgoals and score their likelihood. This directly mimics what our method proposes, but aims to score without the offline goal-conditioned training that we do. See our discussion below on additional baselines that we have added.
New Baselines
After compiling the suggestions across multiple reviewers, we decided to add the following new baselines:
PNLC-7B; PNLC-32B-reasoning: These are two variations of PNLC where we replace the GPT4 backbone LLM with smaller models: namely Qwen2.5-7B and QwQ-32B. PNLC-7B is meant to be more directly comparable to baselines such as Agent Q, which uses a similarly-sized network. PNLC-32B-reasoning uses a substantially bigger model that was pretrained to produce reasoning traces; however, it is still much smaller than frontier LLMs and we use it as a middle-ground. It is important to note that in either case, we do not perform any fine-tuning on the backbone LLM, which is in contrast to approaches such as Agent Q. We believe a large strength of our approach is its lightweight nature, as many people lack the resources to directly fine-tune LLM models as policies.
ReAct+Replan: This method aims to act as an ablation of our approach, where the same self-refinement inference method using subgoals is preserved, but the scores of subgoals is directly inferred by the base LLM rather than learned via offline goal-conditioned RL. In this method, a thought and action are generated by a ReAct agent, then the agent assesses its action by generating n=4 positive and negative subgoals and scoring their likelihood, then finally the agent is allowed to refine its initial action up to m=2 times.
Due to time, we only present results for the persuasion task, but will update with results for WebShop and Avalon.
ReAct+Replan: 0.62
PNLC-7B: 0.56
PNLC-32B-reasoning: 0.76
PNLC (ours): 0.87
From the results, we see that model size definitely plays a role in overall performance of our approach PNLC. However, even with medium-sized reasoning models, our method achieves competitive performance with LATS and GDP-Zero that use frontier LLMs. Furthermore, we would like to stress that our method circumvents training on the LLM, so it is relatively cheap and lightweight to use frontier LLMs instead. In fact, the only advantage to using smaller models is lower inference time, as training is not impacted at all by the using a larger model.
In addition, we see that though ReAct+Replan performs better, it does not get close to our performance. This highlights that our offline training is important, and that even frontier LLMs are incapable of providing accurate scores for goals in long-horizon tasks a priori. Qualitatively, we see that even frontier LLMs provide uncalibrated, often overly-optimistic scores, such that they rarely deem their own actions to be suboptimal. In this manner, the Replan stage often does not accomplish anything beyond deciding that the initial action was good.
We show an example of this below:
State:
AI: Hello! Have you heard of Save the Children?
Donor: Yes. I know of that charity.
AI: Great! Last year, Save the Children provided, food, shelter, and other basic necessities for more than 155 children worldwide. As I'm sure you already know, we rely on donations to provide that help.
Donor: Great. But I am not interested in helping you.
AI: I understand your hesitation. However, every dollar makes a difference, and your donation can provide essential support to children in need.
Donor: I'm skeptical of donations. I think it's a waste of money.
Thought:
The donor is skeptical of the effectiveness of donations. I should address this skepticism by providing concrete evidence of the charity's impact.
Evaluation:
The donor expresses interest in learning more about how their donation would be used and asks for specific examples of impact. (Score: 0.6)
The donor acknowledges the information but remains unconvinced, stating they need more time to consider or research further. (Score: 0.4)
The donor brings up a new, unrelated objection. (Score: 0.2)
Refined Thought:
I should emphasize how donations are used efficiently and effectively to achieve specific, measurable outcomes.
Thank you for your response and extra experimental results. Although the new baseline reflects the potential of the PNLC method, these cannot be strictly counted as its ablation. Therefore, I believe the current score adequately reflects the overall quality of the paper. I recommend that the authors include these results in the next revision.
Thank you for the response! We believe that the React+Replan baseline exactly aims to behave as the method you suggested of using subgoals in a "manually-designed self-reflection process," without any training on offline data or search during inference. Regarding a more complex critic, that is indeed also an excellent modification to compare to; however, because training an LLM critic is much more expensive, we are unable to provide results during discussion but will provide them in our revised work.
To add on to the results we provided earlier, we have included results for evaluating the additional baselines in the other two experimental domains. Due to time, we did not evaluate the PNLC-32B-reasoning baseline, but we found that the performance of PNLC-7B was closer to our proposed PNLC than in the persuasion task (probably because the two other tasks are less natural and do not leverage existing knowledge from LLM pretraining as well), so we do not anticipate drastically different performance.
WebShop (Score/Success Rate):
ReAct+Replan: 53.8/25.0
PNLC-7B: 74.8/47.0
PNLC (ours): 78.2/48.0
Avalon:
ReAct+Replan: 27.0
PNLC-7B: 41.0
PNLC (ours): 48.0
As evidenced in the results, using the LLM to plan without seeing any true trajectories (either from training or from search during inference) performs poorly. This is likely because these benchmarks require generations that look less natural than textual utterances (in the form of tool calls or game actions). In such environments, having the LLM plan can even be detrimental, as we see that performance dropped a marginal amount in the WebShop task. This acts as additional evidence that some additional training is required for LLM agents to perform well.
This paper presents Planning with a Natural Language Critic (PNLC), a method that involves training a goal-conditioned Q-function Q(s, a_thought, g), where g is a goal (a natural language summary of a state after s) on a held-out dataset D, which is a relatively small network trained on language model embeddings. At inference time, an LLM uses Q to solve problems by sampling thoughts and goals from an initial state, evaluating Q on them, and reflecting in order to come up with other candidate actions, etc. The paper demonstrates good performance relative to a number of prompting and fine-tuning baselines and demonstrates the utility of each component of the method through ablations.
优缺点分析
Strengths:
+Clarity: the paper is fairly easy to understand, with generally clear methods and results and good examples provided throughout.
+Quality: The work uses a substantial collection of relevant baselines on well-chosen benchmarking environments.
+Significance: the work shows considerable performance gains over baselines and ablations.
+Originality: to my knowledge, this precise method hasn’t been used before.
Weaknesses:
-Significance: I have some concern about precisely how to think about these comparisons. PNLC learns from an offline dataset of examples. As far as I can tell, none of the non-finetuning baselines make use of these data — it would be helpful to have versions that make at least some use in prompting (e.g. with few-shot examples). Were these included? The baselines that do fine-tuning appear a bit underpowered…the ArCHers use GPT-2, whereas PNLC gets to use GPT-3 representations, which is a huge difference. As far as I can tell, the only other baseline that does finetuning is Agent Q, which uses a Mistral model and doesn’t get access to GPT-4.
-Originality: As I said above, I don’t know any method that does precisely this; however, the paper omits a lot of relevant related work (While e.g. DeepSeek r1 [4] and Self-Taught Reasoner [3] and its follow-up work finetune the language model directly, and hence can be considered unfair baselines with much higher train-time computation budgets, it feels pretty funny to omit these sorts of ideas in related work.) including works that come quite close. Indeed the original GSM8K paper [1] discusses training verifiers to be used at inference time, and a lot of very popular work e.g. [2] builds upon this.
-Clarity (minor): a lot of important details (what the offline datasets are, how finetuning baselines are treated) are reserved entirely for the supplementary. Not a huge deal, but I feel like the paper would be easier to read if some more of these details were included earlier.
[1] Cobbe, Karl, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert et al. "Training verifiers to solve math word problems." arXiv preprint arXiv:2110.14168 (2021).
[2] Hosseini, Arian, Xingdi Yuan, Nikolay Malkin, Aaron Courville, Alessandro Sordoni, and Rishabh Agarwal. "V-star: Training verifiers for self-taught reasoners." arXiv preprint arXiv:2402.06457 (2024).
[3] Zelikman, Eric, Yuhuai Wu, Jesse Mu, and Noah Goodman. "Star: Bootstrapping reasoning with reasoning." Advances in Neural Information Processing Systems 35 (2022): 15476-15488.
[4] Guo, Daya, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu et al. "Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning." arXiv preprint arXiv:2501.12948 (2025).
问题
-Did any of the prompting baselines get access to any of the datasets generated for training of the main method?
局限性
yes
最终评判理由
Stated in another comment.
格式问题
none
Thank you for the detailed review, and for providing many helpful references. We aim to address any concerns or comments you had about our work below, as well as present additional experimental results to support the message of our work.
Unfair comparisons
We admit that there are a lot of nuances between the different baselines that may make comparisons difficult or even unfair. We view our method as a lightweight offline algorithm: “lightweight” meaning that we do not directly fine-tune the LLM policy, which can be prohibitively expensive for large models, and “offline” meaning we rely on a prior state dataset collected by any logging policy.
In summary, we aim to compare against the strongest possible baselines, but the complexity of the considered baseline sometimes limits design choices such as LLM policy size or use of static data.
Regarding model size, out of all baselines, ArCHer and Agent Q rely on fine-tuning the LLM. Because ArCHer uses Q-learning that requires multiple copies of the model in memory, prior implementations of the method were done entirely on GPT-2 or even smaller models. In contrast, Agent Q uses DPO, which can be trained on larger models but still not to the scale of state-of-the-art frontier models, so we used Mistral-7B, which is exactly what prior work did as well. We relied on prior implementations of both approaches, and did not try to improve the compute or memory efficiency of them. For all other “lightweight” approaches, including ours, we used the most capable frontier LLMs. Hence, we compared against the strongest version of all baselines that is publicly available and computationally feasible.
For use of additional, static data, it is true that only ArCHer, Agent Q, and ours train on an extra static dataset. That is because all other baselines do not contain any training component. Regarding few-shot prompting, several approaches such as Reflexion and LATS generate their own samples to be ultimately used within custom prompts, with LATS using up to 30 additional trajectories. We believe these additional trajectories are more useful to these methods than samples from our static dataset, as these additional trajectories are on-policy and directly correspond to how the LLM policy will act. Thus, again we believe we compare against the strongest version of all considered baselines.
To address the reviewer’s concerns regarding unfair comparisons, we add a comparison of our method PNLC using two smaller models, one 7B model and one 32B reasoning model. See our discussion below on results for these additional methods.
Discussion of LLM-based verifiers
Thank you for bringing up that related literature on LLM-based verifiers. It is true that they are related to our approach, and we will include an additional section discussing them. The main difference between our approach and this line of work is that we consider long-horizon LLM agent tasks. For single-step tasks such as math or question-answering, even if each step may consist of multiple “steps” or reasoning, LLMs have been proven to be relatively effective at judging the quality of responses a priori. However, in long-horizon LLM agent tasks, LLMs would need to predict multiple steps into the future, which they are not as nearly as competent at. To illustrate this, we added an additional baseline called React+Replan where an LLM self-scores the likelihood of reaching goals, without any offline training. We see that in long-horizon tasks, the LLM is often overly-optimistic and scores suboptimal actions too highly. See our discussion below that describes the new baseline in more detail and discusses results.
New Baselines
After compiling the suggestions across multiple reviewers, we decided to add the following new baselines:
PNLC-7B; PNLC-32B-reasoning: These are two variations of PNLC where we replace the GPT4 backbone LLM with smaller models: namely Qwen2.5-7B and QwQ-32B. PNLC-7B is meant to be more directly comparable to baselines such as Agent Q, which uses a similarly-sized network. PNLC-32B-reasoning uses a substantially bigger model that was pretrained to produce reasoning traces; however, it is still much smaller than frontier LLMs and we use it as a middle-ground. It is important to note that in either case, we do not perform any fine-tuning on the backbone LLM, which is in contrast to approaches such as Agent Q. We believe a large strength of our approach is its lightweight nature, as many people lack the resources to directly fine-tune LLM models as policies.
ReAct+Replan: This method aims to act as an ablation of our approach, where the same self-refinement inference method using subgoals is preserved, but the scores of subgoals is directly inferred by the base LLM rather than learned via offline goal-conditioned RL. In this method, a thought and action are generated by a ReAct agent, then the agent assesses its action by generating n=4 positive and negative subgoals and scoring their likelihood, then finally the agent is allowed to refine its initial action up to m=2 times.
Due to time, we only present results for the persuasion task, but will update with results for WebShop and Avalon.
ReAct+Replan: 0.62
PNLC-7B: 0.56
PNLC-32B-reasoning: 0.76
PNLC (ours): 0.87
From the results, we see that model size definitely plays a role in overall performance of our approach PNLC. However, even with medium-sized reasoning models, our method achieves competitive performance with LATS and GDP-Zero that use frontier LLMs. Furthermore, we would like to stress that our method circumvents training on the LLM, so it is relatively cheap and lightweight to use frontier LLMs instead. In fact, the only advantage to using smaller models is lower inference time, as training is not impacted at all by the using a larger model.
In addition, we see that though ReAct+Replan performs better, it does not get close to our performance. This highlights that our offline training is important, and that even frontier LLMs are incapable of providing accurate scores for goals in long-horizon tasks a priori. Qualitatively, we see that even frontier LLMs provide uncalibrated, often overly-optimistic scores, such that they rarely deem their own actions to be suboptimal. In this manner, the Replan stage often does not accomplish anything beyond deciding that the initial action was good.
We show an example of this below:
State:
AI: Hello! Have you heard of Save the Children?
Donor: Yes. I know of that charity.
AI: Great! Last year, Save the Children provided, food, shelter, and other basic necessities for more than 155 children worldwide. As I'm sure you already know, we rely on donations to provide that help.
Donor: Great. But I am not interested in helping you.
AI: I understand your hesitation. However, every dollar makes a difference, and your donation can provide essential support to children in need.
Donor: I'm skeptical of donations. I think it's a waste of money.
Thought:
The donor is skeptical of the effectiveness of donations. I should address this skepticism by providing concrete evidence of the charity's impact.
Evaluation:
The donor expresses interest in learning more about how their donation would be used and asks for specific examples of impact. (Score: 0.6)
The donor acknowledges the information but remains unconvinced, stating they need more time to consider or research further. (Score: 0.4)
The donor brings up a new, unrelated objection. (Score: 0.2)
Refined Thought:
I should emphasize how donations are used efficiently and effectively to achieve specific, measurable outcomes.
Thanks much to the authors for this additional effort.
Given the additional experimental evidence, I'm inclined to raise my score somewhat. The new results contextualize the tradeoffs with this vs other approaches more, though the contexts in which this should be judged a win are complex. It would also be really helpful to see the other experimental evidence, given where this is placing the model among baselines.
Thank you for the response! We have added results for evaluating the additional baselines in the other two experimental domains. Due to time, we did not evaluate the PNLC-32B-reasoning baseline, but we found that the performance of PNLC-7B was closer to our proposed PNLC than in the persuasion task (probably because the two other tasks are less natural and do not leverage existing knowledge from LLM pretraining as well), so we do not anticipate drastically different performance.
WebShop (Score/Success Rate):
ReAct+Replan: 53.8/25.0
PNLC-7B: 74.8/47.0
PNLC (ours): 78.2/48.0
Avalon:
ReAct+Replan: 27.0
PNLC-7B: 41.0
PNLC (ours): 48.0
As evidenced in the results, using the LLM to plan without seeing any true trajectories (either from training or from search during inference) performs poorly. This is likely because these benchmarks require generations that look less natural than textual utterances (in the form of tool calls or game actions). In such environments, having the LLM plan can even be detrimental, as we see that performance dropped a marginal amount in the WebShop task. This acts as additional evidence that some additional training is required for LLM agents to perform well.
This paper proposes a method called PNLC, which helps large language models improve planning in long, multi-step tasks. The key idea is to use a goal-conditioned value function trained offline to guide the model’s reasoning. The authors show strong results on tasks like persuasion, dialogue, and web navigation, and the method works without fine-tuning the base LLM. After the rebuttal, three reviewers support acceptance. They were convinced by the new experiments, especially the new baselines and comparisons added by the authors.
Some points were raised about missing ablations and evaluation fairness. The authors replied with careful explanations and provided useful updates. The results show that their method is efficient and competitive even when smaller models are used. The reviewers agree that the contribution is original and relevant for the NeurIPS community. Based on the reviews and the discussion, I recommend accepting this paper.