Enhancing Decision-Making of Large Language Models via Actor-Critic
We propose a novel LLM-based Actor-Critic framework that enhances LLMs' decision-making through long-term action evaluations and efficient policy improvements
摘要
评审与讨论
This paper proposes a gradient-free Actor-Critic framework (LAC) to enhance the decision-making capabilities of LLMs. LAC integrates a value-based critic that offers quantitative feedback to guide policy improvement and employs a gradient-free optimization approach to update actor. Experimental results demonstrate that LAC outperforms GPT-4 with ReAct in Alfworld and BabyAI-test tasks.
给作者的问题
- LATS is also one of the baselines, which uses external feedback and the MCTS method to improve performance. Why did the author only report its' performance in the WebShop benchmark? Can it be adapted to another benchmark?
论据与证据
Yes.
方法与评估标准
Yes.
理论论述
Yes.
实验设计与分析
Yes.
补充材料
Yes.
与现有文献的关系
The proposed framework demonstrate the potential of 'actor-critic' for LLM-based decision-making scenario.
遗漏的重要参考文献
No.
其他优缺点
Strengths:
- This paper proposed a token-level action evaluation method, and utilized a gradient-free method to update policy LLM. The idea is totally makes sense and Concise.
- The experiment setting and provided implementation details are comprehensive.
Weakness:
- Lack of In-depth analysis and explanation. This work constructs a tailored actor-critic framework in decision-making scenario, and the performance improvement results is promising. Nevertheless, in the experimental results part (i.e., sections 5.2, 5.3, 5.4), we didn't see a very illustrative explanation to reveal the potential reason behind this improvement. This limits the impact of this paper as basically, the actor-critic framework is somehow similar with the general domain (e.g., reasoning).
- From this paper, i cannot see the technical difference of the actor-critic framework between decision-making scenario and the general LLM reasoning scenario. It seems that this paper just simply adapts the general actor-critic framework into decision-making task without domain-specific justification or design. This limits the overal novelty of this paper.
- The overall paper writing is barely satisfactory; some parts lack intuitive explanations, making it difficult for readers to understand.
其他意见或建议
- More in-depth analysis of the experiment results.
Q1: Lack of intuitive explanations and in-depth analysis.
A: We appreciate the reviewer's emphasis on intuitive explanations and analyses. In Fig.10 and 11 (Appx.A.7), we have provided detailed illustrative analyses using representative examples from ALFWorld and BabyAI-Text, which demonstrates into how each component of our actor-critic framework (LAC) influences action selection.
Fig.10 presents a concrete scenario where the agent's goal is to "put a saltshaker in the drawer." At a critical decision step (Step 4), we observe the following intuitive distinction between components.
- LLM-based prior policy alone mistakenly suggests "go to drawer 1" because the base LLM model overlooks that the agent has already found the correct object ("saltshaker 1") in cabinet 2. This error exemplifies the common hallucination problem in LLMs, which occurs when the model disregards previous states and incorrectly recommends irrelevant actions.
- In contrast, the critic suggests "take saltshaker 1 from cabinet 2" because it evaluates potential actions by predicting future trajectories and determines that this action will successfully pick up the correct object.
- Our method leverages these distinct insights by optimizing the prior policy's action distribution based on the critic's evaluation (see Lines 4-8 in Algo.1). It effectively corrects the errors introduced by the prior policy, balancing the strengths of prior policy (flexible but sometimes inaccurate) and critic evaluations (accurate but computationally intensive).
This illustrative example explicitly reveals why integrating actor (prior policy) and critic leads to substantial performance improvement.
Q2: The difference of the actor-critic framework between the decision-making scenario and the general LLM reasoning scenario.
A: We respectfully disagree with this claim.
Compared to classic RL actor-critic algorithm, our method is distinguished by two key contributions that specifically address significant challenges in enhancing LLMs' decision-making capabilities: (1) To effectively extract action evaluation information from LLMs, we propose a novel Q-function estimation approach (Sec.4.1) that leverages LLMs' internal belief about success or failure of the current task; (2) To efficiently utilize these action evaluation insights for policy optimization, we formulate the policy improvement problem as a KL-divergence constrained optimization and derive a closed-form solution (Sec.4.2), allowing us to optimize the policy in a gradient-free way.
Compared to the general LLM reasoning scenario, our method introduces two novel features that are specially designed for sequential decision-making problems: (1) To support long-term planning, our critic module first predicts possible future trajectories for action evaluation, which significantly improves the accuracy of action-value estimation. (2) To improve LLM's decision-making ability, our method introduces a deliberate and principled integration of LLMs as both prior policy (actor) and action evaluation (critic). Rather than simply prioritizing one over the other as in prior work, our approach synergistically combines the two, achieving a balanced framework optimized explicitly for sequential decision-making scenarios.
Q3: LATS's performance in other benchmarks.
A: LATS cannot be directly applied to the other benchmarks we used for two main reasons. (1) LATS requires the ability to revert the agent to earlier states in the environment, which ALFWorld and BabyAI-Text do not support. LATS relies on model-free MCTS, using environment simulator as a world model, reverting simulators to earlier states during tree search. This limitation is also noted in their original paper (Page 9). (2) While it might be possible to modify these environments to make them reversible, it would create an unfair comparison. Our method and other baselines do not rely on simulators during reasoning in ALFWorld and BabyAI-Text, whereas LATS would gain an advantage from this modification.
Nevertheless, we still attempted to adapt LATS for ALFWorld by using LLMs as world models, similar to our method, for a fair comparison. The results, presented in the table below, show that LATS fails in almost all tasks. This is because its tree search severely depends on the environment simulator for precise state transitions. With only LLM-based world models, the state transitions often deviate from the actual environments, due to LLMs' inherent hallucinations and partial observability of ALFWorld.
Table 1: Performance comparison of LAC (Ours) and LATS in ALFWorld.
| Methods / Models | CodeLlama-7B | Gemma-7B | Llama-3-8B | Mistral-7B |
|---|---|---|---|---|
| LAC (Ours) | 0.00 | 0.00 | 0.03 | 0.00 |
| LATS | 0.79 | 0.84 | 0.78 | 0.79 |
Thanks for your efforts and insightful comments! We hope our clarification addresses your concerns and sincerely appreciate it if you could re-evaluate our work. Any further feedback is much appreciated.
Thanks for the explanation based on a case study. After rebuttal, we still argue that the proposed approach mostly combines existing general actor-critic ideas, leaving the novelty concerns. Therefore, I will keep my score. Thanks for the author's efforts.
Q4: After rebuttal, we still argue that the proposed approach mostly combines existing general actor-critic ideas, leaving the novelty concerns. Therefore, I will keep my score. Thanks for the author's efforts.
A: We appreciate the reviewer's timely feedback and for acknowledging our experimental explanation. To address the reviewer’s concern regarding novelty, we would like to clarify further why our method represents a substantial advancement beyond simply combining existing actor-critic ideas.
We also would like to emphasize that while our method is inspired by the general actor-critic paradigm, it introduces key innovations specifically tailored to LLM-based decision-making, which, to our best knowledge, have not been explored in prior actor-critic work:
-
Q-function estimation without explicit rewards (Sec. 4.1): Unlike traditional actor-critic methods that typically rely on explicit predefined reward signals, our approach formulates a Q-function that leverages LLMs' internal success and failure probabilities to estimate action values. This strategy enables the critic to estimate action values effectively in scenarios where external rewards are sparse or entirely absent. Such scenarios are prevalent in real-world decision-making tasks involving language models, clearly distinguishing our contribution from classical actor-critic methods.
-
Gradient-free policy optimization (Sec. 4.2): Instead of updating the policy via gradient-based learning, we derive a closed-form solution for KL-constrained optimization. This gradient-free approach circumvents the complexities associated with differentiating through LLM-generated actions, providing an efficient and effective alternative explicitly tailored for LLM-based policies.
To further strengthen our claim of novelty, we will clearly articulate these distinctions in the revised manuscript, emphasizing how these specific innovations directly address practical limitations in existing actor-critic frameworks when applied to large language models. Additionally, we will reinforce our claims by clearly referencing the empirical results, which demonstrate the practical effectiveness of our approach on challenging decision-making benchmarks.
We thank the reviewer again for highlighting this crucial point, as it has helped us sharpen and better communicate the unique contributions of our work.
We hope this further clarification helps demonstrate that our approach is not merely an adaptation of existing actor-critic ideas but introduces novel and necessary modifications to make actor-critic viable for LLM-based decision-making.
The paper proposes LAC, a framework for improving decision-making capabilities of LLMs by integrating base LLM with action evaluations derived from token logits and trajectory rollouts. The authors conduct experiments across ALFWorld, BabyAI-Text, and WebShop benchmarks and show the superiority of LAC over baselines.
update after rebuttal
Thanks for running the additional evaluation on Crafter. Based on the new results, it seems there is overlap between the Naive baseline and LAC. Overall, I find the idea interesting, but my main concern remains about LAC's effectiveness on more complex tasks. Therefore, I will keep my current score.
给作者的问题
- see "Other Strengths And Weaknesses"
- Curious if LAC can also be extended to multimodal LLMs for decision-making in multimodal benchmarks like VisualWebArena [1]?
[1]: Koh, J. Y., Lo, R., Jang, L., Duvvur, V., Lim, M. C., Huang, P. Y., ... & Fried, D. VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks. In ICLR 2024 Workshop on Large Language Model (LLM) Agents.
论据与证据
The paper's claims about LAC improving LLM decision-making are generally supported by the experimental results across ALFWorld, BabyAI-Text, and WebShop benchmarks. The experiments show consistent performance improvements over baselines. The ablation studies adequately demonstrate that each component contributes to the overall performance. However, the significance of these improvements is somewhat limited by the simplicity of the chosen benchmarks, which may not sufficiently challenge newer LLMs.
方法与评估标准
- The evaluation criteria and benchmarks (ALFWorld, BabyAI, WebShop) are now too simple for latest LLMs. Testing on more challenging environments like WebArena [1] or BALROG [2] would provide a more compelling evaluation.
- Not a concern but it would be interesting to see how the method performs with reasoning-LLMs like DeepSeek-R1 Distill Llama or Qwen which might also be suitable for the policy network (π_LLM)
[1]: Zhou, S., Xu, F. F., Zhu, H., Zhou, X., Lo, R., Sridhar, A., ... & Neubig, G. WebArena: A Realistic Web Environment for Building Autonomous Agents. In The Twelfth International Conference on Learning Representations. [2]: Paglieri, D., Cupiał, B., Coward, S., Piterbarg, U., Wolczyk, M., Khan, A., ... & Rocktäschel, T. (2024). Balrog: Benchmarking agentic llm and vlm reasoning on games. arXiv preprint arXiv:2411.13543.
理论论述
yes - appendix B.1
实验设计与分析
- The experiments are generally sound with appropriate comparisons to relevant baselines and comprehensive ablation studies.
- The computational analysis is also useful
补充材料
yes - appendix and a high-level overview of the code
与现有文献的关系
Improving decision-making capabilities of LLMs is a very relevant problem statement in today's times and this paper aims to improve the capabilities of LLMs in order to become better agents
遗漏的重要参考文献
In related works, it would be nice to cite works like: Koh, J. Y., McAleer, S., Fried, D., & Salakhutdinov, R. (2024). Tree search for language model agents. arXiv preprint arXiv:2407.01476.
其他优缺点
Strengths:
- Well-written and easy to follow with clear explanations
- Comprehensive experiments and ablation studies that isolate the impact of each component
Weaknesses:
- [Minor] The approach mostly combines existing ideas rather than introducing fundamentally new concepts. Although I personally feel a lot of agentic LLM applications are more about implementation and orchestration and less about research novelty - so this is a minor concern of mine.
- Evaluation on relatively simple benchmarks that don't sufficiently challenge newer LLMs - as I said above, I would like to see the results on either WebArena or BALROG or another challenging benchmark
- Lack of discussion on failure cases - Since agentic LLMs are already seeing rapid adoption in applied use-cases, it would be nice to have small section on the failure cases of using Q_llm (are there any specific kind of tasks/trajectories where Q_llm doesn't generalize?)
其他意见或建议
Line 135 column 2: typo: typo: "few-show" -> "few-shot"
Thanks for your comments and valuable suggestions. Here we provide detailed explanations to address your questions.
Q1: Lack of discussion on failure cases. It would be nice to have a small section on the failure cases of using Q_llm (are there any specific kinds of tasks/trajectories where Q_llm doesn't generalize?)
A: We have provided illustrative examples from ALFWorld and BabyAI-Text in Figures 10 and 11, respectively. In these cases, the critic () alone may occasionally fail to select the correct actions. This is because the estimation of Q-values relies on the LLM's ability to predict future trajectories. When the LLM experiences significant hallucinations, the Q-values can deviate, leading to suboptimal action selection.
For example, in Figure 10, the agent needs to find and take a "saltshaker". In step 2, the critic suggests "go to drawer 1", because it incorrectly predicts that the saltshaker is there, resulting in a suboptimal action. In contrast, the prior policy suggests a more systematic search, starting with "cabinet 1" and then moving to "cabinet 2", where the target saltshaker is actually located.
Similarly, there are failure cases where incorrect action evaluations of critic lead to unsuccessful task completion. For instance, in the task "find two soapbars and put them in the cabinet", the critic mistakenly identifies a "soapbottle" as a "soapbar", causing the agent to take the wrong object and fail the task.
In summary, the critic may struggle to generalize in cases where the LLM suffers from significant hallucinations, such as mis-predicting future trajectories due to partial observability or incorrectly identifying target objects due to the base LLM's inherent hallucinations. Thank you for the suggestion, and we will expand the discussion on failure cases in the revised version.
Q2: In related works, it would be nice to cite works like: Tree search for language model agents. arXiv preprint arXiv:2407.01476.
A: Thank you for pointing out this relevant work. We will include this work in the revised version of our paper.
Q3: [Minor] The approach mostly combines existing ideas rather than introducing fundamentally new concepts. This is a minor concern of mine.
A: While our method draws inspiration from the actor-critic algorithm in classic RL, it is distinguished by two key contributions that address significant challenges in enhancing LLMs' decision-making capabilities: (1) We propose a novel Q-function estimation approach (Sec. 4.1) to extract action evaluation information from LLMs that leverages LLMs' internal belief about success or failure of the current task; (2) We formulate the policy improvement problem as a KL-divergence constrained optimization and derive a closed-form solution (Sec. 4.2), allowing us to optimize the policy in a gradient-free manner using the action evaluation.
Q4: How does the method perform with reasoning-LLMs like DeepSeek-R1 Distill Llama or Qwen?
A: We have conducted preliminary experiments with reasoning LLMs like DeepSeek-R1-Distill-Qwen-7B. However, we observed that they often tend to overthink rather than output direct environmental actions in both our method and the baseline ReAct. For instance, even when we explicitly prompt the reasoning LLMs to output actions (e.g., "Please make sure your response is in the following format:\n> {The action you choose}"), the models still generated detailed explanations but avoided selecting the next action. A typical response might be: "I need to find a key to open the safe or locate the pencil in the drawers. Since I can’t (…), I’m unable (…), I must (…)". This issue has also been noted in prior work [1]. We believe that using reasoning LLMs for decision-making tasks requires deeper exploration to balance internal reasoning with effective environmental interaction.
Q5: Can LAC be applied to more challenging environments (WebArena or BALROG), and multimodal decision-making tasks (VisualWebArena)?
A: Thank you for the suggestion. Due to the limited time and computing resources available during the rebuttal period, we were unable to conduct experiments on these benchmarks, which generally require larger base LLMs for better performance. Given that our method has already demonstrated effectiveness in various complex sequential decision-making tasks (e.g., ALFWorld, BabyAI-Text and Webshop), we believe it can be extended to these benchmarks as well. We will explore this direction in future work.
Thanks again for your efforts and insightful comments! We hope our clarification addresses your concerns. Any further feedback and discussions are much appreciated.
[1] Cuadron, Alejandro, et al. "The Danger of Overthinking: Examining the Reasoning-Action Dilemma in Agentic Tasks." arXiv preprint arXiv:2502.08235 (2025).
Thank you authors for addressing my concerns. However, I still believe WebShop is a significantly simpler environment compared to WebArena or BALROG. Additionally, I disagree that larger base LLMs are strictly necessary for demonstrating your method’s effectiveness. BALROG, for instance, already benchmarks models such as Llama 3B, 7B and 11B, which indicates feasibility for evaluating smaller-scale models. A valuable demonstration would involve comparing the performance of your method using a smaller model, such as Llama 3B, or 7B, against the baseline without your proposed approach. Consequently, I maintain my original evaluation score.
Q6: Evaluating LAC on more complex benchmarks like BALROG.
A: Thank you for the timely feedback and the constructive suggestion.
Our paper has already included experiments on BabyAI-Text, one of the benchmarks from BALROG. To address your concern regarding environmental complexity, we conducted preliminary experiments on Crafter, another benchmark from BALROG. Crafter is a 2D survival game specifically designed to test long-horizon reasoning and sequential decision-making, with tasks involving resource gathering, crafting, and combat. It represents a significantly more complex setting than ALFWorld and WebShop.
Due to time and resource constraints during the rebuttal phase, we evaluated our method (LAC) on this benchmark using Llama-3.1-8B-it, following BALROG's official evaluation protocol. We compare LAC with several representative baselines from BALROG's GitHub repository. The preliminary results are summarized in the following table:
Table 1: Performance comparison of LAC (Ours) and other baselines in Crafter (from BALROG).
| Methods / Models | Llama-3.1-8B-it |
|---|---|
| LAC (Ours) | 25.91% ± 1.93% |
| Naive (direct action generation) | 20.45% ± 4.26% |
| Robust Naive (formatted actions) | 4.55% ± 1.57% |
| CoT (ReAct, reason then act) | 18.64% ± 3.24% |
| Robust CoT (reason + formatted actions) | 15.46% ± 3.59% |
| Few-Shot (in-context examples) | 12.73% ± 1.25% |
As shown in the table above, our method achieves higher performance than other available baselines under identical evaluation settings. These preliminary results provide further evidence of the robustness, effectiveness, and adaptability of our proposed actor-critic approach (LAC), particularly in significantly more challenging and complex decision-making environments. It is worth noting that CoT performs worse than the baseline Naive on the Crafter benchmark. After discussing it with the authors of the BALROG paper, we hypothesize that it is due to the model's inconsistency within their chain-of-thoughts. Two contiguous chains of thoughts might lead model to take actions which push towards different goals, which is not ideal. The authors also note that this is a problem especially with smaller weaker models.
We appreciate your suggestion and will include these results in the revised version.
To enable planning with a next-token generation autoregressive model, this paper proposes to evaluate each action by a critic. Rather than directly self-judging actions, the critic ranks them based on the output logits, associated with the likelihood of predicting actions being good or bad. The policy is then updated by selecting the action with the highest logit value. This approach is tested on several tasks, including AFAWorld, BabyAI, and Webshop. Across all models, it consistently demonstrates dominant performance over other baselines.
给作者的问题
-
How can a proportional policy be updated without gradients?
-
What does maximizing Q mean if it does not represent cumulative rewards? Why optimize Q-values instead of maximizing the prediction probability?
-
In Equation 4, what is the distribution of the history h?
-
What task is depicted in Figure 5? Why does LAC require fewer steps?
论据与证据
As claimed, the paper presents a method to leverage LLMs' prior knowledge for action evaluation by using prediction logits to classify actions as good or bad.
Long-term planning is achieved by annotating actions as good or bad based on their effects on the final goal, as described in Appendix C.3.
However, the process of gradient-free policy updates remains unclear. In Equation 5, the new policy is obtained by multiplying exponentials of Q-values, but several questions arise: (1) How many actions are evaluated per state? (2) Is the policy updated after each action, or in batch updates? (3) How are policy parameters updated, particularly in the case of an infinite action space?
方法与评估标准
The proposed method effectively enables long-term planning by evaluating actions and updating policies with fewer steps while achieving enhanced performance.
Evaluation is conducted on several control tasks, measuring success rate, rewards, and computational requirements across various base models. Additionally, ablation studies are also thoroughly performed.
理论论述
No, I did not check the proof.
实验设计与分析
For experimental results shown in Figure 2, the proposed method, LAC, is finetuned, while other baselines are not. As shown in Figure 9, the performance of ReAct can indeed be improved with further finetuning. The comparison is unfair. Is it possible to report performance without finetuning for comparison and show in a separate figure that the performance can be further enhanced?
补充材料
I checked implementation details and extra experimental results, which are clearly stated.
与现有文献的关系
The paper is well fitted into the literature with enough baselines chosen.
遗漏的重要参考文献
Not to my knowledge.
其他优缺点
See other sections.
其他意见或建议
See other sections.
Thanks for your comments and valuable suggestions. Here we provide detailed explanations and experimental results to address your questions.
Q1: Questions regarding gradient-free policy updates: (1) How many actions are evaluated per state? (2) Is the policy updated after each action, or in batch updates? (3) How are policy parameters updated, particularly in the case of an infinite action space?
A: (1) The number of candidate actions is a hyperparameter, set to 5 in our experiments, that is, we evaluate the top 5 candidate actions per state. Empirically, evaluating 3-5 top actions is generally sufficient for the benchmarks used, as these actions are sampled by leveraging LLMs' prior knowledge. In most cases, effective actions are included within this set.
(2) The policy is improved by reweighting with Eq.5 after evaluating all sampled candidate actions for the current state.
(3) We do not update policy parameters directly. Instead, we derive a new policy from the original one using a closed-form solution (Eq.5-6) of the policy improvement objective (Eq.4). For infinite action spaces, we evaluate the policy by assessing a set of top candidate actions, similar to how we handle finite action spaces.
Q2: Report performance without finetuning for comparison and show in a separate figure that the performance can be further enhanced.
A: Thank you for the suggestion. We will include comparisons without finetuning in the revised version of the paper. The results, presented in the table below, show that our method outperforms the baseline both with and without finetuning, and the performance can be further improved with finetuning. With finetuing, our method provides more consistent performance with different LLMs.
Table 1: Ablation studies on the impact of finetuning.
| Methods / Models | CodeLlama-7B | Gemma-7B | Llama-3-8B | Mistral-7B |
|---|---|---|---|---|
| LAC (w/o finetuning) | 0.39 | 0.59 | 0.71 | 0.57 |
| ReAct (w/o finetuning) | 0.20 | 0.54 | 0.31 | 0.34 |
| LAC (w/ finetuning) | 0.79 | 0.84 | 0.78 | 0.79 |
| ReAct (w/ finetuning) | 0.38 | 0.70 | 0.73 | 0.65 |
Q3: How can a proportional policy be updated without gradients?
A: We update the distribution by adjusting the probabilities of the top candidate actions, while leaving the probabilities of other actions unchanged. Specifically, we formulate the policy improvement objective as a KL-constrained optimization problem (Eq.4). We then derive a closed-form solution (Eq.5) to this problem, which is a weighted combination of the original policy and the action evaluation values.
Q4: What does maximizing Q mean if it does not represent cumulative rewards? Why optimize Q-values instead of maximizing the prediction probability?
A: In our method, maximizing the Q-function corresponds to maximizing the success probability for the current task. The benchmarks simulate realistic scenarios where there is no immediate reward during execution, and the environment only provides feedback on task completion (success or failure) at the end of each episode. To model this, we design the Q-function to be positively correlated with success probability using a Sigmoid function (Eq.1), ensuring that maximizing Q effectively maximizes the success probability.
While directly maximizing the success probability is also valid and could lead to a different Q-function formulation, we compared our approach against it in Appendix A.2. Our method outperformed it in most tasks and models. We speculate this is because our method leverages more internal information from LLMs, utilizing both success and failure probabilities (as derived in Eq.2), resulting in more accurate and stable action evaluations. While there could be other formulations that use additional internal information, our approach remains both simple and effective.
Q5: In Equation 4, what is the distribution of the history h?
A: The history in Eq.4 is not a distribution, but rather a given context for the policy and critic.
Q6: What task is depicted in Figure 5? Why does LAC require fewer steps?
A: The task depicted in Figure 5 is from ALFWorld. The metric "Steps per task" is calculated as an average over both successful and failed tasks. LAC requires fewer steps due to its higher success rate, enabling it to complete tasks within the maximum step limit, while other baselines often reach this limit without completing the tasks. If we only consider successful tasks, the step cost is similar across methods: Ours: 15.32 steps, ReAct: 17.75 steps, and RAP: 16.36 steps.
Thanks again for your efforts and insightful comments! We hope our clarification addresses your concerns. Any further feedback and discussions are much appreciated.
By sampling only a few top candidate actions, the method does not compute the closed form of the new policy; there remains a nonzero probability of missing the true argmax action. This approach also differs from the actor-critic framework, which updates policies iteratively. The paper lacks sufficient discussion of these weaknesses.
Nevertheless, the closed-form expression offers a convenient mechanism for reweighting action probabilities using the judgments produced by the LLM, potentially leading to improved performance on shown tasks.
Given these considerations, I will maintain my current score.
Q7: By sampling only a few top candidate actions, the method does not compute the closed form of the new policy; there remains a nonzero probability of missing the true argmax action. This approach also differs from the actor-critic framework, which updates policies iteratively. The paper lacks sufficient discussion of these weaknesses.
A: We thank the reviewer for highlighting an important aspect of our approach, which we recognize deserves further discussion and clarification.
(1) Regarding the approximation by sampling a few top candidate actions: We agree that sampling top candidate actions introduces a nonzero probability of missing the true argmax action, especially when distinctions among candidate actions are subtle. However, this choice is primarily driven by computational practicality: explicitly computing or evaluating the full action distribution for large or open-ended action spaces common in LLM-based decision-making is typically intractable. Empirically, we find that generating a small subset of candidate actions from a strong LLM prior is often sufficient to include promising actions, thus making the trade-off between computational efficiency and accuracy acceptable.
(2) Regarding the comparison to classic iterative actor-critic frameworks: The reviewer correctly notes that our approach deviates from traditional iterative gradient-based actor-critic frameworks. However, this deviation—employing a one-shot, gradient-free policy improvement—is an intentional design choice driven by the computational challenges of applying gradient-based optimization to LLM-generated textual actions. We view this as a strength, as it provides a practical and efficient alternative specifically tailored for LLM decision-making tasks. Our approach can significantly reduce computational overhead without substantially sacrificing performance.
We sincerely thank the reviewer for emphasizing these points, which we believe will improve the clarity, rigor, and practical impact of our manuscript.
In this work, the authors propose LLM-based Actor-Critic (LAC), a framework designed to improve LLM-based agents' sequential decision making capabilities by leveraging LLM's own belief to the world as a critic. There are two core components: a Q function estimation method, leveraging the the LLM's internal belief rather than relying on hand-crafted reward design; and a gradient-free policy optimization method via KL-constrained policy optimization.
The reviewers' concerns were mainly regarding two aspects:
- clarity of the methodology, especially in terms of its connection and difference with traditional actor-critic in RL literature.
- simplicity of the datasets/environments used to demonstrate LAC, insufficient evidence how LAC could generalize to more realistic tasks.
The authors made efforts on addressing both aspect:
- There are extensive discussion between the authors and reviewers on the first point, especially the conversation between reviewers and Reviewer WfPB. In my opinion the authors did a great job, the additional statistical analysis trying to understand how each key component (Q-function estimation and gradient-free policy optimization) really works under the hood really strengthens the work, providing stronger evidence in addition to the experimental results and analyses in the paper. Since there are multiple reviewers mentioned the qualitative analysis, I suggest the authors to improve Appendix A.7 (currently only a single sentence) and somehow move it to the main paper.
- For the second point, the authors conduct new experiments on the Crafter environment (part of BALROG, as suggested by Reviewer e1C1). I fully acknowledge that given the short period of time, the authors have already done a good job following the reviewers' suggestions. However, given the fact that 1) Crafter is arguably another simple environment (2d grid world), it is even simpler than ALFWorld in many ways; 2) the new results suggest an overlap between the performance of the Naive baseline and LAC (considering variance), I am less convinced that this particular additional results fully addressed concern 2.
Overall the work has merits, the authors did a good job addressing most of the reviewers' concerns.