Tree-Planner: Efficient Close-loop Task Planning with Large Language Models
摘要
评审与讨论
Language model planning frameworks operate iteratively by generating one action per time-step, appending it back to the prompt (along with observation for closed-loop planning), and repeating the process until the task is executed. Such approaches are token-inefficient because of their repetitive use of prompt tokens (which includes the task, action-observation history, and in-context exemplars) for each action-step generation. Moreover, in closed-loop planning, such methods may often lead to infeasible action steps, thus, requiring corrective steps/replanning (local or global). In this paper, the authors propose a way to make closed-loop planning (i) more token-efficient (i.e. requiring fewer tokens) than an iterative planner and (ii) more replanning efficient (requiring fewer corrective steps).
Specifically, the authors propose to sample multiple plans from the LLM in a non-iterative and offline fashion (Plan Sampling). These plans are then merged into a tree-like structure to avoid repetitive actions (Action Tree Construction). Finally, these executable actions are selected by prompting the LLM with the task, current observation, and the history of executed actions (Grounded Deciding). The non-iterative plan sampling approach is more token-efficient. The tree-like structure helps to backtrack and select different steps in case of action failure without having to plan from scratch, avoiding large token costs.
优点
- Overall, the paper is well-motivated.
- Through a number of empirical and ablation studies on VirtualHome environment, the authors show how their approach is significantly more token-efficient compared to the general iterative closed-loop planning approaches. The authors also show improved performance wrt ProgPrompt, however, ProgPrompt is more token-efficient (Table 1)
- The framework would be of significance to the planning and decision-making audience (especially in academia) who want to use the larger models for similar works.
缺点
-
Originality: Placing this paper in the context of existing works (like Tree-of-Thoughts & SayCan), much of the contributions seem to be merely engineering tweaks that do not contribute (significantly) on a more fundamental level. Considering how generating and planning over tree-like structures have been explored in the past in open-loop/offline planning [A][B], where planning trees are generated iteratively, while simultaneously grounding the actions, the only novelty here is to decouple the tree construction from the grounding (Grounded deciding) for token-efficiency. Backtracking is also something that has already been introduced in Tree-of-Thoughts, although not in a proper task-planing framework.
-
Clarity: From what I understand, the sampled plans were generated in one go (non-iteratively) to avoid higher costs. However, iterative planning is useful in avoiding compounding errors and it is unclear as to how the authors handle such errors in their non-iterative setup (see Questions). Certain details like the definition of "corrections" are somewhat ambiguous.
[A] Reasoning with Language Model is Planning with World Model, Hao et al., 2023
[B] SayCanPay: Heuristic Planning with Large Language Models Using Learnable Domain Knowledge, Hazra et al., 2023
问题
-
When sampling plans (action sequences) non-iteratively, an error in one action step (something that diverges from the list of executable actions) can lead to subsequent actions being erroneous, leading to a so-called "compounding effect" as highlighted in [C]. This would lead to most of the sampled plans being infeasible. How do the authors address this problem?
-
"In terms of correction efficiency, TREE-PLANNER reduces the number of corrections by 37.99% and 40.52%, respectively:" How do you define and measure an error "correction": For instance, it could be (i) a new action insertion; (ii) an action deletion; (iii) replacing one action with another generated action. Is it evaluated irrespective of the success rate (i.e. is it only measured for plans that have succeded in correctly executing the task or for any plan)?
Minor comments:
a considerable part of the prompt "is" less …
Figure 4: Leave node -> leaf node
Missing citation: SayCanPay (see [B] in Weaknesses)
[C] Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents, Huang et al., 2022
We are deeply grateful for your recognition of our paper's motivation, performance, and potential academic impact. Your positive feedback is highly encouraging.
Weaknesses
W1. [Originality: Placing this paper in the context of existing works (like Tree-of-Thoughts & SayCan), much of the contributions seem to be merely engineering tweaks that do not contribute (significantly) on a more fundamental level.]
Thank you for highlighting the importance of differentiating our work from existing literature. I would like to clarify and further emphasize the novelty and contributions of TREE-PLANNER:
-
TREE-PLANNER fundamentally differs from the mentioned works in several aspects: a) Both ToT[1] and RAP[2] leverage LLMs for tree construction. However, these approaches do not tackle decision-making in specific physical environments, a key challenge TREE-PLANNER addresses. Our method is unique in its ability to handle decision-making within constrained environments; b) SayCan[3] and its subsequent developments, like Grounded Decoding[4], utilize hidden layer representations of LLMs. This information is often inaccessible to many researchers and practitioners. TREE-PLANNER, on the other hand, demonstrates an innovative application of black-box LLMs for efficient task planning; c) Another mentioned work, SayCanPay[5], is a concurrent study (submitted to arXiv one month before our submission deadline) and focuses on a different aspect of efficiency. While SayCanPay addresses plan length, our work is centered on token efficiency and error correction. We believe this distinction does not diminish the novelty of our paper. For a more detailed discussion of these differences and their implications, please refer to Section 6 of our paper.
-
TREE-PLANNER introduces a novel perspective on token efficiency: Our work is, to the best of our knowledge, the first to identify and address the issue of token efficiency in LLM-based planning, especially in complex environments. This is a critical aspect, as token consumption directly impacts the feasibility and scalability of LLM applications in real-world scenarios.
[References]
[1] Yao S, Yu D, Zhao J, et al. Tree of thoughts: Deliberate problem solving with large language models[J]. arXiv preprint arXiv:2305.10601, 2023.
[2] Hao S, Gu Y, Ma H, et al. Reasoning with the language model is planning with the world model[J]. arXiv preprint arXiv:2305.14992, 2023.
[3] Brohan A, Chebotar Y, Finn C, et al. Do as I can, not as I say: Grounding language in robotic affordances[C]//Conference on Robot Learning. PMLR, 2023: 287-318.
[4] Huang W, Xia F, Shah D, et al. Grounded decoding: Guiding text generation with grounded models for robot control[J]. arXiv preprint arXiv:2303.00855, 2023.
[5] Hazra R, Martires P Z D, De Raedt L. SayCanPay: Heuristic Planning with Large Language Models using Learnable Domain Knowledge[J]. arXiv preprint arXiv:2308.12682, 2023.
W2.1 [Clarity: From what I understand, the sampled plans were generated in one go (non-iteratively) to avoid higher costs. However, iterative planning is useful in avoiding compounding errors, and it is unclear as to how the authors handle such errors in their non-iterative setup (see Questions).]
Thanks for your valuable feedback. Detailed discussion about how our method avoids compounding error is presented in response to Q1.
W2.2 [Certain details like the definition of "corrections" are somewhat ambiguous.]
Thanks for the opportunity to include more details about our research. We have included a detailed calrification on "correction" in the response to Q2.
Questions
Q1. [When sampling plans (action sequences) non-iteratively, an error in one action step (something that diverges from the list of executable actions) can lead to subsequent actions being erroneous, leading to a so-called "compounding effect," as highlighted in [C]. This would lead to most of the sampled plans being infeasible. How do the authors address this problem?]
We thank you for your valuable comments.
(i) In Zero-shot Planner[1], compounding errors are traditionally mitigated through heuristic evaluation functions. Our approach effectively utilizes 'grounded deciding' to achieve a similar outcome, thereby preventing compounding errors with real-time environmental observations. To further enhance this process, we have implemented three heuristic evaluation functions akin to those in Zero-shot Planner for Best First Search: a) Language-Guided BFS; b) Environment-Guided BFS; c) Hybrid BFS. The detailed introduction to each method and experimental results are presented in Appendix C.5. Our experimental results reinforce the efficacy of grounded deciding, demonstrating its capability to implicitly avoid compounding errors by leveraging LLMs.
(ii) As detailed in our paper's quantitative analysis in Appendix C.3, the number of unique plans sampled increases approximately linearly with the sampling variable N, thereby ensuring a sufficiently large search space. Furthermore, our study in Section 5.2 illustrates that the plans generated through plan sampling maintain a high success rate. This underscores the robustness and efficiency of our plan sampling methodology.
[Reference]
[1] Huang W, Abbeel P, Pathak D, et al. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents[C]//International Conference on Machine Learning. PMLR, 2022: 9118-9147.
Q2.1 ["In terms of correction efficiency, TREE-PLANNER reduces the number of corrections by 37.99% and 40.52%, respectively:" How do you define and measure an error "correction": For instance, it could be (i) a new action insertion; (ii) an action deletion; (iii) replacing one action with another generated action.]
We appreciate the opportunity to clarify the definition and measurement of "error correction" in our work. In our study, a "correction" is the process where the model replaces an action that failed during execution in the environment with an alternative action. This definition aligns with the one provided by Raman et al. (2022)[1].
[Reference]
[1] Raman S S, Cohen V, Rosen E, et al. Planning with large language models via corrective re-prompting[J]. arXiv preprint arXiv:2211.09935, 2022.
Q2.2 ["Is it evaluated irrespective of the success rate (i.e. is it only measured for plans that have succeded in correctly executing the task or for any plan)?]
Thanks for the opportunity to provide more details. The evaluation of the number of corrections is an effectiveness metric that is independent of the success rate. In other words, we calculate this metric for all plans, including those that successfully execute the task and those that do not. We acknowledge the need to explicitly include these details in the next version of our paper to provide clearer insights into our methodology.
Q3 [Other Comments]
Thank you for your reminder! We will thoroughly revise our paper based on your suggestions.
Thank you authors for taking the time to address the questions.
-
Novelty: I'm not convinced with this statement: Our method is unique in its ability to handle decision-making within constrained environments. This can be done with any online (closed-loop) planning framework like SayCan, or Grounded Decoding. The novelty here (compared to SayCan) is the corrective action that you enable, which is again nullified if you consider tree search frameworks like ToT. While ToT has not been demonstrated in the classical "planning" domain, one could easily extrapolate the reasoning tasks of ToT into classical planning. Also, I'm utterly confused with what you mean by b) SayCan[3] and its subsequent developments, like Grounded Decoding[4], utilize hidden layer representations of LLMs. This information is often inaccessible to many researchers and practitioners. TREE-PLANNER, on the other hand, demonstrates an innovative application of black-box LLMs for efficient task planning. In the end, what novelty remains is the improved token efficiency, I'd reiterate seems more like an "engineering" problem lacking significant novelty.
-
Regarding compounding effect (Q1): Again I disagree. Compounding effect arises when an action generated by the LLM does not adhere to any action in the action set of the environment. It is only practical because it is difficult to constrain the generation of LLM to the exact action space. This leads to the subsequent actions also getting affected and can be more pronounced if the plan is generated in one go. Indeed, the grounded deciding would help to ground the plans by selecting executable actions for each step. However, from what I understand, since all the plans were pre-generated, it is likely that compounding effect would lead to most plans being already infeasible. In the Zero-shot planner, they use a simple approach to map each generated action to some action in the action set of the environment.
Thank you for your reply and valuable feedback.
A. [Further Clarifications on Novelty]
We appreciate your insights and the opportunity to clarify the novelty of our work.
-
Tree-Planner shares similarities with previous methods, yet it distinctly advances. Other reviewers also acknowledge this improvement, as stated by #cPrG: "The approach itself can be regarded as an enhancement over previous methods. The basis for such improvements is well justified, targeting specific limitations rather than merely fine-tuning pre-existing models. This approach lends a unique originality to the work."
-
In response to "this can be done with any online (closed-loop) planning framework like SayCan or Grounded Decoding. The novelty here (compared to SayCan) is the corrective action that you enable": Compared to works that focus on online LLM-based planning like SayCan, Tree-Planner effectively combines offline planning (plan sampling module) and online planning (grounded deicidng module), but not merely incorporates action correction into classical online planning framework. Ultimately, this approach surpasses the performance (success rate) of the online LLM-based planning baselines (Zero-shot Planner[1] and Iterative Planner) even under without action correction setting. Furthermore, reviewer #mZgD also regards this approach as interesting: "Good idea: While for model-based planning, offline planning might be weaker than integrated planning and execution, which might differ for LLM-based planning. Obtaining a policy without the distraction of the incoming observations might allow a more systematic execution."
-
We respectfully disagree that token efficiency is solely an engineering problem for the following aspects:
a) Copuational efficiency has been long been studied as a research problem in NLP community, where researchers focus on how to reduce the computational cost of transformers. Previous studies have explored modifications to the architecture of transformers [2][3] as well as the adoption of alternative sampling methods [4].
b) LLMs' black-box nature impose constraints on the study of efficiency. It is often impractical and challenging to retrain LLMs after altering their model structure in the era of LLMs. Consequently, an increasing number of researchers have begun investigating the reduction of token consumption to achieve similar objectives [5][6][7]. The goal of token efficiency is to reduce the number of tokens consumed by LLMs, which is typically associated with computational costs (in cases where the number of API calls is similar, higher token consumption means longer inference time and higher computational cost). Research on token efficiency can be considered another form of efficient transformer research in the era of LLMs and has the potential to significantly enhance the practical applications of LLMs.
-
our paper also includes a formulation in Section 5.1 to showcase the token efficiency of our approach, going beyond relying solely on empirical results for demonstration.
[Reference]
[1] Huang W, Abbeel P, Pathak D, et al. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents[C]//International Conference on Machine Learning. PMLR, 2022: 9118-9147.
[2] Kitaev N, Kaiser Ł, Levskaya A. Reformer: The efficient transformer[J]. arXiv preprint arXiv:2001.04451, 2020.
[3] Katharopoulos A, Vyas A, Pappas N, et al. Transformers are rnns: Fast autoregressive transformers with linear attention[C]//International conference on machine learning. PMLR, 2020: 5156-5165.
[4] Chen C, Borgeaud S, Irving G, et al. Accelerating large language model decoding with speculative sampling[J]. arXiv preprint arXiv:2302.01318, 2023.
[5] Kaneko M, Okazaki N. Reducing Sequence Length by Predicting Edit Operations with Large Language Models[J]. arXiv preprint arXiv:2305.11862, 2023.
[6] Cheng Z, Kasai J, Yu T. Batch prompting: Efficient inference with large language model apis[J]. arXiv preprint arXiv:2301.08721, 2023.
[7] ICLR 2024 Submission360 "BatchPrompt: Accomplish more with less." Link
B. [Further Clarifications on Compounding Effect]
Thank you for providing the opportunity to elaborate further. In response to Q1, we primarily elucidated how we ground actions within a specific scene, where all actions are already grounded in the action space. Regarding how we ground the LLM's generation into the available action space, we employed the following approaches:
-
Prompt Engineering (refers to Appendix F):
a) Our prompt incorporates all available actions and their output formats (Instruction), as well as operable objects in the environment (Global Information).
b) We augmented the prompt with hints to ensure the actions generated by the LLM are viable within the action space: "All action_name of the sub-tasks must be chosen from the above actions and follow the corresponding format."
-
In-Context Learning: The in-context exemplars also guide the output format of the Language Model, as analyzed in [1].
-
Post-Processing Even with steps 1 and 2, LLMs might still generate plans that are not actionable. To address this, we implemented an additional post-processing module (refers to Appendix B.3):
a) Format Checking: If the Language Model's generation includes actions that do not conform to the format (and thus cannot be parsed), these generations are discarded. Given GPT-3.5's robust ability to follow formats, the number of discarded items is minimal.
b) Object & Action Translation: Even if the format is correct, the plan generated by the Language Model might include actions or objects not present in the environment. This often occurs due to the generation of a semantically accurate but not exactly matching result (e.g., the environment's object name is "food_egg," but the generated action includes "egg"). In such cases, we parse the Language Model's action string and match it with the environment's available actions and objects based on BERT similarity. For example, for the LLM's generated string '[Switch On] <telly>,' we first use an action parser to identify the action name as "Switch On" and the object name as "telly." Then, based on BERT similarity, we match these with available actions and objects, resulting in the matched action name "SwitchOn" and object name "TV." Consequently, "[SwitchOn] <TV>" becomes an actionable action in the environment.
In summary, our logical structure is as follows: (i) the response to B addresses how to ground LLM's generation into the available action space (unstructured text -> actionable action); (ii) the response to Q1 focuses on how to select appropriate actions within a specific environment (actionable action -> correct action).
[References]
[1] Min S, Lyu X, Holtzman A, et al. Rethinking the role of demonstrations: What makes in-context learning work?[J]. arXiv preprint arXiv:2202.12837, 2022.
Thanks again your meticulous review regarding our paper. Should you have any further concerns, we are more than willing to provide additional information.
Dear Reviewer #m5sK
Thanks for the insightful and constructive critique you provided on our manuscript. With the rebuttal period drawing to a close, I am eager to know if our responses and revisions have effectively addressed your initial concerns and queries.
We are looking forward to receiving your post-rebuttal rating and any additional feedback you may have. Please feel free to reach out if you require further information or suggest any additional modifications.
Once again, thank you for your invaluable contribution and guidance.
Best wishes,
Authors
This paper revisits LLM-based close-loop task planning on a classical application benchmark (VirtualHome). Previous LLM use has been faced with two major limitations, which are high token consumption and redundant error correction. The proposed approach, TREE-PLANNER, aims at addressing the above limitations by reframing LLM-based task planning with LLMs into three distinct phases: plan sampling, action tree construction, and grounded deciding. This decomposition of queries into a single plan-sampling call ensures token efficiency, while factoring action trees facilitates backtracking and error correction. The system is compared to two baseline systems: Zero-Shot planner and ProgPrompt, and surpasses the state-of-the-art on multiple criteria, each time by at least a few percentage points.
优点
The paper adresses a competitive topic in the field of LLM for agents and its rationale is presented in a very compelling fashion. It is technically sound, with an excellent balance of technical description and use of supplementary material for examples. The approach itself can be seen as an improvement over previous approaches, but the grounds for such improvements are well justified, and address specific limitations rather than incremental tuning of pre-existing models, conferring originality to the work. Very good presentation of related work by structuring it through relevant topics (to the exception of LLM for traditional Planning). Well-presented technical insights, in particular the formalization of token consuption, or the analysis of error types. Detailed results, using a panel of criteria both technical and practical (i.e. cost), which are quite convincing.
缺点
Although the paper includes a fairly comprehensive list of related works, it does not address some of the LLM-based planning work. For instance, while 'Tree of Thoughts' is referenced, there is no mention of 'Faithful CoT' [Lyu et al., 2023] and its suggestion for LLM to produce PDDL representations to be passed to a regular Planner (which would still support online planning or replanning via restarting). It would be good to include such a discussion, or to justify not including it (e.g. on grounds of partial observability), provided such grounds effectively preclude adoption.
Lyu, Q., Havaldar, S., Stein, A., Zhang, L., Rao, D., Wong, E., Apidianaki, M. and Callison-Burch, C., 2023. Faithful chain-of-thought reasoning. arXiv preprint arXiv:2301.13379.
问题
- What guarantees scalability to domains comprizing a larger number of actions and greater tree perplexity?
- Could you discuss how different LLM could be used (GPT-4, or LLaMA) with a similar approach and whether the prompting strategy would be affected?
- Is partial observability really a salient property of this specific set of 28 actions?
伦理问题详情
N/A
We are deeply grateful for your encouraging and insightful feedback, particularly your recognition of our paper's technical soundness and originality. Your appreciation of our formalization of token efficiency and the comprehensive nature of our quantitative and qualitative results is highly valued. We are also thankful for your acknowledgment of the balance we struck between technical detail, clear examples, and the structured presentation of our related work. Such feedback motivates us to further our research in this promising field. Thank you for your thorough and constructive review.
Weaknesses
W1. [Although the paper includes a fairly comprehensive list of related works, it does not address some of the LLM-based planning work.]
Thank you for your valuable feedback. Integrating LLMs with traditional task planners (like PDDL planners) is an important aspect of LLM-based planning, and we will include a discussion of these papers in the next version of our manuscript. [1][2] proposes producing the PDDL description of a given problem and then leveraging a classical planner to find an optimal plan. However, such work typically relies on the existence of Domain PDDL, which is often unavailable in many scenarios. While [3] suggests using LLMs as world models to construct domain PDDLs, this process still necessitates manual intervention due to issues like LLM hallucination. Besides, this work generally assumes a full observation environment, whereas our paper focuses on the more challenging partial observation environments.
[Reference]
[1] Lyu, Q., Havaldar, S., Stein, A., Zhang, L., Rao, D., Wong, E., Apidianaki, M. and Callison-Burch, C., 2023. Faithful chain-of-thought reasoning. arXiv preprint arXiv:2301.13379.
[2] Liu B, Jiang Y, Zhang X, et al. Llm+ p: Empowering large language models with optimal planning proficiency[J]. arXiv preprint arXiv:2304.11477, 2023.
[3] Guan L, Valmeekam K, Sreedharan S, et al. Leveraging Pre-trained Large Language Models to Construct and Utilize World Models for Model-based Task Planning[J]. arXiv preprint arXiv:2305.14909, 2023.
Questions
Q1. [What guarantees scalability to domains comprising a larger number of actions and greater tree perplexity?]
We greatly appreciate the insightful feedback provided. In response to this question, we will provide explanations from two distinct perspectives:
(i) The stored commonsense knowledge within LLMs plays a crucial role in maintaining the tree perplexity within manageable limits. This is due to LLMs' ability to efficiently navigate and prune decision trees based on contextual understanding.
(ii) Grounded deciding can also be combined with other search algorithms to address issues of greater tree perplexity. For instance, AlphaGo[1] uses Monte Carlo Tree Search (MCTS) to handle high tree complexity. We conducted additional experiments using the Best-First Search algorithm, as detailed in Appendix C.5.
[Reference]
[1] Silver, D., Huang, A., Maddison, C. et al. Mastering the game of Go with deep neural networks and tree search. Nature 529, 484–489 (2016). https://doi.org/10.1038/nature16961
Q2 [Could you discuss how different LLM could be used (GPT-4 or LLaMA) with a similar approach and whether the prompting strategy would be affected?]
Thank you for your valuable question about the interplay between the choice of Large Language Models (LLMs) and the prompting strategy. Your question raises an important aspect of our work: the potential variance in effectiveness when employing other advanced LLMs like GPT-4 or LLaMA.
As for the influence of prompting strategy, we conducted experiments with three additional prompting strategies: (i) GPT-4 optimized prompt, (ii) code-style prompt, and (iii) chain-of-thought prompt. (more details can be found in Additional Experiments in General Response)
| EXEC ↑ | SR ↑ | GCR ↑ | Cost ↓ | No.Correction ↓ | |
|---|---|---|---|---|---|
| TreePlanner | 88.26±2.47 | 41.58±3.20 | 59.55±3.20 | 4.54±0.16 | 2.04±0.26 |
| GPT-4 Optimized Prompt | 89.52±2.15 | 39.25±0.54 | 58.80±1.05 | 4.73±0.13 | 2.06±0.18 |
| Code-Style Prompt | 70.70±3.24 | 37.10±0.89 | 50.74±2.73 | 6.25±0.57 | 3.12±0.31 |
| Chain-of-thought | 89.13±2.04 | 42.02±1.27 | 60.13±1.89 | 5.72±0.91 | 2.38±0.17 |
We observed that different prompting strategies yielded varying results. For instance, using a GPT-4 optimized prompt and a code-style prompt didn't significantly improve performance compared to the original prompt. In contrast, employing a chain-of-thought prompting strategy resulted in a slight performance enhancement.
It is worth noting that code-style prompting can lead to a significant decrease in performance. This could be attributed to the relatively weaker code processing capabilities of GPT-3.5 itself. In contrast, in the experiments with ProgPrompt[1], code-style prompts demonstrated stronger performance than natural language prompts, where the experiments were conducted using Codex[2]. In contrast, in the experiments with ProgPrompt, code-style prompts demonstrated stronger performance than natural language prompts, where the experiments were conducted using Codex. As the Codex API has been discontinued, we could not conduct similar experiments on Codex to explore further. In future work, we will explore other code LLMs[3][4] for our research.
Based on these findings, we believe that the choice of prompting strategy should be compatible with the training approach of the LLM. For example, code-style prompts may be more suitable for LLMs trained on code. Additionally, for LLMs trained on structured data, such as GraphGPT[5], directly providing the symbolic graph as input may yield better results than converting it into natural language prompts as in our study. Considering GPT-4's advanced understanding and context retention abilities, it's plausible that a more complex and nuanced prompting strategy could yield better results than we've observed with GPT-3.5.
We appreciate your question for highlighting this important avenue for future research.
[Reference]
[1] Singh I, Blukis V, Mousavian A, et al. Progprompt: Generating situated robot task plans using large language models[C]//2023 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2023: 11523-11530.
[2] Chen M, Tworek J, Jun H, et al. Evaluating large language models trained on code[J]. arXiv preprint arXiv:2107.03374, 2021.
[3] Xu Y, Su H, Xing C, et al. Lemur: Harmonizing Natural Language and Code for Language Agents[J]. arXiv preprint arXiv:2310.06830, 2023.
[4] Roziere B, Gehring J, Gloeckle F, et al. Code llama: Open foundation models for code[J]. arXiv preprint arXiv:2308.12950, 2023.
[5] Tang J, Yang Y, Wei W, et al. Graphgpt: Graph instruction tuning for large language models[J]. arXiv preprint arXiv:2310.13023, 2023.
Q3. [Is partial observability really a salient property of this specific set of 28 actions?]
We appreciate your inquiry regarding the salience of partial observability for action set in our study. Indeed, partial observability plays a crucial role in successfully executing these actions within the VirtualHome environment. To illustrate: (i) Manipulation-Related Actions: Actions such as 'Grab' and 'Touch' are successful only when the robot is close to the target object. This proximity is a crucial aspect of the robot's partial observation, determining the feasibility of these actions. (ii) Navigation-Related Actions: Actions like 'Walk' and 'Find' are available only when the robot and the target object are in the same room. (iii) Object Accessibility: The robot's ability to interact with objects is also governed by partial observability. For example, if an object is placed inside a closed container, it becomes invisible to the robot's observation, rendering the object inaccessible for any interaction.
These examples underscore that partial observability is not just a theoretical consideration but a practical constraint that significantly influences the behaviors of the robot.
Dear Authors, Thanks for your comprehensive response. I was not totally convinced by the statement about common sense being embedded in LLM (Q1), as this point is still debated, in particular the extent to which common sense can support other tasks such as planning, but you have provided a reference. I was also marginally convinced on partial observability, although I now accept some relevance to the 28-actions scenario. I found your answer to Q2 a valuable extension to the paper's result. However, I found your answer to Reviewer K1gw on comparison to search-based planning somehow surprising since heuristics can be derived automatically (that is the essence of modern search-based planning). Considering these various aspects I see no argument to further raise my rating which is already positive.
Dear Reviewer #cPrG,
Thanks for your invaluable contribution and guidance for our work.
We are pleased you found our response to Question 2 a significant extension of our paper's results. Regarding your recognition of our search-based experiments, we are grateful for your acknowledgment and will ensure this aspect is more clearly highlighted in our revised manuscript.
Regarding your concerns about embedding common sense in LLMs (Q1), we understand this is an area of ongoing debate. Our reference aims to provide a perspective on current understanding, but we acknowledge that the field is rapidly evolving. We believe that Tree-Planner and ongoing research will continue to shed light on the extent to which common sense can aid in complex tasks such as task planning.
Thank you once again for your positive evaluation and constructive comments.
Best regards,
Authors
The paper presents an algorithm specialized in LLMs for interacting with an environment for solving tasks that require planning. In this context, the state, the observations and the actions are represented as text.
In previous work, called iterative-planning in the submission, When an action is executed, another call to a LLM appends the observation and attempts to obtain a new action. In the planning literature, that is close to what is called integrated planning&execution, as the agent must choose actions for achieving the goal, but observations might lead the agent to change the course of action completely.
In contrast, the paper follows a more restricted mode that can be more efficient under some circumstances: offline planning, then execution of the plan. This is the more studied case in the symbolic planning literature, where offline planning has received much more attention than execution.
The proposed algorithm —Tree-Planner— is relevant for domains expressed in text, where a LLM could approximate a world model, and could produce a plan using knowledge widely available in text.
While the task can be formulated as a POMDP, the initial phase –called plan sampling– does not deal with further possible observations but just the general scene. Indeed, the samples are converted that represent common prefixes of the plans. So, the policy from plan sampling can be seen as a stochastic policy. That makes sense in some cases, but it generally cannot solve POMDPs where policies might need to map state distributions into actions, and update state distributions with new observations.
Always assuming the use of a LLM, the main hypothesis mentioned in the abstract are:
- Tree-planner might be more effective than alternatives like iterative planning by avoiding redundant error correction.
- Tree-planner might have a lower token cost.
The second hypothesis is true, almost by construction, as the algorithm only needs less information about actions and objects. The first hypothesis is only studied for VirtualHome, a domain of some interest. The main claim is that backtracking allows more effective recovery than appending the error and letting the LLM recover.
The paper shows how high-quality data was refined to allow the evaluation of metrics like success rate.
The experimental setup reports a comparison with baselines including iterative-planning, all using the proprietary LLM GPT-3.5.
The results show improvements in the 35 tasks of the dataset. For N=50 samples, the best possible GCR (goal condition recall) reaches 81.2% of the tasks.
优点
- Good idea: While for model-based planning, offline planning might be weaker than integrated planning and execution, that might be different for LLM-based planning. Obtaining a policy without the distraction of the incoming observations might allow a more systematic execution.
- The paper is reasonably well-written.
- The benchmark is interesting.
- The LLM-based baselines are reasonable enough. Some readers might want to see Chain-of-Though or Self-reflection, but I think zero-shot and iterative replanning are enough.
- The results using the ground truth plans (oracle planner) show that more samples lead to a smaller improvement, showing the initial phase tends to find more of the correct plan.
- Using success rate is robust with respect to alternative ways to achieve the goal.
缺点
- Not clear if the policy is grounded in the environment or in the LLM implicit distribution.
- Section 3.1 mentions the initial observation as part of the prompt, but the prompts in section F include a long “(observation)” for iterative planning and the grounded-deciding phase, but not for plan sampling.
- If a detailed initial observation is given, then it’s possible that many of the actions are grounded on them, reducing the significance of seeing the algorithm as dealing with a POMDP.
- If a detailed initial observation is not given, then the LLM might be retrieving possible courses of action given the initial description but not grounded in the environment.
- Both scenarios diminish the apparent significance of the work.
- No discussion of the systematicity of LLM-based planning.
- LLM-based planning is a popular topic, but the increasing body of work might call for a more careful examination. Each paper in this direction is an opportunity to examine the problem and the methodology.
- In general, planning seems easier in the “happy path,” where text with high likelihood matches what might work in an environment, but things can get more complicated quickly.
- Consider the task of “calling Mary using the cell phone”. A good plan is not about assuming that the cell phone is on a table but exploring the space until the cell phone is found. A diverse set of plans might include the more usual places, perhaps even mentioning multiple places in a single path. But it’s possible that the cell phone is in the fridge or in the supermarket bag.
- While the prompt in the grounded deciding phase includes the full observation, the super-market bag would be there, prompting to choose among a few actions to pass the responsibility of examining the supermarket bag and the freezer to the plan sampling phase.
- It’s possible that finding a cell phone might require samples beyond a fixed N, and no alternative is offered.
- So, the key underlying problem is that a flexible plan dealing with observations cannot be found using a fixed number of samples. Some tasks might require fewer samples, and some might require more. That’s precisely why planning algorithms use search.
- Experimental significance: a single environment with only 35 tasks.
- The dataset subsection explains that 35 tasks were used, but the tasks were not listed. The VH has many other tasks. The 35 tasks may be biased in a way that affects the significance of the observations.
- Experimental significance: Only VH
- There are other environments for testing these ideas. For instance, the reference below. The related work section must cover what other alternatives are used in related work and why it’s a good idea to select only this environment.
- (Jericho) Hausknecht, Matthew, Prithviraj Ammanabrolu, Marc-Alexandre Côté, and Xingdi Yuan. “Interactive Fiction Games: A Colossal Adventure.” arXiv:1909.05398 [Cs], February 25, 2020. http://arxiv.org/abs/1909.05398.
- In general, given the absence of task descriptions and their plans, it’s hard to know whether the results are significant.
- Even if we have the 35 tasks and their plans, the results might be an artifact of this particular environment. Perhaps the current version of GPT 3.5 is better at those household tasks than at navigating in a store to buy groceries.
Secondary issues:
- Goal Condition Recall might be a misleading metric
- different subgoals might be easier than others
- we don’t know the structure of the subgoals
- Confusions in the theoretical emphasis on POMDPs.
- The hallmark of POMDPs, being a Markov decision process, is the Markov assumption: decisions can be made just by looking at the state without the history. However, the policy in section 2 includes the history of actions but perhaps not the observation history.
- The emphasis on POMDPs is over-stated as the VH has high visibility while other environments feature more partial observability (for instance, Jericho)
- Global replan is non-comparable with other algorithms.
- Whether it’s possible to teleport to the initial state or not fundamentally changes the problem.
- The notion of inverse actions is a form of symbolic knowledge, leaving the scope of the paper.
- A more purist approach would task the LLM to undo the last k actions, perhaps offering the list of actions.
Minor
- The best GCR in section 5.2 is interesting, but GCR is not a good predictor of SR (success rate).
- Actually, in Table 2, the gap between GCR and SR is higher for Tree-Planner than for other approaches.
- Related: Table 3 reports 45.5% of errors due to missing correct plan.
- It must be clarified if the LLM is called when there is only one possible action.
- Some action trees in the appendix have a few actions with no branches. This affects the overall cost but not other statistics.
- Alg 1 doesn’t add much. Any algorithm for building a “trie” data structure would work.
Recommendations:
- The paper should distinguish the world model from planning. Discussing the percentage of correct plans is useful but not actionable in a new domain as the set of possible plans in a domain is heavily unbounded.
- The directions suggested in error analysis (re-sample, chain-of-thought, self-reflection) might increase the context size, reducing one of the advantages.
问题
- What are some other environments where this method could be tested?
- Why was only VH selected?
- Given a fixed task, is the same prompt used for the four scenes?
- Does the observation of plan-sampling include the full description of the room or just the type of room?
- I understand that in VirtualHome, the top-down view allows the agent to see everything except what’s hidden in containers like drawers.
- The prompts in section F include a long “(observation)” for iterative planning and the grounded-deciding phase, but not for plan sampling.
- Did you consider adding the observation history?
- The list of actions with the observation context where they were made might need to be more informative.
- Unless the observations are monotonic, meaning that after each action, the agent observes more and more. For instance, if once the agent opens a drawer, it would still see what’s there.
- Are the observations monotonic?
- GCR metric: what’s the distribution of the number of goal conditions? Are the number of goal conditions the same for each of the 35 tasks?
- Sect 5.2
- Did you randomize the results for table 2? A new seed might generate another set of plans.
- What would be the results for SR? Table 1 shows how GCR grows with SR, but doesn’t predict SR. Actually, the gap between GCR and SR is higher for Tree-Planner than for other approaches.
- Related: Table 3 reports 45.5 of errors due to missing correct plan.
- What is the inverse of all the actions?
- The supplementary material lists them, please list the corresponding inverse, say how that information was obtained, and discuss how realistic it is for that information to be available or learned.
- Branches in action trees.
- Some node action trees have no branches. See leftmost path Fig 12. Is the LLM prompt for choosing one action or are they executed? Is this accounted in the cost of the predictions?
- Please report the average number of branches for reaching the leaves for each of the 35 tasks. This is equivalent to aggregating sequences without branching into a single action. For instance, Fig 12 has 8 leaves with [1, 1, 2, 2, 3, 3, 2, 3] branches, average 2.125 branches per leave. So, the agent should make an average of 2.125 decisions when following that branch.
- Please discuss the number of error corrections in Table 1 compared to the number of branches. Otherwise, it’s hard to know whether 3.29 vs 1.86 is a significant difference.
- Sect 4.1: Dataset: what are the 35 unique tasks?
- VirtualHome has multiple programs for some tasks, not a single gold plan.
- Did you create new scenes or used scenes in VH?
- Are the gold truth plans associated to a task or to (task, scene)?
- Please list the 35 tasks descriptions.
- What is the distribution of the number of steps for the gold truth? Even better, provide this information for each task
- Methodological: was N=50 selected using the 35 tasks?
- Is it possible that examining the tasks leads to bias in the selection of N?
- Sect 4.1: implementation.
- Was majority vote only used for grounded-deciding? Please add a reference for the method.
- Table 3: errors for Tree-planner N 25
- Is this success rate, SR?
- The false negative is misleading. In the appendix, the error explanation says that the environment reports that there is no keyboard. Are keyboards part of the possible objects? The observations in the prompt mention a computer but no keyboard.
- The example of the keyboard should be an environment misunderstanding.
- Please fix the tables, and refine the definition of false negative. Perhaps you might to add false negative to the missing correct plan, calling it “semantically correct.”
- What are the error modalities with correction?
- Sect C.1 describes performance across plan length.
- Can you provide the statistics for other approaches?
- What about kind of task?
We are grateful for the reviewer's appreciation of our novel LLM-based planning approach and the paper's clarity. Your feedback is precious to us.
Weaknesses
W1. [Not clear if the policy is grounded in the environment or the LLM implicit distribution.]
Thank you for highlighting this aspect. In our experiments, detailed initial observations were included in the prompt. About "the LLM potentially retrieving actions based on the initial description rather than being environmentally grounded": Firstly, while the LLM's initial plan sampling might not be directly grounded, it provides a broad search space. In the grounded deciding stage, the LLM can select environmentally grounded actions from this set. Secondly, the LLM is prompted to generate plans conforming to commonsense logic during the plan sampling stage, as indicated in "The temporal relationship between subtask sequences must adhere to commonsense logic." (Appendix F). This approach ensures that the LLM considers dynamic environmental changes, reducing reliance on the initial state alone and enhancing the practical relevance of our model in dealing with POMDP scenarios. We hope these perspectives will be considered, and we sincerely appreciate the opportunity to clarify these points.
W2. [No discussion of the systematicity of LLM-based planning.]
We appreciate your insightful observations regarding the systematicity and flexibility of LLM-based planning. Your concern about the fixed number of samples (N) in planning tasks is well-founded and highlights a crucial aspect of our research.
We have conducted additional experiments using GPT-4 to dynamically adjust the sampling number (N) based on the specific task. This approach allows the model to propose a more tailored sampling N and plan length, depending on the task's complexity and context.
As detailed in our response to Q11.1, these experiments revealed that while the model's prediction of the number of steps has a reasonable mean error (3.19 steps), there is significant variability in performance across different tasks, as indicated by a standard deviation of 2.73 steps. This variability underscores the challenge of predicting task complexity with high accuracy.
Moreover, we observed a moderate positive correlation between task complexity (measured by the number of steps) and the suggested Sampling_N. This finding aligns with the intuitive notion that more complex tasks require more samples for exploring a more comprehensive range of potential plans.
In conclusion, dynamically adjusting the sampling N based on the task is a promising method to address the concern you raised. However, it is clear that the LLM's capability to predict the number of steps accurately still requires further improvement. This could be achieved through advanced techniques such as prompt engineering and in-context learning, which we plan to explore in our future work. Your feedback has been invaluable in guiding these efforts and ensuring our methodology is robust and adaptable to various planning scenarios.
W3. [Experimental significance: a single environment with only 35 tasks. The 35 tasks may be biased in a way that affects the significance of the observations.]
Thank you for pointing this out. We have tried to avoid introducing bias in the process of dataset construction.
In earlier planning papers based on VirtualHome, evaluations were based on the Longest Common Subsequence (LCS) and human evaluations [1]. While this applies to the entire VirtualHome dataset, it has two main drawbacks:
- The LCS metric struggles to effectively assess the actual impact of a plan in an environment;
- Human evaluation is highly labor-intensive.
Therefore, we referred to the method of using success rate to evaluate the success of a plan [2]. However, some tasks in VirtualHome are only suitable for some types of automated testing. For example:
Example 1: Task Name: Juggling Description: The robot would catch any items thrown at it and immediately start to juggle them at a pro level. It would have a max of 6 things it can juggle at once. That would be awesome to watch. Explanation: The task name is ambiguous, and the task's goal conditions are unclear.
Example 2: Task Name: Check Email Description: Turn on the laptop. Connect to the Internet. Type in the URL. Check for new email messages. Explanation: As it is impossible to obtain information about "checking email" in VirtualHome, the final goal conditions might include On(computer), but it is impossible to determine if the robot checked the email, allowing for potential cheating by the planner.
The annotation process mainly involved filtering out VirtualHome parts unsuitable for automated testing (success rate). Thus, we believe this annotation method avoids introducing bias as much as possible, allows for better-automated testing, and avoids the subjectivity of human evaluation and the limitations of LCS.
[References]
[1] Huang W, Abbeel P, Pathak D, et al. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents[C]//International Conference on Machine Learning. PMLR, 2022: 9118-9147.
[2] Singh I, Blukis V, Mousavian A, et al. Progprompt: Generating situated robot task plans using large language models[C]//2023 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2023: 11523-11530.
W4. [Experimental significance: Only VH]
Thanks for your valuable feedback. We have provided a detailed discussion about this question in the response to Q1.
W5. [In general, given the absence of task descriptions and their plans, it's hard to know whether the results are significant.]
We appreciate the opportunity to provide additional details about the dataset. This is presented in the response to Q11.
W6.1 [Secondary issues - Goal Condition Recall might be a misleading metric. Different subgoals might be easier than others]
We appreciate your insights regarding using Goal Condition Recall (GCR) as a metric in our study. GCR was chosen as an intermediate indicator alongside Success Rate (SR) to provide a more granular assessment of the system's performance in task planning. While SR is an excellent measure of absolute performance, it can be overly strict, and GCR offers a way to evaluate performance nuances that SR might overlook. Moreover, the effectiveness of GCR is not only in assessing absolute performance but also in comparing relative performance against baseline models. In summary, while we recognize the potential limitations of GCR as a standalone metric, its combination with SR provides a comprehensive and nuanced understanding of the system's performance.
W6.2 [Secondary issues - we don't know the structure of the subgoals]
In Section 4.1, we introduced goal conditions, and here, we provide a more comprehensive explanation of the structure of subgoals. Subgoals primarily consist of the eventual state of the relationship between two objects and the final desired state of a specific thing. Either part may be absent, but at least one subgoal is mandatory. Taking the task "Make Coffee" as an example, it has two goal conditions: one is inside(coffee, coffee_maker), and the other is on(coffee_maker).
W7.1 [Secondary issues - Confusions in the theoretical emphasis on POMDPs. The hallmark of POMDPs, being a Markov decision process, is the Markov assumption: decisions can be made just by looking at the state without the history. However, the policy in section 2 includes the history of actions but perhaps not the observation history.]
Thank you for raising your concern. We appreciate your interest in our approach and would like to take this opportunity to clarify how it aligns with the POMDP framework.
In traditional Markov Decision Processes (MDPs), the Markov assumption implies that decisions can be based solely on the current state without needing historical information. However, in the case of POMDPs, due to the inherent partial observability of states, it becomes necessary to consider historical data to estimate the current state better. This does not contradict the Markov assumption but instead adapts it to the context of partial observability.
Moreover, our approach is consistent with methodologies applied in recent literature[1][2]. Our experiments were designed to ensure a fair comparison between the Tree Planner and other baseline models. We remain confident that including action history in our policy does not undermine the findings of our paper. We appreciate your consideration of these points.
[Reference]
[1] Singh I, Blukis V, Mousavian A, et al. Progprompt: Generating situated robot task plans using large language models[C]//2023 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2023: 11523-11530.
[2] Li S, Puig X, Paxton C, et al. Pre-trained language models for interactive decision-making[J]. Advances in Neural Information Processing Systems, 2022, 35: 31199-31212.
W7.2 [Secondary issues - Confusions in the theoretical emphasis on POMDPs. The focus on POMDPs is over-stated as the VH has high visibility while other environments feature more partial observability (for instance, Jericho).]
We appreciate the opportunity to clarify this point further. Indeed, the partial observation in VirtualHome is primarily manifested in the visibility limitations within a room and the inability to observe objects inside closed containers. This setup provides more observability compared to environments like Jericho[1] or Alfworld[2], and we plan to include a comparison with these environments in our revised paper to highlight this difference.
However, we still maintain that VirtualHome satisfies the definition of partial observation. Our main objective is to distinguish between partial and complete observation in VirtualHome, as in full observation settings, task planning becomes simpler and does not require exploration of the environment. The partial observation in VirtualHome is more realistic for real-world scenarios. Moreover, while VirtualHome offers more observability than other partial observation environments, which may reduce the POMDP challenge to some extent, it also introduces unique issues related to token efficiency. This aspect adds value to our research and contributes to a more nuanced understanding of POMDPs in different settings.
[Reference]
[1] Shridhar M, Yuan X, Côté M A, et al. Alfworld: Aligning text and embodied environments for interactive learning[J]. arXiv preprint arXiv:2010.03768, 2020.
[2] (Jericho) Hausknecht, Matthew, Prithviraj Ammanabrolu, Marc-Alexandre Côté, and Xingdi Yuan. "Interactive Fiction Games: A Colossal Adventure." arXiv:1909.05398 [Cs], February 25, 2020. http://arxiv.org/abs/1909.05398.
W8. [Secondary issues - Global replan is non-comparable with other algorithms]
Thank you for your valuable feedback. The design of our global replan baseline model was primarily inspired by error correction approaches detailed in [1]. In specific scenarios, such as web interaction [2], the assumptions of global replan are particularly applicable. This can lead to an uneven comparison to some extent. In the next version of our paper, we plan to elaborate on this point, providing a more apparent distinction between global replan and other methodologies.
[Reference]
[1] Shinn N, Cassano F, Gopinath A, et al. Reflexion: Language agents with verbal reinforcement learning[C]//Thirty-seventh Conference on Neural Information Processing Systems. 2023.
[2] Yao S, Chen H, Yang J, et al. Webshop: Towards scalable real-world web interaction with grounded language agents[J]. Advances in Neural Information Processing Systems, 2022, 35: 20744-20757.
W9. [Secondary issues - The notion of inverse actions is a form of symbolic knowledge, leaving the scope of the paper. A more purist approach would task the LLM to undo the last k actions, perhaps offering the list of actions.]
We appreciate your insights regarding the concept of inverse actions in our paper. In response to Q7, we provided more details of inverse actions used in our study and explored the feasibility of learning inverse actions in real-world scenarios.
We agree that employing a Language Model like GPT-4 to undo actions is viable. To this end, we conducted experiments where we provided GPT-4 with instructions (similar to that in Appendix F), a list of k executed actions, and their corresponding observations (formatted as , , , ...). As demonstrated in the provided case study, the results show that GPT-4 can accurately generate the corresponding inverse actions.
Case study:
Task: Make coffee
Executed Actions:
- [Walk] <dining_room> (1)
- [Find] <coffee> (1)
- [Grab] <coffee> (1)
- [Find] <coffee_maker> (1)
- [Open] <coffee_maker> (1)
- [PutBack] <coffee> (1) <coffee_maker> (1)
- [Close] <coffee_maker> (1)
- [SwitchOn] <coffee_maker> (1)
Generated Undo Actions by GPT4:
- [SwitchOff] <coffee_maker> (1)
- [Open] <coffee_maker> (1)
- [Grab] <coffee> (1)
- [Close] <coffee_maker> (1)
- [Find] <table> (1)
- [PutBack] <coffee> (1) <table> (1)
- [Walk] <home_office> (1)
W10. [Minor - The best GCR in section 5.2 is interesting, but GCR is not a good predictor of SR (success rate).]
See in response to Q6. We also update the corresponding SR results in Appendix C.4 in the updated version of the paper. Thank you for your insightful question regarding the relationship between GCR and SR in our experimental results. We appreciate the opportunity to address this critical point.
W11. [Minor - It must be clarified if the LLM is called when there is only one possible action.]
Thank you for pointing this out. We did not call LLM when there was only one possible action but directly applied the action. More discussions are provided in response to Q8.
W12. [Minor - Alg 1 doesn't add much. Any algorithm for building a "trie" data structure would work.]
We agree that other standard tree construction algorithms could serve as substitutes. The intention behind presenting Algorithm 1 was to delineate the process of building the action tree in TREE-PLANNER in a clear, algorithmic format. We recognize, however, that this has occupied a notable portion of the manuscript. To address this, in the next version of our paper, we plan to describe the action tree construction process in a textual format for brevity and move Algorithm 1 to the Appendix. This change will conserve space in the main text while still providing detailed methodological insights for interested readers.
W13. [Recommendations - The paper should distinguish the world model from planning. Discussing the percentage of correct plans is useful but not actionable in a new domain as the set of possible plans in a domain is heavily unbounded.]
We thank the reviewer for their valuable feedback. In the next version of our paper, we will explicitly add content to distinguish between the world model and planning. We understand and acknowledge that calculating the proportion of correct plans becomes exceedingly difficult when the number of potential plans is unbounded. The plan sampling process effectively narrows down the space of possible plans to a more manageable scope. Despite the infinite possibilities of plans in each new domain, our method, through the use of LLM, intelligently filters out the most potential and practical plans.
W14. [Recommendations - The directions suggested in error analysis (re-sample, chain-of-thought, self-reflection) might increase the context size, reducing one of the advantages.]
Yes, our discussion here primarily focuses on improving system performance. We agree that introducing these techniques requires considering the tradeoff between performance and efficiency.
Questions
Q1. [What are some other environments where this method could be tested?]
Thank you for your valuable suggestion. In the revised version of our paper, we will include a comparative analysis of the VH environment against other alternatives, particularly addressing why we chose VH for our experiments.
To elaborate, VH is a highly complex environment encompassing multiple rooms and objects, with intricate inter-object relationships that closely mimic real-world settings(refer to Appendix F). This complexity not only makes the environment more challenging for planners but also highlights the issue of token inefficiency more prominently. The abundance of tokens required for representing observations and global information in VH starkly contrasts with environments like Jericho[2], as referenced. This distinction is crucial for our study's focus on addressing token inefficiency in complex, real-world scenarios.
Furthermore, in contrast to VH, another popular embodied planning environment is Alfworld [1], which faces fewer challenges related to token inefficiency due to its simpler environment (single room, fewer objects). In the subsequent version of our paper, we will also include discussions on a broader range of planning environments, such as those mentioned in [3] and others.
We believe this clarification will enhance the understanding of our choice and its relevance to the research objectives.
[Reference]
[1] Shridhar M, Yuan X, Côté M A, et al. Alfworld: Aligning text and embodied environments for interactive learning[J]. arXiv preprint arXiv:2010.03768, 2020.
[2] (Jericho) Hausknecht, Matthew, Prithviraj Ammanabrolu, Marc-Alexandre Côté, and Xingdi Yuan. "Interactive Fiction Games: A Colossal Adventure." arXiv:1909.05398 [Cs], February 25, 2020. http://arxiv.org/abs/1909.05398.
[3] Liu X, Yu H, Zhang H, et al. Agentbench: Evaluating llms as agents[J]. arXiv preprint arXiv:2308.03688, 2023.
Q2. [Given a fixed task, is the same prompt used for the four scenes?]
The prompts vary based on the differences between the scenes. For the task "Turn on light," we have different gold plans for various scenes:
Scenes 1, 3, 4:
Gold Plan:
- [Walk] <light> (1)
- [Find] <light> (1)
- [SwitchOn] <light> (1)
- [Find] <light> (2)
- [SwitchOn] <light> (2)
Explanation: In these scenes, there are two lights present, so the plan is to turn both on.
Scene 2:
Gold Plan:
- [Find] <light> (1)
- [PlugIn] <light> (1)
- [SwitchOn] <light> (1)
Explanation: In this scene, the light is unplugged, so the plan includes plugging it in before turning it on.
Q3. [Does the observation of plan-sampling include the full description of the room or just the type of room?]
Yes, we provide a complete room description in the plan sampling phase. This means that the detailed initial observation is included, similar to what is presented during the grounded deciding and iterative planning phases. We will make detailed corrections in the next version of our manuscript. Thank you for pointing out this oversight.
Q4. [Did you consider adding the observation history?Are the observations monotonic?]
We appreciate your insightful question regarding including observation history in our model. The environment is not monotonic. We initially did not incorporate the observation history, which aligns with current methodologies in task planning research[1][2]. However, we recognize the potential value of integrating past observations to enhance decision-making processes.
In response to this perspective, we have conducted additional experiments where the model inputs include sequences like [Action1] [Observation1] [Action2] [Observation2]...
| EXEC ↑ | SR ↑ | GCR ↑ | Cost ↓ | |
|---|---|---|---|---|
| 49.01±5.67 | 28.14±2.45 | 35.84±4.20 | 3.48±0.04 | |
| 48.83±3.72 | 26.63±1.02 | 34.42±1.72 | 8.31±0.03 |
According to our experimental results, when observations are added to the model inputs, the performance of the model does not improve, but the token consumption is more than double compared to when observations are not included.
We believe that in situations involving shorter observations, incorporating the observation history into the context can indeed aid LLMs in making more informed decisions (for example, the executed result of 'dir' command in bash). However, when dealing with longer observations, this approach results in a significant consumption of tokens, thereby prolonging the reasoning processes. Moreover, models often exhibit poorer performance in handling longer texts and may experience issues with losing attention towards long contexts [3]. As described in Appendix F, our test environment is quite intricate, resulting in extensive observations at each timestep. Integrating the complete observation history in such cases would lead to substantial token usage, which is a crucial consideration given the per-call token charges of LLMs.
In robot task planning, one alternative could be focusing solely on the environmental changes caused by each action, as these usually affect only a few object relations.
Your question has indeed highlighted an essential aspect of our research, and we are actively exploring the balance between comprehensive context inclusion and token efficiency. The additional experiments and ongoing investigations will be thoroughly detailed in our subsequent publication.
[References]
[1] Huang W, Xia F, Shah D, et al. Grounded decoding: Guiding text generation with grounded models for robot control. arXiv preprint arXiv:2303.00855, 2023.
[2] Singh I, Blukis V, Mousavian A, et al. Progprompt: Generating situated robot task plans using large language models. 2023 IEEE International Conference on Robotics and Automation (ICRA), IEEE, 2023: 11523-11530.
[3] Liu N F, Lin K, Hewitt J, et al. Lost in the middle: How language models use long contexts[J]. arXiv preprint arXiv:2307.03172, 2023.
Q5. [GCR metric: what's the distribution of the number of goal conditions? Are the number of goal conditions the same for each of the 35 tasks?]
The distribution of goal conditions is as follows:
| Number of Goal Conditions | Number of Tasks |
|---|---|
| 1 | 2 |
| 2 | 15 |
| 3 | 10 |
| 4 | 6 |
| 5 | 2 |
Thank you for pointing this out. We recognize the importance of providing detailed distributions of goal conditions for each task to understand the complexity and diversity of our dataset. We will include this crucial information in the revised version of the paper to ensure readers can fully understand our dataset.
Q6.1 [Sec 5.2 - Did you randomize the results for table 2? A new seed might generate another set of plans.]
We appreciate your inquiry regarding the randomization of results in Table 2. To ensure the robustness and reliability of our results, we adopted the same randomization method as in Table 1. Specifically, we conducted three independent runs for each scene and reported the average metrics. We will include these details in the next version of our paper.
Q6.2 [Sec 5.2 - What would be the results for SR? Table 1 shows how GCR grows with SR, but doesn't predict SR. Actually, the gap between GCR and SR is higher for Tree-Planner than for other approaches.]
We compute the corresponding Success Rate (SR) result of Figure 5 (refer to Section 5.2). The result is presented in Appendix C.4 of the updated version of the paper. As is shown in the figure, the trend of SR concerning N closely aligns with the trajectory of GCR. While the disparity between the average SR<sub>avg</sub> and GCR<sub>avg</sub> mirrors that observed in Table 1, the difference between the SR<sub>max</sub> and GCR<sub>max</sub> is more pronounced, approximating 20%.
Q7 [What is the inverse of all the actions? How that information was obtained? How realistic it is for that information to be available or learned?]
We categorize inverse actions into two types:
Type I: Inverse Actions Implemented in VirtualHome:
Sleep/WakeUp StandUp/Sit or Lie (dependent on the robot's previous state) Grab/PutBack Pull/Push Open/Close SwitchOn/SwitchOff PutBack/Grab PutIn/Grab Lie/StandUp Drop/Grab
Type II: Inverse Actions Not Implemented in VirtualHome:
These actions either change the object's location (e.g., Walk, Find, TurnTo) or do not significantly alter the environment's state (e.g., Watch, Read, PointAt). For these, we recover the state to its previous condition.
Since our research focused on task planning, we assume that each action's low-level control is already implemented, and inverse actions are equivalent to undo operations, essentially reverting the environment to a prior state. However, the design of inverse actions would vary in real-world scenarios depending on the implementation of the low-level controllers and the physical constraints of the robot. For instance, if actions are implemented through motion control methods, the inverse would involve reversing or undoing a robot's movements. For RL-based low-level controllers, one possible approach is to learn a separate RL policy specifically designed to counteract the effects of the original actions. This counteractive policy would be trained to identify and execute actions that revert the environment to its previous state or negate the impact of the initial action. Moreover, if environmental changes are not solely caused by the robot (e.g., another person moving in the same space), inverse action design must consider these dynamic changes.
Another approach could be recording environmental state changes after each operation, where inverse actions would mean restoring the environment to a prior state.
Q8. [Branches in action trees. Some node action trees have no branches. See leftmost path Fig 12. Is the LLM prompt for choosing one action, or are they executed? Is this accounted in the cost of the predictions?]
Thanks for pointing this out. We appreciate the opportunity to clarify this aspect of TREE-PLANNER. We do not initiate an LLM call when there is only one possible action. This approach is a strategic cost-saving measure, particularly beneficial in scenarios where specific actions naturally follow others. For example, the action '[Lie] <bed>' is typically followed by '[Sleep].' In an iterative planning framework, these two actions would necessitate separate LLM calls, resulting in additional token consumption. We will make this detail more apparent in the next version of the paper.
Q9. [Please report the average number of branches for reaching the leaves for each of the 35 tasks.]
Thank you for suggesting the additional data analysis approach. We have calculated the average number of branches needed to reach the leaves for each task under N=50, and the result is provided in response to Q11.1 ('Number of Branches'). The first four tasks served as in-context examples and thus do not have corresponding branch numbers.
Q10. [Please discuss the number of error corrections in Table 1 compared to the number of branches. Otherwise, it's hard to know whether 3.29 vs 1.86 is a significant difference.]
| Number of Branches | Number of Tasks | Average No.Correction |
|---|---|---|
| [0, 5) | 4 | 0.23 |
| [5, 10) | 21 | 1.91 |
| [10, 15) | 5 | 3.60 |
| [15, 20) | 1 | 5.17 |
To enhance our analysis of the relationship between the number of error corrections and the number of branches, we have compiled their distribution data. The table indicates a general trend where the complexity of tasks, as measured by the number of branches, correlates with an increased likelihood of requiring corrections. This pattern is expected, as tasks with more branches typically involve more decision points, thereby elevating the potential for errors or the necessity for adjustments.
Tasks with a greater number of branches present more opportunities for errors, hence a higher possibility for corrections. This, however, does not directly correspond to a greater number of error corrections. For instance, a task like "Clean bathroom," with 15.8 branches, averages 5.17 corrections, illustrating that the frequency of corrections also hinges on the error probability at each branch. Conversely, tasks with fewer branches might have fewer decision points, suggesting less complexity and reduced chances for errors. Nevertheless, the existence of branches inherently introduces the possibility of decision-related errors.
Furthermore, in all tasks, the number of branches (calculated as the average number of branches multiplied by the number of unique plans) exceeds the maximum set number for error corrections (10). This ensures that the scope for error corrections is not constrained by an overly limited decision space in the action tree.
| Task | Number of Steps | Suggested Sampling N | Predicted No.Steps | Number of Branches | Success Rate | Number of Corrections | Number of New Objects |
|---|---|---|---|---|---|---|---|
| Shut off alarm | 6.0 | 10 | 4 | 4.9 | 1.00 | 0.00 | 188 |
| Drink | 4.8 | 50 | 8 | 7.8 | 0.67 | 1.67 | 177 |
| Look at mirror | 5.0 | 10 | 4 | 5.5 | 1.00 | 0.33 | 217 |
| Put on your shoes | 8.0 | 50 | 12 | 6.9 | 0.00 | 1.00 | 222 |
| Shampoo hair | 13.0 | 50 | 12 | 9.2 | 0.00 | 1.33 | 215 |
| Plug in nightlight | 5.0 | 50 | 8 | 5.7 | 0.00 | 0.00 | 161 |
| Get dressed | 10.0 | 50 | 12 | 10.1 | 0.22 | 0.11 | 163 |
| Put egg on dining table | 8.0 | 50 | 12 | 7.8 | 0.50 | 0.17 | 194 |
| Charge phone | 3.0 | 50 | 12 | 4.7 | 1.00 | 0.50 | 194 |
| Make coffee | 11.0 | 50 | 10 | 7.6 | 0.22 | 3.44 | 194 |
| Read novel in bed | 12.0 | 50 | 10 | 9.8 | 0.67 | 5.50 | 194 |
| Sit at dining table | 6.0 | 10 | 3 | 8.2 | 0.00 | 1.67 | 250 |
Q11.2 [Sec 4.1 - VirtualHome has multiple programs for some tasks, not a single gold plan.]
Yes, during the construction of our dataset, we filtered out certain plans that could not be correctly executed within the environment. We ensured that the final gold plans in our dataset achieved a success rate of 1. We annotated one gold plan for tasks not included in the VirtualHome dataset, such as Charging Phone, Making Coffee, etc..
Q11.3 [Sec 4.1 - Did you create new scenes or used scenes in VH?]
We modified existing scenes from VirtualHome to create new scenes for our study. We made minor environmental adjustments, like changing the states of objects and adding or removing objects. This was done to generate scenes related to, yet distinct from, the original VirtualHome scenes. Below is a table mapping our newly created scenes to the original VirtualHome scenes:
| Tree-Planner Scene | VirtualHome Scene |
|---|---|
| 1 | 2 |
| 2 | 4 |
| 3 | 5 |
| 4 | 6 |
Q11.4 [Sec 4.1 - Are the gold truth plans associated to a task or to (task, scene)?]
The gold plans are associated to (task, scene). As is discussed in Q2, plans may vary depending on the scenes.
Q12. [Methodological: was N=50 selected using the 35 tasks?]
We recognize your apprehension about the possibility of bias in choosing N based on the tasks. In selecting N=50, we did not engage in cherry-picking; in fact, the model achieves a better success rate (42%) under N=75. However, we believe that 25 and 50 are more convenient for comparison. Furthermore, we undertook supplementary experiments to examine the correlation between the sampling number (N) and the task type (complexity), as elaborated in our responses to W2 and Q11.1.
Q13. [Was majority vote only used for grounded-deciding? Please add a reference for the method.]
Yes, the majority vote was used for grounded deciding to mitigate format errors in the LLM's output. This operation draws inspiration from Wang et al. (2022)[1]. We also tried to apply a majority vote to the baseline model Iterative Planner. However, it was observed that using a majority vote in this context resulted in a lower success rate compared to employing greedy decoding with a temperature setting of zero.
[Reference]
[1] Wang X, Wei J, Schuurmans D, et al. Self-consistency improves chain of thought reasoning in language models[J]. arXiv preprint arXiv:2203.11171, 2022.
Q14.1 [Table 3 - Is this success rate, SR?]
Yes, the error cases are selected based on their corresponding success rate scores.
Q14.2 [The false negative is misleading. In the Appendix, the error explanation says that the environment reports that there is no keyboard. Are keyboards part of the possible objects? The observations in the prompt mention a computer but no keyboard.]
Thank you for the opportunity to further explain this particular error case (refer to Appendix E.5). the keyboard is indeed present in its corresponding scene. The issue arises from the implementation of VirtualHome itself. When the robot acts to manipulate an object, such as Grab, SwitchOn, or Type, the environment requires the robot to be close to this object. In the example provided, the robot first executed [SwitchOn] <computer> (1), which implies that the robot was next to the computer. Common sense suggests that the robot should also be near the keyboard at this point. However, in VirtualHome, this results in an error, necessitating the execution of [Find] <keyboard> (1), which may seem redundant.
Nevertheless, we acknowledge and appreciate your suggestion to reclassify the "false negative" as "semantically correct." However, after carefully re-examining the erroneous samples, we believe this particular error should not be classified under missing correct plans, as such errors could also occur in grounded deciding processes. The revised Table 3 is included in the updated version of our paper.
Q15.3 [What are the error modalities with correction?]
We further analyzed the error types in the with correction model. The corresponding results are presented in Appendix E.6 of the updated version of our paper.
Q16.1 [Sec C.1 - Can you provide the statistics for other approaches?]
Yes, we have included the statistical data for the Local Replan method in Appendix C.1 of the updated version of the paper. The experimental findings demonstrate that while TreePlanner delivers comparable or superior performance across various plan lengths, it possesses a significant advantage in the number of corrections required.
Q16.2 [Sec C.1 - What about kind of task?]
The statistics for each task are provided in the response to Q11.1.
ICLR Tree planner
Thank you for your comments, both addressing the weakness and my specific questions.
Let’s start with the weakness:
W1: [Not clear if the policy is grounded in the environment or the LLM implicit distribution.]
As a detailed initial observation is given, then the question is whether
it’s possible that many of the actions are grounded on them, reducing the significance of seeing the algorithm as dealing with a POMDP.
I agree that VH is a POMDP. The answer agree that VH has more observability than Jericho[1] or Alfworld[2].
The question is what’s the role of the objects in the first observation used during sample vs what appears during execution, given that the plans does not account for observations.
Perhaps a way to evaluate this is to report this statistics for successful traces and gold truth: how many objects were in the initial observation, and now many objects appear later? Let’s called new objects the ones that appear later. This question is about the original setting, not the one reported in Q4 about the observation history.
The lower the number of new objects, the less the problem required partial observability. I mean, there always could be another room or another drawer full of many objects, but if they are not relevant to the goal, then it doesn’t makes the problem harder.
This statistic would be crucial to defend whether the algorithm is dealing with partial observability. How? If the success is highly correlated with a low number of new objects, that means that the algorithm is not dealing well with the cases where observations are needed. Of course, one additional factor is how complex is the plan, for instance in number of steps. I’d need to think about how to show that, but it might be possible to make a figure where the x-axis is the number of steps, and the y-axis is proportional of new objects in the plan, both for the gold truth and the successful plans.
W2: [No discussion of the systematicity of LLM-based planning.]
I said:
The key underlying problem is that a flexible plan dealing with observations cannot be found using a fixed number of samples. Some tasks might require fewer samples, and some might require more. That’s precisely why planning algorithms use search.
The answer commented that the longer the plan, the more samples are needed. That makes sense because the required tree-search size is exponential in the length of the solution. For a binary tree, if the solution is , the full tree has , so in principle the number of samples might need to be near . Common sense might prune that tree, but the number of leaves would increase anyway.
The answer says:
We have conducted additional experiments using GPT-4 to dynamically adjust the sampling number (N) based on the specific task.
Can you elaborate on this? Determining the number of leaves of the tree is as hard as planning. Using an external more powerful tool to obtain an estimated of the number of leaves seems like an external solution to a fundamental issue.
My key concern is that the whole notion of a fixed number of samples is fundamentally broken. Sure, you can report a number, but it doesn’t tell me how to fix it.
Perhaps it’d be better if you argue that the results are for a fixed number, report the results, but elaborate further on what happens with other numbers like N=75, as you mentioned, or N=25. It’d be good to know if even if the hyper-parameter is set externally, the contribution still holds.
I’ll process later the rest of the comments.
Thank you!
Thank you for your prompt and invaluable feedback.
W1: [Not clear if the policy is grounded in the environment or the LLM implicit distribution.]
-
We have supplemented our response to Q11 (Parts 9 and 10) in the table by incorporating the number of new objects encountered during the execution process using the gold plan, where the average is 205.29 and the standard deviation is 22.48. Additionally, we have calculated the Pearson correlation coefficient between the "success rate" and the "number of new objects". The correlation coefficient, approximately -0.174, indicates a marginal negative correlation between the success rate and the number of new objects. This implies that an increase in the number of new objects related to a task marginally reduces its success rate.
-
We have also calculated the ratio of the number of new objects to the total number of objects observed throughout the whole trajectory. The mean and standard deviation of this proportion are 0.72 and 0.065, respectively. Moreover, the inverse correlation between this proportion and the success rate is more stronger than that in point 1), with a Pearson correlation coefficient of −0.3244.
| Proportion Interval | Number of Tasks | Success Rate (Mean) |
|---|---|---|
| (0.6, 0.7] | 15 | 0.504 |
| (0.7, 0.8] | 13 | 0.376 |
| (0.8, 0.9] | 3 | 0.057 |
- We have included an analytical diagram in our paper, detailing the number of tasks and the proportion of new objects, as per your suggestion, in Appendix C.6.
Q11.1 [Sec 4.1 - Dataset: what are the 35 unique tasks? Please list the 35 task descriptions. What is the distribution of the number of steps for the gold truth? Even better, provide this information for each task]
As the number of plans may vary across different scenes, we have calculated the average length of the gold plans in different scenarios. As shown in the table above, our dataset encompasses a highly diverse array of household tasks.
| Task | Number of Steps | Suggested Sampling N | Predicted No.Steps | Number of Branches | Success Rate | Number of Corrections | Number of New Objects |
|---|---|---|---|---|---|---|---|
| Watch TV | 7.5 | 10 | 8 | None | None | None | None |
| Turn on light | 5.8 | 10 | 6 | None | None | None | None |
| Go to sleep | 5.5 | 10 | 4 | None | None | None | None |
| Brush teeth | 8.8 | 50 | 12 | None | None | None | None |
| Clean toilet | 6.0 | 50 | 12 | 9.8 | 0.00 | 6.00 | 255 |
| Wash monitor | 16.0 | 50 | 10 | 11.3 | 0.00 | 10.00 | 202 |
| Shave | 8.0 | 50 | 12 | 9.9 | 0.50 | 1.00 | 223 |
| Hang up jacket | 7.8 | 50 | 8 | 7.6 | 0.00 | 5.67 | 216 |
| Read newspaper | 7.3 | 50 | 10 | 7.5 | 0.00 | 0.33 | 217 |
| Sleep | 4.8 | 10 | 4 | 5.4 | 1.00 | 0.00 | 201 |
| Put alarm clock in bedroom | 8.0 | 50 | 12 | 7.0 | 0.00 | 6.00 | 195 |
| Sit in chair | 5.0 | 10 | 4 | 5.7 | 0.17 | 0.00 | 260 |
| Open bathroom window | 4.0 | 10 | 4 | 4.7 | 1.00 | 0.00 | 189 |
| Wash face | 22.8 | 50 | 12 | 10.4 | 0.67 | 1.67 | 216 |
| Computer work | 15.0 | 100 | 10 | 12.3 | 1.00 | 2.00 | 198 |
| Take nap | 4.8 | 10 | 4 | 6.6 | 0.92 | 0.75 | 213 |
| Hand washing | 14.0 | 100 | 10 | 10.0 | 0.00 | 4.22 | 219 |
| Clean bathroom | 14.0 | 50 | 12 | 15.8 | 0.00 | 5.17 | 196 |
| Pick up phone | 4.5 | 50 | 10 | 4.4 | 0.50 | 0.42 | 198 |
| Go to toilet | 5.0 | 10 | 4 | 5.9 | 0.58 | 0.42 | 209 |
| Pick up spare change on dresser | 16.0 | 50 | 6 | 8.2 | 0.00 | 0.17 | 195 |
| Wash sink | 14.0 | 50 | 12 | 8.9 | 0.00 | 4.67 | 180 |
| Put on glasses | 6.0 | 50 | 12 | 6.1 | 1.00 | 0.00 | 213 |
W2: [No discussion of the systematicity of LLM-based planning.]
-
Further elaboration on "We have conducted additional experiments using GPT-4 to dynamically adjust the sampling number (N) based on the specific task.": Specifically, we prompt GPT-4 to generate the a) Predicted number of steps for a specific task; b) Suggested sampling N. The prompt structure looks like: [Instruction],[Description of Tree-Planner],[Global Information],[Initial Observation],[In-Context Examples]. The [Global Information], [Initial Observation], [In-Context Examples] are the same as those in Appendix F. [[Description of Tree-Planner] introduces methodology flow. The key part in [Instruction] is "You need to suggest the hyper-parameter sampling_n and predict plan length based on the given information.". What's more, we employed prompt engineering techniques, such as the chain-of-thought approach, to enhance LLM's performance.
-
Utilizing GPT-4 for the dynamic adjustment of hyperparameters is an open exploration in addressing your posed issue: “So, the key underlying problem is that a flexible plan dealing with observations cannot be found using a fixed number of samples. Here, the LLM's commonsense knowledge also plays a role in configuring the model's hyperparameters. Given the challenge of annotating the optimal sampling_n for various tasks, it's difficult to establish an adaptive sampling_n using a training-based approach. So we think it's a good starting point to use GPT-4 for a training-free experiment.
-
Our contributions remain valid across different fixed sampling numbers. As demonstrated in the corresponding table, we maintained superior performance over the iterative planner, while preserving high token efficiency under conditions of N=25, 50, and 75.
| Exec ↑ | SR ↑ | GCR ↑ | Cost ↓ | No.Correction ↓ | |
|---|---|---|---|---|---|
| w/o correction | |||||
| Iterative Planner | 44.54±6.09 | 27.04±4.65 | 33.25±5.32 | 5.12±0.14 | N/A |
| Tree Planner N=25 | 55.74±0.92 | 28.33±1.18 | 39.96±0.16 | 2.39±0.44 | N/A |
| Tree Planner N=50 | 49.01±5.67 | 28.14±2.45 | 35.84±4.20 | 3.48±0.04 | N/A |
| Tree Planner N=75 | 47.44±4.40 | 28.06±2.82 | 34.42±2.88 | 4.27±0.03 | N/A |
| with correction | |||||
| Iterative Planner | 79.66±2.33 | 37.46±1.71 | 51.9±0.15 | 12.88±0.17 | 3.29±0.46 |
| Tree Planner N=25 | 89.13±0.17 | 35.30±1.78 | 56.65±1.09 | 3.30±0.01 | 1.85±0.05 |
| Tree Planner N=50 | 88.26±2.47 | 41.58±3.20 | 59.55±3.20 | 4.54±0.16 | 2.04±0.26 |
| Tree Planner N=75 | 89.27±0.03 | 42.15±2.79 | 60.12±4.20 | 5.36±0.02 | 2.24±0.58 |
Dear Reviewer #mZgD,
I hope this message reaches you well. As we are approaching the final hour of the rebuttal period for our manuscript, I wanted to send a brief reminder in case there are any last-minute comments or revisions you would like to make.
Your insights and feedback have been incredibly valuable to us, and we have done our utmost to address all the points raised in your initial review.
We understand the time constraints and appreciate any effort you can make in these final moments to review our latest submission. Your expertise and thoughtful analysis are crucial to the final quality of our work.
Thank you once again for your dedication and support throughout this review process. We eagerly await any final remarks you may have.
Warm regards,
Authors
The paper proposes TreePlanner, an LLM-based algorithm for task planning. TreePlanner uses the LLM more efficiently without reducing performance. The main contributions are algorithmic and empirical. The key idea is to decompose the planner into two distinct stages of LLM use. In the first stage (plan sampling), an expensive set of calls to the LLM is used to sample a number of full plans. These plans are aggregated into a tree structure which is then searched over in the second stage of the planner. The main source of efficiency (wrt LLM costs) is that the second stage (grounded deciding) can use a simpler prompt to perform action selection (instead of action generation). Experiments on tasks from the VirtualHome domain indicate the method is far more efficient without any loss of performance, compared to strong baselines.
优点
-
The paper tackles an important and popular problem of leveraging LLMs for task planning. Progress here would likely be impactful and of interest to the community.
-
The main ideas in TreePlanner are intuitively clear. The two step approach for efficient use of the LLM seems novel, to my knowledge.
-
The results show that compared to the baselines, TreePlanner makes very efficient use of the LLM without losing performance.
缺点
-
While the method clearly shows good performance on the considered domains, the experiments could be improved to address important questions about the approach, namely prompt engineering and the most efficient use of an LLM in task planning.
- For example, how good is the current prompt used in the plan sampling step? Given that errors made here are currently catastrophic and the prompt needs to be task-specific, a careful analysis of the quality of this prompt would be interesting.
- Another open question is whether it's more efficient to restrict the use of the LLM to plan sampling alone. See the questions for more details. (I'd be open to adjusting my score based on the responses to these questions.)
-
The set of tasks considered are from a single domain (VirtualHome). This makes it difficult to evaluate the broader utility of the proposed ideas.
-
Some of the the implementation details of the prompts, especially the one used in grounded deciding, are not fully described. See the questions for more details.
问题
-
I was unable to identify what error information is included in the history of the prompt used for grounded deciding. Per Figure 2 and Appendix F.3, it seems like some information about previously failed actions seems to be included in the history part of the prompt. If yes, are failed actions restricted to those on the current trajectory from the root to the current node or are they a global history over the entire tree? Which variant is better and why?
-
How sensitive is overall performance (e.g., success rate) to the prompt used in Plan Sampling? Since the absence of valid / optimal plans in the constructed tree leads to severely degraded performance and LLM performance is known to be very dependent on its prompt, it'd be useful to understand how much performance might be gained from a better Step 1 prompt?
- A related (but harder) question, how does the choice of LLM affect performance (e.g., off-the-shelf black box vs fine-tuned, etc.)?
-
Would it be possible to include additional tasks beyond those in VirtualHome? For example, is Toolbench relevant here?
-
Given the paper's emphasis is on efficient use of the LLM, what performance improvement does the LLM offer in grounded deciding over other action selection mechanisms? For example, is it feasible to replace the use of the LLM in grounded deciding with "classical" action selection techniques (e.g., UCB, best-first heuristic search)? How might this perform? More generally, please discuss where exactly the LLM is required / beneficial over classical search-based planners.
We are immensely grateful to you for recognizing the importance and novelty of our TreePlanner approach and for appreciating its clarity and efficient performance compared to existing baselines. This positive feedback is greatly encouraging.
Weaknesses
W1.1 [While the method clearly shows good performance on the considered domains, the experiments could be improved to address important questions about the approach, namely prompt engineering and the most efficient use of an LLM in task planning. How good is the current prompt used in the plan sampling step?]
Thank you for pointing out the potential improvement of our methodology. We have included additional experiments on varying prompting strategies. The corresponding result is presented in response to Q2.1. The experiment result shows that different prompts yield varying impacts, yet the prompts we currently employ maintain a sufficiently high level of effectiveness compared to others.
W1.2 [Whether it's more efficient to restrict the use of the LLM to plan sampling alone.]
Thanks for your suggestion. Corresponding experimental results and discussion are presented in the response to Q4 and Appendix C.
W2. [The set of tasks considered are from a single domain (VirtualHome). This makes it difficult to evaluate the broader utility of the proposed ideas.]
Thanks for your valuable feedback. The answer to this question is presented in response to Q3.
W3. [Some of the the implementation details of the prompts, especially the one used in grounded deciding, are not fully described. See the questions for more details.]
Thanks for the opportunity to include more details about our research. Corresponding details are included in the response to Q1.
Questions
Q1 [what error information is included in the history of the prompt used for grounded deciding. If yes, are failed actions restricted to those on the current trajectory from the root to the current node, or are they a global history over the entire tree? Which variant is better and why?]
Thank you for the opportunity to provide more details. Error information typically refers to situations where an action cannot be executed due to its precondition not being met (e.g., unable to operate on objects inside a fridge before opening it with [Open] <fridge>). The error information included in the prompt used for grounded deciding is in the format of [trajectory 1], [error information 1], [trajectory 2], [error information 2], and so on, indicating a global history over the entire tree.
We chose this variant because we believe LLMs can optimize themselves implicitly with multiple failed trajectories. This is consistent with the viewpoint presented in [1]. LLMs can use error information as a feedback signal to improve their decision-making process and avoid repeating the same mistakes.
[Reference]
[1] Shinn N, Cassano F, Gopinath A, et al. Reflexion: Language agents with verbal reinforcement learning[C]//Thirty-seventh Conference on Neural Information Processing Systems. 2023.
Q2.1 [How sensitive is overall performance (e.g., success rate) to the prompt used in Plan Sampling? Since the absence of valid / optimal plans in the constructed tree leads to severely degraded performance and LLM performance is known to be very dependent on its prompt, it'd be useful to understand how much performance might be gained from a better Step 1 prompt?]
Thanks for your insightful feedback! In response to your query about the sensitivity of plan sampling's performance to the prompt used, we have conducted additional experiments to understand this aspect. The original prompts in our paper were designed following the guidelines discussed in [1], and we explored three variants of the prompting strategy:
(i) GPT-4 Optimized Prompt: This variant involves using GPT-4 to optimize our original prompt based on Prompt Engineering Guide from OpenAI.
(ii) Code-Style Prompt[2][3]: Transforming the original prompt into a Python code format. In this variant, the prompt generates code and produces comments that serve as "thoughts," aiding in the clarity and interpretability of the plan sampling process.
(iii) Chain-of-thought[4]: Tasking LLMs to first generate a comprehensive chain of thought before sampling possible plans.
These experiments were conducted under the condition of N=50. As is shown in the table, GPT-4 Optimized Prompt and Code-Style Prompt does not lead to performance improvement. In contrast, using a chain of thought prompt enhances performance (+0.44 success rate).
The negative impact on performance observed when using code-style prompts may be attributed to the limited code-generation ability of GPT3.5. We will explore models specifically designed for code generation with code-style prompts in the future[5][6].
| EXEC ↑ | SR ↑ | GCR ↑ | Cost ↓ | No.Correction ↓ | |
|---|---|---|---|---|---|
| TreePlanner | 88.26±2.47 | 41.58±3.20 | 59.55±3.20 | 4.54±0.16 | 2.04±0.26 |
| GPT-4 Optimized Prompt | 89.52±2.15 | 39.25±0.54 | 58.80±1.05 | 4.73±0.13 | 2.06±0.18 |
| Code-Style Prompt | 70.70±3.24 | 37.10±0.89 | 50.74±2.73 | 6.25±0.57 | 3.12±0.31 |
| Chain-of-thought | 89.13±2.04 | 42.02±1.27 | 60.13±1.89 | 5.72±0.91 | 2.38±0.17 |
[Reference]
[1] Vemprala S, Bonatti R, Bucker A, et al. Chatgpt for robotics: Design principles and model abilities[J]. Microsoft Auton. Syst. Robot. Res, 2023, 2: 20.
[2] Singh I, Blukis V, Mousavian A, et al. Progprompt: Generating situated robot task plans using large language models[C]//2023 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2023: 11523-11530.
[3] Liang J, Huang W, Xia F, et al. Code as policies: Language model programs for embodied control[C]//2023 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2023: 9493-9500.
[4] Wei J, Wang X, Schuurmans D, et al. Chain-of-thought prompting elicits reasoning in large language models[J]. Advances in Neural Information Processing Systems, 2022, 35: 24824-24837.
[5] Xu Y, Su H, Xing C, et al. Lemur: Harmonizing Natural Language and Code for Language Agents[J]. arXiv preprint arXiv:2310.06830, 2023.
[6] Roziere B, Gehring J, Gloeckle F, et al. Code llama: Open foundation models for code[J]. arXiv preprint arXiv:2308.12950, 2023.
Q2.2 [A related (but harder) question, how does the choice of LLM affect performance (e.g., off-the-shelf black box vs fine-tuned, etc.)?]
Thank you for raising this important question regarding the influence of different LLM choices on task planning performance. While utilizing a fine-tuned model can offer tailored performance improvements for specific tasks, it might also weaken the model's generalization capabilities. This is particularly evident when the model encounters tasks not covered in the fine-tuning dataset, where its performance may diminish.
Furthermore, as highlighted in Liu et al. (2023)[1], API-based commercial models still tend to surpass open-sourced variants regarding overall effectiveness. This inherent disparity in the baseline capabilities of models limits the extent to which fine-tuning can enhance their performance.
Another critical aspect is the choice of prompting strategy, which is closely intertwined with the nature of the LLM. For instance, models primarily pre-trained on code[2][3], may yield better results with code-style prompts. Likewise, LLMs that have undergone instruction tuning with structured data, such as GraphGPT[4], may potentially yield better results when directly fed with symbolic graphs as input, as opposed to converting them into natural language, as done in our study.
[Reference]
[1] Liu X, Yu H, Zhang H, et al. Agentbench: Evaluating llms as agents[J]. arXiv preprint arXiv:2308.03688, 2023.
[2] Li R, Allal L B, Zi Y, et al. StarCoder: may the source be with you![J]. arXiv preprint arXiv:2305.06161, 2023.
[3] Chen M, Tworek J, Jun H, et al. Evaluating large language models trained on code[J]. arXiv preprint arXiv:2107.03374, 2021.
[4] Tang J, Yang Y, Wei W, et al. Graphgpt: Graph instruction tuning for large language models[J]. arXiv preprint arXiv:2310.13023, 2023.
Q3. [Would it be possible to include additional tasks beyond those in VirtualHome? For example, is Toolbench relevant here?]
Thank you for your insightful suggestion regarding exploring additional tasks and environments beyond VirtualHome (VH) in our research. In response to your query, we are open to and considering including new tasks and environments in our future work.
To clarify, our initial choice of VH was due to its complexity and real-world-like scenarios(refer to Appendix A and Appendix F). This complexity underscores the challenge of token inefficiency, a central focus of our study. However, we recognize the importance of diversifying our experiments to comprehensively assess our approach's applicability and robustness.
In contrast to VH, another popular embodied planning environment is Alfworld [1], which faces fewer challenges related to token inefficiency due to its simpler environment (single room, fewer objects). Regarding the specific mention of Toolbench, we find this suggestion intriguing. Toolbench, or similar environments, could offer unique challenges and scenarios that differ from those in VH. Their inclusion could enrich our understanding of token inefficiency in various contexts and further validate the effectiveness of our proposed solutions.
In the subsequent version of our paper, we will also include discussions on a broader range of planning environments[2][3][4]. We appreciate your valuable feedback, which has guided us to consider these important extensions to our work.
[Reference]
[1] Shridhar M, Yuan X, Côté M A, et al. Alfworld: Aligning text and embodied environments for interactive learning[J]. arXiv preprint arXiv:2010.03768, 2020.
[2] (Jericho) Hausknecht, Matthew, Prithviraj Ammanabrolu, Marc-Alexandre Côté, and Xingdi Yuan. "Interactive Fiction Games: A Colossal Adventure." arXiv:1909.05398 [Cs], February 25, 2020. http://arxiv.org/abs/1909.05398.
[3] Liu X, Yu H, Zhang H, et al. Agentbench: Evaluating llms as agents[J]. arXiv preprint arXiv:2308.03688, 2023.
Q4.1 [Given the paper's emphasis is on efficient use of the LLM, what performance improvement does the LLM offer in grounded deciding over other action selection mechanisms? For example, is it feasible to replace the use of the LLM in grounded deciding with "classical" action selection techniques (e.g., UCB, best-first heuristic search)? How might this perform?]
We appreciate your insightful question regarding using LLMs in grounded deciding compared to classical action selection techniques. Grounded deciding in our approach essentially represents a best-first search that employs Large Language Models (LLMs) as an implicit, pre-parametrized evaluation function.
Due to the necessity of conducting extensive simulations within the environment, Monte Carlo Tree Search (MCTS) frequently encounters inefficiency issues, resulting in an unjust comparison with alternative methodologies that do not require such extensive simulation.
We further compared grounded deciding (LLM-based evaluation function) with other heuristic methods: (i) Language-Guided BFS: This method evaluates actions based on how likely they are to occur, as determined by a language model. It uses the "log probability" concept to estimate each action's likelihood. The higher the probability, the more weight is given to that action; (ii) Environment-Guided BFS: This method evaluates the action based on the observability of objects involved. Actions involving easily observable objects are given higher weight; (iii) Hybrid BFS: This method uses a parameter, α, to balance the two methods. If α is closer to 1, the language model's influence is stronger; if it is closer to 0, the environmental observability takes precedence. More details and formalization of the three heuristic methods can be found in the Appendix C.5 in the updated version of paper.
| EXEC ↑ | SR ↑ | GCR ↑ | Cost ↓ | |
|---|---|---|---|---|
| 49.01±5.67 | 28.14±2.45 | 35.84±4.20 | 3.48±0.04 | |
| 34.71±0.65 | 15.06±1.02 | 22.47±1.72 | 2.16±0.02 | |
| 37.26±0.17 | 6.24±0.65 | 12.10±1.41 | 2.16±0.02 | |
| 38.71±0.42 | 17.74±0.69 | 25.34±0.95 | 2.16±0.02 |
As demonstrated by the experimental results, although heuristic methods exhibit a greater advantage in terms of token efficiency, their SR falls short compared to grounded deciding (using LLM as an implicit evaluation function). Moreover, combining both evaluation functions yields even greater performance improvements, thus demonstrating that there is still room for enhancement through optimizing the evaluation function in heuristic methods.
In summary, this comparison showcases the unique advantage of grounded deciding: (i) LLM's ability to implicitly construct a robust evaluation function that adapts to various situations without the need for extensive simulations; (ii) While classical heuristic searches are feasible, the process of constructing an evaluation function is often time-consuming and laborious.
Q4.2 [please discuss where exactly the LLM is required / beneficial over classical search-based planners.]
We appreciate the opportunity to discuss the advantages of Large Language Models (LLMs) over classical search-based planners in task planning. From the inherent characteristics of LLMs, they possess many advantages. LLMs are trained on large text corpora and exhibit multi-task generalization and substantial commonsense knowledge[1]. The common knowledge stored in LLMs provides a strong foundation for task planning. More specifically, the unique benefits of LLM can be summarized as follows:
-
Robust Implicit Search Evaluation: using LLMs as implicit evaluation function reduces the need for manually designing heuristic evaluation functions, a time-consuming and complex process (refers to the response to Q4.1).
-
Domain-Specific Knowledge Limitation in Classical Planners: Traditional PDDL planners typically rely on domain-specific PDDL files, which are often unavailable or based on assumptions of full observation. LLM-based planning, in contrast, does not depend on domain PDDL files. Although there have been efforts to construct domain PDDL files using LLMs, these still require manual involvement[2].
-
Natural Interaction and User-Friendliness: LLM-based planners, utilizing language as an interface, offer more natural human interaction than classical search-based planners. They are more user-friendly and easily adapt to specific human preferences and customizations in task planning. For example, when tidying the room, a user may prefer the robot to clean the bedroom first. Such preferences can be easily introduced through language in LLM-based planners.
[Reference]
[1] Li X L, Kuncoro A, Hoffmann J, et al. A systematic investigation of commonsense knowledge in large language models[C]//Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022: 11838-11855.
[2] Guan L, Valmeekam K, Sreedharan S, et al. Leveraging Pre-trained Large Language Models to Construct and Utilize World Models for Model-based Task Planning[J]. arXiv preprint arXiv:2305.14909, 2023.
Dear Reviewer K1gW,
Thank you for the valuable time and effort you put into reviewing our manuscript. Your insightful comments have greatly contributed to enhancing our work. In response, we have thoroughly revised our paper and updated the results, striving to fully address your observations.
We are eager to continue our discussion to ensure that our revisions meet your expectations. Should you have any further questions or additional feedback, we welcome your guidance and are fully prepared to clarify any issues. Your input is essential in helping us refine our paper.
Best regards,
The Authors.
Dear Reviewer #K1gw,
I hope this message finds you well. I am writing to express my sincere gratitude for the valuable insights and constructive feedback you provided for our manuscript.
As the rebuttal period is nearing its conclusion, I am reaching out to kindly inquire if our response has sufficiently addressed your concerns and questions. We have endeavored to thoroughly answer each of your points and hope that our revisions meet your expectations.
We are looking forward to your post-rebuttal rating and any further comments you might have. Please do not hesitate to contact us if there are additional clarifications or modifications you would recommend.
Thank you once again for your engagement and support.
Best Regards,
Authors
General Response Part I
We sincerely thank all reviewers for their meticulous assessment and valuable insights regarding our paper. We appreciate reviewers recognizing the contributions of our work, including the novel Tree-Planner approach [K1gw,cRrG], the effectiveness of our model [K1gw,cRrG,m5sK], and our meticulous dataset construction along with the judicious selection of metrics [mZgD,cRrG]. We are also thankful for recognizing the significance and intrigue of the problem we investigated and the commendation of our work's presentation by all reviewers.
Modification of Manuscript
- Corresponding SR results of Figure 5. (refers to Appendix C.4)
- Updated distribution of error types. (refers to Table 3, Section 5.4)
- Error modalities with correction. (refers to Appendix E.6)
- Experimental Results with Best-First Search (refers to Appendix C.5)
Additional Experiments
1. Experiments with Other Prompting Strategy [K1gw,cPrG]
We explored three variants of the prompting strategy: (i) GPT-4 Optimized Prompt: This variant involves using GPT-4 to optimize our original prompt based on Prompt Engineering Guide from OpenAI. (ii) Code-Style Prompt[1][2]: Transforming the original prompt into a Python code format. In this variant, the prompt generates code and produces comments that serve as "thoughts," aiding in the clarity and interpretability of the plan sampling process. (iii) Chain-of-thought[3]: Tasking LLMs to first generate a comprehensive chain of thought before sampling possible plans. More discussions are presented in the response to Q2.1, reviewer #K1gw and Q2, reviewer #cPrG.
| EXEC ↑ | SR ↑ | GCR ↑ | Cost ↓ | No.Correction ↓ | |
|---|---|---|---|---|---|
| TreePlanner | 88.26±2.47 | 41.58±3.20 | 59.55±3.20 | 4.54±0.16 | 2.04±0.26 |
| GPT-4 Optimized Prompt | 89.52±2.15 | 39.25±0.54 | 58.80±1.05 | 4.73±0.13 | 2.06±0.18 |
| Code-Style Prompt | 70.70±3.24 | 37.10±0.89 | 50.74±2.73 | 6.25±0.57 | 3.12±0.31 |
| Chain-of-thought | 89.13±2.04 | 42.02±1.27 | 60.13±1.89 | 5.72±0.91 | 2.38±0.17 |
[Reference]
[1] Singh I, Blukis V, Mousavian A, et al. Progprompt: Generating situated robot task plans using large language models[C]//2023 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2023: 11523-11530.
[2] Liang J, Huang W, Xia F, et al. Code as policies: Language model programs for embodied control[C]//2023 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2023: 9493-9500.
[3] Wei J, Wang X, Schuurmans D, et al. Chain-of-thought prompting elicits reasoning in large language models[J]. Advances in Neural Information Processing Systems, 2022, 35: 24824-24837.
2. Experiments with Best-First Search Algorithm [K1gw,m5sK]
We explored the outcomes of replacing grounded deciding with other best-first search algorithms: (i) Language-Guided BFS: This method evaluates actions based on how likely they are to occur, as determined by a language model. It uses the "log probability" concept to estimate each action's likelihood. The higher the probability, the more weight is given to that action; (ii) Environment-Guided BFS: This method evaluates the action based on the observability of objects involved. Actions involving easily observable objects are given higher weight; (iii) Hybrid BFS: This method uses a parameter, α, to balance the two methods. If α is closer to 1, the language model's influence is stronger; if it is closer to 0, the environmental observability takes precedence... Detailed experimental setup and results are presented in Appendix C.5. The experiment is related to the response to Q4.1, reviewer #K1gw and Q1, reviewer #m5sK.
3. Grounded Deciding with Observation History [mZgD]
We conducted additional experiments by adding the observation history during grounded deciding. The experiments are related to the response to Q4, reviewer #mZgD.
| EXEC ↑ | SR ↑ | GCR ↑ | Cost ↓ | |
|---|---|---|---|---|
| 49.01±5.67 | 28.14±2.45 | 35.84±4.20 | 3.48±0.04 | |
| 48.83±3.72 | 26.63±1.02 | 34.42±1.72 | 8.31±0.03 |
Additional Data Analysis
1. The Distribution of Goal Conditions [mZgD]
| Number of Goal Conditions | Number of Tasks |
|---|---|
| 1 | 2 |
| 2 | 15 |
| 3 | 10 |
| 4 | 6 |
| 5 | 2 |
2. Statistics of Each Task [mZgD]
We have included a table to present detailed statistics for each task in response to Q11.1, reviewer #mZgD.
- Columns "Task" and "Number of Steps" are related to Q11.1, W5, #mZgD
- Columns "Suggested Sampling N" and "Predicted No.Steps" are related to Q9, #mZgD
- Column "Number of Branches" is related to Q9, Q10, #mZgD
- Columns "Success Rate" and "Number of Corrections" are related to Q16.2, #mZgD
3. Quantitative and Qualitative Analysis on the Number of Branches [mZgD]
More discussions are presented in the response to Q10, #mZgD
| Number of Branches | Number of Tasks | Average No.Correction |
|---|---|---|
| [0, 5) | 4 | 0.23 |
| [5, 10) | 21 | 1.91 |
| [10, 15) | 5 | 3.60 |
| [15, 20) | 1 | 5.17 |
Synopsis: This paper tackles the problem of close-loop task planning with an LLM, with an objective of minimizing token consumption, and incorporating feedback for partial replanning. The paper introduces a tree sampling based solution, and empirically demonstrated reductinon token inputs, and improvements in error reduction.
Strengths:
- The paper tackles token efficiency, which is a real pragmatic issue in deploying LLMs
- The tree-based approach seems both sound and novel
- The results are compelling
Weaknesses:
- There are a number of closely related previous approached, and the "degree of novelty" may not be entirely clear.
为何不给更高分
As mentioned in the additional comments section, the paper is not quite groundbreaking - it implements an idea that many would consider an apparent next step given the specific motivations, but it is still the first (as far as I am aware) to do so. It will generate some good discussions, and may lead to interesting follow-on research.
为何不给更低分
It's still novel, and it has good empirical results. The authors provided reasonable answers to the reviewers, even if the reviewers were being nitpicky.
Accept (poster)