Leveraging Environment Interaction for Automated PDDL Translation and Planning with Large Language Models
We automate PDDL file generation for planning problems using LLMs and environment interaction
摘要
评审与讨论
The paper proposes an approach to leverage LLMs and environment feedback to automatically generate PDDL domain and problem description files without human intervention. They do so by an iterative refinement approach that generates multiple PDDL problem and domain candidates based on feedback obtained from the environment. The authors show their approach works experimentally in 66% of 10 PDDL domains that they have tried.
优点
The problem addressed in the paper is an important and interesting problem. The proposed approach with regard to the EW metric is novel and promising.
缺点
Assumptions: Assumption 2 may not be realistic. Often times, people may not know exactly what is the right way to capture the domain knowledge, that is what kind of things they should have said to ensure the pre/effects/initial state all are captured. What about the case that the domain description is missing a constraint or precondition?
Environment requirement: in regard to the applicability of the work, there is a dependency on the environment to do the refinement, and that also may limit the impact of the proposed solution as the environment may not always be available for all domains.
Novelty: the authors claim to be the first to enable PDDL generation using LLM without human intervention. However, there exists at least two related work that does also generate PDDL domain and problem without human intervention:
-
Large Language Models as Planning Domain Generators ICAPS 2024 (https://github.com/IBM/NL2PDDL)
-
AUTOPLANBENCH: Automatically generating benchmarks for LLM planners from PDDL PRL 2024 (https://github. com/minecraft-saar/autoplanbench/tree/main.)
The paper presentation can be improved. See the question section.
问题
-
66% solve rate what does it mean? Does it mean 7 out of 10 problems are solved? Also what does that mean? Is the PDDL now correct? How many problems of that domain is the different approaches now solving correctly? Can you please further clarify.
-
How do you know if you have a planning problem at hand, which can be turned into PDDL? Have you tried it on Alfworld (has a PDDL but its not ideal), or Mind2Web, or anything that does not have an existing PDDL. Maybe this relates to assumption 2.
-
With regard to the refinement how do you know when to stop? Is there a threshold on the metric that would give you signals on when to stop the refinement?
-
Can you say anything in regard to algorithm 1's soundness/correctness (and the approach in general). Also with respect to algorithm 1, how many times a LLM model should be called to come up with a reasonable PDDL domain, and a reasonable PDDL problem? Can you comment on the cost associated with that as well (any information beyond the token size would be great).
-
In the notation section 3, why is the set of all possible actions A separate from D, isn’t A always part of D.
-
Regarding planning error, what about unsolvable instances? Do you assume all instances are solvable?
-
Regarding equation 1, can you clarify that you generate multiple problem instances, but only one domain, or also refining the domain multiple times, each time generating multiple problem files?
-
Can you clarify why you need the ground truth p and d, how do you use the ground truth q to validate the answers. Also what happens if you are not provided with the ground truth domain/problem. Does this mean that even though a ground truth domain/problem is given we are going to use LLMs to generate the domain/problem?
-
Can you please comment on applicability of your approach if the PDDL environment is not known/given?
局限性
Assumption 2 may be limiting the scope and the level of impact of the work. Also what about cases that the ground truth domain/problem is not known which is in most cases when it comes to real applications. Also having to rely on the PDDL environment is also limiting the scope and impact of the work.
We thank the reviewer for their positive comments. Due to character limit constraints, below we summarize and answer the main questions raised by the reviewer.
Q1: Assumption 2 may be unrealistic. What if the domain description lacks a constraint?
We put the first steps towards using the environment as a source of refinement on top of natural language for domain PDDL generation. We have removed the biggest assumption/blocker, which is human feedback. While our framework, at its current state, may not provide high accuracy in the absence of natural language description, or partial natural language description, we are excited to see future extensions of our work that rely more on environment interaction and less (or not at all) on natural language descriptions. Particularly, partial information in NL can be addressed in future work through better prompting for LLM to propose and fill in missing information. This can be further enhanced by multiple guesses in an overall search tree-like reasoning setup that is on top of the "search tree" in our method. Our work lays the groundwork for achieving this goal.
Q2: There is a dependency on the environment to do the refinement, and that also may limit the impact of the proposed solution as the environment may not always be available for all domains.
We understand the concern that the "PDDL" environment might not always be available, thus limiting the impact. Our framework is not limited to PDDL environments, although the current implementation is. The only scenario where our framework is not applicable is when environment feedback is significantly delayed or slow. In such cases, relying on fully automated agent planning and action execution may not be advisable.
Q3: Novelty: two other works generate PDDL domain and problem without human intervention.
We thank the reviewer for bringing our attention to these very recent related works, and will add citations to our paper. However, these two papers do not really change the fact that our work is the first work to generate both domain and problem PDDLs end-to-end without human intervention. To elaborate, (1) the work of Oswald et al. despite generating domain PDDL from natural language and proposing heuristics for PDDL action domain comparison, differs in two significant ways: first, they assume the predicates are given, which is a too relaxing assumption, second, their work does not directly translate problem and uses the ground-truth problem PDDL instance for comparing the compatibility of two domains. (2) the work of Stein et al. is focused on translating PDDL to natural language, which is the opposite of what we seek to achieve: translating natural language to PDDL.
Q4: What does 66% solve rate what does it mean? What does 7 out of 10 domains solved mean?
66% solve rate means that out of 100 tasks (10 problems for each of the 10 environments), 66 of them were solved by finding a correct plan. Note that a task will be solved iff both problem and domain are translated successfully. We observe that despite the translation of the domain being completely successful, sometimes the LLM makes minor mistakes in translating one or more of the 10 problems of the same domain. As such, we consider a domain to be solved if more than 50% of its tasks are solved correctly.
Q5: Clarify the need for ground truth PDDL. What if the PDDL environment is not known or given?
We thank the reviewer for noting the potential of our framework for general planning problems that do not have a PDDL environment. As we have noted the potential applicability in section 4.1, we leave the extension to future work. Automatically generating pddl problem and domains without human intervention is a challenging enough problem that we will have to address these other questions in future work. We use the ground-truth PDDLs only to retrieve the list of possible actions, and their applicability, compatible with our Assumption 1. Hence, our framework, by design, is agnostic to the underlying environment. Therefore, as long as the action interface is expressible in PDDL, and Assumptions 1and 2 are met, our method is applicable to the underlying environment.
Q6: How do you know when to stop refinement?
Following prior works on code generation [2, 3], we set a maximum of conversation turns in our experiments. That said, if the exploration walk metric is 1.0 and the task in hand is solved, we stop early and do not continue the refinement.
Q7: In algorithm 1, how many times a LLM model should be called? What is the cost associated with that as well beyond the token size?
For the P&D method with variables problem samples, domain samples, and conversation turns, the language model is called times, which in our experiments is times. Computing the EW metric for each domain-problem pair takes less than two minutes on a 64-core Server CPU.
Q9: Regarding planning error, what about unsolvable instances?
We consider any unsolvable instance as a part of a planning error. In general, a planning error means that there exists no plan that achieves the desired goal from the initial state, whether it is domain-problem incompatibility or unsolvability.
Q10: Do you generate multiple problem instances and refine the domain multiple times?
For each generated problem, we run a fresh instance of domain refinement. As a part of problem translation, some predicates will be generated, as such, the generated domain should also conform the the defined predicates, and should be generated from scratch.
[1] Guan et al., Leveraging Pre-trained Large Language Models to Construct and Utilize World Models for Model-based Task Planning, NeurIPS 2023.
[2] Madaan et al., “SELF-REFINE: Iterative Refinement with Self-Feedback”
[3] Chen et al., Teaching Large Language Models to Self-Debug, ICLR 2024
Can you please point to where you discuss computational complexity and soundness and completeness of your approach?
I would say that the notion of human intervention can be interpreted in multiple ways: no human, human before the LLM call, human in the loop (after LLM call), etc. I would be good to distinguish your work in one of these. The related work has as you point out the assumption of predicates, but that does not mean humans are in the loop while LLM is being called, right?
I am still not convinced regarding the generality of the work given the current assumptions (need for the environment, etc).
Can you please point to where you discuss computational complexity and soundness and completeness of your approach?
In response to Q7, we provide more detail on how many times the LLM is called and the time complexity of the EW metric (which we will add to our paper). In addition to that, we have mentioned in our paper (line 352) that in Table 2, using the GPT-4 model, we used 12.40 million input tokens and 8.73 million output tokens. Due to the closed-source nature of GPT-4, we are not able to compute the number of FLOPs or any other metric associated with the complexity of the LLM in our experiments beyond the token count. We would be happy to provide more information if the reviewer has particular metrics in mind.
In regard to the soundness and completeness of our approach, we have formalized all the metrics and setup (section 4.1). We also provide desirable properties of our introduced EW metric (lines 259-272), and design rigorous experiments with quantifiable metrics (i.e., domain term removal experiments, and plan-not-found metric) to verify the usefulness of the EW metric (sections 4.2, 4.3, and figure 2). These are in addition to the strong results we get in Table 2 by applying our method to PDDL environments.
I would say that the notion of human intervention can be interpreted in multiple ways: no human, human before the LLM call, human in the loop (after LLM call), etc. I would be good to distinguish your work in one of these. The related work has as you point out the assumption of predicates, but that does not mean humans are in the loop while LLM is being called, right?
Once the input problem descriptions are given (which is a part benchmarking setup, and not the method), our method requires absolutely no human intervention, from the very beginning of the problem/domain proposal to the very end of getting the final evaluation metrics. We will make this more clear in our paper to avoid any confusion. We should point out that there are already several papers that do not require human intervention (such as LLM+P and LLM-DP in Table 1 of our paper, as well as the work you mentioned). However, the important point is that none of the works check the "Domain translation" criteria, where the language model needs to come up with correct predicates and preconditions/effects. The assumption that "predicates are given" is too relaxing and does not pass the "Domain Translation" criteria checkmark (in Table 1 of our paper). Therefore, this does not change the fact that our work is the first to not require human intervention.
I am still not convinced regarding the generality of the work given the current assumptions (need for the environment, etc).
Our main goal is to move towards fully automated planning with LLM agents, and such a goal by definition requires the interaction of the agent with the environment as one of its essential parts and most LLM Agentic workflow work needs some environment interaction. In fact, this is something that even human has to rely on in everyday scenarios (e.g. push/pull to open the door when initial mental model about which way the door opens is wrong). In the absence of an environment, relying on automated agents may not be advisable.
Re complexity, I am actually referring to a big O complexity of the approach (depending on the size of the input/dataset). See the following as examples of such complexity analysis:
https://arxiv.org/html/2406.11911v2 https://arxiv.org/abs/2404.11833
We appreciate the time and effort you have dedicated to the author-reviewer discussion and thank you for bringing our attention to these relevant papers. To provide a similar big O notation analysis as the work of Katz et al., our method has an overall time complexity of for generating domain and problem PDDLs (where is the number of tasks). Here, the first term is for one-time domain translation, and the second term for translating each of the tasks.
We thank you again for taking the time to review our paper. We hope to have addressed your questions and concerns. We noticed that the score was recently adjusted from 6 to 5. We would be grateful if you could give insights behind this change, and allow us to provide further explanation.
If the change is related to the complexity of our method in relation to the two references above, We would like to emphasize that these papers are very recent and do not appear to be published yet. Furthermore, there is no evidence from those papers that they can solve the problem we target in this work.
- I don't agree with the novelty point regarding human intervention. The related work I mentioned also does not have the human in the loop. The notion of human intervention can be interpreted in different ways.
- The complexity numbers: I am not sure on what was just provided as there is no explanation what the different terms are, and don't discuss how you got there and why these complexity analysis was not part of the main paper?
- There are assumptions that are unrealistic in a general sense, so the significance of the work is questionable.
For these results, I have lowered my score.
We have previously addressed all three points that the reviewer made in the last reply above. Given the discussion period is nearing the end, it would have been helpful if the reviewer provided actual reasons instead of the same general claim that we have already rebutted, so that we could provide further explanation.
We try to re-summarize our previous responses nonetheless and elaborate as best as we can.
I don't agree with the novelty point regarding human intervention. The related work I mentioned also does not have the human in the loop. The notion of human intervention can be interpreted in different ways.
The related work that the reviewer referenced is responded to under “Q3: Novelty: Two other works generate PDDL domain and problem without human intervention” of https://openreview.net/forum?id=RzlCqnncQv¬eId=ofKer0JBL5. In short, the referenced papers do not solve the same problem at all, and there is no evidence that they could tackle the problem we are targeting.
What we mean by no human intervention or human feedback is explained in the previous response in https://openreview.net/forum?id=RzlCqnncQv¬eId=lSXjugc9Wg , starting from the sentence “Once the input problem descriptions are given [...]”
The complexity numbers
The complexity number is explained in https://openreview.net/forum?id=RzlCqnncQv¬eId=tvVozFzGuB , where the sources of the first term and second term are explained. Note that the notation is consistent with what is used in the paper text, in the main Algorithm 1 and the main result Table 1, so the result should be evident from the double loop in Algorithm 1 and the fact that the inner most portion generates samples from LLM on line 6 of Algo 1.
To elaborate on the complexity further: (1) The complexity of domain generation is . This is because at first, problem candidates are generated and for each problem candidate the algorithm goes through a refinement procedure (lines 1 and 2 of Algo 1). The refinement is a tree with depth (where is the maximum number of refinement turns) (line 5), and at each level of the tree, one node is expanded with children (where is the number of domain refinement candidates) (line 6), which leads to complexity. (2) Once the domain is ready, the complexity of task generation for tasks is since for each task we only call the LLM once to get a problem translation. Therefore, the overall complexity is .
We did not claim anything regarding computational complexity or efficiency, but only the ability to solve very challenging planning problems automatically without human feedback, so initially, it did not seem to be crucial to include the complexity, especially given that it is not hard to see from Algorithm 1. Instead, we included the actual token cost from LLMs in the paper. We will, however, add the complexity to the paper.
There are assumptions that are unrealistic in a general sense, so the significance of the work is questionable.
This is already addressed in the rebuttal https://openreview.net/forum?id=RzlCqnncQv¬eId=ofKer0JBL5 and in the last section of comment in https://openreview.net/forum?id=RzlCqnncQv¬eId=lSXjugc9Wg Again it would have been helpful to have more than “in a general sense” as criticism so that we can further explain.
We acknowledge that the reviewer has the right to revise their score in ways they see fit. We do respectfully disagree with the assessments made by the reviewer, but, unfortunately cannot provide a further explanation because the last round of comments are too general and the discussion period is coming to an end.
The paper presents an approach that leverages LLMs to generate PDDL domain and problem files from natural language descriptions, and refine them iteratively based on environment interactions. In particular, it proposes an Exploration Walk (EW) metric that provides feedback signals to guide the iterative refinement process. In experiments, the proposed method successfully recovers PDDL files in 7 of 10 domains, outperforming LLM-based planners.
优点
- The paper studies the important problem of learning PDDL domains from embodied interaction for classical planning. This is a promising direction to enable long-horizon planning with formal guarantee. You may also find this recent paper [1] highly relevant.
- In contrast to existing work that require human intervention, the paper boldly attempts to generate PDDL files automatically using feedabck from environment interaction. The proposed EW metric does make some sense to me.
- In experiments, the method seems to be fairly capable in recovering valid PDDL files.
[1] Han, Muzhi, Yifeng Zhu, Song-Chun Zhu, Ying Nian Wu, and Yuke Zhu. "InterPreT: Interactive Predicate Learning from Language Feedback for Generalizable Task Planning." RSS 2024.
缺点
- My major concern is on the major contribution of the paper - the automatic mechanism that iteratively refines the generated PDDL files.
- EW does not provide a sufficient objective. The paper presents an Exploration Walk (EW) metric to provide feedback for the refinement process. The metric measures the difference of generated PDDL domain and ground truth environment by the feasible action sequences within. While EW=1 is the necessary condition for the generated domain to be valid, it is not the sufficient condition. I agree that EW can provide guidance at the initial stage, but in the end the objective in Equation (1) should be the one to optimize to produce a valid domain.
- The exact feedback not explained enough. The paper seems not to elaborate on the form of feedback provided to LLM for refinement - given the EW score which is a number. While the authors mention this briefly in the Appedix, I'm still don't fully understand how it works exactly. As this is the key part that makes the proposed approach possible, I would suggest the authors to provide more details in the main paper
- The effectiveness of scalar-based feedback is doubtable. Also, given the feedback is a number that provides little information on what the exact issue is (whether it's on problem file or domain file, whether it's on a precondition / effect term or on the predicate design, and which line), I doubt whether the LLM can perform reasonable refinement. I think it's highly possible that the iterative refinement process will goto nowhere.
- Another important doubt is on the problem setup - where the natual language descriptions are translated line-by-line from the ground-truth PDDL files.
- This setup is foundamentally different from what the problem of "generating PDDL" should be. Under this setup, the challenge is no long generating PDDL files that requires exploiting environment interactions, but translating natural language into PDDL precisely without losing any information. More specifically, I believe the difficulty is to identify important predicates. Once the predicates are ready, the precondition & effect terms and initial & goal states should be relatively simple to be translated with GPT-4 with some prompt engineering.
- The proposed approach seems to be unaligned with the challenge. While I think utilizing something similar to EW metric is the way to go, it doesn't aligns well with the challenge posed by this problem setup.
问题
- What is in Algorithm 1? If it is the refinement iterations, how many iterations are used in the paper?
- In experiments, the method runs 4 times and the best result is used for evaluation. I'm curious what the statistics of the 4 runs look like. Does a "magical seed" lead to good results while the others fail?
- The language descriptions are generated by a GPT-4 from ground-truth PDDL files. I will expect there might be missing items or hullicination. I wonder, do you manually check and fix the generated outputs?
局限性
See Weakness and Questions
We thank the reviewer for the feedback. However, we respectfully disagree with the criticisms around our core contributions and problem setup, as we believe there are misunderstandings about our work.
The reviewer seems to believe that we only use the scalar feedback provided by the EW score and that the latter is insufficient. However, the scalar EW metric is only used to pick the best response from the LLM, and the textual feedback (as explained in Figure 2 and Appendix B.3) consists of the incorrect set of actions and a textual description of the last state. Please see the attached rebuttal pdf in the overall response for the exact form of feedback.
The second group of criticism appears to try to redefine the problem that we set out to solve and concludes that our solution misaligns as a result. However, our goal is toward fully automatic planning in very complex problems from natural language description. To this end, all challenges surrounding automatic PDDL generation need to be resolved, including but not limited to the predicate identification problem raised by the reviewer, and we have made substantial progress in the overall problem, as evidenced by the benchmark results.
We provide additional responses to each point raised by the reviewer below:
W1 about core contributions: EW does not provide a sufficient objective, the objective in Equation (1) should be the one to optimize to produce a valid domain.
We agree that the final objective is Eq (1); however, this equation is not easy to optimize. We should note that it is common practice in machine learning to optimize a proxy objective instead when the primary objective is not easy to optimize. For instance, optimizing the cross-entropy loss as a proxy for accuracy or optimizing the L1 loss function as a proxy for sparsity. We believe the EW score serves a similar purpose as a proxy for Equation (1), which is hard to optimize.
W2 about core contributions: The exact feedback is not explained enough
We will add an example of feedback for the Termes environment to our paper. Please see the attached rebuttal page for the example.
W3 about core contributions: The scalar provides little feedback.
Please see the comment at the beginning of this response.
Weakness about problem setup: This setup is fundamentally different from what the problem of "generating PDDL" should be. The difficulty is to identify important predicates. Once the predicates are ready, the precondition & effect terms and initial & goal states should be relatively simple to be translated with GPT-4 with some prompt engineering.
Our goal is toward fully automatic planning in very complex problems from natural language description. To this end, all challenges surrounding automatic PDDL generation need to be resolved, and we have made substantial progress in the overall direction, as evidenced by the benchmark results. We appreciate the reviewer's acknowledgment of the challenge in identifying important predicates, which we did target in this work. However, we would like to emphasize that determining the correct preconditions and effects is also more complex than it might seem. Our observations indicate that GPT-4 sometimes misses preconditions or effects. For instance, as shown in the attached rebuttal PDF, the LLM corrected an action's precondition only after receiving feedback from the environment about the action's illegality. We also note that the level of description in our setting is similar to that of prior works (e.g., [1]). Overall, we are excited to see future extensions of our work that rely more on environment interaction and less (or not at all) on natural language descriptions. Our work lays the groundwork for achieving this goal.
Responses to questions
Q: What is in Algorithm 1? If it is the refinement iterations, how many iterations are used in the paper?
It is the maximum number of refinement iterations (or conversation turns). Following prior works on code generation [2, 3], we set a maximum of in our experiments.
Q: In experiments, the method runs 4 times and the best result is used for evaluation. I'm curious what the statistics of the 4 runs look like. Does a "magical seed" lead to good results while the others fail?
We did not “optimize for the random seed” if the question hints at it. Furthermore, in a realistic setting, it is perfectly reasonable to try a few times and use the EW score to decide the final solution, as that only relies on environment feedback. Furthermore, the Best@4 metric is similar to the Pass@K metric, which is commonly used in code generation literature (e.g., [3, 4]).
As for the statistics, it depends on the environment. For example, on Termes, all four seeds succeed in recovering the correct domain. On Grippers, three seeds succeed. On harder environments, such as hiking, two seeds succeed in recovering the correct domain PDDL, and Floortile only one seed succeeds.
Q: The language descriptions are generated by a GPT-4 from ground-truth PDDL files. I will expect there might be missing items or hullicination. I wonder, do you manually check and fix the generated outputs?
We did manually check for hallucination, and observed that the hallucination cases for back-translation are very rare, though not zero, and fixed the cases that we detected. We will publicly release the data along with the camera-ready version of our paper.
Note that this manual checking for hallucination is only for creating the natural language description dataset for the problem setup. For automatic PDDL generation, there is no human in the loop.
[1] Liu et al., LLM+P: “Empowering Large Language Models with Optimal Planning Proficiency”, 2023
[2] Madaan et al., “SELF-REFINE: Iterative Refinement with Self-Feedback”
[3] Chen et al., Teaching Large Language Models to Self-Debug, ICLR 2024
[4] Chen et al., Evaluating Large Language Models Trained on Code, 2021
Thank you for your rebuttal. Please find my further comments below:
- About problem setup: I understand this paper leverages environment interaction to improve the accuracy of translating natual language domain description into PDDL. In this sense, I find the title "Leveraging Environment Interaction for Automated PDDL Generation" misleading - automatically recovering a PDDL domain from interaction is fundamentally different from the goal of this work. Instead, I would suggest the authors to consider using "PDDL translation" or "PDDL generation from language descriptions" when referring to the problem.
- About EW: I agree that the EW metric provides good feedback to further correct the translated PDDL domain file, which already has fair quality.
- About the feedback: It makes sense to me that textual feedback is used for correcting PDDL translation. However, it seems that this aspect is missing from the method section of the paper. Highlighting the form of feedabck used is very important - it helps the readers understand the method better.
- About evaluation: Though I understand that LLM-based translation and correction usually has large randomness, I think it's necessary to report the percentage of runs that the method produces valid PDDL files that allow planning.
After the authors clarify the problem setup and the above concerns, I will be very happy to increase my score.
We are delighted that our response clarified your concerns, and we appreciate your feedback. To incorporate your feedback, we will make the following changes to our camera-ready version:
- We will replace the term "PDDL generation" with "PDDL translation" in both title and the main body of the paper.
- To better explain the feedback system, we will move the feedback format explanation from the appendix to the main body. This is in addition to the feedback example we provided in our rebuttal, which will be added to the appendix in the paper.
- For each environment, we will report the number of seeds that succeed in generating a correct domain PDDL.
Thank you for incorporating the changes. I have raised my score to Weak Accpet. Also, I think it would be very helpful if you can briefly discuss the topic of learning PDDL domain from interaction experience [1,2,3] (without the natual language domain description), and how it is related to and different from your problem setup in the related work section.
[1] Ahmetoglu, Alper, et al. "Deepsym: Deep symbol generation and rule learning for planning from unsupervised robot interaction." Journal of Artificial Intelligence Research (2022)
[2] Silver, Tom, et al. "Predicate invention for bilevel planning." Proceedings of the AAAI Conference on Artificial Intelligence (2023)
[3] Han, Muzhi, et al. "InterPreT: Interactive Predicate Learning from Language Feedback for Generalizable Task Planning." Robotics: Science and Systems (2024)
We thank you for providing references to related work. We will read them in detail and discuss their relationship to our setup where relevant.
This work talks about generating PDDL domain and problem files with LLMs. Specifically, it improves existing frameworks, particularly Guan et al. [8], in terms of increasing the degree of automation & eliminating the need for human corrective feedback. The core contribution of this work is the EW score. To compute the score, it only requires access to a set of executable plans and an executability checker (which can be either a simulator or the actual environment). The EW score is used to select sampled domain models given by the LLMs.
优点
- The paper is well-written, with precise and rigorous wording and formalism.
- The attempt to reduce the need for human feedback in domain generation is a meaningful and useful step forward.
- The introduction of the EW score not only forms the foundation of this work but also holds potential for future applications/research (e.g., for evaluation or as a heuristic)
缺点
-
The domain model sampling/generation is done in a relatively simple way. The feedback message could be more informative than just indicating the inexecutable step or action in a plan.
-
Regarding the structure of the related work section, the distinction between "intrinsic reasoning" and "external reasoning" seems unclear to me, especially given that "with the assistance of basic external tools [14]" is mentioned under the "intrinsic reasoning" subsection. Also, even for the task of PDDL generation, certain degree of "intrinsic reasoning" is needed. Rather than saying "intrinsic reasoning", I guess the authors likely meant "direct plan generation."
-
It doesn’t make it clear how much knowledge is provided in the domain NL description. Clarifying on this helps readers understand whether this work leverages LLMs as a knowledge source or a "translator." -- from the examples in appendix, it seems that LLMs are used as the latter in this work (note that I am not saying translation is trivial)
-
While the EM score may be effective for selecting candidate domain models and guiding their generation, its suitability as an evaluation metric is questionable. We know there exists a "ground-truth" PDDL and our goal is to fully recover its functionality. This is a binary 0/1 problem. This is not like a generated model with 0.8 avg solve rate is more usable than a model with 0.2 avg solve rate.
-
Also, I think it's important to mention that the avg solve rate can only serve as an approximate measure of the equivalency between two domain models. A 100% avg. solve rate doesn't guarantee model equivalency (but this seems to be an easier-to-compute measure)
-
Assumption 2 is stated in a loose way. An NL description of a domain can be given at different degrees of detail (which correspond to different levels of difficulty in domain PDDL generation).
-
Line 211: I think it's better to say "as long as PDDL is expressive enough to capture the dynamic / working mechanisms of the environment" rather than "the env supports PDDL action interface."
-
It's unclear what the takeaway of Sec. 4.2 should be. Firstly, "plan-not-found" only accounts for a certain fraction of consequences caused by removing a term or predicate. Other consequences, such as producing invalid plans, can also occur. Secondly, it is well known that obtaining a valid domain model is challenging, even for humans. The authors should better explain the connection between Sec. 4.2 and the other parts of the paper.
-
The authors should give more information on the computational complexity/cost (e.g., time consumption) associated with the calculation of EW score per candidate model.
Overall, I find this manuscript well-written, and the idea can be valuable to the community. Therefore, I am leaning towards recommending acceptance.
问题
See the weakness section.
局限性
See the weakness section.
We thank the reviewer for their positive comments. Below, we address the main questions raised by the reviewer:
W1: The domain model sampling/generation is done in a relatively simple way. The feedback message could be more informative than just indicating the inexecutable step or action in a plan.
Respectfully, our perspective here differs. The simplicity of our feedback messages is a deliberate choice to showcase the general applicability of our framework. As we improved the solve rate from 29% to 66% with a simple variant of our framework, it shows the impact of the core idea of the framework and a path forward for future innovation. More detailed feedback messages can be useful for future direction among other possibilities enabled by our approach.
W2: Regarding the structure of the related work section, the distinction between "intrinsic reasoning" and "external reasoning" seems unclear to me, especially given that "with the assistance of basic external tools [14]" is mentioned under the "intrinsic reasoning" subsection. Also, even for the task of PDDL generation, certain degree of "intrinsic reasoning" is needed. Rather than saying "intrinsic reasoning", I guess the authors likely meant "direct plan generation."
We appreciate the reviewer's feedback on the structure of the related work section. To clarify, we will replace the term "intrinsic reasoning" with "direct reasoning." This should better capture the distinction we intended to make between different types of reasoning approaches in our study.
W3: Are LLMs used as knowledge sources or translators?
The LLMs are mainly used as translators, however, some elementary general knowledge and reasoning are required for the LLM to be able to come up with correct predicates and preconditions/effects.
W4: While the EM score may be effective for selecting candidate domain models and guiding their generation, its suitability as an evaluation metric is questionable. We know there exists a "ground-truth" PDDL and our goal is to fully recover its functionality. This is a binary 0/1 problem. This is not like a generated model with 0.8 avg solve rate is more usable than a model with 0.2 avg solve rate.
While achieving exact functionality with ground-truth PDDL is ideal, this is often unattainable in hard domain PDDLs, resulting in incompatible output plans (and consequently functionally inequivalent PDDL domains). Therefore, a proxy metric is essential for meaningful comparison. Such proxy metrics are crucial in comparing different models and understanding trends that would have been unpredictable through the primary metric [5]. In Section 4.2, we demonstrate that the EW metric is a suitable proxy for our setting.
W5: It's important to mention that the avg solve rate can only serve as an approximate measure of the equivalency between two domain models. A 100% avg. solve rate doesn't guarantee model equivalency (but this seems to be an easier-to-compute measure)
We will add a sentence to clarify complete task solve rate does not mean exact domain equivalency. However, we should note that it is common practice to test a generated PDDL code using task solve rate (e.g., [1], [2]). More generally, the majority of code generation literature (e.g., Codex [2], AlphaCode [3]) tests the generated Python, or C++ code using unit test cases and decides the final accuracy based on performance on test cases.
W6: Assumption 2 is stated in a loose way. An NL description of a domain can be given at different degrees of detail (which correspond to different levels of difficulty in domain PDDL generation).
This assumption is a part of the input to our framework., The amount of details provided only changes the accuracy, and not the framework itself. Our natural language descriptions maintain a degree of detail comparable to previous works (e.g., [1]). Studying the trade-off between the amount of detail in natural language and accuracy is an interesting area for future work.
W7: Line 211: I think it's better to say "as long as PDDL is expressive enough to capture the dynamic / working mechanisms of the environment" rather than "the env supports PDDL action interface."
Thank you for your constructive feedback. We will change this line as suggested by the reviewer.
W8: It's unclear what the takeaway of Sec. 4.2 should be. Firstly, "plan-not-found" only accounts for a certain fraction of consequences caused by removing a term or predicate. Other consequences, such as producing invalid plans, can also occur. Secondly, it is well known that obtaining a valid domain model is challenging, even for humans. The authors should better explain the connection between Sec. 4.2 and the other parts of the paper.
Section 4.2 aims to convey the message that even the smallest divergence from the actual domain PDDL leads to no plan being found, let alone the plan being valid. This sensitivity is a significant motivation for our shift from traditional plan search approaches to the introduction of the exploration walk mechanism
W9: The authors should give more information on the computational complexity/cost (e.g., time consumption) associated with the calculation of EW score per candidate model.
Computing the EW is relatively negligible compared to the cost of LLM inference. In our experiments, computing the EW score for a single domain-problem pair takes less than two minutes on a 64-core server CPU. We will add this information to our paper.
[1] Liu et al., LLM+P: Empowering Large Language Models with Optimal Planning Proficiency, 2023
[2] Guan et al., Leveraging Pre-trained Large Language Models to Construct and Utilize World Models for Model-based Task Planning, NeurIPS 2023.
[3] Chen et al., Evaluating Large Language Models Trained on Code, 2021
[4] Li et al., Competition-Level Code Generation with AlphaCode, 2022
[5] Schaeffer et al., Are Emergent Abilities of Large Language Models a Mirage?, Neurips 2023
I acknowledge that I have read the authors' response. I find my original evaluation appropriate for the current manuscript and will therefore maintain the current score.
This work presents an approach for modeling planning environments via PDDL generation using LLMs and environment feedback, without relying on human intervention. This is achieved by an Exploration Walk (EW) metric to measure domain similarity and guide domain refinement, and an iterative rectifying method that leverages LLMs to generate and refine PDDL domain and problem files. The evaluation of this method is performed alongside baselines on ten different standard planning domains from IPC.
优点
- The presentation of the work is quite clear and concise.
- The exploration walk method included in the PDDL file correction loop is a rather simple and unsophisticated way to obtain approximate scores for the domain generation process.
- A decent amount of experimentation and analysis have been performed and stated in the work.
缺点
- Section 4.2 - demonstration of the brittleness of PDDL generation can be made more realistic such as additionally including hallucinated object identifiers or actions or symbols, which are highly probable with LLM-based code format generators.
- The PDDL generation with LLMs approach is not as novel and the exploration walk/ environment feedback approach may not be as useful in generating completely new domains from descriptions or making custom modifications to existing domain files. The method does not really differ for rectifying problem files.
- There are intrinsic problems with generating domain files from descriptions for domains such as Barman - where the levels of shaker and actions such as clean shot, empty shot do not translate well for LLM-based generation. More analysis and description in this line of argument are necessary.
问题
- This work on generating programs for planning problems [https://arxiv.org/abs/2305.11014] may also need to be cited in related work.
- The authors may benefit from more sophisticated exploration approaches in this paper [https://arxiv.org/abs/2406.07145] that solve a different problem.
局限性
Limitations have been addressed by the authors.
We thank the reviewer for their positive comments. Below, we respond to the weaknesses and questions raised by the reviewer.
W1: Section 4.2 - demonstration of the brittleness of PDDL generation can be made more realistic such as additionally including hallucinated object identifiers or actions or symbols, which are highly probable with LLM-based code format generators.
We already have a qualitative demonstration of the brittleness of PDDL generation in Appendix A.2, where we show a real (common) example where the LLM makes a subtle error in predicate design, and as such, the whole domain produces invalid plans. This is in addition to the quantitive metrics shown in Section 4.2. We will update the writing to highlight this result more.
W2: The PDDL generation with LLMs approach is not as novel and the exploration walk/environment feedback approach may not be as useful in generating completely new domains from descriptions or making custom modifications to existing domain files. The method does not really differ for rectifying problem files.
PDDL generation is not novel, and we do not claim it to be our contribution. We should emphasize that the main contribution of our work is to provide a framework to eliminate the need for human feedback to generate PDDL domains and problems and put the first steps towards such automation. We are certainly excited to see extensions of our method to “custom domain modification”, or an exploration-walk-like metric for PDDL problem files, but we believe these are out of the current scope and are interesting directions for future research.
W3: There are intrinsic problems with generating domain files from descriptions for domains such as Barman - where the levels of shaker and actions such as clean shot, empty shot do not translate well for LLM-based generation. More analysis and description in this line of argument are necessary.
Indeed predicate design is a crucial part of LLM-based generation, and we already have qualitative analysis for the Grippers environment in Appendix A.2. We observe similar challenges happening in the Barman environment, and kept the Grippers environment example since the environment is simple enough to be illustrative throughout the paper. As per the suggestion of the reviewer, we will add the Barman example to Appendix A.2 for more comprehensiveness.
Q1: This work on generating programs for planning problems [https://arxiv.org/abs/2305.11014] may also need to be cited in related work.
We have already cited and discussed this work in our related work section.
Q2: The authors may benefit from more sophisticated exploration approaches in this paper [https://arxiv.org/abs/2406.07145] that solve a different problem.
We thank the reviewer for pointing out this work. There is a literature on exploration strategies and we will add citations to some papers in that literature including the one mentioned by the reviewer, but the method itself in this suggested reference is for a very different problem. Adaptation to our problem would be highly non-trivial and could warrant a new research effort.
Additionally, we should stress two points: 1) as we have well-established through empirical results on 10 domains, our current simple EW already works. 2) our approach is a general framework, and components in the current implementation can be improved in future work.
Thanks to the authors for their response and clarifying answers. I have read the rebuttal responses and am satisfied with the proposed modifications. I find my original evaluation appropriate, but however have increased my confidence rating.
We thank the reviewers for their constructive feedback. We are encouraged that the reviewers find the automated PPDL generation problem to be important (cVUW, wCSx) that our exploration walk metric to be novel and promising (cVUW, wCSx, zxRN, iKF8), appreciate the analysis we did in the paper (iKF8), and that our paper is well-written (zxRN, iKF8).
We have provided individual answers to the questions raised by the reviewers. To incorporate the modifications requested by the reviewers, we will revise our manuscript with the following:
- Add an example of exact environment feedback (provided in the rebuttal PDF page). This example showcases the reasoning of LLM for domain refinement and provides a clear example of the refinement prompt.
- Add an example from the Barman environment in Appendix A.2 (in addition to the Gripper example that we already have) to emphasize the cruciality of predicate design.
- Clarify the terminology in the related work section.
- Provide more detailed information on the computational complexity of the EW metric.
- Add citations to recent related works suggested by reviewers.
If the concerns are well addressed, we kindly ask the reviewers to raise the ratings accordingly. Should there be any further questions, please do not hesitate to ask us during the discussion period.
The reviewers generally agreed on the fact that the paper is tackling an important problem. Additionally, there was a general consensus that the Exploration Walk (EW) metric is an interesting contribution and could spur follow-up work in the future. The fact that the authors have also focused on leveraging minimal environmental feedback is also a useful aspect of the work.
While some of the reviewers noted the accessibility and clarity of the writing, the discussions clearly pointed out that there the current exposition does lead to confusion about the contributions of the paper, the setting, and the assumptions made. The authors should make a point to incorporate these points that were brought up during the discussion.
Other important points that were highlighted during the discussion include the existence of related work that also doesn’t need human input, what lack of human input even means here, and explicitly capturing the cost associated with learning the model. I would also advice the authors to include discussion connecting the proposed methods to the larger literature on learning PDDL models from interacting with environment.
With that said, in my opinion, the paper makes a meaningful contribution to the space of planning with LLMs. As such, my recommendation is to accept the paper.