5.0

/10

Poster4 位审稿人

最低3最高7标准差1.6

3.5

置信度

正确性3.0

贡献度2.5

表达3.0

NeurIPS 2024

Thought of Search: Planning with Language Models Through The Lens of Efficiency

Michael Katz,Harsha Kokel,Kavitha Srinivas,Shirin Sohrabi

OpenReview PDF

提交: 2024-05-15更新: 2024-11-06

TL;DR

We find that recent trends in using LLMs for planning are profoundly uneconomical, unsound and incomplete. We propose a significantly more efficient approach that is sound and complete and argue for a responsible use of compute resources.

摘要

关键词

planninglarge language modelssearch

评审与讨论

审稿意见

评分: 4置信度: 32024-06-29

The paper analyst the use of LLM in planning, and propose to write the successor function and goal test in code instead of directly solving the problem. The paper showed experiments that using the code can get a higher accuracy and lower number of calls to the LLM compared to LLM-based solutions.

优点

The paper is very well-written and easy to follow. The discussion is very detailed and provide extra insights on the topic.

缺点

First of all, I think the conclusion is not surprising or bring any insights at all. The idea of letting LLM to write code instead of directly solve problems has extensively appear in LLM for reasoning [1], and more specifically on LLM for planning [2][3].
Because the conclusion does not provide any big insights, I am expecting to see a strong experiment, which is not the case in this paper. The benchmark selected, are quite classic from different manners, and similar code has appeared on GitHub for at least a year. It is hard to justify whether the success of generating code is coming from LLMs remembering similar context from Github or they do have the ability to correctly generate code. And even in this case, the authors mentioned “The mistakes GPT-4 makes when producing the code repeat from one experiment to another”, which is not a good sign if one wants to deploy similar methods to more general applications.
One minor drawback is that I feel the authors failed to cover some related works that are quite close, for example the ones I mentioned in the first weakness. A more thorough literature review are recommended for a more convincing paper.

[1]. Zhou, Aojun, et al. "Solving challenging math word problems using gpt-4 code interpreter with code-based self-verification." arXiv preprint arXiv:2308.07921 (2023).

[2]. Liu, Bo, et al. "Llm+ p: Empowering large language models with optimal planning proficiency." arXiv preprint arXiv:2304.11477 (2023).

[3]. Guan, Lin, et al. "Leveraging pre-trained large language models to construct and utilize world models for model-based task planning." Advances in Neural Information Processing Systems 36 (2023): 79081-79094.

问题

I am concerned about the current need of human feedback to successfully generate the code. While I can see from the appendix that a large portion of them might need such feedback, can you provide some statistics about this?

局限性

Yes.

作者回复

2024-08-07

We thank the reviewer for their feedback. We hope that our clarification regarding the statistics on when feedback was needed alleviate the reviewer’s concerns and can allow to raise their rating.

Answer to the Question:

We have provided the average number of interactions with the model (required to produce the correct code) in the paper. The average is computed over 5 separate runs. The number of times a feedback was needed is one less than the number of interactions - the first interaction asking to produce the code is also counted. For 24 game (line 204): 1.2 interactions on average for the successor function and 1 interaction for the goal test. That means that the goal test did not need any feedback and the successor function needed a single feedback once out of 5 runs. For Mini crosswords (lines 219-220) the model required 2.4 interactions on average to produce a valid successor function and 1.4 interactions on average to produce the goal test. For BlocksWorld (line 239), 2.8 and 1 iterations on average for successor function and goal test, respectively. For PrOntoQA (line 275), 1.6 and 1 iterations on average for successor function and goal test, respectively. The sum of the two averages (successor function + goal test) is shown in the “Calls” column of Table 1, last row.

On weaknesses:

While we agree that the idea to ask the language models to produce code is not revolutionary, we are not aware of other work that uses language models to produce search components, such as successor function and goal test. Further, we consider this to be only one of the contributions of our work. Please see the response to all reviewers for the discussion on the contribution of our work.
On related work: we will add the mentioned papers to the related work.
- Guan et al, NeurIPS 2023 is already cited in our work. The paper proposes generating a classical planning model (PDDL), under particular assumptions. The direction is complementary to our work and is probably the more efficient method, when applicable, as it allows using existing planners. Unfortunately, not all planning problems are easily captured by a classical planning model and in such cases our method can still help. One example of such case is the 24 game we experiment with in our work.
- Liu et al, Arxiv 2023 assumes that the PDDL domain (the major part of the PDDL model) already exists and proposes a way to use LLMs to produce PDDL problem instances (objects, initial state, goal). We do not make such assumption; rather we are asking LLM to convert natural language domain information to python code.
- While our focus is on using LLMs for generating code for successor and goal functions for search problems, Zhou et al. ICLR 2024 focuses on math reasoning problems. They illustrate that LLMs can be used to generate, execute (verify), and self-refine the python code for math problems. Their study corroborate our findings that LLMs can indeed be used for generating verifiable code with some feedback.

2024-08-08

Thanks for addressing my concerns and pointing out the connection between the literature I mentioned.

I am still concerned about the novelty of this work being quite incremental. Although I agree with the authors that not all problems can be solved with PDDL, I do not think this part poses a significant enough challenge to the problem studied in this paper: translating human language description to a specific language of code. Given that existing planners also use search in their underlying architectures, the work seems just challenging LLM to generate a more popular language of code, Python, instead of PDDL. In my opinion, the current paper has not adequately addressed the strong connection with this NeurIPS paper. Instead of hiding this reference in a short number, I would like to see a full discussion on how the current paper is different enough from the NeurIPS paper.

Furthermore, I personally believe that the transition gap between math reasoning problems and planning problems is bigger than from PDDL to Python. If the authors do believe Zhou et al. ICLR 2024 can be used as a cross-reference to support that LLMs can generate verifiable code with automatic feedback instead of human feedback used in the current paper, I do believe the contribution to be even smaller. Otherwise, the need for human feedback in the loop still seems to be a big challenge to the proposed algorithm.

2024-08-10

Comparison with Zhou et al.

We share your belief that the transition gap between math reasoning problems and planning problems is large. Therefore, their approach requires independent interaction with LLM for every math problem. A search problem, however, should not require independent interactions with LLM for each problem. That is the main premise our work. A single interaction with LLM to generate successor and goal function can allow solving all problems in that domain. In our previous comment we only meant to highlight that success in Zhou et al's work indicate that LLMs can generate code and refine it with feedback. In our work we leverage this code generation and refinement ability of LLMs. But the similarity ends there. We highlight some of the differences between our work and Zhou et al. below:

They propose generation of code as a means to solve a given math problem and generate final answer. In our work, the final answer is the code itself.
Their approach is not iterative or incremental in nature, they propose generating a predefined collection of validators for a particular instance, regardless of the performance of previously generated validators. Ours iteratively fixes issues with the previously generated code.

On the need for human feedback

The need to provide human feedback is shared by all approaches investigated in our work. In our case, the human feedback is needed on producing the solver and is not required once a sound solver is produced, as the solutions are then guaranteed to be correct. In the case of the previous approaches, the human feedback is needed to validate each and every one of the produced solutions, an almost impossible task. We are not aware of an acknowledgement of such a limitation in the existing literature. We hope that the challenge of alleviating the need for human feedback can be adequately addressed in future work.

On novelty

Finally, We would like to emphasize that a large portion of our work focuses on filling the gap in the current literature on planning with LLMs with regard to the computational complexity and properties of the proposed algorithms. This investigation is essential for understanding approaches and building up on them. So, we believe, this investigation itself satisfies the novel contribution requirement. Additionally, none of the existing approaches use language models to generate code for search components, so that contribution is also novel.

2024-08-11

Thank you for your clarification. This detailed comparison with Guan et al. is greatly helpful for judging the novelty of this paper.

However, as the soundness of the proposed method comes directly from generating code format of the problems instead of solutions, a more thorough literature review on the effect of code interpreters on reasoning is needed, which should include, but not limited to Zhou et. al. Besides, to make the paper novel and impactful, the need of human feedback should also tried to be reduced given the fact the work like self-debug [1], which has been included in the rebuttal with other reviewers, have already appeared in coding for a long time. The current difference between this work and existing work, while not as simple as just changing a coding language from PDDL to Python, is still relatively simple and limited.

Therefore, I am keeping my score at the moment.

2024-08-14

We are happy to see that you acknowledge that papers mentioned by you are not reducing the novelty of our approach. We would like to again highlight the "On novelty" section in our previous response. Our contribution in this paper is so much more than the method we propose. We will update the paper with related work on code-generation. If you have any concrete work that reduces the novelty of our contributions, we would gladly alleviate your concerns.

We would like to reiterate that reducing the need for human feedback should be a focus of a separate investigation.

2024-08-10

Comparison with Guan et al.

Guan et al, NeurIPS 2023 propose a method that consists of three parts: domain construction, domain refinement by a human, planning with the refined domain. To construct the domain, an LLM is queried for action preconditions and effects on an action-by-action basis, providing it with the description of the action in a natural language and a (possibly incomplete) set of predicates. The LLM can provide not only action parameters, preconditions, and effects, but also potentially propose missing predicates. Few shot examples from a BlocksWorld domain are given in the prompt. The process is repeated once, with the list of full predicates from the first iteration. Additionally, PDDL problem instances are created, using the predicates from the domain generation process.
The feedback is provided in two forms:

A symbolic validator VAL is run on domain and problem files, with the result being translated into natural language feedback. This feedback is mostly on the syntax of the generated PDDL model.
The generated PDDL domain is translated into a natural language and presented to a human expert. The expert provides an explicit feedback on missing and extra preconditions and effects of each action.

Upon correspondence with the Guan et al. authors, we validated following conceptual differences from our work:

They are interested in actions only, assuming initial state and goal are given. Our work does not make this assumption; rather asks LLM to generate a goal function.
They feed the language model with targeted pieces of information (single action description one-by-one, predicates to use). We provide entire description of the problem and asks for a single successor function.
Their feedback on the generated PDDL is explicit and requires nuance understanding of modeling in PDDL (e.g., 'unnecessary precondition "no other object stacked on object ?x"'). While ours is mostly generic python code feedback (e.g., issues with shallow copy of a dictionary with list values, error trace) and generic logic (e.g., 'two of the operations are not symmetric, division and subtraction').

We agree that this comparison is interesting and will add the discussion to the paper.

PDDL vs Python code generation

There is a conceptual difference between representing planning problems in PDDL and coding a search problem (successor function and goal test). To overcome the limitations of PDDL, modelers often resort to tricks like presenting additional predicates encoding the negation of modeled predicates, adding predicates that explicitly encode an information that could be derived from other predicates, such as (hand empty) in addition to (holding ?b) in BlocksWorld, because such derivation cannot be done in the preconditions and effects in classical planning, and would require axioms, which are rarely supported by existing planners. As a result, human validation of such PDDL models is often harder and requires more skills than validating code-based successor function/goal encoding.

审稿意见

评分: 7置信度: 32024-07-06

The authors propose a position paper that argues current works for LLMs for planning waste significant compute, on top of having poor algorithmic and empirical performance. The authors also propose ideas on how to use LLMs more efficiently and effectively by using them to preprocess search algorithms instead of using them directly during search. More specifically, the proposed method consists of using the LLM to generate the successor generator and goal state checker, alongside user feedback.

优点

The paper has several strengths as a position paper. It provides a simple yet original idea, namely that researchers should strive for responsible usages of LLMs for planning in terms of computing efficiency and also provides arguments that this would actually improve the performance of LLMs for planning as well.

The idea is complemented by an extensive survey of LLMs for planning methods with a summary of worst case complexities, and whether the algorithms are sound or complete. Furthermore, it is complemented by a novel methodology for using LLMs for planning which adheres to the authors' proposition: more efficient and effective LLMs for planning research.

The experiments are extensive with a wide variety of planning benchmarks, implementation details, and results. The results are quite positive as summarised in Table 1, with minimal calls to LLMs in comparison to existing approaches. The authors are also transparent about the limitations of their approach in its current state, being that it requires user interaction. Nevertheless, this does not undermine the proposition of the position paper.

缺点

With regards to the idea of the paper, there are no major weaknesses. Nevertheless, the paper could benefit from some additional details regarding experiments, and improved clarity in certain areas.

It may not be clear to some readers how %States could go over 100%. By its definition in line 344-345, it is not clear whether visiting the same state twice double dips into the percentage or not, but from the next 2 sentences, it does seem to be the case.
Minor formatting issue: the benchmark domains are not consistently capitalised, e.g. 24 game and crossword in line 357 but elsewhere they are capitalised such as in Table 1 and page 8.
Although the focus of the paper is in LLMs for planning, the paper misses more general related work regarding learning for planning/generalised planning. Such methods are magnitudes of orders more efficient than LLMs in evaluating learned heuristics (ToT or GoT in LLM terminology) or policies and solve problems magnitude of orders larger than problems solved by LLM research. Example works include learned heuristics [1] or generalised policies [2], as well as foundation models for planning [3].

[1] Simon Ståhlberg, Blai Bonet, Hector Geffner: Learning General Optimal Policies with Graph Neural Networks: Expressive Power, Transparency, and Limits. ICAPS 2022: 629-637

[2] Dominik Drexler, Jendrik Seipp, Hector Geffner: Expressing and Exploiting Subgoal Structure in Classical Planning Using Sketches. J. Artif. Intell. Res. 80 (2024)

[3] Dillon Ze Chen, Sylvie Thiébaux, Felipe W. Trevizan: Learning Domain-Independent Heuristics for Grounded and Lifted Planning. AAAI 2024: 20078-20086

问题

No questions or clarifications that could change my opinion.

局限性

The authors adequately addresses limitations of their work, and also the checklist with appropriate justification.

作者回复

2024-08-07

We thank the reviewer for their feedback and support. We hope that the reviewer could advocate for the paper.

Your understanding regarding the %States is correct, we will clarify the text in the final version.

2024-08-09

I thank the authors for their response and clarification.

I also acknowledge that I have read the strengths and weaknesses pointed out in other reviews as well as the corresponding rebuttals but still stand with my rating.

More specifically, I am still convinced by the message concerning the efficient usage of LLMs that the paper proposes with a focus on soundness.

2024-08-13

We greatly appreciate your support. It does look like we will need it.

审稿意见

评分: 6置信度: 32024-07-12

Existing LLM-based planning approaches usually involve searching and multiple passes of the model, which leads to significant inefficiency and cost while failing to guarantee the correctness of the generated plans. Motivated by this issue, this paper first analyzes the soundness, completeness, and complexity of a series of existing planning methods, arguing that they do not meet the standard of these properties. A new algorithm, Thought of Search (ToS), is proposed to alleviate the heavy computing demand of the searching operation for planning. It queries LLMs for generating the code implementation of successor functions and goal tests. The proposed method achieves 100% success rate for four representative search problems, while the total time spent (on CPU) is shorter than or comparable with a single LLM evaluation time. Further discussion describes that, compared with other approaches, ToS is not only sound and complete but also more cost-effective and able to explore a much larger portion with only O(1) complexity.

优点

Impactful motivation. Recently, there have been increasing efforts to improve the planning ability of LLMs. However, since this line of research is still at its early stage, the efficiency of the proposed methods is easy to overlook, especially when the system contains multiple agents. Therefore, I strongly agree with the motivation of this paper (i.e., the need for sound and complete LLM-based approaches that uphold efficiency) and believe in its importance for future research.
Comprehensive analysis. The paper systematically and comprehensively studies the properties of twelve related works that are commonly used or recently proposed, providing convincing support for the authors’ claim and, more importantly, valuable information to the community.
Solid results. Across all four evaluated tasks, ToS consistently generates valid final solutions only and reaches 100% accuracy in a relatively short time. Notably, the searching operation is run on the CPU. These results are sufficient to demonstrate the effectiveness of the proposed method.

缺点

Current works on code generation using LLMs show that the correctness of the generated code is not guaranteed. In this paper, the obtained implementations in the experiments are also not always valid at the first trial and require human feedback. Thus, I am slightly concerned about whether ToS can generalize to more difficult tasks while maintaining its nice properties. Nonetheless, this is likely to be alleviated by combining with some automated optimization techniques as discussed by the authors. Therefore, I am still relatively optimistic about this approach at this point. Further experiments (if any) to address this concern are welcome.
While the paper's contents are generally well organized, which I appreciate, there are quite a few typos and wrong words/phrases. For exemplification, a non-exhaustive list is written below. I highly recommend a careful and thorough check of the whole paper to fix all the mistakes.

Line 22: ‘The purpose of our work is precisely that.’
Line 99: ‘Reflection’
Line 323: ‘a lower than reported by the approaches accuracy’

问题

See weaknesses

局限性

The limitations have been discussed.

作者回复

2024-08-07

We thank the reviewer for their feedback and hope that we can somewhat alleviate their concerns and get their strong support for this work.

Existing literature on code generation e.g., [1,2], as well as the literature on generalized planning with LLMs e.g., [3] shows evidence that automated feedback improves LLM performance. Our preliminary investigation of feedback automation for ToS supports this as well.

We will proofread the paper and fix the typos - thank you for pointing these out!

[1] Chen, Xinyun, et al. "Teaching large language models to self-debug." ICLR 2024 [2] Zhang, Kechi, et al. "Self-edit: Fault-aware code editor for code generation." ACL 2023 [3] Silver, Tom, et al. "Generalized planning in pddl domains with pretrained large language models." AAAI 2024.

2024-08-13

We hope that our responses have strengthened your support for the paper. We would greatly appreciate if that could be reflected in your final score.

评论- Keep the rating unchanged

2024-08-14

I've read the authors' responses and am satisfied with them. I've read other reviewers' comments (especially uMM4 and 4XWb), and think their arguments are also quite valid. Based on all these, I choose to keep my rating unchanged. But I want to emphasize that my confidence score is not high (only 3).

审稿意见

评分: 3置信度: 52024-07-13

The authors propose a Thought of Search: thinking before searching strategy to solve Automated Planning problems using LLMs. They use the GPT-4 LLM to generate a Python code for generating successor states and goal test functions which are the crucial parts of any search. They argue that this method is sound and complete and requires the least calls to the LLM before successfully solving the problem when compared to the relevant literature.

优点

Analysis of Complexities for the existing methodologies of using LLM for Planning.
Innovative use of LLM to reduce the number of calls made to solve the problems.

缺点

However, there are several concerns regarding the proposed work:

If the Large Language Model (LLM) is being used to generate successor and goal test functions, which are already intrinsic components of automated planners, it is unclear what improvement is being made. The inherent memory and time complexities associated with solving these planning problems are not addressed or mitigated by the proposed approach. It appears that the LLM is simply re-writing (a few components of) a more basic planner, which does not seem sufficiently innovative for a NeurIPS-level conference.
The process is not fully automated, as it requires human intervention to re-prompt the LLM until the correct code is generated.
The methodology would have been more compelling if it had included an exploration of generating useful heuristic functions.

问题

Why did the authors use only GPT-4 for their experimentation? No reasons are presented in the paper. Presenting a comparison of the performance with different LLMs would have been interesting to see.
What is the need to use LLMs to generate successor and goal test functions and solve the planning problems with a naive blind search? Is the objective to evaluate GPT-4's capability to generate Python code, or to assess its reasoning ability in solving planning problems?
Comparing the number of calls to the LLM with other approaches may not be appropriate. Your methodology necessitates generating the search components once for each domain, whereas other approaches involve the LLM in solving every problem within the domain. A more relevant comparison would be the average time required to solve these problems and the success rate across different methodologies.

局限性

Showcasing results with only GPT-4 model.
Incomplete comparisons with the other existing approaches as mentioned above.

作者回复

2024-08-07

We thank the reviewer for their feedback.

Answers to the Questions:

Our purpose was to show the feasibility of the approach rather than comparing which models are better at generating successor functions / goal test for the four search problems. For that purpose one language model is sufficient. Further, all papers we compare to (described in section 2) use GPT(3, 3.5, or 4) in their experiments and therefore, by using GPT model we could compare to their results without redoing their experiments, which we show in this paper to be so unnecessarily wasteful.
The objective of this work is neither to evaluate GPT-4's capability to generate Python code, nor to assess its reasoning ability in solving planning problems. Our primary objective is to fill the gap in the current literature on planning with LLMs and point out the inefficiency and pitfalls of the current trends. Please see the response to all reviewers for the discussion on this.
In order to solve a search/planning problem, one needs to be able to capture the dynamics of that problem. Of course, for planning problems that are already expressed in a formal language, such as PDDL, one can simply use an existing PDDL planner, which internally performs a search, defining the successor and goal functions based on the PDDL. If the problem does not have a PDDL yet, but it can be easily captured in PDDL, one may prefer to do so (with or without the help of LLMs), as it is done in e.g., the cited work (Guan et al, NeurIPS 2023, Oswald et al, ICAPS 2024). Many search problems, however, are not easy to capture in PDDL, and therefore an alternative approach is needed. This is the case for many of the planning domains mentioned in the recent literature and we used some of them in our work. Probably the best example is 24 game, which has numeric features not easily captured by classical planning. Please see the discussion in Section 3.
The number of calls to the language model is precisely how we measure complexity in this work. Additionally, we provide the overall time and accuracy comparison in the paper (more on this below).

One of the major advantages of our approach is that we only need to call the LLM a constant number of times per domain, regardless of the number of problems in the domain. However, even if you want to take away this advantage and say that each domain has only a small number of problems, our approach needs less LLM calls per domain than the other approaches per problem in the domain. Further, after the successor function and the goal test were obtained, solving all the problems in a domain by search on a single core of a personal computer CPU typically takes as much time as a single call to the LLM.
Both the total search time (the average time is easy to derive since the number of instances is provided) and the accuracy/success rate results are provided in the paper. The success rate of our approach is 100% in all the tested domains. The success rates reported in the literature for 24 Game and for PrOntoQA are presented in lines 198 and 268, respectively. The success rate of ToT on mini crosswords is 20% and the success rate of RaP on BlocksWorld is 100%, 88%, and 42% for problems with solution length of 2, 4, and 6, respectively. We forgot to mention it in the paper, will add. The total search time results for our approach are provided in the text, lines 210, 226, 247, 278. To exemplify, solving all the 1362 instances of 24 game takes 2 seconds with the “fastest” successor generator and 7 seconds with the “slowest” among the 5 times we conducted the experiment. As mentioned before, our accuracy is 100% - we solve all 1361 solvable games and report unsolvable for the unsolvable one. In comparison, the ToT approach (Yao et al, NeurIPS 2023) restrict their experiments to 100 out of 1362 tasks and performs ~100 calls to an LLM per task. Assuming the same average number of calls and 7 seconds per call, the ToT approach would take around 10^6 seconds or 11 days, achieving the reported success rate of 75%. Even if the LLMs become significantly more efficient and the time of a single call would be cut down to 1s, it would still take ToT more than 1.5 days.

Please also see the response to Reviewer PhwG regarding automating the feedback.

2024-08-08

Thank you for addressing my comments.

I am still not convinced that the work produced in its current state is novel enough for the NeurIPS submission. The authors seem to see their methodology is applicable in cases where devising PDDL is harder. To generate feasible plans for realistic/near-to-realistic domains using the methodology proposed is still the same as a blind search and would require a lot of computation and resources.

I read through the rebuttal responses for other reviewers. On automation - "Our preliminary investigation of feedback automation for ToS supports this as well." - do the authors provide this preliminary investigation in their paper? if so can you please point out where?

2024-08-10

Our preliminary investigation of feedback automation is not part of this work.

It is not clear to us whether reviewer's intention is that blind search is not as effective as the previous approaches (ToT, RaP, etc) or that blind search is not as effective as heuristic search on large realistic domains. Could you please clarify?

2024-08-10

I meant that a blind search is not as effective as a heuristic search on large, realistic domains. Your goal to point out inefficiencies in current trends is clear, but exploring ways to generate useful heuristic functions could have made your work more innovative and impactful.

2024-08-10

Indeed, blind search may often be less effective than heuristic search on large realistic domains. Deriving heuristic estimates for planning tasks described in a natural language is a very interesting area. It deserves a focused systematic investigation, which is now made possible by our work.

We hope that we were able to address reviewer's concerns.

作者回复

2024-08-07

The main objective of our work is to fill the gap in the current literature on planning with LLMs with regard to the computational complexity and properties of the proposed algorithms. Our main contribution is precisely this investigation. We show that the current literature proposes inefficient methods for producing unsound results. We present complexity analysis of the algorithms proposed and establish the lack of soundness and completeness of these algorithms. We not only show just how inefficient the results are, but also show that there is another way by proposing a simple alternative.

We would like to highlight the importance of soundness, which is mostly overlooked by the existing literature body. With soundness, the solutions produced by search are guaranteed to be correct and do not require external validation. Without soundness, the produced solutions have a large potential to be invalid, and an automated validator would be needed. It is not clear from the literature on the algorithms for planning with LLMs studied in this work whether such validators were created and used to verify the claimed success rates. To clarify, let us use the 24 Game as an example, let's assume the initial state is [3 3 8 8] (one of the instances in the existing benchmark). An LLM could produce [24] as a successor during search. If you do not validate each transition from [3 3 8 8] to [24], you would not be able to validate the answer. Note that there is no way to reach [24] from [3 3 8 8]. Producing a goal state does not mean solving the problem and a sound validator is required.

We feel that it is crucial to expose the scientific community to these results, giving our position a stage at this point in time, in an attempt to reduce the amount of work that continues the same trend in the recent literature.

2024-08-14

We would like to emphasize the need for getting our message out there. People are mainly unaware of just how incredibly inefficient the approaches we investigate in this work are. To to take another perspective, each call to GPT-4o consumes roughly 0.2 kWh. Solving all 1362 tasks of 24 game with ToT (~100 calls to GPT4 per task) would consume around 27mWh, more than the average U.S. household uses in 2.5 years.

最终决定Accept (poster)

2024-09-25

I prepare a full review. Then, I present my meta-review

Review

The paper reviews the literature on solving planning problems using LLMs, emphasizing the calls in terms of the number of LLM calls. The paper remark on the fragility of LLMs for planning, including GPT-4, and focus instead on code generation for the successor function and goal test function. In principle , the method is applicable in contexts where both functions could be obtained by code generation. It mentions how samples of plan can be used to verify the planning model. While manual intervention is used to produce the code, other lines of work might be relevant. The idea is quite simple, but it fills a gap in the literature.

If there is no data to evaluate the code generated, trust falls back on errors of other LLM-based approaches. On the other hand, perfect testing might require an unbounded amount of data or a simulator that could be used from a terminal.

Soundness: 3: good Presentation: 3: good Contribution: 1: poor Rating: 7: accept Confidence: 5:

Meta-review

The reviewers coincide on the value of the cost analysis for other prompting algorithms. The diversity of domains seems to be enough as the domains are different from each other. Regarding some of the weaknesses reported by reviewers :

uMM4
- “The generator function and goal detection are part of the planing algorithm. are used by search or planning algorithms. So, they are not the planner.
- “The process is not fully automated.
PhwG
- The code is not guaranteed, as it happenes with popular approaches approaches like CoT.

I’m recommending the acceptance of the paper.