Why Solving Multi-agent Path Finding with Large Language Models has not Succeeded Yet
In this paper, we propose our position on why even latest large language models have not succeeded in multi-agent path finding yet, and what researchers could do to address them.
摘要
评审与讨论
This is a paper that attempts to show that LLMs are not yet a viable option for solving multi-agent path finding (MAPF). The authors describe a method for prompting an LLM for a solution, checking the output for collisions, and iteratively re-prompting the LLM until a solution is found or until a max number of iterations is reached. They show that LLMs can solve MAPF problems when the planning problem is simple (such as single agent problems in an empty room with no obstacles) and that the capabilities of LLMs as MAPF solvers break down as the problem becomes more complex (multiple agents and more complex obstacle scenarios). They then discuss three possible causes of the poor performance (LLM’s reasoning capabilities, context length limits, and map “understanding”).
优点
LLMs are indeed experiencing a surge of popularity in a huge spectrum of applications. Any work that contributes to understanding their limitations is important.
缺点
- While it’s true that their method didn’t work well, the authors do not present enough evidence to justify saying “LLMs do not yet work for MAPF” (which is the main claim of the paper).
- Experimental results are given in Tables 1-4 without sufficient explanation of the methods. Would recommend a more concise problem statement in the paper (not appendix) that gives specifics of the scenarios, prompts, LLM outputs, and the collision checker.
- The justification for why the SBS method the authors describe in the paragraph starting on line 207 is not clear. Other than providing a method to keep the context length shorter, it’s not apparent why this is a good approach.
- Many claims are made with little to substantiate them: a. Line 041: previous work “barely covers multi-agent planning” (Chen2023b and Agashe 2023) b. Line 047-049: Disagree on the list of previous methods. It should also include LLMs (Chen et al 2023b) c. Broad claims about LLMs are made in the section on Understanding Obstacle Locations that do not seem supported by the single example presented. d. The authors claim (line 428) that the current workflow does not include any tool use, but they use an external collision checker (with little or no description of the checker). e. The claim made on line 363 “people barely provide any such information online since people have common knowledge of what to do with a map” seems unfounded and there is nothing to support it.
- There are significant grammar issues throughout. Some examples: a. (Line 323) “However, recent studies have shown that long models like GPT4- turbo-128K are not a model whose capacity in 8K length also works when given a 128K-tokens input.” b. And (line 371) “which is killed by using much more total number of steps than it should”
- In lines 102-103 the authors say “we hope LLMs can be an alternative model to the current MAPF RL based models without additional training”. However, they have previously stated that in their formulation, this didn’t work. At this point, it’s better to say what you observed than what you had hoped for.
- The authors state on line 201 “It is unclear how well LLMs can solve MAPF problems,” but the main claim of the paper is that they aren’t good at it.
- In Figure 4, the goal of Agent 1 is (3,1) and the goal of Agent 2 is (2,0), however, Agent 1 ends up at (0,1) and Agent 2 ends up at (0,3). These are not at the goals, and in fact, Agent 2 has moved further away from the goal, so it’s not clear why this is a “validated solution”, except for the fact that they did not collide.
- It seems intuitive and not scientifically interesting that success rate drops as the size of the problem grows (Lines 424-426).
- The formatting of references seems non-standard.
问题
- Would comparing to a wider range of baseline methods (other than just 0S and SBS) help substantiate the claim that LLMs are not good for MAPF problems?
- Under Methods: “Following common practices of LLMs” what exactly are these?
- The authors mention giving the LLM “stepwise local information” what exactly does this mean? (I assume it is related to the context window size issue?)
- The authors only give the LLM information to solve the “next step” and keep re-prompting until the LLM provides a solution with no conflicts. How does this affect the global optimality of the solution?
- Is there any additional reasoning/evidence for why the SBS method is advantageous (other than the context window length)? Are there any downsides to the SBS method?
- What is meant by "validated solution" in Figure 4? The agents do not reach the goals specified in the figure.
Thank you for your reviews. Here we answer your question one by one:
- As our study has broken down the failing reasons into three aspects, i.e., context length, understanding obstacles, and general reasoning capability related to pathfinding. All three aspects are known as LLM challenges, and all these challenges are also known to be not prompt-sensitive and can be solved with prompt engineering. Thus, we believe that introducing other baselines will definitely keep our conclusion the same and there is very little point in introducing them.
- Thanks for pointing this out. These are papers that work on LLM for planning problems, [1,2,3]. We will add them to our paper.
- The stepwise local information is changed between different designs. Given the page limit, we can only include the prompt and the examples in our appendix, as you pointed out in the weakness. However, we still invite you to look at Fig. 8, which gives an example of this.
- In the current paper, we have primarily focused on the success rate because it is still too low, and the only successful scenarios are those that do not require many detours. This means that, even with single-step information, the solution quality (or global optimality, as you mentioned) is not significantly affected. Since the current successful scenarios stop at an agent count of 8 on maps like "room," and the number of instances tested is limited to 5 per setting due to the high cost, metrics other than success rate that contribute to measuring solution quality exhibit a high standard deviation. Furthermore, as the LLM solver is currently unable to find a solution in many cases, we believe it is more critical at this stage to focus on achieving any solution rather than prioritizing a good solution while exploring the potential of LLMs as alternative solvers for MAPF.
- While we do not have clear evidence and it is difficult to justify, there is a small possibility that breaking down the process could also help the LLM reason more thoroughly, similar to the Chain-of-Thought prompting [4], which enables LLMs to think step-by-step in solving difficult math problems. However, it is not possible to decouple this effect from the context window length issue in the MAPF problem, so we cannot verify this hypothesis. Thus, we can only conclude that SBS is better than OS when the success rate is the sole consideration. However, SBS could potentially have a downside in terms of solution quality—for instance, requiring a greater number of total steps—compared to OS in scenarios where both could generate solutions. Nonetheless, since OS fails to provide a solution in those cases, this cannot be verified.
- Figure 4 shows the output from the LLM that we have not changed. In this context, the “validated" one refers to no collision at the current step.
[1]. Kambhampati S, Valmeekam K, Guan L, et al. Position: LLMs Can’t Plan, But Can Help Planning in LLM-Modulo Frameworks[C]//Forty-first International Conference on Machine Learning.
[2]. Kalyanpur A, Saravanakumar K K, Barres V, et al. Llm-arc: Enhancing llms with an automated reasoning critic[J]. arXiv preprint arXiv:2406.17663, 2024.
[3]. Chen, Y., Arkin, J., Zhang, Y., Roy, N., & Fan, C. (2024, May). Scalable multi-robot collaboration with large language models: Centralized or decentralized systems?. In 2024 IEEE International Conference on Robotics and Automation (ICRA) (pp. 4311-4317). IEEE.
[4]. Wei J, Wang X, Schuurmans D, et al. Chain-of-thought prompting elicits reasoning in large language models[J]. Advances in neural information processing systems, 2022, 35: 24824-24837.
Thanks for your responses to my questions. After reconsidering the paper including your responses, my overall assessment remains unchanged.
While I appreciate the answers to my questions, some of the points made are not clear to me (e.g. giving step-wise information to the LLM is similar to Chain of Thought prompting), and several of the weaknesses I identified have not been fully addressed (e.g. weaknesses 3,4).
I believe the weaknesses I identified would require more substantial revisions/improvements then those provided to meet the standards expected for publication in ICLR.
Dear Reviewer,
Thank you for your further response.
To address your concerns outlined in Weaknesses 4 and 5, we have revised our paper to ensure greater accuracy in our statements. We invite you to review the updated version, and we are happy to address any remaining concerns you may have. Specifically for the points you raised in the introduction, we revised the wording in the relevant paragraph to maintain the narrative without diminishing the contributions of existing works. Since the primary focus of (Chen et al., 2023b) and (Agashe et al., 2023b) is on multi-agent task planning rather than actual path planning, we believe that referencing these works is not necessary in the updated text, as we discuss our differences with them later in the related work section.
Regarding Weakness 3, we believe this is closely related to your comment that "giving step-wise information to the LLM is similar to Chain of Thought (CoT) prompting." First, we would like to clarify the two pairs of ablation studies presented in our paper, which may have caused some confusion. The first pair compares global observations (GO) versus single-step observations (SSO). This refers to whether local obstacle information—such as whether agent 1 can move left at a given moment—is provided at each step. The second pair compares one-shot (OS) generation versus step-by-step (SBS) generation. This refers to whether the LLM provides the entire path in a single response or generates actions for each agent iteratively, step by step. We believe that your original Weakness 3 and Question 5 primarily relate to the second pair of comparisons (OS vs. SBS). However, in your most recent response, it appears that you are discussing the connection between single-step observations (SSO) from the first pair of comparisons and CoT prompting. We agree that this connection is limited. To clarify again, the step-by-step generation approach allows LLMs to produce additional intermediate outputs, enabling them to reason effectively about action choices during the process. This aligns closely with the purpose of CoT prompting and its theoretical advantages, as outlined in [1].
Thank you again for your constructive suggestions on improving the clarity of our writing. We hope our response has addressed your concerns. Please let us know if you have any further questions or feedback.
[1]. Li Z, Liu H, Zhou D, et al. Chain of Thought Empowers Transformers to Solve Inherently Serial Problems[C]//The Twelfth International Conference on Learning Representations.
This paper addresses the challenge of multi-agent pathfinding using large language models (LLMs). The authors demonstrate that while LLMs have proven effective for single-agent planning and, to a certain extent, for multi-agent pathfinding in relatively simple environments, current LLM capabilities are inadequate for planning in more complex settings.
优点
- This paper effectively highlights the current limitations of LLMs in multi-agent pathfinding (MAPF) domains through experiments conducted across environments of varying complexity.
- The authors provide an analysis of potential reasons behind LLMs’ challenges in effective planning.
- This work could serve as a valuable reference for future research directions in MAPF with LLMs.
缺点
- The paper only addresses the limitations of LLMs in centralized planning paradigms. The authors claim to use the history of the agents as input in the prompts. Due to this, the context window limit is reached quite quickly when the number of agents is high. But since the environment is Markov, shouldn’t it be able to decide actions for the agents just based on their current states?
- The methods used to demonstrate the limitations of LLMs are relatively straightforward. Including comparisons to more advanced modular architectures, such as those incorporating memory modules and decentralized planners (e.g., [1, 2, 3]), would have strengthened the analysis—even though surpassing state-of-the-art classical planners is not the primary objective of the paper.
- Overall, the paper feels more akin to a research proposal than a definitive study; it identifies key challenges and proposes future research directions but lacks extensive experimental results demonstrating effective solutions.
- The authors suggest three possible reasons for LLMs' shortcomings in MAPF: reasoning limitations, context length constraints, and difficulty understanding obstacle locations. However, these challenges should also arise in single-agent scenarios, where the agent must similarly reason and interpret obstacles, yet LLMs perform well in those cases. Could the authors provide further insights by comparing single-agent and multi-agent tasks?
References:
[1]: Building Cooperative Embodied Agents Modularly with Large Language Models. Hongxin Zhang and Weihua Du and Jiaming Shan and Qinhong Zhou and Yilun Du and Joshua B. Tenenbaum and Tianmin Shu and Chuang Gan. https://arxiv.org/abs/2307.02485
[2]: Nayak, S., Orozco, A. M., Have, M. T., Thirumalai, V., Zhang, J., Chen, D., ... & Balakrishnan, H. (2024). Long-Horizon Planning for Multi-Agent Robots in Partially Observable Environments. arXiv preprint arXiv:2407.10031.
[3]: Chen, Y., Arkin, J., Zhang, Y., Roy, N., & Fan, C. (2024, May). Scalable multi-robot collaboration with large language models: Centralized or decentralized systems?. In 2024 IEEE International Conference on Robotics and Automation (ICRA) (pp. 4311-4317). IEEE.
问题
There are some serious concerns that I have pointed out in the previous section
Limitations:
The methods used to obtain plans with LLMs are quite simple. It would have been better to have some more sophisticated methods (used in prior literature) to compare and show their failure modes. Just using LLMs for planning might not be a reasonable approach and probably that’s why many of the recent papers come up with more sophisticated methods (different roles with LLMs, decentralization, etc.)
Thank you for your review. Here we address your concerns point by point:
since the environment is Markov, shouldn’t it be able to decide actions for the agents just based on their current states?
Indeed, the MAPF problem itself is Markov, however, because LLM is not perfect, the LLM can leverage the past history to change its preferred action in the same state, and potentially escape from a local stuck. We invite you to look into the case study in our appendix where such a behavior, instead of improving the performance, decreases the performance of o1-series, but perfectly demonstrates how LLM is leveraging the past history.
Including comparisons to more advanced modular architectures, such as those incorporating memory modules and decentralized planners (e.g., [1, 2, 3])
Thank you for pointing them out. We want to clarify that we have cited the third paper you have cited, and their conclusion that decentralized planners are no better than a centralized planner has guided us to focus on the centralized design given that we are currently focusing on the success rate itself.
the paper feels more akin to a research proposal
Indeed, our paper is, in fact, a position paper on what the future research on LLM for MAPF should be. We sincerely hope you could potentially reevaluate the significance in this case.
Could the authors provide further insights by comparing single-agent and multi-agent tasks While the three factors analyzed in this paper also apply to single-agent planning, the multi-agent setting introduces additional challenges due to the output length increasing at least linearly with the number of agents. Longer outputs expand the total context length, necessitating a restart mechanism in our algorithm. This mechanism reinitializes the entire LLM system with a new problem, using the current locations of all agents as their starting points when context length limits are reached. While this approach addresses the immediate problem, it negatively impacts final solutions by causing the algorithm to lose earlier information, such as each agent's preferred direction and potential map locations that the LLM initially struggled to encode. These losses further exacerbate the other two challenges discussed.
In single-agent settings, the LLM's extended input context can help avoid repeated paths, even when the model does not perfectly understand obstacle locations. Similarly, the solution checker can guide the LLM in producing viable paths even if the generated path is suboptimal. However, in multi-agent settings, the limited capabilities of LLMs lead to more frequent mistakes, resulting in additional restarts. Each restart compounds the loss of historical information, such as agent preferences and map details, ultimately increasing failure rates.
Our experiments demonstrate that on an empty map, where collision avoidance is the only constraint, LLM solvers can effectively scale to handle up to 16 agents (one agent for every four cells on the map). However, on larger maps with fixed obstacles, LLM solvers struggle even with only eight agents. Comparing these results suggests that agent collisions are not the primary factor driving the performance gap. Instead, the increased output length appears to be the main reason for the performance differences between single-agent pathfinding and multi-agent pathfinding (MAPF).
Besides, it is also possible that the reasoning capabilities of current LLMs are also contributing to the failure. As our case study on o1-models showed, LLMs can sometimes fail to use past history to guide future actions correctly, which is an observation that has not happened in the single-agent scenario. However, we believe that this is a smaller issue as even if we have manually regenerated incorrect steps to resolve this problem, the performance of MAPF problems is still worse than single-agent task, so we believe that the main issue is still the complexity caused by multi-agent settings.
The paper investigates the use of LLMs for solving multi-agent path-finding problems(MAPF), focusing on moving multiple agents from start to goal locations without collisions. The study explores whether LLMs, without additional training or heuristic guidance, can effectively generate valid paths for agents in different MAPF scenarios, including simple and complex environments. Experiments reveal that while LLMs can solve straightforward MAPF cases with limited obstacles, they struggle with more challenging environments, often failing to generate collision-free solutions. The paper identifies three primary reasons for LLMs’ limitations in MAPF: lack of advanced reasoning, context length limitations, and difficulty understanding spatial map information. Based on these findings, the authors suggest directions for future work to address these limitations and improve LLMs' MAPF performance.
优点
Firstly, the paper is well-structured, and it clearly explains why LLMs may be suitable candidates for MAPF due to their reasoning capabilities and large contextual understanding. It conducts experiments on various MAPF benchmark maps, evaluating LLMs’ performance in different scenarios and identifying specific failure points. The paper’s breakdown of LLM limitations, such as reasoning and spatial understanding, provides useful insights for future improvements in LLM-based MAPF solutions. Different prompt styles and input representations (e.g., text-only, multimodal) are compared, contributing valuable insights into how prompt structure affects LLM performance.
缺点
The paper does not compare the LLM-based approach with traditional MAPF algorithms (e.g., heuristic search, SAT, or reinforcement learning). Including baseline comparisons would provide a better understanding of how LLMs perform relative to established methods. It also lacks visualizations of agent paths and collision instances, which would improve clarity and provide a more intuitive understanding of LLM performance. Success metrics focus on whether a solution is collision-free, with limited emphasis on solution quality (e.g., path optimality or efficiency). Detailed metrics would offer a clearer picture of LLMs’ efficacy in generating high-quality paths.
问题
The authors can consider including baseline methods like heuristic search or SAT-based MAPF algorithms for comparison. Such comparisons would clarify whether LLMs bring any unique advantage to MAPF. Besides, they can evaluate the generated paths for metrics like makespan, and path length. Including these metrics could highlight the quality of LLM-generated solutions relative to optimal or near-optimal paths.
Thank you for your review. Here we address the weaknesses you pointed out one by one:
-
Regarding the comparison with traditional MAPF algorithms, the success rate of typical MAPF algorithms like CBS or EECBS will be 100% under 0.1 seconds. While there is currently no advantage for LLM right now, we are the first to explore the possibility of solving MAPF with LLM, and our objective of the paper is to demonstrate that LLMs can solve small problems simply through prompting and discussing what is stopping them from solving larger scenarios. We hope our paper can inspire future research to enable LLMs to function as solvers, offering the significant advantage of leveraging the rapid advancements in LLM technology while eliminating the need for additional training required by current RL-based methods.
-
Our results (Table 3) show that already with 8 agents all methods fail to find feasible solutions (0-20% success rate). When the generated solution is infeasible, it does not make sense to measure quality - one cannot focus on quality before consistently achieving feasibility. Therefore, as the LLM solver is currently not even able to find feasible solutions consistently for high number of agents, we believe it is more important to focus on getting a feasible solution in the harder scenarios (with 8 agents) than getting a better solution in easy scenarios (with 2-4 agents) that do not need many detours and thus not have a huge improvement margin, in the current stage of pushing LLM as an alternative solver of MAPF.
Thank you for the detailed explanation. After reconsidering the paper including your responses, my overall assessment remains unchanged. For the first question, while I understand that the primary aim of the paper is exploratory rather than competitive, a comparison with traditional MAPF algorithms would still provide essential context. Even if LLMs cannot yet match traditional methods, benchmarking against these standards could highlight where LLMs fall short and provide a stronger foundation for motivating future work. For the second question, I agree that focusing on obtaining feasible solutions in harder scenarios is indeed more critical at this stage. However, the current evaluation primarily highlights the failures without clearly analyzing the underlying reasons for these failures. Including additional metrics, such as the number of collisions, path overlap, or the proportion of agents reaching their goals, could provide more insight into why the LLMs fail and help identify specific bottlenecks in their performance.
Dear Reviewer,
Thank you for your response and your constructive suggestions.
Regarding the inclusion of performance benchmarks for previous algorithms, we have added a sentence similar to the one we discussed with you in the discussion section of our paper. We invite you to review our updated paper, where the changes are highlighted in blue, and let us know if you agree with the chosen location for presenting these results.
Regarding the underlying reasons for these failures, we have already included the breakdown: “77% of the failures occurred because the LLM agents began to oscillate in a specific area of the map, while the remaining failures were due to excessively long detours,” prior to further analyzing the failures from the LLM perspective in our paper. Compared to the additional metrics you suggested, we believe that our current two classes of failures are more intuitive and directly linked to the reasons analyzed later in the paper. Furthermore, these reasons are well-recognized issues with LLMs, making them valuable research topics for future exploration. More specifically, metrics like the number of collisions and path overlap do not occur in our current workflow, as the LLM regenerates solutions whenever our solution checker detects conflicts in the actions generated for a given step. Regarding the proportion of agents reaching their goals, we found this metric to be highly scenario-dependent and subject to significant randomness. For example, in some scenarios, 6 out of 8 agents may successfully reach their goals, while in some others, only 2 succeed. Due to this variability, we are uncertain how this metric could contribute to further analysis of the reasons for failure and, therefore, did not include it in our paper. We would greatly appreciate it if you could provide additional suggestions on how this metric could be leveraged effectively. Thank you!
The paper explores the feasibility of using large language models (LLMs) for solving the Multi-Agent Path Finding (MAPF) problem. While LLMs have demonstrated success in various fields, this study examines their limitations in handling MAPF due to issues with reasoning capabilities, context length limits, and obstacle comprehension. Experiments on standard MAPF benchmarks show that LLMs perform well on simple scenarios but struggle as problem complexity increases. The authors conclude that current LLMs are insufficient for MAPF without additional improvements or hybrid systems integrating traditional path-planning methods.
优点
- This paper addresses a unique application of LLMs to multi-agent coordination, specifically MAPF.
- The paper identifies general limitations of LLMs in multi-agent coordination tasks, with some illustrative failure cases.
- The discussion outlines the general challenges of using LLMs in MAPF and suggests broad areas for improvement.
缺点
- Given that these problems are well-addressed by analytical methods, could the authors elaborate on the concrete advantages of using LLMs for MAPF compared to existing analytical methods?
- As the findings primarily reiterate known LLM challenges (e.g., context limitations and reasoning issues) without introducing MAPF-specific insights or innovations. The relevance to MAPF needs to be clarified. The authors are suggested to highlight any MAPF-specific challenges or insights they discovered.
- The experiments are restricted to simple cases that may not generalize to real-world MAPF tasks, which undermines the strength of the study’s conclusions.
- The chosen prompt design lacks justification as the best approach for MAPF. Without alternative prompting methods or tuning strategies, it is unclear if the observed limitations are universal or specific to this setup.
问题
- How do you justify that the proposed prompt method is the best approach and that its failure indicates no other prompting method can address the MAPF problem? How do you ensure the conclusions are generalizable beyond this specific scenario and prompt?
- Given that the insights are common LLM limitations, not specific to MAPF, what is the unique benefit of this research? Are there distinct challenges in MAPF that differ from general LLM challenges, making any of the observations particularly relevant?
- Why use LLMs for MAPF or even SAPF? The necessity is unclear, as these problems can be well-addressed using traditional analytical methods.
Thanks for your review. Here, we answer your concerns about the weaknesses (as labeled W1-W4 sequentially ) and questions (as labeled Q1-Q3).
Q1. W4. These questions are related to whether the proposed prompt method is the best and whether no other prompting method can address the MAPF problem. Our study has broken down the failing reasons into three aspects, i.e., context length, understanding obstacles, and general reasoning capability related to pathfinding. All three aspects are known LLM challenges as you already acknowledged, and all these challenges are also known to be not prompt-sensitive and can be solved with prompt engineering. Thus, we believe that whether our prompt is the best one is less important, and even if there is a better prompt, our conclusion will still remain the same.
Q2. W1. W2. These are questions about the contribution of our work in comparison to both current work on MAPF and current work on LLM.
From the perspective of MAPF researchers, while the current LLMs show no advantage compared to existing search-based methods or RL-based methods for MAPF, we hope our paper can enable research on LLM-based MAPF solvers that can take advantage of the rapid advancements in LLM technology. More importantly, from the perspective of LLM researchers, the problem of MAPF remains one of the challenging tasks that LLMs still perform very poorly on. When the Blocksworld benchmark (which is related to MAPF) was first introduced to the LLM community, it was hard for LLMs at the time and yet now it can be solved much better by the latest o1 model published by OpenAI . We hope that by introducing MAPF as an LLM planning benchmark in our paper, we can provide a useful next frontier to challenge the abilities of LLMs in the domain of long-context, symbolic understanding, and planning/reasoning capabilities.
Q3. We have included the results of SAPF in section 2.2 of our paper, where the success rate of LLMs as the solver for SAPF is much better than for MAPF with the same workflow that involves a rule-based checker. That is why we moved beyond SAPF to MAPF where the challenges of context length compound with the other challenges to significantly reduce the success rate.
W3. We are currently using the scenarios from the MAPF benchmark [R1]. The benchmark is the most commonly used benchmark that captures the key challenges in real-world MAPF problems, and in general, the MAPF research community believes a good performance on the MAPF benchmark can be generalized to real-world MAPF tasks with relatively easy adaptations.
[R1]. Stern R, Sturtevant N, Felner A, et al. Multi-agent pathfinding: Definitions, variants, and benchmarks[C]//Proceedings of the International Symposium on Combinatorial Search. 2019, 10(1): 151-158.
The paper explores the use of large language models (LLMs) for multi-agent path-finding (MAPF) problems. The study investigates whether LLMs are able to generate valid paths for agents in different MAPF scenarios without heuristic guidance or additional training. Experiments show that while LLMs can effectively solve simple MAPF problems with a small number of obstacles, they face significant challenges in complex environments, frequently failing to produce collision-free solutions. The paper highlights three contributing factors: insufficient advanced reasoning capabilities, restrictions due to context length, and challenges in comprehending spatial information. Drawing from these insights, the authors propose future research directions to overcome these challenges and enhance LLMs’ performance in MAPF tasks.
The paper was reviewed by three referees who agree on the papers' key strengths and weaknesses. All three reviewers appreciate the identification of the limitations of using LLMs for multi-agent planning, which provides valuable insight for future work in LLM-based MAPF. However, the reviewers emphasize the need to compare to other MAPF algorithms including traditional methods as well as more advanced algorithms. The reviewers recognize that the paper's objective is not to outperform classic planners, but these comparisons would help to strengthen the paper's analysis by highlighting where LLMs fall short and would help to motivate the use of LLMs for MAPF. Related, the paper would benefit from experiments on more complex scenarios with further analysis of the success and failure of LLMs.
审稿人讨论附加意见
The reviewers appreciated the authors' responses, but all agreed that their primary concerns with the paper remained.
Reject