PaperHub
6.8
/10
Poster4 位审稿人
最低6最高7标准差0.4
6
7
7
7
3.5
置信度
COLM 2024

MANGO: A Benchmark for Evaluating Mapping and Navigation Abilities of Large Language Models

OpenReviewPDF
提交: 2024-03-18更新: 2024-08-26
TL;DR

We propose MANGO, a benchmark to evaluate the capabilities of large language models to perform text-based mapping and navigation.

摘要

关键词
large language modelsroboticsmappingnavigationtextgame

评审与讨论

审稿意见
6

The paper introduces a new benchmark called MANGO, which focuses on evaluating the mapping and navigation abilities of large language models. The benchmark consists of a set of mazes paired with textual walkthroughs, challenging models to navigate through the mazes based on the provided instructions. The paper discusses the motivation behind creating the benchmark, the design of the mazes and walkthroughs, and the evaluation of models, including GPT-4, on the benchmark tasks.

接收理由

The paper introduces a novel benchmark, MANGO, which addresses an important and previously unexplored aspect of evaluating large language models - their mapping and navigation abilities in text-based environments.

拒绝理由

The paper presents a benchmark with 53 mazes, which could be seen as limited in scale. Moreover, it's unclear how to enable LLM models to comprehend complex mazes, as it may require instructions with more tokens than the LLM window can accommodate.

The paper briefly mentions the poor performance of GPT-4 on certain questions but does not delve deeply into the specific limitations or challenges faced by the model in mapping and navigation tasks. A more detailed analysis of model weaknesses could offer valuable insights for future improvements.

作者回复

Thank you for your positive review!

We appreciate your feedback, and will improve our presentation for camera-ready.

… limited in scale

We are sorry that the current version hasn't emphasized the scale of our dataset.

First, please see Table 2 in App A for the statistics of full data. Actually, we made mistakes in the current version, and the actual data is larger than what's displayed in Table 2; we will make corrections for camera-ready!

Actually, our full data is large-scale, and it has

  • 2M EASY DF questions;
  • 1M HARD DF questions;
  • 190K EASY RF questions;
  • 5K HARD RF questions.

On average, each maze / map / game has ~60K DF questions and ~4K RF questions

Second, the 53 games are diverse. They are of diverse topics and genres, designed by different companies, and are situated in different eras. They cover a wide variety of maps (small vs. big houses, long vs. short halls, towns vs. forests, verbose vs. concise scene descriptions, etc).

… more tokens than the LLM window can accommodate.

The length of walkthroughs ranges from ~20 to >1000 steps. Many of them easily exceed the context windows of most modern LLMs. That is why we could only evaluate the LLMs on a subset of our data (see Sec 3.1 and App A).

However, we will publish the full data (see App A), together with all our source code and a website hosting the benchmark, to facilitate future research (e.g., with future LLMs that have sufficiently large context windows).

delve deeply into the specific limitations or challenges… more detailed analysis of model weaknesses…

This is very important and interesting!

We would like to perform a deeper analysis, and have documented our thoughts in App D. However, we are afraid that the amount of investigation may warrant a standalone paper, and thus leave it to future work.

E.g., we plan to investigate whether the representations of LLMs have captured the structures of the maps (e.g., by probing).

E.g., we plan to "upgrade the MANGO benchmark by enriching its spatial and structural configurations" (to resemble realistic settings) and then analyze the broader implications of the evaluation done on MANGO.

审稿意见
7

The paper introduces MANGO, a benchmark for assessing large language models' mapping and navigation abilities (LLMs). The benchmark comprises 53 mazes from text-based games paired with walkthroughs, mapping, and navigation questions. The authors evaluate the performance of several LLMs, including GPT-4, on this benchmark. They find that even GPT-4 struggles with the tasks, particularly on challenging questions involving unseen paths. The paper demonstrates that strong mapping and navigation abilities can benefit LLMs in downstream tasks like playing text-based games. The paper is well-structured, with a detailed description of the benchmark creation process and comprehensive experiments. The methodology is clearly described, and the results are presented in an organized manner with supporting statistical analyses and visualizations. The MANGO benchmark is novel in its focus on evaluating LLMs' mapping and navigation abilities, addressing an aspect that has received limited attention in prior research. The findings have implications for applications such as embodied agents and robotics. However, the benchmark's current scope may be limited, as it focuses on text-based game environments. The paper lacks a comprehensive comparison with existing methods for mapping and navigation tasks and does not provide a deep analysis of the underlying mechanisms contributing to the LLMs' performance. Overall, the paper makes a contribution by introducing a new benchmark and evaluating LLMs' mapping and navigation abilities, but there is room for improvement in terms of scope, comparison with existing methods, and analysis of underlying mechanisms.

接收理由

  1. Novel benchmark for evaluating mapping and navigation abilities: The paper introduces MANGO, a new benchmark designed to assess the mapping and navigation abilities of large language models (LLMs). This benchmark addresses an aspect of LLMs that has received limited attention in prior research. Acquiring spatial knowledge from textual descriptions and reason about navigation is relevant for various applications, such as embodied agents and robotics. By providing a standardized benchmark and evaluating current state-of-the-art LLMs, this work contributes to the field and can facilitate future research in this area.

  2. Well-designed methodology and thorough evaluation: The authors present a detailed description of the benchmark creation process, including maze collection, question generation, and the development of an evaluation program. The experiments cover multiple LLMs and analyze their performance on different question types and difficulty levels. The statistical analyses and visualizations support the findings, providing insights into LLMs' limitations and potential improvements in mapping and navigation tasks. The clarity of the methodology promotes reproducibility and enables future research to build upon this work.

  3. Relevance to downstream tasks and real-world applications: The paper demonstrates the importance of strong mapping and navigation abilities for LLMs in downstream tasks, such as playing text-based games. This finding suggests that improving the spatial reasoning capabilities of LLMs can benefit their performance in relevant real-world applications, particularly in the context of embodied agents and robotics. The MANGO benchmark offers a tool for researchers and practitioners to evaluate and enhance the spatial reasoning capabilities of LLMs, which can contribute to their effective deployment in various domains, such as autonomous navigation, intelligent assistants, and human-robot interaction.

拒绝理由

  1. Limited scope of the benchmark: While the MANGO benchmark introduces a novel approach to evaluating the mapping and navigation abilities of LLMs, the current scope of the benchmark may be limited. The benchmark consists of 53 mazes derived from text-based games representing a specific environment type. It is unclear whether the performance of LLMs on this benchmark would generalize to other types of environments or real-world scenarios. The authors could consider expanding the benchmark to include a more diverse set of environments or discussing the limitations and potential challenges in generalizing the findings to other contexts to strengthen the paper.

  2. Lack of comparison with existing methods: The paper focuses on evaluating the performance of LLMs on the MANGO benchmark but does not provide a comprehensive comparison with existing methods for mapping and navigation tasks. While the authors mention that their work complements previous research on vision-language navigation, they do not directly compare the performance of LLMs with state-of-the-art approaches in this domain. Including a comparative analysis with existing methods could help contextualize the performance of LLMs and provide a clearer understanding of their strengths and limitations in mapping and navigation tasks.

  3. Insufficient analysis of the underlying mechanisms: Although the paper thoroughly evaluates LLMs' performance on the MANGO benchmark, there is limited analysis of the underlying mechanisms that contribute to their successes and failures. The authors provide some insights into the factors that influence the performance of LLMs, such as the maze size and the path length but do not delve deeper into the specific aspects of the models' architectures or training that may impact their mapping and navigation abilities. A more detailed analysis of the models' internal representations and decision-making processes could provide valuable insights for future research and developing more effective LLMs for spatial reasoning tasks.

给作者的问题

  1. Generalizability of the benchmark: The current MANGO benchmark comprises 53 mazes derived from text-based games. How well do you expect the performance of LLMs on this benchmark to generalize to other types of environments or real-world scenarios? Are there any plans to expand the benchmark to include a more diverse set of environments, and what challenges do you anticipate in doing so?

  2. Comparison with existing methods: The paper focuses on evaluating the performance of LLMs on the MANGO benchmark but does not provide a comprehensive comparison with existing methods for mapping and navigation tasks, particularly in the domain of vision-language navigation. Could you discuss how the performance of LLMs on the MANGO benchmark compares to state-of-the-art approaches in related domains? What insights can be gained from such a comparison, and how might it inform future research in this area?

  3. Analysis of underlying mechanisms: The paper thoroughly evaluates LLMs' performance on the MANGO benchmark but offers limited analysis of the underlying mechanisms that contribute to their successes and failures. Can you provide more insight into how the specific aspects of the models' architectures or training might impact their mapping and navigation abilities? What additional experiments or analyses could be conducted to better understand LLMs' internal representations and decision-making processes in the context of spatial reasoning tasks?

作者回复

Thank you for your positive review and insightful questions!

Our App D sketches a few future directions that aim to address open research questions that are similar to yours. We will include more details in the camera-ready.

Generalizability of the benchmark… unclear whether the performance of LLMs on this benchmark would generalize to other types of environments or real-world scenarios

We believe that the performance and analysis on MANGO has broader implications, and we plan to demonstrate it in our future work.

In App D, we proposed a future direction (page-29) to "upgrade the MANGO benchmark by enriching its spatial and structural configurations" (such that it looks more like realistic scenarios).

Lack of comparison with existing methods…

Due to page limit, we moved some empirical comparison to appendices. We can move it back to the main paper, with the extra page allowed by COLM for camera-ready.

In App C.4, we evaluated a kind of classical methods that (1) map natural language instructions to symbolic graphs and (2) call symbolic search algorithms to answer the questions. However, in our experiments, the natural-to-symbolic translation is difficult (just like reported in lots of existing literature), and thus the performance of this kind of methods is poor. Improving this kind of methods may also be an interesting future direction.

Other types of "state-of-the-art approaches" are not directly applicable since our setting has key differences (e.g., linguistic, but not visual, input) from the "related domains"; details can be found in Related Work (e.g., discussion about Walter et al., 2012, etc).

Insufficient analysis of the underlying mechanisms… more detailed analysis of the models' internal representations and decision-making processes

This is very important and interesting! It requires a significant amount of investigation, which we think may warrant a standalone paper. Thus, we leave it to future work.

However, we have indeed thought about it, and documented our thoughts in App D (page-26). Specifically, we would like to investigate whether the representations of LLMs have captured the structures of the maps (e.g., by probing). We will elaborate our thoughts in the camera-ready.

审稿意见
7

The paper introduces a benchmark to test the ability of large language models for text-based mapping and navigation. The benchmark includes several maps paired with a walkthrough, along with a large number of destination-finding and route-finding questions and answers. Through evaluations on multiple open source and close source, large language models, the authors demonstrate that this task is very challenging even for the most advanced models.

接收理由

What I particularly appreciated about this paper is the closed attention paid to creating the benchmark and the extensive manual work and/or verification. I also liked the analysis that was performed on the results.

拒绝理由

Two aspects where l I believe the paper could be improved:

  1. The comparison to the human performance is hand wavy at best. The authors mention that humans answered a random number of selected questions, and based on that the performance of the models was deemed below that of human performance. Given that this is a benchmark, I believe it is important to have a strong human reference. I would like to see significant details included on this aspect of the evaluation, including among others: how many questions were evaluated by humans; what is the agreement between them; what is the performance on easy versus hard questions; etc.

  2. It would be useful to include information on how the prompt robustness was tested. Just as an example, considering the prompt shown in Example 9: how would a change in this prompt affect the model? For instance, if the prompt is ordered differently; or if a paraphrase used; or if there are additional irrelevant details included; etc

Consider including the actual walkthrough used (even if only partial) for the example in Fig 1, to give a better sense of what is seen by the models.

Minor: acronyms are not consistently used, Eg, see the captions for Examples 2-5; the use of LLM (or not), etc

给作者的问题

Please see items 1 and 2 under “reasons to reject”

作者回复

Thank you for your positive review!

We appreciate your constructive feedback, and will improve the final version accordingly. COLM has allowed an extra page for camera-ready, with which we can add all the new information that you would like.

comparison to the human performance is hand wavy…

During this rebuttal period, we carried out more comprehensive human evaluation, and we will include the details in the camera-ready.

Briefly speaking, our takeaways are:

  • humans can achieve perfect success rates on DF questions, but ~90% on RF questions;
  • DF questions also take less time to answer than RF questions;
  • humans need to sketch the map to achieve a high success rate;
  • humans are good at handling imputed edges: they can answer hard questions well if they sketch maps; they may answer easy questions poorly if they don't sketch maps.

include information on how the prompt robustness was tested.

Appendix B.2 has already discussed prompt robustness (to some extent). We will include more details in the camera-ready.

The structure of LLM output is sensitive to the prompts, which is why we had to carefully tune the prompts (such that LLMs can return easy-to-parse output). E.g., we found it helpful to ask LLMs to format their answers as a Python list of Python dictionaries (Exp 7 and 8, Sec 3.1, App B.2); we also found it helpful to end our prompt with '[' (start symbol of a Python list) to elicit the LLM to actually output in the desirable format.

As long as the output follows a desirable structure, its content is robust to the prompts. In our pilot experiments, we found that different prompts (created by different authors) may generate the same output (regardless of its format) and achieve the same level of success rates (after parsing).

We haven't experimented with different orders of sentences or distracting context. But we think that these are very interesting ideas, and we would love to do them---if not for camera-ready---for future work.

including the actual walkthrough used (even if only partial) for the example in Fig 1

We will include it in the camera-ready.

In addition, we are ready to release all our data (including walkthroughs) and source code as well as a website hosting the benchmark.

acronyms are not consistently used

Thanks! We will fix these.

评论

Thank you for your answers, and the additional evaluation.

To follow up on the human evaluation -- some questions I had in my original review that still feel unanswered: "how many questions were evaluated by humans; what is the agreement between them; what is the performance on easy versus hard questions; etc."

评论

Thank you for the response!

The detailed human evaluation is summarized below. 31 RF questions and 30 DF questions were evaluated. Success rate (loose) is the main metric used in the paper, which allows partial string matching using edit-distance (discussed in Sec 2.3). The strict version requires exact string matching and is provided as a reference. The reasoning accuracy metric is discussed in Appendix A.6.

Humans generally achieve near-perfect success rates with no significant difference between easy and hard questions. The distinction between easy and hard questions lies in the need to reason about reversed edges (e.g., inferring "b->west->a" from "a->east->b"), which is challenging for LLMs but straightforward for humans.

Imperfections in reasoning accuracy scores are due to: 1) the requirement for exact string matching, and 2) human errors in typing location names with special symbols.

DifficultyTotal TasksSuccess Rate (Loose)Success Rate (Strict)Reasoning Accuracy
Route Finding
All310.82110.80650.6129
Easy200.77270.750.55
Hard110.90910.90910.7273
Destination Finding
All301.01.00.5667
Easy211.01.00.6667
Hard91.01.00.3333

Previously, different experts evaluated different questions, so a human agreement score wasn't available. Due to time constraints, a subset of the previously evaluated questions (10 RF, including 2 HARD, and 10 DF, including 3 HARD) were re-evaluated by different experts. Since DF and RF questions can have multiple valid solutions, the answers were not directly compared for agreement. Instead, the mean squared error (MSE) was computed for each metric across tasks and difficulty levels. For each file evaluated in both rounds, the squared error was calculated as (x₁ - x₂)² and then averaged by task type, difficulty, and metric type, as shown below. In general, MSE was low for both RF and DF questions in terms of our main metric.

AllEasyHard
Loose MSEStrict MSEReasoning MSELoose MSEStrict MSEReasoning MSELoose MSEStrict MSEReasoning MSE
Route Finding0.14420.22220.33330.04250.14290.14290.50.51.0
Destination Finding0.00.00.60.00.00.57140.00.00.6667
审稿意见
7

The authors propose a benchmark to test LLM's ability on text-based spatial reasoning (mapping and navigation). The benchmark uses mazes that are paired with walkthroughs which cover all the location but not necessarily all the paths. Evaluation results show that current best models still struggle on this task. The authors also provide tests to showcase how the mapping capability can improve the text-game playing ability.

接收理由

  • Text-only spatial reasoning has been anecdotally known as a challenging task for even the best LLMs; this work operationalizes upon this idea and proposes a concrete way to assess this ability
  • The construction of the benchmark is methodological sound and novel (visits all locations but not all paths)
  • The authors properly discussed and analyzed what makes the maze hard and challenging

In all, I find the work solid in its methodology and comprehensiveness.

拒绝理由

  • The text is a bit dense which makes it hard to follow at times (eg., 2.1, 2.2, 2.3; especially 2.2); might be better to split up the sections when it comes to introducing new ideas (e.g., That is, we have to traverse an extended graph that includes imputed edges. An imputed edge denotes [...]).
  • The authors showcase the usefulness of having the mapping capability added to the model for downstream tasks. However, this is done by sampling from the minigames from the same suite; it might help if the authors can provide some quantification on how different is the downstream task to the walkthrough to really understand its usefulness.
作者回复

Thank you for your positive review!

We appreciate your constructive feedback, and will improve the final version accordingly.

Paper writing is dense… may split sections when introducing new ideas…

With the extra page allowed by COLM camera-ready, we will follow your advice and improve our presentation.

it might help if … provide some quantification on how different is the downstream task to the walkthrough to really understand its usefulness.

This is a great point! We will clarify it in the camera-ready.

In our current experiments, each minigame is a randomly sampled prefix of a walkthrough, thus resembling the structure of the walkthrough. We plan to perform a deeper evaluation in this aspect in our future work, such as generalization to navigation scenarios of significantly different structures, styles, and configurations (like what we have sketched in App D).

最终决定

The paper introduces MANGO, a benchmark to evaluate large language models' (LLMs) abilities in text-based mapping and navigation. It uses 53 mazes from text games, each with a walkthrough covering all locations but not all possible paths.

Pros:

  1. The paper effectively operationalizes the known difficulty of text-only spatial reasoning for LLMs, proposing a concrete assessment method.
  2. Significant effort and attention to detail were invested in creating and verifying the benchmark, including manual work and verification. The evaluation includes multiple models and demonstrates the task's difficulty, reinforcing the benchmark's relevance.
  3. The dataset has diverse 53 mazes and are mapped to a large number of DF and RF questions.
  4. More details, which are answering all questions on the human evaluation are provided in the rebuttal.

Cons:

  1. As the minigames are the trajectories sampled from the prefixes, it is unclear how they are different downstream tasks compared to the benchmark itself. Reviewer 1xds raises the same concern. It is advisable if the authors conduct downstream task experiments to validate this.
  2. Human evaluations included a study on sketching maps during the rebuttal. It is unclear how this map-sketching is mapped to model performance.

Neutral comment: Model performance analysis on the categories of the failures would make the analysis strong. The authors presented some categories of failures, which if can be made more systematic would be helpful for future directions. As the reviewers pointed out this direction, the authors seemed keen on performing model representation analysis which would be a standalone paper as they responded. However, atleast discussing the categories or types of failures beyond easy hard questions would provide better insight into the benchmark.

Originality: The paper is clear in its presentation of the benchmark and its importance, though some sections are dense. The originality lies in the novel benchmark for assessing text-based mapping and navigation, which addresses a previously underexplored area in LLM capabilities particularly in the framing of the maze equivalent itself for text spatial reasoning.

Significance: The work is significant as it identifies a critical gap in current LLM performance and provides a structured way to evaluate and improve text-based spatial reasoning, which could enhance performance in related downstream tasks.