7.0

/10

Poster4 位审稿人

最低6最高8标准差0.7

3.5

置信度

COLM 2025

MapIQ: Evaluating Multimodal Large Language Models for Map Question Answering

Varun Srivastava,Fan Lei,Srija Mukhopadhyay,Vivek Gupta,Ross Maciejewski

OpenReview PDF

提交: 2025-03-20更新: 2025-08-26

TL;DR

We introduce MapIQ, a benchmark for evaluating multimodal large language models (MLLMs) on map-based visual question answering (Map-VQA) across three map types, also assessing their robustness to map desing variations.

摘要

关键词

Large Language ModelsVisual Question AnsweringMapsGeospatial AnalysisVisual AnalyticsBenchmark Dataset

评审与讨论

审稿意见

评分: 7置信度: 32025-05-06

This paper investigates how well multimodal large language models (MLLMs) read data visualisation. To this end, a new benchmark data set is created for visual question answering with maps (Map-VQA). This data set, MapIQ, contains 14K question-answer pairs spanning three map types, and is used to evaluate various MLLMs. In addition, the impact of map design changes is evaluated (e.g., different color schemes).

接收理由

This research provides interesting insights into the robustness of MLLMs, and their reliance on internal geographic knowledge for the task of Map-VQA, by establishing a new benchmark data set that (1) introduces two less researched map types (cartograms and proportional symbol maps) in addition to the better researched choropleth maps, (2) six VA tasks, and also offers (3) metadata sourced from six thematic categories. A variety of open-source and closed-source models are evaluated by means of this benchmark, and compared with a human performance baseline.

拒绝理由

Some parts of the methodology and evaluation are difficult to follow. It is, for instance, not completely clear to me how the gold standard was created. In order to reliably evaluate the system, the only trustworthy way to create the gold standard for the test set is to manually annotate the labels. In Section 3, it is stated that the answers of the question-answer pairs were reviewed by humans to ensure accuracy. In section 4, however, the authors state than only 9% of the test set was manually validated by humans. I don’t understand how systems can be evaluated automatically with great precision if the answers have not been validated manually to create a real gold standard for the rest of the test set?

给作者的问题

See remark above about the creation of the gold standard test set.

Some information could be added to the explanation to clarify the details of the methodology, e.g.

line 107 what ar the five class labels?
line 172 why do you assign scores of 0 or 100 to calculate accuracy? (usually accuracy just calculates the number of correct answers divided by the size of the test set).

2025-05-30

1. Gold standard for test set

Clarification on Ground Truth/Gold Standard Creation (Section 3): The term "answers" in Section 3.3 refers to ground truth labels, not model outputs. These were programmatically extracted from geospatial metadata (geojson files) using Python scripts—no visual map inspection was involved. Given that we had access to the original metadata and our questions were designed to be objective and factual (e.g., "What is the attribute class of AL?"), programmatic extraction was not only feasible but also more accurate and reliable than manual annotation.

Example process (Question Type: MCQ, Task Type: Retrieve Value):

Parse geojson file and extract the question and the map topic.
Identify target state (e.g., Alabama from "What is the attribute class of AL?")
Look up ground truth value (e.g., 11% population mobility).
Match to correct option (e.g., "d. 10.0–11.2").
Return ground truth answer (e.g., ['d']).

Scripts were developed for each Task Type–Question Type combination. We validated representative samples and iteratively refined scripts based on consistent error patterns, ensuring the programmatic approach provided us the gold standard for the test set. The ground truth responses for the test set are provided in the Supplementary Materials (OSF: Folder Name- Ground Truth Responses).

Clarification on Section 4 (Validation and Human Baseline Evaluation): Section 4 contains two distinct processes, both separate from ground truth creation in Section 3:

4.1 Validation (MLLM Response Cleaning)

This was the manual postprocessing performed across the entire test dataset for all model responses to ensure consistent evaluation.

Process Details:

a. Despite explicit formatting instructions (e.g., "Format your response as: 'My answer is [Your Option]'"), models occasionally included extraneous content such as reasoning steps, explanations, or hallucinated information alongside their answers.

b. Example (Molmo response):

Question: What is the attribute class of WA?
Raw Response: "To answer this question, I need to: [reasoning steps...] After examining the map and legend, Washington's circle falls into the second largest category, which corresponds to the 99.9–163 range. My answer is 99.9–163"
Cleaned Response: [99.9, 163]

c. Example (Idefics response):

Question: True or False: In the Midwest zone, the range of attribute value is between 26.5 to 30.4
Raw Response: "No."
Cleaned Response: FALSE

d. The manual validation ensured that:
(1) actual answers were correctly extracted from noisy responses,
(2) responses were formatted consistently for programmatic evaluation, and
(3) formatting variations didn't artificially penalize model performance.

e. Data Availability: All raw model responses and their cleaned versions are provided in the Supplementary Materials (OSF: "MLLM Responses > Baseline").

4.2 Human Baseline Evaluation

This describes a completely separate process: collecting human performance data by presenting 450 questions (~9% of the test set) to human evaluators under identical visual conditions as the MLLMs—meaning they saw the same maps and prompts without access to underlying metadata. This 9% sample establishes human performance benchmarks for comparison with MLLM results.

Important Distinctions:

Different personnel: The ground truth and MLLM response validation were carried out by one human, which was distinct from the two humans involved in the human baseline evaluation.
Different purposes: Validation ensures fair evaluation; human baseline establishes performance benchmarks.
Different scope: Validation covered 100% of the test set; human baseline covered 9% for benchmarking.

Revision plan: We can clarify these distinctions in the manuscript and provide additional details where needed.

2. What are the five class labels?

The five class labels used in the Fisher–Jenks classification are:

Very Low
Low
Medium
High
Very High

Each of the 258 datasets contained continuous data that we discretized using the Fisher–Jenks classification scheme, which minimizes within-class variance while maximizing between-class variance. We chose five classes following standard cartographic practice and to ensure visual encodings remained interpretable for the MLLMs without excessive color or size variations.

3. Why do you assign scores of 0 or 100 to calculate accuracy?

We used the standard formula for accuracy—the number of correct answers divided by the total number of questions—which produces a value between 0 and 1. For consistency in reporting, we multiplied this value by 100 to express accuracy as a percentage. For example, an accuracy of 0.45 was reported as 45.

2025-06-11

Thank you for your clarifications and answers to my questions. I clearly misunderstood some parts in the paper, but I think adding more detailed information and clarification on the creation of the gold standard data set will improve the general quality of the paper. I have updated my score.

2025-06-11

Thank you very much for your thoughtful response and for updating your score! We agree that clearer documentation of the gold standard creation process will strengthen the paper, and we will include clarifications in Section 3 and the Appendix as part of our final revision.

审稿意见

评分: 6置信度: 42025-05-10

This paper introduces MapIQ, a Visual QA benchmark focused on US maps. The benchmark contains QA pairs across three map types, generated using government data from six domains: economic, housing, social, health, crime, and environmental. The authors evaluate seven MLLMs - four open-source and three closed-source - on a subset of the benchmark. Their findings show that there still exists a large performance gap between these models and human annotators. They also provide analysis across different map types, tasks, question formats, and domains. Additionally, the authors test the robustness of MLLMs’ performances by modifying map design elements such as font sizes, legends, and color schemes to measure how these variations affect performance.

接收理由

The paper presents a benchmark highly relevant to real-world applications and reveals significant performance variations among state-of-the-art multimodal models in visual analytics tasks. The inclusion of the varying map design experiments is also quite interesting for testing MLLMs’ robustness - how different font sizes and color schemes can lead to performance changes.

拒绝理由

Qualitative Analysis and Useful insights While the paper presents many quantitative results in baselines and ablations, it’s still hard to understand what type of errors do models make given different map types/question types. Given the same question, how do different model answer it if map types/question formats are different? What are the confounding factors when model makes a mistake? It is difficult to further understand or extract useful insights for future model improvement without knowing the specifics about error types.

Use of Different Domain From Figure 10 in the appendix, it seems that all types of questions are less related to the specific data domain on which the map is based. If this is the case, can the map be generated using synthetic data drawn from a random distribution? The current setup is useful as a visual analytics dataset without eliciting any further demographic knowledge. I think the dataset would be more interesting if it could somehow incorporate some domain-specific questions.

给作者的问题

In the Validation section on page 4, you mentioned that “Removing these extraneous details was particularly time-consuming”. Is this postprocessing manually done for all test datapoints for all models? Can you providen more details?
Please use consistent tense across the paper. Sometimes the paper frequently switches between the present and past tense. E.g. line 78 and 164
Any thoughts on why Gemini 1.5 Pro is performing worse than the random baseline for the binary question in Figure 3?

2025-05-30

1. Qualitative Analysis and Useful Insights

We agree that qualitative error analysis is crucial for understanding model limitations. While we provide a few quantitative insights for each experimental variable (Section 5), along with ablation study results and analysis (Section 6 and Appendix A.3.2), given below are additional insights that can be extracted:

Failure Mode Analysis: From the results of the ablation study of Qwen2-VL (Section 6), we observed that altering legend font size led to a significant drop in accuracy, suggesting difficulty in interpreting complex or non-standard legends. Similarly, altering color encodings and scales (e.g., flipped, divergent, or spectral palettes) resulted in performance degradation on map-based questions, indicating insufficient visual grounding and sensitivity to design variations.

Map Type and Task Type Interaction Analysis: As a case study, we analyze Claude 3.5 Sonnet’s performance variation across interacting variables, specifically Map Type and Task Type. The model performs notably better on cartograms than on choropleth maps for Spatial Clusters (+5.76) and Retrieve Value (+3.63), likely due to the simplified spatial layout of uniform hexagonal bins, which reduces confounding visual variance. In contrast, symbol maps prove most challenging, especially for Pairwise Point Comparison and Retrieve Value, revealing difficulty in fine-grained size discrimination. Interestingly, for Spatial Clusters, performance on symbol maps exceeds that on choropleth maps (+0.60), suggesting the model can more reliably group similarly sized symbols than interpret similarly colored, variably sized regions.

Together, these findings highlight that MLLMs struggle not only with legend parsing but also exhibit modality-specific weaknesses in visual grounding depending on the map type and visual encoding strategy.

Revision Plan: We can expand this analysis in the appendix and provide specific examples (like the interaction analysis case study) to better understand failure modes and guide targeted improvements.

2. Use of Different Domain

Our study targeted low-level visual analytical tasks that relied purely on visual encoding interpretation, deliberately excluding domain knowledge requirements. The questions remained consistent across different map themes, which allowed us to isolate performance variations that may have stemmed from the models’ internal priors rather than differences in question content (see Section 5, Figure 3, variable "Theme"). Domain-specific questions would have shifted the focus toward high-level reasoning, fundamentally changing the research question from "Can MLLMs read visual map information?" to "Can MLLMs perform domain-expert analysis?" To maintain this focus and investigate topic-based bias, we deliberately chose real-world datasets instead of synthetic data. We aimed to explore whether MLLMs performed differently based on their prior knowledge of thematic domains (e.g., demographics vs. environmental data), given that the questions remained unchanged. Using realistic contexts enabled systematic probing of this bias, which would not have been possible with random synthetic data lacking meaningful semantic content.

3. Post-processing Details

Yes, postprocessing was manually performed across the entire test dataset for all model responses to ensure consistent evaluation. Despite explicit formatting instructions, models occasionally included extraneous reasoning, explanations, or hallucinations.

Examples:

Question: What is the attribute class of WA? → Raw Response : "To answer this question, I need to: [reasoning steps...] After examining the map and legend, Washington's circle falls into the second largest category, which corresponds to the 99.9–163 range. My answer is 99.9–163" → Cleaned Response: [99.9, 163]
Question: True or False: In the Midwest zone, the range of attribute value is between 26.5 to 30.4 → Raw Response : "No" → Cleaned Response: FALSE

Manual validation ensured: (1) correct answer extraction, (2) consistent formatting, (3) fair evaluation without penalizing formatting variations.

Data Availability: Raw and cleaned responses are available in Supplementary Materials (OSF: "MLLM Responses > Baseline").

4. Consistent Tense

We will review the text to ensure tense consistency throughout.

5. Gemini 1.5 Pro Performance

Gemini's below-random performance likely stems from response bias mismatch: ground truth skews toward False (1221 vs. 779 True), but Gemini systematically favors True predictions (981 vs. 639 False). Additionally, the model may struggle with uncertainty and negation—common zero-shot binary classification challenges.

2025-06-07

Thank you for the response! Regarding using different domains, "whether MLLMs performed differently based on their prior knowledge of thematic domains" seems to be a little contradictory to what you mentioned earlier on "targeting low-level visual analytical tasks without internal priors". I agree with the authors' thoughts on using real-world data. But to exclude topical bias from considerations or as a way to scale up, randomly generated maps could be something to consider to assess the models' ability for only low-level visual signals.

Also, it would be great to see more qualitative analysis and improvement suggestions in the revision. I have adjusted my score.

2025-06-11

Thank you very much for your thoughtful feedback and for adjusting your score. We truly appreciate it! We agree that synthetic data could be a valuable direction for future work, especially for scaling up studies where the “Theme” variable is less critical, and we see a strong merit in that suggestion.

Regarding qualitative analysis and model improvement insights: we’ve already conducted additional analyses like the ones outlined in our earlier response (e.g., Claude 3.5 Sonnet's performance across task–map combinations and failure cases from ablation studies). We plan to incorporate these into Appendix A.3.2 in the camera-ready version to provide clearer examples of error types and design-driven limitations.

审稿意见

评分: 8置信度: 42025-05-12

The topic of the paper is about visual question answering with maps (Map-VQA) especially extending the range of VQA to more map types than have been considered in the past which was basically focusing only on choropleth maps to now additionally cover cartograms and proportional symbol maps. An important part if the paper is to create and provide a new Map-VQA benchmark called MapIQ which consists of 14,706 question-answer pairs. The corpus covers six different themes and is used to evaluate six different visual analysis tasks including human baseline. Seven different Multimodal LLMs (MLLMs), both closed-source and open-source, are exploited during the evaluation using the new MapIQ benchmark. The major outcomes (Fig. 2 and 3) display that humans are quite good in answering VQA related map questions across map types and task types, where basically all tested MLLMs are still far behind human performance. Although the closed-source perform in average better than the open-source models, some of the open-source are competitive viz Qwen2-VL.

接收理由

new interesting benchmark for Map-VQA
zero-shot evaluation of different MLLMs, both closed-source and open-source
evaluation across several themes and different map types

拒绝理由

as far as I understand all questions are instances of a small set of question templates where for each task type, where in each case three to four questions types are considered and in each case only one question template is defined (see Figure 7). This means that there is a very small (actually) no range of question paraphrases. Since the question templates seem to be very task and domain specific, the current setting of the experiments do not consider robustness of the MLLMs wrt question paraphrases
Some details are missing in sec 3.3. wrt how many human experts were involved and inter-annotator agreement.

2025-05-30

1. Standardized Question Templates

Our experimental setup was designed to reflect a zero-shot setting, where all models are evaluated on the same set of standardized, task-specific question templates. This ensures consistency and fairness across models, eliminating any confounding effects from prompt engineering or fine-tuning. If by “question paraphrases” you are referring to variations in wording or alternative prompt formulations (e.g., rephrased instructions or stylistic changes), we intentionally did not explore such variants. Incorporating paraphrases would shift the focus toward prompt optimization strategies. Our goal was to assess inherent model capabilities under controlled conditions, not their performance under varying prompt formulations. That said, we agree that evaluating model robustness to paraphrasing is a valuable direction, and our dataset and framework can readily support such future extensions. We will update the future work section to recognize the need for this.

2. More Details on Human Validation and Human Baseline Evaluation

In Section 3.3, a single human expert was involved in validating the consistency and accuracy of the ground truth responses extracted programmatically using Python scripts. Since this step focused on verifying deterministic outputs from scripted logic, inter-annotator agreement was not applicable. For Section 4 (Human Baseline Evaluation), two expert map readers independently answered the map-based questions. Given the small annotator pool, we did not compute inter-rater agreement. Instead, we averaged their performance metrics across task types, following the same aggregation strategy used for reporting MLLM performance in Figure 2. This provided a reliable and practical estimate of human performance without requiring additional annotators. We will update Sections 3.3 and 4 in the manuscript to reflect these clarifications.

Important Distinctions:

Different personnel: The ground truth response validation were carried out by one human, which was distinct from the two humans involved in the human baseline evaluation.
Different purposes: Validation ensures fair evaluation; human baseline establishes performance benchmarks.
Different scope: Validation covered 100% of the test set; human baseline covered 9% for benchmarking.

审稿意见

评分: 7置信度: 32025-05-18

The paper introduces MapIQ, a new benchmark for evaluating multimodal large language models (MLLMs) on map-based visual question answering (Map-VQA). It expands beyond prior benchmarks by including three map types—choropleth, cartogram, and proportional symbol maps—and questions derived from six thematic domains like crime, housing, and environment. The benchmark consists of over 14,000 QA pairs targeting six different visual analytic tasks (e.g., value retrieval, spatial clusters, regional comparisons) and various question formats (binary, multiple choice, list, single value). The authors evaluate seven MLLMs, including closed-source (Claude 3.5, Gemini 1.5, GPT-4o) and open-source (Qwen2-VL, InternVL2.5-MPO, etc.) models, comparing their performance across tasks, map types, and question formats against a human baseline. They also study model robustness by introducing 15 map design variations (e.g., rotated maps, removed legends, color flips). Claude 3.5 performs best overall, but all models significantly underperform humans, especially in tasks like range detection and interpreting proportional symbol maps.

接收理由

The benchmark fills a clear gap in current VQA datasets by extending evaluation from common charts (e.g., bar graphs) to geospatial maps, a domain with different cognitive demands and visual representations. Including three distinct map types and six task types allows for a much broader and more nuanced evaluation of MLLMs' spatial reasoning and visual grounding. The paper is also methodologically rigorous—it controls for thematic domains, question types, and task difficulty, and includes both closed- and open-source models, enabling fair comparisons. The added robustness study on map design variations is a thoughtful touch, revealing how fragile some models are to simple visual tweaks.

Another strength is the introduction of thematic bias analysis, probing whether MLLMs rely on real-world topic priors (e.g., assuming higher crime rates in certain regions) rather than actual map interpretation. This goes beyond accuracy and touches on model behavior, which is often underexplored. The dataset is also well-constructed: questions are grounded in geospatial literature, carefully templated, and validated, while the maps follow strong cartographic principles (e.g., consistent color schemes, spatial normalization), making the benchmark high-quality and reusable.

拒绝理由

While the benchmark introduces different map types, the task types themselves are still relatively low-level (e.g., identifying max/min values, comparing two states). The benchmark doesn't evaluate more complex or realistic tasks like navigation, multi-hop spatial reasoning, or causal inference across maps, which could better showcase the capabilities of MLLMs. There's also a lack of diversity in geographic resolution—it only includes U.S. state-level maps, missing finer-grained (e.g., county or city) or international data.

The analysis of model failure modes is somewhat shallow. Although Claude 3.5 performs best, the paper doesn't deeply explore why certain models fail—e.g., whether it’s due to poor legend parsing, weak visual grounding, or reliance on text priors. Some question types (like single-value answers) seem to trip up models significantly, but the authors don’t investigate how to fix that.

Very Minor: While seven models are tested, many of the stronger open-source MLLMs (e.g., Qwen2.5VL, recent Gemini API variants) are missing, and the zero-shot-only evaluation setup may underrepresent models’ true potential.

2025-05-30

1. The benchmark doesn’t evaluate more complex or realistic tasks (e.g., navigation, multi-hop reasoning).

We appreciate this suggestion, but our study targets low-level visual analytical tasks (Section 2, Appendix A1.2) that rely on visual encoding interpretation, deliberately excluding higher-order reasoning. This design isolates models’ core visual comprehension skills without domain knowledge.

Tasks like navigation or causal inference involve high-level analytical reasoning, requiring complex cognitive processes and domain-specific knowledge. Evaluating those would shift the research question from "Can MLLMs read visual map information?" to "Can MLLMs perform advanced spatial reasoning?" Our current approach builds a baseline for visual understanding, which is essential before assessing more complex reasoning capabilities.

2. The benchmark lacks geographic diversity—it only uses U.S. state-level maps.

We acknowledge this and chose U.S. maps as a deliberate starting point:

Exposure: U.S. data likely dominates MLLM pretraining. If models struggle with familiar maps, greater challenges can be expected with underrepresented regions. This setup provides a conservative baseline.
Extensibility: Our framework easily scales to city, county, or international levels by substituting underlying data while maintaining task types and evaluation strategy.
Future Work: We noted this limitation in the paper and believe expanding to international and finer-resolution data is a valuable next step.

Even with the most familiar geography, models exhibit notable performance gaps, validating our benchmark’s challenge level.

3. The analysis of model failure modes is somewhat shallow.

We agree and recognize the need for deeper failure analysis. However, several challenges shaped our current scope:

Methodological Constraints: MLLMs—especially closed-source models like GPT-4V or Gemini—offer limited transparency. We cannot inspect internal representations, making in-depth diagnosis difficult.
Current Analysis: We conducted ablation studies for Claude 3.5 Sonnet and Qwen2VL (Section 6, Appendix A.3.2), modifying visual elements like legends and color schemes. Results showed performance degradation when legends were small or color schemes were flipped/divergent—indicating weak visual grounding and limitations in legend parsing.
Next Steps: We plan to expand this section with more insights and clearly tie them to MLLM limitations. Deeper per-model studies could offer further improvements.

4. Some question types (e.g., single-value answers) trip up models, but the authors don’t explore fixes.

We agree that improving performance on harder question types is important, but this falls outside our zero-shot benchmarking objective.

Design Intent: We deliberately avoided question-specific tuning to assess core capabilities across models.
Optimization Strategies: Techniques like few-shot learning or chain-of-thought prompting could help but would shift our goal from benchmarking to optimization.
Future Work: We acknowledge this as a limitation and plan to support future work exploring such strategies. Our benchmark offers a standard evaluation foundation for doing so.

5. Why were newer models (e.g., Qwen2.5VL, recent Gemini) not included; zero-shot justification?

We selected the most representative models available when the study began. The rapid pace of MLLM updates makes continuous inclusion impractical without delaying research.

Model Set: We included a diverse mix of open/closed models with different architectures and performance tiers for broad coverage.
Benchmark Extensibility: The dataset and protocol are reusable, so future models can be evaluated comparably.
Zero-Shot Rationale: We emphasized zero-shot settings to reflect realistic use without fine-tuning and avoid introducing confounding factors.

2025-06-10

Thanks for the detailed reply. I was already satisfied with the current benchmark (score 7 in the beginning) and I just post my comments to potentially further improve this work. Seems like the authors prefer to keep their current scope and don't change/add anything. I respect that and I won't change the score.

2025-06-11

Thank you for your thoughtful comments and for the high score. We truly appreciate your constructive suggestions. Your feedback has been valuable and will certainly inform future iterations of this work. Thank you again for your engagement and support!

最终决定Accept

2025-07-08

This model proposes a new benchmark for testing multimodal large language models based on questions on different types of maps, going beyond past work, which tested less detailed visual setups. The authors analyze a large set of models and highlight their main errors. Reviewers agree that this benchmark fills a gap in the current literature and will be useful for practitioners.