Do Vision-Language Models Really Understand Visual Language?
摘要
评审与讨论
This paper argues that LVLM's good performance in diagram reasoning mainly comes from pre-existing background knowledge rather than a genuine understanding of the diagrams' relational structure.
The authors develop a test suite specifically to assess diagram comprehension:
- Their evaluations focus on two main tasks: fast recognition and slow, multi-step reasoning.
- Their evaluations are not only performed on synthetic diagrams but also real diagrams.
- Their evaluations are based on open-source models as well as large accessible models.
After showing their experiments and analysis, the authors conclude that the apparent models can perfectly recognize and reason about entities in the diagrams, while they struggle with recognizing the relations.
给作者的问题
I have some questions regarding this paper:
- Can the authors explain why using text labels performs better on QA in Table 2?
- Can the authors provide performance on QA for relations using visual icons, like Table 3?
- Did the Table 4 test on text labels, visual icons, or both?
- For the synthetic diagrams, do they only have text entities or visual entities? Can the authors provide some results on synthetic diagrams with both text entities and visual entities?
论据与证据
The claims made in the submission are supported by clear and convincing evidence. The authors show details about how they set up each experiment.
方法与评估标准
The proposed evaluation criteria make sense.
理论论述
Their theoretical claim (LVLMs reply on knowledge shortcuts when answering complex diagram reasoning questions) is supported by their detailed experiment designs and analyses.
实验设计与分析
The main contribution in this paper is their evaluation suite, from data creation to experiment implementations.
Both synthetic diagrams and real diagrams make sense. However, please answer my questions about the synthetic diagrams in the following parts.
The experimental design begins with simpler tasks—such as identifying entities and relationships in synthetic diagrams—and gradually progresses to more complex tasks involving real diagrams. Additionally, the authors conduct both quantitative and qualitative analyses on knowledge shortcut experiments, demonstrating that LVLMs may not truly understand diagrams but instead rely on ingrained shortcuts to generate answers.
补充材料
The supplementary materials are reviewed. The authors show test suite details, related work as well as some limitations in their work.
与现有文献的关系
The key contributions of the paper shows the concerns about the LVLMs in complex diagram reasoning tasks, which is related to the findings of current LVLMs abilities in reasoning both in 2D and 3D images.
遗漏的重要参考文献
The essential references are discussed by the authors in their supplementary materials.
其他优缺点
Paper Strengths:
- The paper is well-structured, and the topic is interesting.
- The authors show the detailed implementation for each experiment.
- The authors argue their claims logically, from synthetic diagrams experiments to real diagrams experiments.
Major Weaknesses (please also refer to the Questions For Authors):
- Limited analysis in failure cases.
- Missing some details key settings.
其他意见或建议
Here are some minor suggestions:
- I highly suggest the authors could highlight some abbreviations, such as NR and NC.
- L181, seems like you only have one Intuition, why do you label it as Intuition 1.
We thank the reviewer for acknowledging that our claims are clear and convincing. We will address the proposed concerns and answer the questions below. Additionally, we are pleased to inform you that we have extended our evaluation model set to include 9 models, as suggested by Reviewer 8S38 and Reviewer 3mRA, with consistent conclusions, further enhancing the reliability of our work.
Point 1: Limited analysis in failure cases.
Response: We agree with the reviewer that a more detailed failure case analysis would provide valuable insights into why the models fail. We would like to kindly remind the reviewer that Section 5 already includes a qualitative analysis. Additionally, we have provided examples, along with our prompts and the models' responses, for each experiment in the Appendix (Figures 11–34). While these figures do not provide a thorough analysis for each case, they offer readers a clearer understanding of why the models fail by showcasing their responses. We will also emphasize this point in the main paper.
Question 1: Can the authors explain why using text labels performs better on QA in Table 2?
Response: We compare the performance of entity recognition on synthetic diagrams, distinguishing between text entities and visual entities. The results show that models can perfectly identify text entities. We believe this is due to the models being specifically trained on numerous OCR datasets (as mentioned in their technical reports). As a result, their ability to recognize and extract text is significantly stronger compared to their ability to recognize visual elements.
Question 3: Did the Table 4 test on text labels, visual icons, or both?
Response: Based on the results in Table 2, we observe that the model performs well in recognizing text entities. To ensure that our subsequent experiments—such as exploring the models' abilities in more complex tasks like counting and relation recognition—are not influenced by entity recognition errors, we default to representing entities in text for the remainder of our experiments.
Question 2 and 4: Can the authors provide performance on QA for relations using visual icons, like Table 3? For the synthetic diagrams, do they only have text entities or visual entities? Can the authors provide some results on synthetic diagrams with both text entities and visual entities?
Response: We appreciate the thoughtful questions and thank the reviewer for carefully reading our paper. The underlying logic is that since models can accurately recognize text entities, any failure to recognize relations likely stems from a lack of relation recognition ability, rather than being bottlenecked by entity recognition. We understand the reviewer’s concern regarding entity representation. To address this, we report the performance of entity/relation recognition on synthetic diagrams with entity representations of text, visual, and both. Additionally, to better align with the results, we reformat the multi-choice QA into binary classification (i.e., with only one negative option).
GPT-4V Precision (Visual) Recall (Visual) F1 (Visual) Accuracy (Visual) Precision (Text) Recall (Text) F1 (Text) Accuracy (Text) Precision (Both) Recall (Both) F1 (Both) Accuracy (Both) Entity recognition 97.37 84.52 89.54 91.12 99.86 99.75 99.78 99.79 99.86 99.75 99.78 99.79 Relation recognition 55.98 48.82 51.58 55.77 54.82 40.33 45.28 55.83 55.05 47.00 49.99 56.42
The results reveal two key observations. First, representing entities with both text and visual information yields similar performance to text-only entities, and the model still struggles with relation recognition regardless of entity representation. Second, the recall for visual entity recognition is relatively lower, indicating that the model cannot perfectly identify the correct visual entities, rather than misclassifying wrong entities as correct ones. We hope these updated results help address the reviewer’s concern regarding entity representation.
Thanks for the authors' response. I will keep my score.
Dear Reviewer en7J:
Thank you for taking the time to review our paper and for your initial comments.
We carefully addressed all the concerns you raised in our rebuttal, including providing new experimental results and clarifications. While we appreciate your brief follow-up, we were hoping for a bit more engagement—especially to confirm whether our detailed responses resolved your questions. If any concerns remain, we would be more than happy to further elaborate or clarify.
Thank you again for your time and consideration.
Best regards,
The authors
The paper introduces a new benchmark dataset for vision-language models. The benchmark is based on diagram understanding. Unlike most existing diagram understanding datasets, the dataset introduced in this work is specifically designed to study the understanding of relationships between depicted entities by using graphs whose vertices are entities (eg., objects) connected with lines (edges in the graph). The paper studies the performance of existing vision-language models on various question types and subsets (eg. real vs. synthetic images / entities vs. relations / edges random vs. semantic / etc.). The paper uses the results as evidence to argue that the existing models do not truly understand diagrams but that they use short-cuts to answer questions.
给作者的问题
N/A
论据与证据
The claims are supported by clear evidence and are convincing to me.
方法与评估标准
The paper introduces a benchmark dataset itself. The paper uses the performance of models on this data to derive the main hypothesis in this paper, which is that existing vision-language models use shortcuts rather than truly understanding visual information in diagrams.
理论论述
N/A
This is an empirical study.
实验设计与分析
The experiments are fairly comprehensive and seem sound overall. The evidence brought forth in support of the paper's hypothesis is derived from multiple different performance evaluations on variations of the images and questions.
The paper uses a mix of graph images generated from synthetic and real imagery to support the hypothesis that real imagery is more amenable to learning shortcuts by leveraging background knowledge as opposed to the information derived from the image itself.
补充材料
I reviewed the supplementary material but did not read all the presented points, variations and examples in detail.
与现有文献的关系
All the relevant similar existing work is discussed in a Related Works section located in the Appendix. This is a bit strange and unconventional, but probably Ok.
遗漏的重要参考文献
I am not an expert on recent diagram understanding benchmarks, but the presented related works section seems reasonably comprehensive.
其他优缺点
The paper is very well written, and very well-structured, and exceptionally easy to follow as a result.
Not a lot of models are evaluated (Llama, Llava, Qwen2 and three closed cloud models). As this is a benchmark paper, I would say this is the bare minimum. Considering the huge number of vision-language models that have been released in recent times, it would be great if the sample of tested models was larger, and if the results held up, that would increase the confidence in the hypothesis brought forth in this work.
The use of word2vec to generate synthetic relations to test for the use of background knowledge seems like a very nice and clever idea.
The results on knowledge inclusion are convincing enough in my opinion, but not as clear-cut as one would hope after reading the earlier parts of the paper.
其他意见或建议
N/A
We thank the reviewer for the very positive feedback. We have carefully designed our experiments to ensure they are comprehensive, and our conclusions are as convincing as possible. We will address the proposed concern about evaluation model set below.
Concern Point: Not a lot of models are evaluated...
Response: We are grateful for this constructive suggestion. As mentioned in our response to Reviewer 8S38, we have added three new models—Molmo-7B, Qwen2-VL-2B, and Qwen2-VL-72B—to all of our experiments. Our model set now includes three groups with 9 models: large models (GPT-4V, GPT-4o, and Gemini), small models (LLaVA-7B, Molmo-7B, LLaMA-11B), and models with three scales (Qwen2-VL with 2B, 7B, and 72B versions). The performance trends and our conclusions remain consistent, and we believe this expanded model set enhances the reliability of our conclusions.
This paper investigates the diagram comprehension capabilities of Large Vision-Language Models (LVLMs) by developing a comprehensive test suite across synthetic and real-world diagrams. The evaluation reveals that while LVLMs can accurately recognize entities, they struggle to understand relationships within diagrams. Further analysis suggests that their apparent reasoning ability is largely due to leveraging background knowledge as shortcuts, rather than genuine diagram understanding.
给作者的问题
I have no further questions.
论据与证据
Yes, the claims made in the submission are supported by clear and convincing evidence.
- LVLMs can accurately recognize entities, they struggle to understand relationships within diagrams: supported by Tables 2-3.
- Reasoning ability is largely due to leveraging background knowledge as shortcuts:supported by Tables 4-7.
方法与评估标准
This paper investigates the diagram comprehension capabilities of Large Vision-Language Models (LVLMs) by developing a comprehensive test suite across synthetic and real-world diagrams. The evaluation reveals that while LVLMs can accurately recognize entities, their ability to understand relationships is notably limited, often relying on background knowledge as shortcuts rather than genuinely perceiving diagram structures. However, the paper does not clearly specify the scale of evaluated models, raising the question of whether larger models (e.g., 72B) might exhibit improved relational reasoning and diagram perception or if these limitations persist—or even worsen—with increased model size. Addressing this would clarify whether scaling alone mitigates these issues or if fundamental architectural improvements are needed for genuine diagram understanding.
理论论述
There is no theoretical claims.
实验设计与分析
The experimental design is comprehensive and easy to follow, effectively supporting the observations and final conclusions. However, the relationships tested in both synthetic and real-world images are primarily spatial and follow arrow-based dependencies. There are more complex relationships that could be explored, such as logical, hierarchical, or causal relationships, to provide a more holistic evaluation of LVLMs’ diagram comprehension capabilities.
补充材料
Yes. I reviewed all sections in the supplementary material.
与现有文献的关系
Refer to Section of Essential References Not Discussed.
遗漏的重要参考文献
Exploring the visual limitations of LVLMs is not a new topic, as many previous studies have addressed this issue. However, the authors fail to provide a direct comparison with these works. For example:
[1] Hulingxiao He et. al., ANALYZING AND BOOSTING THE POWER OF FINEGRAINED VISUAL RECOGNITION FOR MULTI-MODAL LARGE LANGUAGE MODELS, ICLR 2025.
[2] Shengbang Tong et. al., Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs, NeurIPS 2024.
[3] Shengbang Tong et. al., Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs, CVPR 2024. [4] Xingyu Fu et. al., BLINK: Multimodal Large Language Models Can See but Not Perceive, ECCV 2024.
其他优缺点
This paper contributes to diagram understanding of LVLMs, like visual rasoning and shortcut learning. The findings suggest that existing LVLMs lack true diagram understanding, opening avenues for research in improving relational reasoning in LVLMs through better graph-based representations. However, additional experiments are needed to make these conclusions more convincing: Evaluating larger models to determine whether scaling mitigates these limitations; Expanding benchmarks to test more complex, non-spatial relationships beyond arrows and spatial positioning.
其他意见或建议
Refer to Section of Other Strengths And Weaknesses.
We thank the reviewer for acknowledging that our experimental design is comprehensive and convincing. Our responses to the proposed concerns are provided below.
Point 1: ... raising the question of whether larger models (e.g., 72B) might exhibit improved relational reasoning and diagram perception....
Response: We appreciate the constructive suggestions provided by the reviewer. We are pleased to report that we have now completed experiments on three additional models: Molmo-7B, Qwen2-VL-2B, and Qwen2-VL-72B. The results for these models have been incorporated into the paper (Tables 2, 3, 4, 5, 6, 7, 9, and 11). Overall, the trends and conclusions remain consistent (Molmo performs similarly to LLaMA, and Qwen-2B and Qwen-72B perform similarly to Qwen-7B). We will briefly present some of their scores here (Tables 2, 3, and 4).
Table 2 (entity understanding on synthetic diagrams)
Acc (%) Text Entity (recognition) Text Entity (counting) Visual Entity (recognition) Visual Entity (counting) LLaVA 38.9 26.6 46.4 30.8 Molmo 93.4 78.8 64.1 54.0 LLaMA 91.3 90.9 72.7 70.1 Qwen-2B 82.3 73.2 63.1 53.5 Qwen-7B 97.6 73.0 94.5 73.0 Qwen-72B 99.1 97.9 90.6 86.4 GPT-4V 97.8 99.6 85.7 93.7 GPT-4o 99.2 100.0 92.6 94.9 Gemini 88.1 95.8 87.7 86.5
Table 3 (relation understanding on synthetic diagrams)
Acc (%) Implicit Relation (recognition) Implicit Relation (counting) Explicit Relation (recognition) Explicit Relation (counting) LLaVA 30.2 27.5 35.1 28.3 Molmo 71.7 31.2 59.1 50.4 LLaMA 75.4 32.0 55.2 46.1 Qwen-2B 63.3 29.8 44.0 33.1 Qwen-7B 74.4 59.0 59.8 51.5 Qwen-72B 77.9 67.1 70.3 63.8 GPT-4V 72.3 34.4 61.6 59.5 GPT-4o 77.3 55.3 76.6 70.2 Gemini 60.9 31.8 68.5 70.2
Table 4 (entity and relation understanding on real diagrams)
Model Entity (recognition) Entity (counting) Relation (recognition) Relation (counting) LLaVA 56.5 37.3 45.1 30.2 Molmo 82.7 54.9 59.8 51.2 LLaMA 87.3 59.7 73.7 51.2 Qwen-2B 66.6 40.7 45.7 39.3 Qwen-7B 90.0 56.1 71.4 58.0 Qwen-72B 93.7 77.8 79.4 69.8 GPT-4V 88.9 78.8 78.7 59.9 GPT-4o 93.1 82.3 84.1 72.9 Gemini 85.0 68.4 80.5 57.7
Point 2: There are more complex relationships that could be explored, such as logical, hierarchical, or causal relationships, …
Response: While our synthetic diagrams may not include complex relationships, our real-world diagram dataset covers a wide range of relationship types (more details can be found in Figure 4 and Table 8 in Appendix D). Unfortunately, for rarer cases, we are constrained by the limitations of existing datasets, but we have included as much diversity in the relationships as possible.
Point 3: However, the authors fail to provide a direct comparison with these works. For example:...
Response: We sincerely appreciate the reviewer for pointing out these relevant works. We will include them in our related work section and provide a discussion on them.
Thank you for your detailed responses. My concern regarding the large-scale model performance has been addressed. However, could you further elaborate on the novelty of your work. As I previously noted, visual limitations in vision-language models (VLMs) has been explored in earlier studies. Could you provide a brief discussion comparing your approach with the related works I cited [1–3] and clarify what sets your method apart?
We’re glad to hear that the new results have addressed the reviewer’s concern. Regarding the comparison with the cited works, our approach differs in several important ways:
[1] focuses on evaluating the recognition capabilities of small open-source models (up to 10B), particularly in fine-grained visual recognition tasks. [2] emphasizes visual grounding and the role of vision encoders, introducing a new benchmark and improving model design based on those insights. [3] uncovers fundamental visual shortcomings in CLIP-based VLMs and proposes a solution by incorporating features from self-supervised vision models.
In contrast, our work centers on a different research question: we investigate why VLMs often fail at basic visual abilities—such as recognition and counting—despite appearing capable of reasoning over complex diagrams. Rather than focusing on fixing these issues, we design targeted experiments and curate datasets to explain these counter-intuitive behaviors.
In short, while the cited works aim to identify and improve limitations in VLMs, our goal is to understand why those limitations exist through carefully controlled evaluations. We believe this focus offers a complementary and novel perspective.
update after rebuttal
The paper presents a comprehensive and detailed study on if LVLMs are able to understand visual diagrams. The authors develop a test suite of synthetic as well as real diagrams and test multiple LVLMs on it. The authors find that the models are able to identify identities well. However, they have limited abilities in recognizing relations, especially synthetic ones and the ones without background knowledge as shortcuts. This suggests that existing LVLMs have limited abilities on diagram reasoning.
给作者的问题
An interesting phenomenon I found in the numbers is that Llava performs very poorly on the benchmark compared with other models. However, I noticed that its benchmark score is similar to Llama and Qwen's. Any thoughts on this?
Also, can the models do much better on "relation" with some SFT with your synthetic data?
论据与证据
Yes the paper has clearly presents all the claims, together with clear evidence.
方法与评估标准
Yes, the evaluation makes sense for diagram understanding.
理论论述
No theory claims.
实验设计与分析
Yes, I checked each one and the paper is sound.
补充材料
I checked, and it is the codes and benchmark data.
与现有文献的关系
There are many prior works that relate to diagram understanding in prior works on LVLM. For example, popular benchmarks like MMMU and ChartQA contain diagram understanding problems. The paper focuses on entity & relation and reveals that existing models still have a long way to go on recognizing and reasoning on relations.
遗漏的重要参考文献
No
其他优缺点
The paper is overall well-written and clearly structured. The conclusions are sound and interesting and reveal a particular weakness of existing LVLMs. The experiment design is interesting and informational.
One weakness is that the scope of the problem covered is limited, with only entities connected by relations. It would be nicer if the author can discuss about more general and complex diagrams, which would make bigger impact to the community.
其他意见或建议
page 4 line 172 on the right missing a space between "detailsof"
We would like to thank the reviewer for acknowledging the clarity and robustness of our work. Below, we provide our responses to the concerns and questions raised. Additionally, in line with the suggestions from Reviewer 8S38 and Reviewer 3mRA, we have expanded our evaluation model set to include 9 models, and the results remain consistent, further reinforcing the validity of our findings.
Point 1: the scope of the problem covered is limited, … if the author can discuss about more general and complex diagrams, …
Response: We agree with the reviewer that expanding the range of diagram types discussed could be valuable for the community. In our work, we design two evaluation settings: one with synthetic data and another with real-world data. For the synthetic evaluation, we focus on a typical diagram type (entity-relation) to ensure clarity and intuitiveness. For the real-world evaluation, we gather data from six domains, incorporating various diagram types from existing works (Figure 4 and Table 8 in Appendix D provide additional details). While there is potential to include more general or complex diagram types, our work is constrained by the availability of existing diagram datasets. Nonetheless, we have aimed to cover as diverse a range of diagram types as possible.
Question 1: ... Llava performs very poorly on the benchmark compared with other models. However, I noticed that its benchmark score is similar to Llama and Qwen's. Any thoughts on this?
Response: We appreciate the reviewer for carefully reviewing our paper. Indeed, the performance of LLaVA is significantly lower than that of the other models. We believe this can be attributed to differences in the training patterns. Unlike the other models, LLaVA is primarily trained with real-world images, while the other models incorporate augmented synthetic data and reinforcement learning techniques, such as DPO, which likely enhance their generalizability. We hypothesize that these additional training phases contribute to the stronger performance observed in these models on our benchmark.
Question 2: ... can the models do much better on "relation" with some SFT with your synthetic data?
Response: Although SFT with our data could likely improve the evaluation accuracy on our benchmark, we do not believe it would lead to true diagram understanding ability or generalizability in practice. In our work, our primary focus is to highlight the weaknesses of existing Vision-Language Models (VLMs). Addressing how to improve these models is a broader topic, and we encourage future work to explore potential solutions.
Thanks for the clarification! i will maintain my overall positive scores.
This paper proposes a benchmark for evaluating VLM performance on diagram comprehension, including both synthetic and real-world diagrams. Analysis on the benchmark with existing VLMs show that they mostly fail to represent relational information about entities in diagrams, and that current success in this task is mostly due to background knowledge rather than genuinely understanding diagram contents. The benchmark could be a useful addition to existing work in studying the limitations of modern VLMs, although as pointed out by a couple of reviewers, the scope of diagrams included in the dataset is relatively limited, focusing on relations between entities. To the extent that ICML is welcoming benchmark/dataset/analysis papers, I think this is a relatively solid contribution.