5.5

/10

Poster4 位审稿人

最低5最高6标准差0.5

4.0

置信度

正确性2.8

贡献度2.8

表达2.5

NeurIPS 2024

GITA: Graph to Visual and Textual Integration for Vision-Language Graph Reasoning

Yanbin Wei,Shuai Fu,Weisen Jiang,Zejian Zhang,Zhixiong Zeng,Qi Wu,James Kwok,Yu Zhang

OpenReview PDF

提交: 2024-05-14更新: 2024-11-06

TL;DR

Pioneeringly establish a vision-language graph reasoning setting for VLMs to compete with LLMs, showcasing vision's benefits for language-based graph reasoning, and provide a framework to derive graph VQA benchmarks from existing data.

摘要

Large Language Models (LLMs) are increasingly used for various tasks with graph structures. Though LLMs can process graph information in a textual format, they overlook the rich vision modality, which is an intuitive way for humans to comprehend structural information and conduct general graph reasoning. The potential benefits and capabilities of representing graph structures as visual images (i.e., $visual graph$) are still unexplored. To fill the gap, we innovatively propose an end-to-end framework, called $G$raph to v$I$sual and $T$extual Integr$A$tion (GITA), which firstly incorporates visual graphs into general graph reasoning. Besides, we establish $G$raph-based $V$ision-$L$anguage $Q$uestion $A$nswering (GVLQA) dataset from existing graph data, which is the first vision-language dataset for general graph reasoning purposes. Extensive experiments on the GVLQA dataset and five real-world datasets show that GITA outperforms mainstream LLMs in terms of general graph reasoning capabilities. Moreover, We highlight the effectiveness of the layout augmentation on visual graphs and pretraining on the GVLQA dataset.

关键词

Graph ReasoningVisual Question AnsweringLarge Multimodal Model

评审与讨论

审稿意见

评分: 6置信度: 42024-06-27

The paper introduces the GITA framework, which innovatively integrates visual graphs into general graph reasoning tasks for LLMs. By combining visual and textual information of a graph, GITA improves the comprehensibility and flexibility of graph reasoning, outperforming other LLM approaches. The authors also develop the Graph-based Vision-Language Question Answering (GVLQA) dataset, the first vision-language dataset for graph reasoning. The study also highlights the benefits of layout augmentation in visual graphs and pretraining on the GVLQA dataset.

优点

Integrating visual graphs into graph reasoning tasks for LLMs, is a novel approach. This represents a creative combination of visual and textual modalities to improve graph comprehension and reasoning. The creation of the GVLQA dataset is original and fills a gap in the current datasets available for graph reasoning by incorporating both visual and textual elements.
Extensive experiments on both the GVLQA and five real-world datasets validate the effectiveness of the proposed framework, providing strong empirical evidence for the paper's claims.
The paper is clearly written, with well-defined sections and logical flow. The detailed explanations of the GITA framework's components and the dataset construction process enhance understanding.
By successfully integrating visual information into graph reasoning tasks, the paper addresses a significant limitation of existing LLMs and VLMs. This has the potential to substantially advance the field of graph reasoning.

缺点

The motivation of the framework is challenged. As for the commonly used graph reasoning methods such as GNNs, the authors state that "These methods often lack generalizability, flexibility, and user-friendliness." However, the advantages of GITA in these areas are not directly demonstrated. The zero-shot performance of LLM-based methods is poor for many questions (such as MaxFlow, SP, TS). Considering this, users need to fine-tune when addressing questions on a new dataset. Additionally, GITA requires data augmentation and manual template-based construction for task-specific queries. Therefore, it is not clear that LLM-based methods like GITA are better than GNNs in terms of generalizability, flexibility, and user-friendliness, and the motivation for using LLM for graph reasoning is not justified.
While the paper compares GITA to various language models, it does not provide a comparison to dedicated GNN models that are designed specifically for graph reasoning tasks. A comparison to GNNs would help demonstate the performance gains of GITA and its advantages or disadvantages.
The paper mentions k-hop subgraph sampling for handling large graphs but does not provide a detailed analysis of GITA's scalability or performance as graph sizes increase. A more comprehensive evaluation of scalability, potentially including comparisons to dedicated graph methods, would be valuable for assessing the practical applicability of GITA in real-world scenarios.
The alignment between visual and textual modalities could be explored. The paper notes performance degradation in some tasks for larger models due to potential alignment issues, but it does not provide a detailed analysis or solution. Further investigation into improving modality alignment is needed.
Some places need double-check. For example, the meaning of GITA needs to be unified: the authors mention GITA as "Graph to vIsual and Textual Integration" in line 44, while they refer to it as "Graph to Image-Txt Assistant" in line 117. Another example is in line 120, which needs to be changed to "Firstly, V and D are designed to produce visual depictions and textual descriptions."

问题

Please see the above weaknesses.

局限性

The authors have acknowledged several key limitations of their work but could benefit from providing more comprehensive solutions and experimental evidence. As for my suggestion for improvement, please see the “weaknessess” part.

作者回复

2024-08-07

Thanks for your thorough comments and insightful suggestions, we here address your concerns and adopt your suggestions as follows:

W1: It is not clear that LLM-based methods like GITA are better than GNNs in terms of generalizability, flexibility, and user-friendliness, and the motivation for using LLM for graph reasoning is not justified.

In Appendix A, we have detailed GITA's flexibility, user-friendliness, and generalizability. Unlike GNNs, whose task-specific feature engineering and architecture adjustments require professional techniques on model architectures and coding, GITA employs a consistent architecture that simplifies adaptation to new tasks using language-based templates. This makes it accessible even to non-experts, significantly enhancing flexibility. Moreover, GITA utilizes natural language for input and output, enabling intuitive graph reasoning through simple queries like 'Is there a cycle in this graph?' and providing straightforward 'Yes' or 'No' answers, thereby improving user-friendliness compared to the human-unreadable vector representations/embeddings in GNNs.

Besides, we further demonstrate that GITA-ZS (zero-shot) also has promising performance. According to the results shown in the Table 3 in the rebuttal supplement PDF, where 'PT' represents advanced prompting techniques including in-context-learning (ICL), chain-of-thought (CoT), and self-consistency (SC), its performance can be further improved with both the evolution of the VLM reasoner (i.e., from GPT-4v to GPT-4o) and advanced prompting techniques including ICL, CoT, and SC. Based on this result and the results reported in the submission, our GITA method exhibits promising zero-shot capabilities (generalizability on tasks).

W2: While the paper compares GITA to various language models, it does not provide a comparison to dedicated GNN models that are designed specifically for graph reasoning tasks. A comparison to GNNs would help demonstate the performance gains of GITA and its advantages or disadvantages.

According to your suggestion, we compare with dedicated GNNs, including GCN and SAGE, and present the results in Table 4 in rebuttal supplement PDF. Compared to dedicated GNNs, zero-shot GITA (with prompt techniques) and fine-tuned GITA-7B model have similar average graph reasoning performance. The larger GATA-13B model performs slightly better.

In particular, compared to GNNs, the GITA model shows a stronger ability to recognize local structures in graphs (Connect and Cycle) and to accomplish tasks with obvious layout heuristics (BGM). We believe that this advantage comes from GITA's visual perception. For SP and MaxFlow, GITA performs inferior to GNNs. This may be because GNNs process edge weight more effectively through its message passing mechanism. For HP and TS, GITA-ZS performs best. Those results will be put in the revision.

W3: The paper mentions k-hop subgraph sampling for handling large graphs but does not provide a detailed analysis of GITA's scalability or performance as graph sizes increase. A more comprehensive evaluation of scalability, potentially including comparisons to dedicated graph methods, would be valuable for assessing the practical applicability of GITA in real-world scenarios.

According to your suggestion, we conducted experiment to evaluate the scalability and performance for GITA and dedicated GNNs while increasing the number of hops $k$ in the ca-HepTh dataset.

The results shown in Table 5 in supplement PDF demonstrate the scalability of GITA is kept stable with the sampled graph size (i.e., $k$ ) increasing. According to the accuracy reported in Table 6 in supplement PDF, GITA, GCN, and SAGE achieve their respective peak performance when $k$ equals 2, demonstrating that a small sampled graph size is enough for the performance. Dedicated GNNs show higher peak performance than GITA, but also show worse performance when $k$ becomes larger (e.g., 3 or 4), which demonstrates the performance of GITA is more stable than dedicated GNNs w.r.t. $k$ . We will include those results in the revision.

W4: The alignment between visual and textual modalities could be explored. The paper notes performance degradation in some tasks for larger models due to potential alignment issues, but it does not provide a detailed analysis or solution. Further investigation into improving modality alignment is needed.

Because existing Multilingual Large Language Models (MLLMs) are not inherently attuned to graph data, they require fine-tuning to effectively align vision and text inputs in a graph context. The effectiveness of this alignment during fine-tuning is influenced by the proportion of tunable parameters. Following LLaVA, for the larger GITA-13B, the trainable parameter ratio is only 0.78%, which is much smaller than the 1.46% for GITA-7B. This limitation in tunable parameters may result in less effective alignment for GITA-13B compared to GITA-7B, potentially leading to poorer performance of GITA-13B on some tasks.

To address this issue, two solutions are proposed: one is to employ full-finetuning, which directly tunes all trainable parameters to align the modalities. Another approach involves applying the proposed GITA to more diverse graph data, thereby obtaining a richer source of vision-language data for alignment ([r1] illustrates diverse and rich data can hugely improve alignment). These additional vision-language data can be fine-tuned together with the current task data, or used as pretraining (as demonstrated in Section 5.3, using GVLQA checkpoints for real-world datasets).

[r1] Visual Instruction Tuning. NeurIPS, 2023

W5: Some places need double-check, including the inconsistent full name of GITA and a typo in line 120.

Thank you for pointing out the inconsistent terms and notations. We will fix them in the revision.

2024-08-09

I thank the author for the rebuttal. It addressed most of my questions and I am increasing the score to 6.

审稿意见

评分: 6置信度: 42024-07-03

To fill the gap that LLM overlook the rich vision modality with graph structure, this paper proposes GITA to incorporates visual graphs into general graph reasoning. A large graph vision-language dataset called GVLQA is designed to boost the general graph reasoning capabilities.

优点

This paper may involve a relatively large workload, proposing a large vision-language dataset, which can greatly promote the development of VLM.
Integrating visual graphs into VLM holds significant value.

缺点

This paper appears to be quite technical or engineering-oriented, with the models largely based on existing tools, and lacks strong novelty in terms of methodology.
This paper primarily focuses on integrating visual graphs into large models, however, it fails to provide any specific examples of visual graphs throughout the paper, making it difficult for readers to comprehend how the visual image improve the performance.
In page3 line 120, the 'G' in the end actually refers to 'D'? Some explaination should be provided.

问题

The dataset will be released in the future? If so, I appreciate it.

局限性

See the weakness

作者回复

2024-08-07

We acknowledge and appreciate your insightful review. Below, you can find our responses addressing your concerns point by point. If you have any additional questions or require further clarification, please feel free to let us know.

W1: This paper appears to be quite technical or engineering-oriented, with the models largely based on existing tools, and lacks strong novelty in terms of methodology.

We acknowledge that our work builds upon existing backbone models and tools. Incorporating visual information into general graph reasoning, however, is an interesting idea that has not been explored before. As a result, we have encountered several challenging issues during our work. For instance,

how to maintain the usability of vision graphs while managing context length in large-scale graphs;

how to balance consistency and variability in vision graphs, and the impact of specialized augmentations for vision graphs, etc.

To handle those issues that have never been explored before, we proposed the GITA framework and provided valuable empirical findings.

W2: This paper primarily focuses on integrating visual graphs into large models, however, it fails to provide any specific examples of visual graphs throughout the paper, making it difficult for readers to comprehend how the visual image improve the performance.

The illustrations of the graphs have been provided in Figure 2 as part of the case study for readers to understand how vision plays a role in graph reasoning. We have also included examples of visual graph generated by different graph visualizer tools implemented by us in Appendix C, as well as illustrations of these visual graphs for each subset of GVLQA in Figures 6-9 in Appendix G. We will highlight them in the revision.

W3: In page3 line 120, the 'G' in the end actually refers to 'D'? Some explaination should be provided.

Thank you for pointing out this typo. The 'G' in the end should be 'D'. We will correct it in the revision.

Q1: Will the dataset be released in the future? If so, I appreciate it.

Of course! The dataset will been released soon. Due to the submission policy, we do not provide the link in the submission.

2024-08-12

Thanks for the rebuttal. My concerns are addressed, therefore I maintain my score.

审稿意见

评分: 5置信度: 42024-07-04

The paper introduces an end-to-end framework called Graph to Visual and Textual Integration (GITA) to visualize graphs in order to improve LLMs’ reasoning capabilities on graph tasks. GITA consists of three main components: a Graph Describer to translate a graph into a natural language description, a Graph Visualizer to visualize the graph as an image, and a Questioner to generate task-specific queries conditioned on the task information. GITA shows improvements compared to vanilla LLMs and VLMs on graph reasoning tasks and some real-world graph datasets.

优点

The motivation is convincing. Just as humans reason over structured data, it is natural to visualize the structure first. The additional visual modality can provide rich information that can assist in reasoning over structured data. Therefore, generalizing this human behavior to LLMs/VLMs makes a lot of sense.
The method is simple yet effective for some graph reasoning tasks on small graphs. The idea of visualization is a form of data augmentation.

缺点

The applications are limited. The proposed method is only applicable to tiny graphs or large graphs with k-hop sampling, which hurts its practical application value.
The design of key components, including the Graph Describer, Graph Visualizer, and Task-specific Query, requires human efforts to adjust to fit different datasets.
Lack of comparison with SOTA methods. The experiments only compared with LLMs and VLMs; some graph LLMs should also be compared, such as Graph Chain-of-Thought, GraphLLM, and GraphToken [1-3], etc.

[1] Graph Chain-of-Thought, https://arxiv.org/abs/2404.07103

[2] GraphLLM, https://arxiv.org/abs/2310.05845

[3] GraphToken, https://arxiv.org/abs/2402.05862

问题

Can you provide some statistics about the graph size (average number of nodes and edges) used in your dataset?
In Table 1, is GITA-7B (VO) equivalent to Llava-7B?
In Table 1, in the fine-tuning setting, how are the LLMs fine-tuned? Is GITA-7B fine-tuned on only the alignment projector or the whole model?

局限性

See weekness above

作者回复

2024-08-07

We acknowledge and appreciate your insightful review. Below, you can find our responses addressing your concerns point by point. If you have any additional question or require further clarification, please feel free to let us know.

W1: The applications are limited. The proposed method is only applicable to tiny graphs or large graphs with k-hop sampling, which hurts its practical application value.

In fact, k-hop subgraph sampling is a common practice in graph learning. Typically, a k-hop sampling may not lead to performance degradation. On the contrary, k-hop subgraphs show more dedicated awareness of the local structure details. For example, ShaDow-GNN [1] indicates from both theoretical and empirical analysis, that in graph data, one can ignore distant neighbors ( $k >= 4$ ), and the most effective $k$ is 2 or 3.

To further illustrate, We conducted experimented to show the performance of both GITA and Vicuna by varying $k$ on large graph datasets ca-GrQc and ca-HepTh and show the results in the following table. It is evident that both GITA and Vicuna reach their good performance at $k = 2$ . Thus, increasing $k$ does not necessarily enhance the performance, and k-hop sampling does not lead to a decline in performance. Based on this observation, we think that the proposed GITA method is applicable to general graph reasoning tasks.

Model	ca-GrQc	ca-HepTh
Vicuna (k=2)	78.95	89.85
Vicuna (k=3)	78.95	89.66
Vicuna (k=4)	76.53	85.24
GITA (k=2)	79.70	91.13
GITA (k=3)	79.67	90.31
GITA (k=4)	75.47	86.10

[1] Decoupling the Depth and Scope of Graph Neural Networks. NeurIPS, 2021.

W2: The design of key components, including the Graph Describer, Graph Visualizer, and Task-specific Query, requires human efforts to adjust to fit different datasets.

We think that the requirements of human efforts in GITA are negligible based on the following aspects.

For a new dataset, the Graph Visualizer and Graph Describer in GITA can be directly used in a function-invoking manner. We provide the default setting. As a result, users do not need to design them.
The task-specific template inside the Questioner is the only component requiring human efforts. It only requires users to describe the task definition and the concrete meanings of the graph elements in user-friendly natural language.
We also offer an automated approach that allows users to generate task-specific queries by prompting an agent like ChatGPT. An example of the query generation for a custom gaming scenario has been provided in Appendix E. Though this automated approach still necessitates human input to describe the task initially, it is a minimum and unavoidable requirement for any language-based method.

W3: Lack of comparison with SOTA methods. The experiments only compared with LLMs and VLMs; some graph LLMs should also be compared, such as Graph Chain-of-Thought, GraphLLM, and GraphToken.

Graph Chain-of-Thought is not applicable to the general graph reasoning tasks, because it is built on knowledge graph (KG), but general graph reasoning tasks usually do not have the corresponding KG. GraphToken does not provide its data and code. Hence, in the following table, we here provide the performance comparison with GraphLLM on a randomly chosen half-size GVLQA-Base subset.

Model	Connect	Cycle	TS	SP	MaxFlow	BGM	HP	Avg
GraphLLM-7B	94.74	92.36	42.17	56.72	52.0	58.24	26.3	60.36
GITA-7B	99.05	97.48	44.11	33.05	24.89	93.37	28.15	60.01

Based on the results, for substructure-awareness tasks such as Connect and Cycle and tasks with beneficial visual heuristics such as BGM, the visual modality introduced by GITA is more advantageous than the graph modality introduced by GraphLLM. GraphLLM shows its superiority in MaxFlow and SP, which may be due to their graph transformer encoder being more effective in processing edge weights. Finally, GITA and GraphLLM show similar abilities on sequential ordering tasks TS and HP, and comparable average performance. We will include this comparison in the revision.

Q1: Can you provide some statistics about the graph size (average number of nodes and edges) used in your dataset?

The average numbers of nodes and edges for each task in GVLQA are shown in the following table.

Average / Task	Connect	Cycle	TS	SP	MaxFlow	BGM	HP
#nodes	25.01	23.42	21.86	13.65	13.90	21.13	13.24
#edges	95.46	23.66	114.10	23.99	49.16	51.03	45.05

Q2: In Table 1, is GITA-7B (VO) equivalent to LLaVa-7B?

GITA-7B (VO) represents a variant of GITA using LLaVa-7B reasoning on the vision graph generated by GITA Graph Visualizer and a direct question like "Is there a cycle in this undirected graph", but without the textual descriptions of the graph generated by GITA Graph Describer.

Q3: In Table 1, in the fine-tuning setting, how are the LLMs fine-tuned? Is GITA-7B fine-tuned on only the alignment projector or the whole model?

We have introduced the detailed fine-tuning settings in both Section 5 and Appendix F in our submission. For the fine-tuning setting in Table 1, we fine-tune the LoRA adapters for all weight matrices within the text decoder and the alignment projector, while keeping the vision encoder in the VLM reasoner frozen.

评论- Response

2024-08-12

Thank you to the authors for the rebuttal. I have carefully reviewed it, as well as the responses to the other reviewers.

Regarding W1, my concern about the limited applications remains. I acknowledge that k-hop sampling is a widely adopted approach and may not lead to performance degradation in some cases. However, for datasets like Cora and CiteSeer, the reported accuracy is much lower compared to pure GNN methods[1] (Table 3). I am uncertain whether this is due to the k-hop sampling or because GITA only provides graph structures without node/edge attributes. If it is the latter, then GITA may not be effective enough for handling graphs with attributes, whether in numerical or text format.

For W3, the average performance of GraphLLM and GITA is similar. I am conservative about the contribution of GITA over existing methods.

For the new experiments during rebuttal, for table 5-7, the comparision with GNNs is an important baseline and should be displayed in the main text. Besides, for small datasets (Table 5), simple GNN method has very close performance with GITA. For large datasets (Table 6,7), pure GNN method show better performance and significantly better efficiency. I am skeptical about the necessity of using LLM with visual information to address these tasks given simple GNNs have already perform good with great efficiency.

For W2 and Q1-Q3, my questions are well addressed.

[1] cora leaderboad, https://paperswithcode.com/sota/node-classification-on-cora

2024-08-13

Q2. For W3, the average performance of GraphLLM and GITA is similar. I am conservative about the contribution of GITA over existing methods.

We would like to highlight that the contributions of GITA are unique compared with existing methods such as GraphLLM.

1. Pioneering Use of Vision in Language-Based Graph Reasoning: GITA is the first to explore the effectiveness of vision in language-based graph reasoning. We believe that this vision benefit (our contribution) is not covered by existing methods like GraphLLM. As illustrated, our experiments show that GITA and GraphLLM excel in different types of graph understanding tasks. Therefore, the methods proposed by us (vision modality) and by them (graph modality) are expected to be combined and complement each other in our future work.

2. Superior Performance: Though GraphLLM and GITA-7B have similar average performance, when apply our proposed vision augmentation, GITA-VO(AUGLY) can be more powerful than GraphLLM in overall performance (i.e., 63.36% in Table 2 vs. 60.36%). Besides, zero-shot GITA can also achieve comparable average performance (59.73% in rebuttal supplement Table 3) with GraphLLM, where the former is training-free.

Q3. For the new experiments during rebuttal, for table 5-7, the comparison with GNNs is an important baseline and should be displayed in the main text.

According to your suggestion, we will add them accordingly in the main text of the revision.

Q4. I am skeptical about the necessity of using LLM with visual information to address these tasks given simple GNNs have already perform good with great efficiency.

GITA and GNNs highlight the distinction solutions between general-purpose solutions and specialized solutions. As discussed in our motivation and in our response to W1 of Reviewer oE8d, GNNs are not sufficiently flexible, general, or user-friendly for addressing general graph reasoning tasks, with the following points:

1. Ease of Adaptation (Flexibility): GNNs often need to be combined with other models to meet specific task requirements. For instance, sequence output tasks (such as SP, HP, TS) require integration with LSTM, transformers, etc. In contrast, GITA can achieve this with a unified model architecture. Besides, adapting GNNs to specific tasks requires modifications within the model structure, necessitating a background in deep learning and coding skills. GITA, on the other hand, only requires language skills that all humans possess to fit task variations.

2. Zero-Shot Capability (Generalizability): GNNs lack zero-shot capabilities, whereas GITA has promising zero-shot capabilities (as demonstrated in rebuttal Table 3).

3. Human-Readable Operations and serve as agent (User-friendliness): GNNs operate on unreadable vectors, while GITA operates on human-understandable language and images. Therefore, though it is not as efficient as GNNs, it can serve as an agent for answering graph questions with natural language in acceptable time, whereas GNNs show inabilities because they can not handle language.

Therefore, GITA is a more flexible, general and user-friendly framework for general-purpose graph reasoning than GNNs.

To sum up, this paper is the first work to conduct general-purpose vision-language graph reasoning, and has illustrated that vision could bring overall advances on LLM-based graph reasoning. As an initial direction, it is not as mature as GNNs with years explored. However, it has shown comparable overall performance and special superiorities (flexibility, generalizability, user-friendliness, and promising zero-shot capabilities), demonstrating it is a promising direction to explore as a general solution for graph reasoning.

评论- Response

2024-08-13

Thank you to the authors for the detailed response. I have no further questions and will raise my score to 5.

2024-08-13

Thanks for your detailed, patient and insightful discussion. We respond to these questions as follows.

Q1. Regarding W1, my concern about the limited applications remains. I acknowledge that K-hop sampling is a widely adopted approach and may not lead to performance degradation in some cases. However, for datasets like Cora and CiteSeer, the reported accuracy is much lower compared to pure GNN methods (Table 3). I am uncertain whether this is due to the k-hop sampling or because GITA only provides graph structures without node/edge attributes. If it is the latter, GITA may not be effective enough for handling graphs with attributes, whether in numerical or text format.

We think the reported inferior performance in Cora and Citeseer to the leading models is mostly because we do not use node attributes, but not due to the k-hop sampling. That is because many methods (e.g., Graph Transformer like UGT, and most GNN variants where $k$ is equivalent to the number of GNN layers) in your listed leaderboard also use k-hop sampling to achieve leading performance.

However, we want to point out that though using extra information beyond graph structures like node attributes is helpful (in Cora and Citeseer, they provide extra 0-1 binary word occurrences info), it requires specific models to handle them, thus hurts the generalizability, conflicting with our motivation, i.e., a "general-purpose" graph reasoning framework. Note that "generality" is a significant reason for why research works like GraphLLM and our GITA are interested in using LLMs for graph reasoning, where GraphLLM also does not consider the node attributes.

The hurt for the node attributes to generalizability can be reflected in several aspects: First, handling node attributes typically requires designing specific model tailored to their shapes (e.g., vector dimensions, length, matrix size), which hinders a general and consistent solution for handling diverse tasks. Second, the meanings (e.g., word vectors of titles, embeddings of degree or index, one-hot representation of node classes) of node attributes vary in data, thus the model can overfit in a specific task and hurt the zero-shot abilities. Finally, these diverse models to handle the node attributes also increase the complexity of the model.

As the major concern of GITA is how well vision+LLM performs on general graph scenarios, where node attributes are not necessary provided by default, we only concentrate on pure graph structures and do not incorporate task-specific models for handling extra node attributes.

Moreover, we can provide several potential solutions for combining GITA with node attributes, and leave them for future work.

We can include the text node attributes with the explanation of their concrete meanings inside the text prompt. However, such an approach may not perform well for some types of attributes such as occurrence data, because they are too abstract.
Similar to many existing works, we can use a specific module to encode the attributes and another fusion module to combine them with the backbone model (i.e., MLLMs used in GITA). However, such an approach needs extra designs for these additional modules and needs task-specific adjustments.

In the following table, We also provide an additional comparison between GITA and dedicated GNNs under non-attribute setting on Cora and Citeseer, where non-attribute setting is commonly used to evaluate how well models can understand graph structure. The experimental results show that GITA is much more effective in non-attributed Cora and Citeseer than dedicated GNNs, showcasing a more powerful awareness of their pure graph structure.

Model	Cora	Citeseer
GITA	85.24	75.07
GCN	73.35	68.71
SAGE	69.19	64.69

审稿意见

评分: 5置信度: 42024-07-13

This paper introduces Graph to Visual and Textual Integration (GITA), a novel framework that enhances graph reasoning by integrating visual representations with traditional text-based processing. GITA's innovation lies in using both visual and textual inputs to address graph-structured data, a significant deviation from the typical text-only approaches in Large Language Models (LLMs). Key contributions include the development of GITA, the creation of the Graph-based Vision-Language Question Answering (GVLQA) dataset—the first for visual-textual graph reasoning—and extensive evaluations showing GITA's superior performance over existing LLMs across various datasets, highlighting the advantage of incorporating visual information into graph reasoning tasks.

优点

The paper introduces the novel idea of integrating visual graphs with textual descriptions for enhancing graph reasoning tasks.
The creation of the GVLQA dataset is a significant contribution, as it is the first vision-language dataset designed specifically for general graph reasoning.

缺点

While I think it is an interesting idea to introduce vision into graph reasoning tasks, I am skeptical about its value. This is because vision is a perceptual ability, which is a fast intelligence (i.e., system one). While graph reasoning is a task that requires rigorous inference modeling and step-by-step execution to get the final precise answer, e.g., the graph reasoning tasks covered in this paper have their corresponding graph algorithms to get the precise answer. In my opinion, compared with the visual ability, the ability to rigorously execute traditional graph data structures and algorithms is the key for LLM/MLLM to have the ability to solve graph reasoning tasks. The experiments in Table 1 also show that, except for Connect and Cycle, which can be answered quickly and visually (provided the graph layout is concise and clear), the other graph reasoning tasks, GITA (VO) do not perform well.
I think this paper is like a data track paper, i.e., proposing novel datasets, testing the capabilities of existing LLM/MLLM, and finally proposing viable solutions for testing. Instead, this paper is counterproductive in wrapping it up as a methodological framework paper. The biggest problem is that the graph visualizer, the graph describer, and the questioner are both part of the methodology and the means of constructing the dataset, which is very confusing. I don't think they fit as part of the methodology because they are just tools for engineering the dataset, and there is no innovation at the level of ideas, modeling, or anything else.
Table 1 compares only text-based LLMs and lacks a comparison with visually aware MLLMs such as GPT-4V/GPT-4O/LLaVa/Gemini.

问题

please refer to the weaknesses.

局限性

Yes.

作者回复

2024-08-07

Note: Some tables mentioned in this rebuttal are contained in our rebuttal supplement PDF.

W1: While I think it is an interesting idea to introduce vision into graph reasoning tasks, I am skeptical about its value. This is because vision is a perceptual ability, which is a fast intelligence (i.e., system one). While graph reasoning is a task that requires rigorous inference modeling and step-by-step execution to get the final precise answer, e.g., the graph reasoning tasks covered in this paper have their corresponding graph algorithms to get the precise answer. In my opinion, compared with visual ability, the ability to rigorously execute traditional graph data structures and algorithms is the key for LLM/MLLM to have the ability to solve graph reasoning tasks. The experiments in Table 1 also show that, except for Connect and Cycle, which can be answered quickly and visually (provided the graph layout is concise and clear), the other graph reasoning tasks, GITA (VO) do not perform well.

We agree with you that only the visual modality may be not so good for graph reasoning tasks. Please note that our work is to show that integrating the visual modality with the textual modality could achieve better performance than a single modality, as evidenced by our submitted paper Tables 1 and 3.

While "VO" demonstrates limited performance in Table 1 for most tasks, we have successfully enhanced its capabilities through layout augmentation in GITA, as evidenced in Table 2.

The visual modality could complement the textual modality in graph reasoning tasks. Here we provide a case study. Typically, the visual modality is more excellent in recognizing beneficial substructures/local patterns compared to the textual modality, and some of them are crucial in graph reasoning. For instance, "hop number" serves as a heuristic in shortest path calculations, "leaf nodes" are critical in topological sorting, and "cycles" must be prevented in Hamiltonian path construction. We extracted these substructures inside the GVLQA-Base and manually labeled them. Employing the frozen ViT in LLaVA with a trainable MLP decoder, we achieved identification accuracies of 89.92%, 95.16%, and 92.39%, respectively for hop number counting, leaf nodes identification, and cycle identification. In contrast, using a pre-trained Bert with the same trainable MLP decoder, the accuracies are significantly lower (i.e., 55.47%, 26.33%, and 60.32%). Therefore, the effectiveness of the integration of the visual and textual modalities may be due to that the visual modality provides extra beneficial structural information such as those substructures.

Besides, The benefits of vision do not conflict with the benefits that come from the ability to step-by-step reasoning (e.g. Chain-of-thought, CoT). Table 1 in our rebuttal supplement PDF file shows the model benefits from them both simultaneously.

W2: I think this paper is like a data track paper, i.e., proposing novel datasets, testing the capabilities of existing LLM/MLLM, and finally proposing viable solutions for testing. Instead, this paper is counterproductive in wrapping it up as a methodological framework paper. The biggest problem is that the graph visualizer, the graph describer, and the questioner are both part of the methodology and the means of constructing the dataset, which is very confusing. I don't think they fit as part of the methodology because they are just tools for engineering the dataset, and there is no innovation at the level of ideas, modeling, or anything else.

We would like to point out

GITA is the first to enable the use of MLLMs in graph reasoning. This is a significant advance because nearly all existing graph reasoning works do not utilize the visual modality, which could help improve the performance.
GITA is general for almost all existing graph reasoning data because it only requires graph information in bare structures $G=\{V, E\}$ .
GITA solves technical problems including: how to use MLLM for graph reasoning based on bare graph structure, how to maintain vision graph usability and keep context length in large-scale graphs, how to trade-off the consistency and variability of vision graphs, special augmentations for vision graphs and their impacts, etc.
GVLQA is just a by-product of our core contribution, GITA. Therefore, the methodologies inside GITA are not limited to any individual benchmark or scenario but are significant and valuable to graph reasoning. As a result, we think our work fits the main track instead of the dataset track.

W3. Table 1 compares only text-based LLMs and lacks a comparison with visually aware MLLMs such as GPT-4V/GPT-4O/LLaVa/Gemini.

We would like to point out in our experiments, LLaVa has been utilized as the VLM Reasoner for both the GITA and GITA (VO) configurations. As one of the components of GITA, LLaVa receives the visual and textual inputs from the other components of GITA (though we store them as GVLQA) and the results are shown in the Table as GITA and GITA (VO) terms. Therefore, LLaVa is not necessary to be compared again as an individual baseline because it is a special case of GITA. Similarly, GPT-4V serves as the VLM Reasoner for the GITA-ZS and GITA-ZS (VO) configurations in Table 1 and is not considered a baseline. According to your suggestion, we utilized GPT-4o (not released since the submission of our work) and Gemini, as the MLLM reasoner and recorded the results in Table 2 in our rebuttal supplement PDF file , which shows that the benefits of incorporating vision with GITA are consistent across various VLM reasoner choices.

2024-08-12

My concerns are addressed and I hope that the final version of the paper will be updated.

作者回复

2024-08-07

Dear Reviewers,

Thank you for your valuable feedback and thoughtful comments on our manuscript. We have carefully reviewed each of your concerns and have addressed them individually in the responses below. We believe that these revisions and clarifications have strengthened our manuscript, and we hope that our responses meet your satisfaction.

Additionally, due to space constraints, we have included some of the larger tables in an attached PDF document for your convenience.

Please find our detailed replies to your specific points below.

Best regards,
Authors

2024-08-13

Dear Reviewers,

We extend our heartfelt gratitude for your dedicated time and effort in reviewing our paper. Your constructive questions and valuable feedback have been immensely beneficial to our manuscript, and we will incorporate them into our revisions.

Thank you once again for your invaluable contributions to the review process.

Best regards,

Authors

最终决定Accept (poster)

2024-09-25

The paper presents GITA, a framework that integrates visual graphs generated from text inputs to enhance graph reasoning tasks in LLMs. The creation of the GVLQA dataset, specifically designed for vision-language graph reasoning, is a valuable contribution that fills a gap in the current landscape of graph reasoning datasets. The paper demonstrates promising results across multiple datasets, showcasing GITA’s effectiveness, particularly in local structure recognition and layout-sensitive tasks. Although some concerns were raised regarding scalability and comparisons to GNNs, the authors effectively addressed these by providing additional experiments and detailed explanations in their rebuttal.

All the reviewers comments have been addressed. All the scores recommend acceptance. Congrats to the authors.