5.3

/10

Rejected4 位审稿人

最低5最高6标准差0.4

3.8

置信度

正确性2.5

贡献度2.3

表达2.5

ICLR 2025

GL-Fusion: Rethinking the Combination of Graph Neural Network and Large Language model

Haotong Yang,Xiyuan Wang,Qian Tao,Shuxian Hu,Zhouchen Lin,Muhan Zhang

OpenReview PDF

提交: 2024-09-27更新: 2025-02-05

TL;DR

A new architecture for combining GNN and LLM

摘要

关键词

GNNLLM

评审与讨论

审稿意见

评分: 5置信度: 52024-10-25

The paper proposes a new architecture to address the limitations of LLM-centered models in capturing structural information and GNN-centered models in capturing textual semantics. This architecture includes three components: (1) Structure-Aware transformer, which integrates GNN and LLM into the same transformer layer, (2) Graph-Text Cross-Attention, allowing direct aggregation of node textual information, and (3) GNN-LLM Twin Predictor, which adds an additional layer to the output of LLMs to adapt to downstream tasks.

优点

The motivation is interesting.

缺点

The paper is not well written.

The use of notations presents a significant problem so that I could not understand the paper. In the preliminaries section, $L_n$ occurs only once. I think that $L$ describing the length of the node text in Line 129 should always be $L_n$ . In section 3.2, Equation 3 and Equation 4 are very unclear. None of $T_i, W_q, W_k, X_v$ is defined. It is not clear to me whether $T_i$ in this context refers to the T_i of Equation 2. In line 238, the authors also admits to abusing the notations, and I don't understand why the authors didn't make further corrections.
The order of Table 1 and Table 2 is also confusing.
I am suspicious of the time complexity of the cross attention layer because of the confusion of the notations.
The data format of the table is not aligned.
There is a large amount of blank data in Table 3 with no further explanation as to why.

Lack of important ablation studies and hyperparameter sensitivity analysis.

The paper proposes three modules, but in the experiments, only 'w/o cross-attention' is included in the graph property prediction task. In general, aggregators we will only use one way, such as max or sum. I'm not sure what the motivation was for designing all three and how they affected the final experimental results in the experiment? I would like to understand the difference between them and the type of information captured, especially the standard deviation. Moreover, the experiments did not indicate the need for a gating mechanism in Equation 2.

Detailed descriptions of the dataset and baseline methods are missing.

This paper covers a large number of experimental tasks, but in the appendix, detailed information on the datasets as well as information on the baseline methods are not listed.

Lack of important baseline methods. The baseline methods should be divided into three parts, (1) LLM-centered models, (2) GNN-centered models, and (3) combined LLMs and GNNs methods.

[1] Huang, Qian, et al. "Prodigy: Enabling in-context learning over graphs." Advances in Neural Information Processing Systems 36 (2024).
[2] Tan, Yanchao, et al. "MuseGraph: Graph-oriented Instruction Tuning of Large Language Models for Generic Graph Mining." arXiv preprint arXiv:2403.04780 (2024).
[3] Chien, Eli, et al. "Node feature extraction by self-supervised multi-scale neighborhood prediction." arXiv preprint arXiv:2111.00064 (2021).
[4] Huang, Xuanwen, et al. "Prompt-based node feature extractor for few-shot learning on text-attributed graphs." arXiv preprint arXiv:2309.02848 (2023).

I also doubt the reasonability of the proposed model. Consider that we want to do node classification task only. Given limited number of labeled nodes, the model has significant number of trainable parameters than standard GNNs. However, I do not think the proposed model can outperform GNNs under small labeled datasets. So is the model really needed? I do not think so.

问题

Please see the comments above.

评论- Response 1/3

2024-11-24

Thank you for acknowledging our motivation. We address your concerns as follows.

The paper is not well written.

I am sorry for our writting issues. We have made the notations consistent and add details for experiments in revision. Hope our revision can help you understand our work help and reevaluate our work.

The use of notations presents a significant problem so that I could not understand the paper. In the preliminaries section, $L_n$ occurs only once. I think that $L$ describing the length of the node text in Line 129 should always be $L_n$ .

We apologize for the inconsistency. we have fixed it and convert $L$ to $L_n$ .

In section 3.2, Equation 3 and Equation 4 are very unclear. None of $T_i,W_q,W_k,X_v$ is defined. It is not clear to me whether $T_i$ in this context refers to the T_i of Equation 2.

We agree that this was unclear. We have clarified and defined these notations at the beginning of Section 3 to ensure their proper usage throughout.

In line 238, the authors also admits to abusing the notations, and I don't understand why the authors didn't make further corrections.

We appreciate your feedback and have addressed this issue. Previously, $X, T$ were used for both input token sequences and their representations in the model. We now use lowercase $x, t$ for representations to avoid confusion.

I am suspicious of the time complexity of the cross attention layer because of the confusion of the notations.

The notations have been fixed. The time complexity of our cross attention mechanism is $O(nL_nL_t)$ as the query is input sequence of length $L_t$ and key and value is $n$ nodes' $L_n$ -length text.

The data format of the table is not aligned.

We have transposed Table 1 in the revised version and are happy to address any other formatting issues.

There is a large amount of blank data in Table 3 with no further explanation as to why.

For the link prediction table, our baselines only report results for certain metrics or datasets. To provide a complete comparison, we include results for all metrics with our GL-Fusion model. Additionally, we have added baseline results for models such as GraiL and NBFNet. GL-Fusion consistently outperforms these baselines.

Lack of important ablation studies.
The paper proposes three modules, but in the experiments, only 'w/o cross-attention' is included in the graph property prediction task.

We have added detailed ablation studies on the ogbn-arxiv dataset, as shown below. Our GL-Fusion produces both GNN and LLM outputs simultaneously, so we report the test accuracy of both predictors and their ensemble accuracy (final results of the node classification task).

	gnn	LLM	ensemble performance
Original	77.09	76.43	78.2
Low Lora Rank	76.61	75.97	77.06
w/o cross attention	75.5	74.97	76.35
w/o gate	75.88	73.33	76.2
w/o mpnn	75.4	75.57	76.48
w/o gnn predictor	-	72.45	-
w/o text predictor	75.8	-	-

Here Low LoRA rank denotes using model with LoRA rank=8. w/o cross attention removes the graph-text cross attention. w/o gate removes the gate mechanism. w/o multi-aggr removes multiple aggregators in the message passing modules in transformer layers and uses mean aggregator only. w/o gnn predictor removes the gnn prediction and training loss on it. w/o text predictor removes the text prediction and training loss on it. Each module contributes to the overall performance, and removing any of them results in a performance drop.

评论- Response 2/3

2024-11-24

Lack of hyperparameter sensitivity analysis.

Our model is based on a pretrained LLM, Llama-3-8B, and most hyperparameters follow its configuration (e.g., hidden dimensions, number of layers, attention heads, positional encoding, and fine-tuning learning rate). The dataset-specific hyperparameters include the Low-Rank Adaptation (LoRA) rank and batch size. We choose batch size as large as possible according to GPU memory. As shown in the ablation table, reducing the LoRA rank from 64 to 8 (Low LoRA Rank) decreases performance by 1%. This is because smaller ranks reduce the number of trainable parameters, limiting the model's expressivity.

In general, aggregators we will only use one way, such as max or sum. I'm not sure what the motivation was for designing all three and how they affected the final experimental results in the experiment? I would like to understand the difference between them and the type of information captured, especially the standard deviation.

We apologize for omitting a citation. We use multiple aggregators based on Principal Neighborhood Aggregation (PNA) [1]. PNA suggests that multiple aggregators are necessary for encoding continuous features, as they capture different statistics of neighbors' features. In ours ablation study (comparison between original and w/o multi-aggr), multiple aggregators also performs better than using mean aggregator only.

As [1] said, standard deviation aggregator is used to quantify the spread of neighbouring nodes' features, such that a node can assess the diversity of the signals it receives.

[1] Gabriele Corso, Luca Cavalleri, Dominique Beaini, Pietro Liò, Petar Velickovic. Principal Neighbourhood Aggregation for Graph Nets. NeurIPS 2020.

Moreover, the experiments did not indicate the need for a gating mechanism in Equation 2.

This mechanism allows gradual integration of GNN outputs during training while stabilizing the model and mitigating knowledge forgetting. As shown in our ablation study(comparison between original and w/o gate), the gating mechanism is useful. Removing gate significantly decrease the performance of text output.

Detailed descriptions of the dataset and baseline methods are missing.
This paper covers a large number of experimental tasks, but in the appendix, detailed information on the datasets as well as information on the baseline methods are not listed.

We have added detailed descriptions of datasets and baseline methods in Appendix B of the revised version.

评论- Response 3/3

2024-11-24

Lack of important baseline methods. The baseline methods should be divided into three parts, (1) LLM-centered models, (2) GNN-centered models, and (3) combined LLMs and GNNs methods.

We appreciate this suggestion.

[1] is a GNN-centered models. [1] is pretrained on MAG240M dataset, whose original text feature is unavailable. Therefore, our model cannot use exactly the same experiments as [1]. However, OFA baseline in our experiment takes a similar architecture to it.

We have added [2, 3, 4] in revision. Our GL-Fusion outperforms them significantly.

[2] is a LLM-centered model. On ogbn-arxiv dataset, our model outperforms it significantly, achieving 77.80 micro-F1 and 61.40 macro-F1, while [2] only achieves 69.20 micro-F1 and 42.45 macro-F1.

[3] combines LLMs and GNNs. It achieves 76.12 accuracy on ogbn-arxiv, while our GL-Fusion achieves 78.20 accuracy.

[4] is a GNN-centered models. It achieves 52.91 and 62.12 ogbn-arxiv test accuracy with 10-shot and 100-shot training example per class, while our GL-Fusion achieves 54.66 and 68.18, respectively. The results are shown in the following Table.

#shots per class	10	100
OGB-Feature	45.76	58.75
PLM+GAE	50.16	58.1
PLM+GAE+prompt	51.89	60.63
GIANT	50.5	60.81
GIANT+prompt	51.4	61.26
PLM-cls	46.97	58.69
PLM-Prompt-dense	51.17	58.65
PLM-Prompt-sparse	52.01	60.85
G-Prompt	52.48	61.67
G-Promptw/ogate	52.91	62.12
G-Promptw/ograph	52.26	60.59
G-Promptw/oSSL	52.1	60.92
GL-Fusion	56.44	68.18

[1] Huang, Qian, et al. "Prodigy: Enabling in-context learning over graphs." Advances in Neural Information Processing Systems 36 (2024). [2] Tan, Yanchao, et al. "MuseGraph: Graph-oriented Instruction Tuning of Large Language Models for Generic Graph Mining." arXiv preprint arXiv:2403.04780 (2024). [3] Chien, Eli, et al. "Node feature extraction by self-supervised multi-scale neighborhood prediction." arXiv preprint arXiv:2111.00064 (2021). [4] Huang, Xuanwen, et al. "Prompt-based node feature extractor for few-shot learning on text-attributed graphs." arXiv preprint arXiv:2309.02848 (2023).

I also doubt the reasonability of the proposed model. Consider that we want to do node classification task only. Given limited number of labeled nodes, the model has significant number of trainable parameters than standard GNNs. However, I do not think the proposed model can outperform GNNs under small labeled datasets. So is the model really needed? I do not think so.

We appreciate your concern regarding the model's complexity and performance under low-data regimes. While it is true that merging an LLM introduces a larger hidden dimension (4096) and consequently a higher parameter count for the GNN component, this does not necessarily lead to overfitting. On the contrary, the inclusion of an LLM provides a strong inductive bias, enabling the model to generalize better even with limited labeled data.

To validate this claim, we conducted experiments on the ogbn-arxiv dataset with only a few labeled examples per class. As shown in the table below, our model, GL-Fusion, significantly outperforms previous GNN-based methods and other baselines in few-shot settings:

#shots per class	10	100
OGB-Feature	45.76	58.75
PLM+GAE	50.16	58.1
PLM+GAE+prompt	51.89	60.63
GIANT	50.5	60.81
GIANT+prompt	51.4	61.26
PLM-cls	46.97	58.69
PLM-Prompt-dense	51.17	58.65
PLM-Prompt-sparse	52.01	60.85
G-Prompt	52.48	61.67
G-Promptw/ogate	52.91	62.12
G-Promptw/ograph	52.26	60.59
G-Promptw/oSSL	52.1	60.92
GL-Fusion	56.44	68.18

Thank you for your valuable feedback. We hope these new results address your concerns. Let me know if further refinement is needed!

评论- Reply 1/2

2024-11-28

Dear Reviewer yaRE,

Thank you for acknowledging our efforts to improve the paper and taking time to participate our discussion.

We believe that combining GNNs and LLMs is a promising direction, as demonstrated by previous work [1, 2, 3, 5]. This combination not only enhances the GNN’s ability to process text (such as node text and task descriptions), but also enables LLMs to better understand graph data. By processing graphs from various domains in a unified way—using LLMs to handle text features from these domains—combination of GNN and LLM serves as an important step toward developing graph foundation models [1, 2, 5].

Our architecture outperforms existing combinations of GNN and LLMs, as well as standalone GNNs, across node, link, graph, and graph-to-text generation tasks, confirming the effectiveness of our approach. Specifically, we achieve state-of-the-art (SOTA) performance on the ogbn-arxiv dataset, where multiple GNN-LLM combinations and ensemble methods have been explored. Additionally, on the ogbg-code2 dataset, our model achieves a 40% F1 score, surpassing the previous SOTA of 20%.

We now address your concerns point by point:

The merit of the paper is to combine the advantages of LLM and GNN to simultaneously perform text generation and node classification tasks.

Our contribution goes beyond simply producing two kinds of output for these tasks. The ability to generate both text and GNN outputs is one feature of our twin predictor, which is just one aspect of our architecture. Additionally, we propose a structure-aware architecture and node-text cross-attention, both of which are validated as effective in our ablation study (Table 8). The twin predictor not only produces two types of predictions, but the supervision on both predictions helps improve each other. Our ablation study shows that removing the loss on one prediction leads to a performance drop in the other.

Moreover, our model is not limited to node classification. The flexibility of text allows us to formulate a wide range of tasks, including node, link, and graph classification, question answering with graphs, and code summarization using code ASTs. Our model performs well across all these tasks.

When we only want to do node classification tasks, I think simply performing GNN-centered models is enough. The proposed model will lead to significant overhead compared to plain GNN-centered models, which may overshadow the performance margin.

On node tasks, our GL-Fusion model still significantly outperforms both existing GNNs and GNN-LLM combinations. Text information plays a crucial role in improving performance, and traditional GNN-centered models cannot fully utilize this. For example, in our experiments on ogbn-arxiv, extracting uncompressed text information with our cross-attention module leads to a 3% performance improvement, highlighting that previous embedding methods in GNN models are insufficient. Additionally, node classification tasks like ogbn-arxiv have been dominated by combinations of GNN and LLMs [4].

The cost of our method is minimal compared to the cost of the LLM backbone. Only 10% of additional parameters and a small amount of extra computation are added to the LLM backbone. Despite this, our method significantly enhances the LLM’s ability to process graph data, allowing the LLM to better understand graphs. When compared to GNN-centered methods, the training process for GNN-centered models typically involves precomputing LLM embeddings and running only the GNNs. In contrast, our method requires running both the GNN and the LLM, making our approach slower during training. However, during inference, when new node text features must also pass through the LLM, the cost of GNN-centered methods becomes comparable to that of the LLM backbone.

All the downstream graph-related tasks are not unified in the same template. In other words, when we train the model on node classification, we have to fine-tune the model for graph classification or link prediction tasks.

In our experiments, all downstream tasks are unified into the same input format. For each task, we combine the task description, input graph, and label into a mixed sequence of text and graph. The model is trained with standard LLM text cross-entropy loss to predict the label. Ideally, we can train our model for node, link, and graph tasks simultaneously and perform these tasks without additional fine-tuning.

In this case, the only task-specific component is the GNN predictor, which is a small MLP (approximately 10k parameters). Even without retraining for new tasks, our model can still generate text predictions and directly apply to unseen downstream tasks. While multi-task learning is our future goal, our current work already provides flexibility for adapting to multiple tasks without change to the architecture.

评论- Reply 2/2

2024-11-28

Given graphs that do not contain labels or texts, how can the model be trained?

Most graph data contains textual information, though it may be compressed or removed in preprocessing due to limitations in previous GNN input formats (e.g., Cora). For example, in bioinformatics, textual data can be retrieved from biological databases; in social networks, user descriptions and posts can provide text; and in knowledge graphs, Wikipedia text can be leveraged.

For graphs with only structural features, we can use empty text and allow GNNs to learn from the graph structure. The task description text remains useful, as it can be processed by our model.

Regarding the concern about a large number of learnable parameters, we note that the strong inductive bias provided by LLMs allows our model to perform well with small amounts of data. As shown in Table 3, our model significantly outperforms existing models on few-shot learning tasks.

We hope our reply address your concerns. Let me know if further results or clarifications are needed!

[1] Liu et al. One For All: Towards Training One Graph Model For All Classification Tasks. ICLR 2024
[2] Tang et al. GraphGPT: Graph Instruction Tuning for Large Language Models. SIGIR 2024
[3] Xie et al. Graph-Aware Language Model Pre-Training on a Large Graph Corpus Can Help Multiple Graph Applications. KDD 2023
[4] Duan et al. SimTeG: A Frustratingly Simple Approach Improves Textual Graph Learning. CoRR abs/2308.02565
[5] Yan et al. A Comprehensive Study on Text-attributed Graphs: Benchmarking and Rethinking. NeurIPS 2023

2024-11-29

Thank you for your reply. However, your clarification cannot convince me in two aspects.

First, when we fine-tune the model for graph classification, we have to train all the parameters in the model that are numerous. This is very unnecessary because it has been widely shown that GNN archtectures are more effective in node/graph classification. In this way, why we need to fine-tune all the parameters in the model?

Second, graphs without text features are prevalent. At least, the authors should consider the case. Given non-textual graphs, I believe the model cannot be trained because we have a significant number of parameters and only labels can be used to train Transformer-based model. This is obviously infeasible.

I think the paper is merited in the extensive experiments for verifying the effectiveness. However, from the perspective of applicability, the model is shadowed. I am very confident and will keep my score.

2024-11-28

Dear Authors,

I appreciate all the efforts you have made to improve the paper. Meanwhile, I have carefully checked all the reviews. For me, the biggest issue of this paper remains still unsolved. The so-called merit of the paper is to combine the advantage of LLM and GNN to simultaneously perform text generation and node classification tasks. However, in my eyes, this is unnecessary in three reasons.

First, when we only want to do node classification task, I think simply performing GNN-center model is enough, because GNNs are good at classification tasks. The proposed model will lead to significant overhead than plain GNN-center models, which will shadow the performance margin. In this way, the merge of GNN and LLM as in the paper seems to be not attractive.

Second, all the downstream graph-related tasks are not unified in the same template. In other words, when we train the model on node classification, we have to fine-tune the model for graph classification task or link prediction task.

Further, a minor issue: given graphs that do not contain labels or texts, how can the model be trained? For example, when the graphs in the training data are non-textual, the model will only rely on labels to train. But the disaster is that we have a huge number of learnable parameters, which makes the model inapplicable.

2024-11-29

Thank you for your feedback.

First, we acknowledge that our model has a larger training overhead when compared to GNN-centered models. It takes a similar cost to LLM-centered methods [1]. During inference on unseen data, the computational cost of ours and GNN-centered models will both be similar to that of LLM backbone. It's worth noting that existing GNN-centered models in our baseline also needs to finetune LLM to enhance performance [2, 3]. Freezing the LLM would not allow for a deep fusion of GNN and LLM features.

Second, one example from our degree prediction dataset (following [4]'s setting) is as follows:

Input Graph-Text Mixture Sequence: "What is the degree of node 8? Graph: <graph_start><graph_node>...<graph_node><graph_end>? 3"

Edge Index: [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6, 6, 7, 7, 7, 8, 8, 8, 9, 9, 9, 10, 10, 10, 10, 10, 10, 11, 12, 12, 12, 12, 12, 13, 13, 13, 14, 14, 14, 14, 14, 14, 15, 15, 15, 15, 15, 15, 16, 17, 17, 17, 18, 19, 19, 19, 19, 19, 19, 19, 19], [16, 17, 8, 1, 12, 3, 4, 10, 14, 15, 19, 0, 9, 6, 5, 11, 7, 13, 12, 3, 4, 10, 14, 15, 19, 18, 14, 0, 1, 4, 0, 1, 3, 1, 9, 13, 1, 10, 19, 1, 9, 13, 0, 17, 19, 1, 5, 7, 0, 1, 6, 12, 15, 14, 1, 0, 1, 10, 15, 19, 1, 5, 7, 0, 1, 2, 10, 15, 19, 0, 1, 10, 12, 14, 19, 0, 0, 8, 19, 2, 0, 1, 6, 8, 12, 14, 15, 17]]

Node Text: Each node is represented by its ID, such as "id: 0", "id: 1", ..., "id: 19".

Edge Text: All edges are simply labeled as "edge".

Node label for GNN predictor: 0 for all nodes except node 8. node 8 has label 4 (0 is used to represent node not queried in problem, so 4 represents degree=3).

[1] Tang et, al. GraphGPT: Graph Instruction Tuning for Large Language Models. SIGIR 2024.

[2] Zhao et, al. Learning on Large-scale Text-attributed Graphs via Variational Inference. ICLR 2023.

[3] Chien et, al. Node Feature Extraction by Self-Supervised Multi-scale Neighborhood Prediction. ICLR 2022.

[4] Guo et, al. GPT4Graph: Can Large Language Models Understand Graph Structured Data ? An Empirical Evaluation and Benchmarking. CoRR 2023.

2024-11-30

Thank you for your clarification on non-textual graph fine-tunining. Considering your work in improving the notations and adding experiments, I would like to raise the score to 5 and still vote for rejecting the paper.

My main concern still lies in the additional training cost. While some listed reference papers also fine-tune LLMs, they are published around two years ago. At this stage, we should focus more on the balance between effectiveness and effciency. In my eyes, the improvement of your proposed model over baselines are over-shadowed by the additional training cost, which makes the paper less attractive. This explains my final rating.

2024-12-01

Thank you very much for taking the time to discuss our work and for reevaluating it. We are pleased that the improvements we made to the notations and the additional experiments have further validated the effectiveness of our method. We acknowledge your concerns regarding the balance between effectiveness and efficiency, particularly the additional training costs compared with current GNN-centered methods.

However, we still believe that the efficiency issue will not hinder the practical application of our work. Our GL-Fusion method has a similar training cost to LLM-centered approaches and significantly enhances the graph capacity of LLMs. During the inference stage, the computational cost of the GNN is negligible compared to that of the LLM. Therefore, all methods that combine LLMs and GNNs will have a similar overall cost to an LLM backbone.

2024-11-29

Thank you for taking the time to read our clarification and reevaluate our work. However, there are still some misunderstandings regarding the facts of our method.

First, the statement that "we have to train all the parameters in the model that are numerous" is incorrect. In our GL-Fusion model, the LLM parameters are fixed, and only the newly added GNN layers (31% of all trainable parameters), the cross-attention block, and the parameter-efficient adapter for the LLM are optimized. As a result, the memory requirement for training our model is relatively modest, only requiring about 30 GB of GPU memory.

Second, the statement that "Given non-textual graphs, the model cannot be trained because we have a significant number of parameters and only labels can be used to train Transformer-based models" is also incorrect. In our synthetic experiments with node-degree data, our model, even with empty node text features and trained solely on node degree labels, can predict degree values accurately with 100% test accuracy. Unlike traditional statistical learning models, it is widely known that a large number of parameters does not necessarily lead to overfitting in deep learning models [1], due to regularization, implicit bias of stochastic gradient descent, and inherent properties of deep neural networks.

Third, regarding the claim that "graphs without text features are prevalent." While many datasets preprocess features and may not explicitly expose the raw text, the text data is often still available. For instance, in the Open Graph Benchmark, nearly all datasets—whether node, link, or graph-level—include text in some form. The only exception is ogbl-vessel, which provides 3D coordinates (although these can be converted to text). Current LLMs may struggle with numerical data like coordinates, but the fact remains that most datasets contain some form of textual data. Consequently, many state-of-the-art methods in this domain incorporate LLMs for processing such text.

We appreciate your feedback and hope these clarifications address your concerns.

[1] Zhang, et al. "Understanding deep learning requires rethinking generalization." ICLR, 2017.

2024-11-29

First, I think you have misunderstood my point. Compared with GNN-centered model and LLM-GNN model, your proposed method has to use lora for fine-tuning, while others only need to optimize GNN parameters which are substantially negligible. In this way, the additional training time and cost could shadow the performace improvement.

Second, I am very curious about how you train non-textual graphs. You can show me the example on the degree prediction task.

审稿意见

评分: 5置信度: 22024-11-03

The manuscript proposes a method that deeply integrates Large Language Models (LLMs) and Graph Neural Networks. The method features structure-aware transformers, graph-text cross attention, and GNN-LLM twin predictor, enabling it to perform tasks of various types related to graphs and language.

优点

The method integrates LLMs and GNNs deeper than previous methods by introducing 3 key components.
The manuscript has conducted experiments on downstream tasks of various types, including traditional node classification, knowledge-related tasks and text generation, demonstrating the generalizability of the proposed method.

缺点

Experiments are not comprehensive. Baselines and benchmarks are missing. For example, for the node classification tasks, there are only two datasets considered. In Table 3, many entries in the table are missing, which makes the results less convincing. I suggest the author refer to [1] for benchmarking models on text-attributed graphs.
The manuscript proposes a twin predictor for node classification. However, edge-level and graph-level tasks should also be considered given their importance in graph-related tasks. Regression tasks should also be considered.

[1] Yan, Hao, et al. "A comprehensive study on text-attributed graphs: Benchmarking and rethinking." Advances in Neural Information Processing Systems 36 (2023): 17238-17264.

问题

In the Graph-text attention layer, the language tokens do not attend edges. Would the loss of the information affect performance?

2024-11-24

Thank you for acknowledging our presentation, method design, and experiments on various graph tasks. We address your concerns as follows.

Experiments are not comprehensive. Baselines and benchmarks are missing. For example, for the node classification tasks, there are only two datasets considered. In Table 3, many entries in the table are missing, which makes the results less convincing. I suggest the author refer to [1] for benchmarking models on text-attributed graphs.

We appreciate this feedback. The Arxiv dataset in [1] corresponds to ogbn-arxiv, where our model achieves state-of-the-art results. In addition, we have incorporated more node classification datasets from [1], and the updated results are as follows:

		PLM-Based		GNN-Based				Co-Training Based
Model	GL-Fusion	Tiny	Base	T-GCN	B-GCN	T-SAGE	B-SAGE	GCN(T)	SAGE(T)
Children	61.19	49.85	59.91	57.07	58.11	57.57	58.74	54.75	59.7
History	86.16	83.06	86.09	84.52	85.04	84.79	85.12	83.52	85.09
Photo	82.89	73.75	77.53	82.42	82.7	83.25	83.27	83.32	86.64
Computers	88.91	58.32	60.4	87.43	87.86	87.9	88.3	83.93	86.04
Sports	93.33	81.47	86.02	84.93	86.16	87.06	87.34	85.06	85.87

Our model, GL-Fusion, demonstrates significant improvements on the Children, Computers, and Sports datasets. For the Photo dataset, GL-Fusion outperforms most models except those based on GraphSage.

Regarding the link prediction table, our baselines conduct experiments on only some metrics or subsets of datasets. We have now added baseline results for models such as GraiL, NBFNet, and others. GL-Fusion still outperforms them across the majority of metrics and datasets.

The manuscript proposes a twin predictor for node classification. However, edge-level and graph-level tasks should also be considered given their importance in graph-related tasks. Regression tasks should also be considered.

In our paper, we propose twin predictor as a technique to train & inference (predict) on both GNN-side and LLM-side for each sample in each task. For most GNN tasks, the output of GNN have been well designed and the output of LLM is translated from original labels in dataset. Specifically, in link prediction tasks (like FB15k237 in our paper), given a fixed head, the tail nodes are labeled as 1 and others are labeled as 0, and it can be predicted by GNN. At the same time, the LLM generate its prediction as usual. So twin predictor can work for link prediction tasks (though we only use gnn output in test as we need predicted probability to compute test metric). As for graph classification, the graph label is boardcasted to each node during training and the graph-level prediciton can be readout by voting on each node during inference. (though we only use text output in test as ogbg-code2 task is to generate text). Our twin predictor is also suitable for regression, we just need to change the CEloss to the MSEloss and other setting can be kept the same. Other tasks and predictors can be added to boost GNN, but our primary focus is to combine GNN and LLM better with a new architecture.

In the Graph-text attention layer, the language tokens do not attend edges. Would the loss of the information affect performance?

Although language tokens do not directly attend to edges, the node tokens—containing language information—are further combined with edge information in the message-passing layers. Additionally, the edge text used in our model already encapsulates task-relevant information. Therefore, there is no loss of information in this setup. This design ensures that both node and edge attributes contribute effectively to the overall task performance.

We hope these clarifications address your concerns. Thank you again for your valuable feedback, which has helped us refine and improve our work. Let me know if any further refinement is required!

[1] Yan, Hao, et al. "A comprehensive study on text-attributed graphs: Benchmarking and rethinking." Advances in Neural Information Processing Systems 36 (2023): 17238-17264.

评论- Official Comment by Reviewer

2024-12-03

Thanks for the author's response. Although the authors have included more baselines. I still feel like there is a large amount of related work in the LLM + Graph fields, which has not been discussed in the manuscript or included in the performance table. For example

[1] UniGLM: Training One Unified Language Model for Text-Attributed Graphs

[2] LLaGA: Large Language and Graph Assistant

[4] Disentangled Representation Learning with Large Language Models for Text-Attributed Graphs

[5] Efficient Tuning and Inference for Large Language Models on Textual Graphs.

In addition, considering it an architecture innovation paper, I think the proposed method lacks novelty.

2024-12-04

Thank you for your feedback. We are glad to further clarify your questions:

** More baseline**:
Due to time limitations, we could not implement and compare all of these related work. Here, we present comparisons where baseline accuracy is available.

	Computers	Photo	History	Arxiv
UniGLM	82.61 $\pm$ 0.13	80.71 $\pm$ 0.02	80.22 $\pm$ 0.40	-
LLaGA	-	-	-	76.66
DGTL	-	-	55.70 $\pm$ 0.89	-
ENGINE	-	83.75 $\pm$ 0.08	-	76.02
GL-Fusion	88.91	82.89	86.16	78.2

The comparisons have shown that demonstrates superior capacity in comparison. The only exception is Photo, where we observed that the text information was less relevant due to its structure-heavy nature. As an evidence, the GNN-based models outperformed PLM-based models. This nature restricts our model to perform better on this dataset, because our model tend to learn more information based on both text and structure. However, our model remains competitive to these baselines.

Discussion:

UniGLM is a contrastive learning framework while our work aims to propose a new architecture where the LLM and GNN has been highly fused. In addition, the UniGLM is a encoder for graph and text while ours can generate text.
LLaGA shares a similar idea that the graph can be translate to a node sequence then put them into LLMs. However, their translation loses permutation invariance, a key feature of graph. Furthermore, LLaGA requires a root node, limiting its application for graph-level tasks. In contrast, our model preserves graph structure through specialized position encodings (PEs) and attention layers, ensuring information flow aligns with a GNN.
DGTL is a three step architecture where the text in node is first converted to embedding by a frozen LLM, and then a GNN is trained to learn graph embedding, and finally the graph embedding is injected into a LLM as special PE. Compared to our architecture, their graph embedding is task-agnostic while our graph embedding is task-related because the node can see the task prompt tokens in the front by the attention.
ENGINE integrates a frozen LLM with a GNN by extracting hidden states after each LLM layer. The idea of learning layer by layer is similar. However, it is still a GNN-centered model which restricted to traditional GNN tasks and it cannot generate text answer. At the same time, the text information is structure-agnostic because the LLM cannot see the graph structure. However, we can learn structure-related node information because of the cross-attention layers.
Novelty
Our model’s innovation lies in its high degree of integration between text and structure. From the above analysis, we found that these baselines are either unable to combine structural information when learning node text information, or unable to read text information related to the task when learning structural information - this is due to their tendency to use one of LLM or GNN as the reasoner and the other as the encoder, even for these models trying to integrate LLM and GNN. Our model solves this problem through specially designed attention and cross-attention, and we can ensure the learning of complete graphs, texts and dependencies between them in any direction. this makes our structure is essentially different to some articles that use LLM and GNN to combine information. This also ensures that our structure is sufficient to serve as the basic architecture of a foundation model in the future and adapt to various (and even core) cases.

Let's summarize our innovations:

By designing special PE and attention layers, we simultaneously realized the information interaction of the original LLM and GNN in one layer, and noticed that in this process, the graph representation can read the task prompt, make sure the graph embedding is task-related.
Through cross-attention, we can read text information without compressing node information and extract both task-related and structure-related representation, because the information is extracted condition on a both task-related and structure-related query vector.
Our twin-predictor allows the model to perform traditional graph tasks as a graph predictor and generate text answers as a generator.

审稿意见

评分: 6置信度: 42024-11-04

This paper presents GL-Fusion, an architecture integrating the capabilities of graph neural networks (GNNs) and large language models (LLMs). The proposed architecture attempts to address limitations in existing methods for GNN-LLM fusion by carefully embedding GNN structure within LLMs to leverage both structural graph data and rich textual information. Components of the architecure include structure-aware Transformer layers, cross-attention between graph and text, and GNN-LLM integrated predictor, which together tries to enable simultaneous handling of textual and structural features while providing flexibility in output. GL-Fusion shows improvement across various tasks, outperforming baseline GNN-centered, LLM-centered, and hybrid models from the recent literature.

优点

the design of the architecture builds upon the recent research on LLMs for text attributed graphs and addresses some limitations, especially of the existing approaches which only rely on either GNN or LLM as the main model.
the permutation invariance of graph node tokens is carefully modeled in the structure-aware transformer layer.
both kinds of gnn based and text based tasks can be conceptually handled using the proposed method.

缺点

Limitations and questions:

if there is a common understanding that self attention is a form of message passing, why is there a need to incorporate a separate MPNN in the structure aware Transformer layer? This aspect is unclear and has not been empirically studied as well.
in page 5, the positional encoding which would be ideal for both text and graph tokens is discussed. I am wondering if it is possible to have a unified structure based positional encoding which perceives both sequential structure and graph structure as 'arbitrary graph structure'? This is in context of recent works in graph transformer literature which show laplacian eigenvectors can generalize the text positional encoding of Transformers.
although the proposed aims to reduce complexity, the integration of graph structures into LLM layers may still encounter scalability issues with large graphs or highly interconnected datasets. How does GL-Fusion handle memory usage and processing time for very large graph structures?

问题

included with weaknesses

2024-11-24

Thank you for acknowledging our presentation and method design. We address your concerns as follows.

if there is a common understanding that self attention is a form of message passing, why is there a need to incorporate a separate MPNN in the structure aware Transformer layer? This aspect is unclear and has not been empirically studied as well.

We agree that self-attention can be viewed as a form of message passing, where messages are exchanged between token pairs based on their attention scores. However, encoding graph structures specifically requires message passing constrained by the graph's edges. In a vanilla attention layer, there is no inherent mechanism to enforce this constraint.

In our initial trials, we experimented with directly incorporating the adjacency matrix into the attention mechanism. However, we observed that this approach fails to capture even basic graph properties like node degrees. This limitation arises because simply adding graph edge information to the attention matrix cannot balance the dense, fixed-magnitude token attention with the sparse, degree-dependent edge aggregation. To address this, we introduced a separate MPNN layer, which enables effective graph-specific message passing.

This hybrid architecture is also consistent with recent graph transformer designs, such as those discussed in [1]. We believe this integration provides a principled way to incorporate graph-specific inductive biases while leveraging the power of transformers.

in page 5, the positional encoding which would be ideal for both text and graph tokens is discussed. I am wondering if it is possible to have a unified structure based positional encoding which perceives both sequential structure and graph structure as 'arbitrary graph structure'? This is in context of recent works in graph transformer literature which show laplacian eigenvectors can generalize the text positional encoding of Transformers.

Thank you for pointing out this possible method. While eigenvector-based positional encodings (PEs) can theoretically preserve all graph structure information, incorporating an MPNN module remains critical for performance. As highlighted in [1], even when graph transformers inject structure information through PEs, the message-passing operator significantly enhances performance.

Moreover, [2] demonstrates that using Laplacian eigenvector-based PEs can lead to symmetry-breaking issues, resulting in different predictions for isomorphic graphs, which undermines generalization. Although [2] proposes modifications to mitigate this, these changes significantly alter the vanilla transformer architecture. In contrast, our structure-aware transformer layers retain seamless compatibility with vanilla transformer layers, ensuring broader applicability.

We also conducted an ablation study on the ogbn-arxiv dataset. Replacing MPNN layers with Laplacian eigenvector PEs resulted in a significant performance drop, with test accuracy decreasing from 78.2% to 72.78%. This highlights the importance of retaining the MPNN layer alongside other design choices in our framework.

although the proposed aims to reduce complexity, the integration of graph structures into LLM layers may still encounter scalability issues with large graphs or highly interconnected datasets. How does GL-Fusion handle memory usage and processing time for very large graph structures?

We acknowledge the scalability challenges associated with using LLMs for large graphs. As discussed in Section 3.2, our cross-attention block extracts information from node text features in a more scalable manner compared to directly incorporating all node text as input to the transformer.

Nonetheless, running GL-Fusion on very large graphs remains a challenge due to the memory and computational demands of LLMs. In our experiments, when the input graph exceeds the memory limits, we adopt an ego-subgraph sampling approach, focusing on the subgraph centered around the node (or link) to predict. This approach is a widely accepted and standard technique for handling large graphs in both academia and industry [3].

Thank you again for your valuable feedback, which has helped us clarify and strengthen our methodology. Let me know if further refinements are needed!

[1] Ladislav Rampásek, Michael Galkin, Vijay Prakash Dwivedi, Anh Tuan Luu, Guy Wolf, Dominique Beaini. Recipe for a General, Powerful, Scalable Graph Transformer. NeurIPS 2022.

[2] Xiyuan Wang, Pan Li, Muhan Zhang. Graph As Point Set. ICML 2024.

[3] Haoteng Yin, Muhan Zhang, Yanbang Wang, Jianguo Wang, Pan Li, Algorithm and System Co-design for Efficient Subgraph-based Graph Representation Learning, VLDB, 2022.

审稿意见

评分: 5置信度: 42024-11-07

This paper proposed a method to combine GNN and transformer architecture to learn the graph tasks. Specifically, the graph tokens and text tokens are learned together in a sequence, while the graph tokens are only attending to other graph tokens in the same graph. The graph structures are learned through integrating GNN into prediction layer. The graph tokens are learned by cross attention module over the text tokens.

优点

The proposed architecture provides a feasible solution to combine transformer architecture and graph neural networks to learn on graph tasks.
The paper is well written and easy to follow.
The reported experimental results are good.

缺点

The proposed method is complex which includes three components: text and graph token transformers, text to graph token attention module, and GNN prediction module. The optimization process on three components seem non-trivial. However, the authors did not discuss any details on it.
The method needs to train a new transformer architecture, which seems not easy to combine with existing pre-trained LLMs. Can authors describe if it is possible to include the power of pre-trained language models?
The included datasets for node classification is limited (only two datasets). The reported results in link prediction table lacks most of the comparison results.

问题

For optimization process, is this method using back propagation to directly optimize all parameters?

评论- Response 1/2

2024-11-24

Thank you for acknowledging our presentation and method soundness. We address your concerns as follows.

The proposed method is complex which includes three components: text and graph token transformers, text to graph token attention module, and GNN prediction module. The optimization process on three components seem non-trivial. However, the authors did not discuss any details on it.

We appreciate your observation and provide more details on our optimization process below:

The entire model is trained end-to-end, with joint training of all components. However, not all parameters are trained from scratch.
For text and graph token transformers (referred to as Structure-aware Transformer Layer in our paper), we leverage the pretrained Llama-3-8B model as the backbone. As detailed in Section 3.1, we introduce new parameters through modifications to positional encoding and attention masks. Additionally, for the <graph_node> token, we simply add an embedding to the existing layer. The newly added MPNN component is initialized randomly. We fine-tune the transformer parameters using the Low-Rank Adaptation (LoRA) method from the peft library. This approach freezes the original transformer layers and optimizes only the newly introduced low-rank layers. The MPNN parameters, on the other hand, are trained entirely from scratch.
For the text-to-graph token attention module (referred to as Graph-Text Cross-Attention layers in our paper), all parameters are initialized randomly and trained from scratch.
For the GNN prediction modules, which are simple linear layers for generating outputs, parameters are initialized randomly and trained from scratch.

5 Overall, most parameters are initialized from pretrained models and frozen. Only about 10% of the parameters in the entire model are optimized.

The method needs to train a new transformer architecture, which seems not easy to combine with existing pre-trained LLMs. Can authors describe if it is possible to include the power of pre-trained language models?

Thank you for raising this point. Our model is indeed designed to integrate seamlessly with existing pretrained LLMs. In fact, all our experiments were conducted using a modified version of the Llama-3-8B model.

As described in Section 3.1 of our paper, a standard transformer layer in the LLM is adapted to a structure-aware transformer layer by modifying the positional encoding, attention masks, and adding a separate MPNN layer. This modification ensures that, when the input consists of text only, the adapted transformer layer produces the same output as the original transformer layer. Thus, our method can fully harness the capabilities of pretrained LLMs.

The included datasets for node classification is limited (only two datasets). The reported results in link prediction table lacks most of the comparison results.

We acknowledge the initial limitations and have expanded our evaluation.

For node classification, we added more datasets from [1]. The results are summarized in the table below:

		PLM-Based		GNN-Based				Co-Training Based
Model	GL-Fusion	Tiny	Base	T-GCN	B-GCN	T-SAGE	B-SAGE	GCN(T)	SAGE(T)
Children	61.19	49.85	59.91	57.07	58.11	57.57	58.74	54.75	59.7
History	86.16	83.06	86.09	84.52	85.04	84.79	85.12	83.52	85.09
Photo	82.89	73.75	77.53	82.42	82.7	83.25	83.27	83.32	86.64
Computers	88.91	58.32	60.4	87.43	87.86	87.9	88.3	83.93	86.04
Sports	93.33	81.47	86.02	84.93	86.16	87.06	87.34	85.06	85.87

Our model, GL-Fusion, achieves significant improvements on the Children, Computers, and Sports datasets. On the Photo dataset, GL-Fusion outperforms most models except those based on GraphSage.

For link prediction table, our baselines only conduct experiments on a part of metrics or a part of dataset. We have added full baseline results for GraiL, NBFNet, and other datasets. Our GL-Fusion still outperforms them.

评论- Response 2/2

2024-11-24

Questions:

For optimization process, is this method using back propagation to directly optimize all parameters?

Yes, we use backpropagation for optimization. However, not all parameters are optimized. Parameters from pretrained models are optimized using the Low-Rank Adaptation (LoRA) method, which freezes the original parameters and optimizes only the newly added low-rank linear layers. This allows us to efficiently fine-tune the model while preserving the knowledge in the pretrained parameters.

[1] Yan, Hao, et al. "A comprehensive study on text-attributed graphs: Benchmarking and rethinking." Advances in Neural Information Processing Systems 36 (2023): 17238-17264.

We hope this detailed response addresses your concerns. Thank you once again for your constructive feedback and thoughtful suggestions. Let me know if you'd like any further refinement!

评论- Discussion Stage is Running Out

2024-11-25

Dear reviewers:

Thank you for your thoughtful and constructive feedback on our work. We have carefully revised our paper to address your comments and made the following improvements:

Add more node classification datasets.

We have incorporated five more node classification datasets from [1]. The results are as follows:

		PLM-Based		GNN-Based				Co-Training Based
Model	GL-Fusion	Tiny	Base	T-GCN	B-GCN	T-SAGE	B-SAGE	GCN(T)	SAGE(T)
Children	61.19	49.85	59.91	57.07	58.11	57.57	58.74	54.75	59.7
History	86.16	83.06	86.09	84.52	85.04	84.79	85.12	83.52	85.09
Photo	82.89	73.75	77.53	82.42	82.7	83.25	83.27	83.32	86.64
Computers	88.91	58.32	60.4	87.43	87.86	87.9	88.3	83.93	86.04
Sports	93.33	81.47	86.02	84.93	86.16	87.06	87.34	85.06	85.87

Our model, GL-Fusion, shows significant improvements on the Children, Computers, and Sports datasets. For the Photo dataset, GL-Fusion outperforms most models except those based on GraphSage.

Complete link prediction baseline metrics.

In the link prediction table, baselines had previously conducted experiments on only some metrics or subsets of datasets, leading to incomplete comparison. We have added baseline results for GraiL, NBFNet, and UniLP. Our GL-Fusion still outperforms all baselines across all metrics and datasets.

Add more baseline models.

We have added [2, 3, 4] in revision. Our GL-Fusion outperforms them significantly.

[2] is a LLM-centered model. On ogbn-arxiv dataset, our model outperforms it significantly, achieving 77.80 micro-F1 and 61.40 macro-F1, while [2] only achieves 69.20 micro-F1 and 42.45 macro-F1.

[3] combines LLMs and GNNs. It achieves 76.12 accuracy on ogbn-arxiv, while our GL-Fusion achieves 78.20 accuracy.

#shots per class	10	100
GL-Fusion	56.44	68.18
OGB-Feature	45.76	58.75
PLM+GAE	50.16	58.1
PLM+GAE+prompt	51.89	60.63
GIANT	50.5	60.81
GIANT+prompt	51.4	61.26
PLM-cls	46.97	58.69
PLM-Prompt-dense	51.17	58.65
PLM-Prompt-sparse	52.01	60.85
G-Prompt	52.48	61.67

2024-11-25

Add Ablation study

We have conducted an ablation study on the ogbn-arxiv dataset, as shown below. Since GL-Fusion outputs both GNN and LLM predictions, we report the test accuracy of both predictors and their ensemble performance:

	gnn	LLM	ensemble
Original	77.09	76.43	78.2
Low Lora Rank	76.61	75.97	77.06
w/o cross attention	75.5	74.97	76.35
w/o gate	75.88	73.33	76.2
w/o mpnn	75.4	75.57	76.48
w/o gnn predictor	-	72.45	-
w/o text predictor	75.8	-	-

Here Low LoRA rank denotes using model with LoRA rank=8. w/o cross attention removes the graph-text cross attention. w/o gate removes the gate mechanism. w/o multi-aggr removes multiple aggregators in the message passing modules in transformer layers and uses mean aggregator only. w/o gnn predictor removes the gnn prediction and training loss on it. w/o text predictor removes the text prediction and training loss on it. Each module contributes to the overall performance, and removing any of them results in a performance drop.

Improved Readability and Notation Consistency

We have revised Section 3 to ensure consistent notation and improve readability.

Expanded Appendix

We have added details about the datasets and implementation in Appendices A and B.

Your comments help us improve readability and evaluate our method more compresensively. In addition to these revisions, we have addressed your specific questions in our individual responses. With the discussion period nearing its end, we sincerely appreciate your reevaluation of our work and hope our efforts address your concerns. Thank you once again for your valuable feedback and insightful suggestions.

We look forward to your response.

[1] Yan, Hao, et al. "A comprehensive study on text-attributed graphs: Benchmarking and rethinking." Advances in Neural Information Processing Systems 36 (2023): 17238-17264.

[2] Tan, Yanchao, et al. "MuseGraph: Graph-oriented Instruction Tuning of Large Language Models for Generic Graph Mining." arXiv preprint arXiv:2403.04780 (2024).

[3] Chien, Eli, et al. "Node feature extraction by self-supervised multi-scale neighborhood prediction." arXiv preprint arXiv:2111.00064 (2021).

[4] Huang, Xuanwen, et al. "Prompt-based node feature extractor for few-shot learning on text-attributed graphs." arXiv preprint arXiv:2309.02848 (2023).

评论- Eagerly Awaiting Your Feedback

2024-12-01

Dear Reviewers YrC1, b4Va, and xXbY:

Thank you for reviewing our work! Your constructive comments and concerns have been invaluable in helping us improve our manuscript.

During the rebuttal process, we have carefully addressed each of your points and conducted additional experiments. These include results on more node tasks, complete baseline results on link tasks, ablation studies, and comparisons with new baselines in the few-shot scenario. These results further verify the effectiveness of our method. Our revisions have addressed most of the concerns raised by reviewer yaRE, leading to a reevaluation of our work and an increase in the score.

As the deadline for discussion is approaching, we kindly request that you review our responses and let us know if they satisfactorily address your concerns.

Best regards,

The Authors

评论- Eagerly Awaiting Your Feedback

2024-11-28

Dear Reviewers,

Thank you once again for dedicating your time to reviewing our work and for providing such detailed comments and concerns!

During the rebuttal process, we have carefully addressed each of your points and hope that our responses have resolved your concerns. If there is anything that remains unclear or if you have additional questions, we would be very happy to discuss further.

As deadline for revision is approaching, we kindly hope you to review our responses and let us know if they satisfactorily address your concerns.

We greatly appreciate your time and consideration. Wishing you a wonderful day!

Best regards,

The Authors

AC 元评审

2024-12-21

(a) The paper proposes GL-Fusion, a new architecture integrating GNN and LLM to address limitations of existing methods. It includes Structure-Aware Transformers, Graph-Text Cross-Attention, and GNN-LLM Twin Predictor. Claims to outperform baselines on various tasks.

(b) Strengths: Novel architecture with potential to handle text and graph data jointly. Some reviewers noted good experimental results and well-written presentation.

(c) Weaknesses: Reviewers criticized the paper for complexity with unclear optimization, lack of details on handling large graphs, limited datasets initially, and notation issues. Some questioned the need for the proposed model compared to simpler GNN or LLM models.

(d) Reasons for rejection: Despite the authors' efforts in the rebuttal to address concerns like adding more datasets and improving notation, significant weaknesses remain. The model's complexity and training cost compared to simpler alternatives are major drawbacks. The need for such a complex combination of GNN and LLM for tasks that might be adequately handled by simpler models is not convincingly justified. There are also remaining concerns about the paper's clarity and the practicality of the proposed method.

审稿人讨论附加意见

During the rebuttal, reviewers raised several points. One reviewer noted the complexity of the method and lack of details on optimization, which the authors addressed by explaining the end-to-end training process and parameter initialization. Another concern was the limited datasets, and the authors added more node classification datasets. Notation issues were also raised, and the authors revised the paper for consistency. However, the reviewer who ultimately voted to reject was not convinced by the authors' responses regarding the training cost and and criticized the marginal improvement over baselines of two years ago. In the final decision, these unresolved concerns, especially regarding the model's complexity and cost compared to simpler alternatives, weighed heavily in favor of rejection.

最终决定Reject

2025-01-22

Reject