/10

Poster4 位审稿人

最低1最高3标准差0.9

ICML 2025

When Do LLMs Help With Node Classification? A Comprehensive Analysis

Xixi Wu,Yifei Shen,Fangzhou Ge,Caihua Shan,Yizhu Jiao,Xiangguo Sun,Hong Cheng

提交: 2025-01-17更新: 2025-07-24

TL;DR

We release a testbed for LLM-based node classification algorithms, and provide 8 novel takeaways based on extensive experiments on this testbed.

摘要

关键词

Large Language ModelsGraph Neural NetworksNode Classification

评审与讨论

审稿意见

评分: 12025-02-22

The paper systematically analyzes LLM-based node classification, introducing LLMNodeBed, a benchmark with 10 datasets and 8 algorithms for fair comparisons. It finds that LLMs significantly outperform traditional methods in semi-supervised settings but offer marginal gains in supervised learning, with Graph Foundation Models trailing proprietary LLMs like GPT-4o. While LLM-as-Encoder methods are cost-effective and excel in heterophilic graphs, LLM-as-Reasoner works best when labels depend on textual data, and LLM-as-Predictor performs well only with abundant labeled data, though all LLM-based methods remain computationally expensive compared to GNNs.

update after rebuttal

There are remaining concerns as follow:

misuse of term usage, e.g., LLM-as-reasoner, supervised learning, semi-supervised learning.
wrong dataset with labeling bias
insufficient text classification baselines.

给作者的问题

论据与证据

The claim of reasoning is confusing, as reasoning involves applying logic to seek the truth, but it is unclear what logical reasoning means in a text classification task. Instead, the process aligns more with cognition, which involves acquiring knowledge through thought and experience.
Consequently, the definitions of LLM-as-Reasoner and LLM-as-Predictor are unclear.
The definition of Graph Foundation Models (GFM) is also problematic, as it is based on pre-training on graph corpora rather than the model’s ability to generalize across tasks and domains. The referenced papers do not involve graph pre-training but instead, fine-tune on a single graph dataset.
Takeaways 2 and 3 are not novel findings but rather common knowledge.
Takeaway 9, which claims LLM-as-Encoder performs better on heterophilic graphs, contradicts the fact that LLMs are structure-agnostic. This would imply that GNNs work better on heterophilic graphs than homophilic ones, which is counterintuitive.

方法与评估标准

See in Experimental Designs Or Analyses

理论论述

The paper does not have any theoretical claims

实验设计与分析

The paper lacks discussion on the cost of LLM API inference.
The evaluation dataset is small, with no more than 100,000 nodes, and the arXiv dataset is only partially used.
Experimental results on LLMs are unconvincing due to biased label annotation, as Ogbn-arxiv simplifies a naturally multi-label task into a single-label one, making LLM predictions seem incorrect when they may actually be reasonable. A comprehensive study, not selective examples, is needed to clarify this issue.
The paper also lacks text classification baselines. Since many LLM methods ignore structural information, the task can be seen as text classification, where simple BERT with bidirectional attention and fine-tuning may perform better than LLMs, which focus on generative tasks rather than deep language understanding.

补充材料

Yes. I reviewed the Appendix C and found the scale of benchmark datasets are small which can not reflect real-world practice.

与现有文献的关系

The findings does not have any contribution to the broader scientific literature.

遗漏的重要参考文献

其他优缺点

其他意见或建议

作者回复

2025-03-30

We sincerely thank the reviewer for your valuable feedback.

Definition of Reasoning, LLM-as-Reasoner and LLM-as-Predictor, and Graph Foundation Models (GFM)

Sorry for any confusion on the terminologies. The categorization of LLM-as-Enhancer, LLM-as-Predictor, and GFMs is adopted from the survey paper [Ref 1]. Specifically, our use of LLM-as-Reasoner corresponds to the explanation-based LLM-as-Enhancer paradigm described in that survey. In this paradigm, LLMs generate predictions accompanied by explanations, requiring reasoning capabilities and general knowledge, as shown in TAPE [Ref 2].

The key difference between LLM-as-Reasoner and LLM-as-Predictor lies in whether we directly use the predictions generated by the LLM [Ref 1]. In the LLM-as-Predictor paradigm, the prediction is the label directly generated by the LLM. In contrast, in the LLM-as-Reasoner paradigm, although the LLM provides an initial prediction, the focus is on its reasoning process, i.e., the explanation text it generates. This reasoning is then used for augmentation, and the final prediction depends on other components like GNNs.

For GFMs, according to existing survey papers [Ref 1, 3], they are defined as models pre-trained on extensive graph corpora. These models utilize large-scale training data along with specifically designed techniques to achieve strong generalization capabilities across tasks and domains.

Takeaways 2, 3, 9

We thank the reviewer for the opportunity to further clarify our contributions. While Takeaways 2 and 3 may seem intuitive, they have not been explicitly discussed or established in prior research on LLMs for graphs. Our work provides empirical evidence and theoretical understanding to support these observations, contributing novel insights to the field.

In our paper, we do not have a Takeaway 9, so we assume the reviewer is referring to Takeaway 8. Takeaway 8 does not state that LLMs can perform better if the graph is heterophilic compared to a homophilic counterpart. Takeaway 8 states that in the LLM-as-Encoder paradigm, LLMs perform significantly better than smaller LMs on heterophilic graphs, while achieving only similar performance on highly homophilic graphs. To further support Takeaway 8, we have also added more explanations and four heterophilic datasets and they are given in Question 3 of Reviewer HQNY.

LLM API inference costs

We thank the reviewer for the valuable suggestion. Due to space limits, we use the Photo dataset as an example, reporting average token consumption and corresponding costs (in USD) for each prompt template across two LLMs. Completing four test datasets requires 59,259,894 tokens, costing ~ 19.5USD for GPT-4o-mini and 35.6USD for DeepSeek-V3. We will extend this discussion to all datasets in the revised manuscript.

Prompt	Avg. Token	GPT-4o-mini Cost	DeepSeek-V3 Cost
Direct	282.6	0.2	0.4
CoT	371.3	0.4	0.8
ToT	505.0	0.7	1.2
ReACT	514.8	0.7	1.2
Neighbor	2607.6	3.1	5.6
Summary	2019.3	5.2	9.5

Larger evaluation datasets

Thank you for raising the concern regarding the size of the evaluation datasets. We would like to clarify that the arXiv dataset consists of 169,343 nodes, exceeding the stated threshold of 100,000 nodes. Following standard practices like TAPE [Ref 2], we use the arXiv dataset primarily in the supervised setting.

We have conducted experiments on a larger dataset, ogbn-products, which contains 2,449,029 nodes and 123,718,152 edges. Due to space limitations, we invite you to refer to our response to Question 1 of Reviewer 4MPh for detailed results and discussion.

Biased label annotation on arXiv

While paper categorization could align with a multi-label classification task, existing research on node classification in Academic Networks consistently adopts a single-label classification approach. Specifically, the arXiv dataset [Ref 4] assigns each paper to its primary computer science sub-category, a setting widely accepted in prior works. Modeling this task as multi-label classification is promising but currently infeasible due to the lack of datasets with multi-label annotations. Therefore, our methodology and evaluation adhere to established practices and are appropriate and reasonable.

Lack of text classification baselines

We do include text classification baselines in Table 1, i.e., two fine-tuned LMs of different scales (SentenceBERT-66M and RoBERTa-355M).

[Ref 1] "A Survey of Graph Meets Large Language Model: Progress and Future Directions." In IJCAI, 2024.

[Ref 2] "Harnessing Explanations: LLM-to-LM Interpreter for Enhanced Text-Attributed Graph Representation Learning." In ICLR, 2024.

[Ref 3] "Graph Foundation Models: Concepts, Opportunities and Challenges." In TPAMI, 2024.

[Ref 4] "Open Graph Benchmark: Datasets for Machine Learning on Graphs." In arXiv preprint, 2020.

审稿人评论

2025-04-02

Regarding question 1, I believe the authors did not directly answer the question. I have provided a clear definition of the reasoning process and asked why the classification task is considered a reasoning process.
For takeaway 3, I think other papers do not mention this because it is not presented as a novel research finding but rather as common sense. Since your model relies solely on text rather than structure, it is obvious that text is more informative, leading to better performance.
The answer is somewhat misleading. I am not implying that a larger dataset is absent from the paper, but rather that many experiments—such as those shown in Tables 1-4—do not include tests on Arxiv.
There are several multi-label datasets available on citation networks that could be used for evaluations. However, the authors claim that the dataset they use does not have multiple labels, suggesting that they are confident in employing an incorrect experimental setting.
The only baseline included is simple fine-tuning, as straightforward as the first tutorial on Hugging Face.

作者评论

2025-04-04

Thank you for taking the time to read our response and for your feedback. Your valuable comments greatly help us strengthen our experiments. We would first like to clarify a few possible misunderstandings:

LLM-based algorithms rely on both text and structures: The LLM-as-Reasoner paradigm utilizes LLMs to generate explanations from texts and uses the explanations to enhance node features, while still relying on GNNs for classification. LLM-as-Predictor processes both graph and node information through LLMs and then LLMs generate predictions.
Paper classification as single- or multi-label tasks: In prior works, both single- and multi-label classification are reasonable for paper categorization, as a paper may cover multiple topics while still having a primary field. We adopt single-label classification because it is the predominant setting in existing GNN research, and all LLM-based node classification algorithms have been developed and evaluated under this assumption. While we appreciate the availability of multi-label datasets like DBLP, we note that LLM-based algorithms are not supported or evaluated on them.

Classification as a Reasoning Process

In Reasoner approaches, the LLM generates a preliminary classification with detailed explanations. These explanations are used to enhance the node features, and a GNN with an LM encoder is utilized for classification. As discussed in the TAPE and GAugLLM papers, text classification requires reasoning to interpret context, resolve ambiguity, and distinguish primary focus from supporting details, which are tasks beyond simple keyword recognition. While cognition underpins classification, it is insufficient for fully handling complex academic texts.

We appreciate your feedback on "LLM-as-Reasoner", clear definition of reasoning, and agree that in many scenarios only the cognitive ability is required. We will rename this paradigm to LLM-as-Explainer in the revised manuscript to better reflect its purpose of providing explanations for downstream tasks.

Takeaway 3

We would first like to emphasize that LLM-as-Reasoner and LLM-as-Predictor do not rely solely on text, as we discussed earlier. This trend is unique to Reasoner methods and does not apply to LLM-as-Encoder methods. Compared to Encoder, Reasoner first explains the node's texts, making the text easier to understand and better utilizing the text features. With Takeaway 3, we clearly differentiate the advantages of Reasoner and Encoder, offering practical guidelines for applying these methods to suit different contexts.

arXiv dataset

We initially interpreted your comment to mean that all used datasets are smaller than 100,000 nodes. We apologize for this misunderstanding.

The official release of the arXiv dataset only provides supervised splits, which have been consistently used in prior research while Tables 2-4 follow semi-supervised setting. To explore semi-supervised settings, we conducted an additional experiment on arXiv by allocating 10%, 10%, and 60% of the data for training, validation, and testing, respectively. The performance (%) is summarized below, where we found that LLM-based methods outperformed classic methods by 3% in Accuracy and 8% in Macro-F1, validating Takeaway 2. We will include the arXiv dataset in semi-supervised settings in the revised manuscript.

Method	Acc	Macro-F1
GCN $_\text{ShallowEmb}$	69.51	48.41
SAGE $_\text{ShallowEmb}$	67.75	46.65
GAT $_\text{ShallowEmb}$	69.40	49.16
SenBert-66M	67.38	35.80
RoBERTa-355M	71.49	47.43
GCN $_\text{LLMEmb}$	73.98	53.64
ENGINE	74.02	56.10
TAPE	74.88	57.72
LLM $_\text{IT}$	73.24	51.19
LLaGA	74.10	55.11

Fine-tuned LMs as Baselines

We apologize for misunderstanding your comment, as we initially interpreted it to refer to the lack of fine-tuned LMs as baselines.

To further strengthen text classification baselines, we have added two more advanced methods: GLEM [Ref 1] and GRENADE [Ref 2]. These methods combine fine-tuned LMs with GNNs to better integrate textual and structural information.

We evaluated both methods using two LM backbones and report Accuracy (%) on the Cora, WikiCS, Photo, and arXiv datasets. Our results show that aligning LMs with GNNs boosts performance compared to standalone fine-tuned LMs. We hope these additional baselines address your concerns, and we will include this discussion in the revised manuscript.

Method	Cora	WikiCS	Photo	arXiv
SenBert-66M	66.66	77.77	73.89	72.66
RoBERTa-355M	72.24	76.81	74.79	74.12
GLEM(SenBert)	81.50	76.49	77.32	72.31
GLEM(RoBERTa)	81.30	76.43	76.93	73.55
GRENADE(SenBert)	83.97	80.99	85.73	73.82
GRENADE(RoBERTa)	83.40	81.09	85.77	73.58

[Ref 1] "Learning on Large-scale Text-attributed Graphs via Variational Inference." In ICLR, 2023.

[Ref 2] "GRENADE: Graph-Centric Language Model for Self-Supervised Representation Learning on Text-Attributed Graphs." In EMNLP, 2023.

审稿意见

评分: 32025-03-02

This paper establishes a benchmark for the fair evaluation of different categories of LLM-based methods for node classification and uncovers some insights through performance analysis and comparison.

给作者的问题

As shown in Tables 1 and 2, classical methods achieve sufficiently strong performance on most node classification benchmarks, even in semi-supervised settings, while being significantly more efficient (milliseconds vs. several hours). This raises a key question: Under what circumstances should we use LLMs instead of pure GNNs, in terms of both usage scenarios and dataset characteristics? I believe the homophily ratio alone is not a sufficient metric to fully characterize a dataset, and there is still room to better address this question.
In Takeaway 2, you state: "In semi-supervised settings, the mutual information between structure and labels is relatively low, allowing LLMs to contribute more significantly to performance." Do you have any theoretical justification or intuitive explanation for this claim?
In Takeaway 8: LLM-as-Encoder significantly outperforms LMs in heterophilic graphs, you analyze the results using mutual information. However, both LLM-as-Encoder and LMs process only node features and are not influenced by graph structure. Thus, I believe the homophily ratio does not contribute to the observed performance differences.

论据与证据

Yes.

方法与评估标准

Yes.

理论论述

No proofs.

实验设计与分析

The evaluated benchmarks are all homogeneous. I recommend testing on knowledge graphs such as FB15k-237, where GFM has demonstrated significant improvements over classic GNNs [1,2].

[1] Liu, Hao, et al. "One for all: Towards training one graph model for all classification tasks." (ICLR 2024) [2]. Kong, Lecheng, et al. "Gofa: A generative one-for-all model for joint graph language modeling." (ICLR 2025).

补充材料

Yes, the codes.

与现有文献的关系

The paper provides a comprehensive evaluation of LLM-based methods for node classification and offers practical guidelines for their usage. Additionally, it introduces LLMNodeBed, a codebase designed to facilitate reproducible research in this field.

遗漏的重要参考文献

No.

其他优缺点

Strengths

The paper presents a fair and comprehensive evaluation of LLM-based methods for node classification, a highly relevant and timely topic in the field, supported by commendable engineering efforts. Additionally, it provides practical guidelines for future use.

Weaknesses

This work is a direct extension of GLBench, the first comprehensive benchmark for evaluating GraphLLM methods in both supervised and zero-shot scenarios. This somewhat diminishes its originality.
Some of the insights provided are not particularly novel or compelling. Certain takeaways are intuitive (e.g., Takeaways 4 and 7), while some have already been established in prior GraphLLM research (e.g., Takeaways 1).
Some explanations for the findings lack sufficient supporting evidence—see Questions for Authors.

其他意见或建议

No.

作者回复

2025-03-31

We sincerely thank the reviewer for your insightful and constructive feedbacks.

E1: Experiments on KGs

We thank the reviewer for suggesting the inclusion of Knowledge Graphs. However, datasets like FB15k-237 and WN18RR primarily focus on the Link Prediction task, whereas most LLM-based algorithms are designed for node classification. Adapting these algorithms to link prediction lies beyond the scope of LLMNodeBed, as our benchmark specifically targets node classification, and the baselines we evaluated are tailored for this.

Weakness 1: GLBench

Thank you for the opportunity to clarify the differences between GLBench and our LLMNodeBed. We acknowledge GLBench’s pioneering contribution as the first comprehensive benchmark for evaluating GraphLLM methods, and it inspired aspects of our work.

While GLBench focuses on providing an overall performance leaderboard, LLMNodeBed extends this by addressing the key question: When do LLMs help with the node classification task? To answer this, we consider numerous variables (e.g., learning paradigm, LLM functional roles, LLM type and scale, dataset homophily, prompt templates, etc.) and perform extensive fine-grained analyses. These analyses aim to provide deeper insights and practical guidelines for the community. In addition, LLMNodeBed offers a systematic implementation of all baselines to ensure fair comparisons and fast deployment. For a detailed comparison, we invite the reviewer to refer to our response to Question 2 of Reviewer yVQW.

Weakness 2: Some insights are not novel

Thank you for the feedback. While some takeaways may seem intuitive, they are empirically validated through extensive experiments, providing concrete evidence rather than a vague conclusion. For example, in Takeaway 7, we show that the summary prompt is better than other prompts when integrating neighbor information.

Question 1: Guidelines for using LLMs instead of GNNs

We thank the reviewer for this insightful question. First, it's crucial to note that in industrial applications, even a marginal accuracy improvement (e.g., 2%) can yield disproportionately large profit gains given the substantial capital investments involved.

We agree that homophily alone is insufficient to fully characterize when LLMs should be used instead of GNNs. Based on our extensive studies, we provide the following guidelines:

Zero-shot settings: Stronger LLMs can leverage direct inference to achieve satisfactory performance.
Supervised settings: Beyond homophily, other factors should be considered like: ratio of supervision (LLM-based algorithms provide a larger gain in semi-supervised settings), and task-feature alignment (when the classification label relies heavily on textual attributes, LLM-as-Reasoner is the best).

We acknowledge that this question remains open and are committed to uncovering more practical guidelines in the future.

Question 2: Explanation for mutual information between structure and labels

In semi-supervised learning, the available labels are a subset of those in supervised learning. Thus, we have $\mathcal{Y}^{semi} = g(\mathcal{Y}^{full})$ , where $g(\cdot)$ is a pre-processing function to filter out some labels. Due to the data processing inequality, we have $I(\mathcal{E}, \mathcal{Y}^{semi}) \leq I(\mathcal{E}, \mathcal{Y}^{full})$ .

Question 3: LLM-as-Encoder compared with LM-as-Encoder

The strength of LLMs lies in their ability to better utilize textual information compared to LMs. In homophilic graphs, textual and structural information often overlap, as edges are typically formed between nodes with similar text. As a result, maximizing the use of textual information may yield limited performance gains, since this information is already captured and leveraged by GNNs through the graph structure. In contrast, heterophilic graphs exhibit less overlap between textual and structural information. Therefore, more effective utilization of textual features in these graphs can result in greater performance improvements. To support this claim, we present additional experiments on heterophilic graphs, which confirm Takeaway 8.

We include the Cornell, Texas, Wisconsin, and Washington datasets, which have homophily ratios of 11.55%, 6.69%, 16.27%, and 17.07%, respectively. Using the SAGE method, we evaluate node features derived from various LMs and LLMs. The semi-supervised performance (Accuracy in %) is summarized below, revealing a performance gap of 5–17% between LLM-based encoders and their LM counterparts.

Encoder	Cornell	Texas	Wisconsin	Washington
SenBert-66M	52.55±1.58	61.73±1.37	70.47±1.75	65.54±2.44
RoBERTa-355M	55.55±3.44	64.26±6.26	73.59±2.72	66.08±1.60
Qwen-3B	57.13±2.29	78.53±1.76	83.21±1.39	72.18±3.66
Mistral-7B	56.86±1.37	76.53±2.40	83.96±1.55	73.91±0.97

审稿人评论

2025-04-02

Thanks a lot for your careful response.

For Q3, my point is that the relationship between structure and features is complicated and can not be recapped by heterophily / homophily. It will be clearer to claim “LLM-as-Encoder significantly outperforms LMs in graphs with less informative features”, rather than indirectly “in heterophilic graphs”, considering LLMs/LMs see nothing about graph structures.

作者评论

2025-04-04

Thank you for taking the time to read our response and for your thoughtful feedback.

We greatly appreciate your insightful and constructive comments on Takeaway 8. As you pointed out, the relationship between structure and features extends beyond the simplistic modeling of homophily. We will revise this takeaway in the manuscript, as per your suggestion, to state that "LLM-as-Encoder significantly outperforms LMs in graphs with less informative features," thereby eliminating potential confusion and providing greater clarity.

We are sincerely grateful for your valuable comments, which have greatly contributed to strengthening our experiments and enhancing the quality of our paper. Thank you again for your encouragement and support.

审稿意见

评分: 32025-03-10

In this paper, we conduct a fair and systematic comparison of LLM-based node classification algorithms. They developed LLMNodeBed, a comprehensive codebase and testbed for node classification using LLMs. Then, they conducted extensive experiments, training and evaluating over 2,200 models, to determine the key settings and components that affect performance. Based on the experimental results, they uncover several key insights.

给作者的问题

In the paper, the authors claim the experiments involving 2200 models, it is unclear on how these models were selected and how the number (of models) was determined.
For LLMNodeBed, it is better if more discussion related to the difference between the proposed one and existing other benchmarks, such as datasets, baselines.

论据与证据

Yes

方法与评估标准

Yes

理论论述

实验设计与分析

Yes

补充材料

Yes, A, B, C, D, E, F

与现有文献的关系

遗漏的重要参考文献

其他优缺点

其他意见或建议

作者回复

2025-03-31

We sincerely thank the reviewer for your insightful and constructive feedbacks.

Question 1: Model selection and determination

Thank you for the opportunity to clarify the selection process and total number of models evaluated.

For each baseline on a specific dataset, hyper-parameters were determined based on:

Original papers and related benchmarks: We follow settings from original papers, official Github repos, and benchmark reproductions (if applicable) to ensure fair comparisons. Detailed configurations for LLMNodeBed are provided in Appendix C.2.
Recent research on hyper-parameter tuning: Inspired by studies (e.g., [Ref 1]), we expanded the hyper-parameter search space with techniques like selectable normalization layers and jumping knowledge for GNNs.

After performing the hyper-parameter search, we finalize the configuration for each model and fix it during subsequent experiments to ensure consistency and reproducibility.

The total number of 2,200 models is derived from all experiments described in the paper. Below is a detailed breakdown:

Main experiments (Table 1) Semi-supervised: 11 methods × 9 datasets × 4 runs = 396 models. Adding the supervised setting (10 datasets): 11×(9+10)×4 = 836 models.
LLM-as-Encoder (Table 3) 4 LM/LLM encoders × 3 methods × 9 datasets × 4 runs, excluding overlaps (e.g., GCN with Mistral-7B): (4x3–1)×9×4 = 396 models. Adding supervised: 396 + 440 = 836 models.
LLM-as-Reasoner (Tables 4&5) TAPE variant with GPT-4o: 9×4 (semi-supervised) + 10×4 (supervised) = 76 models.
LLM-as-Predictor (Table 6) 5 Additional LLM backbones × 9 datasets × 4 runs (semi-supervised) + 5×10×4 (supervised) = 380 models.

Summing all parts mentioned above, we arrive at a total of ~ 2,200 models.

Question 2: Differences with existing benchmarks

Thank you for bringing our attention to clarify the differences between LLMNodeBed and existing benchmarks. We provide a preliminary comparison in Appendix A.4 and C.3, and we expand upon it for further clarification.

The most relevant benchmarks that evaluate LLM-based methods on node classification include GLBench [Ref 2], GraphLLM [Ref 3], and CS-TAG [Ref 4]. We first summarize the motivation behind each benchmark to show the overall differences:

LLMNodeBed: Insights on when LLMs help node classification LLMNodeBed is designed to answer this research question. Therefore, we categorize existing LLM-based methods based on LLM's functional roles, incorporates a variety of variables (learning paradigms, LLM types and scales, dataset domain and homophily, prompt templates, etc.), and provides insights and guidelines for applying LLM-based methods in different contexts.
GLBench: Overall performance comparison between classic and LLM-based methods GLBench focuses on comparing the overall performance of both classic and LLM-based methods under consistent learning configurations and data splits.
GraphLLM: Fine-grained analysis of LLM-as-Encoder and Reasoner methods As an earlier benchmark in the LLM4Graph domain (released in July 2023), GraphLLM explores two specific LLM-based paradigms: LLM-as-Encoder (e.g., GNNs/MLPs with LLM-generated embeddings) and LLM-as-Reasoner (TAPE). It provides preliminary findings on the potential of these methods.
CS-TAG: Benchmark for evaluating the performance of GNN, LM, and LM+GNN on text-attributed graphs CS-TAG enhances the development of graph learning by introducing diverse datasets with text attributes. It systematically investigates the performance of GNNs, LMs, and hybrid LM+GNN methods.

The table below further provides a detailed comparison of LLMNodeBed and the aforementioned benchmarks across key dimensions, including datasets, evaluation scenarios, baselines, variables considered, and system implementation:

Benchmark	Datasets	Scenarios	Baselines	Considered Variables	Sys Imple
LLMNodeBed	10 datasets from 4 domains	Semi-supervised, Supervised, Zero-shot	Classic, LLM-as-Encoder, LLM-as-Predictor, LLM-as-Reasoner, GFM	8 LLMs, 2 LMs, 6 Prompts	$\checkmark$
GLBench	7 datasets from 3 domains	Semi-supervised, Zero-shot	Classic, LLM-as-Encoder, LLM-as-Predictor, LLM-as-Reasoner, GFM	5 LLMs, 3 LMs	x
GraphLLM	4 datasets from 2 domains	Semi-supervised, Supervised, Zero-shot	Classic, LLM-as-Encoder, LLM-as-Reasoner	1 LLM, 5 LMs, 6 Prompts	x
CS-TAG	8 datasets from 2 domains	Semi-supervised, Supervised	Classic, LLM-as-Reasoner	4 LLMs, 11 LMs	$\checkmark$

[Ref 1] "Classic GNNs are Strong Baselines: Reassessing GNNs for Node Classification." In NeurIPS, 2024.

[Ref 2] "GLBench: A Comprehensive Benchmark for Graph with Large Language Models." In NeurIPS D&B, 2024.

[Ref 3] "Exploring the Potential of Large Language Models (LLMs) in Learning on Graphs." In KDD, 2024.

[Ref 4] "A Comprehensive Study on Text-attributed Graphs: Benchmarking and Rethinking." In NeurIPS D&B, 2023.

审稿意见

评分: 32025-03-15

In this paper, the authors provide guidelines for leveraging LLMs to enhance node classification tasks across diverse real-world applications. The authors introduce LLMNodeBed, a codebase and testbed for systematic comparisons, featuring ten datasets, eight LLM-based algorithms, and three learning paradigms. Through extensive experiments, the authors uncover key insights.

给作者的问题

The datasets selected in this paper are not large in scale. Do the authors have plans to conduct experiments on larger datasets(ogbn-products and ogbn-papers100M)?

论据与证据

Yes, the claims made in the submission are supported by clear and convincing evidence.

方法与评估标准

I believe that the method in this paper is relatively reasonable.

理论论述

The paper does not include theoretical proofs, so there is no need to verify the correctness of the theoretical claims in the paper.

实验设计与分析

Yes, the experimental designs and analysis of the paper are reasonable.

补充材料

Yes, I noticed that the author provided additional information in the supplementary materials.

与现有文献的关系

Please refer to the weaknesses section.

遗漏的重要参考文献

Please refer to the weaknesses section.

其他优缺点

Strengths

The authors release LLMNodeBed, a PyG-based testbed designed to facilitate reproducible and rigorous research in LLM-based node classification algorithms.
The paper conducts comprehensive experiments to analyze how the learning paradigm, homophily, language model type and size, and prompt design impact the performance of each algorithm category.
Through detailed experiments, the authors provide intuitive explanations, practical tips, and insights about the strengths and limitations of each algorithm category.

Weaknesses

Some references are missing. Methods[1] that utilize LLMs as enhancers, and approaches[2] that align graphs with LLMs should be categorized under which of the three learning paradigms proposed in this paper?
The authors have provided the time costs for training and inference in the appendix, which is commendable. It is recommended that the authors also present the memory costs of these methods during training and inference, as this would facilitate a more comprehensive evaluation of these methods.

[1]Gaugllm: Improving graph contrastive learning for text-attributed graphs with large language models. SIGKDD 2024

[2]Grenade: Graph-Centric Language Model for Self-Supervised Representation Learning on Text-Attributed Graphs.

其他意见或建议

作者回复

2025-03-31

We sincerely thank the reviewer for your insightful and constructive feedbacks.

Weakness 1: Missed references

Thank you for pointing out these additional related works. We have incorporated GAugLLM [Ref 1] and GRENADE [Ref 2] as baselines. Due to time constraints, we have completed experiments on the Cora, WikiCS, Instagram, and Photo datasets, and will expand the discussion to all datasets in the revised manuscript.

For GAugLLM, we use Mistral-7B as the backbone LLM. For GRENADE, we consider two different LM backbones. Based on our categorization, GRENADE falls under the Classic Methods category, as it involves aligning GNNs and LMs for joint optimization.

As shown below, GAugLLM, categorized as an LLM-as-Reasoner method, achieves performance comparable to TAPE, further supporting Takeaway 3. For GRENADE, the performance surpasses that of single fine-tuned LMs, demonstrating the effectiveness of aligning LMs with GNNs. However, it still falls short of outperforming SOTA LLM-based methods.

Semi-supervised Acc (%)	Cora	WikiCS	Instagram	Photo
GAugLLM	84.31±0.76	81.13±0.08	66.96±0.19	85.74±0.19
GRENADE (SenBERT-66M)	83.97±0.23	80.99±0.24	66.10±0.24	85.73±0.06
GRENADE (RoBERTa-355M)	83.40±0.58	81.09±0.16	66.57±0.31	85.77±0.20

Supervised Acc (%)	Cora	WikiCS	Instagram	Photo
GAugLLM	88.45±1.01	84.85±0.44	68.94±0.41	86.93±0.34
GRENADE (SenBERT-66M)	87.50±0.73	84.74±0.40	67.61±0.62	86.67±0.33
GRENADE (RoBERTa-355M)	87.82±0.87	84.84±0.41	68.20±0.32	87.04±0.39

Weakness 2: Memory costs

Thank you for highlighting the importance of including memory costs. We present the memory costs during both training and inference stages in the supervised setting for three datasets (Cora, WikiCS, and arXiv), scaling from 2,708 to 169,343 nodes. All memory usage was measured on a single H100-80GB GPU to ensure consistency and comparability.

Training Memory Costs (in GB)

Method	Cora	WikiCS	arXiv
GCN $_{\text{ShallowEmb}}$	0.76	1.43	4.79
SenBert-66M	9.70	9.70	9.98
RoBERTa-355M	46.00	46.00	46.10
GCN $_\text{LLMEmb}$	0.85	1.62	7.36
ENGINE	1.10	4.52	7.08
TAPE	46.00	46.00	46.10
LLM $_{\text{IT}}$	72.12	67.74	69.13
GraphGPT	41.42	53.10	60.41
LLaGA	34.60	35.54	41.94

Inference Memory Costs (in GB)

Method	Cora	WikiCS	arXiv
GCN $_{\text{ShallowEmb}}$	0.37	0.84	3.32
SenBert-66M	3.25	3.26	3.38
RoBERTa-355M	7.44	7.44	7.55
GCN $_\text{LLMEmb}$	0.40	1.02	5.79
ENGINE	1.29	1.81	2.12
TAPE	7.44	7.44	8.88
LLM $_{\text{IT}}$	32.08	31.53	33.12
GraphGPT	36.02	36.07	36.97
LLaGA	20.45	20.94	29.00

Based on the above statistics, we conclude the following: (i) LLM-as-Encoder methods have memory costs comparable to classic GNNs. (ii) Fine-tuning an LM is memory-intensive, especially for larger-scale LMs. (iii) LLM-as-Predictor methods are the most resource-intensive due to the high parameterization of LLMs.

These results highlight the importance of addressing efficiency challenges for deploying LLM-based methods in practice. We will include these findings in the revised manuscript.

Question 1: Experiments on larger datasets

Thank you for raising the scalability concern regarding the datasets in our benchmark. We have supplemented experiments on the ogbn-products dataset, which contains 2,449,029 nodes and 123,718,152 edges.

Experimental Setup To evaluate LLM-based methods within one week under limited GPU resources, we use the official data splits, which include 196,165 nodes (8.03%) for training, 39,323 nodes (1.61%) for validation, and a reduced test set of 221,309 nodes (9.04%), constituting a semi-supervised setting. The original dataset provides 100-dimensional node embeddings, which are directly adopted as shallow embeddings for classic GNNs.

Compared Methods Given resource and time constraints, we compare classic methods, LLM-as-Encoder method (GCN using Qwen-3B-derived embeddings), and LLM-as-Predictor method LLaGA. Other methods, such as TAPE, are excluded due to the high cost of generating reasoning texts for the entire dataset via APIs (e.g., ~ 500 USD for GPT-4o-mini). Similarly, $\text{LLM}_{\text{IT}}$ is omitted as it would require ~ 4 days for instruction tuning.

Results (%) LLM-based methods outperform classic methods by 5%.

Method	Accuracy	Macro-F1
GCN $_{\text{ShallowEmb}}$	55.71	20.61
SAGE $_\text{ShallowEmb}$	56.04	20.41
GAT $_\text{ShallowEmb}$	56.15	20.35
SenBert-66M	69.06	24.89
RoBERTa-355M	75.12	30.29
GCN $_{\text{LLMEmb}}$	77.91	35.91
LLaGA	79.07	36.36

[Ref 1] "GAugLLM: Improving Graph Contrastive Learning for Text-Attributed Graphs with Large Language Models." In KDD, 2024.

[Ref 2] "GRENADE: Graph-Centric Language Model for Self-Supervised Representation Learning on Text-Attributed Graphs." In EMNLP, 2023.

最终决定Accept (poster)

2025-05-01

The paper provides a benchmark analysis for node classification based on LLMs. Specifically, it covers five different ways of using LLMs for node classification as well as different learning paradigms. The comprehensive analysis is a key strength of the paper, highlighted by most reviewers. Nevertheless, the reviewers have also indicated several limitations, mainly related to the following points: (i) Size of the datasets used along with detailed experiments on large-scale graphs. (ii) Choice of baseline methods. (iii) Interpretation of the empirical claims made in the paper. Considering the effort of the authors to address the key points in the response, I'm in favor of accepting the paper, which overall makes a good contribution toward understanding how LLMs can be used in graph machine learning.