7.5

/10

Poster5 位审稿人

最低4最高5标准差0.5

3.8

置信度

创新性2.4

质量3.0

清晰度3.0

重要性2.6

NeurIPS 2025

Dynamic Bundling with Large Language Models for Zero-Shot Inference on Text-Attributed Graphs

Yusheng Zhao,Qixin Zhang,Xiao Luo,Weizhi Zhang,Zhiping Xiao,Wei Ju,Philip S. Yu,Ming Zhang

OpenReview PDF

提交: 2025-05-07更新: 2025-10-29

TL;DR

This paper tackles the problem of zero-shot inference on text-attributed graphs, and proposes a novel method that uses bundles to query LLMs and supervise GNNs.

摘要

关键词

Text-Attributed GraphsLarge Language ModelsZero-Shot Inference

评审与讨论

审稿意见

评分: 4置信度: 42025-06-05

To apply Large language models in text attributed graphs has drawn increasing attention, this paper proposes Dynamic Text Bundling Supervision that queries Large language models with bundles of texts. Bundle-level labels are obtained and used in supervised tasks for graph neural networks. The proposed approach mainly addresses two challenges: Large language models only receive limited information on graph structure and unreliable responses from Large language models. Theoretical analysis of the proposed method is provided. Experiments across ten datasets are conducted and show improved performance over existing methods.

优缺点分析

Strengths:

Large language models may serve as an effective knowledge base for supervised tasks in graph neural networks.
Theoretical analysis of the proposed method is provided.
Experiments across ten datasets are conducted and show improved performance over other LLM-based methods.

Weaknesses

The significance of this paper is limited. There are some technical concerns. Please refer to questions for details.
The theoretical analysis focuses on bundle, which is quite limited as supervised tasks are mainly performed by Large language models.

问题

Can a text bundle be interpreted as context window size? Any insight? If a bundle is interpreted as a proxy for this window—composed of graph nodes and their descriptions—this has important implications for scaling, quality of encoding, and model performance.
Any other way to tell graphs with high homophily or heterophily other than graph type? What to do with medium homophily and heterophily? The design of the bundle sampling mechanism may need to be adaptive, possibly using homophily-aware selection or weighting strategies.

3.The paper states: “During optimization of the graph neural network gθ, the bundle B may include nodes that do not belong to the category of yˆB.”

Given that this is a supervised learning setup and labels are available during training, it’s unclear why the bundles are not explicitly constructed to include only same-class nodes. Such a strategy could improve signal consistency in the text prompts and reduce label noise within the bundle. If there are limitations that prevent such a strategy—such as low label density or design constraints—these need to be explicitly stated. Otherwise, this decision appears to introduce noise and reduces the clarity of the supervision signal.

"Bundle" seems to be the central component of the proposed method:

(1) Edges in a bundle is implicitly input to Large language models as graph structure information

(2) Multiple nodes in a bundle provides a large context, which will reduce hallucination

However, more nodes are preferred in a bundle to incorporate more structural information, which also increases "outliers". As the core component, analysis in 4.4 is not sufficient.

Structural information incorporated in this way is quite limited.

It is better to include dataset split details directly in the paper instead of referring to other papers.
Performance shown in experiments often is far below traditional GNN methods (e.g., GCN achieves 81.5% accuracy on Cora, GraphSage achieves 83.67% on Wiki CS). As the proposed LLMs-based approach does not modify LLMs in any way, simple prompt-level work may have limitations. Please justify the motivation using LLMs, just because they are popular?

局限性

Adequate.

最终评判理由

The responses from authors addressed most of my questions. I raised the rating accordingly.

格式问题

None.

作者回复

2025-07-30

We are truly grateful for the time you have taken to review our paper and your insightful review. Here we address your comments in the following.

Q1. The significance of this paper is limited. There are some technical concerns. Please refer to questions for details.

A1. Thank you for the comment. The contribution of this paper is three-fold:

The idea of bundling nodes/texts: We propose to bundle relevant nodes and text attributes, and use these bundles instead of individual nodes to query LLMs and supervise GNNs.
Unified Framework: We propose a novel framework consisting of bundle sampling, bundle query, bundle supervision, and bundle refinement. The proposed framework outperforms a number of baselines.
Theoretical Analysis: We provide rigorous theoretical analysis of our method, showing its tolerance to outlier nodes and the convergence properties.

We will address your technical concerns in the following.

Q2. The theoretical analysis focuses on bundle, which is quite limited as supervised tasks are mainly performed by Large language models.

A2. Thank you for the comment. The bundle is an important concept of this paper. Bundles are used for supervising the GNN, which is then used for the downstream task. The two theorems in this paper focus on bundle supervision, which is one of the core components of this paper.

Q3. Can a text bundle be interpreted as context window size? Any insight? If a bundle is interpreted as a proxy for this window—composed of graph nodes and their descriptions—this has important implications for scaling, quality of encoding, and model performance.

A3. Thank you for the question. A text bundle is not the context window size. In addition to providing extra information, text bundling has the following advantages. Firstly, text bundling reduces the task difficulty for LLMs, as predicting the overall property of a set of closely related text items is easier than predicting one of them. This makes the supervision signals more robust. Secondly, text bundling is more efficient in supervising GNNs, as all nodes in a bundle can provide supervision signals.

Q4. Any other way to tell graphs with high homophily or heterophily other than graph type? What to do with medium homophily and heterophily? The design of the bundle sampling mechanism may need to be adaptive, possibly using homophily-aware selection or weighting strategies.

A4. Thank you for the question. The homophilic ratio can be estimated by sampling a set of connected node pairs and directly asking the LLM whether they belong to the same category. Below are the results for ground truth homophily and the prediction by GPT4o-mini. The predicted results are close to the ground truth, which is enough to roughly decide the type of bundle sampling method to use.

Datasets	Cora	Citeseer	History	Children	Sportsfit	Wikics	Cornell	Texas	Wisconsin	Washington
GT	0.81	0.76	0.66	0.46	0.90	0.67	0.11	0.06	0.15	0.19
Predicted	0.70	0.81	0.73	0.35	0.81	0.52	0.05	0.04	0.06	0.02

We also provide an adaptive approach of bundle sampling, where we take the union of topological sampling and semantic sampling, with the ratio of each determined by the predicted homophily. The results are shown below, where we can see that adaptive sampling also outperforms the baseline.

Datasets	Citeseer	Children	Cornell
LLMBP	69.51	24.81	83.28
DENSE, adaptive sampling	72.14	30.62	84.82

Q5. The paper states: “During optimization of the graph neural network gθ, the bundle B may include nodes that do not belong to the category of yˆB.” Given that this is a supervised learning setup and labels are available during training, it’s unclear why the bundles are not explicitly constructed to include only same-class nodes. Such a strategy could improve signal consistency in the text prompts and reduce label noise within the bundle. If there are limitations that prevent such a strategy—such as low label density or design constraints—these need to be explicitly stated. Otherwise, this decision appears to introduce noise and reduces the clarity of the supervision signal.

A5. Thank you for the comment. We want to clarify that the setting of this paper is zero-shot inference instead of supervised learning. In zero-shot inference, we do not have access to the ground truth labels.

Q6. "Bundle" seems to be the central component of the proposed method: (1) Edges in a bundle is implicitly input to Large language models as graph structure information (2) Multiple nodes in a bundle provides a large context, which will reduce hallucination However, more nodes are preferred in a bundle to incorporate more structural information, which also increases "outliers". As the core component, analysis in 4.4 is not sufficient. Structural information incorporated in this way is quite limited.

A6. Thank you for the comment. There is a balance between incorporating more information and excluding outliers. We empirically find that the bundle size of 5 is optimal in terms of overall accuracy. Additionally, by comparing bundle query and individual query in Section 4.3, we show that bundling improves performance. We also provide an additional experiment where we manually include one outlier into the bundles. The results are shown below, and we can see that our method is robust to outliers, which is also demonstrated by Theorem 3.1.

Datasets	Cora	History	Texas
Individual query	71.96	63.95	84.49
Bundle query	75.09	67.31	92.51
Bundle query with a manually included outlier	74.72	66.59	91.44

Q7. It is better to include dataset split details directly in the paper instead of referring to other papers.

A7. Thank you for your suggestion. As our setting is zero-shot inference, there is no training set, and we list the test set split details as follows.

Datasets	Cora	Citeseer	Wikics	History	Children	Sportsfit	Cornell	Texas	Wisconsin	Washington
Test Split	542	2566	5847	8311	15375	121139	191	187	265	229

Q8. Performance shown in experiments often is far below traditional GNN methods (e.g., GCN achieves 81.5% accuracy on Cora, GraphSage achieves 83.67% on Wiki CS). As the proposed LLMs-based approach does not modify LLMs in any way, simple prompt-level work may have limitations. Please justify the motivation using LLMs, just because they are popular?

A8. Thank you for your question. The setting of this paper is zero-shot inference, while the performance you mentioned (81.5% and 83.67%) is semi-supervised learning. LLMs have strong zero-shot generalization ability, and they can understand the semantics of text attributes and class labels, enabling preliminary classification, which is beyond the capabilities of traditional GNNs.

In light of these responses, we hope that we have addressed your concerns and that you will consider raising your score. If there are any additional notable points of concern that we have not yet addressed, please do not hesitate to share them, and we will promptly attend to those points.

2025-08-05

Thanks for the clarification. I will raise the rating..

2025-08-07

Thank you again for your feedback! We are pleased to know that you have decided to raise the rating. We really appreciate your efforts in reviewing our paper, your insightful comments, and your support.

审稿意见

评分: 4置信度: 42025-06-30

This paper addresses zero-shot inference on text-attributed graphs (TAGs) using large language models (LLMs). The authors identify two key challenges: (1) LLMs receive limited information about graph structure, and (2) LLMs produce unreliable responses due to hallucination and information insufficiency. To address these issues, they propose Dynamic Text Bundling Supervision (DENSE), which queries LLMs with bundles of topologically or semantically similar texts rather than individual texts. The method obtains bundle-level labels from LLMs and uses these to supervise graph neural networks through specially designed entropy-based and ranking-based loss functions. The approach includes a bundle refinement mechanism that removes noisy nodes during training. Theoretical analysis is provided to justify the design choices, and extensive experiments on ten datasets demonstrate the effectiveness of the proposed method.

优缺点分析

Strengths:

The paper provides solid theoretical foundations with three theorems proving the tolerance to outliers and convergence properties of the proposed bundle supervision. The experimental evaluation is comprehensive, covering ten datasets across different domains and graph types (homophilic and heterophilic);
The work addresses an important problem in graph learning and provides a solution that outperforms baselines across various datasets. The bundle-based approach offers a new perspective on leveraging LLMs for graph tasks;
The paper is generally well-written with clear motivation, methodology description, and experimental setup. The figures effectively illustrate the key concepts, particularly Figure 1 showing the difference between individual and bundle queries;
The method shows improvements across different types of graphs and is agnostic to specific GNN architectures, making it broadly applicable.

Weaknesses:

The core idea of bundling similar items for better LLM performance is not particularly novel. The contribution is more incremental, combining existing techniques (LLM querying + GNN training) rather than introducing fundamentally new concepts;
The approach treats LLMs and GNNs as separate components rather than achieving true integration. The LLM serves merely as a labeling tool, and there's limited synergy between the two components;
The bundle refinement process is quite mechanical, removing only the least confident nodes at fixed epochs (300 and 400), which may not be optimal for all scenarios.
Although the experimental results were comprehensive and effective, I still have doubts about the reliability of the experimental results (as shown in table 1 and table 3). Table3 v2 and v5 correspond to single-node annotation and individual supervision respectively. Their results are also superior to most baselines, which seems to be contrary to the core contribution of this paper.

问题

Regarding weaknesses 4, could you clarify if the experimental settings are consistent across all comparisons? What specific configurations were used for baseline methods?
The method proposed in this paper is a kind of LLMs for Data Augmentation, but the baselines do not include similar experimental settings.（such as Chen, Z., Mao, H., Wen, H., Han, H., Jin, W., Zhang, H., Liu, H., and Tang, J. Label-free node classification on graphs with large language models (llms). ICLR, 2024e.）
Why is the bundle refinement limited to only two fixed epochs (300 and 400) with removal of just the least confident node? Have you explored more adaptive refinement strategies or different refinement frequencies? What is the sensitivity to these hyperparameters?
For heterophilic graphs, you use semantic proximity for sampling, but this essentially ignores the graph topology entirely. How does this approach truly address the claimed limitation of "limited information on graph structure"? Could you provide more analysis on how graph structure is actually utilized?
The method requires multiple LLM queries (100 bundles × datasets). Could you provide analysis of the computational cost and scalability compared to baselines?

局限性

Yes

最终评判理由

I have reviewed the authors' responses, and they have addressed most of my concerns.

格式问题

None

作者回复

2025-07-30

We are truly grateful for the time you have taken to review our paper and your insightful review. Here we address your comments in the following.

Q1. The core idea of bundling similar items for better LLM performance is not particularly novel. The contribution is more incremental, combining existing techniques (LLM querying + GNN training) rather than introducing fundamentally new concepts;

A1. Thank you for the comment. The contribution of this paper is three-fold:

The idea of bundling nodes/texts: We propose to bundle relevant nodes and text attributes, and use these bundles instead of individual nodes to query LLMs and supervise GNNs.
Unified Framework: We propose a novel framework consisting of bundle sampling, bundle query, bundle supervision, and bundle refinement. The proposed framework outperforms a number of baselines.
Theoretical Analysis: We provide rigorous theoretical analysis of our method, showing its tolerance to outlier nodes and the convergence properties.

Our method is not a simple combination of existing techniques. We introduce the new concept of bundles, while prior studies adopt individual supervision on GNNs.

Q2. The approach treats LLMs and GNNs as separate components rather than achieving true integration. The LLM serves merely as a labeling tool, and there's limited synergy between the two components;

A2. Thank you for the comment. This paper explores graph annotation from the perspective of bundles, which are used to facilitate downstream GNN training.

Q3. The bundle refinement process is quite mechanical, removing only the least confident nodes at fixed epochs (300 and 400), which may not be optimal for all scenarios.

A3. Thank you for the comment. As the bundle size of our paper is 5, removing one item is a reasonable choice. We also provide the results of removing more items below. For the epochs, we also provide results of alternative configurations. The results suggest the current default configuration achieves satisfactory performance.

Dataset	Citeseer	History
Default	72.37	67.31
Removing 2 items each time	72.06	67.12
Removing at 200, 300, and 400 epochs	71.94	67.21

Q4. Although the experimental results were comprehensive and effective, I still have doubts about the reliability of the experimental results (as shown in table 1 and table 3). Table3 v2 and v5 correspond to single-node annotation and individual supervision respectively. Their results are also superior to most baselines, which seems to be contrary to the core contribution of this paper.

A4. Thank you for the comment. For V2, our method achieves significant improvement compared to individual querying. For V5, while it uses $\mathcal L_{IE}$ (defined in Theorem 3.1), it still adopts the overall framework of DENSE, including bundle sampling/query, ranking-based supervision, and bundle refinement. Many baseline methods suffer from poor generalization capability, leading to low accuracy under the zero-shot setting. For example, GOFA and ZeroG adopt joint training of LLM and graphs. Their performance worsens when the test distribution is different from the training distribution.

Q5. Regarding weaknesses 4, could you clarify if the experimental settings are consistent across all comparisons? What specific configurations were used for baseline methods?

A5. Thank you for the question. The experimental setting is zero-shot inference on TAGs without access to node labels, and it is consistent across all comparisons. We provide the details as follows:

Text encoders (SBERT, RoBERTa, TE-3-Large, LLM2Vec): The text attributes and class labels are encoded by the text encoders, and the nodes are assigned according to the cosine similarity between their text embeddings and class embeddings.
Generative LLMs (GPT-3.5-Turbo, GPT-4o): The class labels are obtained by directly prompting the LLMs.
Graph SSL methods (DGI, GraphMAE): The implementation follows the TSGFM [1] benchmark, and the models are trained on the ogbn-arxiv dataset.
OFA, ZeroG, GraphGPT, LLAGA are trained on the ogbn-arxiv dataset.
For GOFA, the pretrained weights are adopted, and the ogbn-arxiv dataset is used for finetuning.
For UniGLM, the pretrained model is directly adopted for inference.
For LLMBP, it is a pretraining-free solution, and we directly use their official implementation.

[1] Text-space graph foundation models: Comprehensive benchmarks and new insights. Chen et al. NeurIPS 2024.

Q6. The method proposed in this paper is a kind of LLMs for Data Augmentation, but the baselines do not include similar experimental settings.（such as Chen, Z., Mao, H., Wen, H., Han, H., Jin, W., Zhang, H., Liu, H., and Tang, J. Label-free node classification on graphs with large language models (llms). ICLR, 2024e.）

A6. Thank you for bringing this excellent work to our attention! We will include this work in the revised version. We want to clarify that our method is not data augmentation. We provide a comparison between our method and Chen et al. as follows:

Datasets	CiteSeer	WikiCS	ogbn-products
DENSE (ours)	72.37	71.03	76.88
Chen et al.	68.64	64.72	74.54

Q7. Why is the bundle refinement limited to only two fixed epochs (300 and 400) with removal of just the least confident node? Have you explored more adaptive refinement strategies or different refinement frequencies? What is the sensitivity to these hyperparameters?

A7. Thank you for the question. For the time of refinement, the general principle is to perform refinement after the training converges. In practice, we observe that the bundle supervision converges within 300 epochs, and that after refinement, it converges within an additional 100 epochs. Additionally, since the optimal bundle size is 5 (Section 4.4), removing one item each time is reasonable. We also provide results of different configurations in A3. The model's performance is not very sensitive to the configuration.

Q8. For heterophilic graphs, you use semantic proximity for sampling, but this essentially ignores the graph topology entirely. How does this approach truly address the claimed limitation of "limited information on graph structure"? Could you provide more analysis on how graph structure is actually utilized?

A8. Thank you for the question. For the homophilic graphs, graph topology is important for understanding the semantics of nodes, and we incorporate the topological information in bundle sampling. For heterophilic graphs, the topology provides less information, and we find semantic proximity more effective in bundle sampling.

Q9. The method requires multiple LLM queries (100 bundles × datasets). Could you provide analysis of the computational cost and scalability compared to baselines?

A9. Thank you for the question. We provide the computational cost and scalability on the Cora dataset below. The results show that our method is better than the baseline. The computational cost is generally not a concern, as GNNs are lightweight models and our method maintains good performance even with cheaper LLMs (e.g. DeepSeek-V3 or Gemini-2.5-flash in Table 2).

Methods	Number of LLM queries	Total Inference Time (s)
LLMBP	140	34
DENSE (ours)	100	32

2025-08-07

Thank you again for your feedback! We are pleased to know that our responses have addressed most of your concerns and that you have decided to raise the score. We really appreciate your efforts in reviewing our paper, your insightful comments, and your support.

2025-08-05

Thank you for your detailed rebuttal. I have reviewed the authors' responses, and they have addressed most of my concerns. Based on the authors' clarifications in the rebuttal, I will slightly increase my original rating.

审稿意见

评分: 5置信度: 32025-07-01

This paper proposes DENSE (Dynamic Text Bundling Supervision), a novel framework for zero-shot inference on text-attributed graphs (TAGs). The method addresses two key challenges: (1) LLMs' limited access to graph structural information and (2) unreliability of LLM responses due to hallucination or insufficient context. DENSE samples node bundles (based on topological/semantic proximity), queries LLMs for bundle-level labels, uses these labels to supervise GNN training, and dynamically refines bundles to exclude noisy nodes. Theoretical analysis demonstrates tolerance to outliers and convergence guarantees. Extensive experiments on 10 diverse TAG datasets show state-of-the-art performance against 15 baselines.

优缺点分析

Strengths

Novel & impactful framework: DENSE innovatively bridges LLMs and GNNs via dynamic text bundling, solving LLMs' structural unawareness and unreliability for zero-shot TAG inference. The bundle-supervise-refine pipeline is elegant and practical.
Strong empirical validation: Outperforms 15 baselines across 10 diverse datasets (homophilic/heterophilic). Adaptive sampling (topological/semantic) and ablation studies prove method versatility.
Theoretical grounding: Formal proofs show bundle supervision’s outlier tolerance and convergence guarantees, significantly boosting credibility beyond empirical results.
High reproducibility: Public code, detailed prompts, and exhaustive training specs set a commendable standard for transparency.

Weaknesses

Scope expansion: Method currently requires textual attributes; exploring LLM-generated pseudo-descriptors for non-text-attributed graphs could broaden applicability.
Heterophily refinement: While semantic sampling handles heterophily, explicitly prompting LLMs to resolve topological proximity vs. semantic conflicts (e.g., "identify dominant label in semantically divergent bundles") may further boost robustness.

DENSE‘s innovations address critical LLM-GNN integration gaps. Weaknesses reflect optional extensions, not flaws.

问题

See Weaknesses.

局限性

yes

最终评判理由

Thank you for the response, and I will keep my positive score.

格式问题

作者回复

2025-07-30

We are truly grateful for the time you have taken to review our paper, your insightful comments and support. Your positive feedback is incredibly encouraging for us! In the following response, we would like to address your major concern and provide additional clarification.

Q1. Scope expansion: Method currently requires textual attributes; exploring LLM-generated pseudo-descriptors for non-text-attributed graphs could broaden applicability.

A1. Thank you for your suggestion! We will add the following discussion to the revised manuscript: The proposed DENSE framework can be extended to a more general setting where text attributes are not provided. Non-text-attributed graphs can be converted to text-attributed graphs, which have been explored by Wang et al. [1]. We will introduce these techniques to expand the scope of our method in future work.

[1] Can LLMs Convert Graphs to Text-Attributed Graphs? Wang et al. 2024

Q2. Heterophily refinement: While semantic sampling handles heterophily, explicitly prompting LLMs to resolve topological proximity vs. semantic conflicts (e.g., "identify dominant label in semantically divergent bundles") may further boost robustness.

A2. Thank you for your suggestion! We will extend our work by exploring the prompting techniques in bundle query when handling heterophilic graphs, explicitly informing the LLMs about the homophilic/heterophilic information of the current graph to improve the robustness of our method.

We will properly incorporate your suggestions into our revised version. Thanks again for appreciating our work and for your constructive suggestions. Please let us know if you have further questions.

2025-08-07

Thank you again for acknowledging our rebuttal! We hope that our responses have addressed your concerns. We really appreciate your efforts in reviewing our paper, your insightful comments, and your support.

2025-08-07

Thank you for the detailed and thoughtful rebuttal. I appreciate your clarifications and proposed future directions, especially regarding scope expansion and heterophily handling. Your work presents a novel and impactful contribution to zero-shot learning on text-attributed graphs, and I believe it will inspire further research in integrating LLMs and GNNs. I’m glad to support your paper.

审稿意见

评分: 5置信度: 42025-07-02

This paper proposes dynamic text bundling supervision for zero-shot inference on text-attributed graphs. The authors propose to query the LLMs with text bundles to improve the robustness of label and subsequent supervision. The bundles are further refined to reduce noise and improve the performance. Additionally, the authors provide theoretical analysis of their method. Comprehensive experiments are also provided and the proposed method achieves state-of-the-art performance on many datasets.

优缺点分析

Strengths:

The proposed text bundling is novel and interesting. The idea of bundling text pieces to query LLMs is different from many existing methods. The writing of this paper is clear and easy to understand.
The proposed method is supported by solid mathematical analysis, with theorems, detailed proofs and explanations of their implications.
The authors perform extensive experiments on a wide range of graph benchmark datasets, and the proposed method outperforms many state-of-the-art baselines.

Weaknesses:

The authors adopt two bundle sampling techniques: topological proximity and semantic proximity, according to graph homophily. In practice, it may be non-trivial to determine the homophily degree of graphs, and this paper lacks discussion on this. The authors should discuss how to decide graph homophily when it is not obvious from the graph's meta information.
The motivation of the proposed ranking-based supervision is not very clear. Why do you use both entropy-based and ranking-based supervision? Why using a min operator? The authors should explain the rationale behind this technique.
The authors only use one GNN architecture for homophilic graphs, while there are many GNN architectures available. The authors should perform experiments on different types of GNNs like GraphSAGE or GIN to show the influence of GNN architectures.

问题

Please refer to the weakness section.

局限性

yes

最终评判理由

My concerns have been addressed, thus I vote for acceptance.

格式问题

作者回复

2025-07-30

Q1. The authors adopt two bundle sampling techniques: topological proximity and semantic proximity, according to graph homophily. In practice, it may be non-trivial to determine the homophily degree of graphs, and this paper lacks discussion on this. The authors should discuss how to decide graph homophily when it is not obvious from the graph's meta information.

A1. Thank you for your suggestion! We will discuss the graph homophily in the revised version. Specifically, one approach is to sample a set of connected node pairs and directly ask the LLM whether they belong to the same category. Below are the results for ground truth homophily and the prediction by GPT4o-mini.

Datasets	Cora	Citeseer	History	Children	Sportsfit	Wikics	Cornell	Texas	Wisconsin	Washington
GT	0.81	0.76	0.66	0.46	0.90	0.67	0.11	0.06	0.15	0.19
Predicted	0.70	0.81	0.73	0.35	0.81	0.52	0.05	0.04	0.06	0.02

Q2. The motivation of the proposed ranking-based supervision is not very clear. Why do you use both entropy-based and ranking-based supervision? Why using a min operator? The authors should explain the rationale behind this technique.

A2. Thank you for your question! We use entropy-based supervision because nodes in the bundle are more likely to fall into the mode category of the bundle. Since there may be nodes from other categories, the ranking-based supervision is designed to penalize bundles in which the majority of nodes are not classified into the mode category. The min operator is used to achieve this functionality. We will provide more explanation of the rationale in the revised version.

Q3. The authors only use one GNN architecture for homophilic graphs, while there are many GNN architectures available. The authors should perform experiments on different types of GNNs like GraphSAGE or GIN to show the influence of GNN architectures.

A3. Thank you for your suggestion! We provide additional results using GraphSAGE as follows. The results suggest that our method achieves good performance with different GNNs.

Datasets	Cora	CiteSeer	History
DENSE w/ GCN	75.09	72.37	67.31
DENSE w/ GraphSAGE	76.38	74.86	66.97

We will properly incorporate your suggestions into our revised version. Thanks again for appreciating our work and for your constructive suggestions. Please let us know if you have further questions.

2025-08-04

Thank you for the response. My concerns have been addressed, and I will raise my score.

2025-08-07

Thank you again for your feedback! We are pleased to know that our responses have addressed your concerns. We really appreciate your efforts in reviewing our paper, your insightful comments, and your support.

审稿意见

评分: 5置信度: 42025-07-06

This paper addresses the challenge of zero-shot inference on text-attributed graphs (TAGs), where the goal is to classify nodes without any labeled data. The paper further introduces a new method called Dynamic Text Bundling Supervision (DENSE). Instead of querying an LLM with individual node texts, DENSE groups nodes with close proximity (either topological or semantic) into "bundles." The LLM is then queried with the combined text of these bundles to determine a single "bundle label", which represents the most frequent category within that bundle. These bundle-level labels are then used to supervise the training of a graph neural network (GNN). The authors provide theoretical justification for their method, showing its tolerance to outliers and analyzing its convergence properties. The effectiveness of DENSE is demonstrated through extensive experiments on ten datasets, where it consistently outperforms fifteen baseline methods, including various text encoders, generative LLMs, and graph learning models.

优缺点分析

Strengths

The problem is novel. The paper studies a relatively new problem of zero-shot inference on text-attributed graphs using LLMs. Although there are some recent efforts using LLMs on graphs, applying them for zero-shot node classification without any labels is a challenging problem. The paper correctly identifies the key bottlenecks of information limitation and response unreliability, making the problem well-defined and highly relevant.
The method is quite innovative. The core idea of "dynamic text bundling" is a novel and intuitive solution to the identified problem. By bundling texts, the method provides richer, contextualized information to the LLM, moving beyond isolated node attributes. The supervision and refinement mechanisms are well-reasoned; the bundle-level supervision is theoretically shown to be more tolerant to noisy labels than individual supervision, and the dynamic refinement process actively purges noise. This multi-faceted approach is a clear methodological innovation.
The experiment is extensive. The authors test their method across ten diverse datasets, covering both homophilic and heterophilic graphs, as well as various domains like citation, e-commerce, and webpage networks. They compare DENSE against a wide spectrum of 15 recent and relevant baselines, including text encoders, generative LLMs, graph self-supervised methods, and dedicated graph foundation models. The consistent outperformance across all datasets strongly validates the method's effectiveness and generalizability. The inclusion of ablation studies and analysis of different LLM backbones further strengthens the empirical claims.

Weaknesses

The exploration of heterophily is limited. While the paper includes heterophilic graphs (Cornell, Texas, Wisconsin, Washington), the mechanism for handling them relies solely on switching from "topological proximity" to "semantic proximity" for bundle sampling. This semantic sampling is based on the initial text embeddings. The paper could benefit from a deeper exploration of how the bundling strategy interacts with strong heterophily, where even semantically similar nodes might belong to different classes. The performance gains on heterophilic graphs, while strong, might stem more from the power of the base GNN designed for heterophily (GloGNN) rather than the bundling strategy itself being optimally adapted for this challenging scenario.
The method depends on High-Quality Initial Embeddings: The "semantic proximity" sampling, crucial for heterophilic graphs, relies on the quality of initial node embeddings generated by a text encoder. The paper uses a task-adaptive embedding method from a prior work. If these initial embeddings are poor or do not capture the relevant semantic nuances for classification, the resulting bundles could be noisy from the outset, potentially hampering the entire process. The paper does not analyze the sensitivity of the method to the quality of these initial text representations.
The experiment lacks sufficient large-scale datasets for evaluation. Most datasets used for evaluation are relatively small. Four of the ten datasets (Cornell, Texas, Wisconsin, Washington) have fewer than 300 nodes each, and two others (Cora, CiteSeer) are also considered small by modern standards. While the inclusion of datasets like Sportsfit (173k nodes) is commendable, the method's scalability and effectiveness on truly large-scale graphs with millions of nodes remain under-explored. The default setting uses 100 bundles for supervision, which may provide very sparse coverage on a much larger graph, potentially requiring a significant increase in costly LLM queries to achieve good results, as evidenced by the finding that more bundles yield better performance.

问题

See the weaknesses above.

局限性

See the weaknesses above.

最终评判理由

This paper addresses the challenge of zero-shot inference on text-attributed graphs (TAGs), where the goal is to classify nodes without any labeled data. Overall, the problem is interesting and the proposed method is novel.

At the beginning, I had some concerns of the method and experiments, but during the rebuttal phase, the authors successfully addressed my concerns. Therefore, I will give an accept.

格式问题

No concerns.

作者回复

2025-07-30

We are truly grateful for the time you have taken to review our paper and your insightful review. Here we address your comments in the following.

Q1. The exploration of heterophily is limited. While the paper includes heterophilic graphs (Cornell, Texas, Wisconsin, Washington), the mechanism for handling them relies solely on switching from "topological proximity" to "semantic proximity" for bundle sampling. This semantic sampling is based on the initial text embeddings. The paper could benefit from a deeper exploration of how the bundling strategy interacts with strong heterophily, where even semantically similar nodes might belong to different classes. The performance gains on heterophilic graphs, while strong, might stem more from the power of the base GNN designed for heterophily (GloGNN) rather than the bundling strategy itself being optimally adapted for this challenging scenario.

A1. Thank you for the comment. While some semantically similar nodes might belong to different classes, they are generally close in the label space. Additionally, the proposed bundle supervision is tolerant to outliers (Theorem 3.1). As for the performance gain, we compare the accuracy of individual query and bundle query below. The results suggest that the bundling strategy using semantic proximity is effective.

Datasets	Texas	Cornell	Wisconsin	Washington
Individual Query	84.49	71.73	75.84	72.05
Bundle Query	92.51	84.82	87.17	81.66

Q2. The method depends on High-Quality Initial Embeddings: The "semantic proximity" sampling, crucial for heterophilic graphs, relies on the quality of initial node embeddings generated by a text encoder. The paper uses a task-adaptive embedding method from a prior work. If these initial embeddings are poor or do not capture the relevant semantic nuances for classification, the resulting bundles could be noisy from the outset, potentially hampering the entire process. The paper does not analyze the sensitivity of the method to the quality of these initial text representations.

A2. Thank you for the comment. We show that our method is robust to the quality of initial text embeddings. Specifically, for embedding $\textbf x$ , we introduce noisy using $\textbf x \leftarrow (1 + \beta \textbf n)\odot \textbf x$ , where $\textbf n\sim\mathcal N(\textbf 0,\textbf I)$ is gaussian noise. We show the performance of our method under different noise ratios $\beta$ on the WikiCS dataset below. As can be seen from the results, our method is robust when the quality of text embeddings degrades.

Noise Ratio	0	0.05	0.1	0.15
DENSE	71.03	70.94	70.82	70.50

Q3. The experiment lacks sufficient large-scale datasets for evaluation. Most datasets used for evaluation are relatively small. Four of the ten datasets (Cornell, Texas, Wisconsin, Washington) have fewer than 300 nodes each, and two others (Cora, CiteSeer) are also considered small by modern standards. While the inclusion of datasets like Sportsfit (173k nodes) is commendable, the method's scalability and effectiveness on truly large-scale graphs with millions of nodes remain under-explored. The default setting uses 100 bundles for supervision, which may provide very sparse coverage on a much larger graph, potentially requiring a significant increase in costly LLM queries to achieve good results, as evidenced by the finding that more bundles yield better performance.

A3. Thank you for your comment. We provide results on ogbn-products with 2,449,029 nodes and 61,859,140 edges. The baseline LLMBP uses 940 queries (default setup), while the proposed DENSE uses 100 and 200 queries. The results suggest that DENSE outperforms the baseline with only 100 bundles (default setup, much less than the baseline), and the performance further improves with additional bundles.

Methods	LLMBP	DENSE (100 bundles)	DENSE (200 bundles)
ogbn-products	74.71	76.88	79.91

2025-08-08

Thank you again for acknowledging our rebuttal! We hope that our responses have addressed your concerns. We sincerely thank you for your dedication and effort in evaluating our submission. Please do not hesitate to let us know if you need any clarification or have additional suggestions.

最终决定Accept (poster)

2025-09-17

The paper is in the space of using LLMs for graphs. In particular, the paper investigates zero-shot node classification on text-attributed graphs (TAGs) Instead of querying LLMs with isolated node texts, the authors propose Dynamic Text Bundling Supervision (DENSE) which forms bundles of semantically or topologically proximate nodes, queries the LLM for a bundle-level label, and uses these labels to supervise a GNN plus iterative bundle refinement. It is argued that this bundling approach helps as it provides the LLM with richer context from multiple related nodes, rather than isolated text attributes and often predicting the dominant category of a bundle is an easier and more reliable task for an LLM than classifying individual, potentially ambiguous nodes. Some theoretical results surrounding convergence was presented and empirical studies we conducted across ten TAG datasets with ablations. The reviewers had mixed response initially. While acknowledging interesting setup and extensive experiments, many important concerns were raised around handling of heterophily, too much dependence on initial embeddings, scalability to larger graphs, limited exploration on types of GNN, relevance of the theoretical results, and finally the significance of the work. We thank the authors and reviewers for engaging during the rebuttal period towards improving the paper: The new experiments on the large-scale ogbn-products dataset helped with scalability, experiments by injecting noise to initial embeddings showed its robustness, asking LLM itself to estimate the homophily ratio of a graph by querying it on sampled node pairs as well as using adaptive sampling handles some of heterophily concerns, and finally additional experiment on another GNN architecture GraphSAGE shows the generality of the method. Thus authors' rebuttal addressed many of the weaknesses pointed out by the reviewers and while some limitations remain, the paper makes a concrete contribution of interest to the community.