Scale-Free Graph-Language Models
摘要
评审与讨论
This paper proposes a novel scale-free graph-language (SFGL) model. There are two main components in the proposed framework: (1) Scale-free KNN graph construction; (2) LM Training augmented with KNN graph-based pseudo labels. Experiments demonstrate that SFGL shows comparable performance.
优点
- The use of scale-free properties for graph generation is clever and practical. The approximation via KNN graphs is a simple yet effective method.
- While the use of pseudo-labels is not a new concept in integrating graphs and language models, the way it is applied in this paper is particularly effective.
- The empirical results are compelling, with the SFGL model consistently outperforming baseline methods across various datasets and labeling rates.
缺点
-
The author could conduct more experiments with different values of labeling rates. Although the paper focuses on the low labeling rates, it will be meaningful to extend the experiments to higher labeling rates (e.g., 50% to 60%) to better understand the model’s scalability and performance under such higher labeling conditions. Observing the model's behavior with more labeled data could also provide insights into whether the proposed improvements hold up when labeled data is no longer a significant limitation.
-
Although the paper demonstrates that KNN graphs have comparable performance to the real graph (Table 6), the differences between KNN and real graph structures are not thoroughly explored. A more detailed case study or analysis comparing the graph structures between KNN graphs and real graphs could provide valuable insights.
-
Some minor points: In Algorithm 1, Textual feature X -> Text sequences X, consistent with line 265 In Algorithm 2, replace X with E -> replace F with E. (X are input text/tokens, E and F are text embeddings.)
-
In abaltion studies, the author could use some more powerful text encoders, especially some recently used LLM-based encoders, such as E5-Mistral-7B-Instruct (https://huggingface.co/intfloat/e5-mistral-7b-instruct). More powerful encoders can be found in the MTEB leaderboard (https://huggingface.co/spaces/mteb/leaderboard).
-
In ablation study section, the author could report the results when only using KNN graph generation without the LM-finetuning.
问题
See Weeknesses part.
Dear Reviewer 7hx6,
We sincerely appreciate your recognition of the novelty and simplicity of our methods, as well as your helpful suggestions. Below, we provide detailed responses to your concerns.
W1: High labeling rates.
We further evaluate the performance of different models on the Cora and PubMed datasets under high labeling rates. Based on the standard data split, the training, evaluation, and testing sets for Cora include 140, 500, and 1,000 nodes, respectively. For PubMed, these sets contain 60, 500, and 1,000 nodes, respectively. Additionally, there are 1,068 and 18,157 unused labels in the Cora and PubMed datasets, respectively, which remain unused during training. To explore the impact of higher labeling rates, we incorporate these remaining labels into the training sets, resulting in 1,208 and 18,217 labeled training nodes for Cora and PubMed, respectively. The experimental results comparing the original data split and the complete label set for these two datasets are presented in Table 7 in Appendix A.4.
As observed, when labeled data is no longer a significant limitation, a pure DeBERTa model achieves satisfactory performance. In such cases, generating pseudo-labels becomes unnecessary, as LMs can produce high-quality text embeddings given sufficient labeled data. This outcome is reasonable, as LMs are typically larger and more powerful than GNNs for handling textual inputs. However, when working with the standard data split that includes only a few labeled nodes, generating pseudo-labels becomes crucial to enhance performance.
W2: comparison with real graphs.
We provide a case study comparing the graph structures of KNN graphs and real graphs, focusing on how well the edges in KNN graphs align with those in real graphs. In KNN graphs, the construction of edges between nodes is based on their feature similarity in a specific feature space. The feature similarity is determined by the chosen similarity metric, while the feature space is influenced by the text encoders used. Consequently, the accuracy of artificial edges matching real edges depends on the value of K, the similarity metric, and the text encoders. We conduct experiments on the Cora dataset using the same similarity metric and text encoders. Our results show that KNN graphs with K=25 and K=50 yield 2,700 and 3,422 edges, respectively, that precisely match the true edges in the real graph. While increasing K can improve accuracy by reconstructing more known edges, it also introduces a higher risk of generating noisy edges. Thus, a trade-off exists between edge accuracy and noise as K increases.
Here, we would like to emphasize that our primary goal is to investigate a real structural prior (i.e., the scale-free property) for graph generation, rather than to approximate all true edges in the real graphs. We really appreciate the reviewer for proposing the intriguing idea of approximating a real graph and look forward to exploring it in our future work.
W3: minor typos.
Thank you for pointing out the minor typos in our algorithm tables. Yes, "Textual feature X" in Algorithm 1 should be "Text sequences X," and "replace X with E" in Algorithm 2 should be "replace F with E." We will ensure these typos are corrected in the final version of the manuscript.
W4: other LLM encoders.
We appreciate the reviewer’s valuable suggestion. Following your recommendation, we adopt the E5-Mistral-7B-Instruct model as the text encoder. Additionally, based on the MTEB leaderboard, we include the BAAI/bge-en-icl model in our experiments. Due to limitations in computational resources, we are unable to fine-tune the E5-Mistral-7B-Instruct and BAAI/bge-en-icl models on our datasets. For comparison, we utilized the pretrained models to extract node features and compared them against both a pretrained and a fine-tuned DeBERTa model.
The following two tables provide a detailed comparison of the models and their performance, respectively.
| Models | Model Size | Inference Time | Embedding Size |
|---|---|---|---|
| DeBERTa | 129M | 1.3 min | 768 |
| E5-Mistral-7B-Instruct | 7B | 14.84 min | 4096 |
| BAAI/bge-en-icl | 7B | 14.54 min | 4096 |
Please refer to the next page for the continuation of our responses.
W4: other LLM encoders.
| Labels | 60 | 120 | 180 | 360 | 500 |
|---|---|---|---|---|---|
| Pretrained DeBERTa | 0.475 | 0.518 | 0.534 | 0.538 | 0.548 |
| Pretrained E5-Mistral-7B-Instruct | 0.550 | 0.614 | 0.648 | 0.686 | 0.690 |
| Pretrained BAAI/bge-en-icl | 0.596 | 0.636 | 0.648 | 0.652 | 0.667 |
| Finetuned DeBERTa | 0.743 | 0.815 | 0.856 | 0.877 | 0.876 |
As observed, both the pretrained E5-Mistral-7B-Instruct and BAAI/bge-en-icl models outperform the pretrained DeBERTa model but fall short of the well-finetuned DeBERTa model. However, it is reasonable to infer that a fine-tuned E5-Mistral-7B-Instruct or BAAI/bge-en-icl model would surpass the performance of the fine-tuned DeBERTa model.
W5: only KNN graph.
In Section 4.3, the experimental results presented in Table 6 are obtained using GNN models without involving LM fine-tuning. Specifically, the GNN models are trained on KNN graphs and real graphs, respectively.
We sincerely appreciate your valuable comments, which have provided us with new insights. We hope our response effectively addresses your concerns.
Thanks for your response. I appreciate the effort and time you have dedicated to addressing my concerns.
Dear Reviewer 7hx6,
We are very delighted to see that our response has addressed your concerns. We sincerely appreciate your earlier comment that the use of scale-free properties for graph generation is clever and practical. This has greatly fueled our enthusiasm for research.
Thank you once again for your time and acknowledgment!
Sincerely,
Authors
The paper introduces a new framework that combines graph generation and text embedding using the scale-free property of real-world networks. By approximating this structure with a KNN graph, the model improves graph generation and enhances language model finetuning. The paper uses a graph-based pseudo-labeling method to improve performance on semi-supervised tasks, validated through experiments on several citation network datasets
优点
-
The idea is novel. A unique contribution is to use KNNs for Graph generation, which I think will be very inspiring for semi-supervised tasks in homogeneous networks, especially from the comparison with the Real Graph on Table 6, the generated graph is even better than the real graph.
-
The paper is well-written, providing a detailed description of the problem setting and research background. The authors clearly articulate the challenges in existing graph-language models and their motivation for the proposed approach, making the rationale behind their work easy to follow.
-
The extensive ablation studies included in the paper significantly contribute to the credibility of the proposed method.
缺点
-
The paper primarily focuses on citation networks, which are inherently homophilic. The applicability of the proposed KNN-based graph generation method to other types of textual networks, such as academic or social networks, is not explored.
-
The core contribution of the paper lies in using KNN to approximate graphs, but this approach seems to be more suited for homophilic networks. In more diverse scenarios, such as heterogeneous academic networks with different types of nodes, the use of a unified nearest-neighbor metric may introduce significant bias. This limits the method’s applicability to broader graph structures. It is recommended to try a broader extension of the for KNN design.
-
While the paper provides strong theoretical and experimental support, the experiments on the validation of KNN could be more comprehensive. (See Questions)
问题
-
I believe validating KNN is crucial, and I suggest the authors include comparisons with other graph generation methods, such as random graph generation or rule-based approaches.
-
I am curious how the specific implementation of real graph in Table 6 is set up?
-
I noticed that the real graph is not even as good as the graph generated by KNN, I hope the author can give your thoughts on this. Such results give people the inspiration that the graph generation is not for fitting the real network connections, but for the information transfer of nodes with high similarity. However, this ignores the information transmission of real edges, because nodes only believe the points that are close to each other, which is easy to introduce errors for complex network mechanisms.
-
I think a deeper experimental exploration of the above questions should involve trying LLM-only prediction ways in the network, such as [1] [2], etc., missing in the paper cited.
-
The paper lacks some experimental details, such as the division of datasets into labeled and unlabeled nodes (m, n) and the specific settings of the iterative algorithm.
[1] Bi B, Liu S, Wang Y, et al. Lpnl: Scalable link prediction with large language models[C]//Findings of the Association for Computational Linguistics ACL 2024. 2024: 3615-3625.
[2] Ye R, Zhang C, Wang R, et al. Natural language is all a graph needs[J]. arXiv preprint arXiv:2308.07134, 2023, 4(5): 7.
Concise and powerful responses will help me understand better, and I will consider raising my score.
Dear Reviewer mKrX,
We really appreciate you recognizing our contributions and providing constructive suggestions to improve our manuscript. Below, we address your concerns in detail.
W1: other types of networks.
At the end of our submission, we provide a case study on social networks in Appendix A.5. Specifically, we explore the degree distribution of KNN graphs on three social networks: Deezer Europe, LastFM Asia, and Facebook Page-Page.
- Deezer Europe is a user-user network of European members of the music streaming service Deezer.
- LastFM Asia is an online social network of people who use the online music streaming site LastFM and live in Asia.
- Facebook is a page-page graph of verified Facebook sites.
For detailed descriptions of these social networks, please refer to [a1].
It is important to note that the node embeddings of citation networks and social networks differ significantly. For the citation networks, node embeddings are derived from textual sequences of nodes (i.e., natural languages). For the social networks, node embeddings are generated based on the graph structure through embedding models Diff2vec [a2]. More specific, the embedding of each node in social networks is obtained by computing an Euler tour on this node and with additional post-processing [a2]. The degree distribution of KNN graphs on these social networks with different distance metrics are shown in Figure 9 in Appendix A.5.
W2: extension on heterogeneous networks.
First of all, we would like to emphasize that this paper follows existing graph-language models, including the baselines used for comparison, and focuses on homophilic networks with textual attributes.
We really appreciate the reviewer’s suggestion on how to extend our method on heterogeneous networks. To avoid the potential bias, a feasible solution is to build several subgraphs for nodes and features with various similarity metrics, such as feature-feature similarity, feature-semantic similarity, and semantic-semantic similarity. These subgraphs can then be integrated into a unified heterogeneous graph. Besides, we can also consider a novel method proposed in [a3] for edge construction. However, it is very challenging to incorporate the scale-free property into heterogeneous graph generation. We would like to explore this interesting problem in our future study.
W3 & Q1: validation on KNN.
It is a good suggestion to compare our method with other graph generation methods. Following the reviewer’s suggestion, we compare our method with random graph generation. For random graph, we consider the Erdős–Rényi (ER) graph , which is a random graph with nodes where each pair of nodes is connected with probability . In the graph, the degree of a node follows a binomial distribution:
.
When is large and is small, the degree distribution can be approximated by a Poisson distribution:
, where .
This is different than the power-law distribution of scale-free networks.
The performance comparison between random graph and our KNN graph on the arxiv23 dataset is shown below.
| Labeling Rates (labels) | 0.11% (50) | 0.22% (100) | 0.32% (150) | 0.65% (300) | 1.08% (500) |
|---|---|---|---|---|---|
| Real Graph | 0.3691 ± 0.0033 | 0.4124 ± 0.0185 | 0.4354 ± 0.0107 | 0.4764 ± 0.0116 | 0.5039 ± 0.0071 |
| Random Graph p=0.01 | 0.2265 ± 0.0123 | 0.2303 ± 0.0054 | 0.2377 ± 0.0241 | 0.2335 ± 0.0055 | 0.2446 ± 0.0217 |
| Random Graph p=0.001 | 0.2212 ± 0.0168 | 0.2374 ± 0.0271 | 0.2506 ± 0.0270 | 0.2579 ± 0.0209 | 0.2626 ± 0.0228 |
| KNN Graph k=10 | 0.3660 ± 0.0136 | 0.4031 ± 0.0054 | 0.4152 ± 0.0145 | 0.4541 ± 0.0114 | 0.4706 ± 0.0164 |
| KNN Graph k=20 | 0.3809 ± 0.0251 | 0.4229 ± 0.0137 | 0.4342 ± 0.0092 | 0.4771 ± 0.0076 | 0.4976 ± 0.0093 |
Q2: Real graph.
The real graphs used in our experiments are exactly the original graphs from the benchmark datasets. Specifically, the real graph used in Table 6 is the original graph of the arxiv23 dataset (which is an open-access dataset referenced in [a5]). This dataset contains 57,471 nodes, and the real graph contains 122,835 connections.
Please refer to the next page for the continuation of our responses.
Q3: Our thoughts on results in Table 6.
This is a good question. As shown in Table 6, the performance of using the real graph is worse than that of using KNN graph in cases with low labeling rates, such as 0.11% and 0.22%. We provide a potential explanation on this phenomenon.
In fact, beyond facilitating information transmission between node features (attributes), the graph also plays a crucial role in the propagation of label (semantic) information. Consequently, both the number of labeled nodes and the selection of specific nodes to label can significantly impact the final classification performance. Consider an extreme case: if all labeled nodes in a real graph are isolated (i.e., they have no connections with other nodes), the label information cannot propagate to the unlabeled nodes. In contrast, constructing an artificial graph for these nodes may establish connections between unlabeled nodes and these isolated labeled nodes, enabling label information propagation across the graph. Therefore, it is possible that artificial graphs outperform real graphs, particularly when the labeling rate is low.
Additionally, KNN graphs can generate a portion of the real edges. Our experimental results demonstrate that for the Cora dataset, KNN graphs generate 2,700 and 3,422 edges that precisely match the true edges in the real graphs when K=25 and K=50, respectively. Furthermore, it is important to note that real graphs may contain noisy edges. KNN reduces the risk of creating such noisy edges, as it constructs edges only between nodes with highly similar features.
Q4: LLM-obly prediction.
Thank you for the constructive suggestion. In [a3], the authors propose novel prompts for scalable link prediction on heterogeneous graphs by articulating graph details in natural language. This is a purely LLM-based approach that transforms graph details into descriptive texts. In [a4], the authors describe the multi-scale geometric structure of graphs using natural language and then fine-tune an LLM with instruction learning to perform various graph tasks. Since the authors [a4] also focus on the node classification task, we conduct an empirical comparison between our method and theirs on the PubMed dataset.
Comparison under the same labeling rate using the standard data splitting (i.e., 60 labeled nodes):
| Models | Labeling rate | Accuray |
|---|---|---|
| 0.30% | 0.851 ± 0.006 | |
| 0.30% | 0.882 ± 0.003 | |
| 0.30% | 0.896 ± 0.004 | |
| (ours) | 0.30% | 0.724 ± 0.012 |
| (ours) | 0.30% | 0.864 ± 0.038 |
| (ours) | 0.30% | 0.914 ± 0.008 |
Comparison under different labeling rates:
| Models | Labeling rate | Accuray |
|---|---|---|
| 60% | 0.938 ± 0.003 | |
| 60% | 0.944 ± 0.001 | |
| 60% | 0.946 ± 0.001 | |
| (ours) | 2.54% | 0.839 ± 0.020 |
| (ours) | 2.54% | 0.912 ± 0.002 |
| (ours) | 2.54% | 0.937 ± 0.002 |
As we can see, our proposed method with iterative training outperforms the approach presented in [a4] under the standard data split and achieves comparable performance even at lower labeling rates.
Please refer to the next page for the continuation of our responses.
Q5: experimental details.
We conduct experiments on different labeling rates. The Cora dataset contains 7 classes, so we adopt 70, 140, 210, and 420 labeled nodes for experiments, where we randomly select 10, 20, 30, 60 nodes in each class as the labeled nodes. Additionally, we experiment with 600 labeled nodes, where 420 nodes are consistent with the previous case, and an additional 180 nodes are randomly selected. This labeling strategy is also applied to the PubMed dataset. For the ogbn-arxiv and arxiv23 datasets, we randomly select 50, 100, 150, 300, and 500 nodes from the original labeled set in the standard data split, designating them as labeled nodes while treating the remaining nodes as unlabeled.
For all datasets and labeling rates, we conduct five independent experiments using different random seeds and report the average testing accuracy along with the standard deviation.
The iterative algorithm is employed exclusively for Table 2 in Section 4.3. This demonstrates that improved text embeddings facilitate the generation of higher-quality pseudo-labels, which, in turn, further enhance the discriminability of the text embeddings. Detailed descriptions of the algorithms can be found in Appendix A.3, as well as in Algorithms 1 and 2.
We sincerely appreciate your valuable suggestions and insightful feedback, and we hope that our response addresses your concerns.
[a1] Multi-scale attributed node embedding. Rozemberczki, B., Allen, C., & Sarkar, R. Journal of Complex Networks, 2021.
[a2] Fast sequence-based embedding with diffusion graphs. Rozemberczki, B., & Sarkar, R. In Complex Networks IX: Proceedings of the 9th Conference on Complex Networks CompleNet, 2018.
[a3] Bi B, Liu S, Wang Y, et al. Lpnl: Scalable link prediction with large language models[C]//Findings of the Association for Computational Linguistics ACL 2024. 2024: 3615-3625.
[a4] Ye R, Zhang C, Wang R, et al. Natural language is all a graph needs[J]. arXiv preprint arXiv:2308.07134, 2023, 4(5): 7.
[a5] He et al, Harnessing explanations: Llm-to-lm interpreter for enhanced text-attributed graph representation learning. ICLR, 2024.
Thank you for your response. I will maintain my score. I believe that expanding the scope of this work could be an interesting direction for future research, particularly as I see potential in further exploring and extending the role of KNN.
Dear Reviewer mKrX,
Thank you very much for highlighting an exciting direction for our future research.
We would like to further point out that this paper primarily focuses on graph-language models for classifying nodes with textual attributes, specifically natural language sequences. In contrast, the attributes of nodes in social networks or heterogeneous networks may not necessarily be textual sequences. For example, in Appendix A.5, the embedding of each node in social networks is derived through an Euler tour, rather than a textual sequence. Therefore, applying graph-language models to these networks would not be appropriate.
We greatly appreciate you recognizing the novelty of our work and mentioning that you think our method is very inspiring. It would also inspire us if you could consider updating your score!
Thank you again for your time and thoughtful review!
Sincerely,
Authors
The paper focus on node classification of text attributed scale-free graph. The authors discovers that citation networks have the property of scale-free graphs, and proposed KNN Graph based on text similarity distance. The authors theoretically and empirically demonstrate that KNN graph has property of scale-free graph and can be used as a good prior for scale-free graph such as citation network.
In the experiment, the authors showed that with small size of labeled data, the model can outperform both GNN based models and Pretrained Language Model based baselines. Ablation studies are performed to demonstrate the effectiveness of KNN Graph approximation for scale-free graphs.
优点
-
The paper proposed a new KNN Graph, which has the property of scale-free graph, can be used as a good prior of scale-free graph such as citation graph.
-
The paper first introduce the application of scale-free graph in text attributed citation network.
-
The proposed approach is carefully verified using empirical and theoretical approach.
-
The proposed method outperformed strong baselines
-
The experiments conducted by the paper is solid and comprehensive
-
Detailed ablation studies are done to analyze the impact different components. Including different text encoders, different graph neural network
-
The paper is clear and easy to follow
缺点
- The explanation of correlation between theory of scale-free network and KNN graph is a bit insubstantial, the theoretical explanation did not provide sufficient intuition to the approach.
问题
-
Correspond to weakness 1, in the paper between line 103 - 209, the paper discuss why KNN graph is similar to scale-free graph. The property of Homophily and Directed seems trivial. Upper bound of degree is proved using Proposition 1 and Proposition 2, but it can not explain why KNN graph's node degree follow the power law which is an important property of scale-free network. Empirical curve fitting shows that probability of node degree of KNN graph could follow the power law, is there any theoretical explanation behind this ?
-
I find that there are only experiments based on academic citation network. But in the appendix, some analysis on social network data are done, is there any experiment for the task on social network ? It can be more convincing if experiments on other domain can be performed
Q2: extension on social networks.
In Section 6, we present the limitation of this work which includes how to extend our observations and strategies to other types of networks. We provide a case study in Appendix A.5 to examine the degree distribution of KNN graphs on different social networks [a5]. It is important to highlight the significant differences between the node embeddings of citation networks and social networks. In citation networks, node embeddings are derived from textual sequences (i.e., natural languages) associated with the nodes. In contrast, for the social networks, node embeddings are generated based on the graph structure using embedding models like Diff2vec [a6]. Specifically, the embedding of each node in social networks is obtained by computing an Euler tour on the node, followed by additional post-processing steps [a6].
This paper primarily focuses on graph-language models for classifying nodes with textual attributes (natural language). Since the node attributes in these social networks are not textual sequences, applying language models to them would neither be appropriate nor reasonable. As a result, we provide the case study in Appendix A.5 to investigate the degree distribution on these social networks.
We sincerely appreciate your valuable comments and insightful suggestions. We hope our response effectively addresses your concerns.
[a1] Mathew D. Penrose. Random Geometric Graphs. 2003.
[a2] Albert-Laszlo Barabasi and Reka Albert. Emergence of scaling in random networks. Science, 286 (5439):509–512, 1999.
[a3] Albert-Laszlo Barabasi and Eric Bonabeau. Scale-free networks. Scientific American, 288(5):60–69, 2003.
[a4] Albert-Laszlo Barabasi and Marton Posfai. Network science. Cambridge University Press, Cambridge, 2016.
[a5] Rozemberczki, B., Allen, C., & Sarkar, R. Multi-scale attributed node embedding. Journal of Complex Networks, 2021.
[a6] Rozemberczki, B., & Sarkar, R. Fast sequence-based embedding with diffusion graphs. In Complex Networks IX: Proceedings of the 9th Conference on Complex Networks CompleNet, 2018.
Dear Reviewer dcB6,
Thank you very much for recognizing our contributions and for providing constructive comments. We address your concerns as below.
W1 and Q1: additional theoretical explanation.
Thank you for this constructive comment. We begin by discussing a simple case using a random geometric graph (RGG) and then elaborate on why analyzing the degree distribution of KNN graphs on real-world data from a theoretical perspective presents significant challenges.
Consider a random geometric graph with nodes sampled independently from a probability density in , where two nodes are connected only if , and is the connection radius of the graph, which depends on and determines the graph’s density.
We define the maximum degree of the graph as where is the degree of the node , i.e., the number of neighbors of within radius , and define a binary indicator function on a subgraph , where . Here, if the subgraph has at least one vertex with degree , and otherwise. is the induced subgraph on the subset using a unit connection radius, where an edge exists between two vertices if and only if the Euclidean distance .
Based on the above definitions, we have the following theorem [a1]:
Let , . If as , then where refers to the probability that the maximal degree equals to k.
This theorem provides an asymptotic result for the distribution of as the number of nodes , under the scaling condition as . As we can see, the maximum degree is affected by several factors, including:
- Distribution density . The density impacts the likelihood of finding clusters of points near a given node.
- Connection radius . The probability of achieving a maximum degree depends on how nodes cluster locally, which is controlled by the scaling of .
- Local structure . This determines if the subgraph satisfies the condition of having at least one node of degree .
Similarly, the degree distribution of KNN graphs is also primarily shaped by several critical factors:
- Node distribution (similar to the in RGG). Determining the node distribution of real-world data without prior knowledge is not feasible. Even if the original node distribution is known, the node distribution remains indeterminate due to the reshaping effect of the text encoders used.
- The value of K (similar to the in RGG). The selection of K influences the magnitude of degrees.
- The choice of similarity metrics (similar to the in RGG). Even with assumptions about the embedding distribution of nodes, the degree distribution is further complicated by the choice of similarity metrics, as different similarity metrics result in varying connections between nodes.
Consequently, we believe that determining the degree distribution in KNN graphs for different real-world datasets should primarily rely on empirical evidence.
Additionally, the scale-free property of networks can be effectively explained by the well-known Barabási–Albert (BA) model [a2, a3, a4]. We analyze the key differences between the BA model and KNN graphs. The BA model operates under the following conditions:
- Growth: the number of nodes continuously increases over time.
- Preferential attachment: starting with an initial set of edges, new nodes are more likely to connect to high-degree nodes, with edges directed from new nodes to existing nodes.
- No node attributes: The BA model does not take node features or attributes into account when constructing edges.
In contrast, KNN graphs have a fixed size, no initial edges, and allow edges to have arbitrary directions. Connections between nodes are determined solely based on their feature similarity. Notably, KNN graph edges can be determined in parallel, whereas edges in the BA model must be constructed sequentially. This distinction makes KNN graphs more practical and efficient for real-world applications.
Please refer to the next page for the continuation of our responses.
Dear Reviewer dcB6,
Thank you very much for recognizing the contributions and novelty of our work. In Response Parts 1-2, we have provided additional theoretical explanations using random geometric graphs and clarified why applying graph-language models to social networks may not be appropriate. We would greatly appreciate it if you might consider updating the score after reading our response.
Thank you once again for your time and effort!
Sincerely,
Authors
Hi, i have read the rebuttal, thanks for your explanation of the challenge of my two concern, i decide to keep my score.
Dear Reviewer dcB6,
Thank you very much for keeping a positive score!
We will try our best to revise the manuscript accordingly. Again, many thanks for helping us improve the paper!
Best regards,
Authors
This paper proposes a graph-based method as a semantic complementary supervision for text-attributed graphs. The author first theoretically justifies that a kNN graph can approximately describe a scale-free network while fulfilling three constraints or properties. Based on this, the author proposes a kNN graph construction to initialize the document graph. Then, the author introduces a new iterative method that fuses embeddings from three sources: a pseudo-labeler, GCN embeddings, and text embeddings.
优点
-
The author has focused on a topic with wide applications, going beyond GCN for text-attributed graphs. The proposed method can be applied to any document without explicitly extracting links.
-
This paper theoretically proves that k-nn can approximate the degree distribution of a scale-free network.
-
The paper presents an extensive set of experiments with several reference citation networks (Cora, Pubmed, ogbn-arxiv and arxiv23). Comprehensive experiments have been conducted using different labeling fractions under semi-supervision conditions and a different number of neighbors in the KNN model, with a detailed description of all hyperparameters used.
缺点
-
Contribution and Achievements are not clearly introduced: To the best of our knowledge, this work is the first attempt to integrate graph generation and text embedding into a unified GLM framework.
-
Algorithm Design: The introduction of the algorithm design should include necessary details in the main paper, not just in the Appendix. The roles of the GCN, text embedding, and pseudo-labeler should be explained in the main paper. A complexity analysis of each step should be provided shortly in the main paper too.
-
Experimental Results: Clarify whether the results from TAPE and GLEM are directly taken from their respective papers. Additionally, there are other important baselines such as GraphFormers, Leading, SimTEG, and Eingne. If possible, please discuss or compare your results with these models as well. You may also refer to related work in this paper: https://www.ijcai.org/proceedings/2024/0634.pdf.
问题
Introduction: 0. Clarification of Focused Downstream Task: The specific downstream task of your proposed model is not clearly articulated in the introduction. "When developing a graph language model (GLM) for classification tasks", it is important to clearly define the target task. Typically, downstream tasks are categorized at the node level, edge level, subgraph level, or graph level. In machine learning, these tasks include classification, regression, clustering, and other related objectives. Please specify the precise downstream task your work focuses on. If it is interdisciplinary, make sure to distinguish it from existing topics and clarify its unique contributions in comparison to related fields. If the graph is directed graph please be clear about it too. adapt the section graph construction too. Directed graph should have +1 -1 but not binary for instance ?
-
"However, LGI models typically rely on artificial assumptions about the underlying edge distribution, which may not accurately reflect the real graph structure." Regarding a series of gcn-based methods, please be concrete about the underlying assumption and why it is too strong in practise or theory? artificial assumption should also be shortly mentioned with a reference.
-
However, finetuning a pretrained LM typically requires target datasets with extensive annotations (Brown et al., 2020), which is an unrealistic requirement for semi-supervised learning tasks and may make LM finetuning unreliable if this requirement is not satisfied. Please list out several publications or provide an example of computation demand? Please refer to https://arxiv.org/abs/2401.15569, https://arxiv.org/abs/2305.19523.
-
We identify two key challenges in existing GLMs: artificial structural assumptions for graph generation and unreliable LM finetuning for text embedding. https://academic.oup.com/book/27303 Please refer to this book, especially the estimation of alpha. Please provide an alpha estimation based on their method to identify the condition of validity and effectiveness. Theoretically any graph inhabits some extent of the hierarchy. But only alpha \in [2, 3) would be considered as scale-free is it? Please be clear about the limitation of this assumption https://www.nature.com/articles/s41467-019-08746-5.
"To the best of our knowledge, this work is the first attempt to integrate graph generation and text embedding into a unified GLM framework. " It would be better to shortly review previous assumption and their limitation. It is also necessary to prove whether such a prior is more complete than the previous one?Please also add some sentences about the contribution involved and why graph generation is crucial, from which perspective? performance improvement, identify the hypothesis gap?
Figure 1, if the discussion is about scale-free network, it would be better to plot in log-log sorted degree template?
Alpha could quite efficiently estimated https://academic.oup.com/book/27303, please also consider it to estimate k?
in Section 2.3 The degree distribution of KNN graphs is primarily shaped by three critical factors: ① node distribution, is it edge distribution or node?
Section 3.1 There is still some space to quantify such estimation of k, if the adjacency matrix is know, it should be possible to provide a approximation error?
-
Regarding the graph's underlying assumptions, the homophily assumption is another important one. Would it not be necessary to compare your approach with this assumption in the related work? Please explain how your method relates to or differs from approaches based on homophily from various level, application, dataset setting?
-
If this work has outperformed the state-of-the-art (SOTA), please highlight it in the introduction and abstract with margin. If not, it would be interesting to mention the gap and your specific focus.
-
If GCN, LM and kNN graph is involved it would be necessary to provide an analytical time complexity analysis and if possible empirical space complexity analysis, so that reader could know whether it is a method with better complexity/performance ratio or refresh the upper bound of current algorithm without taking complexity and scalability into consideration.
W3: discuss with other related work.
The primary distinction between our method and these approaches lies in the following aspects:
- Our approach does not rely on extensive labeled data for fine-tuning language models (LMs), making it more practical in scenarios with limited supervision.
- Our method does not require a given graph structure, allowing it to operate flexibly in situations where the graph needs to be generated.
- Unlike these methods, which focus either on graph generation or text embedding, we integrate both into a unified GLM framework, addressing them simultaneously rather than separately.
Q1: downstream task.
In Section 2.1, Problem Definition, we articulate our task as follows:
Given a set of documents denoted as , where and represent the text sequences of labeled and unlabeled documents, respectively, and is the label set for the labeled samples, with indicating a scarcity of labeled nodes compared to the abundance of unlabeled ones, we tackle a graph-based semi-supervised classification task that involves training a model on to classify unlabeled documents .
The downstream task of our proposed model is node classification, i.e., training a model on to classify unlabeled nodes . In Section 3.3 Classification, we further clarify that we solve the classification task by training a graph-based semi-supervised learning model on and using the available labels .
Q2: undirected graph.
In GNN training stage, the graph used is undirected. As mentioned in line 269, we consider a binary graph, which aligns with the setup of the baselines we compared against. In Figure 5, we present the in-degree distribution of KNN graphs. This is because the out-degree of KNN graphs is constant for all nodes, as it is determined by the fixed parameter K.
Q3: Previous artificial assumption and our structure prior.
In Section 5, lines 507-512, we discuss the artificial assumptions made by previous LGI methods regarding the underlying edge distribution. For example, Zhang et al. (2010) impose a uniformity assumption on edge weights. Lu et al. (2018) introduce a block diagonal prior to the graph Laplacian matrix. Jin et al. (2020) apply a sparsity and low rankness assumption on the adjacency matrix. ix. Fatemi et al. (2021) assume that a graph structure effective for predicting features should also be effective for predicting labels. These methods rely on specific optimizations and additional model training across the entire dataset. And the artificial assumptions they adopt may not accurately represent the true structural properties of real-world data.
In this paper, we focus on a fundamental characteristic of real edge distributions in citation networks: the scale-free property. This property is ubiquitous in real-world networks, particularly in citation networks, and reflects an inherent attribute of the data rather than an imposed assumption. Leveraging this scale-free property, we construct a graph-based pseudo-labeler that offers complementary supervision for fine-tuning language models (LMs).
Q4: requirement of LM finetuning.
When we mentioned the requirements for LM fine-tuning, we emphasized the need for extensive annotations rather than computational demand. We list out related publications in Section 5, lines 514-523.
Following your suggestion, we provide an overview of the computational demands of some commonly used LMs. According to [a3] (https://arxiv.org/abs/2401.15569), the computation demands are as follows:
| LMs | Parameters | Memory (GB) | Total Time |
|---|---|---|---|
| BERT | 114,535,896 | 7.5 | 29h |
| DeBERTa | 138,731,759 | 6.6 | 46h 27m |
| e5-large | 2,013,328 | 3.7 | 25h 22m |
| ENGINE | 3,909,032 | 14.9 | 4h 23m |
| ENGINE with caching | 3,909,032 | 3.3 | 21m |
Q5: alpha estimation.
In lines 251-253, we provide alpha estimations for the three tested datasets. The details of the estimation method are available in [a4]. Notably, the estimation method presented in the book https://academic.oup.com/book/27303 originates from the paper [a4]. The author of the book is also an author of [a4].
Please refer to the next page for the continuation of our responses.
Q10: performance.
Thank you for the suggestion. The comparison results between our method and the SOTA methods are mainly presented in Table 1. As shown, our proposed method outperforms the SOTA methods in most cases, particularly under very low labeling rates. Due to space constraints, we were unable to include the performance improvements in the Introduction and Abstract. However, we will consider moving some of the ablation study results to the Appendix to create space for highlighting these improvements in the Introduction and Abstract.
Q11: time/space complexity analysis of different modules.
The table below presents the time and space complexity analysis of different modules on the ogbn-arxiv dataset.
| modules | parameters | running time |
|---|---|---|
| GCN | 103848 | 6.46s |
| LMs | 138632488 | 1.65h |
| KNN graph | - - | 3.18s |
| As we can see, fine-tuning LMs requires the most computational time and memory resources. |
We sincerely appreciate your valuable comments and insightful suggestions. We hope our response effectively addresses your concerns.
[a1] Yang et al. GraphFormers: GNN-nested Transformers for Representation Learning on Textual Graph. NeurIPS 21.
[a2] Duan et al. SimTeG: A Frustratingly Simple Approach Improves Textual Graph Learning. Arxiv 23.
[a3] Zhu et al. Efficient Tuning and Inference for Large Language Models on Textual Graphs. IJCAI 24.
[a4] Clauset et al. Power-Law Distributions in Empirical Data. SIAM 2009.
[a5] Clauset et al. Scale-free networks are rare. Nature Communications 2019.
[a6] Voitalov et al. Scale-Free Networks Well Done. Physical Review Research. 2019.
[a7] Albert-László Barabási. Love is All You Need – Clauset's fruitless search for scale-free networks. 2018.
[a8] Zhao et al. Heterogeneous Graph Structure Learning for Graph Neural Networks. AAAI 2021.
[a9] Bi B, Liu S, Wang Y, et al. Lpnl: Scalable link prediction with large language models[C]//Findings of the Association for Computational Linguistics ACL 2024. 2024: 3615-3625.
Q5: alpha estimation.
It is important to highlight that unequivocally indicates a scale-free network, while still reflects a power-law distribution. The paper [a5] (https://www.nature.com/articles/s41467-019-08746-5) is very controversial. Contrary to its assertions, scale-free networks are, in fact, ubiquitous. As demonstrated in [a6], real-world scale-free networks are definitely not as rare as one would conclude based on the popular but unrealistic assumption that real-world data comes from power laws of pristine purity, void of noise and deviations. According to [a6], a network is scale-free if its degree distribution approaches a power law as the network keeps growing following the same mechanisms. Thus, fluctuations that make a network fail a statistical test would not matter, because if one let the network evolve, it could soon pass the test.
Furthermore, as shown in [a7] (https://uploads-ssl.webflow.com/58bcae2c9d6c401e73a26fed/5aa01d3e24eebb000199a0a2_loveisallyouneed.pdf) , Albert-László Barabási, who introduced the concept of scale-free networks in 1999, strongly disputes the claim that scale-free networks are rare. At the end of [a7], Albert-László Barabási said: “So before accepting the attention-grabbing claims of the BC paper [a5], you must peek under its hood. And if you take the time to do that, you will find that the study is oblivious to 18 years of knowledge accumulated in network science. You will also find a fictional criterion of scale-free networks. Most important, you will find that their central criterion fails the most elementary tests. And those are only the big problems - if you are willing to devote time to it, you will discover additional technical lapses - but this piece must have an end. What you will not find, is a careful and unbiased statistical study that upends a paradigm.”
Q6: log-log sorted degree.
Thank you for the suggestion. In Figure 1, our goal is to illustrate the heavy-tailed nature of the power-law distribution in scale-free networks. This figure emphasizes the characteristic distribution, where a few hubs with high degrees coexist with numerous small nodes with low degrees (which is a hallmark of scale-free networks). To better visualize this heavy-tailed structure, Figure 1 is not plotted using a log-log sorted degree template.
In contrast, Figure 5 presents the degree distribution using a log-log sorted degree template. In this format, the power-law distribution appears as a straight line, making it easier to confirm the scale-free property in the log-log space. These complementary visualizations help to illustrate different perspectives of the power-law distribution in scale-free networks.
Q7: node distribution.
It is node distribution. Edges do not exist that is why we use KNN to establish connections between nodes.
Q8: K estimation.
The adjacency matrix is unavailable. As described in Problem Definition, Section 2.1, we are only given a set of documents, denoted as . Our task it to train a model on to classify the unlabeled documents . Since the graph structure (i.e., the adjacency matrix) is not provided, in Introduction and Related Work, we introduce various latent graph inference methods which are specifically designed to generate a graph (with an adjacency matrix).
In addition, our approach aims to approximate the scale-free property rather than the true adjacency matrix. More importantly, a low approximation error to the real adjacency matrix does not necessarily translate to better performance. As demonstrated in Table 6, even when using the real graph, it may not achieve optimal performance.
Q9: homophily assumption and comparison with related approaches.
We would like to emphasize that this paper builds upon existing graph-language models, including the baselines used for comparison, with a specific focus on homophilic graphs with textual attributes. In the related work, all referenced approaches are also based on the assumption of homophily. The primary difference between our method and these approaches is listed in W3, Response Part 2.
In the literature, there are also heterogeneous latent graph inference methods [a8, a9], which could be considered for graph construction in heterogeneous settings. However, this falls outside the scope of our paper. Additionally, incorporating the scale-free property into heterogeneous graph generation is challenging. We plan to explore this problem in our future research.
Please refer to the next page for the continuation of our responses.
Dear Reviewer NtXH,
Thank you very much for your valuable suggestions and detailed comments. We address your concerns as below.
W1: clarification on our contributions.
As listed at the end of Introduction, our key contributions are threefold:
- We identify two key challenges in existing GLMs: artificial structural assumptions for graph generation and unreliable LM finetuning for text embedding. We propose addressing these challenges sequentially by leveraging a well-grounded graph structural prior.
- We explore the scale-free edge distribution in real networks as our graph structural prior. Our empirical validation and analysis reveal that a KNN graph, constructed using cosine similarity with an appropriately chosen , closely approximates a scale-free network.
- To the best of our knowledge, this work is the first attempt to integrate graph generation and text embedding into a unified GLM framework.
We would like to further clarify the third point. As mentioned in our Introduction section, existing graph-language models (GLMs) can be broadly categorized into latent graph inference (LGI) models and language-assisted graph (LAG) models (please refer to Section 5 for a more detailed introduction to these two types of models). LGI models focus on graph generation through graph neural networks (GNNs) and usually assume that node features are given. Conversely, LAG models aim to enhance text embeddings using language models (LMs) and assume that the graph structure is provided. In this paper, we simultaneously consider both graph generation and text embedding within a unified GLM framework, rather than focusing on just one aspect.
W2: algorithm design.
Due to the page limitation (10 pages), we had to include Algorithm Tables 1 and 2 in the Appendix. However, we will consider moving some ablation study results to the Appendix and replacing them with the Algorithms in the main paper. We provide time and space complexity analysis of each part of our framework in Q11.
We further explain the roles of different modules, which are also explained in the main paper:
- Pseudo-Labeler (Section 3.2, lines 274-282). The pseudo-labeler is utilized to generate high-quality pseudo-labels that enhance language model (LM) training. Specifically, we use a GNN as the pseudo-labeler, which is trained on node features , the adjacency graph , and the available labels .
- Text Embedding (Section 3.2, lines 287-292). The pseudo-labels are used to finetune a language model. The fine-tuned LM is then employed to generate text embeddings .
- GNN for Final Classification (Section 3.3, lines 310-312). The text embeddings are subsequently used to train another GNN for the final node classification task.
W3: discuss with other related work.
The results of compared methods presented in this paper are not directly taken from the original papers. Instead, we reran all the compared baselines using their open-source codes. Notably, TAPE performed its own train/validation/test splits for the Cora and PubMed datasets, allocating 60% of the data for training, 20% for validation, and 20% for testing. This differs from the standard dataset splits, where only 140 and 60 labeled nodes are used for the training sets of the Cora and PubMed datasets, respectively. Our experimental results reveal that TAPE suffers from limited supervision. This is because finetuning a language model requires a sufficient number of labeled samples, and the standard splits provide far fewer labeled nodes. We highlight the difficulties posed by limited supervision when fine-tuning language models (lines 43-45, 416-421, and 519-523).
We briefly discuss the difference between our method and GraphFormers [a1], SimTEG [a2], and Eingne [a3] as follows.
- GraphFormers: This method incorporates GNNs nested alongside each transformer layer of the pretrained language model and focuses primarily on the link prediction task.
- SimTEG: As mentioned in lines 516-517, SimTEG performs supervised, parameter-efficient fine-tuning on a pretrained language model. It then generates node embeddings using the last hidden states of the fine-tuned LM for node classification.
- Eingne: This is a parameter- and memory-efficient fine-tuning method designed for textual graphs with a large language model (LLM) encoder. It introduces a lightweight, tunable GNN-based side structure alongside each layer of the LLM.
Please refer to the next page for the continuation of our responses.
Dear Reviewer NtXH,
Thank you very much for your time and effort in reviewing our paper. In response to your comments and concerns, we have submitted detailed answers in Response Parts 1-4.
As the discussion deadline approaches, we would greatly value your input on whether our responses have addressed your concerns and if they have helped in re-evaluating our work. Your feedback is incredibly important to us, and we would deeply appreciate it if you could share any additional thoughts or suggestions you may have.
Thank you once again for your time and consideration!
Sincerely,
Authors
Dear Reviewers,
Thank you very much for your thoughtful and valuable comments on our submission.
We have made every effort to provide detailed responses to your questions and concerns. As a gentle reminder, the discussion period is set to conclude on November 26th, leaving just one day remaining.
We would greatly appreciate it if you could take a few moments to review our responses.
We deeply value your time and effort.
Sincerely,
The authors
The paper introduces a novel unified framework that integrates graph generation and text embedding for graph-language models (GLMs). Unlike existing methods that focus on either graph generation or text embedding separately, this framework addresses both simultaneously. The authors use kNN graph to approximate the scale-free network, and successfully apply this technique to citation networks. The proposed approach does not require extensive annotations or a given graph structure, which makes it generally applicable in practice. According to the reviews, the main weakness of this paper is that it mainly demonstrates performance on citation networks without the application results on others, for example, the social network. Also, some reviewers ask for more ablation results such as labeling rates and other graph generation methods which the authors have provided in the rebuttal. Generally the paper receives a borderline score of 5,6,6,6. The reviewer who gave 5 mainly asked for the novelty and clarification with related works, but the reviewer did not participate in the response. I have checked the response and think the author addressed most of the issues. Overall I lean towards acceptance of this paper.
审稿人讨论附加意见
The reviewers mainly raise points on clarification compared to related works, experiments with other types of graph networks such as social networks, and some additional results. The authors have done a good job in the rebuttal and added many new results which the reviewers acknowledge. For the concern that the kNN graph generation approach may not apply easily to other types of networks, the authors made a reasonable argument to clarify the scope of this paper, and I basically agree that the method does not need to be widely applicable to different networks to be published, following previous papers.
Accept (Poster)