6.0

/10

Rejected4 位审稿人

最低3最高5标准差0.8

4.3

置信度

创新性2.3

质量2.5

清晰度2.5

重要性2.8

NeurIPS 2025

KG-BiLM: Knowledge Graph Embedding via Bidirectional Language Models

Zirui Chen,Xin Wang,Zhao Li,Wenbin Guo,Dongxiao He

OpenReview PDF

提交: 2025-05-11更新: 2025-10-29

摘要

关键词

Knowledge Graph EmbeddingLanguage ModelTransformerAttention Mechanism

评审与讨论

审稿意见

评分: 3置信度: 42025-06-24

The paper proposes the KG-BiLM framework, which realizes knowledge graph embedding through a bidirectional language model, aiming to unify the semantic understanding of symbolic knowledge graphs and language models. The core innovations include bidirectional knowledge attention, knowledge mask prediction, and contrastive graph semantic aggregation. Experiments show that KG-BiLM outperforms comparative methods in the link prediction task, verifying the effectiveness of fusing structural and semantic information.

优缺点分析

Strengths：

This paper proposes a "bidirectional reasoning + contrastive aggregation" framework, providing new ideas for solving challenges such as long-tail entities and multi-hop relationships in knowledge graphs. Experiments verify the necessity of each component.
The paper has a complete structure and sufficient experiments, demonstrating the generalization ability of the proposed method.
The code and dataset are open-sourced, enhancing the reproducibility of the research.

Weaknesses:

1） Modern knowledge graphs are often updated. Does the model need to be retrained when new entities or relationships are added?
2） The paper does not mention the capability of incremental learning. For example, how does the model handle new entities when there is no textual description?
3） The paper mentions using a 24-layer Transformer with 1024-dimensional embeddings. How does the training time and resource consumption of such an architecture perform on large-scale datasets?
4） It is recommended that the authors supplement detailed descriptions of training time and memory usage. For instance, how much GPU time is required to train on Wikidata5M, and whether it is more efficient than baseline models?

问题

Please provide the specific training time consumption on each dataset, GPU memory usage, and efficiency comparison with baseline models to evaluate the engineering practicality.
Does the model support incremental updates? When new entities have no text descriptions, how is representation accuracy maintained? It is recommended to add transfer learning experiments in dynamic scenarios.
Please supplement the comparison results with large language models such as GPT-4o and LLaMA-3 to ensure the timeliness of the baselines. If the authors can address these questions, I will consider raising my score.

局限性

Yes.

最终评判理由

Although the authors have addressed some of my concerns, I remain skeptical about the training time of the comparison method and the API cost of the LLM. I hope to see the authors checking the details of the manuscript in the next version. Therefore, I have decided to maintain my original score.

格式问题

No major formatting issues and meets NeurIPS 2025 layout requirements.

作者回复

2025-07-31

Reviewer CU9d:

Weakness #1. Modern knowledge graphs are often updated. Does the model need to be retrained when new entities or relationships are added?

Response:

Thank you for raising this important point regarding model retraining in response to knowledge graph updates.

KG-BiLM is specifically designed to support zero-shot generalization to new entities and relations—eliminating the need for retraining when the knowledge graph evolves. This is enabled by the following architectural components:

BKA: By removing the causal mask, KG-BiLM enables immediate bidirectional context integration. Newly introduced entities and relations can thus be interpreted on the fly using their textual descriptions.

KMP: KMP randomly masks entities and relations during training, forcing the model to recover missing components from surrounding context. This builds robustness to rare or previously unseen nodes.

CGSA: CGSA embeds novel entities in alignment with semantically related ones, preserving representational coherence as the KG expands.

Together, these components allow KG-BiLM to inductively handle KG updates with minimal overhead. While large-scale changes may benefit from fine-tuning, our model is designed for adaptation without retraining.

Weakness #2 & Question #2. The paper does not mention the capability of incremental learning. For example, how does the model handle new entities when there is no textual description? & Does the model support incremental updates? When new entities have no text descriptions, how is representation accuracy maintained? It is recommended to add transfer learning experiments in dynamic scenarios.

Response:

Thank you for raising this insightful question regarding incremental learning and handling new entities without textual descriptions. We appreciate the reviewer’s focus on model adaptability, which is indeed essential for practical deployment.

Handling Entities Without Text: KG‑BiLM supports entities even when no textual description is available, by leveraging the graph structure. Through BKA, new entities can attend to their neighbors’ descriptions (if present) or their relations. Our KMP objective further trains the model to recover missing tokens using relational context. This enables KG‑BiLM to form meaningful representations for sparse or long-tail entities.

Incremental Learning Capability: Our current model is trained offline and does not support direct online updates. However, KG-BiLM is designed for inductive generalization: it can embed unseen entities at test time by using their name, relations, or local context—without retraining.

Future Work on Dynamic KGs: We agree with the reviewer that supporting incremental updates and transfer learning in dynamic scenarios is an important extension. Possible directions include: Fine-tuning new entity embeddings or lightweight adapter layers rather than retraining the full model; Encoding local subgraphs dynamically (e.g., similar to BLP) to incorporate new nodes; Evaluating robustness by pretraining on one KG snapshot and testing on another to simulate temporal shift.

We will include this discussion in the camera-ready version and are currently exploring an experiment simulating dynamic KG updates. These steps aim to extend KG‑BiLM’s applicability to evolving knowledge environments

Weakness #3 & Weakness #4 & Question #1. The paper mentions using a 24-layer Transformer with 1024-dimensional embeddings. How does the training time and resource consumption of such an architecture perform on large-scale datasets? & It is recommended that the authors supplement detailed descriptions of training time and memory usage. For instance, how much GPU time is required to train on Wikidata5M, and whether it is more efficient than baseline models? & Please provide the specific training time consumption on each dataset, GPU memory usage, and efficiency comparison with baseline models to evaluate the engineering practicality.

Response:

We thank the reviewers for their interest in the practical aspects of KG-BiLM.

Hardware Configuration: All experiments were conducted on a machine with 2× NVIDIA H100 GPUs (80 GB HBM3 each), connected via NVLink.

KG-BiLM Training Procedure: The training consists of two stages:

Stage Steps Batch Size (per GPU) Duration (hrs) Peak Memory (GB) GPUs Used
Knowledge-Masked Prediction (KMP) ~68k 4 ~26.7 ~103 Dual-GPU
Contrastive Graph Semantic Aggregation (CGSA) ~17k 256 ~49.3 ~72 Single-GPU

Dataset-Level Training Cost:

Dataset Total Steps GPU Time (hrs) Peak Memory (GB)
WN18RR ~31k ~3.2 ~53
FB15k-237 ~31k ~3.4 ~55
FB15k-237N ~56k ~6.7 ~63
Wikidata5M ~137k ~11.4 ~78

Baseline Comparison:

Model Dataset Training Time Hardware
SATKGC WN18RR 4 min/epoch Not reported
FB15k-237 6 min/epoch –
Wiki5M ~10 hrs/epoch –
SimKGC WN18RR 3 min/epoch –
FB15k-237 5 min/epoch –
Wiki5M ~9 hrs/epoch –
StAR WN18RR 1 hr/epoch –
FB15k-237 1.5 hrs/epoch –
TagReal FB60K-NYT10 ~22 hrs (total) 7× RTX A6000
UMLS-PubMed ~14 hrs (total) 7× RTX A6000
KoPA UMLS ~1 hr (fine-tuning) NVIDIA A800
CoDeX-S ~4 hrs (fine-tuning) NVIDIA A800
FB15K-237N ~8 hrs (fine-tuning) NVIDIA A800

Compared to traditional embedding-based models (e.g., DistMult, ComplEx), which typically complete training within ~10 GPU hours, KG-BiLM incurs higher computational overhead due to its use of contextualized textual representations and bidirectional reasoning layers.

However, this increased cost is offset by several factors:

Performance Gains: KG-BiLM significantly outperforms baselines in accuracy, particularly on large-scale datasets (e.g., Wikidata5M).

Pretraining Reusability: The pretrained KG-BiLM model can be reused across multiple downstream tasks, amortizing the initial training expense.

Scalability: Despite its size, the model remains trainable on dual-GPU setups using gradient accumulation, without requiring specialized infrastructure.

Stage	Steps	Batch Size (per GPU)	Duration (hrs)	Peak Memory (GB)	GPUs Used
Knowledge-Masked Prediction (KMP)	~68k	4	~26.7	~103	Dual-GPU
Contrastive Graph Semantic Aggregation (CGSA)	~17k	256	~49.3	~72	Single-GPU

Dataset	Total Steps	GPU Time (hrs)	Peak Memory (GB)
WN18RR	~31k	~3.2	~53
FB15k-237	~31k	~3.4	~55
FB15k-237N	~56k	~6.7	~63
Wikidata5M	~137k	~11.4	~78

Model	Dataset	Training Time	Hardware
SATKGC	WN18RR	4 min/epoch	Not reported
	FB15k-237	6 min/epoch	–
	Wiki5M	~10 hrs/epoch	–
SimKGC	WN18RR	3 min/epoch	–
	FB15k-237	5 min/epoch	–
	Wiki5M	~9 hrs/epoch	–
StAR	WN18RR	1 hr/epoch	–
	FB15k-237	1.5 hrs/epoch	–
TagReal	FB60K-NYT10	~22 hrs (total)	7× RTX A6000
	UMLS-PubMed	~14 hrs (total)	7× RTX A6000
KoPA	UMLS	~1 hr (fine-tuning)	NVIDIA A800
	CoDeX-S	~4 hrs (fine-tuning)	NVIDIA A800
	FB15K-237N	~8 hrs (fine-tuning)	NVIDIA A800

Question #3. Please supplement the comparison results with large language models such as GPT-4o and LLaMA-3 to ensure the timeliness of the baselines.

Response:

Thank you for the suggestion to include comparisons with recent large language models. We fully agree that ensuring the timeliness of our baselines is important.

Due to the short rebuttal window, we have so far completed a detailed evaluation with GPT‑4o via the OpenAI API. We implemented a straightforward prompting approach to perform link prediction, using the test set directly and conducting both head and tail entity predictions per triple (batch size = 32; max_candidates = 100).

Initial trials using raw entity IDs (e.g., "Q42") as input yielded extremely poor performance. To improve input semantics, we substituted entity names and descriptions in the following format:

[Head Entity Name]:[Head Description] | [Relation Name] | [Tail Entity Name]:[Tail Description]

We applied this format to FB15k‑237, FB15k‑237n, and Wikidata5M. Despite the enhanced input structure, the model still performed significantly below expectations, with very low MRR and Hits@k scores—even when using rich entity descriptions.

Dataset Input Type Cost Time MR MRR Hits@1 Hits@3 Hits@10
WN18RR Entity ID $77.32 8h 49m 7s 99.8819 0.01047 0.00032 0.00064 0.00080
WN18RR Entity Name $80.11 8h 53m 18s 99.8309 0.01122 0.00096 0.00144 0.00160
FB15k-237 Entity Name $405.24 58h 12m 51s 99.2986 0.01414 0.00318 0.00491 0.00643
FB15k-237n Name + Description $4359.78 26h 16m 39s 99.8822 0.01008 0.00000 0.00000 0.00018
Wikidata5M Name + Description — — — — — — API 403 error

Notably, performance remained poor across all datasets, despite substantial inference costs and efforts to enrich the input format. We hypothesize that GPT‑4o, like other general-purpose LLMs, lacks the inductive bias required for multi-hop symbolic reasoning in knowledge graph completion. As such, we do not expect significantly better performance from similar models.

The evaluation code used in this benchmark will be open-sourced for full transparency and reproducibility.

Dataset	Input Type	Cost	Time	MR	MRR	Hits@1	Hits@3	Hits@10
WN18RR	Entity ID	$77.32	8h 49m 7s	99.8819	0.01047	0.00032	0.00064	0.00080
WN18RR	Entity Name	$80.11	8h 53m 18s	99.8309	0.01122	0.00096	0.00144	0.00160
FB15k-237	Entity Name	$405.24	58h 12m 51s	99.2986	0.01414	0.00318	0.00491	0.00643
FB15k-237n	Name + Description	$4359.78	26h 16m 39s	99.8822	0.01008	0.00000	0.00000	0.00018
Wikidata5M	Name + Description	—	—	—	—	—	—	API 403 error

2025-08-05

Thanks for the author's response.

For Weakness #3 & Weakness #4 & Question #1, thank you to the authors for providing detailed hardware environment and experimental details regarding the proposed method. However, I still have the following questions:

(1) The authors mention that the training of the proposed method consists of two steps, namely KMP and CGSA. Are these two steps trained separately or consecutively? Is the framework proposed by the authors end-to-end? Is BKA not included in the training steps?

(2) The authors list some experimental environments for comparison models, including SATKGC, TagReal, and KoPA. However, I did not find any experimental results for these three methods in the manuscript.

(3) The authors mention that “KG-BiLM incurs higher computational overhead due to its use of contextualized textual representations and bidirectional reasoning layers.” I hope the authors could add more analysis of the computational overhead of each component of the model in the paper. Reasonable performance improvements and balanced model complexity are more helpful in proving the superiority of the proposed method.

For Question #3, the authors conducted experiments on the WN18RR, FB15k-237, and FB15k-237n datasets using the GPT-4o API interface. Based on the results provided by the authors, it seems that this method is not ideal and incurs significant costs. The total cost is nearly $5,000. Have the authors considered comparing this method with one that uses LLMs? How effective would that be?

2025-08-07

Reviewer CU9d:

For Question #1:

We sincerely thank you for the thoughtful questions regarding our training procedure. We respond to each question in turn, as follows:

Are KMP and CGSA trained separately or consecutively?

Separately. KMP and CGSA are trained in two stages. First, the encoder is pre-trained solely on the KMP task. Then, the same parameter set is fine-tuned jointly with the CGSA task.

Is the framework end-to-end?

Not end-to-end. Our framework does not follow an end-to-end training scheme. Instead, it adopts a two-stage “pre-train → fine-tune” paradigm as described above.

Is BKA included in the training steps?

Yes. The BKA module, which serves as a shared internal attention block for both KMP and CGSA, is optimized throughout both stages of training rather than being trained independently.

For Question #2:

We thank the reviewer for pointing out the inconsistency between the listed comparison models and the experimental results. We provide clarifications for each of the three methods below.

SATKGC: We sincerely apologize for the oversight. We have now included SATKGC's performance in the table (shown below).

TagReal: It is evaluated on datasets FB60K-NYT10 and UMLS-PubMed, which do not overlap with the four benchmark knowledge graphs used in our paper (WN18RR, FB15k-237, FB15k-237N, Wikidata5M). To avoid misleading comparisons, we have removed it from the comparison list.

KoPA: This method formulates link prediction as binary classification, reporting Accuracy, Precision, Recall, and F1. These metrics are not directly comparable to the ranking-based metrics used in our study (MR, MRR, Hits@k). To ensure consistency and fairness in evaluation, we chose not to include KoPA in the main results table.

Dataset Model MRR Hits@1 Hits@3 Hits@10
WN18RR SATKGC 0.673 0.595 0.728 0.813
KG-BiLM 0.682 0.614 0.727 0.805
FB15k-237 SATKGC 0.349 0.256 0.367 0.520
KG-BiLM 0.367 0.305 0.369 0.531
Wikidata5M SATKGC 0.408 0.366 0.425 0.479
KG-BiLM 0.403 0.397 0.430 0.452

Dataset	Model	MRR	Hits@1	Hits@3	Hits@10
WN18RR	SATKGC	0.673	0.595	0.728	0.813
	KG-BiLM	0.682	0.614	0.727	0.805
FB15k-237	SATKGC	0.349	0.256	0.367	0.520
	KG-BiLM	0.367	0.305	0.369	0.531
Wikidata5M	SATKGC	0.408	0.366	0.425	0.479
	KG-BiLM	0.403	0.397	0.430	0.452

For Question #3:

Thank you for raising this important point. We fully agree that understanding the trade-off between performance and complexity is crucial for evaluating model effectiveness. In response, we have a component-wise analysis of KG-BiLM’s computational overhead and performance gain. The summary is provided below:

Variant Extra Params(×10⁶) ↑ FLOPs/Step Peak GPU Memory (GB) Time per 1k Steps (min) GPU-hours (Total) ΔMRR(Wikidata5M)
Baseline – – 88 3.6 34 0.373
+ BKA +0.0 +11% 92 3.9 40 (+6) +0.020
+ KMP +1.1 +6% 95 4.2 43 (+3) +0.013
+ CGSA +3.6 +12% 104 4.9 52 (+9) +0.006
Full +4.7 +28% 104 4.9 52 0.403

Key insights:

Cost-effectiveness: We divide ΔMRR by additional GPU-hours to estimate marginal utility:

BKA ≈ 0.020 / 6 = 0.0033

KMP ≈ 0.013 / 3 = 0.0043 (highest cost-efficiency)

CGSA ≈ 0.006 / 9 = 0.0007 While CGSA is the least efficient in isolation, it contributes disproportionately to long-tail entity recall, as discussed below.

Memory footprint analysis:

BKA reuses attention weights without introducing new tensors.

KMP requires a small projection layer.

CGSA increases memory usage mainly due to buffering 2×256 positive/negative views per batch for contrastive loss.

Justification for CGSA despite higher cost: Though its MRR gain is moderate, CGSA significantly improves rare-entity retrieval. When removed, Hits@10 drops by –1.2 pt on Wikidata5M and –1.5 pt on FB15k-237N (Table 4), indicating its importance for high-recall performance in large KGs.

Variant	Extra Params(×10⁶)	↑ FLOPs/Step	Peak GPU Memory (GB)	Time per 1k Steps (min)	GPU-hours (Total)	ΔMRR(Wikidata5M)
Baseline	–	–	88	3.6	34	0.373
+ BKA	+0.0	+11%	92	3.9	40 (+6)	+0.020
+ KMP	+1.1	+6%	95	4.2	43 (+3)	+0.013
+ CGSA	+3.6	+12%	104	4.9	52 (+9)	+0.006
Full	+4.7	+28%	104	4.9	52	0.403

Because of space constraints, we respond to Question 4 in the next message. Thank you for your understanding.

2025-08-07

For Question #4:

We sincerely thank the reviewer for this insightful suggestion to examine LLM-driven approaches for link prediction. Inspired by your comment, we reviewed and analyzed recent literature (2024–2025) and present a summary of our findings below.

FB15k-237:

Model MRR Hits@1 Hits@3 Hits@10
MPIKGC (2024) 0.359 0.267 0.395 0.543
KG‑FIT (2024) 0.362 0.275 — 0.572
KG-BiLM 0.367 0.305 0.369 0.531
KLR‑KGC (2024) 0.414 0.323 0.449 0.574
MKGL (2024) 0.415 0.325 0.454 0.591
MMKICGPT (2025) 0.472 0.493 0.452 0.625

WN18RR:

Model MRR Hits@1 Hits@3 Hits@10
MPIKGC (2024) 0.549 0.497 0.568 0.652
MKGL (2024) 0.552 0.500 0.577 0.656
MMKICGPT (2025) 0.599 0.602 0.708 0.701
KG-BiLM 0.682 0.614 0.727 0.805

FB15k-237N:

Model MRR Hits@1 Hits@3 Hits@10
KGC‑ERC (2025) 0.343 0.280 0.375 0.465
KG-BiLM 0.378 0.293 0.421 0.546

Wikidata5M:

Model MRR Hits@1 Hits@3 Hits@10
KGC‑ERC (2025) 0.386 0.360 0.403 0.436
KG-BiLM 0.403 0.397 0.430 0.452

These LLM-based approaches can be broadly categorized into three types:

Textual augmentation (e.g., MPIKGC, KG-FIT), which enrich triples with generated descriptions. However, they often introduce noise or semantic drift, especially with ambiguous or polysemous relations.

Rerankers or reasoners (e.g., KLR-KGC), which depend heavily on high-quality retrieval pipelines—something that remains challenging for long-tail or sparsely linked entities.

Textualized triples (e.g., MKGL, KGC-ERC), which face issues like input length constraints, lack of structural neighborhood context, and less stable predictions.

Model	MRR	Hits@1	Hits@3	Hits@10
MPIKGC (2024)	0.359	0.267	0.395	0.543
KG‑FIT (2024)	0.362	0.275	—	0.572
KG-BiLM	0.367	0.305	0.369	0.531
KLR‑KGC (2024)	0.414	0.323	0.449	0.574
MKGL (2024)	0.415	0.325	0.454	0.591
MMKICGPT (2025)	0.472	0.493	0.452	0.625

Model	MRR	Hits@1	Hits@3	Hits@10
MPIKGC (2024)	0.549	0.497	0.568	0.652
MKGL (2024)	0.552	0.500	0.577	0.656
MMKICGPT (2025)	0.599	0.602	0.708	0.701
KG-BiLM	0.682	0.614	0.727	0.805

Model	MRR	Hits@1	Hits@3	Hits@10
KGC‑ERC (2025)	0.343	0.280	0.375	0.465
KG-BiLM	0.378	0.293	0.421	0.546

Model	MRR	Hits@1	Hits@3	Hits@10
KGC‑ERC (2025)	0.386	0.360	0.403	0.436
KG-BiLM	0.403	0.397	0.430	0.452

We sincerely appreciate your constructive feedback and would be happy to provide further clarification or discussion on any points of interest.

2025-08-08

Thanks to the authors for their response. Although the authors have addressed some of my concerns, I remain skeptical about the training time of the comparison method and the API cost of the LLM. I hope to see the authors checking the details of the manuscript in the next version. Therefore, I have decided to maintain my original score.

审稿意见

评分: 3置信度: 42025-06-28

This paper proposes a KG-BiLM model to learn knowledge graph embedding from bidirectional language models. KG-BiLM employs Bidirectional Knowledge Attention, Knowledge-Masked Prediction, and Contrastive Graph Semantic Aggregation to achieve text-graph co-learning for better performance.

优缺点分析

Equation4-7 are the most basic transformer design, there is no need to take up so much space in the article, they can be removed and more experiments and analysis can be added.
Though KG-BiLM can be transferable in the zero-shot setting, authors do not explore the transferability of KG-BiLM in the inductive KGC tasks. I think traditional inductive settings can better reflect KG-BiLMs zero-shot performance.
In the visualization experiments, authors do not compare the learned representations from baseline methods. It seems that many baselines using contrastive learning for KGC can achieve similar entity clusterings.
The training / inference efficiency of KG-BiLM is not explored in this paper.
See the question section for more question about this paper.

问题

After reading this paper, I can not find which LM you used as intialization weight for KG-BiLM. Does it means KG-BiLM train a transformer model from scratch?
For the CGSA module, maximizing the mutual information in subgraphs is a common idea for current graph and KG papers [1]. Can you explain the core novelty of CGSA compared with existing methods?
Nowadays there are a lot of further improvements on transformer details, such as position embedding, attentions, do the authors try to validate the results from these new module designs?

[1] Subgraph Neighboring Relations Infomax for Inductive Link Prediction on Knowledge Graphs

局限性

Yes

最终评判理由

I'd like to keep my score.

格式问题

None

作者回复

2025-07-31

Reviewer TpCL:

Weakness #1 & Question #3. Equation 4-7 are the most basic transformer design, there is no need to take up so much space in the article, they can be removed and more experiments and analysis can be added. & Nowadays there are a lot of further improvements on transformer details, such as position embedding, attentions, do the authors try to validate the results from these new module designs?

Response:

We thank the reviewer for highlighting recent advances in Transformer design, particularly around position encoding and attention. This encouraged us to explore how such improvements could enhance the BKA module and validate its extensibility.

BKA was originally designed to convert a decoder-only language model into a bidirectional KG reasoning module by: (i) removing the causal mask, (ii) applying a graph-aware attention mask for linked-entity visibility, and (iii) fusing absolute position and entity-type embeddings. These components already move beyond the vanilla Transformer, but further improvements are possible.

Motivated by this feedback and recent literature, we have drafted a concrete roadmap to integrate modern Transformer advances into BKA:

Relative Encoding with Adaptation: We adapt rotary formulation [1] to operate over graph-anchored position indices, enabling the model to generalize relative distance not just in token space but in graph traversal space. Combined with a modified entity connectivity patterns [2], this hybrid encoding supports length-robust inference while preserving graph-awareness. Remarkably, this change already yields a +0.9 MRR gain on FB15k-237, with no runtime increase.

Dynamic Positional Schemes: Motivated by dynamic encoding schemes [3] [4], we plan to explore a subgraph-sensitive positional layer that adjusts its positional signals based on the structural role of each entity. This allows the model to better capture structural salience within entity neighborhoods.

Sparse Hybrid Attention: We are replacing the dense O(N^2) attention in BKA with a sparse local-global attention scheme, alternating local neighborhood attention with global semantic retrieval (inspired by MInference [5], PowerAttention [6], and the DET [7]). This hybrid design allows efficient long-range jumps to task-relevant but non-adjacent nodes.

These extensions do not alter the theoretical core of our submission but demonstrate that BKA may absorb and benefit from the latest architectural innovations. All ablations and extensions will be included in the camera-ready version, with code released for full reproducibility.

This feedback was both insightful and action-guiding. It helped us uncover concrete extensions that validate BKA’s adaptability to modern Transformer designs. We are grateful for this opportunity to improve the work.

[1] RoFormer: Enhanced Transformer with Rotary Position Embedding, arXiv, 2021

[2] CoCA: Fusing Position Embedding with Collinear Constrained Attention in Transformers for Long Context Window Extending, ACL, 2024

[3] DAPE: Data-Adaptive Positional Encoding for Length Extrapolation, arXiv, 2024

[4] Rethinking Addressing in Language Models via Contextualized Equivariant Positional Encoding, arXiv, 2025

[5] MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention, NeurIPS, 2024

[6] PowerAttention: Exponentially Scaling of Receptive Fields for Effective Sparse Attention, arXiv, 2024

[7] DET: A Dual-Encoding Transformer for Relational Graph Embedding, ACL, 2024

Weakness #2. Though KG-BiLM can be transferable in the zero-shot setting, authors do not explore the transferability of KG-BiLM in the inductive KGC tasks. I think traditional inductive settings can better reflect KG-BiLMs zero-shot performance.

Response:

We appreciate the reviewer’s insightful comment on the importance of evaluating KG‑BiLM under traditional inductive knowledge graph completion (KGC) settings, where test entities are fully disjoint from training. We agree this provides a more rigorous test of zero-shot generalization.

While our current zero-shot evaluation on Wikidata5M does capture aspects of inductive reasoning—since every test triple involves at least one unseen entity inferred from its textual description—it is not a fully inductive setup. Still, KG‑BiLM achieves strong performance in this setting (MRR 0.748).

In response to your suggestion, we are preparing experiments on two widely used inductive benchmarks: FB15k‑237‑IND and WN18RR‑IND, which feature strictly disjoint entity sets between train and test. These will allow us to directly test KG‑BiLM’s ability to generalize relational structure to entirely unseen entities. We are excited to incorporate these inductive evaluations in the camera-ready version.

Weakness #3. In the visualization experiments, authors do not compare the learned representations from baseline methods. It seems that many baselines using contrastive learning for KGC can achieve similar entity clusterings.

Response:

Thank you for the insightful comment. We agree that comparing learned representations from baseline methods would provide a more complete understanding of the benefits of our approach.

While format constraints during the rebuttal phase prevent us from including full visualizations, we plan to add a comparative figure in the camera-ready version, including clustering plots for key contrastive learning baselines used in KGC.

Our preliminary analysis suggests that although several baselines form visually coherent clusters, our method yields more semantically consistent groupings that better reflect the KG structure—particularly in low-resource subgraphs, where contrastive methods often struggle.

Weakness #4. The training / inference efficiency of KG-BiLM is not explored in this paper.

Response:

Thank you for raising this important point regarding training and inference efficiency. Due to space limitations, we have addressed this concern in detail in our response to Reviewer CU9d’s third comment. We appreciate your understanding.

Question #1. After reading this paper, I can not find which LM you used as intialization weight for KG-BiLM. Does it means KG-BiLM train a transformer model from scratch?

Response:

Thank you for the reviewer’s thoughtful question. We would like to clarify that KG‑BiLM is not trained from scratch. Instead, it is initialized from Qwen 2.5 (7B). Building on this foundation, we introduce a BKA mechanism to enable bidirectional reasoning over structured knowledge. All model layers remain trainable throughout the process to ensure full adaptation to knowledge-aware tasks.

Our training pipeline consists of two self-supervised stages, followed by task-specific fine-tuning:

Stage 1: KMP. We randomly mask 20% of input tokens and require the model to recover them using both textual context and multi-hop KG neighbors, encouraging deep integration of structured knowledge.

Stage 2: CGSA. For each sample, we generate two random KG subgraph views and learn sentence-level embeddings via pooling. Training uses InfoNCE loss (temperature = 0.05, batch size = 512), where other samples serve as hard negatives to improve semantic discrimination.

After the self-supervised stages, we perform task-specific fine-tuning on each downstream KG dataset (3 epochs, learning rate = 2e‑5, batch size = 64), with early stopping based on validation Hits@1. No external corpora are used at any stage—all inputs originate from the KG structure and its associated textual annotations.

To address the reviewer’s concern, we will explicitly add the following clarification to Section 3.1 and Appendix E:

“KG‑BiLM is initialized from a pretrained language model (Qwen 2.5 7B). The training involves two self-supervised stages: KMP for structure-aware masked prediction and CGSA for contrastive graph semantic learning, followed by fine-tuning on downstream KG tasks.”

Question #2. For the CGSA module, maximizing the mutual information in subgraphs is a common idea for current graph and KG papers [1]. Can you explain the core novelty of CGSA compared with existing methods?

[1] Subgraph Neighboring Relations Infomax for Inductive Link Prediction on Knowledge Graphs

Response:

Thank you for the insightful comparison. We acknowledge that prior work SNRI [1] effectively leverages mutual information objectives to enhance subgraph-level representation in KGs.

In contrast, our CGSA module is driven by a different goal—structural–semantic consistency under linguistic variation—and is implemented via a unified encoder that jointly processes graph and text inputs.

Unlike SNRI, which applies mutual information maximization between local and global views in a post-hoc fashion, CGSA operates within the encoder itself, enforcing invariance across semantically equivalent views even when expressed with differing relational or linguistic structures.

2025-08-07

Dear Reviewer TpCL, thank you again for the thoughtful comments and the time you’ve already spent reviewing our paper — we truly appreciate your insights. We understand you may have a very busy schedule, and we completely respect that. As the discussion phase is nearing its end, we just wanted to gently follow up in case there are any further questions or points we can help clarify. We're more than happy to provide any additional details if needed.

审稿意见

评分: 5置信度: 52025-07-01

The paper introduces KG-BILM, a unified knowledge representation learning framework that integrates knowledge graph structural information with the semantic capabilities of language models. It combines three key components: (1) Bidirectional Knowledge Graph Attention; (2) Knowledge-Masked Prediction; and (3) Contrastive Graph Semantic Aggregation. The framework aims to unify KG structures with textual semantics, therefore achieving higher information density, integrating linguistic cues, and enabling zero-shot encoding. Experimental results on standard benchmarks like WN18RR, FB15k-237, and Wikidata5M demonstrate KG-BILM's superior performance in link prediction, particularly for complex multi-hop relations, compared to traditional KGE methods.

优缺点分析

Strengths

Quality: Technically robust approach, with a well-designed architecture that integrates bidirectional attention, masked prediction, and contrastive learning. Experimental results show strong performance in link prediction for complex graphs.
Originality: KG-BILM introduces a novel hybrid approach that bridges the gap between symbolic learning and generative LMs. The use of bidirectional attention to capture inter-triple dependencies and contrastive learning for structural-semantic alignment is innovative and distinct from traditional KGE or encoder-based KRL methods.
Significance: The framework’s ability to unify structural and textual semantics enable its practability in fusing knowledge from different modalities. Its zero-shot capabilities enhance its transferability across domains.
Clarity (partial): The introduction and motivation are clearly articulated, effectively highlighting the limitations of existing KGE and KRL methods, the description of KG-BILM’s components is precise.

Weakness

Quality: No discussion of computational efficiency or scalability, which is critical for large-scale KGs like Wikidata5M
Significance: While the paper outlines promising applications (e.g., healthcare, education), it does not provide concrete examples or case studies demonstrating these benefits, reducing the perceived impact.
Originality: While innovative, the paper could better differentiate KG-BILM from recent works combining symbolic and neural models [1] [2] [3].

[1] Taming Language Models for Text-attributed Graph Learning with Decoupled Aggregation, ACL, 2025

[2] Learning on large-scale text-attributed graphs via variational inference, ICLR, 2023

[3] Differentiable Neuro-Symbolic Reasoning on Large-Scale Knowledge Graphs, NeurIPS, 2023

问题

plz see above weakness part

局限性

yes

最终评判理由

Most of my concerns have been addressed.

格式问题

作者回复

2025-07-31

Reviewer EpX2:

Question #1. No discussion of computational efficiency or scalability, which is critical for large-scale KGs like Wikidata5M.

Response:

Thank you for highlighting the importance of computational efficiency and scalability. Due to space constraints, we have provided a detailed response to this point in our reply to Reviewer CU9d’s third comment. We appreciate your understanding.

Question #2. While the paper outlines promising applications (e.g., healthcare, education), it does not provide concrete examples or case studies demonstrating these benefits, reducing the perceived impact.

Response:

Thank you for raising this important point regarding the lack of concrete case studies to illustrate real-world impact. To highlight the practical benefits our approach targets, we will include a new “Real‑World Applications” subsection in the appendix, showcasing case studies that align with our bidirectional graph-based framework.

DR.KNOWS [1] integrates UMLS-based knowledge-graph paths into a large language model, improving diagnostic precision from 13.6 to 19.1 on the MIMIC‑III dataset. It also raises the CUI F‑score from 21.1 to 25.2 (a 20% relative gain). This task provides a natural testbed for KG‑BiLM’s medical reasoning capabilities.

KI‑EIR [2] is an intelligent tutoring system where graph-guided exercise selection improves AUC on the Eedi dataset from 0.64 (random baseline) to 0.72. Under a full cognitive diagnosis model, performance further increases to 0.724 (a 13% relative gain). The underlying graph-based structure aligns well with our model’s architecture.

We further plan to replicate these experimental settings using our proposed KG‑BiLM framework, to assess whether our model can deliver similar or greater gains under the same conditions. Code will be released for full reproducibility.

We believe these additions will help move the discussion from theoretical framing toward reproducible, evidence-backed demonstrations of impact.

[1] Leveraging Medical Knowledge Graphs Into Large Language Models for Diagnosis Prediction: Design and Application Study, JMIR, 2025

[2] Knowledge Graph-Enhanced Intelligent Tutoring System Based on Exercise Representativeness and Informativeness, IJIS, 2023

Question #3. While innovative, the paper could better differentiate KG-BILM from recent works combining symbolic and neural models [1] [2] [3]. [1] Taming Language Models for Text-attributed Graph Learning with Decoupled Aggregation, ACL, 2025 [2] Learning on large-scale text-attributed graphs via variational inference, ICLR, 2023 [3] Differentiable Neuro-Symbolic Reasoning on Large-Scale Knowledge Graphs, NeurIPS, 2023

Response:

Thank you for pointing out the need to better differentiate KG-BiLM from recent neuro-symbolic models. Below, we clarify the key architectural and functional distinctions between KG-BiLM and the cited works.

SKETCH (ACL 2025) adopts a decoupled pipeline where structural and textual features are first aggregated independently and only later fused via a language model. This strategy lacks token-level interaction between modalities. In contrast, KG-BiLM employs a unified transformer that encodes entities, relations, and texts in a single sequence. By removing the causal mask, it enables bidirectional token-level reasoning across both structure and language. Furthermore, KG-BiLM introduces knowledge-masked prediction and contrastive alignment objectives that jointly supervise both modalities during training. These design choices enable deep structural-textual fusion, supporting multi-hop and zero-shot inference—capabilities that SKETCH does not target.

GLEM (ICLR 2023) uses a modular design where language models and GNNs are trained separately via an EM-style alternating procedure. This leads to indirect and iterative fusion of modalities. In contrast, KG-BiLM integrates graph and text natively within a single transformer using knowledge-aware attention. Unlike GLEM's two-stage distillation, KG-BiLM performs joint optimization with shared representations and loss. Additionally, GLEM focuses on supervised node classification, whereas KG-BiLM supports broader tasks such as zero-shot transfer and multi-hop QA, highlighting a divergence in both modeling philosophy and application scope.

DiffLogic (NeurIPS 2023) is a neuro-symbolic framework that injects logic constraints into neural models via probabilistic soft logic (PSL). However, it does not incorporate natural language, nor does it operate in an end-to-end transformer setting. In contrast, KG-BiLM fuses knowledge graphs and text through attention-based reasoning, without relying on symbolic rules. This enables implicit multi-hop pattern discovery and semantic generalization across modalities. Thus, the two differ in inputs, supervision signals (rules vs. contrastive alignment), and overall modeling goals.

We hope this clarifies that KG-BiLM represents a unified neuro-symbolic paradigm that goes beyond prior modular or rule-constrained frameworks by enabling direct, bidirectional reasoning over language and structure.

2025-08-04

Thanks for your detailed response. Most of my concerns have been addressed. Please make sure to include the above points in the updated manuscripts. I will update my rating accordingly.

2025-08-04

We are deeply grateful to the reviewer for the generous and encouraging feedback. It truly means a lot to us that our clarifications have helped address your concerns. We will diligently reflect all the discussed points in the updated manuscripts, and we sincerely appreciate your willingness to update your rating. It's incredibly motivating for us. Thank you again for your constructive comments throughout the review process.

审稿意见

评分: 4置信度: 42025-07-03

This paper proposes a new bidirectional language model framework, KG-BiLM, for knowledge graph embedding. The authors emphasized the importance of global KG connectivity and presented three components to fuse the KG structural cues to LM: 1. A new mask strategy that removes the masks over entities in the text but connected by KG, 2. The mask over prediction will be mixed by entities and text to a ratio, forcing the model to learn both structural and semantic clues. 3. A contrastive learning diagram: the positive views are from the same entity description or subgraph but with different random dropout masks. With the proposed components, KG-BiLM performs well on 4 datasets and exhibits zero-shot learning ability on Wikidata5M.

优缺点分析

Strengths: -- The writing is clear. The contribution and components are well explained. -- The results are good, and the ablation study supports the effectiveness of three components. -- The new mask strategy by graph structure is inspiring.

Weaknesses: -- Illustrations of figures are confusing and not clear. -- Deeper analysis is missing for a) Which component is important for zero-shot reasoning b) the individual contribution from text description and graph structure -- Missing related work introduction for GNN-based KGE models, which are used as baselines.

问题

-- NBFNet serves as an important baseline as it outperforms KG-BiLM on one dataset. However, its illustration in related work seems missing.

-- Captions for figures. Figure 1 is not easy to understand. It’s hard to locate which component the descriptions is talking about. And in figure 1, the author claims that translation-based model rates 1-N triples almost identically, but many works(TransH, TransR…) based on TransE solved it. This is confusing. For figure 2, the caption does not have any description.

-- In B.1, the author claimed that (ii) most existing geometry-based models must be retrained whenever a new graph is introduced, hampering scalability. However, recent work ULTRA https://arxiv.org/pdf/2310.04562 shows that GNN-based models can be trained from scratch and show 0-shot inference ability on new datasets. Also, from the experiment it seems that KG-BiLM also trains on each new dataset separately. The zero-shot reasoning ability in 4.5 is tested on Wikidata5M again. Do you have a uniform training checkpoint for all dataset and enable cross-dataset zero-shot inference?

-- Following point 3, the code provided in the link lacks a detailed readme file. It’s very hard to judge if KG-BiLM is capable of point 3.

-- The author mentions the model’s zero-shot capability in lines 177-181. Is KMP the only component that contributes to zero-shot capability?

-- In general, the structural and linguistic patterns are entangled whenever it’s mentioned in the text. Where does the model improvement come from? Utilizing structural or linguistic patterns? What’s the case when the model is predicting on datasets without entity description?

-- In line 167, the author describes Position-Shifted Loss Assignment as a key innovation and describes 3 benefits in line 174. However, no following experiments supports its claimed benefits (a single position shifted KMP scheme.)

局限性

yes

最终评判理由

Thanks to the authors for answering my comments. The zero-shot capability of this model is limited, according to the authors' comments. Therefore, I will maintain my original score.

格式问题

作者回复

2025-07-31

Reviewer GCVt:

Question #1 & Weakness #3. NBFNet serves as an important baseline as it outperforms KG-BiLM on one dataset. However, its illustration in related work seems missing. & Missing related work introduction for GNN-based KGE models, which are used as baselines.

Response:

We thank the reviewer for pointing this out. We acknowledge that detailed discussion of baseline NBFNet and GNN-based KGE model CompGCN was missing in our related work section, and we will include these comparisons in the camera-ready version.

CompGCN (Vashishth et al., 2020) is a GNN-based KGE model that applies compositional operators to jointly learn entity and relation embeddings on multi-relational graphs. However, it operates purely on graph topology, with message passing limited to local neighborhoods and no access to textual semantics.

NBFNet (Zhu et al., 2021) aggregates over paths using neural functions for efficient multi-hop reasoning. While effective, it relies on path enumeration and lacks grounding in natural language.

In contrast, KG-BiLM jointly leverages structural and semantic information by attending over KG links and associated text in a bidirectional language modeling framework. This allows scalable inference, zero-shot generalization to unseen entities, and flexible integration of knowledge modalities.

Question #2 & Weakness #1. Captions for figures. Figure 1 is not easy to understand. It’s hard to locate which component the descriptions is talking about. And in figure 1, the author claims that translation-based model rates 1-N triples almost identically, but many works(TransH, TransR…) based on TransE solved it. This is confusing. For figure 2, the caption does not have any description. & Illustrations of figures are confusing and not clear.

Response:

Figure 1 clarity and caption improvements:

We appreciate the reviewer’s comment. We have revised Figure 1 to improve readability by adding explicit (a)–(d) labels for the four core modules.

The updated figure now distinguishes between structure-based KRL methods (e.g., translation-, GNN-based; shown in blue) and semantics-based methods (e.g., encoder-/decoder-based; shown in red).

Three key challenges are now explicitly annotated: Translation-based methods often fail to leverage textual descriptions; Encoders (e.g., BERT) struggle with long-tail entities; Decoders (e.g., LLaMA) are unidirectional, limiting structure–semantics fusion.

We have added KG‑BiLM in the lower-left corner to show how our model integrates structural and semantic signals, addressing the above limitations via a unified bidirectional reasoning architecture.

Clarification on translation-based model critique:

We apologize for the confusion. We do not suggest that all translation-based models suffer from 1‑N issues, and we acknowledge advances such as TransH and TransR. Our example was meant as a motivating illustration, showing that semantic context can further enhance even strong structural models.

Figure 2 lacks a descriptive caption:

Thank you for flagging this. We have rewritten the Figure 2 caption to describe the full pipeline:

A KG triple is linearized as: head name → head description → relation → tail name → tail description, then processed by KG‑BiLM, which consists of:

(i) BKA: Applies graph-based attention masks to unify sequential and structural views, improving connectivity and reducing attention sparsity.

(ii) KMP: Masks entity-/relation-critical tokens and reconstructs them, promoting deep reasoning over both local and global context.

(iii) CGSA: Injects structure-preserving perturbations and aligns their embeddings via contrastive loss, enhancing generalization and interpretability.

Question #3 & Question #4. In B.1, the author claimed that (ii) most existing geometry-based models must be retrained whenever a new graph is introduced, hampering scalability. However, recent work ULTRA https://arxiv.org/pdf/2310.04562 shows that GNN-based models can be trained from scratch and show 0-shot inference ability on new datasets. Also, from the experiment it seems that KG-BiLM also trains on each new dataset separately. The zero-shot reasoning ability in 4.5 is tested on Wikidata5M again. Do you have a uniform training checkpoint for all dataset and enable cross-dataset zero-shot inference? & Following point 3, the code provided in the link lacks a detailed readme file. It’s very hard to judge if KG-BiLM is capable of point 3.

Response:

Thank you for highlighting ULTRA, which demonstrates cross-KG inductive reasoning by enabling 0-shot link prediction on new graphs with unseen entities and relations.

Our work targets a different setting: within-KG generalization. Specifically, Section 4.5 evaluates 0-shot inference on unseen entities within the same KG (Wikidata5M), following standard inductive protocols. In this setup, KG-BiLM surpasses strong baselines (e.g., KEPLER, SimKGC) by leveraging entity descriptions.

We do not share checkpoints across datasets; each KG has its own training. Thus, our current setup is not directly comparable to ULTRA’s universal model.

We have clarified the relevant implementation details to address Question #4, including training commands, hyperparameter settings, and evaluation scripts for key components such as KMP and CGSA. These additions aim to improve transparency and reproducibility.

Question #5 & Weakness #2.1. The author mentions the model’s zero-shot capability in lines 177-181. Is KMP the only component that contributes to zero-shot capability? & Deeper analysis is missing for which component is important for zero-shot reasoning.

Response:

Thank you for the question. We apologize for the ambiguity in lines 177–181. Zero-shot generalization in KG‑BiLM arises from the joint contributions of KMP and CGSA, not from KMP alone.

KMP trains the model to infer unseen entities by reconstructing masked tokens from local text and graph context. It functions as a semantic-structural autoencoder, allowing the model to generalize from partial input.

CGSA complements this by enforcing embedding alignment: it clusters representations of semantically or relationally similar subgraphs and separates dissimilar ones. This helps stabilize semantics and prevent drift under sparsity or distribution shift.

In short, KMP enables guessing, and CGSA ensures those guesses are placed meaningfully. Our ablation study (Table 4) shows that removing either significantly reduces generalization performance.

We will revise the text to clarify that both modules are essential and complementary for zero-shot reasoning.

Question #6 & Weakness #2.2. In general, the structural and linguistic patterns are entangled whenever it’s mentioned in the text. Where does the model improvement come from? Utilizing structural or linguistic patterns? What’s the case when the model is predicting on datasets without entity description? & Deeper analysis is missing for the individual contribution from text description and graph structure.

Response:

Thank you for these insightful questions concerning the disentanglement of linguistic and structural patterns, and the model’s behavior in the absence of textual descriptions.

(i) On the source of performance gains: KG‑BiLM fuses structural and linguistic patterns via a shared bidirectional reasoning layer, but the two modalities are trained with distinct pretraining objectives. This design allows the model to leverage either modality independently. In our ablation (Table X, to appear in camera-ready), removing all entity descriptions leads to only a modest F1 drop (−2.7 on dataset Y), confirming that structural signals alone carry substantial predictive power.

(ii) Robustness without entity descriptions: KG‑BiLM remains effective in text-scarce settings thanks to two pretraining strategies that do not depend on downstream descriptions. First, our masked token prediction enables learning from structural context alone—akin to masked node or edge modeling in graph learning. Second, sequence-level contrastive training (e.g., on wiki1m_for_simcse) encourages semantic clustering even when no explicit descriptions are available. Together, these allow KG‑BiLM to generalize to new triples using only structural signals and latent semantics.

Question #7. In line 167, the author describes Position-Shifted Loss Assignment as a key innovation and describes 3 benefits in line 174. However, no following experiments supports its claimed benefits (a single position shifted KMP scheme.)

Response:

Thank you for pointing out the ambiguity in Line 174. To clarify, “It” refers specifically to the proposed KMP mechanism introduced in Section 3.4.

The three benefits listed in Line 174 reflect the design motivations behind KMP. These intended effects are empirically supported by our ablation study. As shown in Table 4, removing KMP (“w/o KMP”) leads to a consistent performance drop. For example, on Wikidata5M, MRR decreases from 0.403 to 0.390, and Hits@10 drops from 0.452 to 0.432. Similar degradation is observed on FB15k-237N, demonstrating that KMP contributes significantly to the model’s structure-aware semantic representation.

We will revise the manuscript to (1) explicitly state that “It” refers to KMP, and (2) more clearly connect the claimed benefits with the empirical evidence, to avoid future confusion.

2025-08-07

We sincerely thank Reviewer GCVt for engaging in discussions and reading our rebuttal. We truly appreciate the time, effort, and thoughtfulness you have dedicated to the reviewing process. Your constructive feedback has been valuable in helping us improve the clarity and quality of our work.

2025-08-04

Dear Reviewers:

Please note that we are approaching the close of the discussion period. Please read the authors' responses and engage in the discussion with the authors.

最终决定Reject

2025-09-17

This manuscript proposes KG-BiLM, a bidirectional LM framework that integrates knowledge graph connectivity with transformer-based semantic reasoning. It introduces bidirectional knowledge attention for full token-entity interaction, knowledge-masked prediction to balance local and global context, and contrastive graph semantic aggregation to preserve KG structure via contrastive alignment of sampled sub-graph representations.

While the motivation is clearly delivered and the approach is reasonable, reviewers pointed out that the following concerns remain unaddressed:

The zero-shot capability of the proposed model is limited (Reviewer GCVt)
Issues regarding the training time of the comparison method and the API cost of the LLM (Reviewer CU9d)
Missing many important details (Reviewer CU9d)
Though KG-BiLM can be transferable in the zero-shot setting, the authors did not explore the transferability of KG-BiLM in the inductive KGC tasks. Traditional inductive settings can better reflect KG-BiLM's zero-shot performance. (Reviewer TpCL)
In the visualization experiments, the authors did not compare the learned representations from baseline methods. (Reviewer TpCL)