PaperHub
5.5
/10
Poster3 位审稿人
最低2最高4标准差0.8
4
3
2
ICML 2025

Graph-constrained Reasoning: Faithful Reasoning on Knowledge Graphs with Large Language Models

OpenReviewPDF
提交: 2025-01-23更新: 2025-07-24
TL;DR

We propose a novel framework called graph-constrained reasoning that eliminates hallucinations and ensures faithful reasoning by integrating the KG structure into the LLM decoding process.

摘要

关键词
Large Language ModelKnowledge Graph

评审与讨论

审稿意见
4

This paper proposed the GCR framework, where a trie-based index leverages structured knowledge from KGs to address knowledge gaps and hallucinations in LLM reasoning. GCR employs a lightweight KG-specialized LLM for graph-constrained reasoning and a general LLM for inductive reasoning. Experimental results on KGQA benchmarks show that GCR achieves state-of-the-art performance, and demonstrates strong zero-shot generalizability to unseen KGs without additional training.

update after rebuttal:

Thank the authors for their detailed responses. The clarifications provided have addressed my initial concerns regarding the method's efficiency and its handling of special cases. As a result, I see no need to revise my original assessment score and will maintain it as is.

给作者的问题

How does your method adapt when real-world KGs have missing entities or questionable reasoning paths?

论据与证据

Yes.

方法与评估标准

The motivation is clear and the proposed method is technically sound.

理论论述

N/A

实验设计与分析

Yes, extensive experiments have verified the effectiveness of the proposed method.

补充材料

Yes, I reviewed the appendix B.

与现有文献的关系

The issues of knowledge gaps and hallucinations are significantly important in the current LLM area. The proposed faithful Graph-constrained reasoning may provide a new technical view to the community.

遗漏的重要参考文献

I do not have suggestions for additional references.

其他优缺点

Please check the above suggestions.

其他意见或建议

In my opinion, the efficiency of this approach is vital, and I am curious about the practical time costs of the KG-Trie construction stages on real-world KGs.

作者回复

We sincerely appreciate your positive and constructive review of our paper. Your feedback is invaluable in helping us refine and clarify our work. Below, we address your comments and questions in detail.

Efficiency of KG-Trie Construction and Practical time costs of the KG-Trie construction

We appreciate your emphasis on the efficiency of the KG-Trie construction process. In our current implementation, we optimize KG-Trie construction by leveraging efficient graph traversal techniques and parallelized batch processing. To provide a concrete perspective, we conducted additional experiments measuring the construction time on real-word KG (Freebase) in Appendix B.2, where the average time for KG-Trie construction is only 0.0133s.

We also present further optimizations to handle industrial-scale KGs by incorporating graph retrieval techniques to reduce the size of KG for Trie construction (Appendix B.3 and Figure 6). The breakdown of time consumption in Table 9 also shows the efficiency of the KG-Trie construction when combined with graph retrieval techniques (0.2838s).

Future work could explore better trie structures or hierarchical indexing to improve efficiency on industrial-scale KGs.

Adaptability to Missing Entities and Questionable Reasoning Paths

Thank you for raising this important concern. Our method is designed to be robust against incomplete KGs in two folds:

Exploring Alternative Paths
When the entities are missing, the GCR can explore alternative paths to fulfill the reasoning. For example, for the question “What is the
nationality of Joe Biden?” One of the reasoning paths could be:
Joe Biden -> born_in -> Scranton -> city_of -> USA.
When the entity Scranton is missing and leading to the disconnection of the path, GCR could explore other paths like:
Joe Biden -> profession -> President -> work_in -> Washington D.C -> city_of -> USA.
This path can also be used to deduce the correct answer: USA. In implementation, we explore the top-10 paths for comprehensive reasoning.

Combining Internal Knowledge of Powerful General LLMs
Due to the noises and incompleteness of KGs, the generated reasoning paths might be unsatisfying. To address this, in the graph inductive reasoning (Section 4.4), we employ a powerful general LLM (e.g., GPT-4o-mini) to incorporate diverse paths for deliberate reasoning.

The powerful general LLM exhibits massive internal knowledge and strong inductive reasoning ability. Therefore, it could disregard the questionable paths and deduce final answers accurately.

审稿人评论

Appreciate the authors' thorough response which has addressed several of my concerns raised.

Regarding the adaptability to missing entities, the provided examples in the response utilize path lengths of 4-6 hops. However, it is also claimed in Appendix that "When the hops are set to 3 or 4, the performance drops due to the increased complexity of the reasoning paths, which may introduce noise and make the reasoning less reliable."

Does the existing architecture support case-specific path exploration with variable length parameters, and if so, through what mechanisms? If not, how would the author address this issue in the future?

作者评论

We are glad to hear that your concerns have been addressed, and we appreciate your insightful follow-up questions.

First, we would like to clarify that the provided path examples have hop lengths of 2 and 3, respectively. The listed relations—born_in, city_of, profession, work_in, city_of—represent individual relations but are not counted separately as hops. The reasoning steps for the two examples are as follows:

  • Example 1:

    • Hop 1: (Joe Biden, born_in, Scranton)

    • Hop 2: (Scranton, city_of, USA)

  • Example 2:

    • Hop 1: (Joe Biden, profession, President)

    • Hop 2: (President, work_in, Washington D.C.)

    • Hop 3: (Washington D.C., city_of, USA)

Regarding case-specific path length exploration, our current implementation does not support dynamic path length adjustments. However, this is an exciting direction for future work. One potential approach would be to predict the complexity (or hop distance) of a query and dynamically adjust the path length parameters accordingly. This could involve techniques such as adaptive path selection, uncertainty estimation, or reinforcement learning-based exploration to optimize reasoning reliability.

审稿意见
3

The paper introduces Graph-Constrained Reasoning (GCR), a novel framework to address hallucination and knowledge gaps in large language models (LLMs) when reasoning over knowledge graphs (KGs). GCR bridges structured KG knowledge with unstructured LLM reasoning by constructing a KG-Trie, a trie-based index that encodes valid reasoning paths from the KG. During decoding, the KG-Trie constrains the LLM’s output to generate only KG-grounded paths, ensuring faithfulness. GCR employs a two-stage process: (1) a lightweight KG-specialized LLM generates multiple hypothesis answers and constrained reasoning paths via graph-constrained decoding, and (2) a powerful general LLM performs inductive reasoning over these paths to produce final answers. Experiments on KGQA benchmarks (WebQSP, CWQ) show state-of-the-art performance with zero hallucination and strong zero-shot generalizability to unseen KGs (e.g., ConceptNet, medical KGs). Key results include 92.6% Hit@1 on WebQSP and 75.8% on CWQ, outperforming prior methods like RoG and ToG. The framework reduces latency by avoiding iterative KG traversal and leverages parallel decoding for efficiency.

给作者的问题

Could longer reasoning paths (L > 3) improve performance on complex questions, and what are the trade-offs?

论据与证据

  1. GCR eliminates hallucinations by constraining decoding to KG-grounded paths. Evaluations show 100% faithful reasoning ratio (Table 5, Figure 5), with all generated paths verifiable in the KG. Ablation studies confirm that removing KG constraints leads to hallucinated paths (e.g., Case 1 in Table 5).

  2. GCR achieves SOTA performance on KGQA tasks. Results on WebQSP (92.6% Hit@1) and CWQ (75.8% Hit@1) surpass retrieval-based (RoG: 85.7%) and agent-based (ToG: 68.5%) methods (Table 1).

  3. GCR generalizes to unseen KGs without fine-tuning. Zero-shot tests on FreebaseQA (94% accuracy) and CSQA (91%) demonstrate adaptability (Table 6). Performance drops on MedQA (79%) are attributed to domain specificity.

方法与评估标准

  • Methods:

    • KG-Trie efficiently encodes KG paths as token sequences, enabling prefix-based constrained decoding.
    • Graph-constrained decoding uses a fine-tuned LLM (e.g., Llama-3-8B) to generate hypothesis answers and paths under Trie constraints.
    • Inductive reasoning with a general LLM (e.g., GPT-4) aggregates multiple paths for final answers.
  • Evaluation Criteria:

    • Datasets: Standard KGQA benchmarks (WebQSP, CWQ) and zero-shot tests on FreebaseQA, CSQA, MedQA.
    • Metrics: Hit@1, F1, accuracy. Faithfulness is measured via path grounding in KGs.
    • Baselines: Include LLM-only (ChatGPT+CoT), graph-based (ReaRev), and KG-enhanced methods (RoG, ToG).

Appropriate for assessing faithfulness, efficiency, and generalization.

理论论述

The paper lacks formal theoretical proofs but provides algorithmic formulations.

实验设计与分析

  • Confirm the necessity of both KG-specialized and general LLMs (Table 3).
  • Beam size (K=10) and path hops (L=2) are optimized (Figure 4, Appendix F.1).
  • GCR achieves 3.6s runtime vs. 16.14s for ToG (Table 2), leveraging parallel decoding.
  • Zero-Shot Generalization: Tests on medical/commonsense KGs highlight domain limitations.
  • Soundness: Rigorous and reproducible.

补充材料

no

与现有文献的关系

N/A

遗漏的重要参考文献

N/A

其他优缺点

  • Strengths:
    • Novel integration of KG structure into LLM decoding.
    • Practical efficiency (parallel decoding, pre-computable KG-Trie).
    • Strong empirical results and reproducibility.
  • Weaknesses:
    • Scalability of KG-Trie construction for billion-edge KGs.
    • Limited analysis of path diversity vs. answer correctness.

其他意见或建议

N/A

作者回复

We sincerely appreciate your positive and insightful review of our paper. Below, we address your comments and concerns point by point.

R1. Scalability of KG-Trie construction for billion-edge KGs.

Scalability is indeed a vital concern, especially for billion-edge KGs. In our current implementation, we optimize KG-Trie construction by leveraging efficient graph traversal techniques and parallelized batch processing. Detailed analysis can be found in Appendix B.2, where the average running time for KG-Trie construction is 0.0133s.

We also present further optimizations to handle industrial-scale KGs by incorporating graph retrieval techniques to reduce the size of KG for Trie construction (Appendix B.3 and Figure 6). The breakdown of time consumption in Table 9 also shows its efficiency of when combined with graph retrieval techniques (0.2838s)

Future work could explore better trie structures or hierarchical indexing to improve efficiency on industrial-scale KGs.

R2. Analysis of Path Diversity vs. Answer Correctness

We appreciate this suggestion and agree that path diversity plays a crucial role in reasoning quality. GCR first adopts the beam-search to generate top-K paths with KG-specialized LLM (Sec 4.3). Therefore, the path diversity would increase with the size of K. Then, we conduct graph inductive reasoning to incorporate diverse paths for deliberate reasoning with a powerful general LLM (Sec 4.4).

As shown in Figure 4, we analyze the impact of different beam sizes K for graph-constrained decoding on the performance of GCR. We observe that the hit and recall of GCR increase with the beam size. Because, with a larger beam size, the LLMs can explore more diverse paths and find the correct answers. However, a larger K would increase the decoding time and diverse paths might introduce noises. Thus, we select the K to 10 to balance between the exploration and exploitation of the reasoning paths.

R3. Impact of Longer Reasoning Paths (L > 3) on Performance

We agree that extending path length could potentially enhance performance on complex multi-hop queries. We evaluate the performance of GCR with L up to 4 in Section F.1. The results in Figure 7 show that the performance in WebQSP slightly drops when L > 2. This could be because the question in WebQSP only requires up to 2 hops to solve 11 .

Admittedly, a larger L could improve the performance on complex questions that require longer hop reasoning but the trade-offs would be the size of the KG-Trie. As shown in Figure 7 at Appendix F.1, the size of the KG-Trie is increased from 0.5MB to 7.5MB as the L change from 1 to 4. This would lead to extra complexity for storage and reasoning.

11 LUO, LINHAO, et al. "Reasoning on Graphs: Faithful and Interpretable Large Language Model Reasoning." ICLR 2024.

审稿意见
2

This paper introduces a new approach called Graph-Constrained Reasoning (GCR), which integrates the structured reasoning capabilities of a KG-specialized LLM with the general reasoning abilities of a general-purpose LLM. GCR uses KG-Trie to encode potential KG reasoning paths, to constrain the KG-specialized LLM decoding process, ensuring each step is grounded by KG. The method outperforms previous methods on KGQA datasets and shows generalization ability across other KGQA benchmarks.

给作者的问题

N/A

论据与证据

N/A

方法与评估标准

N/A

理论论述

N/A

实验设计与分析

N/A

补充材料

N/A

与现有文献的关系

N/A

遗漏的重要参考文献

N/A

其他优缺点

Strengths The motivation is clear and addresses a hallucination problem in KGQA tasks, while the method is technically sound and shows strong empirical performance.

The conversion of the KG into a KG-Trie and introduction of the graph-constrained decoding are practical, as they do introduce not much time and space complexity in the reasoning process.

Weaknesses

Does the GCR method effectively handle filtering conditions, such as the query "Which state in the US has the largest population?" Or is it limited to relying solely on the knowledge of the powerful general LLM for such cases?

The main results show a significantly lower F1 score compared to hit@1. In KGQA tasks, if the correct reasoning path is predicted, all answers are contained in the KG-Trie and they should all be retrieved. Therefore, the F1 score should not be much lower than hit@1. This discrepancy indicates that the KG-specialized LLM may be missing answers in KG due to its inherent knowledge limitations.

In Section 5.3, the comparison between GCR and RoG only addresses the hallucination issue of whether the predicted paths exist in the KG, which is necessary but insufficient. A valid path should not only be present in the KG but also exhibit logical coherence. Is it possible to include an evaluation of the quality of paths generated by KG specialized LLM, assessing whether these paths truly represent logical reasoning processes for the given query?

其他意见或建议

As shown in Case 2 of Table 5, the answer generated by GCR’s reasoning path differs from the final answer, revealing that the KG-specialized LLM relies on its internal knowledge to answer KG-related questions. This creates conflicts between its internal knowledge and the KG’s knowledge, leading to inconsistencies or omissions in answers. These conflicts introduce new errors, which may affect the method’s overall performance.

As mentioned in Section 4.2, GCR employs the shortest paths in BFS for training. Is this path always logical correct? There are cases where the training data paths may exhibit semantic or logical inconsistencies as below. To address this, could the authors compare these paths with the relationships specified in the original query to verify their accuracy and relevance? Query:Who is the brother of A's grandfather? Shortest Path(2 hops): A-> gender -> male -> gender -> Answer Reasonable Path(3 hops): A -> father -> C -> father -> D ->brother -> Answer

The GCR method uses a small LLM to select path within the KG. However, since LLMs are inherently language models, the decoding process can be viewed as identifying semantically related edges, a task that may not align with the core capabilities of LLMs. Is the use of an LLM truly necessary for this purpose?

Is the direction of edges within the KG considered in GCR method? For example, considering the query “what is the continent of the USA” and the KG facts 1) USA -> contain -> New York and 2) US <- contain <- North America, how does GCR handle the query? Will both New York and North America both contained in the output of KG specialized LLM?

The tokenization strategy may affect the GCR decoding process, maybe a word level strategy is more reasonable?

伦理审查问题

N/A

作者回复

We sincerely appreciate your time and effort in reviewing our submission. Below, we address your specific concerns:

R1. Handling Filtering Conditions in Queries

GCR can solve this type of query by combining the power of both KG and powerful general LLM. GCR first constrains the reasoning process within the KG to obtain the relevant knowledge, such as the population of all the states in the US. Then, GCR leverages the general-purpose LLM to conduct inductive reasoning on the knowledge to solve queries that require numerical comparison.

R2. Discrepancy Between F1 Score and Hit@1

In experiments, we select top-K paths generated by KG-specialized LLM for answer generation, which might lead to some missing answers, causing a discrepancy between F1 and Hit@1. The results in Fig. 4 show that the recall of the answers increases as the K and reduces the discrepancy.

R3. Logical Coherence in KG Reasoning Paths

Thanks for bringing up this interesting point. Due to the lack of ground truth and a great number of paths, we utilize the LLMs to evaluate the logical coherence and semantic meanings of the generated paths. The prompt and evaluation result are shown below:

As an advanced reasoning evaluator, your task is to analyze whether the following reasoning path presents a **logically coherent connection from the question (subject entity) to the answer (target entity)**. You will assess whether each step in the path is valid and necessary, and whether the overall reasoning supports the final answer in a grounded and justified manner.

### Instructions:
1. Focus on whether the reasoning path makes logical sense from the question to the answer.
2. Check whether each relation contributes meaningfully and validly to reaching the final answer.
3. Penalize paths that make unjustified jumps, overly general connections, or weak associations.

### Rating Scale:
5 - Excellent: Every step is logically valid and contributes clearly toward the answer.  
4 - Good: Mostly coherent with minor assumptions or weak steps.  
3 - Moderate: Some steps are unclear, general, or weak, but the general direction is acceptable.  
2 - Poor: Contains major logical leaps or unclear connections.  
1 - Very Poor: Illogical or invalid path from question to answer.

### Output:
- Score: [1 to 5]  
- Explanation: [Brief explanation of the logical quality of the path from question to answer]

### Question:
{question}
### Answer:
{answer}
### Path:
{path}
MethodScore
GCR3.9

The results show that GCR achieves an average 3.9 in evaluation score, which demonstrates the logical coherence of the generated paths. Moreover, the LLM-based evaluation can be further used for selecting meaningful paths for training. Some meaningful cases can be found in Table 5.

R4. Conflicts Between KG-Specialized LLM and KG Knowledge

We want to clarify that this difference only happens when no constraints are applied (GCR w/o constraint in Table 5), where LLMs purely rely on their internal knowledge for reasoning, which leads to hallucinations. This can be addressed by applying the KG-Trie constraints during reasoning (GCR) to ensure faithful and transparent LLM reasoning, which aligns with the motivation of our method.

R5. Soundness of BFS Shortest Paths in Training

We acknowledge that the shortest paths found via BFS may not always be the most semantically meaningful reasoning paths. However, due to the lack of golden reasoning paths, it would be a great choice to unsupervisedly obtain the paths for training. We will enhance path selection by incorporating LLMs for judging the semantics (as discussed in R3) to ensure that paths adhere to the logical coherence in the KG’s relationships.

R6. Necessity of LLM for Path Selection

As discussed Sec. 4.1, LLMs conduct reasoning by decoding reasoning step-by-step, which purely relies on their internal knowledge and leads to hallucinations. GCR adapts this reasoning behavior to KGs and decodes reasoning paths that are valid on KGs. This well aligns with the reasoning paradigm of LLMs and is faster than other methods adopting LLMs for path selection (e.g., ToG in Table 2).

R7. Edge Direction

We consider the direction of the relations during KG-Trie construction. Therefore, only US -> is_contained_by -> North America will be generated. The inverse relations can be added during KG construction.

R8. Word-level Tokenization

We adopt the token-level tokenizer of the original LLMs to avoid retraining it on new KGs. The entity and relations in any KG can be tokenized into tokens for KG-Trie construction. This ensures the transferability of GCR to unseen KGs (Table 6).

最终决定

The provided context details the development and evaluation of a novel framework called Graph-Constrained Reasoning (GCR), which aims to eliminate hallucinations and ensure faithful reasoning by integrating Knowledge Graph (KG) structure into the Large Language Model (LLM) decoding process. The GCR framework employs a two-tiered process involving a KG-specialized LLM for graph-constrained decoding and a powerful general LLM for inductive reasoning over multiple reasoning paths. This approach not only improves the reasoning capabilities of LLMs but also ensures that the reasoning paths are faithful to the KG, thus reducing the rate of hallucinations. The framework has demonstrated state-of-the-art performance on several KG query answering (KGQA) benchmarks and exhibits strong zero-shot generalizability to unseen KGs without additional training. The use of a KG-Trie to encode valid reasoning paths from the KG allows for the efficient retrieval of knowledge and ensures that the reasoning process is grounded in the KG. The integration of the KG structure into the LLM decoding process is a practical and innovative approach to improving the performance of LLMs in reasoning tasks. The main concerns raised by the reviewers have been address by the authors in response.