Dependency Matters: Enhancing LLM Reasoning with Explicit Knowledge Grounding
We propose GRiD, a framework that grounds reasoning in knowledge-enhanced dependency graph, ensures logical consistency, and improves reasoning accuracy across various benchmarks.
摘要
评审与讨论
This paper introduces GRiD (Grounded Reasoning in Dependency), a framework designed to improve large language model (LLM) reasoning by explicitly grounding reasoning steps in structured knowledge. The key innovation is representing reasoning as a graph with two types of nodes: knowledge extraction nodes and reasoning nodes, where each reasoning step explicitly depends on specific knowledge premises. The framework includes a lightweight verification module that validates reasoning steps against their dependencies.
优缺点分析
Strengths
- Novel and Intuitive Approach: The idea of explicitly grounding reasoning steps in knowledge and creating dependency graphs is well-motivated and addresses real problems in LLM reasoning (knowledge errors and dependency errors).
- Strong Empirical Results: Impressive improvements across multiple benchmarks
- Practical Utility: Framework works both for training and as a plug-and-play inference module, demonstrating versatility.
- Self-Sufficiency: Unlike methods requiring external knowledge bases, GRiD uses the model's internal knowledge.
Weakness
- While the dependency graph representation is intuitive, the core ideas (explicit knowledge extraction, step-wise verification) are relatively straightforward extensions of existing work.
- Limited analysis of when/why the method fails
- Highly dependent on the performance of LLMs
问题
-
The GRiD framework begins by extracting relevant knowledge before reasoning. How does the method ensure that the generated knowledge is accurate and reliable? The paper mentions dependency verification—how effective is this verification in practice? If the verification process fails to identify incorrect knowledge, would the reasoning results still be reliable or potentially misleading?
-
The generation of relevant knowledge and the construction of the knowledge graph are influenced by prompts and the underlying model used (Section 3.1, Figure 3). To what extent is the process sensitive to prompt design and model variation? How robust is the approach across different models and prompting strategies?
-
The paper lacks ablation studies regarding the contribution of different components—specifically, the impact of extracting relevant knowledge versus the construction of the dependency graph. Which component contributes more significantly to the observed improvements in reasoning accuracy?
-
What is the computational complexity of the GRiD method, particularly regarding the dependency verification process? How does this impact runtime efficiency, especially in large-scale or real-time reasoning scenarios?
局限性
yes
最终评判理由
The authors addressed most of my concerns. i would like to accept it.
格式问题
no
We appreciate Reviewer KJEB’s thoughtful evaluation and constructive feedback on our paper. Below, we address each of the identified weaknesses and provide clarifications and additional details to support our claims.
-
Q1: While the dependency graph representation is intuitive, the core ideas (explicit knowledge extraction, step-wise verification) are relatively straightforward extensions of existing work.
Response 1:
The motivation behind our GRiD method is to demonstrate that even relatively small LLMs have sufficient knowledge to answer certain questions. The key challenge is to effectively extract the relevant knowledge and present it as explicit context, while ensuring the correctness of dependencies between reasoning steps.
Our experimental results highlight that the combination of two core components—knowledge extraction and dependency verification—is crucial for enhancing the model's performance. The knowledge module provides adaptive knowledge to complement the reasoning context, while the step dependency verification module ensures that reasoning steps are correctly linked to the premise knowledge. This verification process aids in both data cleaning and runtime validation.
To achieve this, we adapt the GRiD method by placing the knowledge extraction step before each reasoning step that requires it, and explicitly presenting the related premise steps before the reasoning content. The adjusted verification mechanism in GRiD focuses on Dependency Satisfiability, Purpose Satisfiability, and Fact Satisfiability. This makes GRiD more adaptive and flexible, as it goes beyond simply extracting question-related information.
By using GRiD, we can effectively extract useful knowledge to assist reasoning, which we refer to as "squeezing" the model's internal knowledge. This also allows us to better identify the model's knowledge boundary, as discussed in Section 4.7 of the original paper. Overall, our experimental results show that the synergy of knowledge extraction and dependency verification significantly improves the model's reasoning capabilities.
-
Q2: Limited analysis of when/why the method fails
Response 2:
We have addressed the limitations of the GRiD method, particularly regarding generalization and the lack of a recovery mechanism for cases that fail verification, in Section 5. In addition, we want to discuss the Verification Scope and External Knowledge here.
While knowledge faithfulness checks can be easily performed using external knowledge bases (e.g., Wikipedia API), our paper focuses on the idea that a model's intrinsic knowledge is sufficient for multi-step reasoning and can be competitive with larger models. A key limitation of current models is the lack of an effective knowledge extraction strategy from the model’s latent knowledge space.
Table 1 (Baseline React by prompting without fine-tuning): StrategyQA CommonsenseQA GPQA TruthfulQA DeepSeek-v3 0.782 0.864 0.449 0.838 DS-v3-original 0.840 0.853 0.500 0.850 GPT-4o 0.839 0.837 0.429 0.788 GPT-4o-original 0.826 0.851 0.510 0.842 However, we also tested a ReAct agent framework with the Wikipedia API for the same tasks (see results in Tables 1 above, and refer to the discussion on Q3 with Reviewer 1SrY for implementation details). Interestingly, external knowledge did not consistently improve performance, and in some cases, it even worsened results. This is likely due to noise from unreliable external sources. In contrast, applying web search to the knowledge steps in GRiD for cross-verification could enhance the faithfulness of the model’s reasoning, which is an avenue for future work.
Additionally, we discuss the concept of the model's knowledge boundary in Section 4.7 of the original paper, which represents another scenario where the GRiD method may encounter failures.
-
Q3: Highly dependent on the performance of LLMs
Response 3:
Our experimental results, particularly those presented in Tables 1-3 of the original paper, show that the performance of different base models (e.g., Llama3 8b and Qwen 2.5 14b) varies significantly. Section 4.4, which discusses the scaling of model size, further highlights that larger and more powerful base models tend to benefit more from the GRiD method.
This aligns with our assumption about the model's knowledge space: if a model possesses the necessary knowledge for a given question, the GRiD method enhances its ability to answer. Conversely, if the model reaches its knowledge boundary, as discussed in Section 4.7, the method cannot overcome this limitation. Therefore, it is natural for GRiD's performance to be dependent on the original base LLM, as different models have varying knowledge scopes.
-
Q4: Concerns on the faithfulness and reliability of generated knowledge and the dependency verification
Response 4:
To validate the effectiveness of the GRiD method in improving the faithfulness of extracted knowledge and the consistency of reasoning traces, we conducted experiments to calculate the faithfulness and consistency scores of reasoning traces generated by the original model, the CoT-SFT model, and the GRiD model by prompting GPT-4.1 and DeepSeek-r1. These results shown in Table 2 below demonstrate that GRiD can generate more faithful and consistent knowledge-reasoning traces.
Table 2: Strategy Faithfulness ↑ Consistency ↑ Faithful Issue Rate ↓ Consistency Issue Rate ↓ Original 0.709 0.783 1.58 0.695 StrategyQA CoT-SFT 0.758 0.856 1.143 0.406 GRiD 0.766 0.89 1.057 0.314 Original 0.854 0.854 0.472 0.477 CommonsenseQA CoT-SFT 0.874 0.87 0.47 0.445 GRiD 0.905 0.891 0.39 0.35 Original 0.854 0.854 0.472 0.477 GPQA CoT-SFT 0.874 0.87 0.47 0.445 GRiD 0.905 0.891 0.39 0.35 Original 0.715 0.772 1.788 0.504 TruthfulQA CoT-SFT 0.844 0.887 1.456 0.294 GRiD 0.835 0.887 1.175 0.275 Metric explanations: Faithfulness and Consistency refer to the correctness of factual knowledge and the consistency of reasoning steps, while Faithful Issue and Consistency Issue Rate indicate the average number of steps with faithfulness and consistency issues, respectively.
Furthermore, we observed that failure cases are often associated with issues in knowledge faithfulness and step consistency, and vice versa. This is logical, as incorrect knowledge typically leads to incorrect results. On the other hand, while cases with correct results may still contain faithfulness and consistency issues, their severity is typically low, meaning these errors are not critical to the overall outcome.
-
Q5: Concerns on the robustness of the approach across different models and prompting strategies?
Response 5:
In Section 4.5 and Table 4 of the original paper, we present experiments that vary the models used as data creators. These experiments demonstrate that the GRiD method is robust across different base models. When applying the same benchmark, performance variations are primarily influenced by the base model’s capabilities rather than the data creator model.
Regarding prompting strategies, the prompt is designed to generate a dataset in a knowledge-enhanced, dependency-aware format. Regardless of how the prompt is structured, the objective remains the same: to produce a specially formatted dataset. As a result, we observe consistent performance across these datasets.
-
Q6: Concerns on ablation studies regarding extracting relevant knowledge versus the construction of the dependency graph.
Response 7:
We present the relevant ablation studies in Section 4.3 and Table 2 of the original paper, which highlight the impact of dependency verification. Both the data-filtering strategy before training and the verification strategy during inference contribute to further improving model performance compared to the vanilla GRiD method. It's important to note that the verification process relies on this specialized knowledge-reasoning data format. As a result, we cannot directly compare the performance of GRiD using the special data format alone with the performance using verification alone.
-
Q7: the computational complexity of the dependency verification process
Response 7:
Table 3: StrategyQA CommonsenseQA GPQA TruthfulQA GRiD Main Trace 317 557 794 450 Verification 615 518 1091 554 CoT 309 360 1069 396 Think 849 734 5012 837 GRiD Accuracy 0.829 0.860 0.424 0.888 GRiD Accuracy + Verification 0.880 0.904 0.475 0.931 The results presented in Figure 4 of Section 4.6 of the original paper refer to GRiD without the verification module, but with the “GRiD + data filtering” setting. In addition, we have included the average token consumption in Table 3 above. The results show that the average token consumption for verification is similar to that of the main reasoning trace in a single case. Moreover, by incorporating data filtering and verification, we observe an accuracy improvement of 2-7%. We will include additional experimental results and clarifications in the revised version.
Dear Reviewer KJEB,
As the deadline for author-reviewer discussion is approaching, could you please check the authors' rebuttal and post your response?
Thank you!
Best,
AC
Thanks for authors' efforts. The experimental evaluation is thorough and rigorous. I have updated my score
Thank you very much for your recognition of our work and your suggestions for additional experiments. We will revise our paper to improve its clarity and quality based on your constructive feedback. We sincerely appreciate your feedback and will strive to further improve GRiD and make it even more valuable to the community.
This paper introduces GRiD, a reasoning pipeline that enhances LLM reasoning by explicitly grounding each reasoning step in factual knowledge. It first transforms question-answer pairs into a knowledge-reasoning dependency graph, where each reasoning node is linked to specific knowledge nodes. A lightweight verifier is fine-tuned to check the correctness of each reasoning step based on its dependencies. Experiments across multiple QA benchmarks demonstrate that GRiD improves reasoning accuracy and consistency over baseline models.
优缺点分析
Strengths:
-
Reduced Hallucinations and Inconsistencies: By explicitly prompting the model to extract and ground necessary knowledge before each reasoning step, GRiD mitigates common reasoning errors and factual inconsistencies.
-
Lightweight Verifier: The use of a step-wise verifier allows for both data filtering during training and runtime verification during inference, offering a practical and scalable enhancement module.
Weaknesses:
-
Domain Overfitting: The framework relies on supervised fine-tuning with GRiD-formatted data, which ties its effectiveness to the training domain and limits its generalization to unseen tasks.
-
Lack of Fallback for Verification Failures: When reasoning steps fail the verifier, the system lacks robust recovery mechanisms such as revision or fallback prompting. This leads to sharp performance degradation in such cases.
-
Verification Scope Limitation: Currently, the verifier only checks the logical validity of reasoning steps based on linked nodes. How can the framework be extended to verify the factual correctness of the knowledge nodes themselves, especially since hallucinated or outdated knowledge can compromise the entire reasoning chain?
问题
Clarification Questions in Result Presentation:
In Table 1 and Table 2, are GRiD results obtained via fine-tuning or prompting? If fine-tuned, is there a comparable prompting-based version of GRiD?
In Table 2, what exactly do “Zero-Shot” and “Few-Shot” mean in the context of GRiD with verification?
Does the "+ Verification" setting imply filtering out test samples that fail verification? If so, are the same test samples used across the “Directly Answer”, “Zero-Shot”, and “Few-Shot” settings for fair comparison?
局限性
NA
最终评判理由
The rebuttal addressed most of my concerns.
格式问题
NA
We appreciate Reviewer SQC5’s thoughtful evaluation and constructive feedback on our paper. Below, we address each of the identified weaknesses and provide clarifications and additional details to support our claims.
-
Q1: Concerns on Domain Overfitting, Lack of Fallback for Verification Failures, and Verification Scope Limitation:
Response 1:
Thank you for raising these important concerns. We have addressed the limitations related to generalization, recovery mechanisms for verification failures, and verification scope in the "Discussion on Limitations and Future Work" section of the original paper.
- Domain Overfitting and Generalization: We acknowledge that the SFT method has limited generalization ability to new domains. We believe that the overfitting issue arises because the SFT method biases the model toward each target data distribution, making it less flexible to new domains. We think this problem can be mitigated by using reinforcement learning (RL) methods, which would encourage the model to reason in a specific format without being overly focused on the training data distribution. While we use SFT in this paper to quickly test our assumption about extracting the model’s intrinsic knowledge to support reasoning, we plan to implement an RL-based version in future work to address this limitation.
- Recovery Mechanism for Verification Failures: The lack of a recovery mechanism for verification failures is indeed an open problem. We have experimented with the CoT-self consistency@9 and Best of N (N=9) strategies, but neither approach yielded better results after voting. We believe that GRiD, with its knowledge-reasoning format, has reached the model's knowledge boundary. Failures in verification are cases that the model struggles to handle on its own, which we discuss in Section 4.7 of the original paper. As such, we are actively exploring better strategies to handle these failures and welcome contributions from the community to address this challenge.
- Verification Scope and External Knowledge: While it is relatively easy to perform knowledge faithfulness checks for each knowledge extraction step using an external knowledge base (e.g., Wikipedia API), our paper focuses on the insight that a model’s intrinsic knowledge is sufficient for multi-step reasoning, and that it can be competitive with larger models. One of the primary limitations of current models is their lack of a strong knowledge extraction strategy to successfully extract and organize necessary knowledge from the model’s latent knowledge space. Therefore, our study primarily explores the model's intrinsic knowledge, without relying on external knowledge sources.
Table 1 (Baseline React by prompting without fine-tuning): StrategyQA CommonsenseQA GPQA TruthfulQA DeepSeek-v3 0.782 0.864 0.449 0.838 DS-v3-original 0.840 0.853 0.500 0.850 GPT-4o 0.839 0.837 0.429 0.788 GPT-4o-original 0.826 0.851 0.510 0.842 Table 2 (Fine-tuned on Qwen-14b-Instruct:): StrategyQA CommonsenseQA GPQA TruthfulQA Data Creator DeepSeek-v3 0.707 0.799 0.357 0.818 GPT-4o 0.724 0.776 0.387 0.757 Baseline Qwen-14b-CoT 0.760 0.840 0.364 0.812 However, we also implemented a ReAct agent framework with the Wikipedia API for the same tasks during this rebuttal period (We directly put the results in Table 1 and Table 2 above, please refer to the discussion on Q3 with Reviewer 1SrY for more details of the implementation). Interestingly, under the original ReAct setup, external knowledge did not always lead to performance improvements, and in some cases, it even worsened the results. We believe this is due to the noise introduced by unreliable external sources. The challenge is that we cannot always guarantee the accuracy or completeness of the information retrieved. In fact, applying web search operations to the knowledge steps in GRiD could provide auxiliary support to the model’s intrinsic knowledge, allowing for cross-verification without relying solely on external information. This approach may improve the faithfulness of the model’s knowledge during runtime reasoning and is a promising direction for future work.
-
Q2: In Table 1 and Table 2, are GRiD results obtained via fine-tuning or prompting? If fine-tuned, is there a comparable prompting-based version of GRiD?
Response 2:
In Table 1, the first two entries, “Llama” and “Qwen,” are based on prompting, while the remaining entries, “Disentangle,” “Fine-tune on QA Pairs,” and “GRiD (ours),” are based on fine-tuning.
In Table 2, the first three entries for large LLMs, “GPT-3.5-turbo,” “GPT-4o,” and “DeepSeek-R1,” are prompting-based, while the rest, labeled as “Ours,” are fine-tuning-based.
-
Q3: In Table 2, what exactly do “Zero-Shot” and “Few-Shot” mean in the context of GRiD with verification?
Response 3:
In the context of GRiD with verification, "Zero-Shot" and "Few-Shot" refer to different prompting strategies:
- Zero-Shot involves providing only an instruction, such as “Answer the following single-choice question step by step and put your final choice in \n###Answer: \boxed{{ }},” as used in our implementation.
- Few-Shot includes additional reasoning examples along with the instruction to guide the model in performing the task.
In the "+verification" mode, we also use the few-shot prompting method. The key difference is that the few-shot examples in this case are formatted similarly to the GRiD method, explicitly extracting knowledge ("reasoning" and "knowledge" steps) from the model’s intrinsic knowledge space, which facilitates dependency verification.
-
Q4: Does the "+ Verification" setting imply filtering out test samples that fail verification? If so, are the same test samples used across the “Directly Answer”, “Zero-Shot”, and “Few-Shot” settings for fair comparison?
Response 4:
In Table 2 of the original paper, the "+ Verification" setting focuses on the accuracy of test cases that pass verification, leading to an accuracy improvement of up to 7% for these cases. The primary goal of the "+ Verification" setting is to demonstrate the effectiveness of the verification process.
To ensure a fair comparison, we also conducted additional experiments on the test samples that passed verification, applied across the “Directly Answer,” “Zero-Shot,” and “Few-Shot” settings, as shown in Table 3 below. The results indicate a slight accuracy increase for these settings when using the dependency verification strategy, but the performance still lags behind that incorporating the verification, highlighting the effectiveness of the verification strategy.
Table 3: Data Scope StrategyQA CommonsenseQA GPQA TruthfulQA Directly Answer all data 0.840 0.855 0.707 0.813 passed data 0.876 0.859 0.711 0.847 Zero Shot all data 0.855 0.870 0.712 0.819 passed data 0.873 0.889 0.720 0.829 Few Shot all data 0.863 0.830 0.722 0.863 passed data 0.869 0.861 0.719 0.894 +Verification passed data 0.903 0.889 0.737 0.933
Dear Reviewer SQC5,
As the deadline for author-reviewer discussion is approaching, could you please check the authors' rebuttal and post your response?
Thank you!
Best,
AC
Dear Reviewer,
Could you please check the authors' rebuttal and post your response?
The new policy requires AC to flag insufficient review this year, including the non-participation in author-reviewer discussions.
Thanks,
AC
Dear Reviewer SQC5,
Thank you once again for your positive and insightful review of our paper. Your feedback has been instrumental in helping us strengthen our work.
As the author-reviewer discussion period approaches its deadline, we would like to share new experimental results. According to the suggestion from Reviewer 1SrY, we can add ReAcT method as a baseline of our paper. The new experimental results along with Table 1 and Table 2 of the previous rebuttal could further clarify the differences between our GRiD method and the ReAct baseline, and demonstrate why a fine-tuned ReAct model is not ideal for activating a model’s intrinsic knowledge to enhance performance.
1. Additional Experiments with ReAct Baseline
We extended our evaluation of the ReAct method (using the Wikipedia API) to include both GPT-4.1 and the newly released open-source GPT-oss-120b model from OpenAI. For GPT-oss-120b, we leveraged third-party API deployments for testing. Consistent with our earlier findings (see Table 1 of our rebuttal), ReAct in prompting mode produces only modest improvements, and in many cases—even leads to a decrease in model performance when using the Wikipedia API.
| Table 1 -Continued(Baseline React by prompting without fine-tuning): | StrategyQA | CommonsenseQA | GPQA | TruthfulQA |
|---|---|---|---|---|
| DeepSeek-v3 | 0.782 | 0.864 | 0.449 | 0.838 |
| DS-v3-original | 0.840 | 0.853 | 0.500 | 0.850 |
| GPT-4o | 0.839 | 0.837 | 0.429 | 0.788 |
| GPT-4o-original | 0.826 | 0.851 | 0.510 | 0.842 |
| GPT-4.1 | 0.877 | 0.829 | 0.590 | 0.815 |
| GPT-4.1-original | 0.851 | 0.845 | 0.646 | 0.887 |
| GPT-oss-120b | 0.853 | 0.645 | 0.763 | 0.738 |
| GPT-oss-120b-original | 0.823 | 0.825 | 0.786 | 0.869 |
2. Fine-tuning ReAct: Different Strategies and Results
We conducted three new sets of fine-tuning experiments for ReAct:
-
GPT-4.1 as data creator
-
GPT-4o as data creator, with optimized ReAct trace:
Here, based on prior discussions suggesting that noise may be introduced in ReAct traces, we refined the traces by removing the “Similar” segment after “Observation,” while retaining the essential segments: “Thought,” “Search,” and “Observation.” This adjustment aimed to reduce unnecessary noise from search operations and make the trace more coherent. Results (Table 2-Continued, row 4) indicate that this optimized trace leads to a modest increase in final accuracy.
-
GPT-oss-120b as data creator
| Table 2 -Continued(Fine-tuned on Qwen-14b-Instruct:): | StrategyQA | CommonsenseQA | GPQA | TruthfulQA | |
|---|---|---|---|---|---|
| Data Creator | DeepSeek-v3 | 0.707 | 0.799 | 0.357 | 0.818 |
| GPT-4o | 0.724 | 0.776 | 0.387 | 0.757 | |
| GPT-4o-optimized | 0.729 | 0.785 | 0.393 | 0.750 | |
| GPT-4.1 | 0.737 | 0.785 | 0.419 | 0.823 | |
| GPT-oss-120b | - | - | 0.404 | - | |
| Baseline | Qwen-14b-CoT | 0.760 | 0.840 | 0.364 | 0.812 |
Despite these optimizations, fine-tuned ReAct still falls short in fully activating the model’s intrinsic knowledge, especially compared to our GRiD method. The ReAct data format tends to prioritize external information, neglecting the extraction of intrinsic knowledge, and thus fails to unlock the model’s full reasoning potential.
Moreover, while GPT-oss-120b achieves strong performance on the GPQA benchmark (w/wo ReAcT), fine-tuning ReAct using data generated by GPT-oss-120b results in only modest improvement. However, fine-tuning on ReAcT trace produces a significant 36% drop in performance compared to ReAct in prompting mode.
In contrast, GRiD is explicitly designed to extract and organize intrinsic knowledge embedded in the model weights, presenting it as explicit context. This lightweight approach reduces the risk of noise from external knowledge bases. Furthermore, the explicit step dependency graph in GRiD enables step-by-step dependency verification, improving the consistency of the reasoning trace and leading to significant performance gains.
We hope these additional experiments further clarify the strengths and contributions of our method. Thank you again for your thoughtful feedback and valuable suggestions.
This paper proposes to enhance the logical consistency of CoT with Grounded Reasoning in Dependency (GRiD). The core idea of GRiD is to prompt a proposer LLM to explicitly construct a graph connecting all reasoning steps and specify their logical dependencies. In this way, we can easily verify whether each reasoning step is logically valid through a verifier LLM. Based on this idea, the authors first create training reasoning traces based on the proposer and the verifier LLMs, then bootstrap these two models by finetuning them on the created data. At test time, one can directly used the finetuned proposer to generate reasoning traces, or further leverage the finetuned verifier to reject incorrect steps. Experiments on 4 distinct QA benchmarks show that GRiD significantly improves the performance over the original instruct models and directly finetuned models. It is also observed that GRiD scales with model size, is robust to training data generated from various models and achieves excellent token consumption-accuracy trade off.
优缺点分析
Strengths:
- The idea of explicitly grounding reasoning steps and verifying them is neat. GRiD solves the logical inconsistency issue in XoT approaches by self-verifying each intermediate step, rather than Process Reward Models (PRMs), which heavily depend on step-wise annotations.
- GRiD achieves significant performance gain on 4 QA benchmarks, while being robust to training data creators (as long as the training data creators are not very weak models).
- GRiD doesn’t cost too many additional tokens compared to standard CoT, while being much better in performance.
Weaknesses:
- The idea of explicitly grounding reasoning steps, which is the core contribution of this paper, has been proposed in Natural Program[1]. This work appears to be more of a fine-tuning variant of Natural Program, evaluated on recent reasoning benchmarks. Although the authors differentiate their GRiD from Natural Program by emphasizing explicit knowledge extraction, this distinction seems minor. It largely amounts to prompting the model to write queries before knowledge extraction, rather than employing a separate model or session for retrieving knowledge.
- Among the datasets used in this paper, GPQA and TruthfulQA are pure test datasets and don’t contain training splits. How do you generate training data for these two datasets? In my opinion, there are a couple of problems if you split them into training and test splits: 1) this constructs test distribution leakage, which makes the results not comparable to reported results on these datasets. 2) these datasets are so small (e.g. 198 for GPQA Diamond) that further splits will cause super noisy results.
- Some claims in this paper are not supported. For example, there are no experiments supporting the consistency and faithfulness claimed in the abstract, or how hallucination is resolved by explicit knowledge extraction claimed in the introduction.
[1] Ling and Fang et al. Deductive Verification of Chain-of-Thought Reasoning. NeurIPS 2023.
问题
- Line 109-110: What’s the adaptive adjustment to the verification mechanism in GRiD?
- Do you use the same initial model for training the proposer and the verifier?
- In Algorithm 1, it looks like dependency verification doesn’t depend on the ground truth answer. Could you explain why models can benefit from such data bootstrapped from themselves without external supervision?
- In Table 2, why does GRiD itself without data filtering can improve performance over original instruct models? Is it just finetuning on unverified model-generated CoT traces? Why does GRiD without data filtering outperform finetuning on QA pairs in Table 1?
- It’s better to add base model numbers in Table 3 to show both final performance and performance gains. You may put two numbers in a cell to save space.
- Line 234: Which entries in Table 3 are finetuned with identical training data?
- For Sec 4.6, is the token consumption measured based on direct CoT generation or generation with a verifier? If you use a verifier, the tokens rejected by the verifier should also be considered.
局限性
Yes. I appreciate that the authors include experiments in Sec 5 to show the limitation of GRiD in cross-domain generalization.
最终评判理由
Given the authors have addressed my concern on the setup of training splits, I think results in this paper are solid. Therefore, I increased my rating to 5.
格式问题
No formatting issues.
We appreciate Reviewer WaS3’s thoughtful evaluation and constructive feedback on our paper. Below, we address each of the identified weaknesses and provide clarifications and additional details to support our claims.
-
Q1: Concerns on the difference with Natural Program.
Response 1:
While we acknowledge that both GRiD and Natural Program aim to improve reasoning by grounding steps, the motivation and approach of our work are fundamentally different. Our primary goal with GRiD is to demonstrate that even relatively small LLMs contain enough information to answer certain questions. The key is not merely to retrieve knowledge, but to extract the relevant knowledge and explicitly present it as context whenever the next reasoning step is in need, while ensuring the correctness of the reasoning step dependencies.
Unlike Natural Program, which explicitly extracts only question-related information, GRiD is more flexible and adaptive. We extract knowledge dynamically, before each reasoning step when needed, which makes our approach more versatile. Additionally, our verification mechanism in GRiD is tailored to address Dependency Satisfiability, Purpose Satisfiability, and Fact Satisfiability, ensuring that the reasoning process remains consistent and correct.
By using GRiD, we can effectively extract the knowledge required to support reasoning, which we refer to as “squeezing” the model. This also enables us to identify the model's knowledge boundaries, as discussed in Section 4.7 in the original paper. The experimental results consistently show that the combination of knowledge extraction and step dependency verification significantly enhances the model’s reasoning performance.
-
Q2: Concerns on the data preparation.
Response 2:
For GPQA, we use the GPQA-Diamond subset exclusively for testing, while the GPQA-Extended subset, which has excluded the samples from GPQA-Diamond, is used for training. For TruthfulQA, we randomly split the dataset into training and testing sets using an 80:20 ratio, with 160 samples selected for the test set.
We understand your concerns about potential test distribution leakage and the small size of these datasets. To address this, we will include additional clarifications and details in the revised manuscript regarding our data split methodology and its implications.
-
Q3: Concerns on the claims of consistency and faithfulness.
Response 3:
Thank you for pointing out the need for more concrete support for our claims. To address this, we directly calculate the consistency and faithfulness scores as explicit metrics to demonstrate the effectiveness of GRiD for a more objective evaluation.
For these calculations, we prompted GPT-4.1 and DeepSeek-r1 to evaluate and score the reasoning traces of the original model, the SFT model using CoT data, and the GRiD model. The results, shown in Tables 1 below, confirm that GRiD significantly improves the faithfulness and consistency of the generated knowledge-reasoning traces. Interestingly, on the GPQA benchmark, although our method performs slightly better than the other two, the improvement is modest. Additionally, the absolute Faithfulness and Consistency Scores are lower, and the Faithful and Consistency Issue Rates are higher compared to the other three benchmarks. This aligns with the experimental results presented in Tables 1 and 2 of the original paper, which suggest that GPQA is particularly challenging for smaller models. This also supports our discussion in Section 4.7 in the original paper regarding the limitations of achieving knowledge boundary on this benchmark.
Table 1: Strategy Faithfulness ↑ Consistency ↑ Faithful Issue Rate ↓ Consistency Issue Rate ↓ Original 0.709 0.783 1.58 0.695 StrategyQA CoT-SFT 0.758 0.856 1.143 0.406 GRiD 0.766 0.89 1.057 0.314 Original 0.854 0.854 0.472 0.477 CommonsenseQA CoT-SFT 0.874 0.87 0.47 0.445 GRiD 0.905 0.891 0.39 0.35 Original 0.854 0.854 0.472 0.477 GPQA CoT-SFT 0.874 0.87 0.47 0.445 GRiD 0.905 0.891 0.39 0.35 Original 0.715 0.772 1.788 0.504 TruthfulQA CoT-SFT 0.844 0.887 1.456 0.294 GRiD 0.835 0.887 1.175 0.275 Metric explanations: Faithfulness and Consistency refer to the correctness of factual knowledge and the consistency of reasoning steps, while Faithful Issue and Consistency Issue Rate indicate the average number of steps with faithfulness and consistency issues, respectively.
-
Q4: Line 109-110: What’s the adaptive adjustment to the verification mechanism in GRiD?
Response 4:
The adaptive adjustment to the verification mechanism in GRiD occurs in two key ways:
- We extract relevant knowledge adaptively before each reasoning step, rather than solely focusing on question-related knowledge. This requires the verification mechanism to be adjusted to account for intermediate knowledge steps, ensuring consistency throughout the reasoning process.
- We also refine the verification mechanism to primarily focus on Dependency Satisfiability, Purpose Satisfiability, and Fact Satisfiability. This adjustment enables the verifier to identify issues related to dependency, faithfulness, and consistency as the reasoning progresses.
-
Q5: Do you use the same initial model for training the proposer and the verifier?
Response 5:
Yes
-
Q6: … Could you explain why models can benefit from such data bootstrapped from themselves without external supervision?
Response 6:
The motivation behind dependency verification is that if the intermediate reasoning steps are correct, they will lead to a correct answer. Inspired by this, our goal is to ground the reasoning process within the model's intrinsic knowledge space, ensuring that the knowledge extraction is accurate and reliable.
This approach offers several advantages:
- It explicitly requires reasoning steps to present their prerequisites before providing reasoning content, thereby enhancing the interpretability of the reasoning trace.
- It creates an anchor during the reasoning process, making the outcomes more reliable by ensuring that the model’s reasoning is grounded in known facts.
- It benefits data-filtering before training and run-time verification by providing an additional operation of consistency checking, reinforcing the dependency and ensuring that the reasoning trace remains coherent. A consistent reasoning trace improves the reliability of the final answer, instilling higher confidence in the model's conclusions.
-
Q7: In Table 2, why does GRiD itself without data filtering can improve performance over original instruct models? Is it just finetuning on unverified model-generated CoT traces? Why does GRiD without data filtering outperform finetuning on QA pairs in Table 1?
Response 7:
GRiD without data filtering is fine-tuned on unverified, model-generated knowledge-reasoning traces in GRiD format, rather than plain CoT traces. This approach enhances the model's performance for several reasons. As mentioned earlier, the knowledge-enhanced trace, along with reasoning steps where prerequisites are presented before the reasoning content, serves as an anchor during the reasoning process, making the model’s outcomes more reliable.
The explicit inclusion of premise steps (e.g., <knowledge_x>, <reason_y>) in GRiD further boosts the model's confidence in generating the current reasoning content. This structured approach improves reasoning consistency and clarity. The key advantage of GRiD lies in the explicit organization and verification of knowledge, which enhances the reasoning process more effectively than traditional fine-tuning on QA pairs.
-
Q8: It’s better to add base model numbers in Table 3 to show both final performance and performance gains. You may put two numbers in a cell to save space.
Response 8:
Thank you for this suggestion. We will add this content in the revised version.
-
Q9: Line 234: Which entries in Table 3 are finetuned with identical training data?
Response 9:
For each column in Table 3 of the original paper, we use the same GRiD-format training data to train the three models. However, different columns correspond to models trained on identical data from different benchmarks.
-
Q10: Concerns on the token consumption for verificaiton.
Response 10:
Table 2: StrategyQA CommonsenseQA GPQA TruthfulQA GRiD Main Trace 317 557 794 450 Verification 615 518 1091 554 CoT 309 360 1069 396 Think 849 734 5012 837 GRiD Accuracy 0.829 0.860 0.424 0.888 GRiD Accuracy + Verification 0.880 0.904 0.475 0.931 The results presented in Figure 4 of Section 4.6 of the original paper refer to GRiD without verification module, but with the “GRiD + data filtering” setting. In addition, we have included the average token consumption in Table 2 above. The results show that the average token consumption for verification is similar to that of the main reasoning trace in a single case. Moreover, by incorporating data filtering and verification, we observe an accuracy improvement of 2-7%. We will include additional experimental results and clarifications in the revised version.
Thanks the authors for their response. The authors have addressed my concern on training data splits. Regarding the similarity between GRiD and Natural Program, I agree that Natural Program only extracts only question-related information, which is somehow different with GRiD in implementation details. However, the high-level ideas of the two works are still similar. Given that GRiD is more flexible and achieved solid results on recent reasoning benchmarks, I appreciate the authors' efforts and decide to increase my rating to 5.
Thank you very much for your constructive comments on our work and for your active participation in the rebuttal process. We are committed to addressing your suggestions to enhance the quality and impact of our work. We sincerely appreciate your feedback and will strive to further improve GRiD and make it even more valuable to the community.
The paper introduces GRiD, a two-stage framework to enhance the reliability of reasoning traces: (i) constructs a dependency graph containing both knowledge nodes and reasoning nodes; generate reasoning traces and verification data based on this dependency graph; and (ii) finetune the proposer and verifier based on the data generated.
优缺点分析
Strengths
-
The paper is clearly written. The proposed framework is clearly explaned with notations, figures and pseudo codes. It is easy to understand.
-
Sufficient ablation study. The ablation study clearly shows the importance of verification / model size / training data creator.
Weaknesses
-
The generated data may be unreliable. Although the new reasoning format asks the LLM to give the knowledge before the reasoning, it is possible that the LLM gives well-reasoned halluciation since both knowledge and reasoning is from LLM's internal knowledge. And this well-reasoned hallucination may also happen in the verification step. This paper lacks clearly discussion and explanation on this part.
-
Evaluation metric is not clear. What is the evaluation metric for Table 1-5?
-
Baseline missing. How is the proposed method fundamentally different from ReACT prompt[1]? In the context of this paper, the knowledge node is the 'reason' and the reasoning node is the 'action'. In the experiments, the authors only consider direct answer, zero-shot and few-shot prompts. How is the baseline performance with ReACT promtps (especially finetuned LLM with ReACT prompt)?
[1] Yao et al. React: Synergizing reasoning and acting in language models. ICLR 2023.
问题
Please answer the questions in the weaknesses part.
局限性
yes
最终评判理由
The rebuttal and discussion from authors address my concerns on the baseline and innovation of the paper so I would raise my score to 4.
格式问题
no paper formatting concerns
We appreciate Reviewer 1SrY’s thoughtful evaluation and constructive feedback on our paper. Below, we address each of the identified weaknesses and provide clarifications and additional details to support our claims.
-
Q1: The generated data may be unreliable. Although the new reasoning format asks the LLM to give the knowledge before the reasoning, it is possible that the LLM gives well-reasoned halluciation since both knowledge and reasoning is from LLM's internal knowledge. And this well-reasoned hallucination may also happen in the verification step. This paper lacks clearly discussion and explanation on this part.
Response 1:
We agree that the statistical nature of LLMs makes unreliability and hallucination a fundamental challenge for practical applications. This is indeed an open problem in the community.
Our work explicitly mitigates this issue by proposing GRiD, which aims to extract the most accurate information from the model’s intrinsic knowledge in a structured reasoning format. While hallucination often arises during LLM inference due to probabilistic token generation, our approach attempts to mitigate this by constraining the reasoning process and making the knowledge extraction step explicit and verifiable.
Furthermore, the dependency verification module in GRiD enhances reliability by ensuring each reasoning step logically follows from the extracted knowledge and previous steps. This mechanism reduces the likelihood of compounding hallucinations and enforces greater consistency throughout the reasoning chain.
To empirically validate our approach, we have included additional experiments that measure the faithfulness and consistency of the entire knowledge-reasoning trace. We directly calculate the consistency score and faithfulness score to explicitly show the effectiveness of GRiD. To automatically and objectively achieve this, we prompt GPT4.1 and DeepSeek-r1 to judge and score the reasoning trace of the original model, the SFT model using CoT data, and the GRiD model (Metric explanations: Faithfulness and Consistency means the faithfulness and consistency scores, and focus on factual knowledge correctness and the consistent issues between steps; Faithful Issue and Consistency Issue Rate indicates the average numbers of steps that have faithful and consistency issues.) . As shown in Tables 1, GRiD consistently achieves higher faithfulness and consistency scores compared to both the original model and a supervised fine-tuned model using CoT data. These results demonstrate that both the knowledge extraction and dependency verification modules in GRiD play a crucial role in mitigating hallucinations. Interestingly, on GPQA benchmark, although our method is slightly better than the other two, the improvement is slight, and the Faithfulness and Consistency Scores are significantly low, while the Faithful and Consistency Issue Rates are very higher than that on the other three benchmarks. This is in line with the experimental results that are mainly shown in Table 1 an Table 2 in the original paper, showing GPQA is two hard for the relatively small models, and achieving the knowledge boundary, which is discussed on Section 4.7 in the original paper.
To empirically validate our approach, we conducted additional experiments to assess the faithfulness and consistency of the entire knowledge-reasoning trace. We directly compute the consistency and faithfulness scores to objectively demonstrate the effectiveness of GRiD. For automated evaluation, we prompted GPT-4.1 and DeepSeek-r1 to assess and score the reasoning traces of the original model, the SFT model using CoT data, and the GRiD model. As shown in Table 1 in this rebuttal, GRiD consistently outperforms both the original model and the supervised fine-tuned model using CoT data in terms of faithfulness and consistency. These results underscore the importance of the knowledge extraction and dependency verification modules in reducing hallucinations. Interestingly, on the GPQA benchmark, while our method shows a slight improvement over the other two, the absolute faithfulness and consistency scores are notably lower, and the issue rates are significantly higher compared to the other three benchmarks. This finding aligns with our earlier results presented in Tables 1 and 2 of the original paper, indicating that GPQA presents a particularly challenging task for relatively smaller models. This supports our discussion in Section 4.7 in the original paper regarding the limitations of achieving knowledge boundary on this benchmark.
Table 1 Strategy Faithfulness ↑ Consistency ↑ Faithful Issue Rate ↓ Consistency Issue Rate ↓ Original 0.709 0.783 1.58 0.695 StrategyQA CoT-SFT 0.758 0.856 1.143 0.406 GRiD 0.766 0.89 1.057 0.314 Original 0.854 0.854 0.472 0.477 CommonsenseQA CoT-SFT 0.874 0.87 0.47 0.445 GRiD 0.905 0.891 0.39 0.35 Original 0.854 0.854 0.472 0.477 GPQA CoT-SFT 0.874 0.87 0.47 0.445 GRiD 0.905 0.891 0.39 0.35 Original 0.715 0.772 1.788 0.504 TruthfulQA CoT-SFT 0.844 0.887 1.456 0.294 GRiD 0.835 0.887 1.175 0.275 (Metric definitions: Faithfulness and Consistency refer to the factual correctness of knowledge and the consistency between reasoning steps; Faithful Issue and Consistency Issue Rate represent the average number of steps with faithfulness and consistency issues for one example case, respectively.)
Finally, we observe a significant improvement in the model’s overall reasoning ability after adopting our reasoning format, further supporting the effectiveness of GRiD in addressing the reviewer’s concerns.
-
Q2: Evaluation metric is not clear. What is the evaluation metric for Table 1-5?
Response 2:
In this paper, we use the pass@1 accuracy score as the primary evaluation metric to assess the model’s performance in Table 1-5 of the original paper.
-
Q3: Baseline missing. How is the proposed method fundamentally different from ReACT prompt[1]? In the context of this paper, the knowledge node is the 'reason' and the reasoning node is the 'action'. In the experiments, the authors only consider direct answer, zero-shot and few-shot prompts. How is the baseline performance with ReACT promtps (especially finetuned LLM with ReACT prompt)?
Response 3:
We would like to first clarify the key differences between GRiD and the ReACT method. GRiD is a model-only approach that focuses on extracting the model's intrinsic knowledge to enhance reasoning, while also verifying the dependencies between reasoning steps. In contrast, the ReACT method, especially in prompting mode, relies on an external knowledge base to assist the model’s reasoning process. Even in the fine-tuning scenario, ReACT still requires external knowledge sources to construct the training data, rather than solely relying on the model’s intrinsic knowledge.
Table 2 (Baseline React by prompting without fine-tuning): StrategyQA CommonsenseQA GPQA TruthfulQA DeepSeek-v3 0.782 0.864 0.449 0.838 DS-v3-original 0.840 0.853 0.500 0.850 GPT-4o 0.839 0.837 0.429 0.788 GPT-4o-original 0.826 0.851 0.510 0.842 Table 3 (Fine-tuned on Qwen-14b-Instruct:): StrategyQA CommonsenseQA GPQA TruthfulQA Data Creator DeepSeek-v3 0.707 0.799 0.357 0.818 GPT-4o 0.724 0.776 0.387 0.757 Baseline Qwen-14b-CoT 0.760 0.840 0.364 0.812 To provide a baseline, we implement the ReACT method using the Wikipedia API as the external knowledge base on the four benchmarks in our paper. The results can be found in Tables 2 and 3 of this rebuttal. In the prompting mode, we observe some improvements with ReACT, such as for DeepSeek-v3 on Commonsense and GPT-4o on StrategyQA. However, these improvements are modest. In many cases, ReACT with the Wikipedia API even results in a decrease in model performance.
In the fine-tuning mode, the performance decreases significantly compared to the model before fine-tuning. We believe this can be attributed to several factors: 1) the Wikipedia API does not consistently provide correct or useful information, introducing noise into the process; 2) the ReACT-format data prioritizes external information, while neglecting the intrinsic knowledge extraction of the model, and fails to activate the model’s potential in finding faithful and suitable intrinsic knowledge; and 3) the inaccurate external information recalled by the API can overwhelm the fine-tuning process, degrading the model's performance. Consequently, the fine-tuned ReACT model performs similarly to, or even worse than, a model fine-tuned on CoT question-answer pairs.
In contrast, our GRiD method focuses on extracting the model’s intrinsic knowledge, which is already embedded in the model weights, and presenting it as explicit context. This approach is lightweight and reduces the risk of introducing noise from external knowledge bases. Additionally, the explicit step dependency graph in GRiD facilitates step dependency verification, further ensuring the consistency of the reasoning trace and leading to significant improvements in model performance.
Dear Reviewer 1SrY,
As the deadline for author-reviewer discussion is approaching, could you please check the authors' rebuttal and post your response?
Thank you!
Best,
AC
Dear Reviewer,
Could you please check the authors' rebuttal and post your response?
The new policy requires AC to flag insufficient review this year, including the non-participation in author-reviewer discussions.
Thanks,
AC
Dear Reviewer iSrY,
Thank you once again for your positive and insightful review of our paper. Your feedback has been instrumental in helping us strengthen our work.
As the author-reviewer discussion period approaches its deadline, we would like to share new experimental results that further clarify the differences between our GRiD method and the ReAct baseline, and demonstrate why a fine-tuned ReAct model is not ideal for activating a model’s intrinsic knowledge to enhance performance.
1. Additional Experiments with ReAct Baseline
We extended our evaluation of the ReAct method (using the Wikipedia API) to include both GPT-4.1 and the newly released open-source GPT-oss-120b model from OpenAI. For GPT-oss-120b, we leveraged third-party API deployments for testing. Consistent with our earlier findings (see Table 2 of our rebuttal), ReAct in prompting mode produces only modest improvements, and in many cases—even leads to a decrease in model performance when using the Wikipedia API.
| Table 2 -Continued(Baseline React by prompting without fine-tuning): | StrategyQA | CommonsenseQA | GPQA | TruthfulQA |
|---|---|---|---|---|
| DeepSeek-v3 | 0.782 | 0.864 | 0.449 | 0.838 |
| DS-v3-original | 0.840 | 0.853 | 0.500 | 0.850 |
| GPT-4o | 0.839 | 0.837 | 0.429 | 0.788 |
| GPT-4o-original | 0.826 | 0.851 | 0.510 | 0.842 |
| GPT-4.1 | 0.877 | 0.829 | 0.590 | 0.815 |
| GPT-4.1-original | 0.851 | 0.845 | 0.646 | 0.887 |
| GPT-oss-120b | 0.853 | 0.645 | 0.763 | 0.738 |
| GPT-oss-120b-original | 0.823 | 0.825 | 0.786 | 0.869 |
2. Fine-tuning ReAct: Different Strategies and Results
We conducted three new sets of fine-tuning experiments for ReAct:
-
GPT-4.1 as data creator
-
GPT-4o as data creator, with optimized ReAct trace:
Here, based on prior discussions suggesting that noise may be introduced in ReAct traces, we refined the traces by removing the “Similar” segment after “Observation,” while retaining the essential segments: “Thought,” “Search,” and “Observation.” This adjustment aimed to reduce unnecessary noise from search operations and make the trace more coherent. Results (Table 3-Continued, row 4) indicate that this optimized trace leads to a modest increase in final accuracy.
-
GPT-oss-120b as data creator
| Table 3 -Continued(Fine-tuned on Qwen-14b-Instruct:): | StrategyQA | CommonsenseQA | GPQA | TruthfulQA | |
|---|---|---|---|---|---|
| Data Creator | DeepSeek-v3 | 0.707 | 0.799 | 0.357 | 0.818 |
| GPT-4o | 0.724 | 0.776 | 0.387 | 0.757 | |
| GPT-4o-optimized | 0.729 | 0.785 | 0.393 | 0.750 | |
| GPT-4.1 | 0.737 | 0.785 | 0.419 | 0.823 | |
| GPT-oss-120b | - | - | 0.404 | - | |
| Baseline | Qwen-14b-CoT | 0.760 | 0.840 | 0.364 | 0.812 |
Despite these optimizations, fine-tuned ReAct still falls short in fully activating the model’s intrinsic knowledge, especially compared to our GRiD method. The ReAct data format tends to prioritize external information, neglecting the extraction of intrinsic knowledge, and thus fails to unlock the model’s full reasoning potential.
Moreover, while GPT-oss-120b achieves strong performance on the GPQA benchmark (w/wo ReAcT), fine-tuning ReAct using data generated by GPT-oss-120b results in only modest improvement. However, fine-tuning on ReAcT trace produces a significant 36% drop in performance compared to ReAct in prompting mode.
In contrast, GRiD is explicitly designed to extract and organize intrinsic knowledge embedded in the model weights, presenting it as explicit context. This lightweight approach reduces the risk of noise from external knowledge bases. Furthermore, the explicit step dependency graph in GRiD enables step-by-step dependency verification, improving the consistency of the reasoning trace and leading to significant performance gains.
We hope these additional experiments further clarify the strengths and contributions of our method. Thank you again for your thoughtful feedback and valuable suggestions.
This paper aims to enhance the logical consistency of CoT with grounded reasoning in dependency (GRiD). GRiD transforms question-answer pairs into a knowledge-reasoning dependency graph, where each reasoning node is linked to specific knowledge nodes. It also includes a lightweight verification module that validates reasoning steps against their dependencies. Experiments on four question-answering benchmarks demonstrate that GRiD significantly improves the performance over the original instruct models and directly finetuned models.
All of the reviewers recognized the novelty and technical contributions of this work. The idea of explicitly grounding reasoning steps and verifying them is quite interesting. Experiments are comprehensive and convincing, and ablation studies are sufficient. Also, the paper is clearly organized and well written.
Meanwhile, reviewers raised some questions regarding claims (on consistency and faithfulness), experimental settings, baselines, failure case analysis, etc. The authors have provided detailed responses with additional results, which have successfully addressed the previous concerns from reviewers. The authors are strongly encouraged to incorporate the new results and discussions into the final version of the paper.