/10

Poster4 位审稿人

最低3最高5标准差0.8

ICML 2025

KBQA-o1: Agentic Knowledge Base Question Answering with Monte Carlo Tree Search

Haoran Luo,Haihong E,Yikai Guo,Qika Lin,Xiaobao Wu,Xinyu Mu,Wenhao Liu,Meina Song,Yifan Zhu,Anh Tuan Luu

OpenReview PDF

提交: 2025-01-21更新: 2025-07-24

TL;DR

We propose KBQA-o1, an agentic KBQA method with Monte Carlo Tree Search.

摘要

关键词

Knowledge Base Question AnsweringLarge Language ModelLLM AgentsMonte Carlo Tree Search

评审与讨论

审稿意见

评分: 42025-03-12

This paper proposes KBQA-o1, which utilizes Monte Carlo Tree Search and ReAct-based agent process to generate stepwise logical form with knowledge base environment.The incremental finetuning strategy on automatically labeled examples further enhances the performance. According to the experiment results, KBQA-o1 can outperform previous few-shot KBQA methods with open-source LLM like Llama-3.1-8B.

给作者的问题

In Table 7, I find that the reward threshold is higher for easier datasets like webqsp, could you provide more insights when you choose this parameter?
Have you tried applying RL instead of SFT for optimization?

论据与证据

Yes

方法与评估标准

Yes

理论论述

Yes

实验设计与分析

Yes

补充材料

Yes, especially the proof of the propositions, detailed parameter settings, case study and error analysis.

与现有文献的关系

KBQA-o1 adapts MCTS algorithm from o1 to KB-specific question answering tasks, which shows advantages compared to previous end-to-end and step-by-step methods as it allows stepwise adjustments by KB environment awareness.

遗漏的重要参考文献

其他优缺点

Strengths:

This paper proposes KBQA-o1, which shows good potential to solve KBQA in low-resource setting with open-source model.
The experiment and analysis is comprehensive.

Weaknesses:

The efficiency analysis is reflected by the number of queries per minutes, just wondering how many queries are required to be executed for one target question on average.

其他意见或建议

There maybe a typo before equation 13. Then, we discard the annotation by choosing if the answer set is not empty... should be if the answer set is empty...
In equation 10, in the sum operator, the upper bound and lower bound should be reversed

作者回复

2025-03-30

Thank you very much for your time and effort in reviewing our paper. We sincerely appreciate your feedback. Below, we respectfully provide our detailed responses to address your concerns.

W1: The efficiency analysis is reflected by the number of queries per minutes, just wondering how many queries are required to be executed for one target question on average.

We thank the reviewer for raising the concern regarding efficiency analysis. The reviewer asked about “the average number of queries required to be executed for one target question” to better understand the efficiency of the proposed method. In Section 5.4 (Comparison Analysis) and Figures 4(c) and 5(c), we have already provided a detailed analysis of the trade-off between query frequency (queries per minute) and accuracy (F1 score).
It is important to emphasize that our KBQA-o1 method adopts a dedicated efficiency-oriented MCTS parameter setting θ_eff during the prediction phase (see Section 4.3 and Figure 5(c)), which significantly reduces the number of queries per question. Compared with the exploration phase that uses higher MCTS weights, the prediction phase achieves a substantial improvement in overall efficiency while sacrificing only a marginal amount of accuracy. Moreover, Figure 4(c) provides a comparative analysis with other baseline methods under the same evaluation protocol, showing that KBQA-o1 achieves higher accuracy while maintaining a competitive level of query efficiency.
Regarding the average number of queries per question, since KBQA-o1 adopts Monte Carlo Tree Search (MCTS), which is a tree-based heuristic search algorithm, the number of queries is not fixed, but dynamically determined by the search space, question complexity, and the model’s policy. Therefore, we argue that query frequency (queries per minute) is a more comprehensive and practical indicator of efficiency in real-world applications.

C1: There maybe a typo before equation 13. Then, we discard the annotation by choosing if the answer set is not empty... should be if the answer set is empty...

Thank you for pointing out the typo. The sentence before Equation (13) should indeed read “if the answer set is empty” instead of “not empty.” This will be corrected in the revised version.

C2: In equation 10, in the sum operator, the upper bound and lower bound should be reversed

Thanks for noticing the issue in Equation (10). The summation bounds should indeed be reversed, and we will make the necessary correction in the updated paper.

Q1: In Table 7, I find that the reward threshold is higher for easier datasets like webqsp, could you provide more insights when you choose this parameter?

Thank you for the question. The reward threshold γ* is indeed a key parameter in filtering auto-labeled samples during incremental fine-tuning. While we adopt a unified reward model across all datasets, the distribution of reward scores varies due to differences in dataset difficulty, logical form complexity, and question types.
For relatively easier datasets like WebQSP, the generated logical forms are typically shorter and more confident, leading to overall higher reward scores. To ensure quality, we set a higher γ* to filter out over-confident but potentially incorrect samples. Conversely, in more complex datasets such as GrailQA and GraphQ, the model is more conservative, and the reward scores tend to be lower. Thus, a lower γ* is chosen to retain sufficient high-quality samples for fine-tuning.
To determine γ*, we follow a validation-based selection strategy:
1. We first apply the reward model to score auto-labeled logical forms on a validation subset;
2. We then plot the relationship between the reward threshold γ*, the proportion of selected samples, and their downstream F1 performance, as shown in Figure 5(b);
3. Finally, we choose the γ* that optimizes the trade-off between data quality (reward score) and model improvement (F1 score).
To improve efficiency, in practice, we adopt a simple yet effective strategy: we set γ* such that approximately the top 90% of auto-labeled samples are retained, filtering out only the bottom 10% with the lowest reward scores.

Q2: Have you tried applying RL instead of SFT for optimization?

Thank you for your question. Due to the letter limitation, please refer to our response to the last question of Reviewer 369N, which is the same question as this.

At last, we sincerely appreciate your valuable feedback, and we will carefully consider all your suggestions to further improve our paper. Thank you very much!

审稿人评论

2025-04-05

Thanks for the clarification from the authors. I will maintain my score.

审稿意见

评分: 52025-03-14

The paper introduces KBQA-o1, a novel agentic Knowledge Base Question Answering (KBQA) method that leverages Monte Carlo Tree Search (MCTS) to address challenges in KBQA, such as weak KB awareness, the trade-off between effectiveness and efficiency, and high reliance on annotated data. The proposed method employs a ReAct-based agent process for stepwise logical form generation and uses MCTS to balance exploration performance and search space. Additionally, KBQA-o1 generates high-quality auto-annotated data through heuristic exploration, reducing the need for extensive human annotation.

给作者的问题

Please see weakness.

论据与证据

The claims are well-motivated and largely supported by theoretical and empirical evidence

方法与评估标准

The proposed methods and evaluation criteria are appropriate and well-aligned with the paper’s goals.

理论论述

I reviewed the theoretical claims, including proofs in the main paper.

实验设计与分析

The experimental designs are largely sound.

补充材料

I reviewed all the supplementary material.

与现有文献的关系

N/A

遗漏的重要参考文献

N/A

其他优缺点

Weaknesses:

Limited Evaluation on Real-World Scenarios: While the paper demonstrates strong performance on benchmark datasets like GrailQA, WebQSP, and GraphQ, it lacks evaluation in real-world, noisy, or incomplete knowledge base scenarios. Real-world KBs often contain incomplete or inconsistent data, and the robustness of KBQA-o1 in such settings remains unclear. Including experiments on more diverse and noisy datasets would strengthen the paper's claims about the model's practical applicability.

Scalability Concerns: The paper does not thoroughly address the scalability of the proposed method, especially when dealing with extremely large knowledge bases. Although MCTS is designed to balance exploration and exploitation, the computational overhead of performing multiple rollouts on large-scale KBs could be significant. A more detailed analysis of the computational complexity and runtime performance on larger KBs would be beneficial.

其他意见或建议

Please see weakness.

作者回复

2025-03-30

Thank you very much for your time and effort in reviewing our paper. We sincerely appreciate your feedback. Below, we respectfully provide our detailed responses to address your concerns.

Q1: Limited Evaluation on Real-World Scenarios: While the paper demonstrates strong performance on benchmark datasets like GrailQA, WebQSP, and GraphQ, it lacks evaluation in real-world, noisy, or incomplete knowledge base scenarios. Real-world KBs often contain incomplete or inconsistent data, and the robustness of KBQA-o1 in such settings remains unclear. Including experiments on more diverse and noisy datasets would strengthen the paper's claims about the model's practical applicability.

Thank you for raising this important point. We fully agree that evaluation under real-world, noisy, or incomplete KB conditions is critical for practical applicability.
KBQA-o1 is inherently designed to address real-world challenges such as missing entities, non-standard relations, and schema inconsistency. Its environment-aware agent dynamically interacts with the KB during logical form construction, enabling stepwise adaptation to incomplete or unstable KB structures—an advantage over static or end-to-end approaches.
To improve robustness, KBQA-o1 uses SimCSE-based semantic matching in MCTS expansion, allowing flexible matching of noisy or ambiguous relations. This reduces reliance on rigid schema annotations and improves generalization to noisy KBs.
While our experiments are conducted on standard datasets, GraphQ in particular is widely recognized as a noisy and structurally diverse benchmark. KBQA-o1 achieves a 19.4 F1-point improvement over previous methods on GraphQ, showing strong performance under noisy conditions.
We further simulate real-world scenarios via incremental self-supervised learning: starting with minimal labeled data, KBQA-o1 explores unlabeled questions using MCTS and filters high-quality logical forms via a reward model for fine-tuning. This aligns with the practical need for robustness under low-annotation settings.
We also identify real-KB evaluation as a key future direction. Our Impact Statement and Appendix outline plans to test on domain-specific KBs (e.g., medicine, law), simulate KB incompleteness via subgraphs, and explore continual learning strategies like DPO.

Q2: Scalability Concerns: The paper does not thoroughly address the scalability of the proposed method, especially when dealing with extremely large knowledge bases. Although MCTS is designed to balance exploration and exploitation, the computational overhead of performing multiple rollouts on large-scale KBs could be significant. A more detailed analysis of the computational complexity and runtime performance on larger KBs would be beneficial.

Thank you for raising this important concern. We fully recognize that scalability is critical for practical deployment of KBQA on large-scale knowledge bases.
While KBQA-o1 adopts MCTS, it does not rely on exhaustive search. Instead, it integrates an environment-aware agent with policy-guided local exploration. At each step, we use SimCSE-based semantic retrieval (Eq. 6) to narrow the candidate actions to a small, relevant subset, significantly reducing the search space—even in large KBs.
To balance quality and efficiency, we apply stage-specific MCTS settings: a higher exploration weight w = 50 during training to ensure logical form quality, and a lightweight setting w = 10 during inference to reduce rollout cost. As shown in Figure 5(c), this effectively ensures both performance and scalability.
We also present empirical results on query throughput vs. accuracy (Figure 4(c)), showing that KBQA-o1 outperforms CoT and ToT variants in both accuracy and runtime. In the final version, we will further include a theoretical analysis of complexity, including rollout cost, SimCSE retrieval overhead, and search depth.
Importantly, the experiments in our paper are conducted on Freebase, which itself is a massive real-world KB with tens of millions of entities and triples. The fact that KBQA-o1 operates efficiently on Freebase already demonstrates its practical scalability under large-KB conditions.

At last, we sincerely appreciate your valuable feedback, and we will carefully consider all your suggestions to further improve our paper. Thank you very much!

审稿人评论

2025-04-03

Thank you for your clarification. I apologize for missing the formal proofs in the appendix — after reading them carefully, I find the theoretical justification solid. Your explanation for choosing MCTS over standard RL also makes sense. I now believe the paper meets the accept standard and have updated my score.

I have a minor (non-blocking) question: could you elaborate a bit more on how KBQA-o1 might leverage recent advances such as GRPO in the future?

作者评论

2025-04-03

Thank you for your support. Regarding your new question, our response is as follows:

GRPO is an open-source reinforcement learning framework proposed by DeepSeek for large reasoning models, designed to replicate the long-chain-of-thought reasoning capabilities of the GPT-o1 model. It has been shown to be highly effective for generating long reasoning trajectories. Recent works such as Search-R1 and R1-Searcher further extend this line of research by integrating reasoning-oriented reinforcement learning with external search engines.

For the KBQA task, we believe there is strong potential to adapt this approach by integrating reinforcement learning with a knowledge graph as the environment. In this context, KBQA-o1 serves as a solid foundational framework.

In addition, we can conducte a comparison between GRPO and MCTS. Both share the characteristic of end-to-end reward signal propagation. However, GRPO operates at the token level, making it more suitable for textual reasoning tasks such as chain-of-thought (CoT) generation. In contrast, MCTS functions at the step level, which may be more appropriate for structured query generation tasks that require explicit interaction with the environment. We plan to further investigate and analyze the differences between the two approaches in future work.

审稿意见

评分: 32025-03-14

This paper proposes a novel agentic KBQA framework that integrates Monte Carlo Tree Search (MCTS) with large language models (LLMs) to address challenges in low-resource and complex reasoning scenarios.

There are too many baselines not being discussed or compared, which makes this paper far from technically sound since the performance and efficiency are both unsatisfactory.

给作者的问题

Please check the comments and weaknesses.

论据与证据

Yes.

方法与评估标准

MCTS requires multiple rollouts and tree expansions, leading to increased latency for complex queries.
The reward model evaluates logical forms based on syntax and answer alignment but ignores semantic plausibility. Also there is no exploration and exploitation trade-off since high exploration w improves accuracy but slows down the inference, while lower weights risk suboptimal performance.

理论论述

The methods and comparisons are theoretically analyzed.

实验设计与分析

The efficiency is really bad. ReAct is already not suitable for QA, let alone the combination with MCTS. Six minutes for one query is not acceptable for either research or industrial scenarios.
Too many advanced baselines missed to be compared, e.g., RoG, GNN-RAG, StructGPT.
The performance is not satisfying, even though many baselines are not compared. For example, RoG is 70.8% in terms of F1 score on WebQSP.

补充材料

Roughly on the proofs.

与现有文献的关系

Closely related.

遗漏的重要参考文献

Too many baselines are missed to be discussed and compared, e.g., RoG, GNN-RAG, StructGPT, etc.

其他优缺点

其他意见或建议

作者回复

2025-03-25

Thank you very much for your time and effort in reviewing our paper. We sincerely appreciate your feedback. We understand that your main concerns center around two key aspects: performance and efficiency. Below, we respectfully provide our detailed responses to address these points.

< Performance >

Q1: "Too many baselines are missed to be discussed and compared, e.g., RoG, GNN-RAG, StructGPT, etc."

Thanks. Methods like RoG, GNN-RAG, and ChatKBQA rely on annotated data (e.g., WebQSP, CWQ) for tuning. However, as Gu et al. note, such data is often unavailable in practice, making current KBQA approaches overly dependent on supervision. This motivates few-shot evaluation (<=100 examples), which is the setting adopted by KBQA-o1 and all baselines in our work, as shown in Table 2.
We also evaluated KBQA-o1 under full supervision, where it performs also strongly. However, as leaderboard gains have plateaued, we focus on the more practical challenge of low-resource KBQA and omit these results from the paper. For reference, the full-supervised comparison is:

Type	Method	WebQSP F1	WebQSP Hit@1	WebQSP Acc	CWQ F1	CWQ Hits@1	CWQ Acc
End-to-end	RoG	70.8	85.7	-	56.2	62.6	-
	GNN-RAG	73.5	82.8	-	60.4	62.8	-
	ChatKBQA	83.5	86.4	77.8	81.3	86.0	76.8
Step-by-step	Pangu	79.6	-	-	-	-	-
	StructGPT	72.6	-	-	-	-	-
	ToG	-	-	82.6	-	-	69.5
	KG-Agent	81.0	83.3	-	69.8	72.2	-
Heuristic	KBQA-o1 (Ours)	85.7	88.3	81.7	83.9	89.5	80.7

Q2: "The performance is not satisfying, even though many baselines are not compared. For example, RoG is 70.8% in terms of F1 score on WebQSP."

Thanks. Compared to RoG’s 70.8 F1 under full supervision, KBQA-o1 achieves 67.0 F1 with only 100 labels, showing strong low-resource potential. Under full supervision, which is the same settings as RoG's, KBQA-o1 further reaches 85.7 F1, outperforming RoG and achieving SOTA. However, our focus remains on more practical low-resource KBQA, not fully supervised settings.

< Efficiency >

Q3: "Six minutes for one query is not acceptable for either research or industrial scenarios."

Thanks. There is a factual misunderstanding. As shown in Figure 4(c), the x-axis represents Query per Minute, and KBQA-o1 achieves approximately 6 queries per minute, not “Six minutes for one query” as the review states.
With an average of ~10 seconds per query, KBQA-o1 offers efficient, high-quality KB reasoning for either research or industrial scenarios.

Q4: "There is no exploration and exploitation trade-off since high exploration w improves accuracy but slows down the inference, while lower weights risk suboptimal performance."

Thanks. Indeed, increasing w enhances accuracy while decreasing efficiency. However, this process is not linear. As shown in Figure 5(c), both effectiveness and efficiency stabilize after w reaches a certain threshold, indicating a trade-off can be achieved within a proper w.
This trade-off arises from the reward mechanism and the UCT selection algorithm in MCTS, making MCTS inherently heuristic. A detailed proof is provided in Appendix B.2.

< Other Comments >

Q5: "The reward model evaluates logical forms based on syntax and answer alignment but ignores semantic plausibility."

Thanks. Our reward is not solely based on syntax or answer alignment. As shown in Equation (9), it combines the policy model’s semantic score and the reward model’s syntax score via weighted fusion, enabling a more robust evaluation that accounts for both semantic plausibility and structural correctness.

Q6: "ReAct is already not suitable for QA, let alone the combination with MCTS. "

Thanks. In KBQA-o1, we only use ReAct as a standardized prompt format to formulate the agent process. These prompts are fixed and embedded via instruction tuning. Thus, whether ReAct is “suitable for QA” is irrelevant to our setting and does not impact our method’s effectiveness.

At last, we sincerely appreciate your valuable feedback, and we will carefully consider all your suggestions to further improve our paper. We would be deeply grateful if you could kindly reconsider raising the score to 3 or above. Thank you very much!

审稿人评论

2025-04-03

Dear Authors,

Thanks for the rebuttal. I have increased my score to 3.

作者评论

2025-04-03

Thank you for your support. Once again, we sincerely appreciate your responsible and insightful review. We will continue to refine this work and guide our future efforts based on your valuable suggestions.

审稿意见

评分: 32025-03-20

The paper presents KBQA-o1, an agentic Knowledge Base Question Answering (KBQA) method that integrates Monte Carlo Tree Search (MCTS) for improved logical form generation. It addresses challenges in KB awareness, search efficiency, and reliance on annotated data by employing a ReAct-based agent process and incremental fine-tuning. Experiments on GrailQA, WebQSP, and GraphQ show that KBQA-o1 outperforms previous low-resource methods and approaches fully supervised performance, demonstrating strong generalization and adaptability across multiple LLMs.

给作者的问题

Please refer to the above section.

论据与证据

Yes. The paper provides substantial empirical evidence to support its main claims.

方法与评估标准

The methods and evaluation criteria in the paper are generally appropriate and well-aligned with the KBQA task. The authors evaluate KBQA-o1 on three widely used benchmark datasets—GrailQA, WebQSP, and GraphQ—which are standard for assessing KBQA models, particularly in low-resource settings. The use of F1 score and Exact Match (EM) as evaluation metrics is also consistent with prior work in this domain.

理论论述

The paper presents several theoretical claims related to the effectiveness of its agentic KBQA approach with Monte Carlo Tree Search (MCTS). Proposition 4.1 – The agent’s awareness of the KB environment improves logical form generation compared to end-to-end methods. Proposition 4.2 – The MCTS-based heuristic method balances search efficiency and effectiveness better than Chain-of-Thought (CoT) and Tree-of-Thought (ToT) methods. Proposition 4.3 – There exists a reward threshold 𝛾∗ that ensures incremental fine-tuning improves model performance. The correctness of these claims is primarily supported by empirical results, rather than formal mathematical proofs. The paper references experimental findings (Section 5.4 and Appendices) as qualitative or quantitative justification but does not provide rigorous theoretical derivations.

实验设计与分析

The experimental design and analysis in the paper are generally sound and well-structured, providing strong empirical support for the proposed KBQA-o1 method. The authors evaluate their approach on three widely used KBQA benchmarks (GrailQA, WebQSP, GraphQ) under a low-resource setting, which aligns well with the paper’s focus on improving performance with limited annotated data.

补充材料

Yes, all parts.

与现有文献的关系

The paper’s contributions are well-situated within the broader scientific literature on Knowledge Base Question Answering (KBQA), heuristic search methods, and large language models (LLMs) for reasoning. It builds upon existing techniques while introducing novel elements to improve logical form generation and exploration efficiency.

Novelty & Contribution to Literature

Agentic KBQA with MCTS: The combination of ReAct agents and MCTS for KBQA reasoning appears to be a novel approach that improves search efficiency while maintaining flexibility.

Incremental Fine-Tuning for KBQA: The method’s use of self-annotated logical forms aligns with semi-supervised learning approaches, providing a scalable alternative to purely supervised KBQA models.

Improved Low-Resource Performance: Unlike previous KBQA methods that depend heavily on large annotated datasets, KBQA-o1 achieves strong performance with limited supervision, making it more practical for real-world applications.

遗漏的重要参考文献

其他优缺点

Strengths:

Well-Defined Problem Scope & Contribution: The paper clearly identifies key challenges in KBQA, such as poor KB awareness, large search spaces, and reliance on annotated data, and proposes a well-motivated solution with KBQA-o1. The integration of ReAct-based agent reasoning with Monte Carlo Tree Search (MCTS) is a creative and effective combination of existing ideas.

Strong Empirical Performance: The method outperforms state-of-the-art low-resource KBQA methods on standard benchmarks (GrailQA, WebQSP, GraphQ). It demonstrates competitive performance even against fully supervised methods, highlighting its effectiveness in low-data scenarios.

Weaknesses & Suggestions:

A small-scale experiment on a different KB structure would strengthen claims of broad applicability.
A discussion on why MCTS was chosen over standard RL for guiding logical form exploration would be beneficial.

其他意见或建议

Please refer to the above section.

作者回复

2025-03-30

Thank you very much for your time and effort in reviewing our paper. We sincerely appreciate your feedback. Below, we respectfully provide our detailed responses to address your concerns.

Q1: Proposition 4.3 – There exists a reward threshold 𝛾∗ that ensures incremental fine-tuning improves model performance. The correctness of these claims is primarily supported by empirical results, rather than formal mathematical proofs. The paper references experimental findings (Section 5.4 and Appendices) as qualitative or quantitative justification but does not provide rigorous theoretical derivations.

Thank you for your comments. The core components of KBQA-o1 in our paper include Agent Initialization, Heuristic Environment Exploration, and Incremental Fine-Tuning. We provide formal justifications for the effectiveness of these three key modules through Propositions 4.1, 4.2, and 4.3, respectively.
You mentioned that “these claims are primarily supported by empirical results, rather than formal mathematical proofs” and that the paper “does not provide rigorous theoretical derivations.” However, we would like to kindly clarify that in addition to the quantitative experimental results in Section 5, we have provided detailed theoretical derivations supporting these propositions in Appendices B.1, B.2, and B.3.
We are unsure whether this might have been an oversight or if you believe the theoretical proofs require further improvement. Please kindly let us know so we can better address your concerns.

Q2: A small-scale experiment on a different KB structure would strengthen claims of broad applicability.

Thank you for the suggestion. As current KBQA tasks are primarily based on large-scale RDF knowledge bases such as Freebase and Wikidata—each containing tens or hundreds of millions of nodes—the task remains relatively complex. We have validated the effectiveness of our method across three datasets with different distributions: GrailQA, WebQSP, and GraphQ. In particular, we evaluated our approach on the more comprehensive GrailQA dataset under various settings, including I.I.D., Compositional, and Zero-Shot, which aligns with the majority of existing KBQA benchmarks. This ensures both the solidity and applicability of our approach.
As future work, we plan to extend our experiments to different types of knowledge base structures on a broader scale. For instance, we aim to apply our method to custom-built knowledge graphs such as those used in GraphRAG tasks, as well as to property graphs or hypergraph-based knowledge bases, to further demonstrate the broader applicability of our approach.

Q3: A discussion on why MCTS was chosen over standard RL for guiding logical form exploration would be beneficial.

We initially experimented with the Direct Preference Optimization (DPO) algorithm as a standard RL-based approach for guiding logical form exploration. However, DPO requires high-quality negative samples to be effective. To this end, we attempted to construct negative samples by leveraging the erroneous branches generated during the MCTS search.
Nevertheless, we observed a critical challenge: due to the dense structure of large-scale knowledge bases, the differences between correct and incorrect logical forms are often very subtle. For instance, the correct relation might be film.actor.film, while a near-miss incorrect one could be tv.tv_actor.starring_roles. Despite their structural difference, these relations are semantically very close. DPO struggles to distinguish such fine-grained differences in structured outputs, resulting in subpar performance compared to supervised fine-tuning (SFT) with only positive samples.
Given these limitations, we opted to use MCTS for logical form exploration, which provides a more interpretable and controllable mechanism to search over the space of structured queries. Furthermore, we noted recent advances such as GRPO, proposed by DeepSeek-R1, which uses end-to-end reinforcement learning with reward signals to guide structured generation. Inspired by this, we plan to explore replacing the MCTS process with an end-to-end RL paradigm in future work to further enhance performance.

At last, we sincerely appreciate your valuable feedback, and we will carefully consider all your suggestions to further improve our paper. Thank you very much!

最终决定Accept (poster)

2025-05-01

This paper introduces KBQA-o1, an agentic KBQA method that combines MCTS and LLMs to improve logical form generation in KBQA. The method shows strong performance on multiple datasets (GrailQA, WebQSP, and GraphQ), even in low-resource settings. It uses a ReAct-based process and MCTS to tackle challenges like weak KB awareness, inefficiency in search, and too much reliance on annotated data. The paper does a good job of backing up its claims with experimental evidence, and it proposes some interesting new ways to deal with complex reasoning tasks. However, there are a few weaknesses, like limited evaluation in real-world scenarios and scalability issues when working with very large knowledge bases.

The paper could improve by testing the method on more diverse and noisy KBs (real-world situations) and discussing scalability concerns more thoroughly. More experiments with large-scale KBs and practical, messy data would help show how well the method works outside controlled settings.