Is Complex Query Answering Really Complex?
We highlight how common benchmarks for complex query answering with neural models are skewed towards "simple" queries and propose new more challenging benchmarks that solve this issue.
摘要
评审与讨论
Authors carefully analysed limitations of benchmark datasets used in current SOTA neural KG queries, showing that they reflect only the performance of predicting answers where only one link is truly missing and that FB15k237 and NELL995 are not suitable to precisely assess performances of QA systems. Authors proposed a new benchmark dataset and tested main SOTA QA systems, showing performances of all these systems dropped significantly.
给作者的问题
If we make a decision by tossing a coin, we will have 50% of accuracy. But, we do not believe we are doing logical reasoning. Should there be a minimum value of accuracy, below which we shall not call a system is doing logical reasoning (here logical query)?
论据与证据
Authors' claims are strongly supported by reproducing experiments of existing SOTA systems and by conducting their own experiments with their new datasets.
方法与评估标准
Authors very carefully conducted their experiments and analysed existing benchmark datasets.
理论论述
Authors clearly listed three takeaways as follows.
1 Not to use current CQA benchmarks as they essentially reflect only the performance of predicting answers where only one link is truly missing (for both hybrid and neural solvers).
2 FB15k237 and NELL995 are not suitable to precisely assess the capability of CQA methods to answer complex queries, as most of their QA pairs can be predicted by just predicting a single missing links. This results in highly inflated performance that distorts the perception of progress in the field and pushes the community to improve the performance only on the easiest QA pairs.
- New benchmarks have a balanced amount of partial-inference and full-inference QA pairs that depends on the query type, allowing to measure CQA performance while not distorting the aggregate performance across all QA pairs with different hardness. Additionally, ICEWS18+H highlights that more realistic sampling strategies are more challenging for current SoTA.
实验设计与分析
The experiment design is quite straight forward and easy to understand.
补充材料
Authors provided substantial supplementary material to support their claims.
与现有文献的关系
The key contributions of this paper are related to the broader scientific literature.
遗漏的重要参考文献
At my knowledge, authors discussed major SOTA papers in the field.
其他优缺点
A piece of beautiful work! Weakness: authors do not explain at methodological level why these SOTA CQA systems fail for complex queries.
其他意见或建议
no further comment.
We thank the reviewer for considering our paper a piece of beautiful work and considering our experiment design easy to understand. We proceed by answering their questions.
authors do not explain at methodological level why these SOTA CQA systems fail for complex queries.
In Table 2 we showed that the performance of all SoTA methods consistently increased for QA pairs that have a higher number of existing links in their reasoning tree. This is evidence that most ''good'' performance of CQA can be explained by their memorization ability. We conclude that all methods lose performance in the new benchmarks because they overly rely on memorized facts, limiting their ability to predict answers when there are many missing links in their reasoning tree. We will add this conclusion explicitly in the camera-ready.
If we make a decision by tossing a coin, we will have 50% of accuracy. But, we do not believe we are doing logical reasoning. Should there be a minimum value of accuracy, below which we shall not call a system is doing logical reasoning (here logical query)?
We see where the reviewer is leading, but for CQA, even if query answering under missing links is described using logical operators, the results it produces are derived in a mixed deductive-inductive process that returns a ranking of answer entities (measured by MRR) rather than a set of logical conclusions (which are usually meant to be derived by deduction and whose correctness might be measured in terms of accuracy). Additionally, since knowledge graphs contain thousands of entities, a randomly selected ranking would result in an MRR close to zero. Hence, it is less informative to set a hard MRR threshold to determine whether a model can be considered capable of reasoning.
This paper critically examines the validity of current benchmarks for complex query answering (CQA) on knowledge graphs (KGs). The authors argue that existing benchmarks (e.g., FB15k237, NELL995) overrepresent "easy" queries that reduce to simpler tasks like link prediction (1p), distorting perceived progress in CQA. They demonstrate that up to 98% of test queries in these benchmarks can be answered by predicting only one missing link, even for ostensibly complex query types (e.g., 2i1p). To address this, the authors propose new benchmarks (FB15k237+H, NELL995+H, ICEWS18+H) that balance query difficulty by ensuring answers require reasoning over multiple missing links. Experiments show state-of-the-art (SoTA) CQA methods perform significantly worse on these harder queries, revealing their reliance on memorizing training data. The paper introduces a hybrid solver (CQD-Hybrid) combining neural link prediction with graph traversal and highlights the need for benchmarks that better reflect real-world KG incompleteness.
给作者的问题
Suggestion: Include error bars in Tables 2/A.5–A.6.
论据与证据
-
Current CQA benchmarks are skewed toward easy queries reducible to link prediction. Systematic analysis of FB15k237/NELL995 shows 86.8–98.3% of complex queries (e.g., 2i1p) reduce to 1p (Table 1).
-
SOTA methods overperform due to memorization of training links. Performance drops sharply on full-inference queries (Table 2). Hybrid solvers (CQD-Hybrid, QTO) outperform neural models by leveraging training links.
-
Union queries (2u, 2u1p) are artificially hard due to non-existing links. Filtering non-existing links improves MRR (Table 3).
方法与评估标准
-
Proposed Benchmarks: FB15k237+H, NELL995+H balance query types and ensure full-inference answers. ICEWS18+H uses temporal splits for realism. Addresses distribution bias in prior benchmarks. Temporal splits better mimic real-world KG updates. But no discussion of whether the new distributions reflect real-world query patterns (e.g., frequency of 4p/4i queries).
-
CQD-Hybrid: Combines neural link prediction with graph traversal. Demonstrates hybrid methods’ superiority (Table A.3). Limited comparison to other hybrid methods (e.g., FIT).
理论论述
The paper makes no explicit theoretical claims. The reasoning tree formalism (Sec. 3) is intuitive but lacks formal proofs.
实验设计与分析
- Re-evaluating SoTA models on reduced query types (Tables 2, A.5–A.6) is rigorous.
- Missing statistical significance tests for performance differences.
- Removing non-existing links (Table 3) is valid but lacks details on filtering criteria.
补充材料
Reviewed Appendices A.1 (negation analysis), A.3 (CQD-Hybrid details), and A.5–A.6 (full results). The negation analysis (Table A.1) strengthens the main claims but lacks visual examples.
与现有文献的关系
- Builds on prior CQA frameworks (Hamilton et al., 2018; Ren et al., 2020) and neural models (Arakelyan et al., 2021; Zhu et al., 2022).
- Extends critique of benchmark validity similar to issues in link prediction (Kadlec et al., 2017).
- Hybrid solvers (CQD-Hybrid, QTO) align with recent trends in neuro-symbolic reasoning (Bai et al., 2023; Yin et al., 2024).
遗漏的重要参考文献
- FIT (Yin et al., 2024): A concurrent hybrid solver with score calibration, which could contextualize CQD-Hybrid’s performance.
- BetaE (Ren & Leskovec, 2020): A model handling negation, not discussed in negation analysis.
其他优缺点
N/A
其他意见或建议
N/A
We thank the reviewer for providing a precise summary of the paper and for considering our analysis rigorous. We proceed by answering their questions, believing all can be easily addressed.
no discussion of whether the new distributions reflect real-world query patterns (e.g., frequency of 4p/4i queries)
There is no ground truth to compare the frequency of query patterns in the new benchmarks nor in the old benchmarks to reflect a particular real-world application. We highlight that there is no indication that the overrepresented frequency of 1p reductions (see Table 1) in the old benchmarks is representative, and it skews the perception of progress.
Limited comparison to other hybrid methods (e.g., FIT)
As we argue in Section 5 (lines 323-325, left column), and as stated in the FIT paper (Appendix G.2 and Table 5 of FIT), the FIT method works like QTO on the query types that we consider in our evaluations. Therefore, we do not include an evaluation of FIT.
BetaE not discussed in negation analysis.
We chose a baseline method for each relevant SoTA category. Concerning geometric embeddings (e.g., BetaE, Query2Box, ConE), we selected ConE as an established baseline that can handle negation, and it was shown to outperform BetaE.
no explicit theoretical claims. The reasoning tree formalism (Sec. 3) is intuitive but lacks formal proofs
In Sec 3, we introduce the concept of “reasoning tree” in the running text to syntactically delineate trivial, inference-based, partial-inference and full-inference QA pairs. We do not make formal claims in Section 3 that could be proven. In line 210, we should rather say “We hypothesize…” (rather than “We claim…”), and this hypothesis is later empirically validated, but we do not see any way how it could be proven formally since the hardness that we refer to is the effectiveness of making correct predictions (and not a soundness or algorithmic complexity statement).
Missing statistical significance tests for performance differences
By running a Mann-Whitney paired U test, we obtain low p-values in the reciprocal rank distribution of answers between models, confirming that our claims remain unchanged from those reported in our tables. For example, when executing the test for the reciprocal ranks of 2p QA pairs computed with CQD and CQD-hybrid we obtain p-value of 1.5e-4. We will include the full results in the camera-ready.
Removing non-existing links (Table 3) is valid but lacks details on filtering criteria
We filter out all query-answer pairs that contain non-existing links in their reasoning tree. For example, for 2u QA pairs, if either one of the two links in the reasoning tree does not exist in the graph, we remove the QA pair. (see Figure A.1 (Bottom) for an example). It follows that only QA pairs where both links in the reasoning tree exist in the graph are retained (see Figure A.1 (Top) for an example). We will include this description in the camera-ready.
The negation analysis (Table A.1) strengthens the main claims but lacks visual examples.
The visual example would be similar to the example we provided in Figure 1, and that’s the reason we did not include one. Is there a specific reason why the reviewer thinks there should be one? We can effortlessly add one in the camera ready.
This paper investigates the problems in current benchmarks for complex query answering over knowledge graphs. The main points are: 1. Previous benchmarks failed to effectively evaluate the capabilities of CQA methods because most test queries can be reduced to simpler queries, 2. Using the proposed benchmark that alleviates this problem, there is no method that can claim SOTA with a clear advantage.
update after rebuttal
The rebuttal solves some of my main concerns. I have adjusted my evaluation accordingly.
给作者的问题
- Is there a specific reason to exclude the results and discussions on negation queries in the main content? It’s quite important because most recent CQA methods can deal with negations and use queries with negations for evaluation.
- A lot of CQA methods use the data splits in the BetaE data, where the training data may also have the same query reduction problem. Is the new benchmark really harder and more suitable for evaluating CQA methods, or is it out of the training data distribution and thus not that convincing to be used for evaluation?
论据与证据
Please refer to the “Experimental Designs Or Analyses” section because these are mainly related to experiments
方法与评估标准
The authors “balance each query type a QA pair can be reduced to”, which is confusing to me, why is this a necessary choice? And how the balancing coefficient would affect the final results?
理论论述
N/A
实验设计与分析
The authors claim that no method can claim SOTA with a clear advantage using the proposed benchmark. However, a lot of important baselines are not included, from well-established methods like Query2Box and BetaE and more recent strong baselines like LMPNN and CLMPT (and more). It’s crucial to report the performance of important methods on a new benchmark, instead of only choosing a few among others.
补充材料
I had a quick look at most of the contents in the appendix, especially the results on negation queries and the experiment on the number of intermediate existing entities.
与现有文献的关系
To my best knowledge, this work focuses on CQA over KGs and does not explicitly relate to the broader scientific literature/community.
遗漏的重要参考文献
This work can benefit from discussing more related works, especially more recent works, including but not limited to:
其他优缺点
It would be great if the results of queries with negation can be included in the main content
其他意见或建议
Section 6, numbers under subsection “Building new CQA benchmarks” should be 10,000 instead of 10.000.
We believe the main concerns of the reviewer can be fully addressed, as discussed in the comments below. We hope the score can be raised.
The authors “balance each query type a QA pair can be reduced to”, which is confusing to me,
We acknowledge that the usage of the term ''balance'' can lead to misunderstandings. We should rather say, ``We compile the benchmark such that all query types (e.g., 3p) a QA pair can be reduced to (e.g., 1p, 2p, and 3p) appear with the same frequency in it’’. We will clarify this in the camera-ready.
why is this a necessary choice? And how the balancing coefficient would affect the final results?
This is necessary because sub-types reductions in the old benchmark are not equally represented, while 1p reductions are greatly over-represented (see Table 1). As you can see in Fig. 3, by compiling the benchmark such that all query types a QA pair can be reduced to appear with the same frequency in the benchmark, the overall results significantly decrease. An explanation for this phenomenon is that in the new benchmarks, we evaluate the full spectrum of hardness rather than the easiest QA pairs (i.e., the ones reducible to 1p).
from well-established methods like Query2Box and BetaE and more recent strong baselines like LMPNN and CLMPT.
Over the last years dozens of methods for CQA have been released, and it’s highly impractical to evaluate and compare all of them. For such a reason, we have selected a method for each category. For geometric embedding methods, we selected ConE, as it is an established baseline that can handle negation and has outperformed, e.g., BetaE, and Query2Box. For methods based on GNN, we chose GNN-QE, as it provides competitive performance compared to LMPNN and CLMPT on NELL995, but outperforms them on FB15k-237. That said, we also ran additional experiments on the new benchmarks with CLMPT, being the most recent baseline among the three, and our preliminary results confirm that our claims remain unchanged. In fact, similarly to the other baselines, CLMPT's performance drops on the new benchmark, but it remains competitive. For instance, below is the MRR comparison of CLMPT and the best-reported baseline for 2p, 3p, 2i, and 3i queries on FB15k-237+H:
| Method | 2p | 3p | 2i | 3i |
|---|---|---|---|---|
| CLMPT | 5.3 | 4.7 | 10.2 | 12.2 |
| best-baseline | 5.2 | 4.0 | 11.3 | 12.8 |
We will report the full results in the camera-ready.
This work can benefit from discussing more related works
We will extend App. C to include the methods pointed out by the reviewer in the SoTA discussion.
It would be great if the results of queries with negation can be included in the main content. Specific reason to exclude the results and discussions on negation queries in the main content?
We agree with the reviewer, we will use the additional page in the camera-ready to include results on negative queries in the main paper. We did not do it now just for space constraints.
A lot of CQA methods use the data splits in the BetaE data, where the training data may also have the same query reduction problem.
Please note that we have not changed the data splits, rather we have changed the way the benchmark is created (see Sec 6, and App G). Also note that the query reduction problem can only occur at valid/test time, as it happens when certain training data makes the prediction of a QA pair easier, effectively reducing the QA pair to a simpler type (see Fig. 1).
Is the new benchmark really harder and more suitable for evaluating CQA methods, or is it out of the training data distribution and thus not that convincing to be used for evaluation?
We remark that we only change the way the benchmark is created to ensure that the distribution is not skewed towards the easiest QA pairs. Moreover, apart from 4p and 4i queries, we keep the same query structures used in the old benchmarks. Then, following [1], we could consider the query types not included in the training, i.e, ip,pi,2u,up, as well as 4p and 4i to be OOD, while considering the rest of the query types in-distribution. We also remark that there is no ground truth to compare the frequency of query patterns in the new benchmarks nor in the old benchmarks to reflect a particular real-world application. We highlight that there is no indication that the overrepresented frequency of 1p reductions (see Table 1) in the old benchmarks is representative and it skews the perception of progress.
[1] Bai, Y., Lv, X., Li, J., & Hou, L. (2023, July). Answering complex logical queries on knowledge graphs via query computation tree optimization. In International Conference on Machine Learning(pp. 1472-1491). PMLR.
I thank the authors for their submission and the reviewers for their assessment. This is an important paper, demonstrating that SotA complex Q&A method's performance is overestimated due to memorizing training links. The authors address this by creating new balanced benchmark datasets. All reviewers unanimously agree that it meets the bar for publication at ICML.