6.0

/10

Rejected4 位审稿人

最低6最高6标准差0.0

3.3

置信度

正确性2.8

贡献度3.0

表达2.5

ICLR 2025

$EFO_{k}$-CQA: Towards Knowledge Graph Complex Query Answering beyond Set Operation

Hang Yin,Zihao Wang,Weizhi Fei,Yangqiu Song

OpenReview PDF

提交: 2024-09-26更新: 2025-02-05

TL;DR

This paper extends the task of complex query answering on knowledge graph to a brand new frontier.

摘要

To answer complex queries on knowledge graphs, logical reasoning over incomplete knowledge needs learning-based methods because they are capable of generalizing over unobserved knowledge. Therefore, an appropriate dataset is fundamental to both obtaining and evaluating such methods under this paradigm. In this paper, we propose a comprehensive framework for data generation, model training, and method evaluation that covers the combinatorial space of Existential First-order Queries with multiple variables ($EFO_{k}$). The combinatorial query space in our framework significantly extends those defined by set operations in the existing literature. Additionally, we construct a dataset, $EFO_{k}$-CQA, with 741 query types for empirical evaluation, and our benchmark results provide new insights into how query hardness affects the results. Furthermore, we demonstrate that the existing dataset construction process is systematically biased and hinders the appropriate development of query-answering methods, highlighting the importance of our work. Our code and data are provided in~https://anonymous.4open.science/r/EFOK-CQA/README.md.

关键词

complex query answeringknowledge graph

评审与讨论

审稿意见

评分: 6置信度: 42024-11-01

The paper proposes a benchmarking framework for complex query answering, specifically targeting existential first-order (EFO) queries with k independent variables. The paper proposes a new dataset, EFOk-CQA, comprising 741 diverse query graphs.

优点

The paper defines a new framework for CQA that extends existential first order logic with single variable (EFO1) to existential first order logic with k independent variable (EFOk).
The paper defines three new metrics for evaluating EFOk queries.
The paper constructs new datasets of query-answer pairs from 741 different abstract query graphs. The queries are based on FB15k-237, FB15k, and NELL.
Overall, the originality, quality, clarity, and significance of the paper is good.

缺点

Unusual structure of the paper: no section with title "introduction", related work in appendix, main section of the paper (Section 3) only gives a rough overview with important details in the appendix.
Presentation could be improved (some spelling and grammar issues)

Details

"KGs suffer from incompleteness during its construction" --> their construction (plural)
Figure 1, typo: presdient --> president
Figure 1, right: Born(y_1, x_1) appears twice. Should the second one be co-author?
Appendix C: "projections(Definition" space before parenthesis missing
Proposition 20: "it can not only" --> cannot
Section G: Knowledge graph with attributes: Two recent works are missing: NRN and LitCQD

问题

Can you strengthen the motivation for the new query types (EFOk)? For example, can you provide additional real-world examples? Or statistics of real-world query logs? The paper claims that fraud detection would benefit from the proposed approach (Appendix "society impact"). However, no references or concrete examples are given.
What was the runtime to compute the results in Tables 1 and 2?

2024-11-21

Dear reviewer, we would like to express our gratitude for your time and your insightful review, we have made efforts to improve our paper correspondingly and we would like to clarify some misunderstandings and address your concerns.

Unusual structure of the paper: no section with title "introduction", related work in appendix, main section of the paper (Section 3) only gives a rough overview with important details in the appendix.

Presentation could be improved (some spelling and grammar issues)

Thank you for your suggestions! We have made modifications according to your advice and re-checked our paper, please do not hesitate if you have come across other issues in reading our new version of the paper!

Can you strengthen the motivation for the new query types (EFOk)? For example, can you provide additional real-world examples? Or statistics of real-world query logs? The paper claims that fraud detection would benefit from the proposed approach (Appendix "society impact"). However, no references or concrete examples are given.

Sure, we would like to strengthen our motivation. We have now added more discussions of the fraud detection example with related citations to showcase why EFOk queries are important.

What was the runtime to compute the results in Tables 1 and 2?

In general, the running time for FIT is significantly longer than any other models. Taking the running time of FB15k-237 in Table 1 as one example, FIT takes 22 hours to finish, and others are similar, like BetaE takes 1 hour, ConE 1.5 hour.

Thank you again for your patience and invaluable review!

审稿意见

评分: 6置信度: 22024-11-02

This paper introduces $EFO_{k}-CQA$ , a framework for generating datasets and evaluating models on complex query answering (CQA) tasks over knowledge graphs (KGs). Unlike previous datasets limited to set operations and tree-based structures, EFOk-CQA supports the broader family of Existential First-Order (EFO) queries, extending to multiple variables and complex graph structures. The authors propose an end-to-end pipeline to generate EFO queries, sample answers, train models, and evaluate performance. The dataset is shown to cover 741 query types, aiming to capture a wide combinatorial space of queries. Through extensive experiments, the paper highlights structural biases in existing datasets and evaluates six prominent CQA models, revealing the influence of query hardness and graph topology on model performance.

优点

The EFOk-CQA framework is a meaningful extension to existing CQA benchmarks, expanding beyond single-variable, set operation-based queries to include multiple-variable and complex query graphs. It also addresses a notable gap in CQA evaluation, offering a comprehensive benchmark that reflects real-world query complexity.
The paper demonstrates a high degree of rigor in constructing the EFOk-CQA dataset. The authors establish well-defined assumptions to exclude trivial cases, implement a pipeline for end-to-end evaluation, and provide empirical results with various CQA models across a comprehensive range of query structures and complexities.
The paper is well-organized, with clear definitions and illustrations that help in understanding complex query structures. Concepts such as query graphs, grounding processes, and evaluation metrics are presented with detailed explanations and visual aids, facilitating comprehension.
The EFOk-CQA dataset and framework have implications for advancing CQA research. By covering a broader range of query types, the benchmark could lead to the development of more generalized and capable CQA models. The analysis of model performance based on query topology and difficulty also provides useful insights for future CQA model improvements.

缺点

In section 3.1, while the framework supports complex queries, the combinatorial space of queries could grow exponentially as query parameters increase. This might make it challenging to scale up the dataset generation process efficiently or to apply the framework to larger, more complex knowledge graphs.
The use of joint metrics to evaluate models with multiple free variables is valuable, but it is noted in the paper that joint rankings could lack reliability due to inherent difficulty. This limitation could affect the robustness of evaluation for complex queries, particularly when interpreting the rankings for CQA tasks.

问题

Given the vast combinatorial space, how scalable is the framework for generating queries with more complex conditions or additional free variables? Are there specific trade-offs in terms of time or computational resources that become prohibitive?
How often do the assumptions used to generate non-trivial query graphs (e.g., no decomposition) align with real-world query requirements? Would relaxing these assumptions reveal additional complexities or insights about query-answering models?
Can the authors provide an error analysis highlighting specific failure modes of the evaluated models? For example, are there certain query types (e.g., cyclic vs. acyclic) where models consistently underperform? This information could clarify which specific capabilities of current SOTA works require improvement.

2024-11-22

Dear reviewer, we would like to express our gratitude for your time and your insightful review, and we would like to clarify some misunderstandings and address your concerns.

In section 3.1, while the framework supports complex queries, the combinatorial space of queries could grow exponentially as query parameters increase. This might make it challenging to scale up the dataset generation process efficiently or to apply the framework to larger, more complex knowledge graphs.

Given the vast combinatorial space, how scalable is the framework for generating queries with more complex conditions or additional free variables? Are there specific trade-offs in terms of time or computational resources that become prohibitive?

We thank you for your insightful comments. This is a critical challenge as there perhaps is no way to cover the combinatorial space of queries when the query parameters increase non-stoply. We will have to resort to making further constraints, like containing fewer existential variables or fewer edges in the graph. Another idea is to manually craft some abstract query types and avoid enumeration of the combinatorial space. We believe it is the nature of the CQA problem as we have explained in Section 3.2 that CQA is closely related to CSP problem which is NP-complete.

Also, we believe that our coverage is challenging enough for the current development as it is rare in real life to query about more than 4 objects simultaneously.

The use of joint metrics to evaluate models with multiple free variables is valuable, but it is noted in the paper that joint rankings could lack reliability due to inherent difficulty. This limitation could affect the robustness of evaluation for complex queries, particularly when interpreting the rankings for CQA tasks.

Thank you for your observation, actually, the reason we claim that joint rankings lack reliability is that all of the current models lack the ability to retrieve combinatorial answers. Therefore, this is actually not a drawback of the designed metric, but rather a mismatch between the current models and the ambitious goal to answer EFOk queries.

How often do the assumptions used to generate non-trivial query graphs (e.g., no decomposition) align with real-world query requirements? Would relaxing these assumptions reveal additional complexities or insights about query-answering models?

We note that the assumptions we put forward are necessary but not sufficient conditions for the query graphs to align with real-world query requirements. Therefore, those assumptions should not be relaxed, but perhaps further assumptions should be made given further observations.

Can the authors provide an error analysis highlighting specific failure modes of the evaluated models? For example, are there certain query types (e.g., cyclic vs. acyclic) where models consistently underperform? This information could clarify which specific capabilities of current SOTA works require improvement.

Sure, we currently find that the majority of top 10 hardest query types are acyclic and contains negation, we give an example as below, (r1(s1,e1))&((r2(e1,f2))&((!(r3(e1,f1)))&(r4(f1,f2)))). We will try to make more detailed discussion in camera-ready version.

Thank you again for your patience and invaluable review!

审稿意见

评分: 6置信度: 42024-11-04

Existing benchmarks for complex query answering for KGs do only consider specific (tree-shaped) queries. This is not sufficient to evaluate methods thoroughly. The current work suggests new datasets which have 1) multiple variables and 2) removal of many trivial query shapes.

优点

Tying this domain to CSP is helpful because that way we can reuse insights.
Section 2.1 and 2.2 propose a good approach to reduce the number of query shapes to those that are non-trivial. This is an important consideration.
A reasonable choice has been made on what to put in the appendix. For readers, parts might be considered essential for the understanding of the paper, though, especially what is in appendix D.1. and the explanation of the joint metric.

缺点

The main difference between this and the work by Yin 2024, appears to be that multiple variables are allowed. This is a small difference, and also pointed out in the cited survey by Hongyu et al (Neural Graph Reasoning), so not really new. that survey also has a more general notion of graph queries, also for different graph types.
The authors do not provide any insights in the limitations of what they propose.

Minor:

There are some inconsistencies in the math notation and symbols used in the paper.
In section 1, there are some grammar mistakes, mainly concerning the use of articles and plurals which make the text hard to read.

问题

Could you elaborate in assumption 13 and 14? Why are they necessary?
Assumption 16 is quite questionable. I am aware of the earlier work doing this, but also there it was not appropriately justified. Is there any foundational reason this assumption is needed?
I am confused about the benefit of including the marginal metric. As you correctly indicate, it is flawed when dealing with multiple free variables. What is the benefit of still including it?
I wonder how a method like mpqe would work on your dataset. It naturally obtains representations for all variables of the query.
For the results of CQD, did you use the calibrated version $CQD^A$ , or the original one? The calibrated one should be much more robust when computing on different query shapes.

2024-11-21

Dear reviewer, we would like to express our gratitude for your time and your insightful review, and we would like to clarify some misunderstandings and address your concerns.

The main difference between this and... so not really new..

Though the introduction of multiple free variables is the most obvious difference from the previous work, the complete coverage of the combinatorial space is also a strong merit, and it can not be achieved without careful design of assumptions, (which is discussed in Section 4.2) furthermore, we also propose the three new kinds of metrics that are designed for queries that contain multiple free variables in Section 5.5, which has not been investigated yet. In our cited survey[1], only a simple example of EFOk query has been put forward, yet we build the whole pipeline for that, explained in Section 3. To conclude, we believe our work is a solid and meaningful step towards the ambitious goal of more advanced complex query answering.

The authors do not provide any insights in the limitations of what they propose.

In Section 6, we have conducted empirical experiments and discussed different behaviors of different models: query embedding/LMPNN/CQD/FIT. We note that the complete coverage provides strong evidence for the training biases existing in the prevailing dataset, mostly dominantly, the LMPNN which relies on message passing falls behind other methods in non-SDAG queries. This finding also corroborates our discussion in the first issue that why the complete coverage of combinatorial space is important. We also provide more findings in the current Section 6.2/6.3 now.

There are some inconsistencies in the math notation and symbols used in the paper.

In section 1, there are some grammar mistakes....

Thank you for the suggestions, we have rechecked the usage of symbols, notations, and grammar in our paper as you can find in our revised paper, and we welcome you to tell us any inconsistency/mistake you have encountered.

Could you elaborate in assumption 13 and 14? Why are they necessary?

Sure, we have already included an example of the necessity of the assumption 13/14 in Figure 2. These two assumptions combined help us to get the abstract query graph in the simplest form, and thus make sure the enumeration of the combinatorial space is appropriate.

Assumption 16 is quite questionable. ....

Thank you for this question, we think this is a more intuitive assumption of real-world scenarios. We will leverage a justification from previous research[2] and hope it may address your concern. If the query is “List all the entities on KG that are not European countries.”, then both “apple” and “computer” will be the answers, in fact, nearly random entities from random guesses can be true answers, we consider this kind of query as unrealistic and they can also hinder the fair evaluation.

I am confused about the benefit of including the marginal metric.

Thank you for raising the questions. Due to the fact that all previous methods are designed to infer queries with only 1 free variable, and they fundamentally lack the ability to retrieve combinatorial answers. Therefore, in our implementation, we have to let those models infer the answer of each free variable individually, as explained in Section 5.4 and Appendix E. In our evaluation, only the joint metric proposed in Section 6.5 serves the ultimate goal of answering tuples/combinatorial answers, while others are designed as weaker metrics, as a compromise when we currently have no method that can retrieve the combinatorial answers. We believe the marginal and multiply metrics all only serve as milestones towards the ultimate goal, the joint metric.

I wonder how a method like mpqe would work on your dataset. It naturally obtains representations for all variables of the query.

Thank you for your suggestions. However, as mpqe is designed for queries without negation, we are afraid that it’s not suitable for evaluation on EFOk-CQA. Meanwhile, LMPNN is also a model built by message passing and naturally obtains representations for all variables of the query, we believe the discussion of LMPNN can address your concerns to some extent.

For the results of CQD, did you use the calibrated version CQDA, or the original one? The calibrated one should be much more robust when computing on different query shapes.

Thank you for your suggestion, we have actually considered including CQD-A in our evaluation, however, in the GitHub page,[3] the readme doesn’t include enough guidance for running the code and neither the checkpoint is included.

[1] Ren, Hongyu, et al. "Neural graph reasoning: Complex logical query answering meets graph databases." arXiv preprint arXiv:2303.14617 (2023).

[2] Ren, Hongyu, and Jure Leskovec. "Beta embeddings for multi-hop logical reasoning in knowledge graphs." Advances in Neural Information Processing Systems 33 (2020): 19716-19726.

[3] https://github.com/EdinburghNLP/adaptive-cqd

审稿意见

评分: 6置信度: 32024-11-05

This paper proposes a comprehensive framework for generating data, training models, and evaluating methods for complex query answering on knowledge graphs. The authors introduce the EFO k-CQA dataset, which captures a significantly broader combinatorial space of existential first-order queries compared to previous datasets. They demonstrate the limitations of existing datasets and benchmarks, and present new evaluation metrics tailored to queries with multiple free variables. The experiments on six representative CQA models provide new insights into query hardness and model performance.

优点

The authors formulate a new class of queries, EFO k, that extends previous definitions and captures more complex query structures. This expands the scope of complex query answering on knowledge graphs.
The proposed framework is rigorously defined, and the dataset generation process is carefully designed to ensure non-trivial queries. The code and data are made publicly available.

缺点

The paper is more appropriate for the audience of a database or database theory conference rather than for ICLR.
The dataset is entirely synthetically created and lacks application in the real world.
No clear trend or conclusion can be seen from Table 2. There is no clear relation between the number of existential variables and the evaluation types.
The basic KG is incomplete. It will propagate errors as the query structure becomes more complex. Moreover, there is no method to guarantee the quality of the generated training and testing sets.

问题

Why should we scale the dataset if there are no real-world use cases.

2024-11-21

Dear reviewer, we would like to express our sincere gratitude for your time and review. However, there seem to be some misunderstandings that we would like to clarify.

The paper is more appropriate for the audience of a database or database theory conference rather than for ICLR.

Admittedly, our paper is a cross-domain work, however, our core idea is to put forward new tasks/challenges for the neural models and evaluate them on these new tasks. The new empirical experiments help to expose the bias existing in previous methodologies and datasets, this kind of finding is helpful for the developers of neural models, including [1,2,3,4,5,6].

To conclude, this paper aims at the dataset and benchmark track, and our goal is to benchmark current neural models which are all published in machine-learning conferences and we aim for the benefit of their development.

The dataset is entirely synthetically created and lacks application in the real world.

Why should we scale the dataset if there are no real-world use cases.

Thank you for this comment, we have slightly discussed applications in the Appendix, and we have made more explanations now in the Appendix G, the context is as following:

Nowadays, CQA has real-world applications, like fact ranking[7], and explainable recommendation[8]. However, some important practical applications can not be covered by existing dataset in CQA, because their construction is biased and have not discussed queries with multiple free variables entirely. We would like to introduce one example in fraud detection where we need to detect a group of people with cyclic money flow for anti-money laundering applications[9], we also note that this finding is also shared by open-source graph database[10,11].

No clear trend or conclusion can be seen from Table 2. There is no clear relation between the number of existential variables and the evaluation types.

Thank you for your suggestions. Previously, we have mainly discussed the gradually increased hardness of three kinds of metrics in Section 6.3. Regarding your questions about the effect of existential variables and the query types, we have also mentioned that the conclusions are similar to that in Table 1, namely more existential variables make the query harder, and cyclic graph is easier than SDAG, and SDAG is easier than multigraph, which correctifies previous understanding. We have also added more discussion in Section 6.2 and 6.3 now, which discuss more about the specific behavior of different CQA models. We welcome you to check that part.

The basic KG is incomplete. It will propagate errors as the query structure becomes more complex. Moreover, there is no method to guarantee the quality of the generated training and testing sets.

Thank you for your opinion, however, the setting of the train/test splitting is essentially designed for evaluating the generalizability of neural models on KG, namely whether the model can generalize to unseen facts. We note this is the standard practice in related research.[1]

Moreover, we believe our benchmark's strong merit is to help discover how the propagated error becomes challenging for the neural models to handle, this is again achieved by our complete coverage of the combinatorial space.

[1] Ren, Hongyu, and Jure Leskovec. "Beta embeddings for multi-hop logical reasoning in knowledge graphs." Advances in Neural Information Processing Systems 33 (2020): 19716-19726.

[2] Luus, Francois, et al. "Logic embeddings for complex query answering." arXiv preprint arXiv:2103.00418 (2021).

[3] Zhang, Zhanqiu, et al. "Cone: Cone embeddings for multi-hop reasoning over knowledge graphs." Advances in Neural Information Processing Systems 34 (2021): 19172-19183.

[4] Minervini, Pasquale, et al. "Complex query answering with neural link predictors." 31st International Joint Conference on Artificial Intelligence, IJCAI 2022. International Joint Conferences on Artificial Intelligence Organization, 2022.

[5] Wang, Zihao, et al. "Logical Message Passing Networks with One-hop Inference on Atomic Formulas." The Eleventh International Conference on Learning Representations.

[6] Yin, Hang, Zihao Wang, and Yangqiu Song. "Rethinking Complex Queries on Knowledge Graphs with Neural Link Predictors." The Twelfth International Conference on Learning Representations.

[7] Ren, Hongyu, et al. "Fact Ranking over Large-Scale Knowledge Graphs with Reasoning Embedding Models." Data Engineering: 124.

[8] Syed, Muzamil Hussain, Tran Quoc Bao Huy, and Sun-Tae Chung. "Context-aware explainable recommendation based on domain knowledge graph." Big Data and Cognitive Computing 6.1 (2022): 11.

[9] Priya, Jithin Mathews, et al. "A graph theoretical approach for identifying fraudulent transactions in circular trading." DATA ANALYTICS 2017 (2017): 36.

[10] https://www.nebula-graph.io/

[11] https://www.nebula-graph.io/posts/fraud-detection-using-knowledge-and-graph-database

评论- Thanks

2024-11-24

Thanks for addressing my concerns. I would raise my rating.

评论- New version of paper

2024-11-22

We extend our deepest appreciation for dedicating your time to review our work and for providing us with invaluable suggestions and insights. We have diligently incorporated the suggested modifications into our paper, which are outlined below. We kindly invite you to review these changes.

We have changed the structure of the paper, moving the related work section into the main paper from appendix.
We include more discussion about the experiment results in Section 6.2 and Section 6.3.
We give an example to explain why the newly proposed EFOk query is needed in real-world applications in Appendix G.
We have corrected some grammar mistakes and typo and added new citations to related research.

AC 元评审

2024-12-22

This manuscript studies a framework for data generation, model training, and method evaluation that covers the combinatorial space of Existential First-order Queries with multiple variables (EFOk). The EFOk-CQA dataset, with 741 query types, is introduced for empirical assessment, aiming to capture a wide combinatorial space of queries.

While reviewers agreed that the approach and the proposed framework seem reasonable, no one sufficiently championed accepting this work; all reviewers gave 6, indicating marginally above the acceptance threshold. Some of the concerns raised by the reviewers are critical and not entirely resolved by the authors' responses: (1) It is necessary to clarify the significance of this work compared to previous works. (2) Some assumptions (e.g., assumptions 13, 14, and 16) should be elaborated more about their necessity and meaning. (3) The motivation for the new query types (EFOk) should be discussed more broadly than just showing the fraud detection case. (4) The authors are advised to carefully review the reviewers' points and questions and incorporate their suggestions (e.g., rethinking the benefit of including the marginal metric, including the calibrated version CQDA in experiments, etc.).

审稿人讨论附加意见

Reviewers left detailed reviews pointing out some critical points the authors had to resolve. Since some reviewers did not explicitly leave additional comments during the discussion period, this meta-reviewer carefully read the reviewers' points and the authors' rebuttals and found that some of the authors' rebuttals were not enough or not sufficiently convincing. The meta-review contains detailed suggestions for the authors to revise their manuscript.

最终决定Reject

2025-01-22

Reject