7.3

/10

Poster4 位审稿人

最低4最高5标准差0.5

3.8

置信度

创新性2.8

质量2.8

清晰度3.3

重要性2.8

NeurIPS 2025

GFM-RAG: Graph Foundation Model for Retrieval Augmented Generation

Linhao Luo,Zicheng Zhao,Gholamreza Haffari,Dinh Phung,Chen Gong,Shirui Pan

OpenReview PDF

提交: 2025-05-11更新: 2025-10-29

TL;DR

We introduce the first graph foundation model specifically designed for retrieval-augmented generation in large language models.

摘要

关键词

Graph Foundation ModelRetrieval Augmented GenerationGraph-enhanced Retrieval Augmented Generation

评审与讨论

审稿意见

评分: 4置信度: 42025-06-29

The paper proposes a foundation model for graph retrieval for LLMs, called GFM-RAG. The method constructs a KG from a document by using an LLM as an Open IE tool to extract entity relation triples. Further, given a query sentence, the method uses a message passing mechanism over the KG after extracting the entities in the query sentence. Distmult is used to score the entities and select the top ones, followed by obtaining the relevant documents from the entity-document mapping. These documents are then used along with the query for prompting the LLM. The GFM model is trained using KG completion objective in a self-supervised manner and in the document retrieval setting in a supervised manner. Results on multi-hop benchmarks and unsupervised datasets show strong performance compared to baselines.

优缺点分析

Strengths

The paper studies an important area of research for graph foundation models for RAG
The method achieves competitive results

Weaknesses

In the entity retrieval process, if the KG is large the paper doesn’t mention about how it is handled
The current framework uses DistMult as the KG completion scoring method. However it has been shown by previous works [1,2] that distmult has limitations in modelling antisymmetric, inverse and compositional relations.
The paper doesn’t compare with contemporarymethods that perform graph learning for RAG ([3,4])

Missing citation

[1] Xiaoxin He,Yijun Tian,Yifei Sun,Nitesh V Chawla,Thomas Laurent,Yann LeCun,Xavier Bresson, and Bryan Hooi. G-retriever: Retrieval-augmented generation for textual graph understanding and question answering. arXiv preprint arXiv:2402.07630,2024.

References

[1] Zhiqing Sun, Zhi-Hong Deng, Jian-Yun Nie, & Jian Tang. (2019). RotatE: Knowledge Graph Embedding by Relational Rotation in Complex Space.

[2] Zhang, S., Tay, Y., Yao, L., & Liu, Q.. (2019). Quaternion knowledge graph embeddings.

[3] Mufei Li, Siqi Miao, & Pan Li. (2025). Simple Is Effective: The Roles of Graphs and Large Language Models in Knowledge-Graph-Based Retrieval-Augmented Generation.

[4] Xiaoxin He,Yijun Tian,Yifei Sun,Nitesh V Chawla,Thomas Laurent,Yann LeCun,Xavier Bresson, and Bryan Hooi. G-retriever: Retrieval-augmented generation for textual graph understanding and question answering. arXiv preprint arXiv:2402.07630,2024.

问题

In the entity retrieval process, if the KG is large the paper doesn’t mention about how it is handled
The current framework uses DistMult as the KG completion scoring method. However it has been shown by previous works [1,2] that distmult has limitations in modelling antisymmetric, inverse and compositional relations. Why not use some newer approaches for KG completion modelling?
In the path interpretations of sec. 4.8 (eq. 19), is the top-k taken over the product of gradients (chain rule) over the path or over the individual triple scores? This could be clarified better.
See weakness section above

References

[1] Zhiqing Sun, Zhi-Hong Deng, Jian-Yun Nie, & Jian Tang. (2019). RotatE: Knowledge Graph Embedding by Relational Rotation in Complex Space.

[2] Zhang, S., Tay, Y., Yao, L., & Liu, Q.. (2019). Quaternion knowledge graph embeddings.

[3] Mufei Li, Siqi Miao, & Pan Li. (2025). Simple Is Effective: The Roles of Graphs and Large Language Models in Knowledge-Graph-Based Retrieval-Augmented Generation.

局限性

I don't see an explicit limitation section, but I feel the authors should specifiy the technical limitations of the method especially when the KG grows in size how would their method scale or handle such large graphs for RAG.

最终评判理由

Overall after discussion with authors, I'm still not fully convinced from the results about the need for GFM for RAG. The provided results only show recall of entities/documents not the final results of the LLM system after using the docs. If it is only about retrieving entities and then fetching documents, I feel heuristics, standard retrieval systems or semantic closeness followed by query to the LLM would do well (if not for example due to entities not being connected in same component how is message passing going to retrieve the disconnected entity?). Also comparison with baselines may not be fair as proposed method is pretrained on a large dataset. Thus I keep my score still leaning towards accept due to the demonstrated results and importance of the problem.

格式问题

作者回复

2025-07-30

W1/Q1: Scalability to large KGs

We agree that handling large KGs is essential for real-world applications. To support large-scale KGs, inspired by LLM training, we implement a Single Program Multiple Data (SPMD)-style message-passing mechanism that enables distributed training and inference across multiple GPUs.

Specifically, we partition the full KG into balanced subgraphs using the METIS algorithm, with each device storing only a subset of the graph in memory. During message passing, each device first aggregates information locally and then exchanges messages with other devices to finalize the node embedding updates. This design allows GFM-RAG to scale effectively by leveraging more hardware.

In addition, instead of performing message passing over the entire KG, we can only conduct it on the localized k-hop subgraph around the query. This significantly reduces computational and memory costs, further enhancing scalability on large KGs.

We will clarify this strategy in the revised version and add a discussion in the appendix.

W2/Q2: Use of DistMult and limitations

We appreciate the reviewer highlighting the limitations of DistMult in modeling antisymmetric or inverse relations. We chose DistMult as the message function for its simplicity and computational efficiency, as it has no training parameters and is easy to compute. That said, we agree that more expressive models like RotatE $1$ or QuatE $2$ could potentially enhance performance. In preliminary experiments, we found DistMult to offer a strong tradeoff between efficiency and effectiveness in our retrieval-focused setting, but exploring alternative KGE models in our GFM is a valuable direction that we plan to pursue in future work.

Q3: Clarification of Eq. 19 (Path interpretation)

We thank the reviewer for pointing this out. In Eq. 19, the top-k paths are selected based on the product of gradient scores over triples forming the path, which approximates the contribution of that path to the final prediction via the chain rule. This allows us to identify influential multi-hop reasoning chains and interpret the model’s behavior. We will clarify this in the main text and provide more details in the Appendix.

W3/Q4: Comparison with G-Rretriever and SubgraphRAG

We thank the reviewer for highlighting recent related work. In response, we have conducted additional experiments comparing GFM-RAG with SubgraphRAG and G-Retriever across three benchmark datasets. The results are shown below:

Comparison with SubgraphRAG and G-retriever.

Method	HotpotQA		MuSiQue		2Wiki
	R@2	R@5	R@2	R@5	R@2	R@5
SubgraphRAG	61.5	73.0	42.1	49.3	70.7	85.5
G-retriever	53.3	65.5	38.8	45.1	60.8	67.8
GFM-RAG	78.3	87.1	49.1	58.2	90.8	95.6

As the results show, GFM-RAG consistently outperforms both SubgraphRAG and G-Retriever across all datasets and recall settings. While SubgraphRAG and G-Retriever both employ GNN-based reasoning to enhance retrieval, they are primarily tailored to well-constructed KGs with domain-specific fine-tuning, and often operate on shallow, fixed subgraphs. Their architectures typically involve small GNNs with limited reasoning depth and generalization ability.

In contrast, GFM-RAG introduces a generalizable graph foundation model trained over diverse KGs and designed to operate on noisy, incomplete, and large-scale knowledge graphs. This allows GFM-RAG to capture transferable multi-hop reasoning patterns and maintain robust performance across domains without domain-specific adaptation.

We will incorporate these comparisons and insights into the final version to enhance our empirical and methodological contributions.

评论- Need some more clarifications on the GFM nomenclature

2025-08-07

I would like to thank the authors for the rebuttal. It helps clarify some questions I had. I have a few more questions as below:

Considering you propose a Graph Foundation Model for RAG, but it seems that the method is trained using KG completion on the document concerned by first constructing a graph using LLMs. My question is: Whether the message passing is parametric and if so how do you see it generalizing to different domains? If training has to be done for every domain then is it actually right to call it a foundation model?
Follow up question to the above: Do you actually need message passing for RAG? can't we apply some simple heuristic of say selecting neighbors in a given radius, or finding semantic similarity followed by a small query to the LLM to pick top k entities and then find the documents? Have you studied any of these simple methods or even a simple ablation of applying distmult without message passing?

I have a concern regarding the generalizability of the message passing method to various domains, and whether it is right to be called a GFM.

评论- Response to Follow-Up Questions from Reviewer U3C7

2025-08-08

We are glad that our previous response helped address some of your earlier questions, and we sincerely thank you for raising these thoughtful follow-up points. They allow us to further clarify the design choices and generalization capabilities of GFM-RAG.

Q1: Whether the message passing is parametric and generalizable to different domains; appropriateness of “foundation model” terminology

The message-passing component in GFM-RAG is parametric, implemented as a multi-layer GNN that uses DistMult (or another KGE operator) as the message-function (Eq. 8) and updates entity embeddings with neighbor aggregation (Eq.9).

Importantly, our training paradigm is not limited to the KG of a single document or domain. We train the model on a large corpus of diverse KGs (60 specifically) so that it can learn generalizable reasoning patterns that can be applied across different domains without requiring frequent retraining.

We discussed the Generalizability at Line 198-202 and show the zero-shot adaptation to new domains in Section 4.6. In practice, for a new domain, we only need to construct the KG from its corpus, and the trained GFM-RAG can operate directly on it. Fine-tuning is optional and can further boost performance, but it is not required for the model to function effectively.

Therefore, we consider the “foundation model” to be an appropriate term as the trained GFM-RAG can be directly applied for graph-based retrieval and reasoning tasks in various graphs and domains $1$ , analogous to how language foundation models operate across domains.

$1$ Liu, Jiawei, et al. "Graph foundation models: Concepts, opportunities and challenges." IEEE Transactions on Pattern Analysis and Machine Intelligence (2025).

Q2: Necessity of message passing for RAG and comparison to simpler heuristics

We agree that simple retrieval heuristics can be attractive for efficiency, but in practice, real-world KGs are often noisy and incomplete. Methods relying purely on graph structure or local similarity tend to miss relevant entities connected through multi-hop or implicit relations.

We have compared GFM-RAG against representative heuristic-based RAG methods in the paper and our previous responses to Q4:

HippoRAG adopts Personalized PageRank, a non-parametric algorithm, to select top-k entities and use them for document retrieval.
G-retriever first selects neighbors within 2-hop and prunes them using semantic similarity.
SubgraphRAG directly retrieves triples from KGs by combining the semantic similarity and graph embedding similarity.

Experiment results show that GFM-RAG outperforms existing heuristic methods, underscoring the necessity and effectiveness of learned message passing for robust, cross-domain graph-enhanced RAG. GFM-RAG with parametric message-passing can benefit from training on diverse KGs, enabling it to capture generalizable multi-hop reasoning strategies that go beyond local heuristics.

Comparison with simple heuristics RAG methods.

	HotpotQA		MuSiQue		2Wiki
Methods	R@2	R@5	R@2	R@5	R@2	R@5
HippoRAG	60.5	77.7	40.9	51.9	70.7	89.1
SubgraphRAG	61.5	73.0	42.1	49.3	70.7	85.5
G-retriever	53.3	65.5	38.8	45.1	60.8	67.8
GFM-RAG	78.3	97.1	49.1	58.2	90.8	95.6

2025-08-09

Thanks for the response and for clarifying that the method was pre-trained over a large number of KGs.

But I'm still not convinced about the results you provide. The provided results only show recall of entities/documents not the final results of the LLM system after using the docs. If it is only about retrieving entities and then fetching documents, I feel heuristics, standard retrieval systems or semantic closeness followed by query to the LLM would do well (if not for example due to entities not being connected in same component how is message passing going to retrieve the disconnected entity?). Also comparison with baselines may not be fair as proposed method is pretrained on a large dataset. Thus I keep my score (still leaning towards accept).

Best!

审稿意见

评分: 5置信度: 32025-07-02

The proposed GFM-RAG essentially implements a GraphRAG paradigm by constructing graphs from documents and using a graph-enhanced retriever to search relevant documents. It involves two stages of large-scale training to achieve strong generalization with 8M parameters and adhere to neural scaling laws.

优缺点分析

Strengths:

GFM-RAG pipeline is efficient in conducting multi-hop reasoning with single-step retrieval.
CFM-RAG pipeline achieves the state-of-the-art performances across 3 multi-hop QA datasets and seven domain-specific datasets.
CFM-RAG follows the neural scaling laws.

Weaknesses:

The main novelty of GFM-RAG lies in the pipeline. But each module in the pipeline is fundamental and conventional, lacking novelty, such as the network architecture of graph foundation model.
The choice of network architectures and training methods should be considered in the ablation study.
There is a gap between the best performance mentioned in https://hotpotqa.github.io/ and the performance of the proposed method in Table2.

问题

The network architecture of graph foundation model is mainly composed of graph neural networks. The proposed version with 8M parameters is implemented with 6 graph neural networks in the appendix. Why graph foundation model should be backboned with graph neural networks?
Graph foundation model assumes relatively stable knowledge graphs, struggling with rapidly evolving domains without frequent retraining.

局限性

yes

格式问题

The paper formatting is OK.

作者回复

2025-07-30

W1: Novelty of pipeline components

We understand the reviewer’s concern that individual modules in GFM-RAG may appear conventional. However, the core novelty lies in the integration and scaling of these components into a unified graph-based foundation model designed for retrieval-augmented generation (RAG).

To our knowledge, GFM-RAG is the first scalable foundation model for graph-based retrieval, enabling single-step multi-hop reasoning and adhering to neural scaling laws, with strong zero-shot generalization across domains. These contributions go beyond reusing standard modules, aiming to elevate the RAG paradigm using GNN-enabled foundation models and pretraining strategies.

W2: Architecture and training method ablation

We appreciate the reviewer’s suggestion to provide a deeper analysis of the architectural and training design choices. We have conducted a comprehensive set of ablation studies, with results summarized in Appendix E:

In Appendix E.1, we compare different sentence embedding models used in GFM-RAG.
In Appendix E.2, we evaluate the effectiveness of the two training tasks (KG completion and document retrieval) and their impact on downstream performance.
In Appendix E.3, we examine the sensitivity of GFM-RAG to different task loss weightings.
In Appendix E.5, we analyze the impact of model parameter size on performance.

To further investigate architectural design, we varied the number of GNN layers from 1 to 8 while keeping the hidden dimension fixed (512), and evaluated model performance across all datasets. The results are shown below:

Ablation study of different GNN layers.

Method	Averge		HotpotQA		MuSiQue		2Wiki
N_dim=512	R@2	R@5	R@2	R@5	R@2	R@5	R@2	R@5
1-layer (3M)	53.9	66.7	59.3	74.2	40.7	50.2	61.8	75.7
2-layer (4M)	69.9	78.6	73.6	85.4	47.6	57.0	88.6	93.3
4-layer (6M)	72.2	80.1	78.4	87.8	49.3	60.1	88.8	92.5
6-layer (8M)	71.9	79.6	78.0	87.0	48.4	58.7	89.3	93.1
8-layer (10M)	73.0	79.9	79.7	87.8	49.7	59.1	89.5	92.8

We observe that performance generally improves with deeper GNN layers, which we attribute to both the increased model sizes and the ability to capture more complex multi-hop associations. This trend aligns with the neural scaling laws observed in foundation models, where larger parameter counts typically yield better generalization.

Interestingly, we find that performance peaks around 4 layers in some cases. As discussed in Appendix A and Section 4.8, GFM-RAG is designed to capture logical associations from KGs through multi-hop message passing. However, since the maximum number of reasoning hops required by our datasets is 4, additional layers beyond this offer limited benefit—likely due to the absence of higher-hop training signals. This finding supports our hypothesis that GFM-RAG effectively learns query-relevant multi-hop reasoning paths, and that deeper architectures may not improve performance without datasets requiring more complex reasoning.

In summary, these results demonstrate the effectiveness and interpretability of the proposed GNN-based architecture, and confirm that both model capacity and logical expressibility contribute to GFM-RAG’s strong performance. We recognize the potential of other architectural designs and aim to explore them in the future, inspiring the community to do the same.

W3: Performance gap with HotpotQA leaderboard

We thank the reviewer for pointing out the discrepancy with the leaderboard performance. As stated in Section 4.1 and Appendix B, to ensure fair comparison with recent methods such as HippoRAG and IRCoT, we adopt the same experimental setup that uses the same 1,000 randomly sampled HotpotQA subsets from the validation set, rather than the full test set. This setting is widely adopted in prior LLM-based multi-hop QA works to reduce computational costs associated with LLM-based generation while enabling reproducible and meaningful benchmarking.

Therefore, the performance reported in Table 2 is not directly comparable to the leaderboard numbers based on the full test set. We will clarify this in the final version and note this distinction more explicitly to avoid confusion.

Q1: Why use GNNs as the backbone for GFM?

We appreciate this excellent question. Currently, graph foundation models (GFMs) are still an emerging area without a fixed architectural definition. Our decision to use GNNs is grounded in both theory and practicality:

Scalability: Transformer-based graph models are often computationally prohibitive for large graphs. GNNs scale more gracefully with graph size.
Logical Reasoning: As discussed in Appendix A, GNNs possess strong logical expressiveness, making them well-suited for multi-hop and compositional reasoning, which is essential for retrieval tasks. Case studies can be found in Section 4.8.

Our implementation (8M parameters, 6-layer GNN) is a starting point toward scaling graph-based foundation models efficiently. We view this as a foundation for future architectural exploration. In our response to W2, we also analyze the impact of different GNN layers.

Q2: Generalization to evolving domains and KGs

We agree that rapidly evolving KGs present challenges. However, GFM-RAG is trained on diverse and large-scale KG corpora, enabling it to generalize to new or unseen domains and KGs without retraining.

As demonstrated in Figure 3, without retraining, GFM-RAG performs consistently well across seven out-of-domain datasets, all of which involve previously unseen domains and KGs. These results demonstrate that GFM-RAG is not overfitted to a fixed KG schema or domain but instead learns generalizable reasoning capabilities. This allows GFM-RAG to adapt to new and evolving KGs without requiring frequent retraining.

审稿意见

评分: 4置信度: 42025-07-02

The paper introduces GFM-RAG, a graph foundation model for retrieval-augmented generation, which leverages a query-dependent graph neural network to enhance multi-hop reasoning over knowledge graphs. The model is pre-trained on large-scale datasets and demonstrates strong performance and generalizability across multiple benchmarks.

优缺点分析

Strengths:

The proposed GFM-RAG effectively integrates graph reasoning with retrieval-augmented generation, addressing limitations of existing methods.
The model achieves state-of-the-art performance on multi-hop QA and domain-specific RAG datasets, demonstrating its versatility.
The paper provides extensive experiments and ablation studies, validating the design choices and scalability of the approach.

Weaknesses:

The proposed GFM occupies a somewhat awkward position. The core of this paper is the proposal of a GFM that can be directly applied to any dataset. However, according to Table 10, on one hand, without fine-tuning, its performance is poor; on the other hand, skipping pretraining has little impact on the results. This suggests that, beyond the framing presented in the paper, simply training and testing directly on the target dataset might be a more reasonable technical approach. The proposed GNN does not seem to have learned sufficient knowledge from pretraining to effectively support zero-shot QA retrieval.
Figure 5 reports results for different model sizes, but the experiment only changes the embedding dimension. Could the authors comment on how the number of layers in GFM affects the results?
The introduction mentions that KGs constructed by LLMs suffer from incompleteness and noise. Is there a corresponding solution proposed later in the paper? Pretraining on KGC does not address this issue. Moreover, the experiments should include further analysis to demonstrate that GFM can address or alleviate this problem.

问题

Please refer to the Weaknesses section.
Could you clarify whether the fine-tuning is performed on the target dataset itself, or on a combined training set from multiple datasets? If it is the latter, is this approach reasonable and fair?
The results in Tables 10 and 13 appear to be inconsistent. Please provide further analysis and explanation.

局限性

yes

最终评判理由

Thank you to the authors for the detailed response — it has addressed some of my concerns. I will raise my score, as this is a substantial piece of work that is likely to contribute meaningfully to the field. However, I would like to reiterate a few points that I hope the authors can address:

Although the pre-training on KG completion improves KGC performance, this paper primarily focuses on QA. The pre-training does not bring a clear benefit to the QA task, yet it accounts for the majority of the computational cost.
As also noted by another reviewer (fCb4), the main experiments are conducted using the training sets of all three major QA benchmarks. This setup is uncommon in foundation model research and does not constitute a sound experimental design.

Overall, this is a solid piece of work. If the paper is accepted, I hope the authors will address these two issues to set a strong precedent for future work in this direction.

格式问题

作者回复

2025-07-30

W1: Without fine-tuning, the performance of GFM-RAG is poor and does not support zero-shot QA retrieval. This suggests training directly on the target dataset may be more reasonable.

Clarification of the term “fine-tuning”

We appreciate the reviewer’s concerns and would like to clarify that the term “fine-tuning” in Table 10 refers to supervised document retrieval fine-tuning (Section 3.2.2)—not fine-tuning on downstream QA datasets.

As detailed in Section 3.2, GFM-RAG follows a two-stage training process:

Self-supervised KG Completion Pretraining (stage 1): GFM is trained to predict masked relations and entities from incomplete KGs, fostering robust multi-hop reasoning and tolerance to KG noise.
Supervised Document Retrieval Fine-tuning (stage 2): Analogous to instruction tuning in LLMs, this stage trains GFM-RAG on diverse retrieval instances to learn to understand users’ queries and better capture the relevance between questions and knowledge for retrieval.

We acknowledge the terminology may have caused confusion and will revise the paper to consistently use “supervised fine-tuning” for Stage 2 and “domain-specific fine-tuning” for adaptation to individual target datasets.

Table 10 Clarification:

The experiments are conducted on three multi-hop QA datasets (HotpotQA, Musique, 2Wiki). Removing supervised fine-tuning significantly reduces retrieval accuracy, underscoring its necessity for effective query relevance modeling. While skipping KG pretraining has limited impact on retrieval performance, it enhances the model's ability for other tasks, such as KG completion (Table 11), and improves overall robustness in handling incomplete KGs. A detailed explanation is also provided in our response to Q3.

Zero-shot Generalizability

Section 4.6 and Figure 3 demonstrate GFM-RAG’s strong zero-shot retrieval performance across seven unseen domain-specific datasets, surpassing HippoRAG and LightRAG. Table 13 and Appendix E.4 further show that optional domain-specific fine-tuning can further enhance performance when target data is available:

Comparison between the zero-shot performance and the performance of fine-tuning on target domains.

Model	DelucionQA	EManual	ExpertQA	TechQA	MS Marco	HAGRID
HippoRAG (zero-shot)	59.0	50.0	55.1	39.5	51.1	75.5
LightRAG (zero-shot)	46.1	46.2	59.4	36.8	48.3	75.9
GFM-RAG (zero-shot)	70.8	60.6	62.7	46.6	71.0	84.7
GFM-RAG (domain-specific fine-tuning)	82.7	75.9	60.8	49.5	77.5	86.6

Is training directly on the target dataset more effective?

To address this, we compare GFM-RAG with methods trained directly on the target multi-hop datasets. Specifically, we compare with the following baselines:

HippoRAG, a strong, non-training-required baseline.
SubgraphRAG [1] and G-retriever [2], two GNN-enhanced retrieval methods that are trained from scratch on the target datasets.
GFM-RAG (target), the proposed GFM retriever that is trained from scratch on the target dataset.
GFM-RAG (all), the proposed GFM retriever that is trained on multiple datasets (HotpotQA, MuSiQue, and 2Wiki).

Performance of directly training on target datasets.

Methods	HotpotQA		MuSiQue		2Wiki
	R@2	R@5	R@2	R@5	R@2	R@5
HippoRAG (no-train)	60.5	77.7	40.9	51.9	70.7	89.1
SubgraphRAG	61.5	73.0	42.1	49.3	70.7	85.5
G-retriever	53.3	65.5	38.8	45.1	60.8	67.8
GFM-RAG (target)	79.3	87.8	47.6	57.5	89.3	93.2
GFM-RAG (all)	78.3	87.1	49.1	58.2	90.8	95.6

The results show that even when directly fine-tuned on target datasets, GFM-RAG (target) outperforms SubgraphRAG and G-retriever, demonstrating the superiority of the GFM architecture and the effectiveness of two-stage training. Additionally, GFM-RAG (all) performs competitively across all datasets, confirming its ability to leverage diverse KGs and develop generalizable reasoning skills across domains.

We will revise the terminology and include these expanded analyses in the final version.

[1] Simple is Effective: The Roles of Graphs and Large Language Models in Knowledge-Graph-Based Retrieval-Augmented Generation. ICLR 25

[2] G-retriever: Retrieval-augmented generation for textual graph understanding and question answering NeurIPS 24

W2: Impact of GNN depth

We conducted an ablation varying the number of GNN layers (1–8) while fixing the hidden dimension (512). The results are shown below:

Ablation study of different GNN layers.

Method	Averge		HotpotQA		MuSiQue		2Wiki
N_dim=512	R@2	R@5	R@2	R@5	R@2	R@5	R@2	R@5
1-layer (3M)	53.9	66.7	59.3	74.2	40.7	50.2	61.8	75.7
2-layer (4M)	69.9	78.6	73.6	85.4	47.6	57.0	88.6	93.3
4-layer (6M)	72.2	80.1	78.4	87.8	49.3	60.1	88.8	92.5
6-layer (8M)	71.9	79.6	78.0	87.0	48.4	58.7	89.3	93.1
8-layer (10M)	73.0	79.9	79.7	87.8	49.7	59.1	89.5	92.8

Performance improves with deeper GNNs due to larger size and better multi-hop reasoning, echoing neural scaling laws. Yet, performance sometimes peaks at 4 layers—matching the 4-hop max in training data—suggesting GFM-RAG captures relevant multi-hop paths, with deeper layers offering little gain without more complex signals. Detailed analysis can be found in our response to Reviewer 9ZFX W2.

W3: Addressing KG noise and incompleteness.

We appreciate the reviewer’s concern, which aligns with our use of GNNs for their robustness to KG incompleteness. Real-world KGs are inherently noisy. Unlike structure-only methods like HippoRAG, we adopt GNN and enhance its robustness through self-supervised KGC pretraining: masking relations and predicting links to foster multi-hop reasoning in incomplete settings. Table 11 confirms its effectiveness in handling incomplete graphs.

Experiment results on KG completion

Method	MRR	Hits@1	Hits@3	Hits@10
GFM-RAG	0.304	0.234	0.323	0.451
GFM-RAG w/o Self-supervised KG Completion Pre-training	0.029	0.007	0.022	0.067

Q2: Ablation Study of Training Datasets

We want to confirm that GFM-RAG is trained on multiple datasets (HotpotQA, MuSiQue, 2Wiki), but we ensure strict separation between training and evaluation data, with no overlap in documents or entities, thus eliminating any risk of information leakage.

The motivation for this setup is rooted in the foundation model paradigm. Similar to how LLMs are trained on diverse corpora to learn generalizable linguistic patterns, GFM-RAG aims to learn universal graph reasoning capabilities from diverse KG-indexes. This design enables strong zero-shot transfer to new domains and datasets, eliminating the need for retraining or fine-tuning on every downstream domain. Results in Fig. 3 and our response to W1 demonstrate the strong generalizability of GFM-RAG on out-of-domain datasets without retraining.

We further conducted ablation studies where GFM-RAG is trained separately on each dataset, and we report performance across all three benchmarks. Results are shown below:

Ablation study of GFM-RAG trained on each dataset.

Test Dataset	HotpotQA		MuSiQue		2Wiki
Training Dataset	R@2	R@5	R@2	R@5	R@2	R@5
HotpotQA	79.3	87.8	46.9	57.2	86.6	92.4
MusiQue	68.8	81.8	47.6	57.5	84.4	89.6
2Wiki	72.2	77.9	46.6	55.5	89.3	93.2
All	78.3	87.1	49.1	58.2	90.8	95.6

These results show that GFM-RAG not only performs well on the trained datasets, but also generalizes well to other datasets. More importantly, the model trained on multi-domain datasets performs competitively across all datasets—validating its ability to generalize effectively across domains and benefit from training on diverse KGs by learning generalizable reasoning ability across domains.

We will include this ablation study and further clarify the design rationale in the final version of the paper.

Q3: Inconsistency between Table 10 and Table 13.

Table 10 examines the impact of two-stage training strategies, while Table 13 evaluates zero-shot generalizability and domain-specific transferability; their findings are not contradictory.

Table 10 Clarification:

“Fine-tune” and “pre-train” refer to “Supervised Document Retrieval Fine-tuning” and “Self-supervised KG Completion Pre-training,” respectively, as explained in our response to W1.

Removing supervised fine-tuning significantly reduces retrieval accuracy, highlighting its importance for capturing query-document relevance. In contrast, removing self-supervised pretraining has a smaller effect on retrieval but improves generalization to tasks beyond retrieval and handling incomplete KGs.

Table 13 Clarification:

The “fine-tune” in Table 13 denotes the “fine-tuning on domain-specific QA datasets”.

Table 13 compares the zero-shot performance of HippoRAG, LightRAG, and GFM-RAG across seven domain-specific test sets without additional training. GFM-RAG, trained on multi-hop QA datasets, shows strong retrieval performance across all unseen datasets, demonstrating robust zero-shot generalizability.

When fine-tuned on these seven domain-specific training sets, GFM-RAG’s performance further improves, underscoring its strong transferability when target training data is available.

2025-08-03

Although the pre-training on KG completion improves KGC performance, this paper primarily focuses on QA. The pre-training does not bring a clear benefit to the QA task, yet it accounts for the majority of the computational cost.
As also noted by another reviewer (fCb4), the main experiments are conducted using the training sets of all three major QA benchmarks. This setup is uncommon in foundation model research and does not constitute a sound experimental design.

Overall, this is a solid piece of work. If the paper is accepted, I hope the authors will address these two issues to set a strong precedent for future work in this direction.

2025-08-04

Dear Reviewer b7Me,

We sincerely thank you for your thoughtful comments and for raising your score. We are deeply encouraged by your recognition of our work as a substantial contribution to the field. Your insightful evaluation and constructive feedback are invaluable, and we will carefully consider your suggestions as we continue to pursue this direction to advance research in this field.

Thank you again for your time and support.

Best regards, The Authors

审稿意见

评分: 5置信度: 42025-07-07

The paper presents GFM-RAG, a novel graph foundation model designed to enhance retrieval-augmented generation (RAG) by leveraging a graph neural network to capture complex relationships within a knowledge graph. This approach enables more effective and efficient retrieval of relevant information, addressing the limitations of traditional RAG methods that struggle with multi-hop reasoning tasks. GFM-RAG constructs a knowledge graph index from documents, representing knowledge as interconnected triples, which facilitates multi-hop reasoning and integrates diverse knowledge sources. The model demonstrates state-of-the-art performance on multi-hop question answering datasets and shows strong generalizability across various domains. It maintains efficiency by performing multi-hop reasoning in a single step and adheres to neural scaling laws, indicating potential for further improvements. Overall, GFM-RAG offers a significant advancement in RAG systems, providing improved reasoning capabilities and generalizability.

优缺点分析

Strengths

Innovative Model Design The introduction of the GFM-Retriever model is a significant strength. By leveraging the capabilities of Graph Neural Networks (GNNs), the model effectively helps Large Language Models (LLMs) understand and navigate graph-structured knowledge. This innovative approach addresses the limitations of traditional RAG methods, which often fail to capture complex relationships between pieces of knowledge. The use of a query-dependent GNN allows for dynamic adjustment based on user queries, enhancing the model's ability to perform multi-hop reasoning efficiently.
Robust Experimental Setup and Results The experimental setup is thorough and robust, covering a wide range of datasets and benchmarks. The authors provide extensive experimental results that validate the effectiveness and efficiency of GFM-RAG. The model demonstrates state-of-the-art performance on multi-hop question answering datasets and shows strong generalizability across various domains. This evidence strongly supports the claims made in the paper.
Comprehensive Ablation Studies The paper includes comprehensive ablation studies that explore the impact of different components of the model, such as different sentence embedding models, pre-training strategies, loss weighting strategies, and the construction of the knowledge graph index. These studies provide valuable insights into the model's performance and the importance of each component, demonstrating the authors' commitment to ensuring the robustness of their findings.

Weakness

Lack of Ablation Study on Ent-Document Inverted Index GFM-RAG differs from methods like LightRAG, which directly retrieve knowledge to answer queries, by using an ent-document inverted index to map retrieved entities back to source documents, which are then used as context for the LLM. This additional step is crucial for the model's operation, yet the paper does not include an ablation study to evaluate its contribution and importance. Such a study would clarify the significance of the ent-document inverted index in the overall performance of GFM-RAG, providing insights into whether this component is essential or if alternative approaches could achieve similar results more efficiently.

This can also lead to increased computational overhead and longer inference times compared to methods that do not require this step. While Table 3 compares retrieval efficiency and performance, it does not include a direct comparison with methods like GraphRAG or LightRAG regarding inference times. Providing experimental data on the inference time differences would help assess whether the additional steps in GFM-RAG significantly increase computational costs and whether these costs are offset by performance gains.

Training Strategy and KG-Index Construction GFM-RAG constructs 60 KG-indices from 60,000 documents across three datasets to train a single GFM-Retriever, whereas methods like GraphRAG and LightRAG typically do not require training and instead use directly extracted knowledge triples, community reports or other info. The paper does not explain why a single GFM-Retriever is trained across multiple datasets rather than training separate retrievers for each dataset. An ablation study or an explanation comparing these strategies would provide insight into the efficiency and effectiveness of the chosen approach. Additionally, providing data on the token cost of constructing the KG-index, similar to what LightRAG offers, would help in understanding the resource requirements and facilitate a comparison of the cost-effectiveness of GFM-RAG with other methods. This information would give readers a clearer picture of the trade-offs involved in the model's design.

问题

See Summary Of Weaknesses. For each corresponding weakness, we have provided the relevant concerns and suggestions to address those concerns.

局限性

Yes

格式问题

作者回复

2025-07-30

W1.1: Lack of ablation study on ent-document inverted index

We sincerely thank the reviewer for highlighting this point. We have conducted a detailed ablation study to compare several alternative inverted index strategies:

IDF+Top-T Pred: Our proposed method (Eq. 14–16), which maps the top-T entities predicted by GFM to documents using inverse document frequency (IDF)-weighted scores.
IDF+All Pred: Use all predicted entities from GFM and weights them by IDF (w/o Eq 14)
Top-T Pred: Use only the top-T predicted entities without applying IDF weighting (w/o Eq 15).
All Pred: Use all entity predictions and directly map to document scores (w/o Eq 14 and 15).

Ablation study of different inverted indices.

Inverted Index	HotpotQA		MuSiQue		2Wiki
	R@2	R@5	R@2	R@5	R@2	R@5
IDF+Top-T Pred (GFM-RAG)	78.3	87.1	49.1	58.2	90.8	95.6
IDF + All Pred	68.1	71.4	35.8	41.2	86.0	87.5
Top-T Pred	71.6	78.6	46.3	52.5	74.7	78.1
All Pred	77.6	82.9	41.1	46.9	88.6	90.4

The results show that the proposed IDF + Top-k Pred performs the best. This indicates that the inverted index is a crucial component of GFM-RAG, which serves as a bridge between structured reasoning over KGs and the unstructured documents required by LLMs, necessitating a careful design.

We acknowledge the reviewer’s concern regarding potential alternatives, and as a promising future direction, we plan to explore end-to-end models that can jointly reason over structured and unstructured knowledge without relying on an explicit inverted index.

W1.2: Inference time comparison

We conducted an empirical analysis comparing GFM-RAG, GraphRAG, and LightRAG in terms of both retrieval latency (Time in seconds) and retrieval quality (Recall@5). The results are shown below:

Inference time comparison.

Method	HotpotQA		MuSiQue		2Wiki
	Time (s)	R@5	Time (s)	R@5	Time (s)	R@5
LightRAG	0.861	54.7	1.109	34.7	0.911	59.1
GraphRAG (MS)	2.759	76.6	3.037	49.3	1.204	77.3
GFM-RAG	0.107	87.1	0.124	58.2	0.060	95.6

The results clearly demonstrate that GFM-RAG outperforms both LightRAG and GraphRAG in retrieval accuracy while also being significantly more efficient in terms of inference time. Despite performing the inverted index, GFM-RAG achieves superior latency because the reasoning is executed in a single GNN forward pass over a pre-constructed knowledge graph index.

In addition, GFM-RAG also benefits from the quick index process during retrieval, as it does not construct a traditional vector database to store documents and entities:

Indexing time comparison.

Method	Indexing time (s)
LightRAG	1430.32
GraphRAG (MS)	1796.43
GFM-RAG	93.55

We will include these results in the final version to improve clarity and better highlight the advantages of GFM-RAG.

W2.1: Ablation study of training datasets

We thank the reviewer for raising this valuable point. We want to confirm that GFM-RAG is trained on multiple datasets (HotpotQA, MuSiQue, and 2Wiki), but we ensure strict separation between training and evaluation data, with no overlap in documents or entities, thus eliminating any risk of information leakage.

The motivation for this setup is rooted in the foundation model paradigm. Similar to how LLMs are trained on diverse corpora to learn generalizable linguistic patterns, GFM-RAG aims to learn universal graph reasoning capabilities from diverse KG-indexes. This design enables strong zero-shot transfer to new domains and datasets, eliminating the need for retraining or fine-tuning on every downstream domain. Results in Fig. 3 demonstrate the strong generalizability of GFM-RAG on out-of-domain datasets without retraining, compared to GraphRAG and LightRAG.

We further conducted ablation studies where GFM-RAG is trained separately on each dataset, and we report performance across all three benchmarks. Results are shown below:

Ablation study of GFM-RAG trained on each dataset.

Test Dataset	HotpotQA		MuSiQue		2Wiki
Training Dataset	R@2	R@5	R@2	R@5	R@2	R@5
HotpotQA	79.3	87.8	46.9	57.2	86.6	92.4
MusiQue	68.8	81.8	47.6	57.5	84.4	89.6
2Wiki	72.2	77.9	46.6	55.5	89.3	93.2
All	78.3	87.1	49.1	58.2	90.8	95.6

We will include this ablation study and further clarify the design rationale in the final version of the paper.

W2.2: Token cost for index construction

In Appendix E.7, we report the token cost during index construction. Specifically, we find that constructing the KG-index requires approximately 4.8 M tokens per 1,000 documents, which costs around $2.6 using GPT-4o-mini. LightRAG and GraphRAG cost 5.7 M tokens and 7.6 M tokens, respectively. Compared to other methods, GFM-RAG is more cost-effective as it does not require generating community-level summaries.

Token cost comparison for index construction

Method	# Tokens per 1000 documents
GFM-RAG	4.8M
LightRAG	5.5M
GraphRAG	7.6M

最终决定Accept (poster)

2025-09-17

The paper introduces GFM-RAG, a graph foundation model designed to advance retrieval-augmented generation (RAG) by incorporating graph neural networks to capture complex relationships within knowledge graphs. This design improves both the effectiveness and efficiency of retrieval, particularly for multi-hop reasoning tasks where traditional RAG approaches often fall short. GFM-RAG builds a knowledge graph index from documents, representing information as interconnected triples, which supports multi-hop reasoning and integrates heterogeneous knowledge sources.

Empirical results show that GFM-RAG achieves state-of-the-art performance on multi-hop question answering benchmarks and generalizes well across diverse domains. The model further maintains efficiency by enabling multi-hop reasoning in a single step and follows neural scaling laws, suggesting potential for continued gains with larger models. Overall, GFM-RAG represents a meaningful advancement in RAG systems, offering stronger reasoning capabilities and broader applicability.

The paper proposes an interesting idea, and the experiments are comprehensive. Initially, the reviewers have concerns on the ablation studies, training strategies, and so on. However, in the rebuttal, the authors have carefully addressed them, and all the reviewers have positive scores on the paper. I recommend acceptance.