From Complex to Atomic: Enhancing Augmented Generation via Knowledge-Aware Dual Rewriting and Reasoning
We present an advanced RAG system with knowledge-aware dual rewriting and reasoning capabilities, designed to improve knowledge extraction and rationale formulation within specialized datasets.
摘要
评审与讨论
The authors propose a knowledge aware rewriting and reasoning framework, a variant of retrieval augmented reneration (RAG), which is suitable for multi hop question answering tasks since it can aggregate knowledge from different documents. It consists of 4 steps: knowledge atomizer, query proposer, atomic retriever and atomic selector. During knowledge atomizing, a chunk is given to a LLM and asked to produce as many questions as possible whose answer is contained in the given chunk -- this will be included in the knowledge base together with the original chunk. In the next step, a LLM is prompted to break down the original question is broken down into subquestions (atomic query proposals) and if there are any chunks/question pairs from previous step in the knowledge base, these are passed in the context. The next step is the atomic retriever, which uses traditional similarity search with cosine similarity to retrieve top-k atomic questions/chunks that are most similar to the query proposals. Finally, a LLM is again used to select which question/chunk is most useful for the atomic query proposal. The experiments section showcase that their method is competitive with respect to other baselines ( RAG, zero shot CoT, Self-Ask, IRCoT and ProbTree) wrt F1 and using a LLM evaluator.
给作者的问题
-
Which embedding model they used for the retriever step (or did you use BM25)?
-
What is the chunk size and number of chunks retrieved?
-
You mentioned that there are RAG approaches that use question decomposition without considering available knowledge in L65 of page 2, which work is this? Please include a reference
论据与证据
The claims made in the submission supported by clear and convincing evidence
方法与评估标准
The authors showcase their method's performance using three different multi hop QA datasets : hotpot QA, mystique and 2wiki. These benchmarks are relevant to assess performance.
理论论述
There are no proofs in the paper and the claims are well supported.
实验设计与分析
I checked and the experimental design and analysis are solid. The only detail I couldn't find is how many tokens per chunk are included or whether the chunks are at sentence or paragraph level.
补充材料
No
与现有文献的关系
The contributions from the paper are related to the broader scientific literature for improving vanilla retrieval augmented generation when the knowledge source is included in more than one chunk.
遗漏的重要参考文献
The authors do a good job including the relevant references to other works in this research area. Since retrieval augmented generation is a very active area of research, I would mention in the paper that the works included do not include other RAG techniques where the knowledge augmentation happens at training or fine-tuning but only at inference time.
其他优缺点
The paper is well written and easy to follow with respect to all the algorithmic details. I would like to know which embedding model they used for the retriever step (or if they used BM25). I would also like to know the chunk size and number of chunks retrieved.
其他意见或建议
-
It would be interesting to see an ablation of how the chunk size and number of chunks affects performance of the method.
-
I would rename the metric Acc to LLM evaluator or similar since Acc typically refers to another metric so it's confusing to overload the term.
-
I would ask the authors to flesh out the other requirement which is to have a LLM with a context window sufficiently long to include the - increasing context they provide.
-
When you reference Gao et al. 2023, since it's a survey, I would explicitly say "and references therein"
We appreciate the valuable feedback and insightful comments from the reviewer. Below, we address each concern point by point. All tables are accessible via hyperlinks.
Q1. Embedding model used for retrieval.
The text-embedding-ada-002 model is used across all experiments demonstrated in the paper. For more hyper-parameter settings, you can refer to Appendix A.2 (line 654 ~ 668).
Q2. Chunk size and number of chunks retrieved.
As introduced in line 370 ~ 373 in the paper, we compile the context paragraphs without additional chunking, resulting in a chunk size of around 500 chars. The detailed statistics are listed in Table 1. In retrieval phase, the retriever is configured to retrieve 4 atomic questions per atomic query with a relevance threshold of 0.5. However, the actual number of chunks retrieved varies based on the number of queries proposed and their associated relevance scores. Additionally, since there is an atomic selection step incorporated in each decomposition round, a maximum of one chunk may remain after each round. Given decomposition loop constrained up to 5 rounds (N = 5) in our main experimental setup, no more than 5 chunks will be utilized during final answer generation phase.
Q3. Which work is the RAG approach (mentioned in L65 of page 2) that use question decomposition without considering available knowledge?
Thanks the reviewer for pointing it. One such example is Self-Ask ([Press et al., 2023]). We will clarify this in the revised version and explicitly refer to Self-Ask to ensure the statement is supported and precise.
Q4. Paper do not include other RAG techniques where the knowledge augmentation happens at training or fine-tuning but only at inference time.
Thanks for the suggestion regarding related work. In the revised version, we will include the key works that incorporate knowledge augmentation during training or fine-tuning, such as REALM: Retrieval-Augmented Language Model Pre-Training ([Guu et al., 2020]), and LLaMA-Adapter: Efficient Fine-Tuning of LLaMA for RAG ([Zhang et al., 2023]), among others.
Q5. Rename the metric Acc to LLM evaluator or similar.
Thanks for the suggestion! We understand how the term "Acc" might cause confusion. In the revised version, we will rename the metric to something more descriptive, such as "LLM Evaluator", to better reflect its purpose and avoid overloading the term.
Q6. Use a LLM with a context window sufficiently long to include the increasing context.
To explore the dependency of our approach on the context window size, we analyzed the token distribution on MuSiQue (using GPT-4), considering settings where N = 5 and N = 10 (N represents the decomposition round limit, detailed in line 2 of Algorithm 1). The findings are presented in Table 2 and 3. Table 2 shows that the maximum number of input tokens is around 6K, while the maximum output tokens is around 0.5K across all LLM interactions. Additionally, this maxmium token consumption shows only a slight increase as N increases but remains within the same order of magnitude. This suggests that the maximun token requirement grows gradually. Futhermore, Table 3 demonstrates that over 99% of the LLM calls are accommodated by models with a token limit of 4096, a capacity commonly supported by existing LLMs. For models providing context windows of at least 8K tokens, all LLM calls can be handled without the need for any token truncation.
Table 3: Prompt Token (per LLM call) Distribution of KAR³ on MuSiQue.
| Prompt Token | Percentage of Calls (N = 5) | Percentage of Calls (N = 10) |
|---|---|---|
| <= 512 | 49.38% | 34.11% |
| <= 1024 | 91.87% | 77.61% |
| <= 2048 | 97.16% | 97.10% |
| <= 4096 | 99.78% | 99.89% |
| <= 8196 | 100.00% | 100.00% |
This paper addresses the challenge of solving complex, multi-hop queries in domain-specific contexts by introducing a method called KAR^3-RAG. Traditional Retrieval-Augmented Generation (RAG) techniques often rely on straightforward text retrieval methods, which can struggle when queries require multiple steps or hops to reach an answer.
The key idea behind KAR^3-RAG is to reorganize or “atomize” the knowledge base (KB) into smaller units of knowledge in the form of atomic question–answer pairs. From a query standpoint, the method then uses:
- A knowledge atomizer to decompose the large KB into these atomic QA pairs.
- A query proposer to break down the original complex question into more granular sub-queries (or atomic queries).
- An atomic retriever to efficiently retrieve relevant QA pairs.
- An atomic selector that uses these retrieved QAs to generate either an answer or a next-step question.
Empirical results suggest that KAR^3-RAG outperforms various baselines, including on legal question-answering benchmarks and general multi-hop QA tasks.
给作者的问题
- How large was the knowledge base in your experiments, and how computationally intensive was the process of generating atomic QA pairs? Do you foresee your approach being scalable to something the size of Wikipedia or large internal corporate/legal databases?
- Could you clarify whether baselines also employed any form of preprocessing or KB re-structuring? If not, how might that affect the reported performance differences?
- Are you aware of tree-based or partial KB re-structuring methods (e.g., [Sarthi et al., 24]) and how might those compare to your full-scale atomic approach, particularly in terms of complexity and performance?
- How can we ensure the quality of knowledge atomizer? Considering its importance, it is highly crucial to perform the high-quality atomization.
论据与证据
The authors claim that KAR^3-RAG:
- Improves multi-hop retrieval performance by decomposing the KB into atomic QA pairs.
- Outperforms competing retrieval-based methods on both general and domain-specific QA tasks.
These claims are generally supported by the experimental results reported. That said, the methodology for constructing the atomic QA pairs (the knowledge atomization process) provides KAR^3-RAG with an advantage that other methods may not share, so it would be helpful to see a discussion on this additional effort and how it compares to simpler or alternative preprocessing steps.
方法与评估标准
The proposed method makes sense for multi-hop retrieval because it directly tackles the need for stepwise reasoning by structuring the KB itself into smaller, more retrievable units. The evaluation spans both general multi-hop QA tasks and domain-specific (legal) QA, which aligns well with the claim that the approach is general yet especially useful for specialized domains.
However, additional clarity on how much preprocessing effort is involved—and to what extent alternative approaches could achieve similar results with less overhead—would be beneficial. The evaluation is done via standard QA metrics, which is appropriate for comparing question-answering performance.
理论论述
There is no theoretical claim in this paper.
实验设计与分析
From the experiments, the methodology seems sound: the paper compares KAR^3-RAG against strong retrieval baselines on multi-hop QA datasets and domain-specific data (e.g., legal documents). The results show improvements in retrieval accuracy and final QA accuracy.
However, one concern is the fairness of the comparison. Because KAR^3-RAG performs a knowledge atomization step, it is not entirely clear whether the other baselines had an equivalent chance to restructure their knowledge bases or if they simply used plain text as-is. The paper might benefit from more discussion on how to ensure each method has a similar knowledge “preparation” or from a deeper ablation study on the effect of atomization.
补充材料
I did not review the supplementary material.
与现有文献的关系
The paper contributes to the ongoing research in multi-hop retrieval and question answering by introducing the idea of an atomic knowledge base—turning all or most of the raw text segments into smaller question–answer pairs that can be more directly retrieved. This idea builds on previous multi-hop QA work but takes it further by making the entire knowledge base “query-friendly.”
This approach relates to existing multi-hop retrieval frameworks that attempt to break down complex questions into sub-questions. However, instead of focusing purely on query decomposition, KAR^3-RAG also restructures the knowledge base itself. This is a novel angle worth discussing in comparison with other knowledge-base transformation methods.
遗漏的重要参考文献
One relevant line of work is the transformation or rearrangement of knowledge bases for tree-based or structured retrieval. For example, there is a method described in:
[Sarthi et al., 24] “RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval.”
Although not widespread yet, this work (or similar) might provide a useful point of comparison. It would be good to see how KAR^3-RAG stands relative to these approaches in terms of complexity, performance, and scalability.
其他优缺点
Strengths
- Novel restructuring of the knowledge base into atomic QA pairs, which helps address the complexity of multi-hop queries.
- Demonstrates strong improvements in both general and specialized (legal) QA tasks, which suggests broad applicability.
- Brings a fresh perspective to retrieval by rethinking how knowledge is stored and accessed.
Weaknesses
- The cost and scalability of knowledge atomization are not explored in depth, leaving uncertainty about applying this method to large-scale knowledge resources (e.g., Wikipedia).
- It is not fully clear how to compare the results fairly, as the proposed approach benefits from pre-processing the KB, while baselines typically rely on unstructured plain text.
- Comparison to other potentially simpler or less costly knowledge-base transformation approaches is missing (e.g., [Sarthi et al., 24]).
[Sarthi et al., 24] RAPTOR: RECURSIVE ABSTRACTIVE PROCESSING FOR TREE-ORGANIZED RETRIEVAL
其他意见或建议
- Including more details on the size of the knowledge base used in the experiments (e.g., line 371) would help contextualize the results.
- A discussion on the computational cost and feasibility of creating atomic QA pairs for large-scale knowledge sources would be valuable.
- It might be instructive to conduct an ablation or pilot study on partial or dynamic atomization to see how much the approach relies on the full-blown transformation.
We appreciate reviewer's feedback. All tables are accessible via hyperlinks.
Q1-1. How large was the knowledge base in the experiments?
Table 1 provides detailed statistics. All chunks are derived from the context paragraphs of the sampled QA, with chunk count varying by dataset. Additionally, the count info is included in Appendix A.1.
Table 1: Chunk Statistics
| Dataset | Avg. Len | Count |
|---|---|---|
| Hotpot | 546 | 4950 |
| 2Wiki | 422 | 3410 |
| MuSiQue | 484 | 7120 |
Q1-2. How computationally intensive was the process of generating atomic questions?
Table 2 shows the computational cost. Per Appendix A.3, atomization is a one-time step requiring LLM calls equal to chunk count, with cost scaling linearly based on chunk sizes and generated questions.
Table 2: Preprocessing Token Consumption
| Dataset | Avg. Tokens | Calls |
|---|---|---|
| Hotpot | 338 | 4950 |
| 2Wiki | 321 | 3410 |
| MuSiQue | 320 | 7120 |
Q1-3. Discussion of scalability and simpler preprocessing method.
Our approach mitigates scalability for larger datasets through three key features:
- Dynamic addition of atomic questions without structural changes
- Linear preprocessing cost scaling with corpus size
- Compatible with open-source LLMs (Table 2 in paper), reducing cost with only 3% performance drop on MuSiQue
Following the valuable suggestion to employ simpler preprocessing steps, we tested using plain-text sentence splitting via spacy instead of LLM-based atomization. Each sentence serves as an atomic question. After revising the selection prompt (Appendix line 1375-1396), results in Table 3 show that while performance drops 7% on MuSiQue, this method still outperforms most baselines from Tables 1 and 2 in paper. This demonstrates its effectiveness in case lower-cost preprocessing needed, offering a flexible performance-efficiency tradeoff alongside "dynamic addition".
Q2. Clarify whether baselines employed preprocessing or KB re-structuring? How might that affect the performance?
Most baselines use standard retrieval methods without preprocessing (details in Appendix Table 5, page 14). Only Self-Ask explicitly generates sub-questions that can query atomic questions using a similar retrieval path (sub-question -> atomic question -> chunk).
Testing atomic questions with Self-Ask, IRCoT, and Iter-RetGen shows slight performance improve for Self-Ask (1.6%) due to its natural retrieval path, but decreased performance for IRCoT and Iter-RetGen (see Table 4). This demonstrates that atomization alone cannot contribute significantly to performance.
Table 4: Abalation study of baselines with atomic questions
| Variant | Hotpot Acc | 2Wiki Acc | MuSiQue Acc |
|---|---|---|---|
| Self-Ask w/ Atomic Question | 80.00 | 77.60 | 53.00 |
| IRCoT w/ Atomic Question | 77.80 | 65.20 | 47.20 |
| Iter-RetGen w/ Atomic Question | 82.20 | 63.60 | 46.80 |
| KAR³ | 88.00 | 82.20 | 62.60 |
Q3. Awareness of tree-based or partial KB re-structuring methods ([Sarthi et al., 24]). How might those compare to your approach?
We compared KAR³ with other KB-restructuring methods like GraphRAG and RAPTOR (Tables 6, 14, 15 in appendix, and Table 5). Both underperform in F1 scores, likely due to their focus on summarization which can introduce redundancy. KAR³'s precise chunk atomization enables more accurate retrieval and reasoning, leading to better performance on complex questions in 2Wiki and MuSiQue.
As for complexity, the preprocessing cost of both KAR³ and RAPTOR scales linearly with chunk size, while GraphRAG incurs significantly higher cost due to its hierarchical KG construction. This highlights that KAR³ achieves outstanding performance with reasonable computational cost compared to RAPTOR and GraphRAG.
Table 5: Comparison of KB re-structuring methods
| Method | Hotpot F1 | Hotpot Acc | 2Wiki F1 | 2Wiki Acc | MuSiQue F1 | MuSiQue Acc |
|---|---|---|---|---|---|---|
| RAPTOR | 12.46 | 81.40 | 10.03 | 69.80 | 6.86 | 55.00 |
| GraphRAG | 10.66 | 89.00 | 11.83 | 71.20 | 9.62 | 49.80 |
| KAR³ | 76.48 | 88.00 | 75.00 | 82.20 | 57.86 | 62.60 |
Q4. How can we ensure the quality of knowledge atomizer?
KAR³ reduces dependence on high-quality atomization in two ways:
- Atomic Questions Mainly as Multi-aspect Indexing: Generates multiple atomic questions from different perspectives, providing relatively comprehensive coverage without requiring perfect atomization.
- Separation of Retrieval and Reasoning: Uses atomic questions only for retrieval, with downstream LLMs handling reasoning, making the system robust to imperfect retrieval.
In Table 3 of paper, using Llama3 instead of GPT-4 for all components only reduces MuSiQue performance by 2.9%, demonstrating KAR³'s robustness to atomizer quality. While better atomization would improve performance, KAR³ can achieve this incrementally since it naturally supports adding new atomic questions to existing databases on the fly.
This paper proposes a framework for handling multi-hop questions which require complex reasoning which has 4 main components; an atomizer which generates atomic questions from document chunks, a query processor which iteratively generates atomic queries using the input question and current context, an atomic retriever which maps the atomic queries with the atomic questions generated by the atomizer and hence also mapping the atomic questions with a document context and finally an atomic selector which selects the atomic question-query-chunk triplet to either append to the context or respond the the user question. The authors have evaluated their proposed methods against six existing methods and have showcased 20.4% improvements over the second best method. They have also discussed the limitation of their method which is mainly the reliance on the reasoning ability of the underlying LLM in use. They have also conducted ablation study to understand the impact of the individual components in their pipeline.
给作者的问题
This method demands a lot of resources. Do you have any thoughts on the trade-off between resource usage and efficiency, and how to effectively weigh the advantages?
论据与证据
Yes. The authors have successfully showcased that their method is effective in multi-hop question answering systems, and their method is especially effective when domain awareness is required for domain specific complex question answering. For effectiveness in multi-hop question answering they have demonstrated that the proposed method is better compared with existing methods such as standard RAG, Zero shot COT based step by step question answering, Self-Ask w/ Retrieval etc. However, for domain specific complex question answering they evaluated their method on just two datasets from a single domain where clearly the method has an advantage due to its inherent nature and how the task of the datasets are designed. Thus, a more holistic evaluation with diverse set of tasks and domain should be done.
方法与评估标准
Yes, The method discussed in the paper are mainly around multi-hop question answering. The evaluation datasets and metrics used for experimentation are aligned.
理论论述
Yes, I verified the results shared in the paper and those are accurate.
实验设计与分析
Yes, The experimental design of the paper is sound with respect of multi-hop question answering evaluation.
补充材料
No, I have not reviewed any supplementary materials.
与现有文献的关系
The paper discusses how iterative context building and query to knowledge mapping improves the response generation of the RAG system. This is aligned with some of the other papers such as
遗漏的重要参考文献
Nothing that I am aware of.
其他优缺点
Strength: The paper highlights how static query to document chunk mapping followed by dynamic query to context and user question mapping improves the response generation in a multi-hop question answering scenario. The novelty is explained in a clear manner with clear evaluation datasets and methods.
Weakness: The paper does not do a good job in terms of providing evidence for the claim that the method is effective for cases where domain knowledge is required for complex problem solving. The paper also does not discuss an important limitation of the method that it is a resource heavy method and can not be used practically when retrieving from dynamic datasources or large volume datasources such as the web.
其他意见或建议
N/A
We appreciate the valuable feedback and insightful comments from the reviewer. Below, we address the raised concerns point by point.
Q1. Discussion on the important limitation of the method that it is a resource heavy method and can not be used practically when retrieving from dynamic data sources or large volume datasources such as the web.
While resource demands are an important consideration, our approach incorporates several key features and optimizations that mitigate these challenges and enhance practicality.
Dynamic Integration of Data Sources: Our method supports dynamic addition of data sources without requiring structural modifications. This contrasts with knowledge graph-based extraction methods, which necessitate updating the graph when expanding the corpus. This flexibility ensures seamless integration of new information.
Scalable Preprocessing Costs: The construction of our knowledge base involves a one-time preprocessing cost that scales linearly with the size of the corpus. This scalability ensures that our approach remains efficient and suitable for processing large-scale datasets. The detailed preprocessing cost of three benchmarks are provided in Table 12 in the Appendix (page 18).
Optimizations for Large-Scale Data: We acknowledge the need to further optimize preprocessing for extremely large datasets, such as web-scale data, where resource efficiency becomes critical. To address this, we propose alternative atomization strategies to reduce preprocessing costs while maintaining competitive performance.
-
Using Open-Source Models: We explore replacing computationally expensive LLMs like GPT-4 with more resource-efficient open-source models such as LLaMA 3 during the chunk atomization step. In additional experiments, this substitution significantly reduced preprocessing costs, with only a minor accuracy drop (~3% on MuSiQue dataset), as shown in Table 1.
-
Sentence-Level Segmentation: for scenarios requiring even lower-cost preprocessing, we propose using sentence-level segmentation as atomic units for retrieval. Although this approach reduces performance (55.2% on MuSiQue), it still outperforms the majority of baselines presented in Table 1 of the main paper, demonstrating its practicality in resource-constrained settings.
Table 1: Ablation study on the preprocessing method on MuSiQue
| LLM Used | Variant | F1 | Acc |
|---|---|---|---|
| Llama 3 | KAR³ w/ plain-text | 45.88 | 54.20 |
| Llama 3 | KAR³ | 50.68 | 59.70 |
| GPT-4 | KAR³ w/ plain-text | 50.72 | 55.20 |
| GPT-4 | KAR³ | 57.86 | 62.60 |
Q2. Thoughts on the trade-off between resource usage and efficiency, and how to effectively weigh the advantages?
The trade-off between resource usage and efficiency is an important consideration for our method, and we have designed it to balance these aspects effectively while offering flexibility based on specific use cases. To this end, we propose alternative atomization strategies that allow users to tailor preprocessing costs to meet their needs.
For example, substituting computationally intensive models like GPT-4 with lighter open-source models such as LLaMA 3 significantly reduces preprocessing costs, with only a minor performance drop (~3% on MuSiQue). Additionally, for resource-constrained scenarios, sentence-level segmentation can be employed, reducing preprocessing overhead while still outperforming most baselines (55.2% on MuSiQue).
The choice of strategy depends on the specific application: high-accuracy requirements may justify the use of more powerful models, while dynamic or resource-limited settings can benefit from cost-effective alternatives. These considerations ensure that our method remains adaptable and effective across a range of scenarios, balancing resource usage and performance as needed. The detailed experiment results are provided in Table 2 of main paper and Table 1 in this response.
In the revised manuscript, we will incorporate a detailed discussion on the trade-off between resource consumption and efficiency, accompanied by experimental results of alternative preprocessing methods.
Q3. Evaluation with diverse set of tasks and domain should be done.
Thanks for the constructive suggestion. We agree that evaluating our method on a broader set of tasks and domains is crucial to demonstrating its robustness and generalizability. To address this, we have included evaluations on two legal benchmarks in Appendix A.5,which further validate the effectiveness of our approach.
Furthermore, we have applied our method to develop RAG systems in specialized domains, such as manufacturing and healthcare, achieving consistent accuracy improvements of over 15% across a variety of tasks. Due to privacy restrictions, we are unable to release the data from these domains. To further validate and benchmark our method, we are actively exploring publicly available datasets in specialized domains to conduct more rigorous evaluations.
The authors present a new RAG framework suitable for addressing complex questions with a focus on multi-hop. The main idea is based on an iterative process of collecting evidence and generating follow-up questions as required. To devise this iterative process, the authors describe four main components: (1) Knowledge Atomizer: mapping doc chunks to set of answerable questions, (2) Query Proposer: mapping user question and intermediate context to reformulated question candidates, (3) Atomic Retriever: a high-recall lightweight ranking of atomic questions and their respective docs, and (4) Atomic Selector: LLM-based selector of top atomic question and its respective doc to be added to context.
Authors test their method on several datasets corresponding to specialized domains, and compare it against several comparable yet differentiated benchmarks. The authors demonstrate significant enhancement across benchmarks.
给作者的问题
-
the paper addresses and demonstrates KAR-RAG on complex questions that require iterative subtasks, but it is not clear how the system would fare on questions that require parallel subtasks. In other words, the question can be divided to atomic subquestions that are not dependent on each other. I would like to hear the authors' thoughts on how the KAR-RAG system would fare in such examples.
-
Although most of the pseudo algorithm is clear, it is unclear what is the stopping criteria for the system when no high quality proposed questions (or possible answers) are available. In other words, it is clear how the system would stop when hitting N, or when the system would deem available context sufficient, but it is unclear how it would deem the set of proposed question insufficient to find proper evidence/context.
论据与证据
All claims in the paper are well supported, except for the following one:
The paper claims that LLMs struggle with specialized fields and links RAG systems as a solution to address this shortcoming, attributing this shortcoming to reasons such as unawareness to technical terminologies. Although the claims were supported by references, such a claim might be outdated with contrasting evidence such as: https://www.frontiersin.org/journals/oncology/articles/10.3389/fonc.2023.1219326/full . The other issue with this claim is missing a main application of RAG which is equipping LLMs with knowledge crossing its cutoff date.
Additionally, the authors motivated this work on specialized fields but later demonstrated the system on pop culture questions which reflected a disconnect between the original claim/motivation and the later support. With all that being said, the demonstrated enhancement on specialized datasets closes the gap.
方法与评估标准
Yes they do. Benchmarks and baselines are well represented.
理论论述
The paper does not make any theoretical claims but is rather a systems paper.
实验设计与分析
No issues found with experimental design.
补充材料
Yes, I reviewed:
- Case 6(a) referred to in the main manuscript
- Prompts used
- Cost analysis
Supplementary materials were found to be helpful but not properly organized with misaligned tables and figures.
与现有文献的关系
This work establishes the connection to the broader scientific literature mainly through comparing against similar, yet differentiated, RAG setups, most notably: Self-Ask and IterRetGen. KAR-RAG mainly distinguishes itself by using context to generate follow-up questions and by representing the KB as appended by atomic questions for easier atomic retrieval. The paper also demonstrates enhanced performance over said baselines.
遗漏的重要参考文献
None that I can think of.
其他优缺点
Strengths:
- comparison against a large set of representative baselines and across a diversified benchmark.
- enhanced performance on benchmarks.
- ablation study showing benefit of each component.
- representing document chunks as a set of answerable questions, simplifying retrieval
- helpful supplementary materials
Weaknesses:
- The paper can use more analysis of the results and anecdotes to drive home the advantage of the system presented over other baselines.
- The paper is harder to read than necessary. It can be simplified by reducing unnecessary jargon and the use of running examples.
其他意见或建议
-
Authors can do a better job differentiating between atomic questions coming from the original question, and the atomic questions coming from the knowledge base. It got confusing in the doc.
-
I highly recommend the authors to use a running example earlier in the doc. The paper starts off with crowded jargon that is unexplained. Having a running example to explain steps along the way would enhance the paper. The authors did provide examples later in the doc but I found them to have arrived a bit too late.
We appreciate the valuable feedback from the reviewer. Below, we address each concern point by point.
Q1. How KAR-RAG system would fare on questions that require parallel subtasks.
KAR³ is specifically designed to handle complex questions by decomposing them into multiple subqueries, enabling effective retrieval and iterative reasoning. This decomposition mechanism allows KAR³ to address both sequential and parallel subtasks. For instance, consider the parallel comparison question provided in Figure 5 of the Appendix (page 22): "Which film came out first, What Women Love or Ramudu Kadu Krishnudu?" KAR³ decomposes this question into atomic subqueries:
(a) "What is the release date of What Women Love?"
(b) "What is the release date of Ramudu Kadu Krishnudu?"
In the first iteration, KAR³ retrieves a chunk tagged with the atomic question "In what year was the film 'What Women Love' released?" relevant to subquery (a) and add it to the context. In the second iteration, using the updated context and the original question, the system regenerates subquery (b) and retrieves the relevant chunk tagged with the atomic question "In what year was the film 'Ramudu Kadu Krishnudu' released?" Through this iterative decomposition and retrieval, KAR³ resolves parallel subtasks.
Q2. What is the stopping criteria for the system when no high quality proposed questions (or possible answers) are available?
When no high-quality proposed queries or no relevant atomic questions, the atomic selector may return an empty or out-of-range atomic question index after evaluating the provided context, which consists of the retrieved atomic questions. This means that no additional relevant chunks can meaningfully contribute to answering the original question. In such cases, the decomposition loop terminates, and the system generates a final answer based on the information already accumulated in the context.
Q3. The paper can use more analysis of results and anecdotes to drive home the advantage of the system presented over other baselines.
Thank you for the constructive suggestion. We agree that incorporating additional analysis and case studies will help to emphasize the advantages of our method over the baselines. In the revised version, we will include more baseline analysis and case studies to highlight the strengths of our method.
Q4. The claim on LLMs' unawareness to technical terminologies may be outdated given newer evidence and missing a key RAG application which is equipping LLMs with knowledge crossing its cutoff date.
We appreciate the insightful feedback and the opportunity to clarify and strengthen our claims.
Potential outdated nature of our claim: While we acknowledge that recent advancements have improved LLMs' performance in specialized fields, challenges persist in areas requiring precise understanding of technical terminologies, particularly in dynamic domains with evolving jargon. For example, in OLED-related technologies, the term CSE is often misunderstood by LLMs as "Charge Spread Effect" or "Charge Sheet Effect" when it actually refers to "Channel-Shortening Effect". Such examples highlight the ongoing limitations of LLMs in accurately handling domain-specific acronyms and terminology, especially when context-specific disambiguation is required. We will refine our claim to acknowledge progress in LLMs while addressing areas where challenges remain.
RAG’s role in addressing knowledge cutoff issues: We agree that one of RAG’s critical advantages is mitigating knowledge cutoff limitations by retrieving up-to-date information. We regret the omission of this important point and will explicitly highlight it in the revised manuscript as an important benefit of RAG systems.
Q5. Differentiating between atomic questions from the original question and the atomic questions from the knowledge base.
Thanks for your valuable suggestions. Atomic query proposals decomposed from the original query are derived to break down the original query into subqueries that aid in addressing the query. In contrast, atomic questions generated from chunks (knowledge base) are the questions that are relevant and can be answered by the given chunk. In the revised manuscript, we will replace the term "atomic questions" with "atomic tags" to clearly distinguish it from "atomic query proposals" and include illustrative examples.
Q6. Suggestions: a) reducing unnecessary jargon, b) early use of running examples, c) properly organizing supplementary materials.
Thanks for your valuable suggestions. In the revised version, we will a) simplify technical language by reducing unnecessary jargon and using clear, concise terminology, b) introduce running examples early on to illustrate key concepts step-by-step, making the explanations more accessible and easier to follow, c) carefully review and reorganize the supplementary materials to address misalignments in tables and figures.
The authors propose an advanced RAG-based system, Knowledge-Aware dual Rewriting and Reasoning (KAR^3)-RAG, for handling multi-hop questions requiring reasoning with a motivation of supporting highly-technical verticals (e.g., medicine, legal). The key innovation is to 'atomize' the corpus into a knowledge-base like structure consisting of question-answer pairs which are then incrementally retrieved and consolidated to put into the LLM context window for answer generation. Specifically, the key subcomponents include (see Figure 3):
- Knowledge Atomizer: decomposes the larger corpus/KB into atomic QA pairs
- Query Proposer: decomposes the original 'complex' question into 'atomic' (i.e., more granular) sub-queries, possibly considering previously retrieved context
- Atomic Retriever: retrieves atomic QA pairs given the atomic queries
- Atomic Selector: takes the retrieved QA pairs and determines if they should be added to the context At this point, either additional 'knowledge-decomposition' can occur or the context can be packed into the LLM context window for answer generation.
Experiments are conducted on widely used multi-hop QA datasets and specialized legal datasets relative to multiple recent strong baselines, showing solid empirical improvements in most settings with GPT-4 and Llama3 as the base LLM. Finally a sensitivity analysis is performed to address the search depth threshold and an ablation study is performed to show the relative contribution of various subcomponents.
Strengths of this work identified by the reviewers include:
- The proposed method is well-motivated, conceptually appealing, and shown to work well on widely-studied multi-hop QA datasets against strong recent baseline methods. The results are ostensibly the state-of-the-art for these datasets.
- The proposed method is particularly applicable to domain-specific knowledge RAG cases.
Conversely, limitations of this work identified by the reviewers include:
- A primary RAG motivation for the proposed method was reasoning within specialized domains. However, the primary evaluation was on widely used datasets and there were no experiments that directly addressed this motivating aspect of the method. It seems that machinery to index data by generated questions accounts for most of the empirical gains while technical terms, etc. aren't a key aspect (or at least it isn't demonstrated).
- It isn't clear that the proposed approach would scale to web-sized data for time-sensitive queries, etc. due to the preprocessing needed for chunking and question generation.
- The paper introduces a lot of technical terminology early in the paper (repeatedly) that isn't defined until later in the paper and isn't really well understood until presented in the context of the figures. As reviewer 9CDV pointed out, the paper would benefit from an earlier figure that introduced a running example. In my opinion, the first 2-3 pages could be tightened up to make room for presenting a subset of the legal results and/or addressing specific 'reasoning' experiments to support motivating aspects of the work.
- While not mentioned by the reviewers, it seems that this 'chunked' corpus may not be as useful for other LLM tasks (including non-multi-hop QA). Thus, it isn't clear if this is a solution specific to multi-hop, but may not be practical for the more general setting.
- There were several detailed questions that were largely answered during rebuttal.
In my assessment, the key technical innovation is the machinery introduced to support chunking of the data and query generation for associated indexing -- which is shown to work well for multi-hop QA requiring RAG operations. The empirical results in this setting are strong. However, my primary concerns are: (1) while this chunking machinery is useful for short-context snippets common to multi-hop QA, it may not be amenable to longer-form settings which is more common in general LLM QA traffic and (2) the paper is motivated by reasoning in vertical domains (e.g., legal, medical), but these experiments are limited (in Appendix A.5 with many results and little discussion). Thus, it seems to represent the state-of-the-art for the target setting, but it isn't clear it is generally deployable (which was discussed to some degree in rebuttal, but not yet incorporated in the paper) nor really solves a primary motivating case in the domain-specific case. This would be a stronger paper if these aspects were more clearly addressed.