LaRA: Benchmarking Retrieval-Augmented Generation and Long-Context LLMs – No Silver Bullet for LC or RAG Routing
摘要
评审与讨论
The paper investigates whether Retrieval Augmented Generation (RAG) or Long Context (LC) generation is superior to answer questions with LLMs. For this, it first identifies shortcomings of evaluations in current studies and then introduces its own dataset (LaRA) that aims to mitigate these shortcomings. In an extensive evaluation with different models, the authors find that choosing RAG over LC or vice versa depends on various factors.
给作者的问题
As stated above:
- Could it be that the generation produces wrong answers, and the final LC or RAG answers are actually correct?
- Is it possible that GPT-4o creates more complex questions than GPT-4o can answer?
- What role does retrieval play in comparing LC vs. RAG?
- What is the stance of prior literature on your results?
论据与证据
Yes, the majority of the claims are backed by clear evidence. The authors did a very good job of outlining the shortcomings of existing studies. The experimental results are interpreted well. Also, the paper is well-written. Thus, it is simple to follow.
方法与评估标准
The paper uses a synthetic pipeline to generate QA pairs. The dataset itself is central to the paper. However, the authors remain vague when describing the exact generation process. For me, it is unclear by which criteria the in-context samples are selected and what the authors mean by refining the prompts when "the pass rate does not meet a predefined threshold" (lines 240-241). Also, while human evaluation is performed when evaluating the LLM's judgements of the answer quality, no such evaluation is performed in the actual question generation process. Thus, as a reader, I just need to "trust" that this went well. Could it be that the generation produces wrong answers, and the final LC or RAG answers are actually correct?
Also, it remains unclear what the synthetic question generation process means for the task complexities. Is it possible that GPT-4o creates more complex questions than GPT-4o can answer?
In my eyes, the one core contribution of the paper revolves around a good dataset. The authors describe very well that current literature has significant shortcomings but then do an uncareful job when defining their own data.
理论论述
This is not really a theory paper, so I have no comments.
实验设计与分析
In my eyes, the comparison between RAG and LC is a bit more tricky than presented. Essentially, what the investigations may mean is that you compare two different retrieval strategies. If the model stays the same, then the question becomes whether retrieval works better in LC (within the model) or in RAG (with an external retriever). This notion does not find space. While models are alternated, retrievers are not. There is literature proposing advanced retrieval strategies (https://arxiv.org/abs/2406.14162, https://arxiv.org/abs/2311.09476). This could significantly enrich the observations.
Besides, there could be more alignment of the findings with prior literature. For instance, Leng et al. (https://arxiv.org/pdf/2411.03538) find similar results that longer contexts are harder. Schimanski et al. (https://arxiv.org/abs/2402.08277) find that open-source models are generally lacking QA capabilities. Li et al. (https://arxiv.org/abs/2407.16833) find that more chunks increase the performance. For me, the paper would become more credible if these results were reflected on.
补充材料
Yes, I have read through the entire appendix. I didn't check any data.
与现有文献的关系
As stated, relating to existing insights in the results and broader literature in information retrieval may be helpful.
遗漏的重要参考文献
I have pasted some examples above. I think nothing entirely critical is missed out.
其他优缺点
As stated above.
其他意见或建议
While I like the extensive motivation of the paper, I feel like this takes too much space overall in the paper. The appendix is relatively short. More experiments and human data investigations would serve the soundness of the paper well.
We appreciate the reviewer's constructive feedback, which helps enhance the clarity of our paper! Let us answer your question below.
More details about the QA generation process and answer to question “Could it be that the generation produces wrong answers, and the final LC or RAG answers are actually correct?”
Thanks for this advice. We conducted human inspection after the generation. After constructing prompts and example QAs, we sample 40 generated QAs for each context type and task, which are then manually verified for validity. We only stop modifying the prompts and in-context examples for a given context type and task when the accuracy reaches above 90%. Additionally, larger models compared to smaller ones, and stronger proprietary models compared to open-source models, consistently demonstrate higher accuracy, which further validates the overall correctness of the QAs. If the correctness of QAs were not guaranteed, we would likely observe random results. While this process cannot ensure that all QAs are 100% correct, it does guarantee a very high accuracy rate, making them effective for evaluation purposes. We will include these details in our revision.
The impact of retrieval strategies and answer to question “What role does retrieval play in comparing LC vs. RAG?”
We agree that comparing RAG and LC is influenced by many factors, and a complete advanced RAG system can be more complex and have more modules, including query rewrite, different retrieval strategies, reranking, summarization, etc. However, we would like to highlight that this does not affect the value of our work from two perspectives. First, [5] conduct a systematic analysis of RAG implementations, and we adopt their advice to use a hybrid search strategy combining vector search and BM25. We choose gte-1.5 [4], a very strong embedding model released in late 2024, for search and reranking, ensuring that our RAG implementation already employs a strong strategy. On the other hand, our benchmark itself is one of the core contributions, providing effective support for future systematic exploration of the impact of different RAG modules in long-context QA.
Below are the experimental results of replacing gte-1.5 with bge-m3 [6] and adding Recomp [7] as a summarization module (also suggested in [5]). As can be seen, further complicating the RAG process does not bring significant additional gains, but instead makes retrieval a computationally-intensive process.
| ours (32k) | bge-m3 (32k) | Recomp (32k) | ours (128k) | bge-m3 (128k) | Recomp (128k) | |
|---|---|---|---|---|---|---|
| Qwen-2.5-7B | 62.62 | 61.78 | 62.45 | 56.30 | 55.81 | 56.22 |
| Qwen-2.5-72B | 69.97 | 70.11 | 70.34 | 62.68 | 61.88 | 63.09 |
Is it possible that GPT-4o creates more complex questions than GPT-4o can answer?
Yes, this holds true under our generation process, as we employ a short-context generation method for creating QAs. At Line 244, we mention that generating QA pairs for long texts is inherently a long-context generation problem. To improve generation quality, we divide long contexts into several short segments and generated QA pairs based on these individual segments. This means that GPT-4o only needs to process a short context when generating QAs, but when answering, it needs to find answers from the complete context, which is far more challenging than generating them.
What is the stance of prior literature on your results?
While [1] observes that longer contexts pose challenges for RAG, our work provides a more nuanced analysis, demonstrating that RAG performance is comparable to LC LLMs on tasks such as location and hallucination detection. [2] states that open-source models perform poorly on QA, so we discuss the results of open-source and proprietary models separately in our paper. [3] finds that more chunks can increase performance of RAG; we also conduct related experiments in Figure 2 to verify the impact of using more chunks. Detailed discussions of these connections to prior work will be added to the appendix in the final version.
References
[1] Leng, Quinn, et al. "Long context rag performance of large language models." arXiv 2024.
[2] Schimanski, Tobias, et al. Towards faithful and robust llm specialists for evidence-based question-answering. arXiv 2024.
[3] Li, Zhuowan, et al. Retrieval augmented generation or long-context llms? a comprehensive study and hybrid approach. EMNLP 2024.
[4] Zhang, Xin, et al. mgte: Generalized long-context text representation and reranking models for multilingual text retrieval. ACL 2024.
[5] Wang, Xiaohua, et al. Searching for best practices in retrieval-augmented generation. EMNLP 2024.
[6] Chen, Jianlv, et al. Bge m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation. arXiv 2024.
[7] Fangyuan Xu, Weijia Shi, and Eunsol Choi. Recomp: Improving retrieval-augmented lms with compression and selective augmentation. ICLR 2024.
Thanks for the clarification. I think these comments make a lot of sense overall. The only point I'm still not 100% sure about is the data quality aspect. I'm sure the majority is right, but how many may be wrong is important for the benchmark in my eyes. However, I think the authors did a great job in responding. If the points mentioned in the response are included, I think, this is sound work. Thus, I recommend accepting instead of rejecting! I change the score to 3.
We really appreciate your positive feedback and support! Over the past two days, authors of this work have conducted additional human evaluations on the dataset's accuracy. We sample 10 cases from each context type and task, totaling 120 cases (3 context types * 4 tasks * 10 cases each). We only sample cases from the 128k context because the 32k data is obtained using the same pipeline. Out of these, 117 are completely correct, indicating an error rate of approximately 2.5% for LaRA.
We believe that this lower error rate can be used for systematic evaluation, and the analysis based on experimental results is reliable.
Sincerely, Authors of LaRA
The paper proposes LaRA, a benchmark that attempts to answer if RAG is still necessary compared with long-context LLMs.
The LaRA dataset is constructed from novels, academic papers, and financial statements with four tasks: locating specific information, comparing different parts of the text, reasoning about the content, and detecting hallucinations (questions that are not answerable from the provided context).
Results show that the choice between RAG and LC is not trivial, as it varies significantly depending on factors such as model size, query type, type of tasks, context length, context type, and number of retrieved chunks.
Notably, the proprietary LLMs tends to perform better in the long context configuration, except for the hallucination category.
给作者的问题
- Please elaborate the RAG retrieval setup.
论据与证据
This is a dataset/benchmarking paper. The main claims are in experiment findings and are supported by clear and convincing evidence.
方法与评估标准
Overall, I liked how the dataset is structured for evaluation, featuring recent data across various domains and question types, with both 32k and 128k context configurations.
However, the evaluation setup is somewhat restrictive, as all data is designed to fit within the LLM's context window. This raises a key question: what happens when content exceeds this limit? Should RAG be used, or is the context ensemble approach described Section 2 more suitable? A clearer definition of the scope of long-context may be needed.
理论论述
Not Applicable.
实验设计与分析
My main concern is the RAG retrieval setup. The paper mentions "5 chunks per document" and a "hybrid search combining embedding similarity and BM25." However, it’s unclear how many chunks are retrieved per question—possibly 5. If the RAG setup retrieves only a small number of chunks without further experimentation, it might be unfair, as important context could be missed due to retrieval inaccuracies.
补充材料
No.
与现有文献的关系
Existing literature provided conflicting evidence in terms of whether RAG is still necessary given LLMs. This paper attempts to answer this question by resolving limitations of existing approaches (Insufficent context lengths, Data Leakage, Inappropriate Contexts Handling). Findings in this paper show that the choice between RAG and LC is not trivial, as it varies significantly depending on factors such as model size, query type, type of tasks, context length, context type, and number of retrieved chunks.
遗漏的重要参考文献
No.
其他优缺点
Strengths
- The paper is well-written, easy to follow, and tackles an important question by comparing RAG with long-context LLMs.
- I liked how the dataset is structured for evaluation, featuring recent data across various domains and question types, with both 32k and 128k context configurations.
- The interpretation of results is well-explained, even though no definitive solution due to "no silver bullet."
Weaknesses
- As noted in the Experimental Designs or Analyses section, my main concern is the potential unfairness in the RAG setup, as it relies on a limited number of chunks without further experimentation. This could lead to retrieval failures, affecting performance.
- The definition of "long context" needs further clarification, as the experiments are capped at 128k context. What happens beyond this limit, and does RAG remain relevant in such scenarios?
其他意见或建议
- I think the framing of the "Hallucination detection" task could be improved. Typically, "hallucination" refers to model outputs, whereas in this paper, it is used to describe a question that cannot be answered based on the given context. Referring to a question as a "hallucination" may be somewhat misleading.
- I think more analysis on the retrieval size could strength the claims of this paper.
We appreciate the reviewer's constructive feedback, and we are very happy that the reviewer liked how our dataset is structured! Let us answer your question below.
RAG retrieval setup: the potential unfairness in the RAG setup, as it relies on a limited number of chunks without further experimentation
Yes, in the main results, each query retrieves 5 chunks. The specific method is to search for 5 chunks based on similarity using the GTE embedding model and search for 5 chunks using BM25, then take their intersection. If fewer than 5 chunks are found, the remaining chunks are equally selected from the two retrieval methods. We adopt this hybrid search strategy with reference to [1], which is a strong method.
In Figure 2 of the draft, we provide experimental results with different chunk numbers and sizes, comparing the impact of the amount of retrieved information on smaller models (Qwen-2.5-7B) and larger models (Qwen-2.5-72B). We gradually increase the number of chunks from 5 to 30. Since 30 retrieved chunks means the total input length has already reached 30 * 600 = 18,000, which is on the same order of magnitude as long-context, further increasing the number of chunks would cause RAG to lose its significant efficiency advantage. We set 5 as the default number of chunks and 600 as the chunk size primarily by referencing previous works [1, 2].
We further explore the impact of increasing the number of chunks to 100 on these two models with 128k context, and the results are provided below. We find that too many chunks can cause RAG's performance to decrease rather than increase.
| Model#Chunk | 5 | 10 | 15 | 20 | 25 | 30 | 40 | 50 | 80 | 100 | LC |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Qwen-2.5-7B | 56.30 | 60.15 | 62.80 | 61.65 | 59.76 | 59.62 | 59.22 | 58.47 | 57.39 | 55.61 | 48.91 |
| Qwen-2.5-72B | 62.68 | 63.48 | 64.05 | 64.06 | 65.08 | 67.78 | 68.18 | 67.57 | 66.89 | 65.87 | 65.11 |
The definition of "long context" needs further clarification, as the experiments are capped at 128k context. What happens beyond this limit, and does RAG remain relevant in such scenarios?
This is one of the key designs of our benchmark. Under extremely long context, RAG has an overly obvious advantage. An extreme example is knowledge base-based QA, where LC LLM cannot process the entire knowledge base and struggles to rely on external knowledge to answer correctly.
As emphasized in the paper, when collecting contexts, we choose texts that are as close as possible to the LLM's limit without exceeding it. This allows for a fair comparison of the actual capabilities between RAG and LC LLMs. If the context exceeds the LLM's input limit, we need to employ tricks like truncation, which could result in the answer to a question not being present in the LLM's input, thus failing to reflect the LLM's true capabilities. In Table 1, we conduct relevant experiments to verify the impact of such excessively long contexts. In lines 127-149, we specifically analyze why these overly long contexts cannot be appropriately used to compare RAG and LC LLMs in long-context QA scenario.
We will clarify this point more clearly in the final version.
The framing of the "Hallucination detection" task
Thanks for this suggestion. We will rename "hallucination detection" to the more appropriate term "hallucination occurrence" to express whether RAG and LC LLM produce hallucinations.
References
[1] Wang, Xiaohua, et al. Searching for best practices in retrieval-augmented generation. EMNLP 2024.
[2] Li, Zhuowan, et al. Retrieval augmented generation or long-context llms? a comprehensive study and hybrid approach. EMNLP 2024.
I'd like to thank the authors for the response and additional empirical results. While I liked the contributions of this paper, I remain concerned about the 128k context window cap, which is a significant limitation of the work. Although some might argue that evaluating tasks larger than the model's context window is beyond the scope of this work, I believe it represents a crucial and unavoidable research problem and should not be overlooked especially for RAG vs. long context comparisons.
We'd like to thank the reviewer's feedback on our response and appreciation for LaRA's contribution. We'd like to further clarify why we intentionally excluded contexts exceeding 128k and chose contexts close to this limit.
Using context lengths exceeding 128k results in an unfair or excessively tricky comparison.
Many open-source and proprietary LLMs have a context window limit of 128k, including the ones we test in this work (Llama-3.1-8B, Llama-3.2-3B, Llama-3.3-70B, Qwen-2.5-7B, Qwen-2.5-72B, GPT-4o, etc.). Testing long-context performance on inputs exceeding 128k would necessitate truncation, making it difficult to fairly compare RAG and LC. We wouldn't know if LC's inability to answer is due to lacking long-context processing ability or information loss from truncation. In Section 2, we empirically verify this point. We find that with a 200k context length, which far exceeds the input limit of some LLMs, truncation can lead to the answer being absent from the LLM's input. This results in a low LC-LLM score, not because the model cannot answer long-text queries, but because the answer is not present in the input. Furthermore, even if the context exceeds 128k, the LLM ultimately processes only 128k due to its limit, making the comparison unfair.
Therefore, choosing 128k context is not a limitation, but a deliberate design for fair comparison between LC-LLM and RAG on current mainstream models.
If needed in the future, extending LaRA to longer context lengths will be very easy
Testing contexts beyond 128k is easy; we could simply include them in LaRA. However, this contradicts our goal as it forces 128k-limited LLMs to handle longer contexts using special treatments (truncation or other tricks), which goes beyond LC-LLM's inherent abilities and makes the comparison too tricky.
In addition to the existing context and QA pairs, we provide comprehensive details and procedures for generating new data. If the context limit of mainstream models increases in the future, allowing them to accept longer inputs, our method can be used to generate new testing data. We can also easily extend our benchmark to larger context lengths. However, our experimental designs and analysis are reasonable and effective for the tested LLMs with a 128k context limit.
We appreciate the reviewer's engagement in this crucial consideration of LaRA's design and are happy to discuss further if anything is unclear.
This paper introduces a new benchmark called LaRA, which is designed to systematically compare Retrieval-Augmented Generation (RAG) and long-context (LC) large language models (LLMs). It evaluates 11 models on 2,326 test cases across four key tasks (information retrieval, reasoning, comparison, and hallucination detection) using naturally occurring long texts. The study finds that neither approach is universally superior. The choice depends on model size, task type, and context length. RAG benefits weaker models and excels in hallucination detection, while LC performs better in reasoning and structured text processing, particularly for stronger models with extensive context capabilities. The findings offer practical guidelines for optimizing LLM applications through strategic use of RAG and LC.
给作者的问题
- How many chunks were used in the setting of RAG experiments?
- Is hallucination detection a valid task? How do we know the performance change is not purely because of the change in context length?
- In Line 367, are the results for 128k and 32k contexts directly comparable? Do they share the same input?
- Are there task-specific scores for the results in Figure 2? Do different tasks share the same trend?
论据与证据
Yes
方法与评估标准
Yes
理论论述
N/A
实验设计与分析
Yes
补充材料
Yes
与现有文献的关系
This paper contributes by providing a new benchmark for RAG vs. LC comparison with four carefully designed key tasks. It also provide a comprehensive analysis of different aspects.
遗漏的重要参考文献
No
其他优缺点
Strengths:
- The authors provide a novel benchmark featuring four types of questions.
- Comprehensive experiments are performed on the benchmark with multiple LLMs.
- Practical insights are provided based on the experimental results.
Weaknesses:
- Some implementation details are missing from the main paper.
- Some parts (with 2024 knowledge) of the benchmark may not be less useful for future LLMs.
- More analyses could be done on the experiments (see Questions).
其他意见或建议
- The year information should be included when adding in-text citations.
We would like to Thank you for your constructive review, as well as your positive feedback! Let us answer your question below.
Some implementation details are missing from the main paper---How many chunks were used in the setting of RAG experiments?
In Section 4, paragraph “Implementation of RAG (line 269)”, we provide the details that "Our evaluation employs a standardized configuration with a chunk size of 600 tokens, 5 chunks per document, and an overlap of 100 tokens between chunks." In the main results, we use a chunk size of 600 and 5 chunks. Additionally, we further explore the impact of more chunks in Figure 2.
Some parts (with 2024 knowledge) of the benchmark may not be less useful for future LLMs.
For future LLMs, all existing benchmarks may become outdated due to information leakage when they are used to train newer LLMs. However, on one hand, we have already used nearly the most recent corpus in our data selection, with almost all contexts being released in the second half of 2024, except for novels. On the other hand, in addition to providing datasets for evaluation, we also provide concrete pipeline for creating new data, which can be used to generate new datasets with more updated contexts.
Is hallucination detection a valid task? How do we know the performance change is not purely because of the change in context length?
Hallucination remains a significant challenge for LLMs. While RAG can potentially mitigate this issue, we provide a quantitative assessment of RAG's effectiveness in reducing hallucinations specifically in long-context scenarios. Our evaluation spans models of various sizes, including both proprietary and open-source LLMs, comparing RAG against standard long-context input. Our experimental design focuses on measuring models' ability to abstain from answering when presented with unanswerable questions—defining abstention as correct behavior and hallucination as incorrect.
The experimental results yield three key findings: (1) LC LLMs are substantially more susceptible to hallucinations compared to RAG; (2) Model strength does not correlate with reduced hallucination rates that stronger LLMs do not demonstrate fewer hallucinations; (3) Increasing context length corresponds to higher hallucination probability. Two of three findings are independent of the increase in context length, therefore we believe this is an effective task that provides strong support for the study of hallucinations in RAG and LC LLM.
In Line 367, are the results for 128k and 32k contexts directly comparable? Do they share the same input?
Yes, they are directly comparable. At Line 244, we mention that generating QA pairs for long texts is inherently a long-context generation problem. To improve generation quality, we divide long contexts into several shorter segments and generated QA pairs based on these individual segments. This means that for both 32K and 128K contexts, GPT-4o generated QA pairs using the same prompts and similarly sized segments. Therefore, the distribution of these QA pairs can be considered approximately equivalent across different context lengths.
Specifically, in lines 249-252 we wrote "we split the long context into multiple segments, each approximately 10k tokens in length, and input them individually into GPT-4o to generate QAs." In lines 267-273, we clarified that we have different segmentation strategies for different types of contexts. We will clarify this point more clearly in the final version.
Are there task-specific scores for the results in Figure 2? Do different tasks share the same trend?
We find that the trends across different tasks are generally similar, but there are some exceptions. Below we provide the results of Qwen-2.5-72B at 32k length. We find that for the location task, novels perform worse than other context types, possibly because novels contain more similar content, making it difficult to locate answers. For the reasoning task, papers performed best, which we speculate is because papers have stronger logical structure, lower information redundancy, and are more conducive to reasoning. We will add a systematic analysis of performance across different tasks on various context types in the appendix.
| location | Reasoning | Comparison | Hallu | |
|---|---|---|---|---|
| Novel | 72.00 | 76.27 | 71.11 | 88.14 |
| Financial | 89.19 | 61.02 | 72.41 | 82.20 |
| Paper | 88.68 | 84.91 | 63.16 | 84.91 |
This paper studies the problem of benchmarking RAG and long-context LLMs. The authors first revisit the existing benchmarks to compare RAG and long-context LLMs. They further construct a dataset called LaRA, which contains location-related question, reasoning-related question, comparison-related questions and hallucination detection questions. They conduct experiments with seven open-source LLMs and four proprietary LLMs and systematically analysis the comparison between RAG and long-context LLMs.
给作者的问题
NA
论据与证据
I think the claims are well-supported.
方法与评估标准
NA
理论论述
NA
实验设计与分析
The experimental designs make sense to me.
补充材料
Yes. All.
与现有文献的关系
Yes, could be interesting the a broader community.
遗漏的重要参考文献
I would encourage the author to discuss [1] which is also a paper comparing RAG and long-context LLMs.
[1] Long-Context LLMs Meet RAG: Overcoming Challenges for Long Inputs in RAG. ICLR 2025.
其他优缺点
- Line 68, “lanuage” typo
- Line 156, Section “3” link is not effective.
- I would recommend the authors add another “RAG” and “LC” column in Table 2 to make things clearer.
其他意见或建议
NA
Thank you very much for your positive review! Please see our responses below.
Discuss with [1]
Thanks for pointing out this important related work. [1] mainly investigated the phenomenon that increasing the number of retrieved passages does not consistently improve the performance of LLMs. As the amount of retrieved information increases, performance first increases and then decreases. Some of our experimental results align with the conclusions in [1]. In Figure 2, we observe that as the number of retrieved chunks and chunk size increase, LLM performance first improves and then declines. Furthermore, weaker models compared to stronger ones, such as Qwen-7B versus Qwen-72B, begin to show performance degradation earlier, indicating that weaker models are more susceptible to the influence of large amounts of irrelevant noise in the retrieved information. We will add the discussion with [1] in the revision.
We further explored the impact of increasing the number of chunks to 100 on Qwen-2.5-7b-instruct and Qwen-2.5-72b-instruct with 128k context and find that too many chunks can cause RAG's performance to decrease.
| Model\#Chunk | 5 | 10 | 15 | 20 | 25 | 30 | 40 | 50 | 80 | 100 | LC |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Qwen-2.5-7B | 56.30 | 60.15 | 62.80 | 61.65 | 59.76 | 59.62 | 59.22 | 58.47 | 57.39 | 55.61 | 48.91 |
| Qwen-2.5-72B | 62.68 | 63.48 | 64.05 | 64.06 | 65.08 | 67.78 | 68.18 | 67.57 | 66.89 | 65.87 | 65.11 |
I would recommend the authors add another “RAG” and “LC” column in Table 2 to make things clearer.
Thanks for this advice, and we will change it in our revision!
Typos
Thanks for pointing out these typos. We have fixed them in the revision and will keep polishing our paper.
References
[1] Long-Context LLMs Meet RAG: Overcoming Challenges for Long Inputs in RAG. ICLR 2025.
The paper introduces LaRA, a benchmark for RAG vs. LC (long context) comparison. They evaluated 11 different LLMs (7 open-source, 4 proprietary) on LaRA. Their key finding is that there's no universally superior approach; the optimal choice (RAG vs. LC) depends on a variety of factors including the LLM's parameter size, its inherent long-context processing ability, the specific context length used, the type of QA task, and the properties of the chunks retrieved by RAG. The paper aims to provide practical guidelines based on these findings and address limitations of prior work.
The reviewers raised several important points, however, overall, these were adequately addressed in an excellent rebuttal in which the authors' thorough responses significantly improved the rigor of the work. I would like to see theses discussions make it into the next version of the paper.
These discussions primarily covered data quality, where the authors clarified their dataset generation process and filled in details about their human validation. They explained their method (sampling QAs, manually checking against a >90% accuracy target, refining prompts), and they also shared quantitative error data (~2.5%) building confidence in the benchmark quality. They also alleviated concerns about the fairness of their RAG setup, explaining the hybrid retrieval approach, referencing existing experiments, and importantly, providing new results that showed performance could actually drop with too many retrieved chunks or different RAG components. Finally, smaller issues around references, terminology, and specific implementation details were also handled well, with the authors agreeing to make the needed changes.