Understanding Retrieval Augmentation for Long-Form Question Answering
摘要
评审与讨论
This work analyzes 2 LLMs on the LFQA task using the RAG pattern. A superficial metric analysis reveals the RAG does change instrinsic text properties such as length and fluency but does not provided any sense of correctness. To investigate correctness and attribution the authors collected a small dataset of labeled question and answers with attributions. By labeling answer attributions the authors intend to evaluate how effective the LLMs are at attending to retrieved documents in the LFQA context. They arrive at conclusions that impact design choices for RAG-LLMs. However, some questions remain about the generality of the conclusions (due to a small dataset used and limited set of experiments). There are also potential shortcomings of the collected dataset.
优点
This paper performs some standard analyses comparing various RAG approaches with different search algorithms and LLMs. The conclusions point to some interesting properties of tThis paper performs some standard analyses comparing various RAG approaches with different search algorithms and LLMs. The conclusions point to some interesting properties of the investigated algorithms. Beyond this, collecting a dataset with attribution annotation and performing accompanying analysis on the resulting evaluation across the set of RAG algorithms provides potentially useful information to adopters of the proposed solutions of the investigated algorithms. Beyond this, collecting a dataset with attribution annotations allows them to perform a deeper analysis on how well the generated answers match up to the retrieved documents.
缺点
I do not think that the superficial level statistics provide much meaningful information about the RAG pattern in general or even these specific versions of it. It is self-evident that when provided with information to contextual the answer then generative language models have different linguistic properties. This has been studied before. Another weakness is that, if I understand it correctly, then the dataset that was collected is specific to the algorithms used in this analysis. Since the answers are labeled this means that to apply this dataset to a new algorithm (or even a new generation/inference run) will require some machinery to transfer those labels. This can be challenging but I did not see a discussion of this process in the paper. So it is limited to only "answer attribution" methods. However, many approaches to this problem couple the answer and citation/attribution generation. Besides that it is rather on the small side of the datasets on this topic. I also find it somewhat surprising that the authors did not analysis the superficial statistics in light of the dataset (by filtering on attribution accuracy etc.).
问题
What ways can the collected dataset be used in to improve RAG algorithms? It is not clear to me how to apply it to a specific algorithm but rather presents impressionistic suggestions about design patterns, some of which (ordering) are already common knowledge.
Wouldn't conditioning the SLS analysis on correct vs. incorrect results (possibly filtering with your dataset) provide more actionable information on the RAG for LFQA setup? In LFQA having the answers or some set of facts required to generate the answers should give you the ability to produce some normalized statistics, maybe this would be available through your dataset?
Thank you for the review.
Re: take-aways from surface statistics
Regarding major takeaways from measuring the surface statistics, please refer to our response to all the reviewers. Our analysis presents higher-level patterns that hold across models (including GPT-4). We also compare prepended documents of various degrees of relevance, and no prior work has been done to study how relevance of the prepended documents could affect the generated answers.
Re: Specificity of our analysis
As the reviewer comments, the attribution annotations are collected for the outputs of the selected models / algorithms. We do acknowledge that human judgments are the most accurate for evaluating attribution, and this is why we collect human annotations in the first place; however, as human evaluation is expensive and difficult to scale, we explore automatic attribution prediction, a cheaper proxy substitute. Researchers can evaluate the attribution of their answers using the highest performing NLI model, which we identify in Section 7.2. While it would be nontrivial to transfer our collected labels to the outputs of newer models, our annotated data can provide a benchmark for developing high-performing NLI models which can provide reliable estimates for newer models.
Re: size of dataset
We cover a relatively small number of questions (a total of 100 questions), but we collect sentence-level annotations on six different generation settings. We ended up with around 4,000 (answer sentence, attribution label) pairs in total. We do have a significance test showing that the linear correlation between location of answer sentences and their supporting sentences in the documents (Figure 3(a)) is statistically significant, and will include it in the paper.
[Q1] : What ways can the collected dataset be used in to improve RAG algorithms?
The collected dataset could serve as a benchmark for evaluating automatic attribution prediction. Better attribution prediction models could be used to decide if specific parts of the answers are supported by the contexts, and thus improve the attribution of RAG models. We identify novel patterns such as ones presented in Figure 3: (1) Order of information presented in the documents roughly align with the order of information presented in the answers. (2) The latter half of the generated answers is less supported. We also provide numerous actionable insights through our analysis, as detailed in the response to all the reviewers.
[Q2] : Wouldn't conditioning the SLS analysis on correct vs. incorrect results (possibly filtering with your dataset) provide more actionable information on the RAG for LFQA setup?
Thank you for your suggestion! We can compare surface answer statistics between answers that are mostly “supported” by the documents and answers that are not. We will include this analysis in the revision.
I acknowledge that the take-aways from the statistics are nontrivial, but while reading the paper it was hard to understand how significant these observations are (given that most of them are common sense...). I think that emphasizing some of the novelties here makes the paper contributions clearer, although I don't think this moves the needle much on impact because these are not actionable observations as I understand. It seems that most of the significant contributions come from the attribution analysis and... My biggest issue with the paper is the usefulness of the dataset collected for general audience. The attribution-based analysis may or may not transfer to new LLMs, we can't know and this work doesn't provide a way to measure it (although it does highlight some interesting things to measure). This is in general a problem with this domain, and this is why automated metrics are necessary at this point. Until this problem has a consensus solution (like gpt-score) or attribution transfer is well solved, datasets like this seem to be 1-shot datasets and we won't have another data-centric breakthrough for evaluation.
I appreciate that this paper is pushing for understanding at recent development, so I'll take into account all of these factors and all responses in the final review.
The paper studies how retrieval impacts answer generation for long-form question answering by presenting two controlled study settings: 1) fixing the LM and varying evidence documents; 2) fixing evidence documents and varying the LMs. Various attributes of generated answers and the attribution of generated answers to provided evidence documents are studied in this paper. A new dataset with human annotations to evaluate different answer attributions was created.
优点
-
The authors provide an in-depth analysis of attribution with the newly annotated dataset.
-
The story is well presented, and the motivation (Figure 1) is clear.
-
The insights from attribution annotation results are pretty interesting.
缺点
-
While the paper demonstrates good motivation and understanding of the problem so-called long-form question answering, I have a different interpretation of the term “long-form”. I thought the problem is referring to “long length/width form” or “long structured/semi-structured tables”, which pose a greater challenge for current LLM-based retrieval systems. Therefore, I question whether “long-form” is an appropriate term to accurately define this problem.
-
Since the tested dataset consists of a relatively small number of questions (271), it raises the question of why the entire dataset was not utilized for the experiments.
问题
NA
Thank you for your positive review.
[W1] Re: “long-form” QA terminology
Thanks for letting us know about the confusion surrounding the terminology. The task long-form question answering (LFQA) was first proposed by Fan et al. (2019) [1], which defines LFQA as “generating paragraph-length explanations in response to complex, diverse questions”. The term has been used consistently in a range of followup work. We will make this definition clear in the revision.
[W2] Re: evaluation dataset size
In the “Dataset” paragraph in Section 3, we wrote “We use the entire test set released by WebGPT (Nakano et al., 2021) (271 questions) for automatic evaluation (Section 4, 7.2), and randomly sample 100 questions to collect manual attribution annotations (Section 5).” As stated, we only use a subset of the test set to collect human annotations. The reason for taking a subset is the limited budget: the total cost of the experiments is already 5886.60 USD (which we will include in the paper), and annotating the whole test set is prohibitively expensive. Thus we propose to approximate human annotation with NLI models in Section 7, and we report answer attribution predicted by the T5 model in Figure 4(b), which is evaluated on all examples in the test set.
Reference:
[1] Angela Fan, Yacine Jernite, Ethan Perez, David Grangier, Jason Weston, and Michael Auli. Eli5: Long form question answering. arXiv preprint arXiv:1907.09190, 2019.
This paper investigates how retrieval capabilities impact various models on long-form question answering tasks. It does so by:
- Investigating answer statistics of various (retrieval documents, models) pairs on ELI5.
- Collecting human annotations on the extent to which the answers are supported by retrieved evidence.
- Evaluating various methods for automatic attribution in the context of multi-document retrieval-augmented generation tasks. No method is at this time competitive with human annotation, but a T5-based attribution model shows the strongest scores among automated methods.
优点
-
Clarity: the paper states clearly its purpose and gives a wide overview of related work. It is easy to follow and it describes well its experimental setup.
-
Quality: the research is well executed, code and data are available in supplementary material. However, the paper does not seem to follow a strict scientific protocol: for instance, in section 4, the authors make a number of observations on the text generated by various experimental setups without connecting them to higher-level hypotheses that they could then test methodically.
-
Novelty: the annotated dataset as well as the evaluation of various models for multi-document attribution prediction are novel pieces of work.
-
Significance: as it stands, the paper does not seem to serve a well-identified purpose, and may not attract wide interest from the community as its insights are somewhat disconnected from what matters: the end-to-end human-perceived quality of these long-form question answering systems.
缺点
- The main findings of this work would deserve being stated more clearly. While the annotation of supporting sentences across multiple documents is of interest to the field, this is not the paper's main listed contribution. The paper makes a number of observations on how various retrieval-augmented LFQA systems behave, but without connecting them clearly to a consistent set of conclusions, or giving actionable guidance for researchers designing retrieval-augmented LFQA solutions.
问题
-
Why is figure 4a a box plot? A common assumption for box plots is that it reflects independent, identically distributed samples. In this case, as each point reflects a different dataset, and the datasets are the same across models, this assumption does not seem to hold here.
-
What are the main actionable conclusions from your work that any researcher working on multi-document retrieval-augmented long-form question answering systems should know? For instance,
- what does Figure 3.(a) imply in terms of optimal ordering of documents presented to the LFQA system?
- does retrieving and using longer documents imply an improvement of end-to-end quality?
- how do various models handle different degrees of noise (irrelevant documents) in their context?
Thank you for reviewing our work and identifying the novelty of our data collection and analysis. We address your questions and concerns below:
[W1] Re: Lacking high-level research hypothesis
Thank you for your suggestions on presenting our findings better! We will make the writing clear to state the hypothesis for each of our analysis. For answer surface statistics computation, our hypotheses were that
(1) Prepending random documents should impact the surface statistics less than prepending relevant documents. (2) Retrieval augmentation will make the generated answer more specific (e.g., mentioning specific named entities) and diverse, leading to higher perplexity and less repetition (lower self-bleu).
For Figure 3 (a)(b) showing the location of supporting sentences, our hypothesis was that the order of information presented in the documents would align with the order of information presented in the generated answers. For Figure 2 which shows the similarity of answers between different generation settings, our hypothesis was answers generated with more relevant documents would be more different from the answers generated without any document.
Re: Why not link analysis to quality of LFQA systems?
This is a great question! Ideally, we would like to analyze how various design choices of retrieval augmentation impacts the performance of LFQA systems. However, this is incredibly hard. Recent ACL paper [1] showed that the “end-to-end human-perceived quality” for LFQA is elusive, as even domain experts disagree on ranking different answers. This is caused by the complexity of LFQA evaluation, where a wide range of factors, such as completeness and ease of understanding, are considered, and different people weigh factors differently. They suggest that both human and automatic evaluation should focus on a single aspect. This is why we focus on attribution of the answers, which is widely studied by previous and concurrent works, for its more straightforward evaluation (Bohnet et al. (2022) [2], Yue et al. (2023) [3], Gao et al. (2023) [4], Liu et al. (2023) [5])
Re: Actionable conclusions:
Please refer to the response to all the reviewers.
[Q1] Box plots:
We simply aim to visually show each model's performance across a wide range of datasets. If you have suggestions for alternative visualization, we would be happy to modify it! We also provide the full numbers in Table 7 in the appendix.
[Q2.1] what does Figure 3.(a) imply in terms of optimal ordering of documents presented to the LFQA system?
Figure (3) implies that the ordering of information of the answer would be inline with that of the documents. The optimal ordering would depend on the type of question and how you would like to organize the answer. One possible ordering is to follow the discourse structure proposed in Xu et al. (2022) [6]. We could potentially identify which role the information in each document could provide and arrange them in the desired order. Exploring the optimal ordering of information would be out of scope for this paper, and we leave that to future research.
[Q2.2] does retrieving and using longer documents imply an improvement of end-to-end quality?
We reiterate the point that giving an “overall quality” score is difficult and even experts disagree. We mainly focus on “attribution” in this work and leave the evaluation of other facets to future work.
[Q2.3] how do various models handle different degrees of noise (irrelevant documents) in their context?
We only compare GPT-3 and Alpaca when prepending irrelevant documents, as we do not have access to the WebGPT model. GPT-3 exhibits a more severe difference in the surface statistics, which we discuss in Section 4: “Prepending unrelated documents has little effect on the automatic metrics for Alpaca, but impacts the generation of GPT-3, especially in length and Self-BLEU. This might be related to instruction tuning that enables LMs (Alpaca in this case) to be more robust to irrelevant prompts”. Alpaca shows slightly more changes in the percentage of supported answers (Figure 4(b)), but changes are little for both models.
(Reference in a separate comment)
Reference:
[1] Fangyuan Xu, Yixiao Song, Mohit Iyyer, and Eunsol Choi. 2023. A Critical Evaluation of Evaluations for Long-form Question Answering. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3225–3245, Toronto, Canada. Association for Computational Linguistics.
[2] Bernd Bohnet, Vinh Q Tran, Pat Verga, Roee Aharoni, Daniel Andor, Livio Baldini Soares, Jacob Eisenstein, Kuzman Ganchev, Jonathan Herzig, Kai Hui, et al. Attributed question answering: Evaluation and modeling for attributed large language models. arXiv preprint arXiv:2212.08037, 2022.
[3] Xiang Yue, Boshi Wang, Kai Zhang, Ziru Chen, Yu Su, and Huan Sun. Automatic evaluation of attribution by large language models. arXiv preprint arXiv:2305.06311, 2023.
[4] Tianyu Gao, Howard Yen, Jiatong Yu, and Danqi Chen. Enabling large language models to generate text with citations. arXiv preprint arXiv:2305.14627, 2023.
[5] Nelson F Liu, Tianyi Zhang, and Percy Liang. Evaluating verifiability in generative search engines. arXiv preprint arXiv:2304.09848, 2023b.
[6] Fangyuan Xu, Junyi Jessy Li, and Eunsol Choi. 2022. How Do We Answer Complex Questions: Discourse Structure of Long-form Answers. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3556–3572, Dublin, Ireland. Association for Computational Linguistics.
Thank you for your answer. While I appreciate the work that has been done and the very clear explanations by the authors, as it stands, I think these conclusions are still slightly too general and / or not actionable enough to be of a wide interest to the community. I understand this is a subjective measurement of novelty, and as such I would be happy to be overruled if other reviewers find these findings are conclusive enough to be published in this venue.
The paper investigates retrieval-augmented language models (LMs) for long-form question answering (LFQA). By comparing answers from different LMs using the same evidence documents, the study analyzes the impact of retrieval augmentation. Emphasis is placed on how generated answers can be attributed to in-context evidence documents. The research provides insights into the behavior of LMs when using retrieval augmentation and reveals novel patterns in long text generation. The study uses questions from the ELI5 dataset and evaluates models like WebGPT, GPT-3, and Alpaca.
优点
- The research evaluates off-the-shelf models for detecting attributions, offering a comparative perspective on their performance.
- The research presents two controlled study settings to understand the impact of varying evidence documents and varying LMs, ensuring robustness in findings.
缺点
- While qualitative insights are valuable, an over-reliance on them without sufficient quantitative backing might be a weakness.
- The off-the-shelf models that the authors compared are not comprehensive. I feel it's important to include GPT-4.
- It's unclear what are the nontrivial takeaways from this empirical study.
问题
- We observed different behaviors across models like WebGPT, GPT-3, and Alpaca when provided with the same set of documents. What do you hypothesize as the underlying reasons for these differences?
Thank you for your careful review.
Re:[W1] qualitative vs. quantitative analysis:
We argue most of our analyses are quantitative, providing metrics to support our conclusions. For instance, our surface pattern statistics analysis (Section 4) presents statistically significant patterns across multiple experimental settings. Similarly, our study on the percentage of supported sentences (Section 5) and their pattern is quantitative. From our understanding, qualitative analysis typically refers to analysis that is based on anecdotal evidence from a handful of examples, while our study includes quantitative analysis including more than 5 metrics on a wide range of examples (hundreds of examples across 10 settings). If there are specific types of quantitative analysis that the reviewer thinks will be helpful for the analysis , we’d be happy to include them if feasible.
Re: [W2] not including GPT-4
We focus our analysis on three models (GPT-3, WebGPT, Alpaca), carefully chosen models that exhibit strong LFQA performances. We also include the surface statistics and example outputs of other models (GPT-J 6B, Flan-T5-XXL, Llama-(7B, 13B, 30B), Alpaca-7B, davinci-(001, 002)) in Appendix B4. At the time we conducted the experiments, GPT-4 was not released yet. Collecting document-level human annotations is non trivial and expensive (we spent a total of 5,886.60 USD for data collection), thus we did not add GPT4 results last minute. We agree with the reviewers that evaluating GPT-4 will be valuable to our analysis. Please refer to the response to all the reviewers for the results on surface statistics and answer attribution predicted by the T5 model.
Re: [W3] non-trivial takeaways
Please refer to the response to all the reviewers.
Re: [Q1] Potential reasons for different behaviors between model
The main difference between WebGPT and the other models is that WebGPT is further fine-tuned with a retrieved document prepended. We find this makes the WebGPT model more faithful to the prepended documents (Table 2).
We thank the reviewers for their reviews and feedback. We are encouraged to see that they found our study to be comprehensive and well-executed (Reviewer T5uy, 6bfG, o7Hh), and that our study provides insight on an important problem (Reviewer T5uy, o7Hh).
We provide a response to commonly raised questions among reviewers here:
Reviewers (kK3e, 6bfG, T5uy) asked about major takeaways and action insights from our study. We reiterate the major findings from our study here, each of which leads to future work directions:
- The level of attribution differs significantly between LMs trained without retrieval augmentation vs. LMs trained with retrieval augmentation (Table 2, Figure 4(b)): This points to future work on fine-tuning LMs with retrieval augmentation. Prior work [1] showed that retrieval augmentation improves factuality, and we further link retrieval augmented fine-tuning to improving attribution.
- Both the order and the content of the documents affects the generation. The order of information presented in documents is reflected in the answers (Figure 3(a)). Prepending irrelevant documents changes the surface statistics of answers nontrivially (Table 1). This indicates that evidence documents should be carefully added to LMs, since both the order and the content of the contexts matter.
- Attribution errors are more common when the documents are less relevant. Furthermore, our error analysis on attribution shows that retrieval failure is the most common cause. These findings suggest that future work should improve retrieval quality before thinking about improving attribution.
- Existing attribution prediction methods (trained with NLI dataset) show promising performances on detecting unsupported sentences in the answer (Figure 4(a)), but leave ample room for improvement. We provide an annotated benchmark dataset with nearly 4k examples, which can support future work in this direction.
In terms of actionable guidance, we summarize as follows:
Findings (1) from above suggest researchers / practitioners should design their training process with retrieval augmentation in mind to improve attribution. Findings (2) and (3) suggest that future works should focus on improving the retriever component in order to achieve better attribution of the generated answers, developing more robust retrieval augmentation that is irrelevant to factors such as document ordering. Findings (4) suggest that future work is needed to correctly identify attribution to multiple evidence documents.
Reviewer T5uy raised the concern of not including GPT-4 results. Thus, we generated answers with GPT-4 across 5 settings (without documents, +Human docs, +WebGPT docs, +Bing docs, +Random docs) and computed the surface statistics for GPT-4 on all prepended document settings. The results show consistent trends mentioned above as GPT-3, except on generations with random documents where GPT-4 would abstain from answering (responding with “The documents provided do not contain information on…”) because of the irrelevance of the documents. We will include the numbers on surface statistics in the revision.
| Model (+ evidence) | # Sentences | # Words | RankGen () | Self-BLEU () | Perplexity () |
|---|---|---|---|---|---|
| GPT-4 | |||||
| +Human docs | |||||
| +WebGPT docs | |||||
| +Bing docs | |||||
| +Random docs |
We also compute attribution to various types of documents of GPT-4 generated answers using the T5 model. The results are shown below. The trends are consistent with GPT-3, despite that GPT-4 outputs obtain slightly higher attribution numbers. We will include the new results in the revision.
| Model (+ evidence) | Human Docs | WebGPT Docs | Bing Docs | Random Docs |
|---|---|---|---|---|
| GPT-4 | 28.94 | 33.97 | 24.59 | 5.69 |
| +Human docs | 74.97 | 42.02 | 22.17 | 4.34 |
| +WebGPT docs | 35.46 | 81.34 | 23.60 | 4.25 |
| +Bing docs | 28.79 | 35.73 | 56.99 | 4.54 |
| +Random docs | 8.58 | 10.38 | 11.06 | 9.02 |
Reference: [1] Wang, Boxin, Wei Ping, Peng Xu, Lawrence McAfee, Zihan Liu, Mohammad Shoeybi, Yi Dong et al. "Shall we pretrain autoregressive language models with retrieval? a comprehensive study." arXiv preprint arXiv:2304.06762 (2023).
This is a controversial paper. It would be nice to responses to the authors comments.
This paper analyzes aspect of retrieval-augmented long-form question answering systems. The analysis includes the effect of retrieval relevance and different approaches to attribution. However, the reviewers felt that this analysis did not provide sufficient guidance for improvement in this task and a quantitative demonstration of improvement was lacking.
为何不给更高分
See the metareview.
为何不给更低分
There is no lower score.
Reject