PaperHub
7.5
/10
Oral4 位审稿人
最低6最高8标准差0.9
8
8
8
6
3.3
置信度
ICLR 2024

Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection

OpenReviewPDF
提交: 2023-09-23更新: 2024-03-12
TL;DR

We introduce Self-RAG, a new training and inference framework to enable an LM learn to retrieve, generate and critique.

摘要

关键词
Retrieval-augmented GenerationLanguage ModelsRetrieval-augmented LMsFactuality

评审与讨论

审稿意见
8

The paper studies retrieval-augmented generation, where it aims to train a language model (LM) that learns to retrieve external documents on demand and generate a better response. Especially, the model learns to output some reflection tokens that serve different purposes. First, the model would generate a retrieval token to indicate that the current continuation needs external documents. Then with each of the retrieved documents, the model would generate a special token to indicate whether the document is relevant. After generating some text, the model would further generate the critique tokens to indicate whether the generated text is grounded by the document and whether it is helpful for the overall generation. These tokens would then enable controllable text generation during inference. Experiments show that the proposed model outperforms several other competitive baselines that are augmented with a retrieval mechanism but do not have the self-reflection step.

优点

  1. The proposed method addresses two key problems including relevance and grounding by simply adding special tokens learned through fine-tuning.
  2. The fine-tuning does not rely heavily on human annotation. Rather, it makes use of GPT-4 to provide the training data.
  3. The reflection tokens allow users to have more control over the generation process to customize the expected response.

缺点

  1. The reflection tokens might be useful to select more promising generations during the decoding time. But it seems that they do not affect (or guide) the generation process from the beginning and might not help if none of the candidates is good.
  2. If GPT-4 somehow can decently provide the labeled data required for each reflection step, it seems intuitive to just instruct GPT-4 to obtain the ideal response that is grounded by the input. Also, GPT-4 might make the decisions on whether to retrieve or not differently from the small LMs as GPT-4 memorizes a lot more world knowledge. Therefore, the annotation given by GPT-4 might not be suitable for small LMs. It would be great to have some discussion or clarification on this.

问题

  1. In section 3.2.2, the authors mention that during the training time, they mask out the retrieved text chunks. What is the purpose of this masking step?
  2. Other than just providing post hoc feedback on how good the current generation is, do the reflection tokens also affect/guide the generation process somehow?
评论

Generating training data using GPT-4

(Weaknesses 2) If GPT-4 somehow can decently provide the labeled data required for each reflection step, it seems intuitive to just instruct GPT-4 to obtain the ideal response that is grounded by the input.

Creating extensive training data from GPT-4 comes with a high cost, and how to automatically control the quality of data generated by a language model remains an open question (Wang et al., 2023).

  • Regarding the cost, newly created 150k training data by prompting GPT-4 with detailed instructions can be really expensive (e.g., USD 10k assuming 2k tokens per instance, which can be even longer in practice including detailed instructions) and time-consuming due to API time limits. Our method only requires us to collect medium-scale training data for the Critic model from GPT-4, significantly reducing the costs.
  • Even if we add detailed instructions to obtain a better response, it is still unclear whether GPT-4 can always generate ideal outputs the following input, as prior work reports even state-of-the-art models can often generate outputs that are not supported by input context (Liu et al., 2023; Gao et al., 2023) Also, one challenge in Retrieval-Augmented Generation is that retrieved passages are not always relevant or helpful (Mallen et al., 2023), and thus a model needs to have the ability to identify which passages are relevant and how to incorporate relevant passages. This problem may not be fully addressed by generating and training models on ideal GPT-4 outputs. We added more discussions in Section 3.2.1.

Also, GPT-4 might make the decisions on whether to retrieve or not differently from the small LMs as GPT-4 memorizes a lot more world knowledge. Therefore, the annotation given by GPT-4 might not be suitable for small LMs. It would be great to have some discussion or clarification on this.

As shown in Tables 9 and 10, we ask GPT-4 to judge if retrieving passages enhances generation quality in general or if the continuation requires factual verification, rather than asking if it (GPT-4) can benefit from retrieval, to obtain model-agnostic annotations for the retrieval tokens. We design few-shot demonstrations to make sure models’ predictions will follow this intuition. By doing so, we believe that GPT-4 predictions get more model-agnostic. Also, the final reflection tokens for the generator training corpus are predicted by our Crtici model, which is based on Llama2-7b due to the expensive API costs (see the response above). We believe that combining those two models for retrieval token predictions can make the predictions more model-agnostic.

评论

We appreciate your strong support. In the following section, we offer our responses to both the questions raised and the identified weaknesses.

Passage masking during training

(Question 1) In section 3.2.2, the authors mention that during the training time, they mask out the retrieved text chunks. What is the purpose of this masking step?

Without masking, our model learns to generate passages by itself (as it is considered during loss calculations), rather than learning to use the given retrieved passages at inference time. The main focus of this work is to train an LM to be better at leveraging retrieved passages to be more factual, controllable, and attributable, and the model does not need to memorize and learn to generate evidence passages by themselves. Empirical findings also indicate that not masking out passages leads to a drop in model performance on certain tasks.

Guiding generation using reflection tokens

(Question 2) Other than just providing post hoc feedback on how good the current generation is, do the reflection tokens also affect/guide the generation process somehow?

(Weaknesses 1) The reflection tokens might be useful to select more promising generations during the decoding time. But it seems that they do not affect (or guide) the generation process from the beginning and might not help if none of the candidates is good.

Reflection tokens help us to identify and choose better segments, similar to rejection-sampling / best-of-n sampling in RL, although we do not rely on external reward models at inference time. While we agree that self-reflective decoding itself does not make models generate a better set of initial samples, it enables us to recognize situations where none of the segments are deemed acceptable. This capability allows practitioners to adjust the inference pipeline for improved generations. For instance, if all passages are deemed irrelevant by Self-RAG, we can sample more passages from other retrieval systems, or if none of the model generations receive high scores/support, we can choose to abstain from answering.

In our preliminary experiments, we briefly explored the insertion of special tokens at the beginning of the generation, similar to CTRL (Keskar et al., 2019) or Quark (Lu et al., 2022) to guide generation. However, we found that inserting special tokens had limited effectiveness in controlling the quality of generations, such as completeness and factuality. Whether controlled token-based generation could enhance generation quality beyond styles or toxicity is still an open question, and we leave it for future work. Another promising direction to further enhance initial generations is to combine Self-RAG with fine-grained RLHF (Wu et al., 2023) using the Critic as a reward model.

审稿意见
8

This paper proposes a new framework, named Self-RAG, for training LMs to retrieve passages, generate text, and evaluate its own generation on-demand. Specifically, the proposed framework first uses a critique model C, distilled from prompted GPT4, to offline augment existing instruction-finetuning data with control tokens (retrieval token or critique tokens) as well as retrieved passages. Then the generator is trained on the augmented data, so that it also generates the target response as well as control tokens. At inference time, the decoding algorithm is modified to take actions when a control token is decoded. The authors implemented the framework on top of llama2 7B and 13B, and evaluated the models across various text generation tasks. Results show that the model can significant improve llama2's factuality.

优点

  • The paper is well written and nicely presented.
  • The paper proposes a novel framework that enables dynamic, un-demand retrieval and self-reflection in LM decoding. It address a key limitation in most existing RAG framework, that the model does not know when to retrieve.
  • The proposed method provides an interesting alternative to RLHF for using critique models. I really like its controllability and interpretability that RLHF doesn't have.
  • Experiments and ablations are sound and convincing.

缺点

  • There are some related search that prompts the model to decide when retrieval is needed and how to use retrieval during decoding. E.g., "Active retrieval augmented generation." by Jiang et al [1]. It would be nice to study what benefits the offline data augmentation & training approach can bring comparing to these prompting-based frameworks.
  • Since most experiments uses automated evaluation focusing on accuracy and factuality, it is still unclear to me if such a training approach could hurt some aspects of the model, e.g., instruction following, creativity, or reasoning.
  • One limitation of the proposed Self-RAG is that the model sees each passage independent from the others. The generator cannot synthesize multiple passages.

[1] Jiang, Zhengbao, et al. "Active retrieval augmented generation." arXiv preprint arXiv:2305.06983 (2023).

问题

  • How does Self-RAG's training affect the model in terms of aspects other than factuality? e.g., instruction following, creativity, or reasoning.
  • How good is the model at deciding when to retrieve? It would be nice to show the triggering rates of "retrieval" token on different types of tasks.
  • How much training data is needed? Does the training data needs to be carefully sampled, e.g., focusing on fact-seeking slices?
评论

Amount of training data

Sampling

For our current Self-RAG, we first sample many input-output pairs, and then run Critic & retrieve passages. During the augmentation stage, we down-sample instances that don't require retrieval at all at 50%. When we insert a single passage for an output segment, we prioritize the passages that are relevant and provide full support to the succeeding output over irrelevant or contradictory passages, while we intentionally sample irrelevant or no support passages for an Open QA dataset (NQ), where often we have a higher proportion of fully supported passages. We added a detailed process in Appendix Section A.3, and our data creation script will be released.

Prompting-based framewroks

(Weakness 1) There is some related search that prompts the model to decide when retrieval is needed and how to use retrieval during decoding. E.g., "Active retrieval augmented generation." by Jiang et al [1]. It would be nice to study what benefits the offline data augmentation & training approach can bring compared to these prompting-based frameworks.

This work investigates a learning-based approach to improve retrieval-augmented LMs, in contrast to prior work prompting strong black-box LMs. We found this is more suitable to achieve the Self-RAG style inference process for the following reasons.

  • Self-RAG requires careful multi-aspect fine-grained self-evaluations at inference time. As shown in Tables 8-12, to make an LM comprehend fine-grained aspects and scoring systems, precise and detailed instructions as well as few-shot demonstrations are necessary. This significantly increases the input sequence length, resulting in higher costs and latency.
  • Nevertheless, we briefly tried prompting-based approaches in our preliminary experiments and found it is nontrivial. When we combine all instructions for all aspects and feed them to the target pre-trained LMs (GPT-3 davinci-003 / 002, Llama2-13B-chat), all models struggle to precisely follow our evaluation scheme, often generating output formats that do not suit our scheme or whose reflections show high disagreement with human, due to the confusions between dozens of special tokens.
  • To make the most use of the Self-RAG potential, we need to use the token probabilities for the reflection tokens, which may not be always available for black box proprietary LM APIs. Note that at the time of submission, ChatGPT and GPT-4 do not provide log probability information via API, preventing us from applying the Self-RAG algorithm to such models. This limitation is also discussed in the Active Retrieval paper, which also requires access to token probabilities.

While several concurrent or recent work uses proprietary LMs to conduct self-evaluation during inference, most work only considers one aspect during generation, and it is still an open question and indeed nontrivial how we can make models to conduct fine-grained multi-aspect evaluations. That being said, exploring Self-RAG-like inference algorithms on top of proprietary LLMs without training is interesting and a natural follow-up. We added these detailed discussions in Appendix Section A.2.

Applications to enhance creativity, reasoning, and instruction-following

(Question 1) How does Self-RAG's training affect the model in terms of aspects other than factuality? e.g., instruction following, creativity, or reasoning.

(Weakness 2) Since most experiments uses automated evaluation focusing on accuracy and factuality, it is still unclear to me if such a training approach could hurt some aspects of the model, e.g., instruction following, creativity, or reasoning.

Our goal in this work is to enhance the factuality of generations with on-demand retrieval from a rich knowledge corpus (i.e., Wikipedia) and reflections, without hurting the original versatile capabilities of competitive language models, which prior work suggests retrieval-augmentation can hurt (BehnamGhader et al., 2023; Gao et al., 2023). Among our evaluation tasks, ARC-Challenge is also considered as a reasoning task, and we demonstrate the effectiveness of ARC. Investigating the effectiveness of retrieval-augmented generations in reasoning, creativity or instruction-following is a great avenue to explore for future work. Some recent work shows that retrieval-augmented approaches can enhance reasoning tasks by finding closely relevant few-shot demonstrations from the training dataset. Applying Self-RAG to such use cases (e.g., retrieving and evaluating helpful demonstrations from a pool of training data) is a promising direction.

评论

Abilities of handling multiple passages

(Weakness 3) One limitation of the proposed Self-RAG is that the model sees each passage independent from the others. The generator cannot synthesize multiple passages.

As the Self-RAG inference processes multiple documents in parallel, it does not synthesize multiple paragraphs for one segment generation. Still, Self-RAG continues generating based on multiple passages during the whole response generations, which we believe partially addresses this concern to some extent–-For instance, to compare two concepts, Self-RAG can retrieve a passage about Concept A and generate at the first time step, then retrieve another passage about Concept B and generates conditioned on the first segment as well as Concept B. While still there are some cases where such multi-step generations may not address, we found the batch decoding conditioned on multiple paragraphs independently gives us optimal trade-off between efficiency and performance, as well as better attributions i.e., we predict relevance as well as supported tokens for each paragraph.

评论

Thank you so much for your strong support! We are delighted that you recognize the technical contributions of this work in addressing significant issues within the current RAG paradigm. Additionally, we appreciate your acknowledgment of the contrability and interpretability aspects supported by Self-RAG. Below, we carefully address the questions and weaknesses you've raised.

Self-RAG's capability of deciding when to retrieve

(Question 2) How good is the model at deciding when to retrieve? It would be nice to show the triggering rates of "retrieval" to different types of tasks.

Figure 3 (C) shows how frequently retrieval is triggered across different thresholds. Given the same threshold (0.5), PopQA triggers retrieval 70% of the time while PubHealth retrieves less than 20%. This is because PopQA includes many long-tal, rare entity questions that require retrieval more.

We also analyze the Critic models’ predictions across different seed instruction-following datasets used during training. Below, we present the percentage of instances with [Retrieve]=Yes tokens for four representative subsets. FLAN and Stanford Alpaca consist of a diverse set of instruction-output pairs; FLAN data is collected from diverse existing benchmarks while Alpaca data is generated by GPT-3 davinci 003). NQ (short-form generation) and FEVER (classification) are knowledge-intensive tasks that often require factual knowledge.

FLANStanford AlpacaNQFEVER
% of instances with [Retrieve]15.8%53.3%87.7%63.2%
  • We found that particularly on FLAN-T5, many instances are labeled as “no retrieval needed”. Such data includes grammatical checking, and replacing a character in a given sentence, which aligns our intuition that such instances may not require retrieval.
  • Among those four datasets, NQ shows the highest percentage of instances and shows a much higher percentage than FEVER. We found that in FEVER there are many instances where even without external evidence the model can easily judge the truthfulness of statements (Gary Ridgway's last name is Leon, which is obviously false), while on NQ created from Google users’ search queries, many questions are information-seeking and require fine-grained world knowledge.

Those qualitative and quantitative results demonstrate our Critic and Generator LMs show their great capabilities of identifying which instances can benefit from retrieval, instead of merely depending on certain task instruction formats.

Amount of training data

  1. How much training data is needed? Does the training data need to be carefully sampled, e.g., focusing on fact-seeking slices?

We believe that task diversity during Self-RAG is crucial to building a robust Self-RAG system that is not factual and competitive on knowledge-intensive tasks but also retains the versatile capabilities of LLMs. We collect seed instruction-output data from 11 different seed datasets and then augment training data using Crticic and Retriever. We heuristically sample training data to balance reflection token distributions. While even with 50K training instances, Self-RAG outperforms other models, we further scaled up training data to 150K and found it gives us notable improvements. We will open source scripts so that follow-up work can explore more training sizes or combinations of training datasets. Below, we provide more details on those aspects.

Amount of training data

As we can leverage existing instruction-output pairs and do not rely on external expensive APIs such as GPT-4 after training the Critic model, scaling up the training data is fairly straightforward in the Self-RAG framework. Since the initial version using 50k training data, we have created more training data using the same protocol and conducted ablations of the model performance trained on different numbers of training data. While our initial model trained on 50k instances already shows strong performance, we found that increasing training data to 150k instances gives large improvements, especially on some tasks. See Figure 4 in our updated draft. Note that widely used instruction-tuning LMs are often trained on more than 100k instances (e.g., Vicuna).

审稿意见
8

This paper introduces a novel framework known as Self-Reflective Retrieval-augmented Generation (SELF-RAG), designed to enhance the generation quality and factuality of Large Language Models (LLMs). During next token prediction, the SELF-RAG framework enables the decoding of reflection tokens from the LLM, allowing for control over the retrieval and self-reflection processes. To create training data for the reflection tokens, a critic model is trained using data generated by GPT-4. The authors also demonstrate how this framework can be used to control retrieval frequencies and guide generation towards specific critique types. Experimental results on multiple benchmarks show that the proposed SELF-RAG performs the best among non-proprietary LLMs on almost all tasks. The ablation study shows the importance of each component.

优点

  • The proposed method is novel in a way that it integrates the critic model information at training time so that the LLMs could reuse the output signals to guide the next step.
  • The proposed method can be easily adapted to generate responses with certain properties.
  • SELF-RAG significantly outperforms baselines in most cases.

缺点

  • The critic model plays a pivotal role within the framework, and the authors have reported its accuracy in the Appendix. However, the paper does not include an evaluation of the LLM's accuracy in predicting reflection tokens.
  • The retrieval threshold is predetermined, but the authors have not provided an analysis of how variations in the retrieval threshold might affect downstream task performance.
  • In the ablation study, the paper only investigates 'Retrieve top1,' while SELF-RAG utilizes top 5 or top 10. Furthermore, the study exclusively focuses on 'Remove[IsSup],' neglecting an examination of the other criticize tokens.

问题

  • Did the authors investigate the influence of reflection token distributions on the performance of the critic model?
  • Is there a possibility that the critic model may exhibit certain biases? Additionally, have they examined the distribution of retrieval tokens and its correlation with the retrieval threshold?
评论

We are grateful for your insightful questions and suggestions! Below, we address your questions and weaknesses.

Reflection token distributions and their effects on training

(Question 1) Did the authors investigate the influence of reflection token distributions on the performance of the critic model?

In our initial training, we noticed that reflection token distributions can be biased towards certain classes in some datasets, which can make our Generator LM learn to exploit certain task formats. For instance, in Open QA, there’s a much higher percentage of “relevant” and “fully supported” passages while in long-form instruction-following datasets, retrieved passages are often only partially supported by the outputs.

To balance the distributions of the reflection tokens, we carefully design a sampling process, to make it match natural distributions when a system is used in information-seeking instruction-following scenarios. In particular, we introduce two heuristics that alter the reflection token distributions.

  • When we aggregate the augmented datasets from different input-output data sources, we downsample and discard 50% of the instances without any retrieval tokens, since large-scale instruction-following datasets (e.g., Alpaca), include many queries that do not require retrieval (e.g., simple and easy facts or not knowledge intensive).
  • For instances for the queries requiring retrieval, we first retrieve 5-10 paragraphs for each segment and sample only 1 paragraph for the segment following the process described below:
  1. For all datasets except for open-domain QA (NQ), if there is one or more than one paragraph with [relevant] and [fully supported], we randomly sample one from such paragraphs. For NQ, we randomly discard such cases at 25% cases, as we found NQ shows a significantly higher percentage of “relevant” and “fully supported” and the subset can be easily dominated by such “positive” samples.
  2. When none of the passages are relevant and fully supported while we have some passages partially supported, we take the highest ranked passages from this category at 30% of the cases.
  3. When none of the passages are relevant but have no support, we take the highest-ranked passages from this category at 75% of the cases. For the remaining cases, we randomly sample 1 from all irrelevant paragraphs.

We added the detailed description in Section A.3 and will release the processing script upon acceptance. Due to the short time of the rebuttal process, we are unable to re-train models on different distributions of the training data. However, we conducted brief experiments in the early stages, and we observed that without up and downsampling, model predictions can be biased towards certain categories (e.g., many passages marked as “Relevant” in a QA dataset).

(Question 2) Is there a possibility that the critic model may exhibit certain biases?

Our careful sampling process reduces undesirable biases (e.g., overpredictions of “relevance” for a QA-like evaluation dataset). We also noticed that the Critic model learns to generate accurate predictions on when retrieval is helpful, resulting in different distributions across instances from different seed datasets. Below, we provide the percentage of instances with [Retrieve] tokens on four representative seed datasets: FLAN, Stanford Alpaca, Natural Questions, and FEVER.

FLANStanford AlpacaNQFEVER
% of instances with [Retrieve]15.8%53.3%87.7%63.2%
  • We found that particularly on FLAN, many instances are labeled as “no retrieval needed”. Such data includes grammatical checking, and replacing a character in a given sentence, which aligns our intuition that such instances may not require retrieval.
  • Among those four datasets, NQ shows the highest percentage of instances and shows a much higher percentage than FEVER. We found that in FEVER there are many instances where even without external evidence the model can easily judge the truthfulness of statements (Gary Ridgway's last name is Leon, which is obviously false from the surface only), while on NQ created from Google users’ search queries, many questions are information-seeking and require fine-grained world knowledge. This indicates the Critic’s prediction reliability of judging when to retrieve.

Additionally, have they examined the distribution of retrieval tokens and its correlation with the retrieval threshold?

As shown in Figure 3, we show retrieval threshold and retrieval frequencies on PubHealth (Fact verification) and PopQA (Open QA). As you can see, PopQA requires more retrieval while PubHealth requires less. PopQA consists of many rare entities that strong LLMs often struggle to memorize (Mallen et al., 2023) and thus our model generates more “retrieval” tokens for this dataset.

评论

Ablations on the number of passages and different types of reflections

(Weakness 3) In the ablation study, the paper only investigates 'Retrieve top1,' while SELF-RAG utilizes top 5 or top 10. Furthermore, the study exclusively focuses on 'Remove[IsSup],' neglecting an examination of the other critical tokens.

Thank you so much for your great suggestions. We conducted a set of new evlauations on a task, and will incorporate the full results in our final version.

The number of the passages We ran performance evaluations with varying numbers of passages on sampled 500 PopQA, on top of our 7B model trained on 150K queries. The table below shows the results. Adding more passages helps our models to increase performance up to 10 passages, while at N=15 and 20, we see minor performance drops, perhaps because a model may get confused with irrelevant documents, which have been reported by prior work.

The number of passages (N)PopQA performance (acc.)
20.498
30.504
50.504
70.540
100.538
150.528
200.520

Removal of other special tokens We also conduct ablation studies on PopQA by ablating different tokens and evaluating how it affects the overall performance. Below. “X” indicates the special tokens used during inference, and sequence indicates the sequence score (p(y_t | x_t, p) in Eq. 3).

Removing isREL only gives minor deterioration possibly isSUP also implicitly evaluates how relevant the passage is (e.g., if the passage is not relevant, Self-RAG predicts no support or contradictory) while removing both isREL and isSUP gives notable drops, indicating the importance of those two aspects. Completely removing all of the fine-grained aspects and only using sequence scores for ranking also causes significant deterioration. The biggest drops happen when we use isUse only for ranking outputs. isUse only evaluates the overall quality of generations, regardless of factuality of generations or document relevance and thus depending on this aspect only may not be the most reliable axis for knowledge-intensive tasks such as QA.

IsRelIsSupIsUsesequencePopQA performance (acc.)
xxxx0.538
xxx0.536
xx0.512
x0.416
x0.512

Evaluations on reflection tokens prediction performance

The critic model plays a pivotal role within the framework, and the authors have reported its accuracy in the Appendix. However, the paper does not include an evaluation of the LLM's accuracy in predicting reflection tokens.

We moved the evaluation of Critic LM to the main pages in our updated version. We also conducted a small set of human evaluations on the quality of Generator LM in Appendix C.2. We will add a larger-scale analysis of the model’s reflection tokens and final outputs in our final version.

评论

Thanks for the authors' responses!

审稿意见
6

The authors propose a new framework called self-reflective retrieval augment generation (Self-RAG) to improve upon the vanilla RAG approach that always incorporates a fixed number of retrieved passages. They first train a LM with an extended vocabulary that includes reflection tokens. These tokens, along with a reflective retrieval/decoding algorithm, are then used at inference time to generate responses to the queries that are better informed by relevant passages.

To this end, they first generate data for and train a critic model, whose role is to evaluate and generate data for the generation model. The data for doing so is created using GPT-4 via specifically designed prompts. The reason being that using GPT-4 everywhere would be very costly and not reproducible. This model also provides the (usually human generated) signal for instruction tuning the generator.

For evaluation, the model is compared against 3 set of datasets (closed-set for fact verification, short form generation from open-domain QA datasets, and long form generation). The model is then compared with a variety of open source models, as well as cloud-based models, such as ChatGPT and Perplexity.

优点

  • The RAG method, which the authors set out to improve, has become highly used in the industry while having obvious limitations. This is a hard problem and any improvements on this problem will have great significance
  • The framework tries to integrate two very useful LLM approaches, RAG and self-reflection in an optimized manner

缺点

  • Even though there are a lot of benchmarks and ablations, I still find that many of my questions are not answered by these evaluations
    • Specifically, I'd like to separate out the contribution from the main two parts of the approach, the self-reflection, and RAG. I don't see much ablation or comparison on the self-reflection side. Only the IsSup token is ablated and that it.
    • The RAG also has many details that I'm not certain have been compared against, other than simply changing the underlying model, which is not that interesting IMO. I believe simple changes to prompt, number of topk to retrieve, and order of retrieval (top first or top last) can produce highly nuanced results.
    • I would have also liked to see at least an attempt at trying to combine self reflection with RAG without any pre-training
  • I find some of the numbers reported not convincing and in need of more investigation
    • Ret-ChatGPT in Table 2: in all but the Long-form generation tasks, ChatGPT performs better than the retrieval-augmented version. This is curious and tells me either the model has seen the data (which makes the dataset slightly not a good representative of the RAG task) or that RAG is not being done correctly
    • I have a similar (but less important) observation for the Alpaca models (specifically 13B). I can understand the reduced MAUVE score, but accuracy on PubMed has also been reduced which tells a similar story
    • Same is true in Figure 3.a (ablations) where Self-RAG without retrieval is already better than all the other baselines in Table 2. Is this because there has been a data leak? If yes, that completely invalidates these results

问题

  • The authors claim in the abstract that "Self-RAG outperforms ChatGPT ... on Open-domain QA ... tasks". However, the results in Table 2 are dominated by ChatGPT and Ret-ChatGPT. They only get outperformed in PopQA. Given that, I'm surprised that the authors would make this claim

  • The authors claim that "GPT-4 reflection token predictions show high agreement with human evaluations". I find that hard to believe, given personal experience and reported results, e.g. PandaLM which for their task reports an F1 accuracy of 66% when comparing the quality of different passages. I don't deny that GPT-4 evaluation predictions have a strong positive bias towards the truth, but I still would call it high agreement. Do you have data to support this? Or do you think the gap in observations is due to comparison on slightly different evaluation tasks?

  • The repetitive self-reflection methods of Self-RAG will likely have an outsized effect on the computation requirements and the latency of the model. As such, I would love to see some numbers and comparisons here. Specially because it's known that given more time/tokens/compute, LLMs can improve their results.

评论

Thank you so much for your insightful questions and suggestions. Below, we address your questions and concerns.

More ablations to understand Self-RAG effectiveness

(Weakness 1) Specifically, I'd like to separate out the contribution from the main two parts of the approach, the self-reflection, and RAG. I don't see much ablation or comparison on the self-reflection side. Only the IsSup token is ablated and that it. I believe simple changes to prompt, number of topk to retrieve, and order of retrieval (top first or top last) can produce highly nuanced results.

We appreciate your insightful suggestions. Following your questions, we conducted a set of new experiments and analyses. In our final version, we will expand this analysis to other datasets and incorporate it into our main pages. Also we would like to clarify that as shown in Figure 1 and Section 3, Self-RAG processes multiple documents in parallel by batch decoding, and thus there is no order of documents.

Ablations on other special tokens

We conduct ablation studies of different aspects on sampled 500 PopQA, on top of our 7B model trained on 150K queries.
Below. “X” indicates the special tokens used during inference, and sequence indicates the sequence score (p(y_t | x_t, p) in Eq. 3). Removing isREL only gives minor deterioration possibly isSUP also implicitly evaluates how relevant the passage is (e.g., if the passage is not relevant, Self-RAG predicts no support or contradictory) while removing both isRELandisSUP` gives notable drops, indicating the importance of those two aspects. Completely removing all of the fine-grained aspects and only using sequence scores for ranking also causes significant deterioration. The biggest drops happen when we use isUse only for ranking outputs. isUse only evaluates the overall quality of generations, regardless of factuality of generations or document relevance and thus depending on this aspect only may not be suitable for knowledge-intensive tasks such as QA.

IsRelIsSupIsUsesequencePopQA performance (acc.)
xxxx0.538
xxx0.536
xx0.512
x0.416
x0.512

Ablations on the number of passages

We also conduct evaluations on the varying number of passages on PopQA. The table below shows the results. Adding more passages helps our models to increase performance up to 10 passages, while at N=15 and 20, we see minor performance drops, perhaps because a model may get confused with irrelevant documents, which have been reported by prior work.

The number of passages (N=2)PopQA performance (acc.)
20.498
30.504
50.504
70.540
100.538
150.528
200.520

Prompts

Table 5 provides the individual task prompts. Note that we directly feed Open-domain QA queries without any task prompt. We briefly tried different task prompts for PubHealth on top of our 7B model, but we didn’t observe major performance changes on this dataset.

Performance deterioration with RAG in baselines

(Weakness 2) ChatGPT performs better than the retrieval-augmented version. This is curious and tells me either the model has seen the data (which makes the dataset slightly not a good representative of the RAG task) or that RAG is not being done correctly.

We note that ChatGPT is under a family of closed LLMs as we're unaware of its pre-training and instruction / RLHF data, and thus it is hard for us to make claims which datasets are contaminated.

Regarding the performance deterioration on some tasks by retrieval, we suspect this comes from the fundamental limitations of the current RAG systems, and the findings align with recent and concurrent work. Recent work (Mallen et al., 2023; Kandpal et al., 2023) shows that strong LMs memorize a lot of popular factual knowledge, although it can struggle in long-tail (rare entities) knowledge, and benefit more from retrieval-augmentation in such long-tail. TriviaQA often includes queries about popular factual knowledge, leading to the strong ChatGPT performance.

However, existing LMs including ChatGPT can also get easily distracted, when the prompted passages are not relevant. On popular entities, the retrieval system may fail to retrieve helpful passages and make ChatGPT generate errorneous outputs. This is also reported by Mallen et al., (2023) and (anonynous. (2023)).

Self-RAG addresses such limitations by novel methodologies, as acknowledged by reviewers 3ScW, 4MCy and uRop. Also we will make Self-RAG as well as all baselines available for the NLP community to test the methodology with different open LLMs for RAG on diverse tasks.

评论

Question about Abstract

(Question 1) The authors claim in the abstract that "Self-RAG outperforms ChatGPT ... on Open-domain QA ... tasks". However, the results in Table 2 are dominated by ChatGPT and Ret-ChatGPT. They only get outperformed in PopQA.

Our abstract mentioned that Self-RAG outperforms ChatGPT on open-domain QA and fact verification tasks, which is supported by the results on PopQA and PubHealth. We will make it more explicit and clear in our updated abstract. We also would like to note that following the suggestions we’ve received, we explored the effectiveness of increasing the training data scale up to 150k (See more details in Section 5.2. Effects of training data) and we show further gains across many tasks (Table 2).

Human evaluations on GPT-4 Self-reflection predictions

(Question 2) The authors claim that "GPT-4 reflection token predictions show high agreement with human evaluations". I find that hard to believe, given personal experience and reported results, e.g. PandaLM which for their task reports an F1 accuracy of 66% when comparing the quality of different passages. I don't deny that GPT-4 evaluation predictions have a strong positive bias towards the truth, but I still would call it high agreement. Do you have data to support this? Or do you think the gap in observations is due to comparison on slightly different evaluation tasks?

Thanks for sharing the experience and insights from PandaLM. In our work, we sampled 50 examples per type (Retrieve, IsRel, IsSUP, isUse), manually labeled the reflection tokens (e.g., relevant or irrelevant for IsRel), and then compared the GPT-4 predictions on the same examples against human labels. We will release our human evaluation data upon acceptance.

We believe that the different outcomes primarily come from different evaluation setups.

To our knowledge, PandaLM evaluation (PandaLM Section 4) is a pairwise preference evaluation given two LLM outputs and comparing accuracy, precision, recall, and F1 against human annotations. Pairwise overall preference evaluations among two responses often exhibit relatively high disagreement even among humans (Dubois et al., 2023) due to the subjective nature of this evaluation.

In our evaluation, we formulate the evaluations (1) in a fine-grained way (multiple aspects, segment level) and (2) ask GPT-4 to give an absolute class label/score, rather than rank multiple responses. This allows the model to focus on a narrow yet precisely defined aspect of generation and evaluate it using a rubric, rather than dynamically comparing two given paragraphs. We found that this gives more stable and reliable evaluation results, reducing the subjectivity of evaluations. We added discussions in Appendix Section A.1 (Manual analysis of the GPT-4 predictions). Such effectiveness of fine-grained evaluations has been also reported by recent work (Wu et al., 2023) We will release our GPT-4 prompting as well as our Critic LM inference scripts to enables for future reproducibilities.

Questions about Inference time cost

(Question 3) The repetitive self-reflection methods of Self-RAG will likely have an outsized effect on the computation requirements and the latency of the model

Regarding the generation stage, we introduced several techniques to reduce computaional overhead and improve effieicny (parallel batch decoding of multiple passages, beam search, paged attentions). As a result, at each segment generation, we believe the introduced latency is still limited, and we use the same computation for other baselines (a single GPU with 24GB memory). For retrieval part, Self-RAG can also skip and reduce rerieval frequencies from normal RAG. While multi-step retrieval intorudces efficiency-performance trade-off, as discussed by Jiang et al. (2023), we would like to note that Self-RAG can continue using the same passages retrieved previously, and thus often retrieves less than three times for long-form generations.

We will add more discussions and analysis on those aspects in our final version.

评论

I thank the authors for their detailed responses. I believe most of my answers have been more or less answered. As it stands with the ratings, the paper is in a good condition, though I'll be amenable to change my response in the discussion stage if need be.

评论

Performance deterioration with RAG in baselines (cont.)

(Weakness 2) I have a similar (but less important) observation for the Alpaca models (specifically 13B). I can understand the reduced MAUVE score, but accuracy on PubMed has also been reduced which tells a similar story

Whether retrieval helps or not on non-extractive tasks such as multiple choice or classification is still an open question (BehnamGhader et al., 2023), unlike tasks like Open-domain QA, where a system can simply extract or copy subspace in the context. For PubMed or ARC-Challenge, LMs often need to reason over the input sequences and paragraphs, and the retrieved evidence may not directly answer the questions. Standard LMs such as Alpaca are not trained with retrieved passages and we suspect they may struggle to leverage such retrieved passages in those tasks. To build versatile and reliable retrieval-augmented LMs, we believe training with retrieved passages and explicit feedback from such fine-grained aspects as in Self-RAG is important. Note that Self-RAG shows improvements from retrieval from its non-retrieval baselines in ablation.

Same is true in Figure 3.a (ablations) where Self-RAG without retrieval is already better than all the other baselines in Table 2. Is this because there has been a data leak? If yes, that completely invalidates these results

First, our Llama2-FT baseline in the Table is trained on the same input-output data as Self-RAG and uses retrieval at inference time, and still lags behind Self-RAG, which supports the strong Self-RAG performance does not simply come from data. The training data includes a diverse set of source datasets Alpaca-Instruct, Open Assistant as well as KILT FEVER, NQ, and WoW, and there’s no dataset leakage. Also ChatGPT and LLama2-chat, we do not know their training details and they may or may not be trained on the evaluation datasets. Nevertheless, our proposed method outperforms LLama2-chat by a large margin, and matches or outperforms ChatGPT on the evaluated tasks.

Prompting-based methods for Self-RAG

(Weakness 1) I would have also liked to see at least an attempt at trying to combine self reflection with RAG without any pre-training

Thanks for the great suggestions! Although we believe developing a prompt-based approach to achieve Self-RAG-like inference is an existing idea, we found learning-based method may be more suitable to achieve the fine-grained feedback-based inference process for the following reasons.

  • Self-RAG requires careful multi-aspect fine-grained self-evaluations at inference time. As shown in Tables 8-12, to make an LM comprehend fine-grained aspects and scoring systems, precise and detailed instructions as well as few-shot demonstrations are necessary. This significantly increases the input sequence length, resulting in higher costs and latency.
  • Nevertheless, we briefly tried prompting-based approaches in our preliminary experiments and found it is nontrivial. When we combine all instructions for all aspects and feed them to the target pre-trained LMs (GPT-3 davinci-003 / 002, Llama2-13B-chat), all models struggle to precisely follow our evaluation scheme, often generating output formats that do not suit our scheme or whose reflections show high disagreement with human, due to the confusions between dozens of special tokens.
  • To make the most use of the Self-RAG potential, we need to use the token probabilities for the reflection tokens, which may not be always available for black box proprietary LM APIs. Note that at the time of submission, ChatGPT and GPT-4 do not provide log probability information via API, preventing us from applying the Self-RAG algorithm to such models. This limitation is also discussed in the Active Retrieval (Jiang et al., 2023) paper, which also requires access to token probabilities.

While several concurrent or recent work uses proprietary LMs to conduct self-evaluation during inference, most work only considers one aspect during generation, and it is indeed nontrivial how we can make models to conduct fine-grained multi-aspect evaluations. That being said, exploring Self-RAG-like inference algorithms on top of proprietary LLMs without training is interesting and a natural follow-up. We added these detailed discussions in Appendix Section A.2.

AC 元评审

This paper introduces Self-RAG a LLM with "action tokens" (reflection tokens) that, when generated by the model, cause a "scaffolding program" to take certain actions. In particular when the LLM generates a retrieval token the scaffolding does a retrieval and evaluates each retrieved document for relevance and allows the model to produce a set of candidate responses. When the model generates a critique token the scaffolding evaluates each candidate response and selects one. The architecture is novel and effective.

为何不给更高分

there is no higher score.

为何不给更低分

I am in agreement with the reviewers that this is an outstanding paper.

最终决定

Accept (oral)