6.4

/10

Poster5 位审稿人

最低5最高7标准差0.8

3.8

置信度

COLM 2024

RAVEN: In-Context Learning with Retrieval-Augmented Encoder-Decoder Language Models

Jie Huang,Wei Ping,Peng Xu,Mohammad Shoeybi,Kevin Chang,Bryan Catanzaro

OpenReview PDF

提交: 2024-03-23更新: 2024-08-26

TL;DR

We investigate the in-context learning ability of retrieval-augmented encoder-decoder language models.

摘要

关键词

Retrieval-Augmented Language ModelsIn-Context LearningEncoder-Decoder Language Models

评审与讨论

审稿意见

评分: 7置信度: 32024-05-06

This paper investigates the in-context learning ability of encoder-decoder models and further proposes an improved version of the studied model. RAVEN, the proposed model, is built on top of ATLAS by further pre-training with the suggested objectives, with an aim to shorten the gap between the masked language modeling during pre-training and in-context learning during inference time. In addition, the few-shots examples are key to the in-context learning performance. The authors further present to use different examples for each retrieved document (Fusion-in-Context Learning) and use the retrieval component to obtain the demonstration examples (In-Context Example Retrieval). Results indicate that RAVEN outperforms ATLAS and even surpasses (or performs on par with) much larger models.

接收理由

The paper is well-written and organized.
Experiments are thorough, compared with a variety of baseline models.

拒绝理由

Novelty of the proposed methods is somewhat limited. Essentially, RAVEN is trained by further pre-training ASLAS with objectives that were discussed in the original paper of ASLAS (the prefix LM used in the second stage can be seen as a variant of the standard MLM and standard prefix LM).

给作者的问题

What is the baseline for Table 3 and how to interpret them? If the baseline is RAVEN, does it mean the baseline score, for example, 31.7% for RAVEN 3B 1-shot on NQ can be improved by 9.1%, as reported on Table 3?

作者回复

2024-05-31

Thank you for highlighting the writing and extensiveness of our experiments.

novelty

The research problem of “in-context learning with retrieval-augmented encoder-decoder LMs” is an important one that has been overlooked in existing studies, making the problem itself novel. Regarding techniques, it is crucial to emphasize that our methodologies stem from an in-depth understanding of the limitations of current models and a careful consideration of how to overcome these issues, as discussed in Sections 3. The design is also non-trivial; it was based on the exploration of different strategies, as detailed in, e.g., Section 3.1 and Appendix C1-3.

In summary, our work offers contributions from both analytical and technological perspectives:

From an analytical standpoint, we provide a thorough analysis of the in-context learning capability of state-of-the-art retrieval-augmented encoder-decoder language models. Our insights also highlight potential avenues for improvement. This analysis serves as the foundation for our proposed RAVEN training strategy and Fusion-in-Context prompting approach.
From a technological perspective, we introduce a confluence of retrieval-augmented masked and prefix language modeling, coupled with our Fusion-in-Context Prompting and In-Context Example Retrieval strategies. These techniques not only enhance the base model's capabilities (by mitigating the mismatch between pretraining and testing) but also maximize the potential of in-context learning for these models (either by scaling the number of in-context examples with FiCL or by retrieving better in-context examples).

interpretation of Table 3

In this table, we aim to demonstrate how in-context example retrieval can further improve the performance of RAVEN. Yes, the baseline is RAVEN, and the baseline score for RAVEN 3B 1-shot on NQ is 31.7%. With in-context example retrieval, we achieve a further improvement of 9.1%, resulting in an accuracy of 40.8%.

2024-06-06

Dear Reviewer k2PV, thank you again for reviewing our paper! We hope our response addresses your questions. Please let us know your thoughts, and we are more than happy to answer any further questions.

2024-06-06

Thanks for providing the detailed explanation. This answers my questions and concerns. I increased my score from 6 to 7.

评论- Thank you!

2024-06-07

Thank you for your acknowledgment! We are happy that our response addressed your concerns.

审稿意见

评分: 7置信度: 52024-05-11

This paper aims to improve the in-context learning ability of retrieval-augmented encoder-decoder models---to be more specific, Atlas. The authors first conducted some analysis on the in-context learning behavior of Atlas with different prompting strategies, which shows that Atlas can do in-context learning but is unstable and does not benefit from more in-context examples. Then the authors proposed three ways to improve it:

(Training): add a second stage training for Atlas which uses prefix language modeling, to handle the training/inference mismatch (in inference it's almost always a continual generation task).
(Inference): instead of using the same group of ICL examples in each fusion-in-decoder context, the authors propose to use different ones so that the model can use more ICL examples without the limit of encoder context window. This is called "Fusion-in-Context Learning (FiCL)".
Select better ICL examples by using the Atlas retriever following previous work.

Those techniques combined form the new model, Raven. The authors mainly demonstrate its efficacy on open-domain QA tasks (NaturalQuestion, TriviaQA), as well as MMLU. The experiments show that Raven can improve significantly over Atlas. Thorough ablations also demonstrate the efficacy of each of the above techniques.

接收理由

The paper is well written, and well motivated, and the experiments are very clear and clean.
The analysis on Atlas' in-context learning performance is insightful and well motivated the proposed techniques. The proposed techniques are interesting, effective, and novel.
The improvement on the selected tasks is significant. The ablation study is comprehensive.

拒绝理由

This is more of a suggestion than critique: I wish there were more tasks analyzed and evaluated. The authors only explored tasks that significantly benefit from retrieval-augmentation, such as open-domain QA tasks and MMLU. To demonstrate the general ability of the language model, I believe people would also be interested to see how this suite of techniques work on tasks that may not require retrieval augmentation. This would show that FiD/encoder-decoder style LMs not only work for RAG, but also general ICL.
It would be great to have some analysis on how ICL behaviors (wrt numbers of demonstrations) differ in FiD/decoder-only models. Even with FiCL, I imagine the improvement has a "ceiling" given there is no attention across different encoder context windows.
How about comparing this to using retrieval-augmentation in decoder-only LMs too? You can just stack all the demonstrations as well as all the passages in the decoder context as well. Also, what about Replug?

给作者的问题

Please see "reasons to reject".

作者回复

2024-05-31

Thank you for your positive and constructive feedback. We are happy that the reviewer likes our writing, methods, and experiments.

W1: evaluation on tasks that may not require retrieval augmentation

Thank you for your suggestion! Our primary focus in this paper is on “in-context learning with retrieval-augmented encoder-decoder models”, hence our emphasis on tasks benefiting from retrieval augmentation. We recognize the value of evaluating RAVEN on tasks that do not heavily rely on retrieval augmentation. In the MMLU benchmark, which includes a variety of tasks, some do not require retrieval augmentation. We observe RAVEN demonstrates performance improvements over ATLAS. We also observe performance gains when retrieval is turned off (i.e., retrieval of 0 passages) during evaluation.

W2: analysis on how ICL/FiCL behaviors (wrt number of demonstrations)

Actually, we have already provided some analysis on this in our paper. For example, in Figure 2 and Figure 5, as well as in the results in the tables, we report the performance of RAVEN/ATLAS with different numbers of in-context examples. For decoder-only LMs, we may refer to the results reported in previous papers. The performance varies significantly; for example, on TriviaQA, 1-shot is optimal for PaLM, while GPT-3 uses a 64-shot setting.

It’s true that the improvement of FiCL may also have a "ceiling", but we do not observe a performance drop like standard ICL when continuing to increase the number of examples (as shown in Figure 5).

W3: comparison to retrieval-augmented decoder-only LMs

We have compared RAVEN with Replug, Retro, Retro++ in Table 2, which are retrieval-augmented decoder-only LMs. From our experiments, when the capability of the base model is close, RAVEN can significantly outperform retrieval-augmented decoder-only LMs.

For decoder-only LMs like GPT-4 (stacking all the demonstrations and passages), there is still a gap, but the comparison may not be meaningful since the capability of the base model differs significantly (T5 vs. GPT-4). We notice there is a family of high-performing encoder-decoder LMs recently developed by Reka (https://arxiv.org/abs/2404.12387). It would be very interesting to see the performance of RAVEN built on Reka (though it’s not open source so far). We feel this is a very promising area for future exploration.

评论- Ack

2024-06-06

Thanks for answering my question! I will keep the score unchanged.

评论- Thank you!

2024-06-06

Thank you for your acknowledgment! We are happy that our response answered your questions :)

审稿意见

评分: 6置信度: 42024-05-15

This paper tackles the problem of in-context learning with retrieval-augmented encoder-decoder LMs. There are two main techniques introduced in this paper. One is the pre-training method called RAVEN, which combines retrieval-augmented masked language modeling and prefix language modeling for the training of retriever and generator. The other one the fusion-in-context learning, which enables the model to leverage more in-context examples. The proposed method shows improved performances over Altas and other decoder-only LLMs like GPT-3 and PaLM on NQ and TQA.

接收理由

The proposed method is technical sound. The resulted LM improves upon the previous encoder-decoder LM.
The effectiveness of the proposed method is empirically verified on NQ and TQA.

拒绝理由

There are two serious concerns regarding the motivation of the paper.

It's unclear whether the proposed methods remain effective when they are applied to today's popular LLMs, despite that most of them are based on decoder-only architectures.
There is no evidence whether in-context learning is still necessary when fine-tuned LLMs are applied for RAG.

Besides, efficiency is another concern of the proposed method, considering that in-context examples in the RAG settings will take a great deal of token consumption.

作者回复

2024-05-31

We thank the reviewer for highlighting the soundness and effectiveness of our methods.

“It's unclear whether the proposed methods remain effective when they are applied to today's popular LLMs, despite that most of them are based on decoder-only architectures.”

While decoder-only LMs are currently more popular, our research aims to explore and highlight the potential of encoder-decoder architectures. From a research standpoint, demonstrating the efficacy of alternative approaches is crucial for driving innovation and may encourage the development of more powerful encoder-decoder LMs. Notably, recent Reka models (https://arxiv.org/abs/2404.12387) are based on encoder-decoder architectures and achieve performance comparable to GPT-4.

Additionally, it is also interesting to explore the combination of the Fusion-in-Decoder architecture with existing decoder-only LMs. In this way, we can use a bidirectional architecture to effectively encode retrieved passages for powerful decoder-only LLMs.

“There is no evidence whether in-context learning is still necessary when fine-tuned LLMs are applied for RAG.”

This comment seems confusing. In-context learning applies to both pretrained and fine-tuned models with or without RAG. It remains important, as shown in leaderboards and technical reports of recent models like Gemini, where many results (of both pretrained or finetuned models) are still reported using in-context learning.

“Besides, efficiency is another concern of the proposed method, considering that in-context examples in the RAG settings will take a great deal of token consumption.”

The additional cost of RAG applies to all existing LMs. And encoder-decoder LMs are actually more efficient than decoder-only LMs in RAG settings. As explained in Section 2: “For decoder-only LMs, the computational cost typically increases quadratically with the input length, as well as with the number of retrieval passages. In contrast, for encoder-decoder LMs with a Fusion-in-Decoder architecture, the computation cost grows linearly with the number of retrieved passages, as they only perform self-attention over one passage at a time (Izacard & Grave, 2021).”

This efficiency also applies to FiCL. If we encode different in-context examples separately with FiCL, the computation cost also grows linearly with the number of in-context examples. Additionally, we don’t retrieve passages for in-context examples, which further improves efficiency and reduces token consumption.

2024-06-04

Thank you for your feedback! We would like to further clarify and strengthen our response to your concerns.

Effectiveness of ICL in RAG:

The motivation for using ICL with retrieval is aligned with its application in models without retrieval. In most cases, retrieval just adds external knowledge to the model's context but does not fundamentally change the task being performed. If ICL can work on tasks for a model without retrieval, there is no reason why it should not be applied to a model with retrieval. Both ICL and RAG are important and common usages of LLMs; they are orthogonal and don't conflict with each other. Therefore, there must be many use cases of ICL with retrieval, such as answering a factual-seeking question with the help of examples, solving knowledge-intensive tasks with the help of examples, writing a passage (which may require retrieval to collect the knowledge) based on some examples, etc. And empirical evidence in our paper exactly demonstrates that ICL enhances performance in models with retrieval on popular tasks such as Open-domain QA and MMLU.

Running Cost of ICL:

While it's true that ICL introduces additional costs, we designed our approach to minimize these costs effectively. Here are the key points regarding cost management:

Efficiency in Prompt Design: In our implementation, we do not retrieve passages for in-context examples (i.e., only retrieve passages for the target query). Therefore, the additional tokens introduced by ICL in our setting are the same as traditional ICL.
Linear Cost Growth with FiCL: The computation cost of FiCL grows linearly with the number of in-context examples, as it performs self-attention over one (or $m$ ) example(s) at a time. This is more efficient than decoder-only LMs, where the time complexity is quadratic. This efficiency allows us to leverage the benefits of ICL without incurring prohibitive costs.

In summary, our paper provides strong empirical evidence and methodological innovations that support the use of ICL in RAG settings, demonstrating meaningful performance improvements without incurring unsustainable costs. We believe that these contributions are significant and address both the effectiveness and cost concerns you raised.

We hope our response addresses your concerns. Please let us know your thoughts, and we are more than happy to answer any further questions.

2024-06-05

Thanks for the authors' response! I'm increasing my overall assessment to 6 after having calibration with other papers in my batch. I think the paper basically presents a quality study in this field.

However, I'm still uncertain about the necessity of using ICL for RAG. The improvement can be very marginal with the latest LLMs whereas the additional cost from using ICL will be substantial. The authors may provide a comprehensive investigation in their future research.

2024-06-06

Thank you for your feedback! From the developer's perspective, we do believe there are many use cases for ICL in RAG, and we have observed significant performance improvements in our experiments. The cost of ICL here is also not high (the FiCL design even makes it cheaper than traditional ICL).

Thank you for agreeing with the quality of our study! We think one interesting future exploration is to apply our methods to more advanced encoder-decoder LMs, such as Reka (though it’s not open source so far).

2024-06-04

Thanks for the authors' response! I'm sorry for being confused. Let me try to clarify my questions. From my perspective, ICL is not a very common practice for RAG due to the following concerns:

Effectiveness: I haven't seen many cases where ICL contributes substantially to RAG, though it's useful in many other situations. If there are examples where it does, please remind me.
Running cost: Applying RAG is already costly due to the encoding of retrieved information. Introducing ICL will add considerable extra costs.

As a result, I still cannot agree with the motivation of the paper, though it made meaningful progress in improving upon Altas.

审稿意见

评分: 5置信度: 42024-05-15

The paper investigates to combine retrieval augmentation, in context learning, and encoder-decoder models. After establishing the primary grounding literature (e.g., Atlas), the paper begins by presenting two strategies for in-context learning + retrieval augmentation + encoder-decoder LMs. Provided sufficient in context examples (4 or 5 more depending on QA set), one strategy shows improvement over a zero shot Atlas baseline.

接收理由

The paper’s main strength is demonstrating that with a couple modifications to the train/inference procedure, that it outperforms Atlas by a respectable margin.
The paper proposes a method that does not (at least on the surface) seem overly compute intensive, making it easier to apply.
The paper looks solid in methodology, the methods are clear enough to understand and the experiments seem executed well.

拒绝理由

The paper overall seems too incremental and this isn’t helped by lack of analysis beyond headline performance numbers. While there is a nuanced difference between simple yet powerful methods based on novel insight and incrementally building on existing work, I feel the paper falls into the latter category.
1. In 3.2, the main difference between prior work seems to be possibly expanding the number of concatenated examples and instead of repeating the same, shuffling them (compared to Izzacard 2023, it looks like the KL divergence minimization objective is similar/same modulo attention difference).
2. Strategy 1 in 3.1 seems like the simplest possible way (ie would be a good baseline) to do retrieval augmented QA with an encoder-decoder model (put QA pairs for context learning in decoder, Q + retrieved passage + blank answer in decoder).
While there are a couple demonstrative examples, the paper does not include much analysis beyond headline performance numbers (or performance as number if IC examples varies). It would have been nice to see something like an error analysis that convincingly argues that method fixes what it aims to fix.
Along similar lines, although there was deeper discussion of two QA benchmarks, the discussion of MMLU was quite short. For example, I would be interested in knowing if the gains are uniform or from specific tasks in MMLU.

给作者的问题

Did you try a variant of strategy 1 in Figure 1, where the passage either precede the A0, or the passage is before all the QA pairs?
The tables primarily compare to Atlas, are there other recent models that are worth comparing to?

Suggestions that didn’t influence my scores

I realize that LLMs are relatively new, but aside from datasets, the literature cited is almost exclusively 2020 or after, surely there is relevant literature to draw from before that.
I’m not certain if COLM style is different, but at least for many other venus figures (ie Figure 3, Table 4 should be at the top of the page (unless they are preceded by other tables/figures).
Figure 2: The y-axis scales should show where the red lines for Atlas 11B bottoms out, it looks truncated right now. Same with Figure 5
In 3.1, since you discard strategy 2 past 3.1, to make the paper more focused, I would suggest moving most/all the experiments from the main paper into the appendix and include a textual mention/footnote that notes another strategy was attempted, but it did not work as well as S1, with details in appendix.
In Figure 2, you should indicate the results are for S1 in the caption.
In Table 2, there is a mix between open weights and API only models. I would suggest distinguishing between the two somehow, as I don’t feel confident that it is a fair comparison to open weights models whose training data is known to a larger degree. A similar argument would apply to making it easier to compare small model results to large model results.
The paper spends quite a bit of time discussing details in Atlas, it would be stronger by discussing only the minimal necessary and focusing on the new work being introduced.

作者回复

2024-05-31

Thank you for highlighting the solidity of our methods and experiments.

W1: novel insight

Our methodologies stem from an in-depth understanding of existing limitations and how to overcome them, as discussed in Section 3. The proposed methods are designed carefully based on novel insights and comprehensive analysis (see our response to Q1 below as a simple example).

There might be a misunderstanding regarding Section 3.2. Our design differs from previous work by combining retrieval-augmented masked and prefix language modeling to enhance in-context learning. In Section 3.3, we also propose a novel prompting strategy named FiCL that further boosts performance.
We don’t understand this comment. Strategy 1 puts all the input into the encoder. Should "decoder" in your comment be "encoder"?

W2: error analysis

Great suggestion! Briefly, for short-form QA, RAVEN achieves higher answer accuracy due to improved language modeling and better utilization of retrieved passages. For long-form QA, RAVEN is able to generate more reasonable and informative responses (examples in Appendix C4). We will provide more error analysis in the final version.

W3: MMLU

The gain is more significant for factual-seeking questions, where retrieval is crucial, but the improvement is relatively smaller for tasks such as answering math questions.

Q1: variant of strategy 1

We designed strategy 1 to align closely with the pretraining objective. The alternative prompting strategies suggested do not match this objective.

In retrieval-augmented prefix language modeling, the order is (prefix, <extra_id_0>, passages). strategy 1 aligns with this order, i.e., (examples + query, <extra_id_0>, passages).

If the passage precedes q0 (a typo by the reviewer?), the order is (examples, passages, query, <extra_id_0>).
If the passage is before all the QA pairs, the order is (passages, examples + query, <extra_id_0>).

Both alternatives mismatch pretraining, where passages are not after <extra_id_0>, thus are unlikely to perform well.

Q2: comparison

As shown in Tables 2 and 4, we also compare our model to other retrieval-augmented LMs such as RePlug, Retro, and Retro++, as well as several popular LMs. Performance varies significantly depending on the base model's capability. But in general, when the capability of the base model is close, RAVEN can significantly outperform the baselines.

Thank you for your suggestions. We will take into account all of them in the final version.

2024-06-05

Dear Reviewer SfUy, we hope our response addresses your concerns. Please let us know your thoughts, and we are more than happy to answer any further questions.

BTW, as suggested by Reviewer A7yJ, we plan to highlight more about the distinction between our work and ATLAS in the final version. One significant difference between them is the problem itself: finetuning (ATLAS) vs in-context learning (ours). Besides that, there are also several technical innovations (combined retrieval-augmented masked and prefix language modeling, Fusion-in-Context Learning, In-Context Example Retrieval) in our paper that improved the in-context learning performance of retrieval-augmented encoder-decoder LMs.

2024-06-06

Dear Reviewer SfUy, since the deadline of the discussion period is approaching, we would like to follow up to see if our response addresses your comments and to ensure we have enough time to respond if you have any follow-up questions.

Thank you again for your review!

审稿意见

评分: 7置信度: 32024-05-17

This paper studies the problem of retrieval and in-context learning for encoder-decoder languages models. Specifically, they find limitations of prior methods that arise from being not truly few-shot in that they require additional fine-tuning. This work then proposes a new model (RAVEN) that obviates the need for fine-tuning by modifying the prompt/encoding format. The retriever part is trained by a KL objective while the reader is trained using masked language modeling for prefix-based LM. Additionally, this work introduces “Fusion-in-context learning” (FiCL), which distributes a high number of examples (e.g. 64) across multiple embeddings that get encoded. Both RAVEN and FiCL are shown to be effective on Natural Questions, TriviaQA, and MMLU compared to the prior state of the art model.

The main strengths are that the method appears to work well and also are well-suited solutions to the disadvantages of prior work (fine-tuning); in some sense, this work modernizes past work with more attention to prompts and ICL. The main weakness, which could be a result of my own misunderstanding, is the comparison against baseline, ATLAS, does not seem to correspond with the original paper (more explained below).

接收理由

The proposed method, RAVEN, appears to work well across multiple tasks and addresses the concern that ATLAS needs additional finetuning.
The addition method, FiCL, is a novel method to more efficiently combine the notion of in-context learning to encoder-decoder models. The fact that this method works so well is a cool result.
The experiments are extensive and cover lots of ablations like model size, prompting strategy, multiple shots, etc. Overall, they are convincing.

拒绝理由

The paper presentation is a little bit confusing; the first contribution of “thorough analysis” does not seem too conclusive.
Some technical details were confusing, especially as there is a strong reliance on the reader understanding all the technical details and hyper parameters of ATLAS first.
- In particular, the ATLAS baselines reported here are lower than the numbers in the ATLAS paper.

给作者的问题

Section 3.1 feels out of place relative to the rest of the paper. The conclusions of 3.1 are that prompt 1 is better than prompt 2 and that going up to 8-shot is sufficient given the small context sizes (512). Both of these findings could belong in an appendix in favor of clearer explanation and analysis of the main method RAVEN and secondary method contribution FiCL, both of which are, in my opinion, more convincing and stronger results.
- To nitpick a little more, strategy 1 and 2 are not the only strategies: C.1. actually outlines 3 (first/last/random) more, and none of these consider putting the retrieved (maybe limited to top-1 title only due to seq. size) docs into the prompt either -- which would be more akin to what RAG-like approaches would do to fully mimic the data point for ICL.
- I stared at Figure 1 for a long time at the beginning trying to understand what it had to do with encoder/decoder architecture, expecting it to be a compare/contrast of ATLAS and RAVEN, but later on realized it was unrelated to 3.2.
This might be a naive question: how are masked spans determined? (i.e. what constitutes a span in this model?) If these decisions follow those of ATLAS, perhaps it should be made even clearer in Sec 3.2 that a lot of the modeling objectives/decisions come from ATLAS. e.g. How does p_reader differ from ATLAS's p_attn?
The dynamic retrieval confused me a lot. The retriever is retrieving text documents, and the embeddings are embeddings of the documents. The few-show examples are (q, a) pairs, which is a different "type" than documents. How, or why, should we use the same retriever? And does this retriever get trained with an auxiliary objective somehow to retrieve good (q, a) pairs?
ATLAS's numbers in their paper are because they fine-tuned their model further -- is that why the scores reported in this paper are lower than the ones in their paper? Since one of the objectives of this work was to fix some of the shortcomings of ATLAS, it should probably be also compared to the models like ATLAS + finetuning. If it comes close but doesn't beat ATLAS + finetuning yet, then that would be useful to know as practitioners can be informed about which model to use based on their compute budget, etc. If it surpasses ATLAS+finetuning, that would be a stronger result.
I would have liked to see a little more discussion around FiCL. The tables are picking the best numbers based on the [k, m] hyperparameters
- Are [k, m] picked based on a validation set?
- What are the tradeoffs between higher [k, m] and no FiCL? Is FiCL always strictly better than few-shot, or if k (or m) is small, would it still be preferable to do few-shot without FiCL?
"Encoder-decoder LMs are more efficient at multiple (e.g. 40) retrieved passages" -- this seems to be missing a citation?

作者回复

2024-05-31

We thank the reviewer for highlighting the innovation and soundness of our methods, as well as the extensiveness of our experiments.

Q1: Section 3.1

Thank you for your feedback! We believe Section 3.1 is important as it is the first analysis of the in-context learning ability of Retrieval-Augmented Encoder-Decoder LMs. This analysis inspired the design of RAVEN and FiCL.

Yes, Appendix C1 (first/last/random) can also be treated as different prompting strategies. Regarding the design of “putting the retrieved docs into the prompt either”, one reason we don’t apply it is exactly due to the limited context length. Another reason is that this prompting strategy (i.e., q1 + doc1 + q2 + doc2 + … + q0 + doc0 => masked tokens) does not align well with the objective of pretraining (i.e., corrupted text + doc => masked tokens). Designing a pretraining objective for this prompting approach (as well as extend the context length) is a promising future direction.

Q2: masked spans

In the first training stage, masked spans are determined similarly to Atlas, with 15% of tokens randomly masked. In the second training stage, we mask 10% of the tokens on average at the end of the sequence, based on preliminary experiments. The p_reader is related to the Likelihood Distillation described in previous literature. We will add more explanations.

Q3: dynamic retrieval

Following ATLAS, we use Contriever (Izacard et al., 2022), where a single encoder embeds both queries and documents (different from DPR). Since we use a single encoder, embeddings for similar/related queries should be close. Therefore, we can simply use the same retriever.

Q4: prompting vs finetuning

Yes, we report the few-shot prompting results, not fine-tuning. Actually, fine-tuning for RAVEN can also yield better performance, but it’s not the focus of this paper since we want to disregard it, considering it’s not the common few-shot usage nowadays.

Q5: FiCL

For standard ICL, we perform hyperparameter search in {2,3,4,5,8,16}; for FiCL, we search in {[40,1], [64,5], [64,8]}. We follow previous literature (e.g., GPT-3, PaLM) by evaluating on the dev/test sets directly.
According to our results, FiCL [k,m] is (almost) always better than m-shot without FiCL. Settings like k=16 or m=1 also show FiCL outperforms standard ICL.

Q6: reference

This is explained in the previous paragraph: “...the computation cost grows linearly with the number of retrieved passages…(Izacard & Grave, 2021).”

评论- Follow-up questions

2024-06-02

Hi! Thanks for the response. I still have a couple questions that I'm confused about around how ATLAS was used in this work, and would like to see if my understanding is right.

ATLAS is a modification of the Contriever/T5 architecture that was pretrained on various "pretext" tasks on general datasets. They released several models and this work uses the 3B and 11B models base models (not the ones fine-tuned on NQ). RAVEN then does additional (pre)training of the encoder/decoder with the modified objective, and no finetuning at all on the downstream datasets. Meanwhile both ATLAS and ATLAS + FiCL scores reported in your paper use those 3B/11B base models directly.

So the reason why the numbers reported in this work do not match any of the numbers in the original ATLAS paper is because almost all of the numbers in the ATLAS paper are FT. The only exception would be the 0-shot results (e.g. table 7). For that, the numbers aren't exactly the same either? (47.1 vs. 47.4), and that's because I assume you ran eval again and got slightly different numbers? Or is the dataset slightly different? Or prompt/post-processing slightly different?

(as a general comment: while you stated you will add more explanations, I think the main thing missing right now is a clearer distinction between this work and ATLAS, and may reduce the impression of it being "too incremental" as reviewer SfUy wrote. The extra page you'll get would be helpful towards this, and can help lift some of the appendices and analysis into the main paper while keeping Sec 3.1 in if you really insist on it.)

2024-06-02

Thank you for your follow-up questions!

Yes, your understanding is correct. All the few-shot results reported in the ATLAS paper are from finetuning (FT) performance. In our paper, all results are from in-context learning on the base models, as we want to disregard FT since FT is not the common few-shot usage nowadays. This is also a significant distinction between these two papers, i.e., finetuning vs in-context learning.

Regarding the 0-shot results of ATLAS, we ran the evaluation ourselves. There is a slight difference in the prompt. In their code, they add a space between "Answer:" and "<extra_id_0>", but we found there is no such space in their template for pretraining, so we removed it (i.e., "Answer:<extra_id_0>", see our prompt in Appendix B5). Therefore, our reported 0-shot results of ATLAS are slightly higher than the number in the original paper.

Thank you for your suggestion. The rebuttal is limited to 2500 characters, so we need to keep the above response precise to cover all of your questions. Feel free to let us know if you would like any additional explanation. We did a lot of work to shorten this paper to 9 pages, lol. We will definitely use the extra page to incorporate your suggestions in the final version.

2024-06-06

Thank you for the response.

I'm still not convinced by the findings/strength of the current section 3.1 -- if it is intended to be the first analysis for ICL for these models, then it should definitely be expanded to additionally include C.1 and C.2, and there would be 3 (equally important) subquestions, which together inform the design of RAVEN. With extra space, this may be possible, but may also require condensing the existing 3.1 a little (perhaps the discussion of "effect of number of in-context examples" can be reduced, since the findings there aren't too unexpected: that there's some threshold at which ICL plateaus.

For FiCL, I'm re-reading this section trying to recall what [k,m] referred to again, and I'm a little confused by Table 2. In table 2, how many docs are retrieved (i.e. what is n)? As a comment, it would be helpful to illustrate or re-use the variables k, m in Figure 4 too.

(I'll leave this comment here for now so authors get a chance to respond, but will look more closely at this section later).

2024-06-06

Thank you for your feedback and suggestions! Appendices C.1 and C.2 were initially part of the main text, but we moved them to the appendix to keep the paper within 9 pages. We plan to move them back to Section 3.1 in the final version. We will also take into account your suggestions for adjusting the space of other content.

For FiCL, [k, m] refers to [k-shot, m-fusion]. Using Figure 4 as an illustration, it's [j-shot, 2-fusion] (there are j in-context examples in total, and the encoder encodes 2 examples each time). We will make it more clear (e.g., replacing j with k to be consistent and adding more explanations in the caption).

For the number of retrieved docs (i.e., n), as detailed in Appendix B.1, it was set to 40 by default.

Thank you for carefully reviewing our paper and considering our responses. We greatly appreciate it and are happy with the feedback. We are also more than happy to answer any additional questions you may have.

2024-06-06

Thank you for answering the questions.

I think there is already enough substance in this paper, however I still find FiCL an interesting approach and this paper only scratches the surface of what is possible. It would be an even stronger paper with a more thorough study on FiCL specifically to learn why it helpful: this study explores the three parameters [k, m, n], but to me, it feels like there are 4: [k, m, num_docs, num_encodings], where currently it is assumed each doc gets a single encoding. But if the argument is that fusion specifically helps with aggregating over more "shots"/examples, then something like [64, 5, 1, 40] should strongly improve over [1, 1, 1, 40] or more realistically, [1,1,1,1]. That is, 64-shot, 5 sampled randomly 40 times for a single document, then fused should outperform one-shot with no fusion (still 1 doc) -- this hypothetical but be a little degenerate but something like [64, 5, 10, 40] vs. [5, 5, 10, 40] would be a more reasonable setting to demonstrate and ablate between whether it's the diversity of examples or the fusion mechanism itself which leads to performance gains.

Nonetheless, I am increasing my score slightly upon the authors' response which answered my confusion around the ATLAS eval, and their committal for clarifying the relation between this work and ATLAS, and on expanding Section 3.1.

评论- Thank you!

2024-06-06

Thank you for highlighting FiCL and offering your thoughts! This is a very interesting point for further analysis. We really appreciate the engagement and responsibility of the reviewer—this is incredible.

We are also happy that our responses addressed your confusion.

最终决定Accept

2024-07-10

This paper investigates the combination of using an encoder-decoder model, with in-context examples, and retrieval augmentation, looking at ATLAS as a main comparison point (without considering finetuning of the model on downstream tasks)

The authors add a pretraining stage that makes train and test mismatch smaller, and also use different examples at test time, and improve prompting and the retrievers. The reviewers mostly agree the paper is written well, experiments are convincing and empirical gains are sizable so this is a good addition to the literature on retrieval-augmented language modeling.