PaperHub
6.8
/10
Poster4 位审稿人
最低6最高8标准差0.8
8
7
6
6
3.5
置信度
COLM 2024

Exploiting the Potential of Seq2Seq Models as Robust Few-Shot Learners

OpenReviewPDF
提交: 2024-03-21更新: 2024-08-26
TL;DR

Seq2seq models, with objective-aligned prompting and fusion-based methods, exhibit promising few-shot learning capabilities across diverse tasks, surpassing larger decoder-only models.

摘要

关键词
Encoder-Decoder ModelIn-context LearningFew-shot Learning

评审与讨论

审稿意见
8

This paper explores the potential of sequence-to-sequence (seq2seq) models for few-shot in-context learning. While few-shot in-context learning has proven effective in decoder-only neural language models, its application in seq2seq models remains underexplored. The authors conducted ablation studies using various prompting techniques, including sentinel tokens and mode tags, yielding several key findings:

  • Incorporating the target input within the encoder significantly improves results, likely due to the bidirectional nature of seq2seq encoders.
  • The paper proposes two fusion methods: a late-fusion reminiscent of Retrieval-Augmented Generation (RAG) and an early-fusion that concatenates encoder hidden states from previous prompts as input for the decoder. The early-fusion method consistently outperforms the late-fusion across various metrics. The fact that seq2seq decoders cherry-pick where to attend on the encoder side might explain why this is happening. Both fusion methods are shown to be more effective than the vanilla approach.
  • Impressively, the few-shot seq2seq models demonstrated superior performance to decoder-only models six times their size, indicating a promising direction for future research.

Although the individual contributions may not appear novel when scrutinized in isolation, the collective advancements position this paper as a significant milestone. The clear writing, robust experimental setup, and promising results collectively suggest that seq2seq models deserve renewed attention in the wake of decoder-only models' popularity.

接收理由

The clear writing, robust experimental setup, and promising results collectively suggest that seq2seq models deserve renewed attention in the wake of decoder-only models' popularity.

拒绝理由

NA

给作者的问题

I am really curious how your method would work on zero and few shot translation. Do you think there is a way to use your current setting to do that?

作者回复

We thank Reviewer TDzM for your positive opinion. We appreciate your recognition of our paper’s clarity, robust experimental setup, and promising results. Below are our thoughts to further address your questions.

I am really curious how your method would work on zero and few shot translation. Do you think there is a way to use your current setting to do that?

Certainly, the same method can be applied to translation tasks. However, our fusion-based approach is not applicable in a zero-shot setting, making it more effective in a few-shot setting. We will conduct additional experiments to include this aspect in the camera-ready version. Thank you for the suggestion.

评论

In my opinion, the score from Reviewer TDzM is not well calibrated. There are issues with the novelty of the paper and some unsupported/unclear statements (e.g., seq2seq models perform better than decoder-only models). Also, one-line feedback and no objections do not seem like a proper review. For those reasons, I consider that the area chair should not weigh this score during the final evaluation unless they decide to provide a more in-depth review and update the score accordingly.

评论

Thanks for your response

审稿意见
7

This paper studies the topic of few-shot in-context learning for encoder-decoder language models (e.g., T5). Its main contribution is empirical results detailing the impacts of : 1. different ways of formulating the prompt, and 2. approaches for dealing with context limitations.

On the first topic, the paper investigates whether it is better to include the target example in the encoder or decoder, as well as the impact of making the prompt resemble the masked language modeling objective. The results show consistent benefit (across 4 models; T5, T5-LM, T0, and UL2) to placing the target in the encoder, and using mask tokens in the prompt. Interestingly, the benefit of the mask tokens is present even for models that are subsequently trained on maskless instructions (e.g., T0).

On the second topic, the paper investigates the impact of fusion-in-decoder- and RAG-style approaches for encoding the in-context examples. These approaches have the benefits of: 1. allowing the model to leverage more examples than would effectively fit in its context window, and 2. permutation invariance. The results establish consistent benefit to both approaches over naive concatenation of examples in the source prompt, with the fusion-in-decoder approach slightly edging out the RAG approach.

Analyses are performed on benchmarks including SuperGLUE, HellaSwag, and XSum. Models evaluated include:

  • Encoder-decoders: T5, T5-lm, T0, and UL2
  • LLMs: OPT, BLOOM, PALM and GPT3.

As stated in the previous paragraphs, the key results are that in-context learning consistently works best when:

  • The input example is included in the encoder,
  • The prompt uses <mask> tokens to closer resemble the LM's pretraining objective, and
  • RAG/Fusion-in-decoder are used to perform inference.

接收理由

  • The empirical results will be useful for other practicioners investigating few-shot in-context learning with encoder-decoder LMs. Consistent benefits to including <mask> tokens in the prompt, and RAG/fusion-in-decoder were observed across 4 different models; this provides convincing evidence that future practicioners should adopt these approaches as a standard.
  • The topic of few-shot in-context learning with encoder-decoder LMs has been overlooked, and this paper includes some compelling results that suggest that this line of research merits further investigation. In particular, I think that the following benefits of RAG/fusion-in-decoder are compelling: a) improving the efficiency of inference, and b) eliminating sensitivity to the ordering of the few-shot examples

拒绝理由

  • I'm somewhat critical of the claim that the approaches explored are being "proposed", as opposed to preexisting. "Objective-aligned prompting" (aka putting mask tokens in the input) is the default approach used in prior work on finetuning-based few-shot learning with encoder-only LMs [1], and has also been used in prior work on few-shot learning with encoder-decoder LMs [2]. Similarly the fusion-based approaches are just direct applications of fusion-in-decoder and RAG.
  • In my opinion, the results are somewhat overstated. The paper uses the comparison of their best T5-11B-based approach to OPT-66B to justify the claim that "seq2seq models perform better than decoder-only models", however, the comparison to PALM 8B in Table 8B in the Appendix seems to indicate that there is no clearly better approach, and I suspect that this claim would be further undermined had the paper benchmarked against Llama or Mistral. Similarly, in the conclusion this paper refers to its results as "remarkable" when in my opinion a phrase like "encouraging" would be more appropriate.

[1] https://aclanthology.org/2021.eacl-main.20/ [2] Patel, et al., 2022

给作者的问题

Questions:

In Section 5.4 you state:

In the early-fusion method, there is a subtle variation in probability caused by the relative positional bias when the order of the examples is fused.

could you provide a formal explanation of this?

Missing Citations:

  • This is concurrent work but very similar: https://arxiv.org/abs/2308.07922
  • I'd also recommend citing the Shick and Schutze paper above and discussing a bit the parallels to works on finetuning-based approaches for few-shot learning w/ encoder-only models from a few years ago.
作者回复

Thank you for your positive opinion and constructive feedback. We will add the two missing references to the camera-ready version, and discuss their relevance to our work.

I'm somewhat critical of the claim that the approaches explored are being "proposed", as opposed to preexisting.

We acknowledge that our methods, rooted in prior work, make the term "proposed" ambiguous. Our contribution is in demonstrating the effectiveness of these techniques in seq2seq in-context learning, which had not been explored.

Seq2seq models offer various structuring options beyond inserting mask tokens. Our findings show that aligning prompts with the model’s pretraining objectives is most effective, an aspect often overlooked and not disclosed in previous studies.

In the fusion-based approach, we leveraged the structural advantage of seq2seq models for parallel encoding. By adopting methods from FiD and RAG and applying multiple encoders instead of traditional retrieval and scoring modules in few-shot scenarios, we achieved significant performance gains. We hope this innovation is recognized.

We will clearly articulate the lineage of the methods in the revised manuscript.


The results are somewhat overstated.

We acknowledge that our claim could seem broad based on limited data from two models alone.

In practice, comparable encoder-decoder and decoder-only models trained under perfectly matched conditions are exceedingly rare. Recent research has focused on enhancing decoder-only models with high-quality data, often neglecting seq2seq models in in-context learning.

Therefore, while our paper presents evidence supporting the competitive in-context learning capabilities of seq2seq models, such as T5, against contemporaneous decoder-only models like OPT and Bloom, these findings are specific to the conditions and datasets used.

We will revise our paper to incorporate your feedback. Thank you.


“In the early-fusion method, there is a subtle variation in probability caused by the relative positional bias.” Could you provide a explanation?

In the early fusion method, each of the examples is processed through the encoder, and their hidden states are concatenated. This is followed by cross-attention with queries from the decoder block. The order of concatenation affects their relative distance to the queries, leading to differences in the relative position bias and resulting in subtle variations.

评论

Thanks for the response and explanation. I maintain my generally positive opinion on this work.

审稿意见
6

First, the paper conducted the first-ever extensive experiment comparing the in-context few-shot learning capabilities of decoder-only and encoder-decoder models on a broad range of tasks. Second, the paper proposes two methods to more effectively elicit in-context learning ability in seq2seq models: objective-aligned prompting and a fusion-based approach.

接收理由

The paper conducted the first-ever extensive experiment comparing the in-context few-shot learning capabilities of decoder-only and encoder-decoder models on a broad range of tasks.

拒绝理由

This paper is more like an internal technical report rather than a research paper since the scientific innovation of the proposed methods is very limited. The paper presents many factual results from experiments, but the discussion of these results is shallow.

The idea of early-fusion and late-fusion is quite straightforward and simple; however, the authors tried to make it look more complicated. The objective-aligned prompting is also quite straightforward."

给作者的问题

Section 5.1 states that seq2seq models perform better than decoder-only models, which is a very big claim. Do you feel you have enough data points to validate this?

作者回复

Thank you for the detailed review and helpful feedback. In the following, let us try to address your questions.

The scientific innovation of the proposed methods is very limited. The idea of early-fusion, late-fusion, and object-aligned prompting is quite straightforward and simple; however, the authors tried to make it look more complicated.

Our primary contribution is to rediscovery of inherent capabilities of seq2seq models as few-shot learners, rather than introducing entirely novel methodologies. Therefore, we focus on applying straightforward and universally applicable methods that align with the natural structure and obejctives of seq2seq models. The effectiveness of our objective-aligned prompting and fusion-based approach, while seemingly straightforward, is attributed to their simple yet carefully crafted design, and has been validated through extensive experiments for generalizability within the seq2seq framework.

We acknowledge that the manuscript may overcomplicate these methods. We appreciate this observation and will simplify our descriptions in the camera-ready version.


Section 5.1 states that seq2seq models perform better than decoder-only models, which is a very big claim. Do you feel you have enough data points to validate this?

We utilized the T5-11B model and compared its performance with that of the OPT-66B model, providing preliminary evidence that seq2seq models can, in certain contexts, outperform decoder-only models. However, we acknowledge that our claim could be seen as broad based on the limited data from these two models alone.

Given the complexity of fair and comprehensive comparisons, which would require equating training datasets, model sizes, and hyperparameters across diverse pretrained models, our study's findings should be seen as initial rather than conclusive. In practice, comparable encoder-decoder and decoder-only models trained under perfectly matched conditions are exceedingly rare. Additionally, the focus in recent research has predominantly been on enhancing decoder-only models with high-quality data to develop advanced skills, leaving seq2seq models less represented in recent in-context learning studies.

Therefore, while our paper presents evidence supporting the competitive in-context learning capabilities of seq2seq models, such as T5, against contemporaneous decoder-only models like OPT, GPT-NeoX, Palm, and Bloom, these findings are specific to the conditions and datasets used.

评论

Thanks for the response. I will keep my score the same for this paper.

审稿意见
6

The paper presents a study on the in-context learning capabilities of seq2seq models (e.g., T5, UL2, etc.). The authors delve into various prompting strategies for seq2seq models and introduce two distinct fusion-based approaches for few-shot learning, drawing from previous work such as RAG and FiD. Finally, the paper compares the permutation bias between fusion-based seq2seq and decoder-only models.

接收理由

  • The paper studies an exciting topic: the difference in performance between encoder-decoder and decoder-only models.
  • The paper shows a good number of ablations.
  • The paper is generally well-written and organized but could be improved slightly more for better readability.

拒绝理由

Main:

  • Novelty: The paper's novelty regarding the fusion-based approach is not too great. This seems more like an extension of the application of previous work, such as RAG and FiD.
  • The authors stay in the abstract: "Our approach outperforms a decoder-only model that is six times larger and exhibits significant performance improvements compared to conventional seq2seq models." This statement is incorrect since you are not comparing it with llama2,3, recent decoder-only architectures that perform better than Bloom or OPT. Also, the subtitle in section 5.1 highlights this without further clarifying why not consider the most recent models.
  • Table 1 is not very informative in terms of few-shot capabilities. Even though we see a significant performance improvement when the target input is in the encoder, I do not see any tendency for more samples to help performance. The authors boldly put these numbers, but this does not agree with the paper's story.

Minor:

  • Table 1 should mention the dataset in which ablations are done. This would improve the readability of the paper.
  • Figure 2 could be improved a lot. Adding the terms from the equation on page 5 will help readers understand easily and faster.

给作者的问题

Questions:

  • Is there any reason the authors did not include llama2,3 results in their comparison? I would like an explanation, if any since we have seen advancements in the performance of LLMs. Compare your numbers with BLOOM; OPT does not make the paper's story look as good as it could be.
  • According to table 1. Are seq2seq models few shot learners if the target input is placed in the encoder? A clarification of these experiments would be ideal in the paper.
  • What is the novelty presented in the paper? It seems that the most relevant results are those using fusion-based approaches (since these are the only ones where I could see a tendency of more shots helping performance). However, this is more like an extension of previous work.
  • Could you share the dev standard numbers for Figure 3? I would like to see the deltas more clearly.
作者回复

We appreciate the reviewer's feedback on our work. Here’s our response to your inquiries.

What is the novelty presented in the paper? It seems that the most relevant results are those using fusion-based approaches.

We acknowledge that the fusion-based approach adapts previous work with slight modifications. Our contribution is rediscovering seq2seq models as few-shot learners, focusing on applying straightforward and universally applicable methods that align with their structure and objectives. The effectiveness of our objective-aligned prompting and fusion-based approach, while straightforward, is attributed to their simple yet carefully crafted design, and has been validated through extensive experiments for generalizability. Further details on objective-aligned prompting are in the next response.


According to table 1. Are seq2seq models few shot learners if the target input is placed in the encoder?

Deciding where to position the target input in in-context learning is fundamental and precedes other prompting strategies. Existing research has generally overlooked this, making its consideration a prerequisite in our experiments. The results in Table 1 underscore the importance of prompt structuring for in-context learning in seq2seq models. Therefore, we positioned this component at the outset of our experimental approach, allowing all subsequent experiments to be based on these results.


Is there any reason the authors did not include llama2,3 results in their comparison?

We excluded comparisons with models like Llama2 and Llama3 because comparable encoder-decoder models are unavailable. Demonstrating superiority over state-of-the-art decoder-only models like Llama2 is challenging, as it requires fair comparisons between similarly trained encoder-decoder and decoder-only models, which are rare. Recent research has focused on decoder-only models, often overlooking seq2seq models. Additionally, Llama3 was released after our submission, preventing direct comparison. Thus, our paper shows the seq2seq model’s in-context learning capabilities by comparing them with contemporaneous decoder-only models like OPT and Bloom.


Minor issues

For clarity on Figure 3's deltas, see the standard deviation in Table 11's last column, Appendix D.

We will improve Table 1 and Figure 2's readability in the camera-ready version. Thank you.

评论

Thank you for responding to each of my questions. The paper addresses an important question about encoder-decoder architectures, and the execution is also good. However, as Reviewer KQR9 also mentioned, this paper looks more like a report and suffers from the novelty perspective. Considering all these aspects, this is a weak acceptance, in my opinion. I will increase my score to 6.

最终决定

This paper studies in-context learning (ICL) in sequence-to-sequence models. ICL remains under-explored and poorly understood for sequence to sequence architectures compared to decoder-only models. The study is deemed valuable by the reviewers. The authors are highly encouraged to incorporate the reviewers' feedback to improve the paper.