PaperHub
7.5
/10
Spotlight4 位审稿人
最低7最高9标准差0.9
7
9
7
7
3.5
置信度
正确性3.3
贡献度3.3
表达3.5
TL;DR

We investigate the many-shot in-context learning regime -- prompting large language models with hundreds or thousands of examples -- for a wide range of tasks.

摘要

关键词
large language modelsin-context learninglong-context models

评审与讨论

审稿意见
7

This paper explores the effectiveness of in-context learning with hundreds to thousands of examples, bringing the number of examples closer to the range one might use for supervised training methods. Experiments are performed on a large number of tasks and benchmarks using Gemini 1.5 as the LLM, in each case giving it a prompt containing varying number of dataset examples and observing its performance as the number of examples increases. These robustly demonstrate the effectiveness of using more examples, at times beating fully-supervised models on the same data. In addition to using the ground truth examples, two additional prompting methods are evaluated: Unsupervised ICL, which only adds additional inputs to the prompts, and Reinforced ICL, which adds inputs along with machine-generated responses filtered by correctness of result; both techniques limit the amount of "ground truth" responses required for adding prompt examples, and are also found to perform well for all tasks.

优点

  • These experiments provide a very good overview of the effects on performance of including these numbers examples in prompts, surveying the effects in a wide set of tasks.

  • Additional studies, particuarly sec 4, are interesting, finding some behaviors that parallel those seen in supervised training, in the context of ICL with substantive benchmark tasks.

  • Unsupervised ICL and Reinforced ICL are useful techniques to limit the amount of gt reponses, and shown to be effective in these evalutaions

缺点

  • Only Gemini 1.5 was explored, though this was discussed by the authors in the discussion and limitations section, I think it's a significant weakness as it limits findings of many-shot ICL to describing the behavior of this particular model, not the typical range of performance behaviors for different systems with this approach, and whether these are automatic simply by extending context, or if there are model differences that can impact this technique.

  • I didn't see anything on the computational costs or runtime differences varying K, which would provide fuller picture of the behavior of many-shot ICL beyond just its end performance.

  • It's a lot of information to cram into the allotted space and many sections seem terse with the connections between experiments and how they contribute to the overall understanding often left hanging.

问题

While reading this paper I ended up with lots of comments and questions that I've listed below. Aside from profiling computation increases as mentioned above, there are a lot of particulars around the individual experiments that I had questions about, as well as questions on what might be causes and effects in addition to raw performance numbers.

  • Including more examples has two effects: One of increasing information and distinct samples used, but another of increasing the context length. When viewed analogously to a training loop, a supervised optimization loop will repeat examples between epochs if there are fewer samples than steps. It would be interesting to start to separate these two effects, perhaps by extending context length by repeating the same examples instead of adding new examples. For example when going from 20 to 200, repeating the same 20 examples 10 times in the prompt. Is the resulting performance closer to using 20 or 200, or in between, and for which tasks?

  • Supervised finetuning comparison in 4.3 is only on the machine translation task; while this is a good experiment that I was on my mind from the start, the fact that it was evaluated on only one task limits the conclusion.

  • Reinforced ICL description could have more details, particularly since it is highlighted as one of the main contributions. The description "we select rationales that obtain the correct final answer" makes sense as a general filtering procedure, but I'm not sure how many rationales are generated or selected per problem (though I assume only one is used for each k-shot prompt), and it's not clear what happens if no generated rationales also contain the correct answer --- are these inputs thrown away entirely, and if so does this also have a beneficial or detrimental filtering effect on the inputs used?

  • sec 2.3 planning logistics: I don't agree that the figure demonstrates significant increase for many-shot as stated in this section. Near-maximum performance of just under 35% success is achieved by 10 examples, with a possible uptick at 800, as described in the caption. But this uptick is about as large as the downtick at 20, and is towards the end of the plot without enough points to establish a clear trend. It's unclear to me if this is due to the larger number of examples or other effects (e.g. which examples are included, as 800 will have a higher chance of including the most useful examples, or random variation).

  • Related: I'm not sure what error bars indicate; other figures' captions say it is stdev, but I don't see this described in the text.

  • sec 3.1 Fig 5: I'm a little unclear on what K=4 corresponds to in the ICL Ground Truth case. Since the Unsupervised ICL prompt also includes 4 gt examples at the end, does the ICL Ground Truth prompt contain 4 examples in addition to the final 4 (for 8 total), or only these final 4? That is, are these 4 at the end of the prompt counted in the K or not, and is this the same between all three prompt methods? If these final 4 are included in K, then the three methods should coincide at K=4. If they are not included, then ICL Ground Truth should have 8 total gt examples (K=4 + the four at the end of the prompt) -- is this the case? I also wasn't able to find example prompts for reinforced or gt ICL for tasks corresponding to the ones exploring unsupervised prompts.

  • fig 6: reinforced ICL at K=250 is missing from this figure

  • sec 3.3 l.194 "to reduce the impact of false positives, ..." --- why is this important? does it obscure trends in the results by having too much noice from chance, for example?

  • sec 4.1 Fig 8: really interesting the shape of these curves as a function of K is basically what one would expect to see for a supervised model preinitialized with the original task labels, as a function of the training step number. not sure if this would be worth mentioning in the text maybe also showing a supervised training run exhibiting similar behavior for qualitative comparison?

  • sec 4.3 comparing to SFT: this section lacks details on how supervised fine tuning was implemented: in particular, what was the base model, which weights/adapters are trained or frozen, and how many epochs over the training data were performed? And why were these choices made in particular? There is also a description of trade-offs between training and inference costs of the two approaches, but no estimates on what the computation resources for each are.

  • sec 4.4 on NLL: these are some interesting behaviors, though I agree it's unclear what to make of them exactly, but I appreciate presenting the observations. As for why GT ICL may be lower NLL than Reinforced ICL: could this be that Reinforced ICL specifically rejects some high-prob model generated responses that don't end up with the right answers, raising its NLL from this filtering step? whereas in contrast, if the gt or similar same-source examples was present in the pretraining data, it would be expected to have relatively low NLL due to being drawn from that datasource?

  • xsum weird dates after K=50: xsum was taken from webarchive articles, basically mapping html body text to title --- I wonder if somehow the model learned the task was to recover the webarchive page title and confused with other header data including the webarchive last updated time, from the pretraining data? the 2016 timestamps are roughly around to the document snapshot times in the xsum urls. e.g. if webarchive processing included writing a header with the timestamps, then title, then rest of the page, for example. this is also compatible with discussion on finding the relevant parts of pretraining for the task, for Unsupervised ICL intro at sec 3 l.144.

  • xsum could also use unsupervised ICL -- what is its performance just adding articles without summaries? For that matter, machine translation could also use unsupervised ICL just with the English phrases. I'd imagine it wouldn't help at all for MT but it's not impossible, and evaluating it might still be interesting just to verify this. In general using all techniques for all tasks would help round out the study (only reinforced ICL wouldn't make sense for tasks where there is no succinct way to do correctness filtering).

局限性

With the possibility of including more data, it is now more likely to include noisy or incorrect data (either accidentally or adversarially). This might also be mentioned and discussed in the limitations section.

评论

Details about Reinforced ICL: (1) how many rationales , (2) selected per problem, and (3) what happens when no generated rationales, and is it detrimental?

  • (1) Number of rationales generated was based on available problems for ICL. MATH has 7.5K problems and we generated one rationale per problem, resulting in correct rationales for about 3.5K problems (but we see plateauing with about 500-shots). For tasks with a much smaller number of inputs (250 for GPQA, 150 for BBH), we generated 10 rationales per problem at a temperature of 1, to maximize having at least one correct rationale per problem.

  • (2) For each problem, we randomly pick one of the correct rationales for the K-shot prompts.

  • (3) Inputs without any correct rationales are thrown away entirely, which is a limitation of Reinforced ICL to be unable to use such inputs. Moreover, as these inputs correspond to the harder problems which the model cannot currently solve, we might be throwing away valuable information. As shown in Figure A.17, doing another iteration of Reinforced ICL using only these “harder” inputs can further improve many-shot performance.

l.194 "to reduce impact of false positives" --- why is this important?

Typically, we can only evaluate final answer correctness and not verify the CoT rationale. As such, BBH tasks with binary choices can result in rationales that obtain the right answer “by chance” with wrong reasoning (“false positives”) – our manual inspection indicated that model-generated rationales on such tasks were of poor quality.

This is an inherent limitation of methods that rely on model-generated rationales, including Reinforced Self-Training method, which inspired Reinforced ICL. We discussed this limitation in L134-137, and will clarify it in the revision.

fig 6: reinforced ICL at K=250 missing

We generated rationales with correct answers for only 129 problems (this was mentioned in L180-181, will further clarify).

Unable to find prompts for reinforced or gt ICL for unsupervised ICL tasks

Figure A.9 shows the zero-shot GT prompt for GPQA and Figure A.11 shows the 4-shot GT prompt for MATH and GSM8K. Reinforced ICL prompts contain the same problems as GT prompts but use model-generated solutions – we’ll add an example to show the solution differences.

Sec 4.4: why GT ICL may be lower NLL than Reinforced ICL .. rejects some high-prob model generated responses .. raising NLL from filtering step?

Certainly, the filtering step in Reinforced ICL can result in solutions that the base model considers as high NLL. Another hypothesis is that model-generated solutions can look very different from human-written ones, resulting in higher NLL on GT solutions. Notably, Reinforced ICL has lower NLL than GT ICL on model-generated solutions on test problems.

sec 3.1 Fig 5: what K=4 corresponds to in ICL GT. Are these 4 at the end of the prompt counted in the K, and is this same for all three prompt methods?

The 4-shot GT ICL prompt uses only 4 examples, which correspond to the same examples as used by the 4-shot formatting preamble of Unsupervised ICL (Figure A.11). However, Unsupervised ICL prompt also used an instruction preamble (“You will be provided Problems similar to the ones below:”), leading to differences in results from 4-shot GT ICL Prompt.

Reinforced ICL uses the same problems but model-generated solutions instead of GT solutions, resulting in different performance than GT ICL, even in the 4-shot setting.

Machine translation could use unsupervised ICL with the English phrases .. it wouldn't help at all for MT .. and evaluating it just to verify this. Also, unsupervised ICL on xsum?

As expected, Unsupervised ICL on low-resource MT doesn’t help, as shown in Figure A.16. This was mentioned on L146-147 about limitations of unsupervised ICL.

We also ran Unsupervised ICL on XSum but only observed a maximum rouge-L score of about 23.95 (with 250-shot unsupervised prompt), which is slightly higher than just using the 1-shot prompt with an article and summary. Looking at the generated summaries, they were more verbose than the abstractive target summaries. This might be mitigated to some extent with a better zero-shot instruction, but would be unfair as no such instruction was used for many-shot ICL with ground-truth examples.

sec 4.1 Fig 8: the shape of these curves as a function of K is what one would expect for a model preinitialized with the original task labels, as a function of training step

Agreed, this is a nice connection and we’d mention it in text.

With more shots .. more likely to include noisy or incorrect data

If many-shot examples in the prompt contain biases (e.g., stereotypes, unfair representations), the model can possibly amplify these biases. Moreover, many-shot ICL can be used for overriding safety-training biases, manipulating LLMs to behave in unintended or harmful ways. We’ll include this in the discussion.

作者回复

Thanks for the detailed comments and questions, which we address below and would strengthen our work.

In the rebuttal pdf, we have added results on runtime differences, many-shot performance of 1.5 Flash and frontier LLMs, ablated impact of new information vs context length, re-evaluated results on Logistics, and added analysis for hallucination on xsum. Our detailed response follows:

Only Gemini 1.5 was explored .. are there model differences that impact many-shot

Indeed, our work serves as an existence proof for the huge potential of many-shot ICL. Nevertheless, we provided preliminary results for GPT-4-Turbo and Claude-3-Opus in Figure A.2, indicating that different models benefit from many-shot ICL to varying degrees.

We also added Figure 1 in the rebuttal pdf to evaluate many-shot ICL performance for Gemini 1.5 Flash, a smaller LLM than 1.5 Pro, and show that it can match or surpass Claude-3-Opus and GPT-4-Turbo with enough shots, despite having worse few-shot performance. We’ll move this to the main paper.

Recently, follow-ups [1, 2, 3] have exhibited many-shot ICL with other open-weights and closed-source models on different tasks. Our work contributes to this growing body of evidence, and contributes several analyses of the phenomenon.

[1] Many-Shot ICL in Multimodal Foundation Models. Jiang et al, 2024.

[2] Many-Shot ICL for Molecular Inverse Design. Moayedpour et al, 2024.

[3] ICL with Long-Context Models: An In-Depth Exploration. Bertsch et al, 2024.

Runtime differences varying K to provide a fuller picture of many-shot ICL .. profiling computation increases

Great suggestion! To show runtime differences, we added Figure 2 in the rebuttal pdf showing per-single generation runtime, averaged across the test set and multiple seeds, for many-shot ICL on summarization (500-shot) and sequential parity prediction (8192-shot).

With KV caching enabled (default for long-context servers), runtime increases linearly with a large number of shots, as opposed to quadratic for self-attention: doubling the number of shots nearly doubles the runtime. However, for a small number of shots, runtime is nearly constant.

Explanation: When computing the next token, we still have to attend to the fixed many-shot prompt, even if KV is cached. When the number of generated tokens is much smaller than many-shot prompts, each new token is still linear, which explains our observed runtime for a large number of shots. We hypothesize that up to a token length of 32K, you can fit the entire KV cache into TPU HBM, which roughly means that you compute next tokens in O(1) memory load.

xsum weird dates after K=50 .. if somehow the model learned the task was to recover the webarchive page title

Our analysis suggests that this hypothesis is likely to be true! Specifically, we extracted the hallucinated years from XSum summaries and plotted their histogram density in Figure 3. Remarkably, more than 95% of these dates indeed lie within the range 2014-2017, suggesting that the model might indeed be retrieving additional information about webarchive last updated time.

Including more distinct examples increases information, but also context length .. separate these effects by extending context length by repeating same examples

We ran this experiment on low-resource MT by repeating 25 examples several times to create many-shot prompts with up to 1000 examples (shuffled ordering) and added the results in Figure 4 in the rebuttal pdf.

The performance with repeated examples stays nearly the same and significantly lags behind many-shot performance with distinct examples. On this task, the benefit of many-shot ICL mainly stems from adding new information as opposed to increasing context length.

Planning logistics: Is there a significant increase for many-shot?

To be certain, we re-evaluated many-shot on Logistics with the latest public version of Gemini 1.5 Pro, and added the result in Figure 5 in the rebuttal pdf. Many-shot accuracy improves uniformly for this version – interestingly, few-shot performance already starts quite high around 40%, and improves to 62.8% with 400-shot and plateaus to 63.8 with 800% shot.

Supervised fine tuning: base model, and how many epochs? And why were these choices made? Estimates on the computation resources?

We performed “full” fine-tuning (no adapters) on the same base model that was used for the many-shot ICL (Gemini 1.5 Pro). We performed 5 epochs of training, picking intermediate checkpoint with lowest validation loss (often from the first few epochs). These choices were made to ensure that we can obtain quite strong results for SFT. We’ll add these details in Sec 4.3.

Since Gemini 1.5 Pro is closed-source, we used Vertex API for SFT and cannot provide estimates of computation resources.

Supervised finetuning comparison only on machine translation task .. limited conclusion

Correct, our results only demonstrate that many-shot ICL can be competitive to fine-tuning on some tasks. The high dollar cost of “fine-tuning” limited our experimentation to 4 runs (2 tasks x 2 data sizes). We also performed comparison to SFT on parity prediction, where we find that many-shot ICL requires 20x less samples compared to fine-tuning GPT-2 to reach the same performance on this synthetic task (Appendix A.13). A more thorough comparison would be interesting for future work.

what error bars indicate .. say it is stdev

Yes, error bars indicate stdev of test performance across multiple random seeds, where K-shot prompts are sampled randomly for each seed (Lines 68-70). We’ll update the text to clarify this.


We hope most of the reviewer's concerns have been addressed and if so, they would reconsider their assessment.

审稿意见
9

In this work, the Authors investigate the performance of large language models on in-context learning (ICL) tasks when provided with a large number - in the order of hundreds or thousands - examples (many-shot ICL regime), enabled by recent increases in context window sizes.

The Authors demonstrate significant improvements across various tasks when moving from few-shots to many-shots. They also introduce two methods, called Reinforced ICL and Unsupervised ICL, to mitigate the need for human-generated examples.

The paper analyzes how many-shot ICL affects model behavior, including overcoming pre-training biases and learning high-dimensional functions.

优点

The paper examines many-shot ICL across a wide range of tasks including translation, summarization, planning, mathematical problem-solving. This broad scope illustrates very well the benefits of many-shots.

The Authors introduce novel methods (Reinforced ICL and Unsupervised ICL) to address limitations of many-shot ICL, namely, the need for large amounts of human-generated examples, and show their superiority.

An in-depth analyses is conducted of how many-shot ICL affects model behavior. For example, the Authors demonstrate that many-shot ICL can overcome pre-training biases (as shown, for example, in Figure 10 where performance on flipped and abstract labels approaches that of default labels with increasing shots). The paper presents a spund evidence for the benefits of many-shot ICL. For instance, in Figure 1, it is shown a consistent performance improvement across various tasks.

By comparing many-shot ICL to fine-tuning, it is shown that a comparable performance is reached in some cases, suggesting that many-shot ICL could be a viable alternative to fine-tuning.

缺点

As the Authors recognize, the study is limited to a single model (Gemini 1.5 Pro). While the Authors do include some results with GPT-4-Turbo and Claude-3-Opus, a more comprehensive comparison across different models would strengthen the generalizability of the findings.

While the paper provides extensive empirical results, it lacks a theoretical framework to explain why many-shot ICL works so well. A theoretical analysis could provide insights into the mechanisms behind observed improvements.

Another point in which a theoretical analysis would be highly desirable regards the following case. The Authors note that performance can in some cases degrade with more examples (e.g., in the case of MATH), but don't fully explain this phenomenon. They indeed state: "Our analysis found that negative log-likelihood trends are insufficient to explain this degradation, and future work should investigate new directions to shed light on the matter and improving many-shot ICL capabilities." However, this is point is correctly raised while discussing limitations.

Finally, a more explicit discussion of the potential drawbacks and risks associated with many-shot ICL would be helpful to raise awareness.

问题

I would like to aks the Authors what they think about the possible application of many-shots ICL in alignemt problems.

The ability to include much larger contexts could - if I am not mistaken - be leveraged for improving alignment to specific ethical standards or legal frameworks (this seems plausible if one takes into account your findings about overcoming pre-training biases). By providing a large number of examples that demonstrate the desired ethical reasoning or decision-making process, it might be possible to steer the model's behavior more effectively than with few-shot prompting or fine-tuning.

Do you think this is a potential application of many-shots ICL or you can already see limitations?

For example, it comes to me that, since sometimes the performance can degrade with too many examples, and the ordering of examples can affect results, applying many-shots ICL to ethical alignment would require understanding how to structure and present examples effectively within the context window.

局限性

While the paper addresses correctly the technical limitations, the potential downsides or risks associated with many-shot ICL are not discussed. I think that a brief remark on potential misuses would be appropriate.

作者回复

We thank the reviewer for their detailed comments and questions. Our detailed response follows:

While authors include some results with GPT-4-Turbo and Claude-3-Opus, a more comprehensive comparison across models would strengthen the generalizability

While we do not fully address this limitation, in the rebuttal pdf, we’ve added results for Gemini 1.5 Flash in Figure 1, a smaller LLM than Gemini 1.5 Pro and. We find that even 1.5 Flash can match or surpass Claude and GPT with many-shot ICL, despite having worse few-shot performance. This demonstrates that even small LLMs with long-contexts might be capable of many-shot ICL.

While the paper provides extensive empirical results, it lacks a theoretical framework to explain why many-shot ICL works so well

Related to this, an ICML'24 paper [1] argues that ICL operates under two possible modes, task recognition and task learning – activating the task learning mode requires a large enough number of shots (however they only empirically studied up to 128-shots), which is highly task dependent. It is possible that the success of many-shot ICL is partially explained by activating this task learning mode. We’ll include this in discussion.

Theoretical Analysis for why performance degrades

A very recent submission under review [2] argues that performance drop in many-shot ICL might be due to more demonstrations diverting the model attention from the query, hindering its understanding of the query.

Another reason might be that really long many-shot prompts might be highly out-of-distribution (OOD) as LLMs might not have seen such prompts during pre-training or post-training (current mixtures might be optimized for needle-in-a-haystack tests). Furthermore, LLM pre-training has a maximum sequence length, followed by continued pre-training for context lengthening, for example, LLaMa 3.1 uses 8k and 128k while Apple FM uses 8k and 32k. We’ll update the discussion to include these hypotheses.

An explicit discussion of drawbacks and risks with many-shot ICL

We’ll update the discussion to include the following drawbacks and risks.

  • Computational Cost: Many-shot ICL can be computationally expensive, especially with a large number of examples. This can be mitigated with context caching and KV caching.
  • Bias Amplification: If the many-shot examples in the prompt contain biases (e.g., stereotypes, unfair representations), the model can amplify these biases, which raises ethical concerns.
  • Lack of Transparency: The inner workings of how ICL works is not well understood. This makes it difficult to pinpoint exactly why a model generates a specific output with many-shot ICL. This can be problematic to assure safety and alignment of LLMs [3].
  • Jailbreaking: Many-shot ICL can be used for overriding safety-training biases, manipulating LLMs to behave in unintended or harmful ways.

Is improving ethical alignment a potential application of many-shot?

We agree that inference time steering of LLMs with many-shot ICL has more flexibility compared to fine-tuning (e.g. different alignment criteria for different use cases), could also allow for faster adaptation to changing criteria, and fast "patching" for newly discovered legal or ethical issues. As such, many-shot ICL for ethical alignment seems to be a very promising application.

Since sometimes the performance can degrade with too many examples, and the ordering of examples can affect results, applying many-shots ICL to ethical alignment could be challenging.

In our work, sensitivity to example ordering seems to be highly task dependent – we do not see much impact on ordering on tasks like low-resource MT or summarization but larger impact on other tasks. Nevertheless, a simple approach might be to tune the ordering itself based on a held-out validation set and off-the-shelf libraries such as DSPy can be easily used for such purposes.

Regarding the optimal number of examples, this is likely a more critical parameter and would also matter for ethical alignment. As a rule of thumb, we swept across the number of shots on a logarithmic scale. Interestingly, our results on overriding pretraining biases show that performance only plateaued rather than degrading when adding too many shots.


[1] Dual Operating Modes of In-Context Learning. Lin and Lee, 2024.

[2] Focused Large Language Models are Stable Many-Shot Learners. Anonymous, 2024.

[3] Foundational Challenges in Assuring Alignment and Safety of LLMs. Anwar et al, 2024.

审稿意见
7

Owing to the significant increases in context window lengths, the paper analyzes the efficacy of the Gemini 1.5 Pro LLM in the many-shot in-context learning (ICL) setting, where hundreds to thousands of exemplars can be provided to the model at inference time. ICL has generally been restricted to the few-shot learning setting, where only a small number of demonstrations are provided to the LLM. As this expansion to the many-shot setting can pose issues relating to large scale data collection, the authors propose two simple approaches: (1) Reinforced ICL, which switches human written rationales for demonstrations with chain-of-thought model generated rationales, and (2) Unsupervised ICL, where rationales are not provided in the ICL task. The authors conduct extensive experiments across a number of problem domains, ranging from summarization, machine translation, logistics planning, question answering, algorithmic reasoning, among many others, showcasing the performance benefits obtained by using a larger number of exemplars in ICL.

优点

  • I believe the paper is a significant contribution to the field of ICL, as it analyzes the efficacy of LLMs in the not yet studied many-shot regime. The authors conduct a number of extensive experiments ranging from a diverse set of tasks and benchmarks, showcasing the benefits of many-shot ICL. The biggest takeaway from the paper would be that many-shot ICL could be a suitable alternative to supervised fine-tuning which would tune the entire set of model weights, albeit at the cost of increased inference time (which can be reduced via KV caching).
  • The paper is very well-written and I appreciate the large scale of experiments conducted on Google Gemini 1.5 Pro.
  • The two annotation-free many-shot ICL strategies (Reinforced ICL and Unsupervised ICL) proposed are simple, but clearly demonstrate improved performance on a wide variety of tasks.
  • Findings relating to overcoming pre-training biases, learning higher-order functions, and many-shot ICL vs supervised fine-tuning are also important results that strengthen the contributions of this work.

缺点

  • As this is an empirical analysis paper which spans many different tasks and benchmarks, can the authors confirm that for all the tasks, the full test splits were used for evaluation as in line with community standards and past work? If there are any exceptions, these should be listed. For instance, measuring summarization performance on XSum/XLSum for only 150 test articles seems far less in size than the actual test set for this dataset (~11k articles). Were the articles randomly sampled?
  • While the ablations are carried out with respect to the number of ICL exemplars provided to the LLM, I am somewhat unsure of what role the model size plays here. As shown in the cited Wei et al paper (https://arxiv.org/pdf/2303.03846), larger LMs might do in-context learning differently and can learn input-output mappings better than smaller LMs which for example, might not be able to reject pretraining biases. Thus, it is not clear if some of the performance improvements are just a direct result of using Gemini 1.5 Pro (which is a model on the larger end of the size spectrum). Conversely would a smaller LLM with a larger context also benefit from many-shot ICL (possibly not as many exemplars as Gemini 1.5 Pro, however). Do the authors have any thoughts on this and can they draw a distinction between gains attained due to an increase in the size of the model versus the many-shot ICL setting?
  • Do the authors have any intuition for why at times performance reduces as more exemplars are added, somewhat akin to overfitting? I noted that the authors discussed this as an open question in the paper but I think it would benefit readers if some more insight could be provided.

问题

Please see the weaknesses listed above.

局限性

Yes, the limitations have been addressed. More details can be provided in a revision.

作者回复

We thank the reviewer for their detailed comments, which we try to address below. We have added a many-shot ICL comparison for Gemini 1.5 Flash with frontier LLMs to understand the role of model size in the rebuttal pdf.

unsure of what role the model size plays here .. would a smaller LLM with a larger context also benefit from many-shot ICL? Gains from increase in the model size versus many-shot ICL

To understand the role of model size, we’ve evaluated the many-shot performance of Gemini 1.5 Flash, a smaller long-context LLM than Gemini 1.5 Pro. We used the low-resource MT to compare against LLMs at the larger-size of the spectrum, namely 1.5 Pro, Claude-3-Opus and GPT-4. Our results suggest that even smaller LLMs can benefit from many-shot ICL and outperform LLMs with stronger few-shot performance with enough shots. We report these results in Figure 1 in the rebuttal pdf.

On the English → Bemba task, we find that 1.5 Flash matches Claude-3-Opus and outperforms GPT-4 with 997-shots, despite having much worse few-shot performance than Claude and GPT. On English → Tamil MT, 1.5 Flash performs comparably to 1.5 Pro and Claude in terms of few-shot performance. However, 1.5 Flash outperforms Claude-3 in terms of many-shot performance, while lags behind 1.5 Pro.

Were the test splits in line with community standards? If there are any exceptions, these should be listed .. XSum/XLSum for only 150 test articles seems small. Were the articles randomly sampled?

Yes, for almost all of the tasks, we used the standard test sets used for evaluation, e.g., MATH500 and GPQA Diamond split). The only exceptions were summarization and low-resource MT, we randomly subsampled 150 examples from the entire test set to reduce the cost of evaluating many-shot ICL across multiple seeds. We'll update the text to specify this clearly.

Note that for XSum, we actually used the GEM-XSUM [1], which is a cleaner version of XSum with 1.2K test articles.

Do the authors have any intuition for why at times performance reduces as more exemplars are added?

A very recent submission under review [2] argues that performance drop in many-shot ICL might be due to more demonstrations diverting the model attention from the query, hindering its understanding of the query.

Another reason might be that really long many-shot prompts might be highly out-of-distribution (OOD) as LLMs might not have seen such prompts during pre-training or post-training (current mixtures might be optimized for needle-in-a-haystack tests). Furthermore, LLM pre-training has a maximum sequence length, followed by continued pre-training for context lengthening, for example, LLaMa 3.1 uses 8k and 128k while Apple FM uses 8k and 32k. We’ll update the discussion to include these hypotheses.

[1] The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics. Gehrmann et. al, 2021.

[2] Focused Large Language Models are Stable Many-Shot Learners. Anonymous, 2024.


We hope most of the reviewer's concerns have been addressed and if so, they would reconsider their assessment

评论

I would like to thank the authors for their rebuttal and additional experiments. I believe these augment the work further and should be included in the revision. Finally, based on the merits I had listed in my original review, I believe the paper's contributions still currently constitute a "technically solid paper, with high impact on at least one sub-area, or moderate-to-high impact on more than one areas, with good-to-excellent evaluation, resources, reproducibility". I will hence keep my score.

审稿意见
7

This work conducts a comprehensive study on in-context learning (ICL). The experiments range from few-shot to many-shot scenarios, with up to 2048 in-context examples. It was observed that as the number of examples increases, the performance generally improves, even matching the performance of fine-tuned models. One limitation of many-shot in-context learning noted by the authors is the difficulty in obtaining high-quality in-context example pairs. To address this issue, the authors proposed Reinforced and Unsupervised ICL, which achieved results comparable to those using ground truth examples. Additionally, the authors explored many-shot ICL in the context of pre-training bias (distribution shift settings) and high-dimensional numerical settings, providing explanations for the results.

优点

The experiments are comprehensive, covering a wide range of tasks and comparisons. The paper is well-written with a clear structure. The problem itself is interesting and has broad applications.

缺点

The experiments are limited to a single model, Gemini 1.5 Pro, as mentioned by the authors. This may narrow the scope of the results, as different models have varying pre-training data and biases. The performance of LLM under ICL shows significant variation (as seen in figure 6). It would be beneficial to include more seeds or utilize better statistical metrics to represent model performance, beyond just average accuracy. Additionally, as the number of shots increases, so does the number of tokens, making the computation budget (RAM of the machine) a potential bottleneck for further exploration and application. Given that many in-context examples have varying lengths, future work could focus on how to select better in-context examples that capture the task's nature while also being concise in terms of token length.

问题

  1. It's surprising that Reinforced ICL and Unsupervised ICL outperform the ground truth. Do you have any explanations for this?
  2. The performance shows significant variations when different random seeds are used to select in-context examples. Could you explain why this happens?
  3. Flipped labels and abstract labels seem to achieve the worst performance when K=8 or K=16. Is this a universal result across different model sizes, as related to [1][2]?
  4. I am curious about what model is used for figure 11 in sec 4.3. Did you supervise fine-tune the same model and report its results on the dataset?

局限性

Please see weakness and questions for limitations. There is no potential negative societal impact of their work.

作者回复

We thank the reviewer for their comments and questions, which we address below. We have added new results on 1.5 Flash and inference time of many-shot in the rebuttal pdf.

Experiments are mostly limited to 1.5 Pro .. different models have varying pre-training data and biases.

Indeed, our work serves as an existence proof for the huge potential of many-shot ICL. That said, we provided preliminary results for GPT-4-Turbo and Claude-3-Opus in Figure A.2, indicating that they can also benefit from many-shot ICL to varying degrees.

We added Figure 1 in rebuttal pdf to report many-shot performance for Gemini 1.5 Flash, a smaller LLM than 1.5 Pro, and show that it can match or surpass Claude-3-Opus and GPT-4-Turbo with enough shots, despite having worse few-shot performance. We’ll move this to the main paper.

Recently, follow-ups [1, 2, 3] have exhibited many-shot ICL with open-weights and closed-source models on different tasks. Our work contributes to this growing body of evidence, and contributes several analyses of the phenomenon.

ICL shows significant variation on GPQA (Fig 6) .. more seeds or better metrics

Our paper reports standard deviation across seeds on most of the tasks. We’d emphasize that GPQA is an exception, rather than the norm, even noted by the Anthropic report due to its extreme difficulty and small size (198 questions only). To show this variability, we directly report the individual performance on 5 seeds on both GPQA and MATH.

As the number of shots increases, so does tokens, making computation budget a bottleneck

Indeed, many-shot ICL does increase inference computation time, but can allow for quick prototyping and experimentation using just an inference API. That said, it can be sped-up with KV caching and context caching [7], which is default for long-context servers.

To empirically measure this inference time, we’ve added Figure 2 in the rebuttal pdf showing per-output generation runtime for many-shot ICL, averaged across test set and multiple seeds, on summarization (500-shot) and sequential parity prediction (8192-shot). With caching enabled, runtime increases linearly with a large number of shots, as opposed to quadratic for self-attention: doubling the number of shots nearly doubles the runtime. However, for a small number of shots (less than 32k tokens), we see that runtime is nearly constant.

It's surprising that Reinforced and Unsupervised ICL outperform ground truth. Do you have any explanations?

  • Reinforced ICL: The Reinforced Self-Training work [4] showed that fine-tuning using model-generated rationales can be more effective than human-generated ground-truth outputs. We show that a similar finding holds true for many-shot ICL. Since model-generated outputs utilize only the skills / knowledge possessed by the LLM, it might make such outputs easier to learn from.

  • Unsupervised ICL: This is harder to explain as unsupervised ICL does not always work well: it outperforms ground truth for MATH, but is substantially worse for low-resource MT (Appendix A.10). Our hypothesis is that it works well when the LLM already has all the required knowledge to solve a task. As such, ground-truth outputs might bias the model and the model prefers to utilize its underlying knowledge by just relying on inputs.

The performance shows significant variations when different random seeds are used to select in-context examples. Could you explain why this happens?

Impact of random seed for ICL example selection seems to be highly task dependent – on low-resource MT and summarization, it has a minimal effect, as can be noticed from small error bars that show standard deviation of mean performance across seeds. However, on other tasks, such as MATH and GPQA, it has a higher impact. Prior work has found the following factors impact few-shot performance when selecting examples:

  • If chosen examples are semantically similar to test examples, it leads to improved performance [5]
  • Increased diversity in chosen examples leads to improved performance [6]

When we select different example subsets by setting different random seeds, we perturb these factors, leading to differences in downstream performance. Our general trends hold even taking into consideration this variation in performance (shown by standard deviation bars).

Flipped and abstract labels .. achieve worst performance when K=8 / K=16. Is this universal across different model sizes?

We do not know if this is a universal result across different model sizes. That said, we also evaluated Gemini 1.5 Flash on flipped labels and observed similar accuracy trends: 56% for K=4, 34% at K=8, 40% for K=16, 76% for K=32, and 86.5% for K=64.

Did you supervise fine-tune the same model and report on the dataset?

Yes, we performed “full” fine-tuning on the same model (Gemini 1.5 Pro) on the same examples that were used for many-shot ICL. We performed 5 epochs of training, picking intermediate checkpoint with lowest validation loss to ensure strong SFT results.

Future work could focus on how to select better ICL examples that capture the task's nature while being concise

Agreed, would add to future work.


[1] Many-Shot In-Context Learning in Multimodal Foundation Models. Jiang et al, 2024.

[2] Many-Shot In-Context Learning for Molecular Inverse Design. Moayedpour et al, 2024.

[3] In-Context Learning with Long-Context Models: An In-Depth Exploration. Bertsch et al, 2024.

[4] Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models. Singh et al, 2024.

[5] What Makes Good In-Context Examples for GPT. Liu et al, 2021.

[6] Selective Annotation Makes LMs Better Few-Shot Learners. Su et al, 2022.

[7] Context caching. ai.google.dev/gemini-api/docs/caching


We hope most of the reviewer's concerns have been addressed and if so, they would reconsider their assessment.

作者回复

We thank all the reviewers R1 (XEKv), R2 (mj1R), R3 (kkH9), and R4 (2oDD) for their valuable feedback! All reviewers are in favor of acceptance and found the paper to be comprehensive, very well-written, significant contribution to ICL, broad scope and applications, novel annotation-free ICL methods, and interesting analysis. This paper also generated a lot of discussions for our observations as well as questions about future work, which we highly appreciated. Here, we address the common concerns raised by the reviewers:

Experiments mostly limited to Gemini 1.5 Pro

Our work serves as an existence proof for the huge potential of many-shot ICL across a variety of tasks. That said, we provided preliminary results for GPT-4-Turbo and Claude-3-Opus in Figure A.2, indicating that they can also benefit from many-shot ICL to varying degrees. Several follow-ups have exhibited many-shot ICL with other open-weights and closed-source models on different tasks. Our work contributes to this growing body of evidence, and contributes several analyses of the phenomenon.

We also added Figure 1 in the rebuttal pdf to report many-shot ICL performance for Gemini 1.5 Flash, a smaller LLM than 1.5 Pro, and show that it can match or surpass Claude-3-Opus and GPT-4-Turbo with enough shots, despite having worse few-shot performance. We’ll move this to the main paper.

Runtime and inference compute increase for many-shot

While many-shot ICL increases inference computation time, it can allow for quick prototyping and experimentation using just an inference API. That said, it can be sped-up with KV caching and context caching, which is default for long-context servers. Moreover, being able to spend additional inference-time compute to obtain better performance is a useful feature to have.

We added Figure 2 in the rebuttal pdf showing per-single generation runtime, averaged across the test set and multiple seeds, for many-shot ICL on summarization (500-shot) and sequential parity prediction (8192-shot). With KV caching enabled (default for long-context servers), runtime increases linearly with a large number of shots, as opposed to quadratically for self-attention: doubling the number of shots nearly doubles the runtime. However, for a small number of shots, runtime is nearly constant. We'll add this result to the paper.

Additional results in rebuttal pdf for R4

Other results the questions and concerns raised by R4 about impact of context length vs new information in many-shot ICL (Figure 4), hallucination on XSum (Figure 3), and re-evaluation on the Planning Logistics task with latest Gemini API (Figure 5).

Details about SFT comparisons, empirical evaluations, and Reinforced ICL

We'll update the revision to include them, which we discuss below in individual rebuttals.

最终决定

The paper originally received high scores, with reviewers praising the significance of the findings and potential positive impact of the proposed ICL examples quality improvement techniques. Yet the reviewers also raised some concerns on:

  1. limited experimentation - only on Gemini-1.5 Pro
  2. potential efficiency overhead with enlarged input context due ICL examples
  3. lack of theoretical grounding
  4. insufficient discussion of limitations

Following authors' rebuttal, most concerns seem to have been addressed: additional (smaller) models tested, limitations discussed, some links to existing theoretical work made, efficiency impacts admitted and discussed.

In light of the above, the AC recommends accepting the paper, yet urges the authors to incorporate the reviewers' suggestions and additional discussions and experiments into the final version of the paper.