Thanks for the detailed comments and questions, which we address below and would strengthen our work.

In the rebuttal pdf, we have added results on runtime differences, many-shot performance of 1.5 Flash and frontier LLMs, ablated impact of new information vs context length, re-evaluated results on Logistics, and added analysis for hallucination on xsum. Our detailed response follows:

Only Gemini 1.5 was explored .. are there model differences that impact many-shot

Indeed, our work serves as an existence proof for the huge potential of many-shot ICL. Nevertheless, we provided preliminary results for GPT-4-Turbo and Claude-3-Opus in Figure A.2, indicating that different models benefit from many-shot ICL to varying degrees.

We also added Figure 1 in the rebuttal pdf to evaluate many-shot ICL performance for Gemini 1.5 Flash, a smaller LLM than 1.5 Pro, and show that it can match or surpass Claude-3-Opus and GPT-4-Turbo with enough shots, despite having worse few-shot performance. We’ll move this to the main paper.

Recently, follow-ups [1, 2, 3] have exhibited many-shot ICL with other open-weights and closed-source models on different tasks. Our work contributes to this growing body of evidence, and contributes several analyses of the phenomenon.

[1] Many-Shot ICL in Multimodal Foundation Models. Jiang et al, 2024.

[2] Many-Shot ICL for Molecular Inverse Design. Moayedpour et al, 2024.

[3] ICL with Long-Context Models: An In-Depth Exploration. Bertsch et al, 2024.

Runtime differences varying K to provide a fuller picture of many-shot ICL .. profiling computation increases

Great suggestion! To show runtime differences, we added Figure 2 in the rebuttal pdf showing per-single generation runtime, averaged across the test set and multiple seeds, for many-shot ICL on summarization (500-shot) and sequential parity prediction (8192-shot).

With KV caching enabled (default for long-context servers), runtime increases linearly with a large number of shots, as opposed to quadratic for self-attention: doubling the number of shots nearly doubles the runtime. However, for a small number of shots, runtime is nearly constant.

Explanation: When computing the next token, we still have to attend to the fixed many-shot prompt, even if KV is cached. When the number of generated tokens is much smaller than many-shot prompts, each new token is still linear, which explains our observed runtime for a large number of shots. We hypothesize that up to a token length of 32K, you can fit the entire KV cache into TPU HBM, which roughly means that you compute next tokens in O(1) memory load.

xsum weird dates after K=50 .. if somehow the model learned the task was to recover the webarchive page title

Our analysis suggests that this hypothesis is likely to be true! Specifically, we extracted the hallucinated years from XSum summaries and plotted their histogram density in Figure 3. Remarkably, more than 95% of these dates indeed lie within the range 2014-2017, suggesting that the model might indeed be retrieving additional information about webarchive last updated time.

Including more distinct examples increases information, but also context length .. separate these effects by extending context length by repeating same examples

We ran this experiment on low-resource MT by repeating 25 examples several times to create many-shot prompts with up to 1000 examples (shuffled ordering) and added the results in Figure 4 in the rebuttal pdf.

The performance with repeated examples stays nearly the same and significantly lags behind many-shot performance with distinct examples. On this task, the benefit of many-shot ICL mainly stems from adding new information as opposed to increasing context length.

Planning logistics: Is there a significant increase for many-shot?

To be certain, we re-evaluated many-shot on Logistics with the latest public version of Gemini 1.5 Pro, and added the result in Figure 5 in the rebuttal pdf. Many-shot accuracy improves uniformly for this version – interestingly, few-shot performance already starts quite high around 40%, and improves to 62.8% with 400-shot and plateaus to 63.8 with 800% shot.

Supervised fine tuning: base model, and how many epochs? And why were these choices made? Estimates on the computation resources?

We performed “full” fine-tuning (no adapters) on the same base model that was used for the many-shot ICL (Gemini 1.5 Pro). We performed 5 epochs of training, picking intermediate checkpoint with lowest validation loss (often from the first few epochs). These choices were made to ensure that we can obtain quite strong results for SFT. We’ll add these details in Sec 4.3.

Since Gemini 1.5 Pro is closed-source, we used Vertex API for SFT and cannot provide estimates of computation resources.

Supervised finetuning comparison only on machine translation task .. limited conclusion

Correct, our results only demonstrate that many-shot ICL can be competitive to fine-tuning on some tasks. The high dollar cost of “fine-tuning” limited our experimentation to 4 runs (2 tasks x 2 data sizes). We also performed comparison to SFT on parity prediction, where we find that many-shot ICL requires 20x less samples compared to fine-tuning GPT-2 to reach the same performance on this synthetic task (Appendix A.13). A more thorough comparison would be interesting for future work.

what error bars indicate .. say it is stdev

Yes, error bars indicate stdev of test performance across multiple random seeds, where K-shot prompts are sampled randomly for each seed (Lines 68-70). We’ll update the text to clarify this.

We hope most of the reviewer's concerns have been addressed and if so, they would reconsider their assessment.