PaperHub
5.9
/10
Poster6 位审稿人
最低2最高4标准差0.7
3
2
3
4
4
3
ICML 2025

Which Attention Heads Matter for In-Context Learning?

OpenReviewPDF
提交: 2025-01-24更新: 2025-07-24
TL;DR

We compare the role of induction heads and function vector (FV) heads in in-context learning (ICL), we find that FV heads drive most of few-shot ICL which challenges the popular belief that induction implements ICL.

摘要

关键词
interpretabilityin-context learninglarge language modelsmechanistic interpretabilityinduction heads

评审与讨论

审稿意见
3

The authors investigate the nechanisms behind in-context learning (ICL). Specifically, they study about two special types of attention heads called "induction heads" and "function vector heads" (FV heads). By detecting these two heads, they find that: 1) The induction heads and FV heads are distinct; 2) FV heads are mainly response to the ICL; 3) Some FV heads evolve from induction heads.

Update after rebuttal

I think the most important concerns have been solved, if all the mentioned modifications are finally applied. Now I think the paper is OK to be accepted but I feel a litter unconfident because the large difference between these two version (and I cannot see the complete changing log due to the ICML policy --- Honestly, I feel somewhat disappointed about restricting the depth and format of rebuttals, especially when both authors and reviewers are willing to engage in more thorough discussions.) I will raise my score to 3.

给作者的问题

  1. My main concern is the poor structure of the article, especially the repetitive expressions, the confusion of related work, background introduction and methods, which seriously affected my understanding of the content of the article. I approve of the content of the article but the organization it is currently presented makes it not ready for publication.
  2. How can the results be generalized to newer and larger models? At least some Llama-3 and qwen-2.5 models should be included. It is better to include some MoE models or models with GQA. As for the param size, I believe the location of heads can be operated on a 70B model on one A100 GPU. (If I am wrong, please correct me.)
  3. How does the conclusion can be applied, especially considering the few-shot ICL has been replaced by some other methods?

If the question 1 has been solved, I will raise my score to 2 or 3. And if the question 2 has been solved, I will raise my score to 3 or 4. If the question 3 can be explained well, I will raise my score to 4 or 5.

论据与证据

These paper focus on the roles and relations between FV heads and induction heads. The claims are well expressed in Table 1 and these claims are interesting. However, I am not sure whether these claims are well supported:

  1. Although the experiments are operated on 14 models, the selection of these models are strange. 1) The Pythia models (released in 2023) and GPT-2 models (released in 2019) are somehow not most newest models and there is only one model in Llama family. 2) There are only Pythia-6.9B and Llama-2-7b models are large models. A common problem in interpretability is that phenomena that originally seemed correct may disappear as the model size increases.
  2. The conclusion that "induction heads and FV heads are mostly distinct" are based on the chosen threshold 2%. I wonder if the results are robust to the threshold. I hope the authors can provide an analysis on the threshold.
  3. In figure 19, the locations of FV heads and induction heads are significantly distinct only for the small models. For, GPT-2 Large, GPT-2 XL, Pythia 6.9B, and Llama 2-7B, the p-value is large and cannot get the conclusion that the FV heads and induction heads are distinct. The authors should explain why they choose these models and how the results can be generalized to other models.
  4. Not a drawback but a suggestion: The current section 5 is only about the heads. I believe the conclusions can be more clear if the ICL accuracy is also tested during the training. I wonder if the ICL accuracy is also related to the occurrence of FV heads. These experiments may be also helpful to verify some of your conjecture in section 6.

Additionally, In section 4.2, the authors state that the token-loss difference and few-shot ICL are different things. I am curious about what in fact the token-loss difference is, and why the induction heads are most responsible for the token-loss difference. Even an intuitive explanation would be helpful.

方法与评估标准

The main methods in the paper is locating the FV heads and induction heads, where the probing metrics are proposed in previous works. They also use mean ablation of heads, which is also a typical method to test the function of heads. These methods are not innovative but reliable. One of their contributions is to find that the difference between the previous metrics "token-loss difference" and "ICL accuracy". I believe this is a good contribution but I still hope the authors can provide more explanation about what the token-loss difference actually reflect.

理论论述

No theoretical claims

实验设计与分析

The experimental is well designed and the logic is clear. First, the authors detect these two heads and then analyze their roles and finally talk about the evolution. Some concerns about the analysis are mentioned in the "Claims and Evidence" section.

补充材料

I read about A.1, A.2, A.3 and A.9. The supplementary can help understand the paper better.

与现有文献的关系

I have no idea about the relation to broader scientific literature. It is a work about interpretability of ICL ability of LLMs. The explanatory work may shed some light on related research, but I am not currently aware of any direct applications of its conclusions.

I think that work about explaining mechanism has its own value, and it may not be directly applied. But some people may worry about the application value, especially considering that the capabilities of few-shot ICL have been replaced to some extent by instruction tuning, zero-shot ICL, or RLHF. For me, the lack of direct application value is not a disadvantage, but if some more practical applications can be pointed out (out of the interpretability field), it may be more inspiring for other readers.

遗漏的重要参考文献

Essential references are discussed. But the discussion is highly mixed with their own methods. It is the most shortcoming of this article.

There could be some relatively related works that can be included in the related work. For example, there are some works explaining the ICL ability from the perspective of Bayesian inference [1] or gradient descent [2,3]. These works are better to be included in the related work.

[1] S. Xie, et. al., An Explanation of In-context Learning as Implicit Bayesian Inference

[2] R. Ren and Y. Liu, Towards Understanding How Transformers Learn In-context Through a Representation Learning Lens

[3] H. Sun, et. al., In-Context Learning of Polynomial Kernel Regression in Transformers with GLU Layers

其他优缺点

  1. These paper further investigate the training dynamics of the models. The conclusion about how the functions of heads evolve (or change) during training is very interesting and important.
  2. Their discussion in section 6 is very interesting and inspiring. I hope the authors can further investigate them in the future.

其他意见或建议

I believe the organization of the paper needs to be improved. The current article is poorly readable.

  1. There are many repeated expressions:
    1. In the beginning of section 4, the authors have state that "we control for the correlation between induction and FV heads by only ablating induction heads that do not have high FV scores, and vice versa". However they repeat this point in line 263-267: "we take the top n heads by FV score that do not appear in the top 2% heads". If the specific statement has been appear, the upper one can be removed.
    2. The figure 1 is copied from the figure 3 and figure 4. Although the author may think that this helps the reader understand the content of the article more quickly, it is still a form of repetition that should be avoided.
    3. The figure 5 is just the average value of figure 6. The figure 6 is enough to express the conclusion.
  2. I recommend the section 2 background & related work should be splitted into two sections. The "related work" should introduce the ideas of related work and also the relation between theirs and the work. Too specific discussion should not be included in the section. On the other hand, some preliminary like how they define their detectors can be stated in a preliminary subsection. Further, all the statement about your methods (even if these methods are not proposed in these paper) should be placed into a method section. For example, the statement in line 125-150 about how the authors use the causal mediation analysis to identify FV heads, the statement in line 100-114 about how to compute the induction score should be moved to your method section. As for the section 2.3, it could be in introduction or after your experiments.
  3. The table 2 can be moved to the appendix. The authors can keep only the "parameters" (and maybe the "|L|" column) in the main text.
  4. Now, the page 7 only have three figures and the text explaining them are far from them. Please arrange the size and distribution of the images reasonably so that they appear in the most reasonable position possible to ensure readability. For example, you can together the three models in one figure in Figure 5 and use different color or marker to distinguish them.
  5. Please add a limitation section.
  6. The figures could be further improved. For example, the markers should be added in line chart. The upper and right border of the figures can be removed. Some grids can be added. In figure 4, some sub caption could be helpful to understand the figures in each row.
  7. The figure 10 should add more explanation in its caption.
  8. The figure 11 has an empty image.
作者回复

Thank you for the detailed feedback, and we especially appreciate the clear questions to resolve reviewer concerns!

We first address reviewer concerns on the claims:

  1. We used Pythia models mostly to facilitate the analysis over training dynamics, which other open-source models do not enable since they do not release intermediate checkpoints. We will add Llama 3 8B and Qwen 2.5 7B to our paper - as these models were not yet released during the development of this paper.
  2. In many cases, we had to decide on a threshold to differentiate meaningful FV/induction heads from the long tail of other heads. The 2% threshold was carefully chosen following previous work by Todd et al. (2024), and after we verified this threshold does seem to meaningfully separate important heads (Figure 7 in appendix shows the 2% threshold does select heads outside the cluster of other low-scoring heads - taking a higher percentage would select heads with FV scores close to 0). If we take the top 5% of heads, there would be between a 20-40% overlap (except for our smallest model with a 67% overlap) - which is still a small enough overlap to motivate our subsequent experiments in comparing the two sets of heads. The main goal of the claim on overlap is to establish that the two sets of heads we're studying are not identical, and furthermore, our ablation and training experiments later that shows different trends for FV vs. induction heads reinforces that the two are distinct types. Beyond a 5% threshold, we would be selecting too many heads with very low FV and induction scores that it would not be meaningful to consider them FV/induction heads.
  3. We agree that the observation on layer depths is not statistically backed up - we will clarify that this observation is speculative.
  4. Great suggestion! We have computed and added the evolution of ICL accuracy to Figure 5 of section 5: we observe that in all models, few-shot ICL accuracy begins to improve around the same time as when induction heads appear, and continues to gradually increase throughout training until the end. Since ICL accuracy continues to improve even after the formation of induction heads, we speculate that this suggests the sharp emergence of induction heads contributes to an initial rise in ICL performance, but the emergence of the FV mechanisms contributes to further improvements in ICL (which reinforces our conjecture 1).

We also address the questions at the end:

  1. Thank you very much for pointing this out and for the thorough, actionable feedback under "Other comments or suggestions"! We have incorporated your feedback to remove repetitions; reorganize the structure of related work / background / methods; added suggested references to related work; edited Table 2 in the main body to show the model, parameters and |L| only (while putting the rest in the Appendix); moved figures in page 7 to be closer to their text; added a limitation section (where we discuss limitations on the complexity of ICL tasks studied and the scale of the models in our analysis); added additional markers, grids, subcaptions to figures; removed unnecessary borders and empty images.
  2. We are currently repeating our experiments on Llama 3 8B and Qwen 2.5 7B. We will also try to include models in the 12-13B parameter range that might fit in A100s. However, we cannot perform experiments on larger models since, to compute the FV score, we need to store several caches of all the attention head activations - we will suggest experiments on larger models for future work.
  3. Our conclusion mainly serves to clarify misunderstandings in the current interpretability literature, that attributes few-shot ICL to induction heads, as well as providing general lessons for interpretability such as how interpretability conclusions change with model scale, and the definition of the metric used to measure ICL can also affect conclusions. Since few-shot ICL is still an active field of study in interpretability, we believe our conclusions are very important to share with the field, and we hope that our general lessons for interpretability will also apply to future interpretabiltiy research on other methods!

Thank you again for your feedback, please let us know if any questions or concerns remain unsolved!

审稿人评论

I am very grateful for the author's careful response, and I believe that the author can improve this work based on it. I would like to emphasize again that I highly recognize the impact and value that this work may have if some of the problems (which I think are easy to solve, such as adding newer models and improving the article structure) are solved. However, given that the new experiments have not been completed and the updated article structure has not been confirmed, I will temporarily keep my score unchanged. If the author can show the results of their improvements to this work as much as possible in the remaining time, I will increase my score. (Considering that the PDF cannot be updated in this rebuttal, I think the author can partially show the results of their modifications, such as listing the modified related work chapter in the form of an outline or summary, and the specific content can be temporarily omitted. At the same time, modifications to some images and tables can be provided in the form of anonymous links.)

p.s. In my understanding, the process of calculating FV can be completed by recording the activation values ​​on the disk. There is no need to keep the complete activation values ​​on the GPU because it only involves forward propagation. At the same time, model quantization may help reduce GPU memory. Although this may bring some errors, it is better to have a certain approximate experiment than not.

作者评论

Thank you for acknowledging our rebuttal and for your additional comments requesting the results of our improvements! We also appreciate you emphasizing that our work is impactful and valuable, once we added new models and improved the article structure! We apologize for not sharing the results in our original rebuttal, and for the delay (since we had to wait to get results on Llama 3 8B and Qwen 2.5 7B).


First, we provide the plot for the result we described in point #4 of our previous rebuttal comment here. This plot shows the evolution of ICL accuracy during training, as suggested by the reviewer in point 4 under “claims and evidence”.


Second, we outline the structure of the background and related work sections we have adopted in the current revision following suggestion #2 by the reviewer here.

We have also moved the previous section 2.3 on reconciling divergent findings into section 4.3 after our experiments, and revised the text in this subsection to remove repetitions.


Third, we show further results for other suggestions on the paper presentation the reviewer has proposed under “Other comments or suggestions”:

  • Suggestion #3: updated Table 2 (screenshot here)
  • Suggestion #5: we added a limitations section. Outline: limitation on the complexity of ICL tasks studied, limitation on the scale of models studied, enumeration of observations made in the paper that are not yet empirically verified with statistical significance.
  • Suggestion #6: we have added additional markers and grids for all line plots. For example, the figures here or here shows the markers and grids we have added.
  • Suggestion #7: we modified the caption of figure 10 to “Few-shot ICL accuracy after ablating induction and FV heads, using either random ablation method (rows 1 and 3) or zero ablation method (rows 2 and 4). Overall, the observation that ablating FV heads decreases ICL accuracy more than induction heads, is robust against different methods of ablation.”
  • Suggestion #8: we removed the empty image in Figure 11 here.

As for the remaining suggestions, we have removed repetitions in suggestion #1 and rearranged figure positions as suggested in #4 (difficult to show without the context of the full paper).


Fourth, we provide plots for the ablation experiments with exclusion for Llama 3 8B here and Qwen 2.5 7B here. Both plots show similar trends as the other models we have studied of similar parameter size, and strengthens our claim that FV heads contribute more to few-shot ICL accuracy than induction heads.


Finally, in our previous rebuttal, we missed your question at the end of “claims and evidence” on the intuitive explanation for token-loss difference, apologies for that! Intuitively, TL difference measures how much more the model is accurate at predicting the 500th token, relative to predicting the 50th token. This has been used as a proxy for measuring the model’s context utilization (since if the model is leveraging context, it would become better at predicting tokens in later positions than earlier). Induction heads may be important for TL difference since by definition, induction heads retrieve information from the context using the pattern-matching and copy mechanisms.


Thank you once again for your careful review and suggestions! We hope that the additional results have addressed your questions 1 and 2 at the end of your review, and we have also responded to question 3 at the end of our initial rebuttal comment!

审稿意见
2

This paper studies the functionality of attention heads in in-context learning. Specifically, two functionalities are investigated, namely induction heads and function vector heads. By using the established metrics to locate induction heads and function vector heads across layers, several observations have been made. For example, induction heads and FV heads are distinct but correlated; FV heads are more influential for ICL performance compared to induction heads; FV heads appear in higher layers while induction heads appear in lower layers; FV heads evolve from induction heads. Experiments on language models of different scales across a number of ICL tasks reveal these patterns, together with detailed analysis.

给作者的问题

Refer to the above.

论据与证据

The paper proposes several claims. Most of them are revealed from empirical results, but some of them are not strongly supported.

  • The authors claim that "induction heads appear in early-middle leayers and FV heads appear in slightly deeper layers". However, it is not so evident from Figure 2. The average layer may not sufficiently inform the distribution of heads. It is also not clear to me why the authors choose 2% as the selection criteria.
  • The authors claim "The induction heads and FV heads are distinct", but "they are correlated". Figure 3 shows that FV (induction) heads contain 90-95 percentile of induction (FV) heads. That means, whether induction heads and FV heads are overlapping or not depends on how you set the threshold (in this paper, the authors set it as top 2%). In that sense, the above claim seems not rigorous to me.
  • The conjecture that "FV heads implement more complex or abstract computations than induction heads" is not empirically validated or analyzed.
  • The conjecture about the evolution of induction heads to FV heads does not seem rigorous to me.

方法与评估标准

The method and evaluation criteria make sense.

理论论述

There is no theoretical claim.

实验设计与分析

I have checked the validity of the experimental designs and analyses.

补充材料

I have reviewed the supplementary material.

与现有文献的关系

The relationship between induction heads and FV heads and their distributions are relevant to LLM interpretability and the broad area of ICL behavior analysis. The findings could be useful for future research in this area.

遗漏的重要参考文献

No.

其他优缺点

Strengths:

  • The investigation between induction heads and FV heads is interesting and meaningful to future research.
  • Extensive controlled experiments are made to uncover the characteristics of the two heads and their relationships.
  • A number of interesting findings are provided which could potentially guide ICL design and LLM training.

Weaknesses:

  • The ICL tasks being tested are a bit synthetic, such as "Capitalize first letter", "choose first of list". Incorporating more realistic tasks could strengthen the impact of the analysis.
  • Some claims are not very strongly supported. Please refer to the "Claims And Evidence" section for more details.
  • The discussion of token loss is not clear to me. More explanations on why token loss can be used as the metric and why it presents opposite pattern compared to ICL accuracy should be given.
  • The evaluated LLMs are mostly Pythia and GPT-2. More advanced and update-to-date models could be incorporated to make the conclusion general and appropriate in the current development.
  • The paper lacks insightful suggestions for future development of ICL or LLMs. Given the findings in this paper, what are critical takeaways when developing future ICL or LLM training techniques?

其他意见或建议

NA

作者回复

Thank you for your feedback! We will first address the concerns around claims and evidence:

  • On the layer distribution, we will clarify that the claim on the layer depth difference is more of a speculative observation than a claim sufficiently backed up statistically.
  • On the selection of 2%: the 2% threshold was carefully chosen following previous work by Todd et al. (2024), and after we verified this threshold does seem to meaningfully separate important heads. Figure 7 in appendix shows the 2% threshold does select heads outside the cluster of other low-scoring heads - taking a higher percentage would select heads with FV and induction scores close to 0, which would prevent meaningful analysis. We also repeated ablation experiments taking 5% and 10% as thresholds, and find that our claims are robust to the thresholds.
  • On head overlap: the main goal of the claim on overlap is to establish that the two sets of heads we're studying are not identical, to motivate subsequent experiments comparing the two sets. Our ablation and training experiments later that shows different trends for FV vs. induction heads further reinforces that the two are distinct types. If we take the top 5% or top 10% of heads, there would be between a 20-40% overlap (except for our smallest model with a 67% overlap for top 5% and 80% overlap for top 10%) - while taking 5% and 10% as the threshold does not make as much sense since it includes heads with FV and induction scores close to 0, it still shows that the overlap is low enough to justify treating the two sets of heads as distinct heads for study.
  • On the rigor of conjectures: we would like to clarify that these are mere conjectures, inspired from observations in our study, but that remain to be rigorously proven and we are not positing them as empirically verified claims. We leave these conjectures as discussions and interpretations of our empirical findings, and to inspire future analyses.

We also address weaknesses raised by the reviewer:

  • On ICL tasks: our study combines ICL tasks that are synthetic but require reasoning that is achievable by the models we study, with realistic tasks (such as translation, commonsense question answering, news topic classification, etc.) so we believe that the selection of ICL tasks is sufficient for our analysis. We do appreciate the suggestion for realistic tasks, and will consider adding further tasks that is feasible by our models (such as sentiment classification).
  • On token loss: the difference between TL difference and few-shot ICL accuracy was first described in Olsson et al. (2022), but we will include a discussion in our revision as well. In short, TL difference approximates how much the model is using context by measuring how much the model's loss decreases on tokens further in context than earlier. When we observe high ICL accuracy with low TL difference (which is what we observe when we ablate random heads and induction heads for example), this means the model preserves its ICL abilities but loses other abilities associated with learning important signals from context. These other signals could capture non-ICL signals such as remembering entities, tracking style from context, etc. It would be great for future work to analyze what is being lost when we see these drops in TL difference!
  • On newer models: we are currently reproducing all experiments except section 5 on training dynamics on Llama 3 8b and Qwen 2.5 7B. Since our conclusions have applied to three families (GPT-2, Pythia, Llama 2) that were released in different years (2019, 2022, 2023), we would expect our conclusions to generalize to the models released in 2024 as well. If surprisingly they don't, we would still like to share the interesting negative results!
  • On suggestions for future development: while papers in the interpretability literature are often not expected to provide this (e.g. the Olsson et al. (2022) and Todd et al. (2024) papers that we compare the mechanisms for do not) as the interpretability contributions are standalone from model development, we will include a discussion on how evidence of the higher influence of FV heads may suggest that this is a more effective mechanism than induction heads. This suggests that ICL can be optimized in small models by using training methods that promote the formation of FV heads.

Thank you again for your feedback, please let us know if any questions or concerns remain unsolved!

审稿人评论

I appreciate the authors' efforts in responding to my questions. I have a better understanding about the difference between TL and ICL accuracy. Given that there are still some claims without statistical or empirical evidence (more like speculation and observation), I will keep my original score.

作者评论

Thank you for your acknowledgement of our rebuttal, and we are glad to hear that our explanation helps clarify the difference between TL and ICL accuracy!

We would also like to clarify that the main claims of the paper (findings 3 and 4 in Table 1) are all rigorously backed up with empirical evidence.

On the other hand, the two first claims you brought up under “Claims and Evidence” serve to set the context for our subsequent experiments (findings 1 and 2 in Table 1). Then, the last two points you brought up under “Claims and Evidence” are, as you mentioned, conjectures - they are speculative ideas that our results may suggest, but not empirically proven yet, and we include these conjectures in our paper for a high-level discussion of our work and implications for future research in interpretability.

We hope this clarifies that the claims we do make are all sufficiently backed up by evidence, and the others are made clear in our paper as being either context-setting or speculative. Please let us know if there is wording in our paper that misleadingly suggests otherwise - we will adjust the wording accordingly!

审稿意见
3

This paper presents a comprehensive study on attention heads and their impact on in-context learning. Specifically, it examines two types of attention heads: induction heads and function vector heads. Through ablation studies, the authors found that function vector heads play a more significant role in in-context learning performance, while induction heads are more crucial to the token-loss difference. Interestingly, the two types of heads also seem to be correlated.

update after rebuttal

Given that the authors provided additional results to address my concerns, I increase my score to 3.

给作者的问题

Please see the weaknesses above.

论据与证据

Most of the claims are well supported by the evidence. However, I have some concerns regarding the choice of percentile for identifying different types of attention heads. It appears that you use the top 2% as a threshold to classify heads as induction heads, function vector heads, or neither. How is this percentile determined? As shown in Figure 3, while most of the top 2% induction heads do not have top 2% FV scores, they still exhibit relatively high FV scores. Consequently, the ablation study in Section 4.1 is not entirely convincing to me. The heads being ablated, though not in the top 2% of the other category, may still be in the top 5% or 10%, which is still very high to me. The experiments with “exclusion” seems not so exclusive.

方法与评估标准

They make sense.

理论论述

No theoretical claims.

实验设计与分析

I check all the experiments. One issue that I have pointed out in "Claims And Evidence" section.

补充材料

No supplementary material provided.

与现有文献的关系

It helps us to understand the underlying rationale of large language models.

遗漏的重要参考文献

None.

其他优缺点

Strengths

  • Comprehensive empirical studies on induction heads and function vector heads and their impact on in-context learning.
  • Interesting finding that function vector heads contribute more to performance than induction heads.
  • Several models including different sizes are studied.

Weaknesses

  • The selected model are relatively old. I understand part of the reason is the need to train the models. However, for those experiments do not involve training, I would like to see some more advanced models to know if the conclusion can be applied to them as well.
  • As mentioned in the "Claims And Evidence" section, the way to decide the threshold needs more discussion. For example, the current conclusion is there is no significant overlap among the top 2% heads. How about top 5% or top 10%? How strong is the correlation between the two types of heads? Especially, you argue that previous work wrongly interprets the impact of induction heads because some induction heads that also behaved like FV heads. However, in the following section, you argue that there is no significant overlap, which sounds a bit contradictory to me. I suggest more discussions, perhaps along with more analysis, are needed.
  • Similarly, when you ablate one type of head, it would be much helpful if the average score/percentile of the other type can be reported as well. I do not think the current design well separates the effect of the two types of heads.

其他意见或建议

  • It would be great if the performance of 11 out of 37 tasks and the new 8 tasks can be separately displayed. The performance gap helps to understand if the conclusion can be really generalized to new tasks.
作者回复

Thank you for your feedback! We will first respond to the questions about claims and evidence:

  • Choice of percentile: In many cases, we had to decide on a threshold to differentiate meaningful FV/induction heads from the long tail of other heads. The 2% threshold was carefully chosen following previous work by Todd et al. (2024), and after we verified this threshold does seem to meaningfully separate important heads (Figure 7 in appendix shows the 2% threshold does select heads outside the cluster of other low-scoring heads - taking a higher percentage would select heads with FV scores close to 0).
  • Ablations with exclusion: we repeated our ablation with exclusion experiments on Pythia-6.9B taking 5% and 10% as the threshold. Overall, we observe the same trends as with 2%, where ablating induction heads does not hurt ICL much more than random, while ablating FV heads quickly significantly degrades ICL performance. Our ablations are robust to the threshold chosen for exclusion. We will repeat these experiments with other thresholds for our other models, and add them to the Appendix.

We will also address weaknesses:

  • Newer models: we are currently reproducing all experiments except section 5 on training dynamics on Llama 3 8b and Qwen 2.5 7B. Since our conclusions have applied to three families (GPT-2, Pythia, Llama 2) that were released in different years (2019, 2022, 2023), we would expect our conclusions to generalize to the models released in 2024 as well. If surprisingly they don't, we would still like to share the interesting negative results!
  • Threshold selection: we will elaborate in our revision the rationale behind choosing 2% - that in addition to following prior work, we also empirically verified in Figure 7 that selecting a higher threshold such as top 5% or 10% would end up selecting attention heads with FV and induction scores that are close to 0, which would prevent meaningful analysis. We have also previously computed the correlation between the two scores, but the correlations are not informative due to the long tail of heads that are neither high on FV or induction. As for the question about contradiction, we would like to clarify that we first established a lack of significant overlap to motivate our subsequent experiments that study FV and induction as two separate phenomena (if the two heads significantly overlap and correspond to the same set of heads, we no longer have a reason to compare these two heads). However, while the two sets of heads are mostly distinct, there are some heads that have both induction and FV properties which may have confounded earlier studies (e.g. heads in purple on Figure 7). Thank you for raising this confusion, we will clarify this in our revision!
  • For our ablations, we have also computed and plotted the average scores and percentils of the heads preserved. We have added these plots to our appendix.

As for your last suggestion on separating the performance, we have plotted the ablations for each task in Figures 15-18 of the Appendix. Following your suggestion, we have added markers for the 8 new tasks to better differentiate them from the other 11 tasks.

Thank you again for your feedback, please let us know if any questions or concerns remain unsolved!

审稿人评论

Thanks for your response.

Since this time ICML allows anonymous external links for supplementary tables and figures during the rebuttal, could you share the suggested results you have so far? If they seem reasonable and adequately address my concerns, I will consider adjusting my score.

  • Ablation studies with exclusion experiments at 5% and 10%
  • Experiments on Llama 3 8B and Qwen 2.5 7B (I personally think this is important)
  • Figures including averages and percentiles
  • Results that separate 8 new tasks

Regarding contradiction, I still feel a bit confused. If two sets of heads are mostly distinct, that means only a few overlapping heads. Can those few heads (way less than 2%) have such a big impact in the previous studies? I don't think the currently provided evidence support the claim.

作者评论

Thank you for acknowledging our rebuttal and for your additional comments! We apologize for not sharing the results in our original rebuttal, and for the delay (since we had to wait to get results on Llama 3 8B and Qwen 2.5 7B).

  • Please find ablations excluding 5% and 10% of heads in Pythia 6.9B as hyperlinks on the percentages. As stated earlier, both plots show similar trends to the ablation with exclusion experiments in our paper taking 2% as the threshold.

  • We provide plots for the ablation experiments with exclusion for Llama 3 8B here and Qwen 2.5 7B here. Both plots show similar trends as the other models we have studied of similar parameter size, and strengthens our claim that FV heads contribute more to few-shot ICL accuracy than induction heads.

  • We plotted the average induction scores and percentiles for the remaining heads after ablating induction and FV heads here. The plots show how when we ablate induction heads, the induction scores of the remaining heads steadily decreases as we increase the ablation percentage, but when we ablate FV heads, the induction scores of the remaining heads does not change much with ablation percentage.

Similarly, we plotted the average FV scores and percentiles for the remaining heads after ablating induction and FV heads here.

  • To separate the 8 new tasks, we re-generated the plots in Figures 15-18 while bolding the name of the tasks that are new. Here is an example of the new plot for Pythia 6.9B.

Regarding the contradiction: the impact of the overlapping heads is best shown in the difference between our ablations without and with exclusion (Figure 4 row 1 and 2). Most importantly, the induction plot (in blue) is behaviorally different in large models between ablations with and without exclusion. The only difference is that in the ablation with exclusion, we preserved FV heads that overlap with the top n% induction heads (n being the ablation percentage), so this set of heads do meaningfully impact conclusions of ablation studies.

In addition, although the two sets of heads are distinct (which motivates our comparative study), the top 2% of FV heads have induction scores at around the 90-95th percentile, and vice versa. This means that once we start ablating more than 5% of heads, there would be a higher overlap between the two sets of heads. This is more clearly shown in Figure 21 of our Appendix, where we plot the overlap percentage of the two types of heads according to the ablation percentage chosen.

We hope that the supplementary results and the additional clarification on the overlapping heads alleviate your concerns!

审稿意见
4

This paper investigates the mechanisms behind ICL. Specifically, it focuses on analyzing two popular explanations for the mechanism of ICL: induction heads (token-copying) and function vector (FV) heads (task-encoding). Through extensive experiments on 12 LLMs and 45 tasks, multiple interesting findings are presented:

  • Induction heads and function vector heads are distinct but correlated.
  • FV heads matter the most for few-shot ICL, especially in larger LLMs. Particularly, the effect of ablating induction heads while preserving top FV heads is close to random, which offers new insight into the effect of induction heads.
  • Many induction heads transition to FV heads during training. Induction heads are simpler and learned earlier, whereas FV heads are more complex.

给作者的问题

N/A

论据与证据

The claims are well supported by the extensive experiments presented in this work.

方法与评估标准

The evaluation criteria for the three major findings are valid. The measures of token-loss difference and ICL accuracy effectively support the discrepancy between the findings of this work and those of Olsson et al. (2022), demonstrating that FV heads significantly contribute to ICL accuracy but not to token-loss difference.

理论论述

There is no proofs involved in this paper.

实验设计与分析

In general, the experiments are rigorous and well-designed, thoroughly supporting the findings of this paper.

Some minor issues include:

  • ICL is usually conducted on larger LLMs, ranging from a few billion to hundreds of billions of parameters. The largest LLM evaluated in this paper is 7B. Scaling evaluations to include larger LLMs in the range of 13B to 70B could enhance the robustness of the findings.
  • The tasks evaluated focus on simple classification tasks. Extending this to more challenging tasks (e.g., math reasoning, graduate-level MCQs) and diverse prompt formats (ICL with intermediate CoT explanations) would enhance the generalizability of the findings.
  • Most experiments are based on the top 2% of attention heads. Exploring different ratios can reveal whether the observed trends are robust. A sensitivity analysis would strengthen the findings.

补充材料

There is no supplementary material provided.

与现有文献的关系

This work builds on two previous explanations of the ICL mechanism: induction heads and FV heads. It presents multiple novel and interesting insights into these two distinct explanations.

遗漏的重要参考文献

N/A

其他优缺点

Additional Strengths:

  • This work provides valuable insights into the mechanism of ICL, and can potentially inspire future works on ICL mechanism.
  • The paper resolves contradictions of findings in this work and prior studies on induction heads by highlighting the metric difference. It convincingly demonstrates that ICL is largely driven by FV heads rather than induction heads.

Additional Weaknesses:

  • It would be better to discuss potential practical applications of the findings, such as how these insights could be leveraged for model training and ICL optimization.

其他意见或建议

N/A

作者回复

Thank you for the positive assessment! We also address minor concerns:

  • Larger models: we will consider experiments on models in the 13B range for our camera-ready. However, computational constraints on our end prevent testing on larger models due to memory limits, since the FV score computation and ablation experiments require several caching of model activations. We would love to see future work extend our studies to bigger models! The models we study are also much bigger than earlier mechanistic analyses of ICL (e.g. Olsson et al. 2022 on toy models with 1-6 attention-only layers or MLPs) which shows promise of scaling our methods to bigger models given enough compute.
  • Task difficulty: while small models limit complex reasoning tasks, our findings hold on tasks such as translation, binding, named entity recognition, commonsense QA... It would be a great idea to also include the more challenging tasks you suggest for future work studying bigger models that have a reasonable base accuracy for those tasks!
  • Thank you for the suggestion! We have repeated the ablation experiments and training dynamics analysis using 5% as the threshold, and results confirm the robustness of the observed trends.
  • Practical applications of findings: we will include a discussion on how evidence of the higher influence of FV heads may suggest that this is a more effective mechanism than induction heads. This suggests that ICL can be optimized in small models by using training methods that promote the formation of FV heads.
审稿意见
4

The authors study Induction Heads and Function Vector Heads in relation to In-Context Learning in a variety of small and medium-sized models. Their experiments suggest that induction heads and FV heads are distinct, though there is some overlap in their behaviours. They also show that removing FV heads greatly impacts accuracy, while it is not the case for induction heads, while removing induction heads has a greater impact on token-loss difference (the difference in loss between the 50th and the 500th token of the prompt) than removing FV heads. Finally, they show that FV heads tend to be found slightly deeper within the network, and that during training some heads turn into induction heads before later involving into FV heads.

The authors attribute some differences between their findings and earlier literature to their carefully distinguishing between the metrics of interest (token-loss difference and ICL accuracy), as well as to their controlling for the behavioural overlap between FV heads and induction heads.

update after rebuttal

The authors have answered many of my concerns, and have even brought forth new interesting experimental results. Consequently, I have raised my recommendation to a 4.

给作者的问题

I would appreciate additional details regarding the claims I have mentioned in “Claims and evidence” (points 1) to 5)). I would be happy to improve my evaluation of the paper if presented with satisfactory answers.

Other questions:

  1. By the very definition of the function vector score, heads whose FV score is high are necessarily heads such that their activation can have a large impact on ICL accuracy (since the score measures the difference in accuracy for two distinct activations). As such, is it not somewhat tautological, or at least not very surprising, that ablating FV heads greatly hurts accuracy (one of the authors’ main contributions)?
  2. The authors emphasise the importance of distinguishing between few-shot ICL accuracy and token-loss difference, but make no attempt at describing the difference between the two metrics. Consider a prompt composed of input-output pairs (xi,yi)(x_i,y_i) where the 50th and 500th tokens are the correct answers yly_l and yry_r to some inputs xlx_l and xrx_r; in that case, isn’t token-loss difference essentially a renormalized version of few-shot ICL accuracy (renormalised by the accuracy at yly_l given the shorter context of the first 50 tokens) ? In general, if few-shot ICL accuracy is high but token-loss difference is low, doesn’t that mean that there is no improvement in the models predictions as the context increases in length, hence that the model is not actually using the context?

论据与证据

I am currently not convinced of the soundness of the following claims:

  1. One of the paper’s main claims is that induction heads contribute less to ICL accuracy than FV heads. I suspect that this is very task dependent: e.g., if the context describes a dictionary of tokens of sorts, i.e. is composed of pairs “t1,1t1,2,t2,1t2,2,t_{1,1} t_{1,2}, t_{2,1} t_{2,2}, \ldots” where ti,2t_{i,2} is the output token associated with the input token ti,1t_{i,1}, then I would expect induction heads to contribute greatly to the accuracy. I did not read in detail the references where the various tasks studied are described to see whether such tasks are in fact included.

  2. The authors claim that there is little overlap between FV heads and induction heads. However, they also state that “In most models, FV heads are at around the 90-95th percentile of induction scores, and vice versa.” Unless I misunderstood that sentence, does it not mean that there is little overlap mainly because the authors arbitrarily defined FV heads as the heads with the top 2% of FV scores (and same for induction heads), but that there would be a lot of overlap if FV heads were defined as the top 5% of heads for FV scores (and same for induction heads)?

  3. In relation to the same claim: FV heads are by definition heads such that their activation can have a large impact on ICL accuracy for the tasks considered (since the score measures the difference in accuracy for two distinct activations). If the tasks are such that induction heads are not needed to solve them (which the authors demonstrate themselves), then won’t that choice of tasks necessarily reduce the overlap between induction heads and FV heads? In other words, isn’t this potential lack of overlap an almost direct consequence of the choice of tasks, rather than a general phenomenon?

  4. The authors claim that “Here, we find that in smaller models (less than 160M parameters), ablating induction or FV heads does not influence token-loss difference more than random. In models with over 345M parameters, ablating induction heads has a larger effect token-loss difference than ablating FV heads. The gap between the effect of ablating induction and FV heads decreases with model scale.” However, the behaviours of the “FV heads ablated”, “Induction heads ablated” and “random heads ablated” curves with respect to each other seem to vary so much between models (see Figure 4 and Figure 9 - note in particular that removing induction heads is sometimes less impactful than removing random heads) that I find it difficult to conclude with so few data points that there really is a trend with respect to model size, rather than simply a lot of model-dependent variance. In particular, the results for the largest model in Figure 9 in the Appendix (Llama 2 7B) go strongly against the claim.

  5. The authors claim that “In general, induction heads appear in early-middle layers and FV heads appear in slightly deeper layers than induction heads.” However, looking at the results in Figure 19 in the Appendix, we find that the p-values are often quite large and that the phenomenon, if it exists, might again be model-dependent.

Note: I recognise that it is very hard to draw rigorous conclusions regarding such empirical matters, where confounding factors abound. Though I do not fully agree with many of the authors’ claims, I do not question the seriousness of their work or their scientific rigor.

方法与评估标准

Yes.

理论论述

No theoretical claims.

实验设计与分析

Experiments are sound overall.

补充材料

I have skipped through it (see remarks above).

与现有文献的关系

I feel like the findings fit well within a series of articles investigating similar topics and phenomena. One of the findings (induction heads are not that important for ICL accuracy) contradicts conventional wisdom, and the authors offer an explanation for it (which does not entirely convince me, see above).

遗漏的重要参考文献

As far as I can tell, the relevant literature is adequately cited by the authors. A few additional references on either ICL for transformers or the mechanistic study of subcircuits that might further enrich the Related Works section:

The evolution of statistical induction heads: In-context learning Markov chains (Edelman et al.)

Iteration Head: A Mechanistic Study of Chain-of-Thought (Cabannes et al.)

An explanation of in-context learning as implicit bayesian inference (Xie et al.)

What can transformers learn in-context? A case study of simple function classes (Garg et al.)

How do transformers learn in-context beyond simple functions? A case study on learning with representations (Guo et al.)

Dissecting recall of factual associations in auto-regressive language models (Geva et al.)

其他优缺点

The paper is well-written and clear. The authors ask relevant questions and investigate them thoroughly (though I am not entirely convinced by some of their conclusions). They clearly distinguish between what their experiments prove (insofar as anything can be proven regarding such empirical phenomena) and what they merely suggest.

其他意见或建议

The following sentence is unclear to me, and additional explanations might be needed: “However, C2 would predict that ablating monosemantic FV heads would not hurt ICL performance, […]” Why is that ?

Subsection 2.2 is perhaps too concise; a slightly more detailed exposition (as in Todd et al.) would increase clarity.

A few typos:

"two distinct sets of heads To do so, we" (missing ".") "task tTt\in T defined by a dataset PtP_t" (missing "is")

作者回复

We thank the reviewer for their thoughtful feedback! We address concerns below:

  1. Our task suite does include dictionary-style tasks as you mentioned, and we plotted the task-specific ablations in Figure 18 of the appendix - for the dictionary tasks (e.g. national_parks, park-country, person-occupation, person-instrument, english-german, french-english) we do still consistently find that FV heads contribute more than induction heads significantly.
  2. In many cases, we had to decide on a threshold to differentiate meaningful FV/induction heads from the long tail of other heads. The 2% threshold was carefully chosen following previous work by Todd et al. (2024), and after we verified this threshold does seem to meaningfully separate important heads (Figure 7 in appendix shows the 2% threshold does select heads outside the cluster of other low-scoring heads - taking a higher percentage would select heads with FV scores close to 0). If we take the top 5% of heads, there would be between a 20-40% overlap (except for our smallest model with a 67% overlap).
  3. While we used a held-out set of tasks to identify FV heads, the FV heads are task-agnostic (as shown in our ablations where they contribute to tasks that were not used to identify them) - so regardless of the choice of the ICL task, with enough tasks the same set of heads should be identified as FV heads. In addition, the tasks we used to compute FV heads are ones that were previously often associated with induction heads, but demonstrated in our paper as being less influenced by induction heads than FV heads. Therefore, the lack of overlap is a general phenomenon. Also, the main goal of the claim on overlap is to establish that the two sets of heads we're studying are not identical, to motivate subsequent experiments comparing the two sets. Our ablation and training experiments later that shows different trends for FV vs. induction heads further reinforces that the two are distinct types.
  4. We agree token-loss difference trends are noisy and we will remove the claim about scale-dependence. The key finding we'd like to emphasize here is that token-loss difference and ICL accuracy measure distinct phenomena.
  5. We agree that the observation on layer depths is not statistically backed up - we will clarify that this observation is speculative.
  6. Great point! We used a different set of tasks to compute FV scores than the tasks used in our ablation experiments, to specifically ensure that our main claim is not tautological. What's interesting about FV heads is that their contribution to ICL seems to generalize across different ICL tasks.
  7. Interesting question! The difference between the two metrics was first described in Olsson et al. (2022), but we will include a discussion in our revision as well. In the prompt you described, it would be a normalized version of few-shot ICL, but in practice, TL difference is not computed using such prompts (in previous work and ours, it is taken over natural sentences). And yes, high ICL accuracy with low TL difference (which is what we observe when we ablate random heads and induction heads for example) means the model preserves its ICL abilities but loses other abilities associated with learning important signals from context. These other signals could capture non-ICL signals such as remembering entities, tracking style from context, etc. (it would be great for future work to analyze what is being lost when we see these drops in TL difference!)

Thank you for providing the additional references, suggesting better clarity in 2.2, and pointing out the typos - we have incorporated these in our revision!

“However, C2 would predict that ablating monosemantic FV heads would not hurt ICL performance, […]” Why is that ?

C2 posits that FV heads are polysemantic heads that implement both induction and the FV mechanism. Then, if we ablate monosemantic FV heads but preserve the polysemantic FV heads, models should still be able to perform ICL - we will clarify this in our revision!

Thank you again for your feedback, please let us know if any questions or concerns remain unsolved!

审稿人评论

Thank you for your detailed answer. Some additional comments:

  1. Note that I mentioned a very specific task where ti,1t_{i,1} and ti,2t_{i,2} are tokens and not words (i.e sequences of tokens); this makes quite a difference, as induction heads are precisely tasked with outputting B after A if AB was present earlier in the text and A, B are tokens (as opposed to collections of sequences of tokens).
  2. This makes sense. You could maybe emphasize in the main text (a single sentence should be enough) that the choice of threshold is informed by the shape of the distribution of scores.
  3. I am not 100% convinced, but I think that the fact that the set of tasks used to designate FV heads is distinct with the set of tasks used for later experiments is an interesting point that should be slightly more emphasized. Among the reasons why I am still not 100% convinced: the tasks chosen might be quite similar to each other (at least "from a transformer's point of view").
  4. Good.
  5. Good.
  6. Same as for 3).
  7. Such a discussion would indeed be welcome. " high ICL accuracy with low TL difference (which is what we observe when we ablate random heads and induction heads for example) means the model preserves its ICL abilities " I don't entirely agree, as this could also be a sign of the model simply being good at 0 shot accuracy (and not really doing any in-context learning).

As some of my concerns have been answered, I have raised my recommendation to a 3.

作者评论

Thank you for acknowledging our rebuttal and for raising your recommendation, we appreciate it! We will also follow your suggestions on points to emphasize in 2 and 3 for our revision!

Regarding your concerns in 3: we believe that because FV heads are task-general (using different sets of tasks to identify FV heads leads to the same set of heads), then the choice of tasks used to compute FV heads would not influence their overlap with induction heads. Also, Figure 8 shows that in small models, when we don’t perform the exclusion, induction and FV heads seem to contribute comparably to our ICL tasks - and yet there is no overlap between these heads in the small models. For these reasons, we’re not concerned that the lack of overlap is a consequence of the choice of tasks (we find no overlap even induction and FV heads both contribute to the tasks used to compute FV heads). Please let us know if you have remaining questions!

We also ran additional experiments to look into your questions in 1 and 7.

  • For 1, we ran experiments with the dictionary task you described: we sampled random pairs of tokens ti,1,ti,2t_{i,1}, t_{i,2} which we provide as demonstration examples for ICL. Then, we randomly choose one input from the context ii^* and use ti,1t_{i^*,1} as our query to verify whether the model correctly retrieves ti,2t_{i^*,2}. You can find the ablation results for this particular task without exclusion here, and with exclusion here. In both settings, and especially with exclusion, FV heads seem to contribute primarily to this task, which aligns with the general claim of our paper that induction heads contribute less to ICL. We do also remark that for this task, induction heads seem to contribute more than the other tasks we evaluated (in the ablation with exclusion for the dictionary task, ablating 20% of induction heads leads to a ~0.2 accuracy decrease, whereas on other tasks (Fig 1), it leads to less than a 0.1 accuracy decrease). We hope this alleviates your concerns in 1.
  • For 7, we computed ablations on 0-shot ICL accuracy here. The model’s clean 0-shot accuracy is around 0.15 (a large drop from its clean 10-shot accuracy at around 0.5, and similar performance to 10-shot accuracy with 20% of FV heads ablated), and the ablations do not decrease 0-shot accuracy much more. This suggests that for the ICL tasks we study, the model is not good at performing them 0-shot and the model must be doing ICL to achieve high ICL accuracy.

We appreciate the reviewer’s additional questions that allowed us to further verify the robustness of our claims!

审稿意见
3

This paper explores the mechanisms behind in-context learning (ICL) in large language models (LLMs), specifically examining two types of attention heads: induction heads and function vector (FV) heads. The authors conduct experiments to determine which of these mechanisms is primarily responsible for ICL. Through various ablations, the paper argues that FV heads play a more significant role than induction heads, especially in larger models, while also showing that induction heads evolve into FV heads during training.

给作者的问题

No

论据与证据

  1. Induction and FV heads are distinct but correlated. - Ablations and training dynamics support this claim, showing minimal overlap but some correlation in the functionality of these heads.
  2. FV heads are primarily responsible for ICL in LLMs. - Evidence: Ablating FV heads results in a significant drop in ICL performance, whereas ablating induction heads has a smaller impact.
  3. Induction heads evolve into FV heads during training. - Evidence: The authors observe that some heads that initially exhibit induction-like behavior transition to performing more abstract FV functions later in training.

方法与评估标准

The study uses ablation experiments to measure the effect of removing induction and FV heads on ICL performance, focusing on few-shot learning tasks. The paper also tracks the evolution of these heads during training to understand their transition from simpler induction-based operations to more complex FV-based mechanisms. Additionally, they explore the overlap between induction and FV heads and provide a detailed breakdown of the results for models ranging in size from 70M to 6.9B parameters.

理论论述

The paper introduces a key theoretical claim that induction heads may serve as an initial mechanism that helps the model learn the more complex FV heads. This hypothesis is supported by the observation that FV heads often start with high induction scores during training but diverge as training progresses.

实验设计与分析

Ablation experiments were conducted on 12 different language models of varying sizes. Causal mediation analysis was used to assess the contribution of specific heads to ICL performance. Training dynamics analysis helped track the evolution of attention heads during the learning process.

补充材料

The authors provide supplementary material that includes detailed ablation results across multiple ICL tasks, layer-specific analyses, and evolution of head scores over training steps. Additionally, appendices contain data on model architecture, ablation methods, and task descriptions.

与现有文献的关系

The paper contributes to ongoing research in head interpretability and in-context learning. This may be related to https://arxiv.org/abs/2404.15574 and https://www.anthropic.com/research/in-context-learning-and-induction-heads. Especially the second one, I think authors may need to discuss a bit with the second one.

遗漏的重要参考文献

While the paper cites foundational works, it could benefit from incorporating studies that investigate the broader implications of these findings on neural architecture optimization and future model scaling.

其他优缺点

Strengths: Comprehensive ablation studies across models of various sizes. Clear presentation of results and robust statistical analysis. Thought-provoking theoretical contribution regarding the evolution of induction heads.

Weakness: The paper could expand on why the shift from induction to FV heads is necessary for larger models and whether other mechanisms could play a role. The conjectures about the relationship between induction and FV heads could benefit from further validation with real-time model adjustments. The application of FV heads could be further discussed.

其他意见或建议

The findings present a compelling case for rethinking the role of induction heads in ICL, but further exploration of the interaction between FV heads and other components of the model might provide a more holistic view of in-context learning dynamics.

作者回复

We thank the reviewer for their feedback!

To address your weaknesses:

  • On the shift from induction to FV: we will elaborate on why this shift might be necessary for larger models. We hypothesize that induction is a more lightweight mechanism (based on just attending to the subsequent token of a previous copy of the current token) that offers a crude way for models to retrieve information from context, however this method does not sufficiently solve more complex ICL problems. FV heads, on the other hand, offer a compact representation of the ICL task given the examples - the exact mechanism of FV heads is still unknown, and plausibly more complex, but allows larger models to achieve higher training accuracy. We also discussed the possibility of other mechanisms that may play a role in ICL, we leave this investigation for future work.
  • On the conjecture between induction and FV heads relationship: we performed an additional experiment measuring the evolution of ICL accuracy during training. We observe that in all models, few-shot ICL accuracy begins to improve around the same time as when induction heads appear, and continues to gradually increase throughout training until the end. Since ICL accuracy continues to improve even after the formation of induction heads, we speculate that this suggests the sharp emergence of induction heads contributes to an initial rise in ICL performance, but the emergence of the FV mechanisms contributes to further improvements in ICL. This reinforces our conjecture that induction heads may serve as a stepping stone for FV, and we leave further exploration to rigorously verify this conjecture to future work.
  • We will also add a further discussion on the application of our findings: our evidence of the higher influence of FV heads suggests that this is a more effective mechanism for ICL than induction heads. This suggests that ICL can be optimized in small models by using training methods that promote the formation of FV heads.

Thank you for suggesting the relevant work by Wu et al.! We also extensively discussed and compared with the Anthropic work suggested (which we cite as Olsson et al. 2022).

Thank you again for your feedback, please let us know if any questions or concerns remain unsolved!

最终决定

All reviewers unaninmously appreciate that the paper tackles an interesting and important problem concerning the mechanism behind in-context learning in language models. All reviewers praise the paper's results as insightful, compelling and as having the potential to inspire interesting future work. All reviewers also note that there are comprehensive, controlled experiments across a large number of models.

There were multiple concerns brought up in the initial round of reviews, which the reviewers and I agree, have been addressed well.

Summary of concerns and how they were addressed:

  • Reviewer joQW makes an interesting point: the qualitative results here intuitively must be task-dependent while the claims suggest that they will generalize to all tasks. The reviewer suggested some alternative tasks to test.
    • The authors have clarified this well with new results.
  • Some sub-claims (like the layers where some heads appear) seem to be model- or model-size- dependent (2akh, joQW).
    • The authors acknowledge this is speculative. I encourage the authors to edit these sub-claims accordingly.
  • The seemingly-arbitrary choice of a 2% threshold to determine the FV heads (joQW, wNzV, Eo6g, jdwn).
    • The authors clarify how this is based off of the score distributions, and how a 5% threshold would imply greater overlap between FV and Induction heads. Multiple reviewers agreed later that the choice is not arbitrary

Onto other concerns that I believe were reasonably addressed to begin with:

  • Lack of practical applications (2akh, wNzV).
    • However as Reviewer jdwn nicely puts it, this should not be penalized. Work that understands model behavior is insightful in its own right.
  • Model choices are old (Eo6g, 2akh) or strange (jdwn).
    • Although an understandable concern, I agree with the authors these are model families across three different years. If the findings that replicate within these families do not generalize to other families, that would in itself make interesting future work.

I hope the authors found the reviews helpful in polishing their paper.