6.2

/10

Rejected5 位审稿人

最低5最高8标准差1.0

3.4

置信度

正确性3.2

贡献度2.4

表达3.0

ICLR 2025

How Transformers Implement Induction Heads: Approximation and Optimization Analysis

Mingze Wang,Ruoxi Yu,Weinan E,Lei Wu

OpenReview PDF

提交: 2024-09-27更新: 2025-02-05

摘要

Transformers exhibit exceptional in-context learning capabilities, yet the theoretical understanding of the underlying mechanisms remain limited. A recent work (Elhage et al., 2021) identified a "rich" in-context mechanism known as induction head, contrasting with "lazy" $n$-gram models that overlook long-range dependencies. In this work, we provide both approximation and optimization analyses of how transformers implement induction heads. In the approximation analysis, we formalize both standard and generalized induction head mechanisms, and examine whether two-layer single- or multi-head transformers can efficiently implement them, with an emphasis on the distinct role of each transformer submodule. For the optimization analysis, we study the training dynamics on a synthetic mixed target, composed of a 4-gram and an in-context 2-gram component. This setting enables us to precisely characterize the entire training process and uncover an *abrupt transition* from lazy (4-gram) to rich (induction head) mechanisms as training progresses.

关键词

Transformermechanismsapproximiationtraining dynamicsabrupt transition

评审与讨论

审稿意见

评分: 5置信度: 42024-10-31

The paper studies theoretical aspects of induction heads, which have been argued to play a key role in the skills that transformers demonstrate, including in-context learning.

The first set of results has to do with expressivity. Specifically, they show upper bounds on the size of transformers needed to implement variants of induction heads. The result here is not very surprising, as it uses basic transformer components to push tokens several steps ahead to facilitate the induction. It also seems to require increasing the dimension of the transformer as longer induction memory is requested, and again this seems intuitively natural, since this is what’s required for pushing n tokens forward in time, so they can be used in the induction.

The second part asks about the learning process of induction heads. Towards this end, it considers a data generation mechanism that has two components: an n-gram and an induction head. It shows that the learning process goes through several stages where different components are learned. This part is potentially interesting, but I think it could benefit from more work before publication. Some points are: a. I like the idea that there’s a type of inductive bias towards n-grams (also appearing in Bietti, 2023), but it would be more interesting to see if this is a true inductive bias if you had a setting where both n-gram and induction fit the data and learning chooses one over the other. In your current setup they both must be learned, because the output contains both, so I’m not sure what we learn from the fact that one is learned before the other, but eventually they are all learned. b. If I understand correctly, the model has six learnable parameters. It’s possible that some observations (eg single fixed point) are due to this simplification. I would have liked to see at least simulations that show which phenomena appear in larger models. c. Generally, I am missing more explanation and intuition why this particular model was chosen, and what are the implications for learning in real models. d. I am missing more discussion of the relation to Bietti, who also propose some theoretical analysis of the learning dynamics. e. Please also discuss relation to https://arxiv.org/abs/2402.11004

优点

Since induction heads are a key capability of transformers, it is certainly interesting to understand how they can be implemented and learned. There is some prior work on this, and the current work adds to that, especially in terms of analyzing optimization dynamics (though see points above).

缺点

As mentioned above, the expressiveness part is somewhat unsurprising. The dynamics is potentially interesting, but small scale in terms of parameters optimized, and the choice of model and lack of clear implications are also a concern.

问题

Figure 1, the caption doesn’t sufficiently explain what is seen in the figure. E.g., that highlights correspond to tokens with high attention weights for the induction head (I assume).
Following Equation 4, it would be good to highlight that this is not “just” the standard attention head because you have x_{s-1} instead of x_s.
Can you comment on the relation between your results and Bietti (2023) equation 7, which also shows a two layer implementation?
Your induction head averages over all previous appearances of the context. However, it isn’t clear that this is indeed the behavior of induction heads, or that it is even desirable. Wouldn’t we want to output, say, the most common element, or some encoding of the distribution over the next element?
The dependence on parameter q in Theorem 4.3 is confusing because we don’t see the dependence of C_{n,q} on q. Can you present a bound that optimizes over q?
It seems like the dimension of the transformer in Theorem 4.3 needs to scale with n, “the induction length”. This seems rather undesirable, and it is not clear that it is necessary. Can you provide corresponding lower bounds, or provide empirical support for this need?
Typo: “the loss convergences”
In the definition of f*_{G4}, shouldn't it be L-2 (not T-2)?
It’s a bit confusing that in this problem you have X1,..,XL be one dimensional, but X_{L+1} is two dimensional, and you then map X1,...,XL to two dimensions. WOuld have been better to just say it’s all in two dimensions.

评论- Response to Reviewer A68m (1/3)

2024-11-25

We appreciate the reviewer’s thorough review of our manuscript and recognition of the value of our analysis on optimization dynamics.

If you have any further concerns or feel that the current responses do not fully address your concerns, please let us know, and we will promptly provide an updated response.

Weakness & Summary (a)(b)(c)." ""As mentioned above, the expressiveness part is somewhat unsurprising. The dynamics is potentially interesting, but small scale in terms of parameters optimized, and the choice of model and lack of clear implications are also a concern."...If I understand correctly, the model has six learnable parameters. It’s possible that some observations (eg single fixed point) are due to this simplification. I would have liked to see at least simulations that show which phenomena appear in larger models. c. Generally, I am missing more explanation and intuition why this particular model was chosen, and what are the implications for learning in real models."

Response: We thank the reviewer for the constructive suggestion. In response, we have conducted three new experiments to further validate the applicability and robustness of our optimization results:

Experiment with real-world Transformers on natural language dataset. To validate the application of our results in real-world settings, we train a two-layer two-head standard transformer (without any simplification used in our theory) on wikitext-2 datatset. The numerical results are shown in In Figure 3 (please click this link to check the figure). Notably, these results closely mirror the behavior observed in our toy model (Figure 2): the loss exhibits a clear plateau, position encodings $p$ 's are learned first, and the dot-product structures $W_K,W_Q$ are learned slowly at the beginning, resembling an exponential increase. As $W_K,W_Q$ are learned, the loss escapes the plateau. This experiment provides strong empirical support for our theoretical insights regarding the time-scale separation between the learning of positional encoding and the dot-product structures.
Experiment on Adam in high-dimensional toy setting. Since our theoretical analysis is based on Gradient Flow (GF), we extended our experiments to the widely-used Adam optimizer in a high-dimensional toy setting. The results, shown in Figure 5 (please click this link to check the figure), reveal that: while Adam eventually transitions from the lazy to the rich regime, this transition is challenging, and Adam exhibits multiple plateaus during the learning process, which align with our theory and previous experiment on GF.
Experiment on discrete token distribution in toy setting. Recognizing that real-world inputs are often discrete (e.g., tokens), we conducted a new experiment using discrete inputs to further validate our results. In this experiment, we replace Gaussian inputs of the experiment in Figure 2 with Boolean inputs ( $x_i\in$ { $\pm 1$ }), and the results are presented in Figure 4 (please click this link to check the figure). The findings reveal extremely similar behavior to that observed with Gaussian inputs in Figure 2, including the four-phase dynamics and the transition from the 4-gram to the induction head mechanism. This experiment highlights the robustness of our theoretical results to input distribution changes.

Additionally, we fully agree with the reviewer’s suggestion to explore how transformers select solutions when both n-gram and induction head mechanisms fit the data. This would involve testing whether a transformer first learns a memorized solution and later transitions to a generalizable one. However, constructing such scenarios is nontrivial, and we leave it as future work.

评论- Response to Reviewer A68m (2/3)

2024-11-25

Summary (d) and Q3. discussion of [1]. "I am missing more discussion of the relation to [1], who also propose some theoretical analysis of the learning dynamics"; "Can you comment on the relation between your results and equation 7 in [1], which also shows a two layer implementation?"

Response: Thank you for your suggestion. We have incorporated the following discussion into our revised manuscript (Appendix F):

Approximation analysis:
- [1] focus primarily on the implementation of the vanilla induction head. In contrast, our study extends this analysis by investigating not only how two-layer transformers achieve vanilla induction heads (Eq. (4)) but also how they implement generalized induction heads, i.e., in-context n-grams (Eqs. (6) and (7)).
- Furthermore, our work provides explicit approximation rate results, offering insights into the distinct roles of multiple heads, positional encoding, dot-product structure, and FFNs in implementing these induction heads.
Optimization analysis:
- Study objectives. While [1] examines the transition from 2-gram to induction head, our work focuses on the transition from 4-gram to induction head.
- study methods: [1] conducts extensive experiments supported by partial theoretical properties but does not fully characterize the training dynamics theoretically. In contrast, our study provides a precise theoretical analysis of the entire training process in a toy model, uncovering the sharp transition from 4-gram to induction head.
- Main insights: [1] emphasizes the the role of weight matrices as associative memories and the impact of data distributional properties. Our analysis, on the other hand, identifies two primary drivers of the transition: (1) the time-scale separation due to low- and high-order parameter dependencies in self-attention; (2) the speed differences caused by the relative proportions of the two components in the mixed target.

Summary (e). discussion of [2]. "Please also discuss relation to [2]."

Response: Thank you for this suggestion. We have added the following discussion to the revised manuscript (Appendix F).

The primary connection between [2] and our work lies in the optimization analysis. Specifically, [2] focuses on the transition from uni-gram to bi-gram mechanisms in Markov Chain data. In contrast, our study investigates the transition from 4-gram to in-context 2-gram mechanisms (induction head). Additionally, we theoretically identify two primary drivers of the transition: (1) the time-scale separation due to low- and high-order parameter dependencies in self-attention; (2) the speed differences caused by the relative proportions of the two components in the mixed target.

Q2. Question on Eq. (4). "Following Equation 4, it would be good to highlight that this is not “just” the standard attention head because you have x_{s-1} instead of x_s."

Response: Thank you for this inquiry. We would like to clarify that our formulation corresponds to the standard induction head, which involves both $x_{s-1}$ and $x_{s}$ . Specifically, for the current token $x_{L}$ , if a previous token $x_{s-1}$ is similar to $x_{L}$ , the induction head outputs the next token of $x_{s-1}$ , which is $x_{s}$ . Figure 1 provides a detailed example illustrating this behavior. We hope this clarifies your question.

Q4. Question on average over all previous tokens. "Your induction head averages over all previous appearances of the context. However, it isn’t clear that this is indeed the behavior of induction heads, or that it is even desirable. Wouldn’t we want to output, say, the most common element, or some encoding of the distribution over the next element?"

Response: Thank you for raising this question. We'd like to clarify that our induction head (Eq. (4)) does not compute a direct average over all previous tokens. Instead, it uses a softmax operation to compute a "weighted average" based on token relevance. Notably, when $W^\star$ is large, the softmax behaves almost like a hardmax, effectively focusing only on the most relevant tokens. Moreover, our formulation aligns well with the self-attention structure, which also uses a softmax to compute a "weighted average" of all previous and only pays attention to the most relevant tokens.

评论- Response to Reviewer A68m (3/3)

2024-11-25

Q5. Question on C_{n,q}. "The dependence on parameter q in Theorem 4.3 is confusing because we don’t see the dependence of C_{n,q} on q. Can you present a bound that optimizes over q?"

Response: Thank you for pointing out this issue. In Theorem 4.3, the constant satisfies $C_{q,n}\leq C (ne^{1+0.01n})^q$ , where $C$ is an absolute constant. Therefore, the upper bound of approximation error becomes $\frac{C_{n,q}}{H^q}\leq C(\frac{ne^{1+0.01n}}{H})^q$ . Thus, $H\gtrsim ne^{1+0.01n}$ is sufficient to ensure a good approximation. Notably, $n$ is typically small when extracting local semantics. For example, in the vanilla induction head, $n=2$ . In the revised version of our manuscript, we have made this clarification in both the main text and the proof section in the Appendix as suggested.

Q6. Empirical support for the required dimension. "It seems like the dimension of the transformer in Theorem 4.3 needs to scale with n, “the induction length”. This seems rather undesirable, and it is not clear that it is necessary. Can you provide corresponding lower bounds, or provide empirical support for this need?"

Response: We thank the reviewer for the insightful question.

Intuitively, for generalized induction head defined in Eq. (6), the model needs to process $n-1$ $d$ -dim tokens $X_{s-n+2:s}$ , which represent the local semantics near the token $x_s$ . To ensure the complete information transfer, the required model dimension is naturally scales with at least $nd$ . Additioinally, it is worth noting that $n$ is usually small in practice (e.g., $n=2$ for vanilla induction head.
To substantiate this understanding, we have conducted a new experiment using two-layer transformers with dimension $2d$ (smaller than $nd$ in our theory) to learn GIH defined in Eq. (6) with $n=4$ . The results, illustrated in Figure 6 (please click this link to check the figure), demonstrate that two-layer transforemers with insufficient dimension fail to represent generalized induction head.

Q1 & Q7 & Q8 & Q9. Writing problems.

Response: Thank you for your careful review. We have carefully revised the paper to address the writing problems you highlighted.

References:

[1] Bietti et al. Birth of a transformer: A memory viewpoint. NeurIPS 2023.

[2] Edelman et al. The evolution of statistical induction heads: In-context learning markov chains. NeurIPS 2024.

Thank you again for your valuable comments. We hope that our responses have resolved your concerns. We sincerely hope that you could give our work a reconsideration.

评论- Looking forward your feedback

2024-11-27

Dear Reviewer A68m,

We hope that our responses could adequately address your concerns. We warmly welcome further discussion on any additional concerns you may have, and we sincerely hope you might reconsider the rating accordingly.

Thank you once again for the time and effort that you have dedicated to our work.

Best regards,

Authors of submission 9368

2024-12-01

Dear Reviewer A68m,

Thank you once again for your time and effort in reviewing our work! As author-reviewer discussion period will end soon, we would like to confirm whether our further responses have addressed your main concerns. If so, we kindly hope you might reconsider your rating accordingly. Certainly, we are more than happy to answer your further questions.

Best regards,

Authors of submission 9368

审稿意见

评分: 6置信度: 32024-11-04

This paper studies Transformers expressing and learning induction heads. On the expression side, three settings, two-layer single-head Transformer without FFN, two-layer multi-head Transformer without FFN, and two-layer multi-head Transformer with FFN are considered are shown to be able to express induction head based on a single token, induction head based on several tokens, and induction head based on several tokens with arbitrary distance function. On the optimization side, the paper studies the training dynamics for learning a simple task which requires induction heads in a simplified training setting.

优点

The study of mechanisms behind attention heads is of both theoretical and practical interest. It's nice to see how different components can help Transformers implement induction heads of varying complexities.
Generally the insights and intuitions provided into network's presentation and optimization dynamics are informative and helpful.

缺点

The expressivity results are just sufficiency conditions. There are no necessity results presented in the paper. E.g., it's not shown that 2-layer single-head Transformers cannot implement general induction head based on several head (although seems to be true and intuitive).
The optimization results correspond to a very simplified setting from the task side (working with a Gaussian distribution instead of discrete token distribution), model parametrization side, and optimization side (layer-wise gradient flow). I believe experiments supporting the claims in more realistic settings (e.g., more natural Transformer + GD + discrete token prediction task) could have been provided. Also the paper should be more upfront with the limitation of their optimization results throughout the paper in my opinion.
The writing can be improved significantly (see the questions for more details).

问题

In equation 4 and the following lines, I think the softmax notation is used improperly. Currently the input to the softmax function is a scalar. Also it's not clear what exactly $\{(x_s, x_L)\}$ means (maybe there's a typo?). The same problem appears again in equation 6, with the additional complexity that the dimensions of the matrix inside the softmax (and dimensions $X_{L−n+2:L}$ ) are not specified. I think in general we have dimension mismatch problems here based on the current notation. I would appreciate clarification of this issue.
The final value of $C_{q,n}$ is not specific clearly in the proof. I think this value is actually quite important as it determines the minimum number of heads required (I think this actually could be included in the main text.)
Can you elaborate on the meaning of the notation $I(z=x_{L-2})$ (and similar ones) on page 7?
In Figure 2, experiments setup is not mentioned. Also x-axis is not specified. More importantly, we can see a drop in the loss of IH term in the first phase of the training, this is in contrast to the theoretical results. Can you elaborate please?
In Theorem 5, could you explain in what variables the asymptotics are?
I think the writing of the insights around lines 243-250 can be improved.
The paper focuses on a specific format of relative positional embedding. I think the paper could be more upfront on this and state this more clearly throughout the paper.
Line 451, "without loss of ambiguity" seems to be a typo.
Regarding the second weakness, I'd appreciate any evidence/explanation that the proven optimization dynamics would also show up in more natural setting.

评论- Response to Reviewer fF9s (1/3)

2024-11-25

We thank the reviewer or their thorough review of our manuscript and for the appreciation of our contributions and insightful comments.

In particular, for your core concern (W2), we have conducted new experiments on real-world experiments on the optimization results. Please refer to our Response to W2.

If you have any further concerns or feel that the current responses do not fully address your concerns, please let us know, and we will promptly provide an updated response.

W1. Necessity results for approximation. "The expressivity results are just sufficiency conditions. There are no necessity results presented in the paper. E.g., it's not shown that 2-layer single-head transformers cannot implement general induction head based on several head (although seems to be true and intuitive)."

Response: We appreciate the reviewer's constructive suggestion. We fully agree that two-layer single-head transforemers cannot perform a generalized induction head (GIH) as defined in Eq. (6) for large $n$ . Intuitively, each head can accurately extract at most one neighboring token, whereas a GIH requires $n$ neighboring tokens. To substantiate this understanding, we have conducted a new experiment using two-layer single-head transformers to learn GIH defined in Eq. (6) with $n=4$ . The experimental results, shown in Figure 6 (please click this link to check the figure), confirm that two-layer single-head transforemers are indeed incapable of representing GIH for large $n$ .

W2. Real-world experiments on optimization results. "The optimization results correspond to a very simplified setting from the task side (working with a Gaussian distribution instead of discrete token distribution), model parametrization side, and optimization side (layer-wise gradient flow). I believe experiments supporting the claims in more realistic settings (e.g., more natural transformer + GD + discrete token prediction task) could have been provided. Also the paper should be more upfront with the limitation of their optimization results throughout the paper in my opinion." "Regarding the second weakness, I'd appreciate any evidence/explanation that the proven optimization dynamics would also show up in more natural setting."

Experiment on discrete token distribution in toy setting. As mentioned by the reviewer, real-world inputs are often discrete (e.g., tokens), we conducted a new experiment using discrete inputs to further validate our results. In this experiment, we replace Gaussian inputs of the experiment in Figure 2 with Boolean inputs ( $x_i\in$ { $\pm 1$ }), and the results are presented in Figure 4 (please click this link to check the figure). The findings reveal extremely similar behavior to that observed with Gaussian inputs in Figure 2, including the four-phase dynamics and the transition from the 4-gram to the induction head mechanism. This experiment highlights the robustness of our theoretical results to input distribution changes.
Experiment with real-world transformers on natural language dataset. To validate the application of our results in real-world settings, we train a two-layer two-head standard transformer (without any simplification used in our theory) on wikitext-2 dataset. The numerical results are shown in Figure 3 (please click this link to check the figure). Notably, these results closely mirror the behavior observed in our toy model (Figure 2): the loss exhibits a clear plateau, position encodings $p$ 's are learned first, and the dot-product structures $W_K,W_Q$ are learned slowly at the beginning, resembling an exponential increase. As $W_K,W_Q$ are learned, the loss escapes the plateau. This experiment provides strong empirical support for our theoretical insights regarding the time-scale separation between the learning of positional encoding and dot-product structures.
Experiment on Adam in high-dimensional toy setting. Since our theoretical analysis is based on Gradient Flow (GF), we extended our experiments to the widely-used Adam optimizer in a high-dimensional toy setting. The results, shown in Figure 5 (please click this link to check the figure), reveal that: while Adam eventually transitions from the lazy to the rich regime, this transition is challenging, and Adam exhibits multiple plateaus during the learning process, which align with our theory and previous experiment on GF.

评论- Response to Reviewer fF9s (2/3)

2024-11-25

Q1. Concern on the softmax notation. "In equation 4 and the following lines, I think the softmax notation is used improperly. Currently the input to the softmax function is a scalar. Also it's not clear what exactly $(x_s,x_L)$ means (maybe there's a typo?). The same problem appears again in equation 6, with the additional complexity that the dimensions of the matrix inside the softmax (and dimensions $X_{L-n+2:L}$ ) are not specified. I think in general we have dimension mismatch problems here based on the current notation. I would appreciate clarification of this issue."

Response: We thank the reviewer for pointing out these issues. In our revised version, we have addressed these problems and clarified the notations. Below is a brief explanation of the corrections:

First, as the reviewer commented, softmax should act on vectors. For example, in Eq. (4), the corrected notation is $(x_{s})\_{s=2}\^{L-1}$ softmax $\big( (x_L\^\top W\^\star x_{s-1})\_{s=2}\^{L-1}\big)^\top$ . Secondly, for $X_{L-n+2:L}$ used in Eq. (6), we have restated its size in the Notation section to avoid ambiguity. Finally, regarding the pairs { $(x_s,x_L)$ } $\_{s=1}\^{L-2}$ mentioned below Eq. (4), refers to the process of the induction head checking all $(x_s)\_{s=1}\^{L-2}$ in the previous text for similarity with $x_{L}$ .

Q2. Question on $C_{q,n}$ . "The final value of $C_{q,n}$ is not specific clearly in the proof. I think this value is actually quite important as it determines the minimum number of heads required (I think this actually could be included in the main text.)"

Q3. Question on Figure 2. "In Figure 2, experiments setup is not mentioned. Also x-axis is not specified. More importantly, we can see a drop in the loss of IH term in the first phase of the training, this is in contrast to the theoretical results. Can you elaborate please?"

Response: We thank the reviewer for this insightful comment on Figure 2. Below, we address the points raised:

x-axis: First, the x-axis represents the training steps. We have added a detailed description of the experimental setup in the revised version of the paper, including specifics about model architecture, dataset, and hyperparameters.
Drop in ${\rm IH}_2$ loss in Phase I: we appreciate the reviewer's careful observation that ${\rm IH}_2$ loss decreases a lot in the figure. Because the Phase I boundary in the Figure is artificially divided, we accidentally divided it too large. In the new version, we have revised this figure.
Theoretical Explanation for ${\rm IH}\_2$ loss in Phase I: We would like to further explain why ${\rm IH}\_2$ loss should nearly not decrease in Phase I according to our analysis. Recall that the two parameters responsible for ${\rm IH}\_2$ are $w\_{KQ}, w\_{V_2}$ . At the beginning of training, $|\dot{w}\_{KQ}|\sim\frac{w\_{KQ}}{L} \ll 1$ , $|\dot{w}\_{V\_2}|\sim\frac{1}{L}\ll1$ . Since Phase I has a duration of $O(1)$ (which is sufficient for the 4-gram mechanism to be nearly learned), $w_{KQ}, w_{V_2}$ remain almost unchanged, so the ${\rm IH}\_2$ loss hardly decrease in this phase.

评论- Response to Reviewer fF9s (3/3)

2024-11-25

Q4. Explanation of some notations. (1) "Can you elaborate on the meaning of the notation $\mathbb{I}(z=x_{L-2})$ (and similar ones) on page 7?" (2) "In Theorem 5, could you explain in what variables the asymptotics are?"

Response: Thank you for your careful review.

The notation $\mathbb{I}$ { $S$ } is an indicator function that takes $1$ if the event $S$ is true and $0$ otherwise. We have explicitly included this definition in the Notation Section.
Regarding Theorem 5.5 (assuming you refer to Theorem 5 as Theorem 5.5—please correct us if this is not the case), we would like to clarify that it is a non-asymptotic result, rather than asymptotic. Specifically, the theorem holds as long as $L,\alpha^\star,1/\sigma_{\rm init}$ are sufficiently large, meaning that when these parameters exceed certain absolute constants, the result is assured.

Q5. Some writing problems. (1) "I think the writing of the insights around lines 243-250 can be improved." (2) "The paper focuses on a specific format of relative positional embedding. I think the paper could be more upfront on this and state this more clearly throughout the paper." (3) "Line 451, "without loss of ambiguity" seems to be a typo."

Response: Thank you for your careful review. We have carefully revised the paper to address the writing problems you highlighted.

Thank you again for your valuable comments. We hope that our responses have resolved your concerns. We sincerely hope that you could give our work a reconsideration.

评论- Looking forward your feedback

2024-11-27

Dear Reviewer fF9s,

Thank you once again for the time and effort that you have dedicated to our work.

Best regards,

Authors of submission 9368

2024-12-01

Thank you very much for the clarifications and the new experiments. I have raised my rating accordingly.

2024-12-01

We would like to reiterate our gratitude for your valuable recommendation and positive feedback. We are glad that we have addressed your main concern. Thank you for raising the score!

审稿意见

评分: 8置信度: 42024-11-04

This paper studies the theoretical analysis of the transformer mechanisms behind the induction heads from two perspectives: one is approximation theory and the other optimization. In the approximation part, the authors show how to use transformers to approximate the induction head and the generalized induction heads. In the optimization, they investigates the abrupt transition from n-gram to induction heads.

优点

(1)The paper provides a very sound and complete theoretical analysis of the transformer mechanisms behind induction heads, which are often used to explain the important emergent ability of in-context learning for LLMs. (2) The paper is well-written and well organized. In order to the motivate the work, the paper gives a very clear streamline of the related works and formulate the research objectives very clearly. Also they provide very clear definitions of the key notions in the paper. I have not checked all the technical proofs but I believe that they are correct. The paper focuses on the main objectives and makes the contributions explicit and discusses different possible scenarios.

缺点

(1)Since induction head is used to explain ICL, it might be more interesting to explain how the theory in this paper helps in-context learning(ICL), especially empirical results for ICL. (2) Although the paper is meant to be theoretical, it would be helpful to provide some empirical experiments to support the theoretical analysis.

问题

In Line 307, the authors mentioned that “the use of general similarity g enables the model to recognize not only synonymous but also antonymic semantics, thereby improving both the accuracy and diversity of in-context retrievals.” Why do we need to use general similarity g to recognize antonymic semantics? Why does this recognition improve both the accuracy and diversity of in-context retrievals?

评论- Response to Reviewer 9XQk

2024-11-25

We thank the reviewer for the great efforts on the review of our manuscript and for appreciating our novelty and contributions.

If you have any further concerns or feel that the current responses do not fully address your concerns, please let us know, and we will promptly provide an updated response.

W1. Connection with ICL. "Since induction head is used to explain ICL, it might be more interesting to explain how the theory in this paper helps in-context learning(ICL), especially empirical results for ICL."

Response: As the reviewer noted, induction heads are widely recognized to play a crucial role in enabling ICL capability. Particularly, [1] shows that the emergence of the induction head mechanism is nearly synchronized with the development of ICL capability, and removing induction heads significantly diminishes the ICL ability. Since our paper provides a theoretical understanding for the formation of induction heads, our insights are directly applicable to understand ICL.

From the approximation perspective, our work sheds light on the network size required to efficiently express induction heads, which directly correlates to the network size needed to support ICL.
From the optimization perspective, we present the first theoretical analysis of the phase transition from the n-gram mechanism (lazy regime) to the induction head mechanism (rich regime). This transition aligns with experimental findings in [1], which demonstrate a significant increase in ICL scores following this phase transition.

W2. Suggestion on empirical experiments. "Although the paper is meant to be theoretical, it would be helpful to provide some empirical experiments to support the theoretical analysis."

Response: We appreciate the reviewer's constructive suggestion. To further validate the applicability and robustness of our theoretical results, we have conducted five new experiments that align with our approximation and optimization analyses. These experiments include testing on real-world models, natural language datasets, and alternative algorithms. Please refer to our Global Response for details.

Q. Question on recognizing antonymic semantics. "In Line 307, the authors mentioned that “the use of general similarity g enables the model to recognize not only synonymous but also antonymic semantics, thereby improving both the accuracy and diversity of in-context retrievals.” Why do we need to use general similarity g to recognize antonymic semantics? Why does this recognition improve both the accuracy and diversity of in-context retrievals?"

Response: Thank you for the interesting question. Intuitively, relevant contextual information exists not only near synonymous semantics but also near antonymic semantics in the text. For example, consider the in-context reasoning task: "a is middle in a,b,c. If a>b, then a<?". Here, the symbols ">" and "<" act as antonyms, and effective reasoning requires the retrieval of context related to ">" when encountering "<". Additionally, during pre-training, it is crucial to increase the diversity and accuracy of in-context retrieval capacity of the models. This broader capacity equips the model to handle a wider range of downstream tasks.

References

[1] Olsson et al. In-context Learning and Induction Heads. arXiv preprint arXiv:2209.11895, 2022.

Thank you again for your valuable comments. We hope that our responses have resolved your concerns.

评论- Looking forward your feedback

2024-11-27

Dear Reviewer 9XQk,

We hope that our responses could adequately address your concerns. We warmly welcome further discussion on any additional concerns you may have.

Thank you once again for the time and effort that you have dedicated to our work.

Best regards,

Authors of submission 9368

审稿意见

评分: 6置信度: 32024-11-06

The paper comprises two parts. In the first part, the authors show through representation results that simple transformer architectures with two layers and several heads are able to correctly represent different variations of induction head mechanisms. In the second part, the authors prove through gradient flow that a simplified two-layer architecture can learn a mixed target function comprising a 4-gram component and a vanilla in-context 2-gram component.

优点

The paper is generally well written and the proofs seem correct to me. Even if the paper considers heavily simplified models, the theory on this topic is scarce and difficult, so any theoretical insight on the dynamics of these models is welcome.

缺点

I think that the major drawback of the paper is that it mostly lacks experimental results that corroborate the theory. In particular, it would be interesting to see if the constructions used in Theorems 4.3 and 4.4 on in-context induction heads are actually learned by the considered transformer models. Additionally, I found the statements of some theorems to be rather vague (see the questions below).

问题

I have the following questions for the authors:

I find the statements of Theorems 4.3 and 4.4 vague. In particular, the way they are currently stated seem to imply that such results hold for any number of heads $H$ . In the proofs, however, it seems that $H$ cannot be arbitrary, and actually has to be large enough and possibly dependent on $n$ , unless I misunderstood something. It would be helpful to clarify this further.
In the gradient flow analysis of Section 5.1.3, you consider a layer-wise training paradigm in which you first train only the first-layer, and then you fix it to train the second one. I was wondering if the experimental results of Figure 2 are also obtained using this paradigm. I was wondering if this assumption is also insightful in practice, i.e., if when training the two layers at the same time, you could see experimentally that the first layer is learned before the second layer, or if in general the two layers are learned together in practice.
Minor concern: in the main text you put some amount of emphasis on a novel Lyapunov function that is used in the proof, but then this function never appears in the text and is relegated at the end of the appendix. In the final version, I would either give more space to it in the main text, explaining why this function is important/novel, or put less emphasis on it.

评论- Response to Reviewer KJ49 (1/2)

2024-11-25

We appreciate the reviewer's recognition of our work and helpful comments. We will try our best to address your questions.

If you have any further concerns or feel that the current responses do not fully address your concerns, please let us know, and we will promptly provide an updated response.

W1: Experimental supports for approximation results. "I think that the major drawback of the paper is that it mostly lacks experimental results that corroborate the theory. In particular, it would be interesting to see if the constructions used in Theorems 4.3 and 4.4 on in-context induction heads are actually learned by the considered transformer models."

Response: Thank you for your constructive suggestion.

Supporting Theorem 4.1: For the vanilla induction head, our theoretical construction presented in Theorem 4.1 aligns closely with the experimental results reported in the seminal study of induction heads [1]. For further details, please refer to Remark 4.1.
New experiment supporting the construction in Theorem 4.3/4.4: Our key construction in Theorem 4.3 is that the first layer is responsible for extracting local semantic information $X_{s-n+2:s}$ near each $x_s$ , and the second layer produce the final output. To validate this, we have conducted a new experiment. The results, presented in Figure 7 (please click this link to check the figure), confirm the theoretical roles of the first and second layers in implementing the generalized induction head.
Additionally, we have conducted three more new experiments to further support our approximation and optimization results. Please refer to our Global Response for more details.

Q1. Clarification on $H$ in Theorem 4.3 and 4.4 "I find the statements of Theorems 4.3 and 4.4 vague. In particular, the way they are currently stated seem to imply that such results hold for any number of heads $H$ . In the proofs, however, it seems that $H$ cannot be arbitrary, and actually has to be large enough and possibly dependent on $n$ , unless I misunderstood something. It would be helpful to clarify this further."

Response: Thank you for this careful question. We would like to clarify that Theorems 4.3 and 4.4 do hold for all $H\geq 1$ . However, as the reviewer correctly noted, our proof primarily focuses on the case of $H\geq n$ . For the case of $H<n$ , the approximation error can be trivially bounded by a constant. Then, the two cases can be unifiled by selecting appropriate constants. For example, in Theorem 4.3, choosing $C_{n,q}\geq Cn^q$ can ensure that the upper bound in Theorem 4.3 still holds even when $H<n$ . This point was not explicitly stated in the original version of the proof, but we have now included this clarification at the beginning of the proof for each theorem to improve readability.

Q2. Question on the layerwise training. "In the gradient flow analysis of Section 5.1.3, you consider a layer-wise training paradigm in which you first train only the first-layer, and then you fix it to train the second one. I was wondering if the experimental results of Figure 2 are also obtained using this paradigm. I was wondering if this assumption is also insightful in practice, i.e., if when training the two layers at the same time, you could see experimentally that the first layer is learned before the second layer, or if in general the two layers are learned together in practice."

Response: Thank you for your thoughtful question. The experiment in Figure 2 was indeed conducted using the layerwise training paradigm.

While the layerwise training paradigm is widely used in theoretical analyses of the training dynamics of two-layer transformers [2][3], it is often not the case in practice that the first layer is learned before the second when training the layers together. This phenomenon is discussed in more detail in Section 5.2 in [2].
However, it is important to note that this technical simplification still preserves the essential characteristics of the learning dynamics we are interested in. To further support this insight, we have conducted a new experiment in which the two layers of transformer are trained together on the wikitext-2 dataset. The results, shown in Figure 3 (please click this link to check the figure), are highly consistent with the layerwise training dynamics observed in our toy model: the positional encoding is learned first, followed by the dot-product structure.

评论- Response to Reviewer KJ49 (2/2)

2024-11-25

Q3. Regarding the novel Lyapunov function. "Minor concern: in the main text you put some amount of emphasis on a novel Lyapunov function that is used in the proof, but then this function never appears in the text and is relegated at the end of the appendix. In the final version, I would either give more space to it in the main text, explaining why this function is important/novel, or put less emphasis on it."

Response: Thank you for your suggestion. While the construction of the Lyapunov function is indeed a novel and crucial aspect in our proof, we acknowledge that its technical complexity might divert attention from the main insights. In the revised version, we have reduced its emphasis in the main text to maintain focus on the key insights.

References:

[1] Elhage, Nelson, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell et al. A mathematical framework for transformer circuits. Transformer Circuits Thread 1, no. 1 (2021): 12.

[2] Chen, Siyu, Heejune Sheen, Tianhao Wang, and Zhuoran Yang. Unveiling induction heads: Provable training dynamics and feature learning in transformers. arXiv preprint arXiv:2409.10559, 2024.

[3] Eshaan Nichani, Alex Damian, and Jason D Lee. How transformers learn causal structure with gradient descent. arXiv preprint arXiv:2402.14735, 2024.

Thank you again for your valuable comments. We hope that our responses have resolved your concerns. We sincerely hope that you could give our work a reconsideration.

评论- Looking forward your feedback

2024-11-27

Dear Reviewer KJ49,

Thank you once again for the time and effort that you have dedicated to our work.

Best regards,

Authors of submission 9368

2024-11-28

Thank you for the clarifications and the new experiments provided. I am still a bit confused about Theorem 4.3 and the fact that it holds for any number of heads $H$ . For example, in line 934 of the revised version, you say "We choose H large enough so that $x_s^{(1)} \in [-2,2]^D$ . Can you be more precise about how large it has to be and if this has any implications on the number of heads $H$ required for the theorem to hold? Also, I think there is a multiplicative factor $H$ missing in the equation on line 920.

2024-11-29

We thank the reviewer for their interest and insightful comment in the proof details. We are glad to address your additional question.

Intuitively, we divide the proof into two regimes for $H$ :

Large $H$ regime ( $H \gtrsim n e\^{1+0.01 n}$ ): As mentioned in our earlier response, our proof primarily focuses on this regime. We need sufficiently large $H$ to satisfy $x\_s^{(1)}\in[-2,2]^D$ and $H \geq n$ . Specifically, $||x\_s^{(1)}||_{\infty}\leq||x\_s^{(1)}-X\_{s-n+1:s}||\_{\infty}+||X\_{s-n+1:s}||\_{\infty}\leq \epsilon\_{\rm SA}\^{(1)} +1\leq C(\frac{ n e^{1+0.01 n}}{H})^q + 1$ . Then for all $H \gtrsim n e\^{1+0.01n}$ , it follows that $C(\frac{ n e^{1+0.01 n}}{H})^q\leq 1$ , ensuring $x\_s^{(1)}\in[-2,2]^D$ . Consequently, our main proof guarantees the theorem holds in this regime.
Small $H$ regime (the counterpart): For the remaining case, where $H \lesssim n e^{1+0.01 n}$ , the number of cases is finite. Hence, the approximation error can be trivially bounded by a constant by selecting the maximum value among these finite cases.

Therefore, by appropriately choosing constants, these two regimes can be unified.

In the revised version, we will include this clarification at the beginning of the proof to improve readability. (Unfortunately, we are currently unable to update the submitted version.) Additionally, we thank the reviewer for pointing out the typo on line 920, which we will correct in the revised manuscript.

Again, we sincerely appreciate your careful review and valuable feedback. We hope our response could adequately address your concerns. If you have any further concerns, please let us know, and we will promptly provide an updated response.

2024-12-01

Dear Reviewer KJ49,

Best regards,

Authors of submission 9368

2024-12-02

Thank you for the clarification. I increased my score.

2024-12-02

We would like to reiterate our sincere gratitude for your valuable recommendation and positive feedback. We are pleased to have addressed your main concern and appreciate your decision to raise the score. Thank you!

审稿意见

评分: 6置信度: 32024-11-07

The reviewed paper explores how transformers implement ''induction heads'' to perform in-context learning (ICL) by analyzing a simplified two-layer transformer model. The analyses include both approximation and optimization parts. The approximation analysis examines transformer submodules, while the optimization analysis tracks phase transitions in training dynamics as transformers develop induction mechanisms.

优点

A key strength of this work is its rigorous theoretical approach within the chosen framework. The results and proofs are clearly delivered and effectively presented. The authors provide a comprehensive investigation from both approximation and optimization perspectives, which might potentially deepen our understanding of transformer models.

缺点

This work provides an analysis of how transformers implement induction heads, approaching the problem from both approximation and optimization perspectives. The theoretical results appear rigorous; however, the model setup seems overly simplified. Specifically, the study is limited to a two-layer transformer model, with and without feed-forward networks (FFNs), a framework initially explored in a seminal paper [1] and subsequently developed by numerous follow-up studies. Given this extensive literature, the contribution here appears somewhat incremental, with limited novelty in the analytical approach and the techniques remaining relatively standard. Expanding the analysis to a more sophisticated and realistic setting, as seen in recent work [2], would significantly strengthen the contribution. Without this, the impact of the results may be constrained, and it is unclear if they meet the high standards for significance.
Additionally, given the simplified setup and use of synthetic toy examples, I have reservations about the generalizability of these findings for interpreting real-world transformers. I would suggest that the authors conduct extensive empirical experiments on widely-used models to validate the applicability and robustness of their theoretical results.
it would be valuable if the theoretical insights could yield practical implications. Specifically, can the approximation results inform new methods, or could the optimization insights benefit transformer training? If so, this would meaningfully enhance the contribution. Otherwise, the practical significance of the work may remain limited.

[1] Elhage, Nelson, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell et al. "A mathematical framework for transformer circuits." Transformer Circuits Thread 1, no. 1 (2021): 12.

[2] Chen, Siyu, Heejune Sheen, Tianhao Wang, and Zhuoran Yang. "Unveiling induction heads: Provable training dynamics and feature learning in transformers." arXiv preprint arXiv:2409.10559 (2024).

问题

See Weaknesses.

评论- Response to Reviewer QPkT (3/3)

2024-11-25

W3. Potential practical implications. "It would be valuable if the theoretical insights could yield practical implications. Specifically, can the approximation results inform new methods, or could the optimization insights benefit transformer training? If so, this would meaningfully enhance the contribution. Otherwise, the practical significance of the work may remain limited."

A3. Thank you for this inquiry.

Positioning of the paper: We would like to re-clarify that our primary focus is on the theoretical aspects of transformers, about approximation and optimization. We believe that theoretical research is a crucial first step in deepening our understanding of transformers. While providing detailed practical guidance is an ultimate goal, it falls outside the scope of this paper.
Potential practical implications: Nevertheless, we discuss some potential practical implication of our theory, as follows.
- Approximation: Our results suggest that to implement general induction heads as defined in Eq. (6) and Eq. (7), one can increase the number of attention heads and incorporate FFN layers without needing to increase the model depth. Moreover, our analysis indicates that the dot-product structure in the first layer and the positional encoding in the second layer are unnecessary. This finding could help reduce the model size while preserving expressiveness, which may be useful for designing more efficient transformer architectures.
- Optimization:
  - Common beliefs. Smaller initializations are believed to promote better generalization: in traditional deep learning theory, small initialization facilitates feature learning [4]; and in LLM, it has been found to encourage generalization over memorization [5].
  - However, our theoretical results indicate that using sufficiently large context length $L$ and small initialization $\epsilon$ significantly delays the transition from lazy to rich regime, due to $T_{\rm II},T_{\rm III}\sim L\log(1/\epsilon)$ . Our insight provides a valuable perspective on the trade-off between achieving high performance and maintaining efficient training, encouraging a reconsideration of the relationship between initialization size, context length, and training efficiency.

References:

[2] Chen, Siyu, Heejune Sheen, Tianhao Wang, and Zhuoran Yang. Unveiling induction heads: Provable training dynamics and feature learning in transformers. arXiv preprint arXiv:2409.10559, 2024.

[3] Yu Huang, Yuan Cheng, and Yingbin Liang. In-context convergence of transformers. arXiv preprint arXiv:2310.05249, 2023.

[4] Woodworth et al. Kernel and Rich Regimes in Overparametrized Models. COLT 2020.

[5] Zhang et al. Initialization is Critical to Whether Transformers Fit Composite Functions by Inference or Memorizing. NeurIPS 204.

Thank you again for your valuable comments. We hope that our responses have resolved your concerns. We sincerely hope that you could give our work a reconsideration.

评论- Response to Reviewer QPkT (2/3)

2024-11-25

W2. Suggestion on the experiments on widely-used models. "Given the simplified setup and use of synthetic toy examples, I have reservations about the generalizability of these findings for interpreting real-world transformers. I would suggest that the authors conduct extensive empirical experiments on widely-used models to validate the applicability and robustness of their theoretical results."

A2. Thank you for your constructive suggestion. In response, we have conducted three new experiments to further validate the applicability and robustness of our theoretical results regarding optimization dynamics in real-world settings:

Experiment with real-world transformers on natural language dataset. Following the reviewer's suggestion, we train a two-layer two-head standard transformer (without any simplification used in our theory) on wikitext-2 dataset. The numerical results are shown in Figure 3 (please click this link to check the figure). Notably, these results closely mirror the behavior observed in our toy model (Figure 2): the loss exhibits a clear plateau, position encodings $p$ 's are learned first, and the dot-product structures $W_K,W_Q$ are learned slowly at the beginning, resembling an exponential increase. As $W_K,W_Q$ are learned, the loss escapes the plateau. This experiment provides strong empirical support for our theoretical insights regarding the time-scale separation between the learning of positional encoding and the dot-product structures.
Experiment on Adam in high-dimensional toy setting. Since our theoretical analysis is based on Gradient Flow (GF), we extended our experiments to the widely-used Adam optimizer in a high-dimensional toy setting. The results, shown in Figure 5 (please click this link to check the figure), reveal that: while Adam eventually transitions from the lazy to the rich regime, this transition is challenging, and Adam exhibits multiple plateaus during the learning process, which align with our theory and previous experiment on GF.
Experiment on discrete token distribution in toy setting. Recognizing that real-world inputs are often discrete (e.g., tokens), we conducted a new experiment using discrete inputs to further validate our results. In this experiment, we replace Gaussian inputs of the experiment in Figure 2 with Boolean inputs ( $x_i\in\{\pm 1\}$ ), and the results are presented in Figure 4 (please click this link to check the figure). The findings reveal extremely similar behavior to that observed with Gaussian inputs in Figure 2, including the four-phase dynamics and the transition from the 4-gram to the induction head mechanism. This experiment highlights the robustness of our theoretical results to input distribution changes.

评论- Response to Reviewer QPkT (1/3)

2024-11-25

We thank the reviewer for the great efforts on the review of our manuscript and for appreciating our novelty and contributions.

If you have any further concerns or feel that the current responses do not fully address your concerns, please let us know, and we will promptly provide an updated response.

W1: Concerns about the simplified model setup. "This work provides an analysis of how transformers implement induction heads, approaching the problem from both approximation and optimization perspectives. The theoretical results appear rigorous; however, the model setup seems overly simplified...."

A1: Thank you for your insightful question.

Clarification on approximation results. We would like to clarify that the results in Section 4 regarding approximation are not based on simplified models. Instead, by examining how transformers implement both vanilla and generalized induction heads, we uncover the distinct roles of key modules in standard Transformers. This includes multi-heads, positional encoding, dot-product structure, and FFNs (mentioned by the reviewer).
Clarification on optimization results.
- Reasonable simplification of FFN to reveal the transition (from lazy to rich regimes). In our optimization analysis, our primary goal is to explore the open theoretical question: why learning undergoes a sharp transition from n-gram to (vanilla) induction head. As demonstrated in the seminal work [1] and our Theorem 4.1, expressing the vanilla induction head does not require the inclusion of FFN, which is why we did not introduce them in our model. Notably, the optimization dynamics of Transformers are highly complex, and by using reasonable simplifications, we can focus on our central problem: the transition from n-gram to (vanilla) induction head. Moreover, while [2] addresses how Transformers learn a generalized induction head and therefore does not simplify FFNs (to maintain the necessary nonlinearity), our study focuses differently, allowing us to simplify FFNs in our analysis.
- Crucial no-simplification of the quadratic structure $W_K^\top W_Q$ : In contrast to previous works like [2] (and others such as [3]), which simplify the quadratic structure $W_K^\top W_Q$ by reparameterizing it into a single parameter $a=W_K^\top W_Q$ , we retain the quadratic structure in our model. It is important to note that the quadratic structure $W_K^\top W_Q$ leads to a clear time-scale separation from the linear structure of the positional encoding, as detailed on line 505. This separation is crucial for explaining the sharp transition from the lazy regime (n-gram) to the rich regime (induction head). Furthermore, if we also simplify the quadratic structure $a=W_K^\top W_Q$ as in [2], this time-scale separation would no longer exist.

评论- Looking forward your feedback

2024-11-27

Dear Reviewer QPkT,

Thank you once again for the time and effort that you have dedicated to our work.

Best regards,

Authors of submission 9368

2024-12-01

Thanks for your clarifications. I've raised my score.

2024-12-01

We would like to express our sincere gratitude for your valuable recommendation and positive feedback. We are pleased to have addressed your main concern and appreciate your decision to raise the score. Thank you!

评论- Paper Revision

2024-11-25

Dear AC and reviewers,

We have finalized the revision of our paper and uploaded it to the OpenReview system. All changes are highlighted in red in the revised manuscript. We sincerely appreciate the reviewers’ valuable feedback and constructive suggestions, which have significantly improved the quality and presentation of our work. We are happy to provide further clarifications or address additional concerns.

Best regards,

The Authors of Submission 9368

评论- Global Response

2024-11-25

Dear AC and reviewers,

We appreciate the great efforts of each reviewer in the reviewing process. We are encouraged that all reviewers acknowledged the soundness (4, 3, 3, 3, 3) of our theoretical results.

Specifically, The significance of studying this topic was affirmed by Reviewer [QPkT, KJ49, fF9s, A68m]; the rigor and clarity of our theoretical analysis was commended by Reviewer [QPkT, KJ49, 9XQk]; the theoretical insights provided through approximation analysis were acknowledged by Reviewers [QPkT, fF9s]; the novelty and interest of our optimization analysis on the multi-phase dynamics and sharp phase transition were appreciated by Reviewer [KJ49, 9XQk, A68m].

We have individually responded to each reviewer, trying our best to address their concerns. Here, we provide a summary of our responses to a common concern of all the reviewers -- the need for more experiments.

New experiments: To further support our optimization and approximation results, as suggested by the reviewers, we have conducted 5 new experiments during the discussion phase:

Supporting optimization dynamics:
1. Experiment with real-world transformers on natural language dataset. We train a two-layer two-head standard transformer (without any simplification used in our theory) on the wikitext-2 dataset. The numerical results are shown in Figure 3 (please click this link to check the figure). Notably, these results closely mirror the behavior observed in our toy model (Figure 2): the loss exhibits a clear plateau, position encodings $p$ 's are learned first, and the dot-product structures $W_K,W_Q$ are learned slowly at the beginning, resembling an exponential increase. As $W_K,W_Q$ are learned, the loss escapes that plateau. This experiment provides a strong empirical support for our theoretical insights regarding the time-scale separation between the learning of positional encoding and the dot-product structure.
2. Experiment on discrete token distribution in toy setting. Recognizing that real-world inputs are often discrete (e.g., tokens), we conducted a new experiment using discrete inputs to further validate our results. In this experiment, we replace Gaussian inputs of the experiment in Figure 2 with boolean inputs ( $x_i\in$ { $\pm 1$ }), and the results are presented in Figure 4 (please click this link to check the figure). We can see that the behavior of learning process is extremely similar to that observed with Gaussian inputs in Figure 2, including the four-phase dynamics and the sharp transition from the 4-gram to the induction head mechanism. This experiment highlights the robustness of our theoretical results to input distribution changes.
3. Experiment with Adam in high-dimensional toy setting. Since our theoretical analysis is based on Gradient Flow (GF), we extended our experiments to the widely-used Adam optimizer in a high-dimensional toy setting. The results, shown in Figure 5 (please click this link to check the figure), reveal that: while Adam eventually transitions from the lazy to the rich regime, this transition is challenging, and Adam exhibits multiple plateaus during learning induction heads, which align with our theory and previous experiment on GF.
Supporting approximation rates: 4. Experiment supporting the construction in Theorem 4.3. we have conducted a new experiment and results are shown in Figure 7 (please click this link to check the figure). These results support our key construction in Theorem 4.3: the first layer is responsible for extracting local semantic information $X_{s-n+2:s}$ near each $x_s$ , and the second layer produce the final output. 5. Experiment supporting the required $H$ and $D$ in Theorem 4.3. The results, shown in Figure 6 (please click this link to check the figure), demonstrate that the sufficient conditions for $H$ and $D$ in Theorem 4.3 are also nearly necessary.

We have included all additional experimental results in our revised manuscript.

Best regards,

Authors of submission 9368

AC 元评审

2024-12-19

The paper analyzes the implementation and learning of induction heads in simplified two-layer transformer architectures. The authors study both the expressivity of transformers in implementing generalized induction heads and the optimization dynamics under gradient flow.

Strengths:

The paper provides rigorous theoretical results with clear proofs, offering insights into the expressivity and optimization of transformers.
The analysis of learning dynamics highlights key phases in learning n-grams and induction mechanisms.

Weaknesses:

The expressivity results are viewed as unsurprising, as the findings align with existing intuition and prior work (e.g., Bietti 2024, Rajaraman et al., 2024).
The optimization analysis is conducted on an overly simplified model with limited parameters, raising concerns about its relevance to real-world transformers.
The results lack empirical validation on larger models, making it unclear whether the findings generalize beyond toy settings.
The paper does not provide practical implications or connections to in-context learning, limiting its significance.

Following detailed discussions among reviewers and the AC, a consensus was reached to reject the paper due to these limitations in novelty, scope, and generalization.

审稿人讨论附加意见

During the discussion, reviewers raised concerns about the unsurprising nature of the expressivity results and the relevance of the optimization analysis, given its highly simplified setting. The authors clarified aspects of their bounds and results but acknowledged limitations such as exponential dependencies and the lack of a lower bound. While the paper offers sound theoretical insights, the consensus after thorough discussions between reviewers and the Area Chair was to reject, primarily due to limited novelty and generalization of findings.

最终决定Reject

2025-01-22

Reject