6.0

/10

Rejected4 位审稿人

最低5最高8标准差1.2

4.3

置信度

正确性3.3

贡献度2.3

表达2.8

ICLR 2025

On the Training Convergence of Transformers for In-Context Classification

Wei Shen,Ruida Zhou,Jing Yang,Cong Shen

OpenReview PDF

提交: 2024-09-28更新: 2025-02-05

TL;DR

Transformers trained via gradient descent can provably perform in-context classification of Gaussian mixtures.

摘要

关键词

In-context learningTransformer

评审与讨论

审稿意见

评分: 5置信度: 42024-10-23

Yet another ICL theory paper that does not study actual ICL.

This paper, like many previous works, studies the training dynamics of a simple (single layer, linear) transformer model trained using ICL objective. They work with Gaussian data to study the training convergence rates; and impact of prompt length on error. With these settings, they find the rather unamusing results of linear convergence rates, and Bayes optimal behavior with asymptotic prompt lengths. No connection is made to emergent ICL in real LLMs whatsover. The authors should look at this recent ICML position paper to find the distinction between the two.

Disclaimer: I have not read the proofs in detail to verify that they are correct (hence this review is rated at confidence level 4). My review is based on the assumption that the proofs are correct.

优点

The relation between prompt length and error is somewhat interesting because as far as I know, previous works studying this meta-learning capabilities of the transformer model (misnamed as ICL) do not talk about this learnability constraint for open ended problems. However, other works have looked at something similar [link].
The other claim about being the first to study multi-class classification maybe true. Its significance is unclear.

缺点

The setting is too unrealistic to say anything about real ICL. For example,
- Training on hard-coded ICL prompts, when LLMs are trained on next-word prediction (ICL structure is generally not present in the pretraining corpus). This is a major setup difference which makes them incompatible.
- Studying single layer transformers with no non-linear activation functions. This is a good intellectual curiosity but its relevance and usefulness in understanding ICL remains unclear (even classic deep learning theory struggles to present useful insights by studying 2-layer networks). In this paper itself, we see a deviation from expectations when a 3-layer GPT2 architecture with softmax is tested (section 5.2).
- Gaussian data that presents fixed one-token length inputs and outputs. I don't have a problem with Gaussian data, but the framework should be flexible enough to even somewhat resemble real ICL (where the inputs and outputs both can be variable lengths).

问题

Binary classification is a special case of the multi-class classification. Why write up both?
ICL in LLMs is a type of domain adaptation setting, where knowledge about a particular task is scarce in the pre-training corpus. The model needs to be "kindled" with ICL demos to get it to perform better on this task. In contrast, the presented theory first trains the transformers on samples of these tasks, and then requires this data to be "properly distributed" over the space of variations. It is not surprising that the presented results hold in this setting. How do you think the setup needs to change to reflect that realistic ICL in LLMs setting?
With respect to the following:

probably because our transformer models were only trained with a small prompt length of N = 100.

3 layer GPT2 seems like a small model, why not test with higher N?

2024-11-22

$**References:**$

[1] Ruiqi Zhang, Spencer Frei, and Peter L Bartlett. Trained transformers learn linear models in-context. arXiv preprint arXiv:2306.09927, 2023a.

[2] Yu Huang, Yuan Cheng, and Yingbin Liang. In-context convergence of transformers. arXiv preprint arXiv:2310.05249, 2023.

[3] Hongkang Li, Meng Wang, Songtao Lu, Xiaodong Cui, and Pin-Yu Chen. Training nonlinear transformers for efficient in-context learning: A theoretical learning and generalization analysis. arXiv preprint arXiv:2402.15607, 2024.

[4] Siyu Chen, Heejune Sheen, Tianhao Wang, and Zhuoran Yang. Training dynamics of multi-head softmax attention for in-context learning: Emergence, convergence, and optimality. arXiv preprint arXiv:2402.19442, 2024.

[5] David Samuel. BERTs are Generative In-Context Learners. arXiv, 2406.04823.

[6] Ivan Lee and Nan Jiang and Taylor Berg-Kirkpatrick. Is attention required for ICL? Exploring the Relationship Between Model Architecture and In-Context Learning Ability. arXiv, 2310.08049

[7] Yu Bai, Fan Chen, Huan Wang, Caiming Xiong, and Song Mei. Transformers as statisticians: Provable in-context learning with in-context algorithm selection. Advances in neural information processing systems, 36, 2024.

2024-11-25

My central point remains:

There is no experimental evidence from real LLMs that aligns with this theory. Neither does the theory elicit any experiments that could be tested on real LLMs. The ICL used in LLMs is different from this meta learning capability of the transformer architecture, which has been studied by many prior works (as referenced above) and this work. Moreover, the authors failed to recognize the cited reference of Hahn et al, who also formulated a learnability bound on learning from demonstrations, very similar to this work.

For me to justify this paper, I need to see something that this theory of 'ICL' predicts about real LLMs that can be tested and verified on them. Due to its absence, I will keep my score.

2024-11-26

$**General response:**$ We thank the reviewer for the comment, and feel that further clarification is needed. This is not a paper about LLMs. In our main paper, we mentioned LLMs only once, and the purpose there was to highlight the importance of the transformer architecture. Your suggestion requires us to change the subject of this work to focus on ICL for LLMs, which itself is, admittedly, an extremely important research area but is not what we focused on in this paper.

Additionally, we would like to clarify that we are indeed studying ICL for transformers. We have provided a rigorous definition of ICL in Section 2.2, which is also the widely accepted definition in the academic community [1-4,6-8]. Finally, in the experiments, we also consciously chose to focus on single-layer and multi-layer transformers, not LLMs. Our experimental results clearly corroborated our theoretical claims and showed that some of the insights we obtained from the single-layer model also hold for more complex real-world multi-layer transformers. To summarize, our paper is not about LLMs -- it is about the theoretical understanding of the transformer architecture. As a result, we believe that lacking "real LLMs" should not be viewed as a shortcoming of this theoretical work.

The ICL used in LLMs is different from this meta learning capability of the transformer architecture, which has been studied by many prior works (as referenced above) and this work.

$**Reply:**$ The reviewer seems to suggest that what is studied in our paper should be called "meta-learning" instead of "ICL" for transformers. (In the previous comment, you also mentioned that "previous works studying this meta-learning capabilities of the transformer model (misnamed as ICL)".) If our understanding is correct, we respectfully disagree with this viewpoint. The studied ICL in our paper has a clear and rigorous definition in Section 2.2. To the best of our knowledge, meta-learning focuses on "learning to learn" by training a model to quickly adapt to new tasks across different domains, while in-context learning focuses on adapting a model to a specific task by providing relevant context within the input itself, without explicit retraining. Of course, we understand that the definition of these concepts may vary, which is why we gave a clear definition of ICL for transformers in Section 2.2, to establish a common ground for understanding. Also, this definition of ICL for transformers has been widely used and accepted in the research community; see [1-4,6-8].

Moreover, the authors failed to recognize the cited reference of Hahn et al, who also formulated a learnability bound on learning from demonstrations, very similar to this work.

$**Reply:**$ Thank you for providing this interesting paper [9]. We add the citation and the discussion of [9] in Section B in our revised paper. However, we need to clarify that our work and our results are very different from those in [9]. First, [9] only provided ICL guarantees for an idealized predictor, which is not a predictor of actual transformers or LLMs, and they also did not mention how an actual transformer or LLM can be trained to or represent this idealized predictor while, in our paper, we study the training dynamics of a single-layer transformer and showed that this transformer can be trained to an optimal model and establish the relation between the inference error and the training prompt length, test prompt length. Second, [9] studied the ICL with data generated by Compositional Attribute Grammar (CAG) while we studied the ICL of classifications of Gaussian mixtures. The contexts, tasks, and proving techniques in these two papers are totally different. Thus, our paper has its own independent contributions and intellectual merits.

$**References:**$

[8] Shivam Garg, Dimitris Tsipras, Percy S Liang, and Gregory Valiant. What can transformers learn in-context? a case study of simple function classes. Advances in Neural Information Processing Systems, 35:30583–30598, 2022.

[9] Michael Hahn, Navin Goyal. A Theory of Emergent In-Context Learning as Implicit Structure Induction. arXiv, 2023

2024-12-02

Dear Reviewer SwXS,

We are eager to know whether our latest response has properly addressed your central concern regarding transformers versus real LLMs. If so, could you please kindly consider increasing your initial score accordingly? Certainly, we are more than happy to answer your further questions.

Thank you for your time and effort in reviewing our work!

Best Regards,

Authors

2024-12-03

I am completely aware of the following:

This paper is not about LLMs.
The widely accepted definition of ICL established in current literature.

But my concern is also regarding the same points. The widely accepted definition does not mean it helps understand ICL in LLMs. The abstract of this paper starts with:

"While transformers have demonstrated impressive capacities for in-context learning (ICL) in practice, theoretical understanding of the underlying mechanism enabling transformers to perform ICL is still in its infant stage."

This "impressive capability in practice" refers to ICL in LLMs and the second part of the sentence implies that somehow this theoretical study on simple transformers trained with ICL objective will help us explain that.

I am all for theoretical insights and the original few works in the domain that studied properties of transformers trained in this manner were interesting. But the premise of this line of work is that it will somehow help us understand the ICL in LLMs. I ask the authors and Reviewer RKkZ:

What is the end goal of this theoretical work? Is it unreasonable for me to expect that after more than 2 years of this line of work, there would be progress on how to link it to real LLMs? It is justifiable to build on this "widely accepted" definition of ICL and support theories around it even when there is a clear distinction from the training setup of LLMs? If you train transformers to perform "ICL", it is not that surprising to see them perform "ICL". Is it unreasonable to expect some predictions from this theory that aligns with LLMs or some experiments that can be verified on LLMs, when the authors motivate this work using ICL in LLMs?

I would have strongly supported this work, had it made any effort in aligning the training setup of their transformers to LLMs and then analyzed it even with small single layer transformers; or even an effort to justify my concern about the incompatibility of the training setup with technical arguments instead of citing it as "widely accepted". I appreciate the hard work that went into this paper and the rebuttal, but my rating reflects my opinion about this line of work which may be wasting the ML community's research efforts. I can not justify the blind acceptance of advancement on theories that are unable to make verifiable predictions in real world.

2024-11-22

$**Weaknesses:**$ Gaussian data that presents fixed one-token length inputs and outputs. I don't have a problem with Gaussian data, but the framework should be flexible enough to even somewhat resemble real ICL (where the inputs and outputs both can be variable lengths).

$**Reply:**$ In our setting, the prompt length (number of in-context examples) is flexible, and the length of the query and the corresponding output is fixed as one. Most previous papers theoretically studying the ICL of transformers use this setting, e.g. [1-4, 6,7]. Considering the flexible length of queries and corresponding output is an interesting problem.

$**Question:**$ Binary classification is a special case of the multi-class classification. Why write up both?

$**Reply:**$ Because binary classification is a relatively simpler case, its analysis is more concise. We use it as an example to better highlight the theoretical results. Moreover, binary classification, as a special case, has a different structure compared to the case of $c=2$ in the multi-class section. In the multi-class section, when $c=2$ , the dimension of the embedding matrix is $(d+2)×(N+1)$ , whereas in the binary section, it is $(d+1)×(N+1)$ , which is more concise. Therefore, we wrote a separate section dedicated to binary classification.

$**Question:**$ ICL in LLMs is a type of domain adaptation setting, where knowledge about a particular task is scarce in the pre-training corpus. The model needs to be "kindled" with ICL demos to get it to perform better on this task. In contrast, the presented theory first trains the transformers on samples of these tasks, and then requires this data to be "properly distributed" over the space of variations. It is not surprising that the presented results hold in this setting. How do you think the setup needs to change to reflect that realistic ICL in LLMs setting?

$**Reply:**$ We agree that the ICL in LLMs can be a type of domain adaptation setting, and the knowledge about a particular task can be scarce during the per-training of LLMs. However, in this paper, we mainly focus on the theoretical study of the ICL capability of transformers, which is not necessarily a type of domain adaptation setting. For example, many previous papers studying the ICL of transformers [1-4, 6, 7] all considered pre-training transformers on ICL tasks and then testing transformers with the same set of tasks. Moreover, in our setting, the data distribution during pre-training and testing are different. For example, for binary case, during pre-training, $\mu\_{\tau,0}$ , $\mu\_{\tau,1}$ and $x\_{\tau, query}$ are sampled according to the specific distribution $P^b\_\Omega(\Lambda)$ and $P_x^b(\mu\_{\tau,0}, \mu\_{\tau,1}, \Lambda)$ . However, when testing, the $\mu\_0$ and $\mu\_1$ can be arbitrary two vectors that satisfy Assumption 3.2 and our $x\_{query}$ in testing can be an arbitrary $d$ dimensional vector. Thus, for a particular task with $\mu\_0$ , $\mu\_1$ and $x\_{query}$ , their corresponding probability during training can be arbitrarily small, which reflects the property -- as you put it -- in realistic ICL that the "knowledge about a particular task is scarce in the pre-training corpus". One interesting change to the setting is pre-training transformers with different types of tasks, like pre-training with both in-context linear regression tasks and in-context classification tasks. We think it is an interesting direction for future research.

$**Question:**$ With respect to the following: probably because our transformer models were only trained with a small prompt length of N = 100. 3 layer GPT2 seems like a small model, why not test with higher N?

$**Reply:**$ In our revised paper, we have conducted experiments on a 3-layer encoder-only transformer with softmax attention and without positional encoding. This setting is close to our theoretical analysis, and as we can see from Figures 1, 2, and 3, the performance of this 3-layer encoder-only transformer is very similar to the single-layer transformer we theoretically studied. As for the original question, we have explained in the original version of the paper right after the sentence you quote: "Similar declined performance when the training prompt length is smaller than the test prompt length has also been observed for in-context linear regression tasks; see e.g. [1]". Similar situations have been widely observed in many places. For example, you can find similar and significant performance degradations in Figure 1 in [1] and Figures 1, 5, 6 in [6]. In our revised paper, we used an encoder-only transformer without positional encoding in the experiment. We note that such performance degradation does not happen and the inference error of this model decreases as the test prompt length $(M)$ increases, proving that some of the insights we obtained from the simplified models also hold for more complex multi-layer non-linear transformers.

2024-11-22

$**General response:**$ We need to first clarify that the main focus of this paper is to study the ICL capability of transformers, not the ICL capability of LLMs. We agree that the ICL capability of LLMs is remarkable and studying the ICL ability of LLMs is an interesting and important problem. However, since the theoretical study of the ICL ability of LLMs is much more complex than that of the transformer and is still at a preliminary stage, in this paper, we focus on studying the ICL ability of (simple) transformers, which usually serve as the foundational architectures of most LLMs. In fact, even the theoretical study of the ICL of basic transformers is at an infant stage in the sense that most existing papers, including ours, have to consider some simplified models to make progress. For example, in this paper, we focus on an encoder-only single-layer transformer and, to the best of our knowledge, all existing theoretical studies on the training dynamics of transformers focus only on encoder-only 1-layer transformers [1-4]. Even though we have studied a simplified model in this paper, our newly derived results in the revised paper demonstrate that multi-layer non-linear transformers can exhibit many similar behaviors as the simplified models (see Figures 1, 2, and 3 of the revised paper). Some of the insights we obtained from this simplified model also hold for more complex multi-layer non-linear transformers, which indicates that studying this simplified model can help us have a better understanding of the ICL abilities of real-world-adopting transformers.

$**Weaknesses:**$ Training on hard-coded ICL prompts, when LLMs are trained on next-word prediction (ICL structure is generally not present in the pretraining corpus). This is a major setup difference which makes them incompatible.

$**Reply:**$ Thank you for your question. Yes, most LLMs are based on decoder-only transformers and are trained on next-word prediction. However, there are also some language models such as BERT are based on encoder-only transformers, and many prior papers [5-7] also showed that encoder-only transformers can exhibit remarkable ICL abilities. Moreover, in [6], for many ICL tasks tested in their paper, encoder-only and decoder-only transformers exhibit similar performances. Thus, to simplify the analysis, in this paper, we focus on the encoder-only transformers. Moreover, to the best of our knowledge, all existing theoretical studies on the training dynamics of transformers focus only on encoder-only transformers, e.g., [1-4]. We agree that studying the ICL abilities of decoder-only transformers trained in next-word prediction is also an interesting and important problem. We leave it for future research.

$**Weaknesses:**$ Studying single layer transformers with no non-linear activation functions. This is a good intellectual curiosity but its relevance and usefulness in understanding ICL remains unclear (even classic deep learning theory struggles to present useful insights by studying 2-layer networks). In this paper itself, we see a deviation from expectations when a 3-layer GPT2 architecture with softmax is tested (section 5.2).

$**Reply:**$ We just added and revised our experimental results in the revised paper. We conducted experiments on a 3-layer encoder-only transformer with softmax attention. You can find Figures 1, 2, and 3 in our revised version. From Figure 1, we can see that the real-world multi-layer transformers and the single-layer transformers we study actually exhibit many similarities in performances. For example, from Figure 1, we can see that both models' ICL inference errors decrease as training prompt length ( $N$ ) and test prompt length ( $M$ ) increase, and increase as the number of Gaussian mixtures ( $c$ ) increases. This indicates that some of our insights obtained from studying this simplified model may hold for transformers with more complex structures, and studying this simplified model can help us have a better understanding of the ICL abilities of complex transformers. Moreover, to the best of our knowledge, all existing theoretical studies on the training dynamics of transformers focus only on single-layer transformers, e.g., [1-4]. We agree that studying the ICL abilities of multi-layer transformers is also an interesting and important problem.

2024-12-02

Dear Reviewer SwXS,

As a fellow reviewer for this paper, I disagree with your central point about the "real LLMs". As the authors claim and write in the title, abstract, and the main text, this studies Transformers not LLMs. In this regard, it not even needs to study language so it's reasonable the usage of ICL is different from those used in LLMs. I would argue for an acceptance of the work due to its own merit and contributions to the interpretability and theory community beyond its closeness to "real LLMs". Hope you could also reconsider your ratings. Thanks!

Reviewer RKkZ

评论- further discussion with Reviewer SwXS

2024-12-03

re: "impressive capability in practice" refers to ICL in LLMs

I would argue that it does not necessarily refer only to LLMs. For example, Garg et. al. (2022)'s work was studying transformers ICL capabilities to fit function classes, such as linear regression, 2-layer ReLU networks, random forests, etc. This works spark a line of research on ICL capabilies of Transformers on abstract mathematical/statistical tasks that are no longer related to language.

re: What is the end goal of this theoretical work

I think this is a common question to ask for every theory work and I'm glad Reviewer SwXS also pointed it out. In my own opinion, studying how Transformers learn in context, and how they achieve such abilities during training and optimization, is crucial in understanding the trustworthiness of how Transformer architecture could be used as a universal computer. For example, Giannou et. al. (2023)'s work has shown that Transformer's variant with looping could be used as programmable computers, in the expressivity sense. Overall, people want to understand the algorithmic abilities of transformers and how training can lead to these expressivity results. Similarly, many real-world LLM works also build on top of the assumption that Transformers are able to optimize context. For example, the line of research says that transformers allow in-context reinforcement learning (Monea et.al. 2024). This also assumes that transformers could implicitly implement some algorithm or heuristics to solve a task. Again, it's scientific to understand how.

Admittedly, the gap between theory and practice sometimes branches away from each other, and the gap could be getting larger, but I still think these studies are meaningful to combat the overhype in the field and help find scientific explanations and solutions for security and AI safety. I would argue that, in order to build truly safe AI systems, we need to understand how they work, and how they gain their abilities. I personally feel this work could be beneficial for that. Let me know what you think.

Reference

Garg, Shivam, Dimitris Tsipras, Percy S. Liang, and Gregory Valiant. "What can transformers learn in-context? a case study of simple function classes." Advances in Neural Information Processing Systems 35 (2023): 30583-30598.

Giannou, Angeliki, Shashank Rajput, Jy-yong Sohn, Kangwook Lee, Jason D. Lee, and Dimitris Papailiopoulos. "Looped transformers as programmable computers." In International Conference on Machine Learning, pp. 11398-11442. PMLR, 2023.

Monea, Giovanni, Antoine Bosselut, Kianté Brantley, and Yoav Artzi. "LLMs Are In-Context Reinforcement Learners." arXiv preprint arXiv:2410.05362 (2024).

2024-12-03

I agree that this work can have influence on architecture design and hence be meaningful. (reference)

Nonetheless, it should not be posed as anything remotely related to the real ICL which emerges in LLMs without training for it.

I disagree that this “impressive capability in practice” could mean ICL studied by Garg et al. “in practice” can not mean small transformers used to fit linear models, as they are useless by themselves. That paper itself says that they are studying toy transformers with this different setup, mistakenly calling it ICL and presenting it as a way to understand ICL in LLMs. Even in the related work section of this paper, ICL from GPT3 paper (the original definition of ICL) is discussed. The current framing of the contribution very much implies a relationship between the overloaded ICL term and should have been avoided in the first place.

I will increase my score to a 5 to reflect the potential architectural benefits that may stem from this theoretical study. However, I will still reject this paper on the basis of misleading framing of contribution (understanding ICL) and no real effort in justifying it technically apart from saying that it is widely accepted.

2024-12-04

Thank you for increasing the score and recognizing our contributions! We would like to add further clarification regarding the comment that the "impressive capability in practice" refers to ICL in LLMs. As Reviewer RKkZ mentioned, ICL capabilities do not necessarily refer to LLMs. For example, Reference [8] empirically showed that transformers have the ICL capabilities to fit function classes. Reference [6] empirically examined and compared the ICL capabilities of different models (CNN, RNN, transformer models, etc.) for various tasks, including linear regression, multiclass classification of Gaussian mixtures, image classification, and language modeling, etc. Both empirical works [6,8] studied the specific ICL abilities of transformers trained with the corresponding tasks, and the primary motivation of our paper is to provide theoretical explanations for those empirical observations of ICL abilities of transformers (in [6, 8] and also in the experimental results of our paper). Our experimental results on single/multi-layer transformers also corroborate our theoretical claims. We hope this paper can provide valuable insights into the theoretical understanding of the ICL mechanisms of transformers. Those insights may be helpful for potential architectural design (as Reviewer SwxS suggested) and building safe AI systems (as Reviewer RKkz indicated).

Thanks again to Reviewers SwxS and RKkz for providing these valuable discussions and feedback.

$**References:**$

[6] Ivan Lee, Nan Jiang, and Taylor Berg-Kirkpatrick. Is attention required for ICL? Exploring the Relationship Between Model Architecture and In-Context Learning Ability. ICLR 2024

审稿意见

评分: 6置信度: 42024-10-30

This work studied the training dynamics of a one-layer linear transformer trained via GD for in-context multi-classification tasks. They established convergence guarantees for in-context trianing and also provided an in-context inference error bound, which scales with the length of both training and testing prompts.

优点

This work is the first to examine transformers for in-context multi-class classification from a training dynamics perspective.
The end-to-end connection between in-context inference and training offers a novel insight.
Experimental results are provided to support theoretical statements.

缺点

My major concern lies in the analytical novelty of this paper compared to prior work on in-context regression [1]. While this study focuses on the multi-class classification problem, its model and analytical approach appear to share many similarities with [1]. It remains unclear how technically straightforward it is to generalize the results of [1] to multi-classification. Additionally, this paper restricts itself to the linear attention setting, simplifying the analysis and making it somewhat less impactful than [2], which addresses binary classification with strict data assumptions but in the more realistic softmax attention setting. Therefore, a thorough discussion clarifying the technical distinctions and contributions of this work relative to these previous studies would be helpful.

[1] Trained Transformers Learn Linear Models In-Context. Zhang et al., 2023

[2] Training nonlinear transformers for efficient in-context learning: A theoretical learning and generalization analysis. Li et al., 2024

问题

For data distribution, why is it essential to preserve the inner product of vectors in the $\Lambda^{-1}$ -weighted norm? Is this primarily a technical consideration? It would be helpful if the authors could provide further clarification on the role of data distribution in the analysis.
For the inference stage, while $\mu_0$ and $\mu_1$ are not subject to additional constraints, $\Lambda$ remains fixed, imposing a strong assumption on the underlying structure of the data distribution. Do the authors have insights on how these results might extend to scenarios with a varying $\Lambda$ during inference?
The derived inference bound scales with $N$ and $M$ similarly to in-context regression [1]. Could the authors clarify the distinctive aspects of the multi-classification setting in this context? (This also points to weakness.)
For the multi-classification setting, what is the order of the number of classes $c$ ? On line 431, the authors mention that $c$ is treated as a constant coefficient—would a larger order of $c$ impact the analysis?

[1] Trained Transformers Learn Linear Models In-Context. Zhang et al., 2023

2024-11-22

$**Weakness:**$ My major concern lies in the analytical novelty of this paper compared to prior work on in-context regression [1]. While this study focuses on the multi-class classification problem, its model and analytical approach appear to share many similarities with [1]. It remains unclear how technically straightforward it is to generalize the results of [1] to multi-classification. Additionally, this paper restricts itself to the linear attention setting, simplifying the analysis and making it somewhat less impactful than [2], which addresses binary classification with strict data assumptions but in the more realistic softmax attention setting. Therefore, a thorough discussion clarifying the technical distinctions and contributions of this work relative to these previous studies would be helpful.

$**Reply:**$ Our technique is different from those used in [1, 2]. In [1], the globally optimal solution (i.e., the parameters of the transformer) has a closed-form expression, and they proved that the 1-layer transformer optimized via gradient flow can converge to this closed-form globally optimal solution. However, in our setting, due to the high non-linearity of our loss function, the global minimizer does not have a closed-form expression. Instead, by analyzing the Taylor expansion near the global minimizer, we prove that the global minimizer consists of a constant plus an error term that is induced by the finite training prompt length ( $N$ ). We further show that the max norm of this error term is bounded, and converges to zero at a rate of $O(1/N)$ . Our technical approach to addressing this challenge is new and might be useful in other settings. Moreover, we considered the more practical gradient descent rather than the gradient flow in [1]. In [2], they only studied the binary classification tasks with finite pairwise orthogonal patterns. They generated their data as $x=\mu_j+\kappa v_k$ , where $\{\mu\_j\}, {j=1, 2, ..., M\_1}$ are in-domain-relevant patterns and $\{\nu\_k\}, {k=1, 2, ..., M\_2}$ are in-domain-irrelevant patterns, $M\_1\geq M\_2$ and these patterns are all pairwise orthogonal. Thus, the possible distribution of their data is finite and highly limited. In contrast, our work data is drawn according to $P^b(\mu_0,\mu_1,\Lambda)$ or $P^m(\mu, \Lambda)$ , and the range and possible distributions of our data are infinite. Thus, we considered more general in-context multi-class classification tasks with infinite patterns while [2] only considered in-context classification tasks with finite patterns, thereby highlighting the distinct contributions and independent interests of our work.

$**Question:**$ For data distribution, why is it essential to preserve the inner product of vectors in the $\Lambda^{-1}$ -weighted norm? Is this primarily a technical consideration? It would be helpful if the authors could provide further clarification on the role of data distribution in the analysis.

$**Reply:**$ The primary role of the condition (2) in Assumption 3.1 is to ensure that $\mu\_{\tau, 1}$ and $\mu\_{\tau, 0}$ have the same $\Lambda^{-1}$ -weighted norm. Because, if $\mu\_{\tau, 1}$ and $\mu\_{\tau, 0}$ have the different $\Lambda^{-1}$ -weighted norms, then, the probability of the ground truth label $y\_{\tau, query}$ , $\mathbb{P}(y\_{\tau, query}=1)=\sigma((\mu\_{\tau,1}-\mu\_{\tau,0})^\top \Lambda^{-1} x\_{\tau, query}+ (\mu\_{\tau,1}^\top\Lambda^{-1}\mu\_{\tau,1}-\mu\_{\tau,0}^\top\Lambda^{-1}\mu\_{\tau,0})/2)$ , we find it is hard for 1-layer transformer with linear attention to calculate $\mu\_{\tau,1}^\top\Lambda^{-1}\mu\_{\tau,1}-\mu\_{\tau,0}^\top\Lambda^{-1}\mu\_{\tau,0}$ in context. However, we found that a 1-layer transformer with linear attention can approximately calculate $(\mu\_{\tau,1}-\mu\_{\tau,0})^\top \Lambda^{-1} x_{\tau, query}$ in context. Thus, we add the condition (2) in Assumption 3.1. Moreover, the newly added experimental results (Figure 2) in our revised paper also show the necessities of the condition (2). Experimental results in Figure 2 also indicate transformers with more complex structures have better robustness without condition (2). Thus, it is an interesting question whether we can eliminate the need for the condition (2) for more complex transformers. We leave it for future research.

2024-11-22

$**Question:**$ For the inference stage, while $\mu_0$ and $\mu_1$ are not subject to additional constraints, $\Lambda$ remains fixed, imposing a strong assumption on the underlying structure of the data distribution. Do the authors have insights on how these results might extend to scenarios with a varying $\Lambda$ during inference?

$**Reply:**$ This is a good question. We discuss the situations when Assumption 3.2 does not hold, i.e. varying $\Lambda$ during inference, in Remark F.1 and H.1. However, we found that the 1-layer transformer with sparse-form parameters and linear attention cannot correctly perform the in-context classification. Similar behaviors have also been reported in [1] for in-context linear regression. Moreover, the newly added experimental results (Figure 2) in our revised paper also show the necessities of the consistency of $\Lambda$ during training and inference. Experimental results in Figure 2 indicate transformers with more complex structures have better robustness with varying covariances. It is an interesting problem for future investigation whether more complex Transformer structures can perform in-context classification with varying $\Lambda$ .

$**Question:**$ The derived inference bound scales with $N$ and $M$ similarly to in-context regression [1]. Could the authors clarify the distinctive aspects of the multi-classification setting in this context? (This also points to weakness.)

$**Reply:**$ Yes, we derived similar inference bounds as those in [1]. However, compared to [1], we studied different problems in different settings. Moreover, we considered the more practical gradient descent rather than the gradient flow in [1]. The similarity in how the inference bound scales with $N$ and $M$ is intuitive, since in both linear regression and classification of Gaussian mixtures, having more examples generally leads to more accurate results.

$**Question:**$ For the multi-classification setting, what is the order of the number of classes $c$ ? On line 431, the authors mention that $c$ is treated as a constant coefficient—would a larger order of $c$ impact the analysis?

$**Reply:**$ For the multi-classification setting, we considered $c$ as an fixed constant. The inference error regarding $c$ is $O(c^2N^{-1}+c^{3/2}M^{-1/2})$ . Thus, if the number of classes $c$ is large, the models may require larger $N$ to converge and large $M$ to have good inference results. Our experiments in Figure 1(b) and Figure 4(b) also verified our theoretical claims.

2024-11-24

I appreciate the author's efforts in their response. I have no further concerns and will maintain my score.

2024-12-02

Dear Reviewer kPhs,

We are glad to know that our responses have addressed your concerns. Thank you again for the valuable comments and suggestions!

Best Regards,

Authors

审稿意见

评分: 5置信度: 42024-11-03

This paper study the learning dynamics of transformers for in-context classification of Gaussian mixtures and prove the training convergence of in-context multi-class classification. The author presents three key findings: 1) a proof that a single-layer transformer trained via gradient descent converges to a globally optimal solution at a linear rate; 2) an analysis of how the lengths of training and testing prompts influence the inference error in in-context learning; and 3) evidence that, with sufficiently long training and testing prompts, the predictions of the trained transformer approach those of the Bayes-optimal classifier. Some of these results are validated through experiments.

优点

The structure of this paper is very clear.
Analyzing the training dynamics of in-context learning is crucial.
The findings regarding infinite lengths of training prompts (N) and testing prompts (M) are interesting

缺点

This paper claims to be the first to explore the learning dynamics of transformers for in-context classification of Gaussian mixtures and to prove the convergence of training in multi-class classification. However, I find the significance of this assertion unclear, as the paper lacks sufficient detail. Specifically: 1) many prior works have analyzed in-context learning assuming $x$ comes from Gaussian distributions; what additional insights do the results on Gaussian mixtures provide? 2) Why is extending results from binary to multi-class classification considered essential and non-trivial?
Additionally, I have concerns regarding the techniques used:

The introduction of $\tilde{L}$ appears to be a key element in proving Theorem 3.1, but its intuition is unclear, and I'm uncertain how it addresses the challenges posed by the non-linear loss function.
The paper heavily relies on Taylor expansion in its proofs, and I question whether this expansion can accurately approximate the original function. More detail is needed on this aspect.

问题

1.The condition (2) in Assumption 3.1 seems unusual to me. Could the authors provide more clarification on this assumption?

2.Some papers[1,2,3] have highlighted emergent behaviors in the training dynamics of in-context learning. However, this paper asserts that the transformer will converge to its global minimizer at a linear rate, which appears to contradict those findings. Can the authors discuss this further?

[1] In-context learning and induction heads

[2] Breaking through the learning plateaus of in-context learning in Transformer

[3] Training dynamics of multi-head softmax attention for in-context learning: Emergence, convergence, and optimality.

2024-11-22

$**Question:**$ The condition (2) in Assumption 3.1 seems unusual to me. Could the authors provide more clarification on this assumption?

$**Question:**$ Some papers [1,2,3] have highlighted emergent behaviors in the training dynamics of in-context learning. However, this paper asserts that the transformer will converge to its global minimizer at a linear rate, which appears to contradict those findings. Can the authors discuss this further?

$**Reply:**$ Because [1,2,3] studied the in-context learning of transformers with structures and problems different from ours, the training dynamics of transformers with different structures and for different tasks can be different. Some other papers, such as [5], also proved the linear convergence of transformers for some specific problems. However, all existing theoretical studies on the training dynamics of transformers focus only on single-layer transformers [3-7]. Theoretical understandings of the training dynamics of multi-layer transformers for more complex real-world problems are still unclear and are interesting research directions for future research.

$**References:**$

[4] Hongkang Li, Meng Wang, Songtao Lu, Xiaodong Cui, and Pin-Yu Chen. Training nonlinear transformers for efficient in-context learning: A theoretical learning and generalization analysis. arXiv preprint arXiv:2402.15607, 2024.

[5] Tong Yang, Yu Huang, Yingbin Liang, and Yuejie Chi. In-context learning with representations: Contextual generalization of trained transformers. arXiv preprint arXiv:2408.10147, 2024

[6] Ruiqi Zhang, Spencer Frei, and Peter L Bartlett. Trained transformers learn linear models in-context. arXiv preprint arXiv:2306.09927, 2023a.

[7] Yu Huang, Yuan Cheng, and Yingbin Liang. In-context convergence of transformers. arXiv preprint arXiv:2310.05249, 2023.

2024-11-28

Thank you to the authors for their comprehensive response. I appreciate that they have addressed my concerns regarding the Taylor expansion. However, the current version of the paper still does not clearly articulate its value and scope. As a result, I am not yet confident in supporting its acceptance. Therefore, I will maintain my current score.

2024-12-02

Dear Reviewer wwNp,

We sincerely appreciate your helpful comments, and are happy that we have addressed your concerns regarding the Taylor expansion. As for the value and scope, we focus on the theoretical understanding of the training dynamics of transformers for in-context classification. We have the following main results:

We proved that a single-layer transformer trained via gradient descent can converge to a globally optimal model at a linear rate for in-context classification of Gaussian mixtures with some assumptions.
We quantified the impact of the training and testing prompt lengths on the ICL inference error of the trained transformer.
Another important result is that when the lengths of training and testing prompts are sufficiently large, we proved the trained transformer approaches the Bayes-optimal classifier.

We are more than happy to answer any further questions you may have regarding our paper. Thank you again for the helpful comments!

Best Regards,

Authors

2024-11-22

$**Weakness:**$ This paper claims to be the first to explore the learning dynamics of transformers for in-context classification of Gaussian mixtures and to prove the convergence of training in multi-class classification. However, I find the significance of this assertion unclear, as the paper lacks sufficient detail. Specifically: 1) many prior works have analyzed in-context learning assuming $x$ comes from Gaussian distributions; what additional insights do the results on Gaussian mixtures provide? 2) Why is extending results from binary to multi-class classification considered essential and non-trivial?

$**Reply:**$ 1) Previous studies that assume $x$ is drawn from Gaussian distributions are all focused on the in-context linear regression problem. To the best of our knowledge, we are the first to explore the in-context classification problem under the assumption that $x$ is drawn from Gaussian mixtures. Prior work [4] that studied the in-context classification of transformers assumes the data to be pairwise orthogonal. They generated their data as $x=\mu_j+\kappa v_k$ , where $\{\mu\_j\}, {j=1, 2, ..., M\_1}$ are in-domain-relevant patterns and $\{\nu\_k\}, {k=1, 2, ..., M\_2}$ are in-domain-irrelevant patterns, $M\_1\geq M\_2$ and these patterns are all pairwise orthogonal. Thus, the possible distribution of their data is finite and highly limited. In contrast, our work data is drawn according to $P^b(\mu_0,\mu_1,\Lambda)$ or $P^m(\mu, \Lambda)$ , and the range and possible distributions of our data are infinite. Hence, we considered a more general situation that our data can have infinite patterns while [4] only considered in-context classification tasks with finite patterns. Thus, we provide additional insights that transformers can perform in-context classification tasks with infinite patterns. 2) Moreover, [4] only considered binary classification. We also provide additional insights that transformers can perform in-context multi-class classification. This is essential because many real-world classification problems are not binary but multi-class. Therefore, explaining how transformers can handle multi-class classification problems in context is an essential question. It is non-trivial because technically, when extending results from binary to multi-class classification, more complicated cross terms in the Taylor expansions of the softmax functions, which are due to the nature of multi-class classification, bring new challenges to the analysis. To address these issues, we derived new bounds on the expected errors of the cross terms in Lemma G.1, G.2, which may be of independent interest to other similar problems.

$**Weakness:**$ The introduction of $\widetilde{L}$ appears to be a key element in proving Theorem 3.1, but its intuition is unclear, and I'm uncertain how it addresses the challenges posed by the non-linear loss function.

$**Reply:**$ In Lemma E.3, we show that as $N\to \infty, L(W)$ will point wisely converge to $\widetilde{L}(W)$ . Since we can easily find the global minimizer of $\widetilde{L}(W)$ is $2\Lambda^{-1}$ , with the help of $\widetilde{L}(W)$ , we can show that as $N\to \infty$ , $W^*$ , the global minimizer of $L(W)$ , will converge to $2\Lambda^{-1}$ . Thus, we can denote $W^*=2(\Lambda+G)$ . In Lemma E.4, by analyzing the Taylor expansion of the equation $\nabla L(W^*)=0$ at the point $2\Lambda^{-1}$ , we address the challenges posed by the non-linear loss function and establish the bound $\|G\|\_{max}=O(N^{-1})$ .

$**Weakness:**$ The paper heavily relies on Taylor expansion in its proofs, and I question whether this expansion can accurately approximate the original function. More detail is needed on this aspect.

$**Reply:**$ Yes, we used Taylor expansions in many places in our proofs. However, every time we use the Taylor expansion, we always use the Lagrange form of the remainder to express and bound the approximation error. For example, in the proof of Theorem 3.2, we used the equation

    \sigma(a+b)=\sigma(a)+\sigma'(a)b+\frac{\sigma''(\xi(a,b))}{2}b^2,

where $\xi(a,b)$ are real numbers between $a$ and $a+b$ . Since $|\sigma''(\xi(a,b))|\leq 1$ and in the proof of Theorem 3.2, we can prove $E[b^2]=o(1/N+1/\sqrt{M})$ , we can bound the approximation error smaller than $o(1/N+1/\sqrt{M})$ . Similarly, in any other places where we use the Taylor expansion, we always express and bound the approximation error.

$**References:**$

审稿意见

评分: 8置信度: 52024-11-04

The paper investigates the training convergence of transformers for in-context classification tasks. It demonstrates that a single-layer transformer trained by gradient descent converges to a globally optimal model at a linear rate for in-context classification of Gaussian mixtures. Experimental results confirm the theoretical findings, showing that the trained transformers perform well in binary and multi-class classification tasks.

优点

Rigorous theory: The paper provides a detailed theoretical analysis of the convergence properties of transformers for in-context classification tasks, demonstrating that under certain conditions, a single-layer transformer trained by gradient descent achieves global optimality at a linear rate.
Experimental validation: The theoretical claims are corroborated by experimental results. The paper's experiments on binary and multi-class classification tasks with Gaussian mixtures verify the theoretical predictions, showing that the transformers' prediction accuracy improves as the training and testing prompt lengths increase.

缺点

The paper primarily focuses on single-layer transformers with simplified linear attention mechanisms. While it provides valuable insights into the convergence properties and error bounds for these models, the findings may not fully extend to more complex multi-layer transformers with softmax or relu attention mechanisms. And the training dynamics of multi-layer transformers could be different to single-layer -- just like multi-layer MLP can be different than linear models.

Nonetheless, I don't regard this as a strong weakness since the field is evolving and to my best knowledge, most studies are still on one-layer transformers.

问题

Does the same training dynamics results apply to tasks beyond linear regression/classification? Would it be different for other mathematical/statistical tasks, for example, time series, etc.

2024-11-22

$**Weakness:**$ The paper primarily focuses on single-layer transformers with simplified linear attention mechanisms. While it provides valuable insights into the convergence properties and error bounds for these models, the findings may not fully extend to more complex multi-layer transformers with softmax or relu attention mechanisms. And the training dynamics of multi-layer transformers could be different to single-layer -- just like multi-layer MLP can be different than linear models.

$**Reply:**$ Thank you for your recognition of our contributions and for raising the question about the extension to multi-layer with softmax or ReLU attention mechanisms. Yes, we agree that the training dynamics and many other properties of multi-layer, non-linear transformers can be different from the single-layer linear transformers we study. However, from the newly added experimental results (Figure 1) in our revised paper, we can see that the real-world multi-layer transformers and the single-layer transformers we studied actually exhibit many similarities in performances. For example, from Figure 1, we can see that both models' ICL inference errors decrease as training prompt length ( $N$ ) and test prompt length ( $M$ ) increase, and increase as the number of Gaussian mixtures ( $c$ ) increases. This indicates that some of our insights obtained from studying this simplified model may still be valuable for transformers with more complex structures, and studying this simplified model can actually help us have a better understanding of the ICL abilities of real-world-adopted transformers. Moreover, the research community is still at the preliminary stage of the theoretical investigations of in-context learning of transformers. To the best of our knowledge, most existing theoretical studies on the convergence behavior focus only on single-layer transformers, e.g., [1-6]. We agree that studying the ICL abilities of multi-layer transformers is also an interesting and important problem. We leave it for future work.

$**Question:**$ Does the same training dynamics results apply to tasks beyond linear regression/classification? Would it be different for other mathematical/statistical tasks, for example, time series, etc.

$**Reply:**$ It is a good question. The training dynamics of transformers with different structures and for different tasks can be different. However, some insights we got from the linear regression/classification may also hold in other mathematical/statistical tasks. For example, we find that for the in-context classification of Gaussian mixtures, the ICL influence errors are affected by the training and testing prompt lengths. We suspect similar behaviors may also hold in other mathematical/statistical tasks. Nevertheless, the training dynamics for other mathematical/statistical tasks is still an interesting open question for future research.

$**References:**$

[1] Ruiqi Zhang, Spencer Frei, and Peter L Bartlett. Trained transformers learn linear models in-context. arXiv preprint arXiv:2306.09927, 2023a.

[2] Yu Huang, Yuan Cheng, and Yingbin Liang. In-context convergence of transformers. arXiv preprint arXiv:2310.05249, 2023.

[4] Arvind Mahankali, Tatsunori B Hashimoto, and Tengyu Ma. One step of gradient descent is provably the optimal in-context learner with one layer of linear self-attention. arXiv preprint arXiv:2307.03576, 2023.

[5] Jingfeng Wu, Difan Zou, Zixiang Chen, Vladimir Braverman, Quanquan Gu, and Peter L Bartlett. How many pretraining tasks are needed for in-context learning of linear regression? arXiv preprint arXiv:2310.08391, 2023.

[6] Siyu Chen, Heejune Sheen, Tianhao Wang, and Zhuoran Yang. Training dynamics of multi-head softmax attention for in-context learning: Emergence, convergence, and optimality. arXiv preprint arXiv:2402.19442, 2024.

2024-12-02

Your response resolves my concerns and I'm raising my score. I vote strongly for an acceptance of this work, because of its contributions to help the community better understand the training dynamics of Transformers, and in particularly in context learning. I also strongly disagree with Reviewer SwXS and align with the authors of this paper, that this is not an LLM work, and a gap between theoretical understanding of LLMs and empirical LLMs is acceptable and widely adopted by the community.

2024-12-02

Dear Reviewer RKkZ,

Thank you for recognizing our contributions and your strong support of our paper! We sincerely appreciate your helpful comments and valuable input.

Best regards,

Authors

AC 元评审

2024-12-24

This paper provides a theoretical analysis of in-context learning for classification tasks. The authors use a fairly standard setting of 1-layer linear attention. However the distinction of the work arises from its focus on binary and multiclass classification tasks and the associated novel analysis. The authors study these problems under a gaussian mixture dataset model. There was an insightful discussion between the reviewers and the authors on whether this work captures the in-context capabilities of modern LLMs. While the AC acknowledges that the 1-layer linear attention model of this work (and many other in-context learning theory works) is simplistic compared to modern LLMs, purely theoretical contributions are valuable and welcome to ICLR and can provide a stepping stone toward more sophisticated models.

While most reviewers found the technical contribution of the paper to be decent, the final recommendation is reject for the following reasons:

Technical concerns: There are some basic flaws that requires a second review. Firstly, the definition of Bayes-optimal classifier in Line 294 is incorrect. Bayes-optimal classifier is deterministic given the input so the correct classifier is not a probability but is obtained by applying the sign function. Related to this, Theorem 3.2 states a total variation distance guarantee between $y_{query}$ and $yhat_{query}$ which goes to zero as $N,M\rightarrow\infty$ . However the authors use a less strict TV distance definition rather than the conventional one. Namely it does not mean that $y_{query}$ and $yhat_{query}$ are the same random variables or $yhat_{query}$ is the optimal decision, they just have identical distributions. By the definition of authors as stated in Line 270, $yhat_{query}$ is not the Bayes optimal decision because it introduces noise on the optimal classifier during sampling from $yhat_{out}$ .
Finite sample Bayes optimality of GMMs: The authors discuss Bayes optimality only in the asymptotic sense as prompt length is infinite. In reality, even for finite prompt length, under suitable assumptions, 1-step gradient descent can be finite sample Bayes optimal estimator for binary GMMs. For instance, see Section 2.2 of Mignacco et al. ICML 2020. This means one layer attention can do optimal classification under finite prompt length. This work does not discuss or capture this important aspect.
Related work: The second point above brings me to the related work section which needs substantial improvement.

The authors are missing any reference to the literature on gaussian mixture models or classification with GMMs even if their results are heavily relying on gaussian mixture assumption. This even includes a work on in-context learning which similarly utilizes GMM data such as Dual Operating Modes of In-Context Learning (ICML'24). I find it a bit unfortunate that most citations are to ICL/LLM papers within the last 2-3 years and not much to classical ML literature. Note that, once we make the assumption in Eq (14), we end up with the statistical properties of 1 step gradient estimator on gaussian mixture data (Line 151) which is essentially the plug-in estimator in Section 2.2 of Mignacco et al. ICML 2020. In-context learning essentially constitutes a proxy for this fundamental model. I recommend that the authors should consider providing a thorough discussion on GMM literature (prior works on meta learning with GMMs, finite sample learning, Bayes optimal rates, multiclass etc). This would also provide better motivation for their assumptions.
There is also no related work section in the main body. I would advise inserting 0.5 page of (shortened) related work in the final manuscript. I believe 10 pages provide enough space to do so.

Technical clarity (minor concern): Some of the notation should be introduced more clearly. For instance, if I am not mistaken, the $G$ matrix in Theorem 3.1 is essentially defined in terms of $W^*$ , that is, its context is missing. Please go over the manuscript carefully to ensure technical clarity throughout.

审稿人讨论附加意见

There was an insightful discussion between reviewers and author on whether this work captures the in-context capabilities of modern LLMs. While the 1-layer linear attention model of this work (and many other in-context learning theory works) is simplistic, purely theoretical contributions are valuable and welcome to ICLR and can provide a stepping stone toward more sophisticated models.

最终决定Reject

2025-01-22

Reject