6.4

/10

Poster4 位审稿人

最低3最高6标准差1.2

3.8

置信度

创新性3.3

质量2.8

清晰度2.8

重要性3.0

NeurIPS 2025

Transformers are almost optimal metalearners for linear classification

Roey Magen,Gal Vardi

OpenReview PDF

提交: 2025-05-12更新: 2025-10-29

TL;DR

We prove that, under appropriate conditions, linear attention is an almost optimal metalearner for linear classification.

摘要

Transformers have demonstrated impressive in-context learning (ICL) capabilities, raising the question of whether they can serve as metalearners that adapt to new tasks using only a small number of in-context examples, without any further training. While recent theoretical work has studied transformers' ability to perform ICL, most of these analyses do not address the formal metalearning setting, where the objective is to solve a collection of related tasks more efficiently than would be possible by solving each task individually. In this paper, we provide the first theoretical analysis showing that a simplified transformer architecture trained via gradient descent can act as a near-optimal metalearner in a linear classification setting. We consider a natural family of tasks where each task corresponds to a class-conditional Gaussian mixture model, with the mean vectors lying in a shared $k$-dimensional subspace of $\mathbb{R}^d$. After training on a sufficient number of such tasks, we show that the transformer can generalize to a new task using only $\widetilde{O}(k / \widetilde{R}^4)$ in-context examples, where $\widetilde{R}$ denotes the signal strength at test time. This performance (almost) matches that of an optimal learner that knows exactly the shared subspace and significantly outperforms any learner that only has access to the in-context data, which requires $\Omega(d / \widetilde{R}^4)$ examples to generalize.

关键词

metalearningin-context learninglinear classificationtransformers

评审与讨论

审稿意见

评分: 3置信度: 42025-06-30

This paper theoretically investigates that ICL with transformer can match the optimal meta-learner, on linear regression tasks with class-conditional Gaussian mixture model, where all tasks share the same input space.

优缺点分析

The main conclusion in this paper that with sufficient training, ICL with transformer acts like optimal meta-learners, is reasonable. However, it seems that [1] also gives such result, in more general settings, empirically without theoretical guarantee. But [1] is not mentioned at all.

Though the reviewer generally agrees with the main conclusion, there are some critical concerns:

It seems that the results have nothing to do with ICL with transformer. It seems that without Sec 2.2, by replacing it with any model parameterized by W, the sequel still holds. As the sequel only rely on the setting/assumptions, rather then the model.
The formal result (Theorem 3.1) is confusing. The review understand the way to decomposition of model into task-specific $f$ and shared $h$ , which is common in multitask-learning and meta-learning [2]. In the particular setting (Sec 2.1) where the problem is linear and $h$ is linear (the step 1&2 in assumption 2.1/2.2 is no other than replacing the input from uniform distribution on unit sphere in $R^k$ with $R^d$ ), as far as the reviewer could understand, useful information can only been brought when d<k, as the noise $z$ would has less dimensions with fixed $I_d$ . However, the main result (theorem 3.1) shows that the meta-learning risk would increase with d decreases, which is contradictory.

The reviewer would consider to raise the score if related works and above concerns can be discussed properly.

[1] Wu S, Wang Y, Yao Q. Why In-Context Learning Models are Good Few-Shot Learners?[C]//The Thirteenth International Conference on Learning Representations. [2] Maurer A, Pontil M, Romera-Paredes B. The benefit of multitask representation learning[J]. Journal of Machine Learning Research, 2016, 17(81): 1-32.

问题

Please refer to weakness.

局限性

最终评判理由

After rebuttal, my second concern was addressed, so I raised my score. But my first concern remains critical, and seems not could be addressed by discussion or minor revision.

格式问题

作者回复

2025-07-31

We thank the reviewer for their thoughtful feedback. Below we respond to each concern and question in order.

Missing Citation to [1] (Wu et al.): Wu et al. (2025) present an empirical study suggesting that transformers can approximate optimal meta-learners in few-shot classification tasks. While their results are encouraging, the work is primarily empirical and does not offer formal guarantees. The only theoretical result in the paper is Theorem 2.1, which establishes an expressivity claim, that is, the existence of transformer weights that can simulate a metalearning algorithm. However, they do not analyze whether such a solution can be obtained through optimization, which is a central focus of our work.

In addition, their analysis relies on the universal approximation property of multi-layer perceptrons (MLPs), without specifying the required network size. In contrast, our model does not include an MLP component. Finally, it is unclear whether the algorithms studied in their paper achieve optimality in our Gaussian mixture setting.

Our contribution complements this line of work by providing the first formal guarantee that a simple transformer trained by gradient descent can achieve near-optimal meta-learning performance under well-defined assumptions.

We agree that Wu et al. is a relevant and timely reference, and we will cite their work in the revised version.

W1. Is the Result Specific to Transformers?

The answer is yes. While our theoretical guarantee in Theorem 3.1 is expressed in terms of the max-margin solution $W_{mm}$ , this solution is directly tied to the specific transformer architecture presented in Eq. 3. Section 2.2 introduces a linear attention-based transformer model, which has been studied in several prior works, and our analysis closely follows the behavior of this model when trained via gradient descent.

Theorem 2.3 defines the max-margin solution $W_{mm}$ in terms of this attention model, as noted in line 215. Specifically, the constraint in Eq. (4) reflects the prediction structure of the model described in Eq. (3). If a different model were used, the constraints would differ, leading to a different optimization problem and, in general, a different solution. As a result, our analysis would no longer apply in that case.

W2. Clarification on Theorem 3.1 and the Role of d and k:

We believe there may be a misunderstanding:

Clarification of the setting: In our setting, each task's signal vector $\mu$ lies in a $k$ -dimensional subspace of $\mathbb{R}^d$ , and the noise is isotropic in $\mathbb{R}^d$ . Therefore, the data is only defined for $k \leq d$ (see line 162). Indeed, Assumption 2.1 requires that the matrix $P \in R^{d \times k}$ , whose columns span the shared subspace, is semi-orthogonal. That is, $P^TP=I_k$ , which is only possible when $k \leq d$ , as stated in the assumption.

Each task's signal $\mu \in \mathbb{R}^d$ is generated by embedding a vector $\mu' \sim \text{Unif}(R \cdot \mathbb{S}^{k-1})$ into $R^d$ via $\mu = P \mu'$ . Importantly, $\mu$ is not uniformly distributed over the unit sphere in $\mathbb{R}^d$ ; with probability 1, it lies entirely within the shared subspace and has no component in directions orthogonal to it.

In standard metalearning terminology, the optimal shared representation in our setting is given by $h: \mathbb{R}^d \rightarrow \mathbb{R}^k$ , where $h(x) = P^T x$ . For a signal vector $\mu = P\mu’$ , this gives: $h(\mu) = P^T\mu = P^TP\mu’ = \mu’$ , which recovers the original signal in the k-dimensional subspace.

Given a task with signal vector $\mu \in \mathbb{R}^d$ , constructed from $\mu’ \in \mathbb{R}^k$ , the task-specific classifier $f: \mathbb{R}^k \rightarrow (-1, +1)$ is defined as $f(x) = \text{sign}(\mu'^T x)$ .

Risk would increase with d decreases. Our result in Theorem 3.1 shows that the meta-learning error decreases as the ambient dimension $d$ decreases, not the other way around. In particular, as $d$ becomes smaller, the SNR during training increases, which facilitates the learning process and leads to better generalization..

评论- Thanks for your response

2025-08-04

Thanks for the authors' response. The rebuttal has addressed my second concern, which was raised by my neglecting the important semi-orthogonality of $P$ and minus sign in Theorem 3.1. Thus I would raise my rating.

However, the first concern remains critical. Note that the whole paper is bridged with ICL with transformer by Theorem 2.3, which is considered to be a relative general solution, for a wide range of models simplified enough.

2025-08-05

We clarify that Theorem 2.3 explicitly characterizes the maximum-margin solution (precisely the solutions to which gradient descent converges) for the transformer-based attention model defined concretely in Eq. (3). The left-hand side of the constraints in Theorem 2.3 corresponds exactly to the model in Eq. (3) where it is applied to the training tasks. Thus, the scope and applicability of our results are explicitly tied to the specific model we study, rather than to a broad or overly simplified abstraction.

Importantly, the model we analyze in Eq. (3) is identical to the model previously studied in multiple prominent theoretical analyses of in-context learning, particularly for linear classification and linear regression ([1-4], all appearing in top-tier ML conferences). Furthermore, other variants of single-layer linear-attention models have been widely used in the literature to facilitate theoretical analysis of in-context learning, as shown in [5–8] and many other studies. Thus, our model from Eq. (3) has become a common test bed for theoretical analyses of in-context learning in transformers.

Given this extensive precedent, we believe that our choice of model is well-justified. Therefore, we respectfully argue that our model choice does not constitute a critical limitation; rather, it places our work firmly within a well-established research direction.

[1] Juno Kim, Tai Nakamaki, and Taiji Suzuki. Transformers are minimax optimal nonparametric in-context learners. NeurIPS 2024.

[2] Jingfeng Wu, Difan Zou, Zixiang Chen, Vladimir Braverman, Quanquan Gu, and Peter L Bartlett. How many pretraining tasks are needed for in-context learning of linear regression? ICLR 2024.

[3] Spencer Frei and Gal Vardi. Trained transformer classifiers generalize and exhibit benign overfitting in-context. ICLR 2025.

[4] Wei Shen, Ruida Zhou, Jing Yang, and Cong Shen. On the training convergence of transformers for in-context classification. ICML 25.

[5] Ruiqi Zhang, Spencer Frei, Peter L. Bartlett . Trained Transformers Learn Linear Models In-Context. JMLR 2024.

[6] Oswald ET AL. Transformers Learn In-Context by Gradient Descent. ICML 23.

[7] Ahn ET AL. Transformers learn to implement preconditioned gradient descent for in-context learning. NeurIPS 23.

[8] Mahankali et al. One step of gradient descent is provably the optimal in-context learner with one layer of linear self-attention.

2025-08-07

We would like to confirm whether our response addressed the reviewer's concern. If not, could the reviewer please clarify whether the concern is about Theorem 2.3 or the model presented in Eq. (3)?

审稿意见

评分: 4置信度: 32025-07-03

This paper analyzes in-context learning (ICL) with a single-layer linear attention model on Gaussian mixture tasks. It provides theoretical results showing that such an architecture is an almost optimal metalearner for binary classification. The analysis carefully treats the finite-sample setting, which makes it distinct from many prior theoretical ICL works that assume a population training loss.

优缺点分析

Strengths:

The paper is clearly written and provides rigorous results in a finite-sample regime, addressing a realistic scenario where the number of tasks and samples per task is finite, which is an important distinction from many previous ICL theory works that assume infinite training data.
The main results, particularly Theorem 3.1, are accompanied by detailed discussion and intuition, helping the reader understand the key contributions and their implications.

Weaknesses:

Theorem 2.3 holds under the assumption that the data is linearly separable so that an SVM solution exists. In real in-context learning, demonstration samples are typically sampled randomly, which may not always yield separable sets. The paper would benefit from clarifying how $W_{MM}$ in Theorem 2.3 relates to practical training, as well as the feasibility of $W_{MM}$ for Theorem 3.1.
The result in Theorem 2.3 align with optimizing a single weight matrix $W$ as in Eq. (3). However, the actual attention architecture (Eq.(2)) is more complex. it involves separable $W_q$ , $W_k$ matrices, which can lead to different optimization behaviors. Prior work has shown that separately optimizing $W_q$ and $W_k$ can implicitly minimize the unclear norm of $U$ in Eq. (4) rather than Frobenius norm. Thus, the paper's theoretical focus on Eq. (3) does not fully capture the behavior of the true Transformer attention mechanism.
The paper lacks a concrete comparison between the bound in Theorem 3.1 and the Bayes optimal loss. Additionally, the Remarks 2.1 and 2.2 are not consistent with Theorem 3.1 since they do not appear to account for Assumption A, which seems critical for making the comparison meaningful. It would be better to show the main results as the excess risk between the "single-task optimal learner" and "optimal learner with access to the ground-truth representation". Additionally, simply showing that each error is constant in [0,1] is less informative. It would be more compelling to show whether the error decays exponentially with increasing samples, context length or dimension.

问题

Do Remarks 2.1 and 2.2 rely on Assumption A? If so, could the authors clarify this connection?
A recent paper shows that a single-layer linear attention layer can indeed learn the optimal binary classifier. Could the authors clarify it?

Li, Yingcong, et al. When and How Unlabeled Data Provably Improve In-Context Learning. arXiv:2506.15329 (2025).

What specific model architecture is used in the experiments? Are the authors implementing Eq. (2) by replacing softmax with linear attention, or are they directly simulating Eq. (3)?
The experimental curves are quite noisy. It would be helpful to run more trials and report the averaged results with confidence intervals to ensure the trends are robust.
Have the authors tried comparing to the simplest baseline: estimating the task mean directly using $\sum_iy_ix_i$ ? How does its performance compare to the linear attention model?

局限性

yes

最终评判理由

The authors have addressed most of my concerns. I suggest incorporating more discussion on comparison to the Bayes error. I have decided to raise my score.

格式问题

N/A

作者回复

2025-07-31

We thank the reviewer for their thoughtful feedback. Below we respond to each concern and question in order.

W1:

Feasibility of $W_{MM}$ : We note that this is explicitly addressed in our analysis. See Lemma C.4 and the proof sketch (line 325) for details, where we show a specific construction (i.e. the projection matrix) that separates the data (with high probability over the data distribution). While extending the analysis to non-separable data is an important direction for future work, we believe that studying the separable setting is a meaningful and informative starting point.
Data Setting Justification: We believe our assumptions are reasonable for a theoretical work that gives a precise analysis of the effects of all of the core ingredients to the transformer training pipeline: the implicit regularization effect of gradient descent in pre-training, the number of pre-training tasks, the number of examples per pre-training task, the signal-to-noise ratio of the pre-training tasks, the number of in-context examples at test time and the signal-to-noise ratio of the test-time tasks. Moreover, Gaussian mixture models is a well-studied setting in the literature, which enables us to derive a tight lower bound on the performance of an optimal metalearner (see Remark 2.2). Perhaps surprisingly, our analysis shows that the trained transformer nearly matches this optimal performance. We also note that Frei and Vardi [3] studied in-context learning in transformers using a very similar setting. To the best of our knowledge, there are currently no theoretical works on transformer-based in-context learning for classification that use significantly more realistic or complex datasets and provide such comprehensive results.

[3] Spencer Frei and Gal Vardi. Trained transformer classifiers generalize and exhibit benign overfitting in-context. ICLR 2025.

Relation between $W_{MM}$ and practical training: Formally, the model is guaranteed to converge to $W_{MM}$ asymptotically as the time $t$ tends to infinity. In our experiments, we find that gradient descent trained for a finite number of steps already achieves strong generalization, closely matching the behavior predicted by the max-margin solution. We note that analyzing the asymptotic solution rather than a finite-time solution is common in many theoretical papers.

W2: We agree that analyzing a more general transformer architecture with separate key and query matrices is an important and interesting direction for future work. In our work, we adopt the simplified attention model with a merged key-query matrix, which is commonly used in theoretical studies of transformers and in-context learning (e.g. [3-7], and many more). This setting allows a tractable analysis of the optimization dynamics and implicit bias.

Regarding the reviewer’s comment on nuclear norm minimization, we believe this refers to [2]. As we mentioned in the paper, their analysis relies on strong and somewhat artificial assumptions, such as the existence of a single “optimal” token, and all other tokens receiving identical “scores”. These assumptions are not satisfied in our setting and, in fact, are violated by almost all data distributions with more than two tokens drawn randomly.

While our model does not capture the full complexity of attention with separate key and query matrices, we believe it serves as a meaningful and interpretable proxy that sheds light on how gradient descent interacts with attention in metalearning tasks. Also, as we already mentioned, this is a common setting in theoretical works on transformers.

[2] Davoud Ataee Tarzanagh, Yingcong Li, Christos Thrampoulidis, and Samet Oymak. Transformers as support vector machines.

[3] Johannes Von Oswald, Eyvind Niklasson, Ettore Randazzo, João Sacramento, Alexander Mordvintsev, Andrey Zhmoginov, and Max Vladymyrov. Transformers learn in-context by gradient descent. ICML 23.

[4] Kwangjun Ahn, Xiang Cheng, Hadi Daneshmand, and Suvrit Sra. Transformers learn to implement preconditioned gradient descent for in-context learning. NeurIPS 23.

[5] Ruiqi Zhang, Spencer Frei, and Peter L Bartlett. Trained transformers learn linear models in-context. JMLR 2024.

[6] Davoud Ataee Tarzanagh, Yingcong Li, Xuechen Zhang, and Samet Oymak. Max-margin token selection in the attention mechanism.

[7] Vasudeva, Bhavya, Puneesh Deora, and Christos Thrampoulidis. ”Implicit bias and fast convergence rates for self-attention.”

W3:

Error is not a constant : We are not sure why the reviewer mentioned that the error in our result is constant. Theorem 3.1 shows that the error decays exponentially when several parameters increase simultaneously, specifically, when the subspace dimension k, the test-time signal strength \Tilda(R), and the quantity \sqrt{M\Tilde(R)^4/k} all increase, assuming the number of training tasks B is large enough to recover the shared representation.

Comparison with Bayes optimal:

In lines 260-268, we compare the transformer against a Bayes optimal algorithm that has access just to the samples from the new task. While the latter requires \Omega(d/\tilde{R^4}) examples to achieve small constant error (d is the ambient dimension and \Tilde(R) is the signal strength during test), the transformer needs only O(k/\tilde{R^4}) in-context examples, as long the number of tasks is large enough.
In lines 269-275 we compare the transformer against a Bayes optimal algorithm that also has full access to the shared sub-space, which is a very strong baseline, since metalearners don’t have access to the sub-space, and they need to learn it during training. We showed that both have the same in-context sample complexity, up to polylogarithmic factors.

Relation between Assumption A to Remark 2.1 and 2.2: We clarify that Remarks 2.1 and 2.2 do not rely on Assumption A. These remarks present information-theoretic lower bounds that hold independently of any specific algorithm or training setup. In contrast, Assumption A is used specifically to prove our main result (Theorem 3.1) and ensures that the max-margin solution generalizes with high probability.

As discussed in lines 248--253:

Assumption A1: requires that the signal strength $R$ is at least $\widetilde{\Theta}(1)$ . This is the interesting regime, since Remarks 2.1 and 2.2 show that learning becomes impossible when the test-time signal $\widetilde{R}$ is $o(1)$ .
Assumption A2 requires that the ambient dimension $d$ is at least polylogarithmic in the number of tasks. This modest assumption, while not needed in Remarks 2.1 and 2.2, is standard in high-dimensional analysis and ensures concentration bounds for sub-exponential random variables.
Assumption A3 requires the number of examples per task during training to be at least $N \geq d / R^2$ , or equivalently $1/\text{SNR}^2$ . This is necessary for learning the shared representation during pretraining. However, Remarks 2.1 and 2.2 analyze test-time learners that either do not have access to the shared sub-space $P$ (Remark 2.1) or have oracle access to $P$ (Remark 2.2), so their bounds are independent of $N$ .

Q1. See the previous part i.e. Relation between Assumption A to Remark 2.1 and 2.2.

Q2. Comparison to Li et al.: We thank the reviewer for pointing us to this relevant reference. While the setting in Li et al. differs from ours in important ways, it is still conceptually related and warrants citation in our paper.

First, their work does not address the metalearning setting. In particular, the tasks in their framework do not share a common representation (e.g., a low-dimensional subspace), which makes their setting fundamentally different and not directly comparable to ours. In fact, their focus is closer to semi-supervised learning and naturally their bounds depend on the ambient dimension $d$ , while our work avoids that.
Second, they analyze the squared loss, which is more natural for regression and allows for closed-form solutions. In contrast, we study the logistic loss, which is standard for classification and does not admit a closed-form solution. To handle this, we leverage the implicit bias of GD to characterize the behavior of the trained model.
Third, their results do not clearly describe how key parameters influence generalization. For example, how the number of tasks during training affects the generalization error. In our work, we provide explicit relationships between generalization error and quantities such as the number of training tasks, the number of samples per task, the ambient dimension, shared subspace dimension, and the SNR both in training and test.

Q3. We are directly simulating Eq. (3)

Q4. We have included this in the experimental section of the appendix (see the supplementary materials, last section).

Q5. Yes, this is precisely the Bayes optimal algorithm, that is, the maximum likelihood estimator (MLE) that we use as a baseline in our experiments (see the discussion at the beginning of Section 5 in the main paper, as well as the experiments presented both in the main text and in the appendix). We evaluate two versions of MLE: one that only uses the average of the in-context test examples, and another that has access to the ground-truth subspace matrix P, which first projects the data using P and then applies MLE. Our results show that the linear transformer significantly outperforms the MLE baseline that lacks access to P, and closely matches the performance of the MLE with projection.

2025-08-06

Thank you for your response.

Given the radius $R$ and noise level 1, the Bayes error is $\Phi(-R)$ , where $\Phi$ is the cdf of the standard normal. It is a constant. With infinite data, all effective classification algorithms are expected to converge to this bound. Therefore, it would be more meaningful to study the gap to the Bayes error rather than the raw classification error.

Overall, the authors have addressed most of my concerns. I suggest incorporating a comparison to the Bayes error in both the theoretical discussion and the experiments. I have decided to raise my score.

2025-08-07

We thank the reviewer for their valuable feedback and suggestions. While our current comparison focuses on a finite number of in-context examples, in the final version, we will include a comparison with the asymptotic case, where the number of in-context examples tends to infinity.

审稿意见

评分: 6置信度: 42025-07-03

The paper considers the ICL ability of linear transformers on a Gaussian mixture of linear classification. It shows that a trained transformer is nearly an optimal metalearner. The result also shows the number of pretraining tasks needed to generalize at test time, where the number is not dependent on the ambient dimension.

优缺点分析

This is a technical solid paper that studies a rather important problem in ICL theory. It considers the mixture of classification tasks, which is an important theoretical setting that I do not think has been properly addressed in previous work. The theoretical result is informative and demonstrates the ICL ability of linear transformer in such setting.

I will detail my observation as follows.

Methods And Evaluation Criteria:

The data distribution setting, namely the Gaussian mixture of linear classification task, is a rather novel setting and an important step toward understanding the ICL ability of linear transformers. ICL on single-distribution linear regularization has been investigated in e.g. [1], and a potential setting on the Gaussian mixture of linear regularization has been proposed in a recent ICML submission [2]. The Gaussian mixture of linear classification task is a step forward. Since it is a novel setting, I would like to further check with the authors on whether the setting is reasonable, i.e., it does not refer to some degenerate case or some cases where metalearning is unachievable. The definition itself in Section 2.1 is mathematically rigid, but I would like to seek some clarification on its implication. Therefore, I will specify my questions on the problem setting in later sections.

Theoretical Claims:

I skimmed through the proofs including the proof sketch in Section 4 and the Appendix. The proof uses several standard tools in analyzing linear transformer, namely Eq. (3) which has been present in [1, 2]. The proof involves some standard concentration lemmas, and the concentration lemmas are rather established in the textbooks, so it looks to me that the proved result stands as it is.

The main result reduces the bound on in-context examples so that it no longer depends on the ambient dimension $d$ , which is a good improvement.

Experimental Designs Or Analyses:

The numerical experiment replicates and justifies the theoretical findings. It is not the main contribution in this paper anyway, so it does not affect my evaluation.

Relation To Broader Scientific Literature: It is closely related to the theoretical in-context learning literature. It studies the in-context learning ability on a Gaussian mixture of linear classification models, which is a step forward from the non-mixed linear regression/classification function class that is present in the literature.

If the mixture of linear classification setting makes sense, this is surely a good paper for this venue. However, I have some doubts about the setting, as stated before. Therefore, I have a single question on the setting which will greatly impact whether my score is accept or reject, which is listed in the questions below.

[1] Zhang, R., Wu, J., & Bartlett, P. (2024). In-context learning of a linear transformer block: Benefits of the mlp component and one-step gd initialization. Advances in Neural Information Processing Systems, 37, 18310-18361.

[2] Jin, Y., Balasubramanian, K., & Lai, L. (2024). In-context Learning for Mixture of Linear Regressions: Existence, Generalization and Training Dynamics. arXiv preprint arXiv:2410.14183.

问题

I have a single question on the validity of the data generation setting, which will affect whether my score is accept:

In Assumption 2.2 (and similarly in Assumption 2.1), is it true that the $x_i$ with $y_i=1$ is of the same distribution as the $x_i$ with $y_i=-1$ ? Since $\mu'$ is uniform on the sphere and $P$ is isometric, it looks to me that the answer is yes. In that case, how is classification possible?

This question is particularly intriguing to me since in Line 261 and Line 272, the paper is claiming that the algorithm can achieve "an arbitrarily small constant" test error. I am not exactly sure how it is possible in the current setting, and would be eager to hear from the authors.

局限性

N/A on societal impact.

最终评判理由

Thank the authors for their rebuttal. It addresses my concern. I believe this work is quite important for the community since it theoretically demonstrates a very important and strong practical feature (meta-learning) of a very influential model (linear transformer). I will raise the score from 4 to 6 and recommend this paper for wider exposure, like oral or spotlight, depending on how the venue is organized.

I recommend that the authors somehow reflect their response into the paper in Assumptions 2.1, 2.2, and Ln 186-188. It looks to me that my initial misunderstanding is quite natural, since it is currently not explicitly mentioned in Assumptions 2.1 and 2.2 that they are the task distribution rather than the data distribution. For example:

Assumption 2.1' (training-time task distribution). ... For any task $\tau \in [B]$ , the input and label pairs $(x_{\tau,i}, y_{\tau,i})_{i=1}^{N+1}$ in task $\tau$ satisfies the following:

Let $\mu'_\tau$ be sampled from a prior distribution $\text{Unif}(\tilde{R}\cdot S^{k-1})$ , i.e., the uniform distribution on the sphere of radius $\tilde{R}\$ in $k$ dimensions. ...

This and something similar for Assumption 2 should work. Also, in Ln 186, you might want to specify that "In our setting, each example is a task $(x_i, y_i)_{i=1}^{N+1}$ that consists of a sequence of $(x,y)$ pairs".

I believe something like these edits can prevent other people from suffering the same misunderstanding as mine. This can also keep the notation consistent with Ln 257 in Theorem 3.1.

Please feel free to comment if this suggestion makes sense.

Also, I disagree with Reviewer uuw9's evaluation on Eq. (3). As mentioned in my review, it is a standard tool that has been present in prominent papers in this field.

格式问题

Some minor typos:

In Line 305, "Gausian"

作者回复

2025-07-31

We thank the reviewer for their thoughtful feedback. Regarding your question, we believe you are referring to Assumptions 2.1 and 2.2 (there is no Assumption 3.1 or 3.2). The answer is no, but it is essential to also condition on $\mu$ . That is, the distributions of $x_i$ conditioned on $y_i = 1$ and $y_i = -1$ are not the same, when we also condition on $\mu$ .

In both assumptions, the task-specific vector $\mu$ is sampled once for each task and fixed. Then,

$x_i = y_i \cdot \mu + z_i,$

where $z_i \sim \mathcal{N}(0, I_d)$ . Thus, conditioned on $\mu$ , the positive and negative classes are Gaussian distributions centered at $\mu$ and $-\mu$ , respectively. This separation makes classification possible despite the symmetry in the prior over $\mu$ . Thus, since our in-context sequence consists of many examples $(x_i,y_i)$ which are drawn from a Gaussian mixture that corresponds to the same $\mu$ , then given enough in-context examples we can correctly classify fresh examples which are drawn using this $\mu$ .

We hope this clarifies the data generation process and addresses the concern about class distinguishability.

2025-08-07

Thank the authors for their rebuttal. It addresses my concern. I believe this work is quite important for the community since it theoretically demonstrates a very important and strong practical feature (meta-learning) of a very influential model (linear transformer). Also, it investigates a Gaussian mixture of logistic regression setting, which is the first in the ICL theory field. This is a theoretical technique advance. I will raise the score from 4 to 6 and recommend this paper for wider exposure, like oral or spotlight, depending on how the venue is organized.

Assumption 2.1' (training-time task distribution). ... For any task $\tau \in [B]$ , the input and label pairs $(x_{\tau,i}, y_{\tau,i})_{i=1}^{N+1}$ in task $\tau$ satisfies the following:

Let $\mu'_\tau$ be sampled from a prior distribution $\text{Unif}(\tilde{R}\cdot S^{k-1})$ , i.e., the uniform distribution on the sphere of radius $\tilde{R}\$ in $k$ dimensions. ...

This and something similar for Assumption 2.2 should work. Also, in Ln 186, you might want to specify that "In our setting, each example is a task $(x_i, y_i)_{i=1}^{N+1}$ that consists of a sequence of $(x,y)$ pairs".

I believe something like these edits can prevent other people from suffering the same misunderstanding as mine. This can also keep the notation consistent with Ln 257 in Theorem 3.1.

Please feel free to comment if this suggestion makes sense.

Also, I disagree with Reviewer uuw9's evaluation on Eq. (3). As mentioned in my review, it is a standard tool that has been present in prominent papers in this field.

2025-08-07

We thank the reviewer for the positive feedback and are glad our rebuttal addressed the concern. We agree that the distinction between task distributions and data distributions within the same task should be made clearer. We will revise Assumptions 2.1, 2.2, and Line 186 in the final version to reflect this, following the reviewer’s helpful suggestions.

2025-08-07

For Assumptions 2.1 and 2.2, one can also follow the writing of Assumption 1 in [1].

[1] Jingfeng Wu, Difan Zou, Zixiang Chen, Vladimir Braverman, Quanquan Gu, and Peter L Bartlett. How many pretraining tasks are needed for in-context learning of linear regression? ICLR 2024.

审稿意见

评分: 3置信度: 42025-07-23

This paper studied the in-context learning of one-layer linear self-attention models to do binary classification on a Gaussian mixture dataset. After training such a transformer, it has learned the ground truth projection, so it only requires fewer test samples to achieve a small test error. It reduces the sample complexity compared with the optimal binary classifier on the test dataset. This result improved previous results of [11] by Spencer Frei and Gal Vardi. This paper, employing a straightforward linear model, theoretically demonstrated that transformers can function as metalearners, capable of adapting to novel tasks using only a limited number of in-context examples, without the need for additional training.

优缺点分析

Strengths: This paper is well-organized and states the theorems in a very clear way. It is technically sound with sufficient discussions on the results and the proof is clear.

Weaknesses: The main concern is the assumptions of the dataset and attention model in the theory. This paper only considers the linearly separable binary classification and a convex parameterization of the linear transform. These may loose many crucial properties of the transformer both in pretraining and test process. Additionally, it should be more impactful if the authors could extend the results to in-context multi-class classification like reference [12]. And although getting theory for more general setting may be hard, the paper could present more general simulations to show its result can be extended to more general settings. Finally, it is not clear what kind of message a machine learning practitioner can obtain from this theoretical work.

问题

Line 179: there should not be index $\tau$ . Line 305: Gausian
In section 2.2, I understand the Tokenization is normal in many theory paper, but can we consider the embedding different from (1)? For instance, $\mathbf{E}=(x_1^\top,y_1,\ldots,x_n^\top,y_n,x_{n+1}^\top,0)$ .
In Theorem 2.3, if we consider nonlinear softmax attention, can we still get some similar solution for the gradient descent training?
In theorem 3.1, what is $\mathbf{E}$ ? Is it training or test dataset?
In the last remark in Section 3, the authors can eliminate the dependence of $B$ and $N$ on $d$ in the analysis. Can you compare such results with the classical PCA method for Gaussian mixture models, e.g. [1]? In [1], the threshold for exact recovery is also governed by sample size and SNR. Also, it is worth mentioning [2] as another direction for studying transformers using Gaussian mixture models.

[1] An $l_p$ theory of PCA and spectral clustering

[2] Transformers as Unsupervised Learning Algorithms: A study on Gaussian Mixtures

局限性

The paper should have a detailed discussion of the limitations in the main text. The limitation of the theory may come from the assumption of the Gaussian mixture model and the linear self-attention model.

最终评判理由

The authors partially addressed my concerns. I greatly appreciate their effort and detailed discussion. However, I cannot agree with their first response: Eq. (3) serves as a standard toy model for theoretical analyses of in-context learning, so we can focus on this toy model and do not need to enhance our analysis to make it more realistic for practical transformer models. I believe Eq. (3) does not fully capture the behavior of the genuine Transformer attention mechanism. As for the impact on practitioners, I think this paper uses this toy model to elucidate some ideas that may already be familiar to practitioners. Based on the novelty of the mathematical methods and the final results presented in the paper, I will maintain my current score.

格式问题

作者回复

2025-07-31

We thank the reviewer for their feedback. Below, we respond to each point raised.

Simplified Architecture: While we analyze a 1-layer linear transformer, this simplification allows us to rigorously characterize fundamental properties of in-context learning that may extend to more complex architectures. Similar to how early theoretical works on neural networks focused on simplified architectures to establish foundational results, our work provides a theoretical framework that can guide future analyses of more complex transformers.

Moreover, our architecture is exactly identical to the architecture considered in at least four prior works on in-context learning in linear classification or linear regression [1-4 below] (all of them appeared in top tier ML conferences), and other variants of 1-layer linear attention models have appeared in many other works on in-context learning in linear regression tasks (5-8, and much more). Thus, this model has become a standard toy model for theoretical analyses of in-context learning, and it clearly attracts much interest in the theoretical deep-learning community. Hence, we feel that recommending rejection based on this reason would be unfair.