/10

Poster4 位审稿人

最低2最高4标准差0.8

ICML 2025

In-Context Deep Learning via Transformer Models

Weimin Wu,Maojiang Su,Jerry Yao-Chieh Hu,Zhao Song,Han Liu

提交: 2025-01-23更新: 2025-07-24

TL;DR

We investigate the transformer's capability for in-context learning (ICL) to simulate the training process of deep models.

摘要

We investigate the transformer's capability for in-context learning (ICL) to simulate the training process of deep models. Our key contribution is providing a positive example of using a transformer to train a deep neural network by gradient descent in an implicit fashion via ICL. Specifically, we provide an explicit construction of a $(2N+4)L$-layer transformer capable of simulating $L$ gradient descent steps of an $N$-layer ReLU network through ICL. We also give the theoretical guarantees for the approximation within any given error and the convergence of the ICL gradient descent. Additionally, we extend our analysis to the more practical setting using Softmax-based transformers. We validate our findings on synthetic datasets for 3-layer, 4-layer, and 6-layer neural networks. The results show that ICL performance matches that of direct training.

关键词

foundation modeltransformerin-context learning

评审与讨论

审稿意见

评分: 32025-03-14

This paper studies the representation power of the transformer model in performing in-context learning (ICL). In particular, this paper focuses on the implementation of in-context gradient descent with respect a general deep neural network. The proposed construction is flexible that it can either use a ReLU attention or the more common softmax attention, and it is efficient that it uses $O(N)$ for each gradient update step of a $N$ -layer network.

In my opinion, the contributions of this paper is significant because it works for an arbitrary depth neural network and can be implemented using a softmax attention. The later point is especially valuable because much of the literature on ICL cannot handle the softmax activation.

给作者的问题

see above

论据与证据

Although this work is a substantial addition to the literature, it is plagued by very poor readability. I believe that some major restructuring would be necessary before publication.

Several parts of the Appendix absolutely should be in the main body. The paper in its current form does not have a self-contained main body. The definition of transformer and attention have to be part of the intro. The results on softmax attention should also be discussed in the main text because this is in my opinion the most interesting result of the paper. And some of the experiments need to be in the main body because they are mentioned in the abstract. My understanding is that the readers are expected to understand the key arguments of the paper without checking the appendix, and this paper currently falls short.
The paper is overly verbose and many of the details are disruptive to the flow without contributing much to the exposition. For example, "Problem 1" and "Problem 2" are basically the same thing. Definition 1 is just an over-complicated description of a MLP. And the "proof sketch" in Section 3.3 don't need this level of precision for a general audience -- they are simply too dense to parse. I think the details of Section 3.3 (especially those constant) can be striped down and moved to the appendix.
I think an illustration would be very helpful. For example, one can make a diagram marking all of the partial derivatives in back-propagation of the neural network and then highlight which layer of the transformer model implements each of these partial derivatives.

Given the amount of "fat" in this paper, I think my proposed changes can fit into the page limit without too much efforts.

And some more technical concerns:

While my first impression is that the proposed construction is for a standard transformer architecture, this is actually not the case. The authors introduced a *piecewise multiplication layer," which is fine because it can be implemented efficiently in code. Also, the attention and MLP layers are not stacked in alternating orders. The authors should highlight these differences rather burying them under a wall of texts.
In Lemmas 2-4, the constants for ReLU approximations are not explicitly instantiated. I think it would be very helpful if the author could gives the values of these parameters for a simple example such as ReLU network with quadratic loss.

Lastly, I really like that Appendix B.3 clearly discussed the limitations of this paper. I appreciate the authors' honesty.

In light of these observations, I do not recommend acceptance of this paper as-is. However, I do think the overall ideas in this paper can be valuable to the community and I hope the authors can take account of my feedback to improve this paper. In particular, I want to see concrete plans from the authors on how to address my concerns so the revised paper is more accessible, but I fully understand that I am asking for a lot of changes and it is okay if the author do not have enough time during the rebuttal period to finish the revision. I am happy to raise my score if my concerns are adequately addressed.

方法与评估标准

n/a

理论论述

The theoretical derivations of the paper seem to be precise and the authors devoted significant efforts into tracking the various algebraic blobs. So I have confidence that the results do not have any fatal flaws.

I have one question: in the second point Section B.2 seems to imply that the intermediate computations on the partial derivatives are passed down through the forward pass of the transformer layers, but Section 2.i implies that the tokens' embedding dimension is only of order $\Theta(d)$ . Can you clarify the differences and if possible, clearly state those values in the revision? Furthermore, it would be useful to discuss what values were written into the rows containing $q_i$ 's, where are initialized to be 0.

实验设计与分析

The experiments are clear and validates the main theoretical claims of this paper. I have no complaints about them.

补充材料

I read Appendices A-C, a bit of F and took a quick look at G and H.

与现有文献的关系

see above

遗漏的重要参考文献

The citations in this paper is sufficiently comprehensive.

其他优缺点

see above

其他意见或建议

see above

作者回复

2025-03-29

Thanks for your detailed review. We have revised our draft (changes marked in BLUE) in this anonymous Dropbox folder.

C1: Reconstruction of the Manuscript.

Response

Thank you for the suggestions. We agree with this. In response, we have made the following modifications in the revised version:

We have moved the definitions of transformers and attention into the introduction while keeping the full standard definition in Appendix D due to space constraints.
The results on softmax attention and the experimental results have been incorporated into the main body—softmax results in Section 3.4 and experimental results in Section 4.
To improve clarity, we simplify Problem 2 and clearly state that it describes an MLP with $N$ layers.
The proof sketches in Section 3.3 have been moved to Appendix D, just before the detailed proof of the lemmas.
The complex constants in Lemma 6 have been shifted to Appendix D to improve readability.
While we acknowledge that Definition 1 may appear verbose, it establishes the notation used throughout the paper, particularly in the proofs, ensuring precision and consistency. Therefore, we have retained it as is.

Additionally, we appreciate the suggestion to include a visualization of the backpropagation process. We have included it in Appendix A.2 of the revised version. We also redisplay it anonymous figure.

C2: Difference with Standard Transformer

Response

Thank you for your suggestions. We have added this to the last point of limitations (line 749-754) and restate it here:

There are two minor differences between the transformer used in the theoretical analysis and a standard transformer: (i) The transformer used in the theoretical analysis incorporates an element-wise multiplication layer, a specialized variant of self-attention that retains only the diagonal score and allows efficient implementation. (ii) It does not alternate self-attention and MLP layers. We emphasize that this also qualifies as a standard transformer because we view either an attention or an MLP layer as equivalent to an attention plus MLP layer due to the residual connections.

C3 Example of Explicit Construction of ReLU

Response

Thank you for your insightful comments. Here, we provide clarifications on why the constants can be explicitly instantiated and illustrate this using a simple example.

The key reason is that the function approximated by the sum of ReLUs is relatively simple in our context, such as the Sigmoid activation function. For such simple functions, it is straightforward to derive an explicit construction.

Here, we take the Sigmoid activation function as an example and propose one explicit construction method. Let $r(z)$ denote the Sigmoid function.

Segment the input domain. For example, divide the domain $[-10, 10]$ smaller intervals such as $[-10, -9], [-9, -8], \dots, [9, 10]$ .
Approximate each segment locally using a linear function via linear interpolation. For instance, in the domain $[9,10]$ , approximate $r(z)$ using a linear function $a_1 z + c_1$ , where $a_1$ and $c_1$ are calculated as follows: (i) $a_1 = (r(10)-r(9)) / (10-9)$ . (ii) $c_1 = r(9) - a_1 * 9$ .
Approximate linear function $a_1 z + c_1 (z \in [9,10])$ using a sum of ReLU terms. This step involves two substeps, which are straightforward to implement: (i) Approximate the indicator function for $z \in [9,10]$ using a sum of ReLU terms. (ii) Approximate the constant $c_1$ using the sum of ReLU. This is because bias terms are not included in the sum of ReLU terms in Definition 4. The bias term $c_1$ must be approximated using an additional sum of ReLU terms.
Combine all the sum of ReLU approximators across all segments. Finally, integrate the approximations for all segments to construct the complete approximation.
Estimation of the parameters in Definition 4. $\epsilon_{approx} = 0.625$ , $R=10$ , $H=80$ , and $C=25$ .

Furthermore, to achieve higher precision in the approximation, it is sufficient to use finer segmentations.

Q1: Difference of Dimensions

Response

Thank you for your question. We apologize for any confusion caused. Here are some clarifications:

In Section 2.(i), the value of $D$ is derived from prior research on in-context learning (ICL) for functions, indicated as $D=\Theta(d)$ . This basis does not pertain to our focus on models, such as an $N$ -layer neural network. We have removed these parts in the revised version to make things easier to read.
Regarding the zero values in the rows of q_i, these primarily consist of intermediate terms such as $\bar{p}_i(j), \bar{r}^{‘}_i(j), \bar{s}_t(j)$ , which are essential for calculating partial derivatives and contain gradient information. These rows are initially set to zero.

审稿人评论

2025-04-03

I thank the author for their responses. While the revision are a little minimalist in my opinion, it addressed most of my key concerns. I will upgrade my score from 2 to 3 and encourage the authors to further improve their manuscript for the camera-ready version.

Also, I like your example of approximating sigmoid with ReLU, can you add this to the appendix?

作者评论

2025-04-03

Thank you for your thorough review and for the improvement in the score. We are pleased to have addressed many of your concerns and truly value your feedback. We are committed to further refining our manuscript for the camera-ready version.

We have updated our draft to include the example of approximating the sigmoid function with ReLU in the "Further Discussion" section (Appendix B.3). You can find the revised version, with changes highlighted in blue in anonymous Dropbox folder.

审稿意见

评分: 22025-03-14

This paper shows that one can construct weights of a ReLU-activation transformer that can simulate L steps of gradient descent on an N-layer ReLU network using in-context examples.

update after rebuttal

While I have concerns with clarity and usefulness of the construction, I appreciate the author's honesty and thus will stick to my raised score of 2.

给作者的问题

Do trained transformers actually achieve the construction?

论据与证据

The theory appears sound but the empirics are lacking.

Experiments. We support our theory with numerical validations in Appendix G

As far as I can tell, the experiments in Appendix G have nothing to do with constructing a transformer? Instead, they follow the more classical setup (of e.g., Garg et al) in seeing if a transformer can infer the function f in context. No analogy is drawn to whether or not the solution found by the transformer is similar to the construction shown in the main text, beyond the fact that on certain evaluation data the performance matches doing GD on a ReLU network (but this could also just be performance being "good" rather than the "same").

方法与评估标准

See above.

理论论述

See above.

实验设计与分析

See above.

补充材料

Appendix G

与现有文献的关系

The paper appears to cite relevant literature. I'm not sure how relevant the finding is/if such a construction is actually of interest to the broader community.

遗漏的重要参考文献

Not to my knowledge.

其他优缺点

I think the paper suffers a lot from reduced clarity. It seems that most of the content of the paper is in the appendix, including empirical comparisons as well as discussion, which to me makes for a less clear read. I would suggest the authors re-write the main body of text to include more intuitions and takeaways (e.g. when some assumptions may hold in practice rather than just in theory).

其他意见或建议

N/A

作者回复

2025-03-29

Thanks for your detailed review. We have revised our draft and addressed all concerns. The revised version (changes marked in BLUE) is available in this anonymous Dropbox folder.

Q1: Do trained transformers actually achieve the construction?

Response:

Thank you for the question. The trained transformers do not always achieve the construction. We apologize for any confusion caused, and acknowledge the limitation you mentioned. However, this limitation does not affect the primary contributions of our work. Here are some clarifications.

The main contribution of this paper is to provide an explicit construction demonstrating the existence of a transformer capable of simulating gradient descent (GD) for $N$ -layer neural networks via in-context learning (ICL). The experimental design empirically validates that the trained transformer can indeed simulate GD steps for $N$ -layer neural networks, supporting our theoretical results on the existence of such a transformer. Although there is a discrepancy between the theoretically constructed transformer and the empirically used one, this difference does not weaken the central point of our work, which is establishing the theoretical existence of such a transformer.

We acknowledge the limitation you mentioned and have incorporated it into the revised version: line 744-748.

C1: I think the paper suffers a lot from reduced clarity. It seems that most of the content of the paper is in the appendix, including empirical comparisons as well as discussion, which to me makes for a less clear read. I would suggest the authors rewrite the main body of text to include more intuitions and takeaways (e.g. when some assumptions may hold in practice rather than just in theory).

Response:

Thank you for your thoughtful suggestions. We agree that the main text can be better organized to improve readability and self-containment. In response, we have made the following modifications in the revised version:

The results on softmax attention and the experimental results have been incorporated into the main body—softmax results in Section 3.4 and experimental results in Section 4.
The proof sketches have been moved to Appendix D, just before the detailed proof of the lemmas.
We include a remark concerning the practicality of our assumptions (Remark 4: line 1577-1579). Our assumptions remain modest. For example, we require that the loss function $l$ , the activation function $r$ , and its derivative be $C^4$ -smooth. This condition is met by numerous network architectures, including those using the sigmoid activation function $r$ and the squared loss function $l$ .

We hope our responses address your concerns and look forward to further feedback.

审稿人评论

2025-04-04

I appreciate the authors' honesty. I've increased my score to a 2, assuming that the authors will factor in the clarity suggestions and will be correspondingly upfront in their paper:

Specifically,

The trained transformers do not always achieve the construction.

this comment should appear in the intro and main text of the paper, not just the appendix. When providing theoretical constructions as the paper focuses on, it's important to be clear about how they differ from practice so as to not add noise -- for example, the results show that while a construction exists, it isn't typically found through training (i.e., the outer loop if ICL is viewed as the inner loop).

作者评论

2025-04-05

Thank you for your careful consideration of the responses and for increasing the score. We really appreciate your thorough evaluation.

We have revised our draft by adding comments to line 056-058 column 2 and line 408-410 column 2. The revised version (changes marked in BLUE) is available in this anonymous Dropbox folder.

Thanks very much!

审稿意见

评分: 42025-03-16

The paper introduces an approach that harnesses the transformer's ability to emulate the in-context learning for training process of deep models. Its key contribution is the demonstration of a practical instance where a transformer is used to simulate the training process of a deep neural network. Furthermore, the paper appears to extend the work presented in "Transformers as Statisticians: Provable In-Context Learning with In-Context Algorithm Selection," which provides a solid theoretical analysis.

给作者的问题

N/A

论据与证据

The paper poses the question, "Is it possible to train one deep model using the in-context learning (ICL) capability of another foundation model?" and answers it affirmatively. The paper offers theoretical guarantees for achieving approximation within any desired error margin and for the convergence of ICL-based gradient descent. The analysis centers on a practical setting involving Softmax-based transformers. The method is evaluated on synthetic datasets for 3-layer, 4-layer, and 6-layer neural networks, with results showing that ICL performance is on par with that of direct training.

方法与评估标准

The method was primarily motivated by theoretical considerations. The authors systematically formulate in-context learning (ICL), carefully defining the neural network's mathematical model. Their approach relies on a recursive application of the chain rule, approximating derivatives sequentially. The authors suggest optimizing a transformer to mimic in-context gradient descent across different configurations.

理论论述

The paper's contribution mainly lies in its mathematical formulation and theoretical analysis. The ultimate goal is to demonstrate that bounds can be established for the approximation.

实验设计与分析

The experiments primarily aim to validate the correctness of the theorems. Although they are not conducted on large neural networks, they are sufficient in my view.

补充材料

I reviewed the experimental sections but did not go through the theoretical proofs in all the details.

与现有文献的关系

N/A

遗漏的重要参考文献

From my perspective the paper provides adequate citations.

其他优缺点

Overall, the paper is clearly written and easy to follow. The proposed method has been validated through both theoretical analysis and experimental results.

其他意见或建议

Same as previous comments.

作者回复

2025-03-29

Thank you for your review! We greatly appreciate your attention to detail and recognition of our theoretical and experimental contributions! Your constructive comments and encouraging words are also highly appreciated!

审稿意见

评分: 42025-03-17

This paper studies the expressive power of transformer models to simulate gradient descent on other architectures like N layer feedforward networks using in-context learning. The authors corroborate their study with experiments on synthetic datasets that show that the in-context learning performance of Transformers matches direct training of deep networks.

给作者的问题

Please find my questions below.

In corollary 1.1, the error bound is exponential in the number of steps L. How would the bound vary if the transformer's size is allowed to increase beyond (2N+4)L? Can it be made polynomial, by making the transformer size grow exponential?
Theorem 1 requires Element-wise Multiplication Layers (EWMLs) to approximate Hadamard products. Can some form of approximate self-attention or MLP layers be used in their place? How would the size of the constructed model vary then?

论据与证据

The authors show clean theoretical claims to show expressivity of transformer models. The authors extensively discuss the proof strategies for each theoretical claim. Furthermore, they conduct experiments in the appendix to support their theoretical claims.

方法与评估标准

The experiments in appendix have been conducted on toy datasets to corroborate their theoretical claims.

理论论述

I glanced through the proofs of each theoretical claim. Even though I didn't go through specific details, the theoretical proofs and statements are reasonable. Furthermore, the authors clearly corroborate their theoretical observations with clean empirical experiments.

实验设计与分析

The experiments in appendix have been conducted on toy datasets to corroborate their theoretical claims.

补充材料

I checked the empirical results in the supplementary material and simply glanced over the proofs of the theoretical statements.

与现有文献的关系

The paper is relevant to the current interests of the scientific community. Exploring the strengths of in-context learning of transformer models has been a topic of interest for the last few years and this paper takes an important step towards this direction.

遗漏的重要参考文献

The authors have extensively discussed the important references.

其他优缺点

The main strength of this paper lies in its motivation to study the strengths of in-context learning abilities of real world transformer models. The paper first discusses the pros and cons of existing studies on similar topics and attempts to specialize the framework to understand expressive power for simulating multi-step gradient descent of N-layer feedforward networks.

其他意见或建议

I don't see any typos.

作者回复

2025-03-29

Q1: How would the bound vary if the transformer's size is allowed to increase beyond (2N+4)L? Can it be made polynomial, by making the transformer size grow exponential?

Response:

Thank you for your question. The bound does not change if the transformer's size exceeds $(2N+4)L$ , and the error cannot be reduced to a polynomial level. Here are some clarifications:

Each $(2N+4)$ -layer transformer simulates one gradient descent (GD) step, and we stack such transformers $L$ times.
In non-convex optimization, each step’s trajectory depends on the previous steps. This dependency causes the exponential accumulation of the error.

Q2: Can some form of approximate self-attention or MLP layers be used in their place? How would the size of the constructed model vary then?

Response:

Thank you for your insightful question. The answer is No. Here are some clarifications.

Explicit construction of transformer: We aim to provide an explicit transformer construction that simulates $L$ gradient descent steps. Although self-attention or MLP layers can approximate Hadamard products through their universal approximation capabilities, such approximations would result in non-explicit constructions.
Specialized variant of self-attention: EWML can be viewed as a specialized variant of self-attention that retains only the diagonal scores.

We hope our responses address your concerns and look forward to further feedback.

最终决定Accept (poster)

2025-05-01

The main contribution of the paper is to show that transformers can simulate gradient descent on MLPs within a single forward pass. The idea of using one deep neural network to simulate the training dynamics of another is an interesting and timely topic. I read the paper carefully to better understand the contribution, but I believe the writing can be significantly improved. For example, I found the proof of Theorem 6—which presents one of the core results—difficult to follow. In particular, it is unclear why the transformer parameters are assumed to be independent of the input when invoking the universal approximation theorem. One of my central questions is why the authors do not simply invoke the universal approximation theorem to show that the transformer can represent global minimizers of the target MLP. Despite these concerns, and in light of the overall positive reviews, I vote for acceptance. However, I strongly encourage the authors to improve the clarity of their presentation and elaborate on the issues mentioned above.