5.0

/10

Rejected3 位审稿人

最低5最高5标准差0.0

2.0

置信度

ICLR 2024

The Closeness of In-Context Learning and Weight Shifting for Softmax Regression

Shuai Li,Zhao Song,Yu Xia,Tong Yu,Tianyi Zhou

OpenReview PDF

提交: 2023-09-23更新: 2024-02-11

摘要

Large language models (LLMs) are known for their exceptional performance in natural language processing, making them highly effective in many human life-related tasks. The attention mechanism in the Transformer architecture is a critical component of LLMs, as it allows the model to selectively focus on specific input parts. The softmax unit, which is a key part of the attention mechanism, normalizes the attention scores. Hence, the performance of LLMs in various NLP tasks depends significantly on the crucial role played by the attention mechanism with the softmax unit. In-context learning is one of the celebrated abilities of recent LLMs. Without further parameter updates, Transformers can learn to predict based on few in-context examples. However, the reason why Transformers becomes in-context learners is not well understood. Recently, in-context learning has been studied from a mathematical perspective with simplified linear self-attention without softmax unit. Based on a linear regression formulation $ \min_x \| Ax - b \|_2 $, existing works show linear Transformers' capability of learning linear functions in context. The capability of Transformers with softmax unit approaching full Transformers, however, remains unexplored. In this work, we study the in-context learning based on a softmax regression formulation $ \min_{x} \| \langle \exp(Ax), {\bf 1}_n \rangle^{-1} \exp(Ax) - b \|_2 $. We show the upper bounds of the data transformations induced by a single self-attention layer with softmax unit and by gradient-descent on a $ \ell_2 $ regression loss for softmax prediction function. Our theoretical results imply that when training self-attention-only Transformers for fundamental regression tasks, the models learned by gradient-descent and Transformers show great similarity.

关键词

In-Context LearningSoftmax RegressionAttention Computation

评审与讨论

审稿意见

评分: 5置信度: 22023-10-27

The authors aim at improving our understanding of in-context learning from a theoretical perspective. Previous work has proved that a simplified self-attention layer can "in-context learn" the gradient step of a linear regression. The authors propose to show the same in the case of a softmax regression, which they propose as an intermediate step between the linear regression and the actual operation done by self-attention. The appendix contains empirical results that compare self-attention to softmax regression and that corroborate the theoretical findings.

优点

The problem considered, i.e. a theoretical understanding of in-context learning is significant with the rise of LLMs.
The abstract is well-written.
The review of previous work positions the paper well and makes clear what are the novel contributions.
In the appendix there is an empirical verification that a softmax regression model and a single SA layer are similar on one gradient descent step. The empirical approach seems sound as it follows previous work.

缺点

I had quite a bit of trouble reading the paper. I was unable to fill quite a few logical steps that I deem significant. I have left questions regarding them.

问题

My initial rating inclines towards rejection (with low confidence in my assessment): upon reading the paper I am missing a few logical steps that I deem significant. I have left questions regarding these; if you could clarify them it would greatly help me to improve my assessment of the paper.

Major Questions:

Definition 1.3: I miss the motivation for why this would advance the understanding of in-context learning for the Transformer. It seems that the problem solved by Self-Attention would involve the matrix A quadratically in the exponential, while here A appears linearly in the exponential. Could you elaborate on why this intermediate step is useful for analyzing what would happen in the Transformer?
Upon reading the text a few times, I still do not understand why the bounds of Thm 5.1 and Thm 5.2 would imply that the transformation induced by the layer would approximate the gradient step, or if I understood Oswald et. al correctly, that at least there is a choice of layer parameters that would make it approximate the gradient step.

Minor Questions:

(page 2, first equation): I do not understand why one needs to introduce a generalized attention formulation if the considered problem is quite simplified.

评论- Rebuttal by Authors

2023-11-22

Thank the reviewer for the feedback. We provide our responses as follows:

Response to Question 1:

The motivation behind Definition 1.3 (Softmax Regression) is to bridge the gap between the simple linear models studied in prior work [1] and the more complex softmax-based models used in actual LLMs. While in a full Transformer model, matrix 'A' appears quadratically in the exponential, our setting of 'A' in the exponential allows for a tractable analysis while still capturing essential characteristics of the softmax function. This formulation makes our analysis more relevant to actual Transformer models in LLMs than purely linear models, yet more manageable than full quadratic models.

Response to Question 2:

Thanks for your careful reading. Let us emphasize the high level idea of our results here. A self-attention layer update can be treated as an update to the tokenized document $A$ [1]. The gradient descent can be treated as an update to the model parameter $x$ here. In the softmax regression loss function, these two update can be treated as the change of the value $b$ here. Hence, by showing the similarity between the change of $b$ , we show the closeness between them. In [1], they show the change of some specified parameters by gradient descent approximate the ability of in-context learning. This does not contradict the results stated in [1] but rather adds a new dimension to understanding the interplay between layer transformations and gradient steps.

Response to the Minor Question:

Our approach acknowledges the value of simplicity as an initial foundation, drawing from a range of theoretical papers [2,3,4,5]. Starting with a simpler version allows us to build a solid starting point for understanding the broader context.

[1] Von Oswald, J., Niklasson, E., Randazzo, E., Sacramento, J., Mordvintsev, A., Zhmoginov, A., & Vladymyrov, M. (2023, July). Transformers learn in-context by gradient descent. In International Conference on Machine Learning (pp. 35151-35174). PMLR.

[2] Alman, Josh, and Zhao Song. "Fast attention requires bounded entries." arXiv preprint arXiv:2302.13214 (2023).

[3] Deng, Yichuan, Sridhar Mahadevan, and Zhao Song. "Randomized and deterministic attention sparsification algorithms for over-parameterized feature dimension." arXiv preprint arXiv:2304.04397 (2023).

[4] Kitaev, Nikita, Łukasz Kaiser, and Anselm Levskaya. "Reformer: The efficient transformer." arXiv preprint arXiv:2001.04451 (2020).

[5] Katharopoulos, Angelos, et al. "Transformers are rnns: Fast autoregressive transformers with linear attention." International conference on machine learning. PMLR, 2020.

评论- Reviewer Reply

2023-11-22

Thanks for the reply. I decided to keep my score after reading the other reviews.

审稿意见

评分: 5置信度: 22023-10-31

This work studies the in-context learning based on a softmax regression mostly approaching the vanilla self-attention and gives the upper bounds of the data transformations driven by gradient descent for a single self-attention layer. Nevertheless, the paper's structure appears to lack the necessary depth to fully elucidate critical findings, such as the significance of their contributions to advancing our understanding of in-context learning beyond existing literature.

优点

This work examined the in-context learning process based on a softmax regression, aiming to illustrating Transformer’s attention mechanism.

缺点

The structure appears insufficient to fully elucidate the internal mechanisms of in-context learning relying on a single self-attention layer.
The significance of the findings, such as the upper bounds of data transformation, is somewhat understated, rendering them supplementary to prior research.
Certain mathematical proofs and deductions may benefit from a more concise presentation, potentially relocating them to the appendix, while experimental results could find a more prominent place in the main paper.

问题

Could you consider reorganizing the paper to enhance its comprehensiveness and clarity? One suggestion is to separate the model definitions and theorems from the introductory section for better clarity.

评论- Rebuttal by Authors

2023-11-22

Thank the reviewer for the feedback. We provide our responses as follows:

Response to Weakness 1:

We would like to clarify that our approach follows previous work [1] to study in-context learning in the simplified setting of a single self-attention layer, which allows for a more controlled analysis and clearer insights. Importantly, our work goes a step further by incorporating the crucial softmax unit, a component that was further simplified for the linear setting in the previous work. This inclusion is significant as the softmax unit is integral to the functioning of LLMs, particularly in the context of natural language processing tasks. We believe that this approach, while simplified, provides valuable insights into the in-context learning mechanisms of LLMs and lays the groundwork for more complex analyses in the future.

Response to Weakness 2:

We appreciate the feedback and will take steps to highlight the importance and novelty of these findings more clearly in our revised manuscript. Our work provides a significant extension to the existing body of research, and we aim to ensure that this is adequately reflected and understood by our readers.

Response to Weakness 3 and Question 1:

Thanks for the feedback on the presentation of the paper. We will definitely consider your suggestions to improve the clarity and flow of the paper, allowing readers to more easily grasp the theoretical framework before delving into the specific results and their implications.

2023-11-22

Thanks very much for your reply!

审稿意见

评分: 5置信度: 22023-11-01

This paper is about clarifying the relationship between in-context learning in LLMs and weight shifting for softmax regression. The paper tries to understand in-context learning of Transformer models, specifically self-attention, in the perspective of softmax regression. The optimization of the attention module could be seen as the following softmax regression problem: $\min_{X \in \mathbb{R}^{d\times d}} \lVert D^{-1} \exp ( AXA^\top ) -B \rVert_F$ where $A \in \mathbb{R}^{n \times d}$ is a matrix for document having length $n$ and embedding size $d$ , $X$ a weight matrix, and $B$ the target distribution for the probabilities resulting from softmax. Beyond prior work that simplified the above definition by $\min_x \lVert Ax - b \rVert_2$ s.t. $A \in \mathbb{R}^{n\times d}, b \in \mathbb{R}^n$ , it uses the following more formulation which is proposed in Deng et al. (2023b) and argued to be more close to the definition: $\min_{x \in \mathbb{R}^d} \lVert \langle \exp(Ax), \mathbf{1}_n \rangle^{-1} \exp(Ax) - b \rVert_2.$

From the above formulation, the loss function following Deng et al. (2023b), which can further be simplified by the shorthand form $L_{\exp}(x) = 0.5 \lVert c(x) \rVert_2^2$ where $c(x):=f(x)-b, f(x)=\alpha(x)^{-1} \exp(Ax), \alpha(x):=\langle \exp(Ax), \mathbf{1}_n \rangle.$

Then, Lipschitz bounds for $\lVert f(x_{t+1}) - f(x_t) \rVert_2$ and $\lVert f(A_{t+1}) - f(A_t) \rVert_2$ are used to bound $\lVert \tilde{b} - b \rVert_2$ with respect to $\lVert f(A_{t+1}) - f(A_t) \rVert_2,$ which reveals the relationship between softmax weight shifting and in-context learning.

优点

Rich explanation on preliminaries
Mathematical notations are defined thoroughly

缺点

More comparison with the work from Deng et al. (2023b) is needed, which seemingly to be the work most closely related to this work.
It is hard to distinguish this work's contribution and other prior work's contribution. For example, some important definitions and theorems are already proven in Deng et al. (2023b). I believe this could be made more clear.
Organization of the content is preferred to be more focused on what to be proven. i.e. "why bounding the single step of $x$ and $A$ relates to clarifying the relationship between in-context learning and softmax weight shift.

Typo: Lipschtiz → Lipschitz

问题

See above

评论- Rebuttal by Authors

2023-11-22

Thank the reviewer for the feedback. We provide our responses as follows.

Response to Weakness 1:

We believe our contributions are distinct and significant compared to Deng et al. (2023b). Concretely, Deng et al. (2023b) focuses primarily on softmax regression in a broader context, whereas our work specifically studies the in-context learning capabilities of transformer-based large language models from a softmax regression perspective. Our main results in Section 5, including the Lipschitz bounds and the bounded shift for in-context learning (Theorems 5.1 and 5.2), are unique contributions that extend beyond the scope of Deng et al. (2023b). We will add a more detailed comparison in the revised paper to clarify these differences.

Response to Weakness 2:

Our paper builds upon the understanding of in-context learning in LLMs, specifically addressing how the self-attention computation with the softmax unit impacts this learning process. We have established bounds on data transformations in the transformer architecture, which was not explicitly covered by Deng et al. (2023b). Moreover, our experimental validation supports our theoretical findings, further distinguishing our work. We will emphasize these points more clearly in the revised manuscript to highlight our unique contributions.

Response to Weakness 3:

We appreciate the feedback on the organization of our paper. In the revision, we will restructure the content to focus more on our primary contributions.

Thank you for pointing out the typo. We will correct "Lipschtiz" to "Lipschitz" in the revised paper.

AC 元评审

2023-12-07

This paper investigates the in-context learning of large language models with softmax units, providing a theoretical analysis of the data transformations induced by single self-attention layers and gradient descent on a softmax regression loss, revealing the similarities between models learned by gradient descent and Transformers for fundamental regression tasks. However, the consensus among reviewers is that the paper is not ready for acceptance in its current form, and I don't think the authors provided a solid rebuttal to address the concerns raised. As a result, I recommend not to accept the paper at this stage, and further improvements (including the writing) are needed before it can be considered for acceptance.

为何不给更高分

n/a

为何不给更低分

n/a

最终决定Reject

2024-01-16

Reject