5.8

/10

Rejected4 位审稿人

最低5最高6标准差0.4

3.3

置信度

正确性2.8

贡献度2.5

表达2.5

ICLR 2025

Multi-Layer Transformers Gradient Can be Approximated in Almost Linear Time

Yingyu Liang,Zhizhou Sha,Zhenmei Shi,Zhao Song,Yufa Zhou

OpenReview PDF

提交: 2024-09-19更新: 2025-02-05

摘要

The computational complexity of the self-attention mechanism in popular transformer architectures poses significant challenges for training and inference, and becomes the bottleneck for long inputs. Is it possible to significantly reduce the quadratic time complexity of computing the gradients in multi-layer transformer models? This paper proves that a novel fast approximation method can calculate the gradients in almost linear time $n^{1+o(1)}$ where $n$ is the input sequence length, while it maintains a polynomially small approximation error $1 / \mathrm{poly}(n)$ across the entire model. Our theory holds for general loss functions and when the multi-layer transformer model contains many practical sub-modules, such as residual connection, casual mask, and multi-head attention. By improving the efficiency of gradient computation, we hope that this work will facilitate more effective training and deployment of long-context language models based on our theoretical results.

关键词

Multi-Layer TransformersAttention AccelerationGradient Computation

评审与讨论

审稿意见

评分: 5置信度: 32024-10-28

This paper presents a novel method to approximate the gradient computation in multi-layer transformer models with almost linear time complexity. Traditionally, the self-attention mechanism in transformers has a quadratic time complexity with respect to the input sequence length, which becomes a bottleneck during training, especially for models with long input contexts. The authors propose a fast approximation algorithm that calculates gradients in time proportional to $𝑛^{1+o(1)}$ , where n is the input length, and achieves a polynomially small approximation error. This method is applicable to general loss functions and transformers with practical components such as residual connections, multi-head attention, and causal masks. The results hold significant potential for improving the training efficiency of large language models without sacrificing much accuracy.

优点

The paper introduces a method to approximate gradient computation in multi-layer transformers in almost linear time, breaking the quadratic time complexity barrier. This efficiency boost is crucial for training large language models, especially for long-context tasks, and could reduce both computational and energy costs.
The proposed algorithm is versatile, supporting general loss functions and working with commonly used transformer components such as residual connections, multi-head attention, and causal masks. This ensures that the method can be applied to a wide range of practical transformer architectures and tasks without major adjustments.

缺点

While the paper provides a strong theoretical foundation for the proposed method, there is no empirical evidence or experimental results presented to validate its performance in real-world transformer models. This leaves a gap in understanding how well the method performs in practice compared to theoretical expectations.
Although the approximation error is stated to be polynomially small, the paper does not deeply explore how this approximation might affect model performance in different tasks. There could be edge cases where the accuracy trade-offs are more significant, especially in critical applications requiring precise outcomes.
The paper acknowledges that practical implementation, especially on GPUs, may face coding and optimization challenges, such as needing to redefine tensor operations and re-implement certain backpropagation functions. This could limit the immediate adoption of the method by practitioners looking for a ready-to-use solution.

问题

Could the authors provide empirical results or benchmarks to demonstrate the practical efficiency and effectiveness of the proposed gradient approximation method on real-world large language models? How does it compare to state-of-the-art methods in terms of both speed and accuracy?
The paper claims that the approximation error is polynomially small, but how does this error affect downstream tasks such as text generation or classification? Are there scenarios where this approximation might lead to a noticeable degradation in performance, especially in sensitive applications?
The authors mention potential coding challenges for implementing this method, particularly on GPUs. Have the authors considered these challenges in detail, and can they provide more concrete guidance or solutions for implementing this algorithm efficiently in real-world systems, especially for large-scale models like GPT-4 or LLaMA?

2024-11-17

We thank the constructive comments from the reviewer. We provide some clarification as follows.

W1 & Q1: Lack of empirical results and comparison with state-of-art methods in terms of speed and accuracy.

Thanks for your valuable suggestion. Firstly, we would like to point out that the main focus of our work is theoretical analysis. Nevertheless, to demonstrate the effectiveness of our proposed method, we have also conducted experimental evaluations. See Global Response Part 2: Empirical evaluation.

W2 & Q2: How does the approximation error affect the performance of the downstream tasks?

Thanks for your insightful comments. For downstream tasks, particularly those that demand high precision, we recommend

Replacing some part of the approximated attention module to standard attention computation and fine-tune on specific datasets to bypass the slight approximation error.
Also, the approximation error diminishes quickly with the input length $n$ . Furthermore, existing empirical work shows [1, 2, 3] that the approximation error of the attention matrix doesn't affect the downstream performance that much, e.g., approximating the softmax attention doesn't change the output probabilities that much.

W3 & Q3: What are the detailed challenges for implementing this method on GPUs? Can the author provide more concrete guidance on how to implement this method?

Thanks for pointing this out. In Part 2: Empirical evaluation. in the Global Response, we present preliminary results demonstrating the efficiency of our method. However, as outlined in Part 3: Potential implementation challenges on GPUs., implementing the full gradient computation involves several challenges.

[1] Zhang, Z., Sheng, Y., Zhou, T., Chen, T., Zheng, L., Cai, R., ... & Chen, B. H2o: Heavy-hitter oracle for efficient generative inference of large language models. NeurIPS’23.

[2] Jiang, H., Li, Y., Zhang, C., Wu, Q., Luo, X., Ahn, S., ... & Qiu, L. Minference 1.0: Accelerating pre-filling for long-context llms via dynamic sparse attention. NeurIPS’24.

[3] Li, Y., Huang, Y., Yang, B., Venkitesh, B., Locatelli, A., Ye, H., ... & Chen, D. Snapkv: Llm knows what you are looking for before generation. NeurIPS’24.

评论- Re:

2024-11-25

Thanks for addressing my concerns. I have increased my score based on the authors' responses.

评论- Thank you

2024-11-25

We are glad that our response addressed your concerns and thank you for your thoughtful feedback. We appreciate your valuable time and increasing score!

审稿意见

评分: 6置信度: 32024-11-04

In this paper, the authors provide a number of algorithms that enable one to approximately compute the gradient (and forward pass) of a multi-layer Transformer in time that is nearly-linear in $n$ , the input sequence length. The key challenge is in dealing with the self-similarity matrix, which has $O(n^2)$ entries. The dimension $d$ is assumed to be $O(log(n))$ .

In Theorem 4.1, authors establish nearly-linear time guarantees for computing the gradient of a single-layer Transformer.

In Theorem 4.2, the authors generalize this result to a Transformer with up to $m=n^{o(1)}$ layers.

In both Theorem 4.1 and 4.2, the guarantee is roughly $1/poly(n)$ error in $n^{1+o(1)}$ time.

In section 6, the authors discuss further extensions of their results to various settings, including multi-headed attention, causal mask, and prompt tuning.

优点

Overall, I think being able to compute the attention (both forward and backward) is a question of great theoretical and practical interest.

I think the extensions to a $m$ -layer Transformer (Theorem 4.2) is interesting, well-motivated, and proof is non-trivial. (However, I do have trouble verifying some parts of the proof, see below).

缺点

My main criticism of the paper is that, while the results are interesting and important, I am uncertain about the contributions of the present paper, given the existing papers on this topic.

Theorem 4.1 shows fast gradient computation for a single layer. This result seems identical (at least in big-O notation) to the result in Theorem 1.6 of [1]. On line 446, authors claim that they generalize the results in [1] in two ways: (a) accomodate general loss function L(X), and (b) generalize their result to both $W_i$ and $W_{V_i}$ . However, I do not see why (a) is meaningful -- for the purpose of computing gradients, one can wlog consider a linear $L(X)$ , which is exactly handled by [1]. I also do not see why (b) is meaningful -- [1] already handles derivative wrt $W_{i}$ , whereas the derivative of $W_{V_i}$ is just a minor extension of existing results (see Remark 1.3 of [1]).

I feel that the generalization of 4.1 to the multi-layer setting is potentially quite nontrivial -- the error from a single-layer's grradient computation can blow up over layers, and thus handling this is non-trivial. However, I had a bit of trouble verifying the part of the proof that bounds this error propagation. The main result seems to be Lemma 5.5 (Theorem H.4). In this proof, the key seems to be bounding error accumulation in the gradient computation over $m$ layers, whose proof I have some trouble verifying. (see item 2 under questions below)

Finally, the authors discuss a number of extensions of their work to handling things like multi-headed attention, and causal masks. I think the handling of causal mask is also interesting, but the result seem to be almost immediate following [2].

问题

In Theorem 1.6 of [1], authors show that there is a $n^{1+o(1)}$ time algorithm that computes the gradient up to 1/poly(n) accuracy. Can the authors please comment on the difference between this result and Theorem 4.1? Both in terms of guarantees, as well as in terms of proof technique. Does the present paper's proof use Theorem 1.6 in any way?
Can the authors elaborate on the proof of Theorem H.4 for the multi-layer analysis? Specifically, I refer to line 3646-3650: how did $n/poly(n)$ end up becoming $1/poly(n)$ ? Since the authors are using an inductive proof, I don't think one can rigorously prove the inductive step with explicitly writing down the degree in $poly(n)$ . Related to this problem, I don't see where the authors use the assumed bound on $m$ , surely, the inductive proof has to use this fact?
Can the authors please comment on what are the key challenges and theoretical contributions in proving their result on causal masks? In particular, please highlight what additional work was done on top of the results by [2].

minor suggestions that did not affect score

I suggest to remove mention of $m =n^{o(1)}$ in Theorem 4.1 if you are going to assume $m=1$ in the same theorem.
line 401-410 seems to be a duplication of 409-416.

[1] The Fine-Grained Complexity of Gradient Computation for Training Large Language Models, Alman and Song 2024

[2] Conv-basis: A new paradigm for efficient attention inference and gradient computation in transformers. Liang et al 2024a

2024-11-17

We thank the intuitive comments from the reviewer. Here are some clarifications from us.

W1 & Q1.1: What is the difference between Theorem 4.1 and Theorem 1.6 of [1]?

Theorem 4.1 presented in our paper offers two distinct advantages over Theorem 1.6 from [1]:

Support gradient computation for three variables: Theorem 4.1 demonstrates our ability to compute the gradient with respect to three variables—intermediate variable $T_i$ , weight matrix $W_i$ , and value matrix $W_{V_i}$ —in almost linear time, $n^{1+o(1)}$ . This contrasts with Theorem 1.6 in [1], which only facilitates the almost linear time calculation of the gradient for the weight matrix $W_i$ . Note that analyzing the gradient for $T_i$ is crucial and non-trivial since it is the core component that supports gradient backpropagation between multiple layers of the transformer.
Compatibility with any differentiable loss function: Our theorem (Theorem 4.1) extends support to any differentiable loss function, thereby accommodating the commonly employed cross-entropy loss function. In contrast, Theorem 1.6 in [1] is limited to the $\ell_2$ loss for the attention matrix, precluding its direct application to the training of contemporary large language models (LLMs) that necessitate the use of cross-entropy loss.

Q1.2: How is Theorem 1.6 of [1] used in your proof?

We leverage Theorem 1.6 of [1] in our proof for the gradient computation of the weight matrix $W_i$ , with a detailed explanation provided in Section F of the Appendix.

W2 & Q2: How did $n/poly(n)$ end up becoming $1/poly(n)$ ? Can the author write down the degree of $poly(n)$ ? Where is the $m$ used in the induction?

Rigorously speaking, we acknowledge that the polynomial term $poly(n)$ in the expressions $n/poly(n)$ and $1/poly(n)$ does not correspond to the polynomial with the same degree. Given that we treat the number of layers $m$ as a constant in our study, the $poly(n)$ term will indeed be asymptotically larger than $n^m$ . Consequently, in our proof, we treat $n/poly(n)$ as effectively equivalent to $1/poly(n)$ .

The number of transformer layers, denoted as $m$ , is indeed a factor in the inductive process. When accounting for $m$ , the computational time required to calculate the gradient across an $m$ -layer transformer is $m n^{1+o(1)}$ . However, in practical scenarios, the sequence length $n$ is exceedingly large, whereas the number of layers $m$ remains relatively small and can be treated as a constant. This is the rationale behind the absence of $m$ in our final runtime analysis, as its impact is negligible compared to the sequence length.

W3 & Q3: What are the key challenges and theoretical contributions in proving the results of the Causal attention mask?

Causal masking poses a challenge for low-rank approximations because it results in a full-rank attention matrix. We used the algorithm from [2] that separately computes entries of the low-rank representations of the attention matrix. However, we face a new challenge of a new structure of gradient compared to [2]. Due to the complicated form of the gradient, we first categorize gradient components into dot product and Hardamad product while [2] did not have this complexity. In detail, in Lemma I.6 Line 3758, [2] only addresses the dot product form of Part 1 while in our case we have to deal with the Hadamard form Part 2. Moreover, we utilize row-wise Kronecker product $\oslash$ (Line 3784) to solve Part 2 and [2] did not analyze this.

Minor suggestions

Thanks for pointing out the typo in Theorem 4.1. We have removed $m =n^{o(1)}$ in Theorem 4.1. And thanks for pointing out the duplication of line 401-410 and 409-416. We have fixed this duplication, and have uploaded the revised paper.

[1] Alman, J., & Song, Z. The fine-grained complexity of gradient computation for training large language models. NeurIPS’24.

[2] Liang, Y., Liu, H., Shi, Z., Song, Z., Xu, Z., & Yin, J. Conv-basis: A new paradigm for efficient attention inference and gradient computation in transformers. arXiv preprint arXiv:2405.05219.

评论- follow-up clarifications

2024-11-19

Note that analyzing the gradient for $T_i$ is crucial and non-trivial since it is the core component that supports gradient backpropagation between multiple layers of the transformer.

Can the authors please comment on Remark 1.3 of [1]: "...since the final matrix computed in the norm in L depends only linearly on Y, it is straightforward to incorporate it into either an algorithm or lower bound"? Does this remark not simply apply per-layer with $T_i=Y$ ? (I refer to the version of [1] in https://arxiv.org/pdf/2402.04497)

Our theorem (Theorem 4.1) extends support to any differentiable loss function, thereby accommodating the commonly employed cross-entropy loss function. In contrast, Theorem 1.6 in [1] is limited to the $\ell_2$ loss for the attention matrix.

I think I was a bit unclear in my previous comment, so let me reiterate here: say we let $V$ denote the output of the Transformer. [1] computes the gradient with respect to the objective $||V - E||^2$ , for any arbitrary $E$ .

Your paper shows that one can compute gradient for $F(V)$ for any function $F$ .

However, at fixed input $X$ (and thus at fixed $V$ ), the gradient of $F(V)$ is the same as the gradient of $<E, V> \qquad \text{where $E=\nabla F(V)$ is a constant vector}$ the above is the same as the gradient of $0.5* ||E - V||^2 - 0.5 * ||V||^2. \qquad \text{(again $E=\nabla F(V)$ is a constant vector)}$

The last objective is exactly handled by [1] with $E=\nabla F(V)$ . Therefore it seems to me that [1]'s setup already enables gradient computation for arbitrary $F$ 's, including cross-entropy. Can the authors please comment on whether this reasoning is correct?

the $poly(n)$ term will indeed be asymptotically larger than $n^m$ .

Can the authors please add a note, when discussing your main results, that the $m$ dependence is $n^m$ ? I know that you assume $m$ to be constant, and already have the $n^{1+o(1)}$ bound. However I feel that exponential dependence on $m$ is something that people will care about (despite your assumptions), and the $n^m$ dependence is easy to miss as the paper is written currently.

Causal masking poses a challenge for low-rank approximations because it results in a full-rank attention matrix...

Thank you for your explanation. I think results for handling the mask is quite valuable.

评论- Thank You and Authors Reply to Follow-up Questions

2024-11-20

New Q1.1: Comment on Remark 1.3 of [1].

Thanks for your insightful observations. Firstly, we would like to point out that the notation system used by [1] is different from our notation system. As discussed in the sentence above Definition 1.2 in [1], their $X \in \mathbb{R}^{d \times d}$ is defined by $X = Q K^\top$ and their $Y \in \mathbb{R}^{d \times d}$ is defined by $Y = V$ .

We provide a table for the notations used in our paper and [1] as follows:

Variable	Notation in our paper	Notation used in [1]
$W_Q W_K^\top \in \mathbb{R}^{d \times d}$	$W_i$	$X$
$W_V \in \mathbb{R}^{d \times d}$	$W_{V_i}$	$Y$
input data $\in \mathbb{R}^{n \times d}$	$X$	$A_1, A_2, A_3$
intermediate variable $\in \mathbb{R}^{n \times d}$	$T_i$	None

Then, we will discuss the Remark 1.3 of [1]. For simplicity, the following discussion is based on our notation system. Firstly, the contribution of [1] (Theorem 1.6 of [1]) is that they accelerate the gradient computation for $W_i$ , whereas in our paper (Theorem 4.1 in our paper), we proposed the acceleration for the gradient on $X$ , $W_i$ and $W_{V_i}$ , which means the result of [1] is only a subset of our Theorem 4.1.

New Q1.2 Does this remark (Remark 1.3 of [1]) simply apply per-layer with $T_i = Y$ ?

We would like to point out that our $T_i$ and their $Y$ are not comparable. They do not have the same dimension. Please refer to the new Q1.1, where we have pointed out that our $T_i \in \mathbb{R}^{n \times d}$ is the intermediate variable/hidden state for each layer, and their $Y \in \mathbb{R}^{d \times d}$ in [1] only represents value weight matrix $V \in \mathbb{R}^{d \times d}$ in [1].

New Q2: The relationship between any differentiable loss function $F$ and the $\ell_2$ loss.

Thank you for pointing this out! We acknowledge your derivation. We agree that any differentiable loss function is inherently equivalent to $\ell_2$ loss. Thus, we edited the draft based on your valuable comments, in the Line 130 & 355-356 & 361-362 & 400-404.

New Q3: The $poly(n)$ term.

Thank you for your suggestion. We added the dependency of $m$ is $n^m$ is our revised paper. Please refer to Line 344-347.

[1] Alman, J., & Song, Z. The fine-grained complexity of gradient computation for training large language models. NeurIPS’24.

2024-11-20

Thank you for your explanation. on X and Y, I was mistaken in my original understanding of remark 1.3 of [1].

I have increased my score to 6 for now.

评论- Thank you

2024-11-20

We are glad that our response clarifies your concern. We sincerely thank you for your valuable suggestions and time. We appreciate your improving score.

审稿意见

评分: 6置信度: 32024-11-11

The paper addresses the important problem of reducing the quadratic computational complexity of gradient calculation in transformers, proposing an approximation algorithm that computes gradients in almost linear time with a bounded polynomial error. The work builds on recent work in reducing the quadratic complexity of the forward pass and gradient computation for single-layer transformers to linear complexity.

优点

The work focuses on a highly relevant topic, given the growing context length in the training of LLMs, and highlights the need for algorithms with solid theoretical backing to improve the quadratic complexity of gradient computations in the attention mechanism.

The results seem broadly applicable, as they hold across different loss functions (unlike previous studies) and can incorporate important elements of multi-layer transformers, such as MLPs, residual connections, and causal masking.

缺点

The work is very interesting, but I have a few questions and points to raise.

In general, I feel that there are missing intuitions about the approach in the main body. For example, the technical overview section mentions that the attention matrix can be approximated by low-rank matrices. The related work section (line 157) states that the work uses polynomial approximation techniques (Aggarwal and Alman, 2022), but there is no further discussion about these techniques in the main body. I strongly believe that a small discussion section should be included, as this addresses the heart of the challenge.

Similarly, for Algorithm 1, all the steps reference lemmas in the appendix, but there is no discussion of these steps in Section 5. While Section 5 discusses accelerating the gradient for individual variables, there should be a connection and reference to Algorithm 1 in this discussion. Furthermore, I strongly suggest including a small proof sketch for the key lemma, Lemma 5.1, which shows the linear time complexity of the gradient computation with respect to $T_i$ .

Lastly, paragraphs 401-407 and 409-413 essentially say the same thing with minor rewording—please address this repetition.

Some questions:

—What assumptions do the results of Theorem 4.2 (linear time complexity and $1/poly(n)$ error) require to hold?

—The results operate in a very practical regime of $d≪n$ , but $d=O(log⁡ (n))$ seems too small, considering practical values like 1024–4096. I understand that such assumptions are often necessary for theoretical results, but how sensitive do you think the results are to this choice? The same question applies to the sub-polynomial choice for the number of layers $m$ and the logarithmic bit precision. Does this imply a practical limit on the number of layers for the algorithm to remain efficient?

—It would be useful to see the algorithm in practice and compare it to quadratic attention. However, I feel that the benefits will only be apparent when n is very large (e.g., at least 100k, which requires significant computational resources and may be difficult to run. Have the authors had a chance to experiment with this?

问题

See Weaknesses section.

2024-11-17

We extend our gratitude to the reviewer for their meticulous feedback. We offer the following elucidations:

W1: Lack of intuition regarding low-rank approximation techniques in the main text

Thanks for your suggestions. We have revised our paper; for further details, please see the paragraph from Line 161-168. The revised paragraph articulates the intuition behind the polynomial approximation technique, which is used for the low-rank approximation of the attention matrix. Also, we provide a short summary below.

As outlined in Section 3 of [1], the polynomial approximation for low-rank approximation can be succinctly summarized as follows:

Firstly, the authors of [1] demonstrate that for any real number $x \in \mathbb{R}$ , if $x$ is constrained within the interval $[-B, B]$ for some constant $B$ , then $e^x$ can be effectively approximated by a polynomial of degree $g$ .
Additionally, the author establishes that a matrix with bounded entries, following the application of an exponential function, can be low-rank approximated by matrices $U_1$ and $U_2$ , both belonging to $\mathbb{R}^{n \times r}$ , where $r$ is on the order of $n^{o(1)}$ .

W2: Lack of explanation for the steps of Algorithm 1 in Section 5.

Thank you for pointing out this point. We have added an explanation for Algorithm 1 in Line 279-285. We believe this newly added explanation can help readers have a better understanding of our algorithm.

W3: Lack of proof sketch for Lemma 5.1.

Thanks for your valuable suggestion. We revised our paper and have incorporated a proof sketch for Lemma 5.1 (Line 437-445). This sketch is intended to provide a concise overview of the proof’s intuition. This addition will enhance the reader’s understanding by offering a clearer outline of our proof strategy.

W4: Paragraph 401-407 and 409-413 are duplicated

Thanks for pointing this out. We have fixed this duplication, and have uploaded the revised paper.

Q1: What assumptions does Theorem 4.2 require?

There are three assumptions required by Theorem 4.2. We list them as follows:

The number of layers of the multi-layer transformer $m$ is constant.
The embedding dimension $d$ satisfies $d = O(\log (n))$ .
Each entry in the matrix is represented by at most log(n) bits ( $\log(n)$ bits model). This assumption is well-accepted and widely used in the computational complexity community, e.g., [2,3,4]. It aligns well with practical scenarios, where machine precision (e.g., 16 or 32 bits) is typically far smaller than the input context length (e.g., 128k).

The above three assumptions are also listed as the conditions of Theorem 4.2.

Q2: Practical regime of dimension $d$ and number of layers $m$ .

Indeed, the embedding dimensions employed in contemporary LLMs typically range from $1024$ to $4096$ . However, we contend that the multi-head attention mechanism effectively reduces the effective embedding dimension during computation. Consider the Llama3 model as a case in point. Although Llama3 has an embedding dimension of $4096$ , it is divided among $32$ attention heads. Consequently, the computational dimension per head is only $4096 / 32 = 128$ , which is relatively modest in comparison to the length of the input sequence. Moreover, for LLama 3.2 1B, they have $d=64$ . Regarding the number of layers, $m$ , which is $16$ for LLama 3.2 1B, we treat it as a constant in our analysis.

Q3: I feel that the benefits will only be apparent when n is very large (e.g., at least 100k), which requires significant computational resources and may be difficult to run. Have the authors had a chance to experiment with this?

We appreciate the reviewer's suggestion to conduct experiments with a long context length. Regrettably, our current computational resources do not permit us to carry out extensive experiments. However, we perform preliminary experiments to show some promising results, suggesting a potential fast almost linear time training method bypassing the quadratic complexity bottleneck of LLMs. See Global Response Part 2: Empirical evaluation for more details.

[1] Alman, J., & Song, Z. Fast attention requires bounded entries. NeurIPS’23.

[2] Feng, G., Zhang, B., Gu, Y., Ye, H., He, D., & Wang, L. Towards revealing the mystery behind chain of thought: a theoretical perspective. NeurIPS’23.

[3] Liu, B., Ash, J. T., Goel, S., Krishnamurthy, A., & Zhang, C. Transformers Learn Shortcuts to Automata. ICLR’23.

[4] Merrill, W., & Sabharwal, A. The parallelism tradeoff: Limitations of log-precision transformers. Transactions of the Association for Computational Linguistics (ACL’23).

审稿意见

评分: 6置信度: 42024-11-11

This paper proposes a method to compute gradients in multi-layer transformers in almost linear time, addressing the quadratic complexity bottleneck of traditional self-attention mechanisms. By leveraging a low-rank approximation technique, the approach aims to make large language models more efficient, particularly for long-context tasks, while maintaining a small, polynomial approximation error. The algorithm is versatile, supporting general loss functions and widely used transformer components such as residual connections, multi-head attention, and causal masks, making it theoretically applicable to a broad range of transformer-based architectures.

优点

Strengths:

1.Innovation in Complexity Reduction: The primary contribution of an almost linear time approximation method for gradient computation addresses a significant bottleneck in transformer models, advancing efficiency in training large language models. 2.Comprehensive Theoretical Support: The paper offers a well-developed theoretical foundation, demonstrating that the proposed method maintains accuracy within a polynomially small approximation error. 3.General Applicability: The approach is compatible with various sub-modules (e.g., residual connections, multi-head attention) and loss functions, enhancing its utility across diverse transformer-based applications. 4.Potential Impact on Long-Context Models: Given the paper’s focus on scaling efficiency for long input sequences, the findings are likely to benefit future developments in large language models that need to handle extensive contexts effectively.

缺点

1.Lack of Empirical Validation: Although the paper establishes a strong theoretical foundation, it lacks empirical evaluation to validate the proposed method's performance in real-world applications. While there is discussion of related methods, a deeper quantitative or empirical comparison (e.g., speed, memory usage) with current acceleration techniques could further support the claims of the paper. Without experiments or benchmarks, it is difficult to assess the practical effectiveness of the method, especially in comparison with other acceleration techniques. 2.Unclear Approximation Error Impact: While the approximation error is stated to be polynomially small, the paper does not explore how this error might affect model performance in downstream tasks. In applications requiring high precision, even small errors could be significant; this is particularly relevant in domains where precision is critical, such as healthcare or finance. 3.Limitations of Low-Rank Approximation: The proposed approach relies on low-rank approximation of the attention matrix. However, if the attention matrix is effectively full-rank, this assumption may not hold, potentially negating the efficiency gains of the method. Further clarification on the handling of full-rank attention maps would improve the paper. 4.Potential Implementation Challenges: The paper notes that implementing the approach, particularly on GPUs, may involve re-defining tensor operations and re-implementing specific backpropagation steps. This could pose a barrier to immediate adoption by practitioners, as it requires significant engineering effort. This is a good paper and I would like to raise my score if the author can provide more empirical supports.

问题

Could the authors provide examples or scenarios where this method outperforms existing quadratic complexity methods in practical applications?
For cases where the multi-layer approximation results in slight errors, how do these impact tasks with high sensitivity to precision (e.g., in medical or financial text processing)?
Can the authors clarify the compatibility of this method with common pre-training setups in large language models? Specifically, how would this method perform in tandem with models like LLaMA or GPT-4?
What happens if the attention matrix is full-rank? Does the proposed method lose its advantage in such cases?
A more detailed, quantitative comparison with recent works on transformer efficiency, such as FlashAttention[1], hyperattention[2], or tensor-based approximation methods, would enhance the positioning of the proposed method.

[1] FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness [2] HyperAttention: Long-context Attention in Near-Linear Time

2024-11-17

We thank the reviewer for the helpful comments. We provide some additional information below.

W1 & Q1: Lack of practical scenarios where our method outperforms other quadratic complexity methods?

Thanks for your valuable suggestions. We would like to point out that the main focus of our work is theoretical analysis. We provide rigorous theoretical analysis about the reduction of computation complexity of multi-layer transformers. However, we perform preliminary experiments to show some promising results. See Global Response Part 2: Empirical evaluation.

W2 & Q2: How does the slight approximation error affect the performance in tasks which require high precision?

Thank you for your insightful comments. For downstream tasks, particularly those that demand high precision, we recommend replacing some parts of the approximated attention module to standard attention computation and fine-tune on specific datasets to bypass the slight approximation error.

W3 & Q4: What will happen when the attention matrix is full rank?

Thank you for your suggestions. We would like to highlight that our method uses low-rank computation techniques for approximating the attention, but is NOT assuming the attention is close to low rank. Indeed, our method works for full-rank attention.

For an in-depth examination of how to deal with full rank attention matrix, please refer to Line 161-168 in our revised paper and also Part 1: Can our low-rank approximation deal with the attention matrix with full-rank? in Global Response.

W4: Potential implementation challenges on GPUs

Many thanks to your suggestions. As our study centers on the backpropagation process of multi-layer transformers, applying our method to GPUs would necessitate modifying the gradient computation functions within PyTorch and might even require writing custom CUDA code. For more details, please refer to Part 3 of the global rebuttal. And we provide some empirical results in Global Response Part 2: Empirical evaluation.

Q3: How does this method perform when in tandem with pre-training setups?

Thank you for pointing this out. Our method indeed is compatible for tandeming with pre-training. This can be achieved by “hacking” the backpropagation function of pytorch, and replacing the backpropagation of transformer layers with the algorithm provided in our work.

Q5: Quantitative comparison with recent works, FlashAttention[2], Hyperattention[3].

Thanks for your suggestion. We offer a qualitative comparison with FlashAttention and Hyperattention.

FlashAttention is an empirical study that leverages GPU caching to enhance the speed of attention computations. In comparison, our work centers on the acceleration of attention computations from a theoretical perspective.
In relation to Hyperattention, our research diverges in two key respects: (1) we concentrate on expediting gradient computations across multi-layer transformers, whereas Hyperattention is concerned solely with the forward pass; (2) Hyperattention employees Locality Sensitive Hashing (LSH) to approximate the attention matrix, which only ensures a small error with high probability. In contrast, our method does not employ hashing techniques, allowing us to guarantee a minimal approximation error with certainty. We are willing to discuss more about LSH per the reviewer’s request.

[1] Alman, J., & Song, Z. Fast attention requires bounded entries. NeurIPS’23.

[2] Dao, T., Fu, D., Ermon, S., Rudra, A., & Ré, C. Flashattention: Fast and memory-efficient exact attention with io-awareness. NeurIPS’22.

[3] Han, I., Jayaram, R., Karbasi, A., Mirrokni, V., Woodruff, D. P., & Zandieh, A. Hyperattention: Long-context attention in near-linear time. ICLR’24.

2024-11-22

Thanks for addressing my concerns. I have changed my score.

评论- Thank you

2024-11-22

We are glad our response addressed your concerns. We appreciate your efforts and time. Thank you for your helpful suggestions and score raising!

评论- Global Rebuttal - 1

2024-11-17

We gratefully thank all reviewers for their valuable and constructive feedback.

We appreciate the reviewers FvhV, hH9h, o3CW, and zJzg for recognizing the significance of our work in addressing the quadratic complexity bottleneck of gradient computations in attention mechanisms, particularly for long-context tasks in LLMs. They commend our almost linear time approximation method for its innovation, versatility across loss functions and transformer components (e.g., residual connections, multi-head attention, causal masking), and potential to reduce computational and energy costs. The reviewers also value the comprehensive theoretical foundation, with hH9h and o3CW noting the accuracy guarantees and the non-trivial extension to multi-layer transformers (Theorem 4.2). Overall, they highlight the practical and theoretical impact of our contributions to scaling efficiency in LLM training.

We have updated a revision for our draft. We also update the code about experiments in the supplemental material. We summarize all the updates (in brown color) we made in the revision. All line numbers in the rebuttal correspond to the revised version.

Line 161-168: Add a discussion for the intuition behind the low-rank approximation method used in our paper.
Line 279-285: Add an explanation for our main algorithm (Algorithm 1). We believe this explanation can help readers to understand our intuition more easily.
Line 437-445: Add a proof sketch for Lemma 5.1, which displays the high-level idea of our proof.
Line 130 & 355-356 & 361-362 & 400-404: Modify statements of general loss function.

Then, we will cover some questions that reviewers commonly ask.

Part 1: Can our low-rank approximation deal with the attention matrix with full-rank?

The full rank nature of the attention matrix poses no detriment to our approach.

The fundamental of our low-rank approximation approach for the attention matrix is the polynomial approximation technique introduced in Section 3 of [1]. Put another way, rather than directly decomposing the attention matrix in its full rank nature, our method involves constructing low-rank matrices to serve as the “approximation’’ for the original matrix $\exp(QK^\top) \in \mathbb{R}^{n \times n}$ . For an in-depth examination of this process, please refer to Line 161-168 in our revised paper.

Part 2: Empirical evaluation.

Thank you for the reviewers’ valuable suggestions to provide empirical evidence for our LLM training speedup method. However, we would like to emphasize that our work is intended to be positioned as a theoretical study, focusing primarily on theoretical analysis. Nevertheless, to demonstrate the effectiveness of our method, we conducted preliminary experiments within the constraints of time and resources. These experiments were performed across various embedding dimensions, denoted as $d$ , and input sequence lengths, denoted as $n$ .

We focus on the key step of approximating the attention output, $\mathsf{Softmax}(Q K^\top / d) V$ , where $Q, K, V \in \mathbb{R}^{n \times d}$ are random matrices, using the conventional attention computation and our approximation technique. As indicated in the subsequent table, we have maintained a low relative error, which intriguingly diminishes as the input sequence length increases. Notably, our method exhibits a significant acceleration in computation time when contrasted with the standard attention mechanism.

Feature dim d	Input token length n	Relative Error (%)	Original Running Time (ms)	Our Running Time (ms)	Speedup
4	102*d=160	4.673	1.010	0.710	1.423
5	102*d=320	2.505	1.090	0.630	1.730
6	102*d=640	2.846	2.190	0.850	2.576
7	102*d=1280	1.838	3.780	0.970	3.897
8	102*d=2560	2.005	12.160	1.290	9.426
9	102*d=5120	1.620	47.300	1.790	26.425
10	102*d=10240	1.286	192.330	7.200	26.713
11	102*d=20480	1.558	807.740	20.030	40.327
12	102*d=40960	0.773	3297.410	41.750	78.980

In the above Table, we use $n=10 \times 2^d$ as $d = O(\log(n))$ in our assumptions. And the approximation error diminishes rapidly as the input sequence length $n$ increases. This aligns with our theoretical analysis that the approximation error of the entire model can be bounded by $1 / poly(n)$ .

The code is submitted as supplementary material in the revised version of this paper. We hope this additional experiment can address reviewers’ concerns about the practical efficacy of our method.

评论- Global Rebuttal - 2

2024-11-17

On the other hand, we kindly wish to emphasize that the main focus of our work is theoretical analysis. The importance of theoretical contributions is widely acknowledged in esteemed conferences such as NeurIPS, ICLR, and ICML. For example, papers [1, 2, 3, 4] accepted at ICLR are solely based on theoretical analysis without any experimental data. Likewise, studies [4, 5, 6, 7, 8] that focus on designing efficient algorithms for low-rank and attention approximations also bypass empirical results.

Moreover, our work tackles the quadratic time complexity associated with the backpropagation in multi-layer transformers. Our analytical framework accommodates various elements of attention mechanisms, including multi-head attention, residual connections, and causal masking. We are confident that our theoretical findings will contribute to the advancement of practical algorithm design in the future.

Part 3: Potential implementation challenges on GPUs.

As discussed in Part 2, we give a simplified demo of our method implemented on attention computation. We have proved the effectiveness of our method through preliminary experiments. There are several potential challenges:

As our study centers on the backpropagation process of multi-layer transformers, applying our method to GPUs would necessitate modifying the gradient computation functions within PyTorch and might even require writing custom CUDA code.
As shown in our mathematical proof, the backpropagation gradient is very complicated. Therefore, we can only implement part of the gradient computation and cannot fully implement it during the rebuttal phase.
The low rank approximation algorithm we use involves many for loop operations. Parallelizing these operations on the GPU, including SM scheduling and shared memory allocation, will take lots of time and effort, and this part may be not realistic to complete during the rebuttal.

Some possible solutions in our mind: By harnessing the extensibility of PyTorch, we would customize the backpropagation algorithm and devise custom CUDA kernels. Subsequently, we would enhance the performance of these kernels by fine-tuning the Streaming Multiprocessor (SM) scheduling and adeptly managing shared memory, thereby maximizing hardware utilization efficiency. Therefore, our method can achieve optimal acceleration in performance.

Due to limited time and resources, we are not able to do the full implementation. However, we have provided the preliminary experimental as above to show the potential of practical implementation

[1] Zhan, W., Uehara, M., Kallus, N., Lee, J. D., & Sun, W. Provable Offline Preference-Based Reinforcement Learning. ICLR’24.

[2] Chen, S., Chewi, S., Li, J., Li, Y., Salim, A., & Zhang, A. R. Sampling is as easy as learning the score: theory for diffusion models with minimal data assumptions. ICLR’23.

[3] Wen, K., Ma, T., & Li, Z. How Sharpness-Aware Minimization Minimizes Sharpness?. ICLR’23.

[4] Alman, J., & Song, Z. How to capture higher-order correlations? generalizing matrix softmax attention to kronecker computation. ICLR’24.

[5] Alman, J., & Song, Z. Fast attention requires bounded entries. NeurIPS’23.

[6] Alman, J., & Song, Z. The fine-grained complexity of gradient computation for training large language models. NeurIPS’24.

[7] Sarlos, T., Song, X., Woodruff, D., & Zhang, R. Hardness of low rank approximation of entrywise transformed matrix products. NeurIPS’24.

[8] Dexter, G., Drineas, P., Woodruff, D., & Yasuda, T. Sketching algorithms for sparse dictionary learning: PTAS and turnstile streaming. NeurIPS’24.

AC 元评审

2024-12-20

The reviewers and AC found this work to be an interesting contribution. In terms of rating, this paper was borderline, slightly below the acceptance threshold. However, the final decision is rejection for the following reasons:

As initially raised by Reviewer o3CW, this work appears somewhat incremental over [Alman & Song, 2024a]. The authors state: "Fast gradient computation. The prior study in Alman & Song (2024a) demonstrated that the gradient of $W_i$ can be computed in almost linear time. We extend their findings by adapting their approach to accommodate general loss function $L(X)$ (as defined in Definition 3.1) and further generalize their results to include the gradient computation for both $W_i$ and $W_{V_i}$ in each transformer layer (Lemma 5.2 and 5.3)". First, I am not sure how much complexity $W_{V_i}$ brings. Second, I don't see either how loss function itself make a major difference in gradient compute complexity. At the minimum, the authors should have done a much better job distinguishing novelty and proof techniques over [Alman & Song, 2024a].
As pointed out by Reviewer FvhV, the constraint on $d=log(n)$ is not very realistic and not fully in line with practical LLMs. There should have been comprehensive discussions of practical considerations: What the typical scaling of $d$ and the number of heads are, and how the results change if $d$ grows polynomially rather than logarithmic.
Additional drawbacks of this work include the lack of thorough empirical validation, which is partly addressed during the author response, and connections to practical LLM training.

Overall, this submission could be strengthened by addressing the above points and will represent a strong contribution to an upcoming venue. However, current weaknesses prevent its acceptance as is.

审稿人讨论附加意见

Discussed in the above meta review.

最终决定Reject

2025-01-22

Reject

Multi-Layer Transformers Gradient Can be Approximated in Almost Linear Time

摘要

评审与讨论

优点

缺点

问题

W1 & Q1: Lack of empirical results and comparison with state-of-art methods in terms of speed and accuracy.

W2 & Q2: How does the approximation error affect the performance of the downstream tasks?

W3 & Q3: What are the detailed challenges for implementing this method on GPUs? Can the author provide more concrete guidance on how to implement this method?

优点

缺点

问题

W1 & Q1.1: What is the difference between Theorem 4.1 and Theorem 1.6 of [1]?

Q1.2: How is Theorem 1.6 of [1] used in your proof?

W2 & Q2: How did n/poly(n)n/poly(n)n/poly(n) end up becoming 1/poly(n)1/poly(n)1/poly(n)? Can the author write down the degree of poly(n)poly(n)poly(n)? Where is the mmm used in the induction?

W3 & Q3: What are the key challenges and theoretical contributions in proving the results of the Causal attention mask?

Minor suggestions

New Q1.1: Comment on Remark 1.3 of [1].

New Q1.2 Does this remark (Remark 1.3 of [1]) simply apply per-layer with Ti=YT_i = YTi​=Y?

New Q2: The relationship between any differentiable loss function FFF and the ℓ2\ell_2ℓ2​ loss.

New Q3: The poly(n)poly(n)poly(n) term.

优点

缺点

问题

W1: Lack of intuition regarding low-rank approximation techniques in the main text

W2: Lack of explanation for the steps of Algorithm 1 in Section 5.

W3: Lack of proof sketch for Lemma 5.1.

W4: Paragraph 401-407 and 409-413 are duplicated

Q1: What assumptions does Theorem 4.2 require?

Q2: Practical regime of dimension ddd and number of layers mmm.

Q3: I feel that the benefits will only be apparent when n is very large (e.g., at least 100k), which requires significant computational resources and may be difficult to run. Have the authors had a chance to experiment with this?

优点

缺点

问题

W1 & Q1: Lack of practical scenarios where our method outperforms other quadratic complexity methods?

W2 & Q2: How does the slight approximation error affect the performance in tasks which require high precision?

W3 & Q4: What will happen when the attention matrix is full rank?

W4: Potential implementation challenges on GPUs

Q3: How does this method perform when in tandem with pre-training setups?

Q5: Quantitative comparison with recent works, FlashAttention[2], Hyperattention[3].

Part 1: Can our low-rank approximation deal with the attention matrix with full-rank?

Part 2: Empirical evaluation.

Part 3: Potential implementation challenges on GPUs.

审稿人讨论附加意见

W2 & Q2: How did $n/poly(n)$ end up becoming $1/poly(n)$ ? Can the author write down the degree of $poly(n)$ ? Where is the $m$ used in the induction?

New Q1.2 Does this remark (Remark 1.3 of [1]) simply apply per-layer with $T_i = Y$ ?

New Q2: The relationship between any differentiable loss function $F$ and the $\ell_2$ loss.

New Q3: The $poly(n)$ term.

Q2: Practical regime of dimension $d$ and number of layers $m$ .