One Pass Streaming Algorithm for Super Long Token Attention Approximation in Sublinear Space
摘要
评审与讨论
The paper presents a one pass streaming algorithm for computing attention in sublinear space when d is logarithmic.
优点
Addressing LLMs with very large context length appears important and timely.
缺点
- The writing is poor, the paper is littered with typos and the general structure is very hard to follow. I found the paper very hard to read. Other comments aside, it cannot be published in its current state.
- I find it hard to evaluate the novelty of this paper. The related work section seems to be just a lump of citations, and it is not clear what is the actual contribution. It looks like the authors simply joined existing techniques to get the results.
- I am not sure about the motivation for a streaming computation. In what scenario will the computation be executed in a streaming fashion (i.e., the matrices need to be stored somewhere, right?)
- There is no experimental evaluation.
问题
See weaknesses.
In this work, the authors study improving the time and space complexity of the attention mechanism in the non-causal setting. They give a single pass streaming algorithm that given the query, key and value vectors approximates the attention output using tools from randomized linear algebra and compressed sensing. At a high level, the algorithm computes sketches of low rank matrices so that sparse approximations for the columns of the attention output can be computed using the sketches of the low rank matrices. The space required by the algorithm is and the time to process the streams is .
优点
--
缺点
- Attention computation can be done without using space. See Rabe and Staats "Self-Attention does not need memory".
- The version of the problem being studied i.e., without using causal masking is not super relevant to the current LLMs which all use causal masking.
- As stated, the results only output at most nonzero entries for the entire matrix which has rows. Since we expect each of the rows in the full attention output to have similar norms, many useful rows are completely marked to zero by this algorithm unless at which point the sketches stored by the algorithm are linear in size, at which point the claim of streaming algorithm does not make sense and the algorithm is essentially no different from Alman and Song.
- No experimental verification of the ideas. At least showing how the algorithm works on small instances, even without implementing the additional sparse recovery steps would have been useful.
问题
--
The paper proposes a new framework for doing one pass streaming algorithm instead of the quadratic attention. This is useful when the sequence length is much larger than the dimension (i.e. ). The method relies heavily on the polynomial method proposed by Alman and Song which shows how we can compute matrices such that is approximately equal to the full attention function. They effectively employ a strategy whereby they sketch matrix U and then using sparse recovery tools to recover the original vector. Since this can be done in a streaming fashion, one can get a substantially smaller algorithm with very few caveats.
优点
The authors address a fundamental problem that is important. They produce an algorithm with clear guarantees. Furthermore most of the assumptions are clearly stated, the proofs easy to follow and the pseudocode is clean and easy to understand.
缺点
The author doesn't introduce any new techniques but this seems to be the application of the sketching machinery on top of the algorithm of Alman and Song.
The resulting algorithm has small asymptotic complexity but seems to be in extremely impractical in several ways. First it relies on being large where as in practice for even small transformers. As a result this assumption is quite strong. Secondly , the type of operations necessary are not easy to implement fast in modern hardware.
问题
N/A
The paper proposes a streaming algorithm for Attention computation. As mentioned by the reviewers the novelty is low and the paper cannot be accepted at this point.
审稿人讨论附加意见
The rebuttal does not address the concerns.
Reject