PaperHub
7.0
/10
Spotlight3 位审稿人
最低6最高8标准差0.8
7
8
6
4.0
置信度
正确性3.7
贡献度3.7
表达3.0
NeurIPS 2024

Metric Transforms and Low Rank Representations of Kernels for Fast Attention

OpenReviewPDF
提交: 2024-05-04更新: 2024-11-06
TL;DR

This paper studies the possibility and impossibility for kernel methods and metric transform

摘要

We introduce a new linear-algebraic tool based on group representation theory, and use it to address three key problems in machine learning. 1. Past researchers have proposed fast attention algorithms for LLMs by approximating or replace softmax attention with other functions, such as low-degree polynomials. The key property of these functions is that, when applied entry-wise to the matrix $QK^{\top}$, the result is a low rank matrix when $Q$ and $K$ are $n \times d$ matrices and $n \gg d$. This suggests a natural question: what are all functions $f$ with this property? If other $f$ exist and are quickly computable, they can be used in place of softmax for fast subquadratic attention algorithms. It was previously known that low-degree polynomials have this property. We prove that low-degree polynomials are the only piecewise continuous functions with this property. This suggests that the low-rank fast attention only works for functions approximable by polynomials. Our work gives a converse to the polynomial method in algorithm design. 2. We prove the first full classification of all positive definite kernels that are functions of Manhattan or $\ell_1$ distance. Our work generalizes an existing theorem at the heart of all kernel methods in machine learning: the classification of all positive definite kernels that are functions of Euclidean distance. 3. The key problem in metric transforms, a mathematical theory used in geometry and machine learning, asks what functions transform pairwise distances in semi-metric space $M$ to semi-metric space $N$ for specified $M$ and $N$. We provide the first full classification of functions that transform Manhattan distances to Manhattan distances. Our work generalizes the foundational work of Schoenberg, which fully classifies functions that transform Euclidean to Euclidean distances. We additionally prove results about stable-rank preserving functions that are potentially useful in algorithmic design, and more. Our core new tool is called the representation theory of the hyperrectangle.
关键词
hardnessimpossibilitylow rank transformkernel methodLLMattention

评审与讨论

审稿意见
7

This paper studies three seperate problems relating to metric transformations of kernel matrices via a new mathematical technique which the authors refer to as the representation theory of the hyperrectangle.

The first problem is about characterising the entrywise transformations which, when applies to a low-rank matrix, preserve its low-rank structure. The problem is considered in the context of speeding up attention compuations in LLMs. They

  1. show that polynomials are the only class of functions (without essential discontinuities of the first kind) which, when applied entrywise to a low-rank matrix, preserve the low-rankness of the matrix (as defined in Def. 2.2). This result generalises an existing result in the literature for a more restrictive class of functions.
  2. show that the result is still true even is "low-rank" is relaxed to "approximately low-rank" (matrices with small inverse condition number), and "polynomial" is relaxed to "approximately polynomial" (functions with small finite differences).
  3. show that entrywise "functions that do not grow too quickly" preserve a relaxation of low-rankness known as "stable rank". This includes non-polynomial functions. However, most applications of matrices with low stable rank require the whole matrix to be computed, which does not achieve the goal of fast attention computation.

The second problem is a complete categorisation of all funcions of l_1 norm distance which result in positive semi-definite kernel matrices. These are shown to be the class of completely monotone functions.

The third problem is to characterise the set of functions which transform one semi-metric space into another. They show the equivalence between a function being Bernstein, it transforming l_1 distances to l_1 distances, and transforming l_1 distances to squared l_2 distances.

优点

This paper, including the proofs, are accurate and well-written. The first section answers some important questions about the rank of matrices after entrywise transformations. I believe this section of the work will be of great interest to those working on attention in LLMs. The section essentially closes the question of which transformations preserve low-rankness by proving that it is only functions in the class of polynomials (bar functions with essential discontinuities of the first kind, which the authors conjecture do not preserve low-rank anyway, and I am inclined to believe them). The exploration of how the answer to this question changes which the notion of "low-rank" is relaxed to "approximately low-rank" and "of low stable rank" is informative and compliments the first result nicely.

The characterisation of semi-positive definite kernel matrices which are function of l_1 distance also closes an open question with an elegant result. This result will likely be of interest to researchers in the theory of kernel methods.

The theoretical work in Section 4 is also interesting, however the significance the machine-learning community is perhaps slightly less than the previous sections.

The mathematical tools developed to study the spectral properties of matrices whose entries are the l_1 distances between the vertices of a hyperrectangle do not receive much attention in the main text, but may well be of independent interest to mathematics research is many disciplines.

缺点

One might argue that this paper tries to do too much for a conference paper, and as a result much of it feels rushed or unduly relegates details to the appendix. The sections sometimes feel unrelated, and the paper lacks a clear narrative. Empirical experiments to demonstrate the results in Section 2 would be very welcomed, and I think would enhance the message of the paper. However, there is clearly not space for this as it stands. I can imagine a more focused version of this paper which moves the results in Sections 3 and 4 to another paper, and which focuses only the results in this Section 2, which I believe have the most value to the machine learning community. This would give more space to the mathematical tools in Section 5, and to experiments.

That said, all of the individual contributions are strong, and I have no criticisms of any of content of this work.

问题

  • I was surprised by the lack of mention of Mercer's theorem on the existence of the feature maps F in Section 3. Is this covered by other references?
  • I would be interested to see simulated experiments to demonstrate Theorems 2.6 and 2.7.
  • Please could you explain the implications of Theorem 4.4 on a problem in machine-learning.

局限性

The authors are upfront about limitations, particularly with relation to stable rank, which I think opens up interesting directions for future research.

作者回复

Thank you for the thoughtful review.

Re: “the sections sometimes feel unrelated, and the paper lacks a clear narrative”: The sections are related through the unifying technique: the representation theory of the real hyperrectangle. Indeed, all our results (from LLMs to kernels) follow from this one new technique. We plan to emphasize this early in the introduction in the final version of the paper.

To answer your questions:

  1. Indeed, Mercer’s Theorem is covered by our other references. For instance, it is heavily referenced in our citation [SOW01]. But you’re right that it’s important and we implicitly use it; we will cite it directly in the final version.

  2. We agree that experiments relating to Section 2 would be exciting. Some prior empirical papers have performed experiments showing that polynomials in attention are effective. See, for instance, reference [KMZ23] (which appeared in ICML 2024). More experiments on our results, including a search for optimal low-degree polynomials to use in attention, would be an exciting area for future work.

  3. The areas of metric transforms have numerous applications to machine learning (see lines 67-77 on page 2), and our Theorem 4.4 gives a new structural result about what is possible in metric transforms.
    The known statement using Euclidean instead of Manhattan distances (which our Theorem 4.4 extends) is widely used in machine learning. Some specific examples include the converse to Theorem 2 in our citation [RYW+19] about interpretability (“Visualizing and measuring the geometry of BERT” [NeurIPS ‘19]), and in Theorem 4 and 5 of our citation [FLH 15] giving a framework for kernels on popular Riemannian data manifolds in computer vision (“Geodesic Exponential Kernels: When Curvature and Linearity Conflict” [CVPR ‘15]). Again, Lines 67-77 on page 2 of our paper give a longer list of applications.
    We are confident that our Theorem 4.4 will be useful for proving future results similarly. Perhaps the most striking part of our Theorem 4.4 is that it is proved using the exact same technique as our other results.

评论

I thank the authors for their response. I agree that the thing that ties this paper together is the new mathematical tool: the representation theory of the real hyperrectangle, and that this should be emphasised earlier on in the paper. While a strongly support this work, I am still unsure whether a venue with a 10-page limit is suitable for this work. I will increase my score to 7, and I encourage the authors to think carefully about the presentation of their results within the space constraints.

审稿意见
8

The paper achieves three main results: Firstly, it demonstrates that polynomials are the only piece-wise continuous functions for which applying them entry-wise to a low-rank matrix results in another low-rank matrix. Secondly, it shows that if f is a Manhattan kernel, then it must be a completely monotone function. Thirdly, it shows Bernstein functions are the only ones that transform the Manhattan metric to the Manhattan Metric.

优点

  1. I found the paper well-written and highly readable, with a smooth flow. The related works are properly discussed, providing clear motivation for each problem. The proofs are divided into steps, and the ideas are explained clearly, allowing the reader to engage with the paper's storyline. I believe this paper will certainly be of interest to the NeurIPS community.

  2. The paper is mathematically solid. The results and the techniques used to obtain them are interesting and mathematically rigorous and effectively address the problems outlined in the summary.

  3. The technical core of all three main results involves a simple application of Schur’s lemma, which characterizes the eigenvectors of a 'Gram-like matrix' derived from the vertices of a hyper-rectangle. Another interesting aspect of the paper is the author's observation that this technical tool can be employed to address the aforementioned problems.

缺点

None.

问题

To what extent can your results for metric transformation and kernel classification be extended beyond the Manhattan metric?

局限性

The authors have discussed the limitations of their work.

作者回复

Thank you for the thoughtful review.

Great question: one can show with some work that the same theorem statements hold for a broader class of metrics, including effective resistance distances from electrical network theory, or the class of graph metrics arising from all star graphs. Exploring this type of result beyond Manhattan metrics is a rich area of exploration for future work, and we expect the landscape and necessary techniques to vary for different types of metrics.

评论

Dear Authors,

Thank you for addressing my questions and for your excellent work. I have reviewed the rebuttals and other reviews, and I would like to maintain my original score.

Best regards, Reviewer

审稿意见
6

This paper studies kernel functions with a particular interest to Manhattan distance kernels. The following three questions are studied:

  1. For which functions f, entrywise transformation f(M) is guaranteed to not be full rank if M is any sufficiently low-rank matrix? The authors show that if f does not have essential discontinuities of first type and entrywise f(M) has rank < n whenever M has rank <log2(n)+2< log_2(n) + 2, then f is necessarily a a polynomial of degree at most log2(n)+1log_2(n) +1.
    The claim in the paper seems to be made for all n, though the proof seems to require n to be a power of 2.
  2. What functions are positive definite Manhattan kernels? The authors show that f is positive definite Manhattan kernel iff f is completely monotone. (Theorem 3.4)
  3. What functions transform Manhattan distances to Manhattan distances? The authors show that f transforms Manhattan distances to Manhattan distances iff f transforms Manhattan distances to squared Euclidean distances iff f is Bernstein.

The key technique used in this paper is connecting all three problems to study of the eigenvalues of 2d×2d2^d \times 2^d matrices whose entries are defined as Df=f(xixj)D_f = f(\Vert x_i - x_j \Vert), where xi,xjx_i, x_j are vertices of dd-dimentional hyperrectangle. The key observation is that such matrices are preserved by a permutation group induced by reflections in RdR^d, which by Schurs lemma implies that eigenvectors are eigenvectors of Hadamard matrice H_d. This allows to explicitly compute and analyze eigenvalues of DfD_f (Lemma B1), and this analysis allows to make progress on all three problems above.

优点

The problems studied in this paper are important, natural, and in some sense fundamental (e.g., classification of positive definite Manhattan kernels). Hence, I think this is a very nice paper, which is well-motivated. Similar classification for Euclidean distances was obtained almost a century ago, and Manhattan distance is one of the "next" natural metrics for which this question can be asked. It was known for a long time (in one direction) that Bernstein functions preserve Manhattan distances and completely monotone functions are Manhattan distance positive definite. However, this paper appears to be the first one to show that only these functions are the only functions with such properties, proving the opposite direction of the classification.

I am not familiar whether idea to study eigenspaces and eigenvalues of entriwise transforms of hyperrectangles appeared before in the context of these problems, but the idea is very interesting and productive. The proofs are sufficiently rigorous and appear to be correct (I verified some but not all in detail).

缺点

I have mainly two concerns:

  1. I think the presentation of the results can be improved substantially, and I hope that the authors will polish the paper more for the final version if the paper gets accepted. In particular: -- I think it is important to include more insights about the proof technique and main novelties of the paper in the main text. At this point, only lines 365 - 380 (Section 5) provide some substantial insight into the proof. Some paragraphs of other sections can be shortened without losing much content, which would allow for more insights into the technique, which, in my opinion, is important to do in the main text. In particular, I find lines 999-1005 very insightful, and I think moving them up should help the reader a lot. -- Some statements and notation are imprecise or confusing (see below) -- There are a lot of forward references, which make it somewhat hard to read the proofs. For example, proofs in section E talk about "eigenvalues corresponding to ξ\xi" (e.g., line 1407), while this makes sense only after reading section I, which is much later. There is many forward references like this. -- There is non-negligible amount of typos

  2. I am not sure that this is the right venue for these results. The results proven in this paper are great, but they seem to belong to the field of metric geometry/analysis. While there is a connection to learning theory, as the authors indicate in the introduction, in my opinion, the connection is somewhat weak and tangential.

问题

  1. Is not Lemma B1, and hence any lemma that depend on calculation in this lemma, require n to be a power of 2? In particular, Lemma C4, C5, C6, Theorem C11, Theorem 2.5 etc all seem to require n to be a power of 2? If not, how does the proof work for the case when n is not a power of 2? If yes, can you please make it explicit in all those statements that n should be a power of 2? I think results have somewhat substantially different flavor if they hold only for powers of 2, instead of for all integer n.
  2. Theorem C11 has conditions missing, even though it is marked as "formal statement". At least it should be mentioned that f is assumed to not have essential discontinuities.
  3. Line 1299 talks about matrix A which is only introduced in line 1303.
  4. The statement of Theorem 2.4 is somewhat confusing, more specifically the claim that Fact 2.1 is "tight".
  5. nitpicking: in statement of Lemma C15 rank of M can be less than 5, if n < 5.
  6. nitpicking: in line 1163 (and other places) if one wants to be pedantic, Δϵd\Delta_{\epsilon}^{d} is an operator applied to functions defined on RdR^d, while f is defined over RR. I think it should be applied to f(,1)f(\langle \cdot , 1\rangle) at aa.
  7. L1191 closing ) is missing.
  8. SiS_i and TiT_i are mixed up in L 1245-1246
  9. Computation from lemma C9 is simple so it is a bit strange to keep it so far from Lemma C4 where it is used, and requires reader to jump back and forth.

局限性

na

作者回复

Thank you for the thoughtful review.

To address weaknesses:

  1. Thank you for pointing this out. We will take your suggestions for a final draft, expand on the proof technique and main novelties in the introduction, correct typos, and make statements (such as the ones you point to) more precise. We will also remove forward references to improve readability.
  2. Our techniques are based on metric geometry, and they have strong applications to machine learning, including attention computation in LLMs, kernels, and more. For example, we show that low-rank fast attention (the theoretical state of the art for method fast attention computations) must rely on polynomial approximations or replacements for the softmax function in the definition of Attention. Such a result is more relevant to machine learning than metric geometry. The application of metric geometry techniques to a variety of machine learning fields such as LLMs and kernels should be considered a strength of our paper, and we are particularly excited about disseminating our results to the ML community.

To answer your questions:

  1. The lemma as stated has nn as a power of 2, but the statement for general nn follows directly from the statement for nn as a power of 2. This is because of the fact from linear algebra that if a N×NN \times N matrix MM has full rank, then for any fixed nNn \leq N there exists an n×nn \times n minor of MM with full rank. (This follows, for instance, from expansion by minors.) Suppose nn is not a power of 22, and let 2d2^d be the smallest power of 22 bigger than nn. If ff is not a low-degree polynomial, our work as-stated proves that there exists a low rank 2d×2d2^d \times 2^d matrix MM where MfM^f is full rank. By the above, there exists an n×nn \times n minor MnM_n of MM (with low rank, since its rank is less than that of MM) where MnfM_n^f is full rank. Therefore, ff cannot preserve low rank matrices of dimension n×nn \times n, if ff is not a low degree polynomial.

  2. Thank you for pointing this out! Theorem 2.5 is actually our full formal statement as-is, and Theorem C11 is missing the essential discontinuities of the first kind (but nothing else). We will correct this in the final version.

  3. Thank you for noticing this; we will correct it in the final version.

  4. In Theorem 2.4, when we wrote “Fact 2.1 is tight”, we meant to say that the converse of Fact 2.1 is true when ff is n1n-1 differentiable. To be precise, if ff is n1n-1 differentiable and ff preserves low rank, then Theorem 2.4 states that ff must be a low degree polynomial. We will clarify this and remove the word “tight” from Theorem 2.4. (Just to clarify: Our new results show that this converse is true even without the n1n-1 differentiability, and is true for all piecewise continuous functions. As mentioned in Section 2, this set of functions is considerably more expansive than n1n-1 differentiable functions, and includes popular non-second-differentiable functions like ReLU, ELU, and SeLU, as well as analytic oddities like the no-where differentiable everywhere-continuous Weirstrauss function. Our new result also shows that approximate low rank preservation is impossible for piecewise continuous functions that aren’t approximations of low-rank polynomials, whereas Theorem 2.4 didn’t have any results about approximability.)

Other questions: Thank you for noting these typos and other minor issues, we will correct them all in the final version.

评论

Thank you for addressing my questions! The generalization to general n makes sense. I will keep my score unchanged due to presentation issues, and I hope the authors will address them in the final version.

最终决定

This paper introduce a new linear-algebraic tool based on Abelian group representation theory, and use it to address three key problems in machine learning, leading to three new results:

  • It demonstrates that polynomials are the only piecewise continuous functions for which applying them entrywise to a low-rank matrix results in another low-rank matrix.

  • It shows that if ff is a Manhattan kernel, then it must be a completely monotone function.

  • It establishes that Bernstein functions are the only ones that transform the Manhattan metric into the Manhattan metric.

All reviewers acknowledge the novelty and significant contributions of these results, and especially the potential application to the attention mechanism in LLM. Therefore, I recommend accepting this paper.