Attention-based clustering
摘要
评审与讨论
This work provides a theoretical analysis of the clustering capability of Transformers brought by the attention mechanism. By taking a two-head attention layer as an example, this work shows that Transformers could capture the clustering structure underlying the input data. Experimental results also support the theoretical conclusions.
优缺点分析
- Strengths:
- It is interesting to study the clustering capability of the Transformer or attention mechanism. While the conclusions are not surprising, the detailed theoretical analyses are still of good value.
- Beyond theoretical analysis, the authors also conducted numerical experiments to show that the conclusions hold under different initialization regimes, mixture separability levels, and problem dimensionalities.
- Weaknesses:
- I suggest that authors provide an illustration to demonstrate the relationship between inputs, cluster centers, and attention layers. This might help the readers to understand the theorem.
- The current proofs and derivations are based on the simplified example. Could the derivation extend to more complicated scenarios?
- Since the attention mechanism is proven to be effective in clustering, is it possible to evaluate its actual clustering performance?
问题
Please refer to the weaknesses section.
局限性
Yes.
最终评判理由
The authors' response has addressed my concerns. I would like to keep my original score.
格式问题
NA
We would like to thank you for your feedback. In the following, we provide answers to the weaknesses you underlined.
W1 (Illustration). We appreciate the suggestion. In the revised version, we will include a 2-dimensional illustration showing the input data---two point clouds around their cluster centers---and the corresponding outputs of the trained attention layers, which form flattened point clouds, illustrating how the attention model performs approximate quantization.
W2 (Towards more complicated scenarios). Our theoretical analysis is conducted under three main assumptions: (i) a two-component Gaussian mixture model, (ii) centroids constrained to lie on the unit sphere, and (iii) mutual orthogonality of these centroids.
Regarding (i): Our framework could be extended to handle clusters with orthogonal centroids. The analysis would proceed along similar lines by employing linear attention heads. While technically feasible, such an extension would significantly increase the complexity and reduce the readability of the mathematical exposition, which motivated our choice to focus on the two-cluster case for clarity. That being said, we ran new numerical experiments with clusters, 3 heads, and random initializations, which we will add to the revised version.
Regarding (ii): Constraining centroids to lie on the sphere enables the use of linear attention heads in place of softmax-based mechanisms. This assumption effectively normalizes the queries and keys, a role typically played by explicit normalization layers or softmax operations in standard architectures.
Regarding (iii): The attention-based model relies on directional similarity to perform clustering. If the clusters are aligned (or anti-aligned), the model is not able to distinguish them. For intermediate cases (neither orthogonal nor aligned), additional cross terms complicate the analysis. The orthogonality assumption remains a mild condition in high dimensions. We ran additional experiments with randomly initialized centroids on the unit sphere in dimension 10 to support this claim, obtaining similar results to the orthogonal case, and will add these results to the revised version.
In summary, while these assumptions may seem restrictive, they were chosen deliberately to keep the analysis tractable and transparent. We believe they still capture essential aspects of the mechanism and provide useful theoretical insights.
W3 (Using attention layers as clustering methods). This paper studies attention layers when the input data follows a probabilistic model commonly used in clustering contexts. We show that, when attention heads are trained in an unsupervised setting, they learn to encode the parameters of the underlying data distribution—here, the centroids. The transformation performed by the attention layer can further be interpreted as an estimation of the centroid associated with each input token. While this is not clustering in the strict sense (no discrete assignments are made), cluster labels can be derived post hoc. For instance, one may assign token to cluster if , where and are the learned centroids.
In the case where the attention parameters converge to the true centroids, one can compute the risk of the resulting clustering rule: where . We will add a discussion on this topic to the next version.
Thanks for the detailed explanations, which have addressed my concerns. I encourage the authors to include the discussions and potential limitations in the revised version.
The paper shows, learning the weights in a very specific two-headed attention-mechanism recovers the cluster means of a very specific two component Gaussian mixture (up to signs) both theoretically and empirically given specific initialization and/or regularization procedures, when we enforce the transformed instances to be close to the instances in terms of their squared distance. The Gaussian mixture centroids are assumed to lie on the unit sphere and to be orthogonal. The attention-mechanism is based on linear dot-product attention (i.e. there is no softmax applied) with keys and queries projected to only one dimension by a shared weight vector.
优缺点分析
Strength:
- The investigation of how attention relates to mixture models is interesting.
- Theoretical results are backed by proofs (which I have not checked) and experiments on synthetic data.
- The relation of the attention mechanism under consideration and variance reduction is interesting (but not really explored/explained).
Weaknesses:
- The constraints on the mixture model (i.e. two components, orthogonal centroids with norm = 1) are very harsh, reducing the significance of the results strongly.
- Due to one-dimensional keys and queries, the specific form of attention (Eq. 2) can be rewritten as a product of a scaling factor (lambda X_l^T mu) that only dependends on the query and a weighted sum of values (here = instances, since there is no projection) where the weighting is only dependent on the keys. As such, the specific simplification of attention results in something that does not involve pairwise dependencies between tokens (an important / essential property of attention, also stated in the introduction), but rather attention heads that are learned means of instances and query weights that weight the attention heads.
- The above point is not discussed. In my understanding, the attention mechanism is thus simplified to the point where it naturally resembles a mixture model. This would make the theoretical results very much less interesting.
- The suggested initialization requires knowledge of the solution and is thus not applicable in practice.
- The regularization looks like it might enforce approximate orthogonality of attention-weights mu to data instances. This might have a strong effects on optimization and further hinders the generalization of the papers result to more general attention models. This is not discussed.
- That the problem statement (and or proofs) seemingly requires Riemannian gradient descent complicates the analysis further.
- The propositions are proofed in the paper, but proof-sketches and intuition in the main text would help readability (I have only skimmed over the Appendix).
- There are some minor typos and miss-nomenclature, (e.g. Z_l = 0 and Z_k = 0 in line 92, mu_2 in line 626, naming projection weights keys and queries in line 73) that do not help since the paper is very math-heavy and concise.
问题
- Why do you need the constraints on the mixture model (see above) + Riemannian GD?
- How can you argue that results on the simplified attention in Eq. 3 are relevant to more general (or: proper) attention models (see notes on factorization above)?
- Can you elaborate on the results in Section 5, i.e. the relation to variance reduction? How much is this reliant on lambda?
局限性
Yes.
最终评判理由
I still think the paper simplifies attention to essentially a mixture model with two components which renders theoretical results rather uninteresting. However, the other reviewers convinced me that there is something in the paper and I don't object acceptance. I'll stick to my score though. (Let me know in case this is a problem).
格式问题
None
Thank you for your time dedicated to review our paper. We would like to begin by addressing and putting into perspective some of your concerns about our work. More specifically, there is generally a tradeoff in theoretical works between analytical tractability and resemblance to practice. Despite some simplifications, we believe our results provide valuable insights into both the diverse learning capabilities of transformers and a novel perspective on attention mechanisms for unsupervised learning.
More detailed responses to your questions follow below.
Q1. Our theoretical analysis is conducted under three main assumptions: (i) a two-component Gaussian mixture model, (ii) centroids constrained to lie on the unit sphere, and (iii) mutual orthogonality of these centroids.
Regarding (i): Our framework could be extended to handle clusters with orthogonal centroids. The analysis would proceed along similar lines by employing linear attention heads. While technically feasible, such an extension would significantly increase the complexity and reduce the readability of the mathematical exposition, which motivated our choice to focus on the two-cluster case for clarity. That being said, we ran new numerical experiments with clusters, 3 heads and random initializations. We first use a natural extension of the previously introduced regularization. In this specific case, it consists in adding 3 (=) regularization terms to enforce pairwise orthogonality between the head parameters. We also observe that using a single, simpler regularization term based on the product achieves almost comparable performance while significantly reducing the number of regularization terms. We will add these experiments to the revised version.
Regarding (ii): Constraining centroids to lie on the sphere serves enables the use of linear attention heads in place of softmax-based mechanisms. This assumption effectively normalizes the queries and keys, a role typically played by explicit normalization layers or softmax operations in standard architectures.
Regarding (iii): The attention-based model relies on directional similarity to perform clustering. If the clusters are aligned (or anti-aligned), the model is not able to distinguish them. For intermediate cases (neither orthogonal nor aligned), additional cross terms complicate the analysis. The orthogonality assumption remains a mild condition in high dimensions. We ran additional experiments with randomly initialized centroids on the unit sphere in dimension 10 to support this claim, obtaining similar results to the orthogonal case, and will add these results to the revised version.
In summary, while these assumptions may seem restrictive, they were chosen deliberately to keep the analysis tractable and transparent. We believe they still capture essential aspects of the mechanism and provide useful theoretical insights.
Using Riemannian gradient descent is the most natural and mathematically sound approach, given that we can only compute the risk for parameters that lie on the sphere. As such, we are not in a position to compute a standard Euclidean gradient and must instead project each update onto the tangent space of the sphere. Moreover, this choice does not appear to be restrictive in practice: numerical experiments using standard SGD strategies behave similarly to the theory.
We thank you for having raised these concerns, and will discuss them in the revised version.
Q2. We emphasize that linear attention is a standard and widely studied setting (see among many others [1, 2] and references therein). Importantly, any linear attention head can be decomposed into a sum of heads with one-dimensional keys and queries. Therefore, we respectfully disagree with your remark regarding the factorization. In this setting, the attention score between tokens and is . It measures a specific type of dependency between tokens, namely, their joint alignment with a fixed direction , determined by the attention head. Exploring extensions to nonlinear attention mechanisms is an interesting direction left for future work.
[1] Trained Transformers Learn Linear Models In-Context, Zhang, Frei, Bartlett, JMLR 24.
[2] Linear attention is (maybe) all you need (to understand transformer optimization), Ahn et al., ICLR 24.
Q3. Thank you for raising this question. We used the phrase ``variance reduction'' to mean that the attention-based predictor achieves lower reconstruction error than predicting the mean of the cluster---for which the reconstruction error is exactly the variance of the cluster. We acknowledge that this formulation was unclear, and will clarify in the revised version.
Regarding the influence of : for simplicity, Proposition 5.2 reports the result for a specific value of (corresponding to the minimal reconstruction error). However, our computations are carried out for arbitrary values of , and show that the variance reduction effect holds over a range of values and not just for a single point. For instance, when , this effect occurs for approximately between 0.3 and 0.4.
Finally, note that we have derived additional results on the approximate quantization phenomenon in a more challenging in-context setting, where mixture parameters vary for each prompt. We consider prompts composed of tokens drawn from a mixture of two Gaussians, each parameterized by prompt-dependent centroids that are randomly sampled on the unit sphere and mutually orthogonal. While the original architecture studied in the paper appears too limited for this scenario, a straightforward solution is to increase the number of heads. Since the head parameters in our work are assumed to be orthogonal, this leads us to consider at most heads. In this setup, the architecture computes, for the -th token,
Interestingly, stacking orthogonal heads in this setting results in an architecture with no parameters to learn. Yet, it can be shown that this architecture enjoys quantization properties similar to those established in the standard clustering framework, and furthermore successfully adapts to the underlying mixture structure. We will include these new developments in the revised version.
Thank you for the response. I am happy with most of the remarks but have to admit that I am now under the impression that there is possibly more time needed to come up with a concise writeup. First of all, I still find it important to generalise the writeup to K components and perhaps motivate the case K=2 as an example or special case that is then studied theoretically in detail. According to your response, some parts like the pairwise regularisation constraints can apparently be replaced by much simpler alternatives. Given your response, this seems to hold for the orthogonality constraints as well. My conclusion is that the paper is perhaps not yet matured enough for publication. At the current stage, there seem to be (too) many loose ends. I will take the discussion to the PC after the rebuttal period but feel free to correct me if I am mistaken or took the wrong turn at some point.
We are glad our rebuttal helped clarify your questions.
Regarding the suggestion to ``generalize the writeup'', if the goal is to present the full -component model up front before narrowing to for analysis, we worry this might give a misleading impression of the paper’s theoretical scope. Our choice to focus on and orthogonal centroids is deliberate: it captures the core phenomena with minimal technical overhead, and it enables us to derive nontrivial theoretical results that would become less readable for larger .
Therefore, we propose to retain the current structure by focusing the analysis on , and adding a more detailed discussion on generalizations (e.g., arbitrary and relaxed orthogonality) with supporting numerical results. We believe these additions fit well within the scope of a post-rebuttal revision and do not affect the main contributions of the paper.
Thank you and sorry for the confusion, I understand that math is already involving for k=2 and I would not consider changing the entire writeup to all k at this stage as this would lead to a very different paper. However, the need for fixing k=2 is not really motivated in the paper yet and could perhaps (as also proposed in your reply) be better explained. Similarly, it would be interesting to the reader to learn about what is missing / why it is impossible at the moment to derive the generalization that holds for all k. This does not at all affect the contribution of the submission but could be ways to make the paper stronger.
Thank you very much for the clarifying follow-up. We fully agree with your suggestions, and will add these explanations to the revised version. Since the rebuttal addressed your questions, we kindly ask you to reconsider your score.
This paper provides a theoretical analysis of the ability of attention mechanisms to perform unsupervised clustering. The authors investigate whether a simplified Transformer layer can learn the latent structure of data drawn from a mixture model without access to any labels. To do this, they analyze a simplified two-head linear attention layer where the value matrices are fixed to the identity. The learning task is to recover the two centroids of a Gaussian Mixture Model (GMM) from which the input tokens are sampled. The core theoretical contributions are:
A proof that for a simplified Dirac mixture model, projected gradient descent (PGD) on the risk function drives the attention head parameters to align with the true centroids, provided they are initialized on a specific manifold.
An extension of this result to the more general Gaussian mixture model, again showing convergence to the true centroids under similar initialization constraints.
A proposed regularization scheme to overcome the impractical initialization requirements, which is shown empirically to promote disentanglement between the heads and enable learning from a random initialization on the sphere.
An analysis of the oracle regime (where head parameters equal the true centroids), showing the attention layer acts as an approximate quantizer and can achieve a lower risk than a singular optimal quantizer due to a variance-reduction effect from aggregating multiple tokens.
The theoretical findings are supported by numerical experiments on synthetic data that validate the convergence dynamics and the effectiveness of the proposed regularizer
优缺点分析
Strengths
Novel Theoretical Framework: This work provides a novel and rigorous proof that a simplified attention mechanism can perform fully unsupervised clustering on a simple Gaussian Mixture Model. This is a significant and complex theoretical step in understanding the unsupervised capabilities of Transformers.
Practical Regularization Method: The paper proposes an effective regularization term that empirically overcomes the restrictive initialization assumptions of the main theorems, enabling the model to learn the correct centroids from a random start.
Clarity: Numerical experiments, together with cohesive presentation, make the work pleasant to read.
Weaknesses
Impractical Initialization: The main convergence theorems (3.4 and 4.4) are a central contribution, yet they rely on an impractical initialization on a manifold that depends on the unknown, true cluster centroids. This creates a significant gap between the formal theory and practical application.
Limiting Simplifying Assumptions: The analysis relies on strong simplifications that limit its generality, including a highly simplified linear attention model and a two-component, orthogonal GMM. It is unclear how these findings translate to more complex data, even GMM with more components.
问题
The choice of 2 heads is connected with the fact that the mixture has two clusters. However, usually architectures are restricted to a reasonably low number of heads. Do you think that a similar approach can be generalized to N clusters, but a smaller number of heads?
Many theoretical papers on Transformers focuse on their ability to perform supervised, in-context learning, for instance, by implementing known algorithms. Does your analysis of unsupervised clustering reveal a fundamentally different capability of attention, or could it be seen as another form of in-context algorithm implementation where the "algorithm" is centroid estimation?
局限性
Limitations are addressed in the work.
最终评判理由
The work has an interesting theoretical contribution.
Authors answered my questions and even expanded some of their results.
格式问题
No concerns
We thank you for your feedback and for acknowledging the relevance of our work.
Initialization. We agree that restricting initialization to the manifold is a limitation. Relaxing this would require proving the stability of the manifold, so that initializations sufficiently close to it still lead the PGD iterates to converge toward the manifold. Unfortunately, we have not yet been able to establish this result theoretically. To address this limitation in practice, we introduce a regularization term that empirically promotes global convergence.
Simplified architecture. We acknowledge that using a linear head is a simplification compared to practice. This choice was guided by two considerations: (i) this simplified head appears to be sufficient for the structure of the data at hand, and (ii) the mathematical analysis of softmax attention heads (especially when combined with a quadratic loss) is complex and lacks a conclusive theoretical framework to date.
More mixing components. Thank you for this question. Our framework could be extended to handle clusters with orthogonal centroids. The analysis would proceed along similar lines by employing linear attention heads. While technically feasible, such an extension would significantly increase the complexity and reduce the readability of the mathematical exposition, which motivated our choice to focus on the two-cluster case for clarity.
It is also worth noting that stacking attention heads can be reinterpreted as a single attention head with matrix-valued parameters subject to a rank constraint. Specifically, the transformation applied to the -th token in a two-head architecture can be expressed as: where , so that is a rank-2 matrix. Therefore, our work can be seen as an analysis of a non-convex attention-based strategy for implementing spectral-like methods. This perspective also sheds light on what happens when using fewer heads than cluster components: the resulting performance should be comparable to that observed in spectral clustering methods under similar conditions. We leave a formal analysis of this phenomenon for future work.
In-context setting. We thank you for raising this question, as it led us to new results for an extended version of the architecture, within an in-context learning framework. We consider prompts composed of tokens drawn from a mixture of two Gaussians, each parameterized by prompt-dependent centroids that are randomly sampled on the unit sphere and mutually orthogonal.
While the original architecture studied in the paper appears too limited for this scenario, a straightforward solution is to increase the number of heads. Since the head parameters in our work are assumed to be orthogonal, this leads us to consider at most heads. In this setup, the architecture computes, for the -th token,
Interestingly, stacking orthogonal heads in this setting results in an architecture with no parameters to learn. Yet, it can be shown that this architecture enjoys quantization properties similar to those established in the standard clustering framework, and furthermore successfully adapts to the underlying mixture structure. We will include these new developments in the revised version.
Thank you to the authors for answering my questions. Given the theoretical scope of the work that focuses on a specific basic case (GMM with 2 orthogonal clusters), I encourage them to include discussion on existing results and their expectation on generalization within the framework.
This theoretical paper investigates the clustering capabilities of transformers. Specifically, a two-class Gaussian mixture model is analyzed. First, the degenerate case, where the Gaussians become Dirac functions, is examined, followed by regular Gaussians with σ larger than zero. The paper presents a theoretical convergence analysis, which is subsequently empirically demonstrated through simulations. By combining a classical clustering
优缺点分析
Strengths
- The paper provides a complete theoretical analysis of transformers as an unsupervised clustering mechanism. This offers new insights into the properties of transformers.
- The paper is well-written. Theorems are clearly stated, while their proofs are given in the appendices.
- The convergence curves, with and without regularization, are insightful.
Weaknesses
- Only a simplified scenario is analyzed (two classes, orthogonal centroids).
- Only a brief discussion on the potential applications of the findings of the paper.
- Only linear activations are discussed in the main paper. Appendix G discusses the softmax activation function. It would be beneficial to add a brief description to the main article as well.
问题
- It will be interesting to simulate a problem with more classes. I assume that the theoretical analysis will be very challenging; however, a short simulation study may further strengthen the contribution.
- Can the authors comment on the relation of the proposed method and the standard expectation-maximization method?
- How important are the assumptions on orthogonal centroids? Are they mandatory for the convergence of the method, or only for the analysis?
- Can you determine the number of classes from the attention mechanism?
局限性
Yes.
最终评判理由
The paper analyzes transformers as an unsupervised clustering mechanism. The authors have addressed all my comments and suggestions. I will stay with my 'Accept' decision as the paper proposes a fresh perspective on clustering, but the method has some limitations, as explained adequately by the authors. Overall, I think that the paper should be accepted.
格式问题
NA
Thank you for your time to review our paper, and for acknowledging the relevance of our work. We provide some answers to your questions below.
Q1. Our framework could be extended to handle clusters with orthogonal centroids. The analysis would proceed along similar lines by employing linear attention heads. While technically feasible, such an extension would significantly increase the complexity and reduce the readability of the mathematical exposition, which motivated our choice to focus on the two-cluster case for clarity. That being said, we ran new numerical experiments with clusters, 3 heads and random initializations. We first use a natural extension of the previously introduced regularization. In this specific case, it consists in adding 3 (=) regularization terms to enforce pairwise orthogonality between the head parameters. We also observe that using a single, simpler regularization term based on the product achieves almost comparable performance while significantly reducing the number of regularization terms. We will add these experiments to the revised version.
Q2. While the two methods are fundamentally different and originate from distinct principles, the comparison is insightful. The EM algorithm is a likelihood-based approach that heavily relies on the prior assumption of a Gaussian mixture model. In contrast, the attention layer considered here operates in a more geometric fashion, aligning tokens based on directional similarity.
That said, some interesting parallels can be drawn. In the EM algorithm for Gaussian mixtures, the steps are explicit, so that the centroids are estimated by aggregating data points weighted by soft-assignment scores, which reflect the estimated probability of membership in each cluster. While the attention-based model lacks an underlying probabilistic framework, it also performs a weighted aggregation of tokens, this time using alignment scores that quantify directional similarity with certain vectors.
We thank you for prompting us to formalize this comparison, and we will include this discussion in the revised version of the paper.
Q3. The attention-based model relies on directional similarity to perform clustering. If the clusters are aligned (or anti-aligned), the model is not able to distinguish them. For intermediate cases (neither orthogonal nor aligned), additional cross terms complicate the analysis. It is nevertheless worth noting that as increases, even without explicitly enforcing orthogonality between the centroids, randomly choosing them on the sphere results in centroids that are nearly orthogonal. Thus, using the algorithms introduced in the paper, we obtain similar numerical results with centroids randomly chosen on the sphere, even though they are not strictly orthogonal.
The formal analysis in this more general case is an interesting direction for future work.
Q4. We conducted new numerical experiments where the inputs were drawn from a two-component mixture, but the model used three attention heads, meaning there were more heads than necessary. In the low-noise regime, one head successfully recovers a centroid, while the remaining two heads align with each other, but the recovery of the second centroid is partial. However, as the noise level increases, the attention parameters struggle to align with the true centroids, suggesting the existence of bad local minima in the optimization landscape under such conditions. Notably, this setting offers a natural diagnostic: when too many heads are used, redundant heads tend to converge to the same parameters, effectively signaling the model's overcapacity. We leave further investigation for future work.
The authors have addressed my concerns. I am looking forward to read their detailed dsecription on the relations to the EM algorithm.
This paper provides a theoretical analysis of attention layers as clustering mechanisms under Gaussian mixture models, complemented by a simple regularizer and supporting experiments. While the theory is developed in a simplified setting (two components, orthogonal centroids, linear attention), the work nonetheless offers new insights into the unsupervised capabilities of Transformers. Reviewers agree the paper is well written, technically solid, and addresses concerns adequately in the rebuttal, though they note limitations in generality and scope. Reviewer g8nw maintained a low rating (2), yet after discussion he indicated no objection to acceptance. Overall, the consensus is that the contribution is novel and meaningful despite its restricted setting, and I recommend acceptance.