PaperHub
5.5
/10
Poster4 位审稿人
最低3最高3标准差0.0
3
3
3
3
ICML 2025

Consensus Is All You Get: The Role of Attention in Transformers

OpenReviewPDF
提交: 2025-01-22更新: 2025-07-24
TL;DR

We provide a rigorous, mathematical analysis of the asymptotic properties of attention in transformers, and also empirical results, showing that under certain assumptions all tokens asymptotically converge to one cluster.

摘要

关键词
AttentionTransformersConsensus

评审与讨论

审稿意见
3

In this paper, the authors the asymptotic properties of the attention mechanism of transformers. In order to do so, the authors introduce a continuous differential equation emulating the evolution of tokens across an increasing number of layers, and they show that asymptotically -- i.e., as the number of layers involved goes to infinity -- all the tokens tend to collapse to a single point under some assumptions on the initial configuration of the tokens and on the value of the Q, K and V matrices across layers.

给作者的问题

The paper is generally well written, and the theory is complemented with many experiments. I have a couple of concerns, though:

  1. I think the paper lacks motivation: why are the results important for the community? Is there any implication for real-world applications?
  2. While experiments are important, I believe that the paper wastes too much space on the experimental side, when the main contributions seem to be on the theoretical side. In particular, it seems that the main novelty of the paper is to use different techniques (borrowed from control theory) to prove their results, compared to previous works in the same area. This novelty is completely lost in the paper, as there is no mention of it in the main paper after the introduction. I would have rather used more space to explain the intuition behind the proofs instead of including the random weights (e.g., first column of p. 6) or the random inputs (e.g., Table 2 or second column of p. 7) used in the experiments, which could be moved to the appendix.

论据与证据

The proofs of the theorems seem solid to me, and the experimental results provided are satisfactory and seem to back the theoretical results.

方法与评估标准

The proof techniques employed are reasonable and the experimental setup is sensible.

理论论述

I checked the proofs of Theorems 3.2, 4.2 and 4.3 without going into the details, and I could not spot any serious mistake.

实验设计与分析

The experimental setup used for Section 5 is sensible. However, the code for the experiments was not provided, so that it is not possible to independently verify the experimental results presented in the paper.

补充材料

N/A

与现有文献的关系

The main contribution of the paper is to generalize a series of results in the same setting (asymptotic evolution of tokens across attention layers) to broader configurations of the Q, K and V matrices and to multiple heads.

遗漏的重要参考文献

N/A

其他优缺点

N/A

其他意见或建议

N/A

作者回复

We thank the reviewer for the time devoted to reading our paper and providing feedback.

Why are the results important for the community? Is there any implication for real-world applications?

The paper answers the following question, what is the role of attention? The role is to bring the tokens together, i.e., to achieve consensus. In practice, our results show that transformers cannot be too deep since the information contained in the tokens disappears as they converge to a location that is independent of where the tokens start. Our results also show that a transformer has to be deep enough since, otherwise, the last token (used to predict the next token) is not sufficiently influenced by the tokens that precede it which would lead to a next token prediction that is independent from the query/context. Our results then raise the following question, what is the optimal depth of a transformer? This is a question we are trying to answer.

While experiments are important, I believe that the paper wastes too much space on the experimental side.

As can be seen by the different reviews and our replies, there are different opinions about experiments. We hope to reach a compromise by reducing the space devoted to experiments while providing more informative experiments (e.g., randomization over multiple prompts as suggested by other reviewers).

This novelty is completely lost in the paper, as there is no mention of it in the main paper after the introduction.

We will address the reviewer's concern by using the space freed by reducing the number of experimental plots to succinctly describe the techniques used in each proof and highlight their control theoretic origin.

审稿人评论

I thank the authors for their comments. I will maintain my positive score.

作者评论

Thank you for the positive score.

审稿意见
3

This paper theoretically demonstrates that a large language model using Transformers may collapse. The model collapse is analyzed through a mathematical analysis of the asymptotic properties of attention in Transformers. The authors claim that all tokens asymptotically converge to each other. This claim is supported by a simulation experiment, where GPT-2 XL is used iteratively by feeding its output back as the input.

Update after rebuttal

Thank you for the clarification. I revised the final score to 3. This reviewer encourages the authors to include experimental support using larger language models to back up the theory, and believes acceptance should be considered based on the to-be-conducted experiments and some writing clarifications (including the reviewers’ suggestions and the authors’ own suggested to-do list) in the next revision.

给作者的问题

  • This reviewer could not get exactly why the continuous concept is necessary for deriving the equation from (4) to (5). Could the authors clarify this reasoning?
  • Why do the trained weights converge faster than the random ones?
  • A traditional element layer normalization in Transformers does not follow the norm-scaling formula described in eq.(2).
  • Lemma 4.2 does not exist in the reference Geshkovski et al., 2023a.
  • y1y_1 in Eq.(18) should be y0y_0?

论据与证据

The claim appears reasonable under certain assumptions; however, the practical evidence supporting it may be insufficient.

方法与评估标准

The paper presents a theoretical analysis suggesting that the model could eventually collapse, which could possibly occur under certain conditions.

理论论述

Since my focus was primarily on the practical aspects of this paper, the theorems appear mostly correct under given assumptions.

实验设计与分析

Experiments based on simulations are provided, but some aspects are not practical and problematic.

补充材料

I was unable to follow all the materials but some appeared to be particularly important.

与现有文献的关系

The theoretical backup related to model collapse provided in this paper is an important contribution.

遗漏的重要参考文献

There is no related work section, despite the existence of many continuous Transformer-based papers that could be cited, none of which are referenced. While there may be differences in the terminology of "continuous" used in this paper compared to others, the authors should explicitly discuss how this work differs from previous studies.

其他优缺点

  • The theoretical approaches in this paper give a sense of direction, but the exact goal is not clearly presented. From my perspective, the theories are limited by the provided assumptions and seem to hold only under specific conditions. As a result, this reviewer feels that the experimental support is limited and not clearly aligned with the theory.
  • In that regard, the presentation of this paper should be more refined to make it easier to grasp the authors' claims. Furthermore, all materials should be self-contained; for instance, it took multiple times to find that E (appears at the y-axis in the graphs such as Figures 6 and 7) was in the Appendix.
  • The experiment is designed as a simulation, as the authors stated, by using generated tokens as input for the next iteration of the generation to mimic model collapse. However, this setup does not accurately simulate the behavior of a large language model. It is unclear why the authors did not use larger models while given the availability of many such models up to 70B-scale, for example.
  • This reviewer speculates that the outdated GPT-2 XL, trained on a limited corpus with fixed token lengths, may easily generate meaningless tokens when iteratively fed its own outputs. This may not accurately reflect the model collapse described in the theory. Looking forward to seeing similar results with larger models, up to 32B (which I believe would be sufficient for testing).

其他意见或建议

This paper should be further refined to use more precise terminology and notations. This reviewer believes this paper can be improved by presenting its motivation and goals better. Additionally, scaling up and modernizing the experiments instead of just using GPT2-XL would support the claims effectively. Although this reviewer is not a theorist and partially follows the derivations, the attempt to address model collapse theoretically is valuable. If the authors address the concerns, I am willing to raise my score.

作者回复

We thank the reviewer for reading our paper and providing feedback.

While there is no section titled "related work", after each theorem there is a remark titled "closest result available in the literature" where we provide a detailed comparison with the most relevant work. We would be happy to consider any paper the reviewer finds relevant.

The reviewer wrote "The theoretical approaches in this paper give a sense of direction, but the exact goal is not clearly presented. From my perspective, the theories are limited by the provided assumptions and seem to hold only under specific conditions. As a result, this reviewer feels that the experimental support is limited and not clearly aligned with the theory."

In the subsection "contributions of the paper" we wrote "The main contribution of this work is to provide a number of results (...) showing that all tokens converge to a single cluster thereby leading to a collapse of the model." This is the goal of the paper. We are happy to expand the goal's description.

We interpret the second sentence as: the stated assumptions are too strong. Could the reviewer tell us which assumptions are too strong so that we can address them? Please see the reply to reviewer aNPn where we explain how specific assumptions can be relaxed.

All the experiments support the conclusions of the theorems, i.e., they show evidence of consensus. The experiments either satisfy the theory's assumptions or do not. When the assumptions are satisfied, there is perfect agreement with the theory. When they are not, the experiments show that consensus still holds. This does not mean the results are wrong, it means that consensus can be proved under weaker assumptions

Having the definition of E in the appendix was an oversight that will be corrected.

The reviewer wrote "The experiment is designed as a simulation (...) However, this setup does not accurately simulate the behavior of a large language model. It is unclear why the authors did not use larger models (...)"

The experiments in Figure 3, and Figures 5-7 were performed on the GP2-XL model and are not simulations. Since this is a theoretical paper, experiments do not require large models as we do not seek to compare performance metrics. The experiments serve two purposes: 1) illustrate the theorems proved in the paper; 2) show that conclusions hold under weaker assumptions.

The reviewer wrote "This reviewer speculates that the outdated GPT-2 XL, trained on a limited corpus with fixed token lengths, may easily generate meaningless tokens when iteratively fed its own outputs. This may not accurately reflect the model collapse described in the theory."

The experiments do not show GPT-2 XL providing meaningless tokens, they show that all the tokens converge to the same token as predicted by the theory. The function E is only zero when all the tokens are the same. In Figures 3, 5, 6, 7, and 8 we see the function E converging to zero which indicates that all the tokens converge to consensus, i.e., perfect agreement between experiments and theory.

The reviewer wrote "This reviewer believes this paper can be improved by presenting its motivation and goals better. Additionally, scaling up and modernizing the experiments instead of just using GPT2-XL would support the claims effectively (...) If the authors address the concerns, I am willing to raise my score."

We will revise the paper to provide more intuitive explanations for the theoretical results and better describe the objectives. We hope to have convinced the reviewer that: 1) theoretical claims are proved theoretically, not experimentally; 2) there is perfect alignment of theory with experiments. We will also provide more experiments to address the concerns raised by multiple reviewers and would be delighted by a raised score.

Why the continuous concept is necessary for deriving the equation from (4) to (5). We can interpret (4) as a discrete-time dynamical system. Equation (5) is a differential equation, i.e., a continuous-time dynamical system. Continuous-time models enable the use of tools developed in physics, mathematics, and control theory. Corresponding tools for discrete-time either do not exist or are much more difficult to use.

Why do the trained weights converge faster than the random ones? Reviewer aNPn had a similar question, please see the answer we provided.

A traditional element layer normalization in Transformers does not follow the norm-scaling formula described in eq.(2). Eq. (2) has been used in concrete transformers, see the citation in the paper. See also the reply to reviewer aNPn who had a similar question.

Lemma 4.2 does not exist in the reference Geshkovski et al., 2023a. Lemma 4.2 appears in versions 1-3 of the arXiv preprint, but in the latest version is now Lemma 6.4. This will be corrected.

y1y_1 in Eq.(18) should be y0y_0? We will rectify this typo on the final version.

审稿人评论

Thank you for your detailed responses. Overall, I think the to-do actions for my concerns presented in the rebuttal look great. For the concerns related to the strong assumptions, let me clarify my points: For example, 1) UηU_{\eta} is the identity in Assumptions 3.1 and 4.1; 2) there is only one head in Assumption 4.4. Could these be too strong to proceed with the theory in order to ultimately match practice in Transformers, where they do not satisfy these assumptions? I might be wrong, so please let me know if I misunderstood any points.

Furthermore, regarding the GPT-2-based experiments, although this paper is mainly theoretical, as long as the authors chose to include experimental support, the experimental setup should be more practical. Repeating GPT-2 with the generated tokens may not be sufficient experiments to support the theory, since the authors seem to aim at showing a case where practical Transformer-based models may collapse. I mean that the experiments should involve deeper models (likely easier to collapse) to show that tokens converge to the same outputs, rather than just repeating a smaller model. Is the theory mainly formulated for the repetitive model? I don’t see that from reading the theory from my perspective.

作者评论

To better understand the impact of the different assumptions it is convenient to refer to Table 1.

The third column refers to the case where U does not need to be the identity. In this case, tokens still converge to consensus. But under the stronger assumption that U is the identity, convergence is guaranteed for almost all initial configurations whereas when U is not the identity we have the additional requirement that initial tokens belong to an hemisphere. We note that this is the first paper proving convergence to consensus when U is not the identity. We will also include experiments with tokens not starting in an hemisphere to illustrate that convergence still occurs in this case.

The results in the last column can be easily extended to multiple heads when all the heads have the same U matrix (but different attention matrices). We will discuss this observation in the revised version of the paper. When the heads have different U matrices, the aggregated effect is a U matrix that is both time and state dependent and this is much more difficult to analyze. The empirical results show that consensus still occurs in this case although we do not know yet how to establish this theoretically.

We take your point regarding larger models. In the revised version we will include experiments on the largest model that we can run on our lab's machines in the amount of time we have to prepare the final version. If the reviewer is satisfied with the proposed changes we would appreciate if the score could be raised.

审稿意见
3

This paper try to theoretically analyze the phenomenon that using auto-regressive attention model, all tokens will asymptotically converge to the "consensus set". Authors construct discrete/continues-time attention model and show that under full/auto-regressive attention cases, tokens will converge to some special cases. Authors also support their claim with some experiments.

给作者的问题

What's the key difference in technique between your work and previous works such that you only need relaxed conditions?

论据与证据

I think the claims are supported by both previous literature and the toy experiments in Section 5

方法与评估标准

there is no criteria for the experimental parts in this paper. For the proposed theoretical model, I think in general it makes sense for the problem and may follow some framework from previous theoretical paper

理论论述

I haven't checked the proof of main theorem in detail, but I haven't seen critical problems in the claim and lemmas (like App. B, or Lemma C.2) in appendix.

实验设计与分析

I check the validity of experimental designs. I think they are toy experiments and it's acceptable for a theory paper. But in GPT2-XL experiment, authors should not just try one well-designed prompt, but need to try more diverse prompts randomly sampled from natural language dataset and see some average performance.

补充材料

I briefly saw section B and F.

与现有文献的关系

I think the main contributions of the paper is to prove the "consesus convergence" phenomenon based on more relaxed conditions (like no requiring Q^TK to be identity). Previous work like Karagodin et al. 2024 (L230) or Geshkovski et al. 2023a (L172) have shown similar conclusion, but require stronger conditions (like P = Q^TK need to be time invariant or identity). It's nice that authors clearly discuss their contribution in the paper.

遗漏的重要参考文献

I think the main techniques of this paper is about analyzing the training dynamics of transformer and see the property of tokens, but there are a lot of transformer-dynamics related works recently that may be worth to be cited/discussed, although they may have different kinds of modeling [1,2,3,4,5].

[1] Nichani E, Damian A, Lee J D. How transformers learn causal structure with gradient descent[J]. arXiv preprint arXiv:2402.14735, 2024.

[2] Cheng X, Chen Y, Sra S. Transformers implement functional gradient descent to learn non-linear functions in context[J]. arXiv preprint arXiv:2312.06528, 2023.

[3] Huang Y, Cheng Y, Liang Y. In-context convergence of transformers[J]. arXiv preprint arXiv:2310.05249, 2023.

[4] Tian Y, Wang Y, Chen B, et al. Scan and snap: Understanding training dynamics and token composition in 1-layer transformer[J]. Advances in neural information processing systems, 2023, 36: 71911-71947.

[5] Li Y, Li Y, Risteski A. How do transformers learn topic structure: Towards a mechanistic understanding[C]//International Conference on Machine Learning. PMLR, 2023: 19689-19729.

其他优缺点

Strengths:

+: the paper is in general easy to follow

+: toy experiments are interesting and support the theoretical findings.

Weakness:

-: I think one key problem is that the contribution is somewhat too incremental. As shown in "Relation To Broader Scientific Literature*" part, because similar claims have been shown in previous paper, although with kindly stronger condition. However, this paper still requires some strong condition that is far aways from real-world cases (like U still need to be identity, but this is related to value matrix, and but in real-world we indeed need to update the value matrix). And there seems no additional new theoretical findings in this paper (for example, convergence rate for the consenus phenomenon).

-: The authors haven't detailly compare the techniques between them and previous literature to show why they can achieve relaxed conditions. These can be down by adding some discussions or proof scratch

-: Title seems confusing to me, as consensus is an asymptotically phenomenon that in practice don't happen a lot for SOTA LLM, it can not be 'all you get', and the role of attention can be much more complex (as previous works mentioned in additional reference above)

-: in GPT2-XL experiment, authors should not just try one well-designed prompt, but need to try more diverse prompts randomly sampled from natural language dataset and see some average performance.

其他意见或建议

  1. I think if author don't explain V_\eta(t) in the main paper in detail, (5) better not use it (like using V_\eta(t)) (L166)
  2. Seems all the "[0, \infty]" are "[0, \infty["
  3. I think it's better to include the metrics (like the definition of 'E' in Figure3/5, i.e., (18))
  4. I think it's necessary to define 'attractive' or 'attractivity' in Theorem 3.2 in the main paper.
作者回复

We thank the reviewer for the time devoted to reading our paper and providing feedback.

We were not aware of the referenced papers. All but the second paper discuss training dynamics whereas we study already trained transformers. Hence, they do not seem relevant. The second paper shows that the evolution of tokens along a transformer can be interpreted as an instantiation of gradient descent. We cannot see any direct connection to our work but we will be studying this paper out of intellectual curiosity.

The reviewer states that the contribution is "somewhat too incremental" and supports this opinion with "similar claims have been shown in previous paper" and "still requires some strong condition (...) like U still need to be identity". The latter statement is factually incorrect. Section 4.2 discusses the case where the matrix U is not the identity. The reviewer can find in Assumption 4.4 that we only require U to be symmetric and time-invariant. See also our reply to reviewer aNPn where we explain that the time-invariance assumption can be relaxed.

We would also like to politely disagree with the incremental characterization of our work and with "similar claims have been shown in previous paper". Theorems are written as implications, say a implies b, where a is called the antecedent and b the consequent. Two theorems are not similar because they have the same consequent. The strength of two theorems offering the same consequent is measured by how weak the antecedent is. A weaker antecedent means the implication applies to a larger class of systems and is thus a stronger result. Here is one example, before people could travel by aircraft, the US could be reached from Europe by boat in several weeks. Once air travel began, the US could be reached from Europe in several hours. In both cases the consequent is the same, the US can be reached from Europe, but these scenarios cannot be characterized as being "similar".

The reviewer writes "The authors haven't detailly compared the techniques between them and previous literature to show why they can achieve relaxed conditions." It is not clear to the authors what is meant by a detailed comparison of techniques. An intuitive comparison is provided in the 3rd paragraph of the introduction. Intuitively speaking, some prior work used a mean-field model that resulted in a partial differential equation whose solution was interpreted as a distribution. We used ordinary differential equations that enabled us to use control theoretic techniques such as Input-to-State Stability as well as existing results on consensus on spheres. Existing work that also used ordinary differential equations did not use control theoretic techniques. We would be happy to expand this intuitive comparison in the final version of the paper and, in particular, highlight which control theoretic techniques were used in each proof, if this is what the reviewer has in mind.

We are more than happy to change the paper's title.

We agree with the criticism related to the need to use multiple prompts. We will provide such experiments in the final version.

The matrix V_\eta is defined in the second paragraph of Section 2.2 as the value matrix. Please let us know if additional explanations regarding this matrix are needed.

We could not understand the remark «Seems all the "[0, \infty]" are "[0, \infty["».

We should have made this clear in the paper. We follow Bourbaki's notation (introduced in Éléments de mathématique) and write [a,b[ to denote the interval including the point a and excluding the point b. This is done so that there is no confusion between the ordered pair (a,b) and the interval ]a,b[ that excludes the points a and b. We did find a few instances of sets in the appendix using the notation (a,b) and we will correct them.

Not including the definition of E in the main part of the paper was an oversight of our part. It will be rectified.

Not defining attractivity was an oversight, it will be corrected.

The reviewer asks about the "Key difference in technique". This is the first paper that uses control theoretic techniques such as Input-to-State Stability or draws inspiration from results available in the control literature on consensus on spheres. As we mention in the reply to the last reviewer, we propose to comment on which control theoretic techniques are used in each proof so as to provide deeper insights into our approach. We hope this change addresses the concerns of both reviewers.

审稿人评论

Thanks for the clarification and sorry for my late reply. The authors' reply (together with the discussion with reviewer aNPn) address most of my problems and sorry I misunderstand some of the key contributions of this paper before. I agree that more intuitive comparison will be helpful for reader to understand, and looking forward to the modification authors mentioned. I will increase my score.

作者评论

Thank you for increasing the score.

审稿意见
3

This paper theoretically investigates the phenomenon of token representation collapse in transformers as the number of layers blow up. The authors analyze a continuous-time differential equation of the attention model and show that all tokens converge asymptotically. They present results under different assumptions on the key, query, and value matrices, improving upon previous work that required stricter conditions on the composed key-query matrix, the value matrix, and the number of attention heads. Finally, they provide small-scale experiments to validate their theory.

Update after rebuttal

I thank the authors for engaging during the rebuttal period. They addressed most of my key questions and agreed to make appropriate additions to the paper including some additional experiments. Overall, I believe that it is a good paper that improves upon the previous theory on representation collapse, and I keep my positive rating.

给作者的问题

  1. The theory is developed for the ellipsoid projection which is playing the role of layer-normalization in a standard transformer. Do you think it's possible to extend these results to the standard layer-norm?

  2. General question for all results: the value-matrix UU is time-independent (which also appears to be the same in previous works). I understand the assumptions are less strict from previous works, but it seems a little strange. Do you believe the results still hold if UU is time-dependent, similar to the composed key-query matrix PP?

  3. A few questions on the assumptions: For Thm 3.2, what happens if the initial position of the tokens do not like in some hemisphere? Related to this, for the experiment in figure 2, do you still see consensus if the tokens do not start in a hemisphere?

  4. Re the experiments: The evaluation metric used in all plots should be discussed in the main body, it is just referred to as some equation in the appendix. I like the toy experiments in general, but I think the results should be averaged over multiple prompts (for example figure 5 with random prompts), and an average with some confidence intervals should be reported. This is especially important given the variance we see for different prompts in figure 7.

5a. Why do you think collapse is less prominent for random weights in almost all the figures? The theory does not seem to distinguish between different sets of weights.

5b. Comparing Figure 6 to Figure 5, removing periodicity seems to reduce collapse. Why do you think this happens? Again, the theory does not appear to separate these cases.

论据与证据

All claims and evidence seem to be well-supported. The key contribution—demonstrating that the continuous-time differential equation model collapses asymptotically under different sets of assumptions on the model parameters—is backed by theoretical results. While limited, the toy experiments provided support for these claims.

方法与评估标准

It’s mainly a theory paper. I don’t think this question is very valid, but whatever toy experiments they have make sense.

理论论述

No, I did not read the proofs of the theorems.

实验设计与分析

Since this is a theory paper, the experiments are primarily toy examples designed to validate the theory. They involve prompting a pre-trained GPT-2 XL and analyzing representation collapse, which the results confirm and align with previous findings. From this perspective, the experiments appear sound, though I have some questions and suggestions (see the questions section).

补充材料

No, I did not check the supplementary material.

与现有文献的关系

The paper theoretically examines representation collapse in tokens as the number of layers in transformers increases. While previous work has studied this using mean-field techniques, this paper appears to apply ideas from control theory to analyze the asymptotic behavior. Additionally, the theoretical insights developed here could potentially be used in the future to identify architectural components that contribute to representation collapse.

遗漏的重要参考文献

I believe the recent work by Wu et al. (2024) is highly relevant and missing from the discussion. They also investigate representation collapse, focusing on the role of layer normalization and attention masks. While their approach differs, it should be addressed in this paper, along with a comparison of their theoretical results. For instance, Wu et al.'s results appear to be finite-time, whereas this paper's findings are asymptotic. A comparison of their respective assumptions would also be valuable.

Wu et al. (2024) On the role of attention masks and layer norm in transformers.

其他优缺点

The paper studies an important problem from a theoretical point of view, well-written in general, and interesting to read.

其他意见或建议

Please add a small conclusion/discussion section with open questions and drawbacks of the results presented in the paper.

作者回复

We thank the reviewer for the time devoted to reading our paper and providing feedback.

We were not aware of the paper Wu et al. (2024). Upon reading it we found that it confirms the results in our paper since it provides several scenarios of rank degeneration, i.e., consensus. It also provides an example where consensus does not occur but this example requires the query and key matrices to be constant and equal to zero which does not happen in practice.

We will be happy to add a conclusion/discussion section as suggested.

Layer normalization

Thank you for this question, we should have included a discussion about this in the paper. At an intuitive level, any normalization technique has the objective of restricting the tokens to a compact space, hence the results should not depend on how the normalization is done. At the technical level, we can show that layer normalization (removing the mean, dividing by standard deviation, multiplying by a learned scale, and adding a shift) corresponds to projecting on a suitably translated sphere when the learned scale parameters are all equal and non-zero. Hence, our results apply to this case. We suspect the same holds for arbitrary scale parameters but we need to perform a more careful analysis to ensure the topology/geometry of the resulting compact space is homeomorphic/diffeomorphic to that of a sphere.

Matrix U time-varying

Our proof technique is based on projecting the dynamics along the eigenvector of U corresponding to the largest eigenvalue to obtain a scalar differential equation that is easier to analyze. When U is time varying, the eigenvectors of U are time varying. This has two consequences: 1) the consensus point is now time-varying; 2) the scalar ODE has one additional term coming from the time derivative of the eigenvector (this derivative is zero in the time-invariant case). Provided the eigenvector changes slowly, all the tokens will asymptotically converge to the time-varying eigenvector and consensus is still reached. We are happy to discuss this extension in the final version if the reviewer finds it useful.

Theorem 3.2, tokens starting outside an hemisphere

The conclusions of Theorem 3.2 still hold when the tokens don't start in an hemisphere but enter an hemisphere at some future time. We proved that hemispheres are invariant (once the tokens enter one, they cannot leave) and thus the conclusion follows. In our experiments we always observed consensus, independently of the initial condition.

Experiments

Defining the function E in the appendix was an oversight that we will correct. We will also be happy to average the results and provide confidence intervals in the final version.

Collapse is less prominent for random weights/Removing periodicity seems to reduce collapse.

This is an interesting question. Our educated guess is that convergence is exponential when the matrices are constant (this is supported, under additional assumptions, by Theorem 6.3 in version 4 of https://arxiv.org/abs/2312.10794). One may speculate that the rate of convergence is related to the time derivative of the matrices. Constant matrices, zero derivative, faster convergence, random matrices, highest derivative, slowest convergence. The periodic case seems to lie in between the constant and random cases but we do not have a good explanation for this.

审稿人评论

Thanks for the response to my questions, I will keep my positive score. I think it would be useful to discuss the time-varying U\mathbf{U} scenario in the paper, and regarding that how do you ensure that the eigenvectors change slowly through layers? In general, it is obviously not true, but perhaps there you can make some argument at initialization (for standard initialization techniques), using RMT. Other than that, maybe it is possible to track it across different layers at different training steps, empirically.

It would also be useful to add the LN comment in the paper, appendix is fine, but the connection will be useful for the reader. Lastly, do you have any exps where you see consensus when not starting in a hemisphere as you say? I just see Fig. 2 where you do start inside, the other case should be there in the paper.

作者评论

Thank you for the positive score. We will make the discussed changes including an empirical analysis of the eigenvalues and experiments with tokens starting outside an hemisphere.

最终决定

The paper focuses on developing rigorous results for token representation collapse in Transformer networks as one scales the number of layers. Unlike the existing works that rely on mean-field and/or stochastic techniques, this work leverages control theoretic tools to establish token representation collapse for both full attention and causal attention. Interestingly, the results in the paper also relax various assumptions considered in the prior works.

Overall, the reviewers acknowledged that the submission made important contributions by expanding the current understanding of the representation collapse in the most widely popular architecture. The reviewers raised some valid questions/concerns about deviations from the real-life transformer, e.g., identity or time-invariant value matrices and layer normalization. The authors are encouraged to incorporate promised discussions on extending their analysis to more realistic settings or highlight the main challenges preventing the realization of such desirable extensions.

The submission also presents empirical results that corroborate the theoretical analysis and show asymptotic representation collapse in synthetic settings. That said, the AC and one of the reviewers feel that the experimental section can be significantly compressed to include proof sketches and a slightly more self-contained theoretical treatment of the underlying problem. For instance, is the empirical demonstration of token collapse under the weaker assumption (i.e., for realistic transformers) a novel contribution of this submission? If not, much of these results can be compressed in the main text and relegated to the appendix. The authors may also want to cite the work on looped transformers as the main experimental setup in Section 5.2 is closer to looped transformers.

Furthermore, the authors may want to highlight the exact architecture they study early in the paper, e.g., in the Introduction. Currently, the authors briefly mention that, similar to Greshkovski et al. 2023a, they don't consider the feed-forward layer (in lines 136-140) and standard layer normalization (in lines 137R-139R). It would benefit the readers if these assumptions/restrictions were mentioned much earlier in the paper along with the fact that prior theoretical studies have also made such assumptions.