Uncovering hidden geometry in Transformers via disentangling position and context
An analysis of transformer embeddings via interpretable decompositions.
摘要
评审与讨论
This paper investigates the intermediate representation of Transformer by viewing that each token embedding is decomposed into (i) position-wise information and (ii) sequence-wise information. There are 3 main findings --- (a) (i) forms spiral curves in a low-dimensional space, (b) (ii) contains a cluster structure, (c) (i) and (ii) are almost orthogonal --- are observed on pre-trained language models.
优点
The analysis based on the decomposition is original. I haven't seen this type of decomposition of token embeddings.
The third finding (c) is an interesting property, which might open a new research direction.
缺点
The first two findings (a and b) sound relatively trivial. I think the behavior of (a) mainly comes from the sinusoidal positional embedding. The effect of positional embedding propagates to upcoming layers via skip connections, which would explain why the spiral patterns are consistently observed across layers. For (b), since ctx vector is computed by averaging token embeddings over each sequence, it's natural to contain topic-like information. More precisely, each token embedding should contain context (or topic) related information to predict the next word. Taking the average will emphasize the context information, which should be distinguished from other context information obtained from a different document.
The paper analyzes the Transformer models in many aspects. However, each analysis is not tightly connected, and it's hard to capture concrete outcomes.
问题
Why does Equation (7) not include the residual term?
Section 4.2 starts with the following question: "positional information is passed from earlier layers to later layers … How does transformer layer TFLayer enable this information flow?" Isn't this simply because of the skip connections? Also, why do you consider the low-rank plus diagonal form in Equation (8)? Don't you observe the alignment with the positional basis by svd(W) instead of svd(W - diagg(W))?
We thank the reviewer for reading this submission and providing feedback. We first present our replies to common concerns, and then respond to specific issues.
Common concerns
-
Significance of positional basis. We emphasize that the positional embeddings in many transformers including GPT-2 are trained from scratch during pretraining---which is different from the sinusoidal positional embedding originally proposed. We find this nontrivial because
- as far as we know, the low-rank and low-frequency structure has not been examined extensively before.
- The magnitude of the positional basis can change across layers (for aesthetic reasons, Figure 1 is presented as scale-free). We observed the positional basis increases significantly from layer 0 to layer 1. So residual connection is not the sole reason for consistent positional basis.
- Our discovery helps elucidate confusing artifacts (e.g., spiral clustering in embeddings) people observed; see Section 6.
- Smoothness is the key to Transformers. Our analysis on an arithmetic task in Section 5 shows a failure example where transformed exhibit discontinuous patterns that result in failure of length generalization.
-
Induction head. One of the most surprising phenomena about LLMs is the emergent abilities, and induction head---which functions as copying even in OOD data---is one simple example. Still, much is unknown about induction head.
- Our decomposition-inspired analysis identifies 3 interpretable components. The first two (attention to self-token, attention to neighboring tokens) correspond to exactly positional basis or cvec.
- This ``abstract ability'' learned by transformers may allow people to understand the inner workings of LLMs and lead to practical values.
- It provides a new visualization idea---which decouples positional effects from non-positional effects.
Specific issues
-
Q1 "The first two findings (a and b) sound relatively trivial". We disagree that the finding a is trivial. See our response to common concerns 1. Moreover:
-
We hope to draw the reviewer's attention to the fact that many Transformers use positional embeddings entirely trained from scratch, so the sinusoidal is not known in advance. Even so, the low-rank and low-freq structure is not expected.
-
Note that the positional basis lies approximated in a low-dimensional subspace. As far as we know, this is not known before, and is the reason for many artifacts people observed in embedding visualiation.
-
-
Q2: "Why does Equation (7) not include the residual term?" In this decomposition, we are only interested in separating the positional effects vs non-positional effects, so is decomposed into plus . Note that includes the residual term already.
-
Q3: "How does transformer layer TFLayer enable this information flow". Our question is to understand how information flow is processed in the attention component, when passing through the self-attention in each layer.
- The structural decomposition of shows that there is a positional low-rank part corresponding to exactly the positional subspace of embeddings. As far as we know, this gives the first consistent identification of interpretable components of pretrained weights.
I appreciate the feedback from the authors. I realized that I had been mistaken about the positional embedding (I assumed it's not trainable). Since now I believe the results are more untrivial, I will raise my score to 5. However, my concern about the concrete outcomes still remains (which aligns with reviewer Jv3k), and I still feel the manuscript is not ready for acceptance.
The paper attempts an investigation of geometric structures of embeddings leaned by transformers. It first proposes a decomposition of the embeddings to a positional component (mean vectors across context) and a contextual component (mean vectors across position). Then, it studies the geometry of each of these components. For the positional component, they find that that it is low-dimensional and smoothly varying. Concretely the Gram matrix of positions is low-rank in the Fourier domain. For the contextual component, they identify clustering structures. Finally, they find that contextual component is incoherent (almost orthogonal) to the positional component.
优点
-- The investigation of geometry of token embeddings is in my opinion interesting and could shed light on the operation of LLMs
-- I like the proposed decomposition of embeddings into their global mean, positional, contextual and residual part. It is simple, but interesting
-- The authors have conducted rather thorough investigation with multiple experiments
-- The discussion on low-rankness of positional embeddings and its connection to smoothness via fourier analysis is interesting
缺点
I am torn on my decision about the paper. I like the investigation and there are ideas in the paper which I find nice. At the same time though, I believe the paper could benefit from an attempt to better discern and articulate the messages of the findings. Moreover, my opinion is that by discussing too many (and many of them incomplete) topics, main (and potentially interesting) messages are "lost".
-- Several topics discussed feel incomplete, such as (1) clustering of contexts in Sec. 3; (2) content of Sec. 4.2; (3) Last paragraph on Section 5.2 (there doesn't seem to be anything informative being said here including App E other than reporting of figures)
-- The discussion on induction heads is distracting and I don't see the relevance to the rest of the paper
-- I find the presentation of the paper particularly after Sec 2 confusing. There is no clear coherence between sections/subsections. Eg., not made clear how Sec. 4.2 and 4.3 fit within the story. Overall the paper would benefit from a careful read.
AFTER REBUTTAL
I continue thinking that the paper can greatly benefit from an attempt to better discern and articulate the key messages of the findings. The responses did not shed particular light on that. That said, I am raising my score as I believe the approach taken by the authors is interesting and (although not immediately clear now) could lead to new ideas towards better understanding the mechanisms of transformers.
问题
-- Do you have an intuition/interpretation for the spiral shape? I believe I understand the point you are making on smoothness, but what does the particular shape tells us (if anything)? If nothing, then why is it emphasized?
-- Any explanations on the non-spiral trends in Figs 9-11 in the appendix?
-- last paragraph "Why smoothness" of Sec 2: Can you please elaborate why smoothness allows attention to neighboring tokens easily? Also, you discuss there about QK scores, but those involve the WQ,WK matrices which from what I understand are not considered in Sec 2 (only gram matrix of positional embeddings)
-- How the clustering property of the contextual part of embeddings on per document basis is informative? Also, how would the results change based on the four sampled documents?
We thank the reviewer for reading this submission and providing feedback. We first present our replies to common concerns, and then respond to specific issues.
Common concerns
-
Significance of positional basis. We emphasize that the positional embeddings in many transformers including GPT-2 are trained from scratch during pretraining---which is different from the sinusoidal positional embedding originally proposed. We find this nontrivial because
- as far as we know, the low-rank and low-frequency structure has not been examined extensively before.
- The magnitude of the positional basis can change across layers (for aesthetic reasons, Figure 1 is presented as scale-free). We observed the positional basis increases significantly from layer 0 to layer 1. So residual connection is not the sole reason for consistent positional basis.
- Our discovery helps elucidate confusing artifacts (e.g., spiral clustering in embeddings) people observed; see Section 6.
- Smoothness is the key to Transformers. Our analysis on an arithmetic task in Section 5 shows a failure example where transformed exhibit discontinuous patterns that result in failure of length generalization.
-
Induction head. One of the most surprising phenomena about LLMs is the emergent abilities, and induction head---which functions as copying even in OOD data---is one simple example. Still, much is unknown about induction head.
- Our decomposition-inspired analysis identifies 3 interpretable components. The first two (attention to self-token, attention to neighboring tokens) correspond to exactly positional basis or cvec.
- This ``abstract ability'' learned by transformers may allow people to understand the inner workings of LLMs and lead to practical values.
- It provides a new visualization idea---which decouples positional effects from non-positional effects.
Specific issues.
- Q1: "Several topics discussed feel incomplete". We agree that some supporting arguments are not presented in the main paper; however, we feel that the appendix does contain enough details and supporting figures.
- Q2: "The discussion on induction heads is distracting". Please refer to our main response on "Induction head".
- Q3: "I find the presentation of the paper particularly after Sec 2 confusing".
- We provide further analyses based on the decomposition and illustrate why nearly orthogonoal (incoherence) structure is useful. We decided to include Section 4.1 to show that the pos-cvec decomposition can lead to better visualization.
- Section 4.2 shows that the weights exhibit a low-rank + diagonal structure. As far as we know, this provides the first interpretable decomposition of weights that is consistent in pretrained models.
- Section 4.3 shows that the incoherence structure is quite related to the classical compressed sensing & dictionary learning literature. We provide an analysis via this lens, showing that incoherence does lead to cross interaction in the self-attention being easily captured.
Additional questions.
-
"Do you have an intuition/interpretation for the spiral shape?" The spiral shape is related to low-frequency structure and smoothness property. In the top-left part of each subplot in Figure 3, we see that the Gram matrix is smooth. If we simply take to be an identity matrix (which is closely related to empirical observation in Section 4.2), then
gives high values if the positions of query and key are close to the diagnoal part of the Gram matrix. In other words, the attention values are large if and only if is small---this is exactly attention to neighboring tokens.
Note that Gram matrix is used to analyze positional embeddings before [1], but only for the 0-th layer. Our analysis shows (both empirically and theoretically) a connection between Gram matrix and self-attention.
-
"Any explanations on the non-spiral trends in Figs 9-11 in the appendix?" We checked the frequency coefficients and the low-frequnecy components are also dominant ones. The specific linear combination of Fourier basis may be different from GPT-2, but the low-freqency and smoothness properties still hold.
-
"last paragraph `Why smoothness' of Sec 2". See the first reply in "Additional questions."
-
"How the clustering property of the contextual part of embeddings on per document basis is informative?" Our context basis essentially ignores the order of input words. In this light, it is similar to the bag-of-words assumption in NLP before deep learning; see [2]. Clustering encodes the topic information.
[1] Benyou Wang, et al. On position embeddings in bert. In International Conference on Learning Representations, 2020.
[2] Arora, Sanjeev, et al. "Learning topic models--going beyond SVD." 2012 IEEE 53rd annual symposium on foundations of computer science. IEEE, 2012.
The paper aims to demystify the internal workings of transformer models by presenting a novel decomposition of the hidden states (or embeddings) into interpretable components. For a layer's embedding vectors, a decomposition is achieved which distinguishes the mean effects of global, positional, and contextual vectors, as well as residual vectors, providing insights into the input formats and learning mechanisms of transformer
优点
S1. The paper introduces a novel decomposition method that separates the mean effects of global, positional, and contextual vectors within transformer embeddings. This approach offers a fresh perspective on understanding the internal mechanisms of transformers, revealing insights into how they encode and process information.
S2. The paper provides extensive numerical experiments on a variety of models, including Llama2-7B and BLOOM, to validate the proposed decomposition approach. These experiments include token randomization and arithmetic tasks, which demonstrate the ability of the decomposition to capture different aspects of transformer embeddings.
缺点
Please see the Questions below.
问题
Section 1:
The significance of Transformers in research is well-known, but the paper's introduction does not clarify the purpose of the proposed method. Could the authors detail how this new decomposition relates to ANOVA and previous work on positional embeddings and induction heads? Moreover, what practical benefits does this decomposition provide? The main outcomes of the experiments in both the paper and appendix also need clarification.
Section 2:
-
Please define 'smoothness' in the context of your paper. It is essential to relate this to DFT and IDFT for better understanding.
-
The term is used but not defined.
-
In Equation (6) (LHS) of Theorem 1, there is a dimension mismatch; the first term is and the second is .
Sections 3 and 4:
-
The writing in these sections needs improvement. Starting with a summary of the key findings before referring to figures would enhance clarity. What are the main points of these sections?
-
What does -sparse representation by bases mean in Theorem 2?
Section 6:
The claim of providing a "complete picture" is too broad. How does this research stand apart from earlier studies on positional embeddings and induction heads?
伦理问题详情
N.A.
We thank the reviewer for reading this submission and providing feedback. We first present our replies to common concerns, and then respond to specific issues.
Common concerns
-
Significance of positional basis. We emphasize that the positional embeddings in many transformers including GPT-2 are trained from scratch during pretraining---which is different from the sinusoidal positional embedding originally proposed. We find this nontrivial because
- as far as we know, the low-rank and low-frequency structure has not been examined extensively before.
- The magnitude of the positional basis can change across layers (for aesthetic reasons, Figure 1 is presented as scale-free). We observed the positional basis increases significantly from layer 0 to layer 1. So residual connection is not the sole reason for consistent positional basis.
- Our discovery helps elucidate confusing artifacts (e.g., spiral clustering in embeddings) people observed; see Section 6.
- Smoothness is the key to Transformers. Our analysis on an arithmetic task in Section 5 shows a failure example where transformed exhibit discontinuous patterns that result in failure of length generalization.
-
Induction head. One of the most surprising phenomena about LLMs is the emergent abilities, and induction head---which functions as copying even in OOD data---is one simple example. Still, much is unknown about induction head.
- Our decomposition-inspired analysis identifies 3 interpretable components. The first two (attention to self-token, attention to neighboring tokens) correspond to exactly positional basis or cvec.
- This ``abstract ability'' learned by transformers may allow people to understand the inner workings of LLMs and lead to practical values.
- It provides a new visualization idea---which decouples positional effects from non-positional effects.
Specific issues.
-
Q1 "Could the authors detail how this new decomposition relates to ANOVA and previous work on positional embeddings and induction heads?" As far as we know, ANOVA used in analysis of transformers is new.
-
Q2 "Moreover, what practical benefits does this decomposition provide?" We offer improved interpretations of tranformers---which is important considering the potential risks of LLMs. More specifically,
- We highlight that smoothness is the key to Transformers by analyzing the positional basis and a failure study on addition tasks where discontinuous structure prevents length generalization.
- We identify a consistent geometrical structure. This is connected to classical literature in compressed sensing and dictionary learning.
- We also have a new visualization idea based on our decomposition and apply this idea to induction head.
Additional questions.
- O(1)-sparse representation by bases is defined in the paper in Section 4.3 Equation (11).
- Operator norm (usually denoted by "op") is commonly defined as .
- Thank you for pointing out the typo in Equation (6). It should be .
This paper proposes a decomposition of token transformer embeddings across sentences (contexts, ) and position ( within sequence). These (in total) -vectors are decomposed with mean-effects across all each context (across all ) and each position (across all contexts) in addition to a residue, i.e. (I am using my notations). After presenting this decomposition in section 1 (intro), the geometry of positional components is studied in Section 2, while Section 3 focuses on . In section 4 the authors study more specifically the contributions of these separate contributions to attention mechanisms while Section 5 is a simple experiment showing that some of the findings of Section 2 are indeed invalidated when the transformers are trained with text randomly permuted.
Reviewers started with fairly low score, and bumped them up a little, but still no strong push for accept, and a few criticisms / reservations.
Overall, I share the opinion of all reviewers. This looks like a great exploratory work, but feels unfinished, and does not provide a clear message. The idea is great, but the execution will probably require a few more iterations. For instance, Section 3 (context) is oddly short (2 short paragraphs). Section 2 talks about geometry of positional embeddings, but is mostly (already) studying the geometric properties of pos vs context. I've also missed, upon a first reading of the paper, a more careful study of the relative scales of these embeddings. How big are these residues? I can see the SVD analysis of Fig. 2 (left), but I feel this might be a bit misleading, since only all residues are analysed (maybe provide the same norm analysis as in Table 1?).
Overall I really encourage the authors to take a look at reviews and rewrite the paper, better organization and more clear presentation of findings.
为何不给更高分
I am on the fence on this paper. The approach is interesting, but the execution is lagging a bit. Getting something publishable will require some effort, not impossible, but quite significant (e.g. Section 3 is a few sentences long, Section 5 is just a toy experiment).
为何不给更低分
NA
Reject