Learning positional encodings in transformers depends on initialization
Learning accurate and interpretable positional encodings that improve generalization depends on their initialization
摘要
评审与讨论
This paper investigates learning positional information directly from data by training transformers with learnable positional encodings (PE) for two distinct tasks: (1) the Latin Squares task, a 2D spatial reasoning task, and (2) predicting resting-state fMRI data for masked brain regions based on activity in other regions. The authors report that (1) initializing learnable PEs with a small norm improves generalization, and (2) the learned PEs closely align with the 2D grid structure in the Latin Squares task and the known brain connectivity patterns in the neuroscience task.
优点
- Originality: The paper explores the impact of PE initialization on novel tasks, specifically brain activity prediction, which differs from typical applications.
- Quality: The empirical evidence is strong, with various experimental settings and adequate seeding.
- Clarity: The writing is clear.
- Significance: The findings reveal that learned PEs capture interpretable context structures.
缺点
- The motivation for 1D tasks is limited, as it does not clarify why the positional encoding (PE) approach for 2D or higher-dimensional tasks would differ fundamentally (see question). It would be helpful to explain what unique considerations are required for PE in 2D spaces.
- The finding that small-norm PE yields better test performance is unsurprising. The alternative hypothesis—that, unlike in feedforward networks, large-norm PE initialization might improve task performance—is weak.
- Using only a single attention head is uncommon in practice, as it restricts the model to a single processing path. Discussing how results might change with varying numbers of attention heads, or more broadly, how PE effects might differ with multiple heads, would enhance this aspect.
- Learning PE is expected to outperform predefined PE, as it is optimized specifically for the task. This result would be more meaningful if learned PE provided consistent benefits across a wider range of tasks untrained. For example, for the neuroscience task, can the PE learned from resting state data be useful for predicting the brain in other states?
问题
- Could you clarify the gap you've identified, specifically regarding prior work focusing mainly on 1D input sequences? It seems that once the input is flattened, the tasks considered here also operate in a 1D manner.
- Could you explain what you mean by "ground-truth" position information? It's unclear why you consider a 2D "ground-truth" position encoding (PE) based on sines and cosines. Are you suggesting that all tasks on a 2D grid should share the same "ground-truth" PE? If we're training on a single task, shouldn't the PE, or the contextual information, reflect specific geometric constraints imposed by that task? Is there a singular "ground-truth" PE? For instance, in Figure 2C, the attention map accuracy seems to plateau around a cosine similarity of 0.7, implying that this "ground-truth" PE may not be optimal for the task. The entire concept of "ground-truth" PE needs a more substantial rationale.
Some minor points:
- 2D Reasoning Task Explanation: The explanation of the 2D reasoning task could be clearer. For example, why is the solution in Figure 1B specifically a circle? This introduces a new shape, but does it mean that any shape (e.g., a star) would also be a valid answer?
- Optimizer Differences: The differences in behavior between SGD and Adam optimizers raise concerns about the robustness of the results.
- Color Shading in Figures: The meaning of the color shading in Figures 2BE and 4CD was not explained and could use clarification.
- The authors report that PEs initialized from small-norm has the highest overall network modularity and segregation (Figure 4CD). Is higher modularity considered beneficial, and if so, why?
- The training loss for the learnable PE is lower. How was the total parameter count controlled when comparing learnable PE to fixed PE, particularly given the small model size?
Weakness 1 / Question 1: Could you clarify the gap you've identified, specifically regarding prior work focusing mainly on 1D input sequences? It seems that once the input is flattened, the tasks considered here also operate in a 1D manner.
Reply 4A: We thank the Reviewer for their time reviewing our manuscript. We apologize that the fundamental gap/pain point we’ve identified was not made clearer. We have taken steps to clarify this by including an additional conceptual figure, which illustrates the problem of flattening data into a 1D sequence that is originally formatted in 2D (or higher dimensions) (Please see the new Fig. 1 and Response 0A to all Reviewers above).
An intuitive analogy may help: When performing a Sudoku puzzle (very similar to the LST), the puzzle would be made significantly more difficult if it were presented in 1D vs. 2D. Similarily, imagine playing chess with the board flattened into 1D – moves like those of rooks or knights would become far less intuitive.
Please also note that this new figure helps to clarify what exactly we mean by a “ground truth” PE. In brief, a “ground truth” PE is any PE that preserves (approximately) the distance information between token elements in the LST grid (e.g., see Fig. 1E). We have revised relevant text to help contextualize why this is important, e.g., Line 147:
“Importantly, performing the LST task requires positional informational information of the rows and columns. Flattening the grid to 1D without preserving the position information would significantly increase the difficulty of the LST task (Fig. 1E).”
An argument can be made that Transformers would not need to reason in 2D (or be privy to the 2D organization) of a LST grid or a chess board, because how it interprets tokens is fundamentally different. However, our findings suggest the opposite: When not supplied with the 2D position information (which is subsequently then flattened to 1D to be processed by the Transformer), generalization becomes much more challenging. This suggests that Transformers are also susceptible to this limitation.
Weakness 2: The finding that small-norm PE yields better test performance is unsurprising. The alternative hypothesis—that, unlike in feedforward networks, large-norm PE initialization might improve task performance—is weak.
Reply 4B: We thank the Reviewer for highlighting this. First, we wanted to test the influence of positional encoding in input sequences that are originally organized in 2 dimensions and higher (prior to being flattened for transformer processing). We believe that this is an important problem to characterize and solve, since transformers have been predominantly applied to 1D language data. We believe that our results clearly demonstrate the importance of using alternative PE schemes for such data. Regarding the specific finding that small-norm PEs show better generalization performance: Indeed, while this may be unsurprising in feedforward networks due to theoretical analysis in the NTK regime, we wanted to empirically test whether a similar behavior would emerge in the positional encoding layer of a transformer. In fact, another Reviewer (ggRj) for this manuscript suggested that there was not much theory supporting the relationship between NTK theory and transformers. Thus, we think that our results are an important contribution that verify the feasibility and applicability of rich and lazy learning to transformers.
Weakness 3: Using only a single attention head is uncommon in practice, as it restricts the model to a single processing path. Discussing how results might change with varying numbers of attention heads, or more broadly, how PE effects might differ with multiple heads, would enhance this aspect.
Reply 4C: We thank the Reviewer for emphasizing this. Since multiple Reviewers gave this feedback, we now included additional experiments that evaluate generalization performance using transformers with multiple attention heads (please see Response 0B above to all reviewers for additional details).
Specifically, we have performed additional experiments using 2 and 4 attention heads for all models (15 seeds per model; Fig. A14). The primary result shows that adding attention heads tends to reduce generalization performance (but not training set performance / model convergence) across the board for the LST task. In fact, the highest performing model (excluding the 2DPE “ground truth” model is still the learnable PE model initialized at with 1 attention head.
Weakness 4: Learning PE is expected to outperform predefined PE, as it is optimized specifically for the task. This result would be more meaningful if learned PE provided consistent benefits across a wider range of tasks untrained. For example, for the neuroscience task, can the PE learned from resting state data be useful for predicting the brain in other states?
Reply 4D: The reviewer provides an interesting suggestion – to test for generalization on fMRI data on additional states (i.e., not just resting-state fMRI). However, network partitions generally tend to remain consistent across resting and task states. In line with feedback from other reviewers, we asked a related but different question using a simple toy experiment model: Can a network model that is explicitly nonlinear still learn the underlying ‘network modules’ through their positional encodings? We ran this additional experiment using a network timeseries generated from a nonlinear multivariate autoregressive model (NMAR), with the aim of assessing the feasibility of recovering ‘ground truth’ network modules (which specify interaction strengths) (see Fig. A15 and Response 0C to all Reviewers). Corroborating the fMRI analysis, we find that low-norm learnable PEs are best able to recover the network modules via learned PE embeddings, even in nonlinear models.
Question 2: Could you explain what you mean by "ground-truth" position information? It's unclear why you consider a 2D "ground-truth" position encoding (PE) based on sines and cosines. Are you suggesting that all tasks on a 2D grid should share the same "ground-truth" PE? If we're training on a single task, shouldn't the PE, or the contextual information, reflect specific geometric constraints imposed by that task? Is there a singular "ground-truth" PE? For instance, in Figure 2C, the attention map accuracy seems to plateau around a cosine similarity of 0.7, implying that this "ground-truth" PE may not be optimal for the task. The entire concept of "ground-truth" PE needs a more substantial rationale.
Reply 4E: (For a summary addressing this issue, please see Response 0A to all Reviewers.) We apologize that the notion of “ground truth” was not made clear in the original paper. Indeed, a ground truth is specified with respect to a specific task or input sequence. There is no singular ‘ground truth’ PE that is the same across all tasks. (For example, the ground truth in the LST is a 2D grid; the putative ground truth in the fMRI data is the network partitioning scheme.) Reviewer ggRj similarly pointed out that for the 2D sinusoidal positional encoding, any rotation of that PE would produce similarly good ‘ground truths’. We agree with that, and have qualified that in our manuscript (Line 259):
“Finally, we included a “ground truth” PE – an absolute 2D PE based on sines and cosines (2d-fixed) – to compare how similarly the various PEs produced attention mechanisms to this ground truth model (see Appendix A.1). (Note, that the term “ground truth” would apply to any rotation of this 2D fixed encoding.)”
Reviewer wNeT also points to Fig. 2C the attention map accuracy seems to plateau around a cosine similarity of 0.7, implying that this ground-truth PE may not be optimal for the task. However, in the LST task, the ground truth is determined by the original organization of the LST input sequence – a 4 x 4 grid that maintains row and column information. Any positional encoding that contains row and column information with respect to this original grid (or is isomorphic to this input organization) would be considered a valid ‘ground truth’ encoding (see Fig. 1C). We hope that the new conceptual panels in Figure 1 help to clarify this confusion. (We note that we used a 2D sinusoidal-based encoding due to its widespread usage (e.g., Vaswani et al., 2017) in the literature, but does well to approximate the discrete case (e.g., compare Fig 1E v. 1H). Finally, in Table 1, we demonstrate that transformers that have a 2d-fixed positional encoding layer (rather than a learned positional encoding layer) yield the best generalization performance.
Minor Question 1: 2D Reasoning Task Explanation: The explanation of the 2D reasoning task could be clearer. For example, why is the solution in Figure 1B specifically a circle? This introduces a new shape, but does it mean that any shape (e.g., a star) would also be a valid answer?
Reply 4F: We apologize for the confusion – as in Sudoku, there are a fixed/predetermined set of symbols that can be chosen from. We have now included a new panel (Fig. 1A) to illustrate the four possible symbols that can be chosen from.
Minor Q 2: Optimizer Differences: The differences in behavior between SGD and Adam optimizers raise concerns about the robustness of the results.
Reply 4G: The overall differences between SGD and Adam are similar – the only case they diverge is when the initialization is exaggeratedly small, and can be explained by increasing momentum in the Adam optimizer with small gradients (which result from extremely small initializations). We also emphasize that our results are averages across 15 different seeds, a further indication of their robustness. Finally, we hope that the additional results with multiheaded attention (Response 0B), which further support our findings, provide added confidence in the robustness of our findings.
Minor Q 3: Color Shading in Figures: The meaning of the color shading in Figures 2BE and 4CD was not explained and could use clarification.
Reply 4H: We apologize for the confusion. The color changes in the bar/box plots those Figures (now Fig. 3 and 5 after revision) simply correspond to differences in the x-axis. We have now clarified this in the captions: “(Note that bar plot colors correspond to differences in x-axis values.)”
Minor Q 4: The authors report that PEs initialized from small-norm has the highest overall network modularity and segregation (Figure 4CD). Is higher modularity considered beneficial, and if so, why?
Reply 4I: We thank the Reviewer for the opportunity to clarify this result. Here, modularity (and segregation) are computed with respect to the brain’s known networks (or modules). This means that the higher the modularity, the more interpretable learned positional encodings are with respect to known biological function (i.e., ground truth), meaning that the learned PEs are interpretable. We have added a sentence that clarifies this (Line 450):
“We found that the modularity of small-norm initialized PEs (learn-0.1 and learn-0.2) had the highest overall network modularity and segregation relative to other learnable PEs . This implies that the small-norm initialized PEs learned interpretable PEs with respect to the brain's known biological networks.”
Minor Q 5: The training loss for the learnable PE is lower. How was the total parameter count controlled when comparing learnable PE to fixed PE, particularly given the small model size?
Reply 4J: This is true – training loss for the learnable PE is lower. In principle, the number of total model parameters is the same across learnable and fixed PEs. (Each PE has the same number of embedding dimensions, whether they are pre-specified with sinusoids or learnable variables.) However, the number of free (learnable) parameters is not. (Naturally, learnable PE models have more ‘free parameters’ by definition.) We did not explicitly control for this – the motivation for this paper was 1) to show that pre-specified PEs are a strong inductive bias in transformers for learning and generalizability, and that learnable PEs can outperform standard PEs. Instead, what we sought to emphasize is that -- despite no change in model size -- learnable PEs can both generalize better while still learning interpretable PEs.
Dear Reviewer wNeT,
Notion of ground truth.
Thank you for your response. We agree that the concept of ground truth is task-dependent. In 2 out of 3 of our experiments, we have now provided an unambiguous notion of ground truth (with accompanying figures) that is unique to each task: In the LST task, notions of rows and columns in a 2D grid, as depicted in Figure 1E (and the corresponding distance matrix in Figure 1F); In the nonlinear multivariate autoregressive model experiment, the notion of ground truth networks, as depicted in Figure A15A. (Please also note that we have just included, in a follow-up suggestion from Reviewer ggRj, visualizations of learned PEs in the toy NMAR experiment (Figure A15F), which provide an intuitive visualization of how small-norm learnable PEs can recover a notion of ground truth network structure that we think may be particularly helpful.)
In addition, we have now included a dedicated passage in the final paragraph of the Introduction that makes clear the task-dependent nature of ground truth (Line 081):
"Note that the notion of ground truth positions is task-dependent and data-dependent. In some cases, such as in biological datasets, ground truth spatial information may be difficult to know or ambiguous. In this study, we focus on tasks in which either a ground truth is unambiguous (e.g., synthetic tasks) or in which there exists a putative ground truth (e.g., biological data with known properties)."
We also have more specifically referenced the importance of preserving pairwise distances between token PEs when learning or choosing a PE in the revised caption for Figure 1F. We think the explanation is most effective when accompanied by the visualization (Line 141):
"F) The pairwise distances between token positions according to rows and columns provides the ``ground truth'' of how tokens relate to each other in 2D space. A successful PE would approximate the pairwise distance relationships of the ground truth encoding."
We hope these revisions -- particularly the ground truth visualizations in Figure 1E,F and Figure A15 -- provide the necessary framing of what we explicitly mean by ground truth
Regarding the counterintuitive relationship between attention heads and generalization performance.
Thank you for your comments. While we are willing to run another round of additional experiments with suggested changes (e.g., increasing regularization), we would like to clarify that from our perspective, we do not believe the results are counterintuitive. From our perspective, these results are consistent with the idea that inductive bias is critical to generalization. Specifically, the inductive bias of PE supersedes the benefit of adding attention heads for these tasks. First, the ground truth 2D PE model performed at ceiling across all models, irrespective of the number of attention heads (Table 1; Figure A14). This is due to the fact that the correct inductive bias (a 2D positioning) is embedded in the model, making it easier to learn the key representations across multiple attention heads. All other models did not have the correct inductive PE bias. Thus, adding attention heads would merely increase the number of trainable parameters of the model, leading to overfitting of the training dataset, and harming generalization. This is similarly the case for the learnable models -- adding free parameters (through more attention heads) makes it more difficult for the model to learn and approximate the ground truth positional relationships.
However, if the Reviewer feels strongly that this is an important possibility to rule out, then we would be happy to run additional experiments with increased weight decay in the optimizer.
Thank you for the detailed rebuttal. However, I think the concept of “ground truth” needs a more rigorous definition—perhaps reframing it in terms of task-dependent biases, such as specific distance measures or properties that positional encodings should preserve. Furthermore, the new observation that models with multiple attention heads reduce generalization performance is counterintuitive. Could this result stem from insufficient regularization or some other underlying factor? Overall, while I appreciate the effort, I will maintain my original score.
The authors study Transformer position encodings in tasks beyond language where the dimensionality of the input sequence is no longer inherently one-dimensional, and the correlation structure may be more complex. They find that initialization of the position encoding is important in these more complicated contexts, and demonstrate the benefits of small-norm initialization empirically, taking inspiration from the feature learning regime of theoretical work. They ultimately study the generalization properties of learned position encodings, implying that by initializing weights to small values, transformers can learn 'rich' features in position encodings, allowing them to generalize to novel situations as opposed to overfitting on the training data distribution.
In part, one of the core contributions of this paper is the empirical analysis of the idea that the rich feature learning regime of neural networks may also apply to positional encodings. The insight that this specific modification can have dramatic improvements to generalization, especially on complex relational tasks where position is crucial, is both clever and relevant to the machine learning community broadly. It is therefore a valuable contribution in my opinion.
优点
- S1: The paper is well written and very well organized, making contributions clear.
- S2: While the 'method' of the paper is as simple as it could get (a single hyperparmeter setting), the authors do perform a rigorous set of experiments across different positional encoding types, making the scientific contribution sound.
- Examples of this rigorous analysis include the results for Figures 2C&F, comparing the generalization accuracy of the learned position encodings as a function of their similarity to the 'optimal' 2D grid encoding, known a-priori.
- S3: The discussion of the performance impacts of the optimizer choice is a valuable nuance that the authors aptly address.
缺点
- W1: The number of tasks which this initialization is evaluated on is limited to 2, which makes some of the conclusions tentative (although the analysis is rigorous).
- W2: Figure 3D seems to be in conflict with some of the results in the appendix (Figs A10, A11 & A12). See Question Q3. If the authors could comment on this, that would help significantly.
问题
- Q1: Can you explain the significant difference in performance of the relative embedding method between the first and second tasks?
- Q2: Can you explain the reason why you don't see the 'generalization gap' in the second set of experiments (fmri), between train and validation, like you do see in the first set of results (LST)?
- Q3: (Related to Q2) Can you explain why in Figures A10, A11 & A12, the random position encoding now seems to work equivalently well to the learned methods? This seems to be quite different than the result presented in Figure 3D and makes that result look a bit cherry picked.
Weakness 1 / Question 1: The number of tasks which this initialization is evaluated on is limited to 2, which makes some of the conclusions tentative (although the analysis is rigorous).
Reply 3A: We thank the Reviewer for their thoughtful review, and their overall positive assessment. We have now included an additional toy experiment (recovery of network clustering in a simple nonlinear stochastic model (using nonlinear multivariate autoregressive timeseries). We intend to move this to the main text of the manuscript, space permitting. (For additional details, please see Response 0C in the response to all reviewers above.)
Weakness 2 / Question 3: Can you explain why in Figures A10, A11 & A12, the random position encoding now seems to work equivalently well to the learned methods? This seems to be quite different than the result presented in Figure 3D and makes that result look a bit cherry picked.
Reply 3B: We thank the Reviewer for pointing out the discrepancy of random positional encodings in the neural data, and for going through the results in the Appendix. In general, prior research studies have indicated that randomly generated PEs can often be a good choice, particularly on algorithmic tasks (e.g., Ruoss et al., 2023). Thus, we do not necessarily think that random positional encodings are bad – in some cases (particularly when a ground truth PE is not structured or not available) randomly assigning PEs can be good. Our main point is that when there is a structured, ground truth position that can be learned, using a learned parameter with a small initialization can often be beneficial. Overall, our results corroborate that (i.e., learnable PEs initialized with small-norm achieve the lowest MSE). In this specific case, the difference in Figure 3 (now Figure 4, due to revisions) and A10-13 is likely attributed to different training masks (15%, 50%, 75%, 90% masking) and the evaluation mask (90%). Afterall, we use the same exact random seeds across all of these training models, implying that the PEs in the random PE models were exactly the same across these experiments. The choice of seeds and models can be found in our supplementary code file.
Ruoss et al. “Randomized Positional Encodings Boost Length Generalization of Transformers.” https://doi.org/10.18653/v1/2023.acl-short.161.
Question 1: Can you explain the significant difference in performance of the relative embedding method between the first and second tasks?
Reply 3C: This is an interesting question, although we first would like to point out inherent differences in the evaluations of the first and second tasks. In the LST task, we are measuring performance accuracy of the models, whereas in the neural data, we report the raw MSE loss. (The LST task uses a cross-entropy loss.) It is difficult to directly compare these two differences.
As the Reviewer is also likely familiar with this, a second primary difference between Relative PEs and absolute PEs are that relative PEs are computed within the attention mechanism, whereas absolute PEs are computed before. In the case of absolute PEs, this explicit separation likely makes it easier for the self-attention mechanism to compensate for incorrect biases of the PE. Since relative PEs are directly computed with attention at every attention block layer, this makes it more difficult for the model to compensate for the inductive biases that are introduced with an ill-suited PE (in this case the relative component). (This is contrast to absolute PEs, which are typically introduced only in the first layer.)
Question 2: Can you explain the reason why you don't see the 'generalization gap' in the second set of experiments (fmri), between train and validation, like you do see in the first set of results (LST)?
Reply 3D: This is another interesting observation, and again likely boils down to the difference in the model evaluations. In the LST task – which is a classification task – we explicitly hold out unique puzzles for the model to generalize to. This requires the models to effectively learn the strategy to produce the correct symbols on a fundamentally new puzzle. (It is also easier to see differences in train/test gaps in a classification task, where chance is known to be 25%.)
In contrast, with the fMRI experiments, the task is to effectively learn statistical relationships between brain regions in order to infer the activity of masked brain activity on new data. (New data in this case is data from new human participants.) If the statistical patterns are sufficiently generalizable to new participants, the model will do well (there is no notion of ‘new puzzle to classify’).
Thank you for directly addressing my concerns.
The point about random position encodings makes sense and I am willing to accept that answer, although it would be helpful if the authors addressed this directly in the main text to be clear about their results and avoid 'overselling' their method.
With respect to the difference in overall interpretation of the results between the two tasks, I appreciate the clarification that they are using two separate loss metrics and that this could make a direct comparison challenging. However, I also am struggling to see how this would yield such a significant difference. I also do not see how the noted difference between relative and absolute PEs could yield a discrepancy between tasks, since I assume the same architectural structure of the PEs is used for both tasks. In the end both tasks are composed of statistical patterns, and while classification may have more 'discontinuous jumps' in performance when a sample moves from correct to incorrect, I am not sure that this explains why the generalization performance changes. My guess would be that this is a difference in what the train-test splits of the tasks are actually measuring, and how this requires use of the PE (which seems to be what the authors are alluding to when they say 'a fundamentally new puzzle' vs. 'statistical relationships between brain regions'). If the authors have any additional insights on when precisely learned PEs will help generalization this would be helpful for the paper.
Overall I thank the authors for their response, I believe the paper is a valuable contribution to the community, and I will keep my score for now.
Thank you for engaging with us in this discussion, and for helping to improve the manuscript. We recognize the concern about overselling the approach/method, and have placed an additional passage in the Limitations and future directions section of the Discussion to make it clear that learnable PEs are not always optimal (revisions in italics; Line 507):
“First, though we consider the use of the LST (with a 2D organization) a strength of this study due to the visual interpretability of the paradigm's positional information, it is unclear how well this approach will generalize to tasks with an arbitrary number of elements, tasks in which there are dynamic changes in the number of elements (e.g., length generalization problems), or tasks in which there are specific distribution shifts.
In addition, due to the task-dependent nature of utilizing (or learning) optimal PEs, for some tasks and training objectives, such as generic next-token prediction or arithmetic (which is order invariant under addition), standard PE choices may be most appropriate (e.g., 1d-fixed or rndpe).”
At its core, we think the gaps in generalization between the tasks are due to fundamental differences in the difficulty of those tasks (in addition to the loss functions). Brain activity prediction can be achieved with linear models with reasonable performance (though worse than nonlinear models). In contrast, linear models cannot perform the LST, which we consider to be more difficult than brain activity prediction, since it requires learning multiple reasoning steps. (As an interesting aside, vanilla feedforward networks and RNNs generalize extremely poorly on the LST, around ~50%.)
Regardless, we agree that mentioning specific scenarios in which learnable PEs are useful would be helpful to situate this manuscript in relation to other papers which exploit PE variations for improving task optimization. We have added an additional sentence in the Limitations section, directly following the above passage (Line 514):
“However, for tasks in which establishing an underlying ordering and relation of tokens is crucial — such as reasoning tasks in 2D or tasks with complex network structures, as is common in biology — our results show that using small-norm initialized learnable PEs can be highly beneficial.”
The changes have been uploaded to the PDF. Thanks again for your constructive feedback.
Thank you for the updates. I think this is a significant improvement which highlights the benefits of the paper while being straightforward about the limitations.
This paper studies how initialization of learnable positional encodings (PEs) in transformers affects their ability to learn meaningful position information and generalize. The paper tests different PE initialization schemes on two tasks: a 2D spatial reasoning task called the Latin Square Task (LST) and a 3D brain activity (fMRI) self-supervised prediction task. On the Latin Square Task, the paper finds that initializing learnable PEs with small-norm distributions leads to better generalization and more interpretable learned encodings (measured via similarity to 2D cosine embeddings) compared to larger norm initializations. On the fMRI prediction task, the paper also finds that learned positional encodings with small initialization work best, and leads to more structured encodings (measured by computing similarity matrices across tokens aka brain regions, and comparing this similarity matrix to one estimated using functional connectivity). Overall, the paper highlights the importance of learned positional encodings along with proper initialization.
[score updated after rebuttal]
优点
- The paper is well motivated.
- The experimental setup with the 2D spatial task (LST) is a nice way to evaluate whether learned PEs capture 2D spatial relationships.
缺点
Motivation / assumptions:
- The paper appears to mischaracterize the role of traditional sinusoidal PEs, which are meant to provide general spatial information rather than encode specific task-relevant patterns.
- The paper's argument that standard PEs are mainly suited for tasks with local dependencies overlooks their successful use in capturing long-range dependencies in language models.
- The connection to NTK theory feels tenuous and not rigorously established for transformers. Have the cited theoretical results on the lazy and rich regimes been shown to apply to transformers? or even specifically to the positional encoding parameters in transformers?
LST Task
- I disagree with the characterization of the 2D sinusoidal patterns as "ground truth". I can think of multiple methods for encoding 2D spatial information (for example, any rotation of the 2D sinusoidal basis). This matters because the paper uses similarity to the 2D fixed encoding as a measure of how "optimal" the learned embeddings are.
fMRI task
- The paper states that there isn't a ground truth positional structure available for the fMRI task, but then treats functional connectivity as ground truth when interpreting the learned positional encodings' similarity matrices.
- Overall, I'm not sure what I'm supposed to take away from the fMRI results. Functional connectivity is estimated using correlations across brain regions. In this paper, the task is to predict masked fMRI, and one would expect the most relevant regions to pay attention to when predicting a masked region would be the regions that are highly correlated to the masked region. So in that sense, the positional encodings are really just a roundabout way of estimating the correlational structure of the data? What have we really learned here?
Missing experiments that limit the impact of the findings
- No investigation of how PE initialization interacts with other architectural choices
- Limited exploration of optimization hyperparameters or different optimization algorithms beyond Adam/SGD (which are very likely to matter when changing things like the norm of the weight init)
Interpretations / conclusions
- I feel like claims about "discovering" position information are overstated –– the model may just be finding a convenient internal representation. Perhaps if we had a stronger understanding of how a particular positional encoding allowed the network to generalize better on the LST task, or some analysis of why building in the brain's functional connectivity into the PE for the fMRI task is beneficial.
- Put another way, the connection between PE learning and downstream task performance is correlational, not causal
问题
- If I'm reading the results correctly, the effect of the weight init scale () across the two tasks is slightly different. For the LST task, it mainly affects generalization, but training is unaffected (all scales train to 100% accuracy). For the fMRI task, it looks like both train and test performance vary with the weight init value. How should I interpret this?
- What is your viewpoint on the role of positional encodings? Are they supposed to provide generic spatial information to the network (but then subsequent layers in the network can learn to use this regardless of the task)? Or are they supposed to provide a particular inductive bias about the spatial information required to solve the task? (I feel like the paper is arguing the latter, but I want to confirm).
- I noticed you trained networks with a single attention head. What happens if you scale up the network (in particular, add additional attention heads)? I'm wondering if these findings might be in part due to the fact that the network only has access to limited attention mechanisms.
- Are there other toy tasks that you could design and train on that would help shed light on the role of positional encodings?
Weakness 1 & 2: The paper appears to mischaracterize the role of traditional sinusoidal PEs, which are meant to provide general spatial information rather than encode specific task-relevant patterns. The paper's argument that standard PEs are mainly suited for tasks with local dependencies overlooks their successful use in capturing long-range dependencies in language models.
Reply 1A: First, we thank the Reviewer for their comments and review. We agree that traditional sinusoidal PEs are meant to provide general spatial information that is particularly useful for language (Vaswani et al., 2017). The primary gap we aimed to address within this paper is for scenarios in which sequentially ordered tokens in 1D with a general autocorrelation structure are suboptimal, as seen in 2D or 3D datasets. To highlight this discrepancy, we have included a new schematic Figure 1, which contrasts 1D vs. 2D sequence data, and illustrates why 1D positional encodings are suboptimal (despite their general suitability for many data types). Flattening 2D sequences into 1D inputs for a transformer, without preserving the original 2D information (via positional encoding) biases the model against interpreting the proportionate distances between rows and columns (please see the new Figure. 1E-H). Furthermore, we have clarified this in the Introductory paragraph and added a sentence referencing the Reviewer’s suggestion that the original sinusoidal PEs were intended to provide generic spatial information in 1D (Line 39):
“For many common forms of data, such as natural language, text, and audio, the labeling of ground truth positional information is straightforward, since tokens are ordered sequences in 1D. This led to the original design of 1D sinusoidal PEs, which were successfully applied to natural language data, and provided general spatial information about language tokens (rather than data-specific information; Vaswani et al., 2017).”
In the following passage (also in the Introductory paragraph), we also clarify the more specific roles of how various (absolute and relative) PEs have been successful in capturing long-range dependencies in language (Line 43):
“Recent investigations into the role of PE in transformers has led to a proliferation of PE schemes, each specially designed for 1D text with different properties (Su et al., 2022; Shaw et al., 2018; Vaswani et al., 2017; Raffel et al., 2020; Li et al., 2024; Kazemnejad et al., 2023; Shen et al., 2023; Golovneva et al., 2024; Press et al., 2022). Many of these PE schemes have been particularly successful at capturing long-range dependencies in natural language. However, many interesting problems require input sequences that are not in 1D (e.g., image datasets), or where position information is non-trivial or not known (e.g., biological data)”
Weakness 3: The connection to NTK theory feels tenuous and not rigorously established for transformers. Have the cited theoretical results on the lazy and rich regimes been shown to apply to transformers? or even specifically to the positional encoding parameters in transformers?
Reply 1B: We agree that there is limited NTK theory with transformers. This likely stems from the mathematical difficulty (and tractability) of analyzing transformers in an ‘infinitely wide regime’, which is a typical requirement for NTK analysis. Nevertheless, there are a few recent papers that have similarly demonstrated the relevance for rich/lazy training in transformers, though in different tasks, and not specifically for positional encoding (Zhang et al. (2024); Kunin et al. (2024)). We have now included reference to these papers to strengthen this connection, e.g. Line 63:
“Although this theoretical framework was initially developed for simple neural networks (e.g., feed-forward networks with few hidden layers), the insights drawn from it should apply to various model architectures, including transformers (Zhang et al., 2024; Kunin et al., 2024).”
References:
-
Kunin, Daniel, Allan Raventós, Clémentine Dominé, Feng Chen, David Klindt, Andrew Saxe, and Surya Ganguli. “Get Rich Quick: Exact Solutions Reveal How Unbalanced Initializations Promote Rapid Feature Learning.” arXiv, June 10, 2024. https://doi.org/10.48550/arXiv.2406.06158.
-
Zhang, Zhongwang, Pengxiao Lin, Zhiwei Wang, Yaoyu Zhang, and Zhi-Qin John Xu. “Initialization Is Critical to Whether Transformers Fit Composite Functions by Inference or Memorizing.” arXiv, May 8, 2024. https://doi.org/10.48550/arXiv.2405.05409.
Question 3: I noticed you trained networks with a single attention head. What happens if you scale up the network (in particular, add additional attention heads)? I'm wondering if these findings might be in part due to the fact that the network only has access to limited attention mechanisms.
Reply 1I: Thank you for raising this question. In response to this question, we have run additional experiments that we have mentioned in Revision 0B for all Reviewers above (also see Fig. A14).
Question 4: Are there other toy tasks that you could design and train on that would help shed light on the role of positional encodings?
Reply 1J: As requested (and as referenced above in Reply 1D), we have included an additional toy experiment. Please see the text in the summary above (Revision 0C for all Reviewers) and refer to the new Fig. A15.
END
Weakness 4 (LST): I disagree with the characterization of the 2D sinusoidal patterns as "ground truth". I can think of multiple methods for encoding 2D spatial information (for example, any rotation of the 2D sinusoidal basis). This matters because the paper uses similarity to the 2D fixed encoding as a measure of how "optimal" the learned embeddings are.
Reply 1C: We completely agree with the Reviewer that multiple ‘ground truths’ can be theoretically used for this 2D LST task. We hope that the new Figure 1 (referenced above in Revision 0A) helps to intuit the rationale (and the class of PEs) we consider to be appropriate ground truths. To be explicit, appropriate ground truths would be those PEs that preserve proportionate distances in the native dimensions the input sequence was presented (i.e., any PE that preserves the pairwise distances between rows and columns in Fig. 1F).
In addition, when measuring the distance between learned and ‘ground truth’ embeddings in our experiments/analyses, we measure the distance after applying an orthogonal Procrustes transformation (which includes rotations/reflections of the embedding dimensions). Thus, while we agree that any 2D spatial encoding (and their rotations/reflections) would likely provide a viable ‘ground truth’, our analysis accounts for this possibility. We have clarified this in the text, with reference to the new Figure 1 (Line 259):
“Finally, we included a “ground truth” PE – an absolute 2D PE based on sines and cosines (2d-fixed) – to compare how similarly the various PEs produced attention mechanisms to this ground truth model (see Appendix A.1). (Note, that the term ``ground truth'' applies to any rotation of the absolute 2D PE, or any PE that preserves the original 2D row and column LST information, as depicted in Fig. 1E,F).)"
Weakness 5 (fMRI): The paper states that there isn't a ground truth positional structure available for the fMRI task, but then treats functional connectivity as ground truth when interpreting the learned positional encodings' similarity matrices.
Overall, I'm not sure what I'm supposed to take away from the fMRI results. Functional connectivity is estimated using correlations across brain regions... What have we really learned here?
Reply 1D: We thank the Reviewer for raising this point. The motivation of the fMRI portion of this study was not to learn something about the brain per se, but instead to learn how to build and apply transformer models on biological data (for future applications of machine learning). In other words, the motivating question for this experiment was: Can we improve performance and generalization in transformers by learning positional encodings, rather than choosing off-the-shelf choices, like the 1D sinusoidal encoding?
With fMRI data, we fortunately have access to the network partitions to “approximate” a ground truth, which provides some ability to verify what we have learned: That 1) transformers can improve generalization by stripping away the inductive bias of standard absolute PEs, and 2) that the learned PE is interpretable. We importantly think this is relevant for problems in which a ground truth PE does not exist, or no off-the-shelf PE is ideal (e.g., as is common in biological data). We have modified the corresponding results section to emphasize this distinction (Line 452):
“These findings support the hypothesis that learnable PEs (as opposed to off-the-shelf PEs) in the rich training regime can improve generalization, while successfully learning nontrivial and interpretable position information (Figure 6C,D).”
Also, we have run additional experiments (mentioned in Revision 0C to all Reviewers) to demonstrate the feasibility of learning PEs from scratch in a toy network model with nonlinear interactions (nonlinear multivariate autoregressive model). The main finding here is demonstrating that ground truth network modules can accurately be recovered from nonlinear systems using learnable PE models initialized with small norm (Fig. A15, Page 25). We intend to place this figure/analysis in the main text, if space permits.
Weakness 7 (Missing experiments): No investigation of how PE initialization interacts with other architectural choices
Reply 1E: In response to this reviewer request, we have now performed additional experiments on the LST task with multiheaded attention (15 seeds per model) (Fig. A14). The outcome of these experiments are explained above in the response to all Reviewers (Revision 0B to all Reviewers). We have referenced these new results in the manuscript (Line 181):
“However, we have included results for models with multiheaded attention (2 and 4 heads) on the LST task in Fig. A14, which overall reduce generalization performance.”
Weakness 8 (Discovering PE is overstated): I feel like claims about "discovering" position information are overstated –– the model may just be finding a convenient internal representation. Perhaps if we had a stronger understanding of how a particular positional encoding allowed the network to generalize better on the LST task, or some analysis of why building in the brain's functional connectivity into the PE for the fMRI task is beneficial.
Put another way, the connection between PE learning and downstream task performance is correlational, not causal
Reply 1F: We thank the Reviewer for the opportunity to clarify our findings. In response to this comment, we have provided a more intuitive visualization in the Figure 1 schematic. We hope this visualization makes it clear why having an interpretable PE that preserves the original 2D positional information would make it easier to generalize the LST task, which is native in 2D. (For humans, trying to do the LST task in 1D – without a notion of rows and columns (see the new Fig. 1E) -- would be extremely challenging.
We also clarify in our interpretability analysis in Fig. 3D that we explicitly demonstrate that the learned PEs approximate the “ground truth PE”, which considers arbitrary rotations of the same 2D basis set. Moreover, we show that the similarity of these learned PEs to the ground truth set (up to an orthogonal Procrustes transform) is correlated with generalization performance. Since the only manipulation in our experiments was our choice of PE parameterization, it is implied that the learned PE representations are likely fundamental to this generalization, rather than a convenient representation. However, to address the Reviewer’s concerns on making overstatements, we have tempered the language of “discovering PE” to “learning an interpretable PE”. For example, in the Abstract: “Together, our results illustrate the feasibility of learning identifiable and interpretable PEs for enhanced generalization.”
Question 1: If I'm reading the results correctly, the effect of the weight init scale (sigma) across the two tasks is slightly different. For the LST task, it mainly affects generalization, but training is unaffected (all scales train to 100% accuracy). For the fMRI task, it looks like both train and test performance vary with the weight init value. How should I interpret this?
Reply 1G: We thank the Reviewer for their question. The primary distinction between the discrepancy of train/test performances with fMRI vs. LST is that in the LST, we report performance accuracies, which saturate at 100%. In fMRI data however, we report the loss, which is in terms of MSE. (Note that the LST task is trained using cross entropy loss.)
Indeed, in the fMRI experiments we do see MSE loss across choice of PEs in both the train/test sets. The first, most prominent difference is between learnable PEs and traditional sinusoidal and relative PEs. However, as noted by the Reviewer, we do see marginal differences in the MSE loss across different initializations of learnable PEs. The way to interpret this is that richly learned PEs are better able to learn/match the data, in both the training optimization and generalization performance.
Question 2: What is your viewpoint on the role of positional encodings? Are they supposed to provide generic spatial information to the network (but then subsequent layers in the network can learn to use this regardless of the task)? Or are they supposed to provide a particular inductive bias about the spatial information required to solve the task? (I feel like the paper is arguing the latter, but I want to confirm).
Reply 1H: Yes, as the Reviewer suspects, our results provide evidence for the latter view: That positional encodings provide a particular inductive bias about the spatial (or sequential) information required to solve the task. In LLMs, these minor distinctions may matter less, since the objective function defined in LLMs are ambiguous (i.e., next-token prediction of natural language, which is typically a 1D ordered language sequence). However, in the context of a relational reasoning task in multiple dimensions, the organization of tokens matters in learning the strategy to perform the task (e.g., see the new schematic figure that shows the difficulty of solving the LST task if organized in 1D.). We hope this new schematic figure demonstrates the importance of understanding the impact of PEs in tasks which have input sequences originating in higher dimensions.
Thanks for the thorough response, clarifications, and additional experiments!
Regarding the toy task:
- Do the networks achieve the same train/test error on the task? or are the networks with learnable PE better? Does performance correlate with modularity?
- In addition to modularity in Fig A15(D), can you just show what the learned PE looks like? Do they look similar to the ground truth connectivity?
If I understand the response correctly, a key contribution of the paper is pointing out that 1D PEs aren't well suited for tasks where the data are naturally 2D (e.g. spatial tasks). But this has been pointed out before, e.g. in Li et al 2021 (https://proceedings.neurips.cc/paper_files/paper/2021/file/84c2d4860a0fc27bcf854c444fb8b400-Paper.pdf). I'm curious how the authors compare/contrast their work to this prior work.
After reading the author's response, I have increased my score.
Thanks to the Reviewer's great suggestions, and for raising their score. We have included some minor additions that we found to be quite helpful from the reviewer.
Regarding the toy task...
Do the networks achieve the same train/test error on the task? or are the networks with learnable PE better? Does performance correlate with modularity?
- Overall, we did not see any difference in the MSE errors on the toy task across PE models. All models appeared to saturate on performance on IID generated test activity, which likely has to do with the fact that the NMAR model has a large amount of private noise (assigned to each node), making it hard to perfectly optimize predictions of a node via network interactions. Nevertheless, the core take-home of this particular analysis is to demonstrate that small-norm learnable PEs are able to recover PEs that are more interpretable, despite performance saturation across all models (Fig. A15F). Finally, to establish that there was no difference amongst MSE in IID samples, we computed a n-way F-test (n = number of total models), finding no difference in the distributions across model architectures (F=0.24; p=0.97). We have specified this in the corresponding Appendix section (A.4, line 883):
"While we found that learnable PE models with a small-norm were the models that were most capable of learning the ground truth network organization, all models converged to the same MSE (average MSE on IID samples = 0.049). This is due to the fact that there is a substantial amount of private random noise associated with each node, which provides a noise ceiling on their predictions. To ensure there were no differences in performance across each of the models, we performed an -way F-test ( for each of the model variants; , .)"
In addition to modularity in Fig A15(D), can you just show what the learned PE looks like? Do they look similar to the ground truth connectivity?
- We have now included additional visualizations of the learnable and random PEs that provide a concrete visualization in Figure A15(F) (the revised manuscript is now uploaded). We computed the cosine similarity between pairs of positions (within a model) across all models in that experiment. This resulted in a node x node matrix depicting the similarity of PEs. The figure show that small-norm PE models can approximate the ground truth (exhibiting PE clustering remarkably similar to the ground truth), while learnable models with higher PEs cannot (despite converging to similar performance). Thanks for making this suggestion -- we find that the visualizations are quite striking. We still anticipate including Figure A15 into the main text, if space permits.
If I understand the response correctly, a key contribution of the paper is pointing out that 1D PEs aren't well suited for tasks where the data are naturally 2D (e.g. spatial tasks). But this has been pointed out before, e.g. in Li et al 2021...
- Thank you for bringing to our attention this highly relevant paper by Li et al that we had previously missed. We have now included this citation in our manuscript in both the Introduction and Discussion. Indeed, the Reviewer is correct that it is related in important ways. However, in brief, there is a key distinction between our study and Li et al. (2021). Specifically, Li et al. provide a learnable PE scheme that can learn to interpolate PEs within a specified dimensional space (e.g., 2D image) that varies in the number of pixels (e.g., learn to interpolate between images with to pixels). (This scheme is useful for datasets where not all positions are equally included/represented in the training set, such as a 2D image datasets with sparse spatial structures.) In equation 2 in their paper the learnable parameter requires a priori knowledge of the number of dimensions . (In our study, this would correspond to knowing the number of networks in the NMAR model a priori.) Here we study the scenario in which we make no assumptions of the underlying dimensional space. However, we recognize that the approach used in the present paper is limited to settings in which every position is represented in the space (rather than a sparse dataset). In this sense these two approaches are complementary -- one can learn the space (ours), and the other paper shows how one can interpolate a pre-specified space. We have included a sentence on this in the Discussion (Line 494):
"In contrast, many important problems require the encoding of sequences that are not in 1D (e.g., Li et al. (2021)), and where position information is non-trivial or not known, which we investigate here. (Note that the present work is complementary to Li et al. (2021); Li et al. provide learnable PEs to interpolate positions within specified dimensions (e.g., in a 2D image), while we focus on learning PEs in unspecified dimensions."
The authors apply rich vs lazy learning theory to positional encodings (PEs) to show that they can get better generalization in a sudoku-like task and fMRI data from learned PEs with "rich" starting norms compared to many other relevant PE variants. They also show that the learned PEs are most similar to "ground truth" 2d PEs in the sudoku-like task and they show that the learned PEs in the fMRI data captures relevant aspects of the organization of the brain.
优点
- successfully apply rich vs lazy learning theory to improve PEs in a reasoning task and in neural data
- they provide some deeper analyses into what structures the PEs are learning when thy are successful
- the figures are aesthetically pleasing while also being successful in their informational goals
- the paper is a positive contribution to better understanding PEs which are an extremely important aspect of transformers
缺点
- authors are unclear about the contents of their testing data. More specificity is needed and I couldn't find it in the methods or section 3.
- please provide details into how the samples differed from the training data in both LST and the neural data
- Generalization could refer to unseen data within the same distribution, or it could refer to completely held out combinations of shapes, or expansions of the grid size
- paper would be improved with qualitative visualizations of the learned PEs in the LST experiments
- surprisingly low results with RoPE with no further explanation as to why RoPE would fail in the presented cases
- RoPE should be able to at least overfit the training data due to its absolute PE component.
- Or maybe its poor performance was due to the inability to adapt the relative component to the multiple dimensions of the tasks?
- issue with small norm claim in introductory contributions. there are other ways to adjust learning regime.
- Consider increasing/decreasing the gain of the output as a way to adjust the learning regime
- The broader impact of this work is perhaps diminished by the fact that it is an application of lazy vs rich learning theory
问题
- It seems like there could be more ways than a 2d variant of the sinusoidal PEs introduced in Vaswani et. al. 2017 to encode two dimensional positional information that would still lead to good generalization in the LST. Maybe I'm simply wrong about that intuition, but in the case that there are more ways, why do your models learn an encoding scheme that is similar to the 2d baseline? In the case that my intuition is wrong and there is only one way to properly do the PE, how could you prove that this is only one way? Do you think it would be affected by different forms of regularization? Or from training on different types of generalized versions of the LST (like expansions of the grid)?
- Expanding on the previous point, the very notion of "ground truth" PEs seems limiting in scope. It is possible that there are a variety of performance-equivalent ways for any given task that 2d information can be encoded. Of course, within these possibilities, there might be encoding schemes that are more likely to be learned for any given dataset/training objective/update rule. And there might be encoding schemes that generalize better to various distribution shifts. A greater treatement of this idea would improve the work.
- It's really surprising that RoPE is performing so poorly as RoPE provides absolute positional information in addition to relative. Thus, you would expect that the transformer could use the absolute positional information and ignore the relative information if that was useful for the task. In order to trust the results from RoPE, you should provide a further analysis of why the RoPE models are performing so poorly. Are they overfitting in some way? Is there some reason that the absolute PE component in RoPE is worse than the 1d baseline? Otherwise, the simpler explanation is that you implemented RoPE incorrectly.
- I don't understand why you would do a procrustes alignment in section 3.4/figure 4. The model operates on unaligned versions of the PEs in the same context window, no?
Question 3: It's really surprising that RoPE is performing so poorly as RoPE provides absolute positional information in addition to relative. Thus, you would expect that the transformer could use the absolute positional information and ignore the relative information if that was useful for the task. In order to trust the results from RoPE, you should provide a further analysis of why the RoPE models are performing so poorly. Are they overfitting in some way? Is there some reason that the absolute PE component in RoPE is worse than the 1d baseline? Otherwise, the simpler explanation is that you implemented RoPE incorrectly.
Reply 2H: In the LST task, RoPE achieves 100% accuracy on the training data (Table 1) but fails to generalize, achieving only 80.5% accuracy in the test set. Our new experiments with 2 and 4 attention heads show that RoPe models improve their performance (in fact, RoPE models are the only models that increase generalization performance through adding attention heads). However, this improvement plateaus at 90.2%, which is comparable to relative positional encoding performance (92%). This is somewhat unsurprising, given that RoPE encoding has a similar relative component to the relative positional encoding. Notably, outside of the 2DPE (i.e., ‘ground truth’ model), the highest performing model is still the single attention head learnable PE model with a small-norm initialization (performance at 95.6% generalization).
We hope that this new analysis/explanation, in addition to the new schematic figure (which makes it easier to intuit why a relative PE in 1D would not necessarily capture row and column information in a 2D structured grid) addresses the Reviewer’s concern. Note that while we are confident in our implementation of the RoPE encoding layer, we have also included our code in the supplementary material (which includes our implementation of RoPE encoding), should the reviewer wish to verify it. It can be found in lstnn/positionalencoding/rotary_positionalencoding.py
Question 4: I don't understand why you would do a procrustes alignment in section 3.4/figure 4. The model operates on unaligned versions of the PEs in the same context window, no?
Reply 2I: Thank you for the opportunity to improve the clarity of this analysis. Note that the motivation for doing the Procrustes alignment that we do in Fig. 4 (for fMRI data) is conceptually different than why we would do it in the LST task. The main reason is to compare the distances between tokens, rather than align PE embeddings to some ground truth. The reason we would need to compare distances between tokens is because the embedding dimensions of the learned PE of tokens do not necessarily align with one another. Since learned PEs are randomly initialized, embedding dimension of token 1's PE does not necessarily correspond to embedding dimension of token 2's PE. This is in contrast to models that are all trained with the same PE, e.g., the 1D sinusoidal PE; the sinusoidal frequency assigned to each embedding dimension is fixed/predetermined across all models. Thus, to appropriately measure the similarity/distances of learned PE embeddings, we must first align these embeddings to match, and then compute the distance. We have now clarified this in the manuscript (Line 439):
“The reason this is necessary is because since learned PEs are first randomly initialized, the embedding dimension of each token's PE are not necessarily aligned (e.g., position 1's embedding dimension does not necessarily correspond to embedding dimension of position 2). Thus, after aligning the embedding dimensions across tokens, we computed the distance for every pair of tokens, yielding a 2D distance matrix.”
References
- Liu et al. “How Connectivity Structure Shapes Rich and Lazy Learning in Neural Circuits.” ICLR 2024. https://doi.org/10.48550/arXiv.2310.08513.
- Golovneva et al. “Contextual Position Encoding: Learning to Count What’s Important.” arXiv, May 28, 2024. http://arxiv.org/abs/2405.18719.
- Ruoss et al. “Randomized Positional Encodings Boost Length Generalization of Transformers.” ACL 2023. https://doi.org/10.18653/v1/2023.acl-short.161.
- McLeish et al. “Transformers Can Do Arithmetic with the Right Embeddings.” arXiv, May 27, 2024. https://doi.org/10.48550/arXiv.2405.17399.
- Kazemnejad et al. “The Impact of Positional Encoding on Length Generalization in Transformers.” Advances in Neural Information Processing Systems 36 (December 15, 2023): 24892–928.
Thank you for all the updates. I will upgrade the soundness from a 2 to a 4. The core of the work, however, still rests on a relatively simple adjustment that lacks novelty, namely the adjustment of the norm of the initialization. For that reason, I will leave the rest of my review unchanged.
Thank you for your time and helpful feedback/questions!
Weakness 1: Authors are unclear about the contents of their testing data. More specificity is needed and I couldn't find it in the methods or section 3. 1) please provide details into how the samples differed from the training data in both LST and the neural data. 2) Generalization could refer to unseen data within the same distribution, or it could refer to completely held out combinations of shapes, or expansions of the grid size
Reply 2A: We thank the Reviewer for their careful review, and their overall positive evaluation of our manuscript. Thank you for the opportunity to clarify how we measured generalization (test data) in both tasks.
In the LST task, the test set was an excluded set of LST puzzles (of the same size). In general, these were IID generated. However, we took steps to ensure that the test set puzzles were distinctly unique. To ensure this, we measured the Jaccard dissimilarity between pairs of puzzles, and ensured that test set puzzles had a Jaccard dissimilarity of greater than 0.8 to any individual training puzzle. We have now more prominently clarified this in the manuscript’s text, e.g., (Line 151):
“We ensured that the similarity between any generated training set puzzle was distinct from the generalization (validation) set of puzzles (i.e., the Jaccard dissimilarity > 0.8 for a test puzzle to any individual training puzzle).”
For the neural data set, generalization was to evaluate MSE prediction of masked brain activity data from a separate human participant. We have now clarified this in the text (Line 198):
“Generalization in the human dataset was to evaluate MSE prediction of masked brain activity data from a separate human participant.”
Weakness 2: paper would be improved with qualitative visualizations of the learned PEs in the LST experiments
Reply 2B: First, to address this and related comments, we have now provided a new Figure 1 which provides intuitive visualizations of 1D and 2D PEs, and their difference (see Response 0A to all Reviewers above). Importantly, when we compare the similarity of learned PEs to the ground truth PE, we are essentially comparing the similarity of the learned PE to the visualization in Fig. 1H (since our measure of similarity via Procrustes alignment is invariant to rotations/reflections).
We have also included visualizations of the learned PEs in our toy NMAR dataset (Fig. A15F). In that figure, we demonstrate that PEs with small-norm initializations qualitatively recover the ground truth network structure. Moreover, as the initialization norm is increased, the ability to recover the ground truth network structure is diminished. We anticipate moving this figure to the main text, space permitting.
Weakness 3: surprisingly low results with RoPE with no further explanation as to why RoPE would fail in the presented cases...
Reply 2C: Please see our response to Question 3 (Reply 2H) below to address this weakness.
Weakness 4: issue with small norm claim in introductory contributions. there are other ways to adjust learning regime... Consider increasing/decreasing the gain of the output as a way to adjust the learning regime
Reply 2D: Thank you to the reviewer for suggesting this. We agree – for example, changing the rank of the initialization can also change the learning regime (Liu et al., 2024). We are unfamiliar with increasing/decreasing the gain as a way to adjust the learning regime. Could the reviewer kindly provide a reference that we can include?
We have also included a sentence in the Introduction regarding the lo-rank initializations (Line 062): "This is referred to as the rich or feature learning regime (Woodworth et al., 2020; Chizat et al., 2020). (We note that choosing the initialization rank can also induce rich versus learning learning; Liu et al. (2024)"
Weakness 5: The broader impact of this work is perhaps diminished by the fact that it is an application of lazy vs rich learning theory
Reply 2E: While we agree that this work hinges on the theory of rich and lazy learning, we believe the insight that it can be applied to learn accurate positional encodings is a novel and valuable insight. Our approach is particularly relevant given the growing interest in PEs due to their strong inductive bias on transformers, particularly on algorithmic/mathematical tasks (e.g., Kazemnejad et al., 2024; Golovneva et al., 2024; Ruoss et al., 2023; McLeish et al., 2024, among others).
Question 1: It seems like there could be more ways than a 2d variant of the sinusoidal PEs.. Maybe I'm simply wrong about that intuition, but in the case that there are more ways, why do your models learn an encoding scheme that is similar to the 2d baseline? In the case that my intuition is wrong and there is only one way to properly do the PE, how could you prove that this is only one way? Do you think it would be affected by different forms of regularization? Or from training on different types of generalized versions of the LST (like expansions of the grid)?
Reply 2F: Thank you for this question – in general, your intuition is correct. When we used the term “ground truth”, this was not intended to refer to a specific implementation of Vaswani’s 2D sinusoidal encoding. For example, any rotation/reflection of this base encoding would suffice, in addition to any encoding which uses a discrete representation in 2D (rather than sinusoids). In other words, any PE that encodes a 2D grid shape of the correct size can serve as an appropriate ground truth. (Put another way, any PE that encodes pairwise distances between rows and columns, as visualized in the new Fig. 1F panel would suffice.) Our motivation for using the 2D sinusoidal version of PE is due to its common use within the literature (via Vaswani et al., 2017; Fig. 1H). The fact that the PE encodings of Fig. 1F and 1H are nearly identical indicates that this choice of PE was sufficient.
In our analysis that maps the learned PEs to the ‘ground truth’ 2D sinusoidal PE, we allow for any orthogonal Procrustes transform prior to computing the match between learned and ground truth, therefore accounting for range of ‘possible’ learned ground truths. We have revised the sentence to mention these possibilities when introducing the notion of a ‘ground truth PE’ (Line 259):
“Finally, we included a “ground truth” PE – an absolute 2D PE based on sines and cosines (2d-fixed) – to compare how similarly the various PEs produced attention mechanisms to this ground truth model (see Appendix A.1). (Note, that the term “ground truth” would apply to any rotation of a 2D fixed encoding that matches the size of the LST puzzle.)”
Question 2: Expanding on the previous point, the very notion of "ground truth" PEs seems limiting in scope. It is possible that there are a variety of performance-equivalent ways for any given task that 2d information can be encoded. Of course, within these possibilities, there might be encoding schemes that are more likely to be learned for any given dataset/training objective/update rule. And there might be encoding schemes that generalize better to various distribution shifts. A greater treatement of this idea would improve the work.
Reply 2G: In addition to the new Figure 1, we have now added these ideas in a brief discussion of present limitations in the Discussion (Line 508):
“First, though we consider the use of the LST (with a 2D organization) a strength of this study due to the visual interpretability of the paradigm's positional information, it is unclear how well this approach will generalize to tasks with an arbitrary number of elements, tasks in which there are dynamic changes in the number of elements (e.g., length generalization problems), or tasks in which there are specific distribution shifts.”
We thank all four reviewers for their thoughtful and careful reviews. After conducting additional experiments based on the reviewer’s suggestions, we have obtained new results and figures while making substantial revisions to the manuscript. These changes significantly improve both the clarity and substantiveness of the manuscript. We have uploaded the revised PDF of the manuscript and provide an overview of the results and revisions below. We have also included individual responses to each reviewer.
Overview of revisions and new results:
Revision 0A: We have added a new Figure 1 (Page 2), which provides an intuitive and conceptual overview of the research gap/pain point addressed in this manuscript (Reviewers ggRj, riBV, wNeT). This Figure provides an intuition as to why only incorporating 1D sequence information would be extremely challenging to perform the LST task (for both humans and machines), as it fails to provide critical row and column information between elements in a 2D grid. Moreover, this new figure helps to clarify what exactly we mean by “ground truth” PE. In brief, a “ground truth” PE for the LST task is any PE that preserves (approximately) the distance information between rows and columns in the LST grid. This does not exclusively refer to the specific 2D implementation of the sines and cosines PE we used, and we have now clarified this throughout the manuscript.
Revision 0B: We have performed additional experiments using multiheaded attention for all transformers in the LST task (2 and 4 attention heads, 15 seeds per model; Fig. A14, Page 24). The primary result shows that adding attention heads tends to reduce generalization performance (but not training set performance/convergence) across the board for the LST task. The only model/PE that tends to improve generalization is models with the ROPE encoding, although the improvement in the ROPE encoding model appears to reach a plateau that is comparable to the relative PE (relative PE, 4 heads: 92.0% generalization; ROPE PE, 4 heads: 90.2%). This is generally unsurprising, since ROPE encoding has both absolute and relative PE components. Note, however, that the learnable PE model (with 0.2 initialization) achieved 95.6% performance with a single attention head, which exceeds any model with any number of attention heads, excluding the 2DPE (“ground truth”) model. (Also note that all models converged on the training set, despite these added parameters.)
Revision 0C: As requested by Reviewers 8rhP and ggRj, we have included an additional toy experiment (Figure A15, Page 25). Specifically, one limitation that Reviewer ggRJ referenced was that ‘brain networks’ are often estimated using linear correlations in empirical data. We have now included the PE estimation of “ground truth networks” in simulated timeseries generated from nonlinear multivariate autoregressive models that contain pre-designed network clusters of nodes. (The nonlinearity here was to apply sin(x) to inputs to a node.) What we show in this toy example is that the PE of small-norm initialized models can accurately recover the ground truth modules in this toy, nonlinear stochastic system. We have currently placed this Figure in the Appendix (Fig. A15), but hope to place it in the main text after freeing up space. A description of this model and experiment can be found in Appendix section A.4.
Below, we address comments to each Reviewer.
This is an interesting paper where the Authors propose the study of various initialisations of learned positional embeddings (PEs) in Transformers, particularly how they influence generalisation in multi-dimensional spatial data. In my opinion, it is a very valuable direction!
While the majority of the reviewers were in support of acceptance, I found that the support expressed was lukewarm at best, and even the Reviewers with a positive rating have called out clear issues with the work in its present form, which the Authors did not oppose.
To name two specific examples:
- [Reviewer riBV]: "The core of the work, however, still rests on a relatively simple adjustment that lacks novelty, namely the adjustment of the norm of the initialization. For that reason, I will leave the rest of my review unchanged."
- I think that this is not an issue to be brushed aside. Initialisation is a rich topic and a paper calling out the dependence on initialisation should dedicate its core contribution on many aspects of initialisation across multiple schemes, not just a single hyperparameter.
- [Authors' response to Reviewer ggRj]: "However, we recognize that the approach used in the present paper is limited to settings in which every position is represented in the space (rather than a sparse dataset)."
- I also think this is not doing the paper in its present form any favours. Requiring inputs that completely and meaningfully tile multi-dimensional space are limiting, especially considering the memory implications at higher dimensionalities. Deploying some of the findings in this paper (even if simplified) to point-cloud style inputs would be highly valuable and vastly expand the applicability of the work.
It is my opinion that it would be highly valuable for the Authors to address some of these issues before the work is published, and have chosen to recommend rejection. The Reviewers did not oppose this.
审稿人讨论附加意见
No additional comments beyond the main meta-review.
Reject