“I'm not sure why linearity and exponential growth are discussed and considered as a goodness metric.”, “If we make Transformer blocks, which just scale their inputs by large positive constants, it seems to satisfy the linearity and exponential growth but would fail to achieve the high performance”

We thank the reviewer for the point of clarification. In response, the following discussion is now included in Section 6.2.

Coupling suggests that the representations follow the linearized equation:

Expanding across layers, the entire system can be approximated by the product

Assuming that , this equation predicts that the norm of would exhibit exponential growth layer by layer. Expanding and ,

where and represent the eigenvectors and eigenvalues of the Jacobian, respectively. The above equation predicts that the norm of would exhibit exponential growth layer by layer. In general, trajectories are not expected to be perfectly linear unless aligns with an eigenvector of . However, in our experiments, we observe a notable tendency towards linearity, suggesting that the representations align progressively during training with the eigenvectors of the coupled Jacobians.

Linearity and exponential growth are not intended as direct performance metrics, but rather they are structural properties observed in the model's trajectories. These properties help characterize how inputs evolve through the model, and are noteworthy because they occur while the model still maintains high performance. As the reviewer highlights, blocks that simply scale inputs by large constants may exhibit linear trajectories and exponential growth, but such transformations would not lead to meaningful outputs or good performance. The key distinction is that LLMs tend towards these properties alongside effective learning and performance.

Could you provide more discussion and reasons for choosing or designing the metrics for analysis?

Appendix A4 is now updated with a discrete formulation, with enhanced discussion of the relationship to our metrics. Our motivation to measure the linearity of the trajectories is drawn from [11], [12], and to quantify the properties of visualizations in Figures (12, 13, 14). Linearity results in LLMs are in analogy with those previously observed in ResNets. Exponential growth distinctly emerges in transformers, in contrast to ResNets, in which equispaced trajectories are observed. This observation is justified in our discussion on Section 6.2 below.

[11] Gai et al. A Mathematical Principle of Deep Learning: Learn the Geodesic Curve in the Wasserstein Space (2021)

[12] Li et al. Residual Alignment: Uncovering the Mechanisms of Residual Networks (2024).

Figure 3 (c). Does this show that Linearity disappears at the end of training?

According to our results, as displayed in Figure 3(c) (now Figure 28c), the linearity of trajectories tends to decrease slightly in the deeper layers of the network by the end of training, while still remaining more linear than earlier in training. This does not suggest that the trajectories become entirely unstructured.

Overall, LSS is a measure of whether the trajectories exhibit regularity, specifically in terms of their tendency towards linearity. As shown in Figure 3, the degree of linearity varies with depth and evolves throughout training. This variation highlights the nuanced behavior of trajectories across layers and different training stages, rather than indicating a complete disappearance of linearity at any point.

Figure 6. MPT -> Mistral?

MPT is a model by Mosaic ML, and is cited as such.

Can we see Figure 6-style results of untrained models?

A PCA of the trajectories for the untrained models appears in Figure 13 of the updated document.

Figure 8. Can we see the results of untrained models?

In response to the question, we have measured exponential growth of representations in the untrained models. Figure 7b of the updated manuscript now contains both the trained and untrained exponential distancing results, with trained models being consistently more exponentially spaced than the untrained ones. We thank the reviewer for this suggestion since it led to notable increases in the depth of our analyses.

In closing, we'd like to express again our gratitude to the reviewer for their feedback which not only improved the current state of our manuscript but also set the stage for intriguing future investigations. In light of these refinements, should the reviewer find it fitting, we would be most grateful for any potential increase in our score.