PaperHub
7.3
/10
Poster4 位审稿人
最低4最高5标准差0.5
4
5
4
5
3.8
置信度
创新性2.3
质量3.3
清晰度3.3
重要性2.8
NeurIPS 2025

Geometry of Decision Making in Language Models

OpenReviewPDF
提交: 2025-05-11更新: 2025-10-29
TL;DR

Transformer-based language models learn low-dimensional task manifolds across layers, with similar patterns/trends in intrinsic dimensions revealing similar compression strategies despite varying architectures/sizes.

摘要

关键词
Large Language ModelsIntrinsic DimensionInterpretabilityDecision-MakingReasoning

评审与讨论

审稿意见
4

The paper discussed intrinsic dimension (ID) in LLMs, meaning the minimal degrees of freedom for describing a distribution in model's embedding space. Specifically, the work study ID correlation with LLMs' intermediate prediction probability (given by projection hidden states using LogitLens) in the setup of multiple choice question answering (MCQA). The works show a repeated pattern between ID and models' intermediate prediction accuracy: ID has a sudden drop exactly when the accuracy jump. This examination is extended to 28 LLMs and 7 MCQA tasks.

优缺点分析

Strengths:

  1. The ID examination is done across layers, across training steps (for Pythia), and across 7 MCQA tasks.
  2. The paper uses 28 open-weight model which is very impressive.
  3. The presentation of the paper, writing and figures, is very good and professional.

Weaknesses:

  1. Limited contribution: The paper Related Work refer to the works of Cheng at el. [1][2] and Valeriani at el. [3] that also discussed ID in LLMs. As noted by the authors, the difference between this work and the previous ones are that previous ones "study shifts focus from abstraction alone to how LLMs geometrically transition from context encoding to decision formation" (102) while this work (i) focuses on MCQA and (ii) scale this observation to 28 LLMs. Although this work is well written, as a reader, this work seems like a specific case study of ID from the previous works and therefore I have difficulty in deriving its novelty and contributions. Please try to highlight this work new findings compare to the previous ones.
  2. Relevant works: although the paper has a well written Related Work section, mostly about ID and geometry of models, it lacks paper that study of MCQA is answered by LLMs, like Wiegreffe at el. [4], or how prediction grows from layer to layer, like in Haviv at el. [5] (regarding to LogitLens projection across layers).

[1] Emergence of a High-Dimensional Abstraction Phase in Language Transformers

[2] Bridging Information-Theoretic and Geometric Compression in Language Models

[3] The geometry of hidden representations of large transformer models

[4] Answer, Assemble, Ace: Understanding How LMs Answer Multiple Choice Questions

[5] Understanding Transformer Memorization Recall Through Idioms

问题

  1. Is it possible to use ID to alter model predictions? Can we make models better using the ID observations?
  2. Is ID related to the entropy of hidden states? Entropy was previously hypothesis to be related to predictions' confidence levels, like in Stolfo at el. [6]
  3. The drop in ID we see in Figure 2 and 3, was interpret by "information compression phase" (254). Previous work discussed what information each layer outputs and how it effects the final prediction, like Geva et al. [7][8]. Only as a thinking process, is there a connection between ID's compression phase and these work?

[6] Confidence Regulation Neurons in Language Models.

[7] Dissecting Recall of Factual Associations in Auto-Regressive Language Models

[8] Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space

局限性

yes

最终评判理由

My primary concern is the overlap between this work’s contributions and those of prior interpretability studies. However, with additional comparisons to the existing literature, I believe this work could serve as a valuable complement to other methods. Accordingly, I am raising my score by one point (changing 3 to 4), hoping that the revised version will better address previous methods from mechanistic interpretability research.

格式问题

no

作者回复

Thank you for your detailed review and for recognizing the scope and presentation quality of the work. You’ve brought up an important point regarding the novelty and positioning of our contribution, and your suggestions give us the opportunity to clarify both our framing and connection to existing literature.

  • W1 You're absolutely right to note that prior work, particularly [Cheng et al., 2025], [Valeriani et al., 2023], and others, has explored ID trends in LLMs. However, our study departs from these in both focus and intent (coming down to a different methodology). We’re not just tracking representational geometry as an end in itself; we aim to study how these geometric patterns reflect decision-making dynamics, especially in symbolic reasoning tasks like MCQA. The core of our contribution lies in identifying a predictive alignment between ID fluctuations and model decisiveness, i.e., the point at which the model internally commits to a specific answer. This is something we empirically connect via mid-layer vocabulary projections (using the LogitLens method) and corroborate across 28 models and 7 reasoning datasets. That consistency is what transforms this from a case study into a broader interpretability signal. Moreover, all the prior studies take the entire prompt's (model input sentence) representation, leaving the end-of-token as the next prediction, where the decision-making aspect is somewhat missing. In essence, we’re studying how geometry tells not just about abstraction, but about how reasoning unfolds over time within the network. Our ID peaks don’t merely indicate compression; they mark commitment points in the model’s internal state space. That decision-focused framing is distinct from the abstraction-centric perspective of earlier work.

  • We thank you for highlighting relevant studies that make this work broader in terms of alignment with existing literature. We’ll definitely integrate those references ([Wiegreffe et al., 2021], [Haviv et al., 2024], etc.) into our related work section. These works, especially those using layer-wise probing of MCQA behavior or vocabulary-space projections, provide valuable framing for the model behavior we’re studying from a geometric perspective. We would like to highlight that our choice of MCQA stems from its usefulness as a controlled reasoning setup, one where a model must choose, not generate, and where decisiveness becomes a tractable signal to align with geometry. So while the goal wasn’t to analyze MCQA specifically, it gives us a clean framework to study symbolic reasoning, internal commitment, and task-specific abstraction. On another note, the tasks used with template-based reasoning (including greater than and arithmetic dataset), the setup does cover a generative aspect as well (see page 6 subsection Synthetic Tasks: Controlled Learning Trajectories and line 284, page 9 of the main paper). In the revision, we’ll explicitly clarify this connection between our ID-based decision lens and the behavioral studies from the MCQA literature. Your suggestion has helped increase how we articulate the interdisciplinary scope of our work, making it more aligned with a broader scope.

Response to the Questions:

  • Can ID alter model predictions or improve models? At this stage, we focus on understanding (taking a foundational study) rather than modifying model behavior, but we do believe there are clear paths forward where ID can inform model improvement. Some of the straightforward next steps include: 1) Compression & Adaptation; The fact that we observe intrinsic dimension (ID) being much smaller than the embedding space (across layers, models, and tasks) suggests an underutilization of available capacity, pointing toward compression opportunities. Weight pruning, low-rank adaptation (LoRA), or knowledge distillation could benefit from understanding where and when representational redundancy emerges during reasoning. 2) In-Context Learning Optimization; We also observe that adding more in-context examples tends to reduce the ID of hidden states, a sign of collapsing manifolds or reduced ambiguity in decision pathways. With further study, ID might be leveraged to select better exemplars or monitor adaptation dynamics during few-shot learning, potentially improving generalization. 3) Text Quality and Task Complexity; Preliminary findings (e.g., Tulchinskii et al., 2024) suggest that human-written and LLM-generated text differ in their ID profile. This opens the door for unsupervised comparisons of generation quality, model alignment, or task complexity, each grounded in geometry, not only token-level metrics. In short, while our current work doesn’t intervene directly, we see this as a first step (with analyzing 28 open-weight models) toward diagnostic and control tools built on ID.

  • Q2: Is ID related to entropy? This is a very insightful connection. Entropy, especially as studied in [Stolfo et al., 2024], has been linked to confidence at the output layer, and while their focus is on the final logits, we believe there's room to explore a deeper relationship between entropy and ID across the entire forward pass. We already observe that in many models, the ID of MLP outputs fluctuates more abruptly than the residual stream, suggesting a form of information shaping or filtering (see Figs. 2–3 and 8–13). This resonates with your point about entropy possibly tracking with these changes. In fact, very recent work (e.g., [1]) has started quantifying entropy across hidden layers. Bridging that with ID would be a compelling direction, potentially surfacing a joint geometric-information-theoretic signal/study/analysis that tracks reasoning clarity or confidence. We’re excited and happy about the possibility that ID could serve as a layer-wise proxy or complement to entropy, especially in scenarios where logits are unavailable or when exploring internal dynamics across prompts and contexts. We thank you for bringing in this insight.

  • Q3: Compression Phase & Relation to Concept Promotion (Geva et al.) Yes, this is exactly the type of connection we hope our work inspires. When we describe an “information compression phase” (see Line 254), we’re observing not just a drop in ID, but a reduction in representational degrees of freedom, a geometric narrowing that coincides with clearer model preference in the vocabulary space. This aligns conceptually with [Geva et al., 2023], where feedforward layers promote specific concepts to the top of the distribution. Our ID-based lens provides a complementary geometric perspective, i.e., instead of looking at token probability shifts, we are examining how the shape of the representation itself tightens as the model “commits” to a specific concept (decision). While we don’t yet dissect those concepts explicitly, your suggestion points toward a valuable synthesis of mechanistic interpretability and geometric diagnostics. We will be happy to add these relations to the discussion section and point towards future scope/directions.

[1] Skean, O., Arefin, M.R., Zhao, D., Patel, N., Naghiyev, J., LeCun, Y., & Shwartz-Ziv, R. (2025). Layer by Layer: Uncovering Hidden Representations in Language Models. ArXiv, abs/2502.02013.

评论

I thank authors for the detailed response.

Upon revisiting my review, I still find that the paper’s contribution overlaps substantially with previous ID studies and existing analyses of MCQA tasks. This overlap is especially evident in the related works that were absent from the original version. Moreover, because the work is primarily causal, mostly highlighting phenomena but without offering new insights into improving LLMs or better understanding of their mechanisms, it remains difficult for me to significantly change my view.

That said, I believe the authors have done solid work, and the rebuttal suggests that the revised version will include more comparisons to prior interpretability research on MCQA. I hope that this addition would strengthen the work and make it more appealing to researchers in the field.

I recommend that the authors will try to identify a concrete application of ID that departs from existing interpretability studies. While they suggest many proposals, like using ID for weight pruning, it currently feels somewhat speculative and would benefit from actual evidence. Therefore, I suggest focusing on a smaller number of examples that can be easily demonstrated.

In summary, although my primary concerns about the paper’s limited novelty still exists to some extent, I appreciate the work and the proposed revisions. Accordingly, I have raised my overall rating by a point.

审稿意见
5

This paper studies the decision making process inside LLMs by analyzing the topological properties of a ensembles of hidden-state trajectories. Specifically, the authors used three complementary estimators—MLE, TwoNN, and GRIDE—to characterize the intrinsic dimensions (ID) of these trajectories, each representing the inference process of a multiple-choice question. They observe a consistent trend as the ensembles evolve across layers: the IDs first expand, reach a peak in the middle, then drop. By projecting the mid-layer embeddings of the final token through the unembedding (logit-lens) head, they show that the ID peaks coincide with where the LLMs converge to a decision. An abundance of experiments across various transformer models makes this a thorough and cogent investigation.

优缺点分析

Strengths The paper features thorough investigations on 28 LLMs using 3 methods for characterizing the intrinsic dimensions of trajectory ensembles. The application of logit-lens is quite novel in that it provides a perspective complementary to pure topological analysis such as ID.

Weaknesses Causality vs. Correlation: While the ID peaks reliably coincide with decision emergence, it remains unclear whether the drop in intrinsic dimension actually drives the decision or simply signals it. To establish a causal relationship, one would need to identify instances where the model commits to an answer prior to the ID collapse, or conversely, where ID collapses without a correct decision. Without such counterexamples, the link between dimension reduction and decision-making remains correlational.

问题

  1. Uncertainty quantification: What is the variance in the reported accuracy? Instead of looking only at the most probable answer’s accuracy, what if we plot the average log-likelihood assigned to the correct answer by the LLM? I surmise this might be a more fine-grained statistic than accuracy, and one can naturally define uncertainty as the variance of the log-likelihood assigned to the correct answer.
  2. On a related note, is there any way to quantify the variance in the estimated IDs?
  3. Per-example interpretation: This is perhaps my biggest reservation. I understand that intrinsic dimension, as defined in this paper, is a property of an ensemble and not of individual sentences. However, intuitively, “decision making” is a phenomenon at the level of individual prompts—at some point during inference, the model “decides” on the answer by eliminating other options. How should one reconcile this conceptual mismatch between intrinsic dimensions, which is a global geometric property of ensembles of hidden-state trajectories, and decision making, which is a mechanistic action the LLM undergoes when answering each specific question? I do not expect the authors to perform new experiments for this question. I am merely suggesting a more refined discussion on this matter.
  4. Comparisons with other investigations of trajectory ensembles. Several papers (e.g., [1], [2]) have shown that ensembles of leading‐token representations evolve through LLM layers via nearly linear transforms—that is, token‐cloud distributions between adjacent layers align almost perfectly under a linear map. If that is true, how can those same token clouds undergo complex geometric changes such as rising and falling intrinsic dimensions? Perhaps the multiple‐choice prompts used here induce markedly different distributions that defy the linear frameworks of [1] and [2]. I think further comparisons with this line of literature would help better situate the paper.

[1]: [Sarfati, Raphaël, et al. "Lines of Thought in Large Language Models." arXiv preprint arXiv:2410.01545 (2024).

[2]: [Razzhigaev, Anton, et al. "Your transformer is secretly linear." arXiv preprint arXiv:2405.12250 (2024).]

局限性

Yes.

最终评判理由

I thank the authors for having addressed my questions.

格式问题

The formatting looks fine to me.

作者回复

Thank you for your thoughtful and constructive review. We're especially pleased to see that the use of logit-lens alongside intrinsic dimension (ID) came across as a novel and useful perspective. Your recognition of the thoroughness of the experiments and your detailed questions about causality and interpretability are deeply appreciated and help refine the framing of our contribution.

  • W1) You're absolutely right to raise the issue of causality. We agree that our current study cannot claim a causal role for the ID drop, and it acts as a diagnostic correlate, not a mechanistic trigger. As you noted, the interventions would be required to probe this more rigorously, possibly via counterfactual modifications of intermediate representations or layer-skipping ablations (which is more of an open research challenge when talking about decision making along with geometry learned by these large models). While outside the current scope of this work, we believe this work lays the groundwork for future studies to explore whether decisiveness can be perturbed by altering local geometry. It is interesting to note that the models inherently learn to compress the representation space, and the actual decision making is also happening aligned with the learned geometry. One encouraging sign is that the ID peak tends to precede or coincide with high-confidence predictions, rather than lag behind them. This ordering provides a weak (non-causal) signal that something geometric (obtained via the topological space learned by these models) is coalescing right before the model locks into its decision. We thank you for pointing out this clarification. We’ll clarify this limitation more explicitly and avoid causal phrasing in the revision for better presentation quality of this work. Please let us know if any specific places in the paper require strong changes in the writing to make this point clearer for the reader.

Response to the Questions:

  • Q1: Uncertainty Quantification We thank you for this nice perspective/observation. In addition to the top-1 accuracy plots, we could compute average log-likelihood assigned to the correct answer, which may provide a smoother correlation with ID trends, especially across earlier layers. Since the decision making was the primary focus, we considered the actual accuracy to pinpoint the shifts, in likelihood computation, considering the entire vocab of (50k tokens), often leads to minor changes when considered only a single token, making it difficult to analyse (another direction could be computing the rank of the actual answer). In the current version, we followed Robinson et al.'s approach and considered the same to quantify decision-making. We also agree that variance (both in prediction and ID) is an important direction. In fact, GRIDE includes built-in estimation uncertainty based on neighborhood consistency, and we plan to surface that in an extended appendix. Your suggestion about using log-likelihood variance as a proxy for uncertainty is insightful and will be incorporated into follow-up work (if necessary). We believe this work lays a detailed foundation for all these studies with 28 open-weight models and would open up new directions to understand the representations learned by these models.

  • Q3: Per-Example Interpretation We agree that decision-making is inherently a per-example phenomenon, and your question captures a genuine tension in our use of ensemble-level metrics like ID. Our current ID estimates reflect population-level structure, and while that gives us interpretability at scale, it does obscure fine-grained variation, specifically in pinpointing the example-specific perspective. To partially bridge this, in an initial study, we’ve explored (though did not report in the main paper) MLE-based local ID estimates, which can provide per-point dimensionality. These show promising signal variation across samples and could be used to align ID more closely with per-example uncertainty or decisiveness. That said, such methods are noisier and computationally expensive (as reported by prior art). For future work, we envision combining local ID + token-level vocab projections + token-wise entropy to move toward finer-grained interpretability. Your suggestion helps point us directly in this direction, and we’ll revise the discussion section to explicitly address this tension. We would also like to highlight that the overall goal of this work was to study the topology of space learned by these large models (information compression/abstractions) and see if these inherently learned spaces align with the decision-making happening inside these models. It was interesting to see that the inherent topology learned by these models also tells something about the decision-making. In general, this work aims towards establishing a foundational study (with 28 open-weight models) that can be a base for future explorations on how these spaces are learned during training.

  • Q4: Linear Token Cloud Alignment We’ve read and found the results in [Sarfati et al., 2024] and [Razzhigaev et al., 2024] very insightful. Our view is that these seemingly contradictory observations actually reflect different granularities of analysis. The cited works show that layer-wise representations often evolve through low-rank linear maps, especially when tracking fixed-token trajectories or averaged class centroids. In contrast, we focus on the full cloud of final-token representations across MCQA answer choices, which introduces both semantic diversity and structural complexity, especially for reasoning datasets with compositional prompts. In general, we aim to study the entire topology of space on which the signal is projected by the mid-layers (where we discover abstraction and compression), then align it with decision-making happening inside this model. In general, all stacked transformer layers project one space to another (we study both the individual space [MLP layers], and the continuously changed spaces [residual stream]), these spaces are often not linear maps, and the amount of information compression that takes place is the primary goal of this study. In future revisions, we’ll explicitly try to compare our ID findings to the trajectory linearity literature and spell out the changes in studies. We thank you for pointing this out, as it makes our study more aligned with a broader scope in the literature of studying representations learned by these models.

评论

I thank the authors for their timely and thorough reply. All my questions are adequately addressed, and I will raise my score by one point.

A follow-up question: have the authors looked at the "local ID"s of different prompts? I am curious how much the intrinsic dimensions vary throughout the global space. This also ties to my questions about uncertainty quantification.

评论

Thank you for your feedback. We are pleased that we could answer all the questions adequately/satisfactorily.

Regarding the local IDs of different prompts, yes, we did have a look at the local ID spread across the datasets. The local ID estimates (like MLE) help see a specific datapoint through the perspective of its local neighborhood, i.e., what the local manifold looks like for that specific point, and a general estimation of the global ID is done via an aggregation using the arithmetic mean. This way of computing the global intrinsic dimension often comes with some limitations, as discussed by [MacKay and Ghahramani, 2005] in a short blog post. They further modify/improve the aggregation of local intrinsic dimension estimates; instead of direct averaging, it is suggested to do a direct averaging over the parameter k by real-world datasets to compute the IDs empirically (for real-world datasets). Both of these versions are named as MLE (arithmetic mean) and MLE modified (harmonic mean) in the submission, respectively. Regarding the suggestion of uncertainty quantification, we agree that it would be good to add an inspection of the distribution (maybe a histogram) to see the spread of local ID estimates of a given set of prompts.

As an additional note for curiosity regarding the variance of these estimators, we would like to mention that TwoNN can be seen as a special case of GRIDE and resolves to a closed form solution, as highlighted in the remarks section in Denti et. al. (equations 12 to 17, would be a good read for the same). Moreover, they specifically highlight that the estimator in equation 16 (a special case of GRIDE) has variance, which is smaller than the variance of the MLE estimator. All these results point towards GRIDE being a better robust estimator, that is, asymptotically unbiased, hence becoming a good choice for estimating global ID estimates.

Thank you again for your insightful questions, more specifically, a new direction of uncertainty quantification that was not discussed in detail in the submitted version. We would be happy to take these into account for the camera-ready version.

[1] David J.C. MacKay and Zoubin Ghahramani. Comments on ’maximum likelihood estimation of intrinsic dimension’ by e. levina and p. bickel (2004).

[2] Denti, F., Doimo, D., Laio, A. et al. The generalized ratios intrinsic dimension estimator. Sci Rep 12, 20005 (2022).

审稿意见
4

This paper investigates the decision-making process of LLMs through the lens of intrinsic dimensionality (ID). The authors estimate how the ID of hidden states evolves layer by layer in 28 open-weight transformer models. They show a hunchback shape corresponding to the model’s internal commitment to a final answer. By comparing different tasks, models, training stages, and few-shot prompts, the paper studies how geometry relates to decision confidence, suggesting that ID can serve as a tool to interpretability methods like probing and circuit analysis.

优缺点分析

Strengths

• The empirical study is thorough and systematic, covering a large suite of LLMs, diverse benchmarks, and multiple ID estimators. The message is mostly clear and supported by figures and experiments. The study’s main contribution on tying layer-wise ID to decision commitment is novel.

Weaknesses

• The study does not consider linear dimensionality (e.g., PCA-derived), leaving open whether linear methods might capture the same geometric patterns.

• The relationship between ID and accuracy is not fully consistent across figures. For example, the claims in the main texts are based mainly on Fig. 2, which do not always align with the trends seen in Fig. 5 (just to name a few, "For most models, the peak in intrinsic dimension aligns closely with the onset of confident predictions"). Additionally, causal evidence is lacking.

• The reported difference between MLP outputs and residual streams is asserted but not strongly supported in the main plots, which appear very similar visually.

• The analysis focuses solely on the final token’s residual stream. This overlooks decision-making distributed across earlier tokens’ residual streams, especially given attention’s copying behavior. The potential role of previous token residuals and attention heads is not tested.

问题

  1. Could the authors test if PCA or related linear methods were applied, would they show similar layer-wise trends?

  2. Could the authors reframe the claims to be more consistent with all results, and provide/propose ablation studies to clarify whether the ID peak truly underlies commitment, beyond spurious correlations?

  3. Could the authors quantify this difference in “sharper ID transitions” between MLP and residual post more rigorously, perhaps with summary statistics or additional visualizations?

  4. Since only the final token is analyzed, how do the authors rule out that the decisive computation happens earlier and is simply passed forward by the attention heads? Could the author consider extending the ID estimation to preceding token positions or analyzing attention outputs directly?

局限性

Yes.

最终评判理由

Most of my concerns have been addressed. The remaining issues are primarily related to writing. I trust the authors will revise the visualizations and corresponding text accordingly in the final version.

格式问题

None

作者回复

Thank you for your thoughtful and generous review. We’re happy for your close engagement with the work and thoughtful review. We are happy to see that our central intent came out well, that is, ID can provide a complementary geometric lens on model behavior, bridging the gap between representational topology and decision confidence. We're especially pleased that you found the message clear, the empirical scope comprehensive, and the use of ID to infer decision commitment to be a novel interpretability signal. Your feedback on methodology, interpretation, and presentation is helpful for refining both the claims and the framing of the work, and we respond to each of your points in detail below.

  • PCA and Limitations of Linear Methods You're absolutely right in noting that PCA, and by extension, other linear techniques like LLE, assume a stable, global eigenstructure. But in practice, the data manifolds arising inside deep models are nonlinear and high-dimensional, and their local curvature can change across layers and tokens. These properties make PCA unreliable for capturing how internal decisions unfold. The prior deep learning and intrinsic dimension literature supports this, that ID estimates based on PCA (or LLE’s local eigenvalue spectrum) are prone to failure when the curvature is high or when sampling is limited ([Bruske & Sommer, 1998]; [Camastra, 2003]). That’s precisely why we moved beyond PCA and incorporated more recent estimators like GRIDE and TwoNN. These estimators better account for local distortions and provide a more faithful estimate of intrinsic structure. We would like to rehighlight that what sets our work apart is the use of ID as a lens into the process of decision formation, not just as a tool for representation analysis. Rather than testing ID accuracy per se, we explore how ID changes align with model decisiveness, which allows us to study representation evolution over the forward pass.

  • We understand and agree with your concern around the lack of causal evidence behind the alignment of intrinsic dimension (ID) peaks with the model’s decisiveness. You're absolutely right that intervention-based setups (e.g., causal tracing, input perturbations, ablation) would be required to establish a strong causal relationship. However, designing such controlled experiments, especially within the opaque, compositional dynamics of LLMs, is an important direction, but also a significantly more complex task that goes beyond the current scope. In this work, our intent is not to claim causality, but rather to study the geometric topology of the hidden representations and how they change with decision making. We highlight a consistent geometric correlation, i.e., the peaks in ID estimates align with sudden increases in model confidence (as shown by projection to vocabulary space and per-layer performance curves). What makes this alignment interesting is that it appears repeatedly across model families, sizes, and tasks, suggesting that something meaningful (abstract compression) is happening at those mid-to-late layers, where representational complexity briefly expands and then contracts. More broadly, we see this as a step toward a geometric perspective on model behavior, one that complements causal circuit tracing and information-theoretic analysis. If ID transitions consistently line up with shifts in semantic commitment, they could serve as proxies for points of decision solidification within the network. But we agree that this is more of a foundational finding, and more work is needed to test and validate that hypothesis under tighter experimental control.

  • We are happy for your close reading of our figures and agree that at first glance, the MLP-out and Residual-post ID curves may appear visually similar, particularly in the early layers. That said, our motivation for analyzing both signals was to capture how information is composed and stabilized throughout the network, especially in the later layers where MLP-out tends to add/subtract residual information in more targeted, high-leverage ways. In early layers, as you noted, there’s limited semantic structure present in either component. But as the model progresses, we consistently observe a divergence; MLP-out tends to exhibit more abrupt shifts and sharper transitions in ID, while Residual-post evolves more gradually. This pattern reflects the nature of the two streams, MLP-out contains a more “surgical” or sparse transformation, while Residual-post reflects the accumulated state the model carries forward. Figures 2–3 and 8–13 consistently show this across models and tasks, though we agree the current presentation could do more to make these contrasts easier to interpret. Lastly, regarding predictive accuracy, we deliberately report accuracy from Residual-post representations because they are the actual carriers of information forward across transformer blocks. In contrast, MLP-out alone is not directly predictive; it reflects the delta applied to Residual-post and is therefore more disconnected from decision-making in nature. When we attempted classification directly from MLP-out, we found its standalone accuracy to be very low, which further supports the idea that it encodes transient, complementary signals. We’ll be happy to incorporate clearer visual comparisons in the final version to better support this distinction and make this contrast more compelling. We thank you for this observation/suggestion.

  • Focus on Final Token This is a nice observation, and you're right to highlight the potential of analyzing intermediate token positions. Indeed, a more holistic view across all token positions could provide richer insights into distributed decision-making in transformers. However, our current focus on the final token's residual stream is very intentional. In autoregressive models, due to the strictly causal attention, the last token is the only one that has access to the full prompt and is solely responsible for producing the next-token prediction. From a decision-making perspective, this position is where the model must commit to a choice, so if we want to understand how and when a decision takes shape, the last token’s representation is the most direct and informative place to look. This approach is also consistent with prior work in mechanistic interpretability, where causal tracing, circuit discovery, and attribution studies typically operate over the final-token residual stream [Elhage et al., 2021; Wang et al., 2023]. We follow this well-established practice to ensure alignment with the broader interpretability literature. That said, we completely agree that expanding this analysis to all token positions could yield a more granular view of how information accumulates, diffuses, or transforms across a sequence. The main limitations are computational (as ID estimation is non-trivial and scales linearly with token count) and methodological, given the current lack of concrete frameworks for attributing decision-making to earlier token positions.

Response to the Questions:

  • Q1: On PCA and linear methods, we agree that this is a worthwhile baseline to examine. However, PCA and related linear techniques, while computationally appealing, are known to falter when applied to nonlinear manifolds typical of deep model representations (as answered above). This limitation is well-documented. That said, to address your suggestion directly, we are currently running PCA-based analyses across selected checkpoints. Early indications show that PCA does capture some coarse trends in layer-wise expansion and compression, but it lacks the resolution and robustness of nonlinear estimators like TwoNN and GRIDE, particularly around the decisive transition layers. We'll be sure to include these comparative results in the updated appendix if space allows.
  • Q2: In this work, our goal is not to claim that ID causes decisiveness, but rather that ID changes are geometrically aligned with decisiveness, which we probe via projections to vocabulary space. In other words, ID acts as a proxy signal, and we frame it as such throughout (e.g., Line 284). We would be happy to make the suggested writing changes for this initial foundational exploration and point towards causal studies that could be done in the future.
  • Q3: As answered above, the visual similarity between MLP-out and ResidPost curves can be difficult to understand at a glance, especially in early layers where they naturally track closely. However, in deeper layers, we consistently find sharper drops and higher ID volatility in MLP-out. This reflects its role in selectively refining or filtering the residual stream. Since we use ResidPost for reporting accuracy, the highlighted color is shown for ResidPost (MLPout does not tend to show good accuracy)
  • Q4: As stated above, in principle, the decisive computations can begin earlier in the sequence and propagate forward. However, in causal transformers, the final token is where all upstream information converges and where the model generates its prediction. It is thus the most immediate locus of decision commitment.

[1] Bruske, J. & Sommer, G.. (2006). An algorithm for intrinsic dimensionality estimation. 10.1007/3-540-63460-6_94.

[2] Francesco Camastra, Data dimensionality estimation methods: a survey, Pattern Recognition, Volume 36, Issue 12, 2003, Pages 2945-2954, ISSN 0031-3203,

[3] Elhage, et al., "A Mathematical Framework for Transformer Circuits", Transformer Circuits Thread, 2021.

[4] Wang, K., Variengien, A., Conmy, A., Shlegeris, B., & Steinhardt, J. (2022). Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small. ArXiv, abs/2211.00593.

评论

Thank you for your kind and thoughtful responses. Most of my concerns have been addressed. I will update my score.

Q1: My concerns are addressed.

Q2: My question regarding the consistency between Fig. 2 and Fig. 5 appears unresolved. Specifically, in Fig. 2, the peak in intrinsic dimension aligns with the onset of confident predictions, whereas in Fig. 5, it is the valley (inverse peak) in intrinsic dimension that aligns with this transition.

Q3: I find it difficult to observe that the trend in MLP-out is more abrupt than in Residual-post. That said, I believe this may be clarified through improved visualization, and I encourage the authors to consider refining the relevant figure.

Q4: Your point regarding computational limitations is well-taken. However, the claim that “this position is where the model must commit to a choice” is too strong. Prior work has shown that models can make decisions earlier and that specific heads (e.g., the "correct answer head" in "Tracing Attention Computation Through Feature Interactions") can move that decision information to the final token. It would be appropriate to acknowledge this nuance in the discussion.

评论

Thank you again for your engagement in the discussion phase and for agreeing to update your score. We would like to answer the remaining concerns below:

  • [Q2] In Figure 2 (which focuses on an MCQA setup), we notice a sudden shift in accuracy in a few layers (highlighted in the gray region), where the accuracy goes from near random to the actual accuracy of the model, we also note that the ID estimates of these layers show a specific trend (showing the topology of all the information at that layer), these are later refined further in the later layers. Moreover, the restricted setup of having to choose between the given set of options makes the problem setup a little different from what is observed in Figure 5. In Figure 5, the plots represent different sets of reasoning tasks (specifically, the arithmetic dataset), which has a much more open set of answer domains when compared to the MCQA setup with 4 choices. Also, another point to note is that the performance curves in Figure 5 show a gradual increase rather than a sudden shift (as in MCQA Figure 2), and an interesting observation is that the point where the accuracy starts to rise coincides with the compressed state (in terms of geometry/topology, ID). The primary motivation of this plot was to study the evolution of ID throughout training, since the Pythia series provides all the training checkpoints (not available for all other open-weight models), it was possible to evaluate them and study ID trends for checkpoints throughout training. In summary, the trends in Figure 2 are with a sudden shift in accuracy for MCQA (decision being concretized in a few layers, gray area), whereas in Figure 5, we see a gradual increase in the accuracy (ID trend denotes where decision concretization starts).

  • [Q3] When we say, the trend in MLP-out is more abrupt than in Residual-post, we mean the noise that is observed in the MLP-out plots after the decision is been made, this is much clearly visible after layer 20 in Figure 2 and Figure 3, more specifically, in Llama2 13B chat [mid plot] present in figure 3, after layer 20 the noise in ID estimates is well captured for MLP-out, whereas for ResidPost is quite smooth. This pattern reflects the nature of the two streams, MLP-out contains a more “surgical” or sparse transformation, while Residual-post reflects the accumulated state the model carries forward. Figures 2–3 and 8–13 consistently show this across models and tasks, though we agree the current presentation could do more to make these contrasts easier to interpret. More importantly, we specifically highlight the gray area in ResidPost since the accuracy is computed using the ResidPost signal/vector getting projected back to the vocabulary space. MLPout, on the other hand, does not show good accuracy when projected to the vocabulary space. We are happy that you pointed this out, we understand that it may be a little difficult to observe in the current presentation style, as suggested by you, we will update the presentation style with a more detailed explanation (probably zooming in towards the later layers, or plotting them together) to make the presentation quality much more clear. We hope the above has clarified what we were trying to convey before. We are happy for any more presentation suggestions that you have to improve the presentation quality further.

  • [Q4] Thank you for clarifying your question further. We would modify the writing to make it much clearer. We agree that saying “this position is where the model must commit to a choice” may not be completely adequate. We wanted to note that in autoregressive models, the signal (residual stream) corresponding to the last token is updated by each of the transformer blocks (capturing attention taken from all the previous tokens) to finally predict the next token. When projecting back to the vocabulary space, this specific signal (residual stream for last token) is used to predict the next token, whereas if we use representations corresponding to other tokens, the future token information is not completely captured in those. Hence, we wrote that this is the position where the model commits to a choice. We will modify this to make it more understandable/presentable for the readers.

We would also like to mention that the results that we were running for PCA are now out for Figures 2 and 3. Specifically, we used two ways of computing ID estimates using PCA, which use the ​​covariance matrix of the given data to find a subspace on which data projections have maximum variance. Please refer to section 3 of the paper by Fan et. al. [1] for more details on the formulations, where equations (1) and (2) are the two methods respectively. If allowed in limited space, we would be happy to post those tables in the next response (and update the same on the paper as well).

[1] Fan, M., Gu, N., Qiao, H., & Zhang, B. (2010). Intrinsic dimension estimation of data by principal component analysis. ArXiv, abs/1002.2050.

审稿意见
5

This paper seeks to explore the geometry of the hidden representation in LLMs with the goal of developing a better understanding of the internal decision-making processes.

They present a study on a suite of open weight transformer mdels (a mix of model sizes, architectures, prompting strategies): estimating an intrinsic dimension (ID) across layers & quantifying performance per layer (multiple choice Q&A setting). The authors observed a consistent pattern in the ID – as information progresses through the layers, this ID exhibits a “hunchback” pattern. The main contribution is this geometric interpretation on reasoning.

优缺点分析

Strengths:

  1. The connection of geometry to reasoning in LLMs – intuitive and well explained
  2. Extensive experiemental setup and well justified decision for using multiple choice q&a
  3. Good discussion on the limitations too.

Weakness:

  1. Agreed on the listed discussion already in the paper, the most concerning would be to weak proxy issue for the last layer’s ID

问题

局限性

yes

格式问题

作者回复

Thank you for your thoughtful and generous review. We're happy and pleased that the core ideas, especially the connection between geometry and reasoning, resonated with you. Your comments validate the intent behind our methodological choices, including the use of MCQA as a controlled testbed and the emphasis on layer-wise dynamics.

On your note regarding the weakness of the last-layer ID as a proxy, we completely agree that ID near the output layer must be interpreted with care (though some of the works in the vision community have shown its relevance, our large-scale study provides a much more detailed overview to make future works more concrete). The final layer is influenced not just by decision formation but also by architectural constraints (e.g., projection to vocabulary space) that may distort geometric signals. That’s why we center much of our analysis on the intermediate layers, where the ID peak consistently aligns with decisiveness, as shown by LogitLens projections. We plan to clarify this limitation more explicitly in the final version, making sure readers don’t over-interpret the last-layer ID signal in isolation.

Thank you again for highlighting the strengths of the work. Your feedback has been genuinely motivating and has helped us frame the contribution more clearly.

最终决定

This work focuses on the analysis of LLMs’ hidden presentation geometry via the lens of intrinsic dimension method, especially on the dynamic decision making of multiple-choice question answering setting. They conduct a comprehensive analysis on 28 open-source LLMs using multiple estimators. Their findings include a consistent intrinsic dimension pattern across different LLMs, and the linguistic pattern is mapped on a structured and low-dimensional manifold based on task-specific decisions. Those findings provide a novel insight from the geometry viewpoint for those LLMs.

The main strengths of this work include: (1) a novel idea to explain geometry of internal LLM layers using the intrinsic dimension method; (2) comprehensive experiments to verify their ideas and assumptions; (3) a consistent finding across different LLM families brings potential impact to the interpretability community.

The main weakness of this work lies on the unclear discussion and comparison with existing work, which should be discussed in detail during the camera-ready version. In addition, it will improve this work’s impact if authors can demonstrate a concrete example and application by using the intrinsic dimension method.

To summarise, this work is well documented and the experiments are solid. I would recommend accepting this work for NeurIPS.