PaperHub
7.3
/10
Poster4 位审稿人
最低4最高5标准差0.5
4
5
4
5
3.8
置信度
创新性2.8
质量2.8
清晰度3.0
重要性2.5
NeurIPS 2025

Less is More: Local Intrinsic Dimensions of Contextual Language Models

OpenReviewPDF
提交: 2025-05-10更新: 2025-10-29
TL;DR

Lower local intrinsic embedding dimension signals better performance, detecting when LLMs improve, overfit, or grok.

摘要

关键词
latent space geometrycontextual embeddingsLLMsintrinsic dimensiontraining dynamicsgeneralizationoverfitting detectiongrokking

评审与讨论

审稿意见
4

The authors study the local intrinsic dimension of language models using a localized version of two-NN and study representation changes during training/fine-tuning. They offer an explanation for many behaviors seen in learning dynamics, such as overfitting, grokking, etc.

The authors study 4 setups: change in ID from fine-tuning, grokking, “exhausting training capabilities”, classification/overfitting.

Fine-tuning: the change in local geometry is dataset dependent - fine-tuning on dataset X does not alter the local intrinsic dimension given dataset Y.

Grokking: the change in local intrinsic dimension coincides with when grokking occurs in a toy arithmetic setup.

Exhausting training capabilities: monitoring the change in intrinsic dimension over the course of training can inform when training is done. Notably, the change in intrinsic dimension looks very similar across train/validation/test splits, indicating that practitioners can rely on intrinsic dimension to know when training is done, without needing a validation split as most typical training regimes currently rely on.

Overfitting: in a classification setting, an increase in intrinsic dimension coincides with when

优缺点分析

I am a big fan of studying the representations or geometry of latent states to better understand learning dynamics, and I am happy to see contributions in this space.

The paper’s biggest strength is perhaps also its weakness – the authors use their approach to study multiple phenomena commonly seen during training (fine-tuning, grokking, monitoring training progress, overfitting) through multiple setups, which is great. However, per phenomenon, they only study a single setting, making it difficult to know whether their claims will generalize, making it difficult to draw conclusions per setting.

It is also unclear how the authors went about choosing their models, tasks, and datasets per setup. The choice of models seemed rather arbitrary, and I am curious why they did not study more recent models. Even for models of smaller scale, I believe there are more recent models that the authors could have studied. The same applies to datasets - many of the datasets seemed to be very specific (ex: EmoWoz). At the very least, an explanation for the design decision per setup would have been helpful.

There also seem to be minor cases when their claims don’t hold. For instance, Figure 13 in the appendix seems to go against the author’s claim, in which intrinsic dimension is only affected by the data within the domain of the fine-tuning data - this doesn’t seem to be the case in Figure 13?

In Section 4.4, when discussing classification/overfitting, the authors consider an increase in validation loss to be overfitting. However, if I understand correctly, the authors do not use this logic in Section 4.3, and claim that even though validation loss is increasing, their intrinsic dimension measure continues to decrease and is a better measure for training progress. This discrepancy is a bit confusing to me.

In Section 4.2, when discussing grokking, because grokking can be thought of as a phase transition, it is not surprising that the intrinsic dimension and change in validation accuracy coincide. In fact I do not think the intrinsic dimension “predicts” when grokking will occur, and suggest that the authors curtail this claim.

In summary, while I am excited to see this work, I am not quite convinced that intrinsic dimension provides a useful measure to monitor learning dynamics for practitioners, for the reasons listed above. If the paper showed the same general pattern in multiple experiments per setting, I would be a lot more comfortable drawing concrete conclusions.

问题

How were the models and datasets chosen for each setup?

Not a question but a comment - I think the authors will enjoy these works as well: https://arxiv.org/pdf/2410.01444, https://arxiv.org/abs/2503.21073, which also studies the intrinsic dimension of hidden states/token embeddings.

局限性

See “Weaknesses” – also, the authors list some limitations which I agree with: high computational cost (which, if I understand correctly, is why the authors do not study >1B models), and that their intrinsic dimension measure is a “relative” measure and not an absolute measure, making it difficult to know how to compare the measure across models.

最终评判理由

My main concern was generalizability of the experiments. The authors have conducted additional analyses, demonstrating that their findings still hold. All of my other concerns have been promised to be addressed in their next revision (claims about predicting grokking, better transparency around how they selected their models and datasets, etc.).

格式问题

N/A

作者回复

We thank Reviewer ​​oyKS for their feedback and for highlighting the potential of our work to advance the study of geometric representations in language models. We are particularly grateful for the appreciation of our effort to connect intrinsic dimensionality with diverse learning dynamics such as fine-tuning, grokking, and overfitting, and for acknowledging the value of studying the representation geometry of LLMs. Below, we address each concern in detail.

Addressing the Weaknesses

(1) Limited number of experiments per phenomenon

A broader empirical basis per phenomenon would certainly strengthen the generality of our claims. While we focused the main text on one representative setup per phenomenon for clarity and space, we do have additional results that support the generality of our findings.

For example, in the context of overfitting, we conducted further fine-tuning experiments on language modeling, where continued training past optimal validation performance led to an increase in local dimension after the initial drop. This further confirms that a dimension increase is a useful indicator of overfitting. We will include these results in the appendix of the revised version.

(2) Clarification of dataset and model choices

Thank you for requesting more transparency here. We will clarify these design decisions in the revision. Briefly:

  • Our dataset selection was guided by the goal of interpreting training and fine-tuning dynamics in controlled, well-understood domains. SDG and MultiWOZ remain as the largest openly available task-oriented dialogue datasets, and they come with rich annotations (for example, EmoWOZ as an augmented version of MultiWOZ with additional emotion labels per user utterance). Dialogue provides a test-bed for many natural language processing phenomena, allowing us to study discourse-related challenges like natural language understanding, state tracking, and emotion classification in a realistic yet compact setting.

  • The fine-grained and high-quality annotation of MultiWOZ/EmoWOZ allows us to train models with different goals on the same underlying dataset. We decided to cover sequence tagging and tracking via dialogue state tracking in the Trippy-R model, and sequence classification via emotion classification in the ERToD model. This, in turn, allowed us to isolate specific learning signals in fine-tuning and overfitting settings.

  • For models, we selected representative examples of transformer-based encoder-only (BERT/RoBERTa) and a decoder-only model (GPT-2) to capture both sides of the architectural spectrum. These models are well-established and allow for interpretability across training regimes, since, in particular, RoBERTa-style models [a] are still widely applied as encoder models.

  • To support our claims of wider applicability, we are currently running additional experiments on more recent models. Preliminary results on Phi-3.5-mini-instruct [b], a decoder model with 3.82B parameters and a hidden dimension of 3072, can be found in the table below. These results demonstrate that the TwoNN estimator, applied locally within our subsampling method, yields meaningful estimates even for latent spaces whose ambient dimension numbers in the thousands, a typical setting for modern LLM architectures.

In the case of the Phi-3.5-mini-instruct model, we implement the fine-tuning via LoRA with rank constraint r=16 on a portion of the MultiWOZ and Reddit training datasets, with other hyperparameters as described in the paper and taking a model checkpoint after 800 batches. During this fine-tuning, we observe the shifts in mean local intrinsic dimension on a subsample of the validation set of various datasets. In addition to the datasets in the paper, we present results for a self-collected dataset of ICLR 2024 titles and abstracts, for the sake of including a more recent dataset.

The following table contains the (Mean / Median / Std) of the local intrinsic dimension for base and fine-tuned (FT) versions of the Phi-3.5-mini-instruct model.

DatasetBase "Phi-3.5-mini-instruct"MultiWOZ-FTReddit-FT
ICLR 202412.98 / 12.67 / 4.3713.89 / 13.99 / 5.2013.55 / 13.42 / 4.67
MultiWOZ7.48 / 7.91 / 3.776.91 / 6.82 / 3.647.78 / 7.74 / 3.94
Reddit14.55 / 14.83 / 5.4716.42 / 16.60 / 6.9913.81 / 13.60 / 5.33
SGD6.14 / 6.81 / 4.015.74 / 5.73 / 3.976.18 / 6.34 / 4.26
Wikipedia15.18 / 14.05 / 5.5118.84 / 18.14 / 5.3218.65 / 17.73 / 5.06

Note that our qualitative observations on the behaviour of the autoregressive model’s latent spaces under fine-tuning hold for this larger model as well: The dimension drops on the datasets that are related to the fine-tuning data (e.g., the dialogue datasets MultiWOZ and SGD become lower-dimensional in the latent space when tuning the model on MultiWOZ). At the same time, the dimension of unrelated datasets can be seen to increase or stay unchanged, similar to what happens when fine-tuning the notably smaller GPT-2 model. Moreover, we are currently working on an extension to even more recent models (e.g., from the Llama family), and we aim to include the respective results in our final manuscript.

(3) Appendix Figure 13 and fine-tuning generalization

The result in Figure 13 actually reinforces our core claim: it shows that the drop in local dimension primarily occurs within the domain of the fine-tuning data, while unrelated inputs remain largely unaffected or even increase in their mean local dimension. This is what we expect if the model adapts its representation geometry in a task-specific way. We will clarify this interpretation in the caption and appendix discussion to avoid confusion. In particular, we will make it clear that the discussion in Section 4.1 applies primarily to masked language models, and that decoder-only models appear to exhibit the phenomenon of increasing local dimension on non-training-related data.

(4) Discrepancy between Section 4.3 and 4.4, re: overfitting criteria

This is an important observation, and we appreciate the opportunity to clarify. In Section 4.3, we observe that validation loss increases while the non-differentiable downstream metric (Joint Goal Accuracy) still improves, indicating that when JGA is your target metric, validation loss is not a reliable stopping criterion in this case.

In contrast, in Section 4.4, the validation loss and accuracy (weighted F1) are tightly aligned; the respective maximum and minimum occur around the same training step. Hence, the loss serves as a valid stopping signal there, and the local dimension measures support this point. The difference in evaluation criteria is task-specific, and we will clarify this point more explicitly in the text.

(5) Interpretation of grokking as a phase transition

We agree that grokking is often interpreted as a phase transition. Our novel contribution here is to make this intuitive notion quantifiable: our local intrinsic dimension estimates provide a smooth, quantitative trace of the internal phase transition dynamics.

Crucially, our method enables this measurement using training data alone, and still anticipates the grokking event observed in validation accuracy. To the best of our knowledge, this is the first time such alignment can be observed at this granularity. We will revise the text to avoid overclaiming about prediction and emphasize the interpretability advantage.

Questions

(Q1) Choice of models and datasets

Please see our earlier explanation [see (2) above]. We will add a dedicated section explaining our design choices per setup in the revised paper.

(Q2) Suggested related work

We thank the reviewer for these additional references, and we will incorporate them into our Related Work section. While these works also explore intrinsic dimensions, they differ in key aspects: e.g., one focuses on sequence-level embeddings, while the other does not track dynamics during training.

Our method, in contrast, enables fine-grained, contextualized, local estimates over the course of optimization, which is essential for studying grokking or overfitting behavior. We are happy to discuss how our findings relate to and complement these efforts.

Limitations

Limitations and computational cost

Our approach is computationally intensive, which we discuss in our Limitations section, but at the same time, we provide a partial solution to the computational overhead via our sampling procedure. Moreover, we extensively analyse the stability of our estimates under sampling in our sensitivity analysis in Appendix A1.

The preliminary experiments with the Phi-3.5-mini-instruct model (3.82B parameters, 3072-dimensional latent space, for details see our rebuttal to Reviewer 67F6) demonstrate that this sampling approach applies to models with >1B parameters and embedding dimension in the thousands.

So while we largely agree with the reviewer’s assessment of the limitations, we believe that with minor modifications in the camera-ready version of the paper, these will be adequately addressed.

Conclusion

We thank the reviewer again for their time and engagement with our work. We hope that our clarifications and small additions to the revised manuscript address your concerns. Should these revisions meet your expectations, we would kindly ask you to consider increasing your score.

[a] Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference (arXiv:2412.13663)

[b] Phi-3 Technical Report (arXiv:2404.14219)

评论

Thank you for the additional experiments. This has addressed my main concern, which is that the findings may not generalize. The addition of a 3B model is in particular good to see. With the additional promised changes, I am happy to raise my score to a 4.

评论

Thank you for your continued engagement with our work!

We would also like to point you to the additional results for both 3B and 8B Llama models, which we have shared in our reply to Reviewer 67F6.

审稿意见
5

This paper introduces a novel approach for analyzing large language model (LLM) behavior by examining the geometric structure of their embedding spaces through local intrinsic dimension analysis. Using a localized version of the TwoNN estimator, the researchers demonstrate that while LLM embeddings exist in very high-dimensional spaces (hundreds to thousands of dimensions), they actually occupy lower-dimensional manifold structures that vary across different regions. This geometric analysis provides unsupervised insights into model behavior without requiring labeled data or task-specific supervision, addressing the central question of whether structural changes in embedding spaces can reveal model behavior across language modeling tasks.

The paper presents four key applications of this framework: (1) detecting fine-tuning effects and training data overlap through dataset-specific dimensional reshaping, (2) predicting the onset of “grokking” (sudden generalization) in synthetic arithmetic tasks before it occurs, (3) monitoring training convergence through stabilizing dimensions in dialogue state tracking, and (4) identifying overfitting patterns through characteristic dimensional changes that reflect the transition from generalization to memorization. These results suggest that local intrinsic dimensions serve as valuable unsupervised signals for practitioners seeking to interpret and monitor LLM training dynamics, potentially complementing traditional evaluation methods and informing future model design approaches.

优缺点分析

Strengths:

  • Significance & Originality: The paper addresses fundamental, important questions about LLM training dynamics (fine-tuning, overfitting, grokking, convergence) from a novel and promising angle. Shifting the focus from task-specific metrics to the intrinsic geometry of the latent space is a creative and insightful contribution. The approach of using a local intrinsic dimension is a key strength, as it captures the heterogeneity of the embedding space, which a single global dimension estimate would miss.
  • Quality: The experimental setup is rigorous and well-designed. The authors validate their method across a diverse set of tasks (dialogue, arithmetic, classification) and models (RoBERTa, GPT-2), demonstrating the generalizability of their findings. The inclusion of a thorough sensitivity analysis in the appendix (Appendix A) adds significant credibility to the stability and robustness of the proposed measurement technique.
  • Clarity: The paper is exceptionally well-written and structured. It clearly articulates its central question, methods, and contributions. The visualizations, particularly the comparative plots showing model metrics alongside mean local dimension (Figures 2, 3, 4), are intuitive and effectively communicate the core results.
  • Practical Implications: The conclusion that a drop in mean local dimension correlates with improved generalization provides a powerful, unsupervised heuristic. This could be immensely valuable for practitioners, offering a way to monitor training, detect overfitting, and potentially implement early stopping or other interventions without relying on a labeled validation set, which is often a practical bottleneck.

Weaknesses:

  • Computational Overhead: A primary weakness, acknowledged by the authors, is the computational cost of the method. Calculating neighborhoods for tens of thousands of high-dimensional vectors for each checkpoint can be prohibitive, especially for large-scale models or frequent evaluations. This may limit its practical adoption in fast-paced or resource-constrained development cycles.
  • Relativity of Measurements: The exact value of the mean local dimension is dependent on hyperparameters (N, L) and model architecture. This makes it difficult to compare absolute dimension values between different models or different hyperparameter settings. The paper’s insights are based on observing the relative change (the “delta”) during a single training run, which is valid but limits the scope of cross-model analysis.
  • Causality vs. Correlation: While the paper demonstrates a strong correlation between dimensional shifts and model behavior, the causal link remains an open question. Does the dimensional reduction cause generalization (e.g., by forcing the model to find a more compressed, efficient representation), or is it merely an artifact of the optimization process that leads to generalization? While a full causal explanation is beyond the scope of this paper, a deeper discussion of this point would be welcome.

问题

  • Implications for Fine-Tuning: The paper compellingly shows that local dimensions can diagnose training phenomena. Beyond diagnosis, what are the direct, actionable implications for improving fine-tuning? For instance, could the mean local dimension be used as a regularization term in the loss function to explicitly encourage dimensional compression and potentially accelerate generalization or prevent overfitting?
  • Exploring Local Heterogeneity: The core method computes a mean local dimension across the token sample. However, the motivation for using a local estimator is the heterogeneity of the embedding space. Have the authors explored analyzing the distribution of local dimensions further? For example, do certain token types (e.g., punctuation, stop-words vs. content-specific nouns) consistently occupy higher or lower dimensional regions, and how do these distributions shift during fine-tuning? This could provide even more granular insights than the mean alone.
  • Connection to Parameter-Efficient Fine-Tuning (PEFT): Methods like LoRA are known to be effective by operating in a low-rank (i.e., low-dimensional) update space. How do you hypothesize your findings on latent space dimension relate to the intrinsic dimension of the parameter updates in PEFT? Could a similar analysis be used to dynamically set the rank r in LoRA based on the geometric needs of the task, as hinted at by the related work from Ed-dib et al. (2024)?
  • Generative vs. Encoder Models: The experiments include an autoregressive model (GPT-2) but the analysis focuses heavily on encoders (RoBERTa, BERT). Are there any observed qualitative differences in how the local dimensions behave between encoder-only and decoder-only architectures during fine-tuning? For example, does the causal attention mechanism in decoders lead to a different geometric signature compared to the bidirectional context of encoders?

局限性

Yes, the authors have adequately addressed the limitations of their work in Section 4.6. They are transparent about the computational complexity, the an-is an artifact of the optimization process which also leads to generalization, which is an interesting question worth exploring further. Dependency of the absolute dimensional values on hyperparameters (which makes it a “relative” measure), and the assumptions of the TwoNN estimator. This upfront discussion is commendable.

One constructive suggestion for improvement would be to briefly touch upon the missing causal link as a limitation. While they demonstrate a powerful correlation, explicitly stating that the causal mechanism (i.e., whether dimensional reduction drives generalization or is merely a symptom of it) is a topic for future work would further strengthen this already solid section.

最终评判理由

I will keep the score the same and really appreciate detailed responses from the authors on my questions.

格式问题

NO or VERY MINOR formatting concerns only

作者回复

We sincerely thank Reviewer rATb for the thoughtful and constructive review. We are grateful for the recognition of the strengths of our work, particularly your appreciation of the originality of our framing, the rigorous experimental design, and the clarity of the presentation. We are pleased the reviewer found practical utility in our method and robustness in our analysis. We would like to take this opportunity to address the specific points you raised:

Addressing the Weaknesses

Computational Overhead

We agree that computational cost is a potential barrier to large-scale adoption. However, we would like to emphasize that we already mitigate this issue through random subsampling of tokens, which significantly reduces overhead while maintaining stable estimates. Our sensitivity analysis (in Appendix A) empirically supports this approach, demonstrating that local dimension estimates are robust under sampling. We will further clarify this point in the revised manuscript to better guide practitioners on balancing accuracy and efficiency.

Relativity of Measurements

We acknowledge that absolute local dimension values depend on the choice of hyperparameters and model architecture. However, our extensive sensitivity analysis (Appendix A) shows that our findings are robust across a wide range of neighborhood sizes and subsampling rates. We will make this robustness more prominent in the main text.

Causality vs. Correlation

Regarding the distinction between causation and correlation in dimensionality reduction and generalization, while disentangling this relationship is beyond the scope of our current work, we agree that the question is critical. We will revise Section 4.6 to explicitly state this limitation and outline directions for future causal investigations. We are also happy to include a brief speculative discussion on possible mechanisms (e.g., information compression) by which reduced dimensionality might encourage better generalization.

Questions

Implications for Fine-Tuning / Regularization via LID:

This is an insightful question. While our current dimension estimation method is not differentiable (due to nearest-neighbor graph construction and the subsequent TwoNN estimator), we see this as an exciting opportunity for future work. Designing a differentiable surrogate or proxy loss to encourage local compression during training could open up new avenues for model regularization. We will expand our discussion of this direction in Section 5.

Heterogeneity of Local Dimensions Across Token Types

We have indeed explored the heterogeneity of local dimension distributions by grouping tokens according to POS tags. For example, we observed that on average, punctuation and stop words tend to occupy higher-dimensional regions, while content-bearing words such as nouns and verbs exhibit lower local dimensions. We will include a discussion of this phenomenon in the revised manuscript and add a supporting figure to the Appendix.

Connection to Parameter-Efficient Fine-Tuning (PEFT)

We appreciate this connection and have explored it preliminarily. The drop in mean local dimension on the fine-tuning set can be detected on LoRA-tuned models as well, in line with our observations in Section 4.1. Interestingly, we did not observe consistent differences in local dimensional dynamics between full and LoRA-based fine-tuning.

However, we agree that adapting the LoRA rank dynamically based on local geometry is a promising idea. We will add a note on this to the discussion section, citing the relevant work by Ed-dib et al. (2024), and marking it as an exciting direction for future research.

Encoder vs. Decoder Architectures:

While much of our analysis focuses on encoder models, we also included decoder-only models (Appendix C, Figures 12 and 13) and observed comparable dimensional drops on the fine-tuning set. We will clarify this in the main text and ensure that the decoder results in Appendix C.1 are better connected to the core narrative.

When comparing the distribution plots of a given dataset between the encoder-only RoBERTa and decoder-only GPT-2 model, one observes stark differences. For instance, the distribution of local estimates on the Reddit dataset in the GPT-2 latent space appears to be bimodal, which is less pronounced in the RoBERTa latent space. Investigating the reasons for these differences is an interesting direction for future research.

Additionally, we are currently working on an extension to even more recent decoder models (e.g., the Phi and LLaMA model family), and we aim to include the respective results in our final manuscript.

Limitations

Causal Limitation Not Explicitly Stated

We will explicitly note this limitation regarding causality between dimensionality and generalization in the revised limitations section.

We thank Reviewer rATb again for the detailed and supportive review.

评论

Dear reviewer rATb,

This is a reminder to post your first response, as the deadline of author-reviewer discussion period is closing. The authors have responded to your reviews, and also to the others reviews. Please discuss openly with the authors, regarding your reviews and the addressed questions from the authors.

审稿意见
4

This paper investigates the internal mechanisms of large language models (LLMs) by analyzing the geometric properties of contextual latent embeddings, focusing on how fine-tuning alters model behavior. Motivated by the lack of unsupervised methods to understand LLM embedding spaces, the authors introduce a framework to measure local intrinsic dimensions using the TwoNN estimator, which captures the low-dimensional manifold structure of token embeddings. Their approach involves sampling token embeddings, computing local neighborhood dimensions, and analyzing shifts during training and fine-tuning. Experiments across dialogue state tracking, emotion recognition, and arithmetic tasks show that changes in local intrinsic dimensions predict training convergence, overfitting, and grokking, with reductions in mean local intrinsic dimensions correlating with performance gains.

优缺点分析

Strengths

  1. This paper is easy to follow. Its main contributions are clear.

  2. The geometric perspective is interesting. The use of local intrinsic dimension to analyze LLM dynamics offers a fresh and unsupervised approach to interpret model behavior. The method bridges intrinsic model mechanisms and geometric properties, providing insights beyond traditional supervised diagnostics.

Weaknesses

Although the paper's technical aspects are quite interesting, as a study on understanding language models, I believe the most important aspect is to test its effectiveness on modern models and key issues, which is currently lacking in this paper.

  1. The model evaluation is not comprehensive, especially regarding its relevance to recent modern LLM architectures, which remains questionable.

  2. As a rapidly evolving field, LLM evaluation is crucial. The datasets tested in this paper are limited, particularly lacking datasets from the past two years.

  3. The high cost of metric computation limits its practical applicability.

  4. The underlying assumption is TwoNN Assumptions, which assumes locally uniform sampling distributions. The authors do not detail how LLM embedding spaces satisfy it. Are there theoretical justifications for applying TwoNN to LLM latents?

问题

  1. Intuitively, the local dimension might be related to tokens entropy. I hope the author can add some discussion on this, especially considering the recent surge in papers related to entropy and LLM reasoning, e.g., [1-2]

[1] Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning.

[2] The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models.

  1. The paper connects local intrinsic dimensions drops to grokking, but the causal relationship is not explained. Why does a decrease in mean LID indicate generalization beyond training data? Similarly, it could be better to add more discussions on other main findings.

局限性

Yes

最终评判理由

After reading the response and other reviewers' comments, I decide to weakly accpet this paper.

格式问题

No.

作者回复

We thank Reviewer PQs3 for their constructive feedback. We especially appreciate the recognition of our paper’s clarity, the novelty of the geometric perspective, and our effort to develop an unsupervised approach to interpreting LLM behavior via local intrinsic dimension (LID). We are encouraged that they found our main contributions clear and the methodological framing on understanding language models interesting. Below, we address each of the reviewer’s concerns in detail.

Addressing the Weaknesses

(1) Evaluation of modern LLM architectures

We would like to address the reviewer’s concern regarding the relevance to modern LLM architectures. We chose RoBERTa and its fine-tuned variants because our focus lies in understanding fine-tuning dynamics in contextual embedding spaces, settings where pretrained models are adapted to downstream tasks. These architectures remain widely used in practice, especially in encoder models [a] and in settings where compute or data constraints prohibit larger model training.

To demonstrate the feasibility of our approach on higher-dimensional latent spaces, we compute our local intrinsic dimension on the Phi-3.5-mini-instruct model [b], which has a hidden dimension of 3072.

The following table contains the (Mean / Median / Std) of the local intrinsic dimension for base and fine-tuned (FT) versions of the Phi-3.5-mini-instruct model. This demonstrates that our method provides reasonable estimates for higher-dimensional latent spaces produced from modern LLMs.

DatasetBase "Phi-3.5-mini-instruct"MultiWOZ-FTReddit-FT
ICLR 202412.98 / 12.67 / 4.3713.89 / 13.99 / 5.2013.55 / 13.42 / 4.67
MultiWOZ7.48 / 7.91 / 3.776.91 / 6.82 / 3.647.78 / 7.74 / 3.94
Reddit14.55 / 14.83 / 5.4716.42 / 16.60 / 6.9913.81 / 13.60 / 5.33
SGD6.14 / 6.81 / 4.015.74 / 5.73 / 3.976.18 / 6.34 / 4.26
Wikipedia15.18 / 14.05 / 5.5118.84 / 18.14 / 5.3218.65 / 17.73 / 5.06

Moreover, we are currently working on an extension to even more recent models (e.g., Llama), and we aim to include the respective results in our final manuscript.

Our code is available as part of the supplemental material, and our local estimates computation is compatible with models from the HuggingFace transformers library. Thus, running our dimension estimation pipeline on larger models is mainly a matter of having compute available. According to Reviewers 67F6 and rATb, our experiments are comprehensive, and our model selection supports the conclusions drawn in our paper. Users of our method can always supply their own, larger model if applicable.

(2) Dataset selection and recency

We would like to address the reviewer’s concern regarding the selection of our datasets. Our goal was to capture contrasting behaviors in embedding space evolution, so we deliberately selected two complementary types of datasets:

(1) a dataset that was part of the original pretraining corpus (Wikipedia), and (2) a dataset that was released after the pretraining of the model concluded (Reddit) and hence not seen during training. This contrast allows us to probe how local intrinsic dimensions behave in both seen and unseen data regimes.

Our code is released as part of the supplemental material, and our local estimates computation pipeline is compatible with the HuggingFace datasets package. Thus, practitioners can easily extend the evaluation to their datasets. We demonstrate this above with our self-collected ICLR 2024 dataset containing recently released machine learning paper titles and abstracts.

(3) Computational cost of TwoNN estimation

We address the concern about computational cost in the limitations section. While TwoNN involves pairwise distance calculations, we emphasize that our method employs lightweight sampling strategies that drastically reduce computational demands. In practice, we use random token-level sampling, which makes the method feasible for large-scale experiments, as demonstrated across all tasks. Our new results on models from the Phi-family show that our method can be practically applied to modern LLM architectures.

(4) Validity of the TwoNN assumption in LLM embedding spaces

We interpret the value of the TwoNN estimator as a local dimension. That this interpretation hinges on certain assumptions is explicitly noted in the limitations section of the paper. However, the experimental results and their practical implications do not depend on this interpretation. This is why we view this limitation as relatively unimportant.

Still, we are happy to underpin our particular choice of local dimension estimator with theoretical findings:

  • Locally uniform density of the embedding space: In Appendix B of “The geometry of hidden representations of large transformer models” (Valeriani et al., NeurIPS 2023), the authors demonstrate empirically that hidden states of Transformer models exhibit locally constant density, satisfying this condition.
  • Poisson point process assumption: As noted in “The Geometry of Tokens in Internal Representations of Large Language Models” (Viswanthan et. al, arXiv:2501.10573), this condition is usually satisfied in our setup.

We will make these justifications explicit in the updated manuscript and add references and discussion.

Questions

(Q1) On the relationship between local dimension and token entropy

We thank the reviewer for raising this potential connection. Indeed, there may be a relationship between token-level entropy and local intrinsic dimension, and this is a promising direction for future work.

However, the recent studies on entropy and reinforcement learning (RLVR) cited ([1], [2]) focus on reward optimization and exploration behavior in instruction-tuned or RL-finetuned models, scenarios that differ fundamentally from our setting. These papers were also released after our submission deadline. We agree this connection is worth exploring and have added the general setting of RL-tuning of LLMs to our future work section.

(Q2) Causal link between LID and generalization/grokking

We agree that a clearer discussion of why reduced LID corresponds to generalization would improve the paper. Our working hypothesis is that high-dimensional neighborhoods indicate overparameterized, poorly localized behavior, while very low LID reflects overcompression or memorization. There exists a “sweet spot” in LID, where the representation captures the essential degrees of freedom required for solving the task. Our experiments show that this intermediate range correlates with peak generalization. We will make this intuition more explicit, but one should note that investigating causality is beyond the scope of this paper, which is also acknowledged by Reviewer rATb.

Conclusion

We believe that we have adequately addressed the concerns regarding evaluation scope, dataset choices, computational feasibility, theoretical grounding, and interpretation of our results. We are committed to incorporating all necessary clarifications and extensions in the final revision and appreciate the reviewer’s suggestions.

If you find that these responses and planned improvements sufficiently address your concerns, we would be very grateful if you would consider raising your score.

[a] Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference (arXiv:2412.13663)

[b] Phi-3 Technical Report (arXiv:2404.14219)

评论

Thanks for your detailed response. My major concerns have been addressed. I have raised my score to support it.

评论

We sincerely appreciate your updated score and your engagement with the paper!

If of interest, we have also included results for additional larger-scale models (3B and 8B) in our response to Reviewer 67F6.

审稿意见
5

This paper looks at model, fine-tuning, overfitting, and grokking from a geometric perspective. By computing local intrinsic dimensionality estimates they show that fine-tuning reduces dimensionality specifically for the dataset which is fine-tuned. They also show grokking behaviour is related to reductions in local dimensionality, and that the point when reductions saturate indicates when training on a given dataset is exhausted.

优缺点分析

This paper presents an interesting perspective on an important topic. Unlike a lot of interpretability work which uses behavioural or probing evidence, the work here works to link the geometry of model embedding spaces to model behaviours. Overall the paper is clearly written, and the formalisations are easy to follow. The paper has clear scientific merit and novelty.

With this being said, there are a few weaknesses. First it's worth noting that the results broadly include no statistical testing. For example, with the results in figure 2, the body text states "the training mean local dimension starts dropping significantly for those runs that exhibit grokking. Observe that the timing of this drop coincides with the start of the increasing validation accuracy in Figure 2b." Correlating local dimension and performance would make this claim more clearly and attach to it a measure of significance. Similarly the first set of results assert a significant difference between local dimension for the three different datasets, and the last set of results relate local dimension over time to overfitting behaviours. All places where support from statistical testing would be welcome.

Second, the results in the main paper are on largely on Roberta which is comparatively quite a small model by today's standards. The grokking results are also on a two-layer 128 dimensional transformer. Given this and the computational complexity of the methods used it's unclear to me how generalisable the methods here are. Despite this the introduction stresses the importance of understanding large language models and cites work introducing models with billions of parameters. I think the paper would benefit from claims more focussed to the scope of the results.

问题

As a formatting point, the visuals are currently quite small. In future versions I would appreciate if they could be made larger and floated to the top of the page.

局限性

Yes

最终评判理由

Overall the paper is well written and of clear scientific merit. The analysis is currently quite computationally expensive, and largely applied to smaller models. The authors acknowledge this and so I think it is worth seeing discussed at the conference.

格式问题

None

作者回复

We sincerely thank Reviewer 67F6 for their thoughtful and constructive feedback. We are especially grateful for the positive assessment of the paper’s clarity, novelty, and scientific merit, as well as for highlighting the value of linking the geometry of embedding spaces to model behavior.

Statistical testing

Regarding the point about statistical testing, we emphasize that the results presented in the grokking, training dynamics, and overfitting sections (4.2, 4.3, 4.4) are averages over multiple training runs, accompanied by their respective standard deviations. The relatively small magnitude of these deviations compared to the observed changes supports our characterization of differences as “significant”. However, we agree that formal statistical tests would underline our claims. In the revised version of the paper, we will include appropriate statistical analyses to complement our main findings, particularly in the sections concerning grokking behavior, training, and overfitting dynamics.

Dimension measures for larger language models

While our experiments focus on RoBERTa, GPT-2 and a small transformer used for grokking studies, the geometric perspective and methodology we propose are broadly applicable across architectures.

To support this claim, we are currently running additional experiments on more recent models. Preliminary results on Phi-3.5-mini-instruct [b], a decoder model with 3.82B parameters and a hidden dimension of 3072, can be found in the table below. These results demonstrate that the TwoNN estimator, applied locally within our subsampling method, yields meaningful estimates even for latent spaces whose ambient dimension numbers in the thousands, a typical setting for modern LLM architectures. Note that, for example, the 70B parameter Llama family, such as "meta-llama/Llama-3.1-70B-Instruct" has a hidden dimension size of 8192, which is within reach of our methods.

In the case of the Phi-3.5-mini-instruct model, we implement the fine-tuning via LoRA with rank constraint r=16 on a portion of the MultiWOZ and Reddit training datasets, with other hyperparameters as described in the paper and taking a model checkpoint after 800 batches. During this fine-tuning, we observe the shifts in mean local intrinsic dimension on a subsample of the validation set of various datasets. In addition to the datasets in the paper, we present results for a self-collected dataset of ICLR 2024 titles and abstracts, for the sake of including a more recent dataset.

The following table contains the (Mean / Median / Std) of the local intrinsic dimension for base and fine-tuned (FT) versions of the Phi-3.5-mini-instruct model.

DatasetBase "Phi-3.5-mini-instruct"MultiWOZ-FTReddit-FT
ICLR 202412.98 / 12.67 / 4.3713.89 / 13.99 / 5.2013.55 / 13.42 / 4.67
MultiWOZ7.48 / 7.91 / 3.776.91 / 6.82 / 3.647.78 / 7.74 / 3.94
Reddit14.55 / 14.83 / 5.4716.42 / 16.60 / 6.9913.81 / 13.60 / 5.33
SGD6.14 / 6.81 / 4.015.74 / 5.73 / 3.976.18 / 6.34 / 4.26
Wikipedia15.18 / 14.05 / 5.5118.84 / 18.14 / 5.3218.65 / 17.73 / 5.06

Note that our qualitative observations on the behaviour of the autoregressive model’s latent spaces under fine-tuning hold for this larger model as well: The dimension drops on the datasets that are related to the fine-tuning data (e.g., the dialogue datasets MultiWOZ and SGD become lower-dimensional in the latent space when tuning the model on MultiWOZ). At the same time, the dimension of unrelated datasets can be seen to increase or stay unchanged, similar to what happens when fine-tuning the notably smaller GPT-2 model.

These results and the discussion on larger models will be added in the camera-ready version of our paper. We are currently extending our analysis to more recent models (e.g., from the Llama family), and intend to include these results in the final version of our manuscript.

Our choice of RoBERTa was motivated by its continued use in research and applications where encoder-based models are needed [a] and autoregressive alternatives are less suitable, such as in information retrieval, which is an important component in Retrieval Augmented Generation. We will clarify this rationale and better situate the scope of our contributions in the revised manuscript.

Layout

Finally, we thank the Reviewer for the suggestion regarding figure layout and formatting. We will enlarge the visuals and adjust their placement to improve readability in the revised manuscript.

Once again, we are grateful for the constructive feedback and the encouraging evaluation of our work.

[a] Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference (arXiv:2412.13663)

[b] Phi-3 Technical Report (arXiv:2404.14219)

评论

Thank you for the thorough response to my review. I appreciate the inclusion of further statistical testing and the Phi-3.5-mini-instruct results (I'm slightly concerned about promising to include results in the camera ready for experiments that have not yet been concluded -e.g. the llama models). A clear correlation between representational structure and behaviour would be welcome.

评论

Thank you once again for your encouraging follow-up comment!

Our results (Mean / Median / Std) for two sizes (3B and 8B parameters) of the Llama models can be found in the tables below. The fine-tuning parameters match those described in our previous response (LoRA with rank constraint r=16, on a portion of the MultiWOZ and Reddit training datasets, after 800 batches).

To clarify, while we do not plan to include these additional results in the main body of the camera-ready version, we believe that including them in the appendix will be a valuable addition. These results further support the applicability of our local dimension estimation pipeline to modern LLM architectures and larger model sizes.

Llama-3.2-3B and fine-tuned variants

DatasetBase "Llama-3.2-3B"MultiWOZ-FTReddit-FT
ICLR11.99 / 11.57 / 4.0512.80 / 12.48 / 4.4812.11 / 11.74 / 4.17
MultiWOZ8.14 / 7.89 / 2.086.60 / 6.15 / 2.217.61 / 7.27 / 2.31
Reddit14.51 / 14.58 / 5.5415.93 / 16.20 / 6.4613.16 / 12.87 / 5.32
SGD7.47 / 7.25 / 1.785.92 / 5.58 / 1.836.67 / 6.36 / 1.91
Wikipedia17.20 / 16.13 / 5.1618.87 / 17.97 / 5.1817.70 / 16.56 / 5.18

Llama-3.1-8B and fine-tuned variants

DatasetBase "Llama-3.1-8B"MultiWOZ-FTReddit-FT
ICLR11.86 / 11.30 / 4.1112.50 / 12.03 / 4.2812.01 / 11.46 / 4.23
MultiWOZ8.26 / 8.03 / 2.136.56 / 6.18 / 2.097.67 / 7.36 / 2.31
Reddit14.08 / 14.24 / 5.5314.97 / 14.98 / 6.1412.62 / 12.45 / 5.15
SGD7.49 / 7.32 / 1.775.82 / 5.47 / 1.776.75 / 6.43 / 2.00
Wikipedia17.45 / 16.87 / 4.2817.82 / 16.93 / 4.6117.87 / 17.22 / 4.36
评论

Dear reviewers,

This is a reminder that the end of author-reviewer discussion period is near. Please do carefully read all other reviews and the author responses; and discuss openly with the authors, especially on your own questions that the authors addressed.

Thank you.

最终决定

The paper investigates the internal mechanisms of LLMs by analyzing the geometry of contextual latent embeddings via local intrinsic dimension (LID). The authors introduce a framework to measure LID using the TwoNN estimator, which approximates the token embeddings’ intrinsic dimension, and further analyze shifts during training. Experiments across dialogue state tracking, emotion recognition, and arithmetic tasks show that drops in mean LID accompany improved generalization and characterize: training exhaustion, overfitting, and grokking, respectively.

The paper offers a clear, generally applicable geometric signal that is unsupervised and tracks important training phenomena. The rebuttal materially improved the case. Authors provided supports on the claim generality, provided new results using bigger and more recent LLMs, clarified dataset/model choices, and theoretical assumptions for TwoNN. Authors also committed to adding statistical tests, and will further revise the paper for improved clarity.

With expanded results and the promised analyses, I believe that the contribution is solid and of interest to the communities.

Strength

  • The bridging of intrinsic model mechanisms and geometric properties of latent states is interesting.
  • The work has clear scientific merit and novelty.
  • The paper is clearly written and the formalisations are easy to follow.

Weakness

  • The main weaknesses pointed by reviewers on whether their claims will generalize, has mostly been addressed in the rebuttal
  • The prediction of grokking event is seen as an over-strong framing. However, the author will provide revision to focus on the interpretability advantage.