/10

Poster4 位审稿人

最低3最高4标准差0.4

ICML 2025

Measuring In-Context Computation Complexity via Hidden State Prediction

Vincent Herrmann,Róbert Csordás,Jürgen Schmidhuber

OpenReview PDF

提交: 2025-01-24更新: 2025-07-24

TL;DR

Hidden state unpredictability in sequence models is a meaningful measure for in-context reasoning complexity

摘要

关键词

in-context learninginterpretabilitytransformers

评审与讨论

审稿意见

评分: 32025-03-09

The paper introduces the Prediction of Hidden States (PHi) loss to measure the complexity of computation in neural sequence models. The authors argue that traditional next-token prediction loss does not adequately capture the task complexity. To fix this, they propose evaluating the model’s ability to predict its own future hidden states. The PHi layer is introduced to measure the unpredictability of hidden states, which correlates with task complexity. The method is tested across different tasks and architectures, demonstrating that PHi loss distinguishes between complex and simple computations.

给作者的问题

How sensitive is PHi loss to model size and dataset scale?
Is it possible to combine the PHi loss with other model analysis tool to pinpoint where in-context learning occurs?

论据与证据

Next-token prediction loss is an unreliable indicator of computational complexity.
- Evidence: Tasks involving random token sequences yield high next-token prediction loss but have no meaningful computation.
Hidden-state predictability (PHi loss) is a better metric for measuring in-context computation complexity.
- Evidence: The authors show that PHi loss increases for complex reasoning tasks, in-context language learning, and step-by-step mathematical reasoning.
PHi loss correlates with the description length of formal languages learned in-context.
- Evidence: The study demonstrates that the PHi loss reflects the complexity of probabilistic finite automata (PFA)-based tasks.

方法与评估标准

Next-token prediction loss is an unreliable indicator of computational complexity.
- Evidence: Tasks involving random token sequences yield high next-token prediction loss but have no meaningful computation.
Hidden-state predictability (PHi loss) is a better metric for measuring in-context computation complexity.
- Evidence: The authors show that PHi loss increases for complex reasoning tasks, in-context language learning, and step-by-step mathematical reasoning.
PHi loss correlates with the description length of formal languages learned in-context.
- Evidence: The study demonstrates that the PHi loss reflects the complexity of probabilistic finite automata (PFA)-based tasks.

理论论述

No theoretical claim was made in this work.

实验设计与分析

Insert the PHi layer into different models to evaluate.
train-from-scratch models: Transformers and LSTMs
Pre-trained LLMs: a frozen Llama 3B model

补充材料

No supplementary material.

与现有文献的关系

No direct relation to broader scientific literature.

遗漏的重要参考文献

The paper extensively cites prior work in in-context learning, VAEs, information bottlenecks, and neural sequence models.

其他优缺点

Strengths

PHi loss gives a new way to evaluate the task complexity.
The method is tested on multiple architectures and datasets.

Weaknesses

The PHi layer's effectiveness depends on where it is inserted in the model.
Although promising for small models, computing the PHi loss requires adding a PHi layer and training the model. This could make it difficult to apply on larger-scale LLMs.

其他意见或建议

Testing PHi loss on encoder-only models (e.g., BERT) and encoder-decoder models (e.g., T5) could provide further insights.

作者回复

2025-04-01

Document with additional figures: https://tinyurl.com/yp4ucedn

1. Location of the PHi Layer in the Model

The PHi layer's effectiveness depends on where it is inserted in the model.

We want to emphasize that in a fully trained model—such as the transformer in Section 3.1—the results are very robust across different placements of the PHi layer. This is supported by new experiments (Figure 5 in the linked document).

With regard to pre-trained LLMs, the reviewer is correct that the effectiveness of the PHi layer in measuring in-context computation complexity depends on roughly where in the model it is placed. Based on known findings from LLM interpretability research and other empirical considerations, it is not surprising that PHi layers placed about two-thirds of the way through the model tend to perform best. Below we detail our reasoning for choosing placing the PHi layer after layer 18, 20, 22 and 24 of the Llama 3B model in the subsequent experiments.

https://arxiv.org/abs/2407.09298 shows that the early and final layers in an LLM serve distinct purposes, making them unsuitable for measuring in-context computation.
https://arxiv.org/abs/2410.10912 demonstrates that early layers in LLMs are more brittle and harder to prune—making them more vulnerable to the noise introduced by the PHi layer’s information bottleneck.
PHi layers placed early in the model show clear signs of posterior collapse and thus give no clear results.
We selected positions based on where the variance of the PHi loss across tasks is highest—not where interesting tasks produce the largest losses (although these two happen to coincide).

These choices are independent of any specific findings about which tasks are associated with high or low PHi loss. We therefore do not believe our selection introduces bias in the experimental outcomes. We will elaborate on this reasoning in the updated paper. A practical benefit of later PHi layer placement is that it reduces computational cost, since we backpropagate through fewer layers of the pre-trained LLM.

2. Further Points

Although promising for small models, computing the PHi loss requires adding a PHi layer and training the model. This could make it difficult to apply on larger-scale LLMs.

In pre-trained LLMs, only the PHi layer is trained—this requires very modest compute. Training the PHi layer for 10,000 steps in the Llama 3B model, as done for all experiments in Section 3.2, takes between 2.5 and 6 hours on a single consumer-grade GPU (NVIDIA RTX 3090), depending on the PHi layer’s position. We see no reason this approach couldn't scale to larger models.

Testing PHi loss on encoder-only models (e.g., BERT) and encoder-decoder models (e.g., T5) could provide further insights.

We agree—measuring in-context computation complexity in these models would be interesting. However, it likely requires conceptual changes to the PHi layer. In decoder-only models, we have a clear compression scheme for the token and latent sequences. In encoder-only models, hidden states can incorporate information from both past and future tokens, complicating the setup.

How sensitive is PHi loss to model size and dataset scale?

There is a dependency between PHi loss and model size: the KL divergence between two diagonal Gaussians is a sum over dimensions, so the PHi loss scales with the dimensionality of the hidden states. For this reason, PHi loss (in its current form) should only be used to compare tasks within the same model, not across different models. We never claim to compare models in the paper. We also do not interpret the absolute magnitude of the PHi loss. Addressing this limitation is a goal for future work—for example, through the use of quantized latent states.

As for the PHi layer’s capacity and training data needs, our method is flexible. In the Llama 3B experiments, we trained the PHi layer on roughly 40 million tokens from a standard natural language dataset. Figure 7 in the linked document shows the training curves. The PHi layer's architecture is designed to match the scale of a single LLM layer. If the reviewer finds it helpful, we are happy to provide additional experiments comparing different PHi layer capacities and training data volumes in a future reply.

Is it possible to combine the PHi loss with other model analysis tool to pinpoint where in-context learning occurs?

Where in the sequence or dataset? Although we focus on the aggregated PHi loss in the paper, the method supports token-wise PHi loss measurement. This allows us to pinpoint where in a sequence non-trivial in-context computation occurs (e.g., see Figure 13 in the Appendix).

Where in the model? Figures 6&7 in the paper offer some preliminary insights into where interesting computation may be happening within a pre-trained LLM. However, further research is needed to confirm and extend these findings.

审稿意见

评分: 42025-03-14

This paper proposes a novel method for probing the hidden representations of neural sequence models: the prediction of the hidden states (PHi) layer. This layer combines an encoder that generates latent variables from the hidden states and an autoregressive LMs that generates latent variables from previous latent variables. The corresponding parameters are trained to minimise the KL divergence between encoder and autoregressive LMs, together with the standard autoregressive objective on the output of the whole model. The authors test their new measure on several tasks, including both cases where the neural network is trained together with the PHi layer from scratch and cases where the PHi layer is inserted into a pre-trained LLM. The tests show how the PHi loss correlates with the computational complexity of the task while the standard next-token prediction loss does not.

update after rebuttal

The rebuttal cleared the confusion expressed in the "Other Strengths and Weaknesses" section, and I have increased the score accordingly.

给作者的问题

I have no further questions. My `weak accept' evaluation stems from the confusion expressed in the "strengths and weaknesses" section: if cleared, I would raise the evaluation. --- updated after response.

论据与证据

It is far from obvious why the proposed procedure should measure "a model's ability to predict its own hidden states", as stated in the abstract. Nevertheless, experimental results agree with the trends claimed by the authors at the end of section 1. The statistical significance of these trends remains open for debate, especially for experiments with LLMs (figures 6 and 7).

方法与评估标准

Yes.

理论论述

There are no theoretical claims.

实验设计与分析

Experiments seem sound although there are aspects of the PHi layer design that I do not understand:

1: is the autoregressive part of the PHi layer used to compute the output, or does it just enter to train the encoder?

2: is eq. (4) implying that the first part of the networks does not receive gradients from the negative log-likelihood loss?

3: how is the encoder incentivised to have the latents depend on the previous latents, given that the autoregressive component of PHi is also initialised randomly and trained from scratch? (see also Strengths and Weaknesses section)

补充材料

Only part of Appendix A.

与现有文献的关系

This paper could have a significant impact on the literature devoted to understanding the operations performed by LLMs, including mechanistic interpretability and other studies of hidden representations.

遗漏的重要参考文献

I am not aware of any.

其他优缺点

The problem considered by the paper is timely and relevant and the solution based on the idea of measuring "how well the model can predict its own future hidden states" is both interesting and original. However, I cannot completely follow the rationale behind the design of the PHi layer. I would understand the comment after eq. (5), "the model is incentivized to maintain ...", if the prior were some fixed distribution. However, the fact that the prior is trained from scratch via Eq. (2) confuses me, as the loss only forces it to agree with the encoder distribution and not with the distribution of post-activations from the training data. For instance, why not train the prior with a standard next-token prediction objective but in the latent (instead of token) space?

其他意见或建议

Typo in section 2.3, third row ("we need to ensure that ensures").

Typo in the caption of Figure 5 ("because of a posterior collapse, most likely due to posterior collapse").

作者回复

2025-04-01

Document with additional figures: https://tinyurl.com/yp4ucedn

1. Questions about the PHi Layer

1: is the autoregressive part of the PHi layer used to compute the output, or does it just enter to train the encoder?

The purpose of the autoregressive part (i.e., the causal self-attention layer) is to make the prior $p_\chi$ more powerful. It plays no role in computing the next token prediction of the model. More details below.

2: is eq. (4) implying that the first part of the networks does not receive gradients from the negative log-likelihood loss?

No—the gradients flow back through the reconstructed $h'$ , to the encoder $q_\psi$ , the original hidden states $h$ and to the bottom layers $B_\beta$ . No gradients are blocked in the architecture. This ensures that when trained end-to-end, as in Section 3.1, the model is encouraged not only to predict its own next hidden states, but also to make those next hidden states more predictable.

3: how is the encoder incentivised to have the latents depend on the previous latents, given that the autoregressive component of PHi is also initialised randomly and trained from scratch? (see also Strengths and Weaknesses section)

The latents $z$ are sampled from the posterior $q_\psi(\cdot | h_t) = q_\psi(\cdot | x_1, \dots, x_t)$ . They have to contain the information necessary to predict the next token $x_{t+1}$ , otherwise the NLL loss increases. The prior has to predict the posterior distribution but does not have access to the most recent input token $x_t$ . However, we make it autoregressive and give it access to the previous latents, since their information content has already been accounted for by previous PHi losses.

In short: the autoregressive prior $p_\chi$ is trained to predict the posterior. The posterior $q_\psi$ is trained to allow accurate next token prediction from the reconstructed state $h'$ , while staying as close as possible to what the prior predicts.

2. Two Simpler Approaches and Why they Do not Work

...if the prior were some fixed distribution.

...why not train the prior with a standard next-token prediction objective but in the latent (instead of token) space?

The reviewer raises two interesting questions which we also explored during the development of our method:

Can we simply use an information bottleneck with a fixed, uninformative prior and no hidden state prediction (similar to a vanilla VAE, no autoregressive prior)?
Can we use a simpler next hidden state prediction objective? Since the latent or hidden state is continuous, it cannot be the usual categorical cross-entropy, but maybe mean squared error (MSE)?

Each of these simpler approaches has shortcomings that our method avoids: If we use a fixed prior in a pre-trained LLM, it tends to overestimate the information in the hidden state sequence. For example, consider a sequence where the hidden states all lie in the same small subspace. With a fixed prior, we would have to pay the same large price in terms of KL-divergence for every single one of the hidden states. An autoregressive prior, in contrast, quickly adapts to the subspace, reducing the KL for subsequent steps. Our new results include an ablation showing that an information bottleneck without hidden state prediction gives no meaningful results in practice (see Figure 6 in the linked document).

As for using a straightforward next hidden state prediction loss such as MSE: this encourages the model to scale down the norm of the hidden states to near-zero, potentially leading to machine precision issues. Even if we fix the norm, the model could encode information in tiny perturbations from a default vector, which lead to low MSE but can carry arbitrary information. Without a noisy information bottleneck, the approach lacks a clear information-theoretic interpretation.

Our PHi layer addresses both issues:

The autoregressive prior ensures we pay only for genuinely new information not already contained in previous tokens.
The noisy information bottleneck prevents hidden state collapse and provides a clear, principled measure of the information in the latent states.

3. Statistical Significance of the Results

We emphasize that we report 95% confidence intervals or p-values for all experiments, and find all reported results to be statistically highly significant. See Figure 2 in the linked document for additional confidence intervals for the partial correlation in the MATH rationale experiment. These were missing in the submitted paper. For a discussion of how we selected the PHi layer location, please refer to our response to reviewer Uw5E, paragraph 1.

If there are any remaining questions, we are very happy to answer them in a future response.

审稿人评论

2025-04-02

Thanks for the clarification: the response cleared my confusion, thus I raised my mark to 4. In particular, It could be useful to include part of the response to 1. in the revised manuscript, e.g. in the discussion immediately after Eq. (5).

作者评论

2025-04-09

Thank you for the suggestions, we will incorporate them into the updated paper. We would also like to take this opportunity to thank all reviewers for their comments and for reconsidering their initial scores in light of our responses. The reviews have led to valuable clarifications in the updated version of the paper.

审稿意见

评分: 32025-03-14

I think this paper introduces the PHi (Prediction of Hidden States) layer as a novel way to measure the complexity of computation performed by neural sequence models by examining how predictable their hidden states are. The authors show that this metric correlates better with intuitively "interesting" computation than next-token prediction loss.

给作者的问题

How sensitive are the results to the specific architectural choices in the PHi layer?
Is there a principled way to determine optimal layer placement in pre-trained models?

论据与证据

the evidence provided through experiments on both smaller models and LLMs is adequate but not completely convincing, as the connection to formal notions of complexity remains mostly theoretical.

方法与评估标准

I think the proposed PHi layer is an elegant approach, but its placement in pre-trained models seems somewhat arbitrary, raising questions about the robustness of the method.

理论论述

n/a

实验设计与分析

I think the experiments cover a good range of scenarios, but the evaluation of "interestingness" remains subjective, and the baseline comparisons are limited.

补充材料

yes I check.

与现有文献的关系

The paper makes interesting connections to information theory and mechanistic interpretability, but could better situate itself within the growing body of work on understanding LLM computation

遗漏的重要参考文献

n/a

其他优缺点

I think the proposed metric provides a novel lens for quantifying when models perform meaningful computation. The correlations between PHi loss and task complexity across different domains are intriguing. The application to correctness prediction in self-generated reasoning chains demonstrates practical utility.

However, the paper lacks rigorous justification for where to place the PHi layer in pre-trained models, with results varying significantly based on layer position. The evaluation relies heavily on intuitive notions of complexity rather than formal definitions. The method seems sensitive to hyperparameters and architectural choices, raising questions about generalization.

The major concern is that it is only tested on small model, can we test this on Llama7/8+B, etc?

其他意见或建议

n/a

作者回复

2025-04-01

Document with additional figures: https://tinyurl.com/yp4ucedn

1. Robustness towards Hyperparameters and PHi Layer Placement

...the paper lacks rigorous justification for where to place the PHi layer in pre-trained models

While it is true that the properties of the PHi layer vary depending on its placement in a pre-trained LLM, our choices are not arbitrary. In our response to Uw5E, paragraph 1, we give several principled reasons—independent of any specific results on interestingness or computational complexity—why the chosen positions are sensible for the PHi layer. We highlight that we selected layers that maximize the variance of PHi loss across different tasks. A more detailed account of these considerations will be included in the updated paper.

How sensitive are the results to the specific architectural choices in the PHi layer?

In general, our method is quite robust to the choice of hyperparameters and the exact architecture of the PHi layer components. For example, the sizes of the MLPs and the attention module are chosen simply to align with the rest of the model architecture. We show in the paper that our method works with transformers, LSTMs, and pre-trained LLMs.

We did not perform extensive hyperparameter tuning, but selected reasonable and stable settings and used them consistently. In response to this review, we tested varying the weight of the PHi loss (relative to NLL loss) from 1 to 10, and found that higher weights actually perform better than the one reported in the paper, see Figure 5 in the linked document. If the reviewer is interested, we would be happy to provide a more detailed robustness analysis for specific hyperparameters in a future response.

2. Subjectivity of Interestingness

The evaluation relies heavily on intuitive notions of complexity rather than formal definitions.

The reviewer raises a valid point: Notions of complexity and interestingness are inherently difficult to formalize. Many definitions—such as Kolmogorov Complexity and Sophistication—are non-computable and cannot be directly used in empirical studies. Some appeal to intuition is therefore difficult to avoid.

However, we have attempted to ground our evaluation in formal and objective measures wherever possible. For the PFA-based tasks, we use the description length of the automaton $C(A)$ , as described in section 3.1, to quantify complexity. For the complexity of the non-ICLL tasks, we can also give a concrete description length: the complexity of memorized sequences is $\log N$ , where $N$ is the number of memorized sequences. For memorized programs, it is $\log M$ , where $M$ is the number of memorized programs. For random sequences, the complexity is 0, as no inference is possible.

We agree that the natural language tasks in Section 3.2.1 are categorized in a less formal way, but believe our reasoning is principled and goes beyond intuition. In the MATH dataset experiments, we rely on dataset-provided difficulty levels, which range from simple to difficult. In this context, we take “difficult” to be synonymous with “complex.”

We also want to emphasize that we never claim to measure the absolute or objective interestingness of a sequence—only the relative differences between sequences or tasks for a given model.

We will further clarify these points in the updated paper.

3. Misunderstanding about Experiments with Large Models

The major concern is that it is only tested on small model, can we test this on a BERT/GPT2 or even bigger model such as Llama1B etc,

The paper includes extensive experiments with large models—specifically, the LLaMA 3.2B model with 3 billion parameters, which is significantly larger than BERT, GPT-2, or LLaMA-1B. Please see Sections 3.2.1, 3.2.2, and 3.2.3.

Our method is also fairly scalable. Training the PHi layer in the pre-trained LLaMA 3B model takes approximately 3 hours on a single consumer-grade GPU (NVIDIA RTX 3090). Upon acceptance, we will release our code, which requires only minor modifications to work with other open-source LLMs.

4. Additional Points

...baseline comparisons are limited.

We are not aware of other methods that directly measure the complexity of in-context computation in sequence models. Does the reviewer have any specific baselines in mind? The two alternative approaches discussed in our response to rCDj, paragraph 2, may be of interest.

The application to correctness prediction in self-generated reasoning chains demonstrates practical utility.

We are glad the reviewer appreciates this. The updated paper will include an expanded experiment that confirms and extends this result on the MATH dataset (see response to 6ZwR, paragraph 1, and Figures 3&4 in the linked document).

...could better situate itself within the growing body of work on understanding LLM computation

We have aimed to highlight the major relevant work and would be grateful for any suggestions of related literature we may have missed.

审稿意见

评分: 32025-03-14

The paper proposes a "prediction of hidden states" (PHi) layer, which can be used to quantify the complexity of the computation being performed in a neural model. The layer exists between the activations of a sequence model such as a Transformer. It maps the activations to latent variables. It then computes the KL between the encoding of the next symbol and learned "prior" that attempts to predict this encoding given the encodings of previous symbols. This KL divergence therefore quantifies the new information in the next symbol that is not predicted by the prior.

The experiments show that this KL divergence metric, termed the "PHi loss", correlates with various other measures of interest to a greater degree than the overall model's negative log likelihood (NLL) of the next token. For example, while the NLL is high for predicting random strings, "PHi loss" is relatively low. Intuitively, this is because the "PHi loss" is computed over the latent representations, which are implicitly encouraged to only contain information relevant towards predicting future symbols. Random sequences have high NLL because they have high conditional entropy. Experiments show that "PHi loss" correlates with the complexity of in-context learning tasks. Finally, experiments show that "PHi loss" over chain of thoughts correlates with accuracy of the final answer for GSM-8k, suggesting this is a desirable property for answer rationales.

Update after rebuttal

Thank you for your response and clarifying the relationship between the information that the posterior and prior condition on. I would recommend seeking to clarify the presentation and intuition around this if possible, in the revised version. I also appreciate further exploring the application of the method to select for rationales leading to correct answers. I will raise my score from 2 to 3.

给作者的问题

See "Weaknesses" above.

论据与证据

TBD (see questions)

方法与评估标准

Yes

理论论述

No theoretical claims.

实验设计与分析

The experimental settings and analysis seemed reasonable.

补充材料

与现有文献的关系

To the best of my knowledge, the proposed metric is novel. The paper discusses relevant prior work in Section 4.

遗漏的重要参考文献

Not that I am aware of

其他优缺点

Strengths

The proposed measure offers a new perspective on quantifying the complexity of the behavior of a neural model such as a Transformer.
The proposed measure correlates with description length of tasks in an intuitive way.
Perhaps most intriguingly, the proposed measure when applied to rationales for mathematical reasoning appears to correlate with answer accuracy.

Weaknesses

I may be confused, but I was unclear regarding some aspects of the learned prior. Presumably, if the model used to represent the learned prior was sufficiently expressive, would it be possible for "PHi loss" to go to zero during training? Even at inference time, could the prior have equivalent complexity to the overall model, and therefore we would expect low "PHi loss" even on complex tasks, since both the overall model predicting the next token and the prior predicting the next hidden state can capture all of the same information (up to the limit of the mutual information between previous tokens and the next token)? In other words, in theory, does the method implicitly require some capacity constraint on the prior in order to be non-vacuous?
I worry that some of the results may be contingent on the specific decisions and hyperparameters used for implementing the encoder, prior, etc. As there is no "train and test split" for most of the experiments, it is difficult to assess the degree to which some conclusions may be contingent on these choices.
It would be helpful to have a bit more clarity on what precisely the authors mean by quantifying "an upper bound on the complexity of this implicit model... that is generated in-context to predict next tokens". The connection to work on description lengths, inspired by the MDL principle and Kolmogorov complexity, seemed potentially helpful but I didn't fully understand the connection. The role of the prior, the complexity of which is not accounted for in "PHi loss", seems to make the connection unclear. It seems like clarifying this would be quite helpful for the paper and its motivation. It could also help identify relevant methods from prior work to compare the proposed method against.

I will be open to revisiting my judgements if the authors can help clarify these points for me.

其他意见或建议

Just some minor nits:

Section 2.3 typo - "to ensure that ensures that"
Section 6 typo - "a powerful objective for in applications"

作者回复

2025-04-01

Document with additional figures: https://tinyurl.com/yp4ucedn

1. Rationales for Mathematical Reasoning

Perhaps most intriguingly, the proposed measure when applied to rationales for mathematical reasoning appears to correlate with answer accuracy.

We agree that this is an intriguing finding. Since the submission, we were able to further confirm and extend it, please see Figures 3&4 in the linked document. We repeated the experiment described in Section 3.2.1 on the MATH dataset and observed very similar results: Here, too—across all tested layers—correct rationales are associated with high PHi loss, both across all question pairs and within the subset of counterintuitive pairs. The MATH dataset further allows us to break down the questions by difficulty. This reveals an interesting pattern within the counterintuitive subset: for easy questions, high PHi loss does not correlate with correct rationales, whereas for difficult questions, there is a strong positive correlation between high PHi loss and correctness. We will include these results in the updated version of the paper.

2. Learned Prior and Capacity Constraints

The autoregressive prior lacks access to the most recent input token $x_t$ . Hence, unless $x_t$ is completely uninformative, the PHi loss will not go to zero. We can see this by re-writing Equation 2 in the paper. Recall that $h_t$ is a function of $x_1, \dots, x_t$ and $z_t$ is a (stochastic) function of $x_1, \dots, x_t$ . Plugging this in, we can express the PHi loss at time step $t$ as: $$ L_{\text{PHi}}(t)=D_{\mathrm{KL}}\Bigl( q_\psi(\cdot \mid x_1, \dots, x_t)\big|p_\chi(\cdot \mid x_1,\dots,x_{t-1})\Bigr).

This makes it clear that, in general, prior and posterior are not the same, even if the prior is highly expressive. That said, the reviewer is right that in practice, the prior module $p_\chi$ not only lacks access to the most recent token, but also has less capacity than the part $B_\beta$ of the model that computes the target hidden state (the posterior distribution). In this sense, we are measuring the information gain resulting both from the new token *and* from the 'irreducible' computation the model performs in response to previous tokens. Disentangling these two components is an interesting challenge for future work. ## 3. Train/Test Split and Methodology > As there is no "train and test split" for most of the experiments... It is important to clarify that there *is* a strict split between training and test/evaluation data for all experiments in the paper. The data used to train the models and PHi layers is distinct from the data used in the evaluations (with the exception of the memorization tasks, of course). We will emphasize this more clearly in the paper and provide additional information about train/test split in the appendix. As we describe in responses to other reviews, our method is fairly robust towards different hyperparameters (see response to u77y, paragraph 1 and Figure 5 in the linked document). We agree with the reviewer that the results of Section 3.2 are somewhat contingent on the placement of the PHi layer in the Llama 3B model. However, we did our best to be transparent and methodically sound: Experiment 3.2.1 serves as an exploration and allows us to select sensible placements of the PHi layer. In response to Uw5E, paragraph 1, we detail the practical and theoretical considerations of this selection, which we cannot repeat here due to lack of space. All subsequent experiments are conditional only on this one selection process. We will make this clearer in the revised paper. ## 4. PHi Loss and the Minimum Description Length Principle Unfortunately, we do not have the space in this reply for a full explanation, but we will clarify this connection in the updated paper. Here, for brevity we adopt the notation of https://arxiv.org/abs/2410.14086, where $p_\theta$ is the in-context learned model of some data $D$. The MDL says we should minimize $K(p_\theta) + K(D | p_\theta)$. The synthesized model $p_\theta$ has to pass through the information bottleneck in order to be effective. The cumulative PHi loss quantifies the amount of information crossing that bottleneck (i.e., the information contained in the latent sequence), and is thus an upper bound of $K(p_\theta)$. Our comment above about the prior, in addition to section 2.4 in paper which shows the existence of a compression scheme for the latent sequence, should make it clear that the autoregressive prior *is* accounted for in the PHi loss. Meanwhile, the cumulative NLL loss is an upper bound of $K(p_\theta) + K(D | p_\theta),$ and therefore also an upper bound of $K(p_\theta)$. However—as we argue throughout the paper—it is a significantly weaker bound than the PHi loss. If the reviewer would like further clarification, we are very happy to give more details and answer all remaining questions in a future response.

最终决定Accept (poster)

2025-05-01

This paper proposes a new approach prediction of hidden states (PHi) to measure the interestingness of neural sequence models' computations. After discussion, reviewers unanimously agreed that this is a novel and interesting tool that can provide new insights. Some concerns were only partially resolved like dependence on where to put the PHi layer, and evaluation on bigger models. But the paper is already a valuable contribution as-is. I recommend accept.