PaperHub
5.5
/10
Poster4 位审稿人
最低2最高4标准差1.0
2
4
4
2
ICML 2025

Looking Beyond the Top-1: Transformers Determine Top Tokens in Order

OpenReviewPDF
提交: 2025-01-24更新: 2025-07-24

摘要

关键词
mechanistic interpretabilitytransformerlarge language modelefficient inference

评审与讨论

审稿意见
2

This paper studies the "saturation event" in transformers, where the model’s top kk predictions are determined in the early layers and remain fixed in the later layers. They show empirically that the top-ranked tokens saturate in order, with higher-ranked tokens saturating in earlier layers. They also demonstrate that probing the embeddings at each layer allows for predicting the transition to the next saturation event. Based on these findings, they propose an early-exit strategy for text generation to improve model efficiency. Their experiments across different modalities—text, image, and audio—using various transformer-based architectures support their results.

给作者的问题

The questions below are for clarifying the methodology used in the paper:

  1. In section 2.3, for the training phase of the probing task, how do you label the embeddings if the saturation layers are not in order? The tKendall's coeffs reported are around 1.2, which corresponds to only 60% of inputs having in-order saturation layers. Will this cause an issue?

  2. Same question above for the early exiting strategy. Don't you need a guarantee that the second saturation layer is later than the first one?

  3. In Table 1, why is chance level accuracy different in different experiments? Are you predicting a different number of top tokens in each experiment?

  4. Based on Table 4, it appears that in many cases, top-1 saturation occurs in fewer than 50% of inputs (with two models even showing only 10%).

    i. In this context, would the saturation event for other ranks still be meaningful?

    ii. A Kendall coefficient of 1.2 (Table 5) suggests that 60% of inputs show higher rank saturation events in order. Could you clarify how to reconcile this with the observation that less than 50% of inputs show top-1 saturation?

  5. In the early exit strategy (results in Table 2), do you measure top-1 accuracy on all positions in the context window? In this case, in order to skip the computations in a layer, we need all the tokens in a context window reaching their first saturation event before this layer. Is there high variations in the first saturation layer across different positions for a given input sequence?

论据与证据

The authors test their claims on datasets from text, image, and audio domains, covering a wide range of modalities, which strengthens the support for their findings.

In the current format, some claims and results, when considered together, might convey mixed/incomplete messages that might need further clarification. See Question 4 and "Other Strengths And Weaknesses".

方法与评估标准

  1. Since the key point is that top tokens saturate in order for each input, the average rank of the saturation layers over a large number of inputs may not effectively capture this pattern. The Kendall metric reported in the appendix seems more aligned with the message. A simpler and more direct option could be reporting the percentage of inputs where the order of the saturation layers matches the order of the tokens (which should correlate with the Kendall metric).

  2. As also acknowledged by the authors in the limitations section, it’s unclear whether the data used in the experiments was part of the pre-training set. This could slightly affect the results depending on whether the model was directly optimized on these inputs or if they were unseen. Using more diverse input sequences could have reduced this issue.

  3. For the early-exit strategy and the comments on improved language modeling, it may be more appropriate to use a metric that assesses the model's alignment with the language distribution rather than just the top token prediction/accuracy. Top-1 accuracy alone is not a perfect measure of a text generator. Nonetheless, the observations in Table 3 are still interesting.

理论论述

(No theoretical claims)

实验设计与分析

See Questions

补充材料

I checked some of the additional experiment results in the supplementary matrials.

与现有文献的关系

The paper could contribute to the extensive body of interpretability research aiming to understand the inner workings of language models. It also suggests initial methods for leveraging these findings to improve the efficiency of language models during inference, offering an alternative to pruning techniques.

遗漏的重要参考文献

None that I'm aware of.

其他优缺点

The observations reported in the paper are interesting and are supported by experiments in different setups. They also suggest practical implications for these observations.

However, the observation that randomly initialized models also exhibit this in-order saturation event makes the message and implications of this observation a bit vague. If this implies that the architecture plays a major role in unlocking this phenomenon (line 289), then it would be useful to verify this observation across other architectures.

Also see the clarification issues raised in the "Questions" field,

其他意见或建议

Minor typos:

  1. In Def 2.2: l2=29l^2=29.
  2. The running header of the pages doesn't reflect the title of the paper
作者回复

We appreciate the reviewer's helpful feedback.

Methods And Evaluation Criteria

A simpler and more direct option could be reporting the percentage of inputs where the order of the saturation layers matches the order of the tokens.

We calculated this percentage for Llama3-8B on 1K MMLU questions for top-k tokens, where top-1 refers to cases where the 1st tokens reach saturation in order, top-2 includes the 1st and 2nd tokens, and so on:

Top-1Top-2Top-3Top-4Top-5
99.7%82.4%77.0%76.0%75.8%

These results further validate the ordered saturation phenomenon, and we will include this analysis in the camera-ready version.

it’s unclear whether the data used in the experiments was part of the pre-training set.

To mitigate the concerns of data pollution, we reproduced our findings on datasets which came out after the pretraining, and hence less likely to be contaminated. For text, in Appendix A.4, we report results for GPT2-XL model on the MMLU dataset—a benchmark introduced after the model’s release.

For vision, we reproduced our results using ViT on 1K images from ImageNet-D[1], a newer synthetic benchmark. The average rank of the k-th saturation layer increases monotonically (up to k=3), with statistically significant differences between each two consecutive token ranks (p < 0.001, average ranks: 1.0, 2.0, 2.39).

For speech, we reproduced our results using Whisper on 200 English audios from the newer Emilia [2] benchmark. Here we also found the average k-th saturation layer rank to increase with k (up to k=3), with the differences being statistically significant (average saturation ranks in order: 1.01, 1.94, 2.42). Furthermore, since we demonstrate this order phenomenon in a randomly initialized Transformer, our findings suggest robustness to both training and dataset choice.

For the early-exit strategy .. it may be more appropriate to use a metric that assesses the model's alignment with the language distribution rather than just the top token prediction/accuracy.

Following your suggestion, we calculated the perplexity metric for each early-exit strategy using probabilities from the early-exit layer. For Llama3-8B and 500 CNN/DM texts, our method significantly outperforms Softmax Response in speedup-performance tradeoff and performs similarly to State Saturation, both of which are considered SOTA. See results in table below:

StrategySpeedupPerplexity
Oracle1.0931.13
Ours1.12100.2
State Saturation1.12107.93
Softmax Response1.15153.96

Weakness

If this implies that the architecture plays a major role in unlocking this phenomenon (line 289), then it would be useful to verify this observation across other architectures.

In this work we are the first to notice this phenomenon, and focus our work on developing measures of quantifying its extent. We agree that other architectures can also lead to ordered saturation. We will refine this claim in the paper to make it clear that this is an hypothesis raised by our findings.

Questions:

  1. We extract embeddings for training only when saturation layers are in order, as our goal is to explain this mechanism.
  2. When evaluating our early-exit strategy we also used cases where this order doesn’t hold, i.e., we exit when the first token saturates in all cases. This demonstrates the robustness of our probing classifier and method.
  3. Chance level varies as it is inversely proportional to the number of tasks/classes, which is set to ensure data balancing. We will clarify this in the table caption.
  4. i. Saturation events occur fairly often for other ranks. In LLaMA3-8B on 1,000 MMLU questions: top-1 saturates in 65.7% of tokens (77K), top-2 in 28.3% (33K), top-3 in 15.2% (18K), top-4 in 9.9% (12K), and top-5 in 7.2% (8.5K). ii. The strict Kendall’s coefficient values reported in the paper, suggesting moderate agreement between token rank and saturation order, are calculated only on the subset of samples where top-1 reaches saturation.
  5. As noted in Section 5.1, dynamic decoding is not our focus, so we propagate states from layers after early-exit as in regular inference. This allows later tokens to use these computations if needed, and the effect of a state-copying mechanism on performance is negligible [3].
  6. Yes, in LLaMA3-8B on 500 MMLU questions, the first saturation layer ranges from 2 to 30. Analyzing its correlation with token position, we found a weak but significant negative correlation (r = -0.1, p < 0.001), suggesting later tokens tend to saturate earlier. We will add this to the camera-ready version.

We welcome further discussion if needed. If our clarifications are satisfactory, we kindly ask the reviewer to consider raising their rating.

[1] Zhang et al. 2024

[2] He et al. 2024

[3] Schuster et al. 2022

审稿人评论

Thanks for the clarification.

Could you please clarify the contents of Table 4 again? In the paper, you report that 31% of tokens reach top-1 accuracy with the LLaMA3-8B model, but here you mention that "top-1 saturates in 65.7% of tokens," which seems inconsistent.

Also, the saturation percentage for Whisper—used in the paper to extend results to the speech modality—is quite low (10%). I'm asking this to better clarify how general the reported phenomenon is.

作者评论

Thank you again for your thoughtful feedback and for taking the time to follow up on our response. We're glad that you found the observations interesting and that the experimental support came through clearly. We're especially grateful for your recognition of the paper’s potential contribution to interpretability research, as well as the promise of leveraging these findings for improving inference efficiency. Your comments have been both encouraging and constructive, and they’ve helped us further clarify key aspects of our work.

Could you please clarify the contents of Table 4 again? In the paper, you report that 31% of tokens reach top-1 accuracy with the LLaMA3-8B model, but here you mention that "top-1 saturates in 65.7% of tokens," which seems inconsistent.

Thank you for the question—this is an important distinction. The difference between the two numbers, 31.2% and 65.7%, comes from the fact that they are based on different subsets of the data. The 31.2% in Table 4 specifically refers to the percentage of input tokens that reach top-1 saturation within the first 85% of the model’s layers. We focus on this subset to ensure that enough layers remain after saturation for meaningful analysis.

On the other hand, the 65.7% figure represents the proportion of all input tokens that reach top-1 saturation by the penultimate layer, which follows the definition commonly used in prior work. This number is provided to show how widespread saturation events are across the entire dataset, without the constraints imposed in our experimental setup. We will clarify this point in the camera ready version of the paper.

Also, the saturation percentage for Whisper—used in the paper to extend results to the speech modality—is quite low (10%). I'm asking this to better clarify how general the reported phenomenon is.

Indeed, the percentage of tokens that reach saturation in Whisper is lower than in the other models we tested. We believe this might be due to architectural differences rather than the modality: in Whisper, all decoder layers attend to the final encoder layer, which may dilute task-specific information and blur the boundaries between tasks—especially in later tokens.

In contrast, in Qwen-Audio, a different speech model, the encoder’s role is limited to providing input features, and the decoder processes them autoregressively. In this setup, a much higher percentage of tokens reach top-1 saturation (26.15%), and we observe more robust ordered saturation: the average rank of the k-th saturation layer for each k increases monotonically with k (up to k=4) and the difference between each two consecutive token ranks is statistically significant with p < 0.001 based on a pairwise independent samples t-test (average saturation ranks in order: 1.01, 1.85, 2.73, 2.93)

To further validate this, we ran the Qwen-Audio model on 500 samples from LibriSpeech and trained a classifier to predict the task identity from hidden layer representations. The classifier achieved an average accuracy of 85.5% across 4 classes, indicating that task information remains decodable and structured in this modality.

审稿意见
4

This paper studies the calculation process of the hidden layer of the Transformer model after completing the top-1 prediction, and find that the model will determine the subsequent tokens (such as top-2, top-3, etc.) in order of token ranking.

Through experiments across modalities (text, vision, speech) and model variants (pre-trained/untrained Transformers), it is proposed that this phenomenon originates from the discrete task transformation mechanism.

Finally, the authors introduce a token-level early exit strategy that can improve efficiency while maintaining performance.

给作者的问题

see other strengths and weakness

论据与证据

Yes

方法与评估标准

Yes

理论论述

The paper does not involve theoretical proof, but mainly relies on experimental verification.

实验设计与分析

Yes. The experimental design of this paper is clear, verifying the universality of saturation events through cross-modal (text, vision, speech) and multi-model (pretrained/untrained Transformer)

补充材料

No supplemental material, but a reproduction of the Colab notebook for a major experiment is provided.

与现有文献的关系

This paper mainly inherits the saturation event research of [1] and extends it to top-k tokens. At the same time, it echoes the logit lens analysis of [2,3], and methodologically, the early exit strategy of [4] is compared.

[1] Geva, M., Caciularu, A., Wang, K., and Goldberg, Y. Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space. In Goldberg, Y., Kozareva, Z., and Zhang, Y. (eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 30–45, Abu Dhabi, United Arab Emirates, 2022. Association for Computational Linguistics.

[2] Nostalgebraist. Interpreting gpt: the logit lens, 2020

[3] Yang, S., Gribovskaya, E., Kassner, N., Geva, M., and Riedel, S. Do large language models latently perform multi-hop reasoning? ArXiv preprint, abs/2402.16837, 2024

[4] Schuster, T., Fisch, A., Gupta, J., Dehghani, M., Bahri, D., Tran, V., Tay, Y., and Metzler, D. Confident adaptive language modeling. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A. (eds.), Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022

遗漏的重要参考文献

No.

其他优缺点

This paper systematically reveals for the first time the "ordered saturation" phenomenon of top-k prediction in the Transformer model in order of ranking, and verifies its universality through cross-modal experiments, which provides a new perspective for understanding the internal mechanism of the model.

In addition, practical applications, such as an early exit strategy to accelerate the next token prediction, without introducing too many additional calculations, are also explored, striking a balance between speed and accuracy

However, the applicability of complex tasks/models is still unclear, such as some of the more advanced LLMs, Qwen Audio series, LLaVA series, and it is not clear whether they are universal for these complex tasks/models

Overall, given the outstanding performance of the paper in terms of innovation and practical value, I suggest accepting

其他意见或建议

No.

作者回复

Thank you for your thoughtful and positive assessment of our work! We appreciate your recognition of the novelty of the ordered saturation phenomenon and its implications for understanding Transformer models. Your acknowledgment of both the theoretical contributions and practical applications, such as our early exit strategy, is encouraging.

Other Strengths And Weaknesses:

However, the applicability of complex tasks/models is still unclear, such as some of the more advanced LLMs, Qwen Audio series, LLaVA series, and it is not clear whether they are universal for these complex tasks/models

We agree that exploring the applicability of our findings to more complex tasks and advanced models is an exciting and important direction. Following your suggestion we replicated our results using LLaVA-1.5-7B model on 1K questions from MMLU dataset treating it as a language model. We found the average rank of the k-th saturation layer for each k increases monotonically with k (up to k=5) and the difference between each two consecutive token ranks is statistically significant with p < 0.001 based on a pairwise independent samples t-test (average saturation ranks in order: 1.01, 2.14, 2.66, 2.87, 3.00). We also trained a classifier to predict task number from hidden layer representations, and achieved an 82.4% average accuracy over 5 classes (chance level accuracy being 20%).

We then re-run the analysis on 500 questions from MMMU dataset [1], utilizing this model’s multi-modal capabilities on the more complex task of Visual Question Answering, where the input contains both image and text. We found the average rank of the k-th saturation layer for each k increases monotonically with (k up to k=5), and the difference between each two consecutive token ranks is statistically significant with p < 0.001 based on a pairwise independent samples t-test (average saturation ranks in order: 1.01, 2.38, 2.91 3.14, 3.33). We also trained a classifier to predict task number from hidden layer representations, and achieved an 82.4% average accuracy over 5 classes (chance level accuracy being 20%).

Additionally, we ran Qwen Audio model on 500 audio samples from LibriSpeech dataset and found that the average rank of the k-th saturation layer for each k increases monotonically with k (up to k=4) and the difference between each two consecutive token ranks is statistically significant with p < 0.001 based on a pairwise independent samples t-test (average saturation ranks in order: 1.01, 1.85, 2.73, 2.93). We also trained a classifier to predict task number from hidden layer representation and achieved an 85.5 average accuracy over 4 classes.

These results further emphasize the robustness of the ordered saturation phenomenon over models, tasks and modalities.

Thank you again for taking the time to review the paper and providing helpful feedback! Do the above actions address your concerns with the paper? If not, what further clarification or modifications could we make to improve your score?


[1] Yue, Xiang, et al. "Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi." IEEE/CVF 2024.

审稿意见
4

This paper shows that at the intermediate layers, model will sequentially determine the top-k token, rather than just determine the top-1 token. And this observation maybe some special property of transformer architecture. The author calls this as task transition, and show that this is a general phenomenon on different kinds of scenarios (ViT, SpeechTransformer). They also design some new early-exit strategy for accelerating inference.

给作者的问题

I think this phenomenon will be quite related to the existence of residual connection, because this results that every layer will have minor modification to the previous layer, i.e., for x_{n+1} = x_n + f(x_n), actually many works show that |x_n| >> |f(x_n)|, I think this may be related? How do authors see this and how to prove this holds or not.

论据与证据

I think the claim of sequential determination of top-k tokens are convincing and supported by various experiments on different downstream tasks, and the claim itself is interesting.

However, I'm confused about the claim about "task-transition". Because it sounds like a modeling for explaining sequential determination of top tokens, but the later experiments (like the logistic regression classifier) don't really support this claim, but only re-support that embeddings from different layers can corresponding to different stage of token determination.

I believe the obvious order of token determination doesn't necessarily result the task transition. There is possibly another possible modeling: if the weight of all top tokens is increasing together but with different scale, we can also observe sequential saturation of top tokens starting from random weights. And considering the existing of residual connection, the weight modification of each token can be minor across layers. Then the probability of all these top tokens can increases slightly with different scale, but the tokens with larger increasing scale will appear to be top-k earlier. I think the modeling way is not unique, need more investigation or change the claim for soundness.

方法与评估标准

I think the evaluation metrics (like Kendall's tau) are reasonable and we can see some statistical trend of the phenomenon.

理论论述

This paper doesn't have theoretical claims.

实验设计与分析

I think the experiments on convincing the sequential saturation phenomenon is relatively completed, consider different downstream tasks and different models. The improvement from the additional algorithm (like Table 2) looks incremental for me, but this should not be the main contribution of this paper.

补充材料

I just briefly check some experimental results/design in supp, like Figure 7, Table 5, A.3

与现有文献的关系

I think the broader scientific literature shows the top-1 token will be determined at the very last layers of transformer (like [1]), and this paper shows that this phenomenon also happens, and sequentially, for top-k (like k=4 or 5) tokens.

[1] Nostalgebraist. Interpreting gpt: the logit lens, 2020.

遗漏的重要参考文献

I don't find there is some missing key references.

其他优缺点

Strengths:

+: The observation is interesting, and the experimental result for this phenomenon is supportive and sufficient.

Weaknesses:

-: I think the main concern is about the claim of "task transition" we mentioned above. And I think this make the readers more difficult to understand the main takeaway of this paper. And following this, maybe Figure 2 is more understandable than Figure 1 for authors' idea, and should be put earlier.

其他意见或建议

Can introduce more about logit lens or previous works in the main paper for the background introduction? For example how people obtain and use the projection for intermediate layers.

作者回复

Thank you for your insightful comments and helpful suggestions. We appreciate that you found the sequential determination of top-k tokens to be an interesting and well-supported phenomenon. We're also glad that you recognized the broader relevance of our findings in relation to existing literature on token determination in Transformers.

Claims And Evidence:

I believe the obvious order of token determination doesn't necessarily result the task transition. There is possibly another possible modeling: if the weight of all top tokens is increasing together but with different scale, we can also observe sequential saturation of top tokens starting from random weights.

We agree that the sequential saturation of top tokens could emerge from a different underlying mechanism, such as the one suggested—where all top token probabilities increase together but with different scales due to residual connections. This is an interesting alternative perspective, which we will add to the paper.

To clarify, our proposed interpretation is not that the first task corresponds strictly to processing only the first token, the second to the second, and so on. Rather, what we can say with confidence is that by the time the model transitions to the second task, it is no longer focusing on the first token in the same way. This distinction is supported by the probing classifier results, which show high accuracy in distinguishing the tasks as we define them, indicating that these transitions are not simply a gradual rescaling but rather discrete shifts in model behavior.

Additionally, the intervention experiments provide further support for our interpretation. Specifically, when we inject layer activations from later stages, the model abruptly shifts its behavior, ceasing to focus on the top-1 token and instead progressing to the next stage. This causal evidence suggests that the observed transitions are more than just a continuous adjustment of weights. That said, we agree that further investigation is needed to rigorously rule out alternative explanations, and we appreciate the reviewer’s suggestion to refine the claim for greater clarity.

Given the opportunity we will incorporate those points into the discussion section in the camera-ready version.

Other Comments Or Suggestions:

We will follow your suggestion to expand on previous work on extracting predictions from intermediate layer in the introduction section in the camera ready version of our paper.

Questions For Authors:

I think this phenomenon will be quite related to the existence of residual connection, because this results that every layer will have minor modification to the previous layer, i.e., for x_{n+1} = x_n + f(x_n), actually many works show that |x_n| >> |f(x_n)|, I think this may be related?

You raise a very interesting question about the relevance of residual connections to the ordered saturation phenomenon. We address this indirectly in our Related Work section when considering the iterative inference hypothesis. This approach interprets each layer as an iteration from an iterative and convergent process [1], suggesting that each layer incrementally refines the hidden representation by gradually shaping the next token prediction [2-4].

At first glance, the existence of residual connections should make it more difficult to distinguish layer-wise task transitions, as they enforce strong similarity between consecutive layers. However, our results show that saturation events still occur in a structured and discrete manner, rather than as a continuous drift. This suggests a phase transition-like dynamic: while each layer makes small refinements, at a certain point, a critical threshold is reached, leading to a sharp shift in representation and task focus.

More work is needed to determine which components in the Transformer architecture give rise to this phenomenon, and the residual connections is certainly one to examine alongside self-attention and MLP layers. One possible way to do so is through various ablation studies.

Thank you again for taking the time to review the paper and providing helpful feedback! Do the above actions address your concerns with the paper? If not, what further clarification or modifications could we make to improve your score?


[1] Simoulin, Antoine, and Benoit Crabbé. "How many layers and why? An analysis of the model depth in transformers."

[2] Geva, Mor, et al. "Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space."

[3] Belrose, Nora, et al. "Eliciting latent predictions from transformers with the tuned lens."

[4] Rushing, Cody, and Neel Nanda. "Explorations of self-repair in language models."

审稿人评论

Thanks for your clarification and sorry for my late reply. I agree that the intervention study is a helpful and interesting observation. Although I don't think they strictly exclude all other possibility of modeling, the obvious difference between different injection layers shows that there should be some discrete events happen around saturation layers, which support authors claim and to some extent solve my main concern. I will increase my score to 4, and suggest authors to include more clearer description about the main method/takeaway of intervention experiment in introduction, which maybe critical for your core conclusion. And also looking forward to some futural study as author said in rebuttal

作者评论

We're glad we were able to address your main concern, and appreciate you raising you score as a result. We thank you for again for your feeback, and will revise the introduction to provide a clearer description of the main method and takeaway from the intervention experiment.

审稿意见
2

This work analyses the saturation event observed in transformer layer, where a layer's top-1 prediction for the next best token remains the same from some layer onwards. This paper expands the analysis beyond that of top-1, illustrating how the argument continues for top-k and analyses a specific component from the activations which causes the top-k switch from the embeddings from which point onwards all layer's top-k remain the same. The main idea of the paper revolves around moving from top-1 layer saturation task to top-k'th layer saturation.

The paper's empirical results include 1) illustrating how the top-k token ranks for LLM output mostly follow a similar trend to that of the position of their respected saturation layer in the network, 2) This observation is classified as a property of transformers and not task-specific as it appears on untrained randomly initialized, 3) Train a linear regression model to classify the k'ranking for each layer, 4) Use their discovery to improve the top-k token prediction for LLMs (compared to previous methods) by implementing their early exit strategy using the linear classifier.

给作者的问题

Mentioned above.

论据与证据

While I do believe the main claims of the paper, which is that that layer saturation happens beyond just the top-1 token, I don't believe results provided in Figure 3 regarding the rank of k'th top token following the same as their order in the network are convincing enough beyond the 3rd top token. Even though the rankings look quite monotonic for the first top-1,2,3 tokens, for 3 onwards I find it alarming that the average is not close to the actual layer rank. Further I would like more information on the way pp was calculated and I would like some more discussion regarding the τstrict\tau_{\text{strict}} definition. I will ask further questions in the appropriate section.

With respect to the results on classifier accuracy, while I find the results on language and text models acceptable, I'm afraid the same can't be said about the Image and speech transformers. I think the accuracy there is somewhat low for me to consider reliably using the classifier for further practical applications. I will discuss my questions on this topic as well.

Additionally, when considering Table 1, I believe the model's accuracy on specific k-top layer is probably more informative than that if the overall accuracy. It could be that the classifier is doing a great job at detecting the first top-1,2 tasks as they are evidently more distinct from Fig 3, but could fail on the layer ranks top-3,4,5. Looking at Table 8 in appendix, I see that this happens for task 4-5 on some of the models. To be clear, I am not saying the argument doesn't hold, I am saying this should be discussed and is more important that the total accuracy.

With regards to the layer activation injection, I have some questions on what the activation layer is ? but regardless, I see that visuals were only provided for probing the top-1 layer ? Again I see Table 9 provide some accuracies but the probing task but I would like to have a more clear understanding of how these numbers were generated.

With regards to the speedup and early exit, I am not as familiar with other early exit methods and I taking the authors results at face value. However, just extending the results to 2-tokens might not be significant enough for the paper. I don't find in appendix any results on accuracy for tasks above the 2nd top token. I believe its important for them to be reported even if they don't have a positive outcome. I understand however that this could be hard given the constraints of the experimental setup (top 2 tokens have to be wrong for the LLM for the task 3 evaluation)

Finally I am not sure I understand the conclusion drawn from Table 3. I will have a question regarding it below.

方法与评估标准

I believe most of my concerns regarding this have been discussed above. I would like more clarity on the evaluation of pp for statistical significance. I do think more results could be provided for the speed-up argument of the paper, however, I understand that this is not the main focus of the paper.

理论论述

Does not apply to this paper.

实验设计与分析

Here I will ask the main questions I have regarding the paper as it is mostly reliant on empirical observations:

Q1: Why do authors believe the rank accuracy drops noticeably on non-text transformers ? Could it have something to do with certain discoveries regarding the sparsity of attention activations or weights when dealing with LLMs?

Q2: Have the authors attempted to use non-linear model ? I think it could help

Q3: For Figure 3, I don't understand why the error bar is so small. I assume every rank is an integer, and I would expect an integer to exists within in the range of mean +/- error bar. Am I missing something ?

Q4: Can the authors explains the missing values in Tables 8-9. I think I missed the explanation for them.

Q5: Is the task classifier transferable from one model to another assuming the dimensions across transformers are the same ? For example for the same architecture but trained on different datasets.

Q6: Could the authors please expand on the "k=5k=5 as the highest value token which reaches 5% of the input tokens " ? I am not sure I understand.

Q7: Could the authors expand on the calculations of pp using the τstrict\tau_{\text{strict}}.

Q8: Why do the accuracies of the second layer change beyond the task 2 saturation layer in Table 3 ? I assume all these layers are post task 2 saturation.

补充材料

I have read through the supplementary material and provided questions regarding some of the results above.

与现有文献的关系

This paper extends the top-1 saturation argument to further top-k tokens. I believe assuming this is the first paper which extends the idea, specially the discovery w.r.t untrained transformers, could have impact on better understanding of the transformer's functions.

遗漏的重要参考文献

I am not aware of any. I am not as familiar with early exit strategies and it is possible that works from that field are excluded in the paper's discussion, however as the authors mention, this is not the main target of the paper.

其他优缺点

All discussed above.

其他意见或建议

I believe there might be a type on line 681-682 for the discordant.

I am open to change my score depending on the future discussion.

作者回复

Thank you for your thorough review.

Claims And Evidence

I don't believe results provided in Figure 3 .. are convincing enough beyond the 3rd top token.

Beyond the top-3 tokens, saturation order is indeed less consistent in some models. However, our results are statistically significant across models and modalities.This variability may be partly due to model depth—fewer layers may limit lower-ranked tokens from reaching saturation. For instance, GPT-2 XL (48 layers) maintains order up to the 5th token, unlike Llama3-8B (32 layers) (Appendix A.4). We will add this observation to the paper.

With respect to the results on classifier accuracy… I'm afraid the same can't be said about the Image and speech transformers.

While the classifier accuracy in Vision and Speech Transformers is lower than in text models, it remains statistically significant (Table 1) and over twice of chance level. Though this may limit practical applications, this experiment validates the task transition mechanism.

I believe the model's accuracy on specific k-top layer is probably more informative than the overall accuracy.

Table 8 shows no significant accuracy differences across tasks, with ViT even performing slightly better for Task 5 than Task 1. The classifier was trained on balanced data representing all tasks. We didn't investigate these differences, but agree this is an interesting avenue for future research.

With regards to the layer activation injection, I have some questions on what the activation layer is.

We inject the first saturation layer’s output from sample s1 into sample s2’s subsequent layer and observe its effect on s2’s first saturation layer. As a control, we repeat this with activations from three layers before and after. Please see Appendix A.8 for details. We will clarify this in the camera-ready version.

However, just extending the results to 2-tokens might not be significant enough for the paper.

Following this comment we analyzed the 3rd and 4th top tokens using Llama3-8B on 500 CNN/DM texts, assuming higher-ranked tokens were incorrect, and found the results generalize.

For the 3rd top token, accuracy increased from 17.01% (no saturation) to 23.3%, 26.9%, 28.9%, 29.2%, and 31.0% when saturation occurred 2 to 6 layers before output, respectively.

For the 4th top token, accuracy increased from 13.4% (no saturation) to 16.8%, 19.7%, 20.45%, 19.5%, and 18.3% when saturation occurred 2 to 6 layers before output.

A two-proportion z-test confirms statistical significance (p < 0.001) for 2 ≤ i ≤ 6 in the 3rd token and 2 ≤ i ≤ 5 in the 4th token. We will include this analysis in the camera-ready version.

Questions

Q1: We report in our response to Reviewer J6pR that other Vision and Speech models achieve high accuracy, comparable to text models. In Whisper, we speculate that order deteriorates in later tokens because all decoder layers are conditioned on the encoder’s last layer, potentially blurring task boundaries. In contrast, in Qwen-Audio, where the encoder only generates input features for the decoder, our task classifier achieves high accuracy.

Q2: Non-linear probing could hurt our claim, as powerful probes may learn arbitrary mappings rather than reveal the model’s inherent representations [1, 2].

Q3: We computed standard error as std/sqrt(n) [3]; large n results in small values .

Q4: Omitted values indicate that the probing classifier was trained on a different number of tasks per model. For Llama3-8B for example, it was trained on three tasks, so accuracies for Tasks 4 and 5 are not reported. We will clarify this in the caption.

Q5: We did not explicitly test task classifier transferability across models, but we indirectly demonstrate cross-dataset transferability by applying a classifier trained on MMLU to CNN/DM in our early-exit experiment (Section 5.1).

Q6: In our analysis of the ordered saturation of top-k tokens, we set k = 5 to ensure consistency across models while maintaining sufficient examples (>4% of inputs for all models). As k increases, saturation cases decline. The 5% value was a typo. We will clarify this and add detailed statistics in the appendix.

Q7: As noted in Appendix A.3, stricter Kendall’s tau p-value was computed via a permutation test of saturation layer sequences (1K random shuffles per instance). In all cases, real tau values exceeded shuffled results (p < 0.001). We will further clarify this in the paper.

Q8: Table 3 analyzes the second token’s accuracy when the top-1 token is incorrect, depending on its saturation layer (2 ≤ i ≤ 6 layers before output). Variability in saturation layer is an open question, but our main goal was to show a statistically significant difference between saturation and no-saturation cases.

We welcome further discussion if needed. If our clarifications are satisfactory, we kindly ask the reviewer to consider raising their rating.


[1] Hewitt et al., 2019

[2] Belinkov, 2022

[3] Casella, G., & Berger, R. L. 2002

审稿人评论

I thank the authors for their response. Most of my comments have been addressed. For now, I would like to ask the following:

Q1: Regarding the missing values in Table 8, could the authors explain why the classifier was trained on only three tasks? Is this related to the implications concerning model dependence—for example, that smaller models with fewer layers don't exhibit the same number of saturation tasks?

Q2: Could the authors elaborate on how introducing non-linearity undermines the claims of the paper? I would appreciate further clarification on this point.

Q3: Could the authors please reiterate the context of Table 3? Is the top-2 token (Task 2) changing after saturation, and is that the reason for the observed accuracy shift? I feel I may have misunderstood the table entirely.

作者评论

Thank you for your follow-up and for letting us know that most of your comments have been addressed—we really appreciate the time and care you’ve put into reviewing our work. We're grateful for your positive remarks about the paper’s contributions. It's encouraging to hear that the extension of the top-1 saturation framework to top-k tokens, and the observations on untrained transformers, may support a deeper understanding of how these models function. Your insights and feedback have been very helpful in refining the presentation of our findings.

Q1: Regarding the missing values in Table 8, could the authors explain why the classifier was trained on only three tasks? Is this related to the implications concerning model dependence—for example, that smaller models with fewer layers don't exhibit the same number of saturation tasks?

The classifier for the LLaMA3-8B model in Table 8 was trained on only three tasks due to data balancing constraints. After extracting hidden representations and task labels, we selected the number of tasks such that each had at least 10 embeddings per layer and that at least 4 layers were represented for each task. This filtering ensures that the classifier is trained on a representative and balanced dataset across tasks and layers.

We agree that comparing the number of tasks that can be reliably extracted from different models is an interesting avenue for future research, particularly in understanding model-specific dependencies. While it’s tempting to attribute the limitation in LLaMA3-8B to architectural factors such as model depth, the relationship does not appear to be as straightforward. For example, both Mistral-7B and Falcon-8B have the same number of layers (32) as LLaMA3-8B, yet we were able to train a classifier on four tasks for these models using the same underlying dataset (see Table 6).

This suggests that factors beyond layer count—such as differences in pre-training data, tokenization, or implementation details of the self-attention mechanism—may play a role in shaping how task information is represented in different models. We see this as a promising direction for further investigation.

Q2: Could the authors elaborate on how introducing non-linearity undermines the claims of the paper? I would appreciate further clarification on this point.

Our concern with introducing non-linear probes is that they may obscure the interpretability of the results. The goal of our probing analysis is to assess whether the model’s internal representations already contain linearly accessible information about task number. Using a more expressive, non-linear probe introduces the risk that the probe itself is learning complex transformations that construct the desired signal, rather than simply extracting it from the representations. For example, in part-of-speech (POS) tagging, a complex non-linear probe might achieve high accuracy by effectively solving the task itself—even if the model's embeddings don’t linearly encode POS information.

In our previous response, we cited two papers [1, 2] that elaborate on this argument in favor of using simple probing methods. This approach has also been common in recent interpretability work [3–5].

Q3: Could the authors please reiterate the context of Table 3? Is the top-2 token (Task 2) changing after saturation, and is that the reason for the observed accuracy shift?

Thank you for raising this question —happy to clarify the intent behind Table 3. The goal of Table 3 is to evaluate the predictive ability of the second-ranked token under two different conditions. Specifically, we compare its accuracy in cases where:

  1. The second token saturates early, i.e., its rank stabilizes 𝑖 layers before the final output layer (with 2 ≤ 𝑖 ≤ 6).
  2. The second token does not saturate early, i.e., its rank is only determined in the final layer. To isolate meaningful differences, we only include examples where the top-1 token is incorrect, and thus the second token has a chance to be the correct next word.

Importantly, the observed accuracy shift is not due to the second-ranked token changing after saturation—in condition (1), its rank is already stable. Instead, the key difference lies in when the rank becomes stable: our results show that when the second-ranked token saturates earlier, it is significantly more likely to be correct. This supports our hypothesis that saturation reflects meaningful intermediate computation performed by the model—computation which we interpret as corresponding to a distinct task in our proposed mechanism. These findings also suggest that saturation-aware decoding strategies could improve next-word prediction by incorporating top-k candidates that stabilize early, even when they are not the model’s top-1 choice.


[1] Hewitt et al., 2019

[2] Belinkov, 2022

[3] Abdelnabi, Sahar, et al. 2024

[4] Levy, A. A., & Geva, M. 2024

[5] Simhi, Adi, et al. 2024

最终决定

After considering all reviewer comments, author responses and the internal reviewer discussion, I recommend conditional acceptance.

Key Points Reviewer petj found unclear descriptions of measurements and inconsistencies in reported numbers. While mostly addressed, they raised a serious concern (detailed below) but had no strong objections to acceptance provided clarifications are made.

Reviewer 2s7d's concerns were mostly addressed. Their remaining concern about accuracy drops in top-4 and beyond was balanced by their recognition of the paper's contributions: 1) identification of beyond top-1 saturation and 2) introduction of top-k saturation injection. They agreed to raise their score to 3, but share the concern below.

Shared Concern Reviewers petj and 2s7d noted inconsistencies in reported results for Llama3-8B on 1K MMLU questions:

  • 31% of tokens reach top-1 accuracy (in Table 4 of the paper)
  • top-1 saturates in 65.7% of tokens (response to Q4 of review petj)
  • 99.7% of tokens reach top-1 saturation (the first table in the author's response to reveiwer petj)

This concern was brought up after the author response deadline, during internal discussion

Recommendation Since the authors addressed most concerns and two reviewers recommend acceptance, I recommend acceptance while giving the benefit of the doubt to the authors regarding the reporting inconsistencies. This acceptance is conditioned on:

  • Authors must clarify the numerical inconsistencies noted above
  • Authors should use the extra page to include all promised clarifications: Better descriptions of metric settings, more detailed explanations of the injection method, the additional reported experimental results and discussions.