Efficient Prompt Compression with Evaluator Heads for Long-Context Transformer Inference
We propose an efficient, training-free prompt compression method that retains key information within long inputs using the evaluator heads we identified in transformer-based LLMs.
摘要
评审与讨论
This paper introduced a fast, training-free prompt compression method EHPC (Evaluator Head-based Prompt Compression) for prompt compression and long-context inference acceleration. EHPC is built upon the pre-filling stage of LLMs, which reduces the number of tokens in both the pre-filling and decoding stages and the KV cache size, which accelerates the inference. EHPC identifies a subset of attention heads in transformer-based LLMs (i.e., evaluator heads) which can select tokens in long inputs that are most significant for inference. This method allows LLMs to efficiently "skim" prompts by utilizing only the initial layers containing evaluator heads during the pre-filling stage, retaining only the crucial tokens for downstream inference. EHPC achieves state-of-the-art performance on both prompt compression and long-context inference acceleration benchmarks.
优缺点分析
-
Strengths
- The paper is well-written and easy to follow
- The proposed method applies in the pre-filling stage, which compresses the input prompt before inference
- The proposed model is evaluated on various benchmarks and LLMs, and compared with SOTA methods across multiple metrics, including latency. Results show competitive or superior performance compared with SOTA methods.
-
Weaknesses
- the empirical evidence is strong, the paper lacks deeper theoretical analysis on the evaluator heads.
问题
EHPC selects heads from the highest scoring layer, would combining heads from multiple layers helps?
局限性
Yes
最终评判理由
After reading the rebuttal responses and other reviews, I keep my original rating.
格式问题
NA
We thank the reviewer for their insightful and positive feedback. We are encouraged that they found our paper well written and our method effective and adequately evaluated. We answer the questions below.
W1: the empirical evidence is strong, the paper lacks deeper theoretical analysis on the evaluator heads.?
We provide the properties of the evaluator heads in Section 4, which provides an empirical explanation of the found heads. We hope these discussions are useful to understand why our methods work.
Q1:EHPC selects heads from the highest scoring layer, would combining heads from multiple layers helps?
The ablation study in Table 4 below shows that selecting the top heads from multiple layers does not improve performance. Moreover, it forces us to compute attention for every chosen layer, which sharply increases latency. Thus, we simply select heads from the highest scoring layer.
Table 4; The ablation study on single/multi-layer. A single layer represents selecting the heads from the layer with the highest scores. Multi-layer selected the top heads from any layers.
| Dataset | NarrativeQA | qasper | MultiFieldQA-en | HotpotQA | 2WikiMultihopQA | MuSiQue | TREC | TriviaQA | SAMSum | PCount | PRe |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Single layer | 20.98 | 14.38 | 30.5 | 19.35 | 16.23 | 13.02 | 71 | 90.47 | 42.24 | 9.35 | 97.94 |
| Multi layer | 22.05 | 12.93 | 27.47 | 18.39 | 16.73 | 12.54 | 69.78 | 90.65 | 41.58 | 7.25 | 97.75 |
Thank you for the response and for addressing the concerns raised in my review.
Dear Review 1UAM:
We want to express our gratitude for your support and insightful comments.
We will include all the changes in the final revision based on your suggestions. We believe they have greatly increased the quality of our paper.
Best regards,
The Authors of Submission 8522
This paper presents EHPC (Evaluator Head Based Prompt Compression), a novel approach for LLM prompt compression by preserving tokens only the tokens that receive high attention from “evaluator heads” - a set of specific attention heads, usually found in the earlier model layers, that are crucial for focusing on important information in long context tasks. The method employs a two-pass prefill approach for compression. The first pre-fill pass runs a few initial layers of the LLM (containing the evaluator heads) to compute the importance of each prompt token. This is followed by a second prefill pass over all the layers of the model, but with a subset of the prompt tokens. Extending on previous research, the authors show the existence of evaluator heads across models and their generalizability to downstream tasks (when identified using synthetic data). The authors evaluate EHPC in two modes: (1) EMI: using a separate model for prompt compression than the target inference model, (2) NMI: using the same locally deployed model for compression and target inference. Their results show that EHPC outperforms existing prompt compression methods (in both accuracy and system performance) in the EMI mode and is competitive with KV-Cache compression methods in NMI mode.
优缺点分析
Strenghts
- The results for EHPC in EMI mode are impressive, as it consistently outperforms existing prompt compression methods. Specifically, EHPC is far more efficient with its two-pass approach for inference than other methods, as shown in Table 5
- The authors evaluate their method against strong and plentiful baselines on a variety of tasks for both EMI and NMI evaluations
- The method is original and definitely pushes the field of prompt compression further
Weaknesses
- The analysis around generalizability and robustness of evaluator heads is limited and not convincing enough:
- I would implore the authors to evaluate on more models covering different sizes and architectures, like 30B or 70B models, if feasible, and architectures like MoE models
- The generalizability study in Section 4 does not provide details on how the downstream performance is calculated on the given set of tasks with a subset of the heads. Additionally, all the subsets of tasks shown are from LongBench. It would be nice to see some tasks from RULER as well
- The robustness study does not tell me whether the actual heads identified using different tasks are the same or different. Even if downstream performance does not change, the actual heads identified might be different. This is a crucial distinction between saying the evaluator heads are task agnostic versus evaluator heads identified using task X have a good overlap with evaluator heads of task Y.
- While the downstream results in the EMI mode are impressive, the NMI mode evaluation shows that EHPC might not be the best approach to use when compared to KV-Cache compression methods like SnapKV. While the authors acknowledge this, a discussion on whether this is an inherent issue with prompt compression methods or ways to improve this performance in future work would be helpful.
- EHPC does not apply to decode heavy workloads
问题
- I understand that the purpose of evaluator and retrieval heads is different, but is there an overlap between them - i.e. are most evaluator heads the same as retrieval heads?
- When identifying evaluator heads, what is the rationale behind extracting only the last row of the attention matrix?
- In the robustness study, can you confirm if the actual evaluator head indices identified over different tasks are the same or have significant overlap?
- Can you provide details on how the downstream performance was measured in the generalizability study with a limited number of attention heads?
- I think the authors should use LongBench v2, which is an updated version of the LongBench dataset with much larger context lengths to show the effectiveness of their method further.
- A small discussion on whether and how the NMI performance can be improved would be helpful.
- In Table 5, the inference time (second column) is different for different prompt compression methods. Why is that the case?
局限性
Yes
最终评判理由
The authors have addressed most of my major concerns:
-
Robustness and Generalizability Evaluation: The authors provided an evaluation on Llama2-13b-64k, details on the task evaluation in the generalizability study, and results showing that there is overlap in the evaluator heads identified over different tasks. While it is debatable whether the overlap is significant in certain cases, and I would like to see a deeper evaluation in the final version, this suffices for now.
-
EHPC's applicability to decode heavy workloads: The authors provide results on a few generative tasks (not very decode-heavy). I think it is fine, as the main focus of the paper is long prompt compression.
Apart from this, I am satisfied with the answer to all other minor questions I had. I maintain my score of accept.
格式问题
N/A
We thank the reviewer for their constructive and positive feedback. We are encouraged that they found our method efficient, adequately evaluated, and original. We answer specific questions in the following and will incorporate all feedback in the final version.
W1.1: I would implore the authors to evaluate on more models covering different sizes and architectures, like 30B or 70B models, if feasible, and architectures like MoE models
Thank you for the valuable feedback. We include properties of evaluator heads on Llama2 13B 64k below. If we have more computation, we will include the models with larger sizes and MoE models.
- Existence
The score ratio of the selected layer is 70%, and the score ratio of selected heads at selected layers is 55%. This shows that our selected evaluator heads have significantly high scores
- Generalizability
We present the results of evaluator heads on real-world tasks in Table 2 below. The findings indicate that NMI (Llama2 13B 64K) effectively understands the compressed data. However, due to limited memory, we had to truncate the context length to 8K, which resulted in relatively poor performance.
Table2; The results of real-world tasks in LongBench. For Llama2-13b-64k, we truncate the context length to 8k.
| Dataset | narrativeqa | musique | TREC | repobench-p |
|---|---|---|---|---|
| Llama2-13b-64k | 14.2 | 7.1 | 70 | 44.75 |
- Robustness
llama 2 13b 64k:
Layer of evaluator heads: 13 layer
Needle QA :[23, 12, 13, 14, 24, 21, 26, 17],
Variable Tracking (reasoning): [26, 21, 17, 23, 24, 14, 36, 11],
Common heads: [14, 17,21, 23,4,26]
Due to limited memory, we just conduct the results of variable tracking and Needle QA. The results show that these heads found by different tasks have 6/8 common heads.
W1.2: The generalizability study in Section 4 does not provide details on how the downstream performance is calculated on the given set of tasks with a subset of the heads. Additionally, all the subsets of tasks shown are from LongBench. It would be nice to see some tasks from RULER as well.
- Sure. Below we spell out the exact pipeline we used to evaluate downstream performance.
-
Given a prompt and a set of heads, we first compute the attention scores of these heads.
-
Following Eq. 1 in the paper, we compute the score for every token in the prompt. Then we select the top-k tokens to form the compressed prompt.
-
The compressed prompt is fed to the LLM for inference. Finally, task-specific metrics are computed on the model’s output. . These baseline tasks are discussed in Appendix H.
- For the generalizability, we aim to evaluate real-world tasks while the tasks in Ruler are synthesized. We also include the tasks from Infinity Bench [2] in Table 4 (Section 4, main paper).
W1.3: The robustness study does not tell me whether the actual heads identified using different tasks are the same or different. Even if downstream performance does not change, the actual heads identified might be different. This is a crucial distinction between saying the evaluator heads are task agnostic versus evaluator heads identified using task X have a good overlap with evaluator heads of task Y.
Llama 3.1 8B Instsrut :
Layer of evaluator heads: 13 layer
Needle QA : [1, 4, 8, 11, 13, 17, 18, 21]
Variable Tracking (reasoning): [1, 2, 8, 9, 11, 13, 18, 21],
Code completion: [3,4,5,6,13,18,21,27]
Common heads: [8,13,18,21]
mistral-nemo-instruct-2407: 14 layer
Needle QA : [10, 11, 25, 8, 29, 24, 17, 9]
VR : [17, 25, 11, 10, 16, 0, 30, 29]
Code: [29, 25, 17, 30, 10, 11, 8, 28]
Common heads: [10, 11, 17,25, 29]
Thank you for the question. The above results show that:
For Llama 3.1 8B Instruct, there are 4 heads in common on three tasks in the selected 8 heads.
For mistral-nemo-instruct-2407, there are 5 heads in common on three tasks in the selected 8 heads.
W2: A small discussion on whether and how the NMI performance can be improved would be helpful.
Thank you for the valuable suggestion. We provide the discussion below and have included this as future work in the revised manuscript.
Table 6 (main paper) demonstrates that NMI struggles with few-shot learning and code tasks, which demand a high level of context integrity. Existing compression methods prioritize semantic salience at the expense of fluency and structure, leading to downstream accuracy drops. Preserving both structure and fluency during compression is therefore a critical direction for future work.
W3: EHPC does not apply to decode heavy workloads
We provide the results of Infinity Bench in the table above, which covers 32 k-input summarization and long-form generation tasks. For tasks involving heavier decode workloads, we consider for future work.
Q1: I understand that the purpose of evaluator and retrieval heads is different, but is there an overlap between them - i.e. are most evaluator heads the same as retrieval heads?
Our evaluator heads are designed in the same layer, whereas the retrieval heads lie in different layers. Consequently, the two sets are almost disjoint; on Llama-2-7B-80K and llama2-13b-64k we observe zero overlap between the heads reported in [1] and the ones we identified as evaluators.
Q2: When identifying evaluator heads, what is the rationale behind extracting only the last row of the attention matrix?
We follow common practice and use the last row of the attention matrix as the scoring vector for each token. This row contains the attention scores that the final (query) token assigns to all earlier positions, which is exactly the distribution the model itself uses for next-token prediction. Treating these scores as the token’s “importance” therefore aligns our evaluator with the model’s own generative signal.
Q3; In the robustness study, can you confirm if the actual evaluator head indices identified over different tasks are the same or have significant overlap?
Sure, see response to W1.3.
Q4: Can you provide details on how the downstream performance was measured in the generalizability study with a limited number of attention heads?
See the response to W1.2.
- The compressed prompt is fed to the LLM for inference. Finally, task-specific metrics are computed on the model’s output. . These baseline tasks are discussed in Appendix H.
Q5: I think the authors should use LongBench v2, which is an updated version of the LongBench dataset with much larger context lengths to show the effectiveness of their method further.
Thank you for the valuable suggestion. We are still working on evaluating EHPC LongBench v2 and hope to include more results in the final version. We provide the results of 32k on the Infinity Bench for validating our performance on a larger context length. The results in Table 3 show that our NMI setting achieves consistently strong results compared with SnapKV.
Table 3; The results of Infinity Bench. We evaluate SnapKV and our NMI setting on the 32k token length.
| Model | Truncate | Method | Passkey | NumStr | LongBook Sum | CodeDebug | LongDialog QA | LongBook QA | LongBook Choice | MathFind | KV Retrieval | Avg |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Llama | 32k | SnapKV | 27.1 | 26.6 | 26.5 | 2.3 | 7.0 | 14.8 | 56.3 | 21.3 | 5.0 | 20.8 |
| EHPC | 27.1 | 27.1 | 18.8 | 27.4 | 3.0 | 16.9 | 45.9 | 22.3 | 17.6 | 22.9 | ||
| Phi | 32k | SnapKV | 27.1 | 10.3 | 23.4 | 15.5 | 7.3 | 9.5 | 43.2 | 30.6 | 0.4 | 18.6 |
| EHPC | 27.1 | 27.1 | 24.6 | 16.2 | 8.0 | 11.2 | 55.5 | 34.0 | 25.8 | 25.5 |
Q6: A small discussion on whether and how the NMI performance can be improved would be helpful.
See response to W2.
Q7: In Table 5, the inference time (second column) is different for different prompt compression methods. Why is that the case?
In the second column, the 2048 tokens represent the target token length. However, the compression methods in these baselines often do not strictly equal 2048, resulting in varying inference times.
[1] Retrieval Head Mechanistically Explains Long-Context Factuality, ICLR, 2025. [2] \infty${B}ench: Extending Long Context Evaluation Beyond 100{K} Tokens, ACL, 2024
I thank the authors for responding to my concerns. Most of my concerns have been addressed. I maintain that EHPC is an impressive step forward in the prompt compression space, with the EMI mode outperforming existing prompt compression methods. However, there is still work to be done to improve the NMI mode and the rebuttal results, showing that evaluator heads may not be task agnostic demands further study into the generalizability of the method (4/8 common selected heads is not a high ratio in my opinion to call them task agnostic). With all these considerations in mind, I will maintain my score of accept.
Dear Review Dg6c:
We want to express our gratitude for your support and insightful comments.
We will include all the changes in the final revision based on your suggestions. We believe they have greatly increased the quality of our paper. Regarding the generalizability evaluator heads, we will treat it as further work for further study.
Best regards,
The Authors of Submission 8522
This paper introduces EHPC (Evaluator Head-based Prompt Compression), a novel training-free method for compressing long input prompts. EHPC leverages specific attention heads in transformer-based large language models (LLMs), termed evaluator heads, that identify and retain the most important tokens for inference. EHPC accelerates long-context processing by skimming input prompts during the pre-filling stage, passing only critical tokens to subsequent layers. The authors analyze the complexity of EHPC and investigate the existence, generalizability and robustness of EHPC. Experimental results show that EHPC achieves state-of-the-art performance in both prompt compression and long-context inference acceleration while reducing computational costs and API call expenses.
优缺点分析
Strength
- The identification of evaluator heads as key components for token selection is a compelling contribution.
- Unlike methods requiring fine-tuning or auxiliary models, EHPC works out-of-the-box, making it easily applicable to existing LLMs without additional overhead.
- The paper demonstrates SOTA performance on standard benchmarks, showcasing EHPC’s effectiveness in both compression quality and inference acceleration.
Weakness
- The methodology part is somewhat confusing. For instance, when detecting evaluator heads in Section 3.4, the authors use only the last row of the attention matrix, but all the rows are included in the calculation of utility scores (Equation 1). Moreover, some denotations have mistakes. For instance, in line 157, , not .
- The differences between EHPC and existing attention-based compression methods (e.g., SnapKV) are not clearly discussed. Therefore, the novelty of the method is not very clear.
问题
- Does NMI perform better than EMI when the computation budget is controlled the same?
局限性
yes
最终评判理由
I will increase the score as my major concerns have been addressed.
格式问题
No concern.
We thank the reviewer for their insightful and positive feedback. We are encouraged that they found our method compelling, easily applicable, and effective. We answer specific questions below and will incorporate all feedback in the final version.
Q1: Does NMI perform better than EMI when the computation budget is controlled the same?
Thank you for the insightful question. We conduct the experiments to infer the compressed prompt in Table 1 below. The diagonal results are NMI, while the others are EMI. For averaged results, we find that NMI is better than EMI because the same model shares a similar pattern.
Table 1: The averaged results with 2048 compressed prompts on EMI on different settings.
| Model | Llama3 infer | Phi inference |
|---|---|---|
| Llama3 compres | 37.9 (NMI) | 36.9 (EMI) |
| Phi compress | 35.7 (EMI) | 39.0 (NMI) |
W1: The methodology part is somewhat confusing. For instance, when detecting evaluator heads in Section 3.4, the authors use only the last row of the attention matrix, but all the rows are included in the calculation of utility scores (Equation 1). Moreover, some denotations have mistakes. For instance, in line 157, , not
-
Thank you for catching this; the typo in Line 157 has been fixed in the revised manuscript.
-
In Equation 1, we select the last few rows from N_r to N, where N is the prompt length and N_r is a parameter adjusting the number of selected rows. Selecting the last few rows is generalized from selecting the last row, where previous research [1] has shown that this can capture more information.
-
When detecting evaluator heads in plot experiments, we just select the last row in Section 3.4 for convenience. While computing scores by the evaluator heads, we select the last few rows in Equation 1 for strong performance.
W2: The differences between EHPC and existing attention-based compression methods (e.g., SnapKV) are not clearly discussed. Therefore, the novelty of the method is not very clear.
Thank you for highlighting the need for a clearer distinction. We provide two key differences below and have included them in the revised manuscript.
-
- SnapKV and other implicit methods retain the original prompt length and only sparsify the KV cache after computing the full attention map. EHPC, in contrast, removes tokens to reduce the number of tokens in the prompt. The compressed prompts can be applied to other language models, which SnapkV and other implicit methods fail to do.
-
- We design pilot experiments to find the evaluator heads for improving the compression quality and accelerating the compression time. In Table 6, we have also empirically compared our method with acceleration methods under the same KV cache memory usage.
[1] SnapKV: LLM Knows What You Are Looking for Before Generation, NeurIPS, 2024
Dear Review iPeJ:
We want to express our gratitude for your support and insightful comments.
We will include all the changes in the final revision based on your suggestions. We believe they have greatly increased the quality of our paper.
Best regards,
The Authors of Submission 8522
Thank you for your reply. I will increase the score as my major concerns have been addressed.
The paper proposes Efficient Prompt Compression (EPC), a method to reduce the cost and latency of large language model (LLM) inference when dealing with long prompts. EPC trains a small prompt compressor network to transform long input prompts into compact representations. Experiments on multiple benchmarks (QA, summarization, reasoning) show reduced latency and token usage with minimal accuracy drop.
优缺点分析
Strengths
- Addresses the real and growing cost problem in LLM applications that use long context prompts.
- Demonstrates large token savings and latency reduction without heavy retraining of the base LLM.
- Applicable across tasks (QA, summarization, reasoning) without task-specific engineering.
Weaknesses While EPC is compared with some baselines, the evaluation omits certain recent context compression techniques like retrieval-augmented methods or structured summarization models. In high-compression regimes, some performance drop is visible, but the paper doesn’t deeply analyze what kinds of information get lost. The cost and data requirements for training the compressor are not sufficiently discussed.
问题
How well the compressor works on tasks or domains different from the training set. How sensitive is EPC performance to the size of the compressor model? Results show a promising trade-off between compression ratio and accuracy, but more variety in datasets and ablations on compressor architecture would strengthen the claims.
局限性
Yes
最终评判理由
I appreciated the paper's contributions and authors' responses.
格式问题
None
We thank the reviewer for their overall positive feedback. After carefully reading the comments, we noticed that some of the raised points (e.g., Seer’s effect on end-task performance) do not seem to apply to our submission. We suspect a possible mix-up with another paper’s review. Nevertheless, we have addressed every question below to the best of our understanding. We would be grateful if the reviewer could share any clarifying remarks during the discussion period, and we are happy to provide further details as needed.
- Adaptation to vision-language models or audio-augmented transformers where KV size and reuse patterns.
Reducing the token load for multi-modal models is an important and pressing issue. While the current paper focuses on LLMs, we have not yet had the opportunity to conduct more experiments on models such as vision-language and audio-augmented transformers. We are optimistic that the proposed method has the potential in such multi-modal models, and we hope to have more definite results in future research.
- Adaptive weights in the score function.
Thank you for the question; however, we are not sure if the reviewer is referring to the utility scores in Eq. (1). The score is computed with a pooling operation, where the set of evaluator heads is chosen according to the accumulated score of the evidence in Eq. (2). We selected the top- heads according to line 228, where we control the cardinality of to be less than .
- End-task performance (e.g., summarization ROUGE scores, QA accuracy) beyond token-level match.
Thank you for the question. In line with previous research, we have conducted experiments on the LongBench benchmark, which involves six tasks: single-document QA, multi-document QA, summarization, few-shot learning, code completion, and synthetic tasks for retrieval and counting. The results with respect to task performance can be found in Tables 4 and 5 (main paper).
- Incorporation of existing inference accelerations like speculative decoding, FlashAttention, and prefill skipping.
Thank you for the insightful suggestion. Prompt compression is orthogonal to existing inference-acceleration techniques, so the two can be readily combined: after our method shortens the prompt, any subsequent speed-up (e.g., speculative decoding, FlashAttention, prefill skipping) can be applied whether the serving model is the same or different, and whether it runs in the cloud or on the edge. Table 6 already shows that our approach yields the best average performance under KV cache constraints; coupling it with these orthogonal optimizations will therefore amplify the gains at inference time. We will incorporate this discussion in the final version.
- Performance across model sizes and the relationship between cache threshold and model size.
Thank you for the valuable suggestions. However, supporting models with a larger size involves substantial work and timehave lots of work to do; we are still working on this. If we have additional results, we will include them intoin our revised manuscripts.
Thank you for revising your review and completing the required acknowledgement.
Upon reading your comments, we noticed that you now refer to “LoRAPrompt,” “a method for compressing soft prompts using LoRA to reduce memory and storage overhead in parameter-efficient tuning.” This technique does not appear in our paper, so we believe these remarks may have been intended for a different submission.
If you could kindly confirm whether this is the case or indicate which part of our manuscript needs clarification, we will gladly address any remaining concerns.
This paper introduces Evaluator Head-based Prompt Compression (EHPC), a training-free method for compressing long input prompts in large language models (LLMs). EHPC identifies "evaluator heads" — specific attention heads in early transformer layers that focus on crucial information — and uses them to retain only important tokens. This strategy reduces computational costs, accelerates inference, and lowers API usage expenses. The idea of leveraging evaluator heads for token selection is novel, pushing the frontier of prompt compression research. EHPC is training-free and model-agnostic, making it easy to apply to existing LLMs without extra overhead. Extensive experiments demonstrate strong and often state-of-the-art performance across multiple tasks.
All reviewers agree that the paper addresses an important and timely problem: efficient LLM inference with long prompts. The contributions are novel, the empirical results are strong, and the method is practical and impactful. The consensus is that this paper makes a valuable contribution to the area of efficient LLM deployment. I therefore recommend acceptance.