PaperHub
7.3
/10
Poster4 位审稿人
最低4最高5标准差0.5
5
4
4
5
3.8
置信度
创新性3.0
质量3.3
清晰度3.3
重要性3.0
NeurIPS 2025

KeyDiff: Key Similarity-Based KV Cache Eviction for Long-Context LLM Inference in Resource-Constrained Environments

OpenReviewPDF
提交: 2025-05-10更新: 2025-10-29

摘要

关键词
Large Language ModelsLLMKey-Value CacheKV CacheKV Cache EvictionToken EvictionEfficient LLM Inference

评审与讨论

审稿意见
5

This paper proposes KEYDIFF, a training-free policy for evicting KV-cache entries when running large language models under limited memory and resource budgets. The key insight is that keys that are geometrically distinct—those with low cosine similarity to the rest of the cache—tend to receive higher attention weights, so preferentially retaining these keys preserves model performance. Algorithmically, the method computes the mean of all current keys as an anchor, scores each key by its negative cosine similarity to this anchor, and keeps the top-N least-similar keys, yielding an efficient O(N) procedure. Extensive experiments demonstrate that KEYDIFF outperforms existing baselines in both latency and memory consumption while maintaining accuracy.

优缺点分析

Strengths:

  1. The idea is novel. Requires only cosine distance; independent of attention implementation, so it can work with FlashAttention.
  2. Clear proof on the insight.
  3. Benchmarks cover essential aspects,
  4. Linear time solution; easy to apply

Weakness:

  1. Limited architecture coverage; no evidence to show its effectiveness on MoE and encoder-decoder.
  2. Need to test on resource constrained environment.
  3. Need to show the anchor sensitivity at low budget.

问题

  1. For the anchor update schedule, do you recompute the anchor after each eviction? What is the incremental overhead?

  2. What will KEYDIFF perform if we use quantized key? Does cosine similarity remain reliable?

  3. In chat mode, how does sliding-window KEYDIFF choose the window size?

  4. Could KEYDIFF combine with RAG? In this case, the cache contains both prompt and retrieved chunks.

  5. How does KEYDIFF perform on mobile device, since it is a resource contained environment.

局限性

  1. Only decoder-only models are tested.
  2. Benchmarks assume ample GPU bandwidth; edge-class DRAM and cache is not tested
  3. KEYDIFF ranks keys on a per-layer basis but not per head; if one head relies on a subset of highly self-similar keys, those keys might be evicted.

最终评判理由

The rebuttal addresses part of my concerns; however, some experiments are difficult to conduct within the short timeframe, so I will retain my score.

格式问题

no concerns

作者回复

Thank you for your insightful feedback. We have responded to each of your comments below:

Regarding the incremental overhead of anchor updating: a naive approach would simply recompute the anchor from the current KV cache each pass through the model. As you suggest, an incremental approach is also possible. If we chose the key mean as the anchor, we recieve a single new key KcK_c and we evict a single key KeK_e, we can rescale the new and evicted keys then add and subtract them appropriately: KA=KA+KcNKeNK_A^{'} = K_A + \frac{K_c}{N} - \frac{K_e}{N}A similar relation holds during block prompt processing. This means the incremental anchor update complexity is proportional to the number of new keys plus the number of evicted keys.

Regarding KeyDiff performance with quantized keys: preliminary internal analysis does show that cosine similarity remains reliable in downstream benchmarks. However, we do not have a comprehensive analysis of quantized eviction performance at this time.

Regarding the sliding window size appropriate for chat mode: our goal was to address the lack of recency bias in KeyDiff that is present in attention-based methods and show that the sliding window corrects for this. We believe the optimal window size would be application-dependent and likely needs domain-specific experimentation and benchmarking to determine the correct window size.

Regarding RAG integration: yes, KeyDiff can work with RAG. Prompt and RAG tokens could be naively combined and processed as a single long input prompt. KeyDiff could potentially be amended to retain prompt tokens and only evict RAG tokens to improve instruction-following.

Regarding on-device performance: we were unable to properly benchmark a full LLM inference pipeline comparable to the existing NVIDIA GPU TTFT measurements during the rebuttal window. However, we anticipate that performance will behave similarly, assuming the each KV cache eviction scheme does not introduce additional overhead on-device. To measure key scoring overhead on an edge device, we present relative latency on a recent Samsung Android device to compute the eviction scores of KeyDiff, TOVA, SnapKV and H2O, relative to the KeyDiff scoring latency with a cache budget of 512 KVs. Slight runtime variations can be attributed to kernel optimizations and proprietary hardware specifications.

Method5121024204840968192
snapkv1.922.464.508.9411.42
tova1.020.941.251.953.06
keydiff1.001.031.011.071.37
h2o1.100.991.522.324.23

We can see that KeyDiff matches competing methods at small cache sizes and outperforms them at larger cache sizes by a wide margin.

Regarding per-layer vs. per-head eviction: KeyDiff does in fact rank and evict keys on a per-head basis. This allows KeyDiff to attain excellent benchmark performance.

评论

Thank you for the clarifications. I understand that some of the requested experiments (e.g., full on-device benchmarks, quantized evaluations, RAG integration) are difficult to complete within the rebuttal period. I will retain my original score.

审稿意见
4

To address the challenges of the existing attention-based KV cache eviction methods, this work proposes a KV cache evictino method based solely on key similarity, named KeyDiff. The proposed KeyDiff method minimizes the pairwise cosine similarity among keys, thereby maximizing the diversity of the compressed KV cache. Extensive experiments validate that KeyDiff achieves significant superiority in both algorithm accuracy and inference effectiveness.

优缺点分析

Strengths:

  1. The motivation of this work is demonstrated clearly and is easy to follow.
  2. The experiments are rich, which demonstrate the effectiveness and efficiency of the proposed method.

Weaknesses:

  1. The complete evaluation result combining different KV eviction methods with the 'sliding window' strategy is lacking. The KeyDiff with 'sliding window' can achieve better performance in reasoning and code tasks, while the effects of other methods (TOVA, Sink, SnapKV) combined with 'slicing window' are unclear.
  2. The discussion on the ‘Attention-Based Token Eviction Challenges’ in Section 2.4 is more like a general challenge associated with block-wise processing strategies. How does replacing attention with cosine similarity specifically address this challenge?
  3. The referencing of experimental figures and tables is poorly organized, compromising the article's coherence and readability.

问题

See above.

局限性

Yes

最终评判理由

In the rebuttal phase, the author has further clarified my remaining questions, particularly regarding paper formatting and the superiority of the cosine similarity-based eviation scheme. It is suggested that the author supplement the experimental results planned in the reply in the final version. I will maintain my rating 'Borderline accept'.

格式问题

No

作者回复

Thank you for your helpful comments. We have replied to each of them below.

Regarding a more comprehensive analysis of the hybrid attention + sliding window eviction strategy: we do not have a comprehensive analysis of each strategy on LongBench at this time, but we can include such results in the final version of the paper. Attention-based eviction methods tend to have an implicit recency bias [2], which gives them an advantage compared to KeyDiff, which strictly looks at key similarity and ignores temporality. We can also state this motivation for adding the sliding window more clearly in the main text of the final version.

Regarding replacing attention-based eviction with cosine similarity-based eviction: Attention-based eviction is limited by the number of tokens present in current block when evicting: it cannot know future attention scores based solely on the current block. By using cosine similarity instead, we have an eviction criteria that is independent of future tokens (i.e. key similarity) while remaining strongly correlated with future attention scores. We describe in Section 3.1 more precisely how we address the challenge presented in Section 2.4.

Regarding the paper organization, we will do our best to reorganize the paper layout for the final submission to reduce the number of forward references and disreuptions to the reading flow.

[2] Oren, Matanel, et al. "Transformers are multi-state rnns." arXiv preprint arXiv:2401.06104 (2024).

评论

I appreciate the author’s comprehensive reply. All of my concerns have been addressed, and I do not have any remaining questions. I will retain my initial score (Borderline accept).

审稿意见
4

The authors propose a training-free KV-cache eviction policy that scores tokens purely by how geometrically distinctive their keys are (low mean cosine similarity to other cached keys), which shows advantages in the chunk-prefill setting.

This observation that distinctive keys tend to attract high attention regardless of the current query lets KeyDiff reduce peak memory of prefilling with higher accuracy, which is critical for serving/edge deployments where the peak memory of a full-prompt prefill would exceed RAM budgets.

优缺点分析

Strength:

  • Well-motivated setting: Chunk prefill under a hard KV budget is important in both serving and memory-constrained on-device LLMs inference.
  • Simple but novel insight: Linking low pairwise cosine similarity to high long-range importance is intuitive yet, to my knowledge, new and is backed by both visualization and a neat geometric bound.

Weaknesses:

  • NIAH still lags the no-evict baseline: Figure 6 shows recall tapering for deeper needles; although KeyDiff beats other evictors, it is still materially below full-cache accuracy for long docs.
  • Missing harder retrieval tasks: The paper lacks harder benchmarks, such as RULER stress very rare tokens, which raises the concern about whether the property always holds.

问题

  • How does KeyDiff perform on RULER or similarly harder long-context datasets?
  • The authors mention extending KeyDiff to MLA. In MLA, all heads share the same low-rank token embedding E, with different Keys computed through different K projections. Dropping a token means discarding E and thus removing that token’s keys from every head, even if it is distinctive for only some heads. How will your method handle this head-coupling constraint?
  • The authors' “distinct keys ⇒ high future attention” assumption is intriguing, but will it hold on tasks where most tokens are already mutually dissimilar? When there are many key scores as “distinct,” KeyDiff may have no reliable ranking signal and could evict truly important tokens. Have you evaluated such scenarios, and how does the method safeguard against this failure mode?

局限性

Yes.

最终评判理由

The paper proposes a simple but novel insight: distinctive keys (with low cos similarities) tend to attract high attention regardless of the current query, which is useful for chunked prefilling. I appreciate the author’s comprehensive reply. All of my concerns have been addressed. I will raise my score to 4 (Borderline accept).

格式问题

No.

作者回复

Thank you for your helpful feedback. We have responded to each of your comments below.

Regarding NIAH performance of KeyDiff: Thank you for your helpful comment; it prompted us to revisit the NIAH results. We re-ran our tests over more trials (increasing from the average of five random NIAH trials to fifty) to smooth out any noise in the tests. KeyDiff performs significantly better, and we plan to update the final paper with these results.

To summarize our findings here: below, we show the difference in NIAH scores between the eviction-free baseline and KeyDiff with a KV cache of 6k (averaged over fifty samples), i.e., NIAHKeyDiffNIAHFull. NIAH_{\mathrm{KeyDiff}} - NIAH_{\mathrm{Full}}. KeyDiff performs similarly to the eviction-free baseline for context lengths 1000 and 4222, where no eviction occurs, and performs roughly on par with the baseline at longer contexts. A key takeaway from this table is that the eviction-free baseline also cannot solve NIAH problems with 100% accuracy, so matching the model's NIAH performance is the best any eviction method can hope to achieve.

Depth / Context Length10004222744410667138891711120333235562677830000
0.00.02-0.06-0.060.20.16-0.160.180.20.060.04
11.00.1-0.04-0.040.30.28-0.180.260.280.1-0.08
22.00.04-0.120.00.320.3-0.180.160.280.14-0.04
33.00.020.06-0.20.340.08-0.160.260.260.2-0.16
44.0-0.06-0.08-0.160.30.26-0.120.240.22-0.10.04
56.0-0.060.0-0.240.260.26-0.20.180.140.04-0.02
67.00.020.02-0.140.360.28-0.260.240.160.04-0.08
78.0-0.10.08-0.260.260.3-0.260.20.30.02-0.04
89.00.020.08-0.060.30.26-0.320.30.220.06-0.12
100.00.0-0.06-0.260.0-0.06-0.020.0-0.040.14-0.04

Regarding evaluation on the RULER benchmark: while we don't have our own results comparing KeyDiff with competing baselines on the RULER benchmark, we would like to highlight that NVIDIA's KVPress package on GitHub (and the corresponding kvpress-leaderboard on Huggingface) has recently adopted KeyDiff. KVPress demonstrates KeyDiff's competitive performance on RULER and thus provides independent, third-party validation of our method. Note that we are not affiliated with NVIDIA or KVPress in any way.

According to their evaluation -- including RULER as a subset -- KeyDiff achieves higher leaderboard scores than other eviction methods such as TOVA, SnapKV, QFilter, and Knorm, with a clear margin for Llama-3.1-8B-instruct and Qwen3-8B. Although KVPress benchmarking is done without applying the proposed block prompt processing (BPP), KeyDiff still outperforms competing methods. According to our results, KeyDiff performs best under the memory-constrained BPP setup; therefore, we expect a greater performance gap in KeyDiff's favor when the methods are evaluated with BPP enabled.

We also want to note that KeyDiff appears ranked fourth in the plot of KVPress. However, the methods ranked above KeyDiff are meta-heuristic approaches (e.g., DuoAttention, AdaKV, or SnapKV with Compressed Question) built upon an underlying score-based eviction scheme. KeyDiff is fully compatible with these meta-heuristics. Consequently, as a base scoring method, KeyDiff remains a strong KV cache eviction method.

Regarding the extension of KeyDiff to MLA: we believe that, with the exception of the first decoder layer, the geometric observations underpinning KeyDiff are intrinsic to each layer's token embeddings, since WKW_K is just a linear projection. The challenge is how to analyze the RoPE operator, since it is clear that RoPE introduces necessary some temporal bias while allowing keys to cluster together on a per-head basis. Evicting token embeddings per-layer without performance degradation is the main challenge of future work.

Regarding adversarial evaluation: we would like to emphasize that KeyDiff operates on keys directly rather than tokens, i.e., after we have applied the RoPE and the learned projection WkW_k to the input to a decoder block. This, of course, makes it tricky to construct an input prompt that will cause KeyDiff to fail: as one might infer from Figure 4, the keys and queries tend to cluster together regardless of the input prompt. Lemma 3.1 shows that attention scores are bounded by a function of the corresponding cosine similarity values, assuming a bounded key norm. If one can construct an input prompt that forces certain keys to have a massive norm relative to those already in the cache, KeyDiff will likely evict incorrect keys. However, we are not aware of a reliable method to construct such a prompt across all heads and layers.

评论

I appreciate the author’s comprehensive reply. All of my concerns have been addressed. I will raise my score to 4 (Borderline accept).

审稿意见
5

Long-context inference has become increasingly common. Due to resource constraints, the prefill must be chunked into small segments, which requires sparse attention algorithms to make decisions without access to the full context. This paper proposes a KV cache eviction policy based on the cosine similarity of Key vectors. By identifying Key vectors that significantly deviate from the mean, KeyDiff effectively retains important tokens. Evaluation shows that KeyDiff achieves high accuracy while also improving performance.

优缺点分析

Strengths

  • Detailed evaluation across different models and datasets.
  • The proposed method is easy to implement and introduces limited overhead.
  • KeyDiff supports chunked prefill and GQA.

Weaknesses

  • More evidence is needed to demonstrate the correlation between Key cosine similarity and the attention score.

问题

Thanks for submitting to NeurIPS. This paper claims that key similarity can determine which keys are important, which is simple yet effective. I enjoyed reading the paper and would advocate for it if the proposed solution is widely applicable. I appreciate the effort to provide an intuitive explanation in the appendix. However, I kindly ask the authors to provide some additional evidence to further strengthen the argument:

  1. Correlation Analysis
    I would like to see the correlation between Key vector cosine similarity and attention scores. I understand that figures cannot be submitted during the rebuttal, so a table showing cosine similarity and attention scores sampled from request prompts would be helpful. Please include the Spearman rank correlation between cosine similarity and attention scores.

  2. Partial Cache Attention Scores
    What percentage of the total attention score is retained when only a subset of the KV cache (e.g., 4K tokens) is used compared to full attention?

  3. Performance on Retrieval-Critical Tasks
    I would like to see the performance of KeyDiff on tasks that require retrieving all keys to get the correct answer — for example, the Counting Star dataset. Due to time constraints, evaluating one model (e.g., LLaMA3.1 8B) is sufficient.

  4. RULER Benchmark
    Could you include results on the RULER benchmark, which is considered more challenging than LongBench?

  5. NIAH Recall Saturation
    For the NIAH dataset, while the token budget is 6K, none of the policies achieve 100% recall even when the token limit is 1000 or 4222. Is this due to model limitations? What about larger models?

  6. PCA Analysis Interpretation
    The paper shows multiple PCA analysis figures. While it is true that KeyDiff selects tokens more uniformly, it is unclear why this leads to better performance. Could you elaborate on how to interpret these graphs?

  7. Key Vector Norm Variations
    While the paper claims that Key vectors have similar norms, the beginning-of-text token tends to be an outlier with a smaller norm. Does KeyDiff account for this?

局限性

Yes

最终评判理由

I do believe the keys on the border are more important than the keys inside, especially given the strong correlation. However, only keeping the border still sounds not that robust. The results on complex task are not fully given and the cumulative attention score is still missing. Thus, I would like to keep the score.

格式问题

No concerns.

作者回复

Thank you for your thoughtful and considered feedback. We have replied to each of your comments below.

Regarding the correlation between negative key cosine similarity and attention scores: below we report the Spearman rank correlation for each layer of the Llama 3.2 3B Instruct model, averaged over each head. We averaged the correlations over several randomly selected samples from the musique task of LongBench. We observe high correlation across layers, validating our primary claim. We can also infer from this correlation that a large percentage of attention scores will be retained if we evict KVs according to key cosine similarity. We can include this table in the final version of the paper, along with similar data for other models.

Layer Index12345678910111213141516171819202122232425262728
Spearman Correlation0.89970.95610.93320.94960.95170.94840.95870.95780.95340.9570.96280.95540.96580.95780.94770.95380.9280.93730.93280.9340.93110.9140.92650.92840.92580.93940.94190.9448

Regarding retrieval-critical benchmarks: while we were unable to implement the Counting Star benchmark during the rebuttal window, we have initial results on the Phonebook Lookup task, which requires exact recall of a phone number corresponding to a target name [1] from list of names and numbers. We report results averaged over five random phonebook queries with KeyDiff and TOVA using a 6k cache budget. We see that KeyDiff generally outperforms TOVA, but still lags behind the baseline once it begins evicting KVs from the cache.

Method/Phonebook Length1004788561233161119892367274431223500
Dense1.01.01.00.81.00.80.60.60.80.6
Keydiff1.01.01.00.60.40.20.00.00.20.2
TOVA1.01.01.00.40.20.00.00.00.00.0

Regarding evaluation on the RULER benchmark: while we don't have our own results comparing KeyDiff with competing baselines on the RULER benchmark, we would like to highlight that NVIDIA's KVPress package on GitHub (and the corresponding kvpress-leaderboard on Huggingface) has recently adopted KeyDiff. KVPress demonstrates KeyDiff's competitive performance on RULER and thus provides independent, third-party validation of our method. Note that we are not affiliated with NVIDIA or KVPress in any way.

According to their evaluation -- including RULER as a subset -- KeyDiff achieves higher leaderboard scores than other eviction methods such as TOVA, SnapKV, QFilter, and Knorm, with a clear margin for Llama-3.1-8B-instruct and Qwen3-8B. Although KVPress benchmarking is done without applying the proposed block prompt processing (BPP), KeyDiff still outperforms competing methods. According to our results, KeyDiff performs best under the memory-constrained BPP setup; therefore, we expect a greater performance gap in KeyDiff's favor when the methods are evaluated with BPP enabled.

We also want to note that KeyDiff appears ranked fourth in the plot of KVPress. However, the methods ranked above KeyDiff are meta-heuristic approaches (e.g., DuoAttention, AdaKV, or SnapKV with Compressed Question) built upon an underlying score-based eviction scheme. KeyDiff is fully compatible with these meta-heuristics. Consequently, as a base scoring method, KeyDiff remains a strong KV cache eviction method.

Regarding the NIAH recall saturation: we have noted this behavior as well and also attribute it to intrinsic model limitations. We will add a baseline NIAH heatmap without cache eviction to validate this in the final version. We have not yet experimented with larger models, but anticipate the NIAH scores to improve. Below, we show the difference in NIAH scores between the eviction-free baseline and KeyDiff with a KV cache of 6k (averaged over fifty samples), i.e., NIAHKeyDiffNIAHFull. NIAH_{\mathrm{KeyDiff}} - NIAH_{\mathrm{Full}}. We can see that KeyDiff performs similarly to the eviction-free baseline for context lengths 1000 and 4222 and is roughly on par with the baseline at longer contexts.

Depth / Context Length10004222744410667138891711120333235562677830000
0.00.02-0.06-0.060.20.16-0.160.180.20.060.04
11.00.1-0.04-0.040.30.28-0.180.260.280.1-0.08
22.00.04-0.120.00.320.3-0.180.160.280.14-0.04
33.00.020.06-0.20.340.08-0.160.260.260.2-0.16
44.0-0.06-0.08-0.160.30.26-0.120.240.22-0.10.04
56.0-0.060.0-0.240.260.26-0.20.180.140.04-0.02
67.00.020.02-0.140.360.28-0.260.240.160.04-0.08
78.0-0.10.08-0.260.260.3-0.260.20.30.02-0.04
89.00.020.08-0.060.30.26-0.320.30.220.06-0.12
100.00.0-0.06-0.260.0-0.06-0.020.0-0.040.14-0.04

Regarding PCA analysis interpretation: The full KeyDiff scoring criteria (without an anchor, labeled "Pairwise" in the ablation study) is the sum of negative cosine similarity among keys. In order for a key to score highly, it must have a very low cosine similarity with all other keys in the KV cache. This occurs when keys are more uniformly distributed in key space rather than tightly clustered in a single region. This means that a set of more uniformly distributed keys will lead to higher KeyDiff scores and ultimately higher attention, as validated by the correlation analysis above. The PCA plots are an effort to demonstrate that the keys retained by KeyDiff are more uniformly distributed than competing approaches, while evicted keys remain similarly uniform (i.e., not introducing bias).

Regarding the relation between key and beginning-of-sequence (BOS) token norms: This is a great point with a slightly nuanced answer. The BOS or sink token indeed has a smaller norm than other keys, due to its role in computing an approximate no-op in each attention head to avoid overmixing. We discuss this phenomenon in more detail in Section C.3. The smaller norm affects the inequality of Lemma 3.1, tightening the gap between cosine similarity with attention. However, the BOS token is typically not aligned with the majority of keys. Figure 16 shows that the key that is least similar to the key mean (which is typically the sink/BOS token) tends to have a high cosine similarity with the query mean. KeyDiff will always select this token due to its geometric irregularity, despite it having a lower norm than the remaining keys. This can also be seen in Figure 4, where the key with the highest KeyDiff score (the BOS token) can be seen as an outlier.

[1] Jelassi, Samy, et al. "Repeat after me: Transformers are better than state space models at copying." arXiv preprint arXiv:2402.01032 (2024).

评论

Dear Reviewers, The author-rebuttal phase for NeurIPS 2025 ends in 48 h (Aug 8, 23:59 AoE). To safeguard the integrity of the conference—and to give every author the fair hearing you would want for your own work—please complete the following steps by the deadline: read the author rebuttal, engage in discussions, fill in "Final Justification" text box and update “Rating” accordingly. After these, one important step you should take is to submit “Mandatory Acknowledgement” button. Otherwise, NeurIPS may take some penalties to those reviewers who skip the acknowledgement step, such as dest rejection of your submissions. I know your schedules are demanding, yet your partication in the discussion makes a world of difference to the authors who have spent months on their research.

Thank you for your time

评论

Thank you for your response.

  • It's encouraging to see that the correlation between cosine similarity and attention score is close to 1.
  • However, I would still appreciate seeing the cumulative attention score achieved by your methods, as this was not included in the rebuttal.
  • Additionally, I remain somewhat concerned about the feasibility of KeyDiff when applied to more complex tasks. The gap towards dense attention for Phonebook is still significant and the RULER benchmark is only showing the results for 4K context length.
最终决定

This paper proposes KeyDiff, a training-free KV cache eviction method for LLM inference. Its core claim is that geometrically distinctive keys correlate with high attention scores, enabling efficient retention of important tokens for long-context LLMs under resource constraints. Strengths include its novelty, simplicity, low overhead, and compatibility with FlashAttention. It provides a strong theoretical basis and extensive evaluations.

Weaknesses noted were the need for more direct correlation evidence between key similarity and attention scores, and performance lagging baselines on some extreme retrieval tasks like Phonebook Lookup for very long contexts. Limited architecture coverage was also mentioned.

I recommend accepting this paper. Its novel insight into KV cache optimization offers a highly efficient and training-free solution. The authors' rebuttal provided compelling Spearman correlation data, cited independent RULER benchmark validation by KVPress, and presented on-device latency results, effectively addressing most major concerns. Reviewers acknowledged the comprehensive responses, largely expressing satisfaction.