PaperHub
6.4
/10
Poster4 位审稿人
最低4最高4标准差0.0
4
4
4
4
3.3
置信度
创新性3.0
质量2.8
清晰度2.8
重要性2.8
NeurIPS 2025

Value-Guided KV Compression for LLMs via Approximated CUR Decomposition

OpenReviewPDF
提交: 2025-05-11更新: 2025-10-29

摘要

Key-value (KV) cache compression has emerged as a critical technique for reducing the memory and latency overhead of autoregressive language models during inference. Prior approaches predominantly rely on query-key attention scores to rank and evict cached tokens, assuming that attention intensity correlates with semantic importance. However, this heuristic overlooks the contribution of value vectors, which directly influence the attention output. In this paper, we propose CurDKV, a novel, value-centric KV compression method that selects keys and values based on leverage scores computed from CUR matrix decomposition. Our approach approximates the dominant subspace of the attention output $\mathrm{softmax}(QK^\top)V$, ensuring that the retained tokens best preserve the model’s predictive behavior. Theoretically, we show that attention score approximation does not guarantee output preservation, and demonstrate that CUR-based selection minimizes end-to-end attention reconstruction loss. Empirically, CurDKV achieves up to $9.6$% higher accuracy than state-of-the-art methods like SnapKV and ChunkKV under aggressive compression budgets on LLaMA and Mistral, while maintaining compatibility with FlashAttention and Grouped Query Attention. In addition to improved accuracy, CurDKV reduces generation latency by up to 40% at high compression, offering a practical speed-accuracy tradeoff.
关键词
Large Language ModelsKV CompressionAttention CompressionNeedle-in-a-haystack

评审与讨论

审稿意见
4

This paper introduces CurDKV, a method which compresses KV-cache to reduce the inference-time memory comsumption for transformer-based models. CurDKV takes into account the importance of both K and V simultaneously and introduces a novel KV compression method based on CUR decomposition, in order to retain the most important KV-cache. In addition, CurDKV introduces Gaussian projections to compute approximate scores, further improving the efficiency of the KV-cache compresses process.

优缺点分析

Strengths

  • This paper leverages CUR decomposition to approximate the principal subspace of the attention output, which theoretically better preserves the LLM capacity. It is a good idea to use CUR decomposition to select KV cache based on the leverage score, considering both key and value.
  • CurDKV is compatible with existing techniques such as FlashAttention and GQA.

Weaknesses

  • In Section 3.1, the motivation for value-guided compression is not clear enough, as key can be equally important to the attention output. Moreover, the sentence "for higher compression levels, the second term becomes negligible" is a bit hard to follow.
  • In Section 4.4, the paper analyze computational efficiency of CurDKV and lacks a comprehensive comparison with baseline methods. Since CurDKV introduces extra computation such as Gaussian projection and score combination, it is better to compare the end-to-end efficiency with other KV cache compression baselines.
  • This paper uses Gaussian projections with r=20r = 20, but lacks more analysis around Gaussian projections like how different r can impact the accuracy-efficiency trade-off. This ablation study will be very useful and this can be a minor weakness.

问题

  • In Algorithm 1, CurDKV uses the product of K and V leverage scores, which assigns equal importance to K and V, which seems inconsistent with the earlier motivation that V is more significant. Can authors explain more about this point?
  • In Figure 4(b) and fig 5(b), where is the data line for 128K tokens?

局限性

My questions are listed above and there is no other major limitations.

最终评判理由

The authors solved most of my concerns in detail during the rebuttal period. Hence, I will keep my positive score as "borderline accept".

格式问题

There is no major formatting issues in this paper.

作者回复

We thank the reviewer for giving us a comprehensive analysis of our work and highlighting key weaknesses of our manuscript. We address some of the concerns below and hope that the discussion resolves all of them.

Clarification of the claims in Section 3.1 and a motivation for value-guided compression.

Here, we'll try to give a simpler motivation behind our approach of "value-guided compression" of KV caches. Recall that the standard attention output computes softmax(QKT/C)V\text{softmax}(QK^T/C)V, where CC is a suitable constant involving the dimension of the vector spaces in question. Most of the current SOTA KV-cache compression methods (SnapKV, H20, AdaKV, to name a few) focus on estimating the importance of a cache entry by only using attention weights, which are computed using corresponding query and key vectors. However, as the attention equation suggests, value vectors may also contribute significantly to the final attention output. Lemma 3.1 of the paper bounds the true Frobenius distance between the attention outputs of the compressed and uncompressed KV caches; as we see, the upper bound depends on how well the value matrix is approximated. This intuition suggests that a good KV cache eviction policy should also incorporate the use of value vectors (and not just the attention weights). Having said this, we want to emphasize on the fact that this does not undermine the importance of key vectors. Infact, we provide ablations to show that a combination of leverage scores of both the key and value vectors provides the best generation results (Figure 3 and Table 17 of our paper). Our ablation results also show that the next best choice is to only use the leverage scores of value vectors, which further highlight the importance of value vectors in designing key-value pair eviction policies.

Regarding the sentence, "for higher compression levels, the second term becomes negligible": note that the matrix VV' is the compressed version of VV, where dropped value vectors are simply zeroed out. At higher compression levels, say above 70%, most of the rows of VV' will just be zeros. Inevitably, the Frobenius norm of the resultant matrix will decrease, and will slowly become insignificant compared to the first term of the upper bound of Lemma 3.1.

End-to-end computational efficiency comparison with other baselines. As requested by the reviewer, we run experiments to compare the end-to-end runtimes of CurDKV against other baselines.

We compare four methods: exact CurDKV (where exact leverage scores are computed), random CurDKV (where the random Gaussian projection is used to project the key-value vectors, SnapKV and ChunkKV. We do comparisons on a context length of 128K tokens and generated length of 100 tokens, and we use compression ratios of 3030% and 9090%. We report the total runtime, prefill time and the generation time for all methods. Here are the detailed results:

Table 1: Runtime results for a context length of 128K tokens and a compression ratio of 3030%.

MethodLengthTotal time (in sec)Prefill time (in sec)Generation time (in sec)
Exact CurDKV128K55.3946.518.88
Random CurDKV128K44.7635.179.59
SnapKV128K44.9733.3711.59
ChunkKV128K54.3650.533.83

As we see from the above table, CurDKV with a random Gaussian projection outperforms the baselines for a compression ratio of 3030%. This is also intuitive, since the Gaussian projection reduces the effective dimension of the key-value pairs. This, combined with the computationally efficient leverage scores, leads to the overall smallest total time (prefill + generation) for random CurDKV.

Table 2: Runtime results for a context length of 128K tokens and a compression ratio of 9090%.

MethodLengthTotal timePrefill timeGeneration time
Exact CurDKV128K51.3646.904.46
Random CurDKV128K42.8235.417.41
SnapKV128K44.5033.5910.91
ChunkKV128K42.7435.836.91

For a high compression ratio of 9090%, random CurDKV still performs approximately the same as the best baseline (ChunkKV for this case). This, combined with the performance gains of CurDKV, makes it a competitive choice.

An ablation study on the dimension of the Gaussian projection. As rightfully requested by the reviewer, we run an ablation study to determine the effect of the output dimension rr of the Gaussian projection matrix. Intuitively, we expect an overall increase in the generation quality as the dimension rr increases; this is due to the simple fact that as the dimension increases, the leverage scores of the projected vectors will better approximate the leverage scores of the original vectors. However, we still need to empirically verify this claim, and we do so below. We carry out these ablation studies for compression ratios of 3030% and 9090%. We run evaluations on the tasks in the LongBench benchmarks.

Table 3: Ablation study on the output dimension rr of the Gaussian projection matrix for a compression ratio of 30%30\%.

rrLongBench/2wikimqaLongBench/hotpotqaLongBench/multifieldqa_enLongBench/qasperAverage
550.9958.7755.4847.9753.30
1050.7159.2155.7147.3353.24
5049.3157.8554.7847.8352.44
10052.0758.2356.4149.5254.06

Table 4: Ablation study on the output dimension rr of the Gaussian projection matrix for a compression ratio of 9090%.

rrLongBench/2wikimqaLongBench/hotpotqaLongBench/multifieldqa_enLongBench/qasperAverage
528.4541.829.2925.531.26
1035.0441.9829.9426.0233.24
5034.2545.4832.328.6435.17
10035.346.5233.1528.1335.77

For both compression ratios, we observe that increasing rr improves the performance. In Figure 3 in Section 4.2, we have discussed that CurDKV without random projection (Exact CUR) usually performs better than CurDKV with random projection. Having a large rr in the approximated leverage score computation approximates the exact leverage score computation. Therefore, the performance increases to that of the exact CUR. Moreover, this behavior is more prominent for larger compression ratios.

128K data line for Figures 4(b) and 5(b).

We thank the reviewer for pointing this out. Due to a limit in the y-axis (the maximum y value for prefill time is taken as 30 sec in the plots, but for 128K, it exceeds 40 sec), the plots for 128K were omitted from both plots. We will correct this in the revised version of the paper.

评论

Thanks for your rebuttal and results. I don't have further questions, and I'll keep my score.

审稿意见
4

The authors propose CurDKV, a value-centric KV compression method that selects keys and values based on leverage scores computed from CUR matrix decomposition. CurDKV achieves up to 9.6% higher accuracy than state-of-the-art methods like SnapKV and ChunkKV under aggressive compression budgets on LLaMA and Mistral, while maintaining compatibility with FlashAttention and Grouped Query Attention. In addition to improved accuracy, CurDKV reduces generation latency by up to 40% at high compression, offering a practical speed-accuracy tradeoff.

优缺点分析

Strengths:

  1. The experiments are solid. The proposed method is better than strong baselines.
  2. CurDKV is simple and effective without further training.
  3. The paper is well-written and easy to read.

Weaknesses:

  1. The proposed method is only tested on the tasks with longer context. It is not clear how the method works on the tasks that need to generate longer sequence, such as story writing, code generation etc. How does the method process the KV of generated tokens?
  2. The paper claims the value guided method, while both key and value are used to compute scores in Algorithm 1. Need clarification on it. And it would be better to have an ablation study on which is important to guide topk indices.

问题

How does the method process the KV of generated tokens? Does the method support vllm?

局限性

Yes

格式问题

N/A

作者回复

We thank the reviewer for raising important questions about our claim to develop "value-guided" KV compression methods. We are glad to know that the reviewer found our work strong, easy to understand, and practically effective. We address the concerns raised by the reviewer below and hope that this discussion resolves their queries.

How does CurDKV work in inference-time for longer generation tasks

Below, we detail how our method processes the Key-Value (KV) cache for tokens generated during inference -

  1. At each generation step, the model generates a token and computes the corresponding query, key, and value vectors.
  2. The newly generated token's key and value vectors are appended to the KV cache of the prefilled (context) tokens.
  3. After appending the new token, the method periodically recomputes leverage scores for the cached keys and values (optionally, using the approximation via random Gaussian projections to ensure computational efficiency)
  4. Tokens with the lowest combined leverage scores are evicted from the cache to maintain the predefined memory budget. The most informative tokens, identified via high leverage scores, are retained.
  5. CurDKV explicitly preserves the first few tokens (attention sinks) as they typically carry critical global context relevant for subsequent generation steps.

We further compare four KV compression methods: exact CurDKV (where exact leverage scores are computed), random CurDKV (where the random Gaussian projection is used to project the key-value vectors, SnapKV and ChunkKV. We do comparisons on a context length of 128K tokens and generated length of 100 tokens, and we use compression ratios of 3030% and 9090%. We report the total runtime, prefill time and the generation time for all methods. Here are the detailed results:

Table 1: Runtime results for a context length of 128K tokens and a compression ratio of 3030%.

MethodLengthTotal time (in sec)Prefill time (in sec)Generation time (in sec)
Exact CurDKV128K55.3946.518.88
Random CurDKV128K44.7635.179.59
SnapKV128K44.9733.3711.59
ChunkKV128K54.3650.533.83

As we see from the above table, CurDKV with a random Gaussian projection outperforms the baselines for a compression ratio of 3030%. This is also intuitive, since the Gaussian projection reduces the effective dimension of the key-value pairs. This, combined with the computationally efficient leverage scores, leads to the overall smallest total time (prefill + generation) for random CurDKV.

Table 2: Runtime results for a context length of 128K tokens and a compression ratio of 9090%.

MethodLengthTotal time (in sec)Prefill time (in sec)Generation time (in sec)
Exact CurDKV128K51.3646.904.46
Random CurDKV128K42.8235.417.41
SnapKV128K44.5033.5910.91
ChunkKV128K42.7435.836.91

For a high compression ratio of 9090%, random CurDKV still performs approximately the same as the best baseline (ChunkKV for this case). This, combined with the performance gains of CurDKV, makes it a competitive choice.

When we increase the generation length to 4K, the generation time increases for all compression methods due to periodic eviction/retention of key-value pairs.

Table 3: Generation time results for a context length of 128K and generated 4K tokens.

MethodCompression ratioGeneration time (in sec)
Exact CurDKV30%464.9
Random CurDKV30%516.9
SnapKV30%553.2
ChunkKV30%539.3
Exact CurDKV90%656.8
Random CurDKV90%536.9
SnapKV90%553.4
ChunkKV90%545.6

At a 30% compression ratio, exact CurDKV exhibits the fastest generation, whereas, for a 90% compression ratio, random CurDKV achieves the fastest generation. At moderate compression ratios, the Exact CurDKV precisely identifies and retains highly informative keys and values through exact CUR decomposition; therefore, reducing the need for repeatedly decomposing the key-value pairs on generated tokens. At a higher compression rate, more repeated computation of key-value eviction is needed; therefore, under those circumstances, the effectiveness of low-rank projections increases, leading to lower generation time.

On the usage of both the key and value matrices, and the claim of the paper. We appreciate the reviewer for seeking clarification for the main claim of the paper, as we believe it is the main contribution of this work. Most of the existing SOTA literature on KV cache compression methods (SnapKV, H20, AdaKV, to name a few) focus on estimating the importance of a cache entry by using attention weights, which rely only on the corresponding query and key vectors. However, as we show in Lemma 3.1 of our paper, the true Frobenius distance between the exact attention output and the compressed attention output also depends on how well the compressed value matrix approximates the original value matrix; intuitively, this happens because the query and key vectors are absorbed by the softmax operator. This motivates the design of eviction policies which also incorporate value vectors. At the same time, we point out that this analysis does not undermine the importance of the corresponding key vectors. Infact, we provide ablation studies to empirically show that the combined scores of both the key and value produces the best results (Figure 3 and Table 17 of our paper). As our ablations also show, the next best choice is to use the leverage scores of only the value vectors. Overall, through our paper, we aim to establish the importance of incorporating information gained from value vectors in developing key-value eviction policies.

vllm support

We thank the reviewer for raising this important point regarding practical compatibility and integration. CurDKV is fully supported by NVIDIA’s KVPress library, which is specifically designed to implement and benchmark KV cache pruning strategies within Hugging Face Transformers-based pipelines. However, at present, KVPress (and thus CurDKV) does not natively support vLLM. The vLLM inference backend manages KV caches differently and currently has its own separate mechanisms (e.g., KV-Compress, LeanKV) for cache compression. While direct compatibility between CurDKV and vLLM would require additional engineering effort, methodologically, CurDKV can be adapted to work with vLLM by modifying vLLM’s cache eviction and retention mechanisms accordingly.

评论

Dear Reviewer NWDK,

Thank you for your valuable and insightful comments. We have carefully addressed each of your points and made every effort to incorporate your suggestions into the revised version of our paper. We are confident that our responses have resolved the concerns you raised.

We would greatly appreciate it if you could kindly review our responses and consider reassessing our submission. As the discussion phase is concluding soon, we hope to engage in a productive dialogue with you before it ends.

Thank you once again for your time and consideration.

Thanks

Authors

评论

Dear Reviewer,

As only 24 hours are left till the discussion period ends, we once again request you to kindly check our responses. We have carefully addressed each of your points and made every effort to incorporate your suggestions into the revised version of our paper. We are confident that our responses have resolved the concerns you raised.

We would greatly appreciate it if you could kindly review our responses and consider reassessing our submission.

评论

Dear Reviewer NWDK,

The discussion period is now within 24 hours. Please check out the authors' response ASAP!

Best,

AC

审稿意见
4

The paper propose CurDKV for compressing the key-value (KV) cache, a value-centric strategy. They estimate each token’s leverage score via a (fast, projected) CUR matrix decomposition of the attention output and keep only the most influential keys and values.

优缺点分析

Strengths

  1. CUR-based leverage scores + Gaussian projection yield an O(n d r) lightweight scoring pass; algorithm integrates with FlashAttention & Grouped-Query Attention without kernel changes.

  2. Memory scales predictably; although pre-fill cost rises by ~3–5 s for 128 K contexts, generation is faster for long sequences

Weaknesses

  1. Static Compression Only. CurDKV freezes the cache after prefill; it cannot react if later tokens shift relevance, and it adds noticeable prefill overhead for very long inputs

  2. Limited Model Coverage. It is better to provide more results on bigger models. The results use 7 B / 8 B LLaMA/Mistral; scalability to 34 B–70 B or MoE models is untested.

  3. Hyper-Parameter Sensitivity. Adaptive safeguard α (0.20) is fixed; sensitivity analysis would help.

问题

see weaknesses

局限性

Yes, the paper discusses the pertinent limitations and outlines directions for future work.

最终评判理由

After reading the authors’ rebuttal and the comments from other reviewers, most of my concerns, including some experimental justifications involving larger models, have been addressed.

Therefore, I am raising my score by one point.

格式问题

None

作者回复

We thank the reviewer for giving us valuable insights to the potential weaknesses of our work, and we aim to address some of those concerns below. We hope the following discussion clarifies some concerns raised by the reviewer.

CurDKV as a static compression method. It is true that CurDKV is a static compression method, in the sense that once a key-value pair of the context is evicted from the cache, CurDKV never recovers it (even if future tokens shift relevance). However, we'd like to point out that some of the best-performing baselines which we compare against are also static in this sense: H2o, SnapKV, StreamingLLM and AdaKV to name a few. Infact, any reasonable KV cache compression technique must be static in this sense, because the recovery of an evicted key-value pair from the cache must inevitably require either 1) storing the key-value pair in GPU memory or 2) recomputing the key-value pair, both of which may have significant overhead for larger context lengths.

Results on bigger models. Per request of the reviewer, we tested CurDKV on larger models; particularly, we chose the Qwen3 14B and Qwen2.5 32B models, and tested the efficacy of CurDKV on these models against other strong baselines. As in the paper, we report results on tasks from the LongBench benchmark. We report results for compression ratios of 3030% and 9090% respectively.

Table 1: LongBench evaluation results for the Qwen 14B model for a compression ratio of 3030%.

MethodLongBench/2wikimqaLongBench/hotpotqaLongBench/multifieldqa_enLongBench/qasperAverage
ChunkKV28.7133.5330.4124.8429.37
KNorm26.7831.4429.2721.8927.34
SnapKV28.4534.5632.1125.3830.12
StreamingLLM20.5626.6722.3322.0822.91
CurDKV27.9731.5831.9527.0429.63

As we see from the above table, SnapKV seems to perform the best on average for the LongBench tasks. However, CurDKV is not far, and consistently performs close to the best baseline.

Table 2: LongBench evaluation results for the Qwen 14B model for a compression ratio of 9090%.

MethodLongBench/2wikimqaLongBench/hotpotqaLongBench/multifieldqa_enLongBench/qasperAverage
ChunkKV12.3921.0917.039.2514.94
KNorm10.049.7313.067.210.00
SnapKV13.3721.3418.259.5615.63
StreamingLLM9.5614.5114.6610.2212.24
CurDKV20.8821.5422.0816.2320.18

The effect of CurDKV seems to be more prominent for the regime of high compression ratios. For instance, in the above table (9090% compression), CurDKV significantly outperforms other baselines by a significant margin. Below, we see a more prominent effect for an even larger model (a 34B quantized model).

Table 3: LongBench evaluation results for the Qwen 32B model for a compression ratio of 3030%.

MethodLongBench/2wikimqaLongBench/hotpotqaLongBench/multifieldqa_enLongBench/qasperAverage
ChunkKV51.6458.6445.4142.4649.54
KNorm40.1446.9443.3937.7342.05
SnapKV53.8557.5248.5641.8650.45
StreamingLLM34.147.1532.6143.1139.24
CurDKV58.0762.650.4247.7554.71

Here again, we see that CurDKV consistently outperforms other baselines by a significant margin.

Table 4: LongBench evaluation results for the Qwen 32B model for a compression ratio of 9090%.

MethodLongBench/2wikimqaLongBench/hotpotqaLongBench/multifieldqa_enLongBench/qasperAverage
ChunkKV24.2437.8524.3113.6225.00
KNorm6.143.813.127.027.52
SnapKV22.836.2923.0414.2924.10
StreamingLLM18.5430.3822.0314.5621.38
CurDKV40.5943.5131.2416.132.86

And yet again, we see that for the 32B model quantized model, the gains for CurDKV in a high compression ratio regime are even more prominent. This raises a further interesting question: do value-guided KV caching methods perform better for quantized models? We leave this question for a future work.

On the sensitivity of the safeguard parameter (α\alpha). As requested by the reviewer, here we do experiments to run a sensitivity analysis on the safeguard parameter α\alpha. To recall, for KV compression methods that distribute the per-head budget adaptively (aka the AdaKV technique [1]), the parameter α\alpha controls the proportion of KV pairs to preserve in each head (see line 28 of the KVPress implementation at https://github.com/NVIDIA/kvpress/blob/main/kvpress/presses/adakv_press.py). The default value of α\alpha as proposed by the authors of AdaKV is 0.200.20, which we also use in our implementation of AdaCurDKV. Tables 5 and 6 below contain the results of our ablations.

Table 5: Sensitivity analysis of the safeguard parameter α\alpha for LongBench tasks at a compression ratio of 3030%. We compare the two adaptive techniques: AdaSnapKV and AdaCurDKV.

α\alphaMethodLongBench/2wikimqaLongBench/hotpotqaLongBench/multifieldqa_enLongBench/qasperAverage
0.1AdaCurDKV49.1358.7455.8248.3653.01
0.1AdaSnapKV50.5859.9355.2845.7452.88
0.5AdaCurDKV49.1358.7455.8248.3653.01
0.5AdaSnapKV51.3459.5555.6145.8453.08
0.9AdaCurDKV48.9958.6756.3348.4653.11
0.9AdaSnapKV51.0657.95445.7252.17

Table 6: Sensitivity analysis of the safeguard parameter α\alpha for LongBench tasks at a compression ratio of 9090%. We compare the two adaptive techniques: AdaSnapKV and AdaCurDKV.

α\alphaMethodLongBench/2wikimqaLongBench/hotpotqaLongBench/multifieldqa_enLongBench/qasperAverage
0.1AdaCurDKV34.545.9130.828.7534.99
0.1AdaSnapKV26.746.1826.8621.5330.32
0.5AdaCurDKV32.7346.6130.1529.8334.83
0.5AdaSnapKV27.1546.0726.1820.3929.95
0.9AdaCurDKV32.2746.4431.3728.9934.77
0.9AdaSnapKV23.4845.2726.4621.6529.21

For both 30% and 90% compression ratios, we observe only a marginal performance difference for different values of α\alpha, indicating that the adaptive KV compression methods (including AdaCurDKV) are very robust to the selection of this parameter.

References

[1] Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference, Feng et al., 2025.

评论

Dear Reviewer 1huD,

Thank you for your valuable and insightful comments. We have carefully addressed each of your points and made every effort to incorporate your suggestions into the revised version of our paper. We are confident that our responses have resolved the concerns you raised.

We would greatly appreciate it if you could kindly review our responses and consider reassessing our submission. As the discussion phase is concluding soon, we hope to engage in a productive dialogue with you before it ends.

Thank you once again for your time and consideration.

Thanks

Authors

评论

Dear Reviewer,

As only 24 hours are left till the discussion period ends, we once again request you to kindly check our responses. We have carefully addressed each of your points and made every effort to incorporate your suggestions into the revised version of our paper. We are confident that our responses have resolved the concerns you raised.

We would greatly appreciate it if you could kindly review our responses and consider reassessing our submission.

评论

Dear Reviewer 1huD,

The discussion period is now within 24 hours. Please check out the authors' response ASAP!

Best, AC

评论

After reading the rebuttal and other review information, most of my concerns have been addressed, and I will maintain my positive score.

审稿意见
4

This paper introduce CurdKV, that rethinks kv cache compression by consider not only keys but also values via a leverage score metric. Specifically, Curdkv uses cur decomposition to identify more important individual keys and values via leverage score, where a importance is measures for rows and cols of the key and value matrices. In practice, this process is done on a group level to compatible with GQA. And to avoid svd it use a random project for the decomposition to reduce overhead. At last, an adaptive budget version of curdkv is proposed.

优缺点分析

Strengths

  1. The proposed kv cache include values in the kv cache compression process, that were overlooked previously. And this is well motivated by fig 1 and introduction.

  2. The proposed method to use cur decomposition is solid when identifying the important rows and cols.

  3. To use a random projection for leverage score computing seems a good trick to avoid svd. And the extra cost involved is acceptable. Plus this importance evaluation process can be fit inside flash-attn, which is important in real applications.

  4. Solid overall performance on longbench and the NIAH test.

Weakness

  1. Algo 1 is not very clear to me. How's the leverage score is used for the decomposition?

  2. The max length in the experiments is rather limited. I would expect to test upto 128k on RULER.

  3. Why not include a latency comparision with and without curd kv? to show the exact overhead used in computing leverage score and the ranking process.

问题

No

局限性

No

最终评判理由

I'll keep my score.

格式问题

No

作者回复

We thank the reviewer for giving us a comprehensive analysis of our work. We are grateful that the reviewer found our approach of using values to compress KV caches well-motivated. Below, we would like to address all concerns of the reviewer and hope that they answer all their questions.

A clearer description of Algorithm 1. Here we give a clearer description of the algorithm, with a particular focus on how leverage scores are used to compute the decomposition.

  1. Consider a set of key and value matrices KRk×dK\in\mathbb{R}^{k\times d} and VRk×dV\in\mathbb{R}^{k\times d}, i.e kk key-value pairs, each of dimension dd.
  2. Project key-value vectors do a vector of dimension rr (we use r=20r = 20) via a Gaussian projection matrix GRd×rG\in\mathbb{R}^{d\times r} (lines 2-3 of the algorithm). The significance of this step is to reduce the overall run-time complexity (computing exact leverage scores requires SVD).
  3. After this, we compute leverage scores using the new key and value vectors (line 4 of the algorithm). In simple words, these scores are just row norms of the key-value matrices.
  4. The key-value leverage scores are then combined (line 5) and normalized (line 6).
  5. Finally, the indices with the top-(ks)(k - s) scores are computed, and the keys and values for these indices are retained in the KV cache (lines 8-10 of the algorithm).
  6. Note that we preserve the first ss tokens in the context due to the prevalence of attention sinks. [1]

Note: We've further provided an ablation study involving different values of rr for reviewer hDao. TL;DR: we observe only marginal differences when using different values of rr, and the Gaussian approximation improves as rr increases (which is intuitive).

Regarding the maximum length in experiments for the RULER benchmark. As per the reviewer's request, we ran NIAH experiments on the RULER benchmark with a context length of 128K tokens. We've reported the results in the table given below. As in the main paper, we compare CurDKV with the ChunkKV, KNorm, SnapKV and StreamingLLM baselines. We carry out the experiments with compression ratios of 3030%.

Table 1: Results on the RULER benchmark with 128K context length with a compression ratio of 3030%.

Task/MethodChunkKVKNormSnapKVStreamingLLMCurDKV
Accuracy.niah_multikey_1.string_match92.3169.2310069.2384.62
Accuracy.niah_multikey_2.string_match5005078.5792.86
Accuracy.niah_multikey_3.string_match2010102010
Accuracy.niah_multiquery.string_match9582.51007097.5
Accuracy.niah_multivalue.string_match92.8682.1492.8667.8696.43
Accuracy.niah_single_1.string_match10010010055.5688.89
Accuracy.niah_single_2.string_match87.587.510062.587.5
Accuracy.niah_single_3.string_match90.9181.8263.6454.5581.82
Average78.57264.1577.0659.7879.95

Table 2: Results on the RULER benchmark with 128K context length with a compression ratio of 5050%.

Task/MethodChunkKVKNormSnapKVStreamingLLMCurDKV
Accuracy.niah_multikey_1.string_match84.6223.0892.3153.8584.62
Accuracy.niah_multikey_2.string_match35.71042.8650.0057.14
Accuracy.niah_multikey_3.string_match20.00010.0010.0010.00
Accuracy.niah_multiquery.string_match85.0045.0095.0040.0090.00
Accuracy.niah_multivalue.string_match82.1446.4392.8639.2996.43
Accuracy.niah_single_1.string_match100.00100.0100.0044.44100.00
Accuracy.niah_single_2.string_match62.5062.50100.0050.0075.00
Accuracy.niah_single_3.string_match72.7336.369.0936.3636.36
Average67.8439.1767.7740.4968.69

As we see from the above results, CurDKV consistently either outperforms other methods or performs in a way sufficiently comparable to the best baseline for a given task, leading to the overall best average score.

Note: For NIAH 128K experiments, we chose 30% and 50% compression ratios for comparing the KV compression methods. At higher compression ratios (>50%), most of the methods' performance drops poorly (almost to 0 for most tasks) due to a lack of the appropriate context length. This behavior is more prevalent in the 128K context than in shorter context NIAH tasks (reported in the paper).

A latency comparison with and without CurDKV. Per request of the reviewer, here we provide a latency comparison of other baselines against CurDKV. We compare four methods: exact CurDKV (where exact leverage scores are computed), random CurDKV (where the random Gaussian projection is used to project the key-value vectors, SnapKV and ChunkKV. We do comparisons on a context length of 128K tokens and generated length of 100 tokens, and we use compression ratios of 3030% and 9090%. We report the total runtime, prefill time, and the generation time for all methods. Here are the detailed results:

Table 3: Runtime results for a context length of 128K tokens and a compression ratio of 3030%.

MethodLengthTotal time (in sec)Prefill time (in sec)Generation time (in sec)
Exact CurDKV128K55.3946.518.88
Random CurDKV128K44.7635.179.59
SnapKV128K44.9733.3711.59
ChunkKV128K54.3650.533.83

As we see from the above table, CurDKV with a random Gaussian projection outperforms the baselines for a compression ratio of 3030%. This is also intuitive, since the Gaussian projection reduces the effective dimension of the key-value pairs. This, combined with the computationally efficient leverage scores, leads to the overall smallest total time (prefill + generation) for random CurDKV.

Table 4: Runtime results for a context length of 128K tokens and a compression ratio of 9090%.

MethodLengthTotal timePrefill timeGeneration time
Exact CurDKV128K51.3646.904.46
Random CurDKV128K42.8235.417.41
SnapKV128K44.5033.5910.91
ChunkKV128K42.7435.836.91

For a high compression ratio of 9090%, random CurDKV still performs approximately the same as the best baseline (ChunkKV for this case). This, combined with the performance gains of CurDKV, makes it a competitive choice.

References

[1] Efficient Streaming Language Models with Attention Sinks, Xiao et al., 2024.

评论

Dear Reviewer kZ3Z,

Thank you for your valuable and insightful comments. We have carefully addressed each of your points and made every effort to incorporate your suggestions into the revised version of our paper. We are confident that our responses have resolved the concerns you raised.

We would greatly appreciate it if you could kindly review our responses and consider reassessing our submission. As the discussion phase is concluding soon, we hope to engage in a productive dialogue with you before it ends.

Thank you once again for your time and consideration.

Thanks

Authors

评论

Dear Reviewer,

As only 24 hours are left till the discussion period ends, we once again request you to kindly check our responses. We have carefully addressed each of your points and made every effort to incorporate your suggestions into the revised version of our paper. We are confident that our responses have resolved the concerns you raised.

We would greatly appreciate it if you could kindly review our responses and consider reassessing our submission.

评论

Dear Reviewer kZ3Z,

The discussion period is now within 24 hours. Please check out the authors' response ASAP!

Best, AC

评论

Hi everyone,

This is a reminder that the discussion phase is between July 31 – Aug 6.

Please read the author responses, especially where you are mentioned, and post your reply as soon as possible. This helps ensure there's time for meaningful back-and-forth.

Thanks for your engagement!

AC

评论

We sincerely thank all the reviewers for their thoughtful, constructive, and detailed feedback, which helped us improve the clarity and scope of our work. We use this opportunity to highlight core contributions of our paper -

Importance of value-guided KV compression

We observe that most existing KV compression methods like SnapKV and H2o use only attention keys for identifying the key-value pairs for eviction during prefill or generation. Over-reliance on only key tensors lead to high post-eviction (reconstruction) loss for these methods. We theoretically justified how attention value-guided methods can ensure lower reconstruction loss, therefore, ensure better preservation of critical key-value pairs. Based on this observation, we design CurDKV, a novel method that compresses KV cache based on leverage scores computed from CUR matrix decomposition.

Strong empirical results

Experimental results on 16 longbench tasks and 8 needle-in-a-haystack (NIAH) tasks highlight the strong performance of CurDKV with LLaMA and Mistral models. CurDKV consistently outperform the contemporary baselines like SnapKV and ChunkKV with an average margin of 9%.

During the rebuttal phrase, we conducted additional experiments with Qwen-14B and Qwen-32B models, where CurDKV demonstrated superior performance, outperforming the baselines. Even for 128K context length NIAH tasks, CurDKV remains performant under high compression regime, often outperforming the existing KV compression methods.

The ablation studies further illustrate the importance of the design choice of CurDKV. Our method is also compatible with adaptive budget allocation, further extending its utility for different compression needs.

Computational benefits of CurDKV

We evaluated CurDKV and other baselines in terms of prefill and generation time on different context and generation length at various compression ratios. Although the prefill time (static) of CurDKV is marginally higher than SnapKV, the generation time is significantly smaller as compared to SnapKV or ChunkKV. The difference is even more striking for longer generation lengths, highlighting the practical usefulness of our method. Moreover, CurDKV is compatible with FlashAttention, Grouped Query Attention and offers seamless integration support with popular libraries like Nvidia's KVPress.

评论

Dear reviewers,

As the deadline of discussion is approaching, please read the authors' response, check if they address your raised points and provide the feedback accordingly. Many thanks for your cooperation.

Best, AC

最终决定

The paper proposes CurDKV, a value-guided KV-cache compression method for LLMs that uses approximated CUR decomposition. Major concerns such as evaluation on longer sequences and latencies are successfully addressed by the authors. All reviewers are positive toward accepting this paper. Hence I vote for acceptance for this paper.