CommVQ: Commutative Vector Quantization for KV Cache Compression
We propose CommVQ, a novel KV cache quantization method that reduces FP16 KV cache size by 87.5% while maintaining high accuracy through additive quantization and a RoPE-commutative codebook.
摘要
评审与讨论
This paper introduces CommVQ, which sigificantly reduces KV cache memory in long-context LLMs while preserving accuracy. It uses additive quantization with a lightweight encoder and a RoPE-commutative codebook for efficient self-attention integration. In practice, CommVQ enables 1-bit KV cache quantization with minimal accuracy loss.
给作者的问题
NA
论据与证据
As listed at the end of Introduction section, the main claims made the submission are
(1) maintain performance with a per-token quantization without using small group
(2) cash out real world efficiency through the commutative property of the RoPE matrix and the characteristics of self-attention
(3) enable 1-bit KV Cache quantization
All these three points are justified in the experimental section
方法与评估标准
This paper follows the common practice of KV Cache compression work. The selected benchmark, i.e., longbench and InfiniteBench, and evaluation criteria make sense for the long context senarios.
However, I believe LongBench has much more subtasks besides the reported 8 tasks. Also, the it would be great to see the needle in a haystack test
理论论述
NA
实验设计与分析
Yes, it follows default setting of LongBench and InfiniteBench
补充材料
Yes, I double checked A.5 Table 10
与现有文献的关系
It is nice to see 1-bit KV Cache quantization can still maintain the performance in the long context scenario!
遗漏的重要参考文献
NA
其他优缺点
There is no details on the Triton implementation of the proposed method. How the kernel is implemented and why it is faster than the baseline?
其他意见或建议
NA
We thank the reviewer for the positive feedback and thoughtful suggestions.
1. Full Longbench and Needle-in-a-Haystack results
The full LongBench consists of 21 tasks in total. Following prior works such as KIVI and KVQuant, we report results on the same eight representative tasks for fair comparison and consistency. To address your comment, we now provide results on the full LongBench benchmark, including all 21 tasks, using the LLaMA-3.1 8B model.
The results below show that CommVQ-2 maintains accuracy across nearly all subtasks, while CommVQ-1 achieves competitive performance even under extreme 1-bit compression.
| FP16 | KIVI-2 | CommVQ-2 | KIVI-1 | CommVQ-1 | |
|---|---|---|---|---|---|
| Avg. bit | 16 | 3.00 | 2.00 | 2.00 | 1.03 |
| Qasper | 25.19 | 22.71 | 24.67 | 4.99 | 18.86 |
| QMSum | 23.31 | 24.33 | 24.36 | 9.57 | 23.02 |
| MultiNews | 26.82 | 27.29 | 26.48 | 9.20 | 24.34 |
| TREC | 72.50 | 72.50 | 72.50 | 38.75 | 69.00 |
| TriviaQA | 91.65 | 92.06 | 91.92 | 25.07 | 91.61 |
| SAMSum | 43.49 | 43.26 | 43.98 | 11.93 | 41.83 |
| LCC | 52.47 | 51.32 | 53.02 | 17.67 | 48.78 |
| RepoBench-P | 49.01 | 47.53 | 46.92 | 16.40 | 42.08 |
| NarrativeQA | 31.69 | 31.47 | 32.20 | 2.87 | 29.80 |
| MultifieldqaEN | 29.16 | 27.50 | 29.47 | 7.11 | 24.93 |
| MultifieldqaZH | 19.95 | 19.92 | 19.47 | 4.83 | 18.97 |
| HotpotQA | 17.18 | 20.14 | 19.97 | 7.88 | 16.48 |
| 2wikimQA | 16.36 | 17.14 | 16.98 | 5.85 | 14.11 |
| Musique | 11.64 | 11.99 | 12.43 | 3.77 | 9.40 |
| Dureader | 29.66 | 27.76 | 26.59 | 7.79 | 22.06 |
| GovReport | 34.54 | 34.12 | 31.87 | 9.16 | 26.41 |
| Vcsum | 16.15 | 15.94 | 16.39 | 9.40 | 15.45 |
| Lsht | 46.00 | 45.00 | 45.50 | 17.25 | 31.50 |
| PassageCount | 6.02 | 8.19 | 8.04 | 5.83 | 8.29 |
| PassageRetrievalEN | 98.45 | 97.25 | 98.55 | 23.02 | 97.28 |
| PassageRetrievalZH | 77.72 | 67.45 | 84.25 | 2.76 | 89.00 |
| Average | 39.00 | 38.33 | 39.31 | 11.48 | 36.34 |
These results further validate CommVQ’s generalization across a wide range of long-context tasks and domains.
We also provide the results for the Needle-in-a-Haystack test using the LLaMA-3.1 8B model. The NIAH result figures are included in this link: https://github.com/commvq/CommVQ/blob/main/fig1.png
We can see that CommVQ-2 preserves the full retrieval capability of the FP16 baseline, while our 1-bit quantized version, CommVQ-1, significantly outperforms its counterpart, KIVI-1.
2. Triton kernel implementation details
Our Triton kernels implement the techniques described in Section 4 of the main paper. Specifically, they include:
- A kernel that fuses RoPE application with commutative codebook decoding, reducing intermediate memory operations;
- A mixed-precision batched matrix multiplication for efficient computation;
- Loading low-bit representations on the fly, avoiding the need to upcast them to higher precision as required in native PyTorch implementations.
Together, these optimizations reduce memory access overhead and improve compute utilization by fully leveraging Triton’s capabilities. This implementation is under active development and will continue to be optimized. We will release the Triton kernels, along with code and implementation details, upon acceptance.
This paper proposes a novel method, CommVQ, for compressing the KV cache in LLMs. The core innovation lies in using additive vector quantization—treating each token’s key/value vector as a unit rather than quantizing individual scalars—and designing a “commutative” codebook that allows efficient integration with RoPE. Experimental results on multiple long-context benchmarks and reasoning benchmarks show that CommVQ reduces memory usage while maintaining high accuracy relative to other KV cache compression baselines. The authors also provide an implementation that demonstrates real memory savings, enabling longer context sizes and larger batch sizes on a single GPU.
update after rebuttal
The authors have addressed my questions. I keep my opinion that this paper leans toward being accepted.
给作者的问题
See weaknesses.
论据与证据
The major claims are supported by experiments (e.g. reduced KV size and better performance)
方法与评估标准
Overall make sense.
Can consider more dataset with long generation like MATH/AIME.
理论论述
The complexity computation part is correct.
For the RoPE part, it do not contain rigorous proof but no issues stand out.
实验设计与分析
Experiment design is valid.
Need to report throughput/latency to show efficiency
补充材料
Yes, the ablation experiments.
与现有文献的关系
This work is about KV cache compression. Related to literatures like quantization, token eviction, etc.
遗漏的重要参考文献
N/A
其他优缺点
Strengths:
- The commutative codebook idea is elegant and likely generalizable to many LLMs using RoPE-based positional encoding.
- The experiments show strong performance on multiple tasks at extremely low bit-rates (1–2 bits).
Weaknesses:
- No throughput or latency comparison to existing compression baselines is provided. Demonstrating real-world decode speed on multiple tasks or system loads would bolster confidence in the claimed advantages.
- The method seems does not combine well with other KV cache compression or retrieval-based methods.
- Lack of exploration of how domain shifts (in the data used to learn the dictionary vs. real downstream data) might affect compression quality.
- Benchmark dataset, consider LongBench v2, MATH or AIME.
其他意见或建议
- Some details on how to incorporate the commutative constraints during codebook training could be expanded.
- The paper might include best practices for calibrating the dictionary if the user’s data distribution significantly shifts from the calibration domain.
- It would be good to release code/implement details in supple/appendix.
We thank the reviewer for the detailed and constructive feedback. Below, we address the key concerns raised.
1. Latency comparison
Please see our response to Reviewer 86AS (2. Latency comparison with the FP16 baseline and prior methods such as KIVI) for a detailed latency comparison. In summary, CommVQ-1 is consistently faster than KIVI-1, especially as context length increases, as measured on the LLaMA-3.1 8B model using a single NVIDIA H100 80G GPU.
2. Compatibility with other compression methods
CommVQ is compatible with other KV cache compression techniques like token eviction and quantization.
For token eviction or retrieval-based methods, which retain only key tokens during decoding, CommVQ integrates naturally by quantizing just those essential tokens. This allows further compression as only selected tokens are encoded and decoded using the learned codebook.
CommVQ can also benefit from codebook quantization, reducing storage and speeding up decoding via low-bit matrix operations.
In short, CommVQ is orthogonal to other methods and can be combined with them for greater compression. Exploring such combinations is a promising direction for future work.
3. Robustness under domain shift
Thanks for pointing this out. In practice, we find that our codebooks and encoder trained on general pre-training datasets (e.g., FineWeb-Edu) transfer reasonably well across tasks and domains. To validate this point, we conducted an ablation study on the LLaMA-3.1 8B model to show CommVQ-2's perplexity changes compared to the FP16 baseline on 4 datasets:
- FineWeb-Edu: a general dataset.
- GSM-8K: a math benchmark.
- Repobench-p: a code retrieval and completion benchmark.
- KV_Retrieval in InfiniteBench: a synthetic UUID key-value retrieval benchmark.
The first dataset represents in-domain evaluation, while the last three represent evaluations with domain shifts—i.e., the codebooks and encoder are trained on general text and tested on math, code, and synthetic UUID data.
| FineWeb-Edu | GSM-8K | Repobench-p | KV_Retrieval | |
|---|---|---|---|---|
| FP16 | 10.17 | 5.67 | 2.20 | 31.93 |
| CommVQ-2 | 11.54 | 6.14 | 2.78 | 32.72 |
| PPL Diff | 1.37 | 0.47 | 0.58 | 0.79 |
We find no significant increase in perplexity (PPL) due to domain shifts when compared to in-domain evaluations. This suggests that our method performs consistently well across domains that differ from the calibration data, including synthetic UUID data, which is unlikely to appear in the calibration set. Overall, we conclude that our method is robust and generalizable under domain shifts.
Finally, if a significant domain shift is encountered, we recommend further fine-tuning the encoder and codebook on domain-specific data—similar to best practices in other calibration-based quantization methods.
4. Additional benchmark (e.g., LongBench v2)
Thank you for the suggestion. Due to the limited time available during the rebuttal phase, we conducted additional evaluations on LongBench v2 using the LLaMA-3.1 8B model and compared them against KIVI, KVQuant, and VQLLM. KIVI and KVQuant fail to produce meaningful output with 1-bit quantization, so their results are omitted from the table. As shown below, CommVQ continues to outperform the baseline methods at comparable average quantization bit levels.
| Method | Avg. bit | Easy | Hard | Short | Medium | Long | Overall |
|---|---|---|---|---|---|---|---|
| FP16 | 16 | 27.1 | 25.4 | 30.6 | 24.2 | 22.2 | 26.0 |
| KIVI-2 | 3.00 | 25.7 | 24.8 | 25.6 | 25.7 | 23.1 | 25.1 |
| KVQuant-2 | 2.33 | 26.2 | 21.0 | 26.7 | 20.1 | 22.4 | 23.0 |
| VQLLM-2 | 2.00 | 15.6 | 17.7 | 21.1 | 15.3 | 13.0 | 16.9 |
| CommVQ-2 | 2.00 | 24.5 | 26.0 | 28.3 | 23.7 | 24.1 | 25.4 |
| VQLLM-1 | 1.00 | 7.8 | 6.8 | 8.9 | 8.4 | 1.9 | 7.2 |
| CommVQ-1 | 1.03 | 25.5 | 22.2 | 26.7 | 21.4 | 22.2 | 23.5 |
5. Explanation on how to incorporate the commutative constraints during codebook training
Thank you for pointing this out. The commutativity constraint is enforced by restricting each 2D subvector in the codebook to the form [[x, y], [-y, x]], which ensures commutativity with RoPE rotations. In other words, for each 2D subvector, we only learn two scalars—x and y—and use them to construct the subvector for computation. We will provide a more detailed explanation in our future revision.
6. Code and Implementation
Upon acceptance, we will open-source our training pipeline, model weights, and optimized Triton kernels to facilitate reproducibility and further research. Please refer to the response to Reviewer 9BoD (2. Triton kernel implementation details) for explanations of the Triton kernel implementation.
Thanks for sharing the additional results. Could you provide a throughput comparison under different batch sizes and sequence length settings? This information would be particularly useful because kV cache compression is especially beneficial when serving large batch sizes and longer sequences.
Thank you for your response. We present a latency comparison with KIVI under various batch sizes and sequence length configurations, using the LLaMA-3.1 8B model. The batch sizes (BS) are set to 1, 2, 4, and 8, while the sequence lengths are varied across 8K, 16K, 32K, 64K, and 128K (the model’s maximum) tokens.
We start with the smallest batch size and gradually increase the sequence length until KIVI encounters an out-of-memory (OOM) error. Once that occurs, we move on to the next batch size. For each configuration, we record the total latency for generating one next token, measured in seconds per token (s/token). The results are summarized in the table below.
Across all tested settings, our method consistently achieves lower latency compared to KIVI. In general, the advantage becomes more pronounced as the sequence length increases across different batch sizes. We attribute this to our method’s higher compression rate and efficiency-oriented design.
| Method | 8K | 16K | 32K | 64K | 128K | |
|---|---|---|---|---|---|---|
| BS=1 | KIVI-1 | 0.045 | 0.057 | 0.102 | 0.190 | 0.297 |
| CommVQ-1 | 0.031 | 0.032 | 0.050 | 0.085 | 0.152 | |
| BS=2 | KIVI-1 | 0.053 | 0.090 | 0.162 | OOM | - |
| CommVQ-1 | 0.035 | 0.053 | 0.088 | 0.158 | - | |
| BS=4 | KIVI-1 | 0.088 | 0.159 | OOM | - | - |
| CommVQ-1 | 0.056 | 0.091 | 0.161 | - | - | |
| BS=8 | KIVI-1 | 0.154 | OOM | - | - | - |
| CommVQ-1 | 0.117 | 0.206 | - | - | - |
- This paper leverages additive quantization by introducing a lightweight encoder and codebook to compress the KV cache, which can then be decoded with a simple matrix multiplication.
- The authors design a commutative with Rotary Position Embedding (RoPE), and utilize an ExpectationMaximization (EM) algorithm to learn the codebook which allows for efficient integration of decoding into the self-attention mechanism to reduce computation overhead.
- Paper shows an impressive 87.5% reduction in FP16 KVCache size while maintaining accuracy on standard long context datasets.
给作者的问题
- I'd encourage the author to provide more insights into that 1-bit quantization has actual practical benefits. At first glance, it seems a little strange why 16 bits can be compressed to 1-bit with a win on both efficiency and quality.
论据与证据
- "Existing quantization techniques treat each scalar in the KV cache independently, CommVQ performs quantization at the vector level. The method uses additive quantization" This seems like a very promising insight.
- "We refine our codebook to be RoPE-commutative. This refinement enables us to reformulate the self-attention computation to incorporate the decoding process more efficiently. " Great to see focus on efficiently achieving the required quantization, making deployment easier.
- Codebook is learnt using a simple neural network by minimizing the reconstruction error and the technique used is EM. Would have been great so see some more insights behind the choice of this technique.
方法与评估标准
- Neural network-based learnt codebook with EM algorithm to minimize the reconstruction error. Overall this approach makes sense and is intuitive.
- Datasets used are standard for long-context evaluation for LLMs.
- We provide an ablation study on how to choose Nc′, R and g in Appendix A.4. These seem to be crucial for CommVQs performance and it would have been good to have at least the insights behind the choices in the main body of the paper.
理论论述
- Equations in Section 4.2 seem convincing and make sense. However, a lot of the key details seem to be moved to the appendix. It would be helpful to at least have the main insights from them in the main body of the paper.
实验设计与分析
- The experimental section is well set and baselines are appropriate. I was surprised to see the evaluation with only models of 8B parameters, where the KV Cache is still relatively small. Given that KV Caches of larger models would greatly benefit from quantization, it would be helpful to see CommQV's performance on 70B or larger models.
- Performance gains of CommQV-2 seem to be marginal. This seems to suggest that CommQV only performs well only for 1-bit quantization. My concern here is how useful gains on 1-bit quantization are, do operators typically use 1-bit quantized models? Would be great to see some evidence for this.
- I am missing where the initial claim of 87.5% reduction is coming from. Is this for reducing to 2-bit from 16-bit? In that case, the 2-bit quantization seems to have marginal gains. Most of the gains seem to be in 1-bit, but then the reduction should be
- Figure 2 has impressive results, I liked the focus on reduced cost and overhead and the extensive demonstration of the same.
补充材料
I skimmed over the appendices.
与现有文献的关系
- This is a very promising direction and quantization of KV Cache has extremely practical consequences. I liked the way the problem was structured for both maintaining quality and having efficient inference.
遗漏的重要参考文献
References are adequate.
其他优缺点
- Promising overall direction with a dual focus on quality and cost of compression.
- Insights on commutativity of ROPE is very useful.
其他意见或建议
N/A
We thank the reviewer for the thoughtful feedback and positive assessment of our direction and formulation. Below, we address the key concerns raised.
1. Experiments on larger models (e.g., 70B)
We focused on 8B models (LLaMA-2, Mistral, and LLaMA-3.1) due to their popularity and our resource constraints. However, to address your suggestion, we have now evaluated CommVQ on the LLaMA-3.1 70B model under 1-bit quantization using LongBench. Results are shown below:
| Method | Avg. bit | Qasper | QMSum | MultiNews | TREC | TriviaQA | SAMSum | LCC | RepoBench-P | Average |
|---|---|---|---|---|---|---|---|---|---|---|
| FP16 | 16 | 45.46 | 23.75 | 27.76 | 75.50 | 93.18 | 46.23 | 37.62 | 55.24 | 50.59 |
| KIVI-1 | 2.00 | 7.90 | 6.92 | 11.13 | 37.50 | 52.21 | 8.99 | 24.02 | 21.26 | 21.24 |
| CommVQ-1 | 1.03 | 37.96 | 22.03 | 22.17 | 58.50 | 92.41 | 38.63 | 33.16 | 44.93 | 43.72 |
These results show that CommVQ generalizes well to the 70B model, achieving strong accuracy using a 1-bit KV cache. This confirms CommVQ’s scalability and practical utility in larger models.
2. Actual practical benefits for 1-bit quantization
Compressing the KV cache to 1-bit offers several practical benefits. The most significant benefit is a 16× memory reduction compared to the standard FP16 KV cache. These memory savings are applicable regardless of the model's precision, meaning the model doesn't need to operate at 1-bit to take advantage of 1-bit KV cache quantization.
Such a substantial memory reduction not only allows a single GPU to handle longer context lengths and larger batch sizes (as shown in Figure 2 of our main paper), but it also significantly reduces data transfer time when offloading the KV cache to non-GPU memory. This benefit is especially notable in offloading scenarios, which are common when serving large models or deploying on edge devices. In these cases, limited PCIe bandwidth often makes KV cache transfer a major bottleneck. Reducing the size of the KV cache helps mitigate this issue, leading to faster overall inference.
3. Clarification on "87.5% reduction"
We clarify that the 87.5% reduction refers to the reduction from FP16 (16-bit) KV cache to 2-bit, i.e., (1 - 2 / 16) = 87.5%. This applies to CommVQ-2. For CommVQ-1 (1-bit), the reduction reaches 93.75%, enabling even more aggressive compression.
4. More insights behind the codebook, encoder and EM algorithm
Thank you for your suggestions. We chose to use a simple neural network as the encoder, as it proved to be simple yet effective in our preliminary experiments. We selected the EM algorithm because it offers two advantages:
- The EM algorithm converges very quickly, requiring far fewer iterations than gradient-based approaches.
- The EM algorithm incorporates principled techniques—such as soft assignment and annealing—to prevent mode collapse, where some centroids become obsolete after a few iterations. This has been one of the major challenges in quantization algorithms.
Regarding the codebook configurations, we included the ablation study in the Appendix primarily due to page limitations. As suggested, we will provide more insights in the main paper in future revisions.
In general, this paper quantizes the KV cache into a 1-bit representation and then uses the 1-bit representation to combine some basis vectors for the attention process. Intuitively, it is equivalent to decomposing the KV cache into a combination of a finite number of basis vectors to speed up the overall calculation process. It performs well on long sequence processing benchmarks such as Long-Bench and Infinite-Bench, proving the effectiveness and efficiency of the solution.
给作者的问题
Please refer to the 'Other Strengths And Weaknesses' part.
论据与证据
The paper well supports the authors' claim.
方法与评估标准
This paper does not build its benchmarks and mainly uses the typical benchmarks (LongBench and InfiniteBench) to evaluate the effectiveness and efficiency of models. In addition, we argue that the benchmark RULER should be included in this paper.
理论论述
Yes.
实验设计与分析
The typical benchmarks (LongBench and InfiniteBench) are used to evaluate the effectiveness and efficiency of models. We argue that this paper should add the benchmark RULER to evaluate the context retrieval ability.
补充材料
Yes. This appendix mainly supplements the method details and experimental details.
与现有文献的关系
This paper is related to the current sparse attention mechanism including KV cache compression and eviction.
遗漏的重要参考文献
No
其他优缺点
This method can compress KV-cache into a 1-bit form and performs well on typical long-sequence benchmarks. However, I still have some concerns.
At the experimental level, the paper's work performs poorly on retrieval tasks. However, the RULER and other retrieval tasks are now the core evaluation benchmarks for evaluating long-sequence processing. It is recommended that the quantization method in this paper should be analyzed to determine whether it can be used for retrieval tasks.
At the model level, decomposing the KV cache into a combination of basis vectors can achieve 1-bit cache quantization, but I think it is better to perform low-rank decomposition directly. Low-rank decomposition can guarantee the model's performance and be easily accelerated on the underlying hardware. It is recommended that low-rank decomposition be included as a baseline to prove the necessity and advantages of 1-bit quantization.
其他意见或建议
Please refer to the 'Other Strengths And Weaknesses' part.
We thank the reviewer for the thoughtful feedback and helpful suggestions. Below, we address your concerns.
1. Applicability to retrieval tasks such as RULER
We appreciate the suggestion to evaluate additional retrieval-specific tasks. We have included results on the RULER benchmark using the LLaMA-3.1 8B model. We set the context length to 128K. As shown below, CommVQ-2 achieves the highest average score among the methods that can achieve an average quantization bit of 2. CommVQ-1 also retains competitive retrieval ability under the extreme 1-bit quantization.
| Method | Avg. bit | Niah1 | Niah2 | Niah3 | MKey1 | MKey2 | MKey3 | MValue | MQuery | VT | CWE | FWE | QA1 | QA2 | Avg |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| FP16 | 16 | 99.4 | 96.6 | 99.6 | 97.4 | 68.9 | 55.7 | 89.3 | 97.3 | 59.0 | 0.1 | 75.0 | 71.4 | 41.4 | 73.2 |
| KVQuant | 2.33 | 89.0 | 36.0 | 40.8 | 37.0 | 1.0 | 0.0 | 24.3 | 26.4 | 36.7 | 0.3 | 66.0 | 25.4 | 24.4 | 31.3 |
| KIVI | 2.00 | 31.0 | 0.6 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.16 | 0.4 | 30.8 | 12.2 | 8.0 | 6.5 |
| CommVQ-2 | 2.00 | 97.2 | 91.4 | 92.4 | 95.0 | 61.0 | 4.8 | 78.0 | 88.4 | 49.1 | 0.26 | 72.9 | 68.4 | 39.6 | 64.5 |
| CommVQ-1 | 1.03 | 63.0 | 53.6 | 11.6 | 64.4 | 18.8 | 0.0 | 23.6 | 24.0 | 28.6 | 0.18 | 67.7 | 63.4 | 37.0 | 35.1 |
These results demonstrate that CommVQ can preserve retrieval capability even under aggressive compression while performing better than other methods under the same compression rate.
Apart from RULER, we also conducted experiments on the Needle-in-a-Haystack test, which also focuses on retrieval; please refer to our response to Reviewer 9BoD (1. Full LongBench and Needle-in-a-Haystack results) for details. The NIAH result figures are presented in this link.
In summary, CommVQ under 2-bit quantization can preserve the full retrieval capability of the FP16 model, while our 1-bit quantization version performs better on the NIAH test than KIVI's 1-bit quantization version. Moreover, from Table 2 in our main paper, CommVQ achieves good performance on retrieval tasks (namely R.PK, R.Num, and R.KV), especially in the extreme 1-bit quantization case where CommVQ significantly outperforms the baselines. Our comprehensive results demonstrate our method's effectiveness on retrieval tasks.
2. Comparison with low-rank decomposition
Thanks for the suggestion, and we chose to compare CommVQ with Palu, a state-of-the-art low-rank decomposition KV cache compression method published at ICLR 2025. We conducted experiments on LongBench using the LLaMA-3.1 8B model. We set the context length to 128K. As shown below, CommVQ consistently outperforms Palu at both 2-bit and 1-bit quantization levels. This shows that our method is more effective than low-rank decomposition methods under various compression rates.
| Method | Avg. bit | Qasper | QMSum | MultiNews | TREC | TriviaQA | SAMSum | LCC | RepoBench-P | Average |
|---|---|---|---|---|---|---|---|---|---|---|
| FP16 | 16 | 25.19 | 23.31 | 26.82 | 72.50 | 91.65 | 43.49 | 52.47 | 49.01 | 48.05 |
| Palu-30% (3 bits) | 2.10 | 11.71 | 24.25 | 26.16 | 66.50 | 86.73 | 42.99 | 50.14 | 51.13 | 44.95 |
| CommVQ-2 | 2.00 | 24.67 | 24.36 | 26.48 | 72.50 | 91.92 | 43.98 | 53.02 | 46.92 | 47.98 |
| Palu-60% (3 bits) | 1.20 | 2.80 | 16.37 | 11.06 | 51.08 | 3.43 | 6.56 | 18.66 | 21.36 | 16.42 |
| CommVQ-1 | 1.03 | 18.86 | 23.02 | 24.34 | 69.00 | 91.61 | 41.83 | 48.78 | 42.08 | 44.94 |
This paper introduces CommVQ, a novel approach to compress the KV cache during inference, particularly when processing long contexts. Unlike existing scalar-based quantization methods, CommVQ employs vector quantization at the token level using a learned encoder and codebook approach. CommVQ makes two key innovations: (1) leveraging additive quantization to compress each vector in the KV cache into a low-bit representation and (2) designing a codebook that commutes with RoPE, allowing for efficient integration with the self-attention mechanism. This approach achieves impressive compression rates while maintaining high accuracy compared to baseline methods, even enabling effective 1-bit quantization with minimal performance degradation.
给作者的问题
NA
论据与证据
The motivation for using vector quantization is clear, while the proposed method further addresses the computational cost of naive VQ.
方法与评估标准
The design of commVQ makes sense, which addresses the critical efficiency bottleneck when combining VQ with RoPE embedding. Moreover, experiment results show the effectiveness of proposed method.
理论论述
The paper's claims are right. However, it lacks some detailed proof, such as why RoPE embedding is communicative in Property 1. Therefore, I would expect the authors to focus more on this part, as understanding it is very important for the design of CommVQ.
实验设计与分析
The experimental designs include various long-context benchmarks; however, only LLaMa-3.1-8B-Instruct, LLaMA-2-7B, and Mistral-7B are included. Therefore, I would expect more experiments to be conducted on the latest models, such as Qwen-2.5.
Moreover, I would expect some latency comparison with FP16 attention and prior quantization methods such as KIVI. The current results show the speedup of the new codebook, however, it is still unclear when comparing to other methods.
补充材料
NA
与现有文献的关系
NA
遗漏的重要参考文献
NA
其他优缺点
NA
其他意见或建议
NA
We appreciate your positive feedback and insightful suggestions. Below, we address your concerns.
1. CommVQ applied to latest models such as Qwen-2.5
To evaluate our method's generalization to the latest models, we applied CommVQ to the Qwen-2.5 7B model and evaluated it on LongBench. Due to the limited time for rebuttal and compatibility issues (e.g., KIVI and KVQuant do not officially support Qwen in their open-sourced code), we used KV-4, the built-in 4-bit KV cache quantization method (HQQ) in HuggingFace Transformers v4.46.2, as our baseline.
| Method | Avg. bit | Qasper | QMSum | MultiNews | TREC | TriviaQA | SAMSum | LCC | RepoBench-P | Average |
|---|---|---|---|---|---|---|---|---|---|---|
| FP16 | 16 | 13.04 | 20.70 | 22.47 | 72.50 | 89.47 | 46.16 | 58.97 | 64.51 | 48.48 |
| KV-4 (HQQ) | 4 | 4.59 | 12.58 | 7.85 | 35.92 | 25.70 | 10.63 | 15.00 | 15.12 | 15.92 |
| CommVQ-2 | 2 | 14.58 | 22.57 | 24.05 | 68.00 | 87.04 | 45.34 | 55.46 | 61.48 | 47.31 |
CommVQ achieves strong performance while compressing the KV cache to an average of 2 bits, significantly outperforming the default 4-bit quantization baseline. This indicates that our method can be effectively applied to the latest models. We also demonstrate that CommVQ can be applied to much larger models, such as the LLaMA-3.1 70B model, in our response to Reviewer yREq (1. Experiments on larger models). All these results show the broad applicability and effectiveness of CommVQ.
2. Latency comparison with the FP16 baseline and prior methods such as KIVI
We report latency per generated token (in seconds) using the LLaMA-3.1 8B model on a single NVIDIA H100 80G GPU. Both KIVI and our method are optimized with Triton kernels. We set the batch size to 1 and vary the context length. As shown below, CommVQ-1 is consistently faster than KIVI-1, especially as the context length increases.
| Context Length | FP16 | KIVI-1 | CommVQ-1 |
|---|---|---|---|
| 8K | 0.024 | 0.045 | 0.031 |
| 16K | 0.026 | 0.058 | 0.033 |
| 32K | 0.031 | 0.102 | 0.051 |
| 64K | 0.037 | 0.190 | 0.085 |
| 128K | 0.051 | 0.297 | 0.152 |
While both CommVQ and KIVI appear slower than the FP16 baseline in our current measurements, we believe this is primarily due to practical factors such as the usage of flash attention (the FP16 model uses flash attention by default, while KIVI and our method currently do not support flash attention during the generation stage), Triton kernel launch overhead, and the hardware and model we chose. Importantly, as the context length grows, CommVQ achieves better efficiency than KIVI. We are actively optimizing our Triton implementation and expect further latency improvements in future versions.
3. Lack of some detailed proofs, such as why RoPE embedding is commutative in Property 1
We thank the reviewer for pointing this out. Due to the page limit, we have omitted some proofs in the main paper. We will add detailed proofs to the main paper in our future revisions.
As for why RoPE embedding is commutative in Property 1, since
R_m^iC = \\left(\\begin{array}{cc} \\cos m \\theta_i & -\\sin m \\theta_i \\\\ \\sin m \\theta_i & \\cos m \\theta_i \\\\ \\end{array}\\right) \\begin{pmatrix} x & y \\\\ -y & x \\\\ \\end{pmatrix}= \\begin{pmatrix} x\\cos m\\theta_i + y\\sin m\\theta_i & y\\cos m\\theta_i - x\\sin m\\theta_i \\\\ x\\sin m\\theta_i - y\\cos m\\theta_i & y\\sin m\\theta_i + x\\cos m\\theta_i \\end{pmatrix}and
CR_m^i = \\begin{pmatrix} x & y \\\\ -y & x \\\\ \\end{pmatrix} \\left(\\begin{array}{cc} \\cos m \\theta_i & -\\sin m \\theta_i \\\\ \\sin m \\theta_i & \\cos m \\theta_i \\\\ \\end{array}\\right)=\\begin{pmatrix} x\\cos m\\theta_i + y\\sin m\\theta_i & y\\cos m\\theta_i - x\\sin m\\theta_i \\\\ x\\sin m\\theta_i - y\\cos m\\theta_i & y\\sin m\\theta_i + x\\cos m\\theta_i \\end{pmatrix}We can see that they are equivalent, so we can confirm Property 1.
Thank you for the authors' rebuttal. Based on the current results, it established a new state-of-art for KV cache compression. However, with the increase in context length, it is gradually much slower than FP16, which is not only costed by the overhead in triton. Based on these observations, I keep my score for this work. I think it should be accepted.
Dear Reviewer 86AS,
Thank you for your thoughtful and positive feedback. We appreciate your recognition of our contributions and your insights. Your observations will guide our future optimizations and research directions. Thank you again for your review and for supporting the acceptance of our work.
The reviewers are consistent in their opinion that this is a nice result that could be accepted.
Overall, the reviewers view the paper positively while indicating several possible improvements, including additional proofs and adding low-rank decomposition as a baseline.