SQuat: Subspace-orthogonal KV Cache Quantization
We propose a KV cache quantization method that preserves task-critical information throughout the quantization process.
摘要
评审与讨论
This paper proposes a method that considers potential future output errors during the quantization of the current key. The authors first demonstrate that these errors are upper-bounded by terms related to the product of queries and the quantization residuals of keys. Consequently, they propose regularizing the quantization process to encourage orthogonality between the quantization residuals and the subspaces spanned by queries. The proposed method is evaluated against several existing methods on a range of benchmarks. Overall, the motivation is compelling and supported by insightful analysis, suggesting strong potential for the proposed method. However, it should be noted that the quality of the experiments currently lags behind that of the other high-quality components of the work.
接收理由
-
The motivation is compelling and supported by insightful analysis that could inform future work.
-
The proposed method is targeted and rigorous. Its derivation is easy to follow, and the solution is elegant.
-
Empirical results demonstrate strong performance and introduce only negligible overhead.
拒绝理由
-
There are two new hyperparameters introduced. This paper lacks discussion on how to set them effectively. See Q1.
-
Despite Llama 3.1's support for a 128k context length and the use of LongBench (a benchmark for long-context evaluation), the authors limited their experiments to a maximum sequence length of 8192 tokens. This is considerably shorter than the contexts explored in recent long-context modeling studies. Since KV quantization is primarily beneficial in long-context scenarios, evaluation on real long sequences (e.g., >64k tokens) is necessary to accurately assess its performance.
-
The analysis lacks a detailed speed profile or computational cost breakdown for SQaut. Detailed and separated profiling of overheads in prefilling and generation is crucial for assessing practical performance and identifying bottlenecks. Refer to Q5 for a follow-up.
给作者的问题
-
I find results in Table 3 puzzling. Intuitively, a larger defines a higher-rank subspace, which would presumably make the orthogonality constraints more difficult to satisfy. Conversely, a smaller diminishes the regularization effect. Therefore, the observation that reducing while increasing leads to better performance (in general) appears paradoxical. I would appreciate a more detailed analysis and discussion regarding the specific influence of on SQuat's behavior, especially in light of these seemingly conflicting effects.
-
Why do SQuant and SQuant_pre use different values for the parameters and on the LongBench benchmark?
-
What is the length of the input sequence used to generate Figure 2? Do the result reported in Sec 3.2, i.e., prompt only is a good enough approximation, hold for substantially longer sequences, for example, those with 100,000 tokens?
-
What is the role of multiplying in Alg 1 line 6? Why is not enough to represent the subspace?
-
If SVD is fast enough, can be updated (periodically) using the residual buffer? If so, would this approach enhance performance, and what would be the consequent slowdown?
Thanks for your careful reading of our paper and thoughtful comments!
R1 + Q1. There are two new hyperparameters introduced. This paper lacks discussion on how to set them effectively. A more detailed analysis regarding the specific influence of r on SQuat's behavior.
A1. Our recommended configuration for these two hyperparameters is and , and we use these values consistently across all LLMs and tasks in LM-eval. As shown in our paper, this choice consistently outperforms all other baselines (see our response to your Q2 for results for LongBench results). Below, we provide the intuition behind this selection.
For , we recommend a small value (e.g., ). As you pointed out, a larger defines a higher-rank subspace, which makes satisfying the orthogonality constraints more difficult and can degrade model performance. This aligns with our ablation study (Table 3), where for various values of (=, , and ), achieves the highest scores among all configurations. One exception occurs when is very small (=); in this case, the orthogonality constraint is relaxed, and a slightly larger (e.g., ) yields better performance. As for , we recommend (or even smaller), based on the observation that the largest singular value of the query subspace (i.e., ) typically falls between 100 and 300. This choice of ensures the regularization is effective without being too restrictive.
While we do not claim that this configuration is universally optimal (in fact, the best hyperparameters will vary depending on the specific task), our recommended configuration delivers consistent performance (outperforming all baselines tested) across different LLMs and evaluation tasks.
R2. Despite Llama 3.1's support for a 128k context length and the use of LongBench (a benchmark for long-context evaluation), the authors limited their experiments to a maximum sequence length of 8192 tokens. This is considerably shorter than the contexts explored in recent long-context modeling studies.
A2. To address your question, we increased the context length to 30,000 tokens. On this larger window, our results remain consistent with those reported in the original paper: our method achieves near-lossless performance even when quantizing the KV cache to INT2. Specifically, the Baseline (16-bit) achieves a score of 53.206, while SQuat (2-bit) achieves 52.554.
We chose 30,000 tokens because nearly all sequences from the selected LongBench tasks fall below this threshold. We acknowledge that there are other benchmarks having much longer sequences; however, 30,000 tokens is also the maximum context length we can experiment with due to our limited computational resources.
R3. The analysis lacks a detailed speed profile or computational cost breakdown for SQaut. Detailed and separated profiling of overheads in prefilling and generation is crucial for assessing practical performance and identifying bottlenecks.
A3. Please refer to our global response.
Q2. Why use different parameters values (, ) on LongBench?
A4. In this paper, we explore various parameter settings to evaluate the optimal performance of SQuat on LongBench, which consists of relatively simple tasks. Nonetheless, the default parameter values we recommend (i.e., , ) can also achieve near-lossless accuracy on LongBench and consistently outperform all baseline methods (see below):
Llama-2-7B-hf
| SQuat | KIVI | GEAR | ZipCache | FP16 |
|---|---|---|---|---|
| 44.37% | 44.18% | 44.17% | 43.06% | 44.50% |
Llama-3.1-8B-Instruct
| SQuat | KIVI | GEAR | ZipCache | FP16 |
|---|---|---|---|---|
| 51.88% | 51.87% | 51.35% | 51.48% | 52.38% |
Mistral-7B-Instruct-v0.3
| SQuat | KIVI | FP16 |
|---|---|---|
| 51.74% | 51.66% | 52.27% |
Q3. What is the length of the input sequence used to generate Figure 2? Do the result reported in Sec 3.2, i.e., prompt only is a good enough approximation, hold for substantially longer sequences?
A5. The input sequence is 203 tokens long. We believe the observation that “prompt-only is a good enough approximation" holds even for much longer sequences (we tested 8k tokens and observed consistent results). This is likely due to the low-rank structure of the projection matrix that maps attention inputs to query vectors, which constrains the query subspace. In Figure 2 (left, red curve), even when the query subspace is derived from a different task (document QA instead of math), it still provides a reasonable approximation for the math-500 dataset’s token query tensors.
To be continued...
Q4. What is the role of multiplying in Alg 1 line 6? Why is not enough to represent the subspace?
A6. You are completely correct and is enough to represent the subspace. The role of is to assign finer-grained scaling factors to each direction in the subspace. This enables more precise control over quantization error by reducing its impact along directions that carry more important task-related information, rather than distributing the error uniformly across all directions in the subspace.
Q5. If SVD is fast enough, can be updated (periodically) using the residual buffer? If so, would this approach enhance performance, and what would be the consequent slowdown?
A7. Indeed, this is an excellent suggestion and it works really well!!
We dynamically update using the residual buffer, rather than estimating it solely from the prompt tokens (as you suggested). We evaluate this method on the deepseek-distill-llama-8B model using 100 samples from the math-500 dataset. We update the query subspace every time 512 new tokens are generated. The resulting accuracy is 77%, matching the FP16 baseline and indicating no loss from quantization! For comparison, our original implementation (static subspace) has accuracy 72% and FP16 (no quantization) has accuracy 77%.
To further understand the effect of update frequency, we conduct an ablation study by varying how often is updated. The table below summarizes the results:
| #tokens | 128 | 256 | 512 | 1024 | 2048 |
|---|---|---|---|---|---|
| Accuracy | 70% | 72% | 77% | 75% | 74% |
We found that updating too frequently (e.g., every 128 tokens) leads to lower accuracy due to redundancy in the recent token queries. Updating too infrequently (e.g., after 1024 or 2048 tokens) results in diminished performance because the query subspace becomes too coarse to capture local semantic nuances.
While these results are promising, we caveat that enabling dynamic updates requires storing the query tensors of the most recent tokens in memory until the next update. For example, achieving the best performance requires saving 512 tokens’ query tensors, which introduces a significant memory overhead. In contrast, our original approach (estimating from prompt tokens during the prefill stage) requires no additional memory cost, since all prompt query tensors are computed during the prefill stage. Exploring memory-efficient strategies for dynamic query subspace updates would be an interesting direction for future work.
Dear Reviewer hLUL,
Thank you for reading our paper and providing thoughtful feedback. We hope our responses have addressed all your questions. As the discussion period comes to an end, please feel free to let us know if you have any additional suggestions or feedback that could help further improve the paper. Thank you once again!
Thanks for your response, I will maintain my score.
This paper introduces SQuat, a KV cache quantization method that preserves task-relevant information by ensuring quantization errors remain orthogonal to the query subspace. Unlike traditional approaches that focus solely on minimizing numerical error, SQuat recognizes that preserving inner products between key and query tensors is critical for maintaining model performance. Experiments show memory reduction (2.17×-2.82×) and throughput improvement (2.45×-3.60×) while outperforming existing quantization methods across multiple LLMs and tasks.
接收理由
- It introduces a novel conceptual framing for KV cache quantization that addresses fundamental limitations of current approaches.
- It provides rigorous mathematical derivation with proofs and an efficient implementation algorithm.
- It demonstrates consistent performance improvements across diverse models (Llama-2, Llama-3.1, Mistral, DeepSeek) and tasks (reasoning, long-context understanding).
拒绝理由
- Insufficient guidance on hyperparameter selection beyond ablation studies.
- Limited analysis of computational overhead during inference, particularly for initial tokens.
We thank the reviewer for the feedback and kind comments!
Q1. Insufficient guidance on hyperparameter selection beyond ablation studies.
A1. The main hyperparameters of our method are the subspace dimension and the regularization coefficient . Our recommended configuration for these two hyperparameters is and , and we use these values consistently across all LLMs and tasks in LM-eval. As shown in our paper, this choice consistently outperforms all other baselines (see our response to Reviewer hLUL’s Q2 for LongBench results). Below, we provide the intuition behind this recommendation.
For , we recommend a small value (e.g., ). A larger defines a higher-rank subspace, which makes satisfying the orthogonality constraints more difficult and can degrade model performance. This aligns with our ablation study (Table 3), where for various values of (=, , or ), achieves the highest scores among all configurations. One exception occurs when is very small (); in this case, the orthogonality constraint is relaxed, and a slightly larger (e.g., ) yields better performance. As for , we recommend (or even smaller), based on the observation that the largest singular value of the query subspace (i.e., ) typically falls between 100 and 300. This choice of ensures the regularization is effective without being too restrictive.
While we do not claim that this configuration is universally optimal (in fact, the best hyperparameters will vary depending on the specific task), our recommended configuration delivers consistent performance (outperforming all baselines tested) across different LLMs and evaluation tasks.
In addition to and , our method includes two other hyperparameters: residual buffer length and group size. Throughout the paper, we set both of them to 32 tokens, following the recommendations from prior work [1, 2].
[1] Liu, Zirui, et al. "Kivi: A tuning-free asymmetric 2bit quantization for kv cache." ICML, 2024.
[2] Sheng, Ying, et al. "Flexgen: High-throughput generative inference of large language models with a single gpu", ICML, 2023.
Q2. Limited analysis of computational overhead during inference, particularly for initial tokens.
A2. Please refer to our global response for a detailed discussion.
Dear Reviewer hftq,
Thank you for reading our paper and providing thoughtful feedback. We hope our responses have addressed all your questions. As the discussion period comes to an end, please feel free to let us know if you have any additional suggestions or feedback that could help further improve the paper. Thank you once again!
Thank you for the response. I will maintain my ratings.
This paper introduces a tuning-free method SQuat for quantizing the KV cache in large language models (LLMs) by enforcing subspace-orthogonality between quantization errors and a task-relevant query subspace. Instead of minimizing raw quantization error, SQuat aims to preserve the inner products between queries and keys, which are critical to attention scores. It constructs this subspace using prompt query tensors via SVD and iteratively quantizes key tensors while constraining residuals to be orthogonal to it. This theoretically grounded approach improves inference efficiency without retraining or calibration data.
接收理由
- Unlike prior work that minimizes element-wise reconstruction error, SQuat introduces a theoretically grounded objective that minimizes the effect of quantization on attention outputs by ensuring key residuals are orthogonal to a task-relevant query subspace.
- The method requires neither model fine-tuning nor additional calibration data, making it practical for deployment in existing LLM inference pipelines with minimal overhead.
- SQuat is orthogonal to other cache management strategies (e.g., pruning, eviction) and compatible with state-of-the-art compression techniques, suggesting wide applicability in both research and production systems.
拒绝理由
- The query subspace is estimated solely from prompt tokens, assuming it generalizes well to future tokens. However, this assumption may not hold across diverse tasks or dialog settings where user responses shift semantics mid-generation. This could compromise the orthogonality constraint’s effectiveness in practice.
- While the algorithm reduces the theoretical complexity to , it still introduces a non-trivial overhead compared to simpler methods like per-channel quantization. The practical cost of SVD, subspace projection, and iterative updates—especially for longer sequences and large-scale deployment—needs clearer profiling in the main paper.
- The evaluation is focused on academic benchmarks and assumes full control of the inference stack. Real-world deployment scenarios, where key tensors may be quantized asynchronously or under hardware constraints (e.g., mobile inference), are not addressed, limiting the practical utility claims.
We thank the reviewer for the thoughtful review and for appreciating the merits of the work!
Q1. The query subspace is estimated solely from prompt tokens, assuming it generalizes well to future tokens.
A1. To address your concern, we conducted two experiments.
Experiment 1: dynamic query subspace update
In this experiment, we dynamically update the query subspace using the residual buffer, rather than estimating it solely from the prompt tokens (an idea suggested by Reviewer hLUL). This approach is particularly useful in dialog scenarios where user responses may shift semantics during generation. We evaluate this method on the deepseek-distill-llama-8B model using 100 samples from the math-500 dataset. We update the query subspace every time 512 new tokens are generated. The resulting accuracy is 77%, matching the FP16 baseline and indicating no loss from quantization. For comparison, our original implementation (static subspace) has accuracy 72% and FP16 (no quantization) has accuracy 77%.
(Additional results can be found in our response to Reviewer hLUL.)
Experiment 2: robustness of query subspace construction
We test the robustness of our method by constructing the query subspace from easier or even unrelated tasks. Using the llama-3.1-8B-instruct model on 100 samples from the math-500 dataset, we compare the performance under three different subspace:
- Current setup (subspace constructed from the first prompt in each batch): 47% accuracy.
- GSM8K subspace (constructed from a prompt in the GSM8K dataset, an easier math dataset): 46% accuracy.
- QuALITY subspace (constructed from a prompt in the QuALITY dataset, which is a document QA dataset, unrelated to math: 44% accuracy.
For reference, KIVI achieves 44% accuracy on the same set of math-500 samples.
These results are consistent with our observations in Figure 2 (left) from Section 3.2, suggesting that query vectors lie in a low-dimensional subspace. Even when constructed from different tasks, this subspace remains a reasonable approximation which is likely due to the low-rank nature of the projection matrix mapping attention inputs to query vectors.
Q2. The practical cost of SVD, subspace projection, and iterative updates, especially for longer sequences and large-scale deployment, needs clearer profiling in the main paper.
A2. Please refer to our global response. In addition, we conducted an experiment using longer sequences as per your recommendation. Specifically, we used the deepseek-distill-llama-8B model with 100 prompt tokens followed by 8,000 response tokens. This setup is designed to better reflect modern reasoning workloads, which often require extended responses for multi-step reasoning.
SQuat (our method):
| Batch Size (bsz) | SVD (ms) | Iter (prefill) (ms) | Prefill (ms) | Iter (decode) (ms) | Decode (ms) | Total Time (ms) |
|---|---|---|---|---|---|---|
| 16 | 249.430 | 937.041 | 1235.624 | 3931.533 | 328183.072 | 329418.696 |
FP16:
| Batch Size (bsz) | Prefill (ms) | Decode (ms) | Total Time (ms) |
|---|---|---|---|
| 16 | 42.755 | 634690.410 | 634733.165 |
Q3. Real-world deployment scenarios, where key tensors may be quantized asynchronously or under hardware constraints (e.g., mobile inference), are not addressed.
A3. This is a great suggestion! We believe our method can indeed benefit from performing quantization asynchronously to further reduce latency. Our approach maintains a residual buffer that stores the most recent tokens’ KV cache in FP16. Once this buffer reaches a predefined length, the cached KV tensors are quantized together. This design naturally supports asynchronous quantization, as it avoids the need to quantize the KV cache immediately with each new token. Instead, decoding can continue using the FP16 KV cache in the residual buffer, while quantization of the buffer proceeds in parallel.
To address hardware constraints, one potential solution is to reimplement our algorithm using the MLX framework, which supports apple silicon. For example, adapting our current implementation (which builds on huggingface code) for the llama model to MLX could be done by adapting it using mlx-examples/llms/llama/llama.py. The main operations in our algorithm, such as SVD, are already supported in MLX (e.g., mlx.core.linalg.svd).
Dear Reviewer fNFy,
Thank you for reading our paper and providing thoughtful feedback. We hope our responses have addressed all your questions. As the discussion period comes to an end, please feel free to let us know if you have any additional suggestions or feedback that could help further improve the paper. Thank you once again!
I appreciate the authors' clarifications, which address several of my initial concerns. Based on this response, I am increasing my score.
We thank all reviewers for their time and effort. We understand that reviewers are often managing multiple submissions, so we especially appreciate the thoughtful and constructive feedback provided. A common question raised by all reviewers is the computational overhead of our method during inference. In this global response, we provide a detailed breakdown of the inference-time cost for each major component of our method (specifically, SVD and iterative updates) as well as a report on prefill and decoding times.
[Setup] We adopt the same evaluation setup as in Figure 3, using the LLaMA-2-7B model with a prompt length of 160 tokens and a response length of 338 tokens. This setup is designed to model user-shared conversations, synthesizing workloads based on the ShareGPT dataset.
[Results] We report both the prefill time and decode time, which together constitute the total runtime of our algorithm. Additionally, we provide the runtime for the two main components of our method: SVD and iterative updates. The SVD operation is performed only once at the end of the prefill stage, so we report its total time cost. In contrast, iterative updates are executed during both the prefill and decode stages, and we break down their runtimes accordingly into Iter (prefill) and Iter (decode). We vary the batch size incrementally, with each row in the table corresponding to a specific batch size. For comparison, we also report the runtime of KIVI (a baseline that performs simple per-channel/per-token INT2 quantization without SVD or iterative updates) as well as FP16.
[Observations] SVD accounts for 5.8% of the total runtime at a small batch size (bsz = 1), and this percentage drops to 2.4% at a large batch size (bsz = 500). This reduction is due to our strategy of applying SVD only to the first sample in the batch to extract the query subspace, which is then shared across the batch. The runtime of the iterative update remains consistently around 2.5%, regardless of batch size.
SQuat (our method):
| Batch Size (bsz) | SVD (ms) | Iter (prefill) (ms) | Prefill (ms) | Iter (decode) (ms) | Decode (ms) |
|---|---|---|---|---|---|
| 1 | 805.530 | 89.316 | 913.535 | 175.240 | 12959.089 |
| 10 | 825.425 | 62.293 | 927.533 | 88.764 | 13130.239 |
| 100 | 862.325 | 93.071 | 1796.261 | 96.803 | 13048.520 |
| 200 | 961.580 | 88.378 | 2680.430 | 368.207 | 17200.172 |
| 300 | 959.962 | 159.359 | 3508.182 | 437.083 | 22711.346 |
| 400 | 1156.329 | 141.021 | 4389.848 | 636.830 | 28995.633 |
| 500 | 948.548 | 321.158 | 5111.547 | 668.478 | 33734.119 |
KIVI:
| Batch Size (bsz) | Prefill (ms) | Decoding (ms) |
|---|---|---|
| 1 | 40.910 | 13847.371 |
| 10 | 87.783 | 14423.615 |
| 100 | 747.433 | 14416.714 |
| 200 | 1512.203 | 16996.911 |
| 300 | 2291.508 | 22316.557 |
| 400 | 3031.987 | 28460.454 |
| 500 | 3847.747 | 33343.813 |
FP16:
| Batch Size (bsz) | Prefill (ms) | Decoding (ms) |
|---|---|---|
| 1 | 24.790 | 8574.580 |
| 10 | 66.958 | 9786.906 |
| 100 | 560.386 | 30451.646 |
Additionally, in our paper Figure 3, we compared our method with KIVI and FP16 in terms of memory usage (left), runtime latency (middle), and throughput (right).
Below, we respond to the main points and questions raised by each reviewer. We welcome any additional suggestions or feedback that could further improve the paper. Thank you!
The authors introduce a new method for KV cache compression that relies on constructing a subspace spanned by query tensors to capture the most critical task-related information, which seems to yield improvmenets over sota methods.
The paper received positive scores (6, 6, 7) from three reviewers, all acknowledging the novelty and technical merit of the proposed approach. The Reviewers appreciated the method's ability to preserve task-relevant information via orthogonal constraints to a query subspace, distinguishing it from existing compression-based quantization techniques (putting a small asterisk here, discussed below).
Strengths:
- Introduces a theoretical framework beyond simple compression-based quantization methods.
- Demonstrates empirical results across multiple benchmarks and models.
- Requires no fine-tuning
- The authors addressed most of the reviewer concerns with additional experiments, and clarifications on computational breakdown
Weaknesses:
- Evaluation on very long sequences was limited (I suppose that's overall a harder eval, but it was capped at 30k tokens due to computational constraints).
- Hyperparameter selection of new tuning knobs adds extra overhead (this is a complain for any novel compression method to be honest)
- Practical deployment aspects was mentioned by a reviewer. I suppose this is always quite tricky and we have to rely on our belief that some of the findings transfer to a more practical setting.
expanding on the asterisk:
- Many methods developed in the apst can be used for task specific quantization, eg Lexico by Kim et al is VERY similar to what the authors proposed. However, they didn't even compare in their benchmarks. with these methods that can be tuned to the task at hand.