6.3

/10

Poster3 位审稿人

最低6最高7标准差0.5

3.3

置信度

COLM 2025

SQuat: Subspace-orthogonal KV Cache Quantization

Hao Wang,Ligong Han,Kai Xu,Akash Srivastava

OpenReview PDF

提交: 2025-03-06更新: 2025-08-26

TL;DR

We propose a KV cache quantization method that preserves task-critical information throughout the quantization process.

摘要

The key-value (KV) cache accelerates LLMs decoding by storing KV tensors from previously generated tokens. It reduces redundant computation at the cost of increased memory usage. To mitigate this overhead, existing approaches compress KV tensors into lower-bit representations; however, quantization errors can accumulate as more tokens are generated, potentially resulting in undesired outputs. In this paper, we introduce SQuat (Subspace-orthogonal KV cache quantization). It first constructs a subspace spanned by query tensors to capture the most critical task-related information. During key tensor quantization, it enforces that the difference between the (de)quantized and original keys remains orthogonal to this subspace, minimizing the impact of quantization errors on the attention mechanism’s outputs. SQuat requires no model fine-tuning, no additional calibration dataset for offline learning, and is grounded in a theoretical framework we develop. Through numerical experiments, we show that our method reduces peak memory by $2.17\times \sim 2.82\times$, improves throughput by $2.45\times \sim 3.60 \times$, and achieves more favorable benchmark scores than existing KV cache quantization algorithms.

关键词

KV cache quantizationLLMs

评审与讨论

审稿意见

评分: 7置信度: 32025-05-12

This paper proposes a method that considers potential future output errors during the quantization of the current key. The authors first demonstrate that these errors are upper-bounded by terms related to the product of queries and the quantization residuals of keys. Consequently, they propose regularizing the quantization process to encourage orthogonality between the quantization residuals and the subspaces spanned by queries. The proposed method is evaluated against several existing methods on a range of benchmarks. Overall, the motivation is compelling and supported by insightful analysis, suggesting strong potential for the proposed method. However, it should be noted that the quality of the experiments currently lags behind that of the other high-quality components of the work.

接收理由

The motivation is compelling and supported by insightful analysis that could inform future work.
The proposed method is targeted and rigorous. Its derivation is easy to follow, and the solution is elegant.
Empirical results demonstrate strong performance and introduce only negligible overhead.

拒绝理由

There are two new hyperparameters introduced. This paper lacks discussion on how to set them effectively. See Q1.
Despite Llama 3.1's support for a 128k context length and the use of LongBench (a benchmark for long-context evaluation), the authors limited their experiments to a maximum sequence length of 8192 tokens. This is considerably shorter than the contexts explored in recent long-context modeling studies. Since KV quantization is primarily beneficial in long-context scenarios, evaluation on real long sequences (e.g., >64k tokens) is necessary to accurately assess its performance.
The analysis lacks a detailed speed profile or computational cost breakdown for SQaut. Detailed and separated profiling of overheads in prefilling and generation is crucial for assessing practical performance and identifying bottlenecks. Refer to Q5 for a follow-up.

给作者的问题

I find results in Table 3 puzzling. Intuitively, a larger $r$ defines a higher-rank subspace, which would presumably make the orthogonality constraints more difficult to satisfy. Conversely, a smaller $\lambda$ diminishes the regularization effect. Therefore, the observation that reducing $\lambda$ while increasing $r$ leads to better performance (in general) appears paradoxical. I would appreciate a more detailed analysis and discussion regarding the specific influence of $r$ on SQuat's behavior, especially in light of these seemingly conflicting effects.
Why do SQuant and SQuant_pre use different values for the parameters $r$ and $\lambda$ on the LongBench benchmark?
What is the length of the input sequence used to generate Figure 2? Do the result reported in Sec 3.2, i.e., prompt only is a good enough approximation, hold for substantially longer sequences, for example, those with 100,000 tokens?
What is the role of multiplying $\diag(\Sigma[:r])$ in Alg 1 line 6? Why is $V[:r]$ not enough to represent the subspace?
If SVD is fast enough, can $\hat{Q}$ be updated (periodically) using the residual buffer? If so, would this approach enhance performance, and what would be the consequent slowdown?

评论- Thank you for your thoughtful comments!

2025-06-03

Thanks for your careful reading of our paper and thoughtful comments!

R1 + Q1. There are two new hyperparameters introduced. This paper lacks discussion on how to set them effectively. A more detailed analysis regarding the specific influence of r on SQuat's behavior.

A1. Our recommended configuration for these two hyperparameters is $r=5$ and $\lambda=0.001$ , and we use these values consistently across all LLMs and tasks in LM-eval. As shown in our paper, this choice consistently outperforms all other baselines (see our response to your Q2 for results for LongBench results). Below, we provide the intuition behind this selection.

For $r$ , we recommend a small value (e.g., $r=5$ ). As you pointed out, a larger $r$ defines a higher-rank subspace, which makes satisfying the orthogonality constraints more difficult and can degrade model performance. This aligns with our ablation study (Table 3), where for various values of $\lambda$ (= $5 \times 10^{-4}$ , $10^{-3}$ , and $10^{-2}$ ), $r=5$ achieves the highest scores among all configurations. One exception occurs when $\lambda$ is very small (= $10^{-4}$ ); in this case, the orthogonality constraint is relaxed, and a slightly larger $r$ (e.g., $r=20$ ) yields better performance. As for $\lambda$ , we recommend $\lambda=0.001$ (or even smaller), based on the observation that the largest singular value of the query subspace (i.e., $\Sigma[1]$ ) typically falls between 100 and 300. This choice of $\lambda$ ensures the regularization is effective without being too restrictive.

While we do not claim that this configuration is universally optimal (in fact, the best hyperparameters will vary depending on the specific task), our recommended configuration delivers consistent performance (outperforming all baselines tested) across different LLMs and evaluation tasks.

R2. Despite Llama 3.1's support for a 128k context length and the use of LongBench (a benchmark for long-context evaluation), the authors limited their experiments to a maximum sequence length of 8192 tokens. This is considerably shorter than the contexts explored in recent long-context modeling studies.

A2. To address your question, we increased the context length to 30,000 tokens. On this larger window, our results remain consistent with those reported in the original paper: our method achieves near-lossless performance even when quantizing the KV cache to INT2. Specifically, the Baseline (16-bit) achieves a score of 53.206, while SQuat (2-bit) achieves 52.554.

We chose 30,000 tokens because nearly all sequences from the selected LongBench tasks fall below this threshold. We acknowledge that there are other benchmarks having much longer sequences; however, 30,000 tokens is also the maximum context length we can experiment with due to our limited computational resources.

R3. The analysis lacks a detailed speed profile or computational cost breakdown for SQaut. Detailed and separated profiling of overheads in prefilling and generation is crucial for assessing practical performance and identifying bottlenecks.

A3. Please refer to our global response.

Q2. Why use different parameters values ( $r$ , $\lambda$ ) on LongBench?

A4. In this paper, we explore various parameter settings to evaluate the optimal performance of SQuat on LongBench, which consists of relatively simple tasks. Nonetheless, the default parameter values we recommend (i.e., $r = 5$ , $\lambda = 0.001$ ) can also achieve near-lossless accuracy on LongBench and consistently outperform all baseline methods (see below):

Llama-2-7B-hf

SQuat	KIVI	GEAR	ZipCache	FP16
44.37%	44.18%	44.17%	43.06%	44.50%

Llama-3.1-8B-Instruct

SQuat	KIVI	GEAR	ZipCache	FP16
51.88%	51.87%	51.35%	51.48%	52.38%

Mistral-7B-Instruct-v0.3

SQuat	KIVI	FP16
51.74%	51.66%	52.27%

Q3. What is the length of the input sequence used to generate Figure 2? Do the result reported in Sec 3.2, i.e., prompt only is a good enough approximation, hold for substantially longer sequences?

A5. The input sequence is 203 tokens long. We believe the observation that “prompt-only is a good enough approximation" holds even for much longer sequences (we tested 8k tokens and observed consistent results). This is likely due to the low-rank structure of the projection matrix that maps attention inputs to query vectors, which constrains the query subspace. In Figure 2 (left, red curve), even when the query subspace is derived from a different task (document QA instead of math), it still provides a reasonable approximation for the math-500 dataset’s token query tensors.

To be continued...

评论- Thank you for your thoughtful comments!

2025-06-03

Q4. What is the role of multiplying $\text{diag}(\Sigma[:r])$ in Alg 1 line 6? Why is $V[:r]$ not enough to represent the subspace?

A6. You are completely correct and $V[:r]$ is enough to represent the subspace. The role of $\text{diag}(\Sigma[:r])$ is to assign finer-grained scaling factors to each direction in the subspace. This enables more precise control over quantization error by reducing its impact along directions that carry more important task-related information, rather than distributing the error uniformly across all directions in the subspace.

Q5. If SVD is fast enough, can $\hat{Q}$ be updated (periodically) using the residual buffer? If so, would this approach enhance performance, and what would be the consequent slowdown?

A7. Indeed, this is an excellent suggestion and it works really well!!

We dynamically update $\hat{Q}$ using the residual buffer, rather than estimating it solely from the prompt tokens (as you suggested). We evaluate this method on the deepseek-distill-llama-8B model using 100 samples from the math-500 dataset. We update the query subspace every time 512 new tokens are generated. The resulting accuracy is 77%, matching the FP16 baseline and indicating no loss from quantization! For comparison, our original implementation (static subspace) has accuracy 72% and FP16 (no quantization) has accuracy 77%.

To further understand the effect of update frequency, we conduct an ablation study by varying how often $\hat{Q}$ is updated. The table below summarizes the results:

#tokens	128	256	512	1024	2048
Accuracy	70%	72%	77%	75%	74%

We found that updating $\hat{Q}$ too frequently (e.g., every 128 tokens) leads to lower accuracy due to redundancy in the recent token queries. Updating too infrequently (e.g., after 1024 or 2048 tokens) results in diminished performance because the query subspace becomes too coarse to capture local semantic nuances.

While these results are promising, we caveat that enabling dynamic updates requires storing the query tensors of the most recent tokens in memory until the next update. For example, achieving the best performance requires saving 512 tokens’ query tensors, which introduces a significant memory overhead. In contrast, our original approach (estimating $\hat{Q}$ from prompt tokens during the prefill stage) requires no additional memory cost, since all prompt query tensors are computed during the prefill stage. Exploring memory-efficient strategies for dynamic query subspace updates would be an interesting direction for future work.

评论- Thank You and Final Thoughts

2025-06-09

Dear Reviewer hLUL,

Thank you for reading our paper and providing thoughtful feedback. We hope our responses have addressed all your questions. As the discussion period comes to an end, please feel free to let us know if you have any additional suggestions or feedback that could help further improve the paper. Thank you once again!

评论- Thanks for your feedback!

2025-06-11

Thanks for your response, I will maintain my score.

审稿意见

评分: 6置信度: 32025-05-13

This paper introduces SQuat, a KV cache quantization method that preserves task-relevant information by ensuring quantization errors remain orthogonal to the query subspace. Unlike traditional approaches that focus solely on minimizing numerical error, SQuat recognizes that preserving inner products between key and query tensors is critical for maintaining model performance. Experiments show memory reduction (2.17×-2.82×) and throughput improvement (2.45×-3.60×) while outperforming existing quantization methods across multiple LLMs and tasks.

接收理由

It introduces a novel conceptual framing for KV cache quantization that addresses fundamental limitations of current approaches.
It provides rigorous mathematical derivation with proofs and an efficient implementation algorithm.
It demonstrates consistent performance improvements across diverse models (Llama-2, Llama-3.1, Mistral, DeepSeek) and tasks (reasoning, long-context understanding).

拒绝理由

Insufficient guidance on hyperparameter selection beyond ablation studies.
Limited analysis of computational overhead during inference, particularly for initial tokens.

评论- Thank you for your thoughtful comments!

2025-06-03

We thank the reviewer for the feedback and kind comments!

Q1. Insufficient guidance on hyperparameter selection beyond ablation studies.

A1. The main hyperparameters of our method are the subspace dimension $r$ and the regularization coefficient $\lambda$ . Our recommended configuration for these two hyperparameters is $r=5$ and $\lambda=0.001$ , and we use these values consistently across all LLMs and tasks in LM-eval. As shown in our paper, this choice consistently outperforms all other baselines (see our response to Reviewer hLUL’s Q2 for LongBench results). Below, we provide the intuition behind this recommendation.

For $r$ , we recommend a small value (e.g., $r=5$ ). A larger $r$ defines a higher-rank subspace, which makes satisfying the orthogonality constraints more difficult and can degrade model performance. This aligns with our ablation study (Table 3), where for various values of $\lambda$ (= $5 \times 10^{-4}$ , $10^{-3}$ , or $10^{-2}$ ), $r=5$ achieves the highest scores among all configurations. One exception occurs when $\lambda$ is very small ( $=10^{-4}$ ); in this case, the orthogonality constraint is relaxed, and a slightly larger $r$ (e.g., $r=20$ ) yields better performance. As for $\lambda$ , we recommend $\lambda=0.001$ (or even smaller), based on the observation that the largest singular value of the query subspace (i.e., $\Sigma[1]$ ) typically falls between 100 and 300. This choice of $\lambda$ ensures the regularization is effective without being too restrictive.

In addition to $r$ and $\lambda$ , our method includes two other hyperparameters: residual buffer length and group size. Throughout the paper, we set both of them to 32 tokens, following the recommendations from prior work [1, 2].

[1] Liu, Zirui, et al. "Kivi: A tuning-free asymmetric 2bit quantization for kv cache." ICML, 2024.

[2] Sheng, Ying, et al. "Flexgen: High-throughput generative inference of large language models with a single gpu", ICML, 2023.

Q2. Limited analysis of computational overhead during inference, particularly for initial tokens.

A2. Please refer to our global response for a detailed discussion.

评论- Thank You and Final Thoughts

2025-06-09

Dear Reviewer hftq,

2025-06-11

Thank you for the response. I will maintain my ratings.

审稿意见

评分: 6置信度: 42025-05-13

This paper introduces a tuning-free method SQuat for quantizing the KV cache in large language models (LLMs) by enforcing subspace-orthogonality between quantization errors and a task-relevant query subspace. Instead of minimizing raw quantization error, SQuat aims to preserve the inner products between queries and keys, which are critical to attention scores. It constructs this subspace using prompt query tensors via SVD and iteratively quantizes key tensors while constraining residuals to be orthogonal to it. This theoretically grounded approach improves inference efficiency without retraining or calibration data.

接收理由

Unlike prior work that minimizes element-wise reconstruction error, SQuat introduces a theoretically grounded objective that minimizes the effect of quantization on attention outputs by ensuring key residuals are orthogonal to a task-relevant query subspace.
The method requires neither model fine-tuning nor additional calibration data, making it practical for deployment in existing LLM inference pipelines with minimal overhead.
SQuat is orthogonal to other cache management strategies (e.g., pruning, eviction) and compatible with state-of-the-art compression techniques, suggesting wide applicability in both research and production systems.

拒绝理由

The query subspace is estimated solely from prompt tokens, assuming it generalizes well to future tokens. However, this assumption may not hold across diverse tasks or dialog settings where user responses shift semantics mid-generation. This could compromise the orthogonality constraint’s effectiveness in practice.
While the algorithm reduces the theoretical complexity to $\mathcal{O}(d^3)$ , it still introduces a non-trivial overhead compared to simpler methods like per-channel quantization. The practical cost of SVD, subspace projection, and iterative updates—especially for longer sequences and large-scale deployment—needs clearer profiling in the main paper.
The evaluation is focused on academic benchmarks and assumes full control of the inference stack. Real-world deployment scenarios, where key tensors may be quantized asynchronously or under hardware constraints (e.g., mobile inference), are not addressed, limiting the practical utility claims.

评论- Thank you for your thoughtful comments!

2025-06-03

We thank the reviewer for the thoughtful review and for appreciating the merits of the work!

Q1. The query subspace is estimated solely from prompt tokens, assuming it generalizes well to future tokens.

A1. To address your concern, we conducted two experiments.

Experiment 1: dynamic query subspace update

In this experiment, we dynamically update the query subspace using the residual buffer, rather than estimating it solely from the prompt tokens (an idea suggested by Reviewer hLUL). This approach is particularly useful in dialog scenarios where user responses may shift semantics during generation. We evaluate this method on the deepseek-distill-llama-8B model using 100 samples from the math-500 dataset. We update the query subspace every time 512 new tokens are generated. The resulting accuracy is 77%, matching the FP16 baseline and indicating no loss from quantization. For comparison, our original implementation (static subspace) has accuracy 72% and FP16 (no quantization) has accuracy 77%.

(Additional results can be found in our response to Reviewer hLUL.)

Experiment 2: robustness of query subspace construction

We test the robustness of our method by constructing the query subspace from easier or even unrelated tasks. Using the llama-3.1-8B-instruct model on 100 samples from the math-500 dataset, we compare the performance under three different subspace:

Current setup (subspace constructed from the first prompt in each batch): 47% accuracy.
GSM8K subspace (constructed from a prompt in the GSM8K dataset, an easier math dataset): 46% accuracy.
QuALITY subspace (constructed from a prompt in the QuALITY dataset, which is a document QA dataset, unrelated to math: 44% accuracy.

For reference, KIVI achieves 44% accuracy on the same set of math-500 samples.

These results are consistent with our observations in Figure 2 (left) from Section 3.2, suggesting that query vectors lie in a low-dimensional subspace. Even when constructed from different tasks, this subspace remains a reasonable approximation which is likely due to the low-rank nature of the projection matrix mapping attention inputs to query vectors.

Q2. The practical cost of SVD, subspace projection, and iterative updates, especially for longer sequences and large-scale deployment, needs clearer profiling in the main paper.

A2. Please refer to our global response. In addition, we conducted an experiment using longer sequences as per your recommendation. Specifically, we used the deepseek-distill-llama-8B model with 100 prompt tokens followed by 8,000 response tokens. This setup is designed to better reflect modern reasoning workloads, which often require extended responses for multi-step reasoning.

SQuat (our method):

Batch Size (bsz)	SVD (ms)	Iter (prefill) (ms)	Prefill (ms)	Iter (decode) (ms)	Decode (ms)	Total Time (ms)
16	249.430	937.041	1235.624	3931.533	328183.072	329418.696

FP16:

Batch Size (bsz)	Prefill (ms)	Decode (ms)	Total Time (ms)
16	42.755	634690.410	634733.165

Q3. Real-world deployment scenarios, where key tensors may be quantized asynchronously or under hardware constraints (e.g., mobile inference), are not addressed.

A3. This is a great suggestion! We believe our method can indeed benefit from performing quantization asynchronously to further reduce latency. Our approach maintains a residual buffer that stores the most recent tokens’ KV cache in FP16. Once this buffer reaches a predefined length, the cached KV tensors are quantized together. This design naturally supports asynchronous quantization, as it avoids the need to quantize the KV cache immediately with each new token. Instead, decoding can continue using the FP16 KV cache in the residual buffer, while quantization of the buffer proceeds in parallel.

To address hardware constraints, one potential solution is to reimplement our algorithm using the MLX framework, which supports apple silicon. For example, adapting our current implementation (which builds on huggingface code) for the llama model to MLX could be done by adapting it using mlx-examples/llms/llama/llama.py. The main operations in our algorithm, such as SVD, are already supported in MLX (e.g., mlx.core.linalg.svd).

评论- Thank You and Final Thoughts

2025-06-09

Dear Reviewer fNFy,

2025-06-11

I appreciate the authors' clarifications, which address several of my initial concerns. Based on this response, I am increasing my score.

评论- Author Global Response

2025-06-03

We thank all reviewers for their time and effort. We understand that reviewers are often managing multiple submissions, so we especially appreciate the thoughtful and constructive feedback provided. A common question raised by all reviewers is the computational overhead of our method during inference. In this global response, we provide a detailed breakdown of the inference-time cost for each major component of our method (specifically, SVD and iterative updates) as well as a report on prefill and decoding times.

[Setup] We adopt the same evaluation setup as in Figure 3, using the LLaMA-2-7B model with a prompt length of 160 tokens and a response length of 338 tokens. This setup is designed to model user-shared conversations, synthesizing workloads based on the ShareGPT dataset.

[Results] We report both the prefill time and decode time, which together constitute the total runtime of our algorithm. Additionally, we provide the runtime for the two main components of our method: SVD and iterative updates. The SVD operation is performed only once at the end of the prefill stage, so we report its total time cost. In contrast, iterative updates are executed during both the prefill and decode stages, and we break down their runtimes accordingly into Iter (prefill) and Iter (decode). We vary the batch size incrementally, with each row in the table corresponding to a specific batch size. For comparison, we also report the runtime of KIVI (a baseline that performs simple per-channel/per-token INT2 quantization without SVD or iterative updates) as well as FP16.

[Observations] SVD accounts for 5.8% of the total runtime at a small batch size (bsz = 1), and this percentage drops to 2.4% at a large batch size (bsz = 500). This reduction is due to our strategy of applying SVD only to the first sample in the batch to extract the query subspace, which is then shared across the batch. The runtime of the iterative update remains consistently around 2.5%, regardless of batch size.

SQuat (our method):

Batch Size (bsz)	SVD (ms)	Iter (prefill) (ms)	Prefill (ms)	Iter (decode) (ms)	Decode (ms)
1	805.530	89.316	913.535	175.240	12959.089
10	825.425	62.293	927.533	88.764	13130.239
100	862.325	93.071	1796.261	96.803	13048.520
200	961.580	88.378	2680.430	368.207	17200.172
300	959.962	159.359	3508.182	437.083	22711.346
400	1156.329	141.021	4389.848	636.830	28995.633
500	948.548	321.158	5111.547	668.478	33734.119

KIVI:

Batch Size (bsz)	Prefill (ms)	Decoding (ms)
1	40.910	13847.371
10	87.783	14423.615
100	747.433	14416.714
200	1512.203	16996.911
300	2291.508	22316.557
400	3031.987	28460.454
500	3847.747	33343.813

FP16:

Batch Size (bsz)	Prefill (ms)	Decoding (ms)
1	24.790	8574.580
10	66.958	9786.906
100	560.386	30451.646

Additionally, in our paper Figure 3, we compared our method with KIVI and FP16 in terms of memory usage (left), runtime latency (middle), and throughput (right).

Below, we respond to the main points and questions raised by each reviewer. We welcome any additional suggestions or feedback that could further improve the paper. Thank you!

最终决定Accept

2025-07-08

The authors introduce a new method for KV cache compression that relies on constructing a subspace spanned by query tensors to capture the most critical task-related information, which seems to yield improvmenets over sota methods.

The paper received positive scores (6, 6, 7) from three reviewers, all acknowledging the novelty and technical merit of the proposed approach. The Reviewers appreciated the method's ability to preserve task-relevant information via orthogonal constraints to a query subspace, distinguishing it from existing compression-based quantization techniques (putting a small asterisk here, discussed below).

Strengths:

Introduces a theoretical framework beyond simple compression-based quantization methods.
Demonstrates empirical results across multiple benchmarks and models.
Requires no fine-tuning
The authors addressed most of the reviewer concerns with additional experiments, and clarifications on computational breakdown

Weaknesses:

Evaluation on very long sequences was limited (I suppose that's overall a harder eval, but it was capped at 30k tokens due to computational constraints).
Hyperparameter selection of new tuning knobs adds extra overhead (this is a complain for any novel compression method to be honest)
Practical deployment aspects was mentioned by a reviewer. I suppose this is always quite tricky and we have to rely on our belief that some of the findings transfer to a more practical setting.

expanding on the asterisk:

Many methods developed in the apst can be used for task specific quantization, eg Lexico by Kim et al is VERY similar to what the authors proposed. However, they didn't even compare in their benchmarks. with these methods that can be tuned to the task at hand.