PaperHub
6.8
/10
Rejected4 位审稿人
最低5最高8标准差1.3
8
5
6
8
3.5
置信度
正确性2.5
贡献度3.0
表达2.5
ICLR 2025

ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference

OpenReviewPDF
提交: 2024-09-22更新: 2025-02-05
TL;DR

High-Throughput Long-Context LLM Inference System

摘要

With the widespread deployment of long-context large language models (LLMs), there has been a growing demand for efficient support of high-throughput inference. However, as the key-value (KV) cache expands with the sequence length, the increasing memory footprint and the need to access it for each token generation both result in low throughput when serving long-context LLMs. While various dynamic sparse attention methods have been proposed to speed up inference while maintaining generation quality, they either fail to sufficiently reduce GPU memory consumption or introduce significant decoding latency by offloading the KV cache to the CPU. We present ShadowKV, a high-throughput long-context LLM inference system that stores the low-rank key cache and offloads the value cache to reduce the memory footprint for larger batch sizes and longer sequences. To minimize decoding latency, ShadowKV employs an accurate KV selection strategy that reconstructs minimal sparse KV pairs on-the-fly. By evaluating ShadowKV on a broad range of benchmarks, including RULER, LongBench, and Needle In A Haystack, and models like Llama-3.1-8B, Llama-3-8B-1M, GLM-4-9B-1M, Yi-9B-200K, Phi-3-Mini-128K, and Qwen2-7B-128K, we demonstrate that it can support up to 6$\times$ larger batch sizes and boost throughput by up to 3.04$\times$ on an A100 GPU without sacrificing accuracy, even surpassing the performance achievable with infinite batch size under the assumption of infinite GPU memory.
关键词
Long-Context LLM InferenceKV Cache Optimization

评审与讨论

审稿意见
8

The paper introduces SHADOWKV, a novel system that improves inference throughput for long-context LLMs. The key challenge addressed is the increasing memory footprint of KV caches as sequence length increases, which slows down inference. SHADOWKV offers a solution by offloading the value cache to the CPU while keeping a low-rank key cache on the GPU, reducing memory consumption without sacrificing performance. By employing a method of KV selection for sparse attention, it boosts throughput and supports larger batch sizes, showing improvements of up to 3.04× throughput on the A100 GPU.

优点

The paper tackles a significant problem in LLM inference by reducing GPU memory usage through a hybrid CPU-GPU cache system, enabling long-context LLMs to operate more efficiently without sacrificing accuracy.

The system shows impressive throughput gains, handling up to six times larger batch sizes compared to baseline methods, which could substantially impact real-world LLM deployment.

SHADOWKV is tested on multiple models and benchmarks (e.g., Llama, GLM) across various tasks, demonstrating consistent performance improvements across all settings.

缺点

The main issue is that, as far as I know, SVD is precision sensitive, but I didn't find any discussion about precision in the paper. My main question is what precision is used for ShadowKV and baselines. If you are using precision like FP16/FP32, my question is how does ShadowKV work on FP8/(FP8 & FP16) precision? If the precision is FP8, how does ShadowKV survive from precision-sensitive SVD?

For the speed evaluation, I only found the throughput (tokens/s), are there any experiments for the time of each operation separately?

问题

See weaknesses.

评论

Q2: Detailed Time Analysis of Operations

Thank you for the valuable suggestion. We present a detailed latency breakdown in tables below to illustrate the efficiency of each operation under various context lengths for both the prefilling and decoding stages (mentioned in Section 5.2, Page 9\textcolor{blue}{\text{Section 5.2, Page 9}} and detailed in Appendix A.6, Page 18\textcolor{blue}{\text{Appendix A.6, Page 18}}).

  • Scalability for Longer Sequences. As shown in table, the overhead of SVD, reduce, cosine similarity, topK, and gather computing is very low and tends to decrease as the sequence scales, proving that ShadowKV's scalability to longer sequences.

Latency breakdown (ms) of a Transformer block of Llama-3-8B-1M during prefilling:

ContextAttentionFFNSVDReduceCosineSimilarityTopKGatherCost
64K186.2396.4717.190.101.410.080.016.65%
128K721.13193.3226.620.202.770.140.023.25%
256K2880.21392.7750.560.426.110.110.031.75%
512K11720.30789.23108.380.8412.190.150.060.97%
  • Overlapping Operations for Latency Reduction. In the table below, we demonstrate how overlapping the recomputation of the key cache with value cache fetching from the CPU significantly reduces decoding latency. This concurrent processing approach ensures that ShadowKV minimizes overhead when handling long-context models.

Latency breakdown (ms) of a Transformer block of Llama-3-8B-1M during decoding:

ContextGEMM+SoftmaxMaxTopKRecompute K (Overlapped)Fetch VAttentionFFNQKV
48×\times64K0.560.070.141.251.840.230.330.05
24×\times128K0.580.070.151.361.660.210.290.05
12×\times256K0.650.070.161.491.750.190.250.05
6×\times512K0.710.070.171.511.690.180.230.05
评论

Dear Reviewer iRxd,

We sincerely appreciate the time and effort you have dedicated to reviewing our work and for sharing your insightful feedback. Your comments have been immensely valuable in improving the quality of our manuscript.

We have carefully addressed your concerns and revised the manuscript accordingly. As the discussion period is ending soon, we wanted to kindly ask if there are any further questions or points we could clarify to better address your feedback. If our responses have adequately addressed your concerns, we would be deeply grateful if you could consider reflecting this in your score.

Thank you again for your constructive review and for contributing your valuable expertise. We deeply appreciate your input and support throughout this process.

评论

Thank you for your encouraging feedback and insightful remarks. We have thoroughly addressed each of your questions and hope our responses will lead you to consider raising your score.

Q1: Precision Sensitivity in SVD Computations

We appreciate the suggestion to examine the precision sensitivity of ShadowKV. In the main experiments, we used BF16 for both model weights and KV cache. To further investigate the impact of precision on ShadowKV’s performance, we conducted additional experiments using FP8 precision (torch.float8_e5m2), showing that ShadowKV can retain its accuracy at this lower precision, addressing concerns about precision sensitivity, particularly in SVD computations (mentioned in Section 5.3, Page 10\textcolor{blue}{\text{Section 5.3, Page 10}} and detailed in Section A.4, Page 16\textcolor{blue}{\text{Section A.4, Page 16}}).

As detailed in the tables below, ShadowKV and baseline methods were evaluated using FP8. Results show that ShadowKV maintains accuracy and achieves consistently high performance even with FP8 precision. This robustness, despite FP8’s reduced numerical range, confirms that ShadowKV can continue to deliver efficiency gains without compromising accuracy.

  • Results on RULER
MethodsN-S1N-S2N-MK1N-MK2N-MQN-MVQA-1QA-2VTFWEAvg.
Llama-3-8B-1M100.00100.0098.9695.8397.4095.5763.5448.9675.8373.2684.94
Loki5.211.040.000.000.780.265.2113.5428.3328.828.32
Loki (Vonly)36.469.3831.250.006.2521.0911.4615.6357.0835.7622.44
Quest100.0098.9698.9671.8896.6193.4963.5445.8378.1367.0181.44
Quest (Vonly)100.00100.0098.9685.4297.4093.4970.8348.9678.1365.6383.88
ShadowKV100.00100.0097.9294.7995.3193.4975.0048.9680.4273.6185.95
  • Results on LongBench
MethodsNarratQAMultiFQAHotpotQAMuSiQueDuReadGovRepSAMSumPassRetrLCCAvg.
Llama-3-8B-1M18.6941.2135.7621.5931.8133.7735.2980.5056.7739.49
Loki2.2111.125.701.8415.4228.5911.4141.9133.9916.91
Loki (Vonly)2.6822.3312.693.3521.4330.5716.3247.6836.6421.52
Quest19.4138.9234.0219.6423.1326.4028.0478.5049.8135.32
Quest (Vonly)16.1936.7336.6419.5925.5729.4627.1479.5060.0536.76
ShadowKV18.2939.3936.0621.0430.4731.8735.5678.5062.1139.25
评论

Dear Reviewer iRxd,

We are truly grateful for the time and effort you have taken to review our manuscript and for providing such thoughtful feedback. Your insights have been instrumental in helping us refine our work.

As today marks the final day for revising the PDF, we wanted to kindly follow up to see if you have any further questions or points of clarification regarding our responses. We are happy to address any remaining concerns you might have.

Thank you again for your invaluable support and guidance throughout this process.

With our deepest appreciation,

Authors

评论

What is the precision for SVD? PyTorch's SVD seems to not support FP16 or lower precision. See link.

评论

Thank you for your thoughtful question and for pointing this out. The SVD operation is conducted in FP32, as PyTorch currently does not support lower-precision kernels for SVD. To evaluate the robustness of ShadowKV, we cast the decomposition results to FP8, optimizing KV cache storage while ensuring performance remains robust. During decoding, the K reconstruction (i.e., matrix multiplication) is performed directly in FP8 or BF16, just like other FP8 or BF16 LLM decoding operations. Moreover, this SVD part can be executed on the CPU during prefilling, which can be overlapped, mitigating the overhead.

Additionally, we explored PyTorch's approximation SVD method, gesvda, and found that it does not degrade ShadowKV's performance, further demonstrating its precision robustness. Once PyTorch introduces lower-precision SVD kernels, we would be happy to evaluate their impact on our method.

If our responses have adequately addressed your concerns, we would be deeply grateful if you could consider reflecting this in your score. Please let me know if you have any further questions or suggestions.

评论

Dear Reviewer iRxd,

We sincerely appreciate the time and effort you have dedicated to reviewing our manuscript and for sharing your thoughtful feedback. Your insights have been invaluable in helping us improve our work.

As today marks the final day for reviewer replies, we wanted to kindly follow up to check if you have any additional questions or require further clarification regarding our responses. We would be more than happy to address any remaining concerns you may have.

Thank you once again for your invaluable support and guidance throughout this process.

With our deepest gratitude,

The Authors

审稿意见
5

This paper introduces SHADOWKV, a CPU-offloading-based system for long-context LLM inference. SHADOWKV addresses GPU memory constraints by leveraging the low-rank property of the key cache, storing a low-rank pre-ROPE key cache on the GPU while offloading the value cache to the CPU. Additionally, SHADOWKV stores landmarks—mean values of post-ROPE keys for adjacent tokens—along with low-rank pre-ROPE key cache in GPU memory. During decoding, SHADOWKV utilizes these landmark keys to identify significant tokens, selectively recovering their keys and retrieving their values from the CPU. Evaluations on long-context benchmarks demonstrate that SHADOWKV maintains high accuracy while significantly reducing GPU memory consumption and inference latency.

优点

  • In-depth analysis on low-rank nature of key cache in comparison to value cache as well as weights.
  • Leveraging spatial locality of post-ROPE key cache for dynamic sparse attention appears both novel and effective.
  • Empirical results are impressive.

缺点

  • Incomplete Descriptions

The term "sparse budget" is not clearly defined in the paper, which may lead to confusion. Additionally, while SHADOWKV claims to leverage the temporal locality of the KV cache to reduce computation and communication (by approximately 60%), it lacks any detailed explanations on what that feature is.

  • Handling Newly Generated Tokens

While the paper says that it excludes the handling of newly generated tokens for simplicity, this issue is quite significant and should not be ignored. If not addressed, the KV cache for newly generated tokens could negate SHADOWKV’s key benefits of reduced GPU memory usage and lower inference latency, especially with long output sequences. Incorporating mechanisms to handle these tokens within SHADOWKV is essential, and the authors should evaluate and report on its impact on accuracy.

  • Lack of Comparison with Infinigen

The paper does not sufficiently compare SHADOWKV with Infinigen [1], a closely related work that similarly stores low-rank key cache in GPU memory, offloads the value cache to the CPU, and selectively fetches important values based on approximate attention scores. Although the paper briefly discusses Infinigen, given the significant similarities, a more in-depth comparison with Infinigen should be made in order to highlight the main differentiator of SHADOWKV.

[1] Lee et al., "InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management", OSDI'24

问题

  • What exactly is the “sparse budget”?

  • How does SHADOWKV leverage the temporal locality of the KV cache?

  • What are the exact GPU memory savings of SHADOWKV? Including a quantitative discussion on GPU memory savings in the paper would be helpful.

  • How can SHADOWKV handle the KV cache for newly generated tokens, and what would be the impact of this?

评论

Thank you sincerely for your thoughtful review and invaluable feedback. We have addressed each of your questions with care and hope that our responses will inspire you to consider raising your score.

Q1: Definition of “sparse budget”

Thanks for pointing it out. The term “sparse budget” refers to the number of selected tokens, i.e. the K of TopK. This budget dictates the portions that need to be fetched from the CPU or reconstructed from the low-rank space. We have added this clarification to the revised paper (Section 1, Page 2\textcolor{blue}{\text{Section 1, Page 2}}).


Q2: Explanation of “temporal locality of the KV cache”

Temporal locality of the KV cache refers to the observation that, during decoding, the KV cache pairs selected by the queries of two adjacent decoding steps have a repetition rate of approximately 60%. This means that we don’t need to perform low-rank reconstruction or CPU data fetching for the repeated portions, only for the non-repeated parts. This reduces the overall decoding overhead. We have added this clarification to the revised paper (Section 3.2, Page 5\textcolor{blue}{\text{Section 3.2, Page 5}}).

评论

Q3: Handling of Newly Generated Tokens

We appreciate the suggestion to evaluate the impact of handling newly generated tokens. We present extensive experiments on the RULER and LongBench across different long-context models (mentioned in Section 4.1, Page 6\textcolor{blue}{\text{Section 4.1, Page 6}} and detailed in Appendix A.1, Page 15\textcolor{blue}{\text{Appendix A.1, Page 15}}). The results demonstrate that ShadowKV effectively handles newly generated tokens while maintaining accuracy.

To address the handling of newly generated tokens, we project these tokens' key cache into a low-rank space using the same projections applied during the prefilling phase. This approach preserves the benefits of reduced GPU memory usage, particularly for long output sequences.

As shown in tables below, we refer to this extension as ShadowKV+. Our evaluation across various models demonstrates that ShadowKV+ effectively maintains accuracy and manages newly generated tokens as well.

  • Results on RULER:
MethodsN-S1N-S2N-MK1N-MK2N-MQN-MVQA-1QA-2VTFWEAvg.
Llama-3-8B-1M100.00100.0098.9698.9698.9695.5775.0048.9678.5471.8586.68
ShadowKV100.00100.0097.9298.9696.8895.8372.9252.0881.6772.5786.88
ShadowKV+100.00100.0098.96100.0095.8393.4971.8850.0080.2171.8886.23
GLM-4-9B-1M100.00100.0094.7987.5099.7493.7567.7155.2197.2972.2286.82
ShadowKV100.00100.0095.8383.3398.7087.7669.7955.2197.5068.0685.62
ShadowKV+100.00100.0095.8385.4298.1785.1669.7956.2597.9267.7185.63
Llama-3.1-8B100.00100.0098.9691.6798.9695.3182.2947.9268.9671.1885.53
ShadowKV100.00100.00100.0083.3397.9292.1981.2548.9667.0864.9383.57
ShadowKV+100.00100.00100.0084.3896.8891.6781.2552.0865.6362.8583.47
Yi-9B-200K100.00100.0086.4662.5064.5832.5544.7939.5836.8789.9365.73
ShadowKV100.00100.0082.2967.7163.2831.5143.7538.5456.0472.2265.53
ShadowKV+100.00100.0081.2567.7161.7231.5146.8838.5453.5472.9265.41
  • Results on LongBench:
MethodsNarratQAMultiFQAHotpotQAMuSiQueDuReadGovRepSAMSumPassRetrLCCAvg.
Llama-3-8B-1M18.9841.8436.7921.4731.9334.1835.9681.5056.0739.86
ShadowKV17.1739.7338.2921.0831.7731.6235.8780.0063.9339.94
ShadowKV+20.4241.1637.2221.0331.7731.9835.8080.0063.8940.36
GLM-4-9B-1M25.4451.0958.6739.6132.0429.9740.3199.0058.0248.24
ShadowKV26.5051.3159.0938.8732.9228.5438.7096.5058.5547.89
ShadowKV+27.5951.3159.1738.3433.5531.2539.4696.5055.8648.11
Llama-3.1-8B31.5655.1057.6529.4635.2634.4529.84100.0067.3148.96
ShadowKV30.9355.2057.3229.1331.8532.7930.4099.5066.0348.13
ShadowKV+32.2554.2957.7528.3731.0732.8928.7398.7567.5947.97
Yi-9B-200K13.8830.0252.4628.2022.2930.2519.0867.0073.5037.41
ShadowKV12.4430.8252.4327.7320.7929.8320.7364.0072.8936.85
ShadowKV+14.0830.9451.1627.0019.5029.3421.1666.0073.4736.96
评论

Q4: Detailed Comparison with InfiniGen

We provide further clarification on the key distinctions and conduct additional experiments between ShadowKV and InfiniGen (mentioned in Section 2, Page 3\textcolor{blue}{\text{Section 2, Page 3}} and detailed in Section 5.1, Page 7-8\textcolor{blue}{\text{Section 5.1, Page 7-8}}). These experiments show that ShadowKV significantly outperforms InfiniGen across a wide range of downstream tasks.

  • Differences in SVD Usage. Infinigen performs an offline SVD to get a projection matrix, which is applied to post-RoPE key and query states for KV selection, while ShadowKV applies an online, prompt-dependent SVD directly to the pre-RoPE key cache for compression, not for KV selection.

  • Methodological Differences. While InfiniGen uses SVD for KV selection, it requires fetching selected, exact KV pairs from the CPU. In contrast, ShadowKV only fetches the value cache from the CPU, reconstructing the key cache from its low-rank storage on the GPU. By overlapping these processes, ShadowKV reduces data-fetch overhead and achieves improved efficiency in KV cache management.

  • Accuracy Comparison. To empirically validate ShadowKV’s advantages, we conducted accuracy evaluations. Results, presented in the tables below, confirm ShadowKV’s effectiveness in maintaining accuracy while optimizing memory usage. Although InfiniGen performs well on simpler tasks like RULER-N-S1, it shows significant accuracy drops on more complex tasks, such as RULER-N-MK2, RULER-FWE, LongBench-LCC, and others, where ShadowKV maintains consistently high accuracy.

  • Results on RULER:

MethodsN-S1N-S2N-MK1N-MK2N-MQN-MVQA-1QA-2VTFWEAvg.
Llama-3-8B-1M100.00100.0098.9698.9698.9695.5775.0048.9678.5471.8586.68
InfiniGen100.0098.9684.3853.1363.2854.9565.6348.9681.6750.3570.13
InfiniGen (Vonly)100.0098.9696.8876.0481.2577.0867.7150.0081.6753.4778.31
ShadowKV100.00100.0097.9298.9696.8895.8372.9252.0881.6772.5786.88
GLM-4-9B-1M100.00100.0094.7987.5099.7493.7567.7155.2197.2972.2286.82
InfiniGen100.0093.7582.290.0079.4360.1657.2953.1392.7157.2967.60
InfiniGen (Vonly)100.0096.8887.507.2995.3175.2656.2554.1795.6360.7672.91
ShadowKV100.00100.0095.8383.3398.7087.7669.7955.2197.5068.0685.62
Llama-3.1-8B100.00100.0098.9691.6798.9695.3182.2947.9268.9671.1885.53
InfiniGen100.0077.0878.1313.5458.0747.4065.6341.6760.8350.3559.27
InfiniGen (Vonly)100.0088.5487.5026.0479.4377.0872.9243.7557.0855.2168.76
ShadowKV100.00100.00100.0083.3397.9292.1981.2548.9667.0864.9383.57
Yi-9B-200K100.00100.0086.4662.5064.5832.5544.7939.5836.8789.9365.73
InfiniGen100.0094.7977.081.0440.1020.5737.5034.3841.4646.1849.31
InfiniGen (Vonly)100.0098.9678.132.0858.3324.4840.6335.4252.9255.9054.69
ShadowKV100.00100.0082.2967.7163.2831.5143.7538.5456.0472.2265.53
  • Results on LongBench:
MethodsNarratQAMultiFQAHotpotQAMuSiQueDuReadGovRepSAMSumPassRetrLCCAvg.
Llama-3-8B-1M18.9841.8436.7921.4731.9334.1835.9681.5056.0739.86
InfiniGen14.3931.4633.6317.9426.6527.3821.9774.3038.5831.81
InfiniGen (Vonly)17.8336.0835.2819.6428.3929.2828.1274.8545.5335.00
ShadowKV17.1739.7338.2921.0831.7731.6235.8780.0063.9339.94
GLM-4-9B-1M25.4451.0958.6739.6132.0429.9740.3199.0058.0248.24
InfiniGen23.6746.3155.6933.9127.4925.4433.4891.8336.9641.64
InfiniGen (Vonly)25.6348.4457.2336.9429.7726.6736.6493.5846.6944.62
ShadowKV26.5051.3159.0938.8732.9228.5438.7096.5058.5547.89
Llama-3.1-8B31.5655.1057.6529.4635.2634.4529.84100.0067.3148.96
InfiniGen27.2352.7253.8926.8127.7229.6124.4298.9356.8944.25
InfiniGen (Vonly)29.7353.4755.1128.7228.5531.4226.7699.1762.6646.18
ShadowKV30.9355.2057.3229.1331.8532.7930.4099.5066.0348.13
Yi-9B-200K13.8830.0252.4628.2022.2930.2519.0867.0073.5037.41
InfiniGen10.0123.6150.4725.9115.1127.9618.9730.0056.4628.72
InfiniGen (Vonly)11.3126.4651.1326.7716.0928.6719.3334.0062.0730.65
ShadowKV12.4430.8252.4327.7320.7929.8320.7364.0072.8936.85
评论

Q5: Quantitative Analysis of GPU Memory Savings

Thank you for the valuable suggestion. The GPU memory savings provided by ShadowKV can be quantitatively analyzed as follows (mentioned in Section 4.2, Page 6\textcolor{blue}{\text{Section 4.2, Page 6}} and detailed in Section A.2, Page 16\textcolor{blue}{\text{Section A.2, Page 16}}). Let each K or V vector have a size of MM bytes, with a sequence length SS, a chunk size CC, a selected chunk budget KK, OO outliers, and a pre-RoPE key cache rank rr. The GPU memory savings of ShadowKV can then be expressed as:

Memory Savings=2SMSM/C+2(K+O)C+Sr+rM \text{Memory Savings} = \frac{2SM}{SM/C + 2(K+O)C+ Sr+rM}

For example, assuming M=1024,C=8,S=128K,K=256,O=48,r=160M=1024, C=8, S=128\text{K}, K=256, O=48, r=160, the memory savings of ShadowKV is calculated as 7.08×\times. This result demonstrates that ShadowKV can theoretically reduce the KV cache memory footprint on the GPU by 7.08×\times for longer sequences and larger batch sizes.

评论

Thank you for the responses. Before taking a closer look into the additional experimental data you’ve provided, I have a quick question regarding the GPU memory savings. Does your quantitative analysis of GPU memory savings account for the way temporal locality is leveraged? It might increase GPU memory consumption.

评论

Thank you for the additional experimental data. I really appreciate that you were able to provide all this information within such a short time frame.

By the way, while you compare ShadowKV and InfiniGen in terms of accuracy, I believe there should be a discussion about latency and GPU memory savings for a proper comparison between the two methods. Specifically, I am curious whether the latency and GPU memory savings for these methods were set to the same level.

It would be great if you could provide some details on this aspect.

评论

Thanks for your response. The sparse budget is calculated as the product of the selected chunk budget (K) and the chunk size (C), i.e., sparse budget=K×C\text{sparse budget} = K \times C.

For LongBench, when we say “KV cache budget is set to 256,” it means the sparse budget is 256. If the chunk size is 8, then K=256/8=32K = 256/8=32.

Feel free to reach out if you have any additional questions. Thanks!

评论

Thank you for the clarification. Just to confirm, the 7.08x memory savings apply specifically to the example setting used in your quantitative analysis, and GPU memory usage varies across benchmarks in accuracy evaluation due to the different sparse budgets set for each benchmark, correct? The way you set the sparse budget differently for each benchmark makes it a bit confusing to exactly grasp memory saving-accuracy tradeoff of your scheme.

That said, I now have a much clearer understanding of this point. Thank you again for the explanation.

评论

Thank you for your feedback! I'm very glad the explanation clarified things for you. Yes, you are correct. In the efficiency test, we demonstrate that ShadowKV achieves 6x larger batch sizes (refer to Table 4\textcolor{blue}{\text{Table 4}}), which aligns well with our theoretical analysis.

For RULER-128K, we use a budget of K=256K = 256 . For LongBench, since the context lengths are shorter (\le32K), we opt for a smaller KK to better match the benchmark’s requirements.

Feel free to reach out if you have any additional questions. Thanks again for your engagement!

评论

Hello, thank you for your response. Our cache implementation is actually quite straightforward—we avoid complex mechanisms like LRU, ensuring no additional memory consumption. Specifically, during each decoding step, we overwrite the KV pairs from the previous step using the same buffer (2(K+O)C2(K+O)C in the formula), effectively reusing memory without allocating extra space.

If more complex cache rules are used, the hit rate can be improved, but it will bring some additional memory overhead.

评论

Thank you for the prompt response. I understand your point, but I have one more quick question. I am a bit confused about the notation. What exactly is the selected chunk budget (K)? Is it equivalent to the KV cache budget or the sparse budget mentioned in your evaluation?

For instance, in the description of the settings for the LongBench evaluation in the paper, it is stated that the KV cache budget is set to 256. I wonder if the KV cache budget in this context corresponds to the selected chunk budget in the GPU memory savings analysis.

评论

Thank you for your thoughtful feedback and for pointing out your concerns. Let me clarify the points you raised.

The term "KV buffers" refers to the 2KC2KC, i.e., sparse budget component in this context (for LongBench, we set O=1O = 1 for ShadowKV). To implement a caching mechanism, both InfiniGen and ShadowKV must retain at least this portion of the memory, which reduces the overall memory savings. Therefore, your concern about the effect of KK on memory savings applies equally to InfiniGen. Moreover, InfiniGen introduces additional memory overhead by saving the low-rank projection matrix, which is also not accounted for in the 6.67×6.67\times memory savings.

Our primary goal with ShadowKV is to optimize for long-context scenarios (S>16KS >16K, benchmarks like RULER), where its scalability becomes more evident. To provide additional clarity, we have included a table showing memory savings for ShadowKV across different sequence lengths SS and K=32K = 32. It’s worth noting that, for the LongBench dataset, we only evaluate samples with sequence lengths greater than 4K4K.

As shown in the table below, ShadowKV matches InfiniGen's memory savings (which is overestimated) at a sequence length of 8K8K and surpasses it as the sequence length increases. We believe that ShadowKV demonstrates better performance in long-context scenarios compared to InfiniGen and offers superior scalability.

(S, K)S = 4K, K = 32S = 8K, K = 32S = 16K, K = 32S = 32K, K = 32
Memory Savings6.24×\times6.65×\times6.87×\times6.99×\times

We hope this table and explanation address your concerns and provide further insight into ShadowKV's scalability and performance. Please feel free to reach out with any additional questions or clarifications. We are looking forward to your feedback.

评论

I'm now clear with this point. Thank you for the detailed explanations.

评论

Thank you so much for your kind and detailed feedback. We are deeply grateful for the time and effort you’ve devoted to reviewing our work and for providing us with such valuable comments.

It means a lot to us that our explanations addressed your concerns, and we sincerely hope that you might consider raising your score to reflect your clarified understanding.

Your support and constructive feedback are truly appreciated, and please don’t hesitate to let us know if there’s anything more we can do. Thank you again for your generosity and guidance.

评论

Dear Reviewer 5wSX,

We are truly grateful for the time and effort you have taken to review our manuscript and for providing such thoughtful feedback. Your insights have been instrumental in helping us refine our work.

As today marks the final day for revising the PDF, we wanted to kindly follow up to see if you have any further questions or points of clarification regarding our responses. We are happy to address any remaining concerns you might have.

Thank you again for your invaluable support and guidance throughout this process.

With our deepest appreciation,

Authors

评论

Thank you for following up and for your thoughtful comments. We appreciate the opportunity to provide additional details regarding the comparison between ShadowKV and InfiniGen in terms of GPU memory savings and latency. Below, we address your questions and provide a detailed analysis, demonstrating that ShadowKV outperforms InfiniGen in terms of accuracy, memory savings, and latency.

GPU memory savings: InfiniGen does not provide a quantitative analysis of GPU memory savings in its paper, so we performed an estimation to highlight ShadowKV’s advantages. Following the configuration described in the InfiniGen paper, where the partial weight ratio is set to 0.3, the KV cache size is reduced to 15%, equating to a 6.67×6.67\times memory savings. In comparison, ShadowKV achieves a 7.08×7.08\times reduction. It should be noted that InfiniGen’s 15% memory savings does not account for additional memory overheads (e.g., KV buffers). If these overheads are included, the memory savings for InfiniGen would be further diminished.

Latency Analysis: From a latency perspective, ShadowKV is more scalable to long sequences than InfiniGen. In our accuracy evaluation, we maintained the same sparse budget for both methods. Below is the latency breakdown (in ms) for a single Transformer block of Llama-3-8B-1M during decoding, using ShadowKV.

ContextGEMM+SoftmaxMaxTopKRecompute K (Overlapped)Fetch VAttentionFFNQKV
48×\times64K0.560.070.141.251.840.230.330.05
24×\times128K0.580.070.151.361.660.210.290.05
12×\times256K0.650.070.161.491.750.190.250.05
6×\times512K0.710.070.171.511.690.180.230.05

ShadowKV overlaps the recomputation of the K cache with the retrieval of the V cache from CPU. In contrast, InfiniGen overlaps the KV cache fetching time with other computations (GEMM, Softmax, Max, TopK, Attention, FFN, and QKV operations). However, as the sequence length increases, InfiniGen's ability to hide data-fetching costs diminishes. For instance, at a sequence length of 12×256K12 \times 256K, the cost of other computations is 1.37 ms (0.65+0.07+0.16+0.19+0.25+0.05):

  • For ShadowKV, the latency is 1.75+1.37=3.12ms1.75+1.37=3.12\text{ms}, as K cache recomputation is effectively overlapped.
  • For InfiniGen, the KV cache fetching time alone is already 1.75×2=3.5ms1.75\times 2=3.5\text{ms} (InfiniGen fetches both K and V, while ShadowKV only fetches V from CPU). Thus, at longer sequence lengths, the other computations in InfiniGen cannot sufficiently hide the data-fetching cost.

Therefore, under the same sparse budget, ShadowKV not only outperforms InfiniGen in latency but also achieves significantly better accuracy on complex tasks, where InfiniGen suffers from a notable accuracy drop.

We hope this analysis addresses your concerns and provides a comprehensive understanding of ShadowKV's effectiveness. Thank you for your insightful feedback, and we remain open to any further questions or discussions.

评论

Thank you for your prompt response.

I have some reservations about the claim that ShadowKV achieves a 7.08x memory savings when you compare it with InfiniGen. As I mentioned earlier, this figure seems specific to the particular setting used for its derivation (K=256, sequence length=128K). Meanwhile, for LongBench, you use K=32, which implies that a sequence length should be somewhere around 16K to achieve that level of GPU memory savings. However, as far as I know, the sequence lengths of LongBench are generally much shorter than that, which could result in lower GPU memory savings.

For this perspective, I feel the comparison with InfiniGen may not be fully fair. If I’ve misunderstood something, I’d appreciate any clarification.

By the way, I’m also having trouble understanding the statement: “InfiniGen’s 15% memory savings does not account for additional memory overheads (e.g., KV buffers).” Could you elaborate on what “KV buffers” refers to in this context?

审稿意见
6

The increasing KV-cache poses great challenges to long-context LLM inference. This work presents a long-context LLM inference system, SHADOWKV. It decreases the memory footprint by storing the low-rank key cache and offloading value cache to CPU, and reduces the decoding latency by reconstructing the sparse KV pairs on-the-fly. Evaluations show that SHADOWKV supports up to 6x larger batch sizes and improves throughput by up to 3X on A100 GPU while maintaining the accuracy.

优点

  • This work presents a very interesting observation on the KV cache: pre-RoPE key cache is exceptionally low-rank compared to post-RoPE key cache, value cache and KV projection weights. Built on this observation, the proposed SHADOWKV significantly reduces the memory footprint of the Key cache.
  • The work also improves the previous sparse attention work including QUEST by introducing outlier KV cache.
  • This work also implements the inference system and shows actual throughput improvement on the real world A100 GPUs.

缺点

  • The proposed method is complex, including low-rank K cache, CPU offloaded V cache, outlier KV cache, and dynamic sparse attention. However, the ablation study on each component is missing.
    • In terms of model accuracy:
      • it is unclear how much accuracy improvement an extra outlier KV cache will bring.
      • previous work Quest uses Min-Max as landmark cache, ShadowKV adopts Mean as landmark cache. It is unclear how much accuracy improvement this change will bring.
    • In terms of efficiency:
      • Authors only show a rough prefiling latency breakdown in Figure 1(c). It is unclear unclear how long it takes for computing the outlier cache (i.e., reduce, cosine-similarity, top-k, gather) in the profiling stage, how long it takes for KV cache chunk selection (i.e, MatMul, Softmax, Max, TopK) in the decoding stage, how long it takes for recomputing the K cache from low-rank cache. These overheads seem to increase linearly with the context length. It would be better to see the efficiency breakdown on every part of the system under different context lengths (e.g., 128K, 256K, 512K).
      • it is unclear how SHADOWKV performs for extremely long context. For example, the authors evaluated on Llama-3-8B-1M but only with up to 128K context length.
      • this work lacks efficiency comparison against previous work LoKi, Quest and MInference.

问题

My questions are listed in the weakness section.

评论

Thank you for detailed review and valuable feedback. We appreciate the reviewer’s recognition of the novelty and effectiveness of our method. Below, we address concerns regarding the lack of some ablation results. We hope the reviewer can consider raising your score in light of our response.

Q1: Accuracy Contribution of Outlier KV Cache and Mean Landmarks

Our additional experiments demonstrate that outliers play a critical role in capturing essential information, even in small numbers (0.049%), significantly enhancing the performance of mean-based landmarks and outperforming the min-max landmarks used in Quest.

We conduct experiments using different numbers of outlier chunks for Llama-3-8B-1M on the RULER benchmark with 128K context length (mentioned in Section 5.3, Page 10\textcolor{blue}{\text{Section 5.3, Page 10}} and detailed in Appendix A.8, Page 19\textcolor{blue}{\text{Appendix A.8, Page 19}}). As presented in the table below, our findings indicate that outliers play a crucial role. For instance, the first chunk, a significant outlier, has previously been shown to act as an attention sink [1], underscoring its importance in maintaining model accuracy.

# OutliersN-S1N-S2N-MK1N-MK2N-MQN-MVQA-1QA-2VTFWEAvg.
0 (0.000%)100.00100.0096.8885.4273.1870.8343.7539.5873.5457.2974.05
1 (0.006%)100.00100.0097.9298.9695.8394.7970.8351.0470.6370.1485.01
2 (0.012%)100.00100.0097.9298.9695.5795.5770.8351.0472.0870.4985.25
4 (0.024%)100.00100.0097.9298.9695.8395.5771.8851.0474.3871.1885.68
8 (0.049%)100.00100.0097.9298.9695.5795.0572.9251.0478.1372.5786.22
16 (0.098%)100.00100.0097.9298.9696.0995.3172.9251.0480.4271.5386.42
32 (0.195%)100.00100.0097.9298.9696.3595.5772.9252.0881.2572.2286.73
48 (0.293%)100.00100.0097.9298.9696.8895.8372.9252.0881.6772.5786.88
Quest (Ref.)100.00100.0098.9677.0897.6593.4960.4250.0077.0865.6382.03
Full Attn (Ref.)100.00100.0098.9698.9698.9695.5775.0048.9678.5471.8586.68

The results demonstrate that increasing the number of outlier chunks has a positive impact on accuracy, especially in complex tasks. This indicates that even a small number of outliers can effectively capture essential information, reducing the need for full attention. Remarkably, with just 8 outliers (0.049%), ShadowKV outperforms the Quest baseline and nearly matches the accuracy achieved by full attention.

However, when outliers are not adequately managed, the performance of the mean-based landmarks in ShadowKV may fall below the min-max approach used by Quest, underscoring the importance of handling outliers properly.

[1] Efficient Streaming Language Models with Attention Sinks

评论

Q2: Detailed Efficiency Breakdown Across System Components

Thank you for the valuable suggestion. We present a detailed latency breakdown in tables below to illustrate the efficiency of each operation under various context lengths for both the prefilling and decoding stages (mentioned in Section 5.2, Page 9\textcolor{blue}{\text{Section 5.2, Page 9}} and detailed in Appendix A.6, Page 18\textcolor{blue}{\text{Appendix A.6, Page 18}}).

  • Scalability for Longer Sequences. As shown in table, the overhead of SVD, reduce, cosine similarity, topK, and gather computing is very low and tends to decrease as the sequence scales, proving that ShadowKV's scalability to longer sequences.

Latency breakdown (ms) of a Transformer block of Llama-3-8B-1M during prefilling:

ContextAttentionFFNSVDReduceCosineSimilarityTopKGatherCost
64K186.2396.4717.190.101.410.080.016.65%
128K721.13193.3226.620.202.770.140.023.25%
256K2880.21392.7750.560.426.110.110.031.75%
512K11720.30789.23108.380.8412.190.150.060.97%
  • Overlapping Operations for Latency Reduction. In the table below, we demonstrate how overlapping the recomputation of the key cache with value cache fetching from the CPU significantly reduces decoding latency. This concurrent processing approach ensures that ShadowKV minimizes overhead when handling long-context models.

Latency breakdown (ms) of a Transformer block of Llama-3-8B-1M during decoding:

ContextGEMM+SoftmaxMaxTopKRecompute K (Overlapped)Fetch VAttentionFFNQKV
48×\times64K0.560.070.141.251.840.230.330.05
24×\times128K0.580.070.151.361.660.210.290.05
12×\times256K0.650.070.161.491.750.190.250.05
6×\times512K0.710.070.171.511.690.180.230.05
评论

Q3: Performance on Extremely Long Context Lengths

We clarify that our paper includes some experiments with extremely long contexts. We tested Llama-3-8B-1M on the NIAH dataset with up to 1M tokens (Figure 6, Page 8\textcolor{blue}{\text{Figure 6, Page 8}}).

Here we present our additional results, showing that ShadowKV matches the performance of full attention while outperforming other sparse methods on 1M contexts with Llama-3-8B-1M and 512K contexts with Llama-3-70B-1M, as demonstrated on the RULER benchmark (detailed in Section 5.1, Page 7\textcolor{blue}{\text{Section 5.1, Page 7}} and Appendix A.5, Page 17\textcolor{blue}{\text{Appendix A.5, Page 17}}). Additionally, ShadowKV achieves perfect needle retrieval accuracy with Llama-3-70B-1M, evaluated across context lengths ranging from 16K to 1M tokens (Figure 10, Page 17\textcolor{blue}{\text{Figure 10, Page 17}}).

MethodsN-S1N-S2N-MK1N-MK2N-MQN-MVQA-1QA-2VTFWEAvg.
Llama-3-70B-1M100.0082.2990.6354.1785.1696.6169.7935.4268.7569.4475.23
Loki100.001.040.000.000.000.0013.5411.4634.3022.9218.33
Loki (Vonly)100.0015.6326.040.000.000.0025.0019.7940.0031.9425.84
Quest100.0076.0478.1335.4285.4792.1953.2134.3838.3358.3365.15
Quest (Vonly)100.0077.0879.1736.4986.1995.3154.1736.5847.7058.6867.14
ShadowKV100.0082.2988.5453.0488.0294.7967.7137.5068.5468.2574.87
Llama-3-8B-1M96.88100.0096.8869.7991.1585.6864.5842.7125.0056.2572.89
Loki9.381.0410.420.002.604.4338.5411.461.670.698.02
Loki (Vonly)68.7529.1760.421.0426.5643.2359.3815.636.460.6931.13
Quest94.7992.7180.214.1776.3069.2757.2928.1325.6730.5655.91
Quest (Vonly)94.7993.7581.254.1779.6969.2762.5031.2526.0032.9957.57
ShadowKV96.88100.0096.8865.6389.3883.1669.7942.7126.0459.3872.98

These findings underline ShadowKV’s ability to handle large-scale inputs effectively, offering robust performance across increasing context lengths and model sizes. This scalability ensures its suitability for real-world applications that require extensive sequence handling and larger model capabilities.

评论

Q4: Efficiency Comparison with Prior Works

Thank you for the suggestion. We first clarify why we use sparse budgets in the original draft to compare the efficiency and accuracy of ShadowKV against the baselines, followed by an explanation of MInference’s orthogonal relationship to our method.

  • Since Loki and Quest lack CPU offloading and optimized CUDA kernels in their open-source repositories, we ensured a fair comparison by standardizing the computational cost for KV selection and using the same sparse KV cache budget across methods. This budget represents the theoretical efficiency of each method, allowing for a fair comparison of accuracy on downstream tasks.
  • For MInference, we clarify that it is intended to accelerate prefilling rather than decoding, and we have demonstrated its compatibility with our approach in Table 3 in the paper.

In order to compare the efficiency of ShadowKV against Quest, we try our best to implement Quest with CPU offloading and present the result (Appendix A.7, Page 18\textcolor{blue}{\text{Appendix A.7, Page 18}}), showing that ShadowKV achieves up to 4.85×\times faster performance than Quest.

ContextFull AttentionFull Attention (CPU-offload)QuestQuest (CPU-offload)ShadowKV
3×\times1MOOM0.21 tokens/sOOM9.34 tokens/s45.32 tokens/s

As shown in the tale, ShadowKV significantly outperforms both Full Attention and Quest under the same sparse budget. The efficiency advantage of ShadowKV over Quest is due to two key factors:

  • ShadowKV only fetches the value cache from the CPU, rather than the entire KV pair, minimizing data transfer and reducing latency
  • ShadowKV integrates a cache mechanism that leverages the temporal locality of the KV cache.
评论

Dear Reviewer 4R5o,

We sincerely appreciate the time and effort you have dedicated to reviewing our work and for sharing your insightful feedback. Your comments have been immensely valuable in improving the quality of our manuscript.

We have carefully addressed your concerns and revised the manuscript accordingly. As the discussion period is ending soon, we wanted to kindly ask if there are any further questions or points we could clarify to better address your feedback. If our responses have adequately addressed your concerns, we would be deeply grateful if you could consider reflecting this in your score.

Thank you again for your constructive review and for contributing your valuable expertise. We deeply appreciate your input and support throughout this process.

评论

Thank you for your extensive experiments, which have addressed most of my concerns. I have increased my score accordingly.

审稿意见
8

This paper presents a high throughput LLM inference system for long sentence length. It proposes a compression method to leverage the low rank property of key cache and improve the sparse attention method by accurate KV selection.

优点

This paper is well-organized and clearly written. The proposed method is well-motivated, addressing relevant challenges, and is supported by thorough analysis. The evaluation is comprehensive and robust, effectively substantiating the claims and demonstrating thoughtful considerations. Overall, this is a strong submission for ICLR.

缺点

  1. As shown in Fig. 8, some downstream tasks, such as 'Frequent Words Extraction,' perform significantly worse with sparse KV enabled. A brief analysis of why this approach underperforms for these types of tasks would be helpful, as well as any potential solutions to address these limitations.
  2. The proposed solution is currently evaluated on an 8B model with a 128K sequence length. It would strengthen the paper to include an analysis of whether this approach scales effectively for larger models, such as a 70B model with an extremely long sequence length of 1M.

问题

As listed in the weakness part.

评论

Thank you for the supportive comments and recognizing the novelty of our method and the thorough evaluations. We hope our detailed clarifications and additional experimental results will address the concerns regarding our work.

Q1: Analysis and Solutions for Sparse KV Underperformance in Certain Tasks

For certain downstream tasks, attention distributions can exhibit a long tail effect, where a larger portion of tokens receive small yet non-negligible attention, making exact top-K selection less effective. As observed in Differential Transformer [1], this issue arises due to the nature of the softmax function [2], which tends to produce noise and disperse attention scores.

Since Top-K serves as the theoretical upper bound for sparse attention methods aiming to approximate it, poor performance of Top-K inherently limits the performance of sparse attention methods.

We include TopK results for ‘Frequent Words Extraction’ as a reference below. As shown in the table, sparse attention underperforms due to limitations in TopK’s performance.

MethodFull (128K)409620481024512256128
TopK72.2269.2168.3767.3267.0166.5962.85
ShadowKV72.2268.0668.0668.4066.3265.2859.72
Quest72.2266.6765.9764.5862.5061.8150.69

A possible solution is to promote sparsity during the pre-training or SFT phases, co-designing it with the training process. By encouraging sparsity throughout training, the model may learn to manage long tail effects within the constraints of a sparse budget and reduce noise, allowing for a more reliable sparse attention mechanism across diverse downstream tasks.

[1] Differential Transformer

[2] softmax is not enough (for sharp out-of-distribution)

评论

Q2: Scalability Analysis for Larger Models and Longer Sequence Lengths

We appreciate the suggestion to evaluate ShadowKV on larger models with extended sequence lengths, such as 1M. Our results show that ShadowKV matches the performance of full attention while outperforming other sparse methods on 1M contexts with Llama-3-8B-1M and 512K contexts with Llama-3-70B-1M, as demonstrated on the RULER benchmark (mentioned in Section 5.1, Page 7\textcolor{blue}{\text{Section 5.1, Page 7}} and detailed in Appendix A.5, Page 17\textcolor{blue}{\text{Appendix A.5, Page 17}}). Additionally, ShadowKV achieves perfect needle retrieval accuracy with Llama-3-70B-1M, evaluated across context lengths ranging from 16K to 1M tokens (Figure 10, Page 17\textcolor{blue}{\text{Figure 10, Page 17}}).

MethodsN-S1N-S2N-MK1N-MK2N-MQN-MVQA-1QA-2VTFWEAvg.
Llama-3-70B-1M100.0082.2990.6354.1785.1696.6169.7935.4268.7569.4475.23
Loki100.001.040.000.000.000.0013.5411.4634.3022.9218.33
Loki (Vonly)100.0015.6326.040.000.000.0025.0019.7940.0031.9425.84
Quest100.0076.0478.1335.4285.4792.1953.2134.3838.3358.3365.15
Quest (Vonly)100.0077.0879.1736.4986.1995.3154.1736.5847.7058.6867.14
ShadowKV100.0082.2988.5453.0488.0294.7967.7137.5068.5468.2574.87
Llama-3-8B-1M96.88100.0096.8869.7991.1585.6864.5842.7125.0056.2572.89
Loki9.381.0410.420.002.604.4338.5411.461.670.698.02
Loki (Vonly)68.7529.1760.421.0426.5643.2359.3815.636.460.6931.13
Quest94.7992.7180.214.1776.3069.2757.2928.1325.6730.5655.91
Quest (Vonly)94.7993.7581.254.1779.6969.2762.5031.2526.0032.9957.57
ShadowKV96.88100.0096.8865.6389.3883.1669.7942.7126.0459.3872.98

These findings underline ShadowKV’s ability to handle large-scale inputs effectively, offering robust performance across increasing context lengths and model sizes. This scalability ensures its suitability for real-world applications that require extensive sequence handling and larger model capabilities.

评论

Thank you for your responses. The rebuttal has clearly address my concerns on the Underperformance in Certain Tasks and scalability problem.

评论

We thank reviewers [R1(sEj1), R2(4R5o), R3(5wSX), R4(iRxd)] for their thoughtful and highly supportive feedback! We were glad that the reviewers found the problem significant and interesting [R2, R4], the observations and analysis insightful and highly valuable [R1, R2, R3], the method novel and well-motivated [R1, R3, R4], the presentation well-organized and easy to follow [R1], experimental results strong and impressive [R1, R3, R4].

We have updated the paper to incorporate the constructive suggestions, reflected in the revised version. We use blue color for all newly added content. Here is a summary of the major changes:

  • [R1, R2, R3] Scalability to larger models, longer contexts, and longer outputs:
    • Added experiments for Llama-3-70B-1M on RULER with 512K contexts and Needle In A Haystack with up to 1M contexts, demonstrating ShadowKV’s scalability with larger models (mentioned in Section 5.1, Page 7\textcolor{blue}{\text{Section 5.1, Page 7}} and detailed in Appendix A.5, Page 17\textcolor{blue}{\text{Appendix A.5, Page 17}}).
    • Added experiments for Llama-3-8B-1M on RULER with 1M contexts to show ShadowKV’s ability to scale with longer sequences (mentioned in Section 5.1, Page 7\textcolor{blue}{\text{Section 5.1, Page 7}} and detailed in Appendix A.5, Page 17\textcolor{blue}{\text{Appendix A.5, Page 17}}).
    • Added experiments for handling newly generated tokens to validate ShadowKV’s scalability for longer outputs (mentioned in Section 4.1, Page 6\textcolor{blue}{\text{Section 4.1, Page 6}} and detailed in Appendix A.1, Page 15\textcolor{blue}{\text{Appendix A.1, Page 15}}).
  • [R3] Comparison with InfiniGen
    • Added a discussion of the differences between InfiniGen and ShadowKV (Section 2, Page 3\textcolor{blue}{\text{Section 2, Page 3}}).
    • Included comparative experiments with InfiniGen, showing that ShadowKV outperforms InfiniGen on various downstream tasks. Notably, on the RULER benchmark, ShadowKV achieves an accuracy improvement of 10% over InfiniGen (Section 5.1, Page 7-8\textcolor{blue}{\text{Section 5.1, Page 7-8}}).
  • [R2, R4] Latency breakdown
    • Added a latency breakdown for the prefilling stage, demonstrating that ShadowKV introduces minimal overhead, which decreases as sequence length increases, affirming ShadowKV’s scalability with longer sequences (mentioned in Section 5.2, Page 9\textcolor{blue}{\text{Section 5.2, Page 9}} and detailed in Appendix A.6, Page 18\textcolor{blue}{\text{Appendix A.6, Page 18}}).
    • Added a latency breakdown for the decoding stage, showing how overlapping the recomputation of the key cache with value cache fetching from the CPU effectively reduces decoding latency (mentioned in Section 5.2, Page 9\textcolor{blue}{\text{Section 5.2, Page 9}} and detailed in Appendix A.6, Page 18\textcolor{blue}{\text{Appendix A.6, Page 18}}).
  • [R4] Numerical stability
    • Added experiments to show that ShadowKV can retain its accuracy at lower precision, e.g., FP8 (mentioned in Section 5.3, Page 10\textcolor{blue}{\text{Section 5.3, Page 10}} and detailed in Section A.4, Page 16\textcolor{blue}{\text{Section A.4, Page 16}}).
  • [R2] Extra ablations
    • Added experiments demonstrating the effectiveness of outliers and mean landmarks (compared to Quest’s min-max landmarks), highlighting the improved performance of our method (mentioned in Section 5.3, Page 10\textcolor{blue}{\text{Section 5.3, Page 10}} and detailed in Appendix A.8, Page 19\textcolor{blue}{\text{Appendix A.8, Page 19}}).
    • Included an efficiency comparison with Quest, showing that our method outperforms Quest in both accuracy and efficiency. Specifically, ShadowKV is up to 4.85×\times faster than Quest while also achieving superior accuracy (Appendix A.7, Page 18\textcolor{blue}{\text{Appendix A.7, Page 18}}).
AC 元评审

After reviewing the entire discussion and considering input from all reviewers, I recommend the rejection of this paper. While the submission has merits, the unresolved concerns outweigh the strengths.

Strengths:

  • The paper presents a novel observation about efficiently compressing pre-RoPE Key caches, reducing CPU-GPU communication costs. This is a potentially useful insight that has practical implications for offloading methods.
  • The revised methodology demonstrates improved scalability for long sequences while maintaining accuracy.

Weaknesses:

  1. Incremental Contribution:
    The proposed work builds on existing methods of CPU offloading and low-rank approximation. However, the enhancements, while effective, appear to be incremental rather than groundbreaking.

  2. Presentation and Positioning:
    The paper struggles with positioning itself clearly within the existing literature. The unique contributions remain insufficiently distinguished from prior work, leaving ambiguity about its originality. This issue undermines the perceived value of the paper's contributions.

Given these mixed reviews, the lack of consensus among reviewers, and the remaining issues of novelty and presentation, I believe the paper does not meet the threshold for acceptance in its current form. While the work demonstrates promise, a more substantial contribution and clearer articulation of its impact are necessary to justify publication.

审稿人讨论附加意见

  • Reviewer sEj1 supported the paper, citing its strong experimental results and the practical implications of its key cache compression technique. However, even this reviewer acknowledged the need for better articulation of the contributions.
  • Reviewer iRxd expressed moderate support but did not strongly champion the paper and provided limited commentary on its novelty or positioning.
  • Reviewer 5wSX maintained a neutral stance, citing the incremental nature of the contributions and unclear positioning as significant concerns.
最终决定

Reject