PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling
We developed PyramidKV, a novel and effective KV cache compression method.
摘要
评审与讨论
In long-context models, the KV-cache consumes a large amount of GPU memory, and attending over a long context requires a great deal of compute. As a result, a number of research groups have recently focused on the task of KV-cache compression, and an obvious way to reduce the size of the cache is to prune it, by keeping only the most important tokens, rather than all tokens.
The authors make an empirical observation, which is that in pre-trained models, the number of "important tokens" differs from layer to layer. In particular, attention in lower layers seems to be widely dispersed, while attention in higher layers concentrates on only a few key tokens. Thus, an optimal pruning scheme can prune more tokens from higher layers, while retaining more tokens in lower layers.
The authors design and test a KV-cache pruning scheme following this principle. A block of tokens at the end of the prompt are designated as "instruction tokens". These tokens attend to the rest of the prompt, and in each layer, the attention scores for all heads are summed to get an "importance score" for each token. The most important tokens are then retained, following a fixed pyramid-shaped schedule which allocates more resources to the lower layers.
The authors test their technique on a number of long-context datasets using the llama and mistral pre-trained models, and show that it outperforms other pruning strategies on at least some datasets.
接收理由
Using decreasing KV-cache sizes in different layers is a compelling idea to me, because it somewhat resembles the convolution + pooling strategy that has long been used convnets for vision, and has thus far been absent from language models. The author's analysis of attention patterns in pre-trained models is especially valuable, and they provide detailed and convincing experimental results in both the paper and the appendix. This seems like it could be a worthwhile contribution to the field.
Update: I have raised my score following discussion.
拒绝理由
The biggest problem with this paper is that authors do not adequately describe how to interpret their results.
In particular, Figure 3 and Table 1 give results in terms of "KV-cache size". What does "size" mean in this case? Is it the full KV-cache size (i.e the total context length) before pruning, or is it the size of the KV-cache after pruning? If it is the total context length before pruning, then those sizes are laughably small -- 256 tokens is hardly any context at all. However, if it is the KV cache size after pruning, then what was the original (Full KV) context length? Moreover, PyramidKV allocates a different KV-cache size to each layer, so does the number refer to the average size per layer, or the total size summed over all layers? The authors claim that PyramidKV has been tested at different compression rates, but the actual compression rate in the result tables is not given, nor can it be calculated from the information that is given.
PyramidKV works best at small cache sizes, e.g. 64. However, no LLM uses a context length as small as 64; realistic KV-cache sizes even for small (2B) models will be at least 2K or higher. The authors should discuss this.
A second potential issue that needs to be discussed in more detail is the role of the instruction tokens. For Q/A tasks, do the instruction tokens include the query? Figure 1 indicates that the answer might be "yes". But if so, then that's cheating -- you are using the query to do full attention over the entire sequence, and then keeping only those tokens which have the correct answer, so of course accuracy will be high.
If the instruction tokens do not include the query, they still seem to play an outsized role in determining which tokens are "important"; if the instruction tokens are poorly chosen or irrelevant to the task, then the wrong tokens will be kept in the KV-cache. I am particularly concerned about the author's claim that fewer instruction tokens yield more accurate results; I would expect more instruction tokens to yield a more accurate estimate of which tokens are important. This should be discussed.
I would also like to see a more detailed empirical analysis of attention patterns in pre-trained models. It would be nice to have a chart that shows a layer-by-layer breakdown of how sharply attention is focused on just a few tokens. For example, you could compute a number K for each layer, where K is the number of top-K tokens that needs to be retained to keep the result of attention within certain bounds. Such a chart would be more objective than eyeballing hand-picked attention matrices.
给作者的问题
See above. My score of "reject" is based on the fact that I can't interpret the results. If those questions are resolved I will increase my score.
One further question: how does the idea of "most important tokens" combine with "most recent tokens / sliding window attention"? Intuitively, it is not surprising to me that attention in lower layers is less focused, but I would also expect it to be more local, which means that StreamingLLM sliding window attention could be employed at the lower levels, thus providing more token budget for the higher levels. A more comprehensive analysis of attention patterns would be valuable here (see above).
We thank the reviewer for the insightful suggestions and the opportunity to clarify a few misunderstandings.
Q1: Figure 3 and Table 1 give results in terms of "KV-cache size". What does "size" mean in this case? Is it the full KV-cache size (i.e, the total context length) before pruning, or is it the size of the KV-cache after pruning? If it is the total context length before pruning, then those sizes are laughably small -- 256 tokens is hardly any context at all. However, if it is the KV cache size after pruning, then what was the original (Full KV) context length?
A1: In Figure 3 and Table 1, “KV-cache size” means the average number of KV cache retained for each layer and each head after pruning. For example, a cache size of 128 means that each attention head in each layer retains 128 tokens. Therefore, the total retained KV tokens across the entire model is 128 × #layers × #heads. The original (unpruned) context length corresponds to the model’s maximum sequence length: 32K tokens for Mistral models and 8K tokens for LLaMA-3 models. At inference time, the actual context length depends on the input sequence length, which can vary across datasets.
Q2: Moreover, PyramidKV allocates a different KV-cache size to each layer, so does the number refer to the average size per layer, or the total size summed over all layers?
A2: The reported KV-cache size refers to the average number of tokens retained per layer and per head. While PyramidKV supports adaptive token allocation across layers, we report the average size for consistency and fair comparison with existing baselines such as SnapKV and H2O, which use uniform or non-adaptive budgets. This helps isolate the benefits of our compression strategy under comparable conditions.
Q3: The authors claim that PyramidKV has been tested at different compression rates, but the actual compression rate in the result tables is not given, nor can it be calculated from the information that is given.
A3: Thank you for pointing this out. In Figure 3 and Table 1, the compression rate is computed as the ratio of the retained KV cache sizes (e.g., e.g., 64, 96, 128, 256, and 2048) to the average input sequence length, which are listed in the third row of Table 1 (e.g., 18,409 for NrtvQA, 11,214 for Musique, 10,614 for QMSum, and 11,141 for PCound). We agree that the current presentation lacks clarity. In the revised version, we will explicitly report the actual compression rates alongside the corresponding KV cache sizes in both Table 1 and Figure 3. Additionally, we will clearly describe how the compression rate is computed. In the meantime, please refer to the supplemental table below for the detailed compression rates used in our experiments.
| Dataset | NrtvQA | Qasper | MF-en | HotpotQA | 2WikiMQA | Musique | GovReport | QMSum | MultiNews | TREC | TriviaQA | SAMSum | PCount | PRe | Lcc | RB-P |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Avg. Length | 18409 | 3619 | 4559 | 9151 | 4887 | 11214 | 8734 | 10614 | 2113 | 5177 | 8209 | 6258 | 11141 | 9289 | 1235 | 4206 |
| 64 | 0.00347656 | 0.01768444 | 0.01403817 | 0.00699377 | 0.01309597 | 0.00570715 | 0.00732768 | 0.00602977 | 0.03028869 | 0.01236237 | 0.00779632 | 0.01022691 | 0.00574455 | 0.00688987 | 0.05182186 | 0.01521636 |
| 96 | 0.00521484 | 0.02652666 | 0.02105725 | 0.01049066 | 0.01964395 | 0.00856073 | 0.01099153 | 0.00904466 | 0.04543303 | 0.01854356 | 0.01169448 | 0.01534036 | 0.00861682 | 0.0103348 | 0.07773279 | 0.02282454 |
| 128 | 0.00695312 | 0.03536889 | 0.02807633 | 0.01398754 | 0.02619194 | 0.0114143 | 0.01465537 | 0.01205954 | 0.06057738 | 0.02472474 | 0.01559264 | 0.02045382 | 0.01148909 | 0.01377974 | 0.10364372 | 0.03043272 |
| 256 | 0.01390624 | 0.07073777 | 0.05615267 | 0.02797508 | 0.05238388 | 0.02282861 | 0.02931074 | 0.02411909 | 0.12115476 | 0.04944949 | 0.03118528 | 0.04090764 | 0.02297819 | 0.02755948 | 0.20728745 | 0.06086543 |
| 2048 | 0.11124993 | 0.56590218 | 0.44922132 | 0.22380068 | 0.419071 | 0.18262886 | 0.23448592 | 0.1929527 | 0.96923805 | 0.3955959 | 0.24948228 | 0.32726111 | 0.18382551 | 0.22047583 | 1.6582996 | 0.48692344 |
Q4: PyramidKV works best at small cache sizes, e.g. 64. However, no LLM uses a context length as small as 64; realistic KV-cache sizes even for small (2B) models will be at least 2K or higher. The authors should discuss this.
A4: We appreciate the reviewer’s concern and would like to clarify that the KV cache size we refer to (e.g., 64, 96, 128, 256) denotes the number of retained key-value pairs after compression, not the model’s context length. The input context length remains unchanged in our experiments and matches the maximum sequence length supported by the model (e.g., 32K for Mistral, 8K for LLaMA-3). PyramidKV operates after full attention over the entire input, aiming to reduce the KV cache used in the decoding stage. This compression is especially relevant for long-context inference, where storing all key-value pairs becomes a bottleneck.
Q5: A second potential issue that needs to be discussed in more detail is the role of the instruction tokens. For Q/A tasks, do the instruction tokens include the query? Figure 1 indicates that the answer might be "yes". But if so, then that's cheating -- you are using the query to do full attention over the entire sequence, and then keeping only those tokens which have the correct answer, so of course accuracy will be high.
A5: Thank you for raising this important concern. In our approach, the term “instruction tokens” refers to the final tokens in the input sequence, which are retained in the KV cache across all layers. Our prompt format typically consists of a system message, a long input document, and a query; as a result, the instruction tokens may include the tail portion of the query. In our experiments, we fix across all datasets, which is usually much shorter than the full query length. These instruction tokens are used to compute attention scores over the full input sequence, guiding the selection of tokens to retain in the KV cache. This design is in line with prior work, such as SnapKV [1], which uses a similar mechanism often referred to as a “local window.” It’s important to note that this mechanism does not allow us to "cheat" by selectively preserving only tokens that contain the correct answer—there is no guarantee that such tokens are retained. Instead, the goal is to preserve tokens that are deemed important by the model’s attention mechanism. Rather than undermining the integrity of the evaluation, this strategy embodies the core motivation of KV cache compression: leveraging attention signals to identify and retain a small subset of key tokens that approximate the performance of full-context inference.
Q6: If the instruction tokens do not include the query, they still seem to play an outsized role in determining which tokens are "important"; if the instruction tokens are poorly chosen or irrelevant to the task, then the wrong tokens will be kept in the KV-cache. I am particularly concerned about the author's claim that fewer instruction tokens yield more accurate results; I would expect more instruction tokens to yield a more accurate estimate of which tokens are important. This should be discussed.
A6: We agree with the reviewer that the selection of instruction tokens significantly affects which tokens are retained. To keep the experimental setup consistent with existing baselines such as SnapKV [1], we currently follow their practices to use a simple heuristic: selecting the last few tokens as instruction tokens. We acknowledge that more principled selection strategies could improve performance and represent a valuable direction for future work. Regarding the claim that fewer instruction tokens yield better results: this arises because we always retain the instruction tokens in the final KV cache. Increasing the number of instruction tokens reduces the number of remaining slots for important tokens selected via attention, thus limiting compression flexibility and often degrading performance. This is supported by our empirical finding that using 8 or 16 instruction tokens outperforms longer configurations. That said, we agree that more instruction tokens could lead to a more accurate estimation of token importance. To balance both concerns, we are considering separating the roles of instruction tokens: one set for computing attention, and another (potentially smaller) set for forced retention. We believe this refinement could improve overall performance.
Q7: I would also like to see a more detailed empirical analysis of attention patterns in pre-trained models. It would be nice to have a chart that shows a layer-by-layer breakdown of how sharply attention is focused on just a few tokens. For example, you could compute a number K for each layer, where K is the number of top-K tokens that need to be retained to keep the result of attention within certain bounds. Such a chart would be more objective than eyeballing hand-picked attention matrices.
A7: We appreciate this suggestion and agree that a quantitative, layer-wise analysis of attention patterns chart would provide valuable insights. In the revised version, we will include such an analysis by computing, for each layer, the number of top-K tokens required to approximate full attention within a fixed threshold. This approach offers a more objective alternative to visual inspection and supports a deeper understanding of token importance across layers.
Q8: One further question: how does the idea of "most important tokens" combine with "most recent tokens / sliding window attention"? Intuitively, it is not surprising to me that attention in lower layers is less focused, but I would also expect it to be more local, which means that StreamingLLM sliding window attention could be employed at the lower levels, thus providing more token budget for the higher levels. A more comprehensive analysis of attention patterns would be valuable here (see above).
A8: Thank you for the insightful question. We agree that combining the notion of “most important tokens” with the locality assumption underlying sliding window attention is a promising direction. In principle, if empirical attention analysis consistently showed that lower-layer attention is indeed more local, then applying StreamingLLM-style sliding window attention to those layers could be beneficial. However, our current attention analyses do not strongly support the hypothesis that lower-layer attention is predominantly local. In contrast, PyramidKV achieves performance gains by allocating more KV cache budget to lower layers, suggesting that these layers still attend to a broader range of tokens than might be expected under a purely local assumption. This design aligns with the Pyramidal Information Funneling perspective proposed in our work. We agree that a more comprehensive analysis of attention distributions across layers and positions would further clarify this distinction. We will include such analysis in the revised version to better motivate the design choices behind PyramidKV.
[1] SnapKV: LLM Knows What You are Looking for Before Generation https://arxiv.org/abs/2404.14469
It’s important to note that this mechanism does not allow us to "cheat" by selectively preserving only tokens that contain the correct answer—there is no guarantee that such tokens are retained.
Let me explain what I meant here. Assuming the following:
- Context: wikipedia article on the Eiffel Tower.
- Query: "What is the height of the Eiffel Tower (last 8 tokens in bold)
- Answer: 330m
Here, the query will attend to precisely those tokens that contain the height, thus allowing the model to answer the question correctly. Perhaps calling this "cheating" is too strong a word -- after all, if you can reduce the KV-cache to only the required tokens, then the model is operating as intended, not "cheating". I called it "cheating" because this mechanism will prune the cache in such a way that the model still performs well on Q/A benchmarks, but Q/A results may not actually be representative of overall long-context performance. E.g.
- Context: wikipedia article on the Eiffel Tower
- Query: "Please write a lengthy report on the history of the Eiffel tower, its importance to French culture, and its importance internationally in influencing the way in which other countries perceive France. Don't forget to mention the height of the Eiffel Tower.> (last 8 tokens in bold.)
- Answer: ...
Here, the model will still lookup the height, but it will not have enough context to answer the query with only 128 retained tokens.
PyramidKV does not reduce prefill costs, it only reduces generation costs. Generation costs for a 4-token answer like "330m" is trivial; the only reason why PyramidKV might be useful is when generating a lengthy response, which is precisely the area in which the technique is likely to fail.
Thank you for the detailed answers to my questions regarding the meaning of "KV-cache size" and sequence lengths. Will you be updating the text of the paper to include this information?
Thank you for your response. Yes, we will update the text of the paper to include the information and the clarifications we provide during rebuttal. The detailed illustatration of the meaning of "KV-cache size" and sequence lengths will all be included. We agree that these details are essential for readers to fully understand and reproduce our method. Your comments have been invaluable in improving the clarity and presentation of our work, and we appreciate the opportunity to address them.
Also, be sure to label the 3rd row of Table 1 as "average sequence length." :-)
We appreciate the reviewer’s insightful comments.
Flexibility of Budget and Last Tokens
Our method allows flexible adjustment of both the parameter of for last tokens and the retained KV cache budget to suit different tasks. For instance, in tasks that require extensive long-context reasoning, the retained KV cache budget can be increased. Similarly, for tasks with longer queries, more last tokens can be used to ensure sufficient context coverage.
Long-Context Generation Capability
Regarding the provided example where the method may fail, we agree that such failure is possible when the number of last tokens is set too low. However, this reflects a configuration issue rather than a fundamental limitation of the method. As demonstrated in our summarization experiments (i.e., GovReport generation), our method is capable of handling long-context generation effectively. Specifically, we could increase the KV budget in cases where 128 tokens are insufficient, and increase the last tokens to allow full query coverage.
Use of Attention from Alpha Tokens
We would like to clarify that using attention scores from the most recent tokens to determine which keys and values to retain is not the main contribution of our work. This strategy is commonly adopted in KV cache compression approaches such as SnapKV [1] and Ada-KV [2].
Future Work – Adaptive Parameter Selection
We agree that designing methods to adaptively select the KV cache budget and alpha based on task characteristics is a promising direction. We thank the reviewer for highlighting this point, and we believe it could inspire valuable future research in long-context model inference. we will add and discuss this futurue direction in the future work section of our paper.
We will update and present these information clearly in the revised version of our paper.
[1] SnapKV: LLM Knows What You are Looking for Before Generation https://arxiv.org/abs/2404.14469
[2] Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference https://arxiv.org/abs/2407.11550
I understand that the model can be reconfigured. I am still concerned that (1) as an optimization technique, PyramidKV is mainly useful for long-form generation, but (2) Q/A tasks in particular are not a good way of testing that, because PyramidKV may give artificially high results on Q/A benchmarks. To be fair, this criticism applies not only to PyramidKV, but also to SnapKV and related techniques.
That being said, the work on understanding attention patterns is still valuable, and LongBench includes some benchmarks that are not Q/A. My main reason for recommending rejection was the confusing presentation, which you have promised to fix. I am raising my score.
Thank you for your response. We are happy that our response addresses your concern. We will label the 3rd row of Table 1 as "average sequence length". We will fix the presentation issues in the revised version.
Overview
This paper proposes a cute technique for dynamic KV cache quantization. Despite the method being very empirical, the paper does a good job of justifying it through careful analysis of attention patterns as well as strong empirical results. The main paper itself is a complete product, but extensive additional content is included in appendices. The appendices could do with some cleaning up, but this does not detract from the paper.
Notes
Introduction, RW, and abstract: Good analysis and motivation of the problem. My only quibble is that the diagram is slightly misleading. The most recent tokens are always included in the forward pass. I do understand that doing so would make the image less Pyramid shaped.
Attention Analysis: Insightful, concise, clear, and well supported. Figure 2 is stellar. Strongly motivates the design.
Methods: I think 4.1 and 4.2.1 are clear. 4.2.2 I found a bit confusing. How often do we reset the KV cache? Does every forward pass involve every head discarding one token, computed from the existing long term cache and the most recent token to leave the instruction cache? Better clarification or a visual would help.
Experiments:
- Models: Llama3 and Mistral are fine models for comparison. I would like to see mistral replaced by slightly more recent models if possible.
- Datasets: I think LongBench is the appropriate benchmark for this kind of work.
- Baselines: I think these choices of baselines are satisfactory.
- Hyperparameters: I was happy to see the extensive sweeps of KV cache size and resulting accuracy. In addition, I think that the results themselves are strong. I think Figure 4 could use some formatting tweaks to become a vectorized graphic.
接收理由
Good analysis of important problems at the frontier of LLMs, a clever method to solve them, and rigorous experiments. Extensive appendices and good visualizations to boot.
拒绝理由
KV Cache compression is a massive field and not my primary area of research. The existing baselines seem fine to me if a bit out of date.
给作者的问题
In 4.2.2, how often do we reset the KV cache? Does every forward pass involve every head discarding one token, computed from the existing long term cache and the most recent token to leave the instruction cache?
We thank the reviewer for their thoughtful feedback and appreciation of our work. We address the reviewer’s questions below.
Q1: KV Cache compression is a massive field and not my primary area of research. The existing baselines seem fine to me if a bit out of date.
A1: We appreciate the reviewer’s comment. In the revised version of the paper, we will include more recent and relevant baselines in the experiments and discussions to better reflect the current state of the field.
Q2: In 4.2.2, how often do we reset the KV cache? Does every forward pass involve every head discarding one token, computed from the existing long term cache and the most recent token to leave the instruction cache?
A2: Thank you for this insightful question. In our current setting, we focus on scenarios with long input sequences and short output sequences. Accordingly, we reset the KV cache only after the prefill phase. As a result, during the first forward pass, each attention head may discard tokens based on attention scores computed from the full input sequence and the most recent token. While our implementation resets the cache only after the prefill, it is indeed possible to update the cache more frequently—for instance, by discarding one token per forward pass or discarding multiple tokens after a set number of steps. Discarding multiple tokens after an amount of forward passes tends to be more efficient, as it reduces the overhead associated with frequent token selection and eviction.
Thanks for the answers to the questions. I think the paper is sufficient as is, but would like to see exploration of eviction policies for longer sequences (e.g. for reasoning queries) in future work.
Thank you so much for your response. We will highlight the exploration of eviction policies for longer sequences in the future work section in the revised version of our work.
This paper introduces PyramidKV, a novel Key-Value (KV) cache compression method for Large Language Models (LLMs) designed to improve efficiency during long context processing. The core idea is based on the observation of a "Pyramidal Information Funneling" pattern in LLMs. This pattern shows that attention is widely scattered in the lower layers, progressively consolidates in the middle layers, and finally focuses on a few critical tokens (massive activation or attention sinks) in the higher layers. Leveraging this insight, PyramidKV dynamically adjusts the KV cache size across different transformer layers. It allocates a larger cache budget to the lower layers (where information is dispersed) and progressively smaller budgets to the higher layers (where information is concentrated). The authors conduct extensive experiments using the LongBench benchmark with LLaMa-3-8B-Instruct, LLaMa-3-70B-Instruct, and Mistral-7B-Instruct models. The results demonstrate that PyramidKV can match the performance of models using a full KV cache while retaining only a small fraction (e.g., 12%) of the cache, thereby significantly reducing memory usage.
接收理由
- The "Pyramidal Information Funneling" observation offers a clear and intuitive motivation for the proposed dynamic KV cache allocation strategy.
- The proposed idea is novel and reasonable.
- This paper has conducted a thorough evaluation using the LongBench benchmark, covering 17 datasets across various tasks, three different LLM architectures (LLaMa-3 8B & 70B, Mistral-7B), and multiple KV cache size settings.
拒绝理由
See questions below
给作者的问题
- While you've shown the pyramidal pattern in LLaMa, Mistral, and Mixtral, how do you foresee PyramidKV performing on models with significantly different layer counts or attention mechanisms (e.g., models with recurrent layers or different types of sparse attention)? Would the arithmetic sequence parameters need significant retuning?
We thank the reviewer for your recognition of our work, and we would like to answer your question below.
Q1: While you've shown the pyramidal pattern in LLaMa, Mistral, and Mixtral, how do you foresee PyramidKV performing on models with significantly different layer counts or attention mechanisms (e.g., models with recurrent layers or different types of sparse attention)? Would the arithmetic sequence parameters need significant retuning?
A1: PyramidKV performs better on larger models with more layers. For instance, PyramidKV performs better on LLaMA-3-70B with 80 layers than on LLaMA-3-8B with 32 layers. For models with sparse attention, we anticipate that PyramidKV will still perform reasonably well without significant retuning. This is because sparse attention mechanisms still rely on self-attention transformers, and the increasing sparsity patterns of self-attention from lower to higher layers are widely observed [1,2,3]. Thus, PyramidKV may remain effective without significant changes. However, for models with recurrent layers, it is hard to foresee PyramidKV’s performance, which deserves further investigation in future research.
[1] Swin Transformer: Hierarchical Vision Transformer using Shifted Windows https://arxiv.org/abs/2103.14030
[2] DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification https://arxiv.org/abs/2106.02034
[3] Enhancing Layer Attention Efficiency through Pruning Redundant Retrievals https://arxiv.org/abs/2503.06473
Dear reviewers,
We sincerely appreciate all reviewers' insightful feedback and recognition of our contributions. We are particularly grateful for your acknowledgment of the novelty and thoroughness of our work.
We have summarized key points from each reviewer and addressed concerns with significant clarifications and revisions as outlined below:
Strengths recognized by reviewers
- Novelty and Motivation
- Reviewer
DCU1acknowledges the "Pyramidal Information Funneling" observation as a clear, intuitive motivation and highlights the novelty and reasonability of our dynamic KV cache allocation strategy. - Reviewer
dYNLappreciates our careful analysis of attention patterns, robust empirical support, and identifies our method as a clever and justified approach. - Reviewer
KRVUconsiders the layer-wise decreasing KV-cache sizes compelling, likening it to the convolution and pooling strategy in CNNs, and values our detailed attention pattern analysis.
- Comprehensive and Rigorous Evaluation
- Reviewer
DCU1praises the thorough evaluation across multiple architectures (LLaMa-3 8B & 70B, Mistral-7B) using extensive datasets from LongBench. - Reviewer
dYNLhighlights the extensive hyperparameter sweeps and rigorous benchmarking, deeming our experimental validation robust and convincing. - Reviewer
KRVUappreciates our detailed and convincing experimental results.
Key clarifications
- Clarification on KV-cache Size Definition (Reviewer
KRVU): We will clearly define KV-cache size as the average number of tokens retained per layer and per head post-pruning, rather than context length, with detailed explanations provided. - Instruction Tokens Usage (Reviewer
KRVU): We will explicitly clarify that instruction tokens represent the tail end of input sequences, including partial queries. We agree that improving strategies for selecting instruction tokens is a valuable future research direction. - Representation and Compression Rates (Reviewer
KRVU): We will explicitly include detailed compression rate calculations and clarifications in our revised manuscript to resolve the ambiguities. - Compatibility with Diverse Model Architectures (Reviewer
DCU1): We will discuss anticipated performance across models with varying layer counts and sparse attention mechanisms. We will acknowledge uncertainty regarding recurrent layers as an interesting future research avenue. - KV Cache Reset Frequency Clarification (Reviewer `dYNL): We will explicitly clarify the cache reset frequency in the post-prefill phase, highlighting possibilities for more frequent cache updates and their trade-offs.
- Attention Pattern Analysis Enhancements (Reviewer
KRVU): We will include a quantitative layer-by-layer analysis of attention patterns, supporting deeper insights into token importance across layers. - Combination with Sliding Window Attention (Reviewer
KRVU): We will discuss the relationship and distinctions between PyramidKV and sliding window attention methods. We will provide an analysis of attention distributions across layers and positions to clarify the distinction.
Manuscript Revisions Summary
- Enhance presentation clarity and add detailed clarifications on experimental settings and definitions (Reviewer
KRVU,dYNL). - Add attention pattern analysis across layers and positions. (Reviewer
KRVU) - Add discussion on adaptive parameter selection and integration as a future direction. (Reviewer
KRVU) - Add discussion on model architecture compatibility. (Reviewer
DCU1)
We believe these revisions and clarifications significantly enhance the clarity and robustness of our manuscript and sincerely thank the reviewers for their constructive and detailed feedback.
Best,
Authors
This paper introduces PyramidKV, a KV cache compression method based on the observation of "Pyramidal Information Funneling", where attention patterns become increasingly focused from lower to higher layers. The method dynamically allocates larger cache budgets to lower layers and smaller budgets to higher layers. Reviewers had initial concerns about presentation, which the authors addressed in the discussion phase. Reviewers were enthusiastic about the novelty of this work; experimental evaluation was exhaustive, covering two model families across 17 datasets from LongBench. This method achieves some strong results, matching full KV cache performance while only retaining ~12% of the cache.