PaperHub
5.5
/10
Poster4 位审稿人
最低2最高4标准差0.7
2
4
3
3
ICML 2025

EPIC: Efficient Position-Independent Caching for Serving Large Language Models

OpenReviewPDF
提交: 2025-01-18更新: 2025-07-24

摘要

关键词
Machine Learning SystemContext CachingKV CacheSparsity

评审与讨论

审稿意见
2

Existing work accelerates LLM workload by reusing the KV cache of a text chunk when it is the prefix of the request. Existing work break the "prefix" limitation by dynamically finding and computing the attention for a subset of tokens. This work further accelerate existing work by only performing recomputation on a subset of tokens. Experimental results show that this work can reduce TTFT by upto 8x.

给作者的问题

Thanks for submitting to ICML. The problem that this paper tackles is definitely important and the proposed approach is pretty effective.

My major confusion is about the attention sink assumption. Would be really nice if there are more empirical evidences or rationales. This assumption sounds to me that the cross-attention is mainly contributed by the first token in the text chunk, which seems to imply that other tokens don't really matter in terms of understanding the meaning of the text chunk.

Several questions regarding the technical detail: how do you handle positional encoding and how do you handle sliding window layers?

Regarding the evaluation: I would like to have some ablation study --- is your improvement mainly come from better token selection, or from static token selection scheme?

A side question: will your solution remain effective when the text chunk size is large (say tens of thousands of tokens, e.g. a research paper).

论据与证据

The key assumption of this paper is attention sink, meaning that the first token in each text chunk will "absorb" the most attention. This assumption is not well-supported with empirical evidences (since this is the key assumption and it is not very well-known, I would expect some empirical evidences to prove that this is correct) and rationales.

方法与评估标准

The datasets (5 datasets from LongBench) and evaluation metrics (TTFT, Accuracy) are pretty standard and make sense.

理论论述

There is no formal proof involved in the paper.

实验设计与分析

The evaluation involves 3 models (mistral, Llama 3.1 and Yi). I am not familiar with Yi, but mistral and Llama 3.1 all contains sliding window layers. Not sure how the authors handle sliding window layer.

Not sure whether the proposed method work well when the chunk size is large. But since 512 is a typical chunk size in RAG application I would say evaluating it is just a better-to-have.

补充材料

Yes. The runtime breakdown makes sense because CacheBlend do full recomputation on the first layer.

与现有文献的关系

This paper can be hugely beneficial to RAG applications, which is one of the key for a lot of modern LLM applications (e.g. search results summary, code generation, etc) and will benefit the broad audience.

遗漏的重要参考文献

Necessary references are included.

其他优缺点

Strengths:

  • Super fast and easy to implement in real production Weaknesses:
  • Cannot smoothly trade-off between full recomputation and no recomputation given a fixed chunk size.
  • The key assumptions (attention sink) need more empirical backup and more rationales.

其他意见或建议

No.

作者回复

Thank you for your valuable feedback. We address the questions and confusion below.

Key assumptions — attention sink.

The attention sink phenomenon is well-studied in attention sparsity research. For example, StreamingLLM [1] and DuoAttention [3] discover that only a few useful tokens receive some attention score, while the remaining useless attention score (summing to 1.0) is absorbed by the first few tokens (called sink tokens). If the sink tokens are numerous, they overshadow meaningful ones. In addition, Minference [4] describes this phenomenon as a ^-shape pattern, while more recently LightTransformer identifies lazy layers dominated by attention sink heads.

Besides, a key clarification is that too many sink tokens are actually bad rather than representing important information for each chunk.

Relevant related work is provided below to clarify the phenomenon of attention sink further. We hope this addresses your concern.

[1] G. Xiao, Y. Tian, B. Chen, S. Han, and M. Lewis, “Efficient streaming language models with attention sinks,” in The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024.

[2] J. Tang, Y. Zhao, K. Zhu, G. Xiao, B. Kasikci, and S. Han, “QUEST: query-aware sparsity for efficient long-context LLM inference,” in Forty- first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net, 2024.

[3] G. Xiao, J. Tang, J. Zuo, J. Guo, S. Yang, H. Tang, Y. Fu, and S. Han, “Duoattention: Efficient long-context LLM inference with retrieval and streaming heads,” in The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. Open- Review.net, 2025.

[4] H. Jiang, Y. Li, C. Zhang, Q. Wu, X. Luo, S. Ahn, Z. Han, A. Abdi, D. Li, C. Lin, Y. Yang, and L. Qiu, “Minference 1.0: Accelerating pre- filling for long-context llms via dynamic sparse attention,” in Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024, A. Globersons, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. M. Tomczak, and C. Zhang, Eds., 2024.

[5] R. Chen, Z. Wang, B. Cao, T. Wu, S. Zheng, X. Li, X. Wei, S. Yan, M. Li, and Y. Liang, “Arkvale: Efficient generative LLM inference with recallable key-value eviction,” in Advances in Neural Information Pro- cessing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024, A. Globersons, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. M. Tomczak, and C. Zhang, Eds., 2024.

Positional embedding (PE)

PICI prefills a doc, whose first token is given position 0, to generate their KV cache. Then PICI uses the cache directly without modifying PE. On the contrary, new queries receive position IDs based on their real positions.

Again, too many tokens with PE 0 at the beginning of each chunk absorb excess attention without contributing useful information.

We will add this detail in future versions.

Sliding window

We disabled the optional sliding window for the Mistral model.

Large chunk size

Currently, we chunk large documents into mid-sized segments (e.g., 512 tokens). We will add an analysis of the trade-off between chunking and processing long documents as a whole in future versions.

Source of improvement: from better token selection, or static token selection?

LegoLink’s improvement comes from static token selection, specifically, the first few tokens of each chunk (except for the first chunk). These tokens are preselected for recomputation, offline before execution.

The recomputation gives these "first few tokens" actual position IDs, makes them realize they are not the "first few" tokens anymore (they are in the middle now), and stops them from being sink tokens and absorbing too much attention.

审稿意见
4

The paper introduces a context caching method with limited recomputation called LegoLink. Prior work either fully recomputes KV caches for new inference calls (e.g. full re-encoding) or uses dynamic recomputation to recompute a fraction of total cached values (i.e. CacheBlend). To further reduce cost, LegoLink has a static set of values to recompute when reusing cache (primarily the first few tokens in each block, i.e. the attention sink). The authors demonstrate that this decreases time to first token and maintains strong performance.

给作者的问题

Q1. In Figure 3, the full recomputation setting still shows stronger attention at the "chunk boundaries"-- is this solely because the chunks are separate documents? If you use the regular softmax instead of your min-max scaling, is this a really noticable difference? (In general a fan of using softmax instead of min-max for this figure, or at least including the softmax version in the appendix.)

Q2. Do you consider your main contribution to be the framing of the PICI problem (using the metaphor of software linkage), the presentation of LegoLink as an improvement on CacheBlend, or the proposal of LegoLink-0 as a 0-linking alternative?

论据与证据

I think the claims are generally supported by the evidence.

A nitpick: I think the framing in the very first paragraph-- that LLMs have advanced the progress of AGI-- is not generally agreed and doesn't add much to the paper. This did not factor into my review scores, but personally I don't think it adds anything to use the term AGI here when it is still a contested idea.

方法与评估标准

Generally yes, I think measuring time-to-first-token is reasonable for a method focused on improving efficiency in the prefill stage. The model sizes and choice of models seems reasonable.

The authors substitute Needle in a Haystack for multi-doc retrieval (in line 260), saying that it is a more suitable benchmark for retrieval. I disagree-- I think this is actually generally a worse test of capabilities, as it is a much more synthetic task where the needle will be very distinct from the context. I don't think this really has an impact on the paper's claims, but I would consider revising the comment about suitability to be a comment about ease of evaluation or setup, or adding further justification for why you truly believe this is more suitable for assessing retrieval.

理论论述

No-- no theoretical claims.

实验设计与分析

I looked at the main experimental setup, comparing LegoLink to CacheBlend variants and full recomputation on performance and efficiency. I think the setup seems reasonable and well-documented.

补充材料

I read the appendices.

与现有文献的关系

The paper brings attention to an interesting area of active study-- position-independent caching for efficient model serving-- and proposes a new method in this space. LegoLink has the advantage of being very simple conceptually and much faster than the prior state of the art (CacheBlend); the LegoLink-0 setting is also an interesting alternative to recomputation. I think these methods have the potential for adoption because of their low impact on performance, improvement on efficiency, and ease of integration into existing setups (strengthened by the implementation of the method to integrate with vllm workflows).

遗漏的重要参考文献

I think more careful discussion is warranted around the idea of caching context. Line 293 says that context caching was introduced in late June 2024, but this is only the time that Gemini started supporting this feature. The related work on page 8 actually does a much better job of tracing the line of work here from caching KV values for a single inference example, to prefix caching, to modular/position-independent caching. I think this is still missing some prior works that talked about PIC in terms of individual applications, including for RAG and long-context ICL; both feature encoding in local-position-only/global-position-agnostic blocks and then potentially retrieving blocks to reuse.

其他优缺点

I've addressed strengths elsewhere in the review; I think my main concern not mentioned elsewhere is with the clarity of the framing. The paper starts as if it is fully introducing a new task of PIC and introduces a (to my knowledge) new view of PIC as analogous to the way code is pre-compiled. However, the paper then acknowledges that this problem has been studied before and instead begins framing itself as an improvement on an existing method (using LegoLink). Then, near the end, LegoLink-0 is introduced as a way to sidestep the need to do linking altogether. I think it's a good paper with a reasonable contribution, but it seems that the paper is not quite sure what story it is telling, and a revision of the narrative of the work would greatly enhance its readability and impact.

其他意见或建议

I would not describe tokens as "text-based words" (in line 26)-- I think something like "small units of text" is more accurate.

The notation introduced in line 69 for the computational complexity of CacheBlend is strange and makes it a bit hard to follow. Generally we exclude constant factors in big-O notation, but here I understand you are trying to distinguish complexity between full recomputation, your method, and CacheBlend; I think writing CacheBlend as O([0.15N]N)O([0.15N]N) and yours as O(kN)O(kN) would make it clearer that your contribution is reducing the "15% of NN" recomputation to a static kk. In general I think that section could use an additional pass for clarity of text.

Describing the models in 4.1.3 as "base" because they are not finetuned for these tasks is a bit confusing-- generally, I think of "base" as a distinction between instruct-tuned and non-instruct models, and you are using instruction-tuned models here.

Figures are not always placed on the page where they are primarily referred to.

作者回复

Thank you for your valuable feedback. We address the review questions below.

Missing related work.

Thank you for pointing this out. We will include a subsection discussing RAG applications and their connection to context caching. This addition will clarify PICI's positioning without modifying its intended scope.

AGI.

We acknowledge that "AGI" is controversial and requires a precise definition when discussing it, especially in academic contexts. We will revise the first paragraph.

Needle in a Haystack

Initially, we evaluated PICI on the same four datasets as CacheBlend (WikimQA, SAMSum, MultiNews, and Musique). To further demonstrate LegoLink’s generalizability, we selected the Needle in a Haystack dataset based on its popularity rather than ease of evaluation or improved results. We acknowledge its limitations as a synthetic task where the "needle" is highly distinct from the context. In the revised version, we will clarify our rationale for this dataset selection (popularity instead of methodological suitability) and include passage retrieval results from LongBench, where we expect similar findings.

Storyline and contribution.

The revised story (or the original story we plan to tell) should go as follows: 1) We are the first to formally define the PIC problem and decompose its solution into a four-step framework: document chunking, cache generation, retrieval, and linking. This structured approach establishes a diverse solution space (contribution 1). 2) CacheBlend, the first PIC work, fits in our framework and mainly designs the linking step, while LegoLink enhances its efficiency (contribution 2). 3) LegoLink-0 further improves performance by mainly designing the cache generation step, thereby simplifying the linking step (contribution 3).

This framework establishes a diverse solution space for future research. The key consideration is the distribution of computational costs across different steps. Allocating cost to the chunking or cache generation step resembles a compile-time expense—incurred once but benefiting all subsequent queries. Conversely, placing cost in the retrieval or linking step resembles link-time expense—introducing runtime overhead but allowing for more adaptive and potentially more accurate solutions. In the appendix, we call on future research to find more efficient and accurate solutions utilizing our framework.

The above details are currently in the appendix and will be integrated into the main text.

Text clarity.

We appreciate the feedback and will improve clarity on terms like "token," O-notation, "base" model, and figure positions.

Still strong attention on chunk boundaries (Figure 3)

  1. Experiments show that the first token retains relatively strong attention absorption due to being a begin-of-sentence token ("<s>" in LLaMA). This insight motivates LegoLink-0, where we discard "<s>" tokens (and dummy tokens which are also "<s>" tokens), except for the first chunk.
  2. In the softmax map, we could still observe that recomputation reduces boundary attention, but attention-score differences on mid-content are harder to observe. We will add a softmax map for cross-comparison.
审稿人评论

Thank you for the detailed response. I believe the proposed changes will improve the clarity and I think the work is a good contribution, so I have raised my score 3 -> 4.

作者评论

We sincerely appreciate the reviewer's recognition of the value of our paper and the increased score. Your constructive feedback has greatly helped us refine and clarify key aspects of our work. We remain committed to further improving the paper and addressing any additional suggestions or concerns. Thank you for your thoughtful evaluation and support.

审稿意见
3

The authors propose a system of context caching which can store pre-computed key value stores for multiple documents. The proper caches are then retrieved given a user query. The key insight is that only a fixed number of initial tokens needs to be recomputed in order to avoid shifting the attention score distribution due to the sink token phenomena.

给作者的问题

  • In the current setup, what is the algorithm which decides a cache hit?
  • How does the current setup decide what to evict from the HBM cache in the GPU?
  • Are the caches only stored on GPU, or are they offloaded to CPU memory or hard disk space as well?
  • I assume that position encodings for the model are re-used by the chunks of the encoded prefix documents such that the whole process is permutation invariant?

Post rebuttal

Thanks for your responses to my review. I have decided to maintain my current score.

论据与证据

The claims are supported by evidence.

方法与评估标准

The evaluation is sufficient for the problem setting.

理论论述

N/A

实验设计与分析

The design of the experiments is sound.

补充材料

I did not view the supplementary material, as there was no reference to it nor perceived need to view it.

与现有文献的关系

The contributions of this paper build on recent works which highlighted the appearance of sink tokens in transformers.

遗漏的重要参考文献

The cited references are sufficient.

其他优缺点

Weaknesses

  • The pitfalls of prefix based caching are mentioned multiple times in the work, such as L138C1. But doesn't the presented algorithm suffer from the same limitation in that it is "limited to cases with exact prefix matches across requests?"
    • From what I can see, there must be a query to a database which has precomputed KV values, and your method proposes minimally recomputing the retrieved tokens, but it does nothing to address the above limitation, correct?

其他意见或建议

L227C2: Rough-L score --> Rouge-L

I would suggest adding more details of the overall algorithm for cache retrieval to the appendix, as I am left with many questions regarding how this is done after fully reading the work.

作者回复

Thank you for your valuable feedback. We address the review questions below.

More details (overall workflow, cache hit def, indexing, cache swap in/out, too-large cache) of PICI system in the main text instead of the appendix.

Due to page limits, we prioritized the attention sink effect in PIC and our simple yet effective algorithm, placing system details in the appendix. However, we acknowledge this omission and will move system details back to the main text. Below is a summary (from the appendix) related to review questions.

PICI workflow: 1) Users cache frequently used documents via generate_cache API, receiving a cache_id. 2) Users pass queries and cache_ids to PICI in any order (position-independent), and PICI retrieves the cache with minimal overhead (simple cache_id → cache address mapping) and finishes linking as described in the paper.

PICI design philosophy (similar to Gemini): users (or a RAG system) explicitly control caching. 1) Users decide what doc (big or small, frequently used or not) is worth caching and for how long at known costs (xx $/token/hour). PICI never actively deletes users’ PIC cache (more in the next paragraph). 2) Users explicitly pass cache_id to define cache hits. 3) PIC cache resides in GPU memory (no swap in/out).

In contrast, systems like vLLM and SGLang use implicit caching, where cache hits rely on token prefix-matching and swap in/out for memory management. Future work will explore integrating explicit PIC caching with implicit prefix-based caching and optimizing their placement in the memory hierarchy.

Rough-L score --> Rouge-L

Yes, thank you for pointing this out. We will fix it.

I assume that position encodings (PE) for the model are re-used by the chunks of the encoded prefix documents such that the whole process is permutation invariant?

Yes. PICI prefills a doc, whose first token is given position 0, to generate their KV cache. Then PICI uses the cache directly without modifying PE. On the contrary, new queries receive position IDs based on their real positions.

审稿意见
3

This paper introduces PICI, an efficient position-independent context caching system for serving large language models. The system pre-computes the KV caches of unchanged contents and splits them into blocks. The incorporated method, LegoLink, leverages the static attention sparsity of each block, eliminating the influence of the attention sink phenomenon in each chunk to minimize recomputation for accuracy recovery when using position-independent context caching.

Comprehensive experiments validate the effectiveness and efficiency of the proposed method, providing additional insight into efficient LLM.

给作者的问题

See Other Strengths And Weaknesses

论据与证据

The claims made in the submission are supported by clear and convincing evidence.

方法与评估标准

The proposed methods and/or evaluation criteria (e.g., benchmark datasets) make sense.

More details of the proposed PICI system should be introduced in the main paper instead of in the appendix.

理论论述

n/a

实验设计与分析

I have checked the soundness/validity of any experimental designs or analyses.

补充材料

no

与现有文献的关系

This paper's motivation is related to the KVLink algorithm, e.g., CacheBlend, and its idea is related to the KV cache eviction method, e.g., StreamingLLM, H2O.

遗漏的重要参考文献

See Other Strengths And Weaknesses.

其他优缺点

Strengths:

  1. This paper is well-written and easy to follow.
  2. The motivation of this paper is clear. Position-Independent Context Caching is practical in real world application.
  3. Using a static attention sparsity that determines the tokens to recompute is both effective and efficient.
  4. Comprehensive experiments show the effectiveness of LegoLink, and give some interesting insight (e.g., Section 4.4) into the PIC area.

Weaknesses:

  1. In the proposed PICI, how to retrieve the KV cache from a KV cache set remains exploration. The authors treat the implementation of PICI system as their contribution, but the details of PICI are in the appendix.
  2. Pre-computing and saving the KV caches for long documents requires larger storage than string.
  3. Missing related work in the Position-Independent Caching (PIC) of Section 2.2; only a Wikipedia website is cited.
  4. Line 220: O=AVW_O, should be O=AV_{exp}W_O ?

其他意见或建议

See Other Strengths And Weaknesses

作者回复

Thank you for your valuable feedback. We address the review questions below.

More details (overall workflow, cache hit def, too-large cache) of PICI system in the main text instead of the appendix.

Due to page limits, we prioritized the attention sink effect in PIC and our simple yet effective algorithm, placing system details in the appendix. However, we acknowledge this omission and will move system details back to the main text. Below is a summary (from the appendix) related to review questions.

PICI workflow: 1) Users cache frequently used documents via generate_cache API, receiving a cache_id. 2) Users pass queries and cache_ids to PICI in any order (position-independent), and PICI retrieves the cache with minimal overhead (simple cache_id → cache address mapping) and finishes linking as described in the paper.

PICI design philosophy (similar to Gemini): users (or a RAG system) explicitly control caching. 1) Users decide what doc (big or small, frequently used or not) is worth caching and for how long at known costs (xx $/token/hour). For now, PICI never actively deletes users’ PIC cache (more in the next paragraph). 2) Users explicitly pass cache_id to define cache hits. 3) For now, the PIC cache resides in GPU memory (no swap in/out).

In contrast, systems like vLLM and SGLang use implicit caching, where cache hits rely on token prefix-matching and swap in/out mechanisms are used for memory management. Future work will explore integrating explicit PIC caching with implicit prefix-based caching and optimizing their placement in the memory hierarchy.

Pre-computing and saving the KV caches for long documents requires larger storage than string.

Yes, it is always a hard trade-off between memory efficiency and computational savings. However, in the PICI design philosophy, users retain the flexibility to enable or disable caching based on their specific needs. Given their deeper understanding of application semantics and intent, users are often better positioned than the system itself to make optimal PIC caching decisions. This flexibility helps prevent unnecessary memory consumption by avoiding the storage of large, unused caches.

Missing citations in Section 2.2.

We will correct the citation problems in Section 2.2, making them as complete as in the Related Work section.

Line 220: O=AVW_O, should be O=AV_{exp}W_O?

Yes.

最终决定

This paper presents PICI, a serving system for LLM that overcomes the constraints of traditional prefix-based context caching by enabling position-independent caching (PIC). At its core, PICI features LegoLink, a simple yet effective linking algorithm that exploits static attention sparsity and eliminates the influence of the attention sink phenomenon in each chunk. It minimizes recomputation for accuracy recovery when using PIC.

The reviewers generally appreciate the contributions of this work, and the authors have adequately addressed the questions and concerns during the rebuttal. One major concern relates to the assumption of the "attention sink" phenomenon, which has been previously noted and studied (e.g., in Streaming LLM). The authors are encouraged to provide additional context and a more detailed explanation of this concept to enhance the paper’s clarity and readability.