Tail-Optimized Caching for LLM Inference
摘要
评审与讨论
Prompt caching is an important issue for LLM service. Its effectiveness can reduce latency and improve user experience, and on the other hand can reduce costs for the service providers. The paper introduced a new metric, tail latency, to evaluate the quality of LLM service. To optimize the tail latency, a novel stochastic model for multi-turn conversations is introduced, characterizing the certainties and temporal locality of LLM workload, and a novel policy, Tail-Optimized LRU, is proposed to optimize the objective. Moreover, new policy’s dominance over LRU is also soundly proven. Several other theoretical results are also given. The experiments shows competitive and significant performance of the new policy over existing LRU on tail latency.
优缺点分析
Strengths:
- The major strengths of the paper are its topic significance and novelty of its stochastic modeling and policy.
- The topic of prompt KV caching is a significant issue for LLM serving systems. It’s strongly correlated to service quality and cost.
- This work is the first one in the community to model and analyze prompt caching to such extent, while proposing policy tailored for LLM service different from traditional OS caching/paging.
Weakness:
- The importance of tail latency over median latency and hit rate is not clarified. Besides, the hardware settings require more specifications too.
- As tail latency is a new metric to the community, adequate references and data shall be given to support the metric, which is absence from the paper.
- The author only mentions its experiments run on A100, not mentioning GPU number, inter-machine communication settings, network settings and other settings that are crucial for reproduction and real-world usage.
问题
Q1: You mention that your new metric, tail latency, is an important issue for LLM inference QoS. Do you have any reference or data to support its importance? You shall clarify the assertion, since it’s a brand-new metric to the community.
Q2: How well does your stochastic model models the real-world characteristic? Do you have any reference or data to verify it?
Q3: How many A100 do you used? Can your hardware settings simulate real-world clusters for LLM serving well? For example, inter-machine communication settings, network settings etc.
Q4: How do you recover the evicted KV cache blocks? Are they stored in CPU memory, disk, or simply discarded? Recovering by loading from disk or recomputation can be a lot different.
Q5: Your algorithm performs worse than LRU on 50% percentile. Will that be a problem? As the median quality seems to be worsened.
Q6: ShareGPT results can be presented to the main text. I suspect you remove it to the appendix due to the page limit.
局限性
yes
最终评判理由
The concern about the latency metric and some evaluation results are clarified. I will maintain my score.
格式问题
no
We thank the reviewer for thoughtful feedback and constructive suggestions. We are grateful that the reviewer appreciated the importance of the problem we try to tackle and the novelty of our approach. Below we address the concerns raised.
Importance of tail latency.
Tail latency has been important for large-scale services: it is critical for a data center operator’s revenue, see “Understanding and Leveraging the Impact of Response Latency on User Behaviour in Web Search” (Xiao et al., TOIS 2017). “The Tail at Scale” (Dean & Barroso, 2013) shows that a tiny fraction of slow requests dominates user-perceived experience. These findings naturally transfer to interactive LLM applications, and modern LLM stacks have already internalized the need for optimizing tail latency: “Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve” targets p99 tail latency, and “DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving” focuses on both p90 TTFT latency and time-per-output-token latency. We will add the above references in the introduction to underline the importance of tail latency.
Fidelity of the stochastic conversation model.
To the best of our knowledge, WildChat is the only publicly available dataset that includes real timestamps. Our analysis of this dataset shows
- new-conversation arrivals are close to Poisson process;
- the probability of a conversation continuing decreases as the time to last turn increases;
- inter-turn arrivals exhibit three regimes: uniform interval (B = -1), Poisson (B = 0), and bursty (B = 1), with Poisson regime dominating. Here B is the burstiness index computed as , where denotes standard deviation of inter-arrival times and denotes mean of inter-arrival times.
We will add these empirical results in the revised paper. Our formulation is the first stochastic model to capture both uncertainty and temporal locality in LLM workloads, which we believe will interest the broader modeling community.
Crucially, the policy design itself does not rely on the stochastic conversation model, and is guided by a general monotonicity intuition: higher latency requests deserve more cache.
Hardware settings.
We used one A100 GPU on a Colab environment to simulate a typical single-node deployment. We agree that full-cluster results with inter-node communications would be valuable and plan to explore this in future work.
Evicted KV cache blocks.
We discarded evicted KV cache blocks. In Section 7 in the submitted paper, we pointed out that our focus is on a single storage layer, and exploring 314 hierarchical KV caching architectures such as in “Trading more storage for less computation—a kvcache-centric architecture for serving llm chatbot” (Qin et al., 2025) could provide insights into better managing multi-layer resources.
Worse medium latency.
T-LRU deliberately trades off slightly higher median latency (a few milliseconds) to achieve larger tail latency reductions (tens to hundreds of milliseconds), which is a more favorable tradeoff for user experience. In our experiments, T-LRU does slightly increase p50 latency, which is expected as we are trading-off worst-case experience (p95/p99) with the mean. However, the medium latency degradation (+10 milliseconds) is minimal and users will likely not notice: for context, the duration of a blink is on average 100–400 milliseconds according to the Harvard Database of Useful Biological Numbers (BNID 100706). However, reducing p99 latency from 500 ms to 400 ms improves the user-experienced responsiveness.
ShareGPT results.
We put the results on ShareGPT in the appendix due to 1) page limit and that 2) ShareGPT lacks arrival timestamps data, so we simulated the arrivals requests using our stochastic multi-turn conversation model, making it supportive rather than primary evidence.
Thanks for the authors' further clarification! I will keep my positive score.
Thanks again for your thoughtful feedback!
Summary: The paper introduces Tail-Optimized LRU, a caching strategy designed to minimize tail latency in large language model (LLM) inference. By reallocating KV cache capacity to prioritize high-latency conversations and evicting less impactful cache entries, the authors aim to address the limitations of traditional LRU policies, which are not optimized for heterogeneity in conversation lengths. The approach is theoretically grounded with proofs of optimality under certain conditions and empirically validated through experiments on real-world datasets. Significant reductions in P90 and P95 tail latencies, as well as SLO violations, are reported compared to standard LRU and Threshold-LRU policies.
优缺点分析
Strengths:
-
The authors provide rigorous theoretical analysis and prove the optimality of their proposed policy under a natural stochastic model, adding depth to the understanding of caching strategies in LLM inference.
-
The simplicity of the implementation, requiring only minor modifications to existing LRU systems, makes it an attractive option for practitioners looking to optimize their LLM deployments without extensive changes to infrastructure. Weaknesses:
-
Performance and Trade-off Concerns
- Inference speed impact: T-LRU's cache removal strategy might negatively affect overall inference speed and system throughput
- Cache hit rate reduction: optimizing for tail latency could decrease the overall cache hit rate
- Lack of comprehensive evaluation: Missing experiments on full inference latency and potential negative impacts
- Modeling Assumptions and Limitations
- Poisson distribution assumption: The assumption that conversations die according to a Poisson distribution may not hold across different user habits and conversation types, limiting real-world applicability
- Static Q parameter: The next-turn length estimate Q remains fixed during inference, while in practice it likely fluctuates over time. Dynamic adjustment based on recent conversation history might improve performance
- Oversimplified TTFT proxy: Using the number of uncached blocks to approximate Time to First Token overlooks other real-world factors affecting latency
- Innovation and Scope
- Limited innovation: merely adding priority handling for tail excess without fundamentally addressing the root causes of tail latency
- Fairness concerns: The method doesn't address fairness across users, potentially creating inequities in service quality
- Practical implementation challenges: The accurate estimation of Q in real-world deployments is difficult
问题
- The estimated Q is static during the inference, but in engineering application, it may be fluctuating at different times. If we calculate the average length of the last N given questions and make Q dynamic, will the tail latency become smaller or not?
- Could Tail-Optimized LRU lead to a decrease in overall cache hit rate? Might it also reduce the system's overall throughput? It would be valuable to evaluate such potential negative impacts of Tail-Optimized LRU in the experiments, in order to better understand its trade-offs.
局限性
Not available
最终评判理由
While I appreciate the work's rigor and practical merits, my revised perspective still leans negative for NeurIPS. This research appears more aligned with AI systems and architecture focus—its core contributions lie in LLM inference optimization, caching strategies, and system-level performance tuning, which are more central to specialized venues in AI infrastructure.
NeurIPS, despite including systems research, tends to prioritize ML theory, algorithmic innovations, or foundational model breakthroughs. The audience here likely has limited focus on the architectural details and engineering optimizations that form the heart of this work.
For broader impact and more targeted feedback, this paper would be better suited for conferences or journals specializing in AI systems (e.g., MLSys, ASPLOS) or architecture, where the community’s expertise and interest in such system-level optimizations are far more pronounced.
格式问题
N/A
We thank the reviewer for thoughtful feedback and constructive suggestions. We are grateful that the reviewer appreciated our rigorous theoretical analysis, and that our policy is simple to implement in practice. Below we address the concerns raised.
Inference speed impact.
Prior works have shown the decode phase is the primary bottleneck in inference serving (“LLM Inference Performance Engineering – Best Practices”, Databricks 2024; “Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve”). For example, “Mind the Memory Gap: Unveiling GPU Bottlenecks in Large-Batch LLM Inference” shows that prefill time remains below 5% in total execution time each at large batch sizes 256. T-LRU policy only touches KV cache blocks and does not sit on the critical path of attention mechanism or matrix multiplication. We are working to implement our policy on vLLM and will report the results on inference speed in the revised paper.
Cache hit rate impact.
A binary cache hit/miss metric is not informative when objects vary in sizes and can be cached partially. One can achieve 100% cache hit rate if we save one cache block for each conversation (assuming capacity allows), yet the TTFT will still be large. Our objective is thus the number of uncached blocks, effectively a size-weighted cache hit metric, and has a more direct and monotonic relationship with TTFT latency (see Figure 3 in the appendix submission).
Scope of evaluation.
Our paper focuses on the tail latency of time to first token, and we consider decode latency or end-to-end latency as orthogonal optimization. With chunked prefill, decode and prefills are interleaved in one batch, improving the tail TTFT can also positively impact the tail end-to-end latency. In prefill/decoding-disaggregated engines, TTFT and decode latency are independent, so a tail TTFT improvement strictly lowers overall tail latency. We agree that a comprehensive evaluation of our policy is valuable, and we provide some preliminary experiment analyses here (Table 1-5 in this rebuttal).
Poisson distribution assumption.
Our policy does not rely on the poisson distribution assumption and is guided by a general monotonicity intuition: higher latency requests deserve more cache. Our empirical results on WildChat dataset were done using real timestamps. Only theoretical optimality in Section 4 requires the Poisson assumption for analytic tractability, a common approach in the caching and queueing literature. In real conversational data, we do observe that the probability of conversation continuing decreases as the time to last request increases, i.e., monotone-decreasing hazard, which is what our model tries to capture.
Static Q parameter.
We agree that dynamic Q estimation may yield better performance and our framework gives user the flexibility to choose a rolling average or a learned predictor. The table below report the relative TTFT improvement of T-LRU with a dynamic Q over T-LRU with a static Q. Positive numbers mean the dynamic version lowers latency; negative numbers mean it increases latency. Sample size is the number of most recent requests used to compute the dynamic average user prompt length.
Table 6: T-LRU (dynamic Q) vs T-LRU (static Q) TTFT Improvement (%), Sample Size = 30
| Capacity | P90 | P95 | P99 | |||
|---|---|---|---|---|---|---|
| ξ=100ms | ξ=150ms | ξ=100ms | ξ=150ms | ξ=100ms | ξ=150ms | |
| 1000 | 0.59 | 0.49 | 0.00 | 0.22 | 0.00 | 0.00 |
| 2000 | 0.00 | 0.29 | 0.71 | 0.01 | 0.00 | 0.00 |
| 4000 | 1.05 | 0.23 | 0.00 | 0.58 | –0.70 | –2.75 |
| 6000 | 0.89 | 0.15 | 0.98 | 0.29 | –2.75 | –4.85 |
| 8000 | 1.27 | 0.34 | 1.11 | 0.00 | –4.94 | –3.37 |
| 10 000 | 1.35 | –3.09 | 0.06 | –0.72 | –6.03 | –3.23 |
Table 7: T-LRU (dynamic Q) vs T-LRU (static Q) TTFT Improvement (%), Sample Size = 50
| Capacity | P90 | P95 | P99 | |||
|---|---|---|---|---|---|---|
| ξ=100ms | ξ=150ms | ξ=100ms | ξ=150ms | ξ=100ms | ξ=150ms | |
| 1000 | 0.59 | 0.49 | 0.00 | 0.22 | 0.00 | 0.00 |
| 2000 | 0.00 | 0.00 | 0.71 | 0.01 | 0.00 | 0.00 |
| 4000 | 0.81 | 0.03 | 0.00 | 0.58 | –0.70 | –2.75 |
| 6000 | 0.94 | 0.11 | 0.98 | 0.29 | –2.75 | –4.85 |
| 8000 | 1.39 | 1.03 | 1.56 | 0.00 | –6.24 | –3.37 |
| 10 000 | 1.35 | –3.17 | 0.06 | –0.72 | –6.03 | –3.23 |
From the experiments, we find small to moderate gains at P90/P95 but inconsistent effect at p99, and that window size matters modestly for WildChat data. In practice, one can tune sample size to match the time scale of change in prompt lengths. A data-driven rule of thumb is to pick the sample size whose mean-squared prediction error for Q is lowest on recent history. We think dynamic Q is a very interesting future direction, and will add ths discussion in the revised paper.
TTFT proxy.
We choose the number of uncached blocks as a proxy for two reasons.
- Empirically such a linear approximation holds well when the uncached length is not excessively long and the prefill of the request is not batched (see Figure 3 in the appendix submission). Thus the proxy captures the dominant driver of TTFT while allowing us to do closed-form analyses. We want to emphasize that we report the actual TTFT in the experiments, and the proxy is used only in the theory.
- Focusing on the number of uncached blocks decouples the cache policy from other serving optimization. The number of uncached blocks signals how much work must be done, and the monotonic relationship between it and TTFT remains generally applicable regardless of what optimization stack is deployed. This allows the insights from our paper to remain valid.
Limited innovation.
We think the root causes of tail latency arises from heterogenous request sizes, contention for limited KV cache capacity, and stochasticity in arrivals and departures that cause queueing. One can certainly mitigate tail latency by either increasing capacity or decreasing demand, but does not eliminate the root causes.
T-LRU is request size-aware and explicitly aligns KV cache capacity with a tail-latency objective. To the best of our knowledge, we are the first paper that looks at optimizing tail latency via prompt caching (Sarathi-Serve used chunked-prefills and stall-free scheduling), and we also show the optimality of our policy in a novel stochastic multi-turn conversation model. Simplicity of our policy is a feature so that it can be dropped into existing systems that already default to LRU, and complements orthogonal techniques such as batching.
Fairness concerns.
Our policy tries to equalize per-request latency by giving more cache memory to requests with potentially high TTFT latency, agnostic to which user this request is coming from. Therefore we promote fairness at the request level, which could also benefit user-level fairness (e.g., proportional or max-min fairness across tenants), and we leave this as an interesting future direction.
Estimating Q in practice.
In our paper, Q denotes the average user-prompt length, which is readily measurable from logs (e.g., Q is 295.58 across all conversations in WildChat dataset, and is 94.46 for ShareGPT). We also use the average length of a user prompt in our experiment.
While I appreciate the work's rigor and practical merits, my revised perspective still leans negative for NeurIPS. This research appears more aligned with AI systems and architecture focus—its core contributions lie in LLM inference optimization, caching strategies, and system-level performance tuning, which are more central to specialized venues in AI infrastructure.
NeurIPS, despite including systems research, tends to prioritize ML theory, algorithmic innovations, or foundational model breakthroughs. The audience here likely has limited focus on the architectural details and engineering optimizations that form the heart of this work.
For broader impact and more targeted feedback, this paper would be better suited for conferences or journals specializing in AI systems (e.g., MLSys, ASPLOS) or architecture, where the community’s expertise and interest in such system-level optimizations are far more pronounced.
The author modifies the Cache LRU algorithm in online serving systems like VLLM, introducing an LRU variant that explicitly optimizes for tail latency. The motivation appears well-justified, and the technical approach is both simple and effective. Experiments demonstrate that the proposed algorithm slightly increases median latency but significantly reduces tail latency, leading to improved SLO compliance.
优缺点分析
Strengths:
-
The motivation of the paper is well-founded. This strategy of sacrificing median performance to optimize for worst-case scenarios is widely adopted in many fields.
-
The method remains simple yet effective, and experiments show that it can significantly improve tail latency compared to standard LRU.
Weaknesses:
I have some concerns regarding the experiments. It appears that the study does not account for several widely used optimizations present in state-of-the-art serving systems. First, mixed or continuous batching was disabled in the experiments, which could significantly increase latency. I am unsure how the proposed algorithm would perform with continuous batching enabled in a realistic system. Second, under cache reuse settings, it is common practice to use prefixed caching.
When evicting blocks with LRU, should blocks involved in prefixed caching be treated differently? Moreover, has the impact of this widely-adopted prefixed caching feature been considered in the experiments?
Overall:
Overall, I believe this paper effectively optimizes tail latency by adopting the strategy of sacrificing median performance to improve worst-case scenarios. This solution is novel in my opinion, and it opens up a promising direction for future studies. Although the paper does not integrate some popular optimizations into its experimental setting—which somewhat limits the practical significance of the results for real-world deployment—I think this is acceptable in academic research. After all, similar issues are also present in the evaluations of other academic work. Therefore, I believe the strengths of this paper outweigh its weaknesses, and I am inclined to give a decision of Weak Accept. If the authors could further address the above limitations, I would be willing to raise my score.
问题
I have placed my questions in the above weaknesses part. I recommend that the authors, if possible, further evaluate the improvements brought by their method when combined with various practical serving optimizations, such as continuous batching and prefixed cache. While I understand that implementing these features can be challenging, doing so would strengthen the significance of the paper in real-world settings. Additionally, there is a typo in Section 5.1: “without mixed bathing” should be corrected to “without mixed batching.”
局限性
Yes
最终评判理由
Although the benefits of the proposed method are reduced when combined with modern optimization techniques, it still yields positive results. Therefore, I will maintain my positive score.
格式问题
None
We thank the reviewer for thoughtful feedback and constructive suggestions. We are grateful that the reviewer found our work well motivated, our policy simple and easy to implement, and our empirical results strong. Below we address the concerns raised.
Clarification on prefix caching.
We do assume the standard “prefix-reuse” plus “suffix-eviction” design adopted by vLLM and OpenAI. See Figure 1 (a) for illustration of our cache eviction policy. We used the umbrella term prompt caching in the submission, which may have obscured the point. In the revised paper, we will use “prefix caching” explicitly.
Combination with other serving optimization.
We really appreciate this suggestion. Optimization techniques such as mixed batching, chunked-prefill and other different parallelism strategies are orthogonal to the cache eviction, so we can integrate them with our policy T-LRU. We tested these features using Sarathi (budget size is 4096 tokens and a batch forms when probability 0.1), and set the Tail-excess latency ξ threshold as 200 ms. The table below reports the relative TTFT latency improvement of T-LRU with other optimization enabled over LRU. Positive numbers mean the improvement (decreased latency); negative numbers mean increased latency.
Table 5: TTFT Latency Improvement with Batching and Chunked-Prefill (%)
| Capacity | p90 | p95 | p99 | |||
|---|---|---|---|---|---|---|
| LRU | Thre-LRU | LRU | Thre-LRU | LRU | Thre-LRU | |
| 1 000 | 2.73 % | 2.73 % | 0.30 % | 0.36 % | 0.00 % | 0.00 % |
| 2 000 | 6.61 % | 6.58 % | 1.81 % | 2.69 % | 0.03 % | 0.03 % |
| 4 000 | 13.50 % | 12.52 % | 8.95 % | 9.82 % | 6.43 % | 6.43 % |
| 6 000 | 18.30 % | 15.34 % | 12.66 % | 12.50 % | 6.43 % | 6.43 % |
| 8 000 | 13.24 % | 11.92 % | 14.60 % | 15.17 % | 6.25 % | 6.23 % |
| 10 000 | 9.55 % | 8.53 % | 15.45 % | 12.38 % | 9.40 % | 8.65 % |
With batching and chunked-preffil enabled, T-LRU consistently outperforms LRU and Thre-LRU, though the margin narrows compared with no-scheduler baseline (as reported in Table 1 and Table 2 in our submitted paper). These results confirm that our tail-optimized prompt caching policy remains beneficial and is fully compatible when advanced serving optimizations are enabled.
Typo “mixed bathing”.
Fixed, thank you for spotting it!
Thank you for your detailed response!
I noticed that when combined with other serving optimizations, the algorithm's benefit appears to decrease significantly, narrowing to about one-third to one-half of the original gain. This result also seems to be based on the optimal threshold setting. If different threshold values are used, would the benefit decrease further, or could there even be negative effects?
Thank you for the question! We evaluated additional scenarios combining mixed-batching and chunked prefill strategies. When varying the threshold parameter , we found that the performance margin was generally narrower compared to Table 1 and Table 2; however, in some cases —such as P90 latency with and —the relative improvement ratio actually increased. Increasing the sample size also led to a greater performance margin. These results indicate that T-LRU is overall robust, though we acknowledge that the specific numbers may vary depending on the experimental setting. Additionally, we did not optimize the hyperparameter in all configurations, which could potentially yield further improvements.
Latency Improvement of T-LRU when ms
| Capacity | p90 LRU | p90 Thre-LRU | p95 LRU | p95 Thre-LRU |
|---|---|---|---|---|
| 1000 | 0.52 % | 0.52 % | 0.00 % | 0.06 % |
| 2000 | 3.31 % | 3.29 % | 1.15 % | 2.02 % |
| 4000 | 9.07 % | 8.03 % | 4.37 % | 5.29 % |
| 6000 | 16.28 % | 13.26 % | 10.60 % | 10.42 % |
| 8000 | 15.65 % | 14.36 % | 14.60 % | 15.17 % |
| 10000 | 16.39 % | 15.44 % | 14.57 % | 11.46 % |
Latency Improvement of T-LRU when ms
| Capacity | p90 LRU | p90 Thre-LRU | p95 LRU | p95 Thre-LRU |
|---|---|---|---|---|
| 1000 | 0.29 % | 0.29 % | 0.00 % | 0.06 % |
| 2000 | 1.98 % | 1.98 % | 0.00 % | 0.89 % |
| 4000 | 4.32 % | 3.27 % | 0.19 % | 1.15 % |
| 6000 | 11.09 % | 7.88 % | 4.31 % | 4.12 % |
| 8000 | 12.75 % | 11.44 % | 8.02 % | 8.64 % |
| 10000 | 13.77 % | 12.80 % | 14.57 % | 11.46 % |
Latency improvement of T-LRU when increasing the sample size from the first 500 arrivals (used in previous experiments) to the first 1,000 arrivals of WildChat, with ms.
| Capacity | p90 LRU | p90 Thre-LRU | p95 LRU | p95 Thre-LRU |
|---|---|---|---|---|
| 1000 | 3.14 % | 3.03 % | 2.49 % | 2.13 % |
| 2000 | 7.10 % | 6.47 % | 5.86 % | 5.86 % |
| 4000 | 3.53 % | 1.22 % | 9.71 % | 9.47 % |
| 6000 | 5.02 % | 5.02 % | 13.03 % | 13.03 % |
| 8000 | 11.32 % | 11.32 % | 17.53 % | 17.18 % |
| 10000 | 19.12 % | 17.62 % | 20.42 % | 19.00 % |
Thank you for your response. Although the benefits of the proposed method are reduced when combined with modern optimization techniques, it still yields positive results. Therefore, I will maintain my positive score.
This paper proposes a KV caching policy for LLMs with a focus on optimizing tail latency. In particular, the authors introduce Tail-Optimized LRU, which prioritizes high-latency conversations and achieves optimal tail latency under a natural stochastic model of conversation dynamics. They present a novel stochastic model for multi-turn conversations to capture uncertainties and the temporal locality of LLM workloads. The authors empirically validate the effectiveness of the proposed method using real multi-turn chat traces, including ShareGPT and WildChat. Experimental results show that Tail-Optimized LRU reduces p95 tail latencies by up to 23.9% compared to the baseline LRU method.
优缺点分析
Strengths
- The paper is theoretically sound. The authors prove the optimality of the proposed Tail-Optimized LRU under a reasonable stochastic conversational model.
- Optimizing tail latency in LLM-based conversations is a novel and relevant problem.
- The proposed method achieves substantial empirical reductions in tail latency cases.
Weaknesses
- The paper lacks illustrative figures that describe the proposed method and compare it to existing approaches. This makes it difficult to fully grasp the core mechanisms and workflow of the method.
- The proposed method involves a performance trade-off: while the method improves tail latency, the method underperforms in non-tail cases compared to the baseline LRU. LLM serving systems often involve multiple users, concurrent queries, KV cache limitations, and constrained computational resources. In such scenarios, performance degradation in non-tail cases may impact overall system efficiency. It would be helpful if the authors provided evaluation results that reflect a more comprehensive view of performance, such as average or maximum latency per token across the entire dataset, irrespective of percentiles.
问题
- Could you provide additional details about the datasets used (e.g., average token length, number of conversation turns, and topic coverage)?
局限性
- While the authors mention future directions in the conclusion, it would be beneficial to include a more detailed analysis of the current method's limitations.
最终评判理由
This paper provides a solid theoretical analysis. However, the paper lacks illustrative figures that would help readers understand the proposed method and its key differences from prior work. For these reasons, I assign a "weak accept". I also note that I am unfamiliar with the paper's field and related works.
格式问题
.
We thank the reviewer for thoughtful feedback and constructive suggestions. We appreciate that the reviewer found our work theoretically sound, and the problem we tackle is novel and relavent. Below we address the concerns raised.
Lack of illustrative figures.
Thanks for the suggestion. In our submitted paper, Figure 1 (a) illustrates what our policy caches for one conversation across turns and Figure 1 (b) illustrates the comparison with LRU using an example. We also state our policy explicitly in Algorithm 1. We appreciate that these figures may not have conveyed the low-level workflow clearly enough, and we will bridge this gap by adding diagrams to illustrate what blocks are cached/evicted.
Trade-off with non-tail latency.
Below we list the average and maximum TTFT latencies and the relative improvement of T-LRU over LRU and Thre-LRU. For Table 3 and 4, positive numbers mean the improvement (decreased latency); negative numbers mean increased latency.
Table 1: Mean TTFT Latency (seconds)
| Capacity | ξ=50ms T-LRU | ξ=50ms LRU | ξ=50ms Thre-LRU | ξ=100ms T-LRU | ξ=100ms LRU | ξ=100ms Thre-LRU | ξ=150ms T-LRU | ξ=150ms LRU | ξ=150ms Thre-LRU | ξ=200ms T-LRU | ξ=200ms LRU | ξ=200ms Thre-LRU | ξ=500ms T-LRU | ξ=500ms LRU | ξ=500ms Thre-LRU |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1000 | 0.0853s | 0.0849s | 0.0852s | 0.0858s | 0.0849s | 0.0852s | 0.0862s | 0.0849s | 0.0852s | 0.0862s | 0.0849s | 0.0852s | 0.0872s | 0.0849s | 0.0852s |
| 2000 | 0.0829s | 0.0821s | 0.0826s | 0.0839s | 0.0821s | 0.0826s | 0.0846s | 0.0821s | 0.0826s | 0.0847s | 0.0821s | 0.0826s | 0.0863s | 0.0821s | 0.0826s |
| 4000 | 0.0788s | 0.0773s | 0.0782s | 0.0809s | 0.0773s | 0.0782s | 0.0816s | 0.0773s | 0.0782s | 0.0820s | 0.0773s | 0.0782s | 0.0845s | 0.0773s | 0.0782s |
| 6000 | 0.0752s | 0.0730s | 0.0746s | 0.0777s | 0.0730s | 0.0746s | 0.0788s | 0.0730s | 0.0746s | 0.0795s | 0.0730s | 0.0746s | 0.0826s | 0.0730s | 0.0746s |
| 8000 | 0.0720s | 0.0691s | 0.0712s | 0.0745s | 0.0691s | 0.0712s | 0.0756s | 0.0691s | 0.0712s | 0.0778s | 0.0691s | 0.0712s | 0.0810s | 0.0691s | 0.0712s |
| 10000 | 0.0682s | 0.0652s | 0.0677s | 0.0712s | 0.0652s | 0.0677s | 0.0737s | 0.0652s | 0.0677s | 0.0755s | 0.0652s | 0.0677s | 0.0784s | 0.0652s | 0.0677s |
Table 2: Max TTFT Latency (seconds)
| Capacity | ξ=50ms T-LRU | ξ=50ms LRU | ξ=50ms Thre-LRU | ξ=100ms T-LRU | ξ=100ms LRU | ξ=100ms Thre-LRU | ξ=150ms T-LRU | ξ=150ms LRU | ξ=150ms Thre-LRU | ξ=200ms T-LRU | ξ=200ms LRU | ξ=200ms Thre-LRU | ξ=500ms T-LRU | ξ=500ms LRU | ξ=500ms Thre-LRU |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1000 | 0.7070s | 0.7070s | 0.7070s | 0.7070s | 0.7070s | 0.7070s | 0.7070s | 0.7070s | 0.7070s | 0.7041s | 0.7070s | 0.7070s | 0.6449s | 0.7070s | 0.7070s |
| 2000 | 0.7070s | 0.7070s | 0.7070s | 0.7070s | 0.7070s | 0.7070s | 0.7070s | 0.7070s | 0.7070s | 0.6885s | 0.7070s | 0.7070s | 0.6263s | 0.7070s | 0.7070s |
| 4000 | 0.7070s | 0.7070s | 0.7070s | 0.7070s | 0.7070s | 0.7070s | 0.6885s | 0.7070s | 0.7070s | 0.6885s | 0.7070s | 0.7070s | 0.6263s | 0.7070s | 0.7070s |
| 6000 | 0.7070s | 0.7070s | 0.7070s | 0.6885s | 0.7070s | 0.7070s | 0.6885s | 0.7070s | 0.7070s | 0.6885s | 0.7070s | 0.7070s | 0.6263s | 0.7070s | 0.7070s |
| 8000 | 0.6885s | 0.7070s | 0.6885s | 0.6885s | 0.7070s | 0.6885s | 0.6885s | 0.7070s | 0.6885s | 0.6698s | 0.7070s | 0.6885s | 0.6263s | 0.7070s | 0.6885s |
| 10000 | 0.6885s | 0.6885s | 0.6885s | 0.6885s | 0.6885s | 0.6885s | 0.6698s | 0.6885s | 0.6885s | 0.6698s | 0.6885s | 0.6885s | 0.6263s | 0.6885s | 0.6885s |
Table 3: T-LRU vs LRU - TTFT Latency Improvement (%)
| Capacity | ξ=50ms Mean | ξ=50ms Max | ξ=100ms Mean | ξ=100ms Max | ξ=150ms Mean | ξ=150ms Max | ξ=200ms Mean | ξ=200ms Max | ξ=500ms Mean | ξ=500ms Max |
|---|---|---|---|---|---|---|---|---|---|---|
| 1000 | -0.44% | 0.00% | -0.96% | 0.00% | -1.52% | 0.00% | -1.46% | 0.42% | -2.63% | 8.79% |
| 2000 | -0.92% | 0.00% | -2.17% | 0.00% | -3.02% | 0.00% | -3.19% | 2.62% | -5.11% | 11.42% |
| 4000 | -2.00% | 0.00% | -4.73% | 0.00% | -5.63% | 2.62% | -6.11% | 2.62% | -9.40% | 11.42% |
| 6000 | -3.00% | 0.00% | -6.50% | 2.62% | -7.98% | 2.62% | -8.98% | 2.62% | -13.23% | 11.42% |
| 8000 | -4.09% | 2.62% | -7.78% | 2.62% | -9.29% | 2.62% | -12.46% | 5.27% | -17.12% | 11.42% |
| 10000 | -4.61% | 0.00% | -9.12% | 0.00% | -12.94% | 2.72% | -15.68% | 2.72% | -20.17% | 9.04% |
Table 4: T-LRU vs Thre-LRU - TTFT Latency Improvement (%)
| Capacity | ξ=50ms Mean | ξ=50ms Max | ξ=100ms Mean | ξ=100ms Max | ξ=150ms Mean | ξ=150ms Max | ξ=200ms Mean | ξ=200ms Max | ξ=500ms Mean | ξ=500ms Max |
|---|---|---|---|---|---|---|---|---|---|---|
| 1000 | -0.12% | 0.00% | -0.64% | 0.00% | -1.20% | 0.00% | -1.14% | 0.42% | -2.31% | 8.79% |
| 2000 | -0.39% | 0.00% | -1.63% | 0.00% | -2.48% | 0.00% | -2.65% | 2.62% | -4.55% | 11.42% |
| 4000 | -0.77% | 0.00% | -3.47% | 0.00% | -4.36% | 2.62% | -4.83% | 2.62% | -8.08% | 11.42% |
| 6000 | -0.70% | 0.00% | -4.13% | 2.62% | -5.57% | 2.62% | -6.55% | 2.62% | -10.71% | 11.42% |
| 8000 | -1.02% | 0.00% | -4.60% | 0.00% | -6.07% | 0.00% | -9.14% | 2.72% | -13.67% | 9.04% |
| 10000 | -0.75% | 0.00% | -5.09% | 0.00% | -8.76% | 2.72% | -11.41% | 2.72% | -15.73% | 9.04% |
Here we are using the same experiment setup as in the submission: first 1,306 turns from WildChat with average user-prompt length Q = 200, using Vicuna-7B on one Colab A100 GPU.
We observe in our experiments, T-LRU does slightly increase p50 latency and average latency, which is expected as we are trading-off worst-case experience (p95/p99) with the mean. However, the medium/average latency degradation (+1/+10 milliseconds) is minimal and users will likely not notice: for context, the duration of a blink is on average 100–400 milliseconds according to the Harvard Database of Useful Biological Numbers (BNID 100706). However, reducing p99 latency from 500 ms to 400 ms improves the user-experienced responsiveness. T-LRU deliberately trades off slightly higher median latency (a few milliseconds) to achieve larger tail latency reductions (tens to hundreds of milliseconds), which is more favorable for user experience.
Dataset statistics.
We provided distributions of turns and tokens of ShareGPT and WildChat datasets in the appendix (Figure 4). Below we list the statistics:
- Average number of turns in a conversation: 3.51 (ShareGPT) and 2.54 (WildChat)
- Average number of user tokens: 94.46±626.39 (ShareGPT) and 295.58±1609.18 (WildChat)
- Average number of Chatbot tokens: 348.45±269.93 (ShareGPT) and 441.34±410.91 (WildChat)
- Number of languages: 41(ShareGPT) and 68 (WildChat)
Here the token statistics are computed based on the Llama-2 tokenizer. For WildChat English conversations, the topic includes assisting/creative writing, analysis/decision explanation, coding factual info, and math reason ("WildChat: 1M ChatGPT Interaction Logs in the Wild"). ShareGPT covers topics including technology, education, business, legal, relationship, literature, science, language (" The Shifted and The Overlooked: A Task-oriented Investigation of User-GPT Interactions"). We will add these statistics in the revised paper to provide more context.
Discussion on limitations.
Thanks for the suggestions. We will add the following discussions in the revised paper, which are also raised by other reviewers.
- Tail-vs-median trade-off: T-LRU intentionally sacrifices a few milliseconds at the median to cut tens–hundreds of milliseconds at the tail. Practitioners might prefer a hybrid policy that blends the two objectives, which is interesting for future works.
- Automatic threshold calibration: T-LRU relies on a user-chosen TTFT threshold, which penalizes latency above the threshold but does not directly translate to a service level objective. It would be practically-relevant to see how to tune this using live traffic for a given service level objective such as “95% of requests complete within 200 ms”.
Thank you for the detailed clarification. As mentioned, the manuscript would indeed become even more valuable with comprehensive figures and these additional experiments. I maintain my current positive score.
Thanks again for your constructive feedback; we will add the figures and additional experiements.
Prompt caching is an important way to reduce TTFT by storing the KV cache of prior requests, but it’s generally impossible to cache all prior requests due to limited memory. This paper proposes a new eviction algorithm—Tail-Optimized LRU—for prompt caching, which minimizes “Tail Excess Latency” (e.g., total latency above some threshold, across all requests). It proves that this algorithm is optimal under a stochastic model of the incoming requests, and shows empirically that it effectively reduces the tail latency—p90, p95 TTFT— on different datasets of real conversation data (e.g., WildChat) by 27.5% and 23.9%, respectively, relative to LRU.
优缺点分析
Strengths
- The goal of minimizing “Tail Excess Latency” is well motivated, given that inference providers often want to reduce the percentage of requests that exceed some TTFT threshold (where the user experience starts degrading).
- The proposed algorithm—”Tail-Optimized LRU”—is simple, easy to implement in existing LRU systems, and is a natural modification of the eviction algorithm that is (hindsight) optimal (i.e., has full knowledge of all future requests), to deal with the fact that in real settings, the exact lengths and arrival order of the future requests is not known. It is also shown to be optimal under a stochastic model of the arriving requests.
- Empirically, the paper demonstrates convincingly that tail-optimized LRU reduces the p90 and p95 TTFT values, as well as the percent of requests with TTFT greater than a threshold.
Weaknesses
- The technical discussion could be made clearer, especially Section 4. Giving better introductions to the different sections, where the main ideas/intuitions/take-aways are explained clearly, would be quite helpful before diving into the low-level technical details, notation, and theoretical results.
Note: I do not actively work on caching algorithms, so my confidence is medium.
问题
- What existing caching literature has continuous metrics (like TTFT) as opposed to discrete metrics like cache hit vs miss? Also, what examples in the caching literature require varying amounts of storage for different elements of the cache? Has any prior work introduced tail-excess latency?
- I believe the first while loop in Algorithm 1 should be "while \sum X_i > C and there exists i such that X_i >= L_i + Q_i^ - \xi". Is this correct?
局限性
Yes
最终评判理由
I continue to believe this is a nice contribution to the prompt caching literature. Method is simple, and improves an important metric (amount of TTFT time above threshold). I will keep my score.
格式问题
None
We thank the reviewer for thoughtful feedback and constructive suggestions. We are grateful that the reviewer found our work well motivated, and our policy simple, easy to implement and has strong empirical results. Below we address the concerns raised.
Related work.
All titles are given in full because links are not allowed in the rebuttal.
-
Continuous metrics: “Darwin: Flexible Learning-Based CDN Caching” (Chen et al., 2023) optimizes for both hit rate and disk writes, “RobinHood: Tail Latency-Aware Caching — Dynamically Reallocating Cache Resources from the Cache-Rich to the Cache-Poor” (Berger et al., 2018) and “SNC-Meister: Admitting More Tenants with Tail Latency SLOs” (Zhu et al., 2016) for tail latency.
-
Caching work with variable-size objects: prompt caching works certainly lie in this category: “Prompt cache: Modular attention reuse for low-latency inference” (Gim et al. 2023), “On Optimal Caching and Model Multiplexing for Large Model Inference” (Zhu et al., 2023), and “SGLang: Efficient Execution of Structured Language Model Programs” (Zheng et al. 2023). See more in our related work section in the paper. Content delivery networks also require varying amounts of storage for different elements of the cache: “AdaptSize: Orchestrating the Hot Object Memory Cache in a Content Delivery Network” (Berger et al., 2017). In theoretical computer size literature, let cost(p) denotes the cost of a cache miss of object p and size(p) denotes the size of object p, there are four types of caching models:
- uniform model (cost(p) = size(p) = 1): “Competitive paging algorithms” (Fiat et al., 1991); “A study of replacement algorithms for a virtual-storage computer” (Belady 1966);
- cost mode (size(p) = 1): “A primal-dual randomized algorithm for weighted paging” (Bansal et al., 2012); “New results on server problems” (Chrobak et al., 1991);
- fault model (cost(p) = 1): “Caching is hard—even in the fault model” (Chrobak et al., 2012); “Practical bounds on optimal caching with variable object sizes” (Berger et al., 2018);
- bit model (cost(p) = size(p)): “Page replacement with multi-size pages and applications to web caching” (Irani 1996);
- general model: “Page Replacement for General Caching Problems” (Albers et al., 1999); “Randomized competitive algorithms for generalized caching” (Bansal et al., 2008).
In these works, the goal is to study the computational complexity of the offline caching and the competitive ratio of online policies. Fault model, bit model, and general model also have variable-size objects.
-
Tail-excess latency: To our knowledge, Tail excess latency is a novel metric, inspired by conditional value at risk and the (P90/P99) percentile tail latency. The closest paper, “Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve” (via chunked prefill and stall-free scheduling) and “DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving” target tail latency rather than our above threshold metric.
To sum up, existing works have studied continuous metrics other than binary cache hit/miss and variable-size objects, but we are the first to use prompt caching as the primary lever to optimise TTFT tail latency, and the first to provide theoretical guarantees for tail-excess latency. We will add the above discussion in the revised paper to clarify the gap our work fills.
Algorithm 1 loop condition.
Thanks for pointing this out. We have edited this to avoid infinite looping.
Improve technical discussion.
We appreciate the suggestions. We will add the following high-level intuitions before diving into the model and proofs:
To model multi-turn conversations, we need to specify:
- when does a new conversation start?
- when will a conversation speak again?
- how many tokens does the prompt/response have?
- when will a conversation end?
For 1), we assume new conversations “birth” as a Poisson process. For 2), we assume each conversation generates prompts following another Poisson process. For 3), we assume prompt length and answer length follows some fixed distribution. For 4), we give each conversation an exponential death clock: the longer it stays quiet, the less likely it is to come back. This matches what we observe empirically in WildChat traces. Under our model, least-recently-used implies least-likely-to-return. When a conversation “dies”, one can throw away related cache trunks safely. Our policy T-LRU keeps the LRU ordering but prioritizes conversations whose next arrival will breach a TTFT latency threshold. In this section, we will show that a generalized version of T-LRU that minimizes expected TEL is optimal under our stochastic conversation model.
Thank you very much for the response! No further questions at the moment.
Thanks again for your thoughtful comments!
This paper proposes Tail-Optimized LRU (T-LRU), a novel caching policy for LLM inference, specifically designed to reduce tail latency. Prompt caching is crucial for reducing latency and cost in LLM inference, but existing policies like Least Recently Used (LRU) can perform poorly on tail latency due to the conversation length heterogeneity. The paper presents a theoretically sound and empirically effective solution to an important problem in LLM inference. The proposed Tail-Optimized LRU policy is a simple modification to existing LRU systems that significantly improves tail latency and SLO compliance. The introduction of Tail Excess Latency as a novel metric and the proof of optimality under a new stochastic conversation model demonstrate meaningful theoretical contributions, which are valuable for the NeurIPS community.
While some reviewers initially raised concerns about the clarity of technical discussions, experimental scope, and modeling assumptions, the authors provided thorough rebuttals. They clarified existing figures, provided extensive new experimental data, detailed dataset statistics, and explained their modeling choices and the practical implications of parameters like Q. The discussion of the median-vs-tail latency trade-off was also well-articulated and justified from a user experience perspective.
The concern about the paper's fit for NeurIPS, while acknowledged, is mitigated by the depth of theoretical analysis and the generalizable insights into caching policy design for modern interactive AI systems. The work bridges theoretical foundations with practical system optimization, making it an interesting contribution. The overall positive sentiment from the majority of reviewers, combined with the satisfactory author responses, leads to a recommendation of Accept.