7.1

/10

Poster5 位审稿人

最低4最高5标准差0.5

3.4

置信度

创新性2.6

质量3.0

清晰度2.8

重要性2.4

NeurIPS 2025

MPCache: MPC-Friendly KV Cache Eviction for Efficient Private LLM Inference

Wenxuan Zeng,Ye Dong,Jinjin Zhou,Jin Tan,Lei Wang,Tao Wei,Runsheng Wang,Meng Li

OpenReview PDF

提交: 2025-05-08更新: 2025-10-29

摘要

关键词

Large Language ModelKV Cache CompressionSparse AttentionPrivate InferenceMulti-party Computation

评审与讨论

审稿意见

评分: 5置信度: 22025-06-09

The paper proposes MCPCache, a novel KV cache eviction strategy for use in LLM inference when performed via MPC. They thoroughly review and compare the state of the art and justify why their strategy works.

优缺点分析

It's a good paper, with a clear framework for why it is designed the way it is, and an explanation of how speed is improved via the KV eviction strategy.

Stregths:

Clear explanation of why the strategy works
The extensive experiments demonstrate significant improvements over existing methods. MPCACHE achieves a reduction of up to 2.01x latency and 8.37x communication.

Weaknesses:

No Explicit Limitations Section: The paper does not have a dedicated "Limitations" section, and the authors state in the checklist that the paper has no limitations. MPC fundamentally has limitations due to communication complexity and isn't widely used in AI deployments for that reason. -More generally, it's unclear with these speedups if anyone would be willing to accept the complexity of MPC inference.
Assumption of Architectural Knowledge: The security model assumes that all parties are aware of the model architecture and the number of pruned tokens. This is fine and aligns with past literature, but it forgoes significant privacy.

问题

Please address how the speed tradeoffs here fit into the limitations of the broader approach.

局限性

No, see above.

最终评判理由

The authors have vigorously and thoroughly responded to all the reviewer comments.

While there are still challenges with the underlying pragmatic utility of this scope of research, they have argued for its novelty and soundness.

I am happy to increase my review to an acceptable (from borderline acceptable) and retain my confidence.

格式问题

作者回复

2025-07-26

We sincerely thank Reviewer 5USt for your thoughtful feedback!

Q1: No explicit limitations section and address how the speed tradeoffs fit into the limitations of the broader approach.

A1: Thanks for your valuable advice! We agree that MPC incurs more communication overhead and is still under research, but we believe MPC has strong potential for these reasons: 1) In the era of LLM, the issue of data privacy becomes more severe, and there is a strong demand for LLM's privacy protection. For example, General Data Protection Regulation (GDPR) provides policy-level protections for LLM's data privacy; 2) MPC provides cryptographically-strong privacy protection for both data and parameters and is also independent of the hardware platforms; 3) In recent years, there is a significant amount of research on MPC inference, especially for LLMs [1-10].

Through our optimizations, MPCache achieves up to 2.01 $\times$ latency and 8.37 $\times$ communication reduction. In addition, MPCache is orthogonal to both protocol and hardware, and thus can be integrated with future advanced protocols and hardware to further enhance usability. For example, according to Piranha [11], GPU can accelerate MPC by 16-48 $\times$ . Therefore, we believe this research has important academic value and can promote the practical application of MPC.

In our revised version, we will add the limitation section to elaborate the gap between private inference and the plaintext counterpart.

Q2: Assumption of architectural knowledge.

A2: Thanks for your thoughtful consideration. Our threat model is consistent with previous studies that the model architecture and pruning ratio are public. We believe this is reasonable in real-world applications, where users typically know the model architecture and pruning ratio (or speedup) when using the service. For example, Gemini [12] discloses the model architecture and optimizations while keeping model parameters private.

Importantly, our adopted threat model does not compromise the client's data, nor does it enable the client to access the model's parameters, which is aligned with the security goal of MPC.

References:

[1] Zhang, Yancheng, et al. "CipherPrune: Efficient and Scalable Private Transformer Inference." ICLR 2025.

[2] Lu, Wen-jie, et al. "Bumblebee: Secure two-party inference framework for large transformers." NDSS 2025.

[3] Li, Zhengyi, et al. "Nimbus: Secure and Efficient Two-Party Inference for Transformers." NeurIPS 2024.

[4] Pang, Qi, et al. "Bolt: Privacy-preserving, accurate and efficient inference for transformers." 2024 IEEE Symposium on Security and Privacy (SP). IEEE, 2024.

[5] Rathee, Deevashwer, et al. "MPC-Minimized Secure LLM Inference." arXiv preprint arXiv:2408.03561 (2024).

[6] Zheng, Fei, et al. "PermLLM: Private Inference of Large Language Models within 3 Seconds under WAN." arXiv preprint arXiv:2405.18744 (2024).

[7] Li, Dacheng, et al. "Mpcformer: fast, performant and private transformer inference with mpc." ICLR. 2023.

[8] Zeng, Wenxuan, et al. "Mpcvit: Searching for accurate and efficient mpc-friendly vision transformer with heterogeneous attention." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023.

[9] Hou, Xiaoyang, et al. "Ciphergpt: Secure two-party gpt inference." Cryptology ePrint Archive (2023).

[10] Dong, Ye, et al. "Puma: Secure inference of llama-7b in five minutes." arXiv preprint arXiv:2307.12533 (2023).

[11] Watson, Jean-Luc, Sameer Wagh, and Raluca Ada Popa. "Piranha: A GPU platform for secure computation." 31st USENIX Security Symposium (USENIX Security 22). 2022.

[12] Comanici, Gheorghe, et al. "Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities." arXiv preprint arXiv:2507.06261 (2025).

评论- Follow-up on Discussion for Paper 10762

2025-08-09

Dear Reviewer 5USt,

I hope this message finds you well.

We sincerely appreciate your recognition of our work. As the discussion period is nearing its end with less than 6 hours, we want to ensure we have addressed all your concerns satisfactorily. If there are any additional feedback you'd like us to consider, please let us know.

Thank you once again for your time and effort in reviewing our paper.

Sincerely,

The Authors of Paper 10762

审稿意见

评分: 4置信度: 32025-06-23

The authors mainly address the high latency issue in securely performing KV cache eviction. To reduce the cost of similarity computation, they propose cache clustering, where similarity is calculated at the cluster level and refined progressively in a coarse-to-fine manner. To improve effectiveness, instead of using the mean of each cluster, they measure similarity using the maximum and minimum key values within the cluster. Additionally, they further enhance efficiency by exploiting the similarity of eviction patterns across adjacent layers, allowing similarity computation to be performed only on selected layers. Experiments conducted on various datasets and architectures demonstrate that the proposed method is both more efficient and effective compared to existing methods.

优缺点分析

Strengths

The authors propose a novel and efficient method using cache clustering.
They conduct extensive experiments to demonstrate the effectiveness of the proposed method.

Weaknesses

The proposed method appears to achieve approximately a 2× speedup in Figure 10. However, according to Figure 3(a), static eviction also shows no performance degradation up to 60%. Therefore, when compared to static eviction methods applied at a 60% rate, the proposed method might offer similar performance but little to no advantage in terms of speed improvement, since static eviction approaches are more efficient than dynamic eviction. Therefore, a comparison with static eviction at a 50–60% ratio in terms of both latency and performance is necessary.
Since the method requires additionally storing the minimum and maximum values for each cluster, it is expected to incur higher memory usage. In large models with long sequence lengths, this overhead could become significant. It would be helpful to include a comparison of memory consumption to assess this impact.
While the improvement in latency over existing methods is understandable, the reason for the performance improvement is not convincing at a high level. It would be helpful if the authors could provide a high-level analysis explaining why the proposed method outperforms the dynamic KV cache baseline in terms of effectiveness.

问题

The reduction in communication cost is significant, but the corresponding latency acceleration appears to be relatively limited. Could the authors provide an analysis explaining this discrepancy?
Figures 7 and 8 appear after Figure 9 and are placed far from the text where Figure 7 is first mentioned. It would be better to relocate them closer to their first reference for improved readability.

局限性

The paper does not explicitly address its limitations.

最终评判理由

While the latency improvement is somewhat limited, the method is novel and meaningful in that it reduces both latency and memory usage. Therefore, I have raised my score to 4: borderline accept.

格式问题

No formatting issues were found.

作者回复

2025-07-29

We sincerely thank Reviewer kcBh for your thoughtful feedback!

Q1: Comparison with static eviction at a 50–60% ratio in terms of latency and performance.

A1: Thank you for the insightful comment. First, we would like to clarify that the latency of Transformer consists of attention and MLP, and KV cache eviction only optimizes attention. The 2 $\times$ latency reduction shown in Figure 10 represents an end-to-end result, and thus, it is not equivalent to 50% static eviction.

In fact, achieving the ~2 $\times$ latency reduction requires compressing the KV cache to less than 10%. As shown in the table below, solely applying 60% static eviction leads to limited efficiency improvement (i.e., only 1.21 $\times$ latency and 1.83 $\times$ communication reduction), which is significantly lower than MPCache. Meanwhile, using only static eviction to reduce latency by ~2 $\times$ would result in substantial accuracy degradation (this can be also observed in SnapKV in Figure 9).

Method (Reserved KV Cache)	Latency (s)	Communication (GB)	Accuracy (%)
Full KV Cache (100%)	75.52	1.82	26.54
Pure Static Eviction (40%)	62.12 (1.21 $\times$ )	0.99 (1.83 $\times$ )	25.56
Pure Static Eviction (8%)	40.32 (1.87 $\times$ )	0.36 (5.06 $\times$ )	16.80 (Significant Drop)
MPCache (8%)	44.39 (1.70 $\times$ )	0.41 (4.44 $\times$ )	23.86

In contrast, MPCache builds upon static eviction and further perform dynamic selection to identify the most important KV cache. This allows for more aggressive compression, especially for communication cost (i.e., 1.7 $\times$ latency and 4.4 $\times$ communication reduction) while maintaining accuracy, achieving significantly higher accuracy than using static eviction alone.

We will include these clarifications and the comparison results in our revised version.

Q2: This work requires additionally storing the min and max for each cluster, it is expected to incur higher memory usage.

A2: Thanks for your thoughtful consideration. We would like to clarify that the memory usage introduced by storing min/max values is negligible compared to the KV cache. Specifically, KV cache has a dimension of $(H, T, D)$ , where $H$ , $T$ , and $D$ are head number, token number, and hidden dimension per head, respectively. In contrast, min/max values are stored per cluster, resulting in a much smaller dimension of $(H, T/s, D)$ , where $s$ denotes cluster size. It means min/max values require only $1/s$ memory compared to KV cache. In our experiments, the cluster size $s$ is very large, i.e., 32, resulting in only $1/32$ of the KV cache size.

For example, for a Llama-2-7B model and a sequence length of 16K, storing KV cache requires 7.8 GB, while storing min and max values requires only 250 MB of storage (in FP16 precision), which is ~3% of KV cache size.

Since the min/max value requires minimal storage, and communication and running time are the dominant bottleneck in current MPC protocols, we prioritize communication and latency reduction in this work. We will add more explanations regarding the memory usage in our revised version.

Q3: The reason for the performance improvement is not convincing at a high level.

A3: We analyze the effectiveness of MPCache at a high level as follows:

Comparison with average-based clustering methods. The widely adopted dynamic KV cache method is to represent each cluster using average of all tokens within it. However, this method suffers from the drawback of averaging out the token importance, making it difficult to retain highly informative tokens given a KV cache budget. In contrast, MPCache is guided by the principle that similarity approximation should preserve the influence of important tokens within each cluster. So, we build a novel similarity approximation method that utilizes the maximum dot product between the query and key clusters, rather than the average, to better approximate token importance. This intuition is empirically validated across different models and datasets, which shows that our design improves the selection of important tokens under a constrained KV cache budget.
Comparison with non-clustering (token-wise) methods. In long-context scenarios, prior studies has observed attention noise [1-3], where the dot product between the query and key becomes less reliable in reflecting token importance when the sequence is very long. Hence, compared to these token-wise similarity approximation methods, MPCache utilizes cluster-wise granularity to reduce the approximation complexity and mitigate the attention noise to some extent, enabling more robust and accurate token selection, especially for long sequences.

In conclusion, MPCache improves performance by using a novel similarity approximation that effectively preserves clusters with influential tokens, which is especially useful for long-context scenarios. We will clarify these motivations and high-level insights in the revised version.

Q4: The reduction in communication cost is significant, but the corresponding latency acceleration is relatively limited.

A4: The primary reason for this discrepancy is that the communication scales proportionally with the input dimension, resulting in a significant decrease when the KV cache is compressed. In contrast, latency is influenced by multiple factors, including computation time, the number of communication rounds in MPC, and network setups such as bandwidth and transmission delay. These components do not scale linearly with input size, thereby limiting the overall latency reduction.

Furthermore, as our current implementation runs on CPU, leveraging GPU to accelerate computation in future work could yield higher latency gains.

We will clarify these points and include this discussion in our revised version.

Q5: It would be better to relocate the figures closer to the first reference for improved readability.

A5: Thanks for your kind advice! We will adjust the locations of figures and texts carefully in our revised version.

Lastly, thanks so much for helping us improve this work through your valuable feedback. If you have any additional questions or anything you would like to discuss further, please feel free to let us know. If you find that we have addressed your concerns, please kindly consider re-evaluating our work. Thank you once again for your time and effort!

References:

[1] Ye, Tianzhu, et al. "Differential transformer." arXiv preprint arXiv:2410.05258 (2024).

[2] Adnan, Muhammad, et al. "Keyformer: Kv cache reduction through key tokens selection for efficient generative inference." Proceedings of Machine Learning and Systems 6 (2024): 114-127.

[3] Wang, Xue, et al. "S3attention: Improving long sequence attention with smoothed skeleton sketching." IEEE Journal of Selected Topics in Signal Processing (2024).

2025-08-05

I appreciate the detailed response. All of my concerns have been addressed, and I am happy to raise my score to 'borderline accept'.

评论- Thanks very much for your support - Response to Reviewer kcBh

2025-08-05

Dear Reviewer kcBh,

Thank you very much for your support and for your willingness to raise the score! We truly appreciate your thoughtful engagement throughout the review process, and we will incorporate all the discussions and data into our final version.

Sincerely,

The Authors of Paper 10762

审稿意见

评分: 4置信度: 42025-06-27

This paper presents an MPC-friendly KV cache eviction framework. MPC is a cryptography scheme that allows multiple parties to work together on the computation with leaking the plaintext. However, some operations are not efficient in MPC in terms of computation latency and communication, such as comparison and non-linear operations. The existing KV cache eviction on plaintext computation is not efficient in MPC. This paper studies the KV cache eviction in the context of MPC. The proposed scheme combines static eviction and dynamic selection. It reduces the expensive max and similarity approximation operations needed, and also has optimizations like hierarchical clustering and cross-layer index-sharing strategy. The evaluation shows the proposed eviction outperforms the baseline KV cache eviction mechanism in both model performance and latency.

优缺点分析

Strengths

(Significance and Originality) The first study on KV cache eviction in the MPC framework. This work will make the decoder models more feasible in MPC. I like the proposed schemes that reduce the number of max operations.
(Quality) The proposed optimizations are derived from observations in detailed performance profiling. The paper is well-written and supported by solid experiments.

Weaknesses

(Quality) The paper can provide more details on the performance improvement due to each of the optimizations. The current results do not give insights into the impact of each optimization.
(Quality) The evaluation metrics and how it is measured are not stated.
(Quality) The evaluation does not test under different network setups. If the authors cannot evaluate the framework under different network settings, the computation and communication breakdown (i.e., percentage of time in communication vs. computation, or are they pipelined?) should be provided at least.
(Clarity) Some figures miss necessary legends. -- For example, Figure 1 did not mark what each color means in the breakdown. -- In Figure 1(d), the KV cache choice of the eviction baseline [47] can be biased. Table 3 shows that baseline [47] has worse latency performance than baseline InfLLM.

-- It should also be stated clearly that F1 score is used as the model performance metric.

问题

Please explain the color coding in Figure 1.
What is the implementation like in Figure 1(b) and (c)? Is KV cache used? If so, what KV cache eviction algorithm is used?
Is the latency in Table 3 per inference or for X number of queries? How the numbers are measured are not clearly stated in the paper not the evaluation.
Can you provide a further breakdown in "model" Figure 11? Earlier in the paper, you mentioned that softmax accounts for the majority of the cost. How about after your KV cache optimizations?
Can you provide insights on what the performance will be like if the networking has longer latency or bandwidth?

局限性

I agree with the authors that the proposed scheme is general. However, the resulting MPC is still slow for practical use. A discussion on MPC for decoder models can be added.

最终评判理由

Thanks for the response. My clarification questions are addressed. For practical use, the concerns remain, and I will keep my score. In general, I still support the paper.

格式问题

作者回复

2025-07-27

We sincerely thank Reviewer suRu for your thoughtful feedback!

Q1: Provide more details on the performance improvement due to each of the optimizations.

A1: Thanks for you advice! We here summarize the impact of each optimization from the perspective of accuracy and efficiency.

Static eviction algorithm: unimportant KV cache is pruned based on the attention scores with negligible overhead. It does not affect the prefill stage and remain partial KV cache after the prefill stage to improve the overall efficiency by ~60% without degrading accuracy.
Dynamic selection algorithm: 1) similarity approximation optimization: a clustering-based optimization is proposed and it preserves the impact of important tokens within a cluster as much as possible. Hence, our selection method is accurate and efficient (further improve overall efficiency by ~60%); 2) linearization & reordering and hierarchical clustering optimizations: they aim to improve the efficiency of similarity approximation by removing MPC-expensive operators and progressive selection (reduce KV cache selection overhead by ~80%) without significantly affecting accuracy; 3) cross-layer index sharing optimization: due to the high similarity between adjacent layers as observed in Section 3, this optimization directly reduces the overhead of KV cache selection by 50% without degrading accuracy.

In conclusion, our optimizations are supported either by theoretical analysis or by empirical observations. These optimizations work together to improve the efficiency while preserving accuracy. Experimental analysis can be found in our step-by-step ablation studies in Section 5.4. We will add more discussions in our revised version.

Q2: The evaluation metrics and how it is measured are not stated.

A2: Thanks for your suggestion! F1 score used in LLM generation task is calculated as the harmonic mean of precision and recall at the token level: F1=2 $\times$ Precision $\times$ Recall/(Precision+Recall), where precision measures the proportion of tokens in the generated output that also appear in the answer, and recall measures the proportion of tokens in the answer that are correctly included in the generated output.

In this paper, we follow LongBench [11] to measure the metrics, and we will state the metric and measurement details clearer in our revised version.

Q3: The evaluation does not test under different network setups.

A3: Thanks for your valuable advice! We provide the breakdown of computation vs. communication latency to better understand the protocols. We evaluate the 3PC protocol on Llama-2-7B to generate 1 token with a sequence lengths of 512, as shown in the table below:

Network	Total Lat. (s)	Comm. (GB)	Comm. Lat. (s)	Comput. Lat. (s)	+ MPCache
LAN	51.2	1.01	2.679 (5.23%)	48.521 (94.77%)	25.5s (2.0 $\times$ ) / 0.24GB
WAN	73.8	1.01	25.25 (34.2%)	48.521 (65.8%)	31.2s (2.3 $\times$ ) / 0.24GB

As observed, under different network setups, communication improvement remains unchanged. In contrast, latency improvement can vary depending the network setups. For WAN with lower bandwidth, the improvement can be slightly higher, i.e., 2.3x since KV cache eviction directly reduces the communication volume.

We will include more comprehensive experiments and analysis about the network setups in our revised version.

Q4: Some figures miss necessary legends and explain the color coding in Figure 1.

A4: Thanks for pointing out our oversights! We have carefully revised the writing to make our paper clearer. In Figure 1, blue bar means attention, orange bar means MLP, green bar means LayerNorm, purple bar in slash means similarity approximation, yellow bar in slash means top-k ranking, and green bar in slash means token gathering.

Q5: Implementation like in Figure 1(b) and (c). Is KV cache used and what KV cache eviction algorithm is used?

A5: Figure 1(b) uses the both 2PC and 3PC protocols on GPT-2 using the Secretflow framework. Figure 1(c) uses the 3PC protocol to evaluate Softmax. KV cache is used here to avoid redundant re-computation of KV cache (refer to Appendix A.3 for details). But in Figure 1(b) and (c), KV cache eviction algorithm is not used since they are profiled to analyze the efficiency bottleneck with full cache, which motivate us to compress the KV cache.

We will carefully improve our writing to make the paper clearer in our revised version.

Q6: Latency evaluation in Table 3.

A6: The latency in Table 3 is measured by generating one token on Llama-2 given one input with a sequence length of 1024 tokens. We will clarify all the measurements in our revised version.

Q7: Provide a further breakdown in "model" Figure 11.

A7: We show the model-level communication breakdown with a sequence length of 1024 in the table below. As observed, before optimization, Softmax dominates the total cost. After optimization, Softmax overhead becomes very small, leading to significant overall efficiency improvement. We will include these experiments in our revised version.

Method	Softmax (MB)	MLP (MB)	Others (MB)	Total (MB)
Full Cache	1242.7	184.32	436.66	1863.68
+ MPCache	68.19	184.32	23.97	276.48

Q8: The resulting MPC is still slow for practical use. A discussion on MPC for decoder models can be added.

A8: Thanks for your valuable advice! We agree that MPC incurs more communication overhead and is still under research, but we believe MPC has strong potential. MPC has been used in practical applications such as healthcare [12] and finance [13]. In recent years, there is a significant amount of research on MPC inference for LLMs, which is more complex and expensive [1-10]. Through our optimizations, MPCache achieves up to 2.01 $\times$ latency and 8.37 $\times$ communication reduction.

In addition, MPCache is orthogonal to both protocol and hardware, and thus can be integrated with future advanced protocols and hardware to further enhance usability. For example, according to Piranha [14], GPU can accelerate MPC by 16-48 $\times$ . Therefore, we believe this research has important academic value and can promote the practical application of MPC.

In our revised version, we will add more discussions and the limitation section regarding the MPC-based decoder models.