PaperHub
5.0
/10
Poster3 位审稿人
最低4最高6标准差0.8
4
5
6
4.0
置信度
COLM 2025

X-EcoMLA: Upcycling Pre-Trained Attention into MLA for Efficient and Extreme KV Compression

OpenReviewPDF
提交: 2025-03-20更新: 2025-08-26
TL;DR

Upcycling Pre-Trained Attention into MLA for Efficient and Extreme KV Compression

摘要

关键词
MLAMulti-head AttentionLLMEfficient

评审与讨论

审稿意见
4

This paper presents X-EcoMLA, a post-training distillation framework that enables Multi-head Latent Attention (MLA) in already pre-trained models for KV cache compression. Experiments show that X-EcoMLA can compress KV cache by up to 10.6× with less than 0.1% performance drop, using only a few billion tokens and ~100 GPU hours, thereby offering a more practical path to MLA adoption.

接收理由

  • The post-training alternative of MLA avoid the pre-training costs, which makes it more accessible and scalable.
  • The empirical results look good -- it maintains accuracy at high compression ratios (e.g., ~6.25% of original KV cache).

拒绝理由

  • There are already numerous post-training methods that leverage similar low-rank approximations for MLA or related modules. The paper does not clearly distinguish how its approach differs from or improves upon these existing techniques.
  • The paper does not include comparisons with strong and relevant post-training KV compression methods.
  • Despite the paper’s focus on efficient deployment, it lacks system-level metrics such as latency, throughput, or wall-clock inference time. These are essential to substantiate claims about the practical benefits of KV cache compression and to understand trade-offs beyond accuracy.
评论

[Answer 2] Comparison with other KV Cache compression techniques

We thank the reviewers for suggesting a comparison with other KV cache compression approaches. We evaluated our X-EcoMLA against the popular H2O method using the same base model (LLaMA3.2-1B-Instruct) and identical KV sizes. We report the results on lm-eval-hardness benchmark:

KV-sizeAverage AccuracyARCAREHSMMOBQAPIQPMRAWG
H2O15.6%50.3037.7157.4159.9140.8331.671.1160.437.9955.8
X-EcoMLA (ours)15.6%51.9740.162.8858.1739.737.873.556.639.3359.67
H2O9.4%45.0530.0343.0157.7933.2529.664.9658.836.0851.93
X-EcoMLA (ours)9.4%50.4939.1662.6356.0434.936.472.8556.437.758.33
H2O6.25%41.3026.5434.6852.7526.9528.659.0358.634.2650.28
X-EcoMLA (ours)6.25%49.7438.4861.6655.3230.6235.272.3656.637.9959.43

As the results show, across all compression levels, X-EcoMLA consistently outperforms H2O in average accuracy and most individual tasks. For instance, at a KV size of 9.4%, X-EcoMLA achieves 50.49% vs. H2O’s 45.05%, with strong gains in ARC, ARE, and PIQA. These results demonstrate that X-EcoMLA offers superior accuracy under the same memory constraints.

We will include this comparison in the final version of the paper and discuss the implications for memory-efficient inference.

评论

[Answer 3] System-Level Inference Metrics

We appreciate the reviewer’s request for system-level metrics such as latency, throughput, and peak memory usage. Below, we report throughput (sequences/sec) and peak GPU memory (GB) for both Llama3.1-8B and X-EcoMLA-8B (kv_rank=128, 10.67× compression) across a range of batch sizes. All measurements were taken on the same hardware (8×AMD-MI300 GPUs) under identical settings.

1. Throughput Comparison

Batch SizeLlama3.1-8BX-EcoMLA-8B
8368691
167261270
3213292173
6416233160
12817603912
256OOM4598
512OOM5071
1024OOM5267
  • X-EcoMLA-8B achieves roughly 1.7–2× higher throughput than the baseline at all batch sizes.
  • The baseline model (Llama3.1-8B) runs out of memory (OOM) beyond batch 128, whereas X-EcoMLA continues to scale to batch 1024 without OOM.

2. Peak Memory Usage

Batch SizeLlama3.1-8B (GB)X-EcoMLA-8B (GB)
82317
163118
324719
647922
12814328
256OOM40
512OOM65
1024OOM114
  • X-EcoMLA-8B reduces peak memory consumption significantly at small-to-medium batch sizes, enabling much larger batches.
  • For example, at batch=128, Llama3.1-8B uses 143 GB (OOM beyond that), while X-EcoMLA-8B uses only 28 GB, a 5× reduction.

These results demonstrate that X-EcoMLA not only preserves model accuracy but also delivers substantial system-level gains—higher throughput, and lower memory footprint—making it highly suitable for latency- and memory-constrained deployment scenarios.

评论

We appreciate the reviewer’s remarks regarding comparison with existing post-training low-rank methods, comparisons with other KV cache compression techniques, and system-level inference metrics. We have addressed all comments in the rebuttal with detailed analysis and supporting results. We kindly ask the reviewer to consider re-evaluating our work in light of these clarifications and improvements.

[Answer 1] Distinction from Existing Post‐Training Low-Rank Methods

We conducted experiments to compare our solution with two of the most recent SOTA methodsMHA2MLA [1] (released concurrently with our work) and PALU [2] — and to clarify how X-EcoMLA differs from and improves upon these approaches.

1. Comparison with MHA2MLA [1]

  • Baseline Setup

    • Student / Teacher: SmolLM 1.7B-Instruct
    • MHA2MLA: We use the joint-SVD + ShigherS_{\text{higher}} variant (as provided in their code) on the same SFT data.
  • Key Differences

    • MHA2MLA stores separate RoPE vectors per head. Given a fixed storage budget (e.g., 32 dimensions total for Key-RoPE per token across all heads), each head’s RoPE dimension in MHA2MLA becomes 32numheads\frac{32}{{num_{heads}}}.
    • X-EcoMLA follows the DeepSeek MLA structure, where all heads share one single Key-RoPE. Thus, with 32 storage units, each head in X-EcoMLA uses the full 32-dimensional RoPE, which is 8×8\times larger (for an 8-head MLA) than MHA2MLA’s per-head RoPE. This larger shared RoPE capacity preserves richer positional encoding under compression.
SFT Performance Comparison (SmolLM 1.7B-Instruct)
MethodKV SizeAvg. Acc.ARCAREHSMMOBQAPIQAPMRAWG
SmolLM 1.7B-Ins (Target Model)100%50.6438.4862.9660.8925.8839.8073.5660.8036.2757.14
MHA2MLA (SFT)12.5%48.1935.0760.9456.1823.3636.4072.3657.2034.8357.38
X-EcoMLA (SFT, Ours)12.5%49.3437.2962.9659.0523.6838.0072.5259.8034.6456.12
MHA2MLA (SFT)50%49.7937.6362.2959.6024.0938.2074.0560.4034.3557.46
X-EcoMLA (SFT, Ours)50%50.1537.9764.0660.2324.8939.0073.3460.2034.9356.75
  • Observation: At 12.5% compression, X-EcoMLA (49.34) outperforms MHA2MLA (48.19) by +1.15 avg.
  • At 50% compression, X-EcoMLA (50.15) also outperforms MHA2MLA (49.79) by +0.36 avg.
Continual Pretraining Comparison (SmolLM 1.7B Base)
MethodKV SizeAvg Acc.ARCAREHSMMOBQAPIQAPMRAWG
SmolLM 1.7B (Target Model)100%50.6446.4273.4865.7427.7342.0076.0662.6037.0360.93
MHA2MLA (Pre-train, ckpt)12.5%51.6941.5569.5761.4324.6339.0074.7060.2035.6958.41
X-EcoMLA (Pre-train, Ours)12.5%52.8541.1370.0363.3325.8441.8075.0862.0035.9860.46
  • Observation: Under 12.5% compression, X-EcoMLA (52.85) outperforms the MHA2MLA pre-trained model (51.69) by +1.16 avg.
  • X-EcoMLA closes nearly the entire gap to the full baseline, despite using only 12.5% KV cache.

2. Comparison with PALU (ICLR 2025)

  • PALU [2] is a SOTA low-rank projection approach for KV-cache compression. Palu decomposes the linear layers into low-rank matrices, caches compressed intermediate states, and reconstructs the full keys and values on the fly. We compare our X-EcoMLA with PALU on the Llama3-8B model.

  • Result: At ~15.6% KV size, X-EcoMLA (67.34) nearly matches the full model (67.87) and outperforms PALU-based compression (66.19 / 64.45) by +1.15 / +2.89 avg. despite using ~3× less KV storage.

ModelKV-SizeAvg Acc.ARCAREHSOBQAPIQWG
Llama3-8B-Inst (Target)100%67.8756.6681.6175.8142.6078.6271.90
PALU-J-LRD (SOTA-Baseline)50%66.1951.9679.6373.2043.4076.5072.45
PALU-G-LRD (SOTA-Baseline)50%64.4548.9976.3070.3642.6076.0672.38
X-EcoMLA (Ours)15.63%67.3454.6981.0275.6944.4077.5370.72

References

[1] Lin, Yi-Sheng, et al. “MHA2MLA: Upcycling Multi-Head Attention into Multi-Lingual Attention.” arXiv:2502.14837 (2025).
[2] Chang, Chi-Chih, et al. “PALU: KV-Cache Compression with Low-Rank Projection.” ICLR 2025.

审稿意见
5

This paper introduces a post-training method X-EcoMLA to transform pre-trained Transformer attention into Multi-head Latent Attention (MLA), enabling extreme KV cache compression without retraining from scratch. By leveraging SVD-based initialization and knowledge distillation from a larger teacher model, X-EcoMLA reduces memory requirements (up to 10.6×) while preserving or even improving model performance on the LM Harness benchmark. The method supports both static and dynamic rank selection and omits LayerNorms for better convergence.

接收理由

  • The paper introduces a lightweight post-training adaptation mechanism. Unlike prior MLA work that requires expensive pre-training, X-EcoMLA enables efficient KV cache compression on existing pre-trained models with minimal compute (as low as 70 GPU hours), making it highly practical.

  • Across multiple model sizes (e.g., SmolLM, Llama3.2) and benchmarks (LM Harness with 9 diverse tasks), X-EcoMLA achieves up to 10.6× KV compression with negligible performance loss (<0.1%).

  • The method democratizes MLA adoption by requiring only post-training on modest datasets (3.4–7B tokens) and affordable compute. This greatly lowers the barrier for academia and industry groups with limited GPU resources to adopt memory-efficient LLMs.

拒绝理由

  • The approach heavily relies on large teacher models (e.g., LLaMA3.1-8B) to recover performance under extreme compression. This raises a concern about practical viability in low-resource settings, where such teachers may be inaccessible. The performance degradation without large teachers (e.g., at 6.25% KV size) is notable and suggests limited standalone utility.

  • While the paper demonstrates results on SmolLM and Llama3.2 models, all target models use GQA attention and relatively small scales (≤3B). It remains unclear whether X-EcoMLA generalizes to diverse architectures (e.g., standard MHA, MQA, or non-GQA models) and larger-scale LLMs, which limits the current scope of its applicability.

给作者的问题

  • The method relies heavily on large teacher models for extreme compression. Could you quantify the minimum teacher quality/size necessary to recover performance at various KV compression ratios?
评论

We appreciate the reviewer’s detailed feedback on teacher model requirements, scalability across model sizes and attention types, and the impact of teacher size on KV cache compression. We have addressed these concerns in our rebuttal and supported our response with evidence of the method’s effectiveness. We respectfully encourage the reviewer to revisit the evaluation based on these updates.


[Answer 1] Dependency on Large Teacher Models in Low-Resource Settings

We agree that accessibility to large teachers may be constrained in low-resource settings, and we have explicitly addressed this scenario in Section B.2.1 and Figure 1 of the paper:

  • Table 5 empirically examines two efficiency strategies: increasing training tokens versus employing larger teacher models. While larger teachers (e.g., LLaMA3.2-8B) yield stronger performance (e.g., 55.13 vs. 53.42), they achieve this using fewer tokens and less total training time compared to the 1B teacher trained on double the data.
  • However, when such models are inaccessible, scaling up training data remains a viable alternative. For instance, a 1B model trained with 7B tokens and a 1B teacher nearly matches the performance of training with a 3B teacher on 3.5B tokens.
  • Similarly, training with a 3B teacher and 7B tokens approaches the performance of using an 8B teacher with half the data.

In sum, while stronger teachers offer the most cost-effective improvements, our method also scales well with data when larger teachers are not feasible—preserving its practical utility across varying resource regimes.


[Answer 2] Generalizing to larger model sizes and other attention architectures

We emphasize that X-EcoMLA is architecture-agnostic and scales effectively to larger models. We support this claim with results on both non-GQA (MHA) and 8B-scale models:

✅Applicability to MHA — SmolLM 1.7B-Instruct

We applied X-EcoMLA to SmolLM, which uses MHA. The updated results below show that, even at 12.5% KV size, our model retains >99% of the original accuracy, and at 50% KV size, it is nearly identical to the baseline.

ModelKV-sizeAvg. Acc.ARCAREHSMMOBQAPIQPMRAWG
SmolLM 1.7B-Instruct (Base)100%50.6438.4862.9660.8925.8839.8073.5660.8036.2757.14
X-EcoMLA (Ours)50%50.5137.6364.161.3527.639.0074.2758.436.5655.64
X-EcoMLA (Ours)12.5%50.2937.9762.9260.3927.2638.8073.6159.4035.8956.35

✅ Scalability to Larger Models — Llama 3-8B and Llama3.1-8B

X-EcoMLA maintains high performance even at extreme KV compression when applied to larger-scale models (8B). Notably, at KV-size 0.109, performance remains nearly on par with the full model.

ModelKV-sizeAvg. Acc.ARCAREHSMMOBQAPIQPMRAWG
Llama3-8B-Inst100%65.7856.6681.6175.8163.8242.6078.6275.0046.0371.90
X-EcoMLA (rkv=256r_{kv}=256)15.63%65.1654.6981.0275.6959.2044.4077.9174.8048.0470.72
Llama3.1-8B-Inst100%66.6354.8679.5579.2368.1343.0080.9075.4044.6973.88
X-EcoMLA (rkv=160r_{kv}=160)10.94%66.3558.1180.6877.0860.7044.8079.3875.8048.1372.45

These results confirm that X-EcoMLA generalizes well to both standard attention architectures (MHA) and larger-scale models, retaining competitive performance with aggressive KV compression.


[Answer 3] Analysis of Teacher Size for Different KV Cache Compression

Thank you for your question. Based on the results from Table 2 and Figure 1, we provide a rule-of-thumb guideline for selecting the minimum teacher size required to recover the base model’s accuracy (~52.77) under different KV cache compression ratios and training data budgets for the 1B model. We evaluate this trade-off under two data budgets: ~7B tokens and ~3.6B tokens.

Compression RatioData Budget (tokens)Minimum Teacher Size to Hit ≈ 52.8
16x7 B— (even 8 B teacher falls below)
10.6×7 B8 B
6.4×7 B tokens3 B
6.4×3.6 B tokens8 B
3.6×3.6 B tokens3 B
1.9×3.6 B tokens1 B
  • Extreme compression (≥ 10×) demands an 8 B teacher if you want to recover full-model performance—otherwise you lose 1–2 points.
  • Moderate compression (∼ 6×) can be handled by a 3 B teacher (for large data budgets) but needs 8 B when data is scarcer.
  • Mild compression (≤ 2×) is already supported by a 1 B teacher, even with only 3.6 B tokens.
审稿意见
6

The paper proposes to “up-cycle” LLMs pre-trained with vanilla MHA attention into MLA attention, with the goal of improving memory efficiency at test time without performance degradation.

They start with instruct-tuned checkpoints and replace a subset (or all) of the MHA blocks with MLA, initializing the MLA weights via SVD on the pre-trained attention weights. They propose variants of the SVD initialization, as well as random initialization for the MLA up-cycling. Full and hybrid MLA models are considered. The model is then trained with KL-Distillation from a (usually) larger model using an SFT corpus, and finally aligned with DPO. They report KV cache compression with minimal loss in performance on LM benchmarks, training with a modest compute budget.

接收理由

Though up-cycling vanilla Transformers into more efficient variants isn't novel (MHA -> GQA, softmax attention -> linear in T2R and others), the proposed solution for MLA up-cycling is well-described, and the proposed initialization of the MLA weights by SVD on the pre-trained attention weights is elegant. The design decisions are well-ablated (compared to random initialization) and show a clear boost, within a modest compute budget.

拒绝理由

The biggest drawback of this study is that the results are on small models (135M, 360M, 1B, 3B) where the benefits of KV compression are minimal; even for relatively long sequences the memory requirements are fairly modest for these models. There may be some benefits for embedded devices, but this issue justifies discussion in a dedicated limitations section.

For larger models where MLA would start to make a bigger difference (7B+), it's quite possible that up-cycling the models in this way would lead to a performance hit, and this remains untested. Such an investigation would require more computation resources which may not be feasible, but the limitation should be noted.

Although the authors mention the advantages of MLA in the long-context setting, the evaluations presented are standard LM harness tasks which do not evaluate long-sequence performance; proper long-sequence benchmarks (e.g. SCROLLS [1]) would reveal potential degradation/lack of degradation in this setting. This may be a limitation of the up-cycling procedure however, as the SFT data used might not be of sufficient context length to keep long-context performance.

[1] Shaham, Uri, et al. "Scrolls: Standardized comparison over long language sequences." EMNLP 2022

给作者的问题

The authors apply X-EcoMLA exclusively to instruct-tuned models with SFT data, leaving open the question of the effectiveness of the approach on base models when continually pre-training on minimal pre-training data (in addition to the potential limitation for long-context performance raised above).

Have you tried the technique with minimal continued training on pre-training data? Up-cycling works typically used pre-training data, e.g. [2], but [3] analyzed up-cycling with both pre-training and SFT data, and found advantages with each approach. (Some citations of recent up-cycling works are missing, including the original work that introduced up-cycling transformers into RNNs (T2R [1]) and recent extensions to LLMs [2][3].)

Can you comment on the preservation of long-context performance, given the advantage of MLA on long sequences?

[1] Kasai, Jungo, et al. "Finetuning pretrained transformers into rnns." EMNLP 2021

[2] Mercat, Jean, et al. "Linearizing large language models." COLM 2024

[3] Zhang, Michael, et al. "LoLCATs: On Low-Rank Linearizing of Large Language Models." ICLR 2025

评论

We thank the reviewer for the constructive comments on Scalability to Larger Models, Long Context Evaluations, and Applicability to Base Models and Pretraining. We have addressed nearly all of these points and demonstrated the promising effectiveness of our solution. We would greatly appreciate it if the respected reviewer would consider re-evaluating our work and updating the score accordingly.


[Answer 1] Scalability to Larger Models — Llama 3-8B and Llama3.1-8B

We have extended X-EcoMLA to two 8 B-parameter models (Llama3-8B and Llama3.1-8B). The results below demonstrate that even under aggressive KV cache compression, X-EcoMLA maintains performance very close to the full-scale baseline. Notably, at KV-size 0.109, performance remains nearly on par with the full model.

ModelKV-sizeAvg. AccuracyARCAREHSMMOBQAPIQPMRAWG
Llama3-8B-Inst100%65.7856.6681.6175.8163.8242.6078.6275.0046.0371.90
X-EcoMLA (rkv=256r_{kv}=256)15.63%65.1654.6981.0275.6959.2044.4077.9174.8048.0470.72
Llama3.1-8B-Inst100%66.6354.8679.5579.2368.1343.0080.9075.4044.6973.88
X-EcoMLA (rkv=160r_{kv}=160)10.94%66.3558.1180.6877.0860.7044.8079.3875.8048.1372.45

In the final version of our paper, we will present additional results of X-EcoMLA on 8 B-parameter LLaMA models.


[Answer 2] Long Context Evaluations

We evaluated our MLA-optimized models on the LongBench benchmark, which covers a range of long-context understanding tasks such as LCC, Qasper, QMSum, Multi-News, and SamSum. We report the results under various KV-cache size for both llama3.2-1B and llama3.2-3B models:

KV-SizeAvg. Acc.lccrepobench-pqasperqmsummulti_newssamsum
llama3.2-1B-instruct100.00%30.80535.4740.1222.9221.6525.6838.99
X-EcoMLA-1B-model-A (ours)53.13%30.7738.7340.3621.1320.525.7638.11
X-EcoMLA-1B-MLA-model-B (ours)28.13%30.6638.7440.5421.2120.6125.6237.26
llama3.2-3B-instruct100.00%40.0152.1154.1640.4223.6326.5143.21
X-EcoMLA-3B-MLA-model-A (ours)42.91%39.2960.0356.2429.9421.0827.5440.93
X-EcoMLA-3B-MLA-model-B (ours)25.00%39.1159.5953.9431.7520.9327.1941.26

Notably:

  • X-EcoMLA-3B-A achieves 60.03 on LCC, outperforming the full-size LLaMA3.2-3B (52.11), despite using only 43% of the KV cache.
  • Across tasks like Qasper, Multi-News, and SamSum, our compressed models match or slightly improve upon baseline performance, demonstrating strong long-context understanding even under aggressive memory constraints.

These results indicate that our method scales well to long-sequence scenarios and is particularly effective in memory-constrained environments.


[Answer 3] Applicability to Base Models and Pretraining Setting

While our primary evaluations focused on instruct-tuned models, based on your comment, we have also investigated the effectiveness of X-EcoMLA when applied to base models during continual pretraining.

We report results on SmolLM 1.7B in a continual pretraining setup, and compare with the following baselines:

  1. The target full-attention model (uncompressed),
  2. A new baseline: a 12.5% KV-compressed model using the prior MHA2MLA checkpoint, trained on 6B tokens (Hugging Face model link), and
  3. Our X-EcoMLA model, continually pre-trained on the same 6B tokens (from the MHA2MLA setup), also with a 12.5% compressed KV-cache.
ModelKV-SizeAvg. Acc.ARCAREHSMMOBQAPIQPMRAWG
SmolLM 1.7B (Target Model)100%54.6746.4273.4865.7427.7342.0076.0662.6037.0360.93
MHA2MLA12.5%51.6941.5569.5761.4324.6339.0074.7060.2035.6958.41
X-EcoMLA (ours)12.5%52.8541.1370.0363.3325.8441.8075.0862.0035.9860.46
  • Our X-EcoMLA base model, pretrained with 12.5% KV size, outperforms the MHA2MLA baseline by +1.16 average accuracy, and closes most of the gap to the full baseline model.

These results confirm that X-EcoMLA can be **effectively applied to the base model.


[Answer 4] Missing Citations and Use of Pretraining Data for Up-Cycling

We will include the suggested citations in the final version to better position our work within the up-cycling literature. Moreover, we explored minimal continued pretraining (see Answer #3), showing that X-EcoMLA supports both regimes.

评论

Thank you for the extensive clarifications and updates! It's good to see that in an apples-to-apples comparison with MHA2MLA, X-EcoMLA outperforms at the same compression levels and that the long-context evals indicate retention of performance (though there does seem to be a hit to performance increasing with scale)

评论

We thank the reviewer for the encouraging feedback. We hope that the demonstrated results and clarifications provided merit a positive re-evaluation of the paper. Thanks!

评论

Dear Reviewing Committee,

We hope this message finds you well.

We are writing to kindly follow up on our rebuttal submission for our paper X-EcoMLA: Upcycling Pre-Trained Attention into MLA. We have made a dedicated effort to address all the comments and concerns raised during the review process.

We would greatly appreciate it if the respected reviewers could take a moment to review our rebuttal and consider our clarifications in their final evaluations. We firmly believe that our approach is highly competitive with the current state of the art, and we hope our efforts are reflected in the final assessment.

Thank you for your time and attention.

评论

Dear Reviewing Committee,

We kindly encourage the reviewers who have not yet revisited our rebuttal to do so. We believe that the additional results and clarifications more clearly highlight the value of our submission, and we sincerely hope they will be taken into consideration in the final evaluation. We would be more than happy to address any further questions or feedback.

Thank you again for your time and consideration.

最终决定

This paper introduces X-EcoMLA, a method for post-training conversion of pre-trained attention mechanisms to Multi-head Latent Attention (MLA) for KV cache compression. In this way, models can be retrofitted to use MLA without the need to train new models from scratch.

Authors engaged with reviewers, addressing their concerns: they demonstrated scalability to 8B models (Llama3/3.1-8B), showed effectiveness on long-context tasks (LongBench), validated applicability to base models and standard MHA architectures (SmolLM), and provided comprehensive comparisons with recent methods (MHA2MLA, PALU, H2O). Experiments show significant throughput gains (x1.7-2) and memory reduction (5× at batch 128). However, reviewer engagement was limited - only YRgt acknowledged the authors' responses.

This paper can have a significant practical impact, and his technically and experimentally strong.