Mixture of Attention Spans: Optimizing LLM Inference Efficiency with Heterogeneous Sliding-Window Lengths
We design heterogeneous elastic rules for sliding-window lengths of attention for efficient large language models
摘要
评审与讨论
The authors propose a new technique for optimizing the memory and computation requirements of pre-trained models at inference time. The optimization is based on two important observations. First (as noted by many others), attention is often local -- meaning that the output of an attention head depends mainly on local context. Second, whether attention is local or global may differ for each head and layer in the model; this second factor is the basis for the paper.
Instead of keeping a global KV cache for all heads and layers, the authors optimize the size of the KV-cache for each attention head. The result is a form of sliding-window attention, where each head may use a sliding window of a different length. By reducing the size of the KV-cache for heads that do not require global attention, the authors reduce both the memory cost of the KV-cache, and computation cost of computing attention.
In order to discover the optimum length of the windows, the authors use a gradient-based optimization technique. For each entry in attention matrix, the authors calculate the change in loss if that entry were to be masked out. The optimization algorithm then finds the shortest windows which keep the loss from increasing too much. The calibration dataset for this process must be carefully chosen -- the authors use a long-range dataset to ensure that windows are not cropped too much.
接收理由
The optimization technique is well thought out, seems to be effective, and the authors use an appropriate calibration dataset. This technique could reduce the inference costs of serving many existing pre-trained models, and thus has potentially high impact.
拒绝理由
Although I hate to reject an otherwise promising paper on the basis of terminology, I strongly object to the terms that the authors use to describe their work -- in particular the words "sparse" and "compression", which are right in the title. I have seen this misuse a couple of times just recently in other papers as well, and it needs to stop.
Sparse. First and foremost, sliding-window attention is not "sparse". When used to describe attention, or matrix multiplication in general, the word "sparse" means that the non-zero values in the matrix are not contiguous in memory; they are scattered or distributed throughout the matrix. There are several forms of attention which are actually "sparse": Top-K attention is sparse, because the top K matching keys are scattered along the context length. Other proposals, such as hierarchical attention, hash-based attention, and routing attention may be sparse in various ways, and I will grudgingly accept proposals such as strided attention as being "sparse", although trivially so. Sparse attention generally requires clever algorithms, and is usually difficult to implement on modern hardware, which is optimized for dense matrix multiplication.
Sliding-window attention is useful precisely because it is NOT sparse. The most recent N values are all contiguous in memory, which means that the matrix multiplication is fully dense, and thus can be implemented simply and efficiently on modern hardware without any special tricks.
Compression. Second, as near as I understand it, the authors are not doing anything that I would consider to be "compression". Truncating the context to a shorter length is not "compression" in any meaningful sense. Once again, there are other works in the literature which do attempt to do meaningful forms of compression -- either by compressing model weights into a smaller number of bits, or using sophisticated pooling techniques to compress long context lengths into a smaller number of soft tokens. As noted in my summary, the technique described in this paper should be referred to as a form of "optimization," not "compression."
The reason I object so strongly to the misuse of these terms is because "sparse attention" and "context length compression" are active areas of research. By misusing these terms (especially by putting them right in the title), the authors are mischaracterizing their work, and pretending that they are doing something that they are not. I consider this to be a form of academic dishonesty or false advertising, and it makes it difficult for other researchers to find relevant papers in the field when dong a literature review. As a result, I am forced to reject this paper, unless the authors are willing to correct their use of terminology.
An appropriate title for this work would be something along the lines of "Optimizing LLM inference by sliding-window attention with heterogenous window lengths."
If the authors change the title, abstract, and body of the paper to remove the words "sparse" and "compression" where they are misused, I will consider changing my decision to "accept".
EDIT: changing decision to "accept".
给作者的问题
N/A
W: Terminology
I strongly object to the terms that the authors use to describe their work -- in particular the words "sparse" and "compression", which are right in the title. I have seen this misuse a couple of times just recently in other papers as well, and it needs to stop.
We thank the reviewer for pointing out the misleading use of sparse and compression. We will fully adopt your recommendation and revise the manuscript as follows:
-
New technical wording.
- We will describe our method simply as heterogeneous sliding-window attention and as an efficiency optimization technique.
- We will reserve sparse attention exclusively for truly irregular, fine-grained sparsity. For block-pattern methods such as Sparse Transformers [1], we will explicitly write block-sparse attention.
-
New title.
- Mixture of Attention Spans: Optimizing LLM Inference Efficiency with Heterogeneous Sliding-Window Lengths
-
Global edits.
- All occurrences of sparse and compression in the abstract, main text, and figures will be replaced with the above terminology.
We appreciate your careful feedback. Our original terminology was never intended as false advertising; we viewed sliding-window attention as a structured, block-sparse variant of dense attention. Nevertheless, we recognise that its contiguous, coarse-grained pattern differs from the fine-grained sparsity typically implied by sparse attention, and we have adopted the revised phrasing to avoid any ambiguity. We welcome any further suggestions. The modified abstract is attached below.
[1] Child et al., “Generating Long Sequences with Sparse Transformers,” NeurIPS 2019.
The modified abstract
Sliding-window attention is a hardware-friendly way to mitigate the significant memory and throughput demands of Large Language Models (LLMs) in long contexts. Existing methods typically employ a single window length across different attention heads and input lengths. However, this uniform approach fails to capture the diverse attention patterns inherent in LLMs, ignoring their distinct accuracy-latency trade-offs. To address this challenge, we propose the Mixture of Attention Spans (MoA), which automatically tailors distinct sliding-window length configurations to different heads and layers. MoA constructs and navigates a search space of various window lengths and their scaling rules relative to input sequence lengths. It profiles the model, evaluates potential configurations, and pinpoints the optimal length for each head. MoA adapts to varying input sizes, revealing that some attention heads expand their focus to accommodate longer sequences, while other heads consistently concentrate on fixed-length local contexts. Experiments show that MoA increases the effective context length by 3.9x with the same average sliding-window length, boosting retrieval accuracy by 1.5-7.1x over the uniform window baseline across Vicuna-{7B,13B}, and Llama3-{8B,70B} models. Moreover, MoA narrows the capability gaps to full attention, reducing the maximum relative performance drop from 9%-36% to within 5% across three long-context understanding benchmarks. MoA achieves a 1.2-1.4x GPU memory reduction, boosting decode throughput by 6.6-8.2x and 1.7-1.9x over FlashAttention2 and vLLM, with minimal performance impact.
We hope these revisions address your concern and are happy to refine further if needed.
Thank you for taking my comments on board! The new abstract looks great, and I am changing my score to "accept." I like the "mixture of attention spans" terminology -- I think it is a good way of describing what you are doing.
I also agree that "block sparsity" is a reasonable way to describe the "Generating Long Sequences with Sparse Transformers". That particular paper also uses strided attention patterns, which as I mentioned, I will grudgingly accept as being "sparse". :-)
Thank you for your insightful feedback and positive recognition! We're delighted that our revisions effectively addressed your concerns. Your comments have greatly improved the clarity of our contribution.
The paper presents a well-executed study with rigorous experiments on their MoA method across multiple models (Vicuna, Llama3) and benchmarks (LongEval, LV-Eval). The methodology is technically sound, leveraging a profiling-and-optimization pipeline to tailor sparse attention configurations. The ablation studies and comparisons against strong baselines (StreamingLLM, H2O) validate the design choices. The work has high practical impact, demonstrating 3.9× longer effective context lengths, 6.6–8.2× throughput improvements, and minimal performance degradation. These results are critical for deploying LLMs in long-context applications efficiently.
接收理由
- The experimental analysis demonstrates exceptional rigor, systematically evaluating MoA across a spectrum of model scales (Vicuna-7B to Llama3-70B) and challenging benchmarks (LongEval, LV-Eval, LongBench).
- Section 3.1, which quantifies attention value impacts via gradient-based influence metrics, ensuring mask configurations preserve critical attention pathways.
- The results are compelling, achieving a 3.9× extension in effective context length with the same average attention span, 6.6–8.2× throughput gains over FlashAttention2, and limiting performance degradation to <5% across all tasks—a marked improvement over prior sparse methods (e.g., 7.1× higher retrieval accuracy than StreamingLLM at 50% density). These advances position MoA as a practical solution for deploying LLMs in long-context scenarios without sacrificing capability.
拒绝理由
-
While the paper compares MoA against StreamingLLM and H2O, it overlooks critical baselines that define state-of-the-art sparse attention (e.g., [1] SnapKV, [2] PyramidKV). These methods address similar efficiency-accuracy trade-offs but employ distinct strategies (e.g., token prioritization, hybrid sparse-dense blocks). Without benchmarking against them, MoA’s superiority remains incompletely justified, particularly in scenarios requiring dynamic adaptation (e.g., streaming inputs) or hybrid computation. A direct comparison would clarify whether MoA’s static heterogeneous masking generalizes better than dynamic or hybrid approaches.
-
Sections 2–3 suffer from disjointed exposition:
- Section 2 introduces heterogeneous patterns but lacks a visual roadmap (e.g., a flowchart) of how elastic rules are derived from attention matrices.
- Section 3.1’s attention influence derivation (Eq. 2) is abrupt; a step-by-step intuition (e.g., "masking token j redistributes attention mass to k ≠ j") would aid readability.
- The interplay between profiling (Section 3.1) and optimization (Section 3.2) is under-explained. A unified algorithmic pseudocode (combining both stages) would clarify the pipeline.
-
The reliance on custom CUDA kernels introduces practical barriers:
- Compatibility issues with existing frameworks (e.g., HuggingFace, vLLM) requiring significant engineering effort.
- No discussion of energy efficiency—a critical metric for real-world deployment. For instance, does MoA’s sparsity reduce FLOPs sufficiently to offset kernel overhead?
给作者的问题
Please address Reasons To Reject
W1: SnapKV & PyramidKV comparison
While the paper compares MoA against StreamingLLM and H2O, it overlooks critical baselines that define state-of-the-art sparse attention (e.g., [1] SnapKV, [2] PyramidKV). These methods address similar efficiency-accuracy trade-offs but employ distinct strategies (e.g., token prioritization, hybrid sparse-dense blocks).
We provide a direct comparison with the official SnapKV and PyramidKV implementations in Fig. 4. The table reports retrieval accuracy on LongEval using Vicuna-7B with an 8k-token input on a single A100-80GB GPU, with all methods using 25% average attention density:
| Method | Retrieval Acc. | Throughput (token/s) |
|---|---|---|
| SnapKV | 0.79 | 207 |
| PyramidKV | 0.89 | 129 |
| MoA (ours) | 0.95 | 928 |
MoA achieves both higher accuracy and significantly higher throughput, due to its effective offline optimization and the static, lightweight sliding-window during inference.
W2: Exposition of Sections 2-3
Sections 2–3 suffer from disjointed exposition:
- Section 2 introduces heterogeneous patterns but lacks a visual roadmap (e.g., a flowchart) of how elastic rules are derived from attention matrices.
- Section 3.1’s attention influence derivation (Eq. 2) is abrupt; a step-by-step intuition (e.g., "masking token j redistributes attention mass to k ≠ j") would aid readability.
- The interplay between profiling (Section 3.1) and optimization (Section 3.2) is under-explained. A unified algorithmic pseudocode (combining both stages) would clarify the pipeline.
We thank the reviewer for the valuable suggestions. We plan to streamline as follows:
-
Visual roadmap: Revise Figs. 2 and 3(a) to link the attention matrix, desired attention mask, and elastic rule. Update Fig. 3(b) to a clearer flowchart illustrating the derivation of elastic rules.
-
Intuition before formalism: Add a brief intuitive explanation before Eq. 2 as suggested, explaining that masking token redistributes its attention mass to tokens kj.
-
Pseudocode: Introduce a pseudocode block and cross-reference it from both Sections 3.1 and 3.2, clearly connecting both stages, as well as Figure 3.
We believe these changes will create a more coherent visual and algorithmic pipeline. We are happy to further refine the writing if there are additional suggestions.
W3.1: Barriers of custom CUDA kernel
The reliance on custom CUDA kernels introduces practical barriers:
Compatibility issues with existing frameworks (e.g., HuggingFace, vLLM) requiring significant engineering effort.
Our CUDA kernel is exposed through a lightweight Python wrapper that mirrors torch.nn.functional.scaled_dot_product_attention, adding only a single tensor to specify each head’s attention span. Integrating MoA into HuggingFace Transformers requires only a few modifications to the AttentionModule, after which it can be enabled with just two lines of user-level code (see Appendix D.4). Since MoA employs fixed sliding-window masks at inference, integrating the kernel into frameworks like vLLM should involve only moderate engineering effort.
W3.2: Energy efficiency and FLOPs
No discussion of energy efficiency—a critical metric for real-world deployment. For instance, does MoA’s sparsity reduce FLOPs sufficiently to offset kernel overhead?
We thank the reviewer for the suggestion. We have added energy efficiency results under the same setup as Table 5, using Vicuna-7B on a single A100-80GB GPU:
| Method | Energy (J/token) @4k/8k/16k | Avg. GPU Power (W) @4k/8k/16k |
|---|---|---|
| FlashAttention2 | 2.98 / 5.93 / 12.07 | 350 / 354 / 359 |
| MoA (ours) | 0.34 / 0.62 / 1.21 | 330 / 322 / 315 |
Energy: MoA achieves a 8.7–10x reduction in energy per output token, driven by slightly lower GPU power and, more importantly, much higher throughput. We will add these results to the paper.
FLOPs: During decoding, each head accesses only its (shortened) KV cache, with an O(#heads) offset computation that is negligible compared to sequence length. Since attention FLOPs scale linearly with KV length, MoA reduces attention FLOPs per decode step by approximately 50%, much more than offsetting the minimal addressing overhead.
Thank you for your response, I will keep my score at 7.
Thank you for your positive evaluation and continued support for our work. We’re glad to have addressed your initial concerns and sincerely appreciate your insightful feedback.
The paper proposes MoA, a training-free sparse attention method for compressing LLMs by assigning heterogeneous, per-head attention spans that adapt with input length. Unlike prior uniform sparse attention approaches, MoA uses gradient-based profiling on a long-context calibration dataset to measure the influence of each attention value on prediction loss and optimizes attention masks accordingly. This automatic compression improves memory and compute efficiency without retraining. Evaluated on Vicuna and Llama3 models, MoA achieves up to 3.9× longer effective context, 6.6–8.2× faster decoding than FlashAttention2, and maintains accuracy within 1% of dense models at 50% KV-cache density, outperforming existing sparse baselines across long-context tasks.
接收理由
- MoA introduces a unique, training-free method that assigns per-head, per-layer sparse attention masks based on elastic span functions.
- MoA requires no retraining or fine-tuning.
- MoA is simple and lightweight to implement.
- MoA achieves impressive results across long-context benchmarks.
拒绝理由
- Limited evaluation: the paper doesn't provide evaluation other than retrieval tasks such as coding, math Q&A and etc.
- Lack of dynamic sparse attention mechanisms: it doesn’t incorporate dynamic sparse attention mechanisms that adapt to varying input patterns in real-time, not context-dependent
- Limited novelty: similar work like MInference, already addresses the prefill inefficiency in long-context LLM inference by using a more practical and implementation-ready approach. It dynamically constructs sparse indices and leverages optimized GPU kernels for real-time acceleration. The paper doesn't show its advantages on the existing methods like Minference and SpargeAttn, which solve the similar problem.
给作者的问题
- It is unclear how the attention span is explored by the pipeline. It will be better to provide more details on the relationship between attention span function and elastic rules. The current explanation is limited to fully understand what are the elastic rules and how they are explored.
- It will be better to give more details on Figure 3(a) and tie back to your explanation in the paragraph. How does the attention span related to the masks, by colors or by length? Are masks all overlapped together on the same input sequence or sequentially applied? Does elastic rules decide the range of masks? How are the sizes of attention span and attention windows selected separately/related.
- For the experiments, could you please explain the reason that perplexity is only reported with length 8k - 12k.
- What is the data efficiency of finding the Pareto point, is this task agnostic?
- Do you observe any shared pattern between heads?
- Is the accuracy for LongBench averaged across tasks?
W1: Limited evaluation to retrieval tasks
Limited evaluation: the paper doesn't provide evaluation other than retrieval tasks such as coding, math Q&A and etc.
We evaluate MoA on a broad range of tasks beyond retrieval, including code completion, single- and multi-document QA, summarization, and few-shot learning, using 11 and 13 sub-datasets from LV-Eval and LongBench (L626). Averaged results are reported in Table 4, with detailed per-task results provided in Appendix B.1.3 (Table 8 and Figure 10). MoA consistently matches the original dense model across tasks, whereas StreamingLLM and InfLLM show greater variance: sometimes exceeding the original model, but with significant drops in other tasks. We will further emphasize the comprehensiveness of our evaluation in the revised manuscript.
W2: Dynamic vs. static attention mask
Lack of dynamic sparse attention mechanisms: it doesn’t incorporate dynamic sparse attention mechanisms that adapt to varying input patterns in real-time, not context-dependent
Dynamic sparsity can, in theory, remove more FLOPs, yet on commodity GPUs its fine-grained control flow often reduces those gains. MoA chooses a different point on this trade-off curve:
-
Design Choice. We adopt static, per-head heterogeneous windows because they execute in a single branch-free kernel on standard GPUs.
-
Effectiveness. Although static at run time, the window lengths are optimized from gradients and differ across heads, layers, and input lengths. At 50 % density MoA is 6–8x faster than FlashAttention-2 (Tab. 5) while showing smaller accuracy drops than dynamic-sparsity baselines (Tab. 4).
-
Complementarity. MoA can simply replace any homogeneous local window inside a dynamic scheme. As a demonstration, we swap the 0.15N uniform local window in H2O [1] to MoA heterogeneous spans with the same average density, keeping the 0.15N dynamic heavy-hitter unchanged. At the same 30% density, our version better restores dense-level accuracy:
| Dense | H2 + Uniform | H2 + MoA | |
|---|---|---|---|
| Llama3-8B LongEval 8K Acc. | 99 % | 71 % | 98 % |
- Future work. MoA pipeline can also be adept to optimize hyper-parameters inside dynamic attentions (e.g., the H2 rate of H2O). Exploring these methods, especially on custom accelerator hardware, is a promising future work.
W3: Novelty over MInference and SpargeAttn
Limited novelty: similar work like MInference, already addresses the prefill inefficiency in long-context LLM inference by using a more practical and implementation-ready approach. It dynamically constructs sparse indices and leverages optimized GPU kernels for real-time acceleration. The paper doesn't show its advantages on the existing methods like Minference and SpargeAttn, which solve the similar problem.
MInference and SpargeAttn reduce prefill with dynamic sparsity, but during decoding they still access the full KV Cache. MoA tackles the complementary bottleneck of KV-cache reduction and decode acceleration.
| Dimension | MInference / SpargeAttn | MoA (ours) |
|---|---|---|
| Optimised stage | Prefill only | Prefill + Decode |
| Length adaptation | — | Per-head elastic span that scales differently with sequence length |
| Mask acquization | Online: dynamic block sparse / Offline: per-head recall maximization | Offline end-to-end optimization: minimize prediction loss via gradient-based influence quantization + efficient mixed-integer programming |
| Calibration data | — | Emphasises long-range dependencies and LLM-alignment |
MoA is also practical and implementation-ready, providing a two-line Hugging Face interface (Appendix D.4). Empirically, it advances the accuracy-throughput frontier over six decoding baselines (Fig. 4).
We will cite and discuss MInference and SpargeAttn in the final paper.
Q1: Clarifying attention span and elastic rule
It is unclear how the attention span is explored by the pipeline. It will be better to provide more details on the relationship between attention span function and elastic rules. The current explanation is limited to fully understand what are the elastic rules and how they are explored.
The attention span is the sliding-window size plus the first 64 always-visible tokens (L129). The elastic rule specifies how scales with input length .
Logic
- Eq. (1): elastic rule attention span
- Sec. 2.2: attention mask
- Eq. (2–3): prediction loss
Optimization Pipeline
Eq. 4 then minimizes across lengths by searching over elastic rules of each head with multi-objective mixed-integer programming (MIP).
We will add this step-by-step explanation and refine the terminology in the final version.
Q2: Reading Figure 3(a)
Span to mask
It will be better to give more details on Figure 3(a) and tie back to your explanation in the paragraph. How does the attention span related to the masks, by colors or by length?
Figure 3(a) shows examples of attention masks from the search space. Each grid is a candidate attention mask for a single head.
| Visual element | Meaning | Contribution to span |
|---|---|---|
| vertical purple stripe | initial unmasked tokens | |
| diagonal purple band | sliding window | window length |
| white cells | pruned tokens | 0 |
Two vertically placed masks: same length, two elastic rules two masks.
Two horizontally placed masks: same elastic rule, two lengths two masks.
How masks are applied
Are masks all overlapped together on the same input sequence or sequentially applied?
All heads apply their masks in parallel; they are neither overlapped nor sequential. Each mask gates both the attention computation and the KV Cache entries for its head.
Elastic rule to mask
Does elastic rules decide the range of masks?
Yes. An elastic rule , with input length , determines attention span (Eq. 1); the resulting window width fixes the mask.
Selecting span/window
How are the sizes of attention span and attention windows selected separately/related.
MoA searches elastic rules (, ) for each head offline. At inference time, determines each head’s span and window via Eq. 1.
We will incorporate these explanations in the camera-ready version.
Q3: Perplexity length range
For the experiments, could you please explain the reason that perplexity is only reported with length 8k - 12k.
Apologies for the confusion. Perplexity is calculated over all response tokens for every test sample. The label “8k–12k” is not a cut-off we imposed—it simply marks the prompt + response length of data items. Detailed perplexity setups are discussed in Appendix A.1. We will clarify this more explicitly in the final version.
Q4: Data efficiency & task agnosticism
What is the data efficiency of finding the Pareto point, is this task agnostic?
-
Data Efficient. The entire optimization only uses 150 calibration sequences (2k-8k tokens each) for profiling, and 250 sequences (12k tokens each) for validation (Appendix. A.1).
-
Task-agnostic rule. The optimized elastic rules are used across all downstream benchmarks—code, QA, retrieval, etc.—without any per-task re-profiling or tuning.
-
Generalization empirical evidence. As discussed in Section 5.3, MoA offline optimized elastic rules generalize well to different tasks and unseen lengths.
Q5: Shared patterns across heads
Do you observe any shared pattern between heads?
Yes, layer position strongly influences spans:
- Early layers: heads keep a broad, almost uniform view of the sequence, matching the wide-span example in Fig. 2 (right).
- Last layers: most heads focus on the first few tokens and diagonal, as in Fig. 2 (left); only a few still use wide spans.
MoA naturally recovers this trend. As detailed in Appendix C.2 and Figure 11,
- Early layer heads are assigned larger spans,
- Last layer heads mostly receive shorter spans.
This aligns with known attention behaviors. We will add examples in the Appendix to fully visualize how MoA’s learned masks correspond to existing attention patterns at different layers.
Q6: LongBench accuracy averaged across tasks?
Is the accuracy for LongBench averaged across tasks?
Yes. The reported accuracy is averaged over all data items across LongBench. Per-task scores are provided in Figure 10 (Appendix B.1.3) for full transparency.
I appreciate authors' answer to my questions and further clarifications. I will raise my score by 1
Thank you very much for recognizing our response and raising your score! We sincerely appreciate your detailed and insightful feedback. If you have any remaining concerns or points that require further discussion, we would be more than happy to address them.
We sincerely thank all reviewers for their insightful feedback and valuable recognition of our work. We appreciate reviewers highlighting the technical soundness, compelling results, and high potential impact of MoA. We have carefully addressed all raised concerns and questions through additional experimental results, discussions, and manuscript revisions. We warmly welcome any further suggestions or discussions.
The paper proposes MoA, a training-free sparse attention method for assigning heterogeneous, per-head attention spans that adapt with input length. Unlike prior uniform sparse attention approaches, MoA uses gradient-based profiling to measure the influence of each attention value on prediction loss and optimizes attention masks accordingly. Evaluated on Vicuna and Llama3 models, MoA achieves significantly improves inference efficiency with marginal accuracy loss.
Reviewers praise the training-free method and the strong results. Main concerns are about the lack of comparisons against SnapKV and PyramidKV, which has been addressed in the rebuttal. The authors are encouraged to tighten their language in the revision following R HcQe's feedback.
This one is ready for COLM as is.