MMInference: Accelerating Pre-filling for Long-Context Visual Language Models via Modality-Aware Permutation Sparse Attention
摘要
评审与讨论
The paper addresses the computational bottleneck in long-context Vision Language Models (VLMs) during the pre-filling stage. The authors observe that attention in VLMs exhibits unique sparse patterns, particularly a "Grid" pattern in video inputs due to spatiotemporal locality. They also identify distinct modality boundaries in attention (No-Boundary, K-Boundary, Q-Boundary, and 2D-Boundary).
Based on these observations, the authors propose MAPSparse, a permutation-based dynamic sparse attention approach that significantly reduces computation while maintaining performance. MAPSparse consists of three main components: 1) the Grid sparse attention pattern for intra-modality attention, 2) Q-Boundary and 2D-Boundary patterns for handling mixed-modality inputs, and 3) a Modality-Aware Sparse Attention Search Algorithm to optimize patterns for each attention head.
Experiments on video understanding tasks, Video Needle in a Haystack (V-NIAH), and Mix Modality Needle in a Haystack (MM-NIAH) show that MAPSparse achieves up to 8.3× speedup over FlashAttention-2 and 1.7× over MInference at 1M token length while maintaining competitive performance.
给作者的问题
How stable are the identified patterns across different models and datasets? Are there additional attention patterns beyond those identified?
论据与证据
-
VLM attention exhibits unique sparse patterns compared to LLMs, including a Grid pattern.
- Evidence: Visualizations in Figures 2 and 5 demonstrate these patterns clearly. Section 2 provides quantitative analysis showing VLMs require only 5.78% of attention weights to recall 95% of total attention.
-
Modality boundaries create distinct attention patterns that require specialized handling.
- Evidence: Figures 2b, 2c, and 3 visualize these boundaries, and the authors provide detailed analysis of how they affect attention.
-
MAPSparse accelerates the pre-filling stage by up to 8.3× at 1M tokens while maintaining performance.
- Evidence: Comprehensive experiments on video understanding tasks (Table 1) and NIAH tasks (Figure 4) show strong performance. Figure 6 demonstrates the claimed speedups.
方法与评估标准
Methods:
- The analysis of attention patterns in VLMs (Section 2) establishes the foundation for the approach
- The MAPSparse framework (Section 3) is described algorithmically with pseudocode for each component
- Implementation details are provided, including how permutation is applied and how dynamic sparse indices are constructed
Evaluation Criteria:
- Accuracy preservation: The authors compare performance on diverse video understanding tasks including video captioning, QA, and retrieval
- Computational efficiency: Both theoretical FLOPs reduction and actual runtime speedups are measured
- Scalability: Tests at different context lengths (from ~20K to 1M tokens)
- Comparison against baselines: The method is compared against multiple sparse attention approaches and token compression methods
理论论述
The paper does not make significant novel theoretical claims. It's primarily an empirical investigation of attention patterns in VLMs and an engineering solution to accelerate computation. However, the observations about Grid patterns and modality boundaries add to the theoretical understanding of how VLMs process multi-modal information.
实验设计与分析
The experimental design is thorough and well-executed.
- Two state-of-the-art long-context VLMs are used (Llava-Video and LongVila)
- Multiple video understanding benchmarks plus two NIAH tasks
- Several sparse attention methods and a visual token compression method
- The contribution of each component is evaluated, especially in the MM-NIAH task
One minor limitation is that the paper focuses primarily on accuracy and retrieval performance. Additional evaluations like robustness would strengthen the analysis further.
补充材料
A. Modality-Aware Sparse Attention Search Algorithm C. Experiment Details
与现有文献的关系
The paper positions its work well within the broader literature on VLMs and attention efficiency.
遗漏的重要参考文献
N/A
其他优缺点
The visualizations are outstanding and provide clear insights into attention patterns. While the paper mentions implementation details of the attention pattern visualization, more detail could be provided in the main text. Besides, Are there additional attention patterns beyond those identified?
其他意见或建议
"Spars Transformer" on line 319 that should be "Sparse Transformer"
We greatly appreciate the reviewer's recognition and thoughtful, constructive feedback. Below, we address each of the comments and concerns in detail.
- "How stable are the identified patterns across different models and datasets"
Thank you for the suggestion. We have tested these patterns across various VLMs and a wide range of datasets including open-domain question answering, multiple-choice question answering, video captioning, and video information retrieval. Additionally, we provide further pattern analysis on two recent VLMs, Qwen2.5-VL [1] and VideoChat-Flash [2], to demonstrate that the attention patterns leveraged in our method are consistently observed across different models and tasks, see https://anonymous.4open.science/r/complementary-D5B2/. While we also observe other patterns, such as local window attention, these are in fact special cases of our existing patterns and can be effectively captured by our method.
- "Typo issue"
Thank you for your careful review. We will fix this issue in the next version.
[1] Qwen2.5-VL Technical Report, 2024.
[2] VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling, 2024.
Thank you for your response! This is a strong paper, and after considering your clarifications, I am still leaning towards acceptance.
Thank you for your effort in reviewing and recognizing our work. Your feedback has been very helpful, and we will incorporate these suggestions in the next version of the paper.
This proposes a modality-aware permutation sparse attention method that accelerates long-context VLMs, called MAPSparse. It features permutation-based grid sparse attention, Qboundary/2D-boundary patterns for mixed-modality boundaries, and a Modality-Aware Sparse Attention Search Algorithm. Experiments prove the effectiveness.
Update After Rebuttal
I'd like to thank the authors for the clear rebuttal, which solves most of my concern. Therefore, I maintain the original rating towards acceptance.
给作者的问题
My questions are mainly focusing on the experiments and definition, as listed in weakness section.
- Because this work tries to optimize the operation in VLM, it's better to report average latency and memory consumption in Table 1 for better comparisons with other methods.
- An efficient operation in VLM should bring longer frames (or higher FPS), which is an essential for video understanding. The authors are recommended to provide a table that contains results and lantency with increasing frames.
- It's better to provide what is the meaning and how the recall rates are calculated in Section 2.1.
论据与证据
Yes
方法与评估标准
Yes, it make senses.
理论论述
Yes.
实验设计与分析
Yes
补充材料
Yes
与现有文献的关系
This work tries to accelerate the operation on top of other vision language models.
遗漏的重要参考文献
No
其他优缺点
Strength
- This work aims to optimize the long sequence for vision-language models, which is promising ane essential in current framework.
- The proposed method is proved to work well in two different methods for long video understanding.
- The problem is well analyzed with good comparisons.
Weakness
- Because this work tries to optimize the operation in VLM, it's better to report average latency and memory consumption in Table 1 for better comparisons with other methods.
- An efficient operation in VLM should bring longer frames (or higher FPS), which is an essential for video understanding. The authors are recommended to provide a table that contains results and lantency with increasing frames.
- It's better to provide what is the meaning and how the recall rates are calculated in Section 2.1.
其他意见或建议
No
We greatly appreciate the reviewer’s thoughtful and constructive feedback. We respond to each of the comments and concerns below.
- "it's better to report average latency and memory consumption in Table 1"
Thank you for the suggestion. We have updated Table 1 to include a dedicated column for end-to-end latency. Regarding memory usage, since all baselines consume similar memory except for VisionZip, we did not include a separate memory column. Below, we provide a table comparing the latency and average performance of MAPSparse with other baselines. The number of frames is fixed at 256, and all tests are conducted on a single A100 GPU. As shown in Section 4.5, our method achieves greater speedup with longer video inputs and more frames.
| Method | Prefill Latency (s) | Avg. Performance |
|---|---|---|
| Llava-Video (Full Attention) | 17.3 | 55.5 |
| A-Shape | 12.0 | 53.1 |
| Tri-Shape | 11.6 | 54.7 |
| VisionZip | OOM | OOM |
| MInference | 15.4 | 55.2 |
| MAPSparse | 14.2 | 57.6 |
RTable 1. Performance (%) of different methods on video understanding tasks evaluated.
- "longer frames"
Thanks for the suggestion. Here we show the results of MAPSparse and the full attention baselines with various numbers of frames from 32 to 256. The results show that more frames can consistently improve the performance on video understanding tasks. MAPSparse improves system throughput by supporting more frames and higher QPS video inputs within the same latency constraints.
| Method | Num. Frames | VideoMME (w/o sub) | VideoMME (w/ sub) |
|---|---|---|---|
| LongVILA (Full Attention) | 32 | 55.2 | 58.1 |
| MAPSparse | 32 | 56.5 | 58.1 |
| Full Attention | 64 | 59.1 | 62.0 |
| MAPSparse | 64 | 58.4 | 62.4 |
| Full Attention | 128 | 59.4 | 64.7 |
| MAPSparse | 128 | 59.0 | 64.8 |
| Full Attention | 256 | 60.1 | 65.1 |
| MAPSparse | 256 | 60.0 | 65.5 |
RTable 2. Performance (%) of different frame on video understanding tasks evaluated using LongVILA.
- "meaning and how the recall rates are calculated"
The recall rate reflects how much of the original attention score is retained in the top-k sparse attention. It is computed by calculating the softmax scores over the top-k key vectors retrieved for each query vector, i.e., . We will update the paper to include a formal definition of the recall rate in the next revision.
Thanks for the rebuttal. It solves most of my concern. I maintain the original rating towards acceptance.
Thank you for taking the time to review our paper. We will incorporate the updated content into the next version.
MAPSparse provides an innovative and effective solution for accelerating the pre-filling stage of long-context VLMs.
给作者的问题
See weakness.
论据与证据
Yes.
方法与评估标准
Yes.
理论论述
N/A
实验设计与分析
Yes.
补充材料
Yes. Code.
与现有文献的关系
N/A
遗漏的重要参考文献
No.
其他优缺点
Strengths
-
Innovative Approach: The paper introduces MAPSparse, a method that accelerates the pre-filling stage of long-context Vision-Language Models (VLMs) using Modality-Aware Permutation Sparse Attention. This approach significantly improves processing speed, achieving up to 8.3x and 1.7x speedups in a 1M-length context, surpassing FlashAttention-2 and MInference, respectively.
-
Integration of Theory and Practice: The paper provides a detailed analysis of different attention head patterns (such as No-Boundary, K-Boundary, Q-Boundary, and 2D-Boundary) and introduces a modality-aware sparse attention search algorithm to optimize cross-modal and intra-modal sparse patterns. These improvements are not only theoretical breakthroughs but also show excellent performance in practical applications.
-
Comprehensive Experimental Validation: The research utilizes two state-of-the-art long-context VLMs (Llava-Video and LongVila) and tests them on various video understanding tasks, including video captioning, video question answering, and video information retrieval. Additionally, a mixed-modal "needle in a haystack" task was designed to evaluate multimodal input performance, demonstrating that the method significantly enhances efficiency while maintaining high accuracy.
-
Strong Adaptability: MAPSparse effectively addresses boundary issues between different modalities and maintains the continuity of sparse distribution across modalities, supporting more complex scenarios and larger datasets.
Weaknesses
-
Generalization ability of the method: Although MAPSparse shows significant performance improvements, its algorithm implementation requires offline searching the optimal sparse patterns for each head. The authors did not show any experiments of the generalization ability of these patterns across datasets.
-
Generalization of the patterns to ViT. All MLLM contains a ViT before feeding into LLM. Although it contributes little to the latency, but it decides how many tokens are fed into LLM. Will optimizing tokens in ViT better reduce the prefilling latency?
-
Specific Application Scenarios: The paper primarily focuses on applications in video understanding and multimodal learning. Its applicability and effectiveness for other types of data or tasks, such as pure text or non-visual modalities, have yet to be fully explored.
-
Impact Statement: "There are many potential societal consequences of our work, none which we feel must be specifically highlighted here." Maybe the authors could expand more.
其他意见或建议
See weakness.
We greatly appreciate the reviewer’s thoughtful and constructive feedback. We respond to each of the comments and concerns below.
- "Generalization of the patterns"
We evaluate our method across a broad range of multimodal tasks and VLMs, including open-domain question answering, multiple-choice question answering, video captioning, and video information retrieval. The models tested include Llava-Video and LongVILA, and our method consistently demonstrates strong performance across all these scenarios.
We additionally provide pattern analysis (shown in https://anonymous.4open.science/r/complementary-D5B2/) and experimental results (RTable.1) on the latest VLMs, Qwen2.5-VL [1] and VideoChat-Flash [2], further demonstrating the generalizability of both the observed patterns and our method across different models and tasks.
- "Patterns in ViT"
Optimizing tokens in ViT can help reduce prefilling latency, but ViT is not the primary bottleneck in long video processing for VLMs. As shown in Figure 1(a), ViT accounts for less than 2% of the total latency, while attention operations in VLMs dominate with over 95%. Nevertheless, MAPSparse is orthogonal to ViT optimization techniques such as VisionZip (shown in Table 1) and VideoChat-Flash [2]. We include an additional experiment based on VideoChat-Flash, and the results indicate that MAPSparse integrates well with token compression methods.
| Model | VideoDC | ActNet-QA | EgoSchema | Next-QA | PerceptionTest | VideoMME (w/o sub) | VideoMME (w/ sub) | Avg. |
|---|---|---|---|---|---|---|---|---|
| VideoChat-Flash | 3.21 | 53.6 | 57.0 | 81.2 | 69.1 | 63.2 | 70.5 | 56.8 |
| w/ MAPSparse | 3.19 | 54.3 | 57.3 | 79.8 | 69.1 | 63.0 | 70.2 | 56.7 |
RTable 1. Performance of different methods on video understanding tasks using VideoChat-Flash.
- "Generalization to other modalities"
Thank you for the suggestion. We conducted additional experiments on the text-only long-context benchmark SCBench [3], comparing MAPSparse with full attention, as shown in RTable 2. The results demonstrate that MAPSparse generalizes well across modalities and tasks. We will include the corresponding results in the next version.
| Method | Retr.kv | En.Sum | En.QA | En.MC | ICL | Avg. |
|---|---|---|---|---|---|---|
| Full Attention | 52.0 | 38.3 | 25.1 | 65.9 | 54.1 | 47.1 |
| MAPSparse | 47.0 | 36.6 | 23.6 | 62.6 | 62.9 | 46.5 |
RTable 2. Performance of different methods on long-context benchmark SCBench using Llama-3.1-8B in 128K.
- "Impact Statement"
We apologize for the misunderstanding. We followed the ICML [4] guidelines when writing the impact statement. Since our method does not alter the original VLMs outputs, it does not introduce any additional risks.
[1] Qwen2.5-VL Technical Report, 2024.
[2] VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling, 2024.
[3] SCBench: A KV Cache-Centric Analysis of Long-Context Methods, ICLR 2025.
[4] https://icml.cc/Conferences/2025/CallForPapers
MAPSparse offers good solution to a key bottleneck in long-context VLMs by introducing modality-aware sparse attention with strong empirical grounding. The paper is well-motivated and technically sound, demonstrating up to 8.3× speedup without compromising accuracy across diverse tasks and models. The rebuttal thoroughly addressed concerns on generalization, latency, and applicability, strengthening the paper's contribution. Overall, this work presents a solid advancement for efficient VLM inference, and AC and reviewers recommend acceptance.