/10

Poster3 位审稿人

最低3最高4标准差0.5

ICML 2025

MMInference: Accelerating Pre-filling for Long-Context Visual Language Models via Modality-Aware Permutation Sparse Attention

Yucheng Li,Huiqiang Jiang,Chengruidong Zhang,Qianhui Wu,Xufang Luo,Surin Ahn,Amir H. Abdi,Dongsheng Li,Jianfeng Gao,Yuqing Yang,Lili Qiu

OpenReview PDF

提交: 2025-01-23更新: 2025-07-24

摘要

关键词

Visual Language ModelLLMs InferenceLong-Context LLMsDynamic Sparse AttentionEfficient Inference

评审与讨论

审稿意见

评分: 42025-03-07

The paper addresses the computational bottleneck in long-context Vision Language Models (VLMs) during the pre-filling stage. The authors observe that attention in VLMs exhibits unique sparse patterns, particularly a "Grid" pattern in video inputs due to spatiotemporal locality. They also identify distinct modality boundaries in attention (No-Boundary, K-Boundary, Q-Boundary, and 2D-Boundary).

Based on these observations, the authors propose MAPSparse, a permutation-based dynamic sparse attention approach that significantly reduces computation while maintaining performance. MAPSparse consists of three main components: 1) the Grid sparse attention pattern for intra-modality attention, 2) Q-Boundary and 2D-Boundary patterns for handling mixed-modality inputs, and 3) a Modality-Aware Sparse Attention Search Algorithm to optimize patterns for each attention head.

Experiments on video understanding tasks, Video Needle in a Haystack (V-NIAH), and Mix Modality Needle in a Haystack (MM-NIAH) show that MAPSparse achieves up to 8.3× speedup over FlashAttention-2 and 1.7× over MInference at 1M token length while maintaining competitive performance.

给作者的问题

How stable are the identified patterns across different models and datasets? Are there additional attention patterns beyond those identified?

论据与证据

VLM attention exhibits unique sparse patterns compared to LLMs, including a Grid pattern.
- Evidence: Visualizations in Figures 2 and 5 demonstrate these patterns clearly. Section 2 provides quantitative analysis showing VLMs require only 5.78% of attention weights to recall 95% of total attention.
Modality boundaries create distinct attention patterns that require specialized handling.
- Evidence: Figures 2b, 2c, and 3 visualize these boundaries, and the authors provide detailed analysis of how they affect attention.
MAPSparse accelerates the pre-filling stage by up to 8.3× at 1M tokens while maintaining performance.
- Evidence: Comprehensive experiments on video understanding tasks (Table 1) and NIAH tasks (Figure 4) show strong performance. Figure 6 demonstrates the claimed speedups.

方法与评估标准

Methods:

The analysis of attention patterns in VLMs (Section 2) establishes the foundation for the approach
The MAPSparse framework (Section 3) is described algorithmically with pseudocode for each component
Implementation details are provided, including how permutation is applied and how dynamic sparse indices are constructed

Evaluation Criteria:

Accuracy preservation: The authors compare performance on diverse video understanding tasks including video captioning, QA, and retrieval
Computational efficiency: Both theoretical FLOPs reduction and actual runtime speedups are measured
Scalability: Tests at different context lengths (from ~20K to 1M tokens)
Comparison against baselines: The method is compared against multiple sparse attention approaches and token compression methods

理论论述

The paper does not make significant novel theoretical claims. It's primarily an empirical investigation of attention patterns in VLMs and an engineering solution to accelerate computation. However, the observations about Grid patterns and modality boundaries add to the theoretical understanding of how VLMs process multi-modal information.

实验设计与分析

The experimental design is thorough and well-executed.

Two state-of-the-art long-context VLMs are used (Llava-Video and LongVila)
Multiple video understanding benchmarks plus two NIAH tasks
Several sparse attention methods and a visual token compression method
The contribution of each component is evaluated, especially in the MM-NIAH task

One minor limitation is that the paper focuses primarily on accuracy and retrieval performance. Additional evaluations like robustness would strengthen the analysis further.

补充材料

A. Modality-Aware Sparse Attention Search Algorithm C. Experiment Details

与现有文献的关系

The paper positions its work well within the broader literature on VLMs and attention efficiency.

遗漏的重要参考文献

N/A

其他优缺点

The visualizations are outstanding and provide clear insights into attention patterns. While the paper mentions implementation details of the attention pattern visualization, more detail could be provided in the main text. Besides, Are there additional attention patterns beyond those identified?

其他意见或建议

"Spars Transformer" on line 319 that should be "Sparse Transformer"

作者回复

2025-04-01

We greatly appreciate the reviewer's recognition and thoughtful, constructive feedback. Below, we address each of the comments and concerns in detail.

"How stable are the identified patterns across different models and datasets"

Thank you for the suggestion. We have tested these patterns across various VLMs and a wide range of datasets including open-domain question answering, multiple-choice question answering, video captioning, and video information retrieval. Additionally, we provide further pattern analysis on two recent VLMs, Qwen2.5-VL [1] and VideoChat-Flash [2], to demonstrate that the attention patterns leveraged in our method are consistently observed across different models and tasks, see https://anonymous.4open.science/r/complementary-D5B2/. While we also observe other patterns, such as local window attention, these are in fact special cases of our existing patterns and can be effectively captured by our method.

"Typo issue"

Thank you for your careful review. We will fix this issue in the next version.

[1] Qwen2.5-VL Technical Report, 2024.
[2] VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling, 2024.

审稿人评论

2025-04-02

Thank you for your response! This is a strong paper, and after considering your clarifications, I am still leaning towards acceptance.

作者评论

2025-04-02

Thank you for your effort in reviewing and recognizing our work. Your feedback has been very helpful, and we will incorporate these suggestions in the next version of the paper.

审稿意见

评分: 32025-03-09

This proposes a modality-aware permutation sparse attention method that accelerates long-context VLMs, called MAPSparse. It features permutation-based grid sparse attention, Qboundary/2D-boundary patterns for mixed-modality boundaries, and a Modality-Aware Sparse Attention Search Algorithm. Experiments prove the effectiveness.

Update After Rebuttal

I'd like to thank the authors for the clear rebuttal, which solves most of my concern. Therefore, I maintain the original rating towards acceptance.

给作者的问题

My questions are mainly focusing on the experiments and definition, as listed in weakness section.

Because this work tries to optimize the operation in VLM, it's better to report average latency and memory consumption in Table 1 for better comparisons with other methods.
An efficient operation in VLM should bring longer frames (or higher FPS), which is an essential for video understanding. The authors are recommended to provide a table that contains results and lantency with increasing frames.
It's better to provide what is the meaning and how the recall rates are calculated in Section 2.1.

论据与证据

Yes

方法与评估标准

Yes, it make senses.

理论论述

Yes.

实验设计与分析

Yes

补充材料

Yes

与现有文献的关系

This work tries to accelerate the operation on top of other vision language models.

遗漏的重要参考文献

其他优缺点

Strength

This work aims to optimize the long sequence for vision-language models, which is promising ane essential in current framework.
The proposed method is proved to work well in two different methods for long video understanding.
The problem is well analyzed with good comparisons.

Weakness

Because this work tries to optimize the operation in VLM, it's better to report average latency and memory consumption in Table 1 for better comparisons with other methods.
An efficient operation in VLM should bring longer frames (or higher FPS), which is an essential for video understanding. The authors are recommended to provide a table that contains results and lantency with increasing frames.
It's better to provide what is the meaning and how the recall rates are calculated in Section 2.1.

其他意见或建议

作者回复

2025-04-01

We greatly appreciate the reviewer’s thoughtful and constructive feedback. We respond to each of the comments and concerns below.

"it's better to report average latency and memory consumption in Table 1"

Thank you for the suggestion. We have updated Table 1 to include a dedicated column for end-to-end latency. Regarding memory usage, since all baselines consume similar memory except for VisionZip, we did not include a separate memory column. Below, we provide a table comparing the latency and average performance of MAPSparse with other baselines. The number of frames is fixed at 256, and all tests are conducted on a single A100 GPU. As shown in Section 4.5, our method achieves greater speedup with longer video inputs and more frames.

Method	Prefill Latency (s)	Avg. Performance
Llava-Video (Full Attention)	17.3	55.5
A-Shape	12.0	53.1
Tri-Shape	11.6	54.7
VisionZip	OOM	OOM
MInference	15.4	55.2
MAPSparse	14.2	57.6

RTable 1. Performance (%) of different methods on video understanding tasks evaluated.

"longer frames"

Thanks for the suggestion. Here we show the results of MAPSparse and the full attention baselines with various numbers of frames from 32 to 256. The results show that more frames can consistently improve the performance on video understanding tasks. MAPSparse improves system throughput by supporting more frames and higher QPS video inputs within the same latency constraints.

Method	Num. Frames	VideoMME (w/o sub)	VideoMME (w/ sub)
LongVILA (Full Attention)	32	55.2	58.1
MAPSparse	32	56.5	58.1
Full Attention	64	59.1	62.0
MAPSparse	64	58.4	62.4
Full Attention	128	59.4	64.7
MAPSparse	128	59.0	64.8
Full Attention	256	60.1	65.1
MAPSparse	256	60.0	65.5

RTable 2. Performance (%) of different frame on video understanding tasks evaluated using LongVILA.

"meaning and how the recall rates are calculated"

The recall rate reflects how much of the original attention score is retained in the top-k sparse attention. It is computed by calculating the softmax scores over the top-k key vectors retrieved for each query vector, i.e., $\text{softmax}(qK_{\text{topk}}^T)$ . We will update the paper to include a formal definition of the recall rate in the next revision.

审稿人评论

2025-04-03

Thanks for the rebuttal. It solves most of my concern. I maintain the original rating towards acceptance.

作者评论

2025-04-03

Thank you for taking the time to review our paper. We will incorporate the updated content into the next version.

审稿意见

评分: 42025-03-16

MAPSparse provides an innovative and effective solution for accelerating the pre-filling stage of long-context VLMs.

给作者的问题

See weakness.

论据与证据

Yes.

方法与评估标准

Yes.

理论论述

N/A

实验设计与分析

Yes.

补充材料

Yes. Code.

与现有文献的关系

N/A

遗漏的重要参考文献

No.

其他优缺点

Strengths

Innovative Approach: The paper introduces MAPSparse, a method that accelerates the pre-filling stage of long-context Vision-Language Models (VLMs) using Modality-Aware Permutation Sparse Attention. This approach significantly improves processing speed, achieving up to 8.3x and 1.7x speedups in a 1M-length context, surpassing FlashAttention-2 and MInference, respectively.
Integration of Theory and Practice: The paper provides a detailed analysis of different attention head patterns (such as No-Boundary, K-Boundary, Q-Boundary, and 2D-Boundary) and introduces a modality-aware sparse attention search algorithm to optimize cross-modal and intra-modal sparse patterns. These improvements are not only theoretical breakthroughs but also show excellent performance in practical applications.
Comprehensive Experimental Validation: The research utilizes two state-of-the-art long-context VLMs (Llava-Video and LongVila) and tests them on various video understanding tasks, including video captioning, video question answering, and video information retrieval. Additionally, a mixed-modal "needle in a haystack" task was designed to evaluate multimodal input performance, demonstrating that the method significantly enhances efficiency while maintaining high accuracy.
Strong Adaptability: MAPSparse effectively addresses boundary issues between different modalities and maintains the continuity of sparse distribution across modalities, supporting more complex scenarios and larger datasets.

Weaknesses

Generalization ability of the method: Although MAPSparse shows significant performance improvements, its algorithm implementation requires offline searching the optimal sparse patterns for each head. The authors did not show any experiments of the generalization ability of these patterns across datasets.
Generalization of the patterns to ViT. All MLLM contains a ViT before feeding into LLM. Although it contributes little to the latency, but it decides how many tokens are fed into LLM. Will optimizing tokens in ViT better reduce the prefilling latency?
Specific Application Scenarios: The paper primarily focuses on applications in video understanding and multimodal learning. Its applicability and effectiveness for other types of data or tasks, such as pure text or non-visual modalities, have yet to be fully explored.
Impact Statement: "There are many potential societal consequences of our work, none which we feel must be specifically highlighted here." Maybe the authors could expand more.

其他意见或建议

See weakness.

作者回复

2025-04-01

We greatly appreciate the reviewer’s thoughtful and constructive feedback. We respond to each of the comments and concerns below.

"Generalization of the patterns"

We evaluate our method across a broad range of multimodal tasks and VLMs, including open-domain question answering, multiple-choice question answering, video captioning, and video information retrieval. The models tested include Llava-Video and LongVILA, and our method consistently demonstrates strong performance across all these scenarios.

We additionally provide pattern analysis (shown in https://anonymous.4open.science/r/complementary-D5B2/) and experimental results (RTable.1) on the latest VLMs, Qwen2.5-VL [1] and VideoChat-Flash [2], further demonstrating the generalizability of both the observed patterns and our method across different models and tasks.

"Patterns in ViT"

Optimizing tokens in ViT can help reduce prefilling latency, but ViT is not the primary bottleneck in long video processing for VLMs. As shown in Figure 1(a), ViT accounts for less than 2% of the total latency, while attention operations in VLMs dominate with over 95%. Nevertheless, MAPSparse is orthogonal to ViT optimization techniques such as VisionZip (shown in Table 1) and VideoChat-Flash [2]. We include an additional experiment based on VideoChat-Flash, and the results indicate that MAPSparse integrates well with token compression methods.

Model	VideoDC	ActNet-QA	EgoSchema	Next-QA	PerceptionTest	VideoMME (w/o sub)	VideoMME (w/ sub)	Avg.
VideoChat-Flash	3.21	53.6	57.0	81.2	69.1	63.2	70.5	56.8
w/ MAPSparse	3.19	54.3	57.3	79.8	69.1	63.0	70.2	56.7

RTable 1. Performance of different methods on video understanding tasks using VideoChat-Flash.

"Generalization to other modalities"

Thank you for the suggestion. We conducted additional experiments on the text-only long-context benchmark SCBench [3], comparing MAPSparse with full attention, as shown in RTable 2. The results demonstrate that MAPSparse generalizes well across modalities and tasks. We will include the corresponding results in the next version.

Method	Retr.kv	En.Sum	En.QA	En.MC	ICL	Avg.
Full Attention	52.0	38.3	25.1	65.9	54.1	47.1
MAPSparse	47.0	36.6	23.6	62.6	62.9	46.5

RTable 2. Performance of different methods on long-context benchmark SCBench using Llama-3.1-8B in 128K.

"Impact Statement"

We apologize for the misunderstanding. We followed the ICML [4] guidelines when writing the impact statement. Since our method does not alter the original VLMs outputs, it does not introduce any additional risks.

[1] Qwen2.5-VL Technical Report, 2024.
[2] VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling, 2024.
[3] SCBench: A KV Cache-Centric Analysis of Long-Context Methods, ICLR 2025.
[4] https://icml.cc/Conferences/2025/CallForPapers

最终决定Accept (poster)

2025-05-01

MAPSparse offers good solution to a key bottleneck in long-context VLMs by introducing modality-aware sparse attention with strong empirical grounding. The paper is well-motivated and technically sound, demonstrating up to 8.3× speedup without compromising accuracy across diverse tasks and models. The rebuttal thoroughly addressed concerns on generalization, latency, and applicability, strengthening the paper's contribution. Overall, this work presents a solid advancement for efficient VLM inference, and AC and reviewers recommend acceptance.