SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference
We propose an efficient text-aware training-free vision token optimization mechanism called SparseVLM.
摘要
评审与讨论
SparseVLM is a training-free framework that optimizes VLMs by reducing computational load through selective pruning of visual tokens based on relevance to text tokens. Using self-attention matrices, it introduces a rank-based adaptive sparsification strategy and a token recycling mechanism to retain essential information. Compatible with models like LLaVA, Mini-Gemini, and Qwen2-VL, it reduces latency and costs while preserving accuracy.
update after rebuttal
Thank the author for the rebuttal. I will keep my original rating, which was already positive.
给作者的问题
Can SparseVLM be generalized to work with other transformer-based architectures beyond the ones tested in the paper?
论据与证据
SparseVLM can reduce computational overhead in VLMs without sacrificing performance.
- Experimental results demonstrate that SparseVLM can reduce FLOPs by up to 77.8%, decrease CUDA latency by up to 37%, and maintain 97% of the original accuracy in image tasks like VQA.
SparseVLM improves the efficiency of video question answering task.
- In video understanding tasks, SparseVLM outperforms FastV by 34.4% in accuracy and achieves a smaller decrease in performance (0.17 GPT score drop vs. 1.02 for FastV).
SparseVLM provides a trade-off between efficiency and performance.
- The method shows significant improvement in efficiency (e.g., 62.8% reduction in FLOPs) with minimal accuracy loss, validated through extensive experiments on multiple benchmarks.
方法与评估标准
- Methods: SparseVLM uses a two-step process for pruning and refining visual tokens. First, visual tokens are rated based on their correlation with relevant text tokens (text raters). Then, the tokens are sparsified using an adaptive rank-based strategy. For tokens that are pruned, a recycling mechanism is employed to aggregate and compress them into more compact forms. The method integrates well with Flash Attention for efficient computation.
- Evaluation Criteria: The evaluation focuses on accuracy, FLOPs, and latency reduction across multiple vision-language benchmarks, such as GQA, VQAv2, and MSRVTT, as well as video benchmarks like TGIF-QA and MSVD-QA. Trade-offs between computational savings and performance are carefully analyzed.
理论论述
- Visual tokens can be sparsified adaptively based on text tokens to optimize VLM performance: The authors argue that not all visual tokens are equally relevant to a given task and that adaptive sparsification based on textual context can improve both efficiency and accuracy. This claim is supported by the results showing the effectiveness of the rank-based strategy and text-aware pruning.
- Token recycling helps preserve important visual information during sparsification: The authors suggest that recycling pruned tokens minimizes information loss, which is crucial for maintaining performance in downstream tasks. This is backed by experimental improvements in performance when token recycling is used.
实验设计与分析
- Design: SparseVLM was tested on multiple VLMs (LLaVA, Mini-Gemini, Qwen2-VL) across image and video understanding tasks. Three token configurations (576, 192, 64 tokens) were used to assess the impact of sparsification. Comparative experiments with other methods (e.g., FastV, ToMe) were conducted to evaluate improvements in efficiency and accuracy.
- Analysis: The primary analysis focuses on comparing the reduction in FLOPs and latency with minimal loss in accuracy. The method was also analyzed on its ability to handle video tasks, where temporal and spatial dependencies are critical.
补充材料
- Several figures illustrate how SparseVLM selectively prunes visual tokens, demonstrating how the model retains critical information for tasks like VQA while discarding redundant tokens.
- Additional supplementary material includes detailed analyses of the latency vs. accuracy and FLOPs vs. accuracy trade-offs, showing how SparseVLM achieves an efficient balance between computation and performance.
与现有文献的关系
- SparseVLM is an important contribution to the ongoing research on making vision-language models more efficient. It addresses the challenge of visual token redundancy that is prevalent in high-resolution image and video tasks, complementing other recent works that focus on reducing the computational burden in VLMs.
- The method is compared with other pruning and sparsification strategies (e.g., FastV, ToMe, PDrop) and outperforms them in both computational efficiency and accuracy.
遗漏的重要参考文献
N/A
其他优缺点
Weaknesses: The method might struggle with extremely complex tasks where minimal information loss is critical, as the adaptive sparsification might lead to dropping crucial visual tokens in certain edge cases.
其他意见或建议
It would be useful to see a deeper exploration of the impact of token recycling in more complex, multi-step reasoning tasks where preserving information is more crucial.
We sincerely thank the reviewer SmbH for the effort in reviewing our paper. Our responses according to the reviewer's comments are summarized as follows.
1. The discussion examines how effectively SparseVLM performs on highly complex tasks.
Thank you for your attention to our performance on more complicated tasks. Firstly, due to adaptive sparsification, our method can determine whether the vision tokens in the current layer are redundant for the prompt to avoid losing critical information. Besides, our token recycling mechanism can retain as much information as possible in a cost-effective manner. To validate the effectiveness, we perform experiments on MMMU Pro, MMBench (Attribute Reasoning, AR), and MMBench (Logical Reasoning, LR) with LLaVA-7B. Our performance on MMMU Pro consistently approaches the average accuracy reported in Table 1 across all token settings, demonstrating our method's robustness in highly complex tasks. Furthermore, even with a severe reduction in the number of tokens from 576 to 192, the MMBench (LR) performance surpasses that of the baseline. The reason is that SparseVLM progressively sparsifies tokens, enabling the model to gradually focus, transitioning from the image background to the objects within the image and then to the details of those objects, and achieve outstanding performance. In summary, our SparseVLM can well handle highly complex tasks while maintaining vital information.
| Settings | MMMU Pro | MMBench (AR) | MMBench (LR) | Avg. (Loss) |
|---|---|---|---|---|
| Upper bound | 30.3 | 73.3 | 30.5 | 44.7 (0) |
| 192 tokens | 30.0 | 71.9 | 33.9 | 45.3 (+0.6) |
| 128 tokens | 29.4 | 70.4 | 30.5 | 43.4 (-1.3) |
| 64 tokens | 26.4 | 67.2 | 26.7 | 40.1 (-4.6) |
2. The analysis of the impact of token recycling in more complex, multi-step reasoning tasks.
We appreciate the insightful suggestion. To further explore the effectiveness of our token recycling strategy in more complex and multi-step reasoning tasks, we conduct ablation experiments on MMBench (LR) and MMMU Pro. Across multiple sparsity ratios (64, 128, 192), our algorithm achieves a significant average performance improvement of % and % on MMBench (LR) and MMMU Pro, respectively. Notably, as the number of pruned vision tokens increases, the benefit brought by our recycling method increases. For instance, when pruning from 192 to 64 tokens, pruned token recycling significantly boosts accuracy from % to % on MMMU Pro. In short, compared to common tasks (e.g., GQA in Table 4), our token recycling mechanism proves its higher value in complex, multi-step reasoning tasks.
| Benchmark | 64 tokens | 128 tokens | 192 tokens | Avg. |
|---|---|---|---|---|
| MMBench (LR) | 25.2 | 30.0 | 33.6 | 29.6 |
| + TR | 26.7 | 30.5 | 33.9 | 30.4 |
| MMMU Pro | 24.8 | 29.2 | 28.8 | 27.6 |
| + TR | 26.4 | 29.4 | 30.0 | 28.6 |
3. Further experiments on other transformer-based VLM architectures.
We sincerely thank you for your advice. We have conducted experiments across various transformer-based architectures, including different vision encoders (e.g., CLIP in LLaVA and CLIP+ConvNeXt in Mini-Gemini) and different LLM decoders (e.g., LLaMA in LLaVA and Qwen in Qwen2-VL). To further validate compatibility, we also tested our approach on Cambrian-1 13B (576 tokens setting), which is another transformer-based VLM architecture. Shown in the table below, under the 192-token setting, our method achieves a % lower performance drop than PDrop; meanwhile, under the 64-token setting, our approach maintains an accuracy loss below %. Therefore, our method is fully compatible with other transformer-based architectures.
| Method | GQA | MMB | SQA | SEED | TextVQA | MMMU | Avg. (Loss) |
|---|---|---|---|---|---|---|---|
| Cambrian-1-13B | 64.0 | 75.5 | 79.3 | 74.4 | 72.8 | 40.0 | 67.7 (0) |
| Retain 192 Tokens | |||||||
| PDrop [CVPR 2025] | 59.5 | 72.5 | 75.3 | 68.4 | 70.0 | 38.0 | 63.9 (-3.8) |
| SparseVLM | 61.6 | 74.2 | 79.0 | 72.0 | 71.4 | 38.8 | 66.2 (-1.5) |
| Retain 128 Tokens | |||||||
| PDrop [CVPR 2025] | 56.8 | 70.9 | 74.0 | 65.5 | 68.8 | 35.4 | 61.9 (-5.8) |
| SparseVLM | 60.4 | 73.0 | 78.2 | 70.6 | 70.2 | 38.0 | 65.1 (-2.6) |
| Retain 64 Tokens | |||||||
| PDrop [CVPR 2025] | 46.2 | 61.6 | 71.7 | 56.4 | 58.0 | 26.1 | 53.3 (-14.4) |
| SparseVLM | 54.4 | 68.7 | 78.0 | 64.0 | 62.2 | 31.2 | 59.8 (-7.9) |
We sincerely appreciate your thorough review and valuable suggestions to improve our work. If you find our responses satisfactory, we would be grateful for your reconsideration of the rating.
SparseVLM introduces a text-guided visual token sparsification framework for efficient VLM inference without significant performance loss. The key idea is to use the textual input to identify which image regions (visual tokens) are most relevant and prune away the rest.
给作者的问题
- As noted, SparseVLM currently assumes a guiding text query. It would be interesting to see how it could be applied to tasks like image captioning or general visual dialogue, where the prompt might be generic (e.g., “Describe the image”). I'm curious about the experiment results of more general tasks, such as Tab.7 or Appendix Fig.1 / 2 from the original LLAVA paper [1], with LLaVA-Bench (In-the Wild) and LLaVA-Bench (COCO) datasets.
论据与证据
Based on the motivation, "visual tokens should be sparsified adaptively based on the question prompt", the author showed the performance of the SparseVLM with experiments. The motivation is clear, and experiments are well-done, although the overall paper is based on emprical results.
方法与评估标准
See above.
理论论述
实验设计与分析
See above.
补充材料
The authors provide supplementary material containing additional analyses and technical details that complement the main paper.
与现有文献的关系
The paper proposed SparseVLM, an efficient inference method for VLM without additional finetuning.
遗漏的重要参考文献
其他优缺点
The method’s applicability might be somewhat limited to scenarios where a guiding text is present. For task such as captioning without a specific query) SparseVLM in its current form might not know which tokens to prune because there’s no external query focusing it. They don’t explicitly address how their method might extend to, for example, caption generation or more complex tasks. Since the proposed method aims for efficient VLM, it should show generability to other general VLM tasks, rather than just simple QA with classification.
其他意见或建议
We sincerely thank the reviewer bpD8 for the effort in reviewing our paper. Our responses according to the reviewer's comments are summarized as follows.
1. The analysis of the generality of SparseVLM to other general multimodal tasks.
We sincerely appreciate your valuable suggestion. To investigate how SparseVLM sparsifies visual tokens when dealing with questions similar to "Describe this photo in detail.", we further visualized the token sparsification results from different layers of SparseVLM when handling these questions. It was surprising that when confronted with such questions, SparseVLM first shifts the model's attention to the background information of the image. Subsequently, it sparsifies the tokens corresponding to this part and focuses on the information of the main objects in the image (e.g., the people in the image). Finally, it pays attention to the detailed information of the objects (e.g., the nose, mouth, and eyes of a person), enabling the model to describe the image smoothly and progressively. Therefore, our method follows a global-to-local understanding process, yet still effectively captures the key information in captioning tasks. For a visualization example, please refer to the anonymous link: https://imgur.com/a/lbp1qkI.
2. The performance of SparseVLM on other general multimodal tasks.
(1) Qualitative analysis
Here we take the captioning task as an example and conduct a qualitative analysis. Specifically, we ask the model with the question_id of 1 in LLaVA-Bench (In-the-Wild), where the prompt is "Describe this photo in detail".
The GPT-4 answer is: This photo features an aerial view of Diamond Head, located in the Hawaiian Islands. The volcanic tuff cone appears prominent and impressive, with its ridges and slopes covered in lush green vegetation. The crater of Diamond Head is clearly visible, surrounded by the steep walls of the structure. In the background, you can see the vast expanse of the ocean with varying shades of blue, showcasing the pristine waters surrounding the island.
The original LLaVA answer is: The image showcases a breathtaking view of a large, rocky island with a mountainous terrain. The island is surrounded by a beautiful blue ocean, and it appears to be a tropical paradise. The island is covered in lush greenery, and there are several houses scattered across the landscape. In the distance, a city can be seen, adding to the picturesque scene. The combination of the island's natural beauty and the city's presence creates a unique and captivating view.
Our method (192 tokens) answer is: The image showcases a breathtaking aerial view of a large, rocky island with mountainous terrain. The island is surrounded by a beautiful blue ocean, and the shoreline is dotted with houses scattered across the landscape. The houses vary in size and are positioned at different distances from the shore, creating a lively and picturesque scene.
From the answers, it can be seen that our method successfully preserves the general style, atmosphere, and scene composition of images. The observed minor omissions in specific details (such as 'lush greenery') represent a conscious trade-off, as we intentionally maintain only % (192/576) of visual tokens to optimize efficiency.
(2) Quantitative analysis
We further performed experiments on LLaVA-Bench (In-the Wild) and LLaVA-Bench (COCO), and the results are shown in the table below. It can be observed that SparseVLM achieves significant performance in the task of general visual dialogue. When the number of tokens is sparsified to 192 tokens, it may even lead to a slight performance improvement, which demonstrates the effectiveness of our algorithm. The reason is that SparseVLM is capable of progressively sparsifying tokens, enabling the model's attention to gradually focus, transitioning from the image background to the objects within the image and then to the details of those objects. This allows the model to understand the images in a step-by-step manner, and achieve outstanding performance in tasks such as general visual dialogue.
| Settings | In-the Wild (Loss) | COCO (Loss) | Avg. (Loss) |
|---|---|---|---|
| LLaVA-7B | 62.7 (0) | 72.3 (0) | 67.5 (0) |
| 192 tokens | 63.7 (+1.0) | 71.9 (-0.4) | 67.8 (+0.3) |
| 128 tokens | 61.7 (-1.0) | 70.4 (-1.9) | 66.1 (-1.4) |
| 64 tokens | 60.4 (-2.3) | 66.2 (-6.1) | 63.3 (-4.2) |
We sincerely appreciate your thorough review and valuable suggestions to improve our work. If you find our responses satisfactory, we would be grateful for your reconsideration of the rating.
This paper presents SparseVLM, a text-guided, training-free token optimization mechanism that improves the efficiency of vision-language models (VLMs). SparseVLM selects relevant text tokens to evaluate the significance of visual tokens using self-attention matrices and then progressively prunes irrelevant tokens. To enhance sparsity, it employs a rank-based strategy and a token recycling method. Experimental results demonstrate that SparseVLM improves VLM efficiency across various image and video understanding tasks. For example, integrating SparseVLM with LLaVA reduces FLOPs by 54% and CUDA latency by 37%, while retaining 97% of the original accuracy.
给作者的问题
Please see above.
论据与证据
The claims in this paper—"visual tokens should be sparsified adaptively based on the question prompt" and "not all prompt tokens should be considered"—are intuitively reasonable and supported by experimental results.
方法与评估标准
Method
- L202-203: When first encountering the term "rank," it is unclear how it is computed. Although the supplementary material provides details, please add a reference here for clarity.
- Rank(P) Selection: Is Rank(P) too tricky? How is the "appropriate threshold" selected? Additionally, how can you ensure that this threshold remains valid across all LLM layers?
- Deletions: Where is the ablation study on determining the number of deletions? Please compare this approach with a simpler min-k method.
- RoPE Position IDs: What are the position IDs of RoPE for the recycled token and the maintained vision tokens?
Evaluation
Overall, the evaluation setup is well-designed. If possible, could you provide results on more challenging benchmarks, such as VideoMME, MMMU Pro, or LVBench?
理论论述
I have checked Section 3.4, and it looks fine.
实验设计与分析
The experimental results are indeed impressive, significantly surpassing previous methods. However, as mentioned earlier, please provide results on more challenging benchmarks if possible.
补充材料
- I have checked the appendix.
- The authors provided a code repository; however, due to the lack of instructions, I did not thoroughly review it.
与现有文献的关系
n/a
遗漏的重要参考文献
This paper is not the first to propose that "visual tokens should be sparsified adaptively based on the question prompt." Please include a discussion of related works, such as [a] VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths Vision Computation (NeurIPS 2024) and [b] Free Video-LLM: Prompt-guided Visual Perception for Efficient Training-free Video LLMs (arXiv 2410.10441).
其他优缺点
Could you also provide some failure cases for Figure 6? This would help in understanding the scenarios where the method does not work.
其他意见或建议
Please see above.
1. References on the computation of the rank of the matrix.
The rank of a matrix is the maximum number of linearly independent rows, determined by singular value decomposition. We will add a reference to the corresponding supplementary part.
2. Explanation of the selection via the rank method and its threshold.
Mathematically, matrix rank quantifies non-redundant information via its linearly independent vectors. The motivation of Rank(P) is to adaptively determine dynamic thresholds to identify redundant visual tokens per layer for sparsification, eliminating heuristic sparsity ratios requiring manual tuning in other methods. Experiments scale Rank(P) solely for fair comparisons under matching sparsity to demonstrate algorithm superiority. Practical deployment requires no scaling adjustments or tuning.
3. Ablation study on determining the number of deletions.
We conducted comparative experiments with min- (top-) selection strategy. Compared to the min- method, our rank method outperforms across all token settings. The results suggest that the rank-based method is more advantageous in information-constrained scenarios.
| Method | GQA | MMB | SQA | SEED | TextVQA | Avg. |
|---|---|---|---|---|---|---|
| Retain 192 Tokens | ||||||
| min-k | 57.2 | 62.7 | 70.1 | 56.6 | 56.7 | 60.7 |
| rank | 59.5 | 64.1 | 68.7 | 58.7 | 57.8 | 61.8 (+0.9) |
| Retain 128 Tokens | ||||||
| min-k | 55.2 | 60.9 | 69.7 | 54.0 | 55.6 | 59.1 |
| rank | 58.4 | 64.5 | 68.6 | 58.2 | 56.7 | 61.3 (+2.2) |
| Retain 64 Tokens | ||||||
| min-k | 48.5 | 52.5 | 70.1 | 45.4 | 50.1 | 53.3 |
| rank | 53.8 | 60.1 | 69.8 | 52.2 | 53.4 | 57.9 (+4.6) |
4. Adjustment of the position IDs.
Token pruning in SparseVLM creates layer-varying KV token counts, typically below the original sequence length. Using unpruned position IDs misaligns RoPE, as IDs mismatch the pruned KV cache. We solve this via dynamic position ID recalibration: (1) Track retained tokens from prior steps via LLaVA’s KV cache. (2) Reassign position IDs to preserved/recycled tokens. (3) Apply RoPE with updated IDs. Our method updates positional encodings during token additions/removals, ensuring alignment with pruned sequences and spatial consistency across encoding schemes.
5. Performance of SparseVLM on more challenging benchmarks.
(1) Image tasks
We tested on MMBench (LR) and MMMU Pro. Our performance on MMMU Pro approaches the average accuracy reported in Table 1 across all token settings. Besides, SparseVLM progressive token sparsification even improves MMBench (LR) performance at 192 tokens by shifting focus from global to local.
| Settings | MMMU Pro (Loss) | MMBench (LR) (Loss) |
|---|---|---|
| LLaVA-7B | 30.3 (0) | 30.5 (0) |
| 192 tokens | 30.0 (-0.3) | 33.9 (+3.4) |
| 128 tokens | 29.4 (-0.9) | 30.5 (0) |
| 64 tokens | 26.4 (-3.9) | 26.7 (-3.8) |
(2) Video tasks
We tested our method on VideoMME and LVBench. Our method prunes % of vision tokens, retaining only 227, while preserving around % of the original accuracy. It outperforms FastV by % on VideoMME and % on LVBench.
| Method | VideoMME (Loss) | LVBench (Loss) |
|---|---|---|
| VideoLLaVA | 39.9 (0) | 26.3 (0) |
| FastV | 32.9 (-7.0) | 23.8 (-2.5) |
| SparseVLM | 39.7 (-0.2) | 26.0 (-0.3) |
6. Claim about "first attempt to explore text-aware guidance" in vision token sparsification.
(1) VideoLLM-MoD uses a trained linear projection to predict the importance of vision tokens. The basis is from page 5 of their paper: "The LayerExpert determines the importance score μ for a given vision token using a linear projection." Reference: arXiv:2408.16730.
(2) FreeVideoLLM uses the prompt feature for temporal and spatial samplings. The evidence is from page 5 of their paper: "We calculate the relation score (i.e., cosine similarity) between the frame features and the prompt feature." Reference: arXiv:2410.10441. We consider it as concurrent work with ours since it is not accepted yet. Still, we will update it in our related work.
(3) Our approach builds solely on text tokens and filters the visual-relevant text raters to improve performance, which has a fundamental difference.
7. Failure cases for sparsification.
"n281241.jpg" in GQA. Q: "What is the picture hanging above?" The label is "Chair," but the VLM predicts "Wall." The question asks what is above the object, not where the picture is hung. Misguided attention to background tokens causes spatial misinterpretation, resulting in incorrect answers.
"n494677.jpg" in GQA. Q: "Is the weather cloudy?" The label is "Yes," but the VLM predicts "No." Despite sparse vision tokens accurately capturing the sky, the model still errs, indicating reasoning limitations rather than sparsification flaws.
For visualizations, please refer to the anonymous link: https://imgur.com/a/Hq0qYYa.
We appreciate your thorough suggestions. If you find our responses satisfactory, we would be grateful for your reconsideration of the rating.
This paper presents a novel framework for sparsifying visual tokens to enhance the efficiency of Vision-Language Models (VLMs) in a training-free manner. It proposes a strategy to select relevant text tokens as evaluators of visual tokens, followed by pruning redundant visual tokens with a recycling mechanism to minimize the loss of information. LLaVA equipped with SparseVLM can achieve 54% reduction in FLOPs, 37% decrease in CUDA latency while maintaining 97% performance.
update after rebuttal
Thank the author for the rebuttal. I will keep my original rating, which is weak accept.
给作者的问题
N/A
论据与证据
Yes
方法与评估标准
Yes
理论论述
No theoretical claims.
实验设计与分析
Yes.
补充材料
No.
与现有文献的关系
The paper contributes to the area of efficient multi-modal large language models.
遗漏的重要参考文献
No.
其他优缺点
Strengths:
- The paper is well-written and easy to follow, allowing readers to readily grasp the inherent design principles behind SparseLVM.
- Experiments on both image and video setting prove the effectiveness of SparseVLM.
- The performance has been validated across three distinct MLLM architectures, demonstrating its strong generalization capability.
Weakness:
- Lack of discussion with similar works like LLaVA-PruMerge[1] and FastV. As far as I'm concerned, they are also methods for text-aware guidance for visual token sparsification. You need to further support your claim 'first attempt'.
- Questionable method for calculating visual redundancy: The approach uses Rank(P) based on attention maps to determine visual token redundancy. However, the correlation between attention vectors and visual redundancy is unclear, with no supporting evidence or references provided to justify why this approach is valid.
[1] Shang, Y., Cai, M., Xu, B., Lee, Y. J., & Yan, Y. (2024). Llava-prumerge: Adaptive token reduction for efficient large multimodal models. arXiv preprint arXiv:2403.15388.
其他意见或建议
N/A
We sincerely thank the reviewer TyrH for the effort in reviewing our paper. Our responses according to the reviewer's comments are summarized as follows.
1. The further explanation of our claim about "first attempt to explore text-aware guidance" in vision token sparsification.
Our method is the first to explicitly utilize the guidance of text for the sparsification of VLMs. The detailed discussions are as follows.
(1) LLaVA-PruMerge only leverages the class tokens to predict the importance of spatial visual tokens. The basis for this derives from the statement on page 5 of their paper: "Note that we use the class attention value from the penultimate layer for this calculation." Reference: https://arxiv.org/pdf/2403.15388.
(2) FastV directly utilizes all the tokens, including text tokens, vision tokens themselves, and system tokens to evaluate the significance of vision tokens. The evidence is derived from the following statement on page 8 of their paper: "We simply compute the average attention-score one token received from all other tokens as the criteria in our experiment." Reference: https://arxiv.org/pdf/2403.06764.
(3) In contrast, our approach builds solely on text tokens and further filters the visual-relevant text raters to improve performance, which has a fundamental difference from LLaVA-PruMerge and FastV. Our effectiveness is also validated in Table 1, where our method shows superiority.
2. Further explanation on how the correlation among attention vectors relates to visual redundancy.
We identified several relevant studies that also investigated the correlation between matrix rank, model training, and the informativeness of visual tokens.
The first study [1] reveals a positive correlation between the rank of attention matrices and the performance of Transformer models, suggesting that the rank of the attention matrix influences both the representational capacity and learning effectiveness of the attention mechanism. A higher rank generally indicates a greater ability to encode diverse and informative token interactions. On the other hand, lower ranks often indicate linearly dependent attention vectors, which correspond to overlapping or redundant visual information.
Moreover, some methods leverage singular value decomposition (SVD) to prune the attention output matrix, illustrating that low-rank structures often correspond to redundant information. For example, in [2], the authors use SVD to quantify each token’s information contribution. Combined with attention scores, this allows them to prioritize and retain the most informative tokens. This approach closely aligns with our use of rank, as both rely on analyzing the redundancy of visual tokens in attention matrices.
For the mathematical definition of rank, the rank captures the number of linearly independent directions in the attention space—lower rank implies that multiple tokens share similar attention patterns, justifying the removal of redundant ones. So, based on these findings and the mathematical definition of rank, we use Rank(P) to quantify redundancy among visual tokens.
We also validate our approach experimentally by analyzing attention maps in various vision tasks. Results confirm that attention matrices often contain redundant structure, and pruning based on Rank(P) effectively reduces computational overhead with minimal impact on performance.
[1] Min, Zeping, and Zhong Li. "On the Limitation and Redundancy of Transformers: A Rank Perspective."
[2] Tan, Xudong, et al. "TokenCarve: Information-Preserving Visual Token Compression in Multimodal Large Language Models."
We sincerely appreciate your thorough review and valuable suggestions to improve our work. If you find our responses satisfactory, we would be grateful for your reconsideration of the rating.
The paper introduces SparseVLM, which is a training-free technique to optimize number of visual tokens in a Vision-Language Model. The method consists of two main components, i.e. to identify relevant text tokens to rate visual tokens and method to prune and recycle visual tokens based on its significance. Furthermore, the authors explain how the method can be integrated with FlashAttention for efficient inference and present latency benchmarks.
给作者的问题
Based on the recent trends in the VLM literature, it would be good to have the following comparisons in the paper.
- SparseVLM applied to visual tokens from hierarchical backbones like ConvNeXt/FastViTHD used in ConvLLaVA[1]/FastVLM[2] respectively.
- SparseVLM applied to VLMs with multiple vision encoders with different pre-trainings, for example Cambrian-1[3].
- For VLMs trained to support dynamic image resolutions like InternVL2, Qwen2-VL, comparisons with appropriate input sizes to match SparseVLM token lengths.
[1] - ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models [2] - FastVLM: Efficient Vision Encoding for Vision Language Models [3] - Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
论据与证据
The authors claim it is the first training-free approach that explores text guidance to prune visual tokens. After reviewing relevant literature, it might be accurate to claim that the proposed method is the first training-free approach to rely on explicit text-guidance, as opposed to implicit guidance used in methods like FastV. Claims on efficiency is backed by actual latency estimates, which adds credibility to the proposed method.
方法与评估标准
The proposed method evaluates the popular LLaVA-1.5 model on typical VLM benchmarks like GQA, SQA, MMB, POPE, SEEDBench, etc., which is common for most works in this domain. They also, show performance of SparseVLM applied to Qwen2VL, a more recent VLM which is trained to support dynamic image resolution.
理论论述
The theoretical claims seem accurate, I do not find any major issue with the proposed method and its complexity analysis.
实验设计与分析
- Recent VLMs, like Qwen2-VL support dynamic image resolutions, an accurate baseline for Table 2 would be to simply resize the input image to support the same visual tokens as listed in rows 2, 3 and 4.
- The method also focuses primarily on vision encoders that are ViT-based, for completeness, analysis of the method applied to visual tokens from hierarchical backbones like ConvNeXt and FastViTHD used in ConvLLaVA[1] and FastVLM[2] respectively would be good to have.
- The method does not discuss any cases when multiple vision encoders with different pre-trainings are used in a VLM, for example Cambrian-1[3].
[1] - ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models [2] - FastVLM: Efficient Vision Encoding for Vision Language Models [3] - Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
补充材料
The supplementary materials contains code for the proposed method, and extensive discussions on support for FlashAttention.
与现有文献的关系
The introduced method is highly relevant to the broader community that optimizes VLM inference. More recent VLMs like Qwen2-VL or InternVL2 support resolutions as high as 4-7MP, which can lead to significant increase in visual tokens. Training-free techniques like SparseVLM (introduced in paper), enables cost-effective optimization of these models for inference.
遗漏的重要参考文献
I believe the paper discusses most of the relevant works that are published and not concurrent works.
其他优缺点
One potential weakness of the method could arise in multi-turn QA setup. From an initial assessment, it seems like the prefilled KV cache may not be the most accurate to answer a follow-up question that may pertain to a different region of the image. Token recycling may not be sufficient to alleviate this issue.
其他意见或建议
It would be good to discuss potential implications of the method in a multi-turn QA setup.
We sincerely thank the reviewer PtfS for the effort in reviewing our paper. Our responses according to the reviewer's comments are summarized as follows.
1. The analysis of the efficiency of input image resizing and vision token pruning in VLM, which supports dynamic image resolutions.
Firstly, while dynamic resolution capability in VLMs requires additional training compared to fixed-resolution ones, our method is entirely training-free. We still conducted comparative experiments on Qwen2-VL to evaluate our effectiveness. The table below shows that SparseVLM incurs a comparatively minimal performance degradation compared with the resizing manner. Secondly, our method is orthogonal to dynamic resolution and can further reduce the number of tokens on resized images while maintaining nearly lossless accuracy.
| Method | MMB | POPE | TextVQA | Avg. (Loss) |
|---|---|---|---|---|
| Qwen2-VL | 80.5 (1323) | 86.4 (1311) | 84.3 (1326) | 83.7 (0) |
| Retain 600 tokens | ||||
| Resize | 78.1 | 85.3 | 78.0 | 80.5 (-3.2) |
| SparseVLM | 79.6 | 86.5 | 80.3 | 82.1 (-1.6) |
| Retain 500 tokens | ||||
| Resize | 78.1 | 85.3 | 75.3 | 79.6 (-4.1) |
| SparseVLM | 78.8 | 86.3 | 79.0 | 81.4 (-2.3) |
| Retain 400 tokens | ||||
| Resize | 77.0 | 85.1 | 73.4 | 78.5 (-5.2) |
| SparseVLM | 79.0 | 85.8 | 77.1 | 80.7 (-3.0) |
2. Extending SparseVLM to Hierarchical Vision Encoders VLM architecture.
Since FastVLM has not yet open-sourced its weights and model architecture, we conducted experiments on ConvLLaVA-7B (256-token setting). The table below shows that our results own near-lossless accuracy (%) on ConvNeXt vision encoders under the 192 tokens setting. Besides, under extreme compression (64 tokens setting), the performance of PDrop drops by %, while SparseVLM only drops by %.
| Method | MMB | POPE | SEED | TextVQA | MMVet | MMMU | Avg. (Loss) |
|---|---|---|---|---|---|---|---|
| ConvLLaVA-7B | 68.8 | 87.6 | 69.3 | 62.5 | 44.4 | 35.1 | 61.3 (0) |
| Retain 192 Tokens | |||||||
| PDrop [CVPR 2025] | 67.3 | 84.0 | 62.5 | 60.0 | 42.9 | 33.7 | 58.4 (-2.9) |
| SparseVLM | 68.0 | 87.0 | 68.7 | 62.0 | 44.5 | 34.5 | 60.7 (-0.6) |
| Retain 128 Tokens | |||||||
| PDrop [CVPR 2025] | 65.6 | 83.9 | 61.0 | 59.2 | 42.5 | 30.2 | 57.1 (-4.2) |
| SparseVLM | 67.0 | 86.3 | 66.2 | 60.8 | 43.2 | 32.7 | 59.3 (-2.0) |
| Retain 64 Tokens | |||||||
| PDrop [CVPR 2025] | 35.8 | 57.0 | 46.0 | 49.4 | 42.0 | 24.8 | 42.3 (-19.0) |
| SparseVLM | 63.2 | 76.1 | 59.7 | 54.4 | 35.0 | 26.6 | 52.5 (-8.8) |
3. Extending SparseVLM to Multiple Vision Encoders VLM architecture.
Our existing experiments (Figure 4) cover MiniGemini, a VLM architecture employing multiple vision encoders (CLIP and ConvNeXt). Following the reviewer's advice, we further conducted experiments on Cambrian-1 13B. Shown in the table below, our method is well-suited for multiple vision encoders. Under the 192-token setting, our method achieves a % lower performance drop than PDrop; meanwhile, under the 64-token setting, our approach maintains an accuracy loss below %.
| Method | GQA | MMB | SQA | SEED | TextVQA | MMMU | Avg. (Loss) |
|---|---|---|---|---|---|---|---|
| Cambrian-1-13B | 64.0 | 75.5 | 79.3 | 74.4 | 72.8 | 40.0 | 67.7 (0) |
| Retain 192 Tokens | |||||||
| PDrop [CVPR 2025] | 59.5 | 72.5 | 75.3 | 68.4 | 70.0 | 38.0 | 63.9 (-3.8) |
| SparseVLM | 61.6 | 74.2 | 79.0 | 72.0 | 71.4 | 38.8 | 66.2 (-1.5) |
| Retain 128 Tokens | |||||||
| PDrop [CVPR 2025] | 56.8 | 70.9 | 74.0 | 65.5 | 68.8 | 35.4 | 61.9 (-5.8) |
| SparseVLM | 60.4 | 73.0 | 78.2 | 70.6 | 70.2 | 38.0 | 65.1 (-2.6) |
| Retain 64 Tokens | |||||||
| PDrop [CVPR 2025] | 46.2 | 61.6 | 71.7 | 56.4 | 58.0 | 26.1 | 53.3 (-14.4) |
| SparseVLM | 54.4 | 68.7 | 78.0 | 64.0 | 62.2 | 31.2 | 59.8 (-7.9) |
In summary, the experiments on MiniGemini in the paper, along with the above Cambrian-1 and ConvLLaVA experiments, demonstrate that our method can match well with various visual encoder architectures. The reason is that our method is applied to the decoder of VLM, where the visual features are pre-aligned by the projection layers, ensuring the robustness of selecting appropriate text raters.
4. The potential implications of SparseVLM in multi-turn conversations.
Currently, most vision token pruning methods (e.g., FastV [ECCV 2024] and PDrop [CVPR 2025]) fail to effectively maintain compatibility with multi-turn conversations. Here, we propose a potentially viable approach, where we can integrate DyCoke-like [CVPR 2025] dynamic pruning: (1) KV Cache Preservation: Pruned tokens' KV Cache is stored via dynamic pruning (DP-KV Cache). (2) On-demand Updates: New prompts trigger DP-KV Cache retrieval for retained tokens and updates for pruned ones. This ensures multi-round consistency by maintaining historical context while adapting to evolving queries.
We sincerely appreciate your valuable review and suggestions. If you find our responses satisfactory, we would be grateful for your reconsideration of the rating.
I thank the authors for the extended comparisons and discussion on the support for multi-turn conversations. I am still on-the fence on realizable efficiency gains especially in multi-turn scenarios. In view of the latest results, I will be upgrading my rating to weak accept.
Dear Reviewer PtfS,
Thank you for your thoughtful feedback and for recognizing our efforts in the revised comparisons and discussion. We think it is an important future direction for existing vision token pruning methods to support multi-turn scenarios. It is currently an open problem. Still, we think the dynamic KV Cache retrieval method discussed in #4 of our rebuttal is a promising solution and will be our next step. We appreciate your helpful discussions and valuable insights.
We are grateful for your updated rating and glad the additional results helped address some of your concerns. We are happy to anwser any further questions if you have.
The paper is about a training-free token optimization for vision language models, which has been a popular area. All reviewers agree the novelty and the effectiveness by the experiment results. More than 1 reviewer raise the issue of missing related works. Please include all of them in the revision.
VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths Vision Computation, NeurIPS 2024
FastVLM: Efficient Vision Encoding for Vision Language Models, CVPR 2025
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs, ICLR 2025
Llava-prumerge: Adaptive token reduction for efficient large multimodal models, 2024
ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models, 2024
Free Video-LLM: Prompt-guided Visual Perception for Efficient Training-free Video LLMs, 2024