Why 1 + 1 < 1 in Visual Token Pruning: Beyond Naive Integration via Multi-Objective Balanced Covering
This study reveals and quantifies key factors in visual token pruning from a geometric covering perspective, and proposes Multi-Objective Balanced Covering, which significantly accelerates MLLM reasoning with negligible performance loss.
摘要
评审与讨论
This paper provides a systematic analysis of visual token pruning in Multimodal Large Language Models (MLLMs). The authors identify that existing methods often combine the two objectives of Visual Preservation (VP) and Prompt Alignment (PA) in a static manner, neglecting their varying importance across different tasks, which results in inconsistent performance. To address this, the authors derive the first closed-form error bound for visual token pruning based on the Hausdorff distance, and employ ε-covering theory to quantify the intrinsic trade-off between VP and PA under a fixed pruning budget. Building upon this theoretical insight, the paper proposes a novel training-free method—Multi-Objective Balanced Covering (MoB)—which reformulates the pruning process as a bi-objective covering problem. MoB translates the trade-off into a budget allocation problem via greedy radius trading. Extensive experiments demonstrate that MoB outperforms existing methods across LLaVA, Video-LLaVA, and Qwen2-VL, maintaining high performance even under aggressive pruning.
优缺点分析
Strengths: Theoretical Contribution: The paper is the first to provide a closed-form error bound for visual token pruning, with rigorous theoretical analysis supported by a geometric interpretation using Hausdorff distance. Novel Methodology: The proposed MoB method addresses the bi-objective trade-off through a covering-based perspective, reformulating pruning as a budget allocation problem and offering provable performance guarantees. Strong Empirical Results: The method is evaluated across 14 benchmarks and three MLLMs, demonstrating superior performance. Notably, it retains 96.4% performance for LLaVA-1.5-7B with only 11.1% of tokens. Practicality: MoB is training-free and scales linearly with input size, making it applicable to high-resolution or multi-frame scenarios.
Weaknesses: Hyperparameter Sensitivity: MoB requires selecting the prompt center budget K_p, which may vary across tasks. Although the paper categorizes scenarios based on prompt-visual coupling strength (η), practical deployment may incur tuning overhead. Assumption Validity: The theoretical guarantees depend on the assumption of Lipschitz continuity of the model in the Hausdorff metric space. While reasonable in practice, this may not hold universally. Limited Discussion on Supervision Integration: While MoB is fully training-free, the paper does not explore the potential of incorporating supervised signals (e.g., task labels) to improve token selection relevance.
问题
Automatic Estimation of 𝐾𝑝 : Currently, MoB requires manual selection of 𝐾𝑝. Could the authors consider estimating 𝐾𝑝 from prompt length or coupling strength 𝜂? Would meta-learning or lightweight heuristic predictors help automate this? Computational Cost of Hausdorff Distance: While theoretically elegant, the computation of Hausdorff distance in high-dimensional spaces can be expensive. Have the authors considered approximations or alternatives that maintain performance? Performance with Fine-tuning: As MoB is training-free, have the authors tested its combination with lightweight fine-tuning post-pruning? Would such a setup improve performance under aggressive compression? Applicability to Language Token Pruning: While MoB focuses on visual tokens, could the same methodology be adapted to prune language tokens, especially for long-context LLMs?
局限性
The authors explicitly discuss the reliance on theoretical assumptions and the potential cost of manually tuning Kp. They also suggest future work involving adaptive estimation of η. The discussion is honest and constructive.
最终评判理由
Appreciate the thorough response. With most concerns addressed, I decide to keep my score.
格式问题
None.
Sincerely appreciate your responsible work and constructive questions! We provide a point-to-point feedback on the weakness (W) and question (Q)
W1, Q1: Automatic Estimation of : Currently, MoB requires manual selection of . Could the authors consider estimating from prompt length or coupling strength ? Would meta-learning or lightweight heuristic predictors help automate this?
A: Thank you for this forward-thinking question, which is highly align with our ongoing research. In fact, deriving a analytical expression for the optimal from prompt length and coupling strength is highly challenging at present. Therefore, to make the automatic estimation of more tractable, we have developed an elegant and practical framework in our follow-up work.
Instead of solving a precise value, this framework learns to adaptively classify the coupling pattern of incoming samples and applies a pre-configured budget configuration . It operates as follows:
- During inference, the framework maintains a running statistic of an approximated for the processed samples. We observe that as more samples are processed, this statistic naturally reveals the same bimodal distribution pattern identified in our paper (Figs. 1b & 9).
- From this observed distribution, the framework dynamically determines a classification threshold that separates the two patterns.
- Each new sample is then classified into 'weak' or 'strong' coupling using this live, data-driven threshold, and the corresponding optimal is applied.
This approach directly leverages the coupling strength , as you suggested, in a more practicable way. It can be viewed as a form of lightweight, online meta-learning that adapts to the specific data distribution encountered during deployment.
W1, Q2: Computational Cost of Hausdorff Distance: While theoretically elegant, the computation of Hausdorff distance in high-dimensional spaces can be expensive. Have the authors considered approximations or alternatives that maintain performance?
A2: Thank you for this very important and forward-thinking question. We agree that the computational cost of the exact Hausdorff distance is a critical consideration. We address this with a quantitative analysis of the practical costs, followed by a discussion of efficient alternatives.
Quantitative Analysis: We provide a detailed cost breakdown for LLaVA-1.5-7B and LLaVA-Next-7B below:
Computational cost (TFLOPs) in LLaVA-1.5-7B, where .
| Model | Vanilla | |||
|---|---|---|---|---|
| Forward | 8.2 | 1.0 | 1.9 | 2.8 |
| Compute | 2.3×10 | 2.3×10 | 2.3×10 | 2.3×10 |
| MoB | 1.7×10 | 3.3×10 | 4.8×10 |
Computational cost (TFLOPs) in LLaVA-Next-7B, where .
| Model | Vanilla | |||
|---|---|---|---|---|
| Forward | 40.5 | 4.6 | 9.1 | 13.6 |
| Compute | 1.2×10 | 1.2×10 | 1.2×10 | 1.2×10 |
| MoB | 3.9×10 | 7.6×10 | 1.1×10 |
As the tables clearly show, while the complexity of computing Hausdorff distance, i.e., , is intuitively heavy in high-dimensional spaces, it remains orders of magnitude smaller than both the MoB itself and the model's forward propagation. Thereby, it cannot be a practical bottleneck of MoB. Considering your concern, we also present two powerful approximation techniques:
- Heuristic Sampling: It computes the distance on smaller support sets of the tokens (), which can be constructed via random sampling [1] or more advanced heuristics like Key-Norm selection [2, 3]. This reduces complexity to .
- Random Projections: For a more theoretically-grounded approach, the Johnson-Lindenstrauss (JL) lemma [4] allows us to project embeddings to a much lower dimension () while preserving geometric structure, reducing complexity to .
In summary, the cost of computing Hausdorff distance is already negligible in practice, as shown by our quantitative analysis, and can be further optimized via well-established approximation methods if needed. This ensures the broad applicability and efficiency of our framework.
References
[1] Zichen Wen, et al. Stop looking for important tokens in multimodal language models: Duplication matters more. preprint, 2025.
[2] Akhauri Y, AbouElhamayed A F, Gao Y, et al. Tokenbutler: Token importance is predictable. preprint, 2025.
[3] Guo Z, et al. Attention Score is not All You Need for Token Importance Indicator in KV Cache Reduction: Value Also Matters. CoRR, 2024.
[4] Larsen K G, Nelson J. The Johnson-Lindenstrauss Lemma Is Optimal for Linear Dimensionality Reduction. LIPIcs, 2016.
W2: Assumption Validity: The theoretical guarantees depend on the assumption of Lipschitz continuity of the model in the Hausdorff metric space. While reasonable in practice, this may not hold universally.
A: Thanks for your constructive concern; we justify Assumption 1 in three aspects:
Theoretical Grounding: Prior work [1, 2] has shown that Transformers are approximately Lipschitz continuous w.r.t. standard matrix norms, e.g., the Frobenius norm, under the finite and compact input conditions we consider (Line 98). A standard inequality relates the Frobenius norm to the Hausdorff distance: . This implies that any model Lipschitz continuous w.r.t. the Frobenius norm is also Lipschitz continuous w.r.t. the Hausdorff distance.
Empirical Evidence: Vignac et al. [3] have measured the empirical local Lipschitz constant of Transformers on real-world data and found it to be 100-1000x smaller than the theoretical worst-case bound. This indicates that practical models already lie in a well-behaved, near-Lipschitz regime. Besides, Kim et al. [4] have shown that the successful training of deep Transformers relies on controlling this constant. Architectures without such implicit or explicit constraints often fail to converge, suggesting that the well-trained VLLMs most likely adhere to a finite Lipschitz constant.
Consistency with Prior Work: We wish to clarify that this assumption is not novel to our work but is a common theoretical premise in the visual token pruning community, having been adopted by prior methods such as DivPrune [5] and DART [6]. Our work builds upon this established foundation to derive deeper insights.
References
[1] Dasoulas G, et al. Lipschitz normalization for self-attention layers with application to graph neural networks. ICML, 2021.
[2] Yudin N, et al. Pay Attention to Attention Distribution: A New Local Lipschitz Bound for Transformers. preprint, 2025.
[3] Vignac C, et al. Understanding the regularity of self-attention with optimal transport. preprint, 2023.
[4] Kim H, et al. The lipschitz constant of self-attention. ICML, 2021.
[5] Alvar S R, et al. Divprune: Diversity-based visual token pruning for large multimodal models. CVPR, 2025.
[6] Wen Z, et al. Stop looking for important tokens in multimodal language models: Duplication matters more. preprint, 2025.
W3, Q3: Performance with Fine-tuning: As MoB is training-free, have the authors tested its combination with lightweight fine-tuning post-pruning? Would such a setup improve performance under aggressive compression?
A: Thank you for this insightful suggestion. While our work focused on the training-free aspect, we agree this is a promising direction. A key advantage of MoB is that it provides a principled, informative sparse input to the LLM. A subsequent lightweight fine-tuning (e.g., LoRA) would then allow the model, originally trained on dense data, to adapt its attention mechanisms to this new distribution. We hypothesize this would significantly improve performance under aggressive compression and have added it as a key direction for future work.
Q4: Applicability to Language Token Pruning: While MoB focuses on visual tokens, could the same methodology be adapted to prune language tokens, especially for long-context LLMs?
A: Thank you for this constructive question. While naively applying MoB on language tokens might not yield the same degree of success, our theoretical framework has the fundamental generalizability for the language domain.
In the language domain, we can create a powerful analogy:
- Visual Preservation (VP) is equivalent to preserving the broader context of long language representations.
- Prompt Alignment (PA) is equivalent to extracting the golden evidence—the most critical sentences directly relevant to the facts.
The trade-off between VP and PA, a key insight of our theory, is therefore fundamentally the trade-off between Context and Golden Evidence. We believe this principle has significant implications for several typical applications in long-text LLMs:
- RAG: Balancing recall (enough Context to reason) vs. precision (the Golden Evidence that answers the query).
- Summarization & LLM Memory: Balancing coherence (the Context/flow) vs. conciseness (the Golden Evidence/key facts).
Our framework provides a novel, principled view for analyzing and optimizing these fundamental trade-offs, highlighting its significance for the society of long-context LLMs.
This paper investigates the performance differences between single-objective and multi-objective approaches in visual token pruning, with the following main contributions:
The study finds that multi-objective methods do not significantly outperform single-objective methods in visual token pruning. By introducing the Hausdorff distance, the authors derive the first closed-form error bound, quantifying the contribution of the Visual Preservation (VP) and Prompt Alignment (PA) objectives.
Based on the ε-covering theory, the authors reveal an inherent trade-off between VP and PA, showing how to balance these two objectives under different pruning budgets to optimize performance.
The authors propose a Multi-Objective Balanced Covering (MoB) method, reformulating visual token pruning as a bi-objective covering problem. MoB employs a greedy radius-trading strategy for budget allocation, offering provable performance bounds and linear scalability.
Experimental results show that MoB maintains 96.4% of the original performance in LLaVA-1.5-7B using only 11.1% of the original visual tokens, and achieves 1.3–1.5× speedup in LLaVA-Next-7B with negligible performance loss. Moreover, evaluations on Qwen2-VL and Video-LLaVA confirm MoB’s seamless integration into advanced multimodal large language models and diverse vision-language tasks.
In summary, this paper presents a novel visual token pruning method through theoretical analysis and experimental validation, emphasizing the importance of balancing different objectives during the pruning process and demonstrating its effectiveness in practical applications.
优缺点分析
Strengths:
-
The paper proposes the Multi-Objective Balanced Covering (MoB) method, systematically addressing the objective trade-off problem in visual token pruning. It provides provable performance bounds and linear scalability.
By introducing the Hausdorff distance, the authors derive the first closed-form error bound, quantifying the contribution of Visual Preservation (VP) and Prompt Alignment (PA) objectives, laying a solid theoretical foundation for future research.
The paper is well-structured and logically rigorous, guiding readers step-by-step through complex concepts and methods. Especially in the introduction of the MoB method, the working principle and implementation steps are explained in detail.
By reformulating visual token pruning as a bi-objective covering problem, the paper offers a novel perspective and approach, demonstrating high innovativeness.
Experimental results show that MoB performs exceptionally well on multiple benchmarks, achieving significant token reduction while maintaining high performance, indicating its practical applicability.
Weaknesses:
Although a new method is proposed, the performance improvement of MoB is relatively modest in some experimental settings, which may fall short of expectations in certain scenarios.
The interpretation of some experimental results could be more in-depth. The paper does not fully explore the applicability and limitations of different pruning strategies.
While the method shows improvements, compared to existing approaches such as CrossGET, LLaVA-PruMerge, TokenPacker, Turbo, and AIM, which also demonstrate similar effectiveness, the paper lacks comparison and discussion with these methods.
问题
See weakness
局限性
See weakness
最终评判理由
The authors partially provide comparative experiments in their manuscript. In the final version, they'll give explicit discussion of the key differences between the proposed method and existing approaches.
格式问题
Fair
Thanks for your comments. To address your concerns, we provide our point-to-point clarification as follows
W1 & Q1: Although a new method is proposed, the performance improvement of MoB is relatively modest in some experimental settings, which may fall short of expectations in certain scenarios.
A1: Thank you for your feedback. We respectfully argue that the performance improvement of MoB is not modest, and we would like to highlight two key aspects of our contribution: its significant and robust empirical performance and its foundational theoretical insights.
1. Significant and Robust Empirical Performance: We believe the term "modest" may overlook two critical patterns in our results.
- Consistent Superiority: Across all three tested model families (LLaVA-1.5, LLaVA-Next, Qwen2-VL, and Video-LLaVA) and all pruning rates, MoB consistently outperforms the strongest competing baselines in average performance. For example, MoB achieves average performance gains of 2.7% (LLaVA-1.5), 1.5% (LLaVA-Next), and 1.2% (Qwen2-VL) in the image domain and achieves a 97.9% average performance at a very aggressive 93.4% token reduction rate, surpassing the next-best method by 1.6% in the video domain.
- Widening Lead Under Aggressive Pruning: Crucially, MoB's performance advantage becomes more pronounced under more challenging, high-ratio pruning scenarios. On LLaVA-1.5-7B, for instance, our lead over the strongest baseline grows from +1.4% at a 77.8% reduction rate to a more substantial +2.7% at an 88.9% reduction rate. This demonstrates the superior robustness of our method when efficiency demands are highest.
We contend that this consistent and robust outperformance, especially under aggressive compression, is a significant achievement.
2. Foundational Theoretical Contribution: Beyond the performance numbers, we encourage the reviewer to consider the value of our theoretical contributions, which we believe are a core part of our work.
- Provable Guarantees: Unlike most heuristic-based pruning methods, MoB is grounded in a formal theoretical framework that provides a provable performance bound (Theorem 2). This ensures its reliability.
- Novel Insights for the Community: More importantly, this paper presents the first in-depth analysis of the key factors in visual token pruning—Visual Preservation (VP), Prompt Alignment (PA), prompt-visual coupling (), and budget ()—and formalizes their interplay. MoB is a direct and effective application of these insights. We believe these foundational theories will motivate the community and inspire more powerful, principled pruning methods in the future, representing a profound potential impact.
W2 & Q2: The interpretation of some experimental results could be more in-depth. The paper does not fully explore the applicability and limitations of different pruning strategies.
A2: Thank you for your comment. We respectfully clarify that a central motivation and contribution of our work is precisely to provide an in-depth analysis of the applicability and limitations of different pruning strategies, and how to incorporate them to derive a robust visual token pruning method. We have demonstrated this extensively in our manuscript from theoretical, empirical, and ablation perspectives.
1. Theoretical Foundation: Our key theoretical insight is that the effectiveness of a pruning strategy is contingent on the prompt-visual coupling pattern (η). We explicitly analyze this in our discussions of Lemma 2 (Lines 143-147) and Theorem 2 (Lines 232-234). In these sections, we formally reveal that Prompt Alignment (PA) is crucial for weak-coupling tasks, while Visual Preservation (VP) is favored under strong-coupling conditions.
2. Empirical Validation in Main Results: This theoretical prediction is directly validated by the trends in our main experimental results in Tables 1 and 4. As we explicitly state in our analysis in Lines 252-255: "...single-objective baselines exhibit complementary strengths under different coupling patterns...". This discussion confirms that PA-based methods excel on weak-coupling benchmarks, while VP-based methods perform better on strong-coupling ones.
3. Targeted Ablation Studies: Furthermore, we conducted a targeted ablation study (Lines 284-299), the results are presented in Figure 4, to isolate and verify this exact relationship. By systematically varying the budget allocation (considered as the quality of prompt alignment), we empirically confirmed that PA is more critical for weak-coupling tasks and VP for strong-coupling ones, thus validating our core theoretical premise.
In summary, we believe our manuscript provides a comprehensive, multi-faceted analysis of the applicability and limitations of different pruning strategies. We are confident that these detailed discussions, located throughout the paper, fully address your concern.
W3 & Q3: While the method shows improvements, compared to existing approaches such as CrossGET, LLaVA-PruMerge, TokenPacker, Turbo, and AIM, which also demonstrate similar effectiveness, the paper lacks comparison and discussion with these methods.
A3: Thank you for your suggestion to compare with a broader set of baselines. We would first like to clarify that our original manuscript already contains a strong suite of 13 baselines (Appx. C.3), including the mentioned LLaVA-PruMerge [2], where MoB consistently demonstrates state-of-the-art performance (Tables 1-4). Thank you for suggesting these excellent works. To further address your concern, we have conducted additional experiments to provide a direct comparison against CrossGET [1] and AIM [5]. The comparison results are presented below:
| Setting | MMB | SQA | VizWiz | GQA | MME | POPE | VQA-T | VQA-V2 | Avg |
|---|---|---|---|---|---|---|---|---|---|
| FLOPs ↓31% | |||||||||
| CrossGET | 64.7 | 66.7 | 47.7 | 61.4 | 1510 | 83.9 | 54.9 | 77.3 | 95.2% |
| MoB | 64.5 | 70.0 | 52.6 | 61.7 | 1862 | 85.3 | 58.6 | 78.5 | 100.6 % |
| FLOPs ↓73% | |||||||||
| LLaVA-PruMerge | 58.1 | 67.1 | 50.3 | 53.3 | 1554 | 67.2 | 54.3 | 68.8 | 88.8% |
| AIM | 58.8 | 68.4 | - | 58.6 | 1772 | 85.7 | 53.8 | 75.4 | 95.3% |
| MoB | 63.5 | 69.6 | 52.7 | 60.9 | 1845 | 82.1 | 57.8 | 77.5 | 99.4 % |
| FLOPs ↓88% | |||||||||
| LLaVA-PruMerge | 55.3 | 68.1 | 50.1 | 51.9 | 1549 | 65.3 | 54.0 | 67.4 | 87.4% |
| AIM | 60.9 | 67.1 | - | 54.6 | 1573 | 79.5 | 48.4 | 69.0 | 89.6% |
| MoB | 62.1 | 69.8 | 52.1 | 59.0 | 1806 | 77.2 | 57.0 | 75.5 | 96.4 % |
As the new results clearly demonstrate, MoB’s effectiveness is not "similar" but substantially superior. MoB achieves a 5.4% average performance gain over the CrossGET under a ~31% FLOPs reduction, while At a ~73% FLOPs reduction, MoB achieves a 4.1% and 10.6% average performance gain over the AIM and LLaVA-PruMerge under a ~73% FLOPs reduction. As for a more aggressive reduction, e.g., 88%, MoB still achieve 6.8% and 9.0% average performance gain over the AIM and LLaVA-PruMerge. We believe these significant margins validate the effectiveness of our approach.
We did not include Turbo [4] and TokenPacker [3] for the following principled reasons:
- The repository of Turbo's code is still invalid, and its does not report results on any of the standard benchmarks used by the community, making a fair and reproducible comparison impracticable.
- Although TokenPacker is also designed for acceleration of VLLM, it belongs to a different research category. It is a novel efficient VLM instead of a visual token pruning method. TokenPacker requires extensive training with extra data, not only pre-training (1.2M data) but also STF (1.5M data). Comparing it directly with a training-free, inference-time visual token pruning method like MoB would be unfair. Crucially, these approaches are orthogonal; a pruning method like MoB could potentially be applied on top of an efficient VLM like TokenPacker for even more aggressive acceleration.
Finally, we will add these new comparisons and a detailed discussion to the revised manuscript to provide readers with a more comprehensive understanding of MoB's contribution. We hope this response alleviates your concerns regarding MoB's performance.
[1] Shi D, et al. Crossget: Cross-guided ensemble of tokens for accelerating vision-language transformers. preprint, 2023.
[2] Shang Y, et al. Llava-prumerge: Adaptive token reduction for efficient large multimodal models. preprint, 2024.
[3] Li W, et al. Tokenpacker: Efficient visual projector for multimodal llm. IJCV, 2025.
[4] Ju C, et al. Turbo: Informativity-driven acceleration plug-in for vision-language large models. ECCV, 2024.
[5] Zhong Y, et al. Aim: Adaptive inference of multi-modal llms via token merging and pruning. ICCV, 2025.
Thank you for your detailed response to the review comments. I have revised the review feedback based on your input and updated the final decision.
We are very grateful that our responses have addressed your concerns. Sincerely, thank you for your feedback and support ! It is very important for our work.
Best regards,
This paper proposes MoB (Multi-objective Balanced covering), a novel pruning method for vision-language models that balances visual preservation (VP) and prompt alignment (PA) from a geometric perspective. The key idea is to leverage prompt-visual coupling (η) to statically allocate token budgets without requiring training. MoB achieves significant speedups and performance retention across diverse tasks, outperforming existing training-free and sparse pruning methods.
优缺点分析
Strengths:
- The method demonstrates strong empirical performance across a wide range of vision-language tasks, often outperforming previous training-free pruning approaches. 2. The experiments are relatively comprehensive, covering diverse datasets and settings (though lacking evaluation on larger models such as LLaVA-13B or QwenVL-32B). 3. The paper is well-written and clearly structured, making a technically dense topic accessible. 4. Importantly, the authors attempt to analyze the token pruning problem from a geometric and theoretical perspective, introducing prompt-visual coupling (η) and proving formal upper bounds on pruning error. This provides a novel and insightful angle to understand the pruning trade-off. Weaknesses: 1. While theoretical analysis is a commendably ambitious direction, it is also risky—there is a concern that the theory may not be entirely complete. In particular, the paper assumes that all pruning-relevant information can be captured by VP and PA, which may overlook more complex token interactions, reasoning steps, or latent semantics. 2. The design and usage of η raise several questions. It is treated as a dataset-level prior, but not explicitly estimated during inference. Its dependency on model architecture, computational cost, and lack of deployable estimation strategy (especially in unseen, open-domain settings) limit the practicality of the approach.
问题
- The proposed MoB method hinges on the prompt-visual coupling strength (η) to allocate token budgets between visual preservation (VP) and prompt alignment (PA). However, η is not directly observable in real-world settings. How can η be inferred when the task type is unknown, such as in open-domain or user-driven inference? Does MoB require pre-defined task labels or dataset-specific heuristics to function, and how does this constrain its applicability in deployment?
- While η plays a central role in MoB's theoretical guarantees and pruning strategy, it is computed via Hausdorff distance between prompt and visual tokens—an operation that can be computationally expensive and dataset-dependent. The authors assume η is known a priori based on dataset labels, but this introduces additional supervision and cost that methods like FastV or SparseVLM do not require. Can the authors justify this additional cost, and is it fair to compare MoB against zero-cost or training-free baselines under this assumption? Furthermore, could η be efficiently approximated via lightweight dataset sampling (e.g., estimating η from a few examples), and if so, how robust would such estimation be across domains?
- Given that η reflects the geometric distance between prompt and visual token embeddings, it may vary across models of different sizes and architectures (e.g., LLaVA-7B vs. QwenVL-32B). Does η need to be re-estimated or recalibrated when switching to a different model? How transferable is the chosen Kp configuration across models, and to what extent does MoB’s pruning effectiveness rely on model-specific representations?
- The MoB framework assumes that all pruning-relevant information can be captured through the dual objectives of visual preservation (VP) and prompt alignment (PA). However, is this decomposition theoretically complete? Could there exist tokens whose value arises from higher-order interactions, global reasoning, or latent semantics not directly reflected in either VP or PA distances.
局限性
yes.
最终评判理由
Most of the concerns I raised have been satisfactorily addressed in the rebuttal. In particular:
- The authors clarified key theoretical assumptions, such as the use of normalized embedding space and the model-invariance of their pruning strategy.
- They also provided further explanation on the practicality of their method, including the online estimation of coupling strength and its overhead.
- The methodology is solid, and the experimental results are comprehensive and convincing across multiple VLA models and benchmarks. While there are still a few open questions (e.g., performance on larger models, more intuitive hyperparameter analysis), I view them as future directions rather than critical flaws. Overall, I find this work to be technically sound, well-executed, and a valuable contribution to the field. I maintain my positive recommendation.
格式问题
The vertical axis of the VizWiz dataset in Figure 3 is missing performance.
We sincerely appreciate your responsible, careful work and insightful questions! We have corrected the format problem of Fig. 3 and provide a point-to-point feedback as follows
W2 & Q1: is not directly observable in real-world settings. How can be inferred when the task type is unknown, such as in open-domain or user-driven inference? Does MoB require pre-defined task labels or dataset-specific heuristics to function, and how does this constrain its applicability in deployment?
A1: Thank you for this critical question about real-world applicability. We provide an explanation in two aspects:
A Practical Strategy for Open-Domain Inference: MoB does not require pre-defined "task type" labels. Instead, it can perform online classification of a sample's coupling pattern. The practical strategy is a two-stage process:
- Offline Calibration: For a given target model, we first analyze the η distributions on a set of representative benchmarks to establish a robust classification threshold that distinguishes between weak and strong coupling patterns.
- Online Classification: For each incoming query, we perform an online computation of its Hausdorff distance (Alg. 2, Appx. B, with a tractable bilinear complexity of ). The sample is then classified based on this value and the pre-calibrated threshold, and the corresponding is applied.
We acknowledge the online computation introduces a manageable cost, but it does not fundamentally alter MoB's superior performance-efficiency trade-off.
An Advanced, Fully-Adaptive Framework: Furthermore, we have explored this direction deeply in our subsequent work and developed a more elegant, fully-adaptive framework. It operates as follows:
- During inference, it maintains a running statistic of an approximated (e.g., from shallow-layer tokens) for the samples it has processed.
- We observe that as more samples are seen, this statistic naturally reveals the same bimodal distribution pattern identified in our paper (Fig. 1b & 9).
- The framework then uses this emerging distribution to dynamically adjust its own classification threshold for subsequent samples.
The theoretical framework presented in this paper is the essential foundation that enables and validates such future advancements.
W2 & Q2: The authors assume is known a priori based on dataset labels, but this introduces additional supervision and cost that methods like FastV or SparseVLM do not require. Can the authors justify this additional cost, and is it fair to compare MoB against zero-cost or training-free baselines under this assumption? Furthermore, could be efficiently approximated via lightweight dataset sampling (e.g., estimating from a few examples), and if so, how robust would such estimation be across domains?
A2: Considering your concern, we offer the following points in response:
Justification of Experimental Settings: We first argue that prior methods did not intentionally avoid using η, but were fundamentally unaware of the critical role of prompt-visual coupling. A core contribution of our paper is to first identify and formalize this factor. To experimentally validate our theory, it was necessary to introduce the benchmark's η pattern as a prior.
We acknowledge that this could potentially leverage information unavailable to other methods. To ensure a fair comparison, we did not delicately fine-tune for each of the 14 benchmarks. Instead, we use a coarse-grained approach: we classify benchmarks into one of two general coupling patterns and apply one of two corresponding strategies. We believe this setup is sufficiently fair and practical for real-world applications.
-Agnostic Evaluation: In this experiment, we apply this same budget configuration to all benchmarks, regardless of their coupling pattern.
Table 1 Comparative experiments on image understanding with the LLaVA1.5-7B under static
| Setting | MMB | MMB-CN | SQA | VizWiz | GQA | MME | POPE | VQA-T | VQA-V2 | OCR |
|---|---|---|---|---|---|---|---|---|---|---|
| FastV | 61.2 | 57.0 | 67.3 | 50.8 | 52.7 | 1612 | 64.8 | 52.5 | 67.1 | 291 |
| SparseVLM | 62.5 | 53.7 | 69.1 | 50.5 | 57.6 | 1721 | 83.6 | 56.1 | 75.6 | 292 |
| MustDrop | 62.3 | 55.8 | 69.2 | 51.4 | 58.2 | 1787 | 82.6 | 56.5 | 76.0 | 289 |
| DART | 63.6 | 57.0 | 69.8 | 51.2 | 60.0 | 1856 | 82.8 | 57.4 | 76.7 | 296 |
| MoB | 63.8 | 57.5 | 70.0 | 52.4 | 61.2 | 1858 | 84.5 | 58.2 | 77.9 | 304 |
| FastV | 56.1 | 56.4 | 60.2 | 51.3 | 49.6 | 1490 | 59.6 | 50.6 | 61.8 | 285 |
| SparseVLM | 60.0 | 51.1 | 67.1 | 51.4 | 56.0 | 1696 | 80.5 | 54.9 | 73.8 | 280 |
| MustDrop | 61.1 | 55.2 | 68.5 | 52.1 | 56.9 | 1745 | 78.7 | 56.3 | 74.6 | 281 |
| DART | 63.2 | 57.5 | 69.1 | 51.7 | 58.7 | 1840 | 80.1 | 56.4 | 75.9 | 296 |
| MoB | 63.2 | 57.3 | 69.3 | 52.8 | 60.7 | 1842 | 81.7 | 57.5 | 77.2 | 299 |
| FastV | 48.0 | 52.7 | 51.1 | 50.8 | 46.1 | 1256 | 48.0 | 47.8 | 55.0 | 245 |
| SparseVLM | 56.2 | 46.1 | 62.2 | 50.1 | 52.7 | 1505 | 75.1 | 51.8 | 68.2 | 180 |
| MustDrop | 60.0 | 53.1 | 63.4 | 51.2 | 53.1 | 1612 | 68.0 | 54.2 | 69.3 | 267 |
| DART | 60.6 | 53.2 | 69.8 | 51.6 | 55.9 | 1765 | 73.9 | 54.4 | 72.4 | 270 |
| MoB | 61.7 | 54.2 | 69.7 | 52.0 | 59.0 | 1806 | 77.2 | 57.0 | 75.5 | 277 |
As the results demonstrate, even without the -based prior, MoB consistently outperforms all baselines in most cases. This highlights the intrinsic advantage of our approach, independent of the -based situation, and further strengthens the experimental results presented in our manuscript.
On Lightweight Estimation: While sampling a few examples is more lightweight, it still uses prior information from the test set and risks sampling bias. In our subsequent work, we have developed a more elegant and fair solution, which maintains a statistic of an approximated for the samples it has processed and adaptively identifies the coupling pattern of the following inputs, i.e., , based on the statistic.
W2 & Q3: Does η need to be re-estimated or recalibrated when switching to a different model? How transferable is the chosen configuration across models, and to what extent does MoB’s pruning effectiveness rely on model-specific representations?
A3: We want to clarify the absolute numerical values of can vary with model representations; however, it does not significantly affect the optimal configuration, which remains highly transferable. We justify this from both a theoretical and an empirical standpoint:
Theoretical Rationale: As stated in our manuscript (Lines 204-205), we apply L2 normalization to all token embeddings before pruning. Thus, MoB operates in a normalized space, where all tokens are projected onto a unit sphere, meaning our pruning decisions are based on the relative angular structure of the tokens, not their absolute Euclidean distances or magnitudes, which can vary across models.
Empirical Validation: In our settings, the configuration is not tuned to the specific absolute numerical value of for a benchmark. Instead, it is set based on the coupling pattern—that is, the relative difference between distributions for different task types within a single model's embedding space. This is why we were able to apply the exact same configurations across four distinct VLLMs: LLaVA-1.5-7B, LLaVA-Next-7B, Qwen2-VL-7B and Video-LLaVA-7B. The results (Tables 1-4) show this fixed configuration is highly effective for all models. For instance, at an 88.9% reduction rate, MoB achieved average performance gains of 2.7% (LLaVA-1.5), 1.5% (LLaVA-Next), and 1.2% (Qwen2-VL) over the strongest baselines.
This provides powerful empirical evidence that the configuration of is robust across models, and MoB's effectiveness stems primarily from these universal, task-driven geometric principles rather than relying heavily on model-specific representations.
W1 & Q4: Is MoB's framework theoretically complete? Could there exist tokens whose value arises from higher-order interactions, global reasoning, or latent semantics not directly reflected in either VP or PA distances?
A4: Thank you for this profound question that addresses the core of our work. We agree that the theoretical completeness is a critical aspect. Our response is threefold:
Formulation Setting: We clarify that higher-order interactions may influence multi-stage pruning pipelines, which prune tokens across multiple layers simultaneously. To keep the problem tractable and theoretically rigorous, our analysis and methodology explicitly focus on the single-layer pruning setting (L. 94–111), where exclusively exist the first-order interactions of input tokens and the pruned counterpart. Within this well-defined scope, we argue that Visual Preservation (VP) and Prompt Alignment (PA) are sufficient to rigorously characterize the performance bounds of a pruning algorithm.
The Advantage of a Tractable Framework: The challenge of balancing objectives in multi-stage methods, e.g., SparseVLM and MustDrop, highlights a key advantage of MoB's design. By focusing on a single, decisive pruning stage, MoB makes the pruning problem practicable and analyzable. This allows for an efficient and principled balance between VP and PA, thereby enabling us to derive a strict performance guarantee for MoB, as presented in Theorem 2.
Empirical evidence: From an experiential perspective, our solid experimental results validate the correctness of our theoretical analysis. By solely considering the trade-off between these two objectives, MoB consistently and significantly outperforms a comprehensive suite of both single-stage and multi-stage methods across various benchmarks (Table 1, 4). We therefore believe this provides powerful evidence that VP and PA are the decisive factors for visual token pruning.
Thank you for your detailed and thoughtful response. Most of my concerns have been addressed, and I appreciate the rigorous effort put into both the theoretical analysis and experimental validation. I would like to leave a few additional comments and suggestions for further improvement:
1. Online Estimation Overhead
The proposed online estimation of coupling strength for each input is very interesting and reasonable, as different tasks indeed exhibit different coupling patterns. However, I’m curious about the computational cost introduced by this estimation step. Is there a measurable or objective overhead, such as latency or FLOPs, for computing the Hausdorff distance and applying the token selection strategy during inference? A more concrete analysis or metric would help clarify its practical impact in deployment.
2. FastV Baseline Discrepancy
While reviewing Table 1 and Table 2, I noticed a large performance gap for FastV between LLaVA and Qwen under the same token budgets—for instance, 86.4% vs. 91.0% at 77.8%, and 77.3% vs. 84.4% at 88.9%. This was not initially raised in my review, but I'd like to ask: are these results directly reproduced by the authors, or taken from original papers? Some clarification on the source and consistency of these baselines would be helpful.
3. Normalized Space Assumption in Theory
As I mentioned in my initial review, theoretical analysis is an ambitious but delicate direction. Regarding the statement in the manuscript that all token embeddings are L2-normalized to lie on a unit sphere—thereby allowing pruning decisions based on angular structure rather than magnitudes or Euclidean distances—do you have any formal justification or references supporting this assumption across different models? Given that model-specific representations can vary significantly, further elaboration or empirical support would strengthen this argument.
Once again, I sincerely appreciate the authors' careful and thoughtful response. I understand that larger-scale model experiments may have been omitted due to time constraints, but I hope to see these extended results, along with more intuitive hyperparameter analyses, in a future version.
Overall, I find this work solid and impactful, and I am happy to maintain my positive evaluation.
Sincerely, thank you for carefully reviewing our rebuttal and further constructive comments ! We are very grateful that our explanations have helped clarify your concerns. In response to the three more valuable insights you raised, we provide the following point-by-point feedback:
Q1: Online Estimation Overhead
A1: Thank you for this important and forward-looking comment. We agree that the computational cost of the exact Hausdorff distance is a crucial concern. To address this, we provide a quantitative analysis of its practical cost.
Specifically, we use Algorithm 2, detailed in Appendix B, to compute the Hausdorff distance between tokens, with a complexity of , where , , and denote the lengths of visual tokens, prompt tokens, and the feature dimensions, respectively. We include a detailed breakdown of the computational cost for LLaVA-1.5-7B and LLaVA-Next-7B, showing that the practical cost (measured in TFLOPs) of computing the exact Hausdorff distance is orders of magnitude lower than that of MoB itself and the model’s forward pass, resulting in a negligible overhead.
Computation cost in LLaVA-1.5-7B (TFLOPs), where .
| Model | Vanilla | |||
|---|---|---|---|---|
| Forward | ||||
| Comupte | ||||
| MoB | - |
Computation cost in LLaVA-Next-7B (TFLOPs), where .
| Model | Vanilla | |||
|---|---|---|---|---|
| Forward | ||||
| Comupte | ||||
| MoB | - |
As the tables clearly show, the cost of computing (e.g., TFLOPs for LLaVA-Next) is insignificant compared to the cost of the pruned forward pass (e.g., TFLOPs at ) and, more importantly, the massive savings from pruning ( TFLOPs). This demonstrates that the exact computation of online estimation is not a practical bottleneck, and its cost is trivial compared to the efficiency gains from our method.
Q2. FastV Baseline Discrepancy
A2: Thank you for this careful observation. Firstly, we clarify that all baseline results reported in our paper were reproduced by us using the official codebases of the respective methods, ensuring a fair and consistent evaluation.
Then, we attribute the performance discrepancy to the significant difference in the absolute number of the retrained visual tokens between the LLaVA and Qwen architectures, even at the same relative reduction rate. LLaVA-1.5-7B processes a fixed-resolution image, resulting in a modest 576 initial visual tokens. With an aggressive 88.9% reduction, LLaVA retains only 64 visual tokens, which can be insufficient to capture all necessary visual features. On the other hand, Qwen2-VL supports Naive Dynamic Resolution, leading to thousands of initial visual tokens. Consequently, an 88.9% reduction on Qwen2-VL still retains a much larger absolute number of tokens compared to the 64 tokens on LLaVA. This larger absolute budget provides the model with more visual information, naturally leading to better performance.
Q3. Normalized Space Assumption in Theory
A3: Thanks for your comment. We explain that the L2 normalization makes MoB’s geometric objectives robust to model-specific vector magnitudes, which is formally justified: for unit-length vectors, Euclidean distance is a strictly monotonic function of cosine similarity ). This makes our covering algorithm invariant to the original embedding scales, a standard principle in metric learning.
This principle is particularly well-suited to VLM embeddings for two reasons:
-
Architectural Consistency: Transformers' native LayerNorm/RMSNorm already heavily constrain token norm variance (empirically found the coefficient of variation is <3%). Our normalization is thus a mild final step consistent with the model's design.
-
Precedent in Representation Learning: Using cosine similarity with L2 normalization is a well-established practice for measuring semantic distance, as demonstrated by seminal works such as FaceNet (Schroff et al., CVPR 2015) for face recognition and SimCSE (Gao et al., EMNLP 2021) for sentence embeddings. Moreover, recent VLM pruning methods like SAINT (Jeddi et al., 2025) also successfully adopt this approach. The consistent success of these methods across diverse architectures strongly indicates that angular structure encodes task-relevant semantic information.
We would like to express our gratitude for your careful consideration of our manuscript. Sincerely, we hope our responses have fully addressed your concerns. Your support is indeed very important to us.
Best regards,
Thank you for the further clarifications and thoughtful explanations. I truly appreciate the careful and detailed responses to my concerns. Your work presents a meaningful contribution to the field, and I look forward to seeing the updated version of the paper, as well as the release of your code in the future.
This paper revisits the problem of visual token pruning in Multimodal Large Language Models, which is essential for reducing computational overhead from high-resolution visual input. While prior methods often integrate the two main pruning objectives, visual preservation (VP) and prompt alignment (PA), using static or heuristic strategies, this work introduces a principled, theoretically grounded framework for their integration. The authors propose a closed-form error bound for pruning based on the Hausdorff distance (Lemma 1), which ties pruning performance to the balance between VP and PA. Also, a new characterization of the trade-off between these two objectives under varying prompt-visual coupling conditions (Theorem 1) is demonstrated. Finally, a novel algorithm, Multi-objective Balanced Covering (MoB) is proposed which translates the trade-off into a budget allocation problem between prompt and visual token retention and is implemented with efficient greedy heuristics.
优缺点分析
Strengths:
1 - The paper derives an elegant, novel error bound for token pruning (Lemma 1 and 2), tying model output error to geometric distances in token embedding space. The application of covering theory and trade-off quantification (Theorem 1) is a significant conceptual leap over prior work.
2 - The paper convincingly identifies that prior multi-objective token pruning methods do not adapt to variable prompt-visual coupling across tasks, explaining the failure of static integration schemes (e.g., MustDrop).
3 - Extensive evaluations (Tables 1–3) across multiple MLLMs, datasets, and pruning budgets show consistent improvements, including outperforming strong baselines like FastV and MustDrop. MoB retains >95% of task performance with up to 88.9% token reduction.
Weaknesses:
1 - Despite having a solid technical background, the paper is extremely challenging to parse. The derivations, especially in Section 3.2 (covering regularity, trade-off derivations), are packed with notation and require significant effort to unpack. For a NeurIPS audience, clearer conceptual exposition with diagrams or more intuitive examples would greatly improve readability. The notation choices (e.g., reuse of "X", "S", "K", “η”) are overloaded and can be confusing. More thoughtful typographic separation or a notation table would help.
2 - Assumption 1 (Lipschitz continuity of transformers w.r.t. Hausdorff distance) is plausible but unvalidated. While claimed to hold in practice, no empirical evidence (e.g., via probing or ablation) is shown. Similarly, Assumption 2 (bounded prompt-visual coupling η) might not hold in open-ended real-world prompts (e.g., in generative tasks). The robustness of MoB under such violations is not discussed.
3- While the evaluation is extensive, it misses standard datasets like COCO, VQAv2 in full, or long-form visual reasoning tasks like ScienceQA. Evaluating on such benchmarks would further strengthen the case for generality. Some ablations, such as the role of pruning depth (Fig. 6), are shallow. For example, pruning at different layers beyond ℓ = 2 is not deeply explored, though it's likely to affect performance.
问题
1 - Could the authors provide more intuitive explanations or visual illustrations—particularly for the key theoretical insights in Section 3.2 (e.g., the trade-off in Theorem 1)? This would help clarify the core contributions for a broader NeurIPS audience.
2 - Has a notation table or summary been considered to address the overloaded variables like "X", "S", "K", and “η”? This might significantly improve the clarity of mathematical sections.
3 - What empirical evidence supports Assumption 1 (Lipschitz continuity of transformers with respect to Hausdorff distance)? Would it be possible to validate this assumption through ablation or probing experiments?
4 - The experiments omit widely used benchmarks such as COCO and full VQAv2. Is there a reason for this exclusion, and could results on these datasets be included to better demonstrate generalizability?
5 - How does pruning at deeper or multiple layers beyond ℓ = 2 impact performance and trade-off dynamics? Can additional ablations be provided to justify the pruning depth choice?
6 - Given the dependency of MoB on the estimation of η, what method is used to compute or approximate η in practice? Is this estimation stable and efficient?
7 - To what extent is MoB sensitive to the choice of covering fold parameter k, especially under weak coupling conditions with long prompts? Would an adaptive mechanism be feasible here?
局限性
yes
最终评判理由
After reading the authors rebuttal, my recommendation remains a borderline accept. The following points outline my final justification:
- Primary Weakness Addressed: My main criticism was the paper's poor readability, stemming from the dense theoretical derivations and notation. The authors' clarifications and commitment to improving the manuscript's clarity have satisfactorily resolved this issue.
- Methodological Concerns Clarified: The authors provided convincing justifications for their experimental design and theoretical assumptions. This includes their rationale for dataset selection, the choice of pruning at layer
l=2, and the clarification that the coupling strength (η) is determined via an efficient offline analysis rather than a costly online estimation. - Core Contribution is Significant: The paper's primary strength is its novel theoretical framework, which derives a closed-form error bound for visual token pruning and quantifies the trade-off between visual preservation and prompt alignment. This represents a good conceptual advance over prior work.
The authors' rebuttal successfully clarified the paper's contributions and addressed its weaknesses. With the promised improvements to readability, the paper's technical strengths now more clearly justify its acceptance.
格式问题
None
We sincerely acknowledge your responsible work and constructive comments, we provide our point-to-point feedbacks as follows
W1 & Q1: Could the authors provide more intuitive explanations or visual illustrations—particularly for the key theoretical insights in Section 3.2 ?
A1: Following your suggestion, we offer a more direct explanation of the theoretical insights in Section 3.2 here and will incorporate a new visual illustration and this discussion in the appendix of the revised manuscript.
Our starting point is that optimal pruning must balance two fundamental objectives: Visual Preservation (VP) and Prompt Alignment (PA), which are separately responsible for preserving the global context and the local evidence . Just as humans need both context and specific evidence to reason accurately, a VLM without global context (VP) risks hallucination, while one without local evidence (PA) cannot answer the specific question.
Section 3.2 reformulates token selection as a geometric covering problem, thereby establishing a relationship between the preservation quality of global context and the local evidence (measured by Hausdorff distance). Finally, Theorem 1 provides a principle for "how much effort" is separately required for pruning methods to preserve global context and local evidence with a fixed budget under a specific task type .
W1 & Q2: Has a notation table or summary been considered to address the overloaded variables like "X", "S", "K", and “η”?
A2: Considering your constructive suggestion, we have compiled a notation table and supplied it to Appendix E of the revised manuscript to increase the readability and mathematical clarity of our manuscript. We believe this revision is beneficial to make readers better understand our works.
W2 & Q3: What empirical evidence supports Assumption 1? Would it be possible to validate this assumption through ablation or probing experiments?
A3: We justify Assumption 1 in three aspects:
Theoretical Grounding: Prior work [1, 2] has shown that Transformers are approximately Lipschitz continuous w.r.t. standard matrix norms, e.g., the Frobenius norm, under the finite and compact input conditions we consider (Line 98). A standard inequality relates the Frobenius norm to the Hausdorff distance: . This implies that any model Lipschitz continuous w.r.t. the Frobenius norm is also Lipschitz continuous w.r.t. the Hausdorff distance.
Empirical Evidence: Vignac et al. [3] have measured the empirical local Lipschitz constant of Transformers on real-world data and found it to be 100-1000x smaller than the theoretical worst-case bound. This indicates that practical models already lie in a well-behaved, near-Lipschitz regime. Besides, Kim et al. [4] have shown that the successful training of deep Transformers relies on controlling this constant. Architectures without such implicit or explicit constraints often fail to converge, suggesting that the well-trained VLLMs most likely adhere to a finite Lipschitz constant.
Consistency with Prior Work: We wish to clarify that this assumption is not novel to our work but is a common theoretical premise in the visual token pruning community, having been adopted by prior methods such as DivPrune [5] and DART [6]. Our work builds upon this established foundation to derive deeper insights.
Regarding probing experiments: We agree this is a valid consideration. However, computing the global Lipschitz constant is computationally challenging. The most feasible alternative, measuring local constants, has already been thoroughly executed by Vignac et al. [3, 4]. Their work provides strong empirical validation for Assumption 1, and we believe further experiments would likely reinforce their conclusion without fundamentally altering our analysis.
References
[1] Dasoulas G, et al. Lipschitz normalization for self-attention layers with application to graph neural networks. ICML, 2021.
[2] Yudin N, et al. Pay Attention to Attention Distribution: A New Local Lipschitz Bound for Transformers. preprint, 2025.
[3] Vignac C, et al. Understanding the regularity of self-attention with optimal transport. preprint, 2023.
[4] Kim H, et al. The lipschitz constant of self-attention. ICML, 2021.
[5] Alvar S R, et al. Divprune: Diversity-based visual token pruning for large multimodal models. CVPR, 2025.
[6] Wen Z, et al. Stop looking for important tokens in multimodal language models: Duplication matters more. preprint, 2025.
W3 & Q4: The experiments omit widely used benchmarks such as COCO and full VQAv2. Is there a reason for this exclusion, and could results on these datasets be included to better demonstrate generalizability?
A4: We want to explain that we have evaluated MoB in ScienceQA (SQA) (Appx. C.1), while the exclusion of COCO and the full VQAv2 benchmark is a deliberate choice based on their technical limitations for visual token pruning. We attribute it to the fact that COCO's language-based metrics (e.g., CIDEr) and VQAv2's strong language priors make VLLMs insensitive to the loss of key visual details. These limitations are reflected in current community practice: our evaluation protocol mirrors that of 13 recent SOTA pruning methods, none of which report results on COCO or full VQAv2. Hence, for a comprehensive comparison with prior works, we include VQAv2's validation set in our suite.
W3 & Q5: How does pruning at deeper or multiple layers beyond ℓ = 2 impact performance and trade-off dynamics? Can additional ablations be provided to justify the pruning depth choice?
A5: Thank you for the suggestion. We clarify that a comprehensive ablation on pruning depth () is presented in Fig. 6. The results show a clear trade-off: pruning at deeper layers offers margin performance returns with notable latency increments. We chose as the default setting for two primary reasons: First, it offers a highly competitive performance-efficiency trade-off. More importantly, it aligns with the community standard for single-stage pruning methods (e.g., FastV, DART), ensuring a fair comparison in algorithmic superiority rather than conflating results with higher computational cost. Finally, multiple-layer pruning may introduce intractable high-order interactions besides the VP and PA, which are beyond the scope of our theoretical analysis. Hence, our ablations focus exclusively on this single-stage scenario.
Q6: Given the dependency of MoB on the estimation of η, what method is used to compute or approximate η in practice? Is this estimation stable and efficient?
A6: We clarify that MoB relies on an identification of the coupling pattern (i.e., 'weak' vs. 'strong') rather than a precise estimation of the , which is much more tractable in practice. Specifically, we perform an offline analysis of a benchmark's distribution using Alg. 2 (, Appx. B) to identify its coupling pattern.' During inference, we simply apply the predetermined based on the coupling pattern, resulting in near-zero computational overhead at runtime.
Besides, as reported in Fig. 1(b) & 9 (Appx. D.2), the distributions for the two coupling patterns exhibit some overlap. This means our classification will inevitably be imperfect for samples in this overlapping region. In this case, our experimental results (Tables 1-4) show that MoB still consistently outperforms all baselines, highlighting the tolerance and stability of MoB to estimation of .
Q7: To what extent is MoB sensitive to the choice of covering fold parameter k, especially under weak coupling conditions with long prompts? Would an adaptive mechanism be feasible here?
A7: To address your concern, we have reorganized a portion of our ablation results, focusing specifically on the sensitivity of under weak coupling conditions with long prompts:
Table 1, Ablation on fold in GQA
| 0 | 2 | 4 | 6 | 8 | 12 | 16 | 24 | 32 | 48 | 64 | 96 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 58.3 | 58.8 | 59.0 | 58.7 | 58.2 | 57.4 | |||||||
| 60.2 | 60.5 | 60.7 | 60.6 | 60.0 | 59.5 | |||||||
| 60.6 | 61.1 | 61.2 | 60.9 | 60.7 | 60.5 |
Table 2, Ablation on fold in TextVQA
| 0 | 2 | 4 | 6 | 8 | 12 | 16 | 24 | 32 | 48 | 64 | 96 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 56.5 | 56.9 | 57.0 | 56.8 | 56.5 | 56.2 | |||||||
| 57.1 | 57.5 | 57.7 | 57.7 | 57.2 | 56.8 | |||||||
| 57.8 | 58.2 | 58.2 | 58.1 | 57.7 | 57.5 |
As the tables demonstrate, MoB is not overly sensitive to the choice of , particularly within a clear optimal range. For instance, in both two benchmarks, performance only varies by approximately for values between under setting.
There is also a principled, theoretical reason for this robustness, which stems from the relationship between the covering fold , the budget , and the length of prompt tokens . From covering theory, every prompt token is covered by at least one visual token under the condition , thereby ensuring the performance guarantee of MoB. Therefore, as selected satisfies , the performance will remain stable.
The above analysis provides a simple yet effective heuristic: one can easily determine a robust range for based on the prompt length . Based on the above analysis, we believe a more delicate mechanism for adaptively searching for a fine value for each sample may bring limited potential gains.
I appreciate the authors’ detailed rebuttal and encourage them to incorporate these clarifications into the revised manuscript. Most of my concerns have been satisfactorily addressed, and I maintain my positive evaluation.
Sincerely, thank you for your feedback and support, which are very important for our work. According to your suggestions, we will supply these clarifications in our revised manuscript.
Best regards,
This paper introduces a solid theoretical framework for visual token pruning, quantifying the trade-off between visual preservation and prompt alignment. The methodology is sound, with comprehensive experiments and convincing clarifications provided in the original submission and rebuttal. As is raised by reviewers, the authors could further consider improving the readability of the original paper, scalability to larger models, and limited hyperparameter analysis. Overall, the work is technically strong, well-executed, and represents a meaningful contribution, meriting acceptance.