DyMU: Dynamic Merging and Virtual Unmerging for Efficient Variable-Length VLMs
摘要
评审与讨论
This paper presents DyMU, a training-free token compression framework for vision transformers. DyMU introduces two key contributions: (1) a dynamic merging strategy that adaptively controls token reduction based on image complexity, and (2) a novel virtual token unmerging mechanism that improves performance degradation caused by aggressive token merging. The method is evaluated on various models across multiple downstream tasks and benchmarks, demonstrating consistent improvements over prior training-free approaches, particularly ToMe. The work is well-motivated, clearly written, and practically relevant.
优缺点分析
Strengths
-
Well-motivated design: The paper offers a clear rationale for adapting token budgets to image complexity, as opposed to using a fixed schedule. The proposed dynamic merging strategy is intuitive and practically motivated.
-
Robust empirical evaluation: DyMU is validated on a wide range of benchmarks and models, showing strong generalization and consistent improvements.
-
Virtual Token Unmerging: This is a novel contribution not present in prior methods like ToMe. It enables accurate RoPE computation with merged tokens, and empirical results confirm that VTU substantially reduces accuracy degradation.
-
Cost-accuracy trade-off: DyMU supports controllable granularity, letting users dynamically balance efficiency and accuracy. It is also compatible with domain-specific preprocessing tools (e.g., OCR-based cropping), enabling flexible deployment.
Weaknesses
-
Limited baselines: While ToMe comparisons are thorough, recent token-merging or pruning methods such as DiffRate[r1] and PiToMe[r2] are not quantitatively reported. Including them (even if training-required) would contextualise DyMU's gains more convincingly.
-
Missing qualitative visualizations: ToMe originally visualized merged-token regions; similar maps for DyMU would help readers see how token allocation differs for simple vs. complex images, beyond giving only the number of tokens after merging in Figs. 1 and 7.
References
[r1] Mengzhao Chen, et al. DiffRate: Differentiable Compression Rate for Efficient Vision Transformers. In ICCV 2023.
[r2] Hoai-Chau Tran, et al. Accelerating Transformers with Spectrum-Preserving Token Merging. In NeurIPS 2024.
问题
-
Could you report DyMU's performance against DiffRate or PiToMe under comparable token budgets? Even partial results would help position the method relative to the latest merging approaches.
-
Would you consider adding merged-token maps (akin to ToMe's Fig. 4) to illustrate where tokens are retained or merged in simple vs. complex scenes?
局限性
Yes.
最终评判理由
I appreciate the authors' response and the additional experiments, which effectively address my concern regarding comparisons with more recent baselines. I acknowledge that DyMU's performance is competitive relative to these newly added baselines, and as a dynamic approach, it also holds potential advantages in terms of efficiency. Taking into account the authors' responses to the other reviewers as well, I find the current experiments sufficiently solid, and consider the proposed method to be meaningful and of practical value. I will raise my score to absolute acceptance, and I hope the authors incorporate their additional clarifications and discussions with the other reviewers into the final version of the paper.
格式问题
No formatting concerns.
We thank Reviewer 1uHD for the detailed and constructive comments! We appreciate the reviewer’s recognition of our strengths in well-motivated design, comprehensive evaluations, novel VTU algorithm, and more controllable tradeoff in efficiency and accuracy. We hope the following response addresses the remaining concerns and fosters meaningful discussion.
W1,Q1: Comparison to additional merging/pruning methods
We thank the reviewer for pointing out additional baselines to compare with. Below we add the comparison with DiffRate[1] and PiToMe[2], together with VisionZip[3] which is suggested by reviewer iv6c. We show that DyMU achieves competitive overall performance while being both training-free and dynamic in length.
| Method | #Token | Dynamic? | Training-Free? | GQA | MMB | MME (prcp,all) | POPE | SQA-I | SEED-I | VQA-T | MMVet | LLaVA-W | AVG |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| DiffRate[1] | ~57 | ✓ | ✗ | 57.9 | - | 1341, - | - | 66.4 | - | 30.6 | - | - | - |
| PiToMe[2] | ~57 | ✗ | ✓ | 59.9 | - | 1448, - | - | 69.0 | - | 43.0 | - | - | - |
| VisionZip[3] | 64 | ✗ | ✓ | 55.1 | 60.1 | -, 1690 | 77.0 | 69.0 | - | 55.5 | 31.7 | 62.9 | - |
| VisionZip[3] | 128 | ✗ | ✓ | 57.6 | 62.0 | -, 1762 | 83.2 | 68.9 | - | 56.8 | 32.6 | 64.8 | - |
| VisionZip[3] | 192 | ✗ | ✓ | 59.3 | 63.0 | -, 1783 | 85.3 | 68.9 | - | 57.3 | 31.7 | 67.7 | - |
| DyMU-low | 89±27 | ✓ | ✓ | 60.8 | 62.1 | 1438, 1787 | 86.3 | 69.3 | 65.0 | 53.1 | 30.0 | 62.9 | 61.5 |
| DyMU-mid | 195±47 | ✓ | ✓ | 61.7 | 62.8 | 1483, 1862 | 86.6 | 69.2 | 65.9 | 55.1 | 30.9 | 65.1 | 62.6 |
| DyMU-high | 394±57 | ✓ | ✓ | 61.9 | 64.3 | 1498, 1846 | 86.8 | 69.9 | 66.1 | 58.0 | 31.5 | 64.5 | 63.2 |
Reference:
- [1] Chen, Mengzhao, et al. "Diffrate: Differentiable compression rate for efficient vision transformers."
- [2] Tran, Chau, et al. "Accelerating transformers with spectrum-preserving token merging."
- [3] Yang, Senqiao, et al. "Visionzip: Longer is better but not necessary in vision language models."
W2,Q2: Visualization of the merged token regions
This is a great suggestion! Specifically, such a visualization can help further validate the dynamic-length merging by showing that less complex areas are more aggressively merged. Since OpenReview does not allow figures or external links in the rebuttal, we are unable to include it at this stage. We will add this visualization in our next revision.
I appreciate the authors' response and the additional experiments, which effectively address my concern regarding comparisons with more recent baselines. I acknowledge that DyMU's performance is competitive relative to these newly added baselines, and as a dynamic approach, it also holds potential advantages in terms of efficiency. Taking into account the authors' responses to the other reviewers as well, I find the current experiments sufficiently solid, and consider the proposed method to be meaningful and of practical value. I will raise my score to absolute acceptance, and I hope the authors incorporate their additional clarifications and discussions with the other reviewers into the final version of the paper.
We thank reviewer 1uHD once again for the insightful feedback and constructive discussion, which have made this work a stronger submission. We will incorporate the additional results and visualizations as discussed during the rebuttal.
This paper proposes a training-free token reduction method for MLLMs by adaptively adjusting the number of vision tokens based on the complexity of the image. The proposed DYMU framework is comprised of two modules, Dynamic Token Merging and Virtual Token Unmerging. The dynamic token merging module continuously reduces the number of vision tokens by merging similar ones based on image complexity in the different layers of the vision encoder. In the LLM, the designed Virtual Token Unmerging module approximates the full vision sequence operations using the merged tokens. Experiments on several general MLLM benchmarks show the efficiency gain brought by the proposed framework.
优缺点分析
Strength: The idea to adaptively reduce the token number according to image complexity is important.
Weakness:
- DYMU emphasizes adaptive per-image token reduction. When comparing with other token reduction methods, the proper experiment setup should be measuring the actual (average) inference time on different benchmarks and comparing the performance of other token reduction methods configured to a similar inference time.
- Many more recent token reduction methods are missing, e.g. VisionZip (CVPR2025) and PyramidDrop (CVPR2025).
- DYMU merges tokens at every vision-encoder layer, but the paper lacks any quantitative analysis of the computational benefits this provides over a single post-encoder merge. It remains questionable whether early-layer merging—when token similarities may be noisier—yields net gains.
- Token merging thresholds rely on key-vector similarity as a proxy for image complexity, yet no vigorous validation is provided that this reliably relates to semantic detail for the later LLM part. For instance, a dense document image (all text) may exhibit high key similarity and thus undergo over-merging, potentially losing critical OCR information.
- Although broad VQA and grounding benchmarks are included, the paper omits fine-grained tasks (e.g., DocVQA, InfoVQA, MME-RealWorld) where small details matter most. Without these, the claim of “preserving core information” under aggressive compression is unsubstantiated.
- Besides FLOPs measurements, the inference time should also be included in all experiments.
问题
- Is the proposed VTU module compatible with FlashAttention?
- Do you think designing some metrics measuring how well the original vision information can be reconstructed from the merged tokens (or directly down-sampled ones) can better represent the image complexity compared with the similarity-based one?
局限性
Yes
最终评判理由
I appreciate the authors' detailed and thoughtful rebuttal. However, my main concerns remain:
Inference Time vs. FLOPs: While I understand the challenges in aligning FLOP reductions with end-to-end latency, actual inference time is the most critical metric for real-world deployment. The current method shows limited or no gains in end-to-end latency, especially when VTU is enabled, which undermines the practical efficiency benefits claimed by the paper.
Evaluation on Fine-Grained / Information-Dense Tasks: The lack of evaluation on tasks such as DocVQA, InfoVQA, or MME-RealWorld limits confidence in the method's robustness for semantically dense images.
Therefore, I will keep my original rating.
格式问题
NA
We thank Reviewer iv6c for the detailed and constructive comments! We appreciate the reviewer’s recognition of our novelty in complexity-conditional token reduction. We hope the following response addresses the remaining concerns and fosters meaningful discussion.
W1, W6: Inference Time Comparison
W1.1: Additional Analysis on Wall-clock Inference Times
We thank the reviewer for the suggestion! We provide additional analysis on wall-clock inference times, summarizing our findings below. We continue to see speedup in terms of inference times in most settings, however, we find that the inference time reduction is less consistent with token count and FLOPs, particularly when measuring end-to-end duration. This discrepancy has also been observed in prior works, such as Table 5 in [1] and Table 7 in [2]. We provide a detailed discussion in the next section on the underlying reasons and practical challenges.
| Avg N_un/N | Attn MFLOPs | Attn Inference Time (ms) | End-to-End Inference Duration on MME (s) | |
|---|---|---|---|---|
| Full | 576/576 | 1359.0 | 9.17 | 131 |
| DyMU-low | ||||
| DyMU-low w/o VTU | 89/576 | 32.4 | 1.26 | 115 |
| DyMU-low w/ VTU | 89/576 | 64.9 | 7.20 | 131 |
| DyMU-mid | ||||
| DyMU-mid w/o VTU | 195/576 | 155.8 | 1.29 | 121 |
| DyMU-mid w/ VTU | 195/576 | 311.5 | 7.49 | 123 |
| DyMU-high | ||||
| DyMU-high w/o VTU | 394/576 | 635.9 | 2.92 | 132 |
| DyMU-high w/ VTU | 394/576 | 1272.0 | 7.60 | 132 |
W1.2: Deep Dive into the Gap between FLOPs and Inference Time
In this section, we elaborate our findings on why there is a non-trivial gap between FLOP reduction and actual inference time gains, especially for algorithms like DyMU that fundamentally alter the computation of attention. We also explain why we believe FLOPs is a more robust metric for comparing theoretical/algorithmic efficiency, as is also a standard practice in prior works [3][4][5][6].
-
Controllability: From definition, FLOPs quantifies algorithmic compute independent of hardware, batching, parallelism, and system load. Wall-clock inference time can vary significantly due to deployment configurations and parallel jobs running on the same node, which are practically difficult to control on shared school servers.
-
Implementation Gap: In practice, wall-clock inference time is also heavily dependent on the actual implementations and low-level optimizations of the computation functions. This may lead to significant discrepancies between FLOPs and inference time.
-
Here we show a concrete example, where we compare two implementations of computing a N×N attention matrix from two (N,D) vectors. In Version 1, we simply do one matrix multiplication using
torch.matmulto get the N×N matrix. In Version 2, we first split the input vectors into N/4 chunks, do matrix multiplication on the (N/4, N/4) chunks and add them back; As shown below, this results in identical FLOPs but very different wall-clock times. This is becausetorch.matmulis heavily optimized for large matrix multiplications, leading to longer inference times when multipletorch.matmulcalls are made on smaller matrices, despite having the same total FLOPs.#torch.matmul matrix size per multiplication MFLOPs Wall-Clock Time (ms) Version1 1 576 339.74 1.374 Version2 16 144 339.74 2.311 -
In our case, the VTU operation requires decomposing the matrix multiplication into smaller sub-matrices, which inflates the inference time although having a better theoretical efficiency in terms of FLOP. (See the “Implementation Notes” section in the supplementary code for more details.)
-
This demonstrates that, to fully translate the benefit of theoretical FLOPs improvements to actual inference time improvements may require non-trivial work such as implementing new kernels, which themselves can be a separate paper such as Flash-Attention[7].
-
Therefore, in this work, we focus on formally deriving the theoretical efficiency gains and presenting empirical results on FLOPs, token length and end-task performance. But we do acknowledge that optimizing wall-clock inference time of DyMU through dedicated low-level kernel development is an important follow-up work.
Reference:
- [1] Xing, Long, et al. "Pyramiddrop: Accelerating your large vision-language models via pyramid visual redundancy reduction."
- [2] Wen, Zichen, et al. "Token Pruning in Multimodal Large Language Models: Are We Solving the Right Problem?."
- [3] Bolya, Daniel, et al. "Token merging: Your vit but faster."
- [4] Chen, Mengzhao, et al. "Diffrate: Differentiable compression rate for efficient vision transformers."
- [5] Chen, Liang, et al. "An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models."
- [6] Shang, Yuzhang, et al. "Llava-prumerge: Adaptive token reduction for efficient large multimodal models."
- [7] Dao, Tri, et al. "Flashattention: Fast and memory-efficient exact attention with io-awareness."
W2: Additional Comparison Baselines
We thank the reviewer for pointing out additional baselines for comparison!
- A kind reminder that we have already included PyramidDrop in Table 2, Row 11, labeled as “PDrop.”
- In the following table, we show additional comparison for the following baselines: VisionZip[1], DiffRate[2], and PiToMe[3]. We show that DyMU achieves competitive overall performance while being both training-free and dynamic in length.
| Method | #Token | Dynamic? | Training-Free? | GQA | MMB | MME (prcp,all) | POPE | SQA-I | SEED-I | VQA-T | MMVet | LLaVA-W | AVG |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| VisionZip[1] | 64 | ✗ | ✓ | 55.1 | 60.1 | -,1690 | 77.0 | 69.0 | - | 55.5 | 31.7 | 62.9 | - |
| VisionZip[1] | 128 | ✗ | ✓ | 57.6 | 62.0 | -,1762 | 83.2 | 68.9 | - | 56.8 | 32.6 | 64.8 | - |
| VisionZip[1] | 192 | ✗ | ✓ | 59.3 | 63.0 | -,1783 | 85.3 | 68.9 | - | 57.3 | 31.7 | 67.7 | - |
| DiffRate[2] | ~57 | ✓ | ✗ | 57.9 | - | 1341,- | - | 66.4 | - | 30.6 | - | - | - |
| PiToMe[3] | ~57 | ✗ | ✓ | 59.9 | - | 1448,- | - | 69.0 | - | 43.0 | - | - | - |
| DyMU-low | 89±27 | ✓ | ✓ | 60.8 | 62.1 | 1438,1787 | 86.3 | 69.3 | 65.0 | 53.1 | 30.0 | 62.9 | 61.5 |
| DyMU-mid | 195±47 | ✓ | ✓ | 61.7 | 62.8 | 1483,1862 | 86.6 | 69.2 | 65.9 | 55.1 | 30.9 | 65.1 | 62.6 |
| DyMU-high | 394±57 | ✓ | ✓ | 61.9 | 64.3 | 1498,1846 | 86.8 | 69.9 | 66.1 | 58.0 | 31.5 | 64.5 | 63.2 |
Reference:
- [1] Yang, Senqiao, et al. "Visionzip: Longer is better but not necessary in vision language models."
- [2] Chen, Mengzhao, et al. "Diffrate: Differentiable compression rate for efficient vision transformers."
- [3] Tran, Chau, et al. "Accelerating transformers with spectrum-preserving token merging."
W3: Merging Per-Layer vs Merging Last-Layer
Compared to post-encoder merging, per-layer merging offers two key benefits:
- (1) Additional efficiency gains within the ViT encoder (as demonstrated in the ToMe paper);
- (2) Progressively merging a smaller number of tokens per layer helps preserve more visual concepts better than merging a large number of tokens in the final layer.
To further validate (2), we provide an additional ablation study as follows, in which we introduce a new DyMU variant that performs merging in the last three layers. Note: Due to the nature of bipartite matching, each layer can merge at most half of all tokens. Therefore, merging only in the last layer is insufficient to meet the target number of remaining tokens. Instead, we merge across the last three layers to reach the same average target token count, i.e., 94 for DyMU-low. This setting can be considered a “less extreme” version of post-encoder merging. Even under this configuration, as shown in the table below, we observe a notable performance drop compared to per-layer merging, particularly on the hallucination task (POPE).
| GQA | MMB | MME (prcp,all) | POPE | SQA-I | SEED-I | VQA-T | MMVet | LLaVA-W | AVG | |
|---|---|---|---|---|---|---|---|---|---|---|
| DyMU-low w/ Per-Layer Merge | 60.8 | 62.1 | 1438,1787 | 86.3 | 69.3 | 65.0 | 53.1 | 30.0 | 62.9 | 61.5 |
| DyMU-low w/ Last-Layer Merge | 58.3 | 63.2 | 1421,1755 | 81.1 | 69.4 | 62.9 | 50.3 | 25.1 | 58.2 | 59.0 (↓2.5) |
W4,W5: Limitations on Semantically-Dense Images
We are fully aware of this limitation and have acknowledged it in Section 5, where we refer to such cases as token-sensitive tasks, exemplified by TextVQA. In fact, as shown in Table 2, we find this to be a common limitation across all training-free methods. We also agree that incorporating additional benchmarks, such as DocVQA, could further validate this limitation. And future work is needed to explore complexity-aware token merging for semantically dense images, such as multimodal documents.
Nevertheless, we would like to highlight a unique advantage of DyMU not present in previous work: its compatibility with external tools. As illustrated in Figure 6, DyMU can be used in conjunction with OCR tools to address text-dense tasks, while still introducing efficiency gains through dynamic-length token merging. Incorporating agentic tool use into the token merging framework may offer a promising direction for handling the diverse merging needs across different tasks.
Q1: Compatibility with FlashAttention
We currently provide two implementations of VTU: V1 uses the exact matrix decomposition, as derived in Section 3.2, which is more efficient in terms of FLOPs but empirically slower in wall-clock time; V2 uses a direct matrix expansion approximation, which has higher FLOPs but achieves faster wall-clock performance. Among the two, V2 is compatible with FlashAttention. More details can be found in the “Implementation Notes” section of the supplementary code.
Q2: Reconstruction Metric for Image Complexity
We believe this is an interesting idea and a very promising future direction! Specifically, a reconstruction-based metric captures more low-level visual details, which could be particularly useful for identifying complexity in semantically dense images, as discussed above. Moreover, we believe that combining similarity-based metrics (which focus on high-level semantics) with reconstruction-based metrics (which emphasize low-level details) could result in a more general and effective metric for diverse tasks.
I appreciate the authors' detailed and thoughtful rebuttal. However, my main concerns remain:
Inference Time vs. FLOPs: While I understand the challenges in aligning FLOP reductions with end-to-end latency, actual inference time is the most critical metric for real-world deployment. The current method shows limited or no gains in end-to-end latency, especially when VTU is enabled, which undermines the practical efficiency benefits claimed by the paper.
Evaluation on Fine-Grained / Information-Dense Tasks: The lack of evaluation on tasks such as DocVQA, InfoVQA, or MME-RealWorld limits confidence in the method's robustness for semantically dense images.
We thank Reviewer iv6c for the thoughtful and constructive feedback!
-
We understand the concern regarding the gap between efficiency gains in FLOPs and improvements in inference time. We acknowledge that addressing this gap is an important next step to further enhance the practical applicability of the proposed algorithm. We have been very careful not to overclaim improvements in wall-clock time, as we explicitly discussed this limitation in Lines 226–229. And we believe this should not entirely undermine the novelty of our contribution, as the method demonstrates strong theoretical efficiency benefits, while achieving competitive performance and dynamic token length.
-
For the evaluation tasks, we adopted the widely used benchmarks following prior work to ensure a comprehensive comparison. However, we agree with the reviewer that incorporating more information-dense tasks could provide additional insights into model robustness, as also discussed in Section 5. We believe this is a fundamental challenge for most existing token merging/reduction methods. Due to the limited rebuttal time, we were unable to include all requested analyses but will incorporate them in the next revision.
We appreciate your engagement in the discussion and welcome any further comments or concerns you may have. Thank you!
In this paper, the author presents an efficient, training-free framework DYMU that can dynamically reduce the computation of VLMs while maintaining performance with two novel module DtoMe and VTU. A thorough has validate the frameworks’ effectiveness.
优缺点分析
Strengths: 1.The author logically described the current problems in this field and the purpose of the proposed method. 2.The author provided a detailed explanation of the functions and contributions of each module. 3.Sufficient experiments have fully verified the effectiveness of this method.
Weakness: I don't completely understand what the red and blue squares in Figure 1 represent. If they represent the same characteristic with different values, it is recommended that the author not use only two colors but instead use one/multiple colors. If they represent different types of variables, it is hoped that corresponding explanations can be added beside the figure.
问题
I don't completely understand what the red and blue squares in Figure 1 represent. If they represent the same characteristic with different values, it is recommended that the author not use only two colors but instead use one/multiple colors. If they represent different types of variables, it is hoped that corresponding explanations can be added beside the figure.
局限性
I don't completely understand what the red and blue squares in Figure 1 represent. If they represent the same characteristic with different values, it is recommended that the author not use only two colors but instead use one/multiple colors. If they represent different types of variables, it is hoped that corresponding explanations can be added beside the figure.
格式问题
None
We thank Reviewer 2zYg for the constructive and encouraging feedback! We appreciate the reviewer’s recognition of our detailed explanations and comprehensive experiments on the proposed methods. We hope the following response addresses the remaining concerns and questions.
W1: Clarification on the token illustration in the method figure
We thank the reviewer for the suggestion. We believe the reviewer is referring to the red and blue squares in the method overview figure (Figure 2), which indicate the “visual tokens” in the transformer layers. These tokens indeed represent the same characteristics but with different values. We use two colors to better illustrate the bipartite matching algorithm, which divides the tokens into two groups for matching. The red and blue tokens indicate tokens in different groups which will potentially be merged together. We will add additional explanation in the figure caption to clarify this in the next revision.
Dear Reviewer 2zYg,
This is a gentle reminder to participate in the author discussion for your assigned paper(s). Engaging with authors is required before submitting the Mandatory Acknowledgement.
The discussion deadline is August 8, 11:59 PM AoE. Please ensure you post at least one response in the discussion thread.
Let me know if you encounter any issues.
Best, Area Chair, NeurIPS 2025
The paper introduces DYMU (Dynamic Merging and Virtual Unmerging), a novel framework to enhance vision-language models (VLMs) by adaptively reducing visual tokens based on image complexity. It addresses inefficiencies in fixed-length token outputs of existing VLMs. Key components are: 1. Dynamic Token Merging (DToMe): Selectively merges similar tokens depending on image complexity (not resolution), reducing tokens by 32%-85% for simple images while preserving semantic details for complex ones. 2. Virtual Token Unmerging (VTU): Reconstructs full-sequence representations without retraining, maintaining performance and optimizing attention efficiency.
优缺点分析
Strengths:
The paper provides comprehensive quantitative and qualitative experiments across multiple VLM architectures (LLaVA-1.5, LLaVA-OneVision), visual encoders (CLIP, SigLIP), and evaluation benchmarks (GQA, MMBench, MME, etc.), enhancing the credibility of the proposed method.
The proposed DToMe and DYMU are plug-and-play, requiring no retraining of the original model, making them suitable for resource-constrained scenarios or when training data is unavailable.
DToMe extends the previous ToMe by dynamically adjusting the token count via threshold control (instead of pre-selecting a token budget), effectively balancing computational efficiency and performance.
VTU addresses the mismatch between variable-length visual token sequences and fixed-length pre-trained LLMs through Virtual Token Unmerging, efficiently computing attention matrices without explicitly reconstructing the full sequence.
Weaknesses:
Overall computational cost: Although Table 1 details the attention computation cost, the lack of discussion on end-to-end costs makes it unclear whether DYMU outperforms other methods in efficiency. For the DYMU-high w/ VTU configuration, the MFLOPs (1272) show limited reduction versus Full Attention (1359). In low/mid settings, VTU nearly doubles MFLOPs, potentially limiting its applicability.
Architectural limitations: DToMe targets ViT-based encoders, and VTU is designed for RoPE-based LLMs. Their generalizability to other architectures (e.g., CNNs or non-RoPE models) remains unverified.
Hyperparameter search cost: DToMe requires offline threshold calculation (using large image sets), incurring non-negligible preprocessing costs. Appendix C shows that merging strategies significantly affect results, and fine-tuning layer-specific r_i for different models further increases preprocessing overhead.
VTU approximation: Equation 10 uses embedding averages to reconstruct attention outputs—an approximate approach. The paper claims minimal empirical impact but provides no ablation studies or quantitative error analysis.
问题
No
局限性
Yes
最终评判理由
The rebuttal mostly addressed my concerns. I raised my score.
格式问题
No
We thank Reviewer TS2b for the detailed and constructive comments! We appreciate the reviewer’s recognition of the strengths of our comprehensive experiments and the novelty of both DToMe and VTU in enabling training-free dynamic-length token merging. We hope the following response addresses the remaining concerns and fosters meaningful discussion.
W1: Overall Computational Cost
W1.1: Additional Analysis on Wall-clock Inference Times
We thank the reviewer for the suggestion! We provide additional analysis on wall-clock inference times, summarizing our findings below. We continue to see speedup in terms of inference times in most settings, however, we find that the inference time reduction is less consistently correlated with token count and FLOPs, particularly when measuring end-to-end duration. This discrepancy has also been observed in prior works, such as Table 5 in [1] and Table 7 in [2]. We provide a detailed discussion in the next section on the underlying reasons and practical challenges.
| Avg N_un / N | Attn MFLOPs | Attn Inference Time (ms) | End-to-End Inference Duration on MME (s) | |
|---|---|---|---|---|
| Full | 576/576 | 1359.0 | 9.17 | 131 |
| DyMU-low | ||||
| DyMU-low w/o VTU | 89/576 | 32.4 | 1.26 | 115 |
| DyMU-low w/ VTU | 89/576 | 64.9 | 7.20 | 131 |
| DyMU-mid | ||||
| DyMU-mid w/o VTU | 195/576 | 155.8 | 1.29 | 121 |
| DyMU-mid w/ VTU | 195/576 | 311.5 | 7.49 | 123 |
| DyMU-high | ||||
| DyMU-high w/o VTU | 394/576 | 635.9 | 2.92 | 132 |
| DyMU-high w/ VTU | 394/576 | 1272.0 | 7.60 | 132 |
W1.2: Deep Dive into the Gap between FLOPs and Inference Time
In this section, we elaborate our findings on why there is a non-trivial gap between FLOP reduction and actual inference time gains, especially for algorithms like DyMU that fundamentally alter the computation of attention. We also explain why we believe FLOPs is a more robust metric for comparing theoretical/algorithmic efficiency, as is also a standard practice in prior works [3][4][5][6].
-
Controllability: From definition, FLOPs quantifies algorithmic compute independent of hardware, batching, parallelism, and system load. Wall-clock inference time can vary significantly due to deployment configurations and parallel jobs running on the same node, which are practically difficult to control on shared school servers.
-
Implementation Gap: In practice, wall-clock inference time is also heavily dependent on the actual implementations and low-level optimizations of the computation functions. This may lead to significant discrepancies between FLOPs and inference time.
-
Here we show a concrete example, where we compare two implementations of computing a N×N attention matrix from two (N,D) vectors. In Version 1, we simply do one matrix multiplication using
torch.matmulto get the N×N matrix. In Version 2, we first split the input vectors into N/4 chunks, do matrix multiplication on the (N/4, N/4) chunks and add them back; As shown below, this results in identical FLOPs but very different wall-clock times. This is becausetorch.matmulis heavily optimized for large matrix multiplications, leading to longer inference times when multipletorch.matmulcalls are made on smaller matrices, despite having the same total FLOPs.# torch.matmul matrix size per multiplication MFLOPs Wall-Clock Time (ms) Version 1 1 576 339.74 1.374 Version 2 16 144 339.74 2.311 -
In our case, the VTU operation requires decomposing the matrix multiplication into smaller sub-matrices, which inflates the inference time although having a better theoretical efficiency in terms of FLOP. (See the “Implementation Notes” section in the supplementary code for more details.)
-
This demonstrates that, to fully translate the benefit of theoretical FLOPs improvements to actual inference time improvements may require non-trivial work such as implementing new kernels, which themselves can be a separate paper such as Flash-Attention[7].
-
Therefore, in this work, we focus on formally deriving the theoretical efficiency gains and presenting empirical results on FLOPs, token length and end-task performance. But we do acknowledge that optimizing wall-clock inference time of DyMU through dedicated low-level kernel development is an important follow-up work.
Reference:
- [1] Xing, Long, et al. "Pyramiddrop: Accelerating your large vision-language models via pyramid visual redundancy reduction."
- [2] Wen, Zichen, et al. "Token Pruning in Multimodal Large Language Models: Are We Solving the Right Problem?."
- [3] Bolya, Daniel, et al. "Token merging: Your vit but faster."
- [4] Chen, Mengzhao, et al. "Diffrate: Differentiable compression rate for efficient vision transformers."
- [5] Chen, Liang, et al. "An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models."
- [6] Shang, Yuzhang, et al. "Llava-prumerge: Adaptive token reduction for efficient large multimodal models."
- [7] Dao, Tri, et al. "Flashattention: Fast and memory-efficient exact attention with io-awareness."
W1.3: How VTU can achieve better tradeoff in performance and efficiency
We elaborate on how VTU can effectively help achieve a better trade-off in both DyMU-low and DyMU-high settings.
- As shown in the Table below, in DyMU-low settings (where token reduction is large) VTU shows more significant improvement in performance and introduces moderate increases in absolute FLOPs (although nearly doubling);
- On the other hand, in DyMU-high settings, removing VTU results in much smaller performance drop; so it’s safer to remove VTU under this setting to gain more efficiency.
| GQA | MMB | MME (prcp,all) | POPE | SQA-I | SEED-I | VQA-T | MMVet | LLaVA-W | AVG | |
|---|---|---|---|---|---|---|---|---|---|---|
| DyMU-low w/o VTU | 58.2 | 56.0 | 1346,1639 | 86.9 | 67.7 | 60.9 | 51.3 | 25.2 | 58.8 | 58.2 |
| DyMU-low w/ VTU | 60.8 | 62.1 | 1438,1787 | 86.3 | 69.3 | 65.0 | 53.1 | 30.0 | 62.9 | 61.5 (↑3.3) |
| DyMU-high w/o VTU | 60.1 | 64.7 | 1460,1798 | 86.8 | 68.7 | 65.2 | 55.8 | 30.4 | 65.2 | 62.3 |
| DyMU-high w/ VTU | 61.9 | 64.3 | 1498,1846 | 86.8 | 69.9 | 66.1 | 58.0 | 31.5 | 64.5 | 63.2 (↑0.9) |
W2: Applicability to CNN and Non-RoPE models
We agree that this would be an interesting extension: to explore whether the high-level idea can be applied to CNNs and non-RoPE models. For DToMe, since the fundamental similarity measurement relies on the key vectors of visual tokens, it is not directly applicable to CNN architectures. For VTU, it is not directly applicable to non-RoPE layers due to the use of absolute positional embeddings. Nevertheless, within the claims of this paper, we emphasize that the proposed method remains broadly applicable to most mainstream VLMs, where ViT-based vision encoders and RoPE-based LLM backbones are widely used.
W3: Hyperparameter Search Cost is Minimal, No Finetuning Needed
We would like clarify that:
- The offline threshold calculation is inference-only; no fine-tuning is performed to find the layer-specific r_i (Line 122).
- In practice, threshold finding on a CLIP ViT-L encoder using 250K images can be completed within 1 hour on 2 H100 GPUs.
- There is no strict restriction on the size or type of dataset. Any sufficiently diverse dataset is acceptable, and only images (without annotations) are required.
W4: VTU Approximation in Equation 10
The reviewer raised an interesting discussion on the approximation of the “re-merging” part in the VTU algorithm. To clarify, this approximation is required for obtaining the expected efficiency derived in S3.2, so no direct alternatives can be compared under the same computational complexity. And we show empirically in Figure 5 and Table 6 that this VTU operation successfully leads to performance gains. However, we agree that exploring more sophisticated re-merging methods, such as using similarity-based weighted averaging, would be an interesting direction for future work, albeit at the cost of some additional computation.
Dear Reviewer TS2b,
This is a gentle reminder to participate in the author discussion for your assigned paper(s). Engaging with authors is required before submitting the Mandatory Acknowledgement.
The discussion deadline is August 8, 11:59 PM AoE. Please ensure you post at least one response in the discussion thread.
Let me know if you encounter any issues.
Best, Area Chair, NeurIPS 2025
I thank the detailed response to my concerns, especially the end-to-end inference time comparison and its associated analysis between the flops and inference time. It would be better to include these comparisons in the final version. I will raise my score.
We thank Reviewer TS2b for the thoughtful feedback and constructive input! We will incorporate the analyses and additional comparisons in the final version. We appreciate your engagement throughout the review and discussion period.
Dear Authors and Reviewers,
Thank you to the authors for the detailed rebuttal.
Reviewers, please read the responses carefully and post your reply as soon as possible to allow for meaningful discussion. Ideally, all reviewers should respond so the authors know their feedback has been considered.
Best regards, AC
Dear AC and Reviewers,
Thank you for facilitating the discussion and for your time and effort throughout the review process!
We greatly appreciate the reviewers’ engagement and look forward to any further feedback and discussions.
Best regards, DyMU Authors
This paper introduces DyMU, which combines dynamic token merging with virtual unmerging to improve the efficiency of vision–language models in a training-free and plug-and-play manner. The method adaptively adjusts token counts based on image complexity while preserving compatibility with RoPE, achieving substantial token reductions with minimal accuracy loss across multiple benchmarks. Reviewers appreciated the novelty, clarity, and broad validation, though they raised concerns about missing end-to-end latency results, the limited task scope (e.g., lack of document-heavy data), architecture dependence, and the absence of certain recent baselines and visualizations. In rebuttal, the authors added stronger comparisons (DiffRate, PiToMe, VisionZip), clarified FLOPs versus runtime discrepancies, and acknowledged limitations in text-heavy scenarios, which addressed most concerns. Overall, the contribution is clear, original, and practically useful, with strengths outweighing the remaining limitations, so I recommend acceptance.