6.8

/10

Poster4 位审稿人

最低3最高5标准差0.8

4.0

置信度

创新性2.5

质量2.8

清晰度2.5

重要性2.5

NeurIPS 2025

DyMU: Dynamic Merging and Virtual Unmerging for Efficient Variable-Length VLMs

Zhenhailong Wang,Senthil Purushwalkam,Caiming Xiong,Silvio Savarese,Heng Ji,Ran Xu

OpenReview PDF

提交: 2025-05-10更新: 2025-10-29

摘要

关键词

Dynamic Length Visual EncodingVision-Language ModelsToken Merging

评审与讨论

审稿意见

评分: 5置信度: 42025-06-25

This paper presents DyMU, a training-free token compression framework for vision transformers. DyMU introduces two key contributions: (1) a dynamic merging strategy that adaptively controls token reduction based on image complexity, and (2) a novel virtual token unmerging mechanism that improves performance degradation caused by aggressive token merging. The method is evaluated on various models across multiple downstream tasks and benchmarks, demonstrating consistent improvements over prior training-free approaches, particularly ToMe. The work is well-motivated, clearly written, and practically relevant.

优缺点分析

Strengths

Well-motivated design: The paper offers a clear rationale for adapting token budgets to image complexity, as opposed to using a fixed schedule. The proposed dynamic merging strategy is intuitive and practically motivated.
Robust empirical evaluation: DyMU is validated on a wide range of benchmarks and models, showing strong generalization and consistent improvements.
Virtual Token Unmerging: This is a novel contribution not present in prior methods like ToMe. It enables accurate RoPE computation with merged tokens, and empirical results confirm that VTU substantially reduces accuracy degradation.
Cost-accuracy trade-off: DyMU supports controllable granularity, letting users dynamically balance efficiency and accuracy. It is also compatible with domain-specific preprocessing tools (e.g., OCR-based cropping), enabling flexible deployment.

Weaknesses

Limited baselines: While ToMe comparisons are thorough, recent token-merging or pruning methods such as DiffRate[r1] and PiToMe[r2] are not quantitatively reported. Including them (even if training-required) would contextualise DyMU's gains more convincingly.
Missing qualitative visualizations: ToMe originally visualized merged-token regions; similar maps for DyMU would help readers see how token allocation differs for simple vs. complex images, beyond giving only the number of tokens after merging in Figs. 1 and 7.

References

[r1] Mengzhao Chen, et al. DiffRate: Differentiable Compression Rate for Efficient Vision Transformers. In ICCV 2023.

[r2] Hoai-Chau Tran, et al. Accelerating Transformers with Spectrum-Preserving Token Merging. In NeurIPS 2024.

问题

Could you report DyMU's performance against DiffRate or PiToMe under comparable token budgets? Even partial results would help position the method relative to the latest merging approaches.
Would you consider adding merged-token maps (akin to ToMe's Fig. 4) to illustrate where tokens are retained or merged in simple vs. complex scenes?

局限性

Yes.

最终评判理由

I appreciate the authors' response and the additional experiments, which effectively address my concern regarding comparisons with more recent baselines. I acknowledge that DyMU's performance is competitive relative to these newly added baselines, and as a dynamic approach, it also holds potential advantages in terms of efficiency. Taking into account the authors' responses to the other reviewers as well, I find the current experiments sufficiently solid, and consider the proposed method to be meaningful and of practical value. I will raise my score to absolute acceptance, and I hope the authors incorporate their additional clarifications and discussions with the other reviewers into the final version of the paper.

格式问题

No formatting concerns.

作者回复

2025-07-30

We thank Reviewer 1uHD for the detailed and constructive comments! We appreciate the reviewer’s recognition of our strengths in well-motivated design, comprehensive evaluations, novel VTU algorithm, and more controllable tradeoff in efficiency and accuracy. We hope the following response addresses the remaining concerns and fosters meaningful discussion.

W1,Q1: Comparison to additional merging/pruning methods

We thank the reviewer for pointing out additional baselines to compare with. Below we add the comparison with DiffRate[1] and PiToMe[2], together with VisionZip[3] which is suggested by reviewer iv6c. We show that DyMU achieves competitive overall performance while being both training-free and dynamic in length.

Method	#Token	Dynamic?	Training-Free?	GQA	MMB	MME (prcp,all)	POPE	SQA-I	SEED-I	VQA-T	MMVet	LLaVA-W	AVG
DiffRate[1]	~57	✓	✗	57.9	-	1341, -	-	66.4	-	30.6	-	-	-
PiToMe[2]	~57	✗	✓	59.9	-	1448, -	-	69.0	-	43.0	-	-	-
VisionZip[3]	64	✗	✓	55.1	60.1	-, 1690	77.0	69.0	-	55.5	31.7	62.9	-
VisionZip[3]	128	✗	✓	57.6	62.0	-, 1762	83.2	68.9	-	56.8	32.6	64.8	-
VisionZip[3]	192	✗	✓	59.3	63.0	-, 1783	85.3	68.9	-	57.3	31.7	67.7	-

DyMU-low	89±27	✓	✓	60.8	62.1	1438, 1787	86.3	69.3	65.0	53.1	30.0	62.9	61.5
DyMU-mid	195±47	✓	✓	61.7	62.8	1483, 1862	86.6	69.2	65.9	55.1	30.9	65.1	62.6
DyMU-high	394±57	✓	✓	61.9	64.3	1498, 1846	86.8	69.9	66.1	58.0	31.5	64.5	63.2

Reference:

[1] Chen, Mengzhao, et al. "Diffrate: Differentiable compression rate for efficient vision transformers."
[2] Tran, Chau, et al. "Accelerating transformers with spectrum-preserving token merging."
[3] Yang, Senqiao, et al. "Visionzip: Longer is better but not necessary in vision language models."

W2,Q2: Visualization of the merged token regions

This is a great suggestion! Specifically, such a visualization can help further validate the dynamic-length merging by showing that less complex areas are more aggressively merged. Since OpenReview does not allow figures or external links in the rebuttal, we are unable to include it at this stage. We will add this visualization in our next revision.

2025-08-05

We thank reviewer 1uHD once again for the insightful feedback and constructive discussion, which have made this work a stronger submission. We will incorporate the additional results and visualizations as discussed during the rebuttal.

审稿意见

评分: 3置信度: 52025-06-30

This paper proposes a training-free token reduction method for MLLMs by adaptively adjusting the number of vision tokens based on the complexity of the image. The proposed DYMU framework is comprised of two modules, Dynamic Token Merging and Virtual Token Unmerging. The dynamic token merging module continuously reduces the number of vision tokens by merging similar ones based on image complexity in the different layers of the vision encoder. In the LLM, the designed Virtual Token Unmerging module approximates the full vision sequence operations using the merged tokens. Experiments on several general MLLM benchmarks show the efficiency gain brought by the proposed framework.

优缺点分析

Strength: The idea to adaptively reduce the token number according to image complexity is important.

Weakness:

DYMU emphasizes adaptive per-image token reduction. When comparing with other token reduction methods, the proper experiment setup should be measuring the actual (average) inference time on different benchmarks and comparing the performance of other token reduction methods configured to a similar inference time.
Many more recent token reduction methods are missing, e.g. VisionZip (CVPR2025) and PyramidDrop (CVPR2025).
DYMU merges tokens at every vision-encoder layer, but the paper lacks any quantitative analysis of the computational benefits this provides over a single post-encoder merge. It remains questionable whether early-layer merging—when token similarities may be noisier—yields net gains.
Token merging thresholds rely on key-vector similarity as a proxy for image complexity, yet no vigorous validation is provided that this reliably relates to semantic detail for the later LLM part. For instance, a dense document image (all text) may exhibit high key similarity and thus undergo over-merging, potentially losing critical OCR information.
Although broad VQA and grounding benchmarks are included, the paper omits fine-grained tasks (e.g., DocVQA, InfoVQA, MME-RealWorld) where small details matter most. Without these, the claim of “preserving core information” under aggressive compression is unsubstantiated.
Besides FLOPs measurements, the inference time should also be included in all experiments.

问题

Is the proposed VTU module compatible with FlashAttention?
Do you think designing some metrics measuring how well the original vision information can be reconstructed from the merged tokens (or directly down-sampled ones) can better represent the image complexity compared with the similarity-based one?

局限性

Yes

最终评判理由

I appreciate the authors' detailed and thoughtful rebuttal. However, my main concerns remain:

Inference Time vs. FLOPs: While I understand the challenges in aligning FLOP reductions with end-to-end latency, actual inference time is the most critical metric for real-world deployment. The current method shows limited or no gains in end-to-end latency, especially when VTU is enabled, which undermines the practical efficiency benefits claimed by the paper.

Evaluation on Fine-Grained / Information-Dense Tasks: The lack of evaluation on tasks such as DocVQA, InfoVQA, or MME-RealWorld limits confidence in the method's robustness for semantically dense images.

Therefore, I will keep my original rating.

格式问题

作者回复

2025-07-30

We thank Reviewer iv6c for the detailed and constructive comments! We appreciate the reviewer’s recognition of our novelty in complexity-conditional token reduction. We hope the following response addresses the remaining concerns and fosters meaningful discussion.

W1, W6: Inference Time Comparison

W1.1: Additional Analysis on Wall-clock Inference Times

We thank the reviewer for the suggestion! We provide additional analysis on wall-clock inference times, summarizing our findings below. We continue to see speedup in terms of inference times in most settings, however, we find that the inference time reduction is less consistent with token count and FLOPs, particularly when measuring end-to-end duration. This discrepancy has also been observed in prior works, such as Table 5 in [1] and Table 7 in [2]. We provide a detailed discussion in the next section on the underlying reasons and practical challenges.

	Avg N_un/N	Attn MFLOPs	Attn Inference Time (ms)	End-to-End Inference Duration on MME (s)
Full	576/576	1359.0	9.17	131

DyMU-low
DyMU-low w/o VTU	89/576	32.4	1.26	115
DyMU-low w/ VTU	89/576	64.9	7.20	131

DyMU-mid
DyMU-mid w/o VTU	195/576	155.8	1.29	121
DyMU-mid w/ VTU	195/576	311.5	7.49	123

DyMU-high
DyMU-high w/o VTU	394/576	635.9	2.92	132
DyMU-high w/ VTU	394/576	1272.0	7.60	132

W1.2: Deep Dive into the Gap between FLOPs and Inference Time

In this section, we elaborate our findings on why there is a non-trivial gap between FLOP reduction and actual inference time gains, especially for algorithms like DyMU that fundamentally alter the computation of attention. We also explain why we believe FLOPs is a more robust metric for comparing theoretical/algorithmic efficiency, as is also a standard practice in prior works [3][4][5][6].

Controllability: From definition, FLOPs quantifies algorithmic compute independent of hardware, batching, parallelism, and system load. Wall-clock inference time can vary significantly due to deployment configurations and parallel jobs running on the same node, which are practically difficult to control on shared school servers.
Implementation Gap: In practice, wall-clock inference time is also heavily dependent on the actual implementations and low-level optimizations of the computation functions. This may lead to significant discrepancies between FLOPs and inference time.
- Here we show a concrete example, where we compare two implementations of computing a N×N attention matrix from two (N,D) vectors. In Version 1, we simply do one matrix multiplication using torch.matmul to get the N×N matrix. In Version 2, we first split the input vectors into N/4 chunks, do matrix multiplication on the (N/4, N/4) chunks and add them back; As shown below, this results in identical FLOPs but very different wall-clock times. This is because torch.matmul is heavily optimized for large matrix multiplications, leading to longer inference times when multiple torch.matmul calls are made on smaller matrices, despite having the same total FLOPs.
  
  #torch.matmul matrix size per multiplication MFLOPs Wall-Clock Time (ms)
  Version1 1 576 339.74 1.374
  Version2 16 144 339.74 2.311
- In our case, the VTU operation requires decomposing the matrix multiplication into smaller sub-matrices, which inflates the inference time although having a better theoretical efficiency in terms of FLOP. (See the “Implementation Notes” section in the supplementary code for more details.)
- This demonstrates that, to fully translate the benefit of theoretical FLOPs improvements to actual inference time improvements may require non-trivial work such as implementing new kernels, which themselves can be a separate paper such as Flash-Attention[7].

	#torch.matmul	matrix size per multiplication	MFLOPs	Wall-Clock Time (ms)
Version1	1	576	339.74	1.374
Version2	16	144	339.74	2.311

Therefore, in this work, we focus on formally deriving the theoretical efficiency gains and presenting empirical results on FLOPs, token length and end-task performance. But we do acknowledge that optimizing wall-clock inference time of DyMU through dedicated low-level kernel development is an important follow-up work.

Reference:

[1] Xing, Long, et al. "Pyramiddrop: Accelerating your large vision-language models via pyramid visual redundancy reduction."
[2] Wen, Zichen, et al. "Token Pruning in Multimodal Large Language Models: Are We Solving the Right Problem?."
[3] Bolya, Daniel, et al. "Token merging: Your vit but faster."
[4] Chen, Mengzhao, et al. "Diffrate: Differentiable compression rate for efficient vision transformers."
[5] Chen, Liang, et al. "An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models."
[6] Shang, Yuzhang, et al. "Llava-prumerge: Adaptive token reduction for efficient large multimodal models."
[7] Dao, Tri, et al. "Flashattention: Fast and memory-efficient exact attention with io-awareness."

W2: Additional Comparison Baselines

We thank the reviewer for pointing out additional baselines for comparison!

A kind reminder that we have already included PyramidDrop in Table 2, Row 11, labeled as “PDrop.”
In the following table, we show additional comparison for the following baselines: VisionZip[1], DiffRate[2], and PiToMe[3]. We show that DyMU achieves competitive overall performance while being both training-free and dynamic in length.

Method	#Token	Dynamic?	Training-Free?	GQA	MMB	MME (prcp,all)	POPE	SQA-I	SEED-I	VQA-T	MMVet	LLaVA-W	AVG
VisionZip[1]	64	✗	✓	55.1	60.1	-,1690	77.0	69.0	-	55.5	31.7	62.9	-
VisionZip[1]	128	✗	✓	57.6	62.0	-,1762	83.2	68.9	-	56.8	32.6	64.8	-
VisionZip[1]	192	✗	✓	59.3	63.0	-,1783	85.3	68.9	-	57.3	31.7	67.7	-
DiffRate[2]	~57	✓	✗	57.9	-	1341,-	-	66.4	-	30.6	-	-	-
PiToMe[3]	~57	✗	✓	59.9	-	1448,-	-	69.0	-	43.0	-	-	-

DyMU-low	89±27	✓	✓	60.8	62.1	1438,1787	86.3	69.3	65.0	53.1	30.0	62.9	61.5
DyMU-mid	195±47	✓	✓	61.7	62.8	1483,1862	86.6	69.2	65.9	55.1	30.9	65.1	62.6
DyMU-high	394±57	✓	✓	61.9	64.3	1498,1846	86.8	69.9	66.1	58.0	31.5	64.5	63.2

Reference:

[1] Yang, Senqiao, et al. "Visionzip: Longer is better but not necessary in vision language models."
[2] Chen, Mengzhao, et al. "Diffrate: Differentiable compression rate for efficient vision transformers."
[3] Tran, Chau, et al. "Accelerating transformers with spectrum-preserving token merging."

W3: Merging Per-Layer vs Merging Last-Layer

Compared to post-encoder merging, per-layer merging offers two key benefits:

(1) Additional efficiency gains within the ViT encoder (as demonstrated in the ToMe paper);
(2) Progressively merging a smaller number of tokens per layer helps preserve more visual concepts better than merging a large number of tokens in the final layer.

To further validate (2), we provide an additional ablation study as follows, in which we introduce a new DyMU variant that performs merging in the last three layers. Note: Due to the nature of bipartite matching, each layer can merge at most half of all tokens. Therefore, merging only in the last layer is insufficient to meet the target number of remaining tokens. Instead, we merge across the last three layers to reach the same average target token count, i.e., 94 for DyMU-low. This setting can be considered a “less extreme” version of post-encoder merging. Even under this configuration, as shown in the table below, we observe a notable performance drop compared to per-layer merging, particularly on the hallucination task (POPE).

	GQA	MMB	MME (prcp,all)	POPE	SQA-I	SEED-I	VQA-T	MMVet	LLaVA-W	AVG
DyMU-low w/ Per-Layer Merge	60.8	62.1	1438,1787	86.3	69.3	65.0	53.1	30.0	62.9	61.5
DyMU-low w/ Last-Layer Merge	58.3	63.2	1421,1755	81.1	69.4	62.9	50.3	25.1	58.2	59.0 (↓2.5)

W4,W5: Limitations on Semantically-Dense Images

We are fully aware of this limitation and have acknowledged it in Section 5, where we refer to such cases as token-sensitive tasks, exemplified by TextVQA. In fact, as shown in Table 2, we find this to be a common limitation across all training-free methods. We also agree that incorporating additional benchmarks, such as DocVQA, could further validate this limitation. And future work is needed to explore complexity-aware token merging for semantically dense images, such as multimodal documents.

Nevertheless, we would like to highlight a unique advantage of DyMU not present in previous work: its compatibility with external tools. As illustrated in Figure 6, DyMU can be used in conjunction with OCR tools to address text-dense tasks, while still introducing efficiency gains through dynamic-length token merging. Incorporating agentic tool use into the token merging framework may offer a promising direction for handling the diverse merging needs across different tasks.

Q1: Compatibility with FlashAttention

We currently provide two implementations of VTU: V1 uses the exact matrix decomposition, as derived in Section 3.2, which is more efficient in terms of FLOPs but empirically slower in wall-clock time; V2 uses a direct matrix expansion approximation, which has higher FLOPs but achieves faster wall-clock performance. Among the two, V2 is compatible with FlashAttention. More details can be found in the “Implementation Notes” section of the supplementary code.

Q2: Reconstruction Metric for Image Complexity

We believe this is an interesting idea and a very promising future direction! Specifically, a reconstruction-based metric captures more low-level visual details, which could be particularly useful for identifying complexity in semantically dense images, as discussed above. Moreover, we believe that combining similarity-based metrics (which focus on high-level semantics) with reconstruction-based metrics (which emphasize low-level details) could result in a more general and effective metric for diverse tasks.

评论- Post-rebuttal response

2025-08-05

I appreciate the authors' detailed and thoughtful rebuttal. However, my main concerns remain:

2025-08-06

We thank Reviewer iv6c for the thoughtful and constructive feedback!

We understand the concern regarding the gap between efficiency gains in FLOPs and improvements in inference time. We acknowledge that addressing this gap is an important next step to further enhance the practical applicability of the proposed algorithm. We have been very careful not to overclaim improvements in wall-clock time, as we explicitly discussed this limitation in Lines 226–229. And we believe this should not entirely undermine the novelty of our contribution, as the method demonstrates strong theoretical efficiency benefits, while achieving competitive performance and dynamic token length.
For the evaluation tasks, we adopted the widely used benchmarks following prior work to ensure a comprehensive comparison. However, we agree with the reviewer that incorporating more information-dense tasks could provide additional insights into model robustness, as also discussed in Section 5. We believe this is a fundamental challenge for most existing token merging/reduction methods. Due to the limited rebuttal time, we were unable to include all requested analyses but will incorporate them in the next revision.

We appreciate your engagement in the discussion and welcome any further comments or concerns you may have. Thank you!

审稿意见

评分: 5置信度: 32025-07-07

In this paper, the author presents an efficient, training-free framework DYMU that can dynamically reduce the computation of VLMs while maintaining performance with two novel module DtoMe and VTU. A thorough has validate the frameworks’ effectiveness.

优缺点分析

Strengths: 1.The author logically described the current problems in this field and the purpose of the proposed method. 2.The author provided a detailed explanation of the functions and contributions of each module. 3.Sufficient experiments have fully verified the effectiveness of this method.

Weakness: I don't completely understand what the red and blue squares in Figure 1 represent. If they represent the same characteristic with different values, it is recommended that the author not use only two colors but instead use one/multiple colors. If they represent different types of variables, it is hoped that corresponding explanations can be added beside the figure.

问题

I don't completely understand what the red and blue squares in Figure 1 represent. If they represent the same characteristic with different values, it is recommended that the author not use only two colors but instead use one/multiple colors. If they represent different types of variables, it is hoped that corresponding explanations can be added beside the figure.

局限性

格式问题

None

作者回复

2025-07-30

We thank Reviewer 2zYg for the constructive and encouraging feedback! We appreciate the reviewer’s recognition of our detailed explanations and comprehensive experiments on the proposed methods. We hope the following response addresses the remaining concerns and questions.

W1: Clarification on the token illustration in the method figure

We thank the reviewer for the suggestion. We believe the reviewer is referring to the red and blue squares in the method overview figure (Figure 2), which indicate the “visual tokens” in the transformer layers. These tokens indeed represent the same characteristics but with different values. We use two colors to better illustrate the bipartite matching algorithm, which divides the tokens into two groups for matching. The red and blue tokens indicate tokens in different groups which will potentially be merged together. We will add additional explanation in the figure caption to clarify this in the next revision.

2025-08-05

Dear Reviewer 2zYg,

This is a gentle reminder to participate in the author discussion for your assigned paper(s). Engaging with authors is required before submitting the Mandatory Acknowledgement.

The discussion deadline is August 8, 11:59 PM AoE. Please ensure you post at least one response in the discussion thread.

Let me know if you encounter any issues.

Best, Area Chair, NeurIPS 2025

审稿意见

评分: 4置信度: 42025-07-20

The paper introduces DYMU (Dynamic Merging and Virtual Unmerging), a novel framework to enhance vision-language models (VLMs) by adaptively reducing visual tokens based on image complexity. It addresses inefficiencies in fixed-length token outputs of existing VLMs. Key components are: 1. Dynamic Token Merging (DToMe): Selectively merges similar tokens depending on image complexity (not resolution), reducing tokens by 32%-85% for simple images while preserving semantic details for complex ones. 2. Virtual Token Unmerging (VTU): Reconstructs full-sequence representations without retraining, maintaining performance and optimizing attention efficiency.

优缺点分析

Strengths:

The paper provides comprehensive quantitative and qualitative experiments across multiple VLM architectures (LLaVA-1.5, LLaVA-OneVision), visual encoders (CLIP, SigLIP), and evaluation benchmarks (GQA, MMBench, MME, etc.), enhancing the credibility of the proposed method.

The proposed DToMe and DYMU are plug-and-play, requiring no retraining of the original model, making them suitable for resource-constrained scenarios or when training data is unavailable.

DToMe extends the previous ToMe by dynamically adjusting the token count via threshold control (instead of pre-selecting a token budget), effectively balancing computational efficiency and performance.

VTU addresses the mismatch between variable-length visual token sequences and fixed-length pre-trained LLMs through Virtual Token Unmerging, efficiently computing attention matrices without explicitly reconstructing the full sequence.

Weaknesses:

Overall computational cost: Although Table 1 details the attention computation cost, the lack of discussion on end-to-end costs makes it unclear whether DYMU outperforms other methods in efficiency. For the DYMU-high w/ VTU configuration, the MFLOPs (1272) show limited reduction versus Full Attention (1359). In low/mid settings, VTU nearly doubles MFLOPs, potentially limiting its applicability.

Architectural limitations: DToMe targets ViT-based encoders, and VTU is designed for RoPE-based LLMs. Their generalizability to other architectures (e.g., CNNs or non-RoPE models) remains unverified.

Hyperparameter search cost: DToMe requires offline threshold calculation (using large image sets), incurring non-negligible preprocessing costs. Appendix C shows that merging strategies significantly affect results, and fine-tuning layer-specific r_i for different models further increases preprocessing overhead.

VTU approximation: Equation 10 uses embedding averages to reconstruct attention outputs—an approximate approach. The paper claims minimal empirical impact but provides no ablation studies or quantitative error analysis.

问题

局限性

Yes

最终评判理由

The rebuttal mostly addressed my concerns. I raised my score.

格式问题

作者回复

2025-07-30

We thank Reviewer TS2b for the detailed and constructive comments! We appreciate the reviewer’s recognition of the strengths of our comprehensive experiments and the novelty of both DToMe and VTU in enabling training-free dynamic-length token merging. We hope the following response addresses the remaining concerns and fosters meaningful discussion.

W1: Overall Computational Cost

W1.1: Additional Analysis on Wall-clock Inference Times

We thank the reviewer for the suggestion! We provide additional analysis on wall-clock inference times, summarizing our findings below. We continue to see speedup in terms of inference times in most settings, however, we find that the inference time reduction is less consistently correlated with token count and FLOPs, particularly when measuring end-to-end duration. This discrepancy has also been observed in prior works, such as Table 5 in [1] and Table 7 in [2]. We provide a detailed discussion in the next section on the underlying reasons and practical challenges.

	Avg N_un / N	Attn MFLOPs	Attn Inference Time (ms)	End-to-End Inference Duration on MME (s)
Full	576/576	1359.0	9.17	131

DyMU-low
DyMU-low w/o VTU	89/576	32.4	1.26	115
DyMU-low w/ VTU	89/576	64.9	7.20	131

DyMU-mid
DyMU-mid w/o VTU	195/576	155.8	1.29	121
DyMU-mid w/ VTU	195/576	311.5	7.49	123

DyMU-high
DyMU-high w/o VTU	394/576	635.9	2.92	132
DyMU-high w/ VTU	394/576	1272.0	7.60	132

W1.2: Deep Dive into the Gap between FLOPs and Inference Time

Controllability: From definition, FLOPs quantifies algorithmic compute independent of hardware, batching, parallelism, and system load. Wall-clock inference time can vary significantly due to deployment configurations and parallel jobs running on the same node, which are practically difficult to control on shared school servers.
Implementation Gap: In practice, wall-clock inference time is also heavily dependent on the actual implementations and low-level optimizations of the computation functions. This may lead to significant discrepancies between FLOPs and inference time.
- Here we show a concrete example, where we compare two implementations of computing a N×N attention matrix from two (N,D) vectors. In Version 1, we simply do one matrix multiplication using torch.matmul to get the N×N matrix. In Version 2, we first split the input vectors into N/4 chunks, do matrix multiplication on the (N/4, N/4) chunks and add them back; As shown below, this results in identical FLOPs but very different wall-clock times. This is because torch.matmul is heavily optimized for large matrix multiplications, leading to longer inference times when multiple torch.matmul calls are made on smaller matrices, despite having the same total FLOPs.
  
  # torch.matmul matrix size per multiplication MFLOPs Wall-Clock Time (ms)
  Version 1 1 576 339.74 1.374
  Version 2 16 144 339.74 2.311
- In our case, the VTU operation requires decomposing the matrix multiplication into smaller sub-matrices, which inflates the inference time although having a better theoretical efficiency in terms of FLOP. (See the “Implementation Notes” section in the supplementary code for more details.)
- This demonstrates that, to fully translate the benefit of theoretical FLOPs improvements to actual inference time improvements may require non-trivial work such as implementing new kernels, which themselves can be a separate paper such as Flash-Attention[7].

	# torch.matmul	matrix size per multiplication	MFLOPs	Wall-Clock Time (ms)
Version 1	1	576	339.74	1.374
Version 2	16	144	339.74	2.311

Reference:

[1] Xing, Long, et al. "Pyramiddrop: Accelerating your large vision-language models via pyramid visual redundancy reduction."
[2] Wen, Zichen, et al. "Token Pruning in Multimodal Large Language Models: Are We Solving the Right Problem?."
[3] Bolya, Daniel, et al. "Token merging: Your vit but faster."
[4] Chen, Mengzhao, et al. "Diffrate: Differentiable compression rate for efficient vision transformers."
[5] Chen, Liang, et al. "An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models."
[6] Shang, Yuzhang, et al. "Llava-prumerge: Adaptive token reduction for efficient large multimodal models."
[7] Dao, Tri, et al. "Flashattention: Fast and memory-efficient exact attention with io-awareness."

W1.3: How VTU can achieve better tradeoff in performance and efficiency

We elaborate on how VTU can effectively help achieve a better trade-off in both DyMU-low and DyMU-high settings.

As shown in the Table below, in DyMU-low settings (where token reduction is large) VTU shows more significant improvement in performance and introduces moderate increases in absolute FLOPs (although nearly doubling);
On the other hand, in DyMU-high settings, removing VTU results in much smaller performance drop; so it’s safer to remove VTU under this setting to gain more efficiency.

	GQA	MMB	MME (prcp,all)	POPE	SQA-I	SEED-I	VQA-T	MMVet	LLaVA-W	AVG
DyMU-low w/o VTU	58.2	56.0	1346,1639	86.9	67.7	60.9	51.3	25.2	58.8	58.2
DyMU-low w/ VTU	60.8	62.1	1438,1787	86.3	69.3	65.0	53.1	30.0	62.9	61.5 (↑3.3)

DyMU-high w/o VTU	60.1	64.7	1460,1798	86.8	68.7	65.2	55.8	30.4	65.2	62.3
DyMU-high w/ VTU	61.9	64.3	1498,1846	86.8	69.9	66.1	58.0	31.5	64.5	63.2 (↑0.9)

W2: Applicability to CNN and Non-RoPE models

We agree that this would be an interesting extension: to explore whether the high-level idea can be applied to CNNs and non-RoPE models. For DToMe, since the fundamental similarity measurement relies on the key vectors of visual tokens, it is not directly applicable to CNN architectures. For VTU, it is not directly applicable to non-RoPE layers due to the use of absolute positional embeddings. Nevertheless, within the claims of this paper, we emphasize that the proposed method remains broadly applicable to most mainstream VLMs, where ViT-based vision encoders and RoPE-based LLM backbones are widely used.

W3: Hyperparameter Search Cost is Minimal, No Finetuning Needed

We would like clarify that:

The offline threshold calculation is inference-only; no fine-tuning is performed to find the layer-specific r_i (Line 122).
In practice, threshold finding on a CLIP ViT-L encoder using 250K images can be completed within 1 hour on 2 H100 GPUs.
There is no strict restriction on the size or type of dataset. Any sufficiently diverse dataset is acceptable, and only images (without annotations) are required.

W4: VTU Approximation in Equation 10

The reviewer raised an interesting discussion on the approximation of the “re-merging” part in the VTU algorithm. To clarify, this approximation is required for obtaining the expected efficiency derived in S3.2, so no direct alternatives can be compared under the same computational complexity. And we show empirically in Figure 5 and Table 6 that this VTU operation successfully leads to performance gains. However, we agree that exploring more sophisticated re-merging methods, such as using similarity-based weighted averaging, would be an interesting direction for future work, albeit at the cost of some additional computation.

2025-08-05

Dear Reviewer TS2b,

This is a gentle reminder to participate in the author discussion for your assigned paper(s). Engaging with authors is required before submitting the Mandatory Acknowledgement.

The discussion deadline is August 8, 11:59 PM AoE. Please ensure you post at least one response in the discussion thread.

Let me know if you encounter any issues.

Best, Area Chair, NeurIPS 2025

评论- The rebuttal addressed my concerns.

2025-08-09

I thank the detailed response to my concerns, especially the end-to-end inference time comparison and its associated analysis between the flops and inference time. It would be better to include these comparisons in the final version. I will raise my score.

2025-08-09

We thank Reviewer TS2b for the thoughtful feedback and constructive input! We will incorporate the analyses and additional comparisons in the final version. We appreciate your engagement throughout the review and discussion period.

2025-08-02

Dear Authors and Reviewers,

Thank you to the authors for the detailed rebuttal.

Reviewers, please read the responses carefully and post your reply as soon as possible to allow for meaningful discussion. Ideally, all reviewers should respond so the authors know their feedback has been considered.

Best regards, AC

2025-08-04

Dear AC and Reviewers,

Thank you for facilitating the discussion and for your time and effort throughout the review process!

We greatly appreciate the reviewers’ engagement and look forward to any further feedback and discussions.

Best regards, DyMU Authors

最终决定Accept (poster)

2025-09-17

This paper introduces DyMU, which combines dynamic token merging with virtual unmerging to improve the efficiency of vision–language models in a training-free and plug-and-play manner. The method adaptively adjusts token counts based on image complexity while preserving compatibility with RoPE, achieving substantial token reductions with minimal accuracy loss across multiple benchmarks. Reviewers appreciated the novelty, clarity, and broad validation, though they raised concerns about missing end-to-end latency results, the limited task scope (e.g., lack of document-heavy data), architecture dependence, and the absence of certain recent baselines and visualizations. In rebuttal, the authors added stronger comparisons (DiffRate, PiToMe, VisionZip), clarified FLOPs versus runtime discrepancies, and acknowledged limitations in text-heavy scenarios, which addressed most concerns. Overall, the contribution is clear, original, and practically useful, with strengths outweighing the remaining limitations, so I recommend acceptance.