/10

Poster3 位审稿人

最低3最高4标准差0.5

ICML 2025

ToMA: Token Merge with Attention for Diffusion Models

Wenbo Lu,Shaoyi Zheng,Yuxuan Xia,Shengjie Wang

提交: 2025-01-23更新: 2025-07-24

TL;DR

We propose an improved token merging algorithm to speed up diffusion for image generation, and is the only working method on diT.

摘要

Diffusion models excel in high-fidelity image generation but face scalability limits due to transformers’ quadratic attention complexity. Plug-and-play token reduction methods like ToMeSD and ToFu reduce FLOPs by merging redundant tokens in generated images but rely on GPU-inefficient operations (e.g., sorting, scattered writes), introducing overheads that negate theoretical speedups when paired with optimized attention implementations (e.g., FlashAttention). To bridge this gap, we propose **To**ken **M**erge with **A**ttention (ToMA), an off-the-shelf method that redesigns token reduction for GPU-aligned efficiency, with three key contributions: 1) a reformulation of token merging as a submodular optimization problem to select diverse tokens; 2) merge/unmerge as an attention-like linear transformation via GPU-friendly matrix operations; and 3) exploiting latent locality and sequential redundancy (pattern reuse) to minimize overhead. ToMA reduces SDXL/Flux generation latency by 24%/23% (DINO $\Delta <$ 0.07), outperforming prior methods. This work bridges the gap between theoretical and practical efficiency for transformers in diffusion.

关键词

DiffusionToken MergeAttentionSubmodular Optimization

评审与讨论

审稿意见

评分: 42025-03-13

The paper investigates token reduction via submodular optimization.

给作者的问题

The authors claim to implement the naive greedy submodular optimization maximization algorithm concerning the facility location problem applicable to GPU. Since the algorithm in itself is iterative, it can not be parallelized. However, did the authors mean to parallel the similarity matrix computation? In addition, did the authors parallel the finding of the item with the maximal marginal gain?
What are the values of $N$ and $d$ throughout the experiments? This aims to clarify why $75\%$ does not lead to a higher reduction in inference time.
The authors hint that due to diversity in the sampled set (a given byproduct of the submodular optimization process), the resulting $\tilde{A}$ is somewhat orthogonal. Is this claim true? If so, why would it be orthogonal?
A follow-up question on the previous: Usually with submodular optimization, we would have a set that contains up to $D$ items. Is this the case also here? This would explain whether the set is forced to have exactly ensure the token reduction ratio -- in such a case, the items in the set may not be independent (from a linear point of view) which then would lead to numerical issues with the psuedo inverse

论据与证据

Yes.

方法与评估标准

The proposed methods are valid, however, having $75\%$ token reduction leads only to $\approx 1.4$ faster inference than the baseline. Please check the question section.

理论论述

The paper relies on the known plain greedy algorithm for maximizing a submodular function. However, I find some information concerning the unmerging part lacking; see the questions section.

实验设计与分析

The experimental designs and corresponding analyses are sound, however, as mentioned earlier, when reducing a large amount of tokens, the model does not benefit much in terms of decreasing inference time; see the questions section.

补充材料

I went over the entire appendices.

与现有文献的关系

The idea of token reduction via submodular is quite innovative and fitting. Such methods could alleviate standard techniques in Deep learning, becoming inseparable tools that allow faster performances with a theoretical backbone (that would work on vast amounts of models and not tailored to work with a specific set of diffusion models).

遗漏的重要参考文献

The paper references the right papers that are essential to understand the contribution of the paper.

其他优缺点

The strengths of the paper are:

Formulating the token reduction as a submodular function maximization problem is fitting.
Utilizing the fact that submodular optimization aims to ensure a diverse subset of tokens into the token merging and unmerging is quite fascinating, however, I do have some questions on this matter.
The architecture changes done to ensure the full potential of utilizing the token reduction phase, merging and unmerging are necessary and innovative, and efficient from a practical point of view.

The weaknesses, however, are:

The submodular phase takes $O(N^2d)$ time (as mentioned in section C in the appendix). This time might explain why the method while being faster than the competitors, is not much faster than the baseline model; See questions section.
The paper needs polishing; see the comments section.

其他意见或建议

In what follows, some typos will be listed:

In the appendix, please replace line 1071 "Tab ??" with Table 4.
In the appendix, please replace line 1076 "Tab ??" with Table 5.
Remove line 452.
Line 370, second column, I think the authors meant to mention Table 1.
Table 2 was never referenced in the paper.

作者回复

2025-04-01

Dear Reviewer,

Thank you for your review. We appreciate your feedback and pointing out several typos that need polishing. Due to ICML 2025 regulations, we are currently unable to modify the submitted PDF. However, we will certainly address and correct these issues in the updated version.

Regarding your other concerns about efficiency and submodularity, please find our detailed responses below.

Response:

Parallel of Submodular Optimization: Thank you for pointing this out. You're correct—the iterative nature of submodular optimization is inherently unavoidable, as each selection step depends explicitly on the previously selected items. However, to maximize parallelism, we indeed parallelize the computation by performing token selection independently across multiple local regions, each selecting tokens with maximal marginal gains. Since the submodular optimization in each local region is independent, this approach significantly reduces iteration time, as submodular optimization is now executed over smaller subsets with fewer tokens to select at each step.

Values of N and d: In our formulation, N denotes the sequence length and d the channel dimension. In SDXL, as the UNet goes deeper, N decreases while d increases—for 1024×1024 image generation, typical configurations include [4096, 640] and [1024, 1280]. In FLUX, N is composed of both text (512) and image (4096) tokens, totaling 4608, with a fixed channel dimension of 3072.

At 75%, ToMA achieves 1.4x speed-up instead of a higher reduction because of its own overhead. Though ToMA introduces significantly less overhead than other token merging methods—as demonstrated in Appendix Section F—its operations still produce a non-negligible runtime cost. Yet, this overhead can be further reduced through engineering efforts such as kernel fusion and custom CUDA optimization. It's also worth noting that the speed-up in FLUX is slightly lower than in SDXL, mainly due to the additional computation required for RoPE reconstruction. This involves gathering original RoPE positions and recomputing positional encodings based on the merged token set to maintain image quality.

Orthogonal: Thank you for your question. To clarify, ${\tilde{A}}$ refers to the merge matrix computed from attention scores, not the token feature matrix. Since ${\tilde{A}}$ is a non-negative matrix, a strict orthogonality would mean that each source token attends to exactly one destination token, with no overlap. In the approximate case, the overlap across different rows is small—i.e., each source token mainly attends to a unique destination token. This naturally requires the destination tokens to be diverse, or otherwise, a source token may attend to two non-diverse destination tokens and result in overlap. Therefore, diversity is a necessary condition for orthogonality or approximate orthogonality in this setting. Moreover, this connects closely to the facility location objective used in our submodular selection: it ensures that every source token has at least one highly similar (i.e., representative) token in the selected subset, which means that every source token mostly attends to its representing destination token.

Submodular Optimization Related: Thank you for your question. Indeed, we select a subset of up to D tokens. However, linear dependency issues are unlikely in our setting. Specifically, our pseudo-inverse operation is applied to the merge matrix ${\tilde{A}}$ , which is obtained by multiplying the selected token subset with the entire token set. This merge matrix is non-negative and naturally has the largest values along its diagonal (or, more precisely, every row has a distinct index of the largest value), ensuring that the selected tokens remain linearly independent and, thus, preventing numerical instability during the pseudo-inverse computation.

审稿人评论

2025-04-07

Thanks for the detailed answers. I, however, have a follow-up question:

Concerning "Parallel of Submodular Optimization": Generating the submodular coreset in parallel on different localities (I assume different subsets of tokens) would lose some cross-similarities between these points. How do you justify this? Are the sets different from each other in terms of vector similarities?

作者评论

2025-04-08

Thank you for raising the thoughtful question. We appreciate your engagement with our methodology. And below, we clarify the rationale and address your concerns about cross-similarities:

Local Smoothness in Image Latents: In vision latents, tokens corresponding to nearby image patches show stronger similarities than those that are farther apart—a property confirmed by our empirical analysis (Figure. 3 & 9). This leads to weaker cross-region similarities, allowing us to process local regions in parallel while still preserving most of the important relationships.

Positional Embeddings: Positional embeddings (i.e., RoPE) further reinforce this locality prior by explicitly decaying similarity scores with increasing token geometric distance. This means the cross-region similarities we ignore are already heavily attenuated by the model's architecture itself. Our parallelization approach simply aligns with this built-in inductive bias.

Regularization Effect: It’s worth noting that locality constraint also provides beneficial regularization. By enforcing balanced token selection across different image regions (≤ $k$ tokens per local region), we prevent the coreset from over-concentrating on just the most globally salient features. This leads to more diverse and representative token selections in practice.

Theoretical Justification: At its core, our approach approximates the full similarity matrix by focusing on local region similarities only, ignoring the aforementioned small cross-region terms. And formally, this is solved through the submodular optimization under the partition matroid constraint, where at most $k$ tokens are selected from each region and each region from a partition of the groundset.

审稿意见

评分: 42025-03-13

The authors present a token merge algorithm for image diffusion models, based on the theory of submodular optimization, operations that are more friendly to GPU, and some extra tricks to further enhance the speedup. The proposed method first selects destination tokens that are most representative and then merges these tokens considering the attention and locality. The experimental results show speedup over the unoptimized baseline and several other token merging baselines. The absolute speedup on image generation is around 30% when reducing 75% of the token, which is a notable improvement over the baselines compared.

update after rebuttal

The rebuttal solved all my concerns, therefore, I raised the score to 4.

给作者的问题

Please refer to weaknesses.

论据与证据

Most of the claims are supported. For details, see weakness.

方法与评估标准

The method and benchmark make sense.

理论论述

Yes, they are correct.

实验设计与分析

Yes.

补充材料

Reviewed the submitted source code.

与现有文献的关系

N/A

遗漏的重要参考文献

N/A

其他优缺点

The overall quality of the paper is good with noticeable improvement over the baseline together with sufficient qualitative comparison. However, the paper can be polished by addressing the following weaknesses:

Table 2 is apparently included in a rush with significantly less content compared to Table 1. Thorough valuation and comparison on DiT-based diffusion model is crucial for readers to see the merit of this paper.
ToMA stripe usually performs worse in terms of the metrics despite the improvements in runtime, and ToMA tile provides comparable quality but at the cost of overhead, which makes the reader wondering what the real benefit of using these two variants is. And there is the ToMA* which leads to significant degradation in quality.
In table 3, the proposed method is only compared with ToDo at 75% of token reduction.
The authors are suggested to include runtime breakdown and comparison to properly prove that the designed method is GPU-friendly and superior over the prior works.

其他意见或建议

N/A

作者回复

2025-04-01

Dear Reviewer,

Thank you for taking the time to review our work. We have addressed your concerns in detail in the responses below.

Response:

Lack of Comparison on DiT-Based Diffusion Model:

Thank you for raising this point. Table 2 includes fewer comparisons with other methods in the DiT-based model setting because existing approaches often produce mosaics or noise—they are not compatible with both RoPE positional embeddings and the specialized DiT architecture (Joint and Single Transformers). As a result, we focus on Table 2 for analyzing our ToMA variants. To address your concern, we include comparisons with the ToMA_tile implementation. We omit the ToMA_stripe, as stripe is incompatible with the 2D RoPE. Additionally, we report results on the RTX8000, excluding V100 due to its insufficient memory for the FLUX model.

Ratio	Method	FID↓	CLIP↑	DINO↓	Sec/img (RTX 8000)↓	Sec/img (RTX 6000)↓
0.00	Baseline	31.56	29.03	0	59.20 (0%)	21.03 (0%)
0.25	ToMA	30.80	29.07	0.043	56.70 (-4.2%)	20.14 (-4.2%)
	ToMA_tile	31.49	29.05	0.021	57.47 (-2.9%)	20.78 (-1.2%)
0.50	ToMA	31.70	29.09	0.051	51.44 (-13.1%)	18.58 (-11.6%)
	ToMA_tile	32.95	29.19	0.032	53.61 (-9.4%)	19.61 (-6.8%)
0.75	ToMA	33.39	28.98	0.064	49.83 (-15.9%)	16.12 (-23.4%)
	ToMA_tile	33.88	29.34	0.045	49.86 (-15.8%)	18.30 (-12.9%)

ToMA Stripe, ToMA Tile, and ToMA:* Thank you for pointing this out. We ultimately selected the tile method as it aligns well with the inherent hidden states locality, and in FLUX scenarios, it complements the 2D RoPE structure. We believe ToMA_tile can be further optimized—e.g., via contiguous tile-shaped memory access and kernel fusion for (un)merging operation with the attention computation. We believed that through optimization, the ToMA tile could achieve acceleration comparable to the ToMA_stripe. However, such optimization poses a significant engineering challenge, and to keep a fair comparison with baselines, we defer this to future work. Thus, we introduced the ToMA_stripe as a practical fallback due to its lower overhead and immediate applicability, despite slightly weaker performance in quality metrics. As for ToMA*, it is an exploratory variant aimed at reducing overhead by applying (un)merging once at the model level rather than per transformer block. While this significantly improves runtime, it currently poses challenges in preserving generation quality, suggesting a promising direction for future architectural or training improvements.

ToDo: The reason is that the implementation of ToDo only supports merging exactly 4 tokens into 1, inherently limiting the merge ratio to 75%. Therefore, we report results for ToDo only at the 75% merge ratio.

Runtime Breakdown: Before presenting our result, we would like to clarify that the reported runtime does not fully reflect the actual time consumed, as we explored several timing measurement methods, each with its limitations. Torch Profiler introduces overhead, while using cuda.Event to measure individual components adds extra synchronization, potentially disrupting concurrency. Both inflate the runtime measured. To address this, we adopted a method where we selectively disabled specific components via commenting and measured the time difference. This approach yields results that are more consistent with theoretical expectations and offers a more reasonable estimation of the runtime impact.

We have included a comparison of merge and unmerge operation runtimes for both ToMeSD and ToMA in Section F of the Appendix. Additionally, the table below provides a detailed runtime breakdown. As shown, ToMA introduces significantly less operational overhead compared to ToMeSD. However, when the computation time is significantly reduced, the relative impact of overhead becomes more pronounced, resulting in a speedup that is smaller than the theoretical gain suggested by FLOPs. The runtime breakdown for FLUX closely aligns with that of SDXL.

Time Category	Subcomponent	SDXL + ToMA (s, %)	SDXL + ToMe (s, %)
Computation Time		0.550 (45.32%)	0.550 (24.54%)
	SelfAttention	0.151 (12.45%)	0.151 (6.74%)
	CrossAttention	0.107 (8.80%)	0.107 (4.77%)
	FeedForward	0.292 (24.07%)	0.292 (13.03%)
Token Merge Operation		0.191 (15.73%)	1.168 (52.11%)
	ComputeMerge	0.015 (1.21%)	0.419 (18.67%)
	Merge	0.090 (7.44%)	0.431 (19.25%)
	Unmerge	0.086 (7.10%)	0.318 (14.19%)
Other Overhead		0.472 (38.91%)	0.523 (23.35%)
Total Block Time		1.214 (100%)	2.241 (100%)

审稿意见

评分: 32025-03-17

This paper introduces ToMA (Token Merge with Attention), a GPU-optimized token merging method for transformer-based diffusion models, addressing inefficiencies in existing token merging techniques such as ToMeSD and ToFu. ToMA achieves its efficiency improvements through submodular optimization for token selection and attention-based linear projections for merging and unmerging. Experiments demonstrate that ToMA achieves 24% speedup on SDXL and 23% on Flux.1-dev while maintaining image quality, outperforming previous work like ToMeSD and ToFu.

给作者的问题

Would ToMA benefit from sparse attention techniques like Longformer?
What is the computational overhead of ToMA at extreme merge ratios (>90%)? Does merging too aggressively introduce additional inefficiencies or degrade quality significantly?

论据与证据

Yes, most of the claims are supported by experiments, such as ToMA achieves speedups by leveraging GPU-friendly operations and submodular token selection works reasonably.

方法与评估标准

Yes. The paper evaluates the proposed method on both SDXL and Flux against ToFu and ToMeSD.

理论论述

N/A.

实验设计与分析

Yes. Overall the experimental designs are sound. However, it lacks detailed computational complexity breakdown, such as memory usage, FLOP reductions, or additional latency savings beyond just inference time.

补充材料

Yes. I reviewed the qualitative results of the paper.

与现有文献的关系

The paper follows the idea of previous token reduction methods such as ToMeSD and ToFU but optimizes for GPU execution. It also extends submodular optimization into token selection for generative models.

遗漏的重要参考文献

The paper misses the discussion of sparse transformer [1], which also leverages locality for token reduction.

[1] Child et al. Generating Long Sequences with Sparse Transformers. Arxiv 2019.

其他优缺点

Strengths:

ToMA is shown to work efficiently with modern GPU optimizations like FlashAttention2, avoiding memory bottlenecks.
Empirical results confirm the speed improvements of ToMA over prior methods on SDXL and Flux.

Weaknesses:

At 0.25 merge ratio, the speedup is minimal, raising concerns about whether ToMA is beneficial in scenarios requiring high image quality.
The paper reports inference speed but lacks FLOP breakdowns and memory usage analysis. It remains unclear where efficiency gains come from.

其他意见或建议

N/A.

作者回复

2025-04-01

Dear Reviewer,

We appreciate your insightful comments and the time you've taken to evaluate our work. Below are our responses to your concerns:

Concerns about High Image Quality and Little Speed Up: At a ratio of 0.25, the acceleration is limited primarily because the reduction in sequence length is not substantial, which naturally results in limited speedup. Additionally, some unavoidable overhead remains, including operations within the transformer block that are not accelerated, as well as the overhead introduced by our own method (though relatively small, it still contributes). That said, further speedup could be achieved through engineering optimizations, such as low-level implementations. For instance, fusing the kernel to merge directly into the attention computation could significantly improve efficiency. However, in order to maintain a fair comparison with the baseline and other existing methods (which do not achieve the same level of speedup as ToMA), we defer such engineering enhancements to future work.

FLOP, Memory, and Efficiency: The efficiency primarily stems from the reduced sequence length in attention inputs achieved by merging, which significantly lowers the computational load of the attention mechanism.

FLOP: Our FLOP analysis focuses on the transformer blocks, specifically the projection and attention computations. In both FLUX and SDXL, ToMA achieves a substantial FLOP reduction, with a maximum improvement of approximately 3.4×. The overhead introduced by ToMA operations—including submodular token selection, merge token computation, and (un)merge—is negligible compared to the overall FLOP savings.

FLOP Comparison Table

Model	Example Layer (Seq Len × Dim)	Original FLOPs(GFLOPs)	ToMA (0.5) FLOPs(GFLOPs)	ToMA Operation FLOP(GFLOPs)	Reduction
FLUX	4608 × 3072	520	225	1.01	~2.3×
SDXL	4096 × 640	106	31.54	0.42	~3.4×
SDXL	1024 × 1280	30	12.76	0.06	~2.4×

Memory: Regarding memory usage, please check the table below showing the max memory allocated and reserved during the inference process. Experiments in both FLUX and SDXL settings indicate minimal impact on memory allocated or reserved, suggesting our method incurs only minor additional memory allocation.

Combined Memory Usage Table (FLUX vs ToMA vs ToMA_tile)

Metric	FLUX	ToMA			ToMA_tile
		0.25	0.5	0.75	0.25	0.5	0.75
Max Allocated (MB)	34640.07	34743.86	34709.85	34675.21	34647.18	34646.66	34642.15
Max Reserved (MB)	37002.00	37050.00	36976.00	36954.00	37054.00	37006.00	36950.00

Combined Memory Usage Table (SDXL vs ToMA vs ToMA_stripe vs ToMA_tile)

Metric	SDXL	ToMA			ToMA_stripe			ToMA_tile
		0.25	0.5	0.75	0.25	0.5	0.75	0.25	0.5	0.75
Max Allocated (MB)	10720.92	10930.77	10856.75	10796.86	10722.08	10718.62	10718.23	10725.21	10720.06	10718.67
Max Reserved (MB)	14150.00	14460.00	14260.00	14130.00	14114.00	14188.00	14222.00	14158.00	14158.00	14182.00

The efficiency comes from the shortened attention sequence length enabled by token merging, allowing ToMA to significantly reduce FLOPs while introducing negligible additional memory overhead compared to the baseline.

Sparse Attention: The sparse attention mechanism naturally complements token merging. The underlying strategies of sparse attention and token merging are fundamentally orthogonal: token merging explicitly reduces the input length of the attention computation, whereas sparse attention selectively attends to subsets of tokens. As these methods operate independently, they can easily be combined—sparse attention can directly replace flash attention within ToMA to further improve efficiency. Moreover, many sparse attention methods like Longformer typically employ a sliding window that aligns with the local structure exploited by ToMA and can be directly utilized to restrict both merging and attention operations to sliding window regions, further enhancing computational efficiency.

Overhead and Image Quality at Extreme Merge Ratio: In general, we can achieve speedup by merging more tokens. Specifically, with a merge ratio above 90%, ToMA achieves approximately 4.1 seconds generation time on SDXL and 15.05 seconds on FLUX. However, at such high ratios, we observe a noticeable degradation in image quality as the number of tokens processed by the transformer becomes very limited, directly impacting the output. That said, even at these extreme ratios, we did not observe any inefficiencies significant enough to offset the overall speedup in generation time.

最终决定Accept (poster)

2025-05-01

The paper proposes a method for token merging in diffusion models. This is an area of significant interest at this time. The method attacks the problem both from the optimization and the GPU utilization viewpoint achieving significant improvements. All reviewers recommend acceptance and the AC concurs