Diff-MoE: Diffusion Transformer with Time-Aware and Space-Adaptive Experts
摘要
评审与讨论
The paper introduces Diff-MoE, a framework integrating DiT with MoE to enhance scalability and performance in generative modeling. The proposed modules in Diff-MoE are specially designed for diffusion model, including the spatial-temporal adaptive experts and global feature recalibration. Extensive experiments show Diff-MoE significantly outperforms existing dense DiTs and prior MoE-based methods.
update after rebuttal
My concerns have been well addressed and would like to recommend accept.
给作者的问题
Why fix the number of experts to 8? Was this choice empirically validated against other configurations?
论据与证据
The claims are supported by extensive experiments. Diff-MoE consistently outperforms dense DiTs and MoE variants (Tables 2-4) with similar number of parameters.
方法与评估标准
Combining MoE with DiT for spatiotemporal adaptation is novel and well-motivated. The global recalibration mechanism addresses MoE’s local bias. Class-conditional image generation on ImageNet dataset is a widely used benchmark. The architecture design of generative model is usually validated on this dataset, and the metrics reported by the manuscript are persuasive.
理论论述
The paper focuses on empirical contributions.
实验设计与分析
Comprehensive ablations isolate contributions of each component, including basic architecture design, spatial-temporal adaptive experts and global feature recalibration. The comparison with existing methods is also fair, as the authors have targeted the design of models of different sizes.
补充材料
In appendix C, the discussion about decoupling temporal adaptation from spatial specialization is intresting, which explains the reason for the performance improvement in a novel view.
与现有文献的关系
This paper builds on Diffusion Transformers (Peebles & Xie, 2023) and MoE architectures (Shazeer et al., 2017). It advances prior diffusion-based MoE works (e.g., DiT-MoE, DTR) by unifying temporal and spatial adaptation.
遗漏的重要参考文献
The related work discussed in this paper is more comprehensive.
其他优缺点
Strengths:
- I completely agree that the current diffusion-based methods do not consider temporal and spatial flexibility simultaneously. The integration of temporal and spatial MoE for diffusion model is well-motivated. And the integration method is somewhat clever compared to the conventional MoE method.
- The introduction of low-rank decomposition to reduce the number of parameters is natural under the design of this paper. The combined use of LoRA and AdaLN has no significant performance loss in this method.
- The experiments in this paper are very comprehensive, including comparisons with dense and expert-based diffusion models. Compared with the existing methods across different scales, the performance of the proposed method is greatly improved. Weaknesses:
- The FID value of +GLU in Tab.5 is not consistent with the text in Sec. 5.3, which could be a typo.
- The design motivation of depthwise convolution in Basic Architecture Design section is not clear. Why add convolution to the pure transformer architecture? Based on Sec 5.3, this improvement resulted in a significant decrease in FID for reasons that need to be explained in further detail.
其他意见或建议
I'm positive about the paper. There are several contributions, which are well motivated, executed and ablated. The evaluation - especially the quantitative results - is convincing.
Q1: Typo on Tab.5.
We apologize for the typo in Table 5. The correct FID and IS scores for "+GLU" should be 38.20 and 41.43, respectively. The ability of GLU to enhance model capacity compared to a simple MLP has been discussed in works such as Llama [1] and StarNet [2]. We will fix this typo in the revised version.
Q2: Motivation of Depthwise Convolution.
We apologize for the unclear discussion of the motivation behind this design choice. The integration of depthwise convolution prior to the MoE module draws inspiration from hybrid vision architectures like CMT [3] and LocalViT [4], which strategically combine the spatial locality of CNNs with the global receptive fields of transformers. Our design addresses two critical requirements:
● Spatial Locality Preservation: Depthwise convolution injects inductive biases for local feature extraction (edges, textures).
● Parameter Efficiency: With computational complexity reduced from (standard convolution) to , depthwise operations contribute only 0.1% additional parameters in Diff-MoE-S while maintaining spatial fidelity. In our ablation studies (Table 5), we observed that removing depthwise convolutions led to a 6.5% increase in FID (35.85 → 38.20), highlighting their importance.
Q3: Why fix the number of experts to 8?
The selection of 8 experts balances empirical performance gains against computational and architectural constraints, following DiT-MoE [5]. We report the results of the ablation experiments for the number of experts in the following table:
| CFG=1.5, Imagenet256 | Params | FID↓ | IS↑ |
|---|---|---|---|
| Diff-MoE-S-4E1A | 36M / 66M | 33.45 | 46.95 |
| Diff-MoE-S-8E1A | 36M / 107M | 31.18 | 50.55 |
| Diff-MoE-S-16E1A | 36M / 187M | 30.08 | 51.76 |
While larger expert pools may benefit extreme-scale models, our focus on parameter-efficient diffusion training prioritizes balanced specialization over brute-force scaling. Open-sourced implementations will support flexible expert configurations for future hardware-augmented explorations.
[1] Touvron, Hugo, et al. "Llama: Open and efficient foundation language models." arXiv preprint arXiv:2302.13971 (2023).
[2] Ma, Xu, et al. "Rewrite the stars." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.
[3] Guo, Jianyuan, et al. "Cmt: Convolutional neural networks meet vision transformers." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022.
[4] Li, Yawei, et al. "Localvit: Analyzing locality in vision transformers." 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2023.
[5] Fei, Zhengcong, et al. "Scaling diffusion transformers to 16 billion parameters." arXiv preprint arXiv:2407.11633 (2024).
Diff-MoE introduces a novel integration of temporal and spatial adaptation in MoE for diffusion models. The module proposed in Diff-MoE takes into account the characteristics of the diffusion model. The experimental results are also impressive, proving the effectiveness and scalability of Diff-MoE.
给作者的问题
- There is a typo of Tab. 5. The FID of of +GLU should be 38.20, which is described in L433.
- Why the CFG value of Tab. 2 is 1.0, while others are 1.5?
- Do the authors consider open source code? Open source makes a lot of sense to the community because the training process of MoE is usually accompanied by some tricks.
论据与证据
The quantitative and qualitative evaluations are overall adequate and support the claims of the paper. Strong baselines from prior work are considered for comparison. Authors trained models of different sizes, and the proposed method is clearly and consistently superior to the baselines across various sizes.
方法与评估标准
The paper is technically sound. All the proposed components are well motivated and serve a clear purpose: MoE for spatial dynamic computation, expert-specific timestep conditioning for temporal adaptation. The evaluation is in line with what is used in previous work: standard ImageNet benchmark and FID metrics.
理论论述
This paper verifies the effectiveness of architecture design through qualitative and quantitative experiments.
实验设计与分析
The training strategy is aligned with existing methods. Moreover, several models of different sizes are designed to compare with existing methods in the case of the number of alignment parameters. The comparison baselines are comprehensive, including dense and sparse (temporal or spatial) DiTs. The proposed method consistently outperforms different kinds of comparison methods across different scales.
补充材料
Appendix validates convergence (Fig. 6) and expert routing dynamics (Fig. 7).
与现有文献的关系
The current MoE-based DiT methods are still in its early days. Most methods only consider space or time, this work is the first to consider both time and space.
遗漏的重要参考文献
To the best of my knowledge the paper offers a good coverage of related works. The papers listed are all relevant, are properly organized and accurately discussed.
其他优缺点
Strengths:
- The experimental setup is solid and the results are impressive. The performance of Diff-MoE is far superior to the existing methods, and the convergence speed is fast.
- The MoE is a good way to scale up diffusion models to larger sizes, but this direction is still in the early stages of development. The motivation of this paper is clear, considering both spatial and temporal adaptability. The proposed module are specifically designed for diffusion models, rather than naively introducing MoE design from LLM.
- This writing is good and well organized. The design motivation for each module is clear and easy to follow. Weaknesses: See questions. I recommend the paper for accept owing to the merits of the work and clear motivation behind each of the modules used. My final rating depends on how the authors address the concerns in the Question section.
其他意见或建议
See questions.
Thanks for the thorough reviews. Below we try to solve issues one-by-one.
Q1: Typo on Tab.5.
We apologize for the typo in Table 5. The correct FID and IS scores for "+GLU" should be 38.20 and 41.43, respectively. The ability of GLU to enhance model capacity compared to a simple MLP has been discussed in works such as Llama [1] and StarNet [2]. We will fix this typo in the revised version. [1] Touvron, Hugo, et al. "Llama: Open and efficient foundation language models." arXiv preprint arXiv:2302.13971 (2023). [2] Ma, Xu, et al. "Rewrite the stars." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.
Q2: Why the CFG value of Tab. 2 is 1.0, while others are 1.5?
The variation in classifier-free guidance (CFG) scales across Table 2 stems from our commitment to fair, methodologically consistent comparisons. For SiT-LLaMA, we directly adopt the reported CFG=1.0 results from its original paper, as no CFG=1.5 benchmarks were available. Conversely, other baselines (spatial/temporal MoE) are evaluated at CFG=1.5 following their established protocols.
Q3: Code.
We sincerely appreciate the recognition of this work. To ensure reproducibility and foster community progress, we commit to open-sourcing all training frameworks, inference pipelines, and architectural implementations upon publication, to facilitate future research in scalable diffusion models.
This paper introduces Diff-MoE, whihc is a novel framework combining Diffusion Transformers with Mixture-of-Experts to enhance scalability and flexibility in generative modeling. It achieves better FID scores across different model sizes compared to standard DiT models.
给作者的问题
While the paper extensively discusses parameter efficiency, it doesn't clearly address the training and inference computational costs (FLOPs/throughput) of Diff-MoE compared to baseline models. Could you provide quantitative comparisons of computational overhead and wall-clock time for training and inference? This would help evaluate the practical applicability of the approach beyond parameter efficiency.
Were there any specific failure cases or training instabilities (e.g., dead experts, high variance in activation patterns) encountered during training?
论据与证据
Yes, most of the claims made have convincing evidence.
方法与评估标准
Yes, proposed methods and evaluation criteria make sense.
理论论述
Does not contain an explicit formal mathematical proof that needs to be verified for correctness.
实验设计与分析
Limited evaluation at extreme scales (7M steps) due to computational constraints
补充材料
Yes, I have read all parts in supplementary material.
与现有文献的关系
Diff-MoE integrates the Mixture-of-Experts paradigm, which has been effectively used to scale large models in Natural Language Processing, into diffusion models to achieve computational efficiency through dynamic parameter activation.
Diff-MoE distinguishes itself from prior work on MoE in diffusion models by jointly optimizing for temporal adaptation and spatial specialization. Previous approaches often focused on either temporal partitioning of experts across denoising stages (e.g., DTR, Switch-DiT, MEME) or spatial routing of tokens to experts (e.g., DiT-MoE, EC-DiT). Diff-MoE proposes a more unified approach.
遗漏的重要参考文献
This paper only compares DiT-based models and lacks a comparison with state-of-the-art (SOTA) image generation models.
其他优缺点
Strengths: The paper introduces several novel ideas, including joint spatiotemporal expert coordination, expert-specific timestep conditioning, and a globally-aware feature recalibration mechanism. The use of low-rank decomposition to reduce parameter overhead is also a notable contribution.
Weaknesses: The paper acknowledges that due to computational constraints, a full evaluation of the largest models (e.g., Diff-MoE-XL) at extended training durations was not performed.
其他意见或建议
Please refer to Strengths And Weaknesses.
Thanks for the constructive comments and the recognition of novelty.
Q1: Limited evaluation at extreme scales due to computational constraints.
We acknowledge the limitation in fully characterizing Diff-MoE's scaling laws and commit to conducting large-scale evaluations once additional computational resources are secured. Our contribution lies in synthesizing existing MoE methods in diffusion models, combining the benefits of timestep-based and spatial MoEs. This perspective offers a novel approach for scaling diffusion models, and extensive experiments validate its effectiveness. Furthermore, we will open-source the code to enable researchers with sufficient computational resources to explore and build upon our work.
Q2: Comparison with more SOTA models.
Thank you for this constructive comment. State-of-the-art AR/diffusion methods can achieve an FID below 1.5 on ImageNet 256 generation. As discussed in Q1, current hardware limitations constrain our ability to explore larger scales or more iterations, which impacts our ability to achieve state-of-the-art performance. For reference, DiT-MoE-XL[1] achieves 1.72 FID after training for 7M iterations and 3.42 FID after 400K iterations. In comparison, Diff-MoE-XL achieves 2.69 FID after training for 400K iterations, which can serve as a benchmark. We will open-source all implementations to facilitate community-driven scaling efforts and will pursue large-scale training once expanded infrastructure becomes available. These steps aim to bridge the gap between methodological innovation and SOTA performance benchmarks in future work.
Q3: Training and Inference computational costs.
We thank the reviewer for raising this critical aspect of MoE system design. Below we detail computational comparisons between Diff-MoE and DiT-MoE baseline under identical hardware (V100 GPU) and framework conditions.
Inferency Efficiency: Compared with expert-base method, Diff-MoE-S+ incurs only 6% FLOPs increase (16.05G → 17.01G) but 19% throughput reduction (278 → 225 samples/sec) compared to DiT-MoE-S, which may be due to the fact that the GPU optimizes different operators differently. When compared to dense DiT models, despite having similar parameter counts and theoretical FLOPs, our current implementation exhibits slower throughput due to the sequential computation of experts in a for-loop—an inherent challenge in MoE architectures. Nevertheless, Diff-MoE-S/2 achieves competitive FID (44.27 vs. DiT-B/2's 42.84) with 63% fewer activated parameters (36M vs. 131M), demonstrating superior memory efficiency critical for large-scale deployment.
Training Optimization: Building on fast-DiT[2] optimizations, we implement mixed precision training and pre-extracted VAE features. These adjustments yield 1.2 iterations/sec for Diff-MoE-S+ (211M params) vs. DiT-MoE-S’s 1.0 iter/sec (199M params), despite our model’s increased capacity.
While our current implementation prioritizes architectural innovation over low-level optimizations, the sequential computation of experts in a for-loop exponentially exacerbates our speed disadvantage compared to sparse and dense models. We recognize the need for low-level optimizations and identify clear pathways to mitigate throughput costs like parallel expert execution (DeepSeek-MoE[3] and DeepSpeedMoE[4]). Following these advanced strategies, we believe there is significant room for improvement in inference speed. We will prioritize these optimizations after the code is open-sourced to continuously improve inference efficiency to bridge the efficiency gap while retaining the architectural advantages of Diff-MoE. Thanks again for this valuable question.
Q4: Training Stability.
Diff-MoE exhibits robust convergence behavior throughout the training process. As shown in Fig. 4, the FID consistently decreases as training advances. We also repeated the training for different model sizes multiple times. For Diff-MoE-S, the FID fluctuates by no more than 0.5 (44 ± 0.5). For Diff-MoE-XL, the FID fluctuates by no more than 0.05 (2.69 ± 0.05). Load balancing is an inherent problem in MoE training. While load balancing remains a persistent challenge in MoE architectures—partially addressed by conventional auxiliary losses—our expert-specific timestep conditioning mechanism introduces a critical refinement. By decoupling temporal adaptation (denoising stage dynamics) from spatial routing (token-level feature complexity), the framework redistributes computational loads more effectively, as discussed in Supplementary Material Section C.
[1] Fei Z, Fan M, Yu C, et al. Scaling diffusion transformers to 16 billion parameters. arXiv:2407.11633, 2024.
[2] https://github.com/chuanyangjin/fast-DiT
[3] Liu, Aixin, et al. "Deepseek-v3 technical report." arXiv:2412.19437 (2024).
[4] Rajbhandari, Samyam, et al. "Deepspeed-moe: Advancing mixture-of-experts inference and training to power next-generation ai scale." International conference on machine learning. PMLR, 2022
This paper proposes Diff-MoE, a method that captures both timestep and spatial contexts for expert routing. The approach consists of 1) Expert-Specific Timestep Conditioning – Unlike previous spatial MoE approaches, this enables each expert to adapt its operations based on the timestep, improving adaptability to different noise levels and 2) Feature Recalibration with Global Contexts – This enhances feature representations by incorporating global spatial information, leading to better expert specialization. These two techniques improve the model’s expert capabilities and global context awareness. Additionally, a parameter reduction technique using low-rank decomposition is employed to improve efficiency. Experimental results demonstrate that Diff-MoE outperforms both timestep-dependent routing methods and previous spatial MoE approaches.
给作者的问题
N/A
论据与证据
I find most of their claims reasonable. Their primary claim is that their method combines previous temporal and spatial MoE approaches. By demonstrating that their approach outperforms both prior temporal and spatial MoE methods, they effectively validate this claim.
方法与评估标准
They follow the standard evaluation criteria used in the DiT-series evaluations.
理论论述
N/A
实验设计与分析
The ablations are well-conducted. However, one minor drawback is that the method does not achieve state-of-the-art (SOTA) performance. That said, I don’t find this to be a critical issue.
补充材料
Reviewed. They provide details of experiments and implementation.
与现有文献的关系
MoE in Diffusion Models
To enable efficient scaling of diffusion models, several works have explored MoE-based approaches. I agree with this paper's categorization, which classifies MoE methods into timestep-based and spatial MoEs. This work effectively combines both strategies to enhance the MoE architecture.
Additionally, training MoE models in diffusion frameworks is notoriously unstable, yet this paper seems a stable training process across various scale of base experts, which is a significant strength. If the code is made publicly available, it would greatly contribute to the field by providing insights into stabilizing MoE training in diffusion models.
Timestep-Aware Designs
This work aligns well with prior research advocating timestep-aware network operations in diffusion models. However, previous works have already demonstrated why timestep-aware design is necessary, and citing them would strengthen the paper’s argument. Including such references would reinforce the motivation behind their approach.
Diffusion Model Architectures
Since DiT (Diffusion Transformers), most diffusion models have followed a transformer-based architecture, which is also prevalent in large-scale video diffusion models. Although this work does not achieve state-of-the-art (SOTA) performance, it explores an efficient scaling mechanism for MoE in DiTs.
It would be interesting to see how well this method scales to even larger models, though this is not a critical issue.
Overall Contribution
If the code is released, I believe this work would make a substantial contribution to MoE-based diffusion models, particularly by stabilizing training and providing an efficient scaling mechanism.
遗漏的重要参考文献
Well discussed.
其他优缺点
Strength
- This paper is well-written.
- Proposed MoE architecture seems effective.
Weakness
- Including the references about why the timstep awareness of networks is necessary will be beneficial.
- Including results on scaling over XL-size will be beneficial.
其他意见或建议
Line 50: Please correct strange texts.
Thanks for the thorough reviews. Below we try to solve issues one-by-one.
Q1: Code release.
We sincerely appreciate the insightful feedback and recognition of this work. As astutely noted, training stability remains a critical challenge in MoE architectures, where conventional load-balancing losses offer partial mitigation. Our proposed expert-specific timestep conditioning mechanism further alleviates this issue by disentangling temporal adaptation from spatial routing, as analyzed in Supplementary Material Section C. To ensure reproducibility and foster community progress, we commit to open-sourcing all training frameworks, inference pipelines, and architectural implementations upon publication, to facilitate future research in scalable diffusion models.
Q2: Including results on scaling over XL-size will be beneficial.
We fully agree that scaling up MoE to larger sizes and training more iterations can further validate the framework’s efficacy. However, current hardware limitations bound our exploration. Training configurations exceeding the XL scale (4.5B parameters) on our 8-GPU node infrastructure risk exceeding GPU memory capacity or incurring impractical training durations. We acknowledge this limitation in fully characterizing Diff-MoE’s scaling laws and commit to future large-scale evaluations upon securing expanded computational resources. Moreover, we will open-source the code to enable researchers with sufficient computing resources in the community to explore and build upon our work.
Q3: Including the references about why the timstep awareness of networks is necessary will be beneficial.
Thanks for this valuable suggestion. We will discuss following papers in the revision:
[1] Hatamizadeh, Ali, et al. "Diffit: Diffusion vision transformers for image generation." ECCV, 2024.
[2] Liu, Qihao, et al. "Alleviating Distortion in Image Generation via Multi-Resolution Diffusion Models and Time-Dependent Layer Normalization." NeurIPS, 2024.
[3] Karras, Tero, et al. "Elucidating the design space of diffusion-based generative models." NeurIPS, 2022.
Q4: Typos.
Thanks for pointing that out. We will fix all the typos in the revised version.
Thank you for your response. My concerns have been well addressed.
This paper introduces Diff-MoE, a novel framework combining Diffusion Transformers (DiT) with Mixture-of-Experts (MoE) to improve scalability and efficiency in generative modeling. The key contributions include expert-specific timestep conditioning, which enables dynamic adaptation to different diffusion stages, and a globally-aware feature recalibration mechanism that enhances the model's representational capacity. Extensive experiments demonstrate that Diff-MoE outperforms existing methods.
The strengths of this work lie in its innovative combination of spatial and temporal MoEs, which enhances both model adaptability and efficiency. The paper provides a solid experimental foundation with comprehensive ablation studies. The method is well-motivated, addressing key challenges in scaling generative models.