Experts on Demand: Dynamic Routing for Personalized Diffusion Models
摘要
评审与讨论
This paper introduces a personalized MoE (mixture of expert) structure of diffusion models, named as MoEDM, for text2image generations. The proposed MoEDM enhances the inference efficiency for the desinated tasks, mantaining intact the task-specific performance metrics. The parameter sparse strategy effectively navigates the trade-off between efficiency and capability, establishing it as a feasible optimization technique for diffusion models. The experimental results show the good performances of the proposed MoEDM model.
优点
This paper introduces a sparse mixture of expert structure of diffusion models to enhances the inference efficiency for the text2image generation tasks. The experimental results show the good inference performances of the proposed models.
缺点
-
Novelty is limited. The MoE structure is widely applied for the AIGC models, especially large foundation models. As for the diffusion models, MoE structure is not firstly introduced for the text2image diffusion models, e.g. [RAPHAEL: Text-to-Image Generation via Large Mixture of Diffusion Paths]. The sparse parameter pruning with MoE is also widely used for the large dense model,e.g. [Task-Specific Expert Pruning for Sparse Mixture-of-Experts]. Although this paper introduces a sparse MoE structure to prune the large scale of parameters of dense diffusion models, the novelty of the proposed methods is limited.
-
Experiments are probably not sufficient. It is not easy to evaluate the performance of the image generation method. The evaluation metrics, KID and FID are actually insensitive to the generated image quality and these metrics cannot evaluate the image quality well.
问题
- Please highlight the novelty and contribution of the proposed sparse strategy with the Diffusion models
- Please clarify more details, e.g. the expert balance operation, of the proposed MoEDM.
- Please add more image quality comparison between the proposed MoEDM and baseline
Thank you very much for the thoughtful and detailed review. We reply point-by-point here, to begin the discussion.
- W1: The novelty we wish to emphasize that while pursuing general, large-scale models, it's imperative to consider the specific needs of users. The long-term use of models for specialized purposes is a common occurrence, yet it is often overlooked by the community. Employing methods like layer removal and MoE are merely tools for achieving this objective.
- W2: Thank you for pointing out the issues with the evaluation metrics. We have provided numerous visualizations of generated images, along with comparisons to baseline models, in the appendix of our paper. We will place greater emphasis on this aspect in the updated version of our paper.
- Q1: Equal to W1, The novelty we wish to emphasize that while pursuing general, large-scale models, it's imperative to consider the specific needs of users. The long-term use of models for specialized purposes is a common occurrence, yet it is often overlooked by the community. Employing methods like layer removal and MoE are merely tools for achieving this objective.
- Q2: In Section 3.2 of our paper, we have clearly stated: "Fortunately, in diffusion models, the time step t is always known, allowing for targeted activation based on t. This makes the gated mechanism of MoEDM a training-free approach." This implies that we can directly select experts with smaller id based on a smaller value of t. And it is crucial to emphasize that the speed improvement of MoEDM is not related to the number of experts. This is because the number of experts does not affect the volume of parameters involved in the computation at each step of the diffusion sampling process.
- Q3: We have provided numerous visualizations of generated images, along with comparisons to baseline models, in the appendix of our paper. We will place greater emphasis on this aspect in the updated version of our paper.
This paper introduces MoEDM, a novel approach for building a mixture-of-experts tailored for personalized large-scale diffusion models. MoEDM starts by layer-wise pruning a dense diffusion model, which reduces memory requirements. Subsequently, it constructs a mixture of the pruned models via dynamic routing, achieving faster sampling speeds without requiring additional training. The proposed approach is thoroughly validated through experiments conducted on image datasets to demonstrate its effectiveness.
优点
- The paper tackles a timely and practically-relevant problem supported by a fair amount of experiments, and stands as a pioneering study in attempting to build mixtures of diffusion expert models.
缺点
- In general, the writing is difficult to follow, and lacks technical details.
- How is the channel sensitivity metric calculated and what is its computational overhead?
- It's unclear whether the gating vector is trainable or how it's determined.
- The paper could benefit from providing an explicit pruning algorithm to enhance understanding.
- There's a lack of clarity regarding when and how dynamic gating is utilized to construct the mixture of experts. MoE typically involves training with load balancing loss, but there are no details about dynamic gating in this context.
- The performance for high-resolution images appears to be only marginally improved, which could be discussed in more detail.
- Ablation study regarding the number of experts should be included, as most experimental results consistently show a 50% boost in inference speed.
- The experiments are limited to U-Net-based models, and it's uncertain whether the proposed method is applicable to various architectures such as DiT [1].
- The paper lacks baselines. For example, [2] can be served as a pruning baseline.
问题
- How many random seeds are used throughout the experiments?
[1] Peebles et al., “Scalable Diffusion Models with Transformers.” 2022.
[2] Fang et al., “Structural Pruning for Diffusion Models.” 2023.
Thank you very much for the thoughtful and detailed review. We reply point-by-point here, to begin the discussion.
- W1:
- As mentioned in Equation 1, after zeroing out a specific channel, we conduct a diffusion sampling and measure the difference between the generated image and the standard image. This process is indeed time-consuming, but we have observed a consistent pattern across different categories and models: the layers at both ends are absolutely crucial, while the importance of the middle layers is much lower. Consequently, in our MoEDM approach, we adopt the strategy of directly removing the middle layers, eliminating the need for further assessment of parameter importance.
- In Section 3.2 of our paper, we have clearly stated: "Fortunately, in diffusion models, the time step t is always known, allowing for targeted activation based on t. This makes the gated mechanism of MoEDM a training-free approach." This implies that we can directly select experts with smaller id based on a smaller value of t.
- As mentioned above, our specific pruning algorithm involves removing the layers located in the middle of the model.
- It is important to emphasize that our study focuses not on improving the model's performance, but on enhancing efficiency while maintaining existing performance levels. As shown in Table 3 of our paper, we have significantly increased the model's operating speed while maintaining the performance of high-resolution models on specific tasks.
- Thank you for pointing out the need for more ablation studies. We will supplement our studies with these additional ablation experiments. However, it is crucial to emphasize that the speed improvement of MoEDM is not related to the number of experts. This is because the number of experts does not affect the volume of parameters involved in the computation at each step of the diffusion sampling process.
- Thank you for pointing out the need for experiments on DiT, and we will supplement our studies with these additional experiments on DiT.
- We emphasize that this paper (Structural Pruning for Diffusion Models) was made public after the submission deadline for ICLR 2024, and we will include this work as a baseline in our supplementary materials.
- Regarding randomness, we did not deliberately control it in our study. Random seeds can be randomly selected during the sampling process.
The paper proposes a personalization method based on diffusion models, named MoEDM. This method aims to accelerate computational cost and time while preserving performance metrics across various datasets. Experiments are conducted on ImageNet and FFHQ with FID and KID evaluation metrics.
优点
1.The paper tackles a relevant application (i.e., personalization) of the prominently used diffusion models. 2.The proposed method can incorporation of the technique of model acceleration, e.g., DPM-Solver.
缺点
1.The first and perhaps the main weakness of the paper is its poor presentation. 1) the term "an all-encompassing arsenal" is introduced in the second paragraph of the Introduction but is not explained or mentioned later, leading to confusion regarding this concept. 2) In the Introduction, the paper mentioned “…, deploying a general-purpose diffusion model is not just inefficient but egregiously wasteful”. However, the paper does not provide a comprehensive explanation of why it is inefficient and wasteful. Furthermore, Figure 1 also fails to illustrate this point. 3) In the Introduction, the paper mentioned “…often fall short in preserving the performance attributes of diffusion models”. The paper also does not provide a comprehensive explanation of why these methods are “fall short in preserving the performance…”. As such, it is recommended that further provide qualitative and quantitative experiments. The same issue still applies to the remaining sections. 2.As mentioned in the paper, the term 'minimal computational cost' is used, but it lacks a clear definition, making it a concept that may lead to confusion. 3.In Section 3.1, the paper mentions that the convolutional layers 'constitute approximately 80% of the model’s parameters.' However, the paper does not provide the calculation method used to obtain this value. 4.Why setting diffusion models to zero is work ? 5.It’s better to provide experiments comparing with model acceletion methods (DPM-Solver, DPM-Solver++, DDIM, ToMe) to prove that MoEDM enhances the inference efficiency. DDIM: Denoising Diffusion Implicit Models; ToMe: Token merging for fast stable diffusion; DPM-Solver: A Fast ODE Solver for Diffusion Probabilistic Model Sampling in Around 10 Steps; DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models; The experimental validation is not convincing: 1.The paper only provides 64x64 and 256x256 resolution outputs, while diffuison-based models can generate high-resolution results of 512x512 (e.g., Stable DIffusion). 2.In this paper, the proposed method aims to reduce computational costs, but its effectiveness may not be fully convincing. Training experiments require 8 NVIDIA A100 GPUs, and even for sampling, a single NVIDIA A100 with 80GB of memory is needed. 3.It would be more valuable to see if this method can be built with more diffusion models, like Stable Diffusion and DeepFloyd-IF. It would be better to compare the proposed method with current personalization methods based on diffusion models, such as CustomDiffusion and DreamBooth. CustomDiffusion: Multi-Concept Customization of Text-to-Image Diffusion; DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation;
问题
see Weaknesses
Thank you very much for the thoughtful and detailed review. We reply point-by-point here, to begin the discussion.
- Poor Presentation
- "All-encompassing arsenal" means that the model is capable of accommodating a wide variety of user requirements.
- As we mentioned at the beginning of the Introduction, large-scale, general-purpose diffusion models contain 3.5 billion or even more parameters. Such a vast number of parameters inevitably leads to inefficiency, a consensus in the academic community. Additionally, many users are likely to only require specific functionalities of diffusion models for extended periods. For example, a pet store may only attempt to create images of pets, thus deploying an ultra-large-scale model would be wasteful.
- In Figure 2 of our paper, we explicitly demonstrate the results of using traditional compression algorithms. Due to the inability to identify crucial parameters, these methods make it more challenging to restore model performance.
- The term "minimal computational cost" refers to our aim to significantly reduce the runtime of diffusion models by decreasing the number of parameters, particularly for user-specific tasks. We will also provide detailed and clear explanations in the updated version of our paper.
- We can determine the proportion of parameters in convolutional layers within diffusion models by directly traversing the checkpoints of these models.
- This is because, in our study, the task of image synthesis is specific, rather than for a general purpose. Simultaneously, due to the integration of the Mixture of Experts (MoE) algorithm, our approach can achieve performance equivalent to that of the original model.
- We wish to emphasize that our study is not aimed at comparing the capabilities of acceleration algorithms for diffusion models. Instead, it highlights how our proposed method, MoEDM, can seamlessly integrate with existing acceleration techniques.
Experiment 1. Thank you for pointing out the need for experiments targeting higher resolutions. We will supplement our studies with these additional experiments.
Experiment 2. In Figure 6 of the appendix, we provide details on the memory usage of MoEDM, which does not exceed 10,000M and is slightly lower than that of a full-size model. We emphasize that although our experiments were conducted using an A100 GPU, our method remains effective when sampling with other GPUs.
Experiment 3. Thank you for pointing out the need for experiments on other personalized diffusion models. We will supplement our studies with these additional experiments.
The authors of this paper introduced a series of techniques aimed at enhancing the efficiency of diffusion models. Summarized briefly, their approach includes:
- Trimming the bottommost layers of the UNET structure, which serves to decrease the total number of parameters.
- Duplicating the remaining layers and then activating these copies selectively, based on the specific timestep in the diffusion process.
- Employing knowledge distillation from alternative diffusion models, which lessens the dependency on high-quality data.
When these strategies are applied together, the refined model exhibits a modest boost in both speed and precision compared to the original baseline, especially after being fine-tuned on a narrow and specific dataset
优点
S1: The challenge of expediting diffusion models is critically vital and possesses a broad spectrum of applications across numerous fields.
S2: The concept of integrating Mixture of Experts (MOE) within diffusion models presents an intriguing avenue for investigation and merits further exploration.
缺点
W1: While the paper contributes to ongoing discussions in the field, the technical novelty could be further strengthened. The concept of layer removal has precedents, such as in the design of SDXL [1]. Additionally, the current application of Mixture of Experts (MOE) seems to follow familiar patterns, similar to those seen in EDiff-I [2], which might benefit from a more rigorous comparative analysis. The use of distillation to enhance data efficiency is an interesting approach, though its integration with the other proposed methods (aiming for efficiency improvements) appears tangential and warrants a clearer rationale.
W2: The experimental validation presented could be more robust. The majority of efficiency gains appear attributable to the removal of intermediate layers, raising questions about the relative contribution of other proposed innovations. The focus on a singular class from ImageNet may not sufficiently demonstrate the model's generalizability. For the text-to-image results, the paper only present a few curated images, and it would benefit from a broader set of comparisons to fully ascertain the model's effectiveness.
W3: The clarity and structure of the paper would greatly benefit from revision. For instance, the processes following the removal of intermediate layers, including whether the model undergoes fine-tuning with the original dataset, are not clearly outlined. A more detailed discussion in the technical sections, perhaps with an expanded explanation of the loss functions used, would enhance the reader's understanding.
In conclusion, the core ideas of the paper are interesting, yet the manuscript would benefit from substantial revisions to meet the acceptance bar. I would encourage the authors to consider this feedback constructively, make the necessary improvements, and consider resubmission to an alternative venue for a future iteration of their work.
[1] Podell, Dustin, et al. "Sdxl: Improving latent diffusion models for high-resolution image synthesis." arXiv preprint arXiv:2307.01952 (2023).
[2] Balaji, Yogesh, et al. "ediffi: Text-to-image diffusion models with an ensemble of expert denoisers." arXiv preprint arXiv:2211.01324 (2022).
问题
please find my comments in the weakness section.
Thank you very much for the thoughtful and detailed review. We reply point-by-point here, to begin the discussion.
- W1: We wish to emphasize that while pursuing general, large-scale models, it's imperative to consider the specific needs of users. The long-term use of models for specialized purposes is a common occurrence, yet it is often overlooked by the community. Employing methods like layer removal and MoE are merely tools for achieving this objective.
- W2: Indeed, the removal of intermediate layers significantly enhances efficiency, but the resultant performance loss is not negligible. To address this, we introduce the MoE mechanism to compensate for these performance deficits, as emphasized in our ablation experiments. Furthermore, we will conduct additional experiments on various datasets and present the results in the updated version of our paper.
- W3: Thank you for pointing out the clarity and structure of our paper. In fact, in our ablation experiments, we have demonstrated the results of fine-tuning the original model. We will train the MoEDM using the same loss function. We will enhance the clarity of our paper's presentation in the updated version.
The paper presents an approach for enhancing the efficiency of diffusion model in image generation named Mixture of Expert Diffusion Models (MoEDM). Specifically, dynamic routing is employed in MoEDM to activate only indispensable neurons. Hence it gains acceleration on inference for specialized tasks and reduction on computational cost.
Strengths: This work is well motivated in that improving efficiency of diffusion model is crucial and could benefit broad applications.
Weaknesses: Novelty is limited in that several AIGC methods with MOE have been proposed and the difference should be further explained. Ablation study is less sufficient to examine the effectiveness of the proposed components. The presentation of this work is poor and should be greatly improved.
为何不给更高分
The novelty is limited, presentation is unclear and ablation study is insufficient.
为何不给更低分
This paper is recommended to be rejected.
Reject