4.8

/10

Rejected4 位审稿人

最低3最高6标准差1.1

3.0

置信度

正确性2.5

贡献度1.3

表达3.0

ICLR 2025

Sparse-to-Sparse Training of Diffusion Models

Inês Cardoso Oliveira,Decebal Constantin Mocanu,Luis A. Leiva

OpenReview PDF

提交: 2024-09-28更新: 2025-02-05

TL;DR

We introduce sparse-to-sparse training to Diffusion Models, and obtain sparse DMs that are able to match and sometimes outperform the dense versions.

摘要

关键词

Diffusion ModelsSparse-to-Sparse TrainingStatic Sparse TrainingDynamic Sparse Training

评审与讨论

审稿意见

评分: 5置信度: 32024-10-24

The paper proposes a weight pruning-based sparse-to-sparse Diffusion Model (DM) training method using both static and dynamic sparse pruning techniques. Through experiments on latent and Chiro Diffusion Models, the paper demonstrates that sparse training can achieve similar or improved performance compared to dense training methods while reducing the number of parameters and FLOPs.

优点

The motivation for the paper is very clear: the high computational cost of training DMs, which drives the proposal of a prune-based DM training method.
Sparse training is applied across a diverse range of datasets and models, showcasing its versatility.
The experimental results are clearly presented, showing how network sparsity and pruning ratio affect the performance, providing valuable insights into hyperparameter tuning.

缺点

Lack of experiments on full dataset training
- This paper only uses a portion of the CelebA-HQ and LSUN-Bedrooms datasets for the experiments. However, I believe that the performance of sparse training may decrease on larger datasets due to the reduced expressive power of the model caused by pruning. To fully evaluate the effectiveness of sparse training methods, experiments on larger datasets such as the full ImageNet or the entire LSUN-Bedrooms dataset are needed. Although Appendix C presents results from training on the full CelebA-HQ dataset, with only 30,000 images in total, CelebA-HQ is not large enough to alleviate these concerns.
Lack of evaluation metrics
- The authors have presented the FID score as the evaluation metric, but relying solely on the FID score to evaluate a Diffusion Model (DM) seems risky. It would be better to additionally present metrics such as the Inception Score (IS) proposed in the Latent Diffusion Model paper.
Dependence on dataset, method, and hyperparameters
- In Figure 2, only a few methods and sparse rates outperform dense training in CelebA-HQ and Imagenette. Due to the long search time for optimal settings, the reduction in training time mentioned by the authors seems insignificant.
- Although ChiroDiff shows performance improvements with the QuickDraw dataset, it is hard to say that there are meaningful improvements in performance for KanjiVG and VMNIST. Sparse training lacks robustness across different datasets.
Lack of analysis
- In Section 4.1, line 376, it is mentioned that, unlike existing supervised learning and GAN models, the DM using the SST method outperforms the DST method. Additional analysis is needed to explain why this different trend is observed.
- In Table 2, performance is good for QuickDraw but poor for KanjiVG and VMNIST. An analysis of the reasons behind this discrepancy would be useful.
Lack of novelty
- Without introducing new concepts or ideas, the paper applies the existing sparse-to-sparse training method from supervised learning to Diffusion Models. It would be better to propose a new method optimized for Diffusion Models.
- The variance of FID scores in Table 1 is overall too large, and the reduction in FLOPS is not significant for Bedrooms and Imagenette.
- The efficiency gained from reducing inference speed via FLOPS reduction is dependent on hardware.
- Overall, the time taken to search for methods and hyperparameters seems too long compared to the performance improvements. Proposing methods to reduce the search time would be helpful.
- In Section 4.3, line 515, a speed-up of 0.57x is mentioned, but it is unclear whether GPU inference time is improved.

问题

Is sparse training effective on larger datasets such as the full LSUN-Bedrooms dataset or ImageNet1k, which are larger than CelebA-HQ?
Is there a specific reason for using only the FID score as the evaluation metric? If not, it would be helpful to also include the Inception Score (IS).
Could you explain why performance is strong only for QuickDraw in Table 2, but not for KanjiVG and VMNIST? Is there a particular characteristic of the datasets that leads to this?
Have you tried using structured sparsity, which removes entire layers, to reduce inference time?
In Section 4.3, line 515, could you clarify whether GPU inference speed actually improves by 0.57x as mentioned? Could you provide papers or resources that demonstrate that reducing FLOPS leads to improved inference speed on hardware?
For Table 1, would it be possible to conduct experiments that reduce the standard deviation to below 3.0 through hyperparameter tuning on the Bedrooms and Imagenette datasets? The mean + standard deviation for sparse training (for example, 28.79 + 12.65 = 41.44 for Bedrooms Static-DM) is consistently higher than the mean for dense training (31.09 for Bedrooms Dense).
Do you have any insights on how to effectively tune hyperparameters such as network sparsity, exploration frequency, pruning rate, and sparse method, beyond random search or grid search?

评论- Continuation of Response to Reviewer rNhu (2)

2024-11-24

Overall, the time taken to search for methods and hyperparameters seems too long compared to the performance improvements. Proposing methods to reduce the search time would be helpful.

Do you have any insights on how to effectively tune hyperparameters such as network sparsity, exploration frequency, pruning rate, and sparse method, beyond random search or grid search?

To further clarify, we are not suggesting that our study should be performed as a hyperparameter search, every time, for each task or dataset. Rather, our aim is to identify safe and effective values that could be used to perform sparse-to-sparse training of DMs. For high performance, the optimal sparsity level seems to be around 25–50%, and we suggest utilizing dynamic sparse training methods with conservative pruning rates, such as p=0.05. In this way, researchers and practitioners can benefit from our findings, without having to invest extensive computational power themselves, which could indeed defeat the purpose of our research.

In Section 4.3, line 515, a speed-up of 0.57x is mentioned, but it is unclear whether GPU inference time is improved.

In Section 4.3, line 515, could you clarify whether GPU inference speed actually improves by 0.57x as mentioned? Could you provide papers or resources that demonstrate that reducing FLOPS leads to improved inference speed on hardware?

Thanks for pointing this out. That sentence was misleading: GPU inference time is not improved. Since GPUs are not yet optimized to support unstructured sparsity at various S ratios, we mimic sparsity using sparse masks. In practice, this means that there are no actual speedups, therefore in the updated paper we refer to ‘theoretical speedup’. As mentioned in Appendix A, the hardware industry is catching up so it is a matter of time to truly leverage sparse operations.

Have you tried using structured sparsity, which removes entire layers, to reduce inference time?

Our focus was on unstructured sparsity due to its ability to maintain high performance even at very high levels of sparsity, as reported in previous work such as [1] and [2].

[1] Evci, U. et al. Rigging the Lottery: Making All Tickets Winners. ArXiv, abs/1911.11134. (2019)

[2] Frankle, J., and Carbin, M. The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks. arXiv: Learning. (2018)

For Table 1, would it be possible to conduct experiments that reduce the standard deviation to below 3.0 through hyperparameter tuning on the Bedrooms and Imagenette datasets? The mean + standard deviation for sparse training (for example, 28.79 + 12.65 = 41.44 for Bedrooms Static-DM) is consistently higher than the mean for dense training (31.09 for Bedrooms Dense).

We appreciate this suggestion however we believe that it is beyond the scope of our study. As previously discussed, the variance observed in the models is similar when comparing Dense and Sparse versions in all cases for a given dataset, suggesting that this variability comes from factors not related to sparsity. We have mentioned this in Section 4.1 in the revised paper.

2024-11-27

Thank you for the detailed response. The authors provided guidelines for optimal sparsity and pruning rates, and they emphasized that there had been no prior attempts to combine these with sparse training. From this perspective, I will raise the score to 5. However, the improvement in test FLOPS remains dependent on the dataset. Additionally, the experiments were conducted only on small-scale datasets, leading to a high standard deviation and raising concerns about the stability of the experiments. Therefore, I will maintain my negative opinion.

评论- Continuation of Response to Reviewer rNhu

2024-11-24

In Section 4.1, line 376, it is mentioned that, unlike existing supervised learning and GAN models, the DM using the SST method outperforms the DST method. Additional analysis is needed to explain why this different trend is observed.

In the paper, we wonder whether the question of DST models consistently outperforming SST models, provided there is appropriate parameter exploration, might also apply to DMs. To provide further insights into this, we repeated the main experiments for RigL-Diff and MagRan-Diff using a pruning rate of 0.05 for both LatentDiffusion and ChiroDiff. We have updated the paper consequently (updated Appendix E).

The results for Latent Diffusion, presented in Table 1, show that reducing the pruning rate leads to considerable improvements in performance for RigL-DM and MagRan-DM. With the previous pruning rate p=0.5, neither of the methods are able to outperform Static-DM on all datasets. With the new pruning rate p=0.05, both methods are able to outperform it for CelebA-HQ but only MagRan-DM for LSUN-Bedrooms. In Imagenette, Static-DM remains the top method.

Table 1

Dataset	Static-DM, S=0.9	RigL-DM, S=0.9, p=0.5	RigL-DM, S=0.9, p=0.05	MagRan-DM, S=0.9, p=0.5	MagRan-DM, S=0.9, p=0.05
CelebA-HQ	52.48 $\pm$ 4.88	65.65 $\pm$ 4.32	46.07 $\pm$ 11.08	60.77 $\pm$ 6.58	48.39 $\pm$ 14.05
LSUN-Bedrooms	46.18 $\pm$ 13.42	71.45 $\pm$ 18.84	58.64 $\pm$ 22.88	46.22 $\pm$ 10.11	33.80 $\pm$ 3.98
Imagenette	147.47 $\pm$ 7.74	168.48 $\pm$ 15.15	148.93 $\pm$ 12.03	167.19 $\pm$ 8.20	159.08 $\pm$ 14.68

The results for ChiroDiff, presented in Table 2, show a similar pattern of improvement when using the lower pruning rate, albeit not as significant. Using the new pruning rate, all DST methods were able to surpass the performance of the respective SST model, even if this was previously not the case.

Table 2

Dataset	Static-DM, S=0.9	RigL-DM, S=0.9, p=0.5	RigL-DM, S=0.9, p=0.05	MagRan-DM, S=0.9, p=0.5	MagRan-DM, S=0.9, p=0.05
Quickdraw	30.25 $\pm$ 0.43	30.26 $\pm$ 0.63	28.84 $\pm$ 0.37	29.45 $\pm$ 0.39	28.60 $\pm$ 0.37
KanjiVG	30.75 $\pm$ 2.16	28.54 $\pm$ 0.74	29.12 $\pm$ 0.57	33.02 $\pm$ 3.28	29.01 $\pm$ 1.48
VMNIST	52.35 $\pm$ 0.84	54.08 $\pm$ 1.57	52.25 $\pm$ 0.20	53.65 $\pm$ 0.69	51.94 $\pm$ 1.12

These findings do suggest that using appropriate parameter exploration, DST methods are able to outperform SST on high sparsity regimes.

Without introducing new concepts or ideas, the paper applies the existing sparse-to-sparse training method from supervised learning to Diffusion Models. It would be better to propose a new method optimized for Diffusion Models.

We would like to restate that the application of existing sparse-to-sparse training methods serves as a foundation to understand how sparsity interacts with DMs. Prior to this application, no information was known about optimal sparsity ratios, algorithms or pruning rate values. This work is a valuable research contribution by itself that can foster follow-up work like proposing new methods that are specifically tailored to DMs.

The variance of FID scores in Table 1 is overall too large, and the reduction in FLOPS is not significant for Bedrooms and Imagenette.

Actually, the variance of all sparse models is similar to the variance observed in their dense versions. The main objective of this paper is not to achieve the best reduction in FLOPS or the lowest FID scores, but to uncover new findings and insights about sparse-to-sparse training of DMs.

The efficiency gained from reducing inference speed via FLOPS reduction is dependent on hardware.

We completely agree. However, we consider this a ubiquitous challenge of sparsity research, rather than a specific weakness of our work. As the interest in unstructured sparsity research has grown in recent years, so has the development of hardware, with companies such as NVIDIA or Cerebras putting time and resources into developing sparsity-friendly hardware. We believe that hardware industry will catch-up to these research advancements.

评论- Response to Reviewer rNhu

2024-11-24

Dear reviewer rNhu,

Thank you for your comments and constructive review. Below we address the concerns and questions mentioned.

This paper only uses a portion of the CelebA-HQ and LSUN-Bedrooms datasets for the experiments. However, I believe that the performance of sparse training may decrease on larger datasets due to the reduced expressive power of the model caused by pruning. (...) experiments on larger datasets such as the full ImageNet or the entire LSUN-Bedrooms dataset are needed.

Is sparse training effective on larger datasets such as the full LSUN-Bedrooms dataset or ImageNet1k, which are larger than CelebA-HQ?

As mentioned in Section 3.3.2, due to computing limitations, we could not consider the full versions of CelebA-HQ and LSUN-Bedrooms. However, for subsets thereof, if they are randomly sampled (as we did), then model performance on these sets can provide a reliable indication of its performance on a larger set. This is actually corroborated by our experiments in Appendix C.

Is there a specific reason for using only the FID score as the evaluation metric? If not, it would be helpful to also include the Inception Score (IS).

We decided to base our analysis on FID, as it was the metric that the original Latent Diffusion and ChiroDiff papers employed. We also thought of including Inception Score, but ultimately decided to exclude it, due to it being superseded by FID and also it being better suited to evaluate ImageNet generators [1]. Nonetheless, if the reviewer feels strongly about the inclusion of Inception Scores, we would be happy to do it.

[1] Barratt, S.T. and Sharma, R. A Note on the Inception Score. ArXiv, abs/1801.01973 (2018)

In Figure 2, only a few methods and sparse rates outperform dense training in CelebA-HQ and Imagenette. Due to the long search time for optimal settings, the reduction in training time mentioned by the authors seems insignificant.

We believe there might be some misunderstanding here. We performed several runs using different sparsity levels and algorithms on the same model and dataset, as a way of studying whether stable “safe” sparsity levels and algorithms emerged. We don’t expect/recommend practitioners to do a similar study every time they train a new diffusion model.

In general, we have found safe values that work, independent of specific characteristics of the dataset. Optimal sparsity level for high performance seems to be around 25–50%, and we recommend the utilization of dynamic sparse training methods with conservative pruning rates, such as 0.05. It is our hope that researchers will use these guidelines while training their own sparsified DMs.

Although ChiroDiff shows performance improvements with the QuickDraw dataset, it is hard to say that there are meaningful improvements in performance for KanjiVG and VMNIST. Sparse training lacks robustness across different datasets.

Could you explain why performance is strong only for QuickDraw in Table 2, but not for KanjiVG and VMNIST? Is there a particular characteristic of the datasets that leads to this?

It is true that, for ChiroDiff, the most notable improvements in FID scores are observed in the QuickDraw dataset. However we should note that beating the base model’s performance is not the main goal of this work. Still, for all datasets, most of the models with sparsity <= 50% obtain FID values that are comparable with their Dense version, using much less memory and computational cost.

In Table 2, performance is good for QuickDraw but poor for KanjiVG and VMNIST. An analysis of the reasons behind this discrepancy would be useful.

Although in the main experiments the sparse models were only able to outperform the Dense version in QuickDraw and KanjiVG, in VMNIST the performance of the best sparse model is comparable to that of the Dense model. In the new experiments using pruning rate=0.05 (mentioned previously, and included in the updated Appendix E), sparse models are also able to outperform the dense in KanjiVG. The reasons behind the difference in performance are likely due to the difference in model size and dataset difficulty.

2024-12-01

We appreciate the time and effort in reviewing our work, and our rebuttal. We thank you for raising your score.

However, the improvement in test FLOPS remains dependent on the dataset.

We want to clarify that improvement in test FLOPs is dependent on the hardware and the architecture of the model, not the dataset.

审稿意见

评分: 6置信度: 32024-11-02

To enhance the efficiency of both training and sampling of DMs, the paper employs a sparse-to-sparse training technique to develop a lightweight model backbone that can achieve performance comparable to its denser counterpart. Since previous methods primarily focus on the efficiency of sampling in DMs, this paper demonstrates significant advantages by optimizing the diffusion framework for both fast training and sampling speeds. To achieve this goal, this paper proposes two strategies—static and dynamic training—to optimize two state-of-the-art models. Experimental results demonstrate the effectiveness of the proposed sparse-to-sparse training method.

优点

This paper investigates a challenging problem in the diffusion framework, as current state-of-the-art methods all require large model backbones to maintain significant generative performance. Therefore, using a lightweight model to achieve comparable modeling ability is meaningful for the generative community.
The experimental results are compelling, as the proposed framework employs a model with small capacity parameters to achieve slightly better performance, highlighting its great potential for reducing sampling latency.
The proposed two training strategies are effective for training the lightweight model backbone, as models optimized with these strategies can match or even surpass the performance of their dense counterparts.
This paper is easy to follow and the concept idea for the main framework is clearly presented.

缺点

The proposed framework appears to be an incremental application with limited novelty. Furthermore, this paper seems to rely on established sparse-to-sparse strategies for optimization without any careful design.
The proposed method is not theoretically guaranteed, which may result in performance variability.
The ablation studies are lacking. The validity of the model would be better established with more experimental results provided.
It is suggested that the format of the references be made uniform, as there are discrepancies between different sections.

问题

Is there any explanation regarding the design of the sparsity rates for training diffusion models, which appear to be predefined without any intuitive understanding based on specific concepts related to the models? Is it possible to design an adaptive sparsity schedule?
How many GPUs were used in the training process for different DMs? Providing more details about the training settings would greatly enhance the confidence in the proposed framework.
For a given DM, how should the decision be made regarding the training strategy—whether to use static sparsity or dynamic sparsity?
Can you provide some experimental results on the text-to-image task, which is one of the most important practical applications of diffusion models?

评论- Continuation of Response to Reviewer 41FH

2024-11-24

It is suggested that the format of the references be made uniform, as there are discrepancies between different sections.

Thank you for the suggestion. We have revised all references in the updated version.

Is there any explanation regarding the design of the sparsity rates for training diffusion models, which appear to be predefined without any intuitive understanding based on specific concepts related to the models? Is it possible to design an adaptive sparsity schedule?

The application of predefined sparsity ratios is a common practice in the sparse-to-sparse training literature. They served as a solid start for our study, allowing us to observe the baseline behavior of a sparsified DM. Adaptive sparsity schedules have been applied in the context of Time Series Forecasting [1] and could be an interesting direction for a follow-up study, although we should mention that it cannot be straightforwardly applied because [1] was done in the context of supervised learning.

[1] Atashgahi, Z. et al., Adaptive Sparsity Level During Training for Efficient Time Series Forecasting with Transformers. Proc. ECML PKDD (2024)

How many GPUs were used in the training process for different DMs? Providing more details about the training settings would greatly enhance the confidence in the proposed framework.

We have trained each DM using only one GPU. We have used NVIDIA A100 and V100 SXM2 GPUs for training all models. We have added this information to Section 3.3.2 (Experimental Details).

For a given DM, how should the decision be made regarding the training strategy—whether to use static sparsity or dynamic sparsity?

Looking at our findings, dynamic sparse training is a safe choice for DMs. The new experiments using a pruning ratio of 0.05 further corroborate this (detailed in a response above and in the updated Appendix E).

Can you provide some experimental results on the text-to-image task, which is one of the most important practical applications of diffusion models?

We agree that including text-to-image experiments would be an interesting addition to the paper. However, our focus is on unconditional generation, and we hope that the reviewer can understand that the limited space precludes expanding the paper in this way.

评论- Response to Reviewer 41FH

2024-11-24

Dear reviewer 41FH

Thank you for your comments and constructive review. Below we address the concerns and questions mentioned.

The proposed framework appears to be an incremental application with limited novelty. Furthermore, this paper seems to rely on established sparse-to-sparse strategies for optimization without any careful design.

We understand that the use of established sparse-to-sparse strategies may seem incremental, but we believe that a study like ours was necessary. Previous work could not assess whether sparse-to-sparse training would be beneficial for DMs. Our findings are proof that training DMs in a sparse-to-sparse way is feasible and beneficial, and provides valuable insights for future researchers.

Being the first paper in studying sparse-to-sparse training of DMs, we utilized already established algorithms in order to properly contextualize our study. We actually did carefully design a set of experiments, with the goal of providing insights into the performance of different sparsity levels and SoTA algorithms, and their generalization ability across different backbones and datasets.

The proposed method is not theoretically guaranteed, which may result in performance variability.

Sparse-to-sparse training has been applied to supervised learning, reinforcement learning, and continual learning settings, and it has been proven to be both stable and beneficial. We do, of course, acknowledge the empirical nature of our study, as many other DL works published at ICLR and related venues, and agree on the need for more theoretical guarantees. However, our experiments are comprehensive enough to provide a primer on sparse-to-sparse training of SoTA DMs, which can be very valuable for researchers and practitioners.

The ablation studies are lacking. The validity of the model would be better established with more experimental results provided.

One question that arose from the initial submission was the impact of the chosen pruning rate p=0.5. Thus, we decided to repeat the main experiments for RigL-DM and MagRan-DM, using a pruning rate of 0.05, for all datasets and models.

The results for Latent Diffusion, presented in Table 1, show that reducing the pruning rate leads to considerable improvements in performance for RigL-DM and MagRan-DM, in the high sparsity regime of S=0.9. With the original pruning rate, p=0.5, neither of the methods were able to outperform Static-DM on all datasets. With the new pruning rate, p=0.05, both methods are able to outperform it for CelebA-HQ but only MagRan-DM for LSUN-Bedrooms. In Imagenette, Static-DM remains the top method.

Table 1

Dataset	Static-DM, S=0.9	RigL-DM, S=0.9, p=0.5	RigL-DM, S=0.9, p=0.05	MagRan-DM, S=0.9, p=0.5	MagRan-DM, S=0.9, p=0.05
CelebA-HQ	52.48 $\pm$ 4.88	65.65 $\pm$ 4.32	46.07 $\pm$ 11.08	60.77 $\pm$ 6.58	48.39 $\pm$ 14.05
LSUN-Bedrooms	46.18 $\pm$ 13.42	71.45 $\pm$ 18.84	58.64 $\pm$ 22.88	46.22 $\pm$ 10.11	33.80 $\pm$ 3.98
Imagenette	147.47 $\pm$ 7.74	168.48 $\pm$ 15.15	148.93 $\pm$ 12.03	167.19 $\pm$ 8.20	159.08 $\pm$ 14.68

The results for ChiroDiff, presented in Table 2, show a similar pattern of improvement when using the lower pruning rate, albeit not as significant. The only exception is RigL-DM on Kanji-VG, although the difference is very small.

Table 2

Dataset	Static-DM, S=0.9	RigL-DM, S=0.9, p=0.5	RigL-DM, S=0.9, p=0.05	MagRan-DM, S=0.9, p=0.5	MagRan-DM, S=0.9, p=0.05
Quickdraw	30.25 $\pm$ 0.43	30.26 $\pm$ 0.63	28.84 $\pm$ 0.37	29.45 $\pm$ 0.39	28.60 $\pm$ 0.37
KanjiVG	30.75 $\pm$ 2.16	28.54 $\pm$ 0.74	29.12 $\pm$ 0.57	33.02 $\pm$ 3.28	29.01 $\pm$ 1.48
VMNIST	52.35 $\pm$ 0.84	54.08 $\pm$ 1.57	52.25 $\pm$ 0.20	53.65 $\pm$ 0.69	51.94 $\pm$ 1.12

We have updated the paper with these additional experiments (updated Appendix E).

2024-12-01

Dear Reviewer 41FH,

We would like to thank you for your positive review and detailed comments. As the end of the discussion period is approaching, we would kindly ask if our response has sufficiently addressed your concerns. We would be glad to provide further clarification to any other questions you may have.

Best regards,

Authors

2024-12-03

Thank you for the response!

I still maintain the view that the novelty of the proposed method is limited. In light of the considerable efforts made by the authors, I am keeping the original score.

2024-12-04

Thank you for the time and effort spent on reviewing our work, and for your feedback.

审稿意见

评分: 5置信度: 32024-11-03

This paper integrated 2 Diffusion Models with 3 Sparse Training methods respectively, with experiments on many datasets to verify the combination of these two things is OK (reducing FLOPs while maintaining good performance, some even outperforming the dense models). This may be helpful for training time, memory, and computational savings of DMs in the future.

优点

The combination of Sparse Training and DMs is proven to be effective, which can be used in the future efficient training of regular DMs without affecting other components of training and inference.
Experiments are conducted on many datasets together with extensive analysis, making the methods convincing.
The writing logic is great from my point of view, making readers easy to follow.
The content is rigorous, e.g., good to point out the hardware limitation for sparse matrix operation (Line 56).

缺点

Majors:

The biggest issue is that sparse training and DMs seem not to be coupled: there's no strong desire for me to think the combination of these two is fantastic or compatible naturally, and I also didn't see any apparent problems that would prevent the two from combining easily. It seems like this paper simply uses "Sparse Training + DMs = Sparse DMs", in which both Sparse Training and DMs are ready-made without innovation and without extra tricks in the combination process. As a result, although the paper has some contributions (of experiments and verification), it has NO core novelty.
I don't think the methods take advantage of the unique characteristics of DMs itself. After all, the denoising phase of DMs parameterizes a neural network $p_{\theta}$ to approximate the denoising process $q(x_{t-1} | x_t)$ , so the DMs can be regarded as "noising process + network backbone (for fitting denoising process)". The paper uses Sparse Training in denoising backbones, however the backbones may have been verified of the combination with Sparse Training or pruning [1] [2].

Minors:

Refs (hyperlinks) can be changed to a different color or use a box, just like most other articles did. It's hard for me to follow the real contents with all the black letters.
It seems that the page number of the first page can be incorrectly hyperlinked.
More introduction should be made to Latent Diffusion and ChiroDiff.

[1] Narang, S., Elsen, E., Diamos, G., & Sengupta, S. (2017). Exploring Sparsity in Recurrent Neural Networks. ArXiv. https://arxiv.org/abs/1704.05119

[2] Rao, Kiran & Chatterjee, Subarna & Sharma, Sreedhar. (2022). Weight Pruning-UNet: Weight Pruning UNet with Depth-wise Separable Convolutions for Semantic Segmentation of Kidney Tumors. Journal of medical signals and sensors. 12. 108-113. 10.4103/jmss.jmss_108_21.

问题

Why choose these two DMs? AFAIK, ChiroDiff is not a well-known model.
Have experiments been done on other DMs (backbones) to test the generalization? What affects the combination of the two may not be different datasets or different generative tasks, but different network backbone architecture (e.g., U-Net v.s. Bidirectional GRU encoder). More analysis into this?
For Tables 1, 2, and 4, are there criteria or reasons for choosing these specific sparsity ratios $S$ ? It may be necessary to supplement the ablation study of sparsity ratios $S$ and pruning rate $p$ of the three methods.

评论- Response to Reviewer mJ8a

2024-11-24

Dear Reviewer mJ8a,

Thank you for your comments and constructive review. Below we address the concerns and questions mentioned.

The biggest issue is that sparse training and DMs seem not to be coupled: there's no strong desire for me to think the combination of these two is fantastic or compatible naturally, and I also didn't see any apparent problems that would prevent the two from combining easily. (...) As a result, although the paper has some contributions (of experiments and verification), it has NO core novelty.

Our paper highlights how sparse-to-sparse methods can be successfully applied to DMs, and reveals interesting insights that can inform researchers and practitioners. Critically, until now it was not clear how sparsity would impact DM training. For example, researchers have suggested that it is safe to work with high pruning rates for other models [1] however we have found that this is not the same in DMs, and it is better to be more conservative, especially in high sparsity regimes. We have conducted a new experiment (included in the updated Appendix E) to remark on the importance of this finding.

We would like to emphasize that this is the first-of-its-kind study, supported by a comprehensive set of experiments. In sum, our work creates a solid foundation that researchers and practitioners can build upon.

[1] Nowak, A. I. et al., Fantastic weights and how to find them: where to prune in dynamic sparse training. In Advances on Neural Information Processing Systems (2023)

I don't think the methods take advantage of the unique characteristics of DMs itself. (...) the backbones may have been verified of the combination with Sparse Training or pruning [1] [2].

Our work specifically focuses on the application of sparse-to-sparse methods for denoising backbones of DMs. The iterative nature of the denoising process is a characteristic that, to the best of our knowledge, has not been studied before in the literature. We would like to point out that sparse-to-sparse training is a different paradigm from what is proposed in [1], as there, the model starts out as dense, and is increasingly sparsified while training. In [2], a dense model is fully trained and then pruned, which again, is different from sparse-to-sparse training. We also should note that pruning methods are bounded by the performance of the original dense model, whereas with sparse training we can get better performance than their dense counterparts, as shown in Tables 1 and 2.

Refs (hyperlinks) can be changed to a different color (...)

We have now used a blue font color for all hyperref links.

It seems that the page number of the first page can be incorrectly hyperlinked.

Thank you for pointing this out. We have fixed it.

More introduction should be made to Latent Diffusion and ChiroDiff.

We appreciate the feedback and have expanded the description for each model.

Why choose these two DMs? AFAIK, ChiroDiff is not a well-known model.

First, LatentDiffusion is a SoTA and widely used model for image generation [1, 2]. Then, to study the generalization of sparse-to-sparse training, we decided to use a DM with a different backbone that can handle non-image generation. ChiroDiff, which was published at ICLR’23 and provided an open-source implementation, fitted well that purpose.

[1] Blattmann, A. et al., Align Your Latents: High-Resolution Video Synthesis with Latent Diffusion Models, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023) [2] Takagi, Y. and Nishimoto, S., High-resolution image reconstruction with latent diffusion models from human brain activity, IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023)

Have experiments been done on other DMs (backbones) to test the generalization? What affects the combination of the two may not be different datasets or different generative tasks, but different network backbone architecture (e.g., U-Net v.s. Bidirectional GRU encoder). More analysis into this?

Indeed, we have conducted experiments with two different backbone architectures to test the generalization of sparse-to-sparse training of DMs. Specifically, as pointed out by the reviewer, we have tested UNet (LatentDiffusion) and Bidirectional GRU (ChiroDiff). We agree that it may be interesting to analyze more backbones, but we left this exercise as an opportunity for future work.

评论- Continuation of Response to Reviewer mJ8a

2024-11-24

For Tables 1, 2, and 4, are there criteria or reasons for choosing these specific sparsity ratios ? It may be necessary to supplement the ablation study of sparsity ratios and pruning rate of the three methods.

Choosing sparsity rates that cover a comprehensive range of values is common in the literature, see e.g. [1, 2]. We opted to include the extreme cases (0.9 and 0.1), the middle point (0.5), and two intermediate values (0.25 and 0.75). This selection allowed us to provide a detailed overview of the impact and performance of different sparsity scenarios.

Similar to the reviewer, we also questioned the impact of the initially chosen pruning rate (p=0.5). So, after submission, we repeated the main experiments for RigL-DM and MagRan-DM, using a pruning rate of 0.05, for all datasets and models. In general, decreasing the pruning rate has a positive effect on performance. We have updated the paper with these additional experiments (updated Appendix E). A snapshot of the results for Latent Diffusion, presented in Table 1, shows that reducing the pruning rate leads to considerable improvements in performance for RigL-DM and MagRan-DM, in the low sparsity regime of S=0.1, with more models being able to outperform the Dense baseline.

Table 1

Dataset	Dense	Static-DM, S=0.1	RigL-DM, S=0.1, p=0.5	RigL-DM, S=0.1, p=0.05	MagRan-DM, S=0.1, p=0.5	MagRan-DM, S=0.1, p=0.05
CelebA-HQ	32.74 $\pm$ 3.68	34.49 $\pm$ 2.12	33.00 $\pm$ 2.12	23.88 $\pm$ 16.92	35.08 $\pm$ 4.66	21.28 $\pm$ 15.05
LSUN-Bedrooms	31.09 $\pm$ 12.42	32.05 $\pm$ 16.45	37.80 $\pm$ 16.45	26.20 $\pm$ 21.76	37.17 $\pm$ 7.67	18.88 $\pm$ 14.37
Imagenette	123.42 $\pm$ 4.25	119.92 $\pm$ 5.94	121.59 $\pm$ 6.91	125.78 $\pm$ 2.61	127.29 $\pm$ 7.70	118.25 $\pm$ 8.47

[1] Liu, S. et al., Don’t Be So Dense: Sparse-to-Sparse GAN Training Without Sacrificing Performance. International Journal of Computer Vision (2023). [2] Sokar, G. et al., Dynamic Sparse Training for Deep Reinforcement Learning. ArXiv, abs/2106.04217, (2021).

2024-12-01

Dear Reviewer mJ8a,

We would like to thank you for your review and valuable comments. As the discussion period is ending soon, we would kindly ask if our response has addressed your concerns. If you have any other questions, we would be glad to provide further clarification.

Best regards,

Authors

2024-12-03

Thanks for the detailed response from the authors. I appreciate the thorough explanations provided for our queries, along with the additional experimental data that has lent more credibility to some of the conclusions. Your significant efforts have inclined me to raise my rating (from 3 to 5). However, concerning the novelty at the core, in alignment with reviewer xPfm, I still maintain my rather non-affirmative evaluation overall.

As all four reviewers have recognized, the primary concern remains that the combination of sparse-to-sparse training with Diffusion Models is incremental, merely merging two existing models, conducting some experiments, and yielding entirely expected results. I believe this may not merit significant attention and could potentially offer limited contributions to the research community in these two domains. In other words, if this work had discovered that sparse-to-sparse training fails entirely due to certain properties of DMs different from other models, that would certainly be a noteworthy conclusion.

In responding to my concerns, the authors highlighted two points that I believe require further elaboration. If these issues can be clarified more effectively (perhaps even becoming the most crucial points of the entire paper), I would kindly request the AC to consider our perspectives collectively and provide a higher evaluation for this work:

The nature of DMs themselves: Why does the iterative nature of the denoising process in DMs pose challenges when applying sparse training to them compared to backbones without such an "iterative nature"? This fundamental difference between DMs and other network architectures was mentioned by the authors in their response to me, but I believe it warrants a more focused analysis.
The properties resulting from the combination of DMs with this sparse training framework: Delve deeper into comparing and analyzing the differences between DMs and other backbones when incorporating the sparse training framework. This comparison can highlight the uniqueness of DMs within this framework, thereby making a more substantial contribution.

2024-12-04

Dear Reviewer mJ8a,

Thank you for the effort and time you have devoted to our work. Below, we respond to the posed questions.

The nature of DMs themselves: Why does the iterative nature of the denoising process in DMs pose challenges when applying sparse training to them compared to backbones without such an "iterative nature"? This fundamental difference between DMs and other network architectures was mentioned by the authors in their response to me, but I believe it warrants a more focused analysis

In our responses, we wanted to highlight that this characteristic is inherent to DMs, as they usually require a slow iterative sampling process (many denoising steps) as compared e.g. to GANs, and that it had never been studied before, as mentioned in the Introduction section. We believe that this characteristic merited further study within the sparse-to-sparse training framework. One interesting finding in our work is that, while performance is stable when comparing models sampled with the same number of timesteps, it is possible to sample from an even sparser model using more timesteps, and obtain images of similar quality (Fig. 4). This finding could be extremely valuable for researchers working in low-budget regimes.

The properties resulting from the combination of DMs with this sparse training framework: Delve deeper into comparing and analyzing the differences between DMs and other backbones when incorporating the sparse training framework. This comparison can highlight the uniqueness of DMs within this framework, thereby making a more substantial contribution.

By presenting results on both Latent Diffusion and ChiroDiff, our goal was to do exactly this, i.e., to analyze how the sparse training framework works in state-of-the-art DMs, independent of the backbone architecture.

We should note that the literature about sparse-to-sparse training of vision-based generative models is not extensive. In fact, we are only aware of two papers [1,2] on this intersection, and they both study GANs. Both GAN papers show that a delicate balance between the sparsity of the generator and the discriminator is needed to achieve good performance. In contrast, our work shows that sparse-to-sparse training for DMs seems to be robust, and obtain comparable performance to the dense versions with sparsity ratios under 75%. In addition, we have found that it is better to use a conservative pruning rate (0.05).

Overall, we believe that our work is a solid, comprehensive study on how sparse-to-sparse training can affect DMs, revealing insights that were not previously known, and providing valuable and practical information for others to build upon our work.

[1] Liu, S. et al., Don’t Be So Dense: Sparse-to-Sparse GAN Training Without Sacrificing Performance. International Journal of Computer Vision (2023).

[2] Wang, Y. et al., Balanced Training for Sparse GANs, Advances in Neural Information Processing Systems (2023)

审稿意见

评分: 3置信度: 32024-11-05

This paper proposes the use of sparse-to-sparse pretraining for diffusion models. These techniques (specifically those known as unstructured sparsity, where the vertices remain fixed but only edges/connections/weights between neurons are taken to be a subset of a dense network) have shown in prior work that they can boost the performance of a wide variety of deep learning models while theoretically resulting in less FLOPs for both training and inference. This paper applies three different sparse-to-sparse pretraining methods to various diffusion models, showing a slight boost in FID scores on various image datasets while reducing the number of FLOPs.

优点

The paper is well-written and is easy to follow.
The results presented improve over the dense baselines in the majority of datasets/models chosen for experiments.
Important explorations are included, such as studying the effect of different percentages of network sparsity and different numbers of denoising steps for inference.
Experiments are conducted on various models and datasets, improving confidence on the results.

缺点

The biggest weakness of this paper is that there is virtually nothing new happening. As the paper itself observes in its literature review, prior work has already shown that the sparsity methods explored have already been shown to achieve similar results in generative models, so the results are not surprising either. The contribution in this paper therefore feels very limited: it is showing that using this on diffusion models can result in a small quality boost and (theoretical / hardware-dependent) FLOP reduction. The techniques explored are all from prior work, with seemingly no additional technical challenges on the way to apply them to diffusion models. Please correct me if I am wrong on this (and if so, this would definitely be an important discussion to include in the paper).

It should also be noted that other methods exist where the goal is also FLOP reduction without compromising quality. For example, masked autoencoders (MAE), and more recent work like MicroDiT applying the ideas from MAE to diffusion models, explore dropping out sequence elements entirely from transformer architectures, which can result in immense computational savings in practice with current hardware. The paper needs to better motivate why exploring these specific methods is important, given that the motivation and goals are the same as other methods that can better take advantage / live up to the constraints of modern hardware. In particular, sequence dropout has proven to virtually sacrifice no quality with very drastic dropout rates on image and video domains.

While the improvement of FID scores is certainly a strength of the work given that connections are being pruned, this is insufficient to demonstrate the effectiveness of any method: qualitative comparisons are key, given that the connection between FID and sample quality is not a guarantee (especially when differences are very small). This is an easy fix; the authors can provide many more samples, side-by-side with the baseline models. It is even possible to obtain extremely similar samples, simply via deterministic training and sampling with the same random seed to study the actual results more carefully.

Finally, while the quantity of experiments, datasets and models is appreciated by the reviewer, one less fatal but nonwithstanding a weakness, is that the datasets utilized are of very narrow domain and results may not transfer to larger settings, which are of key interest to the community. One potential way to improve this would be to show positive results on a traditional dataset that is much more diverse and challening, such as ImageNet (as opposed to the much smaller Imagenette used in the paper). It is more typical for positive findings on challenging benchmarks like ImageNet to transfer to larger-scale tasks and models, while it is very common for results in small, narrow datasets like CIFAR10 and the datasets used in this work to not work in more interesting settings.

问题

Why are the authors specifically interested in these sparsity methods compared to other existing techniques in the literature that can actually reduce FLOP count and properly utilize hardware? The fixation on these specific sparse-to-sparse methods seems very poorly motivated, but I would welcome clarification on this.

评论- Response to Reviewer xPfm

2024-11-24

Dear Reviewer xPfm,

Thank you for your comments and constructive review. Below we address the concerns and questions mentioned.

The biggest weakness of this paper is that there is virtually nothing new happening.

We respectfully disagree. While this paper does not present novel algorithms for sparse-to-sparse training, it is, as stated in the paper, the first study in the context of DMs, presenting novel insights that were not previously known. For example, the existence of literature that analyzes sparse-to-sparse training for GANs does not make this paper less noteworthy, since it cannot be assumed that findings in the GAN context translate directly into DMs, which have different training dynamics and challenges [1].

Our paper shows, through extensive empirical studies, that sparse-to-sparse training of DMs produces models that can match or outperform the dense counterparts, and provides new insights into optimal sparsity level, model performance, and impact of diffusion timesteps. Our results are consistent for two state-of-the-art (SoTA) DMs and over six datasets.

[1] Yang, L. et al. Diffusion Models: A Comprehensive Survey of Methods and Applications. ACM Computing Surveys (2023)

The paper needs to better motivate why exploring these specific methods is important, given that the motivation and goals are the same as other methods that can better take advantage / live up to the constraints of modern hardware.

Why are the authors specifically interested in these sparsity methods compared to other existing techniques in the literature that can actually reduce FLOP count and properly utilize hardware? The fixation on these specific sparse-to-sparse methods seems very poorly motivated, but I would welcome clarification on this.

In a nutshell, sparse models are more efficient yet equally performant as their dense counterparts. Our paper shows the theoretical advantages of sparse-to-sparse training of DMs, as the studied algorithms are currently ahead of their time in terms of hardware implementation. Although the industry is catching up, as we mentioned in Appendix A, with companies like NVIDIA or Cerebras allocating more and more resources to studying sparsity and developing sparsity-friendly hardware. By showing that sparse-to-sparse training of DMs is possible, and that it can lead to superior metrics with stability in specific sparsity levels, we hope to continue to “push” for more investment of time and resources on this topic. Most of the popular image generators available are now based on DMs, which means that plenty of resources are being poured into training these models. Showing that we can reduce these can be highly impactful for industry and academia.

Our work therefore looks more into how the future of Deep Learning could look like. We believe we all can agree on the fact that training large (and growing) models in production is expensive. But it does not need to be like that.

While we acknowledge that there are several other methods that attempt to solve the same problem, the main characteristic of sparse-to-sparse training is that it does not require specific architectural changes, and can potentially be combined with other strategies, such as the ones mentioned by the reviewer.

While the improvement of FID scores is certainly a strength of the work given that connections are being pruned, this is insufficient to demonstrate the effectiveness of any method: qualitative comparisons are key, given that the connection between FID and sample quality is not a guarantee (especially when differences are very small). This is an easy fix; the authors can provide many more samples, side-by-side with the baseline models.

Thank you for the suggestion. We have updated the paper accordingly (see new Appendix G).

Finally, while the quantity of experiments, datasets and models is appreciated by the reviewer, one less fatal but nonwithstanding a weakness, is that the datasets utilized are of very narrow domain and results may not transfer to larger settings, which are of key interest to the community.

We understand this criticism and agree that it would be interesting to test a larger dataset such as ImageNet. We plan to do it (and mention it in Section 4.4), however we believe that the experiments we report in the paper already uncover important insights that can be useful to researchers, such as the trade-off between sparsity rate and performance, and the impact of diffusion timesteps. None of this was previously known in the research literature.

2024-12-01

Dear Reviewer xPfm,

We thank you for your time and constructive comments. As the end of the discussion period approaches, we would kindly ask if our rebuttal has adequately addressed your concerns. We would also be happy to provide further information to answer any other questions that you might have.

Best regards,

Authors

评论- Reply to authors

2024-12-03

Thank you for the detailed response and the effort poured into the reviews, including new results and samples. After carefully reading all the reviews and the authors’ responses and revised manuscript, I maintain my negative recommendation.

Some of the concerns raised by this reviewer remain unresolved and are additionally shared by other reviewers. E.g., the lack of core novelty remains a core issue. The authors explain (in their response to reviewer mJ8a) that the work indeed includes novel elements, such as the findings around the levels of sparsity potentially hurting DMs specifically more v.s. other kinds of models, and (explained in the joint response) that with more denoising steps it is possible to slightly improve FID scores. However, as reviewer mJ8a explains better than I did in my review, a core issue is that there do not seem to be any particular challenges or problems to solve in order to just combine sparse training techniques with DMs. If there were significant challenges that the work would solve, it would make the work much more interesting and worth publishing.

Because this does not seem to be the case, I believe that my feedback on broadening the scope a bit to e.g. compare against methods that drop out sequence elements (MicroDiT [1] was my original suggestion; Patch Diffusion [2] is cited in the manuscript) could be an alternative route to significantly improve this paper. In the rebuttal, the authors reply to this point I raised by stating that that the main characteristic of sparse-to-sparse training is that it does not require specific architectural changes; but the same could be said about these other methods: the only change is to drop out a fixed number of sequence elements at some point in the architecture, all the way to not including them in the loss function. This does not entail any changes in architecture design. Again, the main reason why exploring these methods in the work could significantly strengthen this work v.s. the particular FLOP-reduction methods considered is that they are much more practically useful (as they can actually reduce FLOPS in practice regardless of hardware used). The goal of both strategies, i.e., reducing training compute requirements, is the same. One caveat here is, it could be argued that there is value in sparse-to-sparse because it can reduce inference FLOPS, but there is a myriad of literature on reducing inference-time compute of DMs with extremely competitive results to date. And as the authors report, sparse inference may need an increased budget in denoising steps to achieve the best results when compared to their dense counterparts, i.e., an increase in FLOPS and also a decrease in parallelizability, so it remains unclear whether sparse inference will be practically useful even with appropriate hardware.

Another unresolved concern (though as pointed in my review, less critical): the findings reported on very small and narrow image datasets with low diversity such as those included in the paper may not translate to more interesting settings. Results on a more diverse image dataset such as ImageNet was additionally suggested by reviewer rNhu as a potential (albeit not the only) way to strengthen the robustness of the findings. This is partially compensated for via results on non-image data, but the core issue of no results on a challenging benchmark remains. Omitting the more common benchmarks (CIFAR10, ImageNet, etc.) also makes comparisons to prior work more challenging (e.g., see my next paragraph on SparseDM, which reports results on exactly these datasets).

Finally, some very closely related work acknowledged in the manuscript but discussed at insufficient length is SparseDM [3]. I understand that this paper employs completely different sparsity methods; however, this puts into question some of the responses to the reviews stating that this is a first-of-its-kind study on DM + sparsity methods. This is also general feedback to improve writing; it is the burden of the authors to explain how they differ from the most similar prior work, and a comparison / finding improved results would have strengthened this work.

[1] Stretching Each Dollar: Diffusion Training from Scratch on a Micro-Budget. Sehwag et al., 2024

[2] Patch Diffusion: Faster and More Data-Efficient Training of Diffusion Models. Wang et al., 2023

[3] SparseDM: Toward Sparse Efficient Diffusion Models. Wang et al., 2024

2024-12-04

Dear Reviewer xPfm,

We really appreciate the time you have spent reviewing our work, and the detailed answer to our rebuttal.

And as the authors report, sparse inference may need an increased budget in denoising steps to achieve the best results when compared to their dense counterparts, i.e., an increase in FLOPS and also a decrease in parallelizability, so it remains unclear whether sparse inference will be practically useful even with appropriate hardware.

We want to clarify that for models with sparsity levels of 25-50%, no increase in diffusion timesteps is required to achieve comparable or better results than their dense version. An increase in timesteps might be necessary only for sparser models (75% and above) to outperform the dense versions.

Omitting the more common benchmarks (CIFAR10, ImageNet, etc.) also makes comparisons to prior work more challenging (e.g., see my next paragraph on SparseDM, which reports results on exactly these datasets).

We have already started a training run on ImageNet using MagRan-DM with 50% sparsity and pruning rate 0.05. The results should be ready in 3 weeks. We will incorporate them into the Appendix.

Finally, some very closely related work acknowledged in the manuscript but discussed at insufficient length is SparseDM [3].I understand that this paper employs completely different sparsity methods; however, this puts into question some of the responses to the reviews stating that this is a first-of-its-kind study on DM + sparsity methods.

We want to emphasize that, to the best of our knowledge, we are the first doing sparse-to-sparse training of DMs. Please note that SparseDM does not investigate sparse-to-sparse training; instead it uses a pre-trained model (which was trained fully dense), and applies sparsity in a finetuning process. We do value the suggestion of expanding the discussion about SparseDM, and will further highlight the difference in the paper.

评论- JOINT REBUTTAL

2024-11-24

Dear Reviewers,

We want to thank you all for the time spent on reviewing our paper and for the constructive comments and feedback provided. We are pleased that our paper was found to be well-written (Reviewers xPfm, mJ8a), easy to read (Reviewer 41FH), and containing rigorous content (Reviewer mJ8a). We also appreciate that our research has been recognized as well motivated (Reviewer rNhu) and meaningful to the generative modeling community (Reviewer 41FH). We are delighted that our experimental results are considered compelling (Reviewer mJ8a, 41FH) and valuable (Reviewer xPfm, mJ8a, 41FH, rNhu). Finally, we are pleased to read that our paper provides important insights into how to better apply sparse-to-sparse training to DMs (Reviewer rNhu, xPfm).

We have noticed that the most common concern is about the novelty of our work. We recognise that no new methods for sparse-to-sparse training are proposed, however we would like to remark that novelty in scientific research may occur in several ways. As recognized by the reviewers, our paper presents a comprehensive and convincing study that connects two fields that had never been considered in tandem, which is, by itself, novel since there is no previous paper published in the research literature. We provide interesting insights about the trade-off between performance and sparsity levels: in most cases, at least 50% of the model connections can be removed with minimal performance loss. Our paper also explores the iterative aspect of inference in DMs: there is now evidence that sparser models sampled using an increased amount of timesteps can produce higher quality samples than a dense model. Taken together, these findings can be extremely valuable for researchers and practitioners, especially those working on a low computational budget.

Overall, we consider that our comprehensive set of experiments is a solid first step into training sparse DMs from scratch. We hope that reviewers will recognize that scaling this study is challenging, as this is a resource-intensive paper, due to the nature of DMs (mostly Latent Diffusion). We also hope that reviewers will reconsider raising their scores, as we truly believe that our results provide meaningful and valuable information to the community.

Below we discuss the individual concerns and indicate how we have addressed them in the revised version of the paper that you can find attached. We have operationalized the changes in magenta font color, to facilitate re-review.

The Authors

AC 元评审

2024-12-22

This paper utilizes sparse-to-sparse training for diffusion models, which improves both its training and testing efficiency. Both static and dynamic sparse training were shown when neuron connections were static or dynamic during training. An interesting observation is that with certain sparsity levels, the authors have found a sparse model can outperform a dense model slightly in terms of FID scores. But the improvement is marginal, and in general sparser network leads to inferior models.

The biggest concern among reviewers is the lack of novelty, as this paper simply utilizes existing methods of sparse-to-sparse training in the context of diffusion models. There is no insights or theoretical analysis on the tradeoff between efficiency and image fidelity. Even for the only reviewer who gave a positive rating, the novelty is limited. All reviewers raised major concerns about the technical novelty of this work.

The method is still technically sound. This paper is better suited for venues emphasizing technical correctness rather than technical novelty, such as Transactions on Machine Learning Research. The authors are encouraged to explore other strategies for publishing this paper.

审稿人讨论附加意见

All four reviewers raised concerns about novelty but were not convinced by the rebuttal. Reviewer xPfm suggested a comparison to another line of work using dropout for diffusion models, which were not provided in the submission nor the rebuttal. Reviewer rNhu suggested experiments on full-scale datasets and also more evaluation metrics beyond FID.

最终决定Reject

2025-01-22

Reject