PaperHub
5.8
/10
Poster4 位审稿人
最低4最高7标准差1.1
6
4
7
6
4.0
置信度
正确性2.8
贡献度2.5
表达3.0
NeurIPS 2024

DiP-GO: A Diffusion Pruner via Few-step Gradient Optimization

OpenReviewPDF
提交: 2024-05-06更新: 2024-11-06
TL;DR

A diffusion pruner via few-step gradient optimization without retraining the diffusion model.

摘要

关键词
DiffusionPruningSpeedupGradient optimizationSuperNet

评审与讨论

审稿意见
6

The paper introduces Diffusion Pruning via Few-step Gradient Optimization (DiP-GO), which is a new diffusion model pruning method. The method addresses the high computational cost of diffusion models' multi-step denoising process, which hinders their practical use. Traditional pruning methods require resource-intensive, inefficient retraining with large datasets. DiP-GO proposes intelligent, dynamic pruning without retraining. The paper proposes a SuperNet using standard diffusion models with backup connections for similar features. These turn pruning into SubNet search, eliminating retraining.

Plugin pruner design optimizes pruning constraints and synthesis. The pruner finds an optimal SubNet by identifying redundant computation with carefully designed optimization losses.

A post-processing method ensures the pruned SubNet meets specific requirements, improving pruning efficiency and effectiveness.

优点

  1. DiP-GO efficiently predicts the importance scores of computational blocks, allowing for dynamic and intelligent pruning without the need for retraining the diffusion models.
  2. The pruner network employs consistency and sparsity optimization losses to ensure generation quality while minimizing computational usage. This balance between accuracy and efficiency is crucial for effective model pruning.
  3. The authors validated DiP-GO across different diffusion models, including Stable Diffusion series and DiTs, showing its versatility and robustness. The extensive experiments show that DiP-GO achieves great speedup (up to 4.4× on Stable Diffusion 1.5) without loss of accuracy, outperforming previous state-of-the-art methods.

缺点

  1. The vast search space (due to the large number of blocks and timesteps) may pose challenges, as the method must efficiently navigate and optimize within this space to identify the optimal SubNet.
  2. At higher pruning ratios, there is a risk of performance degradation where the quality of generated images might be compromised. This is particularly concerning for applications requiring high fidelity and detail. As the pruning ratio increases, some patterns in the image content may deviate from those in the original images. Although the main objects typically adhere to the textual conditions, subtle changes in background details can lead to noticeable artifacts.

问题

  1. How exactly are the additional backup connections in the SuperNet constructed? Are they predefined or learned during the training of the pruner network?
  2. Can you provide detailed computational resources required to train the pruner network? How does this compare to the computational resources required for retraining the diffusion models in traditional pruning methods?
  3. The threshold τ\tau vary among different datasets or not?

局限性

The authors adequately addressed the limitations.

作者回复

We would like to express our sincere gratitude for the detailed and professional attention you have given to our work during the review process.

Re: Weakness#1

    Yes, the vast search space actually introduces challenges to the learning process for obtaining an optimal SubNet. Therefore, we choose the gradient-based optimization method to tackle this challenge instead of the traditional search methods and design the pruner network with specific optimization losses. And the Table 6 of main submission shows that our method surpasses the search methods in both accuracy and efficiency. Moreover, Table 1, 2, 3 and 4 in main submission show that our method can obtain optimal SubNets in the search space, which is benefited to the gradient-based optimization method. We can reduce the search space by reducing the number of pruning steps and pruning blocks, which may hinder the achievement of an optimal SubNet than before and sacrifices accuracy of pruning result.

Re: Weakness #2

    Yes, although our method can maintain the main object content of images under very high compression ratios, subtle changes in the background still occur. This is a common issue in model acceleration tasks, and it is essential to consider the accuracy-speed trade-off for the deployment scenario when using our method. We also show this phenomenon in Figure 3 of the main submission, and our method can maintain the generation quality with a significant acceleration on SD1.5 with 50-step sampler.

Re: Question #1

    The additional backup connections are predefined in the dense SuperNet before training the pruner network. As shown in Figure 1b of main submission, a block would build backup connections for its dependent blocks and the backup connections are from the corresponding dependent blocks in previous timestep to it. The pruner network would select either the original connection or the backup connection, but not both simultaneously, during training and testing phases based on the importance score predicted by it.

Re: Question #2

   The training cost of our method is not high. For example, using SD-1.5 with a 50-step sampler, our method requires approximately 2.3 hours of training on a single MI250 GPU as shown in Table 6, which is significantly less than the time cost of traditional pruning methods (10% to 20% of the cost compared to pre-training) [1] and the SD-1.5 training takes about 150K GPU hours for pretraining [2]. And our method requires about 4 hours of training a pruner network for DiT with 250-step sampler, which is also efficient.  

Re: Question #3

    For different datasets, the threshold τ is set to 0.2, it's related to the pruned ratio requirement.  The τ should be sat smaller at higher pruning ratios and bigger at lower pruning ratios.

[1] Structural Pruning for Diffusion Models. NeurIPS 2023.

[2] Stable-diffusion-v1-5 in hugging face.

评论

Dear Reviewer JBdU,

Thank you again for your valuable comments. We have tried our best to address your questions (see rebuttal PDF and above), and will carefully revise the manuscript by following suggestions from all reviewers. Please kindly let us know if you have any follow-up questions.

Your insights are crucial for enhancing the quality of our paper, and we would greatly appreciate your response to the issues we have discussed.

Thank you for your time and consideration.

评论

Dear Reviewer JBdU,

Thank you for reviewing our paper and providing thoughtful feedback. We have addressed your concerns in our rebuttal, which are summarized as follows:

  1. Vast Search Space Challenge: We detailed how our gradient-based optimization method overcomes this challenge and outperforms existing methods, such as GA Search.

  2. Image Quality under Higher Pruning Ratios: We provided a more detailed analysis of our method under high pruning ratios and clarified that considering the accuracy-speed trade-off is crucial for deployment scenarios when using our method.

  3. Clarification: We explained the concept of backup connections in the SuperNet and provided more details about the threshold τ\tau.

  4. Computational Resources: We detailed the computational resources required to train the pruner network, demonstrating that our method is significantly more efficient than traditional pruning methods.

Given these clarifications and the positive aspects of our work that you have acknowledged, we kindly request that you reconsider your evaluation. We believe our research contributes valuable insights to the field and addresses key challenges in diffusion model acceleration.

With approximately 15 hours remaining in the discussion phase, we hope our rebuttal has resolved your concerns. If so, we would appreciate if you could consider raising your score.

If you have any further questions or need additional information, please let us know. We are committed to providing any further clarification needed within the remaining time.

Thank you again for your time and feedback.

Best regards,

Authors

审稿意见
4

This paper proposes a novel differentiable pruner for diffusion models. The core of the approach involves transforming the model pruning process into a SubNet search process.

优点

  1. The main idea is to transfer the model pruning process into a SubNet search process, eliminating the need to retrain pretrained diffusion models.
  2. Compared to traditional search methods, this differentiable approach is much more efficient.
  3. Extensive experiments demonstrate the superiority of this method.

缺点

  1. In Figure 2(a), the dimension of the prune queries is T \times N \times D . The meaning of dimension D is not discussed earlier in the paper. Could you clarify what D represents?
  2. DiP-GO is tested with 50 steps for SD and 250 steps for DiT, while 20 or 25 steps are more common nowadays. How does your algorithm perform under these more typical step scenarios?
  3. The paper only provides theoretical speedup measurements in terms of MACs or Speedup. What is the actual inference latency?
  4. In Table 3, the faster sampler method is compared under pruned-0.75 with 70 steps. However, under pruned-0.6, the results for the fast sampler method are not shown. Could you explain this omission?
  5. The usual step count for DPM-Solver is 20 or 25 [1,2,3]. Why did the author choose 50 steps in Table 4?
  6. According to Table 4, DiP-GO does not seem to be friendly to LCM.

[1]. Cross-Attention Makes Inference Cumbersome in Text-to-Image Diffusion Models, arXiv. [2]. PixArt-α\alpha: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis, ICLR24. [3]. Your Student is Better Than Expected: Adaptive Teacher-Student Collaboration for Text-Conditional Diffusion Models, arXiv.

问题

Please refer to the weaknesses part.

局限性

Yes. The authors have addressed the limitations and social impacts.

作者回复

We would like to express our sincere gratitude for the detailed and professional attention you have given to our work during the review process.

Re: Weakness #1

    DD is the embedding dimension of a learnable query in the pruner network, and the dimension of the prune queries is T×N×DT \times N \times D due to the N×TN \times T queries mentioned in the caption of Figure 2.  We clarify its meaning and value in the experimental section on Line 267-268 of the paper.  But there is a typo, which changes DD to dd. We will fix it in the revised version.

Re: Weakness #2

    We tested our method on SD-1.5 with 50 steps to align with the configuration of DeepCache [1], the main SOTA method for comparison. Additionally, we tested our method on DiT with 250 steps, as specified in the official DiT paper [2]. Our method achieved significant acceleration while minimizing loss on DiT with the 250-step sampler, demonstrating the effectiveness of our gradient-based optimization method. For models with fewer steps, it is easier to obtain the optimal SubNet to reduce redundant computations due to a smaller search space. We also tested our method on SD-2.1 with a 25-step DPM-Solver sampler and PixArt-α\alpha with a 20-step DPM-Solver sampler, achieved excellent pruning results, as shown in Table 1 and Table 2 below.

ModelMACsSpeedupCLIP Score
SD-2.1 with 25-step DPM19.02 T1.0×\times31.59
SD-2.1 Pruned-0.5 (Ours)9.51 T1.8×\times31.52

Table 1: Comparison with a 25-step DPM-Solver sampler for SD model. We evaluate the effectiveness of our methods on COCO2017 validation set.

ModelMACsSpeedupCLIP Score
PixArt-α\alpha with 20-step DPM85.65 T1.0×\times30.43
PixArt-α\alpha Pruned-0.4 (Ours)51.39 T1.6×\times30.41

Table 2: Comparison with a 20-step DPM-Solver sampler for diffusion transformer model. We evaluate the effectiveness of our methods on COCO2017 validation set.

Re: Weakness #3

   The speedup mentioned in the paper is based on actual measurements, not theoretical measurements. Detailed descriptions of the environment and platform used for these measurements can be found on Line 272 of the main submission. The testing method adheres to the Deepcache method for fair comparison, and the actual inference latency is presented in Table 3 of the main submission. We will include further details about the measurement method in the revised version.

Re: Weakness #4

We apologize for missing this data. We have now included two additional few-step results in Table 3, and will include these updates in the revised version of the paper.

MethodPruning TypeMACsFID-50KSpeedup
DiT-XL/2*-250 stepsBaseline29.66T2.971.00 ×\times
DiT-XL/2*-110 stepsFast Sampler13.05T3.062.13 ×\times
DiT-XL/2*-100 stepsFast Sampler11.86T3.172.46 ×\times
Ours (DiT-XL/2* w/ Pruned-0.6)Structured Pruning11.86T3.012.43 ×\times

Table 3: Comparison of fast sampler methods under pruned-0.6.

Re: Weakness #5

    We first choose 50-step DPM sampler due to its higher quality with more sampling steps, and the DPM sampler using 50 steps is also found in the Table 2 of the reference paper [3].

Further, we provide the pruning result on SD-2.1 with 25-step sampler in above Table 1. As shown in Table 1, our method can prune 50% computation nearly without loss.

Re: Weakness #6

   Yes, you are right. Line 298 in the paper explains this.  Our method benefits from information redundancy in multi-step optimization processes, the LCM model has fewer redundancy in features across adjacent timesteps due to its efficiency.  For better pruning results, we think it's necessary to train the original diffusion model.

[1] Deepcache: Accelerating Diffusion Models for Free, CVPR 2024.

[2] Scalable Diffusion Models with Transformers, ICCV 2023.

[3] Cross-Attention Makes Inference Cumbersome in Text-to-Image Diffusion Models, arXiv.

[4] PixArt-α\alpha: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis, ICLR24.

评论

Dear Reviewer A73E,

We appreciate your thoughtful evaluation and the opportunity to clarify and expand upon key aspects of our work. Based on the detailed responses and additional data provided, which directly address the concerns raised:

  1. We have added more clarification regarding the meaning of dimension DD and inference latency.
  2. We conducted comparison experiments with SD-v1.5 (25 steps) and PixArt-α\alpha (20 steps), demonstrating that our model performs effectively in these scenarios.
  3. We included an additional comparison of faster samplers under pruned-0.6 in Table 2, further verifying that our method surpasses these faster samplers.

We hope our responses have addressed your questions satisfactorily. We believe that the additional explanations and data substantiate a higher impact in these areas, reflecting the rigor and potential impact of our work. If our explanations have resolved your concerns, we would be grateful if you could reconsider your rating.

Should you have any further questions regarding our responses, we would be happy to provide additional clarification.

评论

Dear Reviewer A73E,

Thank you for your careful review of our paper. With approximately 15 hours remaining in the discussion phase, we sincerely hope that our rebuttal has addressed your concerns.

If our responses have clarified the issues you raised, we kindly request that you consider raising your score.

We believe our research makes a valuable contribution to the field and addresses important challenges in diffusion model acceleration.

We greatly value your feedback and have made every effort to provide thorough responses to each of your points. If you have any unresolved questions or require further clarification, please do not hesitate to let us know. We are committed to providing additional information within the remaining time.

Thank you once again for your valuable time and expert opinion. Your feedback is crucial for enhancing the quality of our research.

Best regards,
Authors

评论

We suspect that there might have been a system issue that caused our previous response to not be sent successfully. Therefore, we are resending our response and hope it hasn’t caused you any inconvenience. Below is the main content:

We would like to express our sincere gratitude for your detailed review.

Re: Weakness #1

DD is the embedding dimension of a learnable query in the pruner network, and the dimension of the prune queries is T×N×DT \times N \times D due to the N×TN \times T queries mentioned in the caption of Figure 2. We clarify its meaning and value in the experimental section on Line 267-268 of the paper. But there is a typo, which changes DD to dd. We will fix it in the revised version.

Re: Weakness #2

We tested our method on SD-1.5 with 50 steps to align with the configuration of DeepCache [1], the main SOTA method for comparison. Additionally, we tested our method on DiT with 250 steps, as specified in the official DiT paper [2]. Our method achieved significant acceleration while minimizing loss on DiT with the 250-step sampler, demonstrating the effectiveness of our gradient-based optimization method. For models with fewer steps, it is easier to obtain the optimal SubNet to reduce redundant computations due to a smaller search space. We also tested our method on SD-2.1 with a 25-step DPM-Solver sampler and PixArt-α\alpha with a 20-step DPM-Solver sampler, achieved excellent pruning results, as shown in Table 1 and Table 2 below.

ModelMACsSpeedupCLIP Score
SD-2.1 with 25-step DPM19.02 T1.0×31.59
SD-2.1 Pruned-0.5 (Ours)9.51 T1.8×31.52

Table 1: Comparison with a 25-step DPM-Solver sampler for SD model. We evaluate the effectiveness of our methods on COCO2017 validation set.

ModelMACsSpeedupCLIP Score
PixArt-α\alpha with 20-step DPM85.65 T1.0×30.43
PixArt-α\alpha Pruned-0.4 (Ours)51.39 T1.6×30.41

Table 2: Comparison with a 20-step DPM-Solver sampler for diffusion transformer model. We evaluate the effectiveness of our methods on COCO2017 validation set.

Re: Weakness #3

The speedup mentioned in the paper is based on actual measurements, not theoretical measurements. Detailed descriptions of the environment and platform used for these measurements can be found on Line 272 of the main submission. The testing method adheres to the Deepcache method for fair comparison, and the actual inference latency is presented in Table 3 of the main submission. We will include further details about the measurement method in the revised version.

Re: Weakness #4

We apologize for missing this data. We have now included two additional few-step results in Table 3, and will include these updates in the revised version of the paper.

MethodPruning TypeMACsFID-50KSpeedup
DiT-XL/2*-250 stepsBaseline29.66 T2.971.00×
DiT-XL/2*-110 stepsFast Sampler13.05 T3.062.13×
DiT-XL/2*-100 stepsFast Sampler11.86 T3.172.46×
Ours (DiT-XL/2* w/ Pruned-0.6)Structured Pruning11.86 T3.012.43×

Table 3: Comparison of fast sampler methods under pruned-0.6.

Re: Weakness #5

We first choose 50-step DPM sampler due to its higher quality with more sampling steps, and the DPM sampler using 50 steps is also found in the Table 2 of the reference paper [3].

Further, we provide the pruning result on SD-2.1 with 25-step sampler in above Table 1. As shown in Table 1, our method can prune 50% computation nearly without loss.

Re: Weakness #6

Yes, you are right. Line 298 in the paper explains this. Our method benefits from information redundancy in multi-step optimization processes, the LCM model has fewer redundancy in features across adjacent timesteps due to its efficiency. For better pruning results, we think it's necessary to train the original diffusion model.

[1] Deepcache: Accelerating Diffusion Models for Free, CVPR 2024.

[2] Scalable Diffusion Models with Transformers, ICCV 2023.

[3] Cross-Attention Makes Inference Cumbersome in Text-to-Image Diffusion Models, arXiv.

[4] PixArt-α\alpha: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis, ICLR24.

We believe that the additional explanations and data substantiate a higher impact in these areas, reflecting the rigor and potential impact of our work. If our explanations have resolved your concerns, we would be grateful if you could reconsider your rating.

Once again, thank you for your valuable time and expert opinion.

Best regards, Authors

审稿意见
7

This paper proposes DiP-GO, a novel pruning method to address the computational challenge of Diffusion model during inference. The key innovation lies in the creation of a SuperNet, which includes backup connections based on similar features across adjacent time steps, and a plugin pruner network optimized through gradient optimization to identity whether to keep or remove a computational block. This approach eliminates the need to retrain the diffusion model, significantly reducing computational costs. Extensive experiments on various diffusion model, including Stable Diffusion series and DiTs, demonstrate that DiP-GO achieves substantial speedups without compromising accuracy.

优点

  • The creation of SuperNet and transforming network pruning into a SubNet search, despite its simplicity, significantly improves performance in terms of efficiency and evaluation metrics such as FID and CLIP score.
  • This method avoids retraining diffusion models, saving substantial computational resource and time
  • This method has been applied to both traditional U-Net based diffusion models and transformer-based DiTs, validating its effectiveness across different architectures.

缺点

  • As noted by the authors, memory overhead is a potential issue in this work. Although techniques like gradient checkpointing and half-precision floating-point representation are used to mitigate this, they might sacrifice the performance

问题

  • Feature similarity in fast samplers: This approach relies heavily on the similarity of features across adjacent time steps. For diffusion model with fast sampler, only a few steps are required which might result in dissimilar features across adjacent time. A more strategic demonstration of this similarity could make statement more convincing.
  • How does different choice of alpha in equation 3 impact the overall performance?

局限性

The DiP-GO is limited in training the pruner and maintaining performance with extremely high pruning ratios. The authors address the pruner training limitation as the training time is still relatively short compared to retraining a model. For extremely high pruning ratio cases, it is possible that the pruned model is not over-parameterized. A theoretical work about the upper bound of the pruning ratio would be helpful.

作者回复

We sincerely appreciate the time and effort you invested in reviewing our paper. Your acknowledgment of the significant improvements in efficiency and effectiveness across different architectures is highly valued. We are also grateful for your recognition that our method avoids retraining diffusion models, thereby saving substantial computational resources and time.

We address the raised concerns as follows:

Re Weaknesses #1:

Gradient checkpointing and FP16 mixed precision are standard practices for training diffusion models, as recommended by Hugging Face. These techniques are crucial for long-timestep diffusion models (e.g., DiT, which requires 250 timesteps for inference). However, for models with fewer timesteps (e.g., SD-1.5 with 50 timesteps), these techniques are not necessary.

Additionally, we apply FP16 precision only to the diffusion model during the forward pass to reduce memory overhead, while keeping the pruner network and backward gradients in FP32 to maintain performance. Furthermore, additional acceleration techniques such as DeepSpeed [1] and ZeRO [2] can be employed to further reduce memory usage.

Re Questions #1:

Feature similarity across adjacent time steps in fast samplers has been confirmed in recent works. We have cited their conclusions in Line 156-157 of the manuscript.  We also analyzed the feature similarity between adjacent steps in the fast sampler. Specifically, we sampled 200 samples from COCO2017 validation set and calculate the average cosine similarity between the features of the penultimate upsample block across all TT steps for two typical fast samplers, creating a T×TT \times T similarity matrix shown in Figure 2 (refer to the rebuttal pdf). The heat map in Figure 2 illustrates the high degree of similarity between features of consecutive time steps.

Re Questions #2:

We conducted an ablation study on α and present the results below: our method achieves the best performance when α=1.0. We will include this additional comparison Table 1 in the next version of the manuscript.

α  0.1  0.5  1.0  2.0  
CLIP-Score29.7729.9330.2930.17

Table 1: Comparison of different α values. Pruning experiments with 80% pruning ratio were conducted on COCO2017 validation using SD-1.5.

[1] DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters, KDD 2020, Tutorial.

[2] ZeRO: memory optimizations toward training trillion parameter models, SC 2020.

评论

Dear Reviewer zGQU,

Thank you again for your valuable comments. We have tried our best to address your questions (see rebuttal PDF and above), and will carefully revise the manuscript by following suggestions from all reviewers. Please kindly let us know if you have any follow-up questions.

Your insights are crucial for enhancing the quality of our paper, and we would greatly appreciate your response to the issues we have discussed.

Thank you for your time and consideration.

评论

Dear Reviewer zGQU,

We sincerely appreciate the time and effort you have invested in reviewing our paper. Your recognition of the significant improvements in efficiency, evaluation metrics, and effectiveness across diverse architectures has been particularly encouraging.

We have carefully addressed the concerns you raised in our rebuttal, which can be summarized as follows:

  1. Memory Overhead: We provided a more detailed analysis of the memory reduction strategies used in our work and proposed promising solutions for reducing memory overhead while maintaining performance.

  2. Feature Similarity Analysis: We analyzed the feature similarity in faster samplers and demonstrated the high similarity of features across adjacent time steps in these samplers.

  3. More Ablation of α\alpha: We conducted additional ablation experiments of the hyperparameter α\alpha.

Additionally, we will take your advice and address the theoretical work on the upper bound of the pruning ratio, including it in the final version of our paper.

With approximately 16 hours remaining in the discussion phase, we sincerely hope that our responses have provided the clarity needed and resolved your questions satisfactorily. If our explanations have successfully addressed your concerns, we would be grateful if you could reconsider your evaluation.

Specifically, we kindly request that you consider raising our score if you feel that:

  1. Our rebuttal has adequately addressed your concerns.

  2. The novelty and contributions of our work have been further clarified.

  3. The value and potential impact of our research in the field of diffusion model acceleration have been demonstrated.

Your expert opinion is crucial in the evaluation process, and we truly value your input. If you have any remaining questions or require further information, please do not hesitate to let us know. We are more than willing to provide additional clarification.

Thank you once again for your time and consideration. We look forward to your response.

Best regards,

Authors

审稿意见
6

The paper introduces DiP-GO, a novel pruning method for diffusion models. Unlike the majority of existing methods that require extensive retraining with large datasets, DiP-GO employs (1) a SuperNet with backup connections, and (2) a plugin pruner network to identify redundant computations. The authors formulate a subnetwork searching problem which requires few-step gradient optimization instead of expensive retraining. Extensive experiments demonstrate DiP-GO’s effectiveness, outperforming state-of-the-art methods. The paper also explores compatibility with fast samplers and presents a fair amount of ablation studies.

优点

  • The paper tackles a timely and practically-relevant problem supported by a fair amount of experiments. Diffusion model pruning is an area with limited prior research, making this work particularly valuable.
  • The proposed method demonstrates superior performance compared to the baselines, and the paper provides a comprehensive review of relevant previous works.
  • Overall, the paper is well-written and easy to follow.

缺点

  • I wonder why the pruner network takes T×NT \times N random learnable queries and prediction heads. If the goal is to leverage the similarity between feature representations from adjacent timesteps, rather than distant relationships like between timestep 0 and 1, a more intuitive design might involve using 2N2N learnable queries and a timestep embedding vector (e.g., say tembt_{emb}). The final score could then be averaged or normalized as there will be duplicated score outputs, e.g., sts_t can be calculated from outputs (st1,st)(s_{t-1}, s_t) and (st,st+1)(s_t, s_{t+1}). This approach could result in a smaller network size and faster training for the pruner network. Please correct me for any misunderstanding.
  • Qualitative comparison with respect to baselines such as Diff-Pruning may help to improve the paper. Currently, there are only two figures (Figures 3 and 4).
  • I wonder if there is any pattern of pruning ratio with respect to the time-steps. For instance, DiP-GO may aggressively prune blocks near t=0t=0, or there may exist a repeating pruned block pattern across time-steps. This leads to the question of design of γ\gamma in Equation (3). Did the authors ablate results concerning γ\gamma?
  • As the proposed method offers 2-4X speedup, is this method better than naively skipping time-steps, say for every two time-steps?

问题

  • The covariance constant β\beta in Equation (2) should be defined.
  • In Equation (3), does the flops ratio γ\gamma in [0,1]?
  • In line 250, how is the threshold for sparsity loss set to 0.2? Is this merely a hyperparameter?

局限性

The paper provided limitations in Appendix D.

作者回复

We are deeply thankful for the thorough and expert review of our work. Your acknowledgment of the timely and practically-relevant problem our research tackles is greatly appreciated. Your feedback highlights the superior performance and the extensive experiments that were conducted.

Below are our detailed responses to the weaknesses and questions raised:

Re: Weakness #1

We sincerely appreciate your valuable feedback. We believe your concerns primarily stem from two aspects:

  1. Our pruner network uses learnable queries to predict the importance scores of T×NT \times N blocks, rather than directly predicting the similarity between adjacent steps. Although this importance score is influenced by the similarity between time steps during training.

  2. The idea of using time embedding as an alternative to the T dimension is a good design choice for the queries. We incorporated time embedding as a condition, added it to the N queries, and then conducted experiments with a pruning network. On SD1.5, we achieved a 29.49 CLIP score with a pruning rate of 80%, which is lower than our DiP-GO with a 30.29 CLIP score. This design helps reduce the number of parameters when T is very large, requiring N×D+Dt×DN \times D + D_t \times D parameters (where DtD_t is the time embedding dimension), while queries in our method require T×N×DT \times N \times D parameters. However, since TT is usually a constant, this design does not result in a significantly smaller network size or faster training.

Re: Weakness #2

Thank you for the suggestion. We provide qualitative comparisons with baseline, DeepCache, since Diff-pruning lacks pruning results for SD and DiT, as shown in Figure 3 (refer to the rebuttal PDF). And we will add them to the revised version.

Re: Weakness #3

  1. Our method exhibits a specific pattern of pruning ratios with respect to the timesteps. As shown in Figure 1 (refer to the rebuttal pdf), fewer blocks are pruned during the middle denoising stage (approximately between steps 65 and 150), as this is when the image content is rapidly being generated. Conversely, the pruning ratio in the latter stage is higher since the content has already taken shape.

  2. Yes, we conducted an ablation study on γ. Without gamma, pruning 80% on SD1.5 resulted in a CLIP score of 29.50 (w/ γ: 30.29).

Re: Weakness #4

We also tried simply skipping time steps for every N steps. When N = 2, the CLIP score on COCO was 19.74 (Ours: 30.29), which resulted in much lower speedup and accuracy compared to our method. This indicates that simply skipping steps leads to significant performance loss, while our method of searching for the pruner gate provides a more effective pruning strategy.

Re: Question #1

Thank you for pointing this out. We will refine the definition of the covariance constant β in the final version.

Re: Question #2

Yes. The flops ratio γ\gamma is in the range [0, 1].

Re: Question #3

The sparsity threshold 0.2 is a hyperparameter used to prevent the network from overly optimizing the sparsity loss, which could lead to all-zero gates predictions during training.

评论

Dear Reviewer 1Yei,

Thank you again for your valuable comments. We have tried our best to address your questions (see rebuttal PDF and above), and will carefully revise the manuscript by following suggestions from all reviewers. Please kindly let us know if you have any follow-up questions.

Your insights are crucial for enhancing the quality of our paper, and we would greatly appreciate your response to the issues we have discussed.

Thank you for your time and consideration.

评论

Dear Reviewer 1Yei,

We appreciate the time and effort you have invested in reviewing our paper. We have provided a comprehensive rebuttal addressing your concerns, which can be summarized as follows:

  1. Alternative Design: We explained the motivation behind our learnable queries design and incorporated your advice by providing an additional comparison with time embedding conditioned queries. This supports the rationale of our design both theoretically and experimentally.

  2. More Qualitative Comparison: We included a more detailed qualitative comparison with respect to baselines (see the rebuttal PDF), demonstrating that our method generates higher fidelity samples with better content consistency compared to the baselines.

  3. Pruning Patterns: We presented the pruning gates of our learned pruner network and analyzed the pruning ratios with respect to the timesteps.

  4. Hyperparameters Ablation: We conducted an additional ablation study on the parameter γ\gamma, demonstrating its effectiveness.

  5. Direct Skip Steps Comparison: We evaluated a direct skip step design and found that our method achieves better performance.

  6. Clarity: We provided additional details on the covariance constant β\beta, FLOPs ratio γ\gamma, and the sparsity threshold used in our paper.

Given the comprehensive nature of our response, we kindly request that you review our rebuttal and provide your feedback. Your insights are crucial for a fair evaluation of our work, and we would greatly appreciate your response to the points we've addressed.

With approximately 16 hours remaining in the discussion phase, we sincerely hope our rebuttal has addressed your concerns. If our responses have clarified the issues you raised, we kindly ask you to consider revising your score.

Once again, thank you for your valuable time and expert opinion. Your feedback is crucial in improving the quality of our research.

Best regards

评论

We suspect that there might have been a system issue that caused our previous response to not be sent successfully. Therefore, we are resending our response and hope it hasn’t caused you any inconvenience. Below is the main content:

We are deeply thankful for the thorough and expert review of our work. Your acknowledgment of the timely and practically-relevant problem our research tackles is greatly appreciated. Your feedback highlights the superior performance and the extensive experiments that were conducted.

Below are our detailed responses to the weaknesses and questions raised:

Re: Weakness #1

We sincerely appreciate your valuable feedback. We believe your concerns primarily stem from two aspects:

  1. Our pruner network uses learnable queries to predict the importance scores of T×NT\times N blocks, rather than directly predicting the similarity between adjacent steps. Although this importance score is influenced by the similarity between time steps during training.

  2. The idea of using time embedding as an alternative to the TT dimension is a good design choice for the queries. We incorporated time embedding as a condition, added it to the NN queries, and then conducted experiments with a pruning network. On SD-v1.5, we achieved a 29.4929.49 CLIP score with a pruning rate of 80%, which is lower than our DiP-GO with a 30.2930.29 CLIP score. This design helps reduce the number of parameters when TT is very large, requiring N×D+Dt×DN\times D + D_t \times D parameters (where DtD_t is the time embedding dimension), while queries in our method require T×N×DT\times N \times D parameters. However, since TT is usually a constant, this design does not result in a significantly smaller network size or faster training.

Re: Weakness #2

Thank you for the suggestion. We provide qualitative comparisons with baseline, DeepCache, since Diff-pruning lacks pruning results for SD and DiT, as shown in Figure 3 (refer to the rebuttal PDF). And we will add them to the revised version.

Re: Weakness #3

  1. Our method exhibits a specific pattern of pruning ratios with respect to the timesteps. As shown in Figure 1 (refer to the rebuttal pdf), fewer blocks are pruned during the middle denoising stage (approximately between steps 65 and 150), as this is when the image content is rapidly being generated. Conversely, the pruning ratio in the latter stage is higher since the content has already taken shape.

  2. Yes, we conducted an ablation study on γ. Without gamma, pruning 80% on SD1.5 resulted in a CLIP score of 29.50 (w/ γ: 30.29).

Re: Weakness #4

We also tried simply skipping time steps for every N steps. When N=2N = 2, the CLIP score on COCO was 19.74 (Ours: 30.29), which resulted in much lower speedup and accuracy compared to our method. This indicates that simply skipping steps leads to significant performance loss, while our method of searching for the pruner gate provides a more effective pruning strategy.

Re: Question #1

Thank you for pointing this out. We will refine the definition of the covariance constant β in the final version.

Re: Question #2

Yes. The flops ratio gammagamma is in the range [0, 1].

Re: Question #3

The sparsity threshold 0.2 is a hyperparameter used to prevent the network from overly optimizing the sparsity loss, which could lead to all-zero gates predictions during training.

We appreciate the time and effort you have invested in reviewing our paper. We have provided a comprehensive rebuttal addressing your concerns, which can be summarized as follows:

  1. Alternative Design: We explained the motivation behind our learnable queries design and incorporated your advice by providing an additional comparison with time embedding conditioned queries. This supports the rationale of our design both theoretically and experimentally.

  2. More Qualitative Comparison: We included a more detailed qualitative comparison with respect to baselines (see the rebuttal PDF), demonstrating that our method generates higher fidelity samples with better content consistency compared to the baselines.

  3. Pruning Patterns: We presented the pruning gates of our learned pruner network and analyzed the pruning ratios with respect to the timesteps.

  4. Hyperparameters Ablation: We conducted an additional ablation study on the parameter γ\gamma, demonstrating its effectiveness.

  5. Direct Skip Steps Comparison: We evaluated a direct skip step design and found that our method achieves better performance.

  6. Clarity: We provided additional details on the covariance constant β\beta, FLOPs ratio γ\gamma, and the sparsity threshold used in our paper.

With remaining time in the discussion phase, we sincerely hope our rebuttal has addressed your concerns. If our responses have clarified the issues you raised, we kindly request that you consider raising your score.

Once again, thank you for your valuable time and expert opinion.

Best regards, Authors

作者回复

Thank you to all the reviewers. We mainly upload the images needed for the rebuttal here. Detailed rebuttal responses and tables have already been sent to each reviewer separately.

最终决定

This paper proposes DiP-GO, a novel differentiable pruner for diffusion models. DiP-GO involves transforming the model pruning process into a search process. Specifically, DiP-GO employs a SuperNet with backup connections, and a pruner network to identify redundant computations. The subnetwork searching problem involves few-step gradient optimization instead of expensive retraining. Extensive experiments demonstrate the effectiveness of DiP-GO. It tackles a practically-important problem. The descriptions of many parts of the paper are not clear and need more details, as suggested by several reviewers. Some of the information from the rebuttal should be included in the final version of the paper.