Physics Informed Distillation for Diffusion Models
We introduce a novel Distillation approach for Diffusion Model heavily inspired by Physics Informed Neural Networks(PINNs) enabling single-step image generation.
摘要
评审与讨论
This paper looks to propose a knowledge distillation method for the score-based generative models, using a PINN method for ODE systems. As indicated in the original score-based model paper, for each generative SDE, there is a corresponding probability flow ODE. While it is expensive and difficult to distill an SDE model, it is much easier to distill an ODE model. The author uses existing methods from the PINN literature to distill a probability flow trained with score matching. The empirical results are promising, but some theoretical assumptions, such as the score function is Lipschitz continuous, are a bit far-fetched.
优点
The empirical results from this paper looks promising for a single step generation model. The presentation is clear.
缺点
The literature of PINNs on ODE systems is very mature, and the paper did not propose innovations on how to better perform PINNs on ODE systems. Rather, it is simply applying the PINN techniques to ODE systems, limiting the contribution to developing a technique for distilling score based generative models.
This is fine if the paper achieves state of the art performance (meaning it beats all previous benchmarks), however, as shown in Table 1, 2, 3, and Figure 7, this is not the case. The existing methods consistency model and EDM (which is heavily referred to when developing this paper) perform much better than the proposed method.
That being said, it would be great if the authors can demonstrate qualitatively and quantitively how the proposed method is better than EDM and/or consistency model (such as distillation efficiency, inference time, etc.).
问题
I am concerned with the validity of the assumption that the score model is Lipschitz continuous. How often is this the case? If we do enforce this property, how much will it hurt the performance?
伦理问题详情
None
We appreciate your time in reviewing our paper. Below, we address your concerns:
Questions: The assumption that the score model is Lipschitz continuous.
The main use of the Lipschitz condition assumption is to assume the existence and uniqueness of an ODE solution modelled by the diffusion model as well as the convergence of numerical ODE solvers. More specifically, the global truncation error proofs of many ODE solvers used in Diffusion literature involves the use of such Lipschitz continuity assumptions[1]. As such, this assumption is present in many diffusion model literature either explicitly as seen in Theorem 3.2 of DPM Solver[2] or implicitly when using ODE solvers to generate solutions where convergence of such solvers are assumed implicitly. Note that unlike GAN approaches, this assumption is not strong as we only require the model to be Lipschitz continuous to have valid ODE solutions and do not need to restrict the Lipschitz constant of such models.
Adressing weaknesses raised
Probability flow ODEs can be viewed as a special type of ODE where the Lipschitz constant of such ODE is exploding near the origin[2]. To train a PINN for such an ODE system, 2 key innovations were used. Namely, the shifting of variable from Equation 8 to 9 was used to prevent the explosion of the loss value from one ieration to another when points near the singularity are sampled. Secondly, we adopt a time discretization similar to that in EDM[3]. In black box ODE solvers such as Euler Method, their truncation error is related with the Lipschitz constant of such ODE systems. As such, this error increases with the growing Lipschitz constant near the singularity point. To account for this, EDM time discretization was used where more points near the singularity was sampled. To the best of our knowledge, solving ODE systems on intervals near such Lipschitz singularity have never been considered in PINN literature. As such, we believe that our approach could also contribute to the PINN literature when such ODE systems are encountered.
With regards to the performance of PID, while our approach does not beat the current SOTA method, it is important to note that our method is able to achieve competitive performance, surpassing several recent works while resolving issues inherent in prior methodologies. Note that EDM model is a teacher Diffusion Model used in many recent distillation works as seen in Table 1 and 2, thus exhibiting a significantly slower sampling speed. Interestingly, our method exhibits a predictable trend with respect to hyperparameters. As demonstrated in Figure 3 in our main paper, PID performance consistently improves with increasing discretization, saturating when the discretization number is sufficiently large. In contrast, CM[4] is sensitive to the choice of discretization number, where high or low values from the optimal point results in performance deterioration. As such, our approach does not need to search for hyperparameters such as discretization numbers from dataset to dataset allowing us to set it at a sufficiently high number.
[1] Butcher, John Charles. Numerical methods for ordinary differential equations. John Wiley & Sons, 2016.
[2] Yang, Zhantao, et al. "Eliminating Lipschitz Singularities in Diffusion Models." arXiv:2306.11251 (2023).
[3] Karras, Tero, et al. "Elucidating the design space of diffusion-based generative models." NeuIPS(2022): 26565-26577.
[4] Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. ICML2023
Thanks for the response.
As the authors claimed, the main innovation w.r.t. the PINN techniques comes two-fold: one for modifying a loss function, and one for using the EDM discretization adaptation, both for accounting for the singularities. It is true that this has not been considered before, but these are not adequate, as they're created mainly by applying existing techniques, rather than developing new techniques.
As for the performances comparison with existing works, it is interesting to see that the approach does not need to tune the number of discretizations. However, this problem has been extensively studied by the Neural ODE community before, and can be solved by simply training with a high discretization number [1, 2].
Therefore, I will keep my score as is.
[1] Ott, Katharina et al. “ResNet After All? Neural ODEs and Their Numerical Solution.” (2020). [2] Sander, Michael E. et al. “Do Residual Neural Networks discretize Neural Ordinary Differential Equations?” ArXiv abs/2205.14612 (2022): n. pag.
Thank you very much for your response. I am slightly confused as to how the topic of Neural ODE relates to the topic of distillation. To my understanding, Diffusion Models during sampling can be viewed as a type of Neural ODE(Score SDE[1]). As such, just as increasing the NFE in Diffusion Model would increase the performance, increasing discretization in a Neural ODE context increases the NFE but also improves performance. While this removes the need to choose a suitable discretization as increasing it always produces improvement, to my understanding Neural ODE due to its iterative nature cannot be viewed as single step sampling. Therefore, while I comprehend the relationship between Diffusion Models and Neural ODEs, I struggle to see how the iterative nature of Neural ODEs aligns with single-step sampling distillation approaches, where augmenting discretization has no bearing on sampling costs as it doesn't impact the NFE. Could you elaborate further on this aspect to aid my understanding of your concerns?
Thank you for your assistance and understanding in clarifying this matter.
[1] Song, Yang, et al. "Score-based generative modeling through stochastic differential equations." arXiv preprint arXiv:2011.13456 (2020).
Thank you for the response.
It is true that during the sample phase, the proposed method is a single-step method, whereas Neural ODEs take many steps. However, during the training phase, as point out in Algorithm 1, it still need to discretize the dynamics by discretization number N, hence the connection to Neural ODEs.
Hope this clarifies the matter.
Thank you for your clarification on this matter. I would like to elaborate on the distinction between the discretization approaches employed in Neural ODEs and our proposed method. In Neural ODE discretization, the utilisation of Neural ODE as a vector field involves sampling more steps, akin to diffusion. Thus, increasing discretization has been properly explored and has been shown to improve performance. However, in the context of knowledge distillation approaches for diffusion models, such as Consistency Models, it is noteworthy that increased discretization does not necessarily yield enhanced performance and necessitates careful tuning. In our approach, we learn the entire trajectory function, with the vector field being implicitly acquired. As such, increased discretization corresponds to querying points on this trajectory function, incurring no additional cost during training or inference. As our work is in the field of single step sampling, sampling efficiency is of great importance as iterative methods would defeat the purpose of distillation. Therefore, in the context of fast sampling methods, the comparison between discretization in the iterative Neural ODE and discretization in our method cannot be made on an apples-to-apples basis.
The authors propose Physics Informed Distillation (PID), a method for distillation of a teacher diffusion model up to single-step image generation. Inspired by models from the Physics Informed Neural Network (PINN) architecture, the method trains a student model to approximately satisfy the probability flow ODE induced by the teacher diffusion model. To speed up the training process, the residual loss is approximated using a first-order Euler discretization step, instead of having to apply backpropagation. A theoretical analysis of the discretization error is also provided.
优点
- The authors make an interesting connection between distillation of diffusion models and PINNs via enforcement of the probability flow ODE.
- The authors propose PID, a relatively simple method for distillation, which shows results comparable to state-of-the-art single-step image generation for CIFAR10 and ImageNet64.
- The paper is generally well-written and clear.
- The PID distillation method achieves results comparable to current state-of-the-art single-step image generation methods (1) using an arguably simpler method involving fewer hyperparameters/training tricks.
- Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models.
缺点
- The specific parameterization of the PID model (Eqn. 7) seems somewhat undermotivated to me. The authors mention they take inspiration from the two common approaches to enforcing boundary conditions with PINNs, soft and strict conditions. However, beyond this high-level explanation, the parameterization is not justified and no ablations are performed.
- Similarly, a first-order numerical approximation of the residual loss is proposed for the sake of efficient training, but no ablations are performed as to how much this discretization affects the performance.
- A bit more background about PINNs, especially the soft enforcement of boundary conditions, could be helpful to better motivate the authors' choice of model parameterization.
问题
- Is there a reason why a combination of soft and hard enforcement of boundary conditions is necessary?
- Once a first-order discretization scheme for the residual loss is chosen, the student model training looks very similar in form to existing distillation techniques (1, 2). How is PID related to these techniques, e.g. can they be described as a special case of PID given a specific choice of time discretization of the probability flow ODE?
- How does the first-order approximation of the residual error loss compare to training directly using the ODE (backpropagation), or with using a higher-order approximation?
- What is the benefit of enforcing the probability flow ODE in a PINN-inspired way, as opposed to an operator learning method such as (3)?
- Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models.
- Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models.
- Hongkai Zheng, Weili Nie, Arash Vahdat, Kamyar Azizzadenesheli, and Anima Anandkumar. Fast sampling of diffusion models via operator learning.
Question 1: Why a combination of soft and hard enforcement of boundary conditions is necessary?
In our approach, we only use hard conditions to satisfy the boundary conditions. In detail, we use skip connections to automatically satisfy the boundary conditions and thus do not explicitly need to train boundary conditions through soft conditioning. It is important to note however that despite not training for boundary conditions, the gradients at the boundary are still trained. To elaborate, we train our student model such that the gradient at the boundary corresponds well with the probability flow ODE.
Question 2: The possible connections between PID and existing distillation[2,3].
While after the first-order discretization, the loss looks somewhat similar, one fundamental difference is present. Notably, in existing distillation techniques [2,3], the teacher model imparts knowledge to the student model without direct interaction. Specifically, the teacher model is fed with noised data, and the output of the student model is not incorporated back into the teacher model. In contrast, our approach, due to formulating it in a physics informed fashion, utilizes the output of the student model as the input for the teacher model instead of feeding in data directly. This distinction allows our method to learn in a data-free manner, as commonly observed in Physics-Informed Neural Network (PINN) literature. However, due to this difference, PID and existing distillation approaches unfortunately cannot be connected through considering different discretization schemes.
Question 3: Compare the first-order approximation with forward mode auto differentiation or a higher-order approximation.
In section 4.2, we mention that in PINN literature, it has been observed that exact gradients using forward mode automatic differentiation leads to convergence with unphysical solutions[1]. This is primarily due to the fact that automatic differentiation relies only on singular point and thus can cause a form of overfitting in the PINN training paradigm while numerical methods rely on local points which do not exhibit this issue. Similarly, we observe that exact gradient for training performs poorly and therefore resort to first order gradient approximations. As for higher order numerical approximations, higher order approximations such as central difference method provides a further performance gain as it better approximates the gradients while still using local evaluation points. We present this results in the table below.
| Method | FID |
|---|---|
| PID (1st order) | 3.92 |
| PID (2nd order - Central difference) | 3.68 |
Question 4: PINN-based ODE solutions vs operator learning methods such as DSNO
In DSNO[4], operator learning is defined as learning the mapping between functions. In formulating it in such a manner, operator learning in a diffusion context involves mapping the constant function of noise to the trajectory function . To learn this mapping, DSNO requires multiple functional evaluations of the different time steps in a parallel manner instead of the recursive one in diffusion models. As such, despite being able to perform single step inference to generate images, the parallel evaluations of the different time steps in DSNO causes it to be slower than standard single step inference as seen in Table 3 of their paper. In addition, the use of multiple time evaluations per element in their batch also incurs a significantly higher cost during training than posing it in a physics informed fashion where only single point evaluations are used. It is important to note however that the numerical gradients for training in a physics informed fashion incurs extra cause although not as much as multiple parallel evaluations in DSNO. Finally, operator learning as described in DSNO requires a trajectory dataset. This can be very memory consuming and expensive to produce as also mentioned in Consistency Models[2]. In contrast, proposing it in a physics informed fashion is data-free as only noise samples are needed, alleviating this problems entirely.
[1] Pao-Hsiung Chiu, Jian Cheng Wong, Chinchun Ooi, My Ha Dao, and Yew-Soon Ong. Can-pinn: A fast physics-informed neural network based on coupled-automatic–numerical differentiation method. Computer Methods in Applied Mechanics and Engineering, 2022
[2] Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. ICML2023
[3] Salimans, Tim, and Jonathan Ho. "Progressive distillation for fast sampling of diffusion models." arXiv preprint arXiv:2202.00512 (2022).
[4] Zheng, Hongkai, et al. "Fast sampling of diffusion models via operator learning." International Conference on Machine Learning. PMLR, 2023.
Thanks for the response. Most of my concerns/questions have been addressed and the discussion has been clarifying. I appreciate the data-free distillation as a key contribution of this work, but due to a similar framework as the prior work BOOT [1] as Reviewer sGVD points out, I will keep my score as is.
- Gu, Jiatao, et al. "Boot: Data-free distillation of denoising diffusion models with bootstrapping."
Thank you for taking the time to read and review our paper. Please see our responses below:
Weakness 1: Motivation of the parameterization used by PID (Eq.7).
We apologize for the lack of explanation on the paramaterization used in PID. We have added a detailed reasoning of this scheme in Appendix A.2 of the revised manuscript.
Weakness 2: Affect of numerical differentiation on performance.
In section 4.2, we mention a study on PINNs that uses exact vs numerical approximations of the gradient during training[1]. One key observation in that paper is that with exact gradients, the model is vulnerable to converging to unphysical solutions, having low loss but modeling a significantly different solution than the exact solution. As such, the authors of that paper argue that leveraging local points via numerical approximations can alleviate this issue. Similarly, we observe a similar situation where using exact gradients performs poorly in practice and resort to using numerical gradients instead.
Weakness 3: Background for the soft enforcement of boundary conditions
In section 3.2, we explain that PINN methods often either use soft or hard conditions to satisfy the boundary condition. However, as our method only exclusively uses hard conditions through skip connections, we did not include detailed explanations on how soft conditions in PINNs are enforced.
[1] Pao-Hsiung Chiu, Jian Cheng Wong, Chinchun Ooi, My Ha Dao, and Yew-Soon Ong. Can-pinn: A fast physics-informed neural network based on coupled-automatic–numerical differentiation method. Computer Methods in Applied Mechanics and Engineering, 2022
Thank you for your encouraging response. I would like to clarify that our submission, as it is within four months of the BOOT [1] paper's publication at the ICML workshop in June 20, can be regarded as concurrent work, which aligns with the ICLR Reviewer guidelines. We kindly request the reviewer's consideration of this timing and sincerely appreciate your understanding in this regard.
[1] Gu, Jiatao, et al. "Boot: Data-free distillation of denoising diffusion models with bootstrapping." ICML 2023 Workshop on Structured Probabilistic Inference {&} Generative Modeling. 2023.
The proposed physics-informed distillation (PID) distills a student model from the teacher diffusion model for single-step image generation of diffusion models. PID aims to solve the probability flow ODE by reducing the equation residual loss, which is approximated by finite difference method. The theoretical results show that PID matches the Euler method if it achieves zero loss. PID show competitive single-step image generation performance on CIFAR10 and ImageNet64.
优点
- The single step generation ability of PID is competitive on CIFAR10.
- The training cost per step of PID is smaller compared to PD and CD.
- The training of PID does require any extra data.
缺点
- PID cannot further improve the sample quality by investigating more NFEs. It is limited to single-step generation, where the performance is not that impressive.
- Equation 9 and 8 are equivalent up to a scaling factor for L2 metric, but not for arbitrary distance metric such as LPIPS, which is used for the main results. Changing from equation 8 to 9 will change the loss landscape. However, this step is not justified or explained in the paper. Why not use the original PINN loss given by equation 8?
- The authors choose LPIPS metric in the paragraph before Theorem 1. However, Theorem 1 is fully based on L2 metric, which is a bit confusing. I do not see how Theorem 1 can extend to LPIPS metric.
问题
- Theorem 1 shows that PID will be equivalent to the Euler method with the same number of discretization steps N if the PID loss is zero. Can you also report the FID of the corresponding Euler method with the same ? This may help us understand the underlying gap between the learned model and the actual Euler method.
- The central difference method is more accurate than the forward difference that is used in the paper. Would it be beneficial to use the central difference for PID?
We thank the reviewer for their detailed feedback. We address your concerns as below:
Weakness 2: Changing loss from Equation 9 to 8 will change the loss landscape. However, this step is not justified or explained in the paper. Why not use the original PINN loss given by equation 8?
We apologize for the lack of clarification when moving from Equation 8 to 9. We have added a detailed explanation in the revised manuscript. The primary motivation for this change is the Lipschitz explosion of the probability flow ODE, where the ODE system goes to infinity for small time values. As training in physics-informed fashion trains to match the gradients of the network to the values in the ODE system, directly training with an exploding system will lead to poor performance. To resolve this issue, we transform to a stabler loss system where the values do not explode near this singularity point.
Weakness 3: Theorem 1 is entirely based on L2 metric
In Theorem 1, we assumed that the metric used was a valid distance metric, thus possessing specific properties. As such, the proof in Theorem 1 is not specific to the L2 metric but rather any valid distance metric. It is important to note that despite the distance metric used, the truncation error is still expressed in L2. This is not due to the metric used for training but rather the assumption of the Lipschitz continuity of the Diffusion Model about the L2 metric.
Question 1: FID using the Euler method with discretization setup.
The FID value of EDM[2] when sampling with an Euler solver in 250 time steps is given in the following table. Additionally, we have also incorporated this table in the updated manuscript.
| Sampler | CIFAR | ImageNet |
|---|---|---|
| EDM - Base | 2.04 | 2.44 |
| EDM - Euler | 2.10 | 2.41 |
Question 2: Central difference for PID.
The table below presents the results of using the central difference approach instead of the 1st order approach proposed in the main paper. Indeed, since second-order gradient approximations such as major difference methods provide better gradient approximations, this leads to a further gain in performance.
| Method | FID |
|---|---|
| PID (1st order) | 3.92 |
| PID (2nd order - Central difference) | 3.68 |
[1]Yang, Zhantao, et al. "Eliminating Lipschitz Singularities in Diffusion Models." arXiv:2306.11251 (2023).
[2] Karras, Tero, et al. "Elucidating the design space of diffusion-based generative models." NeuIPS(2022): 26565-26577.
Thanks for the authors' response. Most concerns in my review have been addressed. Considering its similarity to BOOT[1] paper as pointed out by another reviewer, I will lower the score on novelty. However, I think the proposed PID implements the idea in a better way that leads to a better performance. I keep my rating of 6 but with slightly less confidence.
[1]: Gu, Jiatao, et al. "Boot: Data-free distillation of denoising diffusion models with bootstrapping." ICML 2023 Workshop on Structured Probabilistic Inference {&} Generative Modeling. 2023.
Thank you for your response. I hope to clarify that our submission, falling within four months of the BOOT [1] paper's publication in the ICML workshop in June 20 can be considered as concurrent work aligning with ICLR Reviewer guidelines. We kindly request the reviewer's consideration of this matter and thank you in advance for your understanding.
[1] Gu, Jiatao, et al. "Boot: Data-free distillation of denoising diffusion models with bootstrapping." ICML 2023 Workshop on Structured Probabilistic Inference {&} Generative Modeling. 2023.
This work proposed a PINN-based distillation technique for single-step diffusion sampling. The output of the trained network is equivalent to the integral of the diffusion ODE, and thus the sampling procedure can be equivalently rewritten to an ODE-solving problem with PINN methods. Experiments show that the proposed method can achieve comparable sampling results to other distillation methods such as consistency distillation.
优点
- The proposed method and the corresponding nemerical methods can achieve comparable results to other distillation methods such as consistency distillation.
- The presentation is easy to follow and the algorithms are quite neat.
缺点
- Major:
-
Lack of an important related work: BOOT[1]. The proposed method seems almost exactly the same as BOOT, because they both distill the integral from time to time , with the same integral and numerical differential method. Please compare with BOOT in details and discuss more about the own contirbutions.
-
Minor:
- The results in Table 1 is unfair. Some of the results are based on the checkpoint of the VPSDE in ScoreSDE[2], but some of the results are based on the checkpoint of EDM[3]. The authors should at least split them into two parts.
- The results of DPM-Solver in Table 1 seems not the best results. e.g., please see Table 1 in [4], where it involves "dpm-solver-fast".
- A small typo: please use instead of .
-
[1] Gu, Jiatao, et al. "Boot: Data-free distillation of denoising diffusion models with bootstrapping." ICML 2023 Workshop on Structured Probabilistic Inference {&} Generative Modeling. 2023.
[2] Song, Yang, et al. "Score-based generative modeling through stochastic differential equations." arXiv preprint arXiv:2011.13456 (2020).
[3] Karras, Tero, et al. "Elucidating the design space of diffusion-based generative models." Advances in Neural Information Processing Systems 35 (2022): 26565-26577.
[4] Song, Yang, et al. "Consistency models." (2023).
问题
Please discuss in details with BOOT and highlight the own contributions and differences.
========================
I've read the authors' rebuttal and I think the comparison between BOOT and the proposed method is fair. However, I still think there are many similarity between BOOT and the proposed work because BOOT is the first work which combines the PINN loss with diffusion distillation. So I raised my score to 5.
========================
Thanks for pointing out the definition of concurrent work in the review guide. I think although they can be considered as concurrent work, the two works are too similar, and the author needs to discuss them more in detail, instead of only comparing the FID results. But I respect and follow the review guide because it is not the reason for rejection. However, the final results of this work are still not promising. So I consider to raise my score to 6.
We are grateful for taking the time to review our paper. We present our response to your concerns below.
Question: Similarities and Differences with concurrent work BOOT[1]
We apologize sincerely for not including BOOT as we were previously unaware of its existence. We have since added it in the revised manuscript in Table 1 and 2 for comparison. It is notable that both our proposed approach and BOOT[1] represent concurrent works in the field of distillation sampling with diffusion models. We present the similarities and differences in detail as follows.
Similarities:
-
Same starting motivation: They share the similarity in the fact that we both realize that distillation can be achieved by reducing the loss associated with the differential equation of the probability flow ODE system.
-
Same use of first order numerical gradient approximations: In both our methods, we resort to using 1st order numerical gradient approximations instead of forward-mode auto differentiation. It is important to highlight that, unlike BOOT, our decision is not solely based on ease of implementation. Rather, our motivation stems from the insights in the Physics-Informed Neural Network (PINN) literature. Numerical gradient approximations have been demonstrated to prevent convergence to unphysical solutions in ODE systems, in contrast to auto-differentiation approaches[2] by relying on local points rather than a single point, as discussed in Section 4.2 of our main paper.
Differences:
-
Teacher Model Training: Signal ODE Distillation vs. EDM-based Approach: A key distinction lies in the absence of signal ODE for distillation in our method. Notably, their approach involves distillation using the signal ODE system instead of the conventional ODE system. This difference holds significance as it leads them to train a new signal prediction teacher model (refer to Table 3 in their appendix, where a signal prediction model is employed as the teacher), rather than relying on pre-trained teacher models, as in our case. As such, they incur additional training cost associated with such teacher models.
-
Divergent Boundary Approaches: Our Method vs. BOOT: A key discrepancy that arises between the 2 methods is in our approach towards the boundary conditions. While BOOT opts to solve the signal ODE, they encounter challenges in easily satisfying boundary conditions through simple skip connections. Consequently, they resort to training with soft boundary conditions. In contrast, due to our use of the base ODE system modeled by the diffusion model, such boundary conditions can easily be satisfied through the use of skip connections. The benefit of this is twofold. Firstly, the need for soft conditions in BOOT incurs additional cost during training through the introduction of the boundary condition loss. In addition, due to the use of soft conditions, such boundary conditions are not always satisfied. As a result, reducing the loss associated with the differential equation may not lead to valid solutions. We observe this in the revised Table 1 and 2 of our manuscript where we outperform BOOT in both datasets.
In summary, while the beginning motivations of both ours and BOOT are similar, notable differences in the training schemes are evident that provides the following benefit:
- We are able to use pretrained teacher models for distillation while BOOT trains its own signal prediction teacher model.
- We do not need to train for boundary conditions, while BOOT has to due to its soft conditions, incurring a slight additional training cost.
- Our boundary condition is always satisfied, unlike BOOT, causing the reduction of the PINN loss to provide valid solutions to the probability flow ODE, resulting in improved FID values across all datasets.
Weaknesses: Manuscript update and minor typos.
Thank you for meticulously reviewing our paper. We have partitioned the result comparison in the main result table based on the distillation teacher model in our revised manuscript. Additionally, we have incorporated the results from DPM-Solver-fast into the main result table and fixed the typos accordingly.
[1] Gu, Jiatao, et al. "Boot: Data-free distillation of denoising diffusion models with bootstrapping." ICML 2023 Workshop.
[2] Pao-Hsiung Chiu, Jian Cheng Wong, Chinchun Ooi, My Ha Dao, and Yew-Soon Ong. Can-pinn: A fast physics-informed neural network based on coupled-automatic–numerical differentiation method. Computer Methods in Applied Mechanics and Engineering, 2022
[3] Lu, Cheng, et al. "Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps." NeuIPS 2022: 5775-5787.
I've read the authors' rebuttal and I think the comparison between BOOT and the proposed method is fair. However, I still think there are many similarity between BOOT and the proposed work, and BOOT is the first work which combines the PINN loss with diffusion distillation. So I raised my score to 5.
Thank you for your kind response. We would like to draw your attention to a point regarding the ICLR Reviewer guidelines. In it, works only in peer reviewed conferences such as online conferences and workshops within 4 months of submission are considered as concurrent work. As the paper BOOT [1] was published at the ICML workshop in June 20, this places it within the four-month timeframe leading up to our submission on September 28. It is our understanding that, in accordance with this timeline, our work should be regarded as concurrent. We kindly request the reviewer's consideration of this matter and thank you in advance for your understanding.
[1] Gu, Jiatao, et al. "Boot: Data-free distillation of denoising diffusion models with bootstrapping." ICML 2023 Workshop on Structured Probabilistic Inference {&} Generative Modeling. 2023.
Thanks for pointing out the definition of concurrent work in the review guide. I think although they can be considered as concurrent work, the two works are too similar, and the author needs to discuss them more in detail, instead of only comparing the FID results. But I respect and follow the review guide because it is not the reason for rejection. However, the final results of this work are still not promising. So I consider to raise my score to 6.
This paper proposes a "knowledge distillation" method for score-based generative models, in which the student model learns to generate the trajectory of the teacher model (a diffusion ODE) using PINNs in one shot, which reduces inference time. The paper proposes two techniques to address numerical issues. One is to change the loss to avoid exploding at 0, and another is to use a finite-difference approximation to replace the automatic differentiation shown by previous work to converge to unphysical solutions. The paper provides a theoretical result to justify the distillation loss, and it compares the proposed method to many diffusion generative models, including several single-step ones.
为何不给更高分
Some issues pointed out by the reviewers:
- Limited performance improvement compared to other single-step generation methods.
- The main novelty comes from adapting existing techniques.
- The proposed method is limited to single-step generation and cannot improve quality even if more steps are allowed.
- Limitations due to theoretical assumptions.
- Slightly worse performance when compared with Consistency Model.
Also, while the related work by Gu et al. cannot be used as the reason for rejection, reviewers suggested better discussion due to the apparent similarity between the two works.
为何不给更低分
N/A
Reject