5.0

/10

Poster3 位审稿人

最低4最高6标准差0.8

3.7

置信度

正确性2.7

贡献度2.7

表达2.0

NeurIPS 2024

UDPM: Upsampling Diffusion Probabilistic Models

Shady Abu-Hussein,Raja Giryes

OpenReview PDF

提交: 2024-05-15更新: 2024-11-06

TL;DR

We propose an efficient diffusion model that is based on adding noise + upsampling and a novel training strategy, which leads to state-of-the-art generation results on CIFAR-10 and other datasets

摘要

关键词

diffusion modelsgenerative models

评审与讨论

审稿意见

评分: 5置信度: 42024-07-05

This paper introduces a novel generative model called the Upsampling Diffusion Probabilistic Model (UPDM). UPDM aims to decrease the number of diffusion steps needed to generate high-quality images, resulting in a significantly improved efficiency compared to previous methods.

优点

This paper is well-written, and the organization is great.
The motivation is clear enough.

缺点

Some symbols are not fully explained when they are used at the first time.
The datasets might be fully able to valid the effectiveness of your method.
The compared methods are relatively out-of-date.
The comparison metric is only FID.
Some commas and labels in several equations are missing.

问题

In ColdDIffuison, blur and noise can be utilized to train a diffusion model. In your method, downsample and noise is utilized at the same time to train a diffusion model. Please re-clarify your main contribution expect for this.
Please re-clarify the details for your network to handle images with different resolution.
Can you explain how you balance the weights of L{simple}, L{per}, and L{adv}. Please provide more ablation studies.
Please compare the generation speed with other methods that speed up DDPM.
Please provide more details about your network.
How did the authors assure that 3 steps will obtain best performance?

局限性

The authors have addressed the limitations and potential negative societal impact of their work.

作者回复

2024-08-07

We would like to thank the reviewer for the valuable comments. Our response is detailed below.

Weaknesses:

Will be fixed in the revised version. We thank the reviewer for the effort.
Although the three datasets we examined UDPM on are diverse, with CIFAR10 and AFHQv2 being multi-class datasets; we additionally trained UDPM on the LSUN horses dataset at 128x128 resolution as can be seen in the attached rebuttal PDF file.
The methods shown in the paper are state-of-the-art accelerated diffusion-based generative models. We will be glad to compare our methods to other works we might have missed. There is a recent class of generative models derived from diffusion named consistency models [1], which can generate images with a single denoising step with an FID score of 8.70 on CIFAR10, which is still inferior to UDPM in both generation quality and efficiency.
Note that we do not compare UDPM to distillation-based approaches and specifically to the distillation version of the consistency models, as distillation is complementary to our technique. Thus, we compare only to direct training. Note also that even if distillation-based approaches are taken into account, they require at least a single denoising step, leading to much longer runtimes ( $\sim300\\%$ ) than UDPM.
Following the reviewer's comment, we have the following inception score comparison:
Denoising student: 8.36
TDPM: 8.65
UDPM: 9.01
Which shows that UDPM outperforms current SOTA efficient diffusion models also in the inception score measure. In the revised version of the paper we will make sure to add this comparison to Table 1.
We will make sure to polish the paper in the revised version.

Questions:

Indeed in ColdDiffusion they propose multiple approaches for defining the forward diffusion process. However, ColdDiffusion does not address the existence of the reverse diffusion process in their formulation. While in UDPM, Lemma 1 allows explicit access to the reverse process defined in equations (10-12).
UDPM needs a network architecture that upsamples its input while being aligned to other DDPM works to allow direct comparison. Therefore we use the popular SongUNet [2] used in many diffusion works while increasing the number of output channels from $3$ to $3\times \gamma^2$ , followed by a depth-to-space layer that upsamples the output by rearranging the pixels (https://pytorch.org/docs/stable/generated/torch.nn.PixelShuffle.html). This architecture makes minimal changes to the network, assuring minimal additional latency over the baseline. This is reflected by the run time that we measure which indeed shows that.
The practical objective function we use is built from three different terms:
$\ell_1$ : Promotes high fidelity and agreement between the diffusion steps.
$\ell_{per}$ : Guides the network to more perceptual pleasing estimations.
$\ell_{adv}$ : Complementary term to the $\ell_{per}$ , that makes sure the reconstructed variable statistics matches the true one.
We found the combination of the three terms above very crucial for getting sharp and detailed generations. For instance, on using only $\ell_1$ when training UDPM on CIFAR10 we got an FID score of $\sim 60$ , and when we added the $\ell_{per}$ it reached $\sim 30$ . In the attached PDF file we show the effect of each loss term on the generation results. We will make sure this ablation study is presented in the revised version of the paper.
For the completeness of the comparison, below we report the average runtimes on CIFAR10 of UDPM with 3 steps, and DDPM with a single denoising step when a similar network is used. Benchmarked on a single NVIDIA RTX A6000 GPU and averaged on 100K image generations for eliminating any unwanted overhead:
UDPM: 2008.21 FPS
DDPM: 735.68 FPS
Speedup = 2.73x
The runtimes will be added to Table 1 in the revised version.
We use the popular SongUNet used in many diffusion works, particularly the implementation of EDM [3] (https://github.com/NVlabs/edm/). The specific hyperparameters used are detailed Table 4 in the supplementary.
The number of steps is determined by the smallest noise resolution you want to start with. Then everything is fixed. From our experiments, we found out that beyond the 3 diffusion steps, the diffusion steps did not benefit the generation quality, hence we use 3 diffusion steps. We will make sure to clarify this is the revised version.

[1] Consistency models by Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever.
[2] Score-Based Generative Modeling through Stochastic Differential Equations
[3] Elucidating the Design Space of Diffusion-Based Generative Models (EDM)

Given the substantial improvements made in response to your feedback, we kindly request you to reconsider the score you initially assigned to our submission. We believe that the revised version of our paper now better aligns with the high standards of the NeurIPS conference.

评论- Review response

2024-08-10

I've thoroughly reviewed the authors' responses and appreciate their thoughtful engagement. I will stay in touch for further discussion as we approach the final rating.

2024-08-12

Dear Reviewer,

Thank you very much for your thoughtful feedback and for acknowledging that we have addressed the concerns you raised in your reviews. We appreciate the time and effort you put into evaluating our work and the constructive comments you provided.

If there are any additional questions or if further clarification is needed, please feel free to let us know. We are happy to provide any further information, and we hope you will consider re-evaluating your score as for acceptance in Neurips the score needs to be around 6.

Thank you once again for your valuable input.

Best regards,
The Authors

审稿意见

评分: 6置信度: 42024-07-11

The paper discusses the Gaussian diffusion modeling at different dimensionality by incorporating downsampling in the forward process. As a solution, the authors propose a new model called Upsampling Diffusion Probabilistic Model (UDPM), which reduces the latent variable dimension before adding noise. The reverse process then gradually denoises and upsamples the latent variable to produce a final image and tackles the computationally inefficient problem in previous diffusion models like DDPM. In the experiment, UDPM can generate images within 3 stages, which is computationally cheaper than a single step in DDPM and achieves better results.

优点

The method is novel and explores a diffusion process across variance scale and dimensionality. The proposed solution is technically sound.
The proposed method is computationaly efficient compared to previous diffusion models, which were designed on a fixed dimensionality and relied a following cascade of upsampling models to reach higher dimension.
The paper is overall easy to read and well organized.

缺点

The authors do not elaborate on how do we determine the number of upsampling stages are needed. Moreover, how do we choose the resolutions in the training in order to balance the performance and computation cost? The authors may need to provide heuristics, theoretical analysis or empirical studies to guide the authors on the choices.
The expression "steps $<1$ " is not rigorous. Using NFEs (numbers of fuction evaluation) and GPU time (or FLOPs) at different resolution stages may be more informative to the readers.
Some important related works are missing and lack of discussion. For example, Simple Diffusion studies the diffusion schedule in terms of the image dimensionality; LEGO diffusion and Matryoshka diffusion also discuss the diffusion modeling with variable dimensionality and the solutions are closely related.

Emiel Hoogeboom, Jonathan Heek, and Tim Salimans. "simple diffusion: End-to-end diffusion for high resolution images." In International Conference on Machine Learning, pp. 13213-13232. PMLR, 2023.

Huangjie Zheng, Zhendong Wang, Jianbo Yuan, Guanghan Ning, Pengcheng He, Quanzeng You, Hongxia Yang, and Mingyuan Zhou. "Learning stackable and skippable LEGO bricks for efficient, reconfigurable, and variable-resolution diffusion modeling." In The Twelfth International Conference on Learning Representations. 2023.

Jiatao Gu, Shuangfei Zhai, Yizhe Zhang, Joshua M. Susskind, and Navdeep Jaitly. "Matryoshka diffusion models." In The Twelfth International Conference on Learning Representations. 2023.

Some minors:

Some references are not precisely cited. For example, Adir [1] and Wavelet SGM [12] missed the conference/journal title; ; score-sde [36] was published in ICLR 2021, DDGAN [40] was published in ICLR 2022, TDPM [42] was published in ICLR 2023.

问题

Please see the Weakness.

局限性

The authors have discussed the limitations and potential negative societal impact of their work.

作者回复

2024-08-07

We appreciate the reviewer’s valuable points. Our response to the reviewer’s comments are given below.

Weaknesses:

The number of steps is determined by the smallest noise resolution you want to start with. Then everything is fixed. From our experiments, we found out that beyond the 3 diffusion steps, the diffusion steps did not benefit the generation quality, hence we use 3 diffusion steps. We will make sure to clarify this is the revised version.
Because the size of the diffusion variables changes throughout the diffusion process in UDPM, it is not possible to directly compare it to the denoising steps of DDPM. Therefore, we compare the total computations required by each algorithm; from which we obtain that UDPM uses $\sim30\\%$ of the computations used in a single denoising step with the same network, or equivalently $\sim0.3$ the time of a single diffusion denoising step.
Yet, for the completeness of the comparison, below we report the average runtimes on CIFAR10, of UDPM with 3 steps and DDPM with a single denoising step when a similar network is used. Benchmarked on a single NVIDIA RTX A6000 GPU and averaged on 100K image generations for eliminating any unwanted overhead:
UDPM: 2008.21 FPS
DDPM: 735.68 FPS
$\Rightarrow$ Speedup = 2.73x
The runtimes will be added to Table 1 in the revised version.
Simple Diffusion: This work investigates the noise scheduling and the network architecture, and their relation to the image's resolution. However, this approach complements our method, since it does not modify the diffusion structure itself and only optimizes the empirical setup.
LEGO diffusion: This approach proposes a LEGO bricks architecture for improving the training, efficiency, and resolution generalization of diffusion models. This approach is also complementary to UDPM since it can be used alongside our approach for further improvements.
Matryoshka diffusion: This method proposes to train a diffusion model on different resolutions simultaneously. The effectiveness of this approach shines when the desired generation resolution is relatively high, where it can generate images with resolutions similar to what Stable-Diffusion can produce but with no need for a separate image encoder. This approach however is very different from UDPM, since it does not modify the formulation of the diffusion model itself.
All the missing references will be added to the revised version and discussed accordingly. We thank the reviewer for the effort.

2024-08-10

I appreciate the authors' efforts in addressing my concerns and questions. After reading the rebuttal, I will keep my positive recommendation, and I suggest the authors incorporate the discussions and additional content of the rebuttal into the final revision.

2024-08-12

Dear Reviewer,

Thank you once again for your valuable input.

Best regards,
The Authors

审稿意见

评分: 4置信度: 32024-07-12

This paper proposes a new training and sampling scheme for a diffusion model. The motivation is to enhance the effectiveness and interpretability of the diffusion model. Building upon the methods of DDPM, this paper introduces an upsampling operation into the Markov process, enabling the model to denoise and upsample simultaneously. Furthermore, through mathematical derivations, the reliability of this process is demonstrated. Experimental results ultimately show that in certain specific scenarios, the model outperforms existing alternatives.

优点

The paper is well-written with a clear organizational structure, making it easy to follow.
The paper propose a new diffusion model framework, complete with mathematical derivations, resulting in a loss function analogous to that used in DDPM.
The discussion and comparison with related work, such as cold diffusion and soft diffusion, are clearly articulated, effectively highlighting the technical contributions of this paper.

缺点

The motivation behind the study is not sufficiently clear, and the interpretability of the model has not been well demonstrated.
There is a lack of ablation studies on the loss function. The complexity of the loss, especially with adversarial training, may lead to instability during training.
The experiments were conducted only at a 64x64 resolution, leaving the scalability of the method unclear.

问题

What is the computational logic behind steps less than 1 in Table 1? Could you please provide a clear explanation?
It appears that this method involves special design considerations for the network structure. What is the actual inference latency of this model compared to baselines?
Traditional diffusion models, such as EDM, achieve significantly better results with more sampling steps due to their scalability. How scalable is the proposed method, and how does its performance compare?

局限性

The authors have adequately addressed the limitations and potential negative societal impact of their work.

作者回复

2024-08-07

We are grateful for the reviewer's insightful comments. Our response is provided below.

Weaknesses:

It is well known that diffusion models severely suffer from heavy computations to produce pleasing-looking images due to two aspects: (i) The large number of diffusion steps and (ii) the large dimensions of the latent variables at each step. Compared to GANs, where a single inference step is used, with latent space much smaller than the generated image size; DDPMs have significant drawbacks. As a result, in this paper, we narrow this gap significantly, reducing the computations using the same network by $\sim 65\\%$ , while outperforming the fastest SOTA diffusion-based models.
Additionally, it has been shown in many GAN papers that the latent space of the generative model is interpolatable and interpretable, which was not the case with DDPMs, where even a single step DDPM has latent space with the same dimensions as the input image; making the latent space very redundant. In UDPM however, the whole latent space is much smaller than the generated image, making the latent space smooth for interpolation (as shown in Figures 5, 7, and 8) and interpretable (as shown in Figures 6, 9, 10). For further understanding of the interpretability of the model, we present an ablation study where we generate an image, then fix two of its three noise maps, and perturb/rerandomize the third noise map 128 times to produce 128 different images. We then take these images and analyze them by examining the principal components of their covariance matrix to understand how each diffusion step affects the generated image (check the rebuttal PDF file). As can be seen in Figures 1 and 2 in the attached PDF file, the noisiest latent variable controls the semantics of the image (i.e. class, pose, background, etc.), while the initial and the middle noise level controls the fine details of the generation. As a result, one may get similar behavior to what has been done in StyleGAN simply by modifying the last diffusion variable; as demonstrated in Figure 3.
Complexity: The reverse process in UDPM requires super-resolving the latent variable from the previous step, which necessitates the use of a sophisticated loss term, as shown by previous super-resolution literature [1, 2, 3], which we find crucial in our case for obtaining sharp and detailed results.
Stability: In contrast to pure adversarial loss used in GANs, using it as a regularization term for guiding the network to sharp solutions is in fact very stable. The only part that needs tuning is the weights of each term, which from our experiments remained fixed to the reported values in the paper for all datasets. To demonstrate the effect of each loss term, in Figure 4 in the attached rebuttal PDF, we show how each term contributes to the generation results. This ablation will be added and discussed in the revised version.
As mentioned in the limitation section, training diffusion models require heavy computational resources, particularly when increasing the images resolution; therefore, due to our limited resources we leave such exploration to future research. Yet, for completeness, we ran our approach on LSUN-horses dataset with resolution 128x128 and reported the results in the attached rebuttal PDF file.

Questions:

Because the size of the diffusion variables changes throughout the diffusion process in UDPM, it is not possible to directly compare it to the denoising steps of DDPM. Therefore, we compare the total computations required by each algorithm; from which we obtain that UDPM uses $\sim30\\%$ of the computations used in a single denoising step with the same network, or equivalently $\sim0.3$ the time of a single diffusion denoising step.
UDPM needs a network architecture that upsamples its input while being aligned to other DDPM works to allow direct comparison. Therefore, we use the popular SongUNet [4] used in many diffusion works while increasing the number of output channels from $3$ to $3\times \gamma^2$ , followed by a depth-to-space layer that upsamples the output by rearranging the pixels (https://pytorch.org/docs/stable/generated/torch.nn.PixelShuffle.html). This architecture makes minimal changes to the network, assuring minimal additional latency over the baseline. This is reflected by the run time that we measure which indeed shows that.
UDPM generalizes traditional denoising diffusion schemes, therefore one can omit $\mathcal{H}$ in part of the diffusion steps (e.g. the last ones), such that some diffusion steps become denoising without upsampling. This may enable increasing the diffusion steps to arbitrary choice, similar to traditional diffusion. However, we did not examine such an approach as we focused on efficiency, and we leave it for further research.

[1] Photo-realistic single image super-resolution using a generative adversarial network
[2] Esrgan: Enhanced super-resolution generative adversarial networks
[3] Real-esrgan: Training real-world blind super-resolution with pure synthetic data
[4] Score-Based Generative Modeling through Stochastic Differential Equations

2024-08-12

Dear Reviewer,

Thank you for taking the time to review our work.

As the discussion period is approaching its conclusion (on August 13, AOE), we kindly ask if you could review our detailed responses to your concerns. We would be happy to address any further questions you might have, and we hope you will consider re-evaluating your score as for acceptance in Neurips the score needs to be around 6.

Thank you again for your efforts.

Best regards,
The Authors

作者回复

2024-08-07

In the following PDF file, we present the following ablation studies and results:

Additional demonstration of the interpretability of the model.
Ablation study on the contribution of each loss term.
Additional results on more a diverse dataset with higher resolution.

We hope after this significant improvement we will make the reviewers reconsider their initial scores.

Best regards

最终决定Accept (poster)

2024-09-25

The paper introduces a generalized approach to the denoising diffusion process called the Upsampling Diffusion Probabilistic Model. The forward process reduces the dimensionality of the latent variable through downsampling, followed by the addition of noise. The reverse process progressively denoises and upsamples the latent variable to generate a sample from the data distribution. The reviewers were rather positive about this work, many indicating the novelty of the proposed approach, its theoretical grounding and computational efficiency. Several concerns raised by the reviewers were adequately answered during the discussion, including the scalability issue of testing with only small resolution dataset.