6.0

/10

Poster3 位审稿人

最低5最高7标准差0.8

4.7

置信度

正确性2.7

贡献度2.7

表达3.0

NeurIPS 2024

PaGoDA: Progressive Growing of a One-Step Generator from a Low-Resolution Diffusion Teacher

Dongjun Kim,Chieh-Hsin Lai,Wei-Hsiang Liao,Yuhta Takida,Naoki Murata,Toshimitsu Uesaka,Yuki Mitsufuji,Stefano Ermon

OpenReview PDF

提交: 2024-05-15更新: 2024-11-06

TL;DR

We train one-step text-to-image generator that is progressively growing in its resolution. For that, we only need low-resolution diffusion models.

摘要

The diffusion model performs remarkable in generating high-dimensional content but is computationally intensive, especially during training. We propose Progressive Growing of Diffusion Autoencoder (PaGoDA), a novel pipeline that reduces the training costs through three stages: training diffusion on downsampled data, distilling the pretrained diffusion, and progressive super-resolution. With the proposed pipeline, PaGoDA achieves a $64\times$ reduced cost in training its diffusion model on $8\times$ downsampled data; while at the inference, with the single-step, it performs state-of-the-art on ImageNet across all resolutions from $64\times64$ to $512\times512$, and text-to-image. PaGoDA's pipeline can be applied directly in the latent space, adding compression alongside the pre-trained autoencoder in Latent Diffusion Models (e.g., Stable Diffusion). The code is available at https://github.com/sony/pagoda.

关键词

Diffusion Models; Distillation Models; Generative Models; Generative AI; Text-to-Image Generation; Generative Adversarial Networks (GAN); Variational Autoencoders (VAE)

评审与讨论

审稿意见

评分: 7置信度: 42024-06-26

The paper proposes PaGoDA, an adversarial distillation method to support single-step generation of image resolutions higher than the teacher diffusion model. PaGoDA first solves the forward PF-ODE (data to noise) to collect noise from the dataset. Then, it gradually adds upsampling layers to the model and trains only those layers. The training objectives include reconstruction loss and GAN loss, and for text-to-image experiments, additional distillation loss and CLIP regularization are introduced to support classifier-free guidance. PaGoDA demonstrates strong one-step generation performance on ImageNet and COCO benchmarks and shows its superiority through various ablation studies.

优点

Originality

This is the first study to support higher resolution than the teacher diffusion model when distilling a diffusion model into a single-step generator.
It is also the first study to use the forward PF-ODE to collect noise-data pairs for distillation.

Significance

Despite using GAN loss, the training is stable due to the progressive growing strategy, as shown by the good performance at various resolutions in Table 2.
By using the forward PF-ODE for distillation, it learns the true data distribution and achieves higher sample diversity compared to GANs, as evidenced by the recall performance in Table 2.
The PaGoDA is shown to be applicable to the pixel-space text-to-image model, DeepFloyd.

Quality

The paper is well-written and includes detailed implementation details in the appendix for reproduction.
Various ablation studies demonstrate the superiority of each loss and the upsampling method of this approach compared to the VAE decoder of latent diffusion models.

缺点

PaGoDa is claimed to have a faster sampling speed compared to LCM, but wouldn't the sampling speed be faster with smaller VAE [A]? Wouldn't LCM be more efficient than PaGoDa if the VAE decoder has less layers than PaGoDa's additional upsampling layers?
Additionally, smaller VAE only requires training of a VAE, while PaGoDa involves sequential training stages depending on the resolution, making the training pipeline somewhat inconvenient.
In Table 4, PaGoDa performs better than the diffusion teacher, and in Table 1, PaGoDa outperforms EDM2-XXL. This may be because the discriminator uses EfficientNet and DeiT pre-trained on ImageNet, leading to better FID scores, which use an Inception network also trained on ImageNet. This issue has been discussed in [B]. There are two potential solutions to this problem. The first is to compare Fréchet distances using DINOv2, following the EDM2 paper. The second is to train the discriminator for GAN loss from scratch. If PaGoDa's GAN loss is truly stable, it should be possible to train without a pre-trained network.

[A] Tiny AutoEncoder for Stable Diffusion: https://github.com/madebyollin/taesd

[B] Kynkäänniemi et al., The Role of ImageNet Classes in Fréchet Inception Distance, ICLR 2023.

问题

What classifier-free guidance scale did PaGoDA use in Table 6? Is the point that achieved the best performance in Figure 14 reported in Table 6?
Why were different datasets used when collecting latents (CC12M) and during GAN training (COYO-700M)?
Does the DDIM inversion in Section 5.4 needs a diffusion teacher for inversion?

局限性

Appendix C explains well the gap between PaGoDA's theoretical assumptions and its practice.

作者回复

2024-08-07

We greatly appreciate the reviewer’s constructive feedback.

Weakness 1. LCM with lighter VAE may be more efficient than PaGoDA's upsampling blocks?

Ans.

[Lightweight PaGoDA] The smaller VAE [A] decoder can be integrated into PaGoDA's super-resolution framework. If LCM's VAE is made more lightweight, PaGoDA's network could be similarly lightweighted. The degree of simplification for heavy decoder (or upsampling network) is determined by the function complexity. As PaGoDA’s upsampling function is simpler than VAE's nonlinear transformation, PaGoDA is more suitable for model distillation. Also, a lightweight decoder sometimes fail at high-quality generation, illustrated in Figure C in the attached PDF.

Weakness 2. A smaller VAE requires single-stage training, while PaGoDA involves multiple sequential stages.

Ans.

[Smaller VAE is Sequential Training] The smaller VAE [A] separately distills a heavy encoder and decoder with lightweight counterparts to match latent distributions. In knowledge distillation, lightweight models distilled from heavy teachers tend to perform better than those trained without teacher. Likewise, the smaller VAE [A] requires to have the original VAE to minimize the performance sacrifice, which falls into a sequential training.

[PaGoDA Skips Intermediate Stages] PaGoDA's progressive upsampling can be compressed. Instead of doubling resolution jumps, they can be increased by factors of 4x or 8x, allowing the super-resolution network to be trained all at once. It was observed that even with an 8x jump, model remained stable and performance was retained.

Weakness 3. FID evalutaion could be biased due to the use of EfficientNet and DeiT pre-trained on ImageNet.

Ans.

[Frechet Distance] Following reviewer's suggestion, we additionally report FD using DINOv2 in Tables A and B of the attached PDF. On ImageNet 64x64, we have three implications:

vs. EDM & CTM: All three models (PaGoDA, EDM, CTM) share the same architecture. CTM performs better than teacher EDM only in FID, a signal of FID bias. However, PaGoDA consistently outperforms teacher EDM in both FID and FD.
vs. StyleGAN-XL: Direct comparison between PaGoDA and StyleGAN-XL is challenging due to differences in architectures and training details. Nonetheless, PaGoDA outperforms StyleGAN-XL in all metrics.
vs. Validation Data: Given that PaGoDA's FID is better than the validation data while its FD is worse, FID may no longer be a reliable metric for model evaluation.

On ImageNet 512x512, we compare PaGoDA as follows:

vs. EDM: PaGoDA significantly surpasses EDM in both FID and FD, even with the same architecture (EDM is trained on latent space).
vs. EDM2: EDM2 outperforms PaGoDA in FD, regardless of the number of parameters used. However, please note that EDM2 requires $63$ steps to achieve good FDs, while PaGoDA requires only a single step.

Based on these findings, we will focus on comparing PaGoDA using FD in our paper revision. We plan to explore whether applying PaGoDA on EDM2 architecture would bring us a better quality in future work.

Question 4. CFG scale used in table 6 for PaGoDA: is the best performance point in Fig. 14 reported in Table 6?

Ans.

It is correct that we used the best performance in Figure 14, with a CFG scale of 1.15, to report the score in Table 6. Adopting reconstruction loss reduced the optimal CFG scale from 2-3 to 1-2.

Question 5. Why were different datasets used when collecting latents (CC12M) and during GAN training (COYO-700M)?

Ans.

[Prevent Overfitting] We used different datasets to mitigate overfitting and memorization issues. Let's suppose that $z$ is a latent of $x$ from CC12M. Then, the reconstruction loss pushes $G(z)$ towards $x$ . If we utilize the same CC12M for GAN training, then the discriminator will push $G(z')$ towards CC12M, where $z'$ is a neighborhood of $z$ . Therefore, to alleviate $G(z')$ being too close to $x$ in CC12M, we provide an effect of $G(z')$ diverging from CC12M by training discriminator with COYO-700M.

Question 6. Does the DDIM inversion in Section 5.4 needs a diffusion teacher for inversion?

Ans.

Yes, we utilized teacher diffusion in Section 5.4 for controllable generation.

<Final comments to Reviewer 3>

We respectfully request the reviewer to re-evaluate our paper with an emphasis on its upsampling capabilities. The components, though seemingly separate, are integrated to achieve efficient super-resolution generation.

[Upsampling: DDIM Inversion for Recon Loss] DDIM inversion allows obtaining a latent $z$ from high-dimensional input data $x_{\text{high}}$ by downsampling it to DM's resolution. By training a map from $z$ to $x_{\text{high}}$ , PaGoDA provides upsampling capabilities alternative to SD's VAE and Cascaded Diffusion Models. When using DDIM, we cannot train such an upsampling map.

[Upsampling: Classifier-Free GAN for Adversarial Loss] Previous diffusion distillation methods used standard GANs with $\omega$ -CFG teacher samples, limiting student performance and resolution extension. An easy fix could be using appropriately resized real-world data into GAN's real part, but this way is incompatible to T2I because the optimal student follows $p_{\text{data}}(x|c)$ , not $p_{\text{data}}(x|c,\omega)$ . Classifier-Free GAN solves this by predicting $\omega(x,c)$ and feeding it into the student network, ensuring alignment with the desired distribution $p_{\text{data}}(x|c,\omega)$ (see also L157-L159).

[Upsampling = Progressive Growing + DDIM Inversion + Classifier-Free GAN] By integrating three proposed components altogether, we achieved efficient super-resolution generation in both training and sampling, providing a new and effective alternative to Stable Diffusion and Cascaded Diffusion Models. We kindly ask the reviewer to reconsider our paper with this perspective in mind.

2024-08-08

I thank the authors for their thorough rebuttal. The attached PDF's Table A and B sufficiently address my concerns regarding evaluation. Results confirm that PaGoDA outperforms all models except for EDM2.

Meanwhile, I have further questions regarding the smaller VAE:

While training a new VAE requires sequential training, PaGoDA also necessitates an additional data generation process, and for conditional generation, the model p(w∣x,c) must also be trained. In this regard, I still doubt whether PaGoDA truly offers a more convenient training pipeline compared to training a smaller VAE.
From Figure C in the attached PDF, I understand that a lightweight VAE might not be the best choice. However, doesn't PaGoDA also experience reconstruction error for the input image in Figure C? One could downsample the base resolution of that input image, obtain noise through DDIM inversion, and then check the reconstruction performance using PaGoDA.

评论- Official Reply by Paper Authors

2024-08-11

We sincerely express our gratitude for the reviewer to allow us compare PaGoDA with SD more deeply. We believe the review is highly helpful, and would like to revise the paper based on our discussion.

Q1-1. While training a new VAE requires sequential training, PaGoDA also necessitates an additional data generation process.

Ans.

[Minimal Overhead] It takes about one day with 8xH100 GPUs to converge in T2I at base resolution if training data pairs are synthesized online. The data collection overhead is minimal compared to the overall DM pretraining cost. For higher-resolution training, the data pairs collected at base resolution are reused, significantly reducing the need for additional DDIM inversion and compute resources.

Q1-2. For conditional generation, the model p(w∣x,c) must also be trained.

Ans.

[Use Released Checkpoint] We recognize that our approach involves higher initial costs due to the need to train the classifier. However, once the classifier is publicly available, the training pipeline will be as equivalent of training a smaller VAE.

Q1-3. I still doubt whether PaGoDA truly offers a more convenient training pipeline compared to training a smaller VAE.

Ans.

In terms of the number of training stages, PaGoDA is indifferent from SD. However, we would like to further discuss the discrepancy between SD and PaGoDA when we desire to increase the sample resolution, e.g., from 512x512 to 1024x1024.

[SD Needs Entire Retraining] In Stable Diffusion, the entire pipeline needs to be retrained from scratch, following these steps:

Train VAE at 1024x1024 from scratch
Train latent DM from scratch
(Optional) Distill latent DM into one-step generator
(Optional) (Optional) Distill VAE into smaller VAE

[PaGoDA Recycles Pretrained Network] In contrast, PaGoDA could reuse 512x512 generator and only necessitates training a upsampling network from 512x512 to 1024x1024, making it significantly more cost-effective than SD, as follows:

Reuse pixel DM at base-resolution
Reuse PaGoDA at base-resolution
Reuse PaGoDA at super-resolution up to 512x512
Train PaGoDA for upsampling from 512x512 to 1024x1024

Q2. Doesn't PaGoDA also experience reconstruction error for the input image in Figure C?

Ans.

PaGoDA exhibits less accurate reconstruction than generation quality. This is because PaGoDA's reconstruction quality solely depends on reconstruction loss. Below, we provide details on this respect.

[Different GANs for Different Purposes] We kindly remind that the model objectives are different: VAE in SD is for data compression, while PaGoDA is to assist generation. Accordingly, VAE and PaGoDA applied GANs for two different purposes:

GAN for reconstruction
- Real: a real image
- Fake: its corresponding reconstructed image
GAN for generation
- Real: a real image
- Fake: a randomly generated image from a random noise

PaGoDA prioritizes generation quality over reconstruction quality, thus using (II) type of GAN. Therefore, PaGoDA's reconstruction relies solely on LPIPS reconstruction loss. In contrast, SD's VAE uses option (I), specifically tailored for better reconstruction quality.

[Applying Option (I) in PaGoDA for Better Reconstruction] We observed that the further use of (I) in PaGoDA on top of (II) improves reconstruction quality, and we will include a figure in our final revision to illustrate this improved reconstruction quality. It is important to note that a direct comparison with Figure C is not feasible for PaGoDA, as DDIM inversion relies on the text prompt, which is not provided in Figure C.

(Side Remarks on Q2)

We presume that the reviewer may have raised concerns about PaGoDA's upsampling capability due to the loss of high-frequency signal during downsampling of the encoder. This information leakage seems fundamentally blocking higher-dimensional generation at the first glance. Below, we further discuss that it is not necessarily true in PaGoDA.

[PaGoDA: Lossless Compression] Although our encoder may seem to perform lossy compression, it can actually achieve lossless compression in principle, i.e., perfect reconstruction. If the downsampled versions of two different images $x$ and $y$ remain distinguishable, DDIM inversion, as an injective function, will map $x$ and $y$ to distinctive vectors in the latent space. Therefore, the generator can map these distinctive latent vectors back to the original signals $x$ and $y$ , respectively, given a sufficiently flexible generator and a well-designed loss function. However, this lossless compression is only achievable when two different data remain distinguishable after downsampling. Hence, the downsampling factor should be carefully selected to ensure that the downsampled images are distinguishable. At resolutions like 64x64 (or even 32x32), natural images remain sufficiently distinguishable.

2024-08-12

Thank you for the detailed discussion regarding my questions. Your responses have addressed my concerns about the advantages of the PaGoDa pipeline over the existing latent diffusion pipeline. I also understand that comparing the reconstruction error between PaGoDa and VAE is not feasible.

Given that this work demonstrates strong generation performance and has the advantage of reusing pre-trained generative models when increasing resolution, I believe it will be as beneficial to the generative AI community as latent diffusion models. Therefore, I am raising my score to 7.

评论- Official Reply by Paper Authors

2024-08-12

We sincerely appreciate the reviewer helping us to enhance the quality of our paper with valuable discussions. In the final revision, we will thoroughly and faithfully reflect the discussions addressing the reviewer's concerns.

审稿意见

评分: 6置信度: 52024-07-12

The paper introduces a novel GAN-based diffusion distillation approach that leverages ideas from PG GAN for effective high-resolution synthesis.

优点

PaGoDA demonstrates state-of-the-art single-step generation performance on ImageNet, competing with or outperforming more expensive alternatives, including the teacher DMs.
The proposed classifier-free GAN is an interesting modification for better conditional generation.
The authors justify most components of the proposed method by providing valuable discussions and ablation studies.
The paper includes thorough theoretical analysis, as well as many experimental and implementation details.

缺点

PaGoDa is designed for pixel-based DMs only and hence is not applicable to most state-of-the-art T2I models, e.g., SDXL, SD3. Also, the motivation behind this (L42) seems a bit arguable. For example, one could try to adapt PaGoDa to LDM by replacing the AE decoder with a shallow one mapping a 64x64 latent variable to a 64x64 image. Then, PaGoDA can be applied as is. Do the authors have any thoughts on this?
The method requires collecting the dataset of a large amount of (x, z) and ( $\hat{x}_w$ , z) pairs. Likely, this significantly increases the training costs compared to sampling-free alternatives. Could the authors provide the data collection vs training costs? How does the overall training time compare to other distillation approaches?
The KD objective is used for the guided generators (i.e., using the reverse process for sampling ( $\hat{x}_w$ , z) pairs). This contradicts the motivation behind the reconstruction loss (L89). While the ablation study in Tab.7 aims to address this, could the authors clarify the inference CFG scale used in this experiment? How does the role of the reconstruction loss change with different CFG scales?
PaGoDa uses CLIP regularization in the T2I setup. I believe this makes the CLIP score evaluation incorrect. Moreover, other distillation methods, e.g., LCM, DMD, or ADD, do not use it. This raises concerns about the performance of the core approach in the T2I setting.
For T2I evaluation, automated metrics have been shown to correlate poorly with human preference (e.g., [1,2]). It would be highly beneficial to conduct a human study comparing PaGoDA to other distillation approaches and T2I diffusion models. If a reliable human evaluation setup is unavailable, I would suggest evaluating a metric learned to mimic human preference, e.g., [3,4,5].
The CFG details for T2I inference are missing. What scale is in the final t2i results? How does the model perform with different CFG scales?
The paper does not provide qualitative T2I results demonstrating the diversity of generated images for a given prompt.
The ablation study lacks evaluation of the proposed approach without adversarial objective.

[1] Podell et al. SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis.

[2] Li et al. Playground v2.5: Three Insights towards Enhancing Aesthetic Quality in Text-to-Image Generation.

[3] Xu et al. ImageReward: Learning and Evaluating Human Preferences for Text-to-Image Generation

[4] Kirstain et al. Pick-a-Pic: An Open Dataset of User Preferences for Text-to-Image Generation.

[5] Wu et al. Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis.

[6] Sauer et al. Fast High-Resolution Image Synthesis with Latent Adversarial Diffusion Distillation

问题

Please address the questions and concerns raised in "Weaknesses".
Intuitively, the reconstruction and distillation losses are most beneficial for the base-resolution training. Did the authors try to deactivate these losses for SR stages?
Did the authors consider using a pretrained DM as a discriminator, following the ideas from [6]? It would be interesting to explore if noise-level feedback might be valuable for learning better generators.
PaGoDa approaches very low FID values on ImageNet. Could the performance gains be a result of overfitting the training data?

局限性

The authors addressed the limitations and potential negative social impact in the appendix.

作者回复

2024-08-07

We express our gratitude to the reviewer for helpful reviews. Below, we faithfully answer the raised concerns, but we would be happy to provide additional experiments upon reviewer's request in our final revision.

Weakness 1. Can we integrate PaGoDA with SD?

Ans.

Yes, we can integrate PaGoDA in the latent space of SD. If $d$ (say 512) is the latent resolution of SD (data resolution could be up to 2K or 4K), we can train a DM on $d/2^n$ (say 64) dimension (which is much cheaper than training DM on full 512 latent dimension) and then distill/upsample with PaGoDA up to $d$ . Although we focus on the core algorithm with pixel DM to pursue a fundamental understanding of core components, we are interested to integrate PaGoDA with SD in future to examine the degree of compression this integration can offer.

Weakness 2. Discuss the training costs compared to sampling-free alternatives.

Ans.

[Data Collection is Optional] While we utilized collected data-latent pairs mainly to save budget for model building, once training configurations are set, training data pairs can be obtained online during training.

[Minimal Overhead] In T2I, convergence is achieved around 20k iterations, requiring about 1 day with 8xH100, including latent calculation. This overhead is minimal compared to the total DM pretraining cost.

[Safer Training] We further address that sampling-based methods are safety-secure. During training, a classifier can filter out problematic samples related to privacy, NSFW, and copyright. Moreover, maintaining clear records of training data aids in post-mortem analysis.

Weakness 3. Adoption of KD objective in T2I is contradictory to reconstruction loss's motivation.

Ans.

[KD for High CFG] Figure 4 highlights that the latents from DDIM inversion are out-of-prior at high CFG scales. Therefore, a model trained with reconstruction loss performs poorly at these scales. To resolve this problem, we impose the KD loss. However, using KD-only limits student upper bounded by teacher.

[Effect of Recon+KD] When we add reconstruction loss, the FID-CLIP curve is shifted right (higher CLIP) and below (better FID) with the optimal CFG (w.r.t FID) falls from 2.5 to 1.15. We also report Figure A of the attached PDF of human evaluation, resulting in better image quality and prompt alignment.

[Eliminate KD] We are interested in eliminating KD objective by substituting CFG with CFG++ [A], a recent method with improved DDIM and DDIM inversion on guidance scale less than 1. On these scales, DDIM latent falls into in-distribution that facilitates model training without KD. We originally planned to explore this as future works but would be happy to investigate on this if reviewer requests in our final revision.

[A] CFG++: Manifold-constrained Classifier Free Guidance for Diffusion Models

Weakness 4. Using CLIP regularization makes the CLIP score evaluation incorrect.

Ans.

Figure A in the attached PDF shows that CLIP regularization is advantageous for prompt alignment under human judgement. To minimize potential bias in evaluation, we utilized different CLIP models for training (ViT-L/14 trained on YFCC100M) and evaluation (ViT-g/14 trained on LAION-2B).

Weakness 5. Clarify the CFG details for inference.

Ans.

The CFG scale in Tab. 5 and 6 is 1.15. As shown in Figure 14, CFG in PaGoDA exhibits a similar trend to teacher diffusion.

Weakness 6. Does PaGoDA work without adversarial objective?

Ans.

[Existence of Prior Holes] PaGoDA requires a GAN loss. Since PaGoDA's encoder is a deterministic mapping, it avoids posterior collapse but suffers from prior holes. These holes are unseen during training with recon-only loss, causing decoder to struggle in creating high-quality samples in those areas. Therefore, we need GAN loss to cover the entire prior manifold during training.

[Uniform Encoding] DDIM inversion encodes data uniformly into the prior manifold, as illustrated in Figure B of the attached PDF. This uniformity constrains generator, preventing mode collapse by densely regulating its output within the prior space through reconstruction loss.

[Well-Grounded GAN] Moreover, Theorem 3.1 proves that the optimal generator of a GAN-augmented distillation model matches data distribution, unlike previous works where optimal generator is heavily influenced by pretrained DM. Theorem 3.2 confirms the stability of combining GAN with reconstruction loss.

Question 7. Can we deactivate reconstruction/distillation losses for SR stages?

Ans.

Deactivating recon/distill losses causes model to solely depend on GAN. As seen in Figure 6, relying solely on GAN for SR results in object location shifted across resolution. This unwanted shift hinders the ability to control samples in base resolution. By activating reconstruction loss, the object's position remains consistent in resolution changes.

Question 8. Can we use a pretrained DM as a discriminator feature extractor?

Ans.

Yes, but note that feature extractors behave differently. According to [6], discriminative features prioritize local details over global shape, while generative features do the opposite. Since reconstruction loss strongly penalizes errors in shaping the global context, we used features from discriminative models to prioritize capturing local details with GAN loss.

Question 9. Could the ImageNet performance a result of overfitting?

Ans.

Table A of the attached PDF shows the Frechet Distance on DINOv2 feature space, comparing 50k samples with 50k val data. The degree of overfitting (FD val - FD train) in PaGoDA is comparable to baselines, implying PaGoDA is as robust as baselines in overfitting.

2024-08-13

I thank the authors for their detailed and insightful discussions and additional evaluation results. I appreciate that other reviewers noticed the use of the IN pretrained discriminator for training and agree that it's important to eliminate this source of bias in evaluation. I'm glad the authors addressed most evaluation concerns and provided a human study for the t2i setting. In addition to updating the main results, I also encourage revising Section 5.2, considering potential bias, e.g., Fig.5(b) needs to be validated.

Other questions and concerns were also well addressed, except for the qualitative results concerning t2i diversity. I would appreciate seeing t2i samples given one prompt and different seeds compared with the teacher model in the revision (please see figures 3 and 8 in ADD as examples).

Regarding CFG++, I won't insist on adding these results but would still be excited and interested to see them in the revision if available. In my opinion, it could make the approach more elegant and completed.

Overall, I believe the paper provides valuable contributions, and I do not have objections to its acceptance if the authors carefully address evaluation concerns in the revision. I have increased my score to 6.

评论- Official Reply by Paper Authors

2024-08-13

We express our best gratitude to the reviewer's constructive feedback. Following reviewer's suggestion, we will revise Section 5.2 by adding the discussion of potential bias of FID, evaluating all possible FD values and comparing PaGoDA with baselines based on FD, including Figure 5-(b). Also, we will add T2I samples given one prompt and different seeds compared with the teacher to check our model generates diverse images in our final revision.

审稿意见

评分: 5置信度: 52024-07-13

This paper introduces a method to distill a diffusion model into a one-step generator. The training process combines several loss functions. First, the authors use DDIM inversion to transform real images into latent noise, which is then fed into the generator. The generated images are supervised with a reconstruction loss. Second, an additional GAN loss is applied to the generated images, distinguishing them from real images. For text-to-image generation, the authors present a novel approach to enable classifier-free guidance in the GAN loss by using an auxiliary guidance predictor. Other auxiliary losses, such as knowledge distillation loss and CLIP loss, are also employed to enhance performance. The final method are evaluated across various benchmarks.

优点

S1. The presented approach performs well on common benchmarks while achieving good inference efficiency.

S2. The authors plan to open-source the method, which will be beneficial for further research.

缺点

W1. Some loss components are questionable and may mislead readers about the actual performance of the proposed approach. For instance, it is well-known that using a GAN loss with a discriminator pretrained on ImageNet significantly biases the FID metric [1]. This issue also applies to the CLIP loss, which biases the evaluation of FID/CLIP scores when used during training. A substantial portion of the strong results presented might be attributed to these misleading numbers, making it difficult to draw accurate conclusions about effectiveness. This is a common issue in some previous works but needs to be addressed now. For example, using GANs is good, but it might be better to avoid using a pretrained classifier as the discriminator, as suggested by recent explorations in diffusion GAN literature [2]. For text-to-image results, it is recommended to conduct human evaluations to assess real image quality, prompt alignment, and diversity.

W2. Many of the proposed components have been well explored in previous literature, but the connections to these works are not sufficiently discussed in the related works section. For instance, the combination of a regression loss and a GAN loss has been explored for text-to-image generation in several studies, including [3, 4]. Additionally, the use of distillation loss and CLIP loss are common practices [5, 6]. This reduces the perceived novelty of the paper to a moderate level.

W3. The introduction of a classifier-free GAN objective might overcomplicate the problem. Previously, GAN loss with classifier-free guidance (CFG) was enabled by generating fake samples using an original diffusion model with guidance, as applied in [2], and then utilizing the standard GAN loss. This method is simpler than training an auxiliary classifier and has a similar computational cost, since the current classifier training also requires generating samples with CFG using the original diffusion model.

[1] Kynkäänniemi, Tuomas, et al. "The role of imagenet classes in fr'echet inception distance." arXiv preprint arXiv:2203.06026 (2022).

[2] Sauer, Axel, et al. "Fast high-resolution image synthesis with latent adversarial diffusion distillation." arXiv preprint arXiv:2403.12015 (2024).

[3] Lin, Shanchuan, Anran Wang, and Xiao Yang. "Sdxl-lightning: Progressive adversarial diffusion distillation." arXiv preprint arXiv:2402.13929 (2024).

[4] Song, Yuda, Zehao Sun, and Xuanwu Yin. "SDXS: Real-Time One-Step Latent Diffusion Models with Image Conditions." arXiv preprint arXiv:2403.16627 (2024).

[5] Sauer, Axel, et al. "Adversarial diffusion distillation." arXiv preprint arXiv:2311.17042 (2023).

[6] Kang, Minguk, et al. "Scaling up gans for text-to-image synthesis." CVPR 2023

问题

Most of my concerns are outlined in the weaknesses section. To reiterate, I am particularly concerned about the utilization of a pretrained classifier as a discriminator and the CLIP loss, which bias the quantitative evaluation. Additionally, the overall novelty is moderate, and some components might be unnecessarily complex.

局限性

yes.

作者回复

2024-08-07

Weakness 1. To address the concerns about fair evaluation, we provide additional results below.

W1-1. GAN's discriminator pretrained on ImageNet biases the FID metric.

Ans.

[Frechet Distance (FD)] We agree with the reviewer that the use of ImageNet pretrained discriminator could bias the model towards achieving a good FID. Following EDM2 practice, we evaluate FD using DINOv2 features. Tables in the attached PDF imply that FID and FD trends are consistent in ImageNet 64x64, but not for high-resolution, which may indicate FID bias. Despite this, PaGoDA outperforms its teacher EDM at both resolutions on FD. We will update the manuscript to include a full set of results and discussions. Please also refer to R2-W9 and R3-W3.

W1-2. CLIP regularizer may also bias the evaluation.

Ans.

[Human Evaluation] We conduct human evaluation to fairly measure the effect of CLIP loss at Figure A of the attached PDF. According to the figure, the adoption of CLIP regularization yields positive influence on prompt alignment. Also, to minimize the undesirable bias in FID/CLIP scores, we utilized different CLIP models for training (ViT-L/14) and evaluation (ViT-g/14). These models were trained on different datasets: ViT-L/14 with YFCC100M and ViT-g/14 with LAION-2B.

Weakness 2. To address the novelty concerns, we clarify our contributions below.

W2-1. The combination of a regression loss and a GAN loss has been explored for T2I generation in several studies.

Ans.

[DDIM vs. DDIM Inversion] Although the reviewer classifies both reconstruction loss (from DDIM inversion) and distillation loss (from DDIM) as regression loss, they function differently. Previous distillation models [3, 4] adopt DDIM for distillation, starting from the prior latent and obtaining synthetic data, which may harm model performance if the synthetic data is of low-quality. In contrast, DDIM inversion in PaGoDA, starting from the real data and obtaining its latent, can benefit from the high-quality of real data.

[Theoretically Grounded GAN Integration] Additionally, we show that existing methods combining GAN loss and DDIM-distillation loss do not optimally learn the data distribution, as noted in Theorem B.4 (extended Theorem 3.1) in the appendix. In contrast, Theorem 3.1 demonstrates that PaGoDA's objective ensures optimal data distribution—making it the first diffusion distillation approach to guarantee this theoretically. This advantage stems from using real input data $x$ to construct $(x,z)$ pairs for distillation. Empirically, Figure 5-(d) shows that PaGoDA's loss (recon+GAN) maintains performance, while KD+GAN loss degrades as the teacher model worsens (imperfect teacher model).

[GAN Stability] Moreover, previous works have not adequately addressed GAN stability concerns. Our Theorem 3.2 provides the first guarantee that GAN can be stable if combined with DDIM inversion. Addressing these questions on optimality and stability offers new insights into the distillation community.

W2-2. The use of distillation loss and CLIP loss are common practices [5, 6].

Ans.

Please note that we are not claiming CLIP regularization as our contribution. Instead, we advocate using the CLIP regularizer as an option to achieve better text alignment.

W2-3. Proposed components explored in literature reduce the novelty of the paper.

Ans.

[DDIM Inversion + Progressive Growing = Upsampling] We respectfully request the reviewer to re-evaluate our contribution. We are not complicating matters by adding well-established losses. Instead, we argue that DDIM inversion not only guarantees learning the data distribution (see Theorem 3.1) but, when combined with our progressively growing technique, offers an alternative to the upsampling methods used by Stable Diffusion and Cascaded Diffusion Models.

Weakness 3. We provide explanations to address concerns about methodological complexity.

W3-1. Using the standard GAN loss is simpler than the proposed classifier-free GAN, which complicates the objective.

Ans.

[Upsampling] We would like to address the reviewer's concern by connecting GAN with upsampling. Suppose $d$ is the base resolution where DM has been trained. To double the resolution, the GAN's real/fake data should also be doubled. Otherwise, the GAN cannot generate $2d\times 2d$ images with full details.

[Problem of Standard GAN] A problem arises when using standard approach of putting $\omega$ -CFG teacher sample as GAN's real input. Since the teacher sample is of $d\times d$ dimension, without learning independent upsampling models like Cascaded Diffusion Models, there is no efficient way to upsample the teacher sample to $2d\times 2d$ .

[Use Real-world Data] Our proposal is to use high-resolution real-world data. We put real-world data (downsized to $2d$ resolution) into GAN's real part instead of using teacher samples. We compare the standard approaches, and our proposal as follows:

Standard way: generate $\omega$ $ω$ -CFG teacher sample and feed it to GAN's real part.
- (-) Student resolution not extendable
- (-) Reach to teacher quality at optimum
Our proposal: retrieve real data, downsize it, predict $\omega$ $ω$ of it, and put it into GAN's real part.
- (+) Student resolution extendable
- (+) Reach to data quality at optimum

[Use Released Checkpoint] We recognize that our approach may involve higher initial costs due to the need to train the classifier. However, once the classifier is publicly available, the training cost will be similar to the standard approach. We kindly ask the reviewer to focus on the qualitative benefits of our proposal.

[Concurrent Works] The literature [2-6], released publicly within 2-3 months before the NeurIPS submission, should generally be considered concurrent work to ours. We could add a section in our revision, but we think it is unfair to evaluate our method based on them.

2024-08-12

Thank you to the authors for their response. I have increased my rating to 5 and am willing to accept the paper if other reviewers strongly support it. However, I still consider it borderline due to several issues.

The new experiments using dino metrics demonstrate that the results are not completely biased, which is appreciated. However, the method performs worse than EDM2 and possibly other untested methods. For the revision, it is crucial to correct numerous inaccurate claims regarding the state-of-the-art performance, initially attributed to a biased FID evaluation. I recommend conducting additional experiments using unbiased GAN discriminators, such as non-initialized or diffusion GANs.
I acknowledge the distinction between DDIM and DDIM inversion for reconstruction training. However, current method also needs to employ the DDIM distillation loss for text to image. Additionally, the improvement between these two may not be significant once the biased GAN issue is addressed. I suggest rerunning the experiment shown in Figure 5(c) for the revision.
I appreciate the clarifications on the benefits of certain components in the upsampling process. However, it remains unclear how much more advantageous the classifier-free GAN is in a broader setting compared to the simpler, well-tested GAN loss with generated image as real data. Upsampling may not be necessary in real world setting. For generation up to 1K resolution, LDM is sufficient; for higher resolutions, training a separate, smaller super-resolution network is often more straightforward.
Concerning related works, most references are old enough according to NeurIPS policy. Additionally, it is important to discuss any literature that has influenced our understanding. The papers I mentioned, along with others, are closely related and warrant a more thorough discussion.

评论- Official Reply by Paper Authors

2024-08-13

We deeply appreciate the reviewer for valuable and insightful feedback. We will revise the paper thoroughly reflecting all discussions. Below we further address the raised concerns.

Questions 1.

Ans. We would like to express our deepest gratitude to the reviewer for pointing out the bias in the FID. We fully agree with the reviewer's opinion on this matter and plan to conduct additional experiments using the unbiased GAN discriminator suggested by the reviewer, reflecting this in the final revision.

Question 2.

Ans. To further assess the effectiveness of the reconstruction (DDIM inversion) loss, we present our findings in Figure A. PaGoDA with reconstruction loss shows stronger performance in prompt alignment compared to PaGoDA without it, highlighting the benefits of incorporating reconstruction loss. Nevertheless, we plan to rerun the experiments presented in Figure 5-(c) for the revision, as suggested by the reviewer.

Question 3.

Ans. We agree with the reviewer. In a more practical scenario, as the reviewer said, it could be easier to generate higher-resolution samples by first generating 1K-resolution with LDM, and upscaling it into higher-resolution with a separate module. Although, we believe that PaGoDA could work effectively in the following cases.

[Inference Speed] Due to the upsampling capability, PaGoDA can replace LDM for 1K generation with halved cost compared to the 1-step LCM, as illustrated in Figure 1. In the LDM framework, regardless of the number of steps taken to synthesize the latent representation, the generated latent must be decoded back to the pixel space. This additional decoder evaluation incurs computational costs nearly equivalent to those of U-Net evaluation, accounting for about 50% of the total computation in 1-step LCM generation. This inherent and fundamental limitation of the LDM framework is completely overcome in PaGoDA, which directly trains on the pixel space, enabling sampling at half the cost compared to the 1-step LCM.
[Latent PaGoDA] If we want to keep the LDM framework for 1K generation, we can apply PaGoDA in the latent space of LDM to lighten the training load. Given that diffusion training accounts for most of the computational cost, PaGoDA's resource demands are nearly the same as those for diffusion training at resolution lower than the full latent dimension. In contrast, LDM requires training the diffusion model at the full latent resolution, which is considerably more costly. Therefore, PaGoDA offers a substantially reduced training budget compared to conventional LDM.

Question 4.

Ans. We agree with the reviewer and we will discuss related literature thoroughly in our paper revision.

[Final Comments from Authors] We are planning experiments with the aim of thoroughly addressing all the reviewer's concerns by diligently fulfilling the requests. Given these efforts in mind, we respectfully ask the reviewer if our paper still remains borderline.

作者回复

2024-08-07

We sincerely appreciate all the reviewers for their constructive and helpful feedback. For a clearer evaluation, we would like to highlight a high-level overview of the contributions in this paper.

[Switch DDIM to DDIM Inversion] PaGoDA proposes to utilize DDIM inversion for distillation. Unlike DDIM, which starts from Gaussian prior to synthesize fake data for generator training, DDIM inversion begins from real data to its latent representation for generator training. This approach allows PaGoDA to benefit from the high quality of real data, whereas distillation from DDIM depends on the quality of synthetic data, which can be less reliable.

[DDIM Inversion for Upsampling] DDIM inversion, starting from real data, enables the generator to learn up to the resolution of the real data. In contrast, DDIM alone, without additional upsamplers like Cascaded Diffusion Models, cannot easily extend generator's resolution to higher dimensions.

[Switch Standard GAN to Classifier-Free GAN] PaGoDA introduces a GAN loss tailored for super-resolution generation and compatible with CFG. While standard GAN distillation puts $\omega$ -CFG teacher samples on the GAN's real part, this approach, similarly to DDIM, limits resolution due to reliance on synthetic data. Instead, we use real data in the GAN's real part, aligning with the motivation of using DDIM inversion. To ensure the student learns the $\omega$ -conditioned data distribution $p_{\text{data}}(x|c,\omega)$ rather than $p_{\text{data}}(x|c)$ , we predict $\omega(x,c)$ for the real data and use this in the student network evaluation, completing the design of Classifier-Free GAN.

[Upsampling = Progressive Growing + DDIM Inversion + Classifier-Free GAN] High-dimensional generator can be achieved by incorporating three proposed core components altogether: progressive growing (decoder architecture), DDIM inversion (reconstruction loss), and Classifier-Free GAN (adversarial loss). With no single component, PaGoDA is not successfully working, necessitating all three components as essential ones. This high-dimensional generator, trained from low-dimensional teacher DM, provides a new and effective alternative to Stable Diffusion and Cascaded Diffusion Models. We, therefore, kindly ask the reviewer to reconsider our paper with this perspective in mind.

We attach one-page of additional PDF here for the reviewer's information.

最终决定Accept (poster)

2024-09-25

The paper introduces the model for the adversarial distillation method for single-step-image generation. Progressive growing GAN inspired the proposed method and combined the benefits of GAN's upscaling and diffusion models. The proposed model solves an essential problem of efficient conditioned image generation in the viral domain. The authors highlight the benefits of the solution compared to other efficient approaches like LCM.

The reviewers expressed quite a lot of concerns. The major issues were focused on using CLIP both as a regularizer and evaluator, human-based evaluation for T2I scenarios, application to some existing models, like SD or novelty, and relation to the most recent works. After the rebuttal stage, the major concerns were addressed, and a consensus among the authors was achieved regarding accepting the paper. In my opinion, the proposed approach is novel and represents a promising new direction in the field of generative model distillation applications. Therefore, I recommend this paper for acceptance.