5.3

/10

Poster4 位审稿人

最低5最高6标准差0.4

3.5

置信度

正确性3.0

贡献度2.5

表达3.3

NeurIPS 2024

LiteVAE: Lightweight and Efficient Variational Autoencoders for Latent Diffusion Models

Seyedmorteza Sadat,Jakob Buhmann,Derek Bradley,Otmar Hilliges,Romann M. Weber

OpenReview PDF

提交: 2024-05-14更新: 2024-11-06

TL;DR

We introduce an efficiency-adapted variational autoencoder design for latent diffusion models with improved scalability and computational efficiency.

摘要

关键词

variational autoencoderslatent diffusion models

评审与讨论

审稿意见

评分: 5置信度: 42024-07-04

This paper provides a LiteVAE structure to replace the original one, aiming to reduce the computations when training on large-scale datasets, which can boost the performance of latent diffusion models with more potential augmentations during training.

优点

The new encoder could reduce more parameters than the original one with less GPU memory.
The paper provides some interesting analysis of the feature map.
The first half of the paper is easy to read.

缺点

There is less information about the decoder. If without the lite version of the decoder, the innovation of the work may be limited because the encoder would not be used during inference. What is the structure of the decoder? Will the decoder be trained with an encoder? Did the authors implement a new lite decoder?
Have the authors verify their structure in some known Latent diffusion models with fine-tuning strategies. Without such validation, this work may not contribute effectively to modern latent diffusion models. (There is no need to provide new experiments to verify this.)
Did the paper have the ablation study on the structure modules in Fig.1?
There is less visual comparison between LiteVAE and VAE to verify its effectiveness. (Fig.6 and Fig.9 didn't show the results of VAE)

问题

Did the authors try to expand the bottleneck channel from 4 to other numbers, such as 16, 32, or more?
Could the authors provide some distribution analysis about the latent features and the differences between LiteVAE and VAE?
Could the authors provide the values of LiteVAE-S/M/L in Table 3?
The format of the Checklist does not fully meet the requirements.

局限性

The limitations of their work and any potential negative societal impacts should be strengthened.

作者回复

2024-08-07

We appreciate the reviewer's helpful comments and the positive reaction to our work. Please find our responses to the individual comments below.

Information about the decoder

As mentioned in line 149, we use the same decoder architecture as SD-VAE, and the encoder and decoder networks are trained together. While the decoder is the component mainly used during inference, our work targets the efficiency of LDM training. During each training step, only the encoder is used; therefore, the complexity of the encoder directly affects the training efficiency of LDMs. Additionally, the encoder is used in other LDM applications, such as image editing and score distillation (SDS) [1]. Thus, improving the efficiency of the encoder will also enhance performance in those applications. Please also note that the encoder is the component that must be fixed before training the diffusion part. The decoder can be distilled into a more lightweight network after training the diffusion and the autoencoder, as it does not change the latent space of the VAE.

[1] Poole B, Jain A, Barron JT, Mildenhall B. Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988. 2022 Sep 29.

Question about compatibility with latent diffusion models

We thank the reviewer for mentioning this interesting question. As mentioned in the general comment, the focus of this work is offering a new encoder architecture for latent diffusion models that results in faster training of the first stage and more efficient training of the second stage of LDMs. As changing the encoder changes the latent space of the VAE, the model cannot be directly used with pretrained diffusion models such as Stable Diffusion. We show that it is possible to train latent diffusion models in the latent space of LiteVAE, but training another Stable Diffusion is unfortunately not in the scope of our compute budget. As the Stable Diffusion models also train a new autoencoder for each version (e.g., SD 2.1, SDXL, and SD3 all have different VAEs), we hope that our findings will be useful when developing new SD models. We did not try direct fine-tuning of existing SD models due to limited computational resources, but [1] demonstrates that it is possible to fine-tune the diffusion UNet with a new VAE to adapt existing pretrained models.

[1] Chen J, Ge C, Xie E, Wu Y, Yao L, Ren X, Wang Z, Luo P, Lu H, Li Z. Pixart-\sigma: Weak-to-strong training of diffusion transformer for 4k text-to-image generation. arXiv preprint arXiv:2403.04692. 2024 Mar 7.

Ablation studies regarding LiteVAE structure

Besides the scaling experiments mentioned in the main text, we provided several ablation studies in Appendix D regarding the feature-extraction network architecture (D.3), sharing feature-extraction modules (D.4), using ViT for feature aggregation (D.5), and the importance of using all wavelet levels (D.6). We would be happy to include more ablations in the final version if requested.

More visual comparisons for the VAE-based LDMs

Figure 6 and 9 include generated images based on our latent diffusion model. Hence, there is not a direct correspondence between these images and the images generated by another LDM trained on the same data. As the FIDs of these two models are close to each other, we expect the VAE generations to also have similar characteristics. We would be happy to also provide generations for the VAE-based LDM models in the final version.

Question about the bottleneck channel

We have included experiments for $n_z=4$ and $n_z=12$ in the paper, and the conclusions are similar. Our internal experiments also showed that similar results hold for $n_z=16$ and $n_z=48$ . Hence, we conclude that our findings are independent of the number of channels used in the encoder bottleneck layer.

Distribution analysis

The submission includes a distribution analysis between SD-VAE and LiteVAE latent spaces in Table 6, where we concluded that the latent space of LiteVAE is closer to a standard Gaussian distribution in terms of MMD metrics.

Throughput for other LiteVAEs

We thank the reviewer for pointing out this missing detail. Please find the throughputs in the following table, and we will include these results in the final version of the paper.

Model	Throughput (img/sec)	GPU Memory (MB)
LiteVAE-S	384	1324
LiteVAE-M	42.24	12130
LiteVAE-L	41.6	12170

Comment about the checklist

We would be happy to double check the checklist in the final version to make sure that it is fully compatible with NeurIPS requirements.

2024-08-08

Thank the authors for the response. However, since this work is mainly designed for the LDM, missing a verification on an LDM and without a light version of the decoder limits its innovations. Therefore, I maintain my score.

评论- Response to Reviewer 86Z2

2024-08-08

We thank the reviewer for considering our rebuttal, but we would like to call the reviewer’s attention to our LDM results in Section 5 (pp. 8–9), quoted here:

Lastly, we trained two diffusion models on the FFHQ and CelebA-HQ datasets and compared their performance with standard VAE-based LDMs. The diffusion model architecture used for this experiment is a UNet identical to the original model from Rombach et al. [55]. Table 7 shows that the diffusion models trained in the latent space of LiteVAE perform similarly to (or slightly better than) the standard LDMs. Additionally, Figure 6 includes some generated examples from our FFHQ model. These results suggest that diffusion models are also capable of modeling the latent space of LiteVAE.

Additional generated results are given in Appendix E (p. 20).

We respectfully disagree with the reviewer that our not developing a light version of the decoder limits the innovativeness of our work, as our paper aims to improve the training efficiency of these models, which has received comparatively little attention in the literature, while the decoder addresses the sampling efficiency, which has received far more attention in recent work. We hope that with this point clarified, the reviewer will reconsider his or her score.

2024-08-08

Thanks for the response, but I maintain my opinion. However, if you can change the opinion of Reviewer rH5p, I will follow her/him.

评论- Response to Reviewer 86Z2

2024-08-09

We appreciate the reviewer’s participation in the discussion and respect the reviewer’s position. We are also pleased to note the increase in Reviewer rH5p’s score.

2024-08-12

I partially agree with the innovation of other reviewers regarding training efficiency, so I raise my score to 5. However, I still hope that the author can fine-tune the existing LDM model (such as SDV1.5) in the final version to prove that it can indeed work on the SOTA LDMs.

审稿意见

评分: 6置信度: 42024-07-14

This paper presents LiteVAE. LiteVAE is an efficient and lightweight modification to latent diffusion models (LDMs) that incorporate 2D wavelet transform into the encoding structure. It then uses a feature aggregating model (UNet-based architecture) to fuse multiscale wavelet coefficients into a unified latent code. It uses a decoder to transform the latent code into an image. This work also provides other modifications such as self-modulated convolution, pixel-wise discrimination, removing adaptive weight for adversarial loss, and additional loss functions. These modifications further enhance the training dynamics and reconstruction quality of LiteVAE.

These modifications lead to a considerable reduction in the computational cost compared to the standard VAE encoder without sacrificing the quality of image construction. Bse LiteVAE with a six-fold reduction in encoder parameter count matches the reconstruction quality of standard VAE encoders. Large LiteVAE models provide better reconstruction quality than the standard VAE.

Finally, this paper presents experimental results to support its claims. The result shows that large LiteVAE models outperform standard VAEs of similar size based on the following performance metrics: rFID, LPIPS, PSNR, and SSIM.

优点

ORIGINALITY: The main focus of this work is to use the traditional signal processing method (2D wavelet transform) to improve the performance and reduce the computational cost of deep learning methods. It explores the fact that the latent code of Stable Diffusion VAE (SD-VAE) is itself image-like. So they opted for a traditional signal processing method or transformation that preserves the image-like structure of the latent code for SD-VAE. Although there are prior works that use 2D wavelet transform to improve the performance of generative models. This work incorporates 2D wavelet transform and adds some other useful modifications.

PRESENTATION: This paper is well organized and provides sufficient background information to understand the central claim of this work. For example, it section provides relevant background information to understand the main components of the new LiteVAE. The inclusion of relevant figures further improves the readability of this work.

QUALITY: This is a high-quality. It clearly explains the motivation for this work and provides a detailed explanation and justification for the modifications presented in this work. It also provides a detailed explanation of how the modifications improve efficiency and scalability. Finally, it provides sufficient experiments to support the main claims of this work.

SIGNIFICANCE: The reduction in computational cost with no tradeoff in reconstruction quality points to the significance of this work.

缺点

Table 3 only compares the throughput of VAE and LiteVAE-B. Please provide the throughput for LiteVAE-S and LiteVAE-L.

Also, the model did not discuss the additional computational cost for the 2D wavelet transform. Please provide some information about this.

There is a typographical error in line 207. I think "Table 3" should be "Table 2".

问题

Table 3 only compares the throughput of VAE and LiteVAE-B. Please provide the throughput for LiteVAE-S and LiteVAE-L.

Also, the model did not discuss the additional computational cost for the 2D wavelet transform. Please provide some information about this.

There is a typographical error in line 207. I think "Table 3" should be "Table 2".

局限性

The checklist points to Section 7 for the limitations but the section only shows the conclusion. Please address this.

作者回复

2024-08-07

We thank the reviewer for providing constructive comments and for recognizing our paper as high-quality with numerous strengths and significant contribution. Please find our answers to the comments below.

Throughput of other LiteVAE models

We thank the reviewer for pointing out this question. Please find the throughput of other LiteVAE models in the following table, and we will include this updated result in the final version of the paper.

Model	Throughput (img/sec)	GPU Memory (MB)
LiteVAE-S	384	1324
LiteVAE-M	42.24	12130
LiteVAE-L	41.6	12170

Compute cost for wavelet transforms

The computational cost of wavelet transform is linear in terms of the number of pixels and is negligible compared to querying the neural network. For a tensor of shape (32, 3, 256, 256), computing the wavelets takes 829 microsecond on an RTX 3090 GPU while querying the LiteVAE-B encoder takes 55.1 millisecond for the same data.

Error in line 207

We thank the reviewer for mentioning this error. The text indeed means Table 2, and we will fix this issue in the final version.

2024-08-12

Dear Author(s):

Thank you for your response. I have gone through your answers to my questions.

I would also advise that you provide more information about the implementation of the wavelet feature extraction. This is an important aspect of your work and will greatly help in reproducing your results, especially in a case like this where you cannot share your code due to internal copyright policies.

审稿意见

评分: 5置信度: 42024-07-14

This paper introduces LiteVAE, a novel approach that combines multi-scale VAE and discrete wavelet transform to reduce computational cost and enhance reconstruction capabilities. Both components are well-grounded and supported by experimental results. Additionally, the paper provides a detailed pipeline and ablation studies on training VAE for diffusion models, which will also benefit readers.

优点

This paper integrates multi-scale VAE and discrete wavelet transform to reduce computational cost and boost reconstruction performance.
The paper offers a detailed training pipeline for VAE in diffusion models, along with ablation studies that readers will find beneficial.
The paper presents several interesting tricks for improving VAE training, including (1) removing group normalization from the decoder, (2) using a U-Net-based discriminator, and (3) eliminating the adaptive weight $\lambda_{reg}$ .

缺点

The multi-scale VAE and discrete wavelet transform are two relatively independent improvements for VAE, so I recommend that the authors conduct an ablation study on these two components.
The new VAE training pipeline is a valuable resource for the community, so I suggest that the authors release not only the VAE checkpoint but also the entire training code.

问题

See the weaknesses part.

局限性

None

作者回复

2024-08-07

We greatly appreciate the reviewer's helpful suggestions, as well as the positive assessment of the influence and quality of our work. Below, we provide detailed responses to the reviewer’s comments.

Question about multiscale VAE and wavelets

We would like to note that the multiscale structure of our model and the wavelet transforms are in fact relatively coupled, as the multi-scale part arises exactly because wavelet transforms are multiscale operations in nature. We are happy to make this relationship more clear in the paper. We did extra ablations in Appendix D on the importance of using all wavelet levels, as well sharing the feature-extraction networks for each part and other properties of this specific architecture. We would be happy to include additional ablations in the final version if requested.

Code availability

As pointed out in our general response, due to internal copyright policies, we are unfortunately unable to share the source code of this work. However, we will do our best to make sure that the results are reproducible by providing more implementation details and detailed pseudocode in the final version.

审稿意见

评分: 5置信度: 22024-07-26

The authors propose LiteVAE, a novel architecture for the VAE decoding step of latent diffusion. They show that LiteVAE can achieve comparative perfomance to SD-VAE, the default latent diffusion decoder, while using fewer parameters. The efficiency gain comes from using a more lightweight network, and a wavelet feature representation. The paper provides evaluation showing that LiteVAE outperforms the naïve approach of simply scaling down SD-VAE. This gives evidence that LiteVAE offers an architectural improvement.

优点

The paper provides an extensive experimental investigation of how to improve the efficiency of the VAE upscaling step, and is clearly presented. To me the contribution is significant, due to the dominance of diffusion modelling for image generation. A performance gain in even one step can improve the efficiency many real world uses.
The use of wavelets is relatively underexplored in the literature. Using them to improve perceptual quality in generated images is novel to the best of my knowledge.
All claims are supported by experimentation and ablation studies.

缺点

Main concern is with the impact of this work. The motivation does not state how much more efficient their improvements would make the full diffusion pipeline. Is the VAE step such a significant performance bottleneck?
No code available for a paper whose main contribution is experimental. I would have liked to see how the wavelet features are handled, as Pytorch does not provide official modules that deal with wavelet decompositions. I am a bit concerned about how easy this model is to build and deploy.
I have some concerns about whether LiteVAE can offer a performance boost in domains other than natural images. If this is the case, maybe it should be mentioned as a limitation. My concern stems from the wavelet decomposition allowing to explicitly target high frequency content reconstruction, which is especially important for natural image upscaling.
Table 4 (difference between group norm and SMC) should ideally have error bars, as it claims an improvement in favour of SMC against group norm.

问题

In line 24 the computational burden of the VAE is given and compared to the diffusion Unet. Is the 86 GFLOP figure for the latter given per diffusion step, or for the entire diffusion process? In general, how much of a performance gain would we expect for the full image generation pipeline when using LiteVAE instead of SD-VAE?
Can LiteVAE be used as a drop-in replacement of SD-VAE in real world applications?
In line 287 it is mentioned that LiteVAE could be applied to different scenarios. Wavelets are known to offer a good basis for representing natural images. What is the motivation behind hypothesising that this basis will yield good performance in other domains?
Could you comment on whether using the wavelet basis and reconstruction loss with the original SD-VAE might yield similar performance?

局限性

The limitations are not discussed.

作者回复

2024-08-07

We wish to thank the reviewer for the helpful comments and for finding our work novel with detailed evaluations, good presentation, and significant contribution. Please find our answers to the comments below.

Impact of the work

The VAE component in latent diffusion models is responsible for processing high-resolution images, which can be computationally intensive. We observed that replacing the standard VAE with LiteVAE during the training of DiT models results in approximately 30-35% increase in the speed of each training step. Additionally, as highlighted in the introduction of our submission, the GFLOPs required for querying the VAE encoder exceed those required by the Stable Diffusion UNet. Consequently, enhancing the performance of the VAE significantly improves the training efficiency of latent diffusion models. Moreover, in applications such as score distillation (SDS) [1], the algorithm requires backpropagation through the encoder, meaning that optimizing the effeciency of the encoder can have a substantial impact on performance in these contexts as well.

[1] Poole B, Jain A, Barron JT, Mildenhall B. Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988. 2022 Sep 29.

Code availability

We agree with the reviewer that making the work reproducible is an important aspect of experimental papers. Unfortunately, due to internal copyright policies, we are unable to share the source code of this work. For computing wavelets, we used the fbcotter/pytorch_wavelets library. We have also included detailed implementation details in the appendix and would be happy to add more information and detailed pseudocode to ensure that the results are easily reproducible.

Application to other scenarios

We thank the reviewer for pointing out this somewhat ambiguous terminology. Our work focuses on the application of latent diffusion and autoencoders for natural images, and we do not have any claims for other domains. What we mean by line 287 is that the use of wavelets can be explored further in other autoencoder-based generative models for natural images (such as vector-quantized models). Hence, the focus of the paper is solely on the natural image domain only. That being said, we agree with the reviewer that our method assumes wavelets are well-suited for the particular application in which LiteVAE is being used. We thank the reviewer for pointing this out and would be happy to acknowledge this as a limitation or assumption of our method.

Error bars for Table 4

We agree with the reviewer that including error bars in table 4 makes the result more convincing. However, doing so would require multiple training of different autoencoders, which is outside our compute budget. We should also note that other than the slight improvements in reconstruction quality, removing feature imbalances improves training stability as observed in previous works (e.g., [1, 2, 3]).

[1] Karras T, Laine S, Aittala M, Hellsten J, Lehtinen J, Aila T. Analyzing and improving the image quality of stylegan. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition 2020 (pp. 8110-8119).

[2] Karras T, Aittala M, Lehtinen J, Hellsten J, Aila T, Laine S. Analyzing and improving the training dynamics of diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2024 (pp. 24174-24184).

[3] Salimans T, Kingma DP. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. Advances in neural information processing systems. 2016;29.

Question about the GFLOPs

The GFLOPs are reported for a single forward pass through the encoder and the diffusion UNet. These are the forward calls that have to be made at each training step of LDMs. As this step is used during training, we can conclude that a noticeable improvement can be made by switching the VAE to a more efficient version since its computational complexity is even larger than the UNet part.

Question about using LiteVAE instead of SD-VAE

LiteVAE can be used as a drop-in replacement for SD-VAE in applications that only utilize the autoencoder from Stable Diffusion. However, because changing the autoencoder alters the latent space, the Stable Diffusion model is not compatible with LiteVAE out of the box. However, as pointed out in [1], it is possible to fine-tune the diffusion UNet with a new VAE for rapid adaptation of existing pretrained models.

Question about using wavelets with SD-VAE

We believe that using wavelets with SD-VAE will also enhance quality and performance. However, please note that doing so deviates from the original SD-VAE model and makes the setup more similar to LiteVAE. In case we misunderstood the question, we would be happy to provide more discussion on this.

2024-08-09

Thank you for your response.

I was confused about the comparison of Unet and LiteVAE GFLOPs, because I thought the argument was made in the context of image generation, where Unet is applied repeatedly during denoising. I now see that the argument is made for training only, where the encoder and Unet are generally applied once each (please correct me if I'm wrong). The authors did mention this in the text, my apologies for missing it. I also appreciate the point about score distillation.

It is always helpful for the community to have an official software implementation, but I understand if the authors have limitations that preclude them from releasing code. Code availability will not influence my score. I would urge the authors to instead provide ample information to guide the construction of this model, and would like to remind them that this would also ease the adoption of their method.

Overall, I will revise my score up one, because I agree with the authors that efficiency of training is important. I am also of the opinion that efficiency of training should be explored more, because of the abysmal power requirements of training large models. However, unless the authors can convince me of the wider scientific significance of their work, I will keep my score in the borderline region.

评论- Response to Reviewer rH5p

2024-08-09

We thank the reviewer for the thoughtful response to our rebuttal. We will certainly refine the description of our method to ensure that it is straightforward for others to implement.

We believe that the reviewer makes an excellent case for the broader significance of our work given the rising concern over the resources consumed by AI. But we also recognize that evaluating scientific significance has a subjective component to it, and we sincerely appreciate the reviewer’s positive consideration.

作者回复

2024-08-07

We thank all reviewers for recognizing our paper as well-structured and easy to read, and for highlighting its interesting ideas and detailed evaluations.

We would like to clarify that the primary goal of LiteVAE is to study the efficiency and reconstruction capabilities of the autoencoder, as we believe this is an important area that has received comparably less attention from the LDM community than the diffusion model itself. Our main contribution is offering a more efficient encoder that achieves the same reconstruction quality with significantly less parameters. We demonstrate that our method achieves comparable reconstruction quality to that of a standard VAE while requiring significantly less compute. This leads to faster (>2x) training of the first stage and higher throughput (up to 35%) in the second stage of LDMs, as well as having the potential to improve efficiency for LDM-based applications such as score distillation (SDS) [1].

Also, since some reviewers asked for the code of our work, we should unfortunately mention that due to internal copyright policies, we are not able to share the full code of the paper. However, we would be happy to add more information regarding the implementation details and detailed pseudocode for different LiteVAE components to ensure reproducibility. In response to the reviewers' suggestions, we are happy to also expand the limitation section of the paper in the final version.

We have also prepared individual responses to each reviewer and welcome any follow-up discussions. Given that there are no major concerns with our work and reviewers agree that it is a novel and well-presented paper with strong motivation/evaluation and relatively significant contribution, we hope that our rebuttal motivates the reviewers to adjust their scores accordingly.

[1] Poole B, Jain A, Barron JT, Mildenhall B. Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988. 2022 Sep 29.

最终决定Accept (poster)

2024-09-25

The paper proposes a new lightweight multiscale encoder by incorporating 2D DWTs, thereby significantly decreasing the encoder parameters leading to faster training and lower GPU memory requirements as compared to traditional VAEs. A set of model sizes are presented leading to desirable GPU memory, throughput and reconstruction quality tradeoffs.

The paper received four reviews with unanimity in their (post discussion) ratings. The reviewers carefully engaged with the paper and participated in subsequent discussions. The paper is readable and well-organized, the key ideas well motivated, and the reviewers liked the reasoning accompanying the main approach. Though not entirely novel, the use of wavelets to address the computational costs of training LDMs is seen to have a potentially significant impact and the proposed approach novel. The training details and the ‘tricks’ shared are expected to be valuable for improving VAE training. The experiments and ablations are found to be sufficient to support the main claims of the work.

The initial reviews were split around the borderline. The primary concerns were around the actual computational and quality impact when the LightVAE is incorporated into LDMs, lack of error bars when the statistical results are close, reproducibility, ablation studies and sensitivity analysis, and some typos and missing details that needed clarification. The authors addressed many of these concerns during the discussion resulting in an upward revision of ratings by two reviewers resulting in unanimity.

While concerns around the actual impact when LightVAE is incorporated into LDMs, statistical significance, and reproducibility persist, the paper can be significantly improved by incorporating the clarifications and material provided by the authors in their rebuttal and responses.