Improving the Diffusability of Autoencoders
We explore the spectral properties 📊 of modern autoencoders 🤖 used for image 🖼️ / video 📹 latent diffusion 🌬️ training 🏋️ and find 🔍 that a simple downsampling ⬇️ regularization 📏 can substantially boost 🚀 their downstream 🌊 LDM performance 📈💥.
摘要
评审与讨论
This paper finds that the pre-trained VAE for visual generation exhits larger high-frequency components than the original RGB images, in the lens of spectral analysis using 2D DCT.
To improve the latent diffusion generative modeling, the authors proposed to align the spectral property of image latents with that of RGB images by enhancing the VAE reconstruction ability of the low-frequency signals.
Results on image generation and video generation show that the proposed Scale Equivariance regularization improves the generation quality.
给作者的问题
Q1: In equation 1, should be <=1 after normalizing by , but why there are some components still > 1 in Figure 2 and Figure 3?
Q2: Why 'As the number of channels in the autoencoder’s bottleneck increases, high-frequency components become more pronounced'?
Q3: What are the number of sampling steps used in Table 1? what are the results when using different sampling steps?
论据与证据
The claim below lacks verification or literature support:
We also hypothesize that higher frequencies components are harder to model than lower frequency components for the following reasons and thus should be avoided: (i) they have higher dimensionality; (iii) they are more susceptible to error accumulation over time
Reasons:
- High-frequency components of images usually have a lower entropy than the low-frequency components, thus image high-freq signals should be easier to model since they are intuitively close to 0 (this is also why JPEG removes high-freq signals for compression).
- It is not clear to me why high-freq signals have higher dimensionality
- It is not clear why high-freq signals are more susceptible to error accumulation
方法与评估标准
The authors proposed a regularization term (Scale Equivariance) on the vanilla VAE loss, which is easy to understand. But, different from the explanation in the paper, I think the principle of SE is that it explicitly enhances the reconstruction of the image's low-frequency signals.
The evaluation criteria FID, FVD, PSNR, SSIM, and LPIPS are commonly used in the community.
理论论述
no theoretical claim
实验设计与分析
The experimental designs and analyses are overall sound and convincing. The authors verify their methods on image generation, video generation, and autoencoder reconstruction.
One suggestion: it is better for the authors to add an experiment to show the effect of Scale Equivariance by training from scratch. (The authors only verified SE by fine-tuning the pre-trained VAE)
补充材料
I have reviewed B. Additional exploration, where the authors explored a fine-grained version of Scale Equivariance
与现有文献的关系
The key contribution of this paper is improving the VAE used in image generation and video generation.
Specifically, SD-VAE, FluxAE, and CogVideoX-AE are basic components for high-resolution image generation and video generation.
遗漏的重要参考文献
the references are well discussed
其他优缺点
Strengths:
-
The paper is easy to follow. The analysis, motivation and method are overall sound and well-organized.
-
Experiments are well-designed, and the results are promising.
-
The proposed regularization method is potentially useful for the community.
Weaknesses
-
It is not clear why reducing the high-frequency of the latent space can improve the generative modeling. The authors provide the hypothesis but lack sufficient verification. I encourage the authors to explore it from the entropy of the high-freq coefficients in both RGB space and DCT space, where the distribution of high-freq components in the RGB space should have low entropy.
-
It is better to show the effect of Scale Equivariance when training the VAE from scratch, as a supplement to fine-tuning.
-
It seems not obvious from Figure 5 that Scale Equivariance preserves more content compared to the baseline.
-
Presentation-wise: (1) It is better to explain in detail why stronger KL regularization leads to a larger high-freq latent expression (random noise of the latent codes). (2) CosmosTokenizer employs wavelet transform mainly for compression according to their paper; (3) Equation 2 should have a scaler for the term of Scale Equivariance.
其他意见或建议
already stated above
We deeply appreciate the reviewer’s valuable feedback and constructive recommendations. Below, we systematically respond to each issue highlighted. We will ensure comprehensive incorporation of all suggestions into our manuscript.
It is not clear why high-frequency components have higher dimensionality.
Reviewer JqrQ has raised the same concern, and due to the limited space, we politely refer to our argumentation in that other response.
It is not clear why high-frequency components are more susceptible to error accumulation.
Similarly to the previous question, we kindly refer the reviewer to our response to Reviewer JqrQ on the same matter.
Can SE be helpful just because it improves the reconstruction of low-frequency components?
Not quite. PSNR/SSIM scores are extremely sensitive to low-frequency reconstructions, and all the autoencoders (vanilla, fine-tuned and fine-tuned+SE) perform on par in terms of these metrics.
From-scratch training.
Due to the space limit, we are again forced to refer the reviewer to our response on the same question to Reviewer MRKn. We apologize for this inconvenience.
The improved performance can be due to a different reason: high-frequency components of images have a lower entropy and should be easier to model.
In this [plot], we visualize the entropy of the latents for regularized/non-regularized FluxAE at various frequencies, and also RGB. Entropy is computed by obtaining a histogram of 100 bins for each frequency and its corresponding density. While for images space, high-frequencies indeed have lower entropy, for the latents, HF entropy exhibits a much flatter or an even slightly increasing profile indicating that they might be harder to model. We note that while higher entropy, intuitively, should lead to harder modeling of a distribution, it does not necessarily affect the quality of the final samples obtained from passing the latents through the decoder. Higher frequencies with smaller scale can still exhibit high entropy while our SE reduces the dependence on these frequencies.
Figure 5 does not show that SE preserves more content compared to the baseline.
We try being gentle about enforcing SE regularization not to affect the reconstruction quality. Figure 5 shows that our regularized AE does not introduce spurious high frequencies when they are chopped off in the latents. This also results in meaningful improvements in the corresponding reconstruction metrics as confirmed in Figure 8. Other examples (e.g., [this one]) can show this effect more noticeably.
[3 writing issues]
We fully agree with the remarks and will incorporate them in the manuscript.
Why are some amplitudes greater than 1 after normalization by D_{0,0} in Figures 2 and 3?
D_{0,0} corresponds to the lowest frequency, which is basically the mean of all the values. It might not necessarily have the largest amplitude compared to other frequencies (though this rarely happens in natural signals). For example, [this figure] shows an example of an image which has 0 amplitudes everywhere, except for the “[[0,1]]” spatial frequency.
Why do high-frequency components become more pronounced as bottleneck channels increase?
This is an interesting question. We hypothesize that increasing the autoencoder's bottleneck channels enables the model to better capture finer, high-frequency details. Initially, with limited capacity, the encoder prioritizes smoother, low-frequency information. As capacity grows, it encodes additional high-frequency details, which are unique and informative. However, without explicit regularization promoting frequency-based disentanglement, these high-frequency components distribute across channels in an unstructured manner. Thus, higher-dimensional bottlenecks enhance high-frequency representations but do not yield systematic frequency-specific disentanglement per channel. We will clarify this point in the revised manuscript.
Results for various numbers of steps.
We generated the Table 1 results using 256 steps (L#318). Additional plots for FluxAE/CogVideoX-AE/LTX-AE across step counts (16,32,64,128,256) for FID/DinoFID and FVD (images: 50K samples, videos: 10K samples) are provided [here (see the neighboring folders as well)]. Regularized autoencoders consistently improve diffusability. We will include these results (+ DiT-XL/2 for CogVideoX-AE) in the final paper.
Extended KL influence discussion.
We provided an expanded KL discussion (noise injection, relevant literature, RGB+noise spectrum analysis) in response to Reviewer tijt.
We welcome any additional suggestions from the reviewer.
Thanks for the responses. Some of my doubts and concerns have been solved. Regarding the claims in the paper, I strongly suggest the authors to improve them by adding rigorous verification/rephrasing with proper citations, because these can be potentially important and interesting insights. Considering the contribution of this paper to the community, I maintain my rating and lean toward accepting the paper.
The paper observes higher frequency component in VAE's latent space than those in normal RGB images and these high frequency components have greater magnitude with larger channels and stronger KL regularization. Therefore, it proposes a novel regularization technique -- scale equivariance (SE) -- to improve the diffusibility of VAE. Specifically, SE suppresses the high-frequency component in latent space by introducing an additional loss for the ground-truth downsampled image and reconstructed corresponding downsampled latent vector . Extensive experiments have been conducted on different VAE to demonstrate the effectiveness of the proposed method.
给作者的问题
I have listed all my questions in each part. While the proposed method does improve the performance, my major concern is (1) whether the motivation of the method (suppressing high-frequency components) leads to the improving performance (2) whether it is fair to compare with original baseline given the same iteration with possibly additional computation budget from a new loss term (3) whether it is possible to simplify the method using reconstruction loss with dynamic downsampled ratio or further improve the generation performance with weights balancing two reconstruction losses.
If the authors can resolve some of my concerns (mostly about 1, 2). I am happy to increase my score.
论据与证据
The effectiveness of the proposed method is well supported in the experiment section. Nevertheless, I have the following questions:
- For hypothesis (i) in line 210 (right column), how do the results in Figure 4 imply that high-frequency components are higher dimensional? Adding a new loss does not necessarily increase the maximum dimensionality VAE can model.
- For hypothesis (ii), how does Figure 5 show higher frequencies are generated only in the final steps of sampling? I thought Figure 5 only considers the reconstructed images with VAE and there is no any sampling in diffusion.
- For hypothesis (iii), how does Figure 6 demonstrate higher frequencies are more susceptible to error accumulation over time? Is it possible to provide more quantitive evidences for this claim (rather than sampled noisy images during the diffusion denoising process)?
- Empirically, it is observed using scale equivariance can improve the performance of diffusion model on different metrics. But I do not find any quantitive results that illustrate generation performance or the diffusibility of VAE is negatively correlated with the presence of higher frequencies components? It is possible that scale equivariance might interact with other factors in latent space of VAE and inadvertently improve the generation performance of the ultimate generative model.
方法与评估标准
The paper adopts standard metrics for VAE and diffusion models. But I have some important concerns regarding the computation budget of baseline and scale equivariance.
- Regarding equation (2), if I understand correctly, the additional loss term regarding the downsampling introduces additional gpu memory cost when doing forward and backward. Can you provide the flop of each VAE training / fine-tuning iteration? I am wondering whether it is fair to compare with baseline without scale equivariance given they might use different compute budget with the same number of iterations.
- Also, is it possible to adapt dynamic downsampled latent vectors and images during the training rather than introducing a new loss term for it? In other words, we do not use the first term in Equation 2, and vary the downsampled ratio for each batch images and vectors (of course, we still need to do one forward to get latent z with original x). Can we still achieve better performance than the original baseline?
- Can we introduce a coefficient to balance the weight between original reconstruction objective and downsampled one? What is the optimal coefficient (empirically)?
理论论述
This paper does not introduce any new theory or proof.
实验设计与分析
The paper has conducted extensive evaluations using various VAE models, including FluxAE, CogVideoX-AE, and LTX-AE and across different datasets, including ImageNet-1K and Kinetics-700. In all cases, the dataset (in-the-wild data) used to train VAE is different from the one training diffusion model. What if they are the same? Do scale equivariance still help filter out higher frequencies and improve generation performance?
补充材料
The authors do not provide any supplementary material, which is not an issue for me.
与现有文献的关系
The proposed method is both simple and novel. I am not aware of any prior work using downsampled image to reduce high frequency components for VAE.
遗漏的重要参考文献
I do not find any important yet missing cited works.
其他优缺点
See above for each section.
其他意见或建议
I do not have major comments and suggestions other than questions listed above.
We thank the reviewer for their valuable remarks. Below, we address each raised concern.
Why high-frequency components have higher dimensionality?
We agree this could be clearer. By "low-frequency components," we mean DCT coefficients required to reconstruct a feature map downsampled by factor k per spatial dimension. High-frequency components are remaining coefficients needed to fully reconstruct the original map. Since DCT coefficients scale quadratically with k, we have:
Number of high-frequency components = (k² − 1) × Number of low-frequency components
For k=2 (our experiments), high-frequency components thus have three times higher dimensionality. We'll clearly state this in the final version.
It is unclear why higher frequencies are only generated in the final steps.
This claim originates from prior literature (Rissanen et al., "Generative Modelling With Inverse Heat Dissipation"; also "Diffusion is spectral autoregression" and DCTDiff). Intuitively, high-frequency components are too easily erased by noise in the early denoising steps, which is why the model can only pick them up for smaller noise levels, which relate to the final denoising steps. We will extend the discussion of this topic in the earliest revision.
It is unclear why higher frequencies are more prone to error accumulation.
This observation is also an already explored phenomenon in the broader diffusion literature. A solid reference we can point to is Li et al. “On Error Propagation of Diffusion Models” (ICLR 2024) with Figure 2 of their paper serving as a clear illustration of such behaviour. Intuitively, the denoising process of a diffusion model can be represented as an ODE trajectory. And ODEs are known to accumulate errors very quickly even for simple processes (e.g., even for the simplest y’ = y equation, forward Euler solver would accumulate the error proportional to h² with each step of size h, and this error compounds exponentially as steps progress). Then, since diffusion models generate high frequencies later in the trajectory (per our previous response), the error accumulation is amplified specifically for them.
Could SE interact with other AE properties, rather than spectral properties alone?
In Table 6 of the appendix, we provide the ablation for direct high frequency chop-off: it only has influence on spectral properties (and this influence is the same as for SE), eliminating the interaction of other factors. It noticeably improves the diffusability, but we opt for the equivalent SE regularization since it’s much simpler and less error-prone to implement and should be easier to adopt by the community.
Computational cost and potentially unfair AE training budget comparison.
We measured the FLOPs of FluxAE (for the batch size of 1 and resolution of 256²) using [fvcore]. The entire encoder-decoder pass has 447 GFLOPS, and is split between the encoder/decoder as 136 vs 311 GFLOPS. Our regularization reuses the encoder pass and only runs the decoder with x2 or x4 reduced resolution (the scale sampled randomly during training). This results in 77.6 or 19.4 extra GFLOPs of the decoder, which is almost exactly 1/4 or 1/16 of the decoder compute or +17% or 4.5% of the total forward pass. Since we sample x2 or x4 downsampling factor equally randomly, this results in ~10.75% of total FLOPs overhead for our regularization.
To strengthen our point even further, we ran an experiment where the baseline FluxAE was fine-tuned strictly for 2 times longer (for 20K instead of 10K iterations). The resulted DiT-B/2 model achieved FID@5k and DinoFID@5k of only 33.99 and 642.7 vs the corresponding metrics of 25.87 and 551.27 of our FluxAE+FT-SE, fine-tuned for only 10K steps.
Balancing reconstruction and SE regularization.
We appreciate the raised concern and we provide the ablation over the SE strength in the table [here]. We remain cautious about reducing the influence of the main reconstruction term not to lose in terms of the reconstruction quality.
Training AE and LDM on the same dataset.
We conducted from-scratch FluxAE and CogVideoAE training on ImageNet/Kinetics datasets, followed by DiT-B/2 training, as detailed in our response to Reviewer MRKn. In both cases, SE regularization consistently improved results.
Unclear if suppressing high frequencies motivates improved performance.
We emphasize that merely suppressing high frequencies is insufficient; it is also crucial to prevent the decoder from arbitrarily amplifying them (we provide a detailed discussion on this in Sec. 3.3). This ensures alignment between the diffusion process’s strength—its ability to generate low-frequency components—and human perception, which is more tolerant to errors in low frequencies. Our experiments with direct frequency chop-off (described above) articulates this more explicitly.
I appreciate the rebuttal from the authors. My main concerns about the additional computation budget brought from SE and other questions regarding the validity of claims in the paper have been addressed, so I will update my score to 3.
I encourage the author to move the results of Table 6 in Appendix (or part of them) to main text. The results motivate the paper more intuitively as it only removes high-frequency without introducing any additional cofounding factors (that can inadvertently improve VAE's performance). I also think it might be worth trying to randomly downsample the latent vector z during training instead of introducing an additional loss in SE.
This paper explores the latent spaces of autoencoders within latent diffusion models (LDMs), specifically examining spectral discrepancies between latent and RGB spaces. The authors introduce the concept of diffusability, which quantifies how effectively a distribution can be modeled by a diffusion process. They hypothesize that high-frequency components in the latent space degrade diffusability, thereby reducing both the efficiency and generation quality of LDMs. To mitigate this issue, they propose a scale equivariance regularization strategy, which enforces spectral alignment between the latent and RGB spaces by removing high-frequency components. Empirical evaluations demonstrate that this approach improves image generation performance by 19% and video generation by 44% compared to existing LDMs.
给作者的问题
-
As the goal of this paper is to improve diffusability, could you clarify whether the term diffusability encapsulates both the efficiency and generation quality of diffusion models? Specifically, does an increase in diffusability directly imply an improvement in generation quality, as suggested in your results? If not, are there any proposed methods or metrics to quantitatively assess diffusability in the context of latent diffusion models?
-
As I mentioned above, in your analysis, the hypothesis regarding the impact of high-frequency components on diffusability is derived from spectral properties. Could you provide a more detailed explanation of how the spectral analysis led to this hypothesis? Specifically, what were the key observations or trends that informed this conclusion?
论据与证据
The authors’ hypothesis is grounded in the spectral analysis of latent and RGB spaces, as presented in Figures 2 and 3. However, certain aspects of their analysis remain unclear.
Figure 2: The caption lacks clarity regarding whether the spectra of various channels in FluxAE correspond to reconstructions or latent representations. The distinction between “comparison between the reconstructions” and “the latent space of an autoencoder” is not explicitly addressed, leaving ambiguity in the interpretation of the results. Additionally, given the limited scope of the analysis (only applied on FluxAE), this claim appears insufficiently substantiated. The absence of empirical verification across multiple architectures weakens the argument that this trend is a fundamental characteristic of diffusion models in general.
Figure 3: The authors claim in Section 3.2 that “higher KL regularization introduces more high frequencies”. However, the figure does not provide clear evidence supporting this statement. No explicit trend is observed between the scale of KL regularization and the power of high-frequency components, contradicting the claim made in the text. This lack of alignment between theoretical justification and empirical results raises concerns about the robustness of the proposed hypothesis.
Given these limitations in the spectral analysis, the motivation for the proposed regularization technique appears to be based on a weak foundation. The authors do not provide sufficiently clear or convincing evidence to establish a strong causal link between their observations and the claimed effects on diffusion model performance. A more rigorous analysis, including evaluations across multiple architectures and additional spectral studies, would be necessary to substantiate their claims.
方法与评估标准
The proposed methods and evaluation criteria make sense. However, a more comprehensive evaluation across multiple architectures for image generation would be necessary to substantiate the claim that their method effectively improves diffusability in LDMs more broadly.
As the strength of the regularization is controlled, mentioned to be 0.25 in the paper, the scale factor for scale equivariance term should be included.
理论论述
NA
实验设计与分析
The experimental design seems to be well-constructed and appropriate for the research objectives. The authors effectively evaluate their method on both image and video generation tasks, which demonstrates the broader applicability of their proposed regularization technique. Furthermore, the ablation study provides valuable insights by comparing the autoencoder's reconstruction quality with and without the regularization term, highlighting the impact of their approach on the model’s performance. However, there is still an issue with the unclear trend in Table 3, where the relationship between the KL regularization scale factor and high-frequency components is not clearly demonstrated.
补充材料
NA
与现有文献的关系
The authors propose a regularization method that truncates these high frequencies, drawing from earlier work on spectral regularization techniques. By aligning the spectral properties of latent and RGB spaces, their approach improves generation quality, offering a novel perspective on latent space manipulation in generative models.
遗漏的重要参考文献
NA
其他优缺点
NA
其他意见或建议
NA
We thank the reviewer for their insightful comments, which have greatly improved our work. Below, we address each concern.
Ambiguous caption in Figure 2
Figure 2 shows spectra of latent codes from real images encoded with from-scratch trained FluxAE with varying bottleneck sizes. We clarified the caption accordingly.
Exploring influence of bottleneck dimensionality on high-frequency components with more architectures
Upon closer inspection, we discovered a broadcasting issue in our spectrum computation pipeline, which led to distorted plots in the original submission. Updated FluxAE results are [here]. We further added analyses for two additional architectures: [WanAE] (which has the same bottleneck dimensionalities as CogVideoX-AE, but considerably faster to train) and [LTX-AE], trained with increasing bottleneck channel sizes. As one can see from these figures, autoencoders with higher channel sizes tend to possess high-frequencies of larger relative magnitude.
Fig. 3 does not show clear KL-high-frequency trend
The FluxAE KL plot previously had the same broadcasting issue; corrected results are [here]. We included similar analyses for [WanAE] and [LTX-AE]. These demonstrate a clearer relationship between KL and high-frequency energy.
The influence of KL on the high-frequency spectrum can be attributed to its role in injecting random noise during the encoding stage—an effect present both during training and inference. As KL regularization increases, so does the level of injected noise. Since random Gaussian noise has a uniform power spectrum, this flattens the frequency distribution, disproportionately inflating the high-frequency tail. We illustrate this in [this plot], where progressively increasing Gaussian noise (added to normalized RGB signals in [[–1, 1]]) results in a visible elevation of the high-frequency content. A similar effect was also independently observed in concurrent work on [SwD distillation (Figure 1)].
This reveals a nuanced trade-off: KL regularization aligns latents to the standard normal prior, facilitating downstream diffusion (as noted by LSGM), yet increases high-frequency energy, potentially hindering diffusability. We believe this trade-off warrants further attention.
Eq. 2 missing SE regularization loss weight
We thank the reviewer for pointing this out and corrected the equation.
Table 3 unclear relationship between KL strength and high-frequency components
There may be a misunderstanding: Table 3 illustrates KL's negative effect on diffusability, whereas its impact on high-frequency energy appears in Figure 3. To strengthen this analysis, we added results from DiT-L [here].
This table shows increased KL generally boosts small-model LDM performance, but at the expense of poorer reconstruction and stability, limiting scalability (consistent with SD3 findings). In contrast, our regularization improves LDM performance without harming reconstruction, scaling well to larger models.
Clarify “diffusability”: does it include efficiency and quality, and are there metrics?
We use "diffusability" strictly for generation quality, independent from efficiency. All else equal, increased diffusability should enhance LDM generation quality. Unfortunately, we found no reliable quantitative metrics correlating consistently with diffusability, including diffusion loss magnitude or decoder Lipshitz constants (as theoretically connected via Eq. 28 in [LFM]). We believe it is still important to introduce such a term to the community to bring attention to this property, since it remains largely ignored these days.
Clarify how spectral analysis suggested high-frequency components impact diffusability
We initially sought high-compression autoencoders but quickly found that increasing bottleneck size significantly reduces diffusability, keeping other factors constant. Then, we concurrently worked on cascaded latent diffusion pipelines, which required autoencoders to support downsampling, motivating us to explore their spectral characteristics. This revealed how increased bottleneck dimensionality relates to diminished diffusability, informing our central hypothesis.
We would greatly welcome any further comments the reviewer might have.
I appreciate the author's constructive response.
My primary concern was the unclear trends in the spectral plots, which limited the credibility of the paper's main claims. The authors identified an issue and provided corrected plots, which improved the reliability and clarity of the results. In addition, I acknowledge the authors' effort to strengthen the generalizability of their findings by incorporating results from additional autoencoder architectures (WanAE and LTX-AE). Since the ambiguous parts of the paper have been appropriately clarified, I am raising my score to 3.
The authors analyze the latent space of autoencoders widely used for latent diffusion models and identify that the spectrum of autoencoders typically deviate from that of natural images. In particular, latent spaces have stronger high frequency components compared to RGB images. These high frequency components will be challenging for the diffusion model to learn and could impede performance. The authors propose a simple regularization to align the spectrum of the latent space with that of RGB images. The authors train the autoencoder to reconstruct a downsampled version of the RBG image from a downsampled latent code. This removes the high frequency components from both the RGB image and latent and enforces scale equivariance. The authors present results across image and video generation and demonstrate that this autoencoder regularization improves the downstream performance of diffusion models. Extensive comparisons with KL regularization are presented as well as additional spectral regularization methods in the appendix.
给作者的问题
- How exactly is the self-conditioning implemented? Is it the latent self-conditioning from the RIN paper, or just self-conditioning on the current data prediction?
- The behavior of the KL regularization seems somewhat non-monotonic (Table 3). I would expect increased regularization to monotonically harm reconstruction, but this does not appear to strictly be the case. For instance, 10e−6 achieves worse reconstruction than 10e-3. Any insight into why this is the case? Did you observe training instabilities?
论据与证据
The claims are presented by clear and convincing evidence. The spectral analysis, in particular, is insightful and clearly motivates their proposed approach. The authors present comprehensive analysis of the effect of their regularization on both the autoencoder and the downstream generative models. The comparison against KL regularization, the current standard, is comprehensive.
方法与评估标准
The authors are concerned with improving the suitability of autoencoders for downstream generative modeling. The authors evaluate primarily on ImageNet 256 and Kinetics 700 which are suitable benchmarks for latent diffusion modeling. Their evaluation of their autoencoders (reconstruction metrics, spectral analysis) and generative models (FID) is convincing.
理论论述
The authors do not present any proofs.
实验设计与分析
The experimental design is sound and the experimental comparisons are fair. One limitation is that the authors always start from an existing high-quality autoencoder and then fine-tune it with their additional regularization. While much more computationally feasible, this raises the question of whether their regularizer conveys the same benefit when training an autoencoder from scratch. I do not think that this is a big limitation as their method can always be introduced towards the end of training to improve align the latent spectrum. However, it does limit the ramifications of their findings somewhat. When training a new autoencoder, it is not entirely clear what the optimal procedure is.
补充材料
I did review the supplementary material. I appreciated the additional discussion of more sophisticated spectral regularization techniques.
与现有文献的关系
While diffusion models, and latent diffusion models in particular, have exploded in popularity, I think there has been comparatively less focus on what makes a "good" autoencoder for latent diffusion. People often use publicly available autoencoders for which the training decisions may not be entirely transparent. I think that this area is currently under-explored, and this paper is a welcome remedy to that.
遗漏的重要参考文献
The discussion of related work is comprehensive.
其他优缺点
Strenghts:
- This work focuses on an under-explored problem: What makes a "good" autoencoder for latent diffusion? I think that studying this problem is challenging in part because evaluation requires both training an autoencoder and a downstream diffusion model in its latent space. I welcome work in this area.
- The proposed regularization is well-motivated. I find the "spectral autoregression" interpretation of diffusion models to be intuitive and it's nice to see this intuition motivating a technique that seems to work well in practice. The spectral analysis throughout is insightful.
- The proposed regularization is simple to implement, increasing the likelihood of adoption. I appreciated the discussion of more complex alternatives in the appendix.
- The comparison with KL regularization is comprehensive.
Weaknesses:
- The work focuses only on fine-tuning pre-trained autoencoders. The regularization could behave differently when learning a model from scratch with additional losses (e.g. adversarial). This limits the takeaways from their work somewhat.
- The autoencoders are fine-tuned on private internal datasets which harms reproducibility. The authors do verify (with their fine-tuning only ablation) that the dataset shift doesn't contribute to the performance boost.
- Some of the implementation details are bit under-explained. The authors mention that they incorporate self-conditioning in the DiT, but do not provide precise implementation details.
其他意见或建议
The DiT training details section in appendix A ends on a trailing sentence.
We sincerely thank the reviewer for their thorough feedback. In what follows, we carefully respond to each of the points raised. All comments and suggestions will be fully reflected in the revised manuscript.
From-scratch training
Our main motivation of fine-tuning instead of training from scratch is three-fold: Be able to compare against strong established benchmarks; Eliminate the possibility that we might have some “bug” in the training pipeline, which our regularization is rectifying (please, note that none of the explored SotA AEs release their training code). Carry more value to the community of improving popular autoencoders.
That being said, we launched multiple experiments for from-scratch training for FluxAE and CogVideoAE. Moreover, we opted for using public datasets to have fully self-sufficient experiments. Namely, we trained FluxAE from-scratch on ImageNet for 200K steps and CogVideoAE from-scratch on Kinetics-700 for 60K steps (it’s a heavyweight AE and slower to train). Due to the limited rebuttal period time, we only have the DiT-B/2 results till 300K steps on ImageNet, and till 250K steps on Kinetics. For both datasets, the regularized AEs lead to better LDM convergence. DiT-B/2 for non-regularized vs regularized AEs perform as follows:
- For ImageNet:
- DinoFID@5k: 569.93 vs 561.4
- DFID@5k: 27.94 vs 28.79
- For Kinetics:
- DDinoFID@5k: 652.5 vs 561.21
- DFID@5k: 21.54 vs 19.66
- DFVD@5k: 265.94 vs 379.88
In all cases, our regularization improves the diffusability. We’ll include the full training results (400K) in the revised version of the paper.
Training on public data
We thank the reviewer for pointing this out, we include the results on public data together with the from-scratch training experiments in the previous message.
Some training details are under-explained (e.g. self-cond)
Our self-conditioning mechanism follows prior work without any modifications (i.e., RIN, FIT, WALT). Namely, during training with a 90% probability, we run an auxiliary forward pass with the DiT model, take its activations from the last block (i.e., right before the “unpatchify” projection), project them with a linear layer and add as residuals to the input tokens after “patchification” of the main training forward pass. For that auxiliary forward pass, following RIN, we use the same noise level and “no-grad” context (i.e., we do not backpropagate through the auxiliary forward pass)
We added the self-conditioning discussion into the “Implementation details” appendix. We would be grateful to the reviewer if they point out any further missing implementation details and we would happily include them into the submission.
The DiT training details section in appendix A ends on a trailing sentence.
We thank the reviewer for pointing out that writing mistake. The sentence was intended to convey: “In essence, this reduces the total dataset size, but since we do the same procedure for the entire CogVideoX-AE family, the models are comparable between each other.” We fixed the error.
Explanation of non-monotonic KL influence on reconstruction quality (Table 3).
Yes, FluxAE was indeed unstable when fine-tuned with KL regularization for some of the KL β weights: it was stable for β of 0, 1e-7, 1e-4, and 1e-3, but not for 1e-6, 1e-5, 1e-2, and 1e-1. For from-scratch training (which we were doing for Figure 3), it was stable for the entire range (from 0.0 to 0.1), but after some threshold of β <= 1e-4, there was almost no difference in PSNR or FID. Our intuition is that high-capacity autoencoders (like FluxAE) can accommodate high KL penalty for their latents (for from-scratch training, we started noticing degradation only for β >= 1e-3), and their reconstruction quality is governed by other factors, making KL interference less predictable (up to a certain factor).
Should the reviewer have further comments, we would be glad to incorporate them fully into our manuscript.
The paper received initial mixed reviews but 3/4 reviewers increased their scores after the rebuttal, while the one who didn't had recommended acceptance from the start.
Reviewers agree that the paper introduces a novel contribution to an important but unexplored problem with potential practical applications.
There are a few concerns whether the hypothesis about the relation between latent space spectrum and generative performance is valid, but the paper is nonetheless an important step on investigating the connection between latent space properties and generative model behavior.