PaperHub
7.2
/10
Spotlight4 位审稿人
最低3最高4标准差0.4
3
4
4
4
ICML 2025

Direct Discriminative Optimization: Your Likelihood-Based Visual Generative Model is Secretly a GAN Discriminator

OpenReviewPDF
提交: 2025-01-22更新: 2025-07-24
TL;DR

an efficient and effective finetuning method for enhancing diffusion models and visual autoregressive models

摘要

While likelihood-based generative models, particularly diffusion and autoregressive models, have achieved remarkable fidelity in visual generation, the maximum likelihood estimation (MLE) objective, which minimizes the forward KL divergence, inherently suffers from a mode-covering tendency that limits the generation quality under limited model capacity. In this work, we propose Direct Discriminative Optimization (DDO) as a unified framework that integrates likelihood-based generative training and GAN-type discrimination to bypass this fundamental constraint by exploiting reverse KL and self-generated negative signals. Our key insight is to parameterize a discriminator implicitly using the likelihood ratio between a learnable target model and a fixed reference model, drawing parallels with the philosophy of Direct Preference Optimization (DPO). Unlike GANs, this parameterization eliminates the need for joint training of generator and discriminator networks, allowing for direct, efficient, and effective finetuning of a well-trained model to its full potential beyond the limits of MLE. DDO can be performed iteratively in a self-play manner for progressive model refinement, with each round requiring less than 1% of pretraining epochs. Our experiments demonstrate the effectiveness of DDO by significantly advancing the previous SOTA diffusion model EDM, reducing FID scores from 1.79/1.58/1.96 to new records of 1.30/0.97/1.26 on CIFAR-10/ImageNet-64/ImageNet 512$\times$512 datasets without any guidance mechanisms, and by consistently improving both guidance-free and CFG-enhanced FIDs of visual autoregressive models on ImageNet 256$\times$256.
关键词
Diffusion ModelsVisual Autoregressive ModelsGANGeneration Quality

评审与讨论

审稿意见
3

The paper proposes fine-tuning a likelihood-based generative model to steer the generated distribution towards the real distribution from a new perspective. Specifically, the authors parameterize pretrained likelihood-based generative models as the discriminator within the GAN framework. Using a reference generative model, they unify the discriminator and generator as variants of the same generative model with different weights. The paper provides theoretical derivations and analyses of the method’s design, leading to a simple yet effective formulation for parameterizing the discriminator. The effectiveness of the proposed approach is demonstrated by fine-tuning EDM and VAR on the CIFAR-10 and ImageNet datasets. Additionally, ablation studies are conducted to analyze the choice of hyperparameters α,β\alpha, \beta, and the image generation quality progressively improves during the refinement iterations.

Update after Rebuttal

The authors have addressed most of my concerns. Although the added hyperparameters require additional tuning, I maintain my weak accept to support the paper’s acceptance given its novelty.

给作者的问题

N/A

论据与证据

Please see the weaknesses.

方法与评估标准

Yes, that makes sense.

理论论述

The proofs are logical.

实验设计与分析

The experiments are well designed.

补充材料

I have checked all the content in the supplementary material.

与现有文献的关系

The paper offers a new perspective on parameterizing a likelihood-based visual generative model as a discriminator in the GAN framework and may provide insights for fine-tuning likelihood-based generative models beyond vision.

遗漏的重要参考文献

N/A

其他优缺点

Strengths:

  1. The idea of parameterizing a likelihood-based generative model as a discriminator is novel.

  2. The authors provide a solid theoretical analysis and derivation of the method's design.

  3. The authors validate the proposed method on existing pre-trained diffusion models, demonstrating its effectiveness.

  4. The paper is well-written and easy to follow.

Weaknesses:

  1. The introduced α\alpha and β\beta require additional effort for hyperparameter search.

  2. The motivation for introducing α\alpha and β\beta to address gradient-vanishing/numerical issues is not well-supported. It would be helpful if the authors could provide experimental evidence to show the problem and demonstrate how α\alpha and β\beta help mitigate these issues.

  3. The ablation study for Multi-Round Refinement is missing. Could the authors provide experiments where the reference model remains fixed throughout the entire fine-tuning process? Meanwhile, could the authors provide ablation studies on the number of iterations per round? Is a smaller or larger number of iterations more effective?

  4. Could the authors provide experiments on fine-tuning the EDM, EDM2, and VAR using a standard fine-tuning method to assess how much the proposed approach improves over it?

  5. It would be useful to show whether the proposed method also benefits training the generative models from scratch.

其他意见或建议

N/A

作者回复

Thank you for your positive comments. Below, we provide detailed responses to your concerns.

The motivation for introducing α\alpha and β\beta to address gradient-vanishing/numerical issues is not well-supported. It would be helpful if the authors could provide experimental evidence to show the problem and demonstrate how α\alpha and β\beta help mitigate these issues.

The training will simply collapse without introducing a small β\beta. For example, on CIFAR-10, the data dimension is 3072, and the bits per dimension (BPD) is around 2.5, meaning the log-likelihood logpθ(x)\log p_\theta(x) is around 3072×2.5×log250003072\times2.5\times\log 2\approx 5000. The log-likelihood ratio logpθpθref\log\frac{p_\theta}{p_{\theta_{ref}}} also scales linearly with the data dimension and can be as large as hundreds, making σ(logpθpθref)\sigma(\log\frac{p_\theta}{p_{\theta_{ref}}}) suffering from gradient-vanishing. If we use β=1\beta=1, the gradient will vanish when pθp_\theta only deviates a bit from pθrefp_{\theta_{ref}}, and the finetuning almost does nothing to the base model. We also observe in our experiments that the FID remains unchanged with a large β\beta. As for α\alpha, it is an empirical hyperparameter to oppose the fake samples and accelerate the FID decrease, which we have already ablated in Figure 5(c).

The ablation study for Multi-Round Refinement is missing. Could the authors provide experiments where the reference model remains fixed throughout the entire fine-tuning process? Meanwhile, could the authors provide ablation studies on the number of iterations per round? Is a smaller or larger number of iterations more effective?

We would like to clarify that (1) we already provided the ablation for multi-round refinement in Figure 5(a) and (2) Figure 5(b)(c) already provides ablation studies on the number of iterations per round, as we continuously monitor the FID score during the whole finetuning process in a single round (with the reference model remaining fixed), demonstrating performance improvement at different numbers of iterations.

Could the authors provide experiments on fine-tuning the EDM, EDM2, and VAR using a standard fine-tuning method to assess how much the proposed approach improves over it?

Good question! Reviewer GoSA also mentioned this question, but he/she also says "I have few reasonable doubts as most baselines are already extremely optimized". We first show our empirical results and explain the fundamental reason behind it.

The "standard fine-tuning method" of likelihood-based models is to continue training with MLE, which can be seen as extending the pretraining period. On unconditional CIFAR-10 with the EDM base model:

Iterations0500750100012501500
MLE Continued1.971.9611.9691.9721.9882.011
DDO (α=6.0,β=0.05\alpha=6.0,\beta=0.05)1.971.8961.8521.8091.7331.769

By continuing training with the MLE objective, the FID just fluctuates within a small range and never advances the performance like DDO. This phenomenon can be expected, not only because the baselines are already extremely optimized, but also for the traditional MLE's inherent nature. As stressed in the paper's abstract and introduction, MLE is mode-covering and will be penalized seriously if the model underestimates the likelihood of any training samples. This limits the MLE objective's upper bound performance under limited model capacity. Therefore, we can never expect significant generation quality improvement by continuing training with MLE, while DDO is a fundamental different way that breaks the limitation.

It would be useful to show whether the proposed method also benefits training the generative models from scratch.

We did not expect the method to training generative models from scratch. As stated in Section 3.1 and 3.2, DDO replies on good initializations for stable optimization. Intuitively (e.g., as seen in Figure 1), DDO mainly refines a well-trained density and concentrate it on main modes. Moreover, the MLE objective has proven to be a stable and scalable approach for pretraining of likelihood-based generative models. Therefore, the best practice is pretraining with MLE and refining with DDO.

Additionally, we would like to provide some new results on higher-resolution datasets. After submission, we further use DDO to finetune EDM2-L on ImageNet 512x512, successfully advancing the FID from 2.11 to 1.36, without any guidance. Some visualizations are provided in the anonymous link. We find DDO significantly enhancing the image quality/details without affecting the overall content/diversity. We will release all model checkpoints and code that reproduce the reported results upon acceptance.

Thank you again for your consideration, and for giving positive rating. We hope that our response can resolve your concerns, and we are happy to answer further questions.

审稿人评论

Thank the authors for addressing most of my concerns. However, introducing α\alpha and β\beta requires additional efforts to hyperparameter tuning. Nevertheless, I maintain my weak accept score to support the paper’s acceptance, given its novel perspective on parameterization.

作者评论

We acknowledge DDO's performance can be affected by hyperparameters, which seems to be the only weakness by the reviewer. However, no method is flawless, and no machine learning paper is free from hyperparameter tuning. especially when DDO's advantage in fundamentally breaking the limitation of traditional MLE is clear and insensitive to hyperparameters, as we carefully ablated, and DDO's upper bound has been proven to create SOTA FID records on standard academic datasets as large as ImageNet 512x512. Nevertheless, we appreciate the reviewer's mercy in continuing to support our paper.

审稿意见
4

This paper proposes a new optimizing method for likelihood-based models. The idea comes from GANs, but the method is not to introduce a new discriminator, which is more efficient and easy to apply. Theoretical analysis and experiments prove the effectiveness.

给作者的问题

None.

论据与证据

The authors claim that their method will not complicate the training or increase inference costs.

In my opinion, we can also introduce a traditional GAN discriminator for the optimization of likelihood-based models. Thus, we need a direct comparison, which shows that utilizing the proposed method can achieve better, or at least comparable performance when compared with methods that utilize the traditional discriminator. Based on this kind of results, the proposed method can be valuable when it is efficient and easy to apply.

方法与评估标准

The proposed method makes strong sense for the current generative models.

理论论述

I do not check all theoretical results carefully but take a quick look. I do not find any obvious error.

实验设计与分析

I have checked the experiments. The experiments involve proper baselines, benchmarks, and evaluation metrics. I think there should be an additional ablation, as mentioned in "Claims And Evidence."

补充材料

No. I can get all the critical information in the main paper.

与现有文献的关系

This paper essentially reports a work of combining GANs and likelihood-based models. Though it is not the first one, the idea of implicitly defining the discriminator is new, at least has never been utilized for improving likelihood-based models, to my knowledge.

遗漏的重要参考文献

No.

其他优缺点

None.

其他意见或建议

None.

作者回复

Thank you for your positive comments. Though you do not have notable concerns, we would like to provide some new results on higher-resolution datasets.

After submission, we further use DDO to finetune EDM2-L on ImageNet 512x512, successfully advancing the FID from 2.11 to 1.36, without any guidance. Some visualizations are provided in the anonymous link. We find DDO significantly enhancing the image quality/details without affecting the overall content/diversity. We will release all model checkpoints and code that reproduce the reported results upon acceptance.

Thank you again for your consideration, and for giving positive rating. We hope that our additional results can further support our method, and we are happy to answer further questions.

审稿意见
4

This paper introduces Direct Discriminative Optimization (DDO), a novel finetuning framework designed to enhance the generation quality of likelihood-based generative models, such as diffusion and autoregressive models. Likelihood-based generative models are inherently limited by the mode-covering tendency of maximum likelihood estimation (MLE), which restricts their performance under limited model capacity. The key innovation of DDO lies in its parameterization of a discriminator using the likelihood ratio between a learnable target model and a fixed reference model, drawing inspiration from Direct Preference Optimization (DPO).

Update After Rebuttal

I have reviewed the authors' rebuttal. I appreciate their response addressing my questions regarding the lack of Recall metric analysis and the specific hyperparameter settings (large alpha, small beta) required for DDO.

Overall, the authors have satisfactorily addressed my concerns. Therefore, I maintain my score of 4 (Accept).

给作者的问题

The theoretical framework of DDO is elegant, but the paper mentions that a relatively large alpha value (e.g., alpha = 4.0) and a small beta value are required for practical effectiveness. Why does DDO require such specific hyperparameter settings (large alpha and small beta) to weight fake samples, while GANs do not seem to need this additional weighting?

论据与证据

The claims made in the submission are supported by clear and convincing evidence. The authors demonstrate the effectiveness of Direct Discriminative Optimization (DDO) through extensive experiments, showing significant improvements in generation quality across multiple benchmarks. Additionally, they provide consistent improvements in FID scores for both guidance-free and CFG-enhanced autoregressive models on ImageNet 256×256.

方法与评估标准

The proposed methods and evaluation criteria are well-suited for the problem and application at hand. The use of benchmark datasets such as CIFAR-10, ImageNet-64, and ImageNet 256×256 is appropriate, as these are widely recognized and challenging benchmarks for evaluating generative models. The evaluation metric, Fréchet Inception Distance (FID), is a standard and reliable measure for assessing the quality and diversity of generated images.

理论论述

I have reviewed the theoretical proofs in the paper to a reasonable extent, and while I did not perform a 100% detailed verification, the proofs appear to be correct and well-justified.

实验设计与分析

The experimental setups are well-structured and comprehensive. The authors evaluate their proposed Direct Discriminative Optimization (DDO) method on standard image benchmarks, including CIFAR-10, ImageNet-64, and ImageNet 256×256, which are widely used and respected in the field of generative modeling. They apply DDO to finetune state-of-the-art models such as EDM, EDM2, and VAR, and compare their results against advanced generative baselines, including GAN-based approaches. This ensures a fair and rigorous evaluation of DDO's effectiveness.

补充材料

I reviewed the supplementary material, including the theoretical proofs, sample visualizations, and detailed experimental settings.

与现有文献的关系

The key contributions of this paper, particularly the introduction of Direct Discriminative Optimization (DDO), are closely related to broader advancements in generative modeling literature. DDO bridges the gap between likelihood-based generative models (e.g., diffusion and autoregressive models) and adversarial training frameworks like GANs, addressing the mode-covering limitation of maximum likelihood estimation (MLE).

遗漏的重要参考文献

After reviewing the paper, I did not find any particularly critical or essential references that were missing from the discussion.

其他优缺点

Strength: The proposed Direct Discriminative Optimization (DDO) creatively combines ideas from likelihood-based generative models (e.g., diffusion and autoregressive models) and adversarial training frameworks (e.g., GANs), addressing the fundamental limitation of maximum likelihood estimation (MLE) in mode-covering.

Weakness: One potential weakness of the paper is the lack of discussion on diversity metrics Recall during the DDO finetuning process. Analyzing how recall evolves during the DDO distillation process could provide deeper insights into whether the method maintains or improves diversity while enhancing sample quality.

其他意见或建议

I do not have any additional comments or suggestions for the paper.

作者回复

Thank you for your positive comments. Below, we provide detailed responses to your concerns.

One potential weakness of the paper is the lack of discussion on diversity metrics Recall during the DDO finetuning process. Analyzing how recall evolves during the DDO distillation process could provide deeper insights into whether the method maintains or improves diversity while enhancing sample quality.

We did not report the Recall metric because (1) it is measured only on relatively large-scale datasets like ImageNet 256x256, while not on CIFAR-10/ImageNet-64 (2) it is not as importance as FID/IS metrics, and good generative models often have indistinguishable Recall scores (as reported in the VAR paper). Moreover, the FID metric itself is sensitive to the generation diversity. A model with low diversity will yield degenerated FID scores.

Additionally, we would like to provide some new results on higher-resolution datasets. After submission, we further use DDO to finetune EDM2-L on ImageNet 512x512, successfully advancing the FID from 2.11 to 1.36, without any guidance. Some visualizations are provided in the anonymous link. We find DDO significantly enhancing the image quality/details without affecting the overall content/diversity. We will release all model checkpoints and code that reproduce the reported results upon acceptance.

Why does DDO require such specific hyperparameter settings (large alpha and small beta) to weight fake samples, while GANs do not seem to need this additional weighting?

Though we show that a discriminator can be implicitly parameterized by a generative model to utilize the GAN loss, an implicit discriminator and an explicit discriminator (used in GANs) can have different expressiveness and inherent output space structure. As mentioned in the paper, the log-likelihood of a generative model scales linearly with the data dimention, so does the implicit discriminator. Therefore, we need a small beta to manually scale it. A large alpha is an empirically choice to oppose the fake samples and accelerate the FID decrease. For an explicit discriminator, the output is directly produced by a separate network (more expressive) and can be freely scaled internally, so fewer manual operations outside are needed.

Thank you again for your consideration, and for giving positive rating. We hope that our response can resolve your concerns, and we are happy to answer further questions.

审稿意见
4

This paper tackles the issue of the predisposition of likelihood-based generative models to cover modes of the data distribution, yielding unrealistic or blurry results. This paper introduces a solution to this problem that is complementary to the usual but impractical guidance methods via a fine-tuning method that integrates self-guidance-like mechanisms during training.

This fine-tuning, named DDO, relies on parameterizing a discriminator by the log probability ratio between current and reference generated distributions, that is optimized to fit the true discriminator between the data and reference distributions. This provably makes the current generated distribution converge towards the data distribution by improving density concentration on the modes.

By adopting tricks to make this method tractable, notably by estimating log likelihoods with Monte-Carlo samples of the ELBO and applying it iteratively on the reference model, the authors manage to show its significant benefits and obtain state-of-the-art generation performance on standard image datasets with and without guidance.

给作者的问题

The paper successfully presents a novel, high-potential, simple, and effective fine-tuning method for likelihood-based models. As such, it is valuable and I think it should be accepted. However, I only recommend a "Weak accept" as the paper is lacking in a few aspects, detailed above and summarized below. As I believe all of them can be addressed in the discussion period, I am willing to increase my score depending on the authors' answers.

  1. Can the few debatable claims be reformulated along with the title?
  2. How do the proposed fine-tuning compares to further regular training of the base model?
  3. Can the authors promise to release the code upon publication?

I still expect other, more minor, weaknesses of the paper mentioned in my review to be addressed either in the author's response or in future versions of the paper.


Post-rebuttal update

Given the satisfying answer provided by the authors below, I now recommend an "Accept' provided that the paper is improved accordingly.

论据与证据

The central claims of the paper, summarized above, are clearly presented and well supported either by theoretical or empirical evidence.

  • The theoretical analysis is particularly appreciated as it provides informative insights into the method and helps design its practical implementation. It also makes the method simple to understand and facilitates its reusability. Up to my limited assessment (cf. below), theoretical results appear to be sound.
  • Experimental results clearly show the benefits of the proposed fine-tuning compared to the base model and usual guidance methods, notably the one of Chen et al. (2024) who also attempted to bypass the need for extra network/inference in guidance. The icing on the cake is that it enables the proposed method to surpass the state of the art in image generation.
  • To my knowledge, the idea of parameterizing a discriminator with the log likelihood of the model itself and training the latter by optimizing the same discriminator w.r.t. a reference, non-data, generated distribution, is novel and has the potential to be widely reused.

Nonetheless, a few secondary claims need to be amended.

  • I disagree with the title and the paper stating that "your likelihood-based [...] model is secretly a GAN discriminator": it is not a discriminator per se, but can be used to parameterize a discriminator. I would advise the authors to reformulate this claim and the title to a more factual statement.
  • The paper states in the abstract and the introduction that DDO is a "unified framework that bridges likelihood-based generative training and the GAN objective". This is misleading as it is rather a framework integrating (not bridging) likelihood-based models in a GAN discriminator training (not overall GAN objective, i.e. without the generator objective).
  • In Section 3.1, it is stated that "stable convergence requires stronger initial conditions" based on Section 3.2. While the choice of strong model initialization is generally well motivated, this assertion is too strong. Section 3.2, particularly in Theorem 3.2's assumptions, simply shows that strong model initialization is sufficient for stable training. Therefore, this claim should be toned down.

方法与评估标准

The evaluation of the proposed method is in line with the standards of the literature, both in terms of datasets (CIFAR-10 and two resolutions of ImageNet) and criteria (FID and IS). While other, more modern and less biased evaluation metrics could be considered, this is a shared shortcoming with almost all papers in this area of research.

Comparisons include most recent state-of-the-art methods. However, like in most other papers again, it is not clear whether the reported results were reproduced by the authors or simply reported from the original papers. I would recommend this to be specified somewhere in the paper.

理论论述

I checked the correctness of all theoretical results by skimming through proofs in the appendix. Up to my limited assessment, these theoretical results are sound. This should be sufficient given that most derivations use standard tools from the literature.

实验设计与分析

Conducted experiments are also done within the standards of the literature and based on the reference implementation of score-based diffusion models (EDM). There are, however, two important misses.

  • The evaluation misses a comparison with an extended regular training of the baseline (EDM) with comparable execution time to definitely decide on experimental relevance, although I have few reasonable doubts as most baselines are already extremely optimized.
  • No code is provided in the submission and no promise is made to release it for publication. This greatly hinders the reproducibility of the proposed method.

补充材料

I reviewed the whole appendix.

与现有文献的关系

The literature is overall well covered and discussed in the paper. It is a clear motivation for the introduction of the method as a means to alleviate mode covering behaviors of likelihood models. Nonetheless, two minor improvements should be considered.

ELBOs for diffusion models. The authors state that "alternative choices [of p(t)p(t) and w(t)w(t)] share the same optimum as the true ELBO and can serve as surrogate objectives (Kingma & Gao, 2024)". Can they elaborate this claim? I am not sure this is the result outlined in the referenced paper (which shows that alternative choices result in ELBOs with data augmentation).

GANs. The paper discusses characteristics of GANs from the common knowledge in the literature, typically that their training is "unstable" or that their optimization if based on the Jensen-Shannon divergence or Wasserstein distance. While all this is supported by literature from the last decade, it should be tempered. More stable GAN settings have been found e.g. with StyleGAN (Karras et al., 2019) or FastGAN (Liu et al., 2021), and more comprehensive analyses of GAN optimization now deviate from the Jensen-Shannon / Wasserstein paradigm (Franceschi et al., 2022; Yi et al., 2023).

Karras et al. A Style-Based Generator Architecture for Generative Adversarial Networks. CVPR 2019.
Liu et al. Towards Faster and Stabilized GAN Training for High-fidelity Few-shot Image Synthesis. ICLR 2021.
Franceschi et al. A Neural Tangent Kernel Perspective of GANs. ICML 2022.
Yi et al. MonoFlow: Rethinking Divergence GANs via the Perspective of Wasserstein Gradient Flows. ICML 2023.

遗漏的重要参考文献

All strongly related works were discussed in the paper. Still, I would advise the authors to develop the discussion of Chen et al. (2024), which is only quickly mentioned in the appendix. While it is a concurrent work by ICML's guidelines, it does share similar techniques to the ones introduced in the submission. It would be beneficial for the community to acknowledge more precisely the similarities between both papers.

其他优缺点

The method is simple and effective. The paper is overall very clear and well written. Overall, I believe this submission may have a substantial impact on this area of research both for the proposed fine-tuning and the underlying techniques.

A small caveat is the need to tune additional hyperparameters, as is the case of most new techniques. Nevertheless, the sensitivity analysis included in the paper suggests that their optimization is not a limiting factor, especially compared to the greater complexity of direct competitors.

其他意见或建议

  • I would like to ask the authors whether they considered other ff-divergence GAN discriminator losses (Nowozin et al., 2016). To my understanding, the proposed method could be adapted to theses losses as their optimal discriminators involve density ratios. Some of them may even solve gradient vanishing issues more directly than the generalized objective introduced in Section 3.3.
  • The paper states that "autoregressive models [...] learn discrete data distributions" (Section 2.1), but they can also model continuous ones as shown e.g. by Tschannen et al. (2024).
  • The paper has a number of formatting issues that need to be solved for future versions.
    • Many equations go over the columns' edges.
    • Proper space should be left between captions and figures/tables.
    • The dataset corresponding to experiments in Figure 5(b)(c) is not specified.
    • The notation ss for the guidance scale is confusing as it share the same notation as the score network.
    • Authors should check whether all figures are colorblind-friendly.

Nowozin et al. ff-GAN: Training Generative Neural Samplers using Variational Divergence Minimization. NIPS 2016.
Tschannen et al. GIVT: Generative Infinite-Vocabulary Transformers. ECCV 2024.

作者回复

Thank you for your positive comments. Below, we provide detailed responses to your concerns.

Can the few debatable claims be reformulated along with the title?

  • I disagree with the title...it is not a discriminator per se, but can be used to parameterize a discriminator.

We agree with "it is not a discriminator per se, but a parameterization". Our use of the expression "your xx is secretly xx" mirrors the famous DPO [1]. In DPO, a language model is also not a reward model per se, but can be used to parameterized a reward model. We think the expression is a convention in the community and will not be misleading. Otherwise, we can also change the title to "your ... can implicitly parameterize ...".

[1] Direct preference optimization: Your language model is secretly a reward model

Other improper claims in abstract/intro/Section 3.1... Formatting issues...

we partially agree your comments but have to post discussion later due to character limitation of rebuttal stage

comparison to further regular training of the base model?

On unconditional CIFAR-10 with the EDM base model:

Iterations0500750100012501500
MLE Continued1.971.9611.9691.9721.9882.011
DDO (α=6.0,β=0.05\alpha=6.0,\beta=0.05)1.971.8961.8521.8091.7331.769

By continuing training with the MLE objective, the FID just fluctuates within a small range and never advances the performance like DDO. This phenomenon can be expected, not only because the baselines are already extremely optimized, but also for the traditional MLE's inherent nature. As stressed in the paper's abstract and introduction, MLE is mode-covering and will be penalized seriously if the model underestimates the likelihood of any training samples. This limits the MLE objective's upper bound performance under limited model capacity. Therefore, we can never expect significant generation quality improvement by continuing training with MLE, while DDO is a fundamentally different way. We will add this comparison in the revised paper.

Improvements on related work

  • (Kingma & Gao, 2024)"

We think this is common sense for diffusion models: the minimizer of denoising score matching is the ground-truth score function at every tt regardless of specific choices of p(t),w(t)p(t),w(t). We cite (Kingma & Gao, 2024) because it is a systematic discussion on how to understand different p(t),w(t)p(t),w(t) and how they are theoretically equivalent.

  • More stable GAN settings have been found

We will temper the "unstable" statements for GANs. The instability is not the only factor why we favor diffusion/autoregressive models over GANs. Diffusion/autoregressive models are proven scalable to high dimensional data. Besides, they have an iterative generation process, which intuitively provides a larger potential generation capability compared to GANs.

  • discussion of Chen et al. (2024)

We acknowledge the similarities and will add a section in the appendix to discuss the relationship. In our opinion, DDO is distinguished from CCA philosophically. On a high level, DDO's utilization of self-generated data can be seen as a form of on-policy optimization that roots deeply in reinforcement learning (RL) in today's language models. Similar to RL, DDO can fundamentally improve the base model's ability, while CCA cannot.

  • ff-divergence GAN

Thank you for introducing the ff-divergence GANs! We think these generalizations are applicable to DDO, while they are unable to directly address the numerical issues as the likelihood ratio p(x)q(x)=exp(logp(x)q(x))\frac{p(x)}{q(x)}=\exp(\log\frac{p(x)}{q(x)}) can be explosive (logp(x)q(x)\log\frac{p(x)}{q(x)} scales with data dimension). Nevertheless, we will add a section in the appendix to discuss this possible extension and leave more detailed exploration to future work.

No code is provided

DDO is simple to implement. To show reproducibility, we now release some code for the EDM2 codebase at the anonymous link. The loss function is simple (DDO_EDMLoss in training_loop.py), and we only need to add an additional dataloader to load model-generated fake samples.

Additionally, we provide some new results on higher-resolution datasets. After submission, we further finetune EDM2-L on ImageNet 512x512, successfully advancing the FID from 2.11 to 1.36, without any guidance. Some visualizations are provided in anonymous link. We find DDO significantly enhancing the image quality/details without affecting the overall content/diversity. We will release all model checkpoints and code that reproduce the reported results upon acceptance.

Thank you again for your consideration, and for giving positive rating. We appreciate your responsibility in carefully reviewing our work and providing very contrustive suggestions that help us improve. We hope that our response can resolve your concerns, and we are happy to answer further questions.

审稿人评论

I would like to thank the authors for their comprehensive answer which successfully addressed almost all my comments. Even though this will not further influence my rating, I am still curious to read the authors' thoughts on the claims I highlighted.

Provided that the authors' response is used as a basis to improve the paper, I am happy to now fully recommend acceptance.

作者评论

We are glad that our response helped address most of your concerns! We would like to thank you again for your responsible review and constructive suggestions, which will be incorporated into the revised paper. Below, we provide responses to your other mentioned claims, which we mostly agree with but failed to post initially due to the character limitation of the rebuttal.

Other improper claims in abstract/intro/Section 3.1... Formatting issues...

  • Abstract and introduction "unified framework that bridges ... the GAN objective": it is rather a framework integrating (not bridging) likelihood-based models in a GAN discriminator training (not overall GAN objective).

We agree and will revise all sentences involving "bridge" "GAN objective" for more accurate expression.

  • In Section 3.1 "stable convergence requires stronger initial conditions" based on Section 3.2 is too strong.

We agree this strong assertion is inproper and will weaken it to "strong initial conditions facilitate the optimization".

  • It is not clear whether the reported results were reproduced by the authors or simply taken from the original papers

In the tables, only results annotated with "retested" are reproduced by us, otherwise they are taken from original papers. We will explicitly stress this.

  • The paper states that "autoregressive models [...] learn discrete data distributions" (Section 2.1), but they can also model continuous ones as shown e.g. by Tschannen et al. (2024).

We agree. We will add a footnote to state "They can also model continuous data [...], while in this paper we only consider the more common discrete case."

  • Many equations go over the columns' edges. Proper space should be left between captions and figures/tables.

We compact the space seriously due to the 8-page limit at submission. We will fix these issues upon publication, as well as add our new results on ImageNet 512x512 in the main text.

  • The dataset corresponding to experiments in Figure 5(b)(c) is not specified.

We state that they correspond to class-conditional CIFAR-10 in the last paragraph of Section 5.2.

  • The notation ss for the guidance scale is confusing as it share the same notation as the score network.

We will use ww to represent the guidance scale in the revised paper.

It seems that the ICML discussion can only proceed for one round, but please feel free to edit the rebuttal comment if you have further thoughts. We will check the rebuttal comment regularly and edit this reply for further discussion.

最终决定

This paper proposes a fine-tuning method for generative models where a discriminator is parameterized through the likelihood ratio of the target model and a reference model. The reviewers praised the clarity of writing, novelty of the method, and empirical results. This paper is a clear strong accept.