Your Diffusion Model is Secretly a Noise Classifier and Benefits from Contrastive Training
摘要
评审与讨论
The paper presents a method to improve the parallel sampling of the diffusion model by improving the denoising network for out-of-distribution evaluation. The paper proposes to finetune a trained model using the log-likelihood ratio of a sample at two different noise scales. The log-likelihood ratio is obtained by integrating the denoising error at all noise scales. The proposed method performs better for parallel sampling and performs comparable for sequential sampling on the CIFAR-10, AFHQ, and FFHQ datasets.
优点
- The paper is overall well-written and easy to follow.
- As the contrastive diffusion loss is expensive to compute, the authors show that finetuning on pretrained model is sufficient for few epochs instead of training using contrastive diffusion loss (CDL) (or linear combination of CDL and denoising error) through out the training.
缺点
- The contrastive diffusion loss (CDL) contains the integration of denoising error over all noise scales. In practice, I assume that the integration will be replaced by a numerical integration therefore, evaluating the proposed CDL loss is expensive. Considering the training/finetuning using CDL loss is expensive, the benefits from the training don’t seem to be significant in the empirical performance. Additionally, very few/no details are provided on how the integration is approximated in experiments/practice.
- Some discussions about the related works need to be included. For example, another work on improving parallel [1] is not mentioned/discussed/compared against. Additionally, the derivation of the CDL loss is somewhat similar to some diffusion classifier papers (for example, [2, 3, 4]) therefore they should be cited/discussed.
[1] Accelerating Parallel Sampling of Diffusion Models
[2] Your Diffusion Model is Secretly a Zero-Shot Classifier
[3] Your Diffusion Model is Secretly a Certifiably Robust Classifier
[4] Robust Classification via a Single Diffusion Model
问题
- It is not clear why only 5k samples are used for calculating FID for parallel sampling. The standard practice is to use 50k samples. As the FID depends on the number of samples, it is hard to make sense of the significance of improvement.
- Can you experimentally show/quantify the improvement of the denoiser for out-of-distribution evaluation when trained using contrastive diffusion loss (CDL)?
局限性
yes
Please see our top-level comments for clarification on FID scores, where we show that we consistently produce SOTA results across a wide variety of settings on this metric.
We address the two weaknesses pointed out by the reviewer below.
- Training cost The numerical integration in our loss is done using importance sampling. Details of this standard approach are discussed in many papers, e.g., in Kingma et al, Variational Diffusion Models, NeurIPS 2021. As pointed out by Kingma et al, the standard diffusion loss can also be interpreted as an evaluation of a weighted integral evaluated with importance sampling, so this is not unique to our method.
However, it is true that the form of our loss does incur additional training cost, and we pointed this out explicitly as a limitation.
- Trading off extra computation at training time for faster / better quality sampling at inference time (as we do) is considered advantageous for most applications. For instance, the entire area of score distillation methods train two models, rather than one, but are nonetheless popular in practice.
- The reviewer guidelines state “authors should be rewarded rather than punished for being up front about the limitations of their work”. We have tried to honor this guideline in our own reviews, and ask the reviewer to consider taking this guideline into consideration when deciding on a final score.
- Related work
- The references [2,3,4] from XUix’s review also have “diffusion” and “classification” in the titles. However, these methods all consider classifying objects in images (e.g. “cat”/”dog”). They rely on the longstanding observation that conditional generative models can be used as classifiers via Bayes rule. In contrast, we introduce an interpretation of diffusion models as noise classifiers - a classifier that can distinguish the amount of noise added to an image. The derivation of these methods is nontrivial and totally different as it goes through I-MMSE results rather than Bayes rule. While the task of classifying noise may be unfamiliar to some, it comes from a classic line of work started by Gutmann & Hyvarinen on noise contrastive estimation (JMLR 2012) as an alternative foundation for machine learning (see response to Y79y for more detail). We will add a discussion of these papers and the distinction between noise classification and traditional classification with diffusion to related work.
- Reference [1], like most attempts to accelerate diffusion, relies on changes to the sampler. Our approach, on the other hand, changes the training procedure and can be combined with any sampler. We tested our approach with a representative sample of deterministic and stochastic samplers, and sequential and parallel samplers. We will add this reference to the list of other recent sampling approaches that our method could be combined with.
Can you experimentally show/quantify the improvement of the denoiser for out-of-distribution evaluation when trained using contrastive diffusion loss (CDL)?
Yes, a direct metric is to look at the score matching loss (L2 distance between true and estimated score) for OOD points, we will add this to the supplementary material. This requires access to a ground truth denoiser, so we can only show it for simple cases like the one shown in Fig. 1.
Dear Reviewer XUix,
We greatly appreciate the time you took to review our paper. Due to the short duration of the author-reviewer discussion phase, we would appreciate your feedback on whether your main concerns have been adequately addressed. We are ready and willing to provide further explanations and clarifications if necessary.
Thank you very much!
Building on the denoising score matching loss, the paper introduces a novel contrastive learning loss. This loss estimates the log-likelihood ratio between mixture densities with varying noise levels. The contrastive learning loss is then employed to fine-tune the diffusion model, enhancing its ability to estimate out-of-distribution (OOD) data. To validate this approach, the paper demonstrates experimentally that the fine-tuned diffusion model exhibits improved performance in both sequential and parallel sampling, which typically degrade with poor OOD estimates.
优点
The derivation of the contrastive diffusion loss is solid and novel. The CDL could potentially inspire more contrastive loss and noise-level data augmentation methods for diffusion model training. Additionally, CDL introduced the asynchronous data pair for training, improving the robustness of the out-of-distribution region, especially during sampling. Overall, the idea is very sound and novel.
缺点
The experimental results are not particularly impressive:
-
Although CDL fine-tuning consistently enhances parallel sampling performance on both synthetic and real-world datasets, the improvements are marginal. Additionally, the baseline chosen for comparison is not state-of-the-art. For instance, in the unconditional generation in Table 2, even with CDL fine-tuning, an FID of around 7.0 on CIFAR-10 falls short of the state-of-the-art FID, which is less than 2.0. The improvement in sequential generation is even less substantial.
-
I believe the proposed CDL has significant potential. For instance, when addressing inverse problems, the diffusion model also encounters OOD issues. The paper could benefit from including more promising experiments to demonstrate how CDL enhances the diffusion model's OOD estimation capabilities.
问题
-
I think there are issues with the chosen baseline in the paper. In the original parallel sampling method [27], parallel sampling achieves similar performance to sequential sampling in significantly less time. However, the results reported in Table 2 show that unconditional parallel sampling performs much worse than sequential sampling (around 3.0 for the original DDPM, VP, and VE). Could the author provide more clarification on this discrepancy?
-
According to [1], the denoising score matching loss serves as an upper bound for the negative log-likelihood. So minimizing the denoising score matching loss is equivalent to maximizing the negative log-likelihood. Are there any insights or proofs provided on why minimizing the CDL would also lead to meaningful generation?
[1] Song, Yang, Conor Durkan, Iain Murray, and Stefano Ermon. "Maximum likelihood training of score-based diffusion models." Advances in neural information processing systems 34 (2021): 1415-1428.
局限性
The authors have covered the limitations of their work.
Thank you so much for taking the time to read and review our paper. Please see our top-level comments for clarification on FID scores, where we show that we consistently produce SOTA results across a wide variety of settings on this metric.
We address the two weaknesses pointed out by the reviewer below.
- FID Score Discrepancy
The original Table 2 reported FID scores using 5k samples, as per the settings of our baseline paper [4]. FID scores depend on the number of samples, so that our previous results were not directly comparable to other results reported using 50k samples. In response to your feedback, we have updated Table 2 to include FID scores calculated with 50k samples, results are in the top-level rebuttal response.
The updated results show that our results are actually very good and in line with SOTAs. With the same parallel sampler, CDL-finetuned models consistently outperforms the models trained by the original diffusion loss (DDPM, VP, and VE losses). For example, compared to all other baseline losses under FID calculated with 50k samples, cdl achieves an FID improvement of ~0.7 on unconditional CIFAR10, ~0.5 on conditional CIFAR10.
- Minimizing CDL and Meaningful Generation
Are there any insights or proofs provided on why minimizing the CDL would also lead to meaningful generation?
Our method is based on Density Ratio Estimation (DRE) rather than maximum log-likelihood methods. DRE techniques transform the task of learning data distributions into learning to classify between data samples and samples from a reference distribution [1,2,3]. Our approach bridges diffusion-based generative methods with density ratio estimation. Therefore, minimizing CDL effectively learns the underlying data distribution, which leads to meaningful generation.
For more detailed information about DRE, we refer to [2] section 2.1 Density Estimation by Comparison.
[1] Masashi Sugiyama, Taiji Suzuki, and Takafumi Kanamori. Density ratio estimation in machine learning. Cambridge University Press, 2012
[2] Michael U Gutmann and Aapo Hyvärinen. Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics. Journal of machine learning research, 13(2), 2012
[3] Benjamin Rhodes, Kai Xu, and Michael U Gutmann. Telescoping density-ratio estimation. Advances in neural information processing systems, 33:4905–4916, 2020.
[4] Andy Shih, Suneel Belkhale, Stefano Ermon, Dorsa Sadigh, and Nima Anari. Parallel sampling of diffusion models. Advances in Neural Information Processing Systems, 36, 2024.
Thanks for the author's response, I have increased my score to 5.
Thank you very much for enhancing the score. We appreciate your careful review once again!
The paper establishes a connection between the diffusion model denoiser and noise classifier through an examination of log-likelihood ratio estimation. The authors introduce a novel loss function, termed Contrastive Diffusion Loss (CDL), designed to encourage diffusion models to explore OOD regions in noise levels. Empirical results indicate that the proposed loss function enhances performance in parallel sampling and demonstrates robustness to different hyperparameters in sequential sampling.
优点
- The paper draws an interesting connection between diffusion model denoiser and noise classifier and introduces a novel loss function based upon.
- The paper is well-organized and most results are clearly presented.
缺点
-
The clarity of the paper could be further enhanced. For instance, in line 86, the authors introduce the notion to indicate the noise level, while in Appendix A.2, they use during the derivation. It seems like the authors want to differentiate between noise level and SNR, but the message is not entirely clear here.
-
The empirical section of the paper is not as strong as the analytical section. In the parallel sampling example, the method shows good performance, but even with the performance improvement, the results are still far from optimal compared to EDM with the default settings. The experiment on deterministic samplers aims to demonstrate that CDL-regularized models maintain a more stable FID score with changes in NFE, but the authors do not explain why CDL helps in this context. The same issue applies to the experiment on stochastic samplers.
Overall, I am on the positive side for this paper. I would appreciate it if the authors could provide more intuitive explanations to better connect the analytical and empirical sections of the paper.
问题
Why did the authors choose fine-tuning with the CDL over pre-training and using CDL as a regularization?
局限性
The authors addressed the limitations and potential negative societal impact of their work.
Thank you so much for taking the time to read and review our paper. We’re glad that you are positive about our paper. Please see our top-level comments for clarification on FID scores, where we show that we consistently produce SOTA results across a wide variety of settings on this metric.
We address the two weaknesses and two questions pointed out by the reviewer below.
- Clarity on Notation
The clarity of the paper could be further enhanced
To clarify the distinction between the log-SNR used in the integral ( in Eq. 3) and the density of the noisy data distribution, which is a mixture of data and noise, we used to represent the amount of noise in the noisy distribution. We acknowledge that was used in Appendix A.2 due to a typographical error. This will be corrected in the revised manuscript.
- Empirical Results FID scores
The difference in FID scores compared to EDM's default settings is because the original Table 2 reports FID scores using 5k samples, as per the baseline paper, whereas SOTA methods typically use 50k samples.
In the parallel sampling example, the method shows good performance, but even with the performance improvement, the results are still far from optimal compared to EDM with the default settings.
Table 2 reports FID scores for samples generated using parallel samplers from EDM-pretrained checkpoints, while EDM's default FID scores are for samples from sequential EDM samplers. This difference in sampling methods explains the difference in FID scores.
We have updated Table 2 to include FID scores calculated with 50k samples. For example, compared to all other baseline losses, cdl achieves an FID of 2.38 (0.62 better than the best baselines with an FID of 3.00) for unconditional CIFAR10. These updates are included in the top-level comments.
- Why CDL Helps with Stable FID Scores Across Hyperparameters?
According to EDM [1], samplers introduce local discretization errors at each step, which accumulate as global errors. Although sequential samplers are designed to stay close to the forward / training paths , local errors can deviate from these paths. CDL, by training on asynchronous data pairs , allows the model to see deviations (e.g., ) during training and thus corrects for these local errors. This "correction" capability contributes to stability across hyperparameters in both deterministic and stochastic sampling settings. More generally, our approach can be viewed as noise contrastive density ratio estimation, an alternative approach to maximum likelihood for density estimation (Gutmann & Hyvarinen, 2012) [2]. By combining maximum likelihood and density ratio based estimates, we improve over either one individually.
- Why Fine-Tuning Instead of Pretraining?
Fine-tuning on pretrained models is preferred because training diffusion models from scratch is computationally expensive. It is common practice to fine-tune pretrained models to balance efficiency and performance. Results on toy datasets were trained from scratch, with excellent results.
[1] Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. arXiv preprint arXiv:2206.00364, 2022.
[2] Michael U Gutmann and Aapo Hyvärinen. Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics. Journal of machine learning research, 13(2), 2012
Dear Reviewer M92h,
We greatly appreciate the time you took to review our paper. Due to the short duration of the author-reviewer discussion phase, we would appreciate your feedback on whether your main concerns have been adequately addressed. We are ready and willing to provide further explanations and clarifications if necessary.
Thank you very much!
Thanks to the authors for the additional experiments and clarifications. I hope that the authors would include these additional results and discussions when they update the paper.
My concerns are addressed, thus raising 6 -> 7.
We are grateful for the time reviewers have invested in reviewing our paper and for the insightful feedback provided. In this top-level comment, we will explain concerns about FID results and baseline comparisons, as this was the subject of discussion among the reviewers. In particular, our choice of baseline metric obscured the fact that our method produces competitive results with SOTA results. We will respond to other specific points in comments replying to each reviewer. We are posting comments now to give time for discussion, with a revised version coming in a few days.
We have carefully considered your comments and made the following updates:
The FID scores reported in original Table 2 were calculated using 5k samples, following the settings of our baseline paper [1]. This choice was made to maintain consistency with the baseline. Additionally, previously we didn't have an appropriate amount of computational power to run a higher number of samples for FID scores. FID scores depend on the number of samples, so that our results were not directly comparable to other results reported using 50k samples.
We have updated Table 2 with FID scores calculated using 50k samples. Our best results of FID on all datasets are comparable with SOTA results of FID using 50k samples reported in [DDPM, VP, VE, EDM].
| Cond CIFAR10 | Uncond CIFAR10 | |||
|---|---|---|---|---|
| VP | VE | VP | VE | |
| EDM loss [baselines] | 2.93 | 2.76 | 3.24 | 3.00 |
| CDL loss[ours] | 2.41 | 2.25 | 2.51 | 2.38 |
Table 2: Parallel sampler results from EDM pretrained checkpoints, FID score evaluated by 50k samples.
Below are a summary of changes in the revised version of the PDF and supplementary material.
- Update Table 2 to show FID calculated with 50k samples, where CDL continuously outperforms baselines and produces near SOTA FID scores. All the results will be replaced to FIDs calculated with 50k samples, and FIDs with 5k will be reported in appendix for direct comparison to that other paper. E.g. unconditional CIFAR10 dataset, with the same sampler, cdl-finetuned model achieves 2.51 FID, whereas edm-loss trained model achieves 3.23 FID, in total cdl improves ~0.7 FID score.
- Will have wording changes and clarifications throughout to address reviewer comments
- Will add details about implementation of numerical integration in our loss
- Will add a discussion of classification diffusion papers, and the distinction between noise classification and traditional classification diffusion
- Will add a direct metric – L2 loss between true and estimated denoiser (score matching loss), which quantify the score / denoiser improvements on OOD points
We have taken great care to address all the concerns raised and hope that our revisions will meet your approval.
Dear Reviewers,
I hope this message finds you well. I would like to take this opportunity to express my gratitude for the valuable feedback and insights you have provided during the review process. Your expertise and dedication are essential to maintaining the high standards of our community.
The authors have devoted considerable time and effort to addressing the concerns and suggestions you raised in your reviews. They have carefully crafted a detailed rebuttal, aiming to clarify any misunderstandings and incorporate your feedback into their manuscript.
I kindly ask that you take a thorough and thoughtful look at the authors' responses. Your assessment of their rebuttal is crucial in determining whether they have satisfactorily addressed the issues and concerns highlighted in your initial review.
Thank you once again for your hard work and commitment to advancing the quality of our scholarly community. Your contributions are greatly appreciated.
Best regards,
AC
Building on the denoising score matching loss, the paper presents a new contrastive learning loss. This loss estimates the log-likelihood ratio between mixture densities with different noise levels. The contrastive learning loss is subsequently used to fine-tune the diffusion model, improving its capability to estimate out-of-distribution (OOD) data. Experimental results show that this approach leads to enhanced performance of the fine-tuned diffusion model in both sequential and parallel sampling, areas that often suffer from poor OOD estimates.
Most reviewers find the paper well-presented and well-written, and the paper draws an interesting connection between diffusion model denoiser and noise classifier and introduces a novel contrastive loss, and the derivation is solid and insightful. In the camera-ready version, we suggest the authors take the reviewers' feedback and revise accordingly, such as a more comprehensive discussion with related works and a more convincing evaluation of the proposed methods.