/10

Poster4 位审稿人

最低3最高4标准差0.4

ICML 2025

Stochastic Forward–Backward Deconvolution: Training Diffusion Models with Finite Noisy Datasets

Haoye Lu,Qifan Wu,Yaoliang Yu

OpenReview PDF

提交: 2025-01-19更新: 2025-07-24

摘要

关键词

diffusion modelsdeconvolutionambient diffusion

评审与讨论

审稿意见

评分: 32025-03-13

This work put forward a new training framework, dubbed Stochastic Forward–Backward Deconvolution (SFBD), a method for training diffusion models on noisy datasets while mitigating data memorization and copyright concerns. Moreover, the theoretical analysis demonstrated that training solely on corrupted data is inefficient, whereas pretraining on a small fraction of clean data significantly improves performance. Importantly, experimental results showed state-of-the-art sample quality with as little as 4% clean data in certain settings. By bridging the gap between deconvolution theory and practical generative modeling, the proposed SFBD provides a scalable solution for privacy-preserving training.

给作者的问题

How many training iterations are required for the proposed method?
can you explain more about the $K$ in $\mathcal{O}(\log n )^{-2}$ ?
Since the training process still requires some clean images, does it truly help resolve the data leakage issue?

If the authors can address most of my concerns, I will improve the initial score.

论据与证据

The paper claims that training solely on noisy data is inefficient due to slow convergence rates but can be significantly improved by pretraining on a small fraction of clean data. While the theoretical analysis supports this claim, the extent to which pretraining on limited clean data generalizes to diverse datasets or real-world scenarios is not entirely explored. Additional experiments on different datasets and noise levels could further strengthen this conclusion.

方法与评估标准

This paper suggests that the proposed method may help address potential copyright issues, as it enables the recovery of noisy images without accessing their complete clean information.

理论论述

For Theorem 1 & 2:

The theorem claims that the optimal convergence rate for estimating the data density from noisy samples is $\mathcal{O}(\log n )^{-2}$ , which implies that training diffusion models solely on corrupted data is nearly infeasible. However, it derived that the lower bound of the discrepancy between the modeling distribution and the target data distribution is $\mathcal{O}(\log n )^{-2} \cdot K$ . Since this is a lower bound, then how can restrain the modeling discrepancy the modeling distribution and the target data distribution? Moreover, how $K$ determines the bound? is this very sensitive or just a constant.

实验设计与分析

The comparisons with the baseline models are all based on the same model backbone? or otherwise it if unfair. It seems two datasets are insufficient verification of model performance.

补充材料

The supplementary part includes the theoretical analysis and the visualizations. I try my best to check it.

与现有文献的关系

The key contributions of this paper build on existing research in diffusion models, deconvolution theory, and generative modeling with noisy data. By connecting ideas from these fields, it offers new theoretical insights and practical techniques to enhance training efficiency, improve sample quality, and support privacy-preserving generative modeling.

遗漏的重要参考文献

N/A

其他优缺点

weaknesses:

While the method improves sample quality, the paper does not provide a detailed analysis of training efficiency, computational overhead, or scalability compared to standard diffusion model training.
Although the paper claims that training on noisy data can mitigate copyright risks, it does not provide a formal analysis or guarantees regarding information leakage or potential reconstruction of copyrighted content.
The theoretical analysis seems no help for the model convergence, especially the derived lower bound.

其他意见或建议

Evaluating the proposed method on a broader range of datasets would significantly enhance the assessment of its robustness. While CIFAR-10 and CelebA provide useful benchmarks, incorporating larger and more diverse datasets, such as ImageNet for high-resolution images or LSUN for complex scenes, would better demonstrate the model’s generalizability.

作者回复

2025-04-01

Thank you very much for reviewing our paper. Below are our responses to your comments:

Q1. How many training iterations are required?

A1. In most of our experimental settings, SFBD converges within four iterations. (see [link] for results from additional iterations.) This is why we originally report the FIDs over the first four iterations.

When the noise level is fixed, SFBD converges faster with more clean samples, as the pretrained model starts closer to the true data distribution (see also Sec 5: The Importance of Pretraining). Conversely, with a fixed number of clean samples, higher noise levels also speed up convergence. We hypothesize this is because higher noise levels obscure more information in the noisy samples, prompting SFBD to rely more on the clean data and thus more rapidly adapt by combining complementary features from the noisy sources.

Q2. Regarding "training efficiency, computational overhead, or scalability compared to standard DM"

A2. While SFBD involves multiple fine-tuning steps, each step has a similar cost to standard diffusion training, as it minimizes the regular conditional score-matching loss.

Most training time is spent on pretraining and the first fine-tune, together matching the cost of training a regular diffusion model. Later fine-tuning steps are much faster—each taking less than 1/4 of standard training time in our setup. Overall, using SFBD with 4 fine-tunes takes about 1.75× the time of training a regular diffusion model.

Backward sampling adds cost but is parallelizable. With 8 RTX 6000 GPUs, each sampling step takes under 30 minutes.

Q3. Baseline model selections

A3. Our implementation follows EDM [1] with identical hyperparameters (using the FFHQ‑64×64 config for CelebA). As noted in EDM Appx F.3, the UNet backbones are nearly identical to those in DDPM and DDIM. AmbDiff, EMDiff, and TweedieDiff also use EDM backbones, so their architectures match ours. For SureScore, we use results from [2], which employs a slightly different UNet but with a similar parameter count.

Q4. The theoretical analysis seems no help for the model convergence, especially the derived lower bound.

A4. Thms 1 and 2 are not meant to describe SFBD’s convergence, but to illustrate the difficulty of estimating clean distributions from noisy data. Based on density deconvolution theory, they provide matching upper and lower bounds, implying a sample complexity of $\mathcal{O}((\log n)^{-2})$ . This poor rate shows that training solely on noisy data is impractical, justifying SFBD’s use of limited clean samples for pretraining.

SFBD’s actual convergence guarantee is given in Prop 2, which shows a rate of $\mathcal{O}(1 / \sqrt{K})$ , where $K$ is the number of iterations in Algorithm 1.

Note: The $K$ in Prop 2 and Alg 1 refers to the same quantity. In contrast, the $K$ in Thm 2 is an unrelated constant. To avoid confusion, we’ll revise the notation in Thm 2 to use $C’$ in the updated version.

Q5. Can you explain more about the $K$ in $\mathcal{O}(\log n)^{-2}$ ?

A5. In Thm 2, given $p_{\rm data}$ and the noise level $\sigma_\tau$ , the constant $K$ is a fixed positive value. (Note: This quantity will be renamed to $C'$ in the revised version to avoid confusion, as it is only used in Thm 2 and is unrelated to the iteration count $K$ in Alg 1 and Prop 2.)

Q6. Regarding the data leakage issue.

A6. The SFBD algorithm does not address potential leakage from the clean images used in pretraining. However, since these images are public and copyright-free, such leakage is not a concern in our setting. We also note that:

Some clean data is essential, as the problem is otherwise intractable;
Clean images can be copyright-free or obtained with user consent;
Pretrained models may be sourced from public datasets, though with potential quality trade-offs (as shown in ablation studies).

SFBD is specifically designed to prevent leakage from sensitive data by exposing the model to only one corrupted version of each sensitive sample during training. This inherently limits memorization and reconstruction of private or copyrighted content. Notably, SFBD does not require clean images or the pretrained model to be released or shared through secure channels.

To further support our privacy claims, we follow [3] by computing similarity scores between generated and sensitive samples [link]. Results show no reconstruction of sensitive content. These findings will be included in the revision.

Q7. Regarding addtional results on bigger datasets

A7. See Reviewer BKmX - A2.

[1] T. Karras et al. Elucidating the Design Space of Diffusion-Based Generative Models. NeurIPS 2023.

[2] An Expectation-Maximization Algorithm for Training Clean Diffusion Models from Corrupted Observations. Bai et al. 2024.

[3] Consistent Diffusion Meets Tweedie. Daras et al., 2024.

审稿意见

评分: 42025-03-13

This paper tackles an important practical issue: how to train diffusion models without directly accessing large volumes of clean (and potentially copyrighted) data. The authors propose a novel method called Stochastic Forward–Backward Deconvolution (SFBD). The approach begins with pretraining on a small set of clean images and then leverages a large noisy dataset through a forward–backward iterative process. The work is grounded in a solid theoretical framework based on density deconvolution, where the authors derive that the optimal convergence rate for learning from noisy samples is $O((\log n)⁻²)$ . This theoretical insight motivates the use of even a modest amount of clean data to guide the deconvolution process. Extensive experiments on CIFAR-10 and CelebA demonstrate that SFBD outperforms several baselines, including methods such as TweedieDiffusion and SURE-Score, in terms of FID scores.

给作者的问题

论据与证据

Most claims are supported by either theoretical analysis and experimental results well. However, I have one question:

Proposition 2 was presented in terms of the infinity norm between the chracteristic function between the data distribution and the iterative distribution. The results show that the norm not only depend on the iteration numbers K but also the l2 norm of the $u$ as well, which implies there are always non-trivial gap when $\|u\|$ is large enough no matter how large K is. In this sense, the theorem should be augmented by how the original distance/KL-divergence of the two distributions would be if the gap between the chracteristic function are only large when $\|u\|$ is large.

方法与评估标准

Experimental results for larger dataset, like imagenet, better to be demonstrated.

理论论述

Correct.

实验设计与分析

The results are only shown for at most 4 iterations, it is better to show more iterations and specifically, how fast the results would converge to the optimal state.

补充材料

与现有文献的关系

遗漏的重要参考文献

其他优缺点

The derivation of the convergence rate $O((\log n)⁻²)$ under a Gaussian noise model provides deep insight into the limitations of learning from solely corrupted data. This analysis justifies the necessity of even a small amount of clean data.
The experimental results, though limited, provide good insights.

其他意见或建议

作者回复

2025-04-01

Thank you very much for the review. Below are our responses to your comments:

Q1. Regarding the non-trivial gap when $|\mathbf{u}|$ is large in Prop 2.

A1. Thank you for this insightful comment. We agree that the bound appears to grow with $\|\mathbf{u}\|$ , and we would like to clarify that in practice, the behavior of characteristic functions for large $\|\mathbf{u}\|$ is typically negligible. Specifically, characteristic functions tend to decay rapidly—often at an exponential or super-polynomial rate—for distributions with smooth and bounded densities.

To make the discussion concrete, consider the 1D case: the characteristic function $\phi(u)$ is the Fourier transform of the probability density function. When the density is k-times differentiable, it is well known (e.g., Lemma 4 on page 514 of [2]) that the characteristic function satisfies $|\phi(u)| = o(|u|^{-k})$ . This implies that for sufficiently large $|u|$ , the magnitude of the characteristic function becomes negligible.

Therefore, under the assumption that both $p_{\rm data}$ and $p_0^{(k)}$ are smooth and have bounded support, it suffices to ensure that their characteristic functions are close within a compact domain $|u| < U$ for some $U > 0$ . This local closeness in the Fourier domain translates to closeness in the original densities, and hence the distributions. We will include this clarification in the final version of the manuscript.

Q2. Regarding addtional results on bigger datasets

A2. While we acknowledge that further experiments on larger datasets would strengthen the empirical study, our ability to scale up is constrained by the limited computational resources available at our academic institution. For instance, according to EDM’s official repository [1], training a diffusion model on ImageNet requires 32×A100 GPUs for 13 days—resources that are currently beyond our reach.

To partially address your comment, we conducted an experiment on the Tiny ImageNet dataset, which contains 100,000 images. We selected 1,000 clean images as copyright-free samples for pretraining and set the noise level to 0.2. Using a reduced batch size of 256 (compared to the recommended 4096 in EDM), the pretrained model achieved an FID of 36.74. After the first fine-tuning iteration, the FID dropped to 20.41. This result reflects a trend consistent with our findings on CIFAR-10 and CelebA, further corroborating our claims.

Q3. The results are only shown for at most 4 iterations, it is better to show more iterations and specifically, how fast the results would converge to the optimal state.

A3. In most of our experimental settings, SFBD converges within four iterations. (see [link] for results from additional iterations.) This is why we focus on reporting the FID trajectories over the four few iterations in our original submission.

Moreover, when the noise level is fixed, SFBD tends to converge more quickly as the number of clean samples used for pretraining and fine-tuning increases. A larger clean dataset allows the model to start closer to the true data distribution (see also Section 5 – The Importance of Pretraining). Conversely, when the number of clean samples is fixed, increasing the noise level also accelerates convergence. We hypothesize this is because higher noise levels obscure more information in the noisy data, prompting SFBD to rely more on the clean samples, enabling faster adaptation through the fusion of complementary information from noisy samples. We will incoperate this discussion in the revision.

[1] T. Karras et al. Elucidating the Design Space of Diffusion-Based Generative Models. NeurIPS 2023.

[2] Feller, W. An Introduction to Probability Theory and Its Applications, Vol. 2. Wiley. 1971

审稿意见

评分: 42025-03-15

This paper addresses the challenge of training diffusion-based generative models using datasets that are intentionally corrupted with noise to mitigate concerns around memorization and copyright infringement.
However, the authors show that in practice, the convergence rate for learning from noisy samples is extremely poor—on the order of $O((\log n)^−2)$ —making effective training from only noisy data infeasible.
The key insight is to pretrain a diffusion model on a small set of clean (copyright-free) data and then iteratively refine it using the large noisy dataset. The algorithm alternates between backward sampling (a denoising step using the current model) and updating the denoiser with these generated samples. Over time, the model’s outputs converge to the true data distribution even though it only sees a tiny fraction of clean data.
The authors theoretically validate SFBD’s convergence and demonstrate its practicality via empirical studies on CIFAR-10 and CelebA. Impressively, SFBD achieves competitive image quality with only 4% clean images (FID of 6.31 on CIFAR-10), outperforming other methods designed to train on noisy data.

给作者的问题

How does the method perform when you use a off-the-shelf denoiser trained on generic data?
What will happen in the blind case when the noise variance is unknown?
How does using a denoiser trained on multiple noise levels (blindly) affect the performance?

论据与证据

Yes.
The approach is justified using theoretical guarantees, specifically the convergence rate to the true clean distribution is analyzed and bounded.
The claims are supported by the empirical studies, showing that the approach is effective also in practice.

方法与评估标准

Almost.

The approach assumes that the noise level of the noisy data is within the trajectory of the diffusion process, which in many practical cases does not hold.

理论论述

Yes. They seem correct.

Proposition 1
Theorem 1
Theorem 2

实验设计与分析

Yes, the method is compared to other methods designed to generate images given access only to noisy images. This comparison is not fair. The other methods do not require having a set of clean images, for the comparison to be fair one needs to test the methods after denoising the dataset with a denoiser network trained using the given clean images.

补充材料

Yes. A, B, D, and E.

与现有文献的关系

The task discussed in the paper is very important, specifically when one does not have access to the true data and only the noisy version of the data is accessible. The approach shows that even in this case one can still learn the true data distribution in a very effective way.

遗漏的重要参考文献

其他优缺点

Strengths:

The paper is well written.
The approach is novel and interesting.
Compared to other methods this approach shows great improvement.
The method is supported by theoretical justification.

Weaknesses:

The approach assumes that the noise level of the noisy data is within the trajectory of the diffusion process, which in many practical cases does not hold.
The method requires access to clean data at the initial stage.
The method is compared to other approaches but not fairly. The other methods do not require having a set of clean images, for the comparison to be fair one needs to test the methods after denoising the dataset with a denoiser network trained using the given clean images.
The performance of the approach degrade significantly when the clean data is out of the distribution (FIgure 2C, when the model is pretrained on truck images, it performs poorly on horse images).

其他意见或建议

N/A

作者回复

2025-04-01

Thank you very much for your comments. We will first clarify the distinction between our framework and standard denoising methods, and then address your comments point by point.

How our method differs from a standard denoising algorithm. SFBD alternates between denoising samples and fine-tuning the denoiser, allowing it to blend high-frequency details from clean data with global structure from noisy samples (see Section 6.3). Unlike standard denoising methods that reconstruct exact clean inputs, SFBD trains the denoiser to recover the full data distribution, resulting in more realistic samples.

To highlight SFBD’s advantages, we compare the FID of its denoised samples at each iteration ( $\mathcal{E}_k$ in Algorithm 1) with the full training set, alongside results from a state-of-the-art off-the-shelf denoiser, Restormer [1].

CIFAR-10 (4% clean images)

$\sigma$	Model	Iter 1	Iter 2	Iter 3	Iter 4
0.30	SFBD	6.16	3.42	2.68	2.35
	Restormer	53.87
0.59	SFBD	10.23	7.47	6.31	6.54
	Restormer	99.99
1.09	SFBD	12.68	9.39	9.08	10.14
	Restormer	132.69

CelebA

Setting	Model	Iter 1	Iter 2	Iter 3	Iter 4
50 clean images, $\sigma = 0.2$	SFBD	47.69	10.05	5.63	3.93
	Restormer	18.90
1,500 clean images, $\sigma = 1.38$	SFBD	9.05	5.76	4.56	3.98
	Restormer	227.91

The results show that SFBD can produce significantly more realistic samples than regular denoisers.

Below, we provide point-by-point responses to your comments.

Q1. How does the method perform when you use a off-the-shelf denoiser trained on generic data?

A1. As shown above, Restormer-denoised images consistently yield much higher FIDs than SFBD (reported in the paper). Since generative models trained on these images can’t surpass the FID of their targets, using off-the-shelf denoisers like Restormer results in significantly worse performance. We will include this discussion in the revision.

Q2. Regarding Experimental Designs Or Analyses

A2. The upper block of Table 1 assumes full access to both clean copyright-free and sensitive samples, so the results naturally outperform those from models trained on denoised data—making our method appear weaker, not stronger.

For the algorithms in the lower block, their original designs either assume only noisy data or use clean samples in their own specific way. Adapting them to leverage denoised samples would require substantial changes and could also raise fairness concerns. To maintain a consistent and fair comparison, we used available clean data for pretraining when applicable.

As noted in A1, the reported results in the above tables serve as upper bounds for models trained on denoised samples—whether from a pretrained denoiser (SFBD-Iter 1) or an off-the-shelf one (Restormer). Importantly, final SFBD models (after multiple training iterations) still outperform these upper bounds. We will clarify this in the revision to avoid potential concern.

Q3. Regarding the case that noise variance is unknown.

A3. In our setting where noise is intentionally added to protect sensitive samples, it is reasonable to assume known noise variance (either directly controlled by the model developer or communicated to it securely).

To make the proposed framework compatible with scenarios where the noise level is unknown, it could be extended by incorporating noise level estimation techniques-such as those used in blind denoising methods. We consider this an interesting direction for future work.

Q4. How does using a denoiser trained on multiple noise levels (blindly) affect the performance?

A4. The denoiser Restormer, discussed in the tables above as well as in A1 and A2, is designed to handle multiple noise levels in a blind manner. As shown in our results and discussions, generative models trained on images denoised by Restormer cannot achieve performance comparable to those trained using SFBD.

Q5. Regarding "The approach assumes that the noise level of the noisy data is within the trajectory of the diffusion process, ..."

A5. In theory, diffusion models can handle noise levels ranging from zero to infinity. We would appreciate it if you could provide more details or clarification on this concern so we can better understand and address the issue.

Q6. Regarding "The performance of the approach degrade significantly when the clean data is out of the distribution"

A6. We included these ablation studies to support our theoretical claims and offer additional insights into the framework’s behavior. While we agree that using in-distribution clean data yields better results, the main takeaway is that pretraining on out-of-distribution data still provides a clear benefit compared to no pretraining at all.

[1] Restormer: Efficient Transformer for High-Resolution Image Restoration. S. W. Zamir et al. CVPR 2022

审稿意见

评分: 42025-03-16

The authors consider the problem of training diffusion models with a small set of clean data and a large set of noisy data. This follows a line of recent works on developing techniques for training diffusion models under corruption in the training set. The main finding of this work is that without clean data performance is fundamentally limited for datasets of finite sizes. The authors develop a technique that significantly improves the performance by leveraging a small set of clean samples.

给作者的问题

Please include a discussion of the missed work and if possible, benchmark against it. Please also clarify the questions I asked regarding how you implemented the baselines and if possible update the Tables using the (significantly improved) numbers from the related work. I understand that the rebuttal time is limited, but if you could also provide a comparison with the baseline of first training on everything and then fine-tuning on only the clean data points, that would be really useful.

I would also like to ask about the computational requirements for the method, as it seems to require K finetunings.

If my concerns are properly addressed, I will raise my score since I like the idea, the paper, and the analysis.

论据与证据

Yes, the claims in this work are supported by clear evidence.

方法与评估标准

I find the benchmarking of this work unfair/incomplete.

Missing baselines:
- The work How much is a noisy image worth? Data Scaling Laws for Ambient Diffusion (ICLR 2025) proposes the same idea (leveraging a few clean examples) and reaches the same finding (performance with noisy data only is limited but a few clean samples can lead to a dramatic increase in performance). Despite that, the paper is not discussed/cited and there is no benchmarking against it. I believe the authors were probably unaware of this work, but it is super relevant and should be extensively discussed and benchmarked against.
- Another useful baseline to compare against would be the training on noisy data first and then fine-tuning on clean data. This is in some sense what is happening post-training in the foundational models (e.g. see the EMU paper).
Unfair implementation of baselines:
- Ambient Diffusion is a method developed for training diffusion models with linear corruptions (e.g. random inpainting). It is unclear how the authors use this method for the denoising case.
- The authors report that TweedieDiff achieves FID of 167.23 without clean data and FID of 65.21 with some clean data points on CIFAR-10 for $\sigma=0.2$ . However, the work "How much is a noisy image worth" reports: a) FID 12.12 without clean data and without consistency, b) FID 11.93 without clean data and consistency, and iii) FID 60.73 with naive full sampling on a model that was just trained with times $t: \sigma_t>=0.2$ . Using 10% of clean data further drives FID to 2.50. The reported FIDs of 167.53 and 65.21 in the submission paint an unfair and very pessimistic picture for the baselines.

I also believe that for the evaluation to be more rigorous, the pair of (FID, memorization) needs to be reported.

理论论述

I checked the proofs about MISE, Proposition 1 and the proposed method. The MISE proofs seem to be heavily based on prior work. I also have a question regarding MISE; wouldn't it make more sense if we weight the integral with p_{data}(x)? Why are errors in very unlikely datapoints (according to $p_{data}(x)$ ) important?

实验设计与分析

I have several concerns regarding the experimental validation, as listed above. Repeating here:

I find the benchmarking of this work unfair/incomplete.

Missing baselines:
- The work How much is a noisy image worth? Data Scaling Laws for Ambient Diffusion (ICLR 2025) proposes the same idea (leveraging a few clean examples) and reaches the same finding (performance with noisy data only is limited but a few clean samples can lead to a dramatic increase in performance). Despite that, the paper is not discussed/cited and there is no benchmarking against it. I believe the authors were probably unaware of this work, but it is super relevant and should be extensively discussed and benchmarked against.
- Another useful baseline to compare against would be the training on noisy data first and then fine-tuning on clean data. This is in some sense what is happening post-training in the foundational models (e.g. see the EMU paper).
Unfair implementation of baselines:
- Ambient Diffusion is a method developed for training diffusion models with linear corruptions (e.g. random inpainting). It is unclear how the authors use this method for the denoising case.
- The authors report that TweedieDiff achieves FID of 167.23 without clean data and FID of 65.21 with some clean data points on CIFAR-10 for $\sigma=0.2$ . However, the work "How much is a noisy image worth" reports: a) FID 12.12 without clean data and without consistency, b) FID 11.93 without clean data and consistency, and iii) FID 60.73 with naive full sampling on a model that was just trained with times $t: \sigma_t>=0.2$ . Using 10% of clean data further drives FID to 2.50. The reported FIDs of 167.53 and 65.21 in the submission paint an unfair and very pessimistic picture for the baselines.

I also believe that for the evaluation to be more rigorous, the pair of (FID, memorization) needs to be reported.

补充材料

I went over all the Supplementary Material. Section E (Experiment Configurations) could benefit from explaining how the baselines are trained/used.

与现有文献的关系

This paper follows a line of recent works on training diffusion models with corrupted data and its implications for performance, copyright and memorization. The authors do a very good job in describing the state of the literature in Section 2 (Related Work). The main finding of this work is very interesting, and it is that a small set of clean data is essential for performance when the assumption of infinite data is violated (i.e. in all practical settings).

遗漏的重要参考文献

As mentioned above, the authors miss the work How much is a noisy image worth? Data Scaling Laws for Ambient Diffusion (ICLR 2025) which proposes the same idea (leveraging a few clean examples) and reaches the same finding (performance with noisy data only is limited but a few clean samples can lead to a dramatic increase in performance).

其他优缺点

I have already listed the main Weaknesses of the work. On the strengths side:

The paper is clearly written.
The topic is very interesting and the findings very intuitive.
The finding confirms a recent finding from another work, which is in some sense, a positive thing since it shows its importance and its reproducibility.
The connection to density deconvolution is novel and the sample-complexity results are interesting.
The proposed framework, a sort of Expectation-Maximization applied to diffusion modeling, is very niche and it shows that there are other ways beyond consistency to extrapolate beyond the training data.

其他意见或建议

Please update the running title as it is currently listed as "Submission and Formatting Instructions for ICML 2025".

作者回复

2025-04-01

Thank you for reviewing our paper and for the detailed feedback.

Before addressing specific points, we would like to clarify that the remark stating "the benchmarking of this work is unfair/incomplete" is, in our view, inaccurate. Specifically,

We were not aware of the missing baseline [1] before our submission. We hope you would understand since [1] is very recent and there is no way we could have compared to it. We did our best to discuss its relation to our method below (in A1), which will also appear in our revision.
Regarding the claim of unfair implementation: our version of TweedieDiff is not the same as in Daras25 [1] but faithfully follows the original paper (see A4 for details). While we’re happy to include results from [1] and discuss the differences, we do not have any intention to “paints an unfair and very pessimistic picture of the baselines.”

Q1. Regarding a related work [1] published in ICLR 2025

A1. We agree that this very recent work [1] is relevant. We will include a detailed discussion and benchmark comparison in the revision. In particular:

Both papers show that learning the true data distribution from only noisy samples is theoretically possible but practically requires an infeasible number of samples. To address this, both incorporate a small amount of clean data to guide training.

However, they differ in approach. Our method uses the density deconvolution, while Daras et al. build on Gaussian Mixture Models (GMMs), providing more precise modeling under stronger distributional assumptions. We believe the fact that both methods reach similar conclusions—despite their different foundations—reinforces each other.

Methodologically, [1] applies Tweedie’s formula (consistency constraint) to recover the clean distribution. In contrast, ours introduces a novel forward-backward deconvolution strategy, offering a fresh perspective without the heavy computational cost of enforcing consistency.

(Benchmark Comparison) We will include the following CIFAR-10 FID results in the revision:

50 clean images, σ = 0.2 (Table 1 setting): Daras25: 8.05; SFBD: 13.53
10% clean images, σ = 0.2: Daras25: 2.81; SFBD: 2.58
4% clean images, σ = 0.59: Daras25: 6.75; SFBD: 6.31

Daras25 performs better with very limited clean data, likely due to overfitting during our model’s pretraining on small datasets. This overfitting is lessened as more clean data becomes available, allowing SFBD to outperform. We attribute this to SFBD’s stable fine-tuning via score-matching loss, which avoids the extra constraints of consistency-based methods.

Q2. Regarding training models in a way similar to EMU

A2. We illustrate this on CIFAR-10 by pretraining a diffusion model on noisy images (σ = 0.2), then finetuning on either 50 or 4% clean images. FID trajectories [link] show an initial drop followed by a rise—sharper with 50 clean images, slower with 4%. This is expected: finetuning on clean data improves performance at first but eventually causes the model to forget useful noisy features, leading to degradation.

Q3. Regarding ambient diffusion

A3. The reported value is from the arXiv version of [2], which states that "AmbientDiffusion is trained with the standard setting." Since it was omitted from the NeurIPS final version and its reliability is uncertain, we will remove it in the revision.

Q4. Regarding TweedieDiff

A4. Our implementation closely follows the original TweedieDiff paper. Comparing Daras25 [1] and Daras24 [3], we think the difference might be caused by the new implementation in Daras25 [link], which we were not aware of before our submission. We will include results from both versions and clarify their differences in the revision. We are happy to include the new result of Daras25 in our revision and add explanation on the different results. We want to assure the reviewer we tried our best to compare all baselines fairly (to be best of our knowledge).

Q5. Regarding memorization

A5. See Reviewer kYMA-A6.

Q6. Regarding MISE

A6. MISE is the standard metric in density estimation that evaluates error uniformly across all x, encouraging agreement both within and outside the support of $p_{\rm data}(x)$ . Weighting the error by $p_{\rm data}(x)$ instead reduces the influence of low-likelihood regions, under-penalizing high-density estimates in areas where the true density is low. This can make models that generate unlikely or absurd samples appear to perform well, contradicting our goal of accurately modeling the full distribution.

Q7. Regarding computation requirement

A7. See Reviewer kYMA-A2.

[1] How much is a noisy image worth? Data Scaling Laws for Ambient Diffusion. Daras et al., 2025.

[2] An EM Algorithm for Training Clean Diffusion Models from Corrupted Observations. Bai et al., 2024.

[3] Consistent Diffusion Meets Tweedie. Daras et al., 2024.

审稿人评论

2025-04-04

Thank you for your rebuttal. I expect that the authors will update their comparisons and baseline results, as promised. I agree that the overlap with the other work reinforces the validity of the approach. I am raising my score to 4.

作者评论

2025-04-05

Thank you very much for your thoughtful comments and engagement throughout the review process. We’re pleased that our responses have addressed your concerns. As noted, we will incorporate the promised comparisons and baseline results in the revised version to strengthen the final submission.

Sincerely,

The Authors

最终决定Accept (poster)

2025-05-01

This is a very interesting paper on how to train diffusions when given both clean data and noisy data. There is an interesting result, that even with a small amount of clean samples, the noisy data can help in learning the true (unknown) distribution. Overall, the paper is very interesting, studies an important problem and positions itself well compared to recent work in this very active space.

In the discussions the authors also agreed to compare with recent related work which seems very closely related but indeed different. All additional concerns by the reviewers were sufficiently addressed by the author's rebuttals. Overall, this work seems satisfactory and ready for publication.