Ambient Diffusion Omni: Training Good Models with Bad Data
We improve the quality of generative models by using low-quality, corrupted, and out-of-distribution data
摘要
评审与讨论
Ambient Diffusion Omni (Ambient-o) proposes a two-stage strategy for training diffusion models on mixed-quality datasets. At large diffusion noise it adds Gaussian noise until a time-aware classifier can no longer tell clean from degraded images, then trains with every sample, at small noise it trains only on local crops whose statistics, chosen by a crop-classifier, match the target distribution. A total-variation contraction proof shows that noise shrinks distribution gaps, giving a formal bias–variance trade-off.
优缺点分析
S1. The paper formalizes how additive Gaussian noise contracts total variation distance between clean and corrupted distributions, quantifying the bias–variance trade-off when including low-quality data.
S2. Unlike prior work on Ambient Diffusion or EM-based approaches that assume access to the exact corruption operator, Ambient-o only requires 1). a small "clean" subset and 2). the ability to train classifiers to distinguish clean or corrupted (global or patch-level).
S3. The authors supply ablation studies of Ambient weight and buffer clipping and show how a sample-dependent classifier threshold improves performance over a fixed σ.
S4. Because quality decisions rely only on learned classifiers, the method works with blur, JPEG, motion blur, and even out-of-distribution examples such as using cat images to help a dog model.
W1. Adding noise cannot fully eliminate low-frequency degradations (e.g., color shifts, masking), so the approach may revert to simple filtering in those cases
W2. Performance hinges on the accuracy and thresholding of the quality and crop classifiers; mis-calibration could hurt results, yet sensitivity analyses are limited.
W3. Sample-wise annotations, crop-level decisions, and dual noise schedules introduce extra engineering steps that may slow adoption in production settings.
W4. Ambient-o assumes access to a small set S_G of known high-quality images and a set S_B of known corrupted/out-of-distribution images or a way to label them. In domains lacking any semi-trusted clean subset, it is unclear how to obtain S_G or S_B without manual curation.
问题
-
How sensitive are the final generative results to small variations in the threshold τ used for determining t? Have you tested different τ values (e.g., 0.4 or 0.45) on the same dataset to measure stability?
-
For ImageNet, you used CLIP-IQA to split into high or low quality. In domains where no pre-trained quality estimator exists, what is a practical approach to form S_G or S_B? Could you use a small human-labeled "gold" set, or would self-supervised heuristics suffice?
-
You note that degradations affecting low-frequency channels fog, color shifts are "ill-suited" since Gaussian noise does not sufficiently confuse them. Have you tried combining Gaussian noise with other synthetic corruptions to push corrupted samples into a "usable" regime at lower σ?
-
While Ambient-o improves global image quality and diversity, have you quantified whether it preserves very fine-grained details better or worse than training on fully curated data? For instance, does the patch classifier inadvertently retain tiny out-of-distribution artifacts?
局限性
The authors discuss the limitation in the methodology section.
最终评判理由
Thank you for the reply, which addressed most of my concerns. I have determined my final rating based on this.
格式问题
-
The checklist has been placed above the Reference List.
-
Part of the checklist content appeared in the supplementary materials.
We would like to thank the Reviewer for their thoughtful feedback. The Reviewer appreciated our formalization of the problem, our theoretical results, the generality of our method, and the breadth of our experiments. The Reviewer has concerns on three fronts:
- Missing ablation studies on the sensitivity of the classifiers and the size of the ground truth set needed for classification.
- Missing evaluations to measure how much we can preserve fine details.
- the applicability of our method to low-frequency corruption types.
We present numerous new ablation results addressing each one of the raised issues.
Sensitivity Results
Reviewer: "How sensitive are the final generative results to small variations in the threshold?"
Great question: short answer, results are not very sensitive to miscalibration. To test this, we report FID results for JPEG compression at 18% compression rate (Table 3 of the paper) for a fixed classifier annotating at different thresholds. The results are shown below:
| Threshold | FID | Notes |
|---|---|---|
| – | 8.79 | Only clean data |
| 0.40 | 6.67 | |
| 0.43 | 6.73 | |
| 0.45 | 6.43 | Paper result |
| 0.47 | 6.21 | Best result |
| 0.48 | 6.28 |
Reviewer: "Performance hinges on the accuracy and thresholding of the quality and crop classifiers; mis-calibration could hurt results, yet sensitivity analyses are limited."
Beyond the threshold ablations we presented above, we further ablate the size of the dataset needed for classification and the effect of training iterations.
Training Samples Ablation
We train two CIFAR-10 classifiers for the Gaussian blur corruption at . The first classifier is trained using 25K clean samples and 25K blurry samples and the second is trained with 5K clean samples and 45K corrupted samples. The scores are shown below:
| Classifier Training Data | FID |
|---|---|
| 25K clean and 25K blurry | 5.34 |
| 5K clean and 45K blurry | 6.04 |
As shown, the number of samples has an effect on the performance of the classifier, but still, the performance is much better than the 8.79 FID we get by only training on clean samples. Hence, miscalibration due to small training size can hurt a bit the performance, but not to a catastrophic extent.
Training Iterations Ablation
Another way to measure sensitivity to miscalibration is to check what happens when we undertrain or overtrain the classifier. The results are shown below for the JPEG at 18% compression rate (Table 3 of the paper):
| Classifier Training Size | FID |
|---|---|
| 5M images (100 epochs) | 6.50 |
| 10M images (200 epochs) | 6.58 |
| 15M images (300 epochs) | 6.49 |
We believe the results above prove the robustness of our classifier to various types of miscalibrations.
Removing the classifier
Reviewer: "Sample-wise annotations, crop-level decisions, and dual noise schedules introduce extra engineering steps that may slow adoption in production settings"
We agree with this point; however, it might not be needed.
At its core, our method is very simple: bad data should only be used in a subset of diffusion times. Blurry images, for example, are typically useful in the high-noise regime, while out-of-distribution images might be useful in the low-noise regime. The training of the classifiers provides a rigorous way of finding the “allowed training times” (as they estimate the TV distance), but this is not necessary. For example, for our controlled experiments on blurry images, annotating all the corrupted samples with the same value already yields pretty strong results. We show these results below (annotated with Fixed Sigma):
| Setting | Blur 1 | Blur 0.8 |
|---|---|---|
| Only clean data | 8.79 | 8.79 |
| No annotations | 45.32 | 28.26 |
| Classifier | 6.16 | 6.00 |
| Fixed sigma | 6.95 | 6.66 |
Our paper has similar results (Fixed Sigma) for the out-of-distribution data experiment (Table 2b), too. Further, for our text-to-image results, we similarly annotated all the data from DiffDB at the same noise level without requiring the training of a classifier.
Reviewer: "In domains where no pre-trained quality estimator exists, what is a practical approach to form S_G or S_B? Could you use a small human-labeled "gold" set, or would self-supervised heuristics suffice?"
Excellent question! The problem of sampling from the "uncorrupted" distribution is not identifiable if the corruption model is unknown (as in our setting) and we don't have samples from the ground truth. Hence, a "gold" set needs to be formed, either with human supervision or with some other domain-specific knowledge. For example, in proteins, there are corrupted data in the Protein Data Bank, but they come with some "confidence level" (resolution of the protein). In robotics, there are ground truth data from humans and "corrupted data" from simulations. If we are not in such a setting, then human annotation is required to form a "gold" set. The good news is that our ablations above suggest that the "gold set" can be small without hurting the performance too much (see sensitivity analysis).
Low-frequency corruptions
Reviewer: "Adding noise cannot fully eliminate low-frequency degradations (e.g., color shifts, masking), so the approach may revert to simple filtering in those cases. Have you tried combining Gaussian noise with other synthetic corruptions to push corrupted samples into a "usable" regime at lower sigma?"
Excellent question! It is true that for certain cases, classifying under Gaussian noise corruption will essentially revert to filtering. Hence, there are two options. Option 1 is to do some preprocessing of the data, as the Reviewer proposes. This can actually take us a long way in certain cases. We experimented with two types of corruptions to show this:
- salt and pepper noise, where we set 5% of the pixels independently to either 0 or 1.
- random masking, where we delete 70% of the pixels independently.
For both of these cases, we required very high sigma to mask the corruptions -- essentially reverting to filtering. However, a simple preprocessing of the data dramatically increased performance. For salt and pepper we used a 3x3 median filter and obtained an FID of (compared to the filtering ) using a fixed annotation of . Similarly, for the random masking, we do a local average of the non-masked pixels (simulating the effect of blurring) and that leads to an FID of . These experiments show that with some simple preprocessing we can actually extract more signal out of the available data.
However, this is not always possible. For some corruptions, there aren't really easy preprocessing steps we can do to make them amenable to Gaussian noising. If the dataset has a lot of samples with such corruptions, the best way forward in our opinion is to train an "exotic" diffusion model (similar to Soft Diffusion / Cold Diffusion and certain flow matching models). These diffusion generalizations learn to connect arbitrary source and terminal measures -- e.g. the fully fogged distribution to the distribution of clean images. If we are willing to train such a model, we can corrupt the samples (not by adding noise, but by e.g. adding fog) to bring them along that exotic-diffusion trajectory. That said, we think this approach will require a lot of engineering efforts and hence we did not experiment with it in the paper.
Evaluation regarding preservation of very fine features
Reviewer: "have you quantified whether Ambient-o preserves very fine-grained details better or worse than training on fully curated data? For instance, does the patch classifier inadvertently retain tiny out-of-distribution artifacts?"
Super interesting question -- we thank the Reviewer for raising this. As in the low-quality data case, using data from any distribution other than the target will have to introduce some artifacts since convergence (even for small crop sizes) is not perfect. However, these artifacts do not seem to be detectable, at least using the tests we performed during the rebuttal.
Let's take the dogs and cats example, which is the most extreme case of using out-of-distribution data. To measure to what extent we preserve fine details, we measure a per-patch size FID: i.e we compare the distribution of patches of the generated samples with the distribution of patches from the training set and we compute the FID at different patch levels. We compare two models: the baseline model trained only on dogs vs our Ambient-o model trained on the same number of dogs + crops from cats. The results for the per-patch FID are shown below:
| Patch Size | Baseline FID | Ours FID |
|---|---|---|
| 1 | 0.3861 | 0.1825 |
| 2 | 1.1113 | 0.9761 |
| 4 | 2.9895 | 1.2498 |
| 8 | 5.1462 | 2.7415 |
| 16 | 9.4396 | 5.8144 |
| 32 | 9.7527 | 6.5243 |
| 64 | 12.0800 | 8.9214 |
As shown, our method not only achieves superior FID on the global images, but also achieves better FID when looking at patches. This is evidence that we have better fine-grained details compared to the model trained only on clean data.
To further assess whether this makes sense, we also computed the per-patch size FID between 1) disjoint training images of dogs and 2) training images of cats and training images of dogs. Results are shown below:
| Patch Size | FID (Dogs 1 + Dogs 2) | FID (Dogs 1 + Cats 1) |
|---|---|---|
| 64 | 29.40 | 176.85 |
| 32 | 42.94 | 170.83 |
| 16 | 58.08 | 83.66 |
| 8 | 26.44 | 37.99 |
| 4 | 13.88 | 12.37 |
| 2 | 5.10 | 5.43 |
| 1 | 0.86 | 1.60 |
As shown, for crop sizes 4 and lower the FIDs are comparable. Our crop classifier assigns an average annotation at crop size 6 (between 8 and 4), which checks out with this experiment.
We believe the provided experiments and clarifications strongly address the Reviewer's concerns. If so, we would appreciate if the Reviewer considers increasing their rating to reflect this.
Dear Reviewer, Thank you for acknowledging our rebuttal and for your feedback. We would like to know if our additional experiments and replies addressed your concerns. We remain at your availability if further clarifications are needed.
Thank you for your response. Your previous comment has addressed most of my concerns. Taking into account the comments from other reviewers, your clarifications, and my overall assessment of the paper, I will maintain my original score, which reflects a positive signal. Thank you again.
The paper proposes a novel method, Ambient Diffusion Omni, to learn from bad data. The method can learn clean distributions from bad data without access to the forward operator. Results on Gaussian blur, JPEG, and motion blur verify the effectiveness of the proposed method. Theoretical justification is provided for understanding the tradeoff encountered when learning from corrupted data.
优缺点分析
Strengths:
- The paper proposes an interesting setup where the forward operator is unknown.
- The paper provides theoretical justification for the tradeoff encountered when learning from bad data.
- The paper provides experiments on relatively large-scale datasets like ImageNet, which enlarges the impact of this field.
Weaknesses:
- The overall storytelling is interesting, but the method proposed is a combination of techniques.
- A lack of examination of the power of each component could hurt the originality of the proposed method.
- A fair comparison with possible solutions to the proposed interesting setting would weaken the paper.
问题
-
It would be interesting to see the results of the current method on , where is an identity matrix. In this case, the estimated directly reflects the noise level added to the original image. However, the idea of learning a classifier for noise is not novel, as previous works such as [4] also use GANs at different diffusion steps. The estimated can also indicate the strength of the proposed classifier.
-
Recent works [1], [3] suggest that using Eq. 2.2 in the paper with the estimated , diffusion models can effectively learn clean distributions by ablating various sampling methods, e.g., truncated sampling. [1] also shows that, even without knowing the true noise level, ablating over noise levels can still yield reasonable performance.
-
The paper lacks ablation studies on the two regimes—high-quality and low-quality. What happens if either regime is removed during training? It would be insightful to see which regime contributes more to performance. Notably, [2] demonstrates that using only Eq. 2.2 and truncated sampling in the noisy setting, or standard diffusion objectives in the noise-free setting, already achieves fairly good results.
-
It would also be interesting to extend the results to , where is noise following an unknown distribution, aligning with the setup in the paper.
-
The paper claimed it is learning from bad data, but in the experiments 10% data are necessary, right? It is also interesting to see the performance without clean data. This requirement of clean data would hurt the novelty of this setup.
Suggestions for clarification:
- Provide an ablation on the thresholds used in the paper.
- Discuss the importance of estimating the noise level. What if were treated as a hyperparameter and optimized via grid search? This might be more effective than training a noise classifier and selecting a threshold, which is labor-intensive.
- Clarify the effectiveness of learning in the low-noise regime. What are the results when training in the high-noise regime with truncated sampling? Are there other techniques beyond cropping that could improve performance?
Overall, the paper presents an interesting setup and approach. However, the lack of ablation studies on key components weakens its empirical foundation. My current rating reflects this limitation.
[1] Denoising Score Distillation: From Noisy Diffusion Pretraining to One-Step High-Quality Generation
[2] Restoration Score Distillation: From Corrupted Diffusion Pretraining to One-Step High-Quality Generation
[3] How much is a noisy image worth? Data Scaling Laws for Ambient Diffusion
[4] Diffusion-GAN: Training GANs with Diffusion
局限性
yes
最终评判理由
The authors have addressed my concerns through thoughtful discussions. I thus increase the rating to 4.
格式问题
None
We thank the Reviewer for their time and feedback! The Reviewer appreciated our experimental setup, our theoretical proofs and the strength of our experimental results. There are some concerns regarding missing ablations. We ran numerous ablations to address these concerns:
- We show results without training classifiers.
- We provide results on additive Gaussian noise.
- We show results with truncated sampling.
- We provide baseline results for only learning with clean data on ImageNet.
- We experiment with other corruption processes with additive noise.
- We test the robustness of the thresholds.
Reviewer: "[...] the method proposed is a combination of techniques. What if the noise level were treated as a hyperparameter and optimized via grid search?"
At its core, our method is very simple: bad data should only be used in a subset of diffusion times. Blurry images, for example, are typically useful in the high-noise regime, while out-of-distribution images might be useful in the low-noise regime. The training of the classifiers provides a rigorous way of finding the “allowed training times” (as they estimate the TV distance), but this is not necessary. For example, for our controlled experiments on blurry images, annotating all the corrupted samples with the same value already yields pretty strong results. We show these results below (annotated with Fixed Sigma):
| Setting | Blur 1 | Blur 0.8 |
|---|---|---|
| Only clean data | 8.79 | 8.79 |
| No annotations | 45.32 | 28.26 |
| Classifier | 6.16 | 6.00 |
| Fixed sigma | 6.95 | 6.66 |
Our paper has similar results (Fixed Sigma) for the out-of-distribution data experiment (Table 2b), too. Further, for our text-to-image results, we similarly annotated all the data from DiffDB at the same noise level.
Reviewer: "It would be interesting to see the results of the current method on the additive Gaussian noise case. It can also indicate the strength of the classifier."
We agree with the Reviewer that this is a good sanity check! We corrupt our data with the noise levels , as in the paper "How much is a noisy image worth", and we annotate each datapoint with the classifier. The estimated noise levels are shown below:
| Gaussian Noise Stddev | Average Annotations |
|---|---|
| 0.05 | 0.058 |
| 0.10 | 0.102 |
| 0.20 | 0.179 |
As shown, the classifier does a pretty reasonable job of estimating the noise level. We trained models using the estimated noise levels and we reproduced the FID result reported in the paper "How much is a noisy image worth". Our approach generalizes this idea to arbitrary corruptions.
Reviewer: "Recent works [1], [3] suggest that [...] diffusion models can effectively learn clean distributions by ablating various sampling methods, e.g., truncated sampling."
We believe the Reviewer has a critical misunderstanding here. In these works, it is known that the samples have been corrupted by Gaussian noise (and the noise level is given or can be estimated). However, for unknown degradation models (like the ones we deal with in the paper) it is information-theoretically impossible to recover the uncorrupted distribution without access to clean samples from it. That might be better understood with an example. If we are just given blurry samples, it is impossible to know whether the underlying distribution has blurry samples or the underlying distribution is a clean distribution and we just happened to observe blurry observations from it. Recovery guarantees for the underlying distribution without access to clean samples are only possible if the degradation model is known – hence, the distillation method of [1, 2] wouldn’t work for our settings where the corruption model is unknown. That said, we sincerely thank the Reviewer for bringing them to our attention, and will ensure they are properly cited in the camera-ready version.
Reviewer:: "The paper claimed it is learning from bad data, but in the experiments 10% data are necessary, right? It is also interesting to see the performance without clean data. This requirement of clean data would hurt the novelty of this setup."
Given only access to blurry samples and no prior on the degradation model, the best our method can achieve is to recover the distribution of the blurred samples. The FID of the blurred training set is given below:
| Blurred Data FID | Only Clean Data FID | Our Method FID | |
|---|---|---|---|
| 1.0 | 53.11 | 45.32 | 6.16 |
| 0.8 | 32.09 | 28.26 | 6.00 |
| 0.6 | 11.84 | 11.42 | 5.34 |
Our method achieves significantly better FIDs (6.16, 6.00 and 5.34 respectively), with the benefit coming from the usage of the clean data. On the other hand, just using clean data does not suffice (see Table 2a) as it gives FIDs 45.32, 28.26 and 11.42 respectively. Hence, for the success of our method the combination of a small set of clean data and a large set of corrupted data is essential. These results are similar in spirit to the results obtained by the paper How much is a Noisy Image Worth that the reviewer mentions. Our work generalizes these results to arbitrary degradations (beyond additive Gaussian noise).
Reviewer: "The paper lacks ablation studies on the two regimes—high-quality and low-quality. What happens if either regime is removed during training? It would be insightful to see which regime contributes more to performance."
See above for results when only low-quality data is available. As we explained above, the underlying distribution is not identifiable if the degradation model is not known and the best one can hope for is recovering the corrupted distribution – this leads to very poor FID scores.
For the regime where we are only given high-quality data, we actually have results in the paper and we think that the Reviewer might have missed them. This is the Only Clean baseline in Table 2a and the Additional Data = None rows in Table 2b. We further have results for fine-tuning the text-to-image model only on a curated high-quality subset in lines 318-322 of the paper, showing that our method achieves a comparable FID score with a very significant increase in diversity (>13%).
Based on the Reviewer’s recommendation, we further report results on ImageNet for the Only Clean data baseline. We report results when 10% and 50% of ImageNet is available. We show test FID below:
| Method | FID | FD (DINO) |
|---|---|---|
| Baseline XS | 3.77 | 115.16 |
| Ambient XS | 3.69 | 115.02 |
| Only Clean Data (10%) | 58.09 | 1001.79 |
| Only Clean Data (50%) | 9.77 | 364.44 |
As shown, having access to low-quality data is critical because if there is not enough high-quality samples, performance is quite poor. We will include these results in the Camera Ready of our work – we thank the Reviewer for suggesting this.
Reviewer: "It would also be interesting to extend the results to y=Ax + w [...]"
We thank the Reviewer for their suggestion, and include results on CIFAR-10 for A = gaussian blur and w = zero-mean poisson noise, as well as salt and pepper corruption. As in the paper, we compare results using 10% clean data to using 10% clean data + 90% corrupted data.
| Setting | FID |
|---|---|
| Only clean data | 8.79 |
| Gaussian blur , , classifier annotations | 5.78 |
| Gaussian blur , , fixed | 5.72 |
| └─ Fixed | 5.94 |
| └─ Fixed | 5.72 |
For the salt and pepper, we pre-processed the data with a 3x3 median filter to convert the unbounded impulse noise to a blurring corruption, which is more easily confused by gaussian noise. This allowed us to obtain a best FID of 5.70 using a fixed annotation t_min=3.
Reviewer: "Provide an ablation on the thresholds"
We show FID vs threshold results in the JPEG 18% case, keeping the data and classifier constant and changing only the threshold
| Threshold () | FID | Notes |
|---|---|---|
| 0.40 | 6.67 | |
| 0.43 | 6.73 | |
| 0.45 | 6.43 | Baseline (in paper) |
| 0.47 | 6.21 | Best FID |
| 0.48 | 6.28 |
Reviewer: "What are the results when training in the high-noise regime with truncated sampling? Are there other techniques beyond cropping that could improve performance?"
Truncated sampling is known to deteriorate results, as the samples get more blurry. We reproduce the same finding. On ImageNet, the results with truncated sampling are shown below:
| FID | |
|---|---|
| 0.0 | 2.53 |
| 0.1 | 2.86 |
| 0.2 | 4.50 |
| 0.4 | 6.87 |
| 0.8 | 19.89 |
| 1.0 | 31.41 |
Another technique that could improve results would be the consistency loss of the CDM paper. However, this loss is expensive to run as it requires sampling trajectories during training.
We believe our numerous ablations and clarifications strongly address the Reviewer’s concerns. If so, we would be grateful if the Reviewer considers increasing their score (which is currently the lowest among all Reviewers)
I appreciate the authors' effort in the rebuttal. The paper is overall interesting. I also appreciate the author's effort in scaling this setting to ImageNet and T2I generation. However, I have several concerns about the overall setup and the effectiveness of each component, which I’d like to discuss with the authors.
- For example, for our controlled experiments on blurry images, annotating all the corrupted samples with the same value already yields pretty strong results.
I think this demonstrates that learning in the high-noise regime is useful for blurry data, and that using a fixed sigma can already yield strong results. While the classifier improves performance slightly, from my perspective, the overall setup of the paper is solid, but the classifier itself, as a component of the method, is not particularly compelling. Training a classifier and annotating each image requires additional time and computation, whereas using a fixed sigma would be more efficient. Do you provide any discussion on the computational efficiency of these methods?
- The Reviewer has a critical misunderstanding here – hence, the distillation method of [1, 2] wouldn’t work for our settings where the corruption model is unknown.
To clarify, my point is that Truncated Sampling can improve performance over full sampling when there is no access to clean data. Papers [1], [2], and [3] show that when a diffusion model is trained on [2] or [1, 3] without access to clean data, truncated sampling performs better than full sampling.
In the first table provided in the rebuttal:
| Setting | Blur 1 | Blur 0.8 |
|---|---|---|
| Only clean data = 10% clean data | 8.79 | 8.79 |
| No annotations = 10% clean data + 90% corrupted data | 45.32 | 28.26 |
| Classifier = 10% clean data + 90% corrupted data | 6.16 | 6.00 |
| Fixed sigma = 10% clean data + 90% corrupted data | 6.95 | 6.66 |
Is the setting I highlighted above correct?
I understand the authors use full sampling since there is always 10% clean data available. However, my question is: If the classifier or fixed sigma is effective, what is the performance of the fixed sigma when only 90% corrupted data is available? In that sense, the fixed sigma could serve as the sigma for truncated sampling. Then, truncated sampling + fixed sigma when only 90% corrupted data can reflect the real performance of the corrupted data.
The purpose of this experiment, in my view, should be to demonstrate the maximum performance that corrupted data alone can achieve, without access to clean data. The current ablation does not fully capture that gap.
Another related question is: If the 10% clean data is so critical, wouldn't it be important to ablate on this percentage to better understand its impact?
I understand the discussion period is limited, and I am not suggesting that the authors add new experiments. My goal is simply to engage in a deeper discussion around the setup and implications of the paper.
Reviewer: “While the classifier improves performance slightly, the classifier itself, as a component of the method, is not particularly compelling. [...] fixed sigma can already yield strong results”
We agree that when all bad data is corrupted with the same degradation operator (e.g. blur) and degradation strength, then fixed sigma yields comparable performance to the classifier without training a new model.
However, in cases with heterogeneous corruptions in the dataset, using a fixed sigma is not a great idea, as we would have to pick a large value even for mildly corrupted datapoints. For example, for the optimal fixed sigma is 2.3 while for it is 1.9. On a dataset that has data corrupted with both and , we would have to go with the higher fixed sigma value, leading to information loss and suboptimal performance.
That said, fixed sigma is still a reasonable approach that improves over the i) filtering baseline and ii) the no annotations baseline. For example, for our text-to-image results, annotating all of DiffDB at a fixed noise level improves COCO FID from 12.7 (Stable Diffusion v1 level) to 10.6 (DALLE-2 level).
Reviewer: “whereas using a fixed sigma would be more efficient”
That’s only true if the optimal fixed value is known beforehand. Usually it is unknown and has to be found with hyperparameter tuning, leading to multiple diffusion trainings.
Reviewer: “Training a classifier and annotating each image requires additional time and computation. Do you provide any discussion on the computational efficiency of these methods?”
Great point! Training a classifier and annotating is actually much faster than training a diffusion model. For example, in our controlled corruption experiments (e.g. JPEG), diffusion model training takes ~150min on 8xV100 GPUs, while classifier training and annotation takes ~28min or ~18% overhead. So unless we can guess the optimal fixed sigma in one shot, the classifier approach will always be more efficient in GPU-hours.
Reviewer: “Is the setting I highlighted above correct?”
Yes, that’s correct!
Reviewer: “The purpose of this experiment, in my view, should be to demonstrate the maximum performance that corrupted data alone can achieve [...] The current ablation does not fully capture that gap”
In the rebuttal we showed the optimal results you can get with only corrupted data. When the corruption model and its propagation are both unknown, the best you can achieve is to learn the blurred distribution, whose FIDs were given in the rebuttal and are repeated below:
| Blur σ | Blurry Data FID |
|---|---|
| 1.0 | 53.1 |
| 0.8 | 32.1 |
| 0.6 | 11.8 |
| Truncated sampling cannot have better results. The reason is that it outputs the conditional expectation of the underlying distribution given the current iterate – which in this case is actually the blurry distribution (not the clean). |
Reviewer: “My question is: If the classifier or fixed sigma is effective, what is the performance of the fixed sigma when only 90% corrupted data is available? In that sense, the fixed sigma could serve as the sigma for truncated sampling”
To further validate our point above, we do exactly the experiment that the Reviewer suggests, i.e., truncated sampling with training on only corrupted data. Training with gaussian blurred data with using fixed/truncation sigma=1.83, we obtain FID=69.9
Reviewer: “Another related question is: If the 10% clean data is so critical, wouldn't it be important to ablate this percentage to better understand its impact?”
We agree with the Reviewer. Please find below FID results for training with x=1%, 5%, 10%, 30%, and 50% clean data and (100-x)% blurry data at blur level .
| Clean Data % | FID |
|---|---|
| 1% | 21.9 |
| 5% | 12.9 |
| 10% | 6.2 |
| 30% | 2.8 |
| 50% | 2.4 |
| We’ll include it in the Camera Ready |
Reviewer: “My goal is simply to engage in a deeper discussion around the setup and implications of the paper”
We appreciate the Reviewer’s efforts to better understand the paper and its implications.
Our paper unlocks a new capability: using bad data without any information or access to the degradation. This setting has not been studied before: Consistent Diffusion Meets Tweedie and [3] only work for additive Gaussian noise corruption, and [1]/[2] assume black-box access to the degradation operator (see [2]: Corruption Aware Training as well as Algorithm 6, L13/L17, Appendix page 10 where the degradation function is used during distillation). None of these works can handle the general (and more realistic) case of only having sample access to the clean and corrupted distributions.
To solve this more general case, any method needs some samples from the clean distribution (otherwise, the clean distribution is not identifiable). We hope this addresses the Reviewer’s concerns. We'll clarify these points in the Camera Ready and include all ablation studies from the Rebuttal and the Author-Reviewer discussions.
Thank you for your detailed and thoughtful responses. The authors have largely addressed my concerns. I will increase the rating accordingly.
This paper presents a method to enhance diffusion models using low quality, synthetic, and out of distribution images. It introduces the Ambient Diffusion Omni framework, which leverages spectral power law decay and locality of natural images. Validated on images with Gaussian blur, JPEG compression, and motion blur, it achieves state-of-the-art ImageNet FID and improves text-to-image generation in quality and diversity.
优缺点分析
Strength:
- Traditional methods that filter high-quality data before training have a significant advantage in training efficiency. While adding bad data may not always be cost-effective, it can be meaningful if it yields results that cannot be achieved by training solely on high-quality datasets. I believe the most important benefits are the generative diversity and semantic understanding ability gained from retaining more samples.
Weakness
- The authors' experiments lack diversity metrics like the commonly used precision and recall in Imagenet.
- The concept of "bad data" presented in this paper is commendable. However, in practical applications, data are mostly in a curated form, and in real-world scenarios, "bad data" characterized by low motion, poor aesthetic metrics, limb distortion, incorrect spatial relationships, misinterpretations, etc., are far more critical. Additionally, the method section predominantly focuses on data corruption rather than these real-world manifestations of bad data. Moreover, with the requirement to train multiple classifiers, the paper fails to clearly demonstrate how the proposed approach can be extended to broader application scenarios.
问题
- Specify the weakness: detail how to train a classifier for text-to-image tasks without class label.
- Precision and recall metrics are lacking to demonstrate the improvement in diversity.
- Why is the FID improvement rather small or even worse? More data causes higher GPU resource consumption, and thus yielding worse results is unacceptable.
局限性
yes
最终评判理由
The rebuttal has addressed most of my concerns, though I remain confused about the classifier. Given the intriguing motivation, as well as the solid analysis and experiments presented, I will maintain my original score.
格式问题
No
We thank the Reviewer for their time and feedback!
Reviewer: "Traditional filtering methods have a significant advantage in training efficiency."
We want to point out that it is not necessarily true. For many applications, practitioners run multiple epochs during training. Hence, the important parameter is the number of training updates, not the number of datapoints. All the baseline models used in the paper obtain their peak performance after multiple epochs of training. For example, for the ImageNet generative models, the EDM-2 XS baseline sees 2,147,483,000 (2B) images during training from a dataset of size 1.2M images. So in total, there are approximately 1000 epochs being run. Similarly, our text-to-image baseline (microdiffusion) sees 686,080,000 (700M) images during training, but the dataset only has 37M images (~19 epochs being run).
Methods that leverage low-quality data can run the same number of training updates with fewer repetitions of the same data, leading to less memorization and increased diversity. For all our experiments, we trained our models for the same number of training steps as the baseline models.
Reviewer: "experiments lack diversity metrics like the commonly used precision and recall in Imagenet"
We thank the Reviewer for their recommendation! For our text-to-image results, we already report Vendi Diversity (Line 320) and we show a greater than 13% improvement compared to a baseline trained on a curated dataset. We further show qualitative results for diversity in Figure 29 of the Appendix.
For our ImageNet results, we initially did not report these metrics because we followed exactly the evaluation protocol from the EDM and the EDM-2 papers that only report FID. Following the Reviewer’s suggestion, we further report the following metrics:
- Precision (higher is better): measures the expected likelihood of fake samples against the real data manifold
- Recall (higher is better): measures the expected likelihood of real samples against the fake data manifold.
- KD (lower is better): measures the kernel distance between the real and the generated distribution
- CT (lower is better): A memorization statistic (see A Non-Parametric Test to Detect Data-Copying in Generative Models).
For all these metrics, we use the dgm-eval library that provides implementations of metrics used in the paper: Exposing flaws of generative model evaluation metrics and their unfair treatment of diffusion models
We report the results for our best model (Ambient-o-XXL+crops) trained on ImageNet (Table 1 in the paper) and the baseline model trained by the EDM2 authors. We use the same training parameters, training updates and inference parameters (EMA=0.015, guidance strength=1.2 – values taken from the EDM2 paper) for both models. The results are shown below:
| Metric | Ambient-o | Baseline |
|---|---|---|
| FID | 2.53 | 2.73 |
| KD | 0.04 | 0.04 |
| CT | 27.81 | 29.07 |
| Precision | 93% | 92% |
| Recall | 90% | 89% |
As shown, our method improves the FID from 2.73 to 2.53 and it also achieves a reduction in memorization (lower CT score). We further achieve mild improvements in Precision and Recall.
We further want to clarify that our ImageNet models did not use extra data -- we just treat some of the existing data as noisy, and this leads to quality improvements. The improvement in Precision/Recall is mild since these are primarily diversity metrics, and we did not have access to extra data for this experiment.
We will report all these results in the Camera Ready version, we thank the Reviewer for the recommendation.
Reviewer: "[...] the method section predominantly focuses on data corruption rather than these real-world manifestations of bad data"
We believe the Reviewer has a critical misconception about our work here. We indeed perform experiments on real-world datasets on which real corruptions (such as poor aesthetics, low lighting, and other bad data forms) appear. In Figure 4 (right) we show examples of such “bad data” on ImageNet. Similarly, in Figures 24, 25, 26, 27 in the Appendix, we show examples of “bad data” that appear on the datasets we used for our text-to-image results: CC12M, SA1B, DiffDB, and JDB, respectively.
Our framework treats these real-world manifestations of bad data as noisy samples, and by doing so, it leads to quality and diversity improvements. For example, in the text-to-image setting, we manage to significantly outperform our micro-diffusion baseline in various benchmarks, such as DrawBench and Partiprompts (Figure 6), as well as in COCO FID and GenEval (Figure 7).
We only performed synthetic corruptions for the first part of our experimental validation to verify the effectiveness of each component of our approach, prior to moving to real corruption manifestations.
Reviewer: "However, in practical applications, data are mostly in a curated form"
Our paper questions this curation/filtering process. As we show in our text-to-image results section, filtering/curation leads to a greater than 13% drop in DINO Vendi Diversity. Qualitative results of this drop in diversity are further shown in Appendix Fig. 29. Further, such filtering pipelines are often driven by heuristics and arbitrary quality cutoffs. Instead of throwing away completely some samples, we propose using the “bad” samples only for certain noise times. This paradigm leads to a significant increase in diversity while still producing samples of high-quality (see lines 314-322 in the paper).
Filtering out datapoints entirely makes sense when we are bound by compute, not by data. I.e. if we can only run K training updates and we have at least K high-quality samples, it makes sense to use those.
Reviewer: "[...] the requirement to train multiple classifiers, the paper fails to clearly demonstrate how the proposed approach can be extended to broader application scenarios."
We want to highlight that, in its simplest form, our idea does not require the training of any classifiers. The proposal is the following: instead of filtering out the “bad data”, just use them for very high noise. Similarly, for out-of-distribution data, we propose using them only for “low-noise”. The amount of “high-noise” or “low-noise” can be a single hyperparameter. Already, this very simple idea achieves significant improvements – for the dog-cats experiment, FID improves from 12.08 to 9.85 and for our text-to-image results COCO FID improves from 12.73 (Stable Diffusion v1 level) to 10.61 (DALLE-2 level).
We provide further experiments to show this by showing FID results without a classifier (fixed sigma, found by hyperparameter tuning) for blurry CIFAR-10 images:
| Setting | Blur 1 | Blur 0.8 |
|---|---|---|
| Only clean data | 8.79 | 8.79 |
| No annotations | 45.32 | 28.26 |
| Classifier | 6.16 | 6.00 |
| Fixed sigma | 6.95 | 6.66 |
As shown, "Fixed sigma" approaches the performance of the trained classifier.
The classifiers provide a more principled way of selecting a different noise level per sample, as they estimate the time at which the TV distance goes below a threshold. The most principled version of our framework requires the training of classifiers over two things: noise levels and crops. The training of the classifiers is computationally much cheaper compared to the training of the diffusion model. However, we acknowledge that this might add engineering complexity, and we encourage practitioners to start with the simplest version of our framework (fixed sigma) using hyperparameter tuning or domain expertise to select the allowed noise levels for each sample.
Reviewer: "detail how to train a classifier for text-to-image tasks without class label."
The training of our classifier only requires the existence of a small set of “top-tier” samples. It is impossible for any method (not just ours) to work without it. The reason is that if the degradation model is unknown and there is no ground-truth sample access, the underlying distribution of pristine samples is not identifiable.
Reviewer: "Why is the FID improvement rather small or even worse? More data causes higher GPU resource consumption [...]."
We always train for the same number of training steps as the baseline, despite the fact that we have more data. Hence, our method is as computationally intensive as the baseline.
Regarding FID: for our text-to-image results, the FID improvement on COCO is massive (from 12.37 to 10.61). That’s the equivalent of going from a Stable Diffusion v1 model to a DALLE-2 model – and this is achieved only by changing the way we use the data (no extra compute + same hyperparameters).
Further, for ImageNet, we are exposed to the same number of samples as the baseline – we just use some of the available data as low-quality. Despite being exposed to the same data and being trained for the same GPU hours, Ambient-o-XXL+crops achieves state-of-the-art FID across all evaluation metrics and guidance settings.
We also want to highlight that the test images of ImageNet might also have imperfections – so if our model only produces pristine quality images, it might get penalized in FID. To account for this, we further report the mean CLIP IQA score for our Ambient-o-XXL+crops and the baseline XXL model (EDM2) below:
- Baseline XXL model average CLIP IQA: 0.69
- Ambient-o-XXL+crops average CLIP IQA: 0.71
The increase in the average CLIP IQA corroborates our findings that our model produces samples of higher quality.
We hope that the additional experiments and the clarifications addressed the remaining concerns. If so, we would highly appreciate it if the Reviewer considers raising their score to signify stronger support for our paper.
The rebuttal has addressed most of my concerns, though I remain confused about the classifier. Given the intriguing motivation, as well as the solid analysis and experiments presented, I will maintain my original score.
Dear Reviewer, Thank you for acknowledging our rebuttal, and we are glad that most of your concerns have been addressed.
We have put very significant efforts during the rebuttal period: we ran additional evaluation metrics, we provided results without the training of classifiers, and we thoroughly addressed your questions regarding computational efficiency and real-world manifestations of bad data. Hence, we would like to ask if there is anything else that we could do to strengthen your support of our work.
You mention in your comment that you remain confused regarding the classifier. What is the remaining confusion about? In your initial review, you write: "Specify the weakness: detail how to train a classifier for text-to-image tasks without a class label."
Our method assumes a small set of high-quality samples. Is that what you mean by labels? If this set is not given, we need to either a) form it (using human annotations or domain knowledge) or b) estimate it (e.g. by asking an SSL model, like CLIP, to find the top-quality samples). For our text-to-image experiments, we use a) as we put the label "low-quality" to all the samples that are coming from DiffDB (synthetic data). For our ImageNet experiments, we do b), as we use CLIP to find the high and low-quality set.
We emphasize that having access to a small high-quality set is not a weakness of our method. For unknown degradation models (like the ones we deal with in the paper) it is information-theoretically impossible to recover the uncorrupted distribution without access to clean samples from it. That might be better understood with an example. If we are just given blurry samples, it is impossible to know whether the underlying distribution has blurry samples or the underlying distribution is a clean distribution and we just happened to observe blurry observations from it. Recovery guarantees for the underlying distribution without access to clean samples are only possible if the degradation model is known.
If the confusion is about something else, we would be happy to engage further with the Reviewer.
This paper proposes a novel framework for training diffusion model that can leverage low-quality, synthetic and out-of- distribution data, which are often useless and discarded previously, to improve the generation. Focusing on the task of 2D image generation, the paper mainly propose two principles in distinguishing which low-quality data can be used: (1) Low-quality data with very high noise level benefit the training since their distribution is close enough to the high-quality data; (2) Low-quality data with very small nosie level benefit the training after cropping since some local regions within them provide sufficient information as high-quality data. A specific classifier is trained for each technique to determine the threshold of the noisy level above or below which the low-quality data can be used. The paper provides both rigorous theoretical derivation and extensive experiments to show the soundness of (1) and (2). The paper also provides results on the task of training image generative model on ImageNet dataset, demonstrating that their method benefits the pratical application.
优缺点分析
Strengths:
- The problem this paper focuses on is meaningful in broad fields of generative AI, as almost all the datasets contain a large proportion of low-quality data which do not benefit the training. Data filtering and curating is a topic that cannot be gotten around in training large generative models, and it often takes great effort to do it. Meanwhile, there is a trend that scaling up large generative models is more and more difficult because all the high-quality in the world is close to being exhausted. To this point, even though this paper doesn’t show a significant improvement in the practical application with their method, it is inspiring in how to efficiently use data.
- The proposed principles are simple and the method is generalizable enough (only rely on training specific classifiers). Though the paper focuses on the 2D image generation task, the principle (1) mentioned in the summary part can be easily generalized to any diffusion model. The principle (2) is a bit stricter but is also supposed to be generalized to the data distribution with the property of locality.
- The paper demonstrates the effectiveness of the proposed method well from different perspectives. For example, to prove the principle (1), the paper not only provides a clear and rigorous theoretical derivation to show that mixing low-quality data in high-quality dataset results in a lower upper-bound error, but also shows it with multiple small experiments on various types of corrupted data. Besides, extensive experiments are conducted to demonstrate the points that support the principle (2) (e.g. illuminating the relationship between noise level and context size with multiple plots from different experiments).
Weaknesses: I don’t see major weaknesses in this paper. The following are several minor questions:
- I didn’t catch how cropping is applied in detail when training a diffusion model. Take sample dependent annotation as an example, from my understanding, the for each sample is different but is determined by a same crops classifier. Is the crop size the same and fixed for training such as classifier (i.e. the classifier decides the max noise level of each low-level data sample below which the sample cannot be distinguished with high-level data samples under a specific crop size)? If so, what crop size is used in the experiments and what are the pros and cons of using large or small crop sizes? And more important, how the image is cropped in training the classifier and the diffusion model. Is every crop of one image used and are averaged in terms of the output or are the crops randomly picked?
- A small typo: Figure 7 should be Table 7
问题
The author should clarify all the unclearness mentioned in the strengths and weaknesses part.
局限性
Yes.
最终评判理由
The rebuttal answered my questions about cropping. Given the overall quality of this paper and the authors' detailed rebuttal, I will maintain my support for this paper to be accepted.
格式问题
N.A.
We thank the Reviewer for their time and very encouraging feedback! We are very pleased to see that the Reviewer appreciated the novelty of our approach, the breadth of our experimental evaluations, the theoretical proofs and the generality of the method. We most certainly agree with the Reviewer that as we are running out of Internet data, methods for efficient usage of data will become increasingly relevant. The same of course holds for scientific applications where we don't have a lot of data to start with.
Regarding the question about cropping: Great question and apologies for not having it clear in the first place in the paper. During training, the classifier takes as input a crop (that can be any size) and tries to detect if the crop came from the high-quality distribution or the low-quality distribution.
At inference time, we split an image into crops of a specific size, let’s call it C, and then we see if, on average, they confuse the classifier. The bigger the C, the harder it is to confuse the classifier. If there is no confusion for the initial C, we decrease it and we try again until we find a crop size for which the classifier is (on average) confused. We start with C=64 (the whole image), and work our way down. We underline that, in principle, we could do the annotation separately for each crop of the same image, effectively leading to different diffusion times per crop. That said, for the crop experiments we did in the paper, we used an implementation that averages across crops for simplicity.
We finally map from the crop size that confused the classifier to a diffusion time, using Figures 15/17. Higher noise levels require bigger receptive fields (bigger crops) for optimal denoising. So if an image only manages to confuse the classifier at a small crop, then it is only used for a small set of diffusion times.
We clarify that the diffusion model is never trained on crops of images. We always train using the entire image, even the out-of-distribution images, but for diffusion times small enough (less than t_max) such that for the required receptive field size at that noise level, the target and out-of-distribution data is indistinguishable.
We hope this clarifies things, and we will definitely make it clearer in the camera-ready version of our work.
Regarding the typo: We thank the Reviewer for careful reading! We will fix it in the camera-ready version.
If there is anything else we can do to further strengthen the Reviewer’s support for the paper, we would be happy to do so. Once again, we thank the Reviewer for their time and feedback.
Thank you for the detail clarification! My questions about cropping have been addressed. Given the overall quality of this paper and the authors' detailed rebuttal, I will maintain my support for this paper to be accepted.
Meta-Review
A common practice in training large-scale foundational generative models is to rely on high-quality data, typically by filtering out low-quality or "bad" samples early in the pipeline. In image generation, such samples are often noisy or corrupted in some way.
This paper proposes two techniques for handling noisy data, depending on the degree of corruption:
(i) leveraging the diffusion process itself, since it already relies on a noise-to-denoising trajectory to benefit from highly noisy images during high-noise steps of diffusion training;
(ii) using an external classifier to crop and retain high-quality regions from partially noisy images, allowing the model to benefit still from otherwise discarded data.
Weaknesses:
Some concerns were raised regarding the use of a classifier in the cropping-based technique. I agree that this second technique is less novel; leveraging classifiers for data filtering is relatively common, and it still involves discarding parts of the data. Additionally, it depends on an external model. I strongly recommend that the authors include the relevant rebuttal discussions in the final version, clarifying how the classifier is trained and including ablation studies that isolate the performance of the first technique alone, as suggested by the reviewers.
Recommendation: Accept (highlight)
I find the approach, especially the use of the diffusion process itself to handle noisy examples, both interesting and significant. As high-quality data becomes increasingly scarce, methods that employ imperfect data are of high relevance to the community.
Discussion Summary:
gDjV raised questions about cropping and classifier usage. The authors clarified the procedure, explained that training always uses full images, and added ablations without classifiers. Concerns were resolved, and support remained strong. JDMp questioned diversity metrics, practical “bad data” handling, classifier efficiency, and small FID gains. Authors added precision/recall, CT, KD results, clarified real-world corruptions, and compared fixed-sigma vs. classifiers. Most points were addressed, though skepticism about classifiers persisted. qqMU noted missing ablations and reliance on classifiers vs. fixed sigma, and asked about corrupted-only data. Authors provided extensive ablations (thresholds, clean-data proportion, Gaussian noise, and truncated sampling) and demonstrated that corrupted-only methods cannot recover clean distributions. Rating increased after rebuttal. whhj raised concerns on classifier sensitivity, forming clean subsets, low-frequency corruptions, and fine-detail preservation. The authors presented sensitivity studies, practical routes for clean subsets, preprocessing experiments, and patch-level FID for details. Most issues were resolved, and the score remained positive.