Distribution-Aware Data Expansion with Diffusion Models
摘要
评审与讨论
This paper proposed a training-free data augmentation method based on the diffusion models. To alleviate the poison phenomenon of the diffusion model, which is that the distribution of generated images will deviate from the natural distribution, this paper proposed a simple yet effective method based on the prototypes to introduce additional constraints to alleviate the deviation. The experimental results show that the proposed method achieves the SOTA performance compared to the baselines in this paper.
优点
-
Leveraging the prototypes as the constraint to guide the generation process is novel. Directly using the diffusion model to enrich the dataset will cause the poison phenomenon that will harm the classifier's performance since the distribution of generated images deviates from the natural image distribution. In this case, this paper leverages the cluster-based method, i.e., the prototype technology is sound and interesting.
-
This paper is easy to follow.
-
The proposed method achieves the SOTA performance compared with various baselines.
缺点
- The baselines lack the latest works, such as Brandon et al.[1] and Khawar et al [2]. For Brandon et al., although this method needs to fine-tune the pre-trained diffusion models, the time cost for fine-tuning them is very low and is close to the time cost reported in this paper. Under the closed computation cost, it should be considered the baseline.
[1] Effective Data Augmentation With Diffusion Models. Brandon et al.. ICLR 2024.
[2] DIFFUSEMIX: Label-Preserving Data Augmentation with Diffusion Models. Khawar et al. Arxiv:2405.14881.
- The datasets used in this paper may not be enough. Since this paper does not only focus on small datasets, ImageNet should be considered the dataset to test the performance of the proposed method, which is similar to DIFFUSEMIX [2]. ImageNette is insufficient to replace ImageNet since it only contains 10 classes.
问题
-
How to keep such a low time cost for the proposed method? The motivation for this question is that the author claims: "Stable Diffusion generates per sample in 12.65 seconds on average, while our DistDiff achieves the same in 13.13 seconds." The weakness of the guidance method (i.e., the energy function guidance used in this paper) is that it will dramatically increase the time cost, especially for a stable diffusion model. The reason is that the guidance method needs to calculate the gradient of the diffusion model. Concretely, please see Eq. 6. Eq. 6 needs first to calculate . . By Eq. 5 and Algorithm 1, . In this condition, needs to calculate and is the pre-trained diffusion model. Therefore, the overall process needs to calculate the gradient of the diffusion model at least 50 times, while the time cost only increases by less than 1 second, which is hard to understand.
-
Could the author offer the ablation study for K=1 that only one group situation? The motivation for this question is that Table 6 shows that the increase of K has a limited influence on performance, which is confusing. In theory, following the storyline of this paper, more groups should increase the diversity of the generated images, which should have a positive influence. Based on the Table 6, it seems that group strategy is redundant.
局限性
No additional limitations, including societal impact, should be discussed. All my concerns are listed in Weaknesses and Questions. If the author could clarify these concerns, I'm willing to increase my score.
We sincerely appreciate the time and effort you have dedicated to reviewing our paper. We value your feedback on our method, noting that the prototype technology is sound and interesting and that it is easy to follow. We also thank you for recognizing that our method achieves SOTA performance compared to various baselines.
Re Weakness #1:
We apologize for missing this comparison. We have now included the baseline results in Table 1 and will update the revised paper accordingly. Our method outperforms both baselines. DA-Fusion modifies images while respecting semantic attributes but lacks fine-grained control and incurs additional time costs (around 14 minutes per class on a 3090 GPU) . Our training-free method uses hierarchical guidance for better alignment with the real distribution. DiffuseMix, which uses bespoke prompts and fractal images for augmentation, treats all datasets equally and may not handle distribution shifts well. Our method shows superior performance compared to these approaches.
| Method | Accuracy |
|---|---|
| DA-Fusion | 79.14 |
| DiffuseMix | 82.33 |
| Ours | 83.09 |
Table 1: Comparison of accuracy in expanding Caltech-101 by .
Re Weakness #2:
We follow previous work, such as GIF-SD and DA-Fusion, in conducting experiments on small datasets, as data augmentation strategies are often necessary in data-scarcity scenarios. Taking your advice into account, we have included results on ImageNet with expanding ratios, resulting in approximately 256K generated samples. As shown in Table 2, our method demonstrates improvements on large-scale datasets.
| Method | Accuracy |
|---|---|
| Original | 69.30 |
| Ours | 69.95 |
Table 2: Comparison of accuracy in expanding ImageNet by . We trained ResNet18 with a resolution for 90 epochs.
Re Question #1:
You might have some misunderstanding regarding our method's principles. The energy function guidance used in our paper does not dramatically increase the time cost. As illustrated in Lines 310-315 (More Optimization Steps) of the manuscript, our method only introduces two additional optimization steps, which means calculating the gradient 2 times rather than 50 times. Therefore, our method does not dramatically increase the time cost as you suggested.
Re Question #2:
When , the group-level prototype becomes the class-level prototype, leading to results similar to those with class-level prototypes. In Table 6 of our manuscript, performance changes are not apparent, as the differences between and are minimal in the small class variance dataset Caltech-101. Our method shows more benefits with fine-grained datasets with greater class variance, as illustrated in Table 3 with StanfordCars. Additionally, having more groups is not always beneficial; an excessive number of groups may cause prototypes to fit noise points or outliers, which could degrade performance.
| Method | Accuracy |
|---|---|
| Stable Diffusion Baseline | 88.45 |
| 89.55 | |
| 89.69 | |
| 90.36 | |
| 90.69 | |
| 90.62 |
Table 3: Prototypes comparison of accuracy in expanding StanfordCars by . We trained ResNet50 with a 448 × 448 resolution for 128 epochs.
Thanks for the author`s rebuttal. I have carefully read all the contents. The new experimental results show the improvement of the author's proposed method. Although the improvement in ImageNet may be fair, I think it is due to the expansion, and the time limitation of the rebuttal period does not allow us to try (5x) expansion. Most of my concerns have been addressed. Thus, I will consider increasing my score to the borderline accept. Additionally, about the group (), based on the rebuttal, it seems there is a trade-off about since cannot be small and be large. Meanwhile, will be influenced by the Dataset. A natural question is raised: Is there any way to choose the in an adaptive way?
Thank you for recognizing that "most of the concerns have been addressed" and for considering an increase in the score. We truly appreciate your positive feedback!
-
Regarding the ImageNet experiment, we further applied the Stable Diffusion (SD) baseline to expand ImageNet by and conducted expriments. The baseline method achieved accuracy. Our method surpasses the SD baseline by and the accuracy of training on the original dataset by . This confirms that our method has an advantage in data expansion compared to the original SD method. In addition, training accuracy typically increases with a larger expansion ratio, and the performance gap between our method and existing methods tends to grow. This phenomenon has been validated across multiple datasets, as shown in Figure 4 of the manuscript. We are adding more experiments with ImageNet expansion ratios and will include these in the final version.
-
Thank you also for your suggestion to choose adaptively. We have explored this idea using classical adaptive clustering strategies. However, this introduces another parameter to tune, such as the neighborhood radius or cluster distance, which can be more challenging to adjust than . Additionally, this poses challenges for parallel computation since may vary within each batch. Nevertheless, this is a valuable idea, and we will consider your suggestion for further exploration and optimization of adaptive selection methods.
Thank you once again for your valuable feedback. If you have any further questions regarding our rebuttal, we would be happy to provide additional clarification.
Thanks for the author`s further clarification for the and my concerns about the fair improvement in ImageNet. Considering the novel for this paper, I think the strengths outweigh the weaknesses ( should be chosen based on the dataset case by case) since there are enough ablation studies for to improve the reproduction of this paper. Therefore, I increased my score to borderline accept.
Thank you for your recognition and for raising the score. We really appreciate your positive feedback! We will consider your suggestion for further exploration the adaptive selection methods.
The authors present DistDiff, a training-free data expansion framework based on a distribution-aware diffusion model. DistDiff constructs hierarchical prototypes to approximate the real data distribution, optimizing latent data points within diffusion models through hierarchical energy guidance. The framework demonstrates its capability to generate distribution-consistent samples, significantly improving data expansion tasks. DistDiff consistently enhances accuracy across a diverse range of datasets compared to models trained solely on original data.
优点
- The authors claim high efficiency for DistDiff.
- The framework effectively generates samples that align with the original distribution, markedly enhancing data augmentation tasks.
- DistDiff consistently improves accuracy across a wide variety of datasets when compared to models trained only on the original data.
缺点
-
The manuscript introduces the concept of hierarchical prototypes but falls short in sufficiently explaining how these prototypes are selected or generated from the dataset. A more detailed description or examples of the prototype generation process would greatly enhance the reader’s understanding and bolster the credibility of the proposed method.
-
The authors claim high efficiency for DistDiff but do not provide supporting data. It is recommended to add a comparison table detailing DistDiff's computational time, resource usage, and scalability against other methods, to substantiate its efficiency claims.
-
The paper claims that " these two scores reinforce each other and are indispensable mutually," in Table 4, the observed impact appears minimal without P_g. Additionally, the k value for P_g seems to have little effect on the results. Further explanation and clarification are needed to substantiate these claims.
问题
-
In Figure 6, it is unclear why the optimal number of prototypes is set at K=3, as the difference from the figure is not readily apparent and the quantitative metrics only differ by 0.01. Could this be further elaborated to justify the selection of K=3 as the optimal prototype count?
-
It is suggested that the captions be further elaborated to explain the entire pipeline process in detail. Currently, the process terms used in the caption do not correspond directly to those labeled in the figure, leading to potential confusion. A clearer alignment between the text and graphical elements would improve comprehension and the overall effectiveness of the figure.
-
It is recommended to revise Figure 1. While it depicts the complex process of DistDiff, it does not provide clear benefits. Including performance comparisons or showcasing significant differences in generated results would enhance the figure's utility and informative value.
局限性
As stated above.
We sincerely appreciate the time and effort you invested in reviewing our paper. We address the raised concerns as follows:
Re Weakness #1:
Thank you for your valuable feedback. We appreciate your suggestion to provide a clearer explanation of the hierarchical prototypes. We recognize that the manuscript would benefit from a more detailed description of the prototype generation process. In Lines 135-136 and Lines 139-141, we describe how prototypes are selected or generated from the dataset. Specifically, our method first extracts feature vectors using a pre-trained image feature extractor. Class-level prototypes are then obtained by averaging feature vectors within each class. To refine these prototypes further, we apply the agglomerative hierarchical clustering algorithm [1] to group samples from the same class into clusters. The group prototypes are computed by averaging feature vectors within each cluster.
In the final version of the manuscript, we will include a more comprehensive explanation of the prototype generation process and provide illustrative examples to clarify this methodology further.
Re Weakness #2:
The high efficiency of DistDiff stems from its direct guidance on the pre-trained model without the need to retrain the diffusion models. We provide computational time in Lines 334-339 and resource usage in Line 617. Compared to the stable diffusion baseline model, we only introduce an extra 0.48 seconds of inference time per sample, which is only 4% of the original inference time.
We included an efficiency analysis in Table 1. Our method incurs minimal additional time costs and generates samples directly, unlike LECF, which needs extra post-processing to filter low-confidence samples. This makes LECF slower in generating the same number of samples. We are incorporating this cost analysis into the paper.
| Method | Inference Time (s) |
|---|---|
| SD | 12.65 |
| LECF | 12.73 (17.20 *) |
| GID-SD | 13.47 |
| Ours | 13.13 |
Table 1: Inference Efficiency comparison with existing methods on Caltech-101 dataset. * denots that the actual time required of LECF to derive one sample after filter post-processing. Evaluation processes are conducted on a single GeForce RTX 3090 GPU.
Re Weakness #3 and Question #1:
As shown in Table 4, both scores significantly improve performance, and further improvements can be achieved by combining them. However, we did not provide a clear analysis of their relationship, and the claim that these two scores are "indispensable mutually" lacks sufficient evidence and may not be appropriate. This is primarily due to the small class variance in the Caltech-101 dataset, where the differences between and are minimal. In contrast, when we applied hierarchical prototypes to a dataset with larger class variance, such as StanfordCars, the differences between hierarchical prototypes were more pronounced. This led to a more significant combined improvement, as shown in Table 2.
Additionally, we selected as the optimal prototype since it shows the best quantitative performance, although changes in the value are not apparent in the Table 6 of the manuscript. The influence of is more pronounced for fine-grained StanfordCars dataset, highlighting the importance of , as shown in Table 2.
| Method | Accuracy |
|---|---|
| Stable Diffusion Baseline | 88.45 |
| 89.55 | |
| 89.69 | |
| 90.36 | |
| 90.69 | |
| 90.62 |
Table 2: Prototypes comparison of accuracy in expanding StanfordCars 2. We trained ResNet50 with a 448 × 448 resolution for 128 epochs.
Re Question #2:
Thank you for the valuable suggestion. We will revise the captions to provide a more detailed explanation of the entire pipeline process and ensure that the terms used in the captions align clearly with those labeled in the figure. This will help improve comprehension and the overall effectiveness of the figure.
Re Question #3:
We appreciate your insightful recommendation. We will revise Figure 1 to include performance comparisons and highlight significant differences in the generated results. This will enhance the figure's utility and informative value, making the complex process of DistDiff clearer and more impactful.
[1] Hierarchical clustering. Introduction to HPC with MPI for Data Science 2016.
Dear Reviewer 4Bhc,
We appreciate your thoughtful evaluation and the opportunity to clarify and expand upon key aspects of our work. Based on the detailed responses and additional data provided, which directly address the concerns raised:
- We have provided a detailed explanation of the generation process for our hierarchical prototypes.
- We have included a comprehensive analysis of time complexity compared to existing methods, substantiating our claims of high efficiency.
- We have supported the effectiveness of our hierarchical prototypes with additional ablation results on downstream tasks, confirming the efficacy of these prototypes.
We hope that our responses have clarified and addressed your questions satisfactorily. We will carefully revise the manuscript in accordance with the suggestions from all reviewers. If our explanations have resolved your concerns, we would be grateful if you could reconsider your rating. We are eager to make any further improvements necessary to meet the conference's standards.
Given the comprehensive nature of our response, we kindly request that you review our rebuttal and provide your feedback. If you have any further questions regarding our responses, we would be happy to provide additional clarification.
Thank you for your time and consideration.
Dear Reviewer 4Bhc,
Thank you for your careful review of our paper. With approximately 20 hours remaining in the discussion phase, we sincerely hope our rebuttal has addressed your concerns. If our responses have clarified the issues you raised, we kindly request that you consider raising your score.
We greatly value your feedback and have tried to provide thorough responses to each point. If you have any unresolved questions or need further clarification, please don't hesitate to let us know. We will do our utmost to provide additional information within the remaining time.
Once again, thank you for your valuable time and expert opinion. Your feedback is crucial in improving the quality of our research.
Best regards
This paper focuses on data augmentation or expansion by generating synthetic data from pre-trained large-scale diffusion models. To ground the samples from these large-scale diffusion models, the paper proposes an energy-based guidance approach where the energy function depends on hierarchical prototypes. In the paper, Hierarchical prototypes are essentially feature vectors that define the object classes. The hierarchical prototypes are obtained as follows - First, the features for each class are aggregated to obtain a class level representation. Further, the features of each class are clustered into K clusters to obtain sub-group representations within each class.
The paper shows clear improvement for classification tasks on many standard datasets, when classifiers are trained from scratch on the augmented dataset.
优点
The paper addresses a very important problem - How to perform effective data augmentation using synthetic data generation models. The approach does not require any training or fine-tuning of the diffusion models to adapt the generation to the required data distribution. The paper has detailed ablation studies for each design choices. Also, the experimental results suggest considerable improvement over the prior data expansion approaches.
缺点
-
There seems to be some confusion in the explanation. Section 3.3 (Transform Data Points) seems to suggest that the approach always starts with a sample from the dataset. However, the algorithm in the appendix does not include any sample from the dataset.
-
Is the approach extendable to other supervised learning tasks like segmentation, detection, etc? It looks to me like the augmentation approach is tailored for classification tasks only, with the usage of class-specific hierarchical prototypes. Whereas, traditional augmentation approaches like random cropping, rotation, etc., are generic.
-
One of the main contributions of the paper is the residual multiplicative transformation. The paper does not give a clear answer to "why should I not adjust the latent directly to optimize for the energy function ?" The paper shows empirical explanations in the ablation, but there is a lack of concrete reasoning as to why this approach works.
问题
Can the authors please address the questions 2 and 3?
Additionally, how is the distribution across classes chosen for the synthetic data generation process? Are all classes equally sampled? I couldn't find this information in the paper.
局限性
Yes, the authors have addressed the limitations.
We would like to express our sincere gratitude for the detailed and professional attention you have given to our work during the review process. We greatly appreciate your recognition of the very important problem our work addresses, as well as your acknowledgment of our detailed ablation studies and the considerable improvements made.
Below are our detailed responses to the weaknesses and questions:
Re: Weakness #1
In Section 3.3, we introduce our method starting with a sample . However, in the algorithm presented in the appendix, we begin with the latent point , which omits the image encoding process from to for simplicity. Thank you for pointing this out. We recognize that this may cause confusion and will revise it in the final version.
Re: Weakness #2
Our method effectively generates classification data and has the potential to be extended to detection and segmentation tasks. This extension requires incorporating more advanced foundational models, such as ControlNet [1], which can use layout maps or segmentation masks as conditions to control the spatial positioning of targets. An intuitive approach to extend our method might involve cropping features from relevant target regions, averaging them to obtain feature vectors, and then constructing prototypes to guide the generation process.
Re: Weakness #3
The residual multiplicative transformation applies channel-level transformation to the original latent point. Compared to directly optimizing the latent point, this transformation constrains the optimization space, preventing out-of-control guidance and making the optimization process easier.
Re: Question #1
We apologize for missing this detail. We follow previous data expansion methods by expanding each sample certain ratios. The sampling category distribution is consistent with the original dataset category distribution. We will add an description in the final version.
[1] Adding Conditional Control to Text-to-Image Diffusion Models, ICCV2023.
I agree with the author's rebuttal. Though approaches like ControlNet can be used for generating synthetic data augmentations, the energy function used for guidance, built using hierarchical prototypes, looks very much tailored for classification tasks alone. It is very unclear how this approach can be extended to tasks like detection where prototypes have to be constructed not only for object classes but also for object locations, which is not so straightforward.
Dear Reviewer ghiL,
Thank you for your thorough review and for agreeing with our rebuttal. We appreciate your responsible evaluation.
For the extension to detection and segmentation tasks, we need to guide latent points at the instance level. Based on the segmentation mask conditioned ControlNet model, a potential guiding design is as follows:
(a) Deriving Hierarchical Prototypes:
Given a sample with instances and its annotation mask { }, we first derive each instance image by suppressing its background pixels to zero and taking the minimum bounding box of the foreground region. Then, all instance images within a class are resized and fed to the pre-trained image encoder to derive their feature embeddings. We construct hierarchical prototypes for each class using a feature clustering strategy as mentioned in Section 3.2.
(b) Guiding in the Denoising Process:
During the denoising process, we apply instance-level energy guidance as follows: First, we transform each instance in the latent point with a residual multiplicative transformation similar to Section 3.3. We then predict a clean data point at -step and derive instances by suppressing the corresponding background pixels in using predefined mask conditions. Finally, we calculate the energy score and apply our energy guidance based on these predicted instances and their corresponding hierarchical prototypes, similar to Equation 6 in our manuscript.
Additionally, considering that guiding instances during denoising may introduce extra computational load, we further propose an more efficient strategy: Guide Multiple Instances Once-for-All. Unlike the previous design, this method directly inputs the predicted into the image encoder and performs energy guidance at the final layer of the image encoder's feature map, thus forwarding the image encoder only once. Compared to the first design, this approach is more efficient but may result in instance feature disturbances due to the convolutional nature.
These designs theoretically extend the application of energy guidance to detection and segmentation tasks. We are conducting additional experiments to evaluate this extension for detection and segmentation data augmentation tasks, which will be included in future versions of our work.
Given these clarifications and the positive aspects of our work that you've previously acknowledged, we kindly ask if you would reconsider your evaluation. We believe our research makes a valuable contribution to the field and addresses important challenges in classification data augmentation.
With approximately 18 hours remaining in the discussion phase, we sincerely hope our rebuttal has addressed your concerns. If our responses have clarified the issues you raised, we kindly request that you consider raising your score.
We greatly value your feedback and have tried to provide thorough responses to each point. If you have any unresolved questions or need further clarification, please don't hesitate to let us know. We will do our utmost to provide additional information within the remaining time.
Once again, thank you for your valuable time and expert opinion. Your feedback is crucial in improving the quality of our research.
Best regards
I thank the authors for clarifying this. There was a small misconception from my side with respect to the previous explanation regarding the extension of the proposed approach to other CV tasks. I am satisfied with the response. I will increase the score to 7.
We greatly appreciate your valuable feedback and the improved scores. Your recognition is highly encouraging.
Best Regards!
Dear Reviewer ghiL,
Thank you again for your valuable comments. We have tried our best to address your questions (see rebuttal above), and will carefully revise the manuscript by following suggestions from all reviewers. Please kindly let us know if you have any follow-up questions.
Your insights are crucial for enhancing the quality of our paper, and we would greatly appreciate your response to the issues we have discussed.
Thank you for your time and consideration.
This work is focussed on efficiently utilizing diffusion models for data augmentation with an aim to be useful primarily for visual classification tasks. The strength of the work lies in being able to utilize pre-trained diffusion models and the minimal additional latency required at runtime as well as the novel formulation of the algorithm. I recommend acceptance based on the contribution as well as the discussions happened during the rebuttal window.