Will the Inclusion of Generated Data Amplify Bias Across Generations in Future Image Classification Models?
摘要
评审与讨论
Great: The authors investigated an important question on a selected dataset. The future work can be extended to other hierarchical sub-groups and more datasets. Limited benchmark results are apparent and convincing.
Missing: It will be helpful to accept the claim with more extensive experiments and analysis. According to me, citations to the original work/s are missing. The final section can be elaborated to cover the important findings that support the main problem.
Without additions: It's a strong acceptance for a poster. Weak acceptance for the main track.
Explanation:
- The experiments: If possible, I would like to see the results from other standard datasets or more subgroups for the dataset used.
- Citations: Some original works in the introduction section and later are missing citations. If the authors think they have cited all the necessary work, please put that in a rebuttal.
- Conclusion and future work: I think it can be improved or extended to connect with the problem and story.
优点
The authors investigated an important question on a selected dataset. The future work can be extended to other hierarchical sub-groups and more datasets. Limited benchmark results are apparent and convincing
缺点
It will be helpful to accept the claim with more extensive experiments and analysis. According to me, citations to the original work/s are missing. The final section can be elaborated to cover the important findings that support the main problem.
问题
Look at weakness section to know what to improve.
Questions:
[1] The experiments: If possible, I would like to see the results from other standard datasets or more subgroups for the dataset used.
A: As you suggested, we conducted experiments on the Living-17 dataset, which consists of 17 superclasses, each containing 4 subclasses. We utilized the Stable Diffusion 1.5 model as the base generative framework to study the inheritance phenomenon of bias across generations. The results are presented in Appendix A.
[2] Citations: Some original works in the introduction section and later are missing citations. If the authors think they have cited all the necessary work, please put that in a rebuttal.
A: Thanks for your suggestion. We have cited more work about using generated data helping the model traiing in our revision. It would be appreciate if you could suggest some work for further discussion.
[3] Conclusion and future work: I think it can be improved or extended to connect with the problem and story.
A: Thank you for your suggestion. Indeed, there are several models that involve self-consumption loops, such as Stable Diffusion, LLaMA, LLaVA, and Nemotron. Notably, Nemotron is trained on a dataset consisting of more than 98% synthetic data. While the inclusion of synthetic data in the training process has shown some benefits, it is important to consider the potential harms, especially concerning model biases. This concern has prompted us to investigate the impact of generated data on model performance and bias, particularly as the number of self-consumption loops increases.
We plan to expand on this in our future work by further exploring how the inclusion of synthetic data in iterative training cycles influences not only model performance but also the ethical implications of model biases. This line of inquiry could provide valuable insights for future AI model development.
The paper attempts to analyze whether the inclusion of generated data alleviates the model bias. In the paper, the authors repeatedly train generative models on the generated augmented data and study its impact on multiple fairness metrics.
优点
The paper is studying a new problem. In the era of generative AI, more generated content is available on the Internet. Training on the generated data may influence model performance. In this sense, the paper is studying a valid and important problem.
缺点
The paper mostly conducts experiments on three datasets. Among the three datasets, Colourized-MNIST and CIFAR20/100 are very small datasets in terms of the resolutions and the number of data and classes compared to the existing image data. Moreover, the largest model used in the paper is ResNet50, which is a relatively small network when compared to SOTA models like ViT. This raises the concern of whether the observations from the experiments are still valid in realistic scenarios with large datasets and models.
Some experimental results are hard to understand.
- baseline performance. To my understanding, CIFAR20/100 has a smaller number of classes and should be an easier dataset to classify when compared to CIFAR100. However, the baseline performance of ResNet50 on CIFAR100 is around 80%; on CIFAR20/100 is around 50%.
- The trends are inconsistent between models. For example, in Fig. 6(a), ResNet50 and LeNet show an opposite trend in Equal Opportunity and Disparate Impact but the same trend in average accuracy and Maximal Disparity; VGG-19 remains unchanged for the tested metrics.
- The trends are inconsistent between datasets. For example, comparing the ResNet50 baseline between Fig 5(a) on CIFAR-20/100 and 6(a) on CIFAR-100. Even though the datasets are similar, the ResNet50 baseline shows a totally different trend for the Equal Opportunity, Disparate Impact, and Maximal Disparity metrics.
- Similar discrepancies are noted in other sections of the paper. The experimental findings do not appear to explain how the generated data affects model bias in general.
The conclusion is weak. Based on the observations, the authors conjectured that the model bias is affected by multiple factors, including the datasets, models, and data quality across generations. However, the authors did not provide clear experiment evidence or solid theory explaining how these factors influence the model bias.
问题
The authors reported the FID score for the augmented data from multiple generations in Table 1. It seems that the FID scores for unbiased colorized MNIST show a decreasing trend; biased colorized MNIST is more or less the same; CIFAR-20/100 decreases first and then increases; Hard ImageNet shows a sudden increase from 50.9 to 186.4. Can the author explain the inconsistent changes here? Are there any implications from the observations? Also, it would be better if the author could visualize the generated images from different generations to visually see the changes across generations.
The authors mentioned, “We manually partition the original dataset into multiple subgroups, where subgroups within the same class share similar semantics.” Can the authors explain more clearly how they defined and constructed the subgroups for each dataset?
Weaknesses:
[1] The paper mostly conducts experiments on three datasets. Among the three datasets, Colourized-MNIST and CIFAR20/100 are very small datasets in terms of the resolutions and the number of data and classes compared to the existing image data. Moreover, the largest model used in the paper is ResNet50, which is a relatively small network when compared to SOTA models like ViT. This raises the concern of whether the observations from the experiments are still valid in realistic scenarios with large datasets and models.
A: On small datasets, CNN-based models are more suitable as ViT-based models tend to suffer from severe overfitting issues without large-scale datasets. For this reason, our primary analysis focuses on CNN-based models.
As you suggested, we conducted additional experiments on the Breeds dataset, which is organized based on subgroup connections identified by WordNet. Additionally, we evaluated the performance of ViT models. The results are presented in Appendix A.
[2] Some experimental results are hard to understand.
- baseline performance. To my understanding, CIFAR20/100 has a smaller number of classes and should be an easier dataset to classify when compared to CIFAR100. However, the baseline performance of ResNet50 on CIFAR100 is around 80%; on CIFAR20/100 is around 50%.
- The trends are inconsistent between models. For example, in Fig. 6(a), ResNet50 and LeNet show an opposite trend in Equal Opportunity and Disparate Impact but the same trend in average accuracy and Maximal Disparity; VGG-19 remains unchanged for the tested metrics.
- The trends are inconsistent between datasets. For example, comparing the ResNet50 baseline between Fig 5(a) on CIFAR-20/100 and 6(a) on CIFAR-100. Even though the datasets are similar, the ResNet50 baseline shows a totally different trend for the Equal Opportunity, Disparate Impact, and Maximal Disparity metrics.
- Similar discrepancies are noted in other sections of the paper. The experimental findings do not appear to explain how the generated data affects model bias in general.
A:
-
Baseline performance:
The CIFAR20/100 dataset consists of 20 superclasses, each containing 5 subclasses. Due to its low resolution and significant distribution disparity between subclasses within a superclass, it is challenging for models to learn generalized representations from this dataset. This results in lower classification accuracy compared to the original CIFAR100 dataset. Similar experimental results can be found in [a]. -
Inconsistent trends between models:
The impact of generated data varies across models due to differences in architecture and model capacity. On one hand, generated data can enhance model training and improve performance. On the other hand, inherent explicit biases in generated data may harm model performance. Learning from data augmented with generated samples requires balancing these two effects. Variations in model architecture and capacity result in inconsistent trends across models and datasets. -
Inconsistent trends between datasets:
The trends differ between datasets like CIFAR20/100 and CIFAR100 due to differences in data distribution and subclass structures. These disparities influence how models learn and respond to generated data, leading to different observed trends. -
Explanation of experimental findings:
Detailed explanations of the experimental findings are provided at the end of each experiment and in Section 5.
[a] Zhang et al. "Discover and Mitigate Multiple Biased Subgroups in Image Classifiers." CVPR 2024.
[3] The conclusion is weak. Based on the observations, the authors conjectured that the model bias is affected by multiple factors, including the datasets, models, and data quality across generations. However, the authors did not provide clear experiment evidence or solid theory explaining how these factors influence the model bias.
A: This work is an empirical study aimed at understanding the impact of generated data on the bias of image classification models during the self-consuming loop. Our conjectures are based on experimental observations and serve as a preliminary explanation of the results. We hope our research will inspire future studies to investigate the underlying mechanisms in more depth. To facilitate further research, we will open-source all the code and data used in this project.
Questions: [1] The authors reported the FID score for the augmented data from multiple generations in Table 1. It seems that the FID scores for unbiased colorized MNIST show a decreasing trend; biased colorized MNIST is more or less the same; CIFAR-20/100 decreases first and then increases; Hard ImageNet shows a sudden increase from 50.9 to 186.4. Can the author explain the inconsistent changes here? Are there any implications from the observations? Also, it would be better if the author could visualize the generated images from different generations to visually see the changes across generations.
A: It is important to note that the generated data in the self-consuming loop not only influences the image classification model but also impacts the generative model itself.
For MNIST and CIFAR, we use a GAN model that is trained from scratch. In this case, the consistent inclusion of generated data does not drastically affect the generative model, as the clean data still constitutes the majority of the training dataset. This explains why the changes in FID scores for MNIST and CIFAR are relatively small across generations.
In contrast, for Hard ImageNet, we utilize the pre-trained Stable Diffusion model, where both the clean data and the generated data across generations differ significantly from the pre-training dataset. With an increasing number of generations, the impact of these distribution differences accumulates, leading to a significant increase in the FID score. This phenomenon highlights the sensitivity of pre-trained generative models to distribution shifts when operating in a self-consuming loop.
We believe this explains why the FID score changes for MNIST and CIFAR are small, while for Hard ImageNet, the changes are substantial.
As you suggested, we have visualized examples of generated images across different generations in Appendix B to provide a clearer understanding of the changes.
[2] The authors mentioned, “We manually partition the original dataset into multiple subgroups, where subgroups within the same class share similar semantics.” Can the authors explain more clearly how they defined and constructed the subgroups for each dataset?
A:
- For MNIST dataset, we randomly color each image with one of three colors. The original digit serves as the classification label, while the assigned color is treated as the subgroup label.
- For CIFAR-20/100 dataset, we refer to the CIFAR-100 dataset, as introduced by [a]. The 100 classes in CIFAR-100 are grouped into 20 superclasses, with each superclass containing 5 subclasses. The group information is provided in the CIFAR-100 section of [a]. In our study, we use the superclass as the classification label and the subclass as the subgroup label.
Thanks for the response. The author clarified the construction of subgroups for the tested dataset and added a new dataset and the ViT baseline. I am increasing the score to 5 based on the improved clarity and experiments. As mentioned in the weakness, there is a different trend between the tested datasets, and the authors attributed it to the differences in data distribution and subclass structures. It appears that some of the observations are dataset-dependent and lack a conclusive, significant finding. I understand that the computation power may be a concern. However, I still believe the authors should extend their experiments to a richer set of datasets and generative models in order to make a more decisive conclusion from the results.
Thank you very much for your agreement on the value of our work and improve the score.
In the past few weeks, we have continued to run more experiments with additional datasets and models. However, due to limited computational resources, achieving the scale we aim for remains challenging.
We hope this work will draw the community's attention and inspire more work to explore this setup further and investigate the pros and cons of generated data in self-consumption loops. By doing so, we can better understand its implications for fairness and contribute to the integration of geerative model into the real-world applications.
This paper investigates the bias implications of incorporating generated data from generative models into training downstream tasks. The authors propose an iterative pipeline that repeatedly uses new generators to create additional images that enhance training. Their analysis spans three image datasets of varying complexity, exploring both performance improvements and bias impacts on downstream classification tasks. Bias is examined across different subgroups, covering both single and multiple bias interactions. Through this setup, they observe mixed trends in performance and bias effects across datasets of different complexities and model capacities. The paper concludes with a high-level discussion on potential root causes behind these varied results.
优点
- This paper addresses a highly relevant topic by examining the implications of generated data on bias, which is essential for advancing our understanding of the gaps between generated and real data.
- The iterative pipeline for incorporating generated data closely resembles real-world applications. By using datasets of varying complexity and models with different capacities, the study effectively explores different aspects of the problem, enhancing the generalizability of the findings.
- The study provides noteworthy observations with good experimentation support, such as the low correlation between dataset bias and resulting bias effects, the higher susceptibility of pre-trained models to integration bias, and insights into how different factors affect bias across datasets and models.
缺点
- The paper presents mixed findings across datasets and models but does not provide in-depth explanations for these variations. While section 5 includes some discussion on the root causes of observed behaviors, this analysis remains at a high-level and is not well-supported or directly connected to the experiments in earlier sections. The analysis would be more convincing with clearer connections to the results, reinforcing the paper’s claims with evidence from the experiments.
- In table 1, FID fluctuates for Color-MNIST and CIFAR10 after several rounds of data generation, while it increases substantially for HARD-ImageNet starting from the second iteration. This trend suggests a marked difference in data quality for HARD-ImageNet compared to the other datasets. However, the subsequent experiments focus primarily on how generated data impacts downstream performance and bias without addressing how this observed FID trend might influence these results. A discussion on how does the data quality(assesed by FID) could affect interpretations across the three datasets would enhance the clarity of the findings.
- Some methodology details are lacking, making it challenging to fully understand and replicate the study. For example, in section 3.2, there is limited information on the design of the human study, the impact of expert-guided filtering on image quality, and the specific r% used.
- The paper would benefit from some recommendation on the usage of generated data in generative models or downstream tasks with the insights from the experiments.
问题
- Can you provide more detail on the human study in section 3.2, and the r% used?
- What is the impact of image quality from the expert-guided filtering in section 3.2?
- How well does the findings generalize to other dataset? For example, section 4.2 showed dataset bias does not amplify model bias. Do authors expect that to hold for other datasets, too?
Weaknesses: [1] The paper presents mixed findings across datasets and models but does not provide in-depth explanations for these variations. While section 5 includes some discussion on the root causes of observed behaviors, this analysis remains at a high-level and is not well-supported or directly connected to the experiments in earlier sections. The analysis would be more convincing with clearer connections to the results, reinforcing the paper’s claims with evidence from the experiments.
A: Thank you for your suggestion. This work serves as an empirical study to translate real-world practices into a case study, investigating whether generated data can influence model bias across generations.
We include result analysis after each experiment and observe that there is no consistent pattern across different models and datasets. For this, we propose a hypothesis in the final section, considering multiple factors that may contribute to these findings. We hope our work inspires future research to delve deeper into this phenomenon.
[2] In table 1, FID fluctuates for Color-MNIST and CIFAR10 after several rounds of data generation, while it increases substantially for HARD-ImageNet starting from the second iteration. This trend suggests a marked difference in data quality for HARD-ImageNet compared to the other datasets. However, the subsequent experiments focus primarily on how generated data impacts downstream performance and bias without addressing how this observed FID trend might influence these results. A discussion on how does the data quality(assesed by FID) could affect interpretations across the three datasets would enhance the clarity of the findings.
A: It is important to note that the generated data in the self-consuming loop not only influences the image classification model but also affects the generative model itself.
For MNIST and CIFAR, we employ a GAN model trained from scratch. In this scenario, the consistent inclusion of generated data does not drastically affect the generative model, as clean data still constitutes the majority of the training dataset. This explains why the FID scores for MNIST and CIFAR remain relatively stable across generations.
In contrast, for Hard ImageNet, we use the pre-trained Stable Diffusion model, where both the clean data and the generated data across generations differ significantly from the pre-training dataset. As the number of generations increases, these distribution differences accumulate, resulting in a substantial increase in the FID score. This phenomenon underscores the sensitivity of pre-trained generative models to distribution shifts in a self-consuming loop.
The impact of generative data on model bias across generations depends on several factors, including the generative model's performance and the learning capacity of downstream models. While the generative model may produce more data for certain subgroups, the quality of the data—reflected in the FID score—ultimately determines whether the performance of these subgroups
[3]Some methodology details are lacking, making it challenging to fully understand and replicate the study. For example, in section 3.2, there is limited information on the design of the human study, the impact of expert-guided filtering on image quality, and the specific r% used. The paper would benefit from some recommendation on the usage of generated data in generative models or downstream tasks with the insights from the experiments.
A: We will open-source all code, models, and data to facilitate future research. The data filtering process is detailed in the questions section. In this study, we observe that the ratio of generated data used for data augmentation must be carefully considered to mitigate model crashes caused by distribution disparities between generated and real-world data. Additionally, it is worth noting that for a given model and dataset, the trend in model bias changes is usually consistent. This consistency suggests that we can adjust the mixup ratio by observing a small number of iterations and dynamically adapting our augmentation strategy.
Questions: [1] Can you provide more detail on the human study in section 3.2, and the r% used?
A: First, we manually review the generated samples and discard images with low quality.
Second, We calculate the CLIP score for each image, where the paired text is the class name. Images are then grouped into bins based on their CLIP scores, with each bin representing a ±10% range of CLIP scores. This results in 10 bins.
Then, we randomly sample 10 images from each bin and evaluate the quality of each bin. Based on this evaluation, we determine the maximum ratio of the CLIP score range (denoted as r%) to retain for training.
- For MNIST, we find that retaining the top 90% of images (r = 10%) is optimal.
- For CIFAR-20/100, retaining the top 70% of images (r = 30%) works best.
- For the ImageNet dataset, retaining the top 40% of images (r = 60%) yields the best results.
We include this in appendix C.
[2] What is the impact of image quality from the expert-guided filtering in section 3.2?
A: In practice, we observe that the model performs poorly without expert-guided filtering, resulting in low classification accuracy. As the number of generations increases, the performance degradation becomes more pronounced. This is primarily due to the significant distribution shift between the original images and the low-quality generated images, which negatively impacts the model's ability to generalize effectively.
[3] How well does the findings generalize to other dataset? For example, section 4.2 showed dataset bias does not amplify model bias. Do authors expect that to hold for other datasets, too?
A:
-
[Why this happens?]
While the dataset is initialized with bias, it can influence both the generative model and the image classification model. The generative model may unintentionally learn the bias from the dataset, but the quality of the generated data can directly affect the image classification model. Specifically, high-quality data sampled from high-density regions of the original distribution can make it harder for the model to learn a representation on the biased subclass, thereby alleviating the bias. -
[Generalization to other datasets]
The impact of generated data on model bias across generations within the self-consuming loop depends on multiple factors, including model architecture, dataset characteristics, the difficulty of learning the dataset, and the nature of the bias itself. While it is challenging to predict whether the findings will hold for other datasets, our results consistently show similar trends for specific models on certain datasets.This consistency suggests that models can be trained over a few generations, and the observed performance during these initial generations can be used to predict whether the current training strategy is effective. This approach provides practical guidance for real-world model development.
We hope our research will inspire future studies to conduct more fine-grained analyses of the impact of generated data on model bias across generations.
With the growing prevalence of generative models, the authors raised the concern regarding model bias, particularly in the context of generation and retraining cycles. Then the authors developed a simulation environment and conducted extensive experiments on image classification datasets to reveal how fairness metrics evolve across successive generations.
优点
S1. The scenario proposed by the authors is highly relevant, as synthetic data is increasingly shared online and integrated into various domains. More studies on how the synthetic data will affect model training in generations will be beneficial for the research community.
S2. The experiments on the proposed dataset are extensive, e.g., w/ or w/o biased initialization, different base models.
缺点
W1. Lack of experiments on the choice of generative models. Various generative models can differ in behavior, the choice of model likely impacts sample quality and influences the outcomes of subsequent studies.
W2. The motivation of the paper is on future image classification synthetic data plays in it. With foundational models playing a dominant role, integrating settings like synthetic data for transfer learning in classification will enhance the paper better, going beyond the current base case that may lack scalability.
W3. The experiments are mainly targeting the model bias within a self-consuming loop in image classification domain. However, the conclusions/observations are not significant.
问题
Q1. How many synthetic data are added or ratio p? Is there a rationale behind the choice of p.
Q2. Are there any experiments or preliminary results on tasks with a larger number of classes?
Q3. How is CLIP used in filtering? Specifically, is it based on the similarity score between label texts and images?
Q4. Would different losses lead to varying results?
Q5. Any insights on how could the conclusion generalize to other tasks, or other modality?
Weaknesses: W1. Lack of experiments on the choice of generative models. Various generative models can differ in behavior, the choice of model likely impacts sample quality and influences the outcomes of subsequent studies.
A: Thank you for your suggestion. We agree with your observation that different generative models can have varying impacts on downstream tasks.
Criteria for Selecting Generative Models In our study, we utilized three datasets: MNIST, CIFAR, and an ImageNet-like dataset. MNIST and CIFAR are small-scale datasets with low resolution, which led us to select GAN models for learning their distributions, as diffusion models often struggle with limited-size data. In contrast, the ImageNet-like dataset, with its higher resolution, is more suitable for diffusion models.
We also considered the point you raised in our study. However, due to limited computational resources, we conducted experiments on each dataset using only one type of generative model.
Our study aims to shed light on the issue of continuously using generated data in real-world applications, serving as a case study. We hope this work will inspire further investigations into how various generative models impact bias in downstream tasks.
W2. The motivation of the paper is on future image classification synthetic data plays in it. With foundational models playing a dominant role, integrating settings like synthetic data for transfer learning in classification will enhance the paper better, going beyond the current base case that may lack scalability.
A: Thanks for your suggestion. We will include this discussion into our paper.
W3. The experiments are mainly targeting the model bias within a self-consuming loop in image classification domain. However, the conclusions/observations are not significant.
A: We acknowledge that there is no universal rule for the impact of generated data within a self-consuming loop on downstream image classification models. However, it is worth noting that the observed trends for each model on the same dataset remain consistent, providing guidance for real-world model development practices.
This indicates that while it is known to us from this study that using generated data can influence model bias across generations, we can simplify the process by training models over a few generations and observing the trend. If the bias of interest continues to worsen across the observed generations, this suggests the need to incorporate additional real-world samples to mitigate the adverse effects of generated data on the model.
Questions:
Q1. How many synthetic data are added or ratio p? Is there a rationale behind the choice of p.
A: We maintain the size of the generated data at 10% of the original clean dataset. Previous studies have highlighted that the ratio between synthetic data and clean data should be carefully considered. A large ratio can lead to model collapse, while a small ratio is expected to enhance performance as desired.
Q2. Are there any experiments or preliminary results on tasks with a larger number of classes?
A: We further conduct experiments on the Breeds dataset (sub ImageNet dataset)[a], which is organized by the subgroup connections established in WordNet. The results are shown in appendix. A.
[a] Santurkar et al. "Breeds: Benchmarks for subpopulation shift." ICLR 2021.
Q3. How is CLIP used in filtering? Specifically, is it based on the similarity score between label texts and images?
A: We use the similarity score betwen the label text and images.
Q4. Would different losses lead to varying results?
A: In our study, we work on the task of image classification, where the cross-entropy is the most popular metric for learning. In our view, the change of loss would not lead to varying results. Because The bias comes from the distribution of generated data, originally from the imbalanced generation of the generative model. Thanks.
Q5. Any insights on how could the conclusion generalize to other tasks, or other modality?
A: Model bias originates from the data it learns. Due to the imbalanced data generation by generative models, the distribution of augmented data often exhibits unevenness, introducing bias into downstream tasks. This phenomenon not only affects classification tasks, as studied in our work, but also extends to other domains such as robotic control, visual question answering, and more.
Thank for authors for the response and for addressing the questions. I agree and appreciate the effort to draw attention to the influence of generated data in the community, which is indeed an important and under-explored topic. The idea of contriving the loop and the iteratively testing is simple and reasonable. My primary concern, however, lies in the scalability and generality of the conclusions and experimental settings, which are also main parts of the paper. Given the practical nature of the idea, exploring more applicable and advanced settings would significantly enhance its impact. So I will maintain my score.
Thank you for your response and suggestions! We completely agree that adding experiments on fine-grained settings (e.g., loss functions, downstream tasks) would further enrich the paper.
However, within the limited scope of a conference paper, we have chosen to focus on a more general setting on the most widely used task for concept verification: the standard image classification task on standard image datasets using the cross-entropy. Specifically, we employ popular generative models (GAN and Stable Diffusion) to enhance the training process during the self-consuming loop. This approach allows us to examine the impact of generated data on model performance in general, aligning closely with the central theme of our paper's title.
It is undoubtedly valuable to explore scalability across other downstream tasks, such as transfer learning, domain adaptation, and the impact on model robustness. We leave this broader investigation to future work, aiming to inspire more researchers to bridge the gap between industry practices and academic research while fostering mutual advancements.
We hope this addresses your concern further.
To all reviewers:
We sincerely thank all reviewers for their valuable feedback and for recognizing the merits of our work:
- Our studied problem and the proposed framework are novel, important, and practical. (QLxJ, Bwfa, hfuo, m2oW).
- We conduct extensive experiments to answer the studied problem and support our claim. (QLxJ, Bwfa)
In our revision, we have made the following updates to answer the reviewers's major concerns:
- More experiments on ImageNet: Details are provided in Appendix A.
- Examples of generated images across generations: Details are provided in Appendix B.
- Detailed expert-guidance filtering: Details are provided in Appendix C.
- Improved Clarity: We revise both illustration figures and their captions for better clarity.
- Typos Corrected: We fix all identified typographical errors.
During the discussion, we sincerely appreciate the reviewers' suggestion to further broaden the scope of the studied problem. However, within the limited scope of a conference paper, we have chosen to focus on the most general and widely used task under a standard setting for concept verification. To ensure reliable verification, all results are conducted multiple times to minimize randomness.
We agree that it would be valuable to explore the impact of generated data on additional tasks, such as transfer learning, domain adaptation, and model robustness. We leave these investigations for future work, aiming to inspire more researchers to delve into this problem and pave the way for leveraging academic advancements to drive progress in industry practices.
We hope these updates address your concerns and further strengthen the contributions of our work. Thank you again for your thoughtful reviews and support!
This paper explores the implications of generated data being used to train generative models, which is a realistic setting given how data is often shared and scraped on the internet. Specific focus is given to how biases can emerge or be exacerbated across subgroups in the data.
Reviewers noted that trends in the data and results were inconsistent across settings, with no clear explanations given as to the mechanisms that could be at play. The datasets used were small scale, focused only on the image modality, and used a small set of model architectures. These choices greatly limit the generality of the study, and its applicability to real world cases (which was the original motivation). Reviewers noted inconsistent results between experiments that were not explained by the authors, and these points were not adequately addressed during the discussion period. For these reasons, I am recommending rejection.
审稿人讨论附加意见
The main points of improvement raised by reviewers were: small scale models and datasets, as well as lack of variety in modalities and model types; inconsistent behaviours across experiments which are not sufficiently explained; broad conclusions that are not sufficiently supported by evidence.
The authors did not manage to satisfy these criticisms in the discussion. They did not scale up their experiments or diversify the modalities and model types, which is admittedly a substantial ask for a rebuttal period, but still it points to a need to revise the work on a more holistic scale. The inconsistent results between experiments were not explained.
I want to encourage the authors that the topic they are working on is important and can have impact on the field, but the current work is not ready for publication. Please take a close look at the feedback from the reviewers and revisit the overall structure and objectives of your experiments to find a path forward in revising your work.
Reject