PaperHub
7.3
/10
Rejected4 位审稿人
最低4最高5标准差0.5
5
4
5
4
3.8
置信度
创新性2.3
质量2.8
清晰度3.3
重要性2.5
NeurIPS 2025

A Large Scale Analysis of Gender Biases in Text-to-Image Generative Models

OpenReviewPDF
提交: 2025-05-11更新: 2025-10-29

摘要

关键词
gender biassocial biastext-to-image generation

评审与讨论

审稿意见
5

The authors of the paper perform a large-scale analysis of gender biases in T2I generation. They focus on 5 state-of-the-art models and they perform such analysis by considering more than 3000 prompts about activities, contexts, objects and occupations. Overall, they consider a set of more than 2 million images. Through their analyses, they show that T2I reinforce gender stereotypes, and particularly the association between women and household roles.

优缺点分析

The presented experiments are extensive and technically sound. The claims of the authors are well supported by the obtained results. In addition, the authors provide a profound analysis of the limitation of this work, and it is particularly positive how they mention, multiple times, the utilization of binary genders as a limitation, as well as the issues related to inferring gender on images. The work is written clearly and the structure is easy to follow. The extensive literature review shows that the authors have a profound knowledge of the related work and have done a notable effort in showing (in the last part of the appendix) how they compare to major existing publications in this field. The obtained results are significant and interesting. However, while the effort of providing a larger-scale evaluation, as well as the focus on more dimensions (rather than just occupations) is notable, the obtained results are not groundbreaking. In this sense, the paper, while being strong, sound and interesting, is not particularly innovative and does not present unique or creative ideas. That being said, this weakness does not constitute a reason for rejecting the paper. The field of T2I generation advances very fast and a large-scale evaluation as the one provided in this paper, while not being groundbreaking, still provides interesting and more nuanced insights on the different dimensions (beyond occupation) where gender biases can be evaluated.

问题

My current evaluation is 5 (accept). I would not go for a 6 (strong accept) even after the rebuttal because I do appreciate this paper and the effort put by the authors, but I do not believe that this is a groundbreaking piece of work. Nevertheless, here are some minor suggestions:

  1. Lines 224-225 are a bit confusing, since it is not clear why the parent-son prompts are not gender-neutralized (e.g., parent-child).
  2. For clarity reasons, it would be great if the authors could briefly explain the difference between "bias analysis" and "open bias detection" at the beginning of Section 2.

局限性

Yes, the authors have addressed the limitations of their work extensively. In the part in which they mention the utilization of generated images for training other models, they might consider adding some reflections on the impact, considering the results shared in the following paper:

Chen, T., Hirota, Y., Otani, M., Garcia, N., & Nakashima, Y. (2024). Would Deep Generative Models Amplify Bias in Future Models?. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 10833-10843).

最终评判理由

As previously stated in my review, I am keeping my score as it is (5), which is already rather positive. While I do not feel confident to raise it to a 6, I am grateful to the authors for having addressed my minor doubts and I wish them the best of luck with this submissions.

格式问题

There are no major formatting issues related to this paper.

I would just give a suggestion to the authors. Throughout the paper, there are several implicit references, for example a sentence like: "In their seminal work, [14] discovered intersectional gender and racial biases in image recognition systems.". This sentence would be much easier to read if it the authors would directly write "Buolawmini and Gebru". Especially in cases like this, in which a very famous piece of work is mentioned, reporting the names of the authors would help the reader (which likely knows this work) understand the context better, without having to check the references. In general, if the authors made an effort to switch from implicit to explicit references, this would definitely improve the flow of the manuscript.

作者回复

We thank the reviewer for the positive assessment and great support of our work. We are particularly happy that the reviewer finds our experiments “extensive and technically sound” and that our results are “significant and interesting” to the reviewer. We also appreciate the suggestions provided in the review to improve the clarity and reading flow, as well as the interesting reference to work on synthetic data and bias amplification.

We included these in our updated version, as detailed below. Additionally, we highlight again our main contributions, and we are happy to further discuss any remaining unclear aspects of our paper.

The following are our individual responses.


The paper, while being strong, sound and interesting, is not particularly innovative and does not present unique or creative ideas.

We greatly appreciate that the reviewer finds our paper significant, and we respect their judgment regarding the novelty of our ideas. We would still like to highlight our innovations:

Our study features an improved experimental methodology, and we expect that future work will adopt this improved analysis pipeline. Furthermore, we investigate gender bias beyond occupations and link it to bias that exists in real-world contexts.

In addition to the empirical contribution this provides, researchers can use our prompts and categories for their analyses. We also link results from human studies in social science with statistical results on gender distribution in generated images. We think that this combination of research areas is novel and will facilitate further interdisciplinary studies.

Furthermore, we illustrate how biases are amplified beyond real-world contexts, which may be linked to the models (as opposed to training data). For this, we use gender-concept cooccurrence frequencies from LAION-400m. This is another valuable contribution of our work that will inspire future research, because it allows much more detailed analyses regarding the relation between bias in pretraining data and model bias. It also reduces the reliance of bias analysis on demographic workforce statistics. This contribution enables the establishment of new baselines for bias amplification, and it enables research on better understanding how dataset bias translates to model bias.


Lines 224-225 are a bit confusing, since it is not clear why the parent-son prompts are not gender-neutralized (e.g., parent-child)

Thank you for pointing this out. We decided to keep the activities in their original formulation as provided by Wilson and Mihalcea [A], save for syntactical adaptation to our prompt format. However, a number of prompts indeed contain gendered expressions, such as “son” or “mom”. Our rationale was that this could allow for an analysis of gender bias in interpersonal interactions. For example, would the prompt “attend a meeting at one's son's school” mainly generate male-gendered images, while the prompt “go shopping with one's son” (both are actual examples from our prompts) would generate images that show women? However, we didn’t include experiments on this in the current paper and leave this for future work. We hope that our released data, including images, will inspire future work on this.

[A] Wilson and Mihalcea: Measuring semantic relations between human activities. In IJCNLP 2017


For clarity reasons, it would be great if the authors could briefly explain the difference between "bias analysis" and "open bias detection" at the beginning of Section 2.

We agree that this point should be clarified. We have added the following explanation, which will be included in our updated version:

“Here, bias analysis means understanding and quantifying known biases (e.g., gender bias) of T2I models, while open bias detection means discovering biases that the T2I model exhibits for a given set of prompts, without any prior information what kind of bias (e.g. gender/race/age bias or other attribute correlations) we are looking for.”


In the part in which they mention the utilization of generated images for training other models, they might consider adding some reflections on the impact, considering the results shared in the following paper: Chen, T., Hirota, Y., Otani, M., Garcia, N., & Nakashima, Y. (2024). Would Deep Generative Models Amplify Bias in Future Models?. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 10833-10843).

Thank you for bringing this very interesting and relevant paper to our attention! While the paper does not find any conclusive trends regarding the relation of synthetic data and gender bias amplification, it shows that synthetic data can have effects on various bias axes, such as age or skin color. We agree with the conclusions mentioned in the paper that these findings suggest dataset audits and possibly filtering are necessary to avoid introducing bias through synthetic data, which in turn justifies the need for large-scale analyses such as ours to better understand possible bias in synthetic data.

Therefore, we have added the following sentence to the suggested section:

“Synthetic images are not only used in everyday applications such as advertisements [61] and presentation slides [76] but are also increasingly used as training data for other foundation models [35,59,96,97]. Here, Chen et al. (2024) observed age and skin tone bias amplification at increased levels of synthetic images in the pretraining data and therefore recommend filtering web data to mitigate bias.


In general, if the authors made an effort to switch from implicit to explicit references, this would definitely improve the flow of the manuscript.

We also think that explicit references are preferable, and we resorted to numbered references mainly because of space constraints. In our updated version, which we unfortunately cannot share, we have made references explicit as much as possible within the space constraints, in particular in all cases where the reference is the grammatical subject of a sentence. We hope that these changes will increase the reading flow of the paper, and we thank the reviewer once again for the suggestion.

评论

Thank you for your kind answer. I am happy to know that my positive review is encouraging. Thank you for clarifying my minor doubts.

审稿意见
4

The authors present an in-depth and comprehensive analysis of gender bias in text-to-image (TTI) models. While prior studies have explored gender bias in highly limited scenarios using small sets of prompts, this work investigates biases across a wide range of real-world contexts, including everyday activities, occupations, objects, and broader situational prompts.

In their experimental setup, the authors generate images from each prompt using five widely adopted TTI models. The generated images are first filtered based on whether they contain a person, after which an automated binary gender classification is performed using an object detector and a multimodal large language model (MLLM).

The results reveal the presence of gender bias in specific categories, with further evidence showing that male-oriented prompts in the dataset not only exhibit bias but that these biases are amplified by the TTI models.

优缺点分析

Strengths

1. The first large-scale, in-depth exploration of gender bias in recent TTI models

This paper presents the first large-scale, systematic analysis of gender bias in TTI models, using a diverse set of prompts derived from real-world scenarios. Compared to previous studies that focused on narrow prompt sets, this work's comprehensive design—covering four distinct categories—provides a significant advantage for investigating bias in TTI models. The large-scale nature of the study adds credibility to the findings, providing stronger empirical support for the existence of gender bias in pre-trained TTI models.

2. Fully automated framework for large-scale gender bias evaluation

The authors propose a fully automated framework to evaluate gender bias in TTI models at scale. This framework eliminates the need for human intervention, making it a highly practical and reproducible tool for bias assessment.

Weaknesses

1. Impact of Classifier-Free Guidance (CFG) on gender bias

Classifier-Free Guidance (CFG) is known to enhance the quality of generated images by sacrificing diversity, effectively narrowing the sampling distribution toward high-quality, representative images aligned with the text prompt. As noted in the appendix, all models except Flux-Schnell employ CFG. This raises concerns that CFG may unintentionally amplify gender bias by favoring stereotypical or dominant gender representations within specific categories. The potential role of CFG in reinforcing or amplifying gender bias warrants further investigation, ideally through controlled ablation studies.

2. Ambiguity in the source of observed gender biases — prompt design or model bias?

In the Bias Amplification in Activities experiment (Table 3), it is unclear why male-majority and female-majority groups exhibit opposite patterns. The current explanation, attributing these results to inherent male-gender bias in the TTI models, seems insufficient. One possible alternative is that the seemingly neutral prefixes (e.g., “a person”) used in the prompts may themselves introduce unintended gender bias. While such prefixes appear gender-neutral from a human perspective, they may encode bias when processed by TTI models. It is recommended that the authors explicitly investigate potential gender bias in prompt prefixes through controlled experiments. This would strengthen the fairness and reliability of the overall bias evaluation framework.

Moreover, if the model itself exhibits an inherent male-gender bias, some of the results shown in Figure 2 may primarily reflect model-level bias rather than category-specific biases. This possibility casts doubt on the interpretation of male-dominated outcomes reported in Sections 4.1 and 4.2. A clear justification distinguishing between model-inherent bias and category-specific biases is necessary to ensure a fair and accurate analysis.

问题

It would be more informative if this figure were separated by model and provided in the supplementary material. For instance, in the Object Categories analysis, it is difficult to accurately discern the effects specific to Flux-Schnell, making model-wise comparisons less clear. Presenting model-specific results would facilitate a more precise understanding of the bias patterns across different TTI models.

局限性

yes

最终评判理由

Initially, I had some concerns, which are mostly resolved through the authors' rebuttal. After considering the other reviewers’ comments and the discussion, I have decided to keep my initial score.

格式问题

no formatting concerns

作者回复

We thank the reviewer for the helpful feedback, and we are happy that the reviewer mentions the scale and setup of our study as a strength of our work. We appreciate that the reviewer finds our comprehensive design offers significant advantages for investigating bias in TTI models, and that our findings are credible.

In our rebuttal we address the mentioned concerns about the roles of classifier-free guidance and prompt prefixes. For this, we ablate the effect of classifier-free guidance on female ratios, finding that not using CGF results in poor image quality, and mixed impact on bias in models that are more robust to different guidance scales. We also highlight our experiments regarding different prompt prefixes, where we find that variations do not systematically affect the measured bias.

We hope these propose the requested clarifications. Let us know if there are any remaining uncertainties.

In the following, we address each concern mentioned in the review individually.


The potential role of CFG in reinforcing or amplifying gender bias warrants further investigation, ideally through controlled ablation studies.

We thank the reviewer for this interesting suggestion. The role of CFG in the quality-diversity trade-off makes it a potential candidate for influencing bias. To investigate this, we conducted the requested ablation study. We sampled 50 “activity” prompts, stratified by their average female ratio, and re-generated images using two settings: (1) CFG turned off, and (2) a low CFG scale of 2.0.

As the reviewer anticipated, turning off CFG or lowering the guidance scale significantly degrades image quality for SD-3.5-Medium and SD-3-Medium, rendering most images unusable for analysis. We therefore proceed with the three models that produced coherent images (Flux, Flux-Schnell, SD-3.5-Large). The table below shows the mean absolute deviation of the female ratio RfR_f from the RfR_f values in our original study, where we use recommended CFG values.

FluxFlux-SchnellSD-3.5-Large
CFG off0.077-0.0030.004
Guidance Scale=20.024-0.0010.004

From this, we derive the following key findings:

  1. For models where CFG is a factor (Flux, SD-3.5-Large), changing the guidance scale has small or inconsistent effects on gender bias.
  2. Lowering or disabling CFG severely impacts the basic image quality of several models (SD-3.5-Medium, SD-3-Medium).

Therefore, we conclude that deviating from the recommended CFG settings introduces a strong confounding variable. Any observed changes in bias could be artifacts of the model's failure to generate coherent subjects. Therefore, to ensure a fair and meaningful evaluation of bias on high-quality, in-distribution generations, we think that using the recommended guidance scale values is the most methodologically sound approach. We will add this ablation study to the supplementary material.

Furthermore, we check if the different treatment of CFG in Flux-Schnell has an impact on image diversity, and if diversity has an impact on bias. For this, we use DINOv2 image embeddings and calculate the average pairwise cosine similarity for all images generated for a given prompt group. We use this as an inverse proxy for diversity, where lower scores indicate higher diversity. The results are summarized below:

ActivitiesContextsObjectsOccupations
Flux0.710.660.580.72
Flux-Schnell0.710.690.600.75
SD-3.5-Large0.720.710.610.75
SD-3.5-Medium0.730.720.610.77
SD-3-Medium0.760.760.630.79

The diversity scores of FLUX-Schnell are in line with the other models (i.e., no outlier), so we conclude that the different treatment of CFG has no impact on image diversity in our case. Nonetheless, models generate images with varying levels of diversity, and their diversity ranking is consistent across prompt groups. For example, Flux models generate the most diverse images, while SD-3-Medium generates the least.

Combining these findings with our results on bias strength presented to Reviewer PQHE, suggests that higher image diversity may not directly translate to less bias. Although models can be ranked clearly by their diversity and skew (see the response to Reviewer PQHE), which are naturally related, this ranking does not correspond to a clear ranking by bias strength. This suggests that improving general capabilities like image diversity may not be a solution for mitigating bias. We believe this is an important finding, and we will add this analysis and discussion to the paper.


It is recommended that the authors explicitly investigate potential gender bias in prompt prefixes through controlled experiments.

We thank the reviewer for this critical question about disentangling the sources of bias. Our experimental design was constructed to isolate category-specific bias from other potential confounders. We can break down the sources as follows:

  1. Inherent Model Bias: We agree that models are inherently biased toward generating male-gendered images, as noted in line 194, and also confirmed by other studies [A, B]. This likely explains the bias amplification trends: The ratio of female-gendered images for female-dominated activities is reduced, because models pull the gender distribution towards the male. Conversely, this may also cause the tendency to emphasize male skew.
  2. Prompt Prefix Bias: To control for bias introduced by prompt wording, we intentionally used 5 distinct, neutral prefixes (“a person”, “someone”, etc.). Most importantly, we performed an ablation study on these prefixes (Appendix D.3, Fig. 7), which confirms that no single prefix introduces a significant gender skew. By averaging our results across these five validated prefixes, we neutralize the prompt prefix as a confounding variable.

We will clarify this in the main paper to assure that our methodology isolates and measures category-specific biases fairly.

[A] Ghosh and Caliskan: 'Person'== Light-skinned, Western Man, and Sexualization of Women of Color: Stereotypes in Stable Diffusion. In EMNLP Findings 2023
[B] Wu et al.: Stable diffusion exposed: Gender bias from prompt to image. In AIES 2024


Presenting model-specific results would facilitate a more precise understanding of the bias patterns across different TTI models.

This is a great suggestion. We will add figures showing model-specific results to the supplementary material for our updated version.

评论

Thank you for your constructive feedbacks with additional experiments. My concerns are well addressed.

评论

Thank you for your positive reply! We are very glad that our rebuttal addressed your concerns. We would be grateful if you would consider reflecting this in your final rating. Thank you again for your time and valuable feedback!

审稿意见
5

This paper investigates social biases in image generation models by moving beyond prior analyses that primarily focused on limited prompt templates and narrow gender-occupation associations. Instead, the authors conduct a large-scale analysis across a broader range of bias dimensions—including activity, context, object, and occupation—using a diverse set of prompts. The study covers several recent state-of-the-art image generation models and examines which wordings exhibit a majority bias toward male or female representations.

优缺点分析

Strengths

  • The authors rightly identify a key limitation in prior work: many earlier studies rely on narrow prompt sets when analyzing gender-occupation bias. In contrast, this work conducts a broader, large-scale investigation.
  • The paper provides a comparative analysis across multiple state-of-the-art image generation models, offering valuable insight into how widespread the issue is.
  • The authors improve the trustworthiness of their findings by introducing filtering techniques to ensure cleaner and more meaningful prompt-image pairings.

Weaknesses

  • The paper primarily presents an empirical analysis but falls short of delivering a strong message beyond confirming the existence of known biases. Since the presence of gender bias in generative models is already well recognized, the paper would benefit from articulating what new insights are derived from the analysis. Furthermore, it would improve the completeness of the work to include recommendations or implications for future bias mitigation strategies based on the findings.
  • Given the analysis spans multiple generative models, a discussion on model-wise differences would be valuable. Are some models more biased than others? What factors (e.g., model size, training objective) contribute to the differences, if any? These aspects are not sufficiently addressed in the current version.
  • While the authors justify analyzing LAION-400M (as a proxy for training data) to study majority class reduction and amplification, most of the evaluated models are trained on datasets much larger and more diverse than LAION-400M. This raises concerns about how much trust can be placed in the LAION-based conclusions when extrapolating to those models.
  • The analysis is limited to binary gender bias, while recent studies have also explored broader social dimensions such as race and age.

问题

  • What novel findings can we extract from this analysis beyond reaffirming that bias exists?
  • Given that many of the evaluated models are trained on larger, more diverse datasets than LAION-400M, how reliable are the comparisons made using LAION-400M?
  • A discussion or acknowledgment of the additional axes of bias would have made the work more comprehensive and aligned with current trends in fairness research.

局限性

Yes

最终评判理由

My major concerns were addressed by the authors' responses.

格式问题

No concern

作者回复

We thank the reviewer for the positive evaluation of our submission and for the great suggestions to improve it. We appreciate that the reviewer finds that we provide “valuable insight” and enable improved trustworthiness of our findings. In our rebuttal, we clarify our contributions, and we describe additional results regarding model comparisons, which we will include in our updated version. Furthermore, we clarify implications for debiasing resulting from our work, which we will also add to in our updated version. Finally, we justify using LAION-400M as a proxy for web-scale pretraining data.

We hope these clarifications address the concerns mentioned in the review. Kindly let us know any further questions or uncertainties. We are happy to discuss additional changes to our submission or any remaining doubts. In the following, we answer to all comments and concerns individually.


What novel findings can we extract from this analysis beyond reaffirming that bias exists?

The novelty of our work lies in three key areas that establish a new, more rigorous standard for bias analysis in T2I models. These contributions are:

  1. Improved experimental methodology. Instead of classifying images holistically as male or female, we detect all people in images, label their perceived gender, and remove images with ambiguous gender or no people. We also exclude entire prompts that generate too few images to yield statistically significant conclusions. These are innovations of the experimental setup as commonly used for bias research of T2I models, and thus enable more valid analysis. We expect that future work will adopt our improved analysis pipeline.
  2. Broader coverage of concepts. We systematically investigate gender bias at scale beyond the well-studied domain of occupations. By curating 3,217 prompts, we provide the resources to analyze bias in more subtle, everyday scenarios (e.g., household chores). This enables a more holistic understanding of the stereotypes T2I models perpetuate. Our work is also novel in its interdisciplinary approach, which links findings from social science with statistical results on gender distribution in generated images. We expect that this will inspire further interdisciplinary studies.
  3. Using web-scale data as bias baselines. We establish gender-concept co-occurrence frequencies in LAION-400M as a baseline for gender bias. This is a significant contribution because it reduces reliance on demographic statistics. This method allows for a more direct analysis of how biases are amplified beyond real-world contexts, which may be linked to the models (as opposed to training data).

We will revise the paper to make our contributions clearer.


it would improve the completeness of the work to include recommendations or implications for future bias mitigation strategies based on the findings

We thank the reviewer for this important suggestion. While we have a preliminary discussion on debiasing in Appendix I, we agree that the implications of our findings for bias mitigation can be made more explicit and actionable.

Based on our analysis, we will add the following key recommendations to the paper:

  1. Targeted Data Curation: Our fine-grained analysis identifies specific concepts and contexts (e.g., “laundry”, occupations involving physical work) where gender bias is most severe. This can inform mitigation efforts like targeted data collection or strategic rebalancing of existing training sets.
  2. Hybrid Mitigation Strategies (Data + Model): Our finding that models can amplify bias present in web-scale data (LAION-400m) is a crucial insight. It implies that data-centric approaches alone may be insufficient. Future work should investigate hybrid strategies that address bias in both the training data and the training/inference process of the model.

We will add a dedicated subsection to the discussion (Sec. 5) in the main paper, and expand on it in Appendix I.


Are some models more biased than others? What factors (e.g., model size, training objective) contribute to the differences, if any?

This is an excellent question. To provide a detailed comparison of the models, we made a new analysis where we examine model-wise differences along two axes: bias direction (which gender a model favors relative to the average) and bias intensity (how skewed, i.e. close to 0 or 1, the gender ratios are).

1. Bias Direction: To see if any model consistently leans more male or female, we calculated the deviation of each model's female ratio RfR_f from the average RfR_f across all models for each prompt. As shown in the table below, the results are mixed.

ActivitiesContextsObjectsOccupations
Flux0.031-0.0450.002-0.040
Flux-Schnell0.032-0.078-0.014-0.028
SD-3.5-Large-0.0040.0380.066-0.005
SD-3.5-Medium0.0010.089-0.0280.029
SD-3-Medium-0.063-0.008-0.0330.045

In a few cases, models significantly deviate from the model mean, for example SD-3.5-Large generates more women than the average for object prompts, and SD-3.5-Medium for context prompts. Conversely, Flux-Schnell generates fewer women for context prompts, and SD-3-Medium for activity prompts. However, no model consistently shows a positive or negative deviation across all categories.

2. Bias Intensity (Skew): Next, we measure the intensity of bias by calculating the entropy of the gender distribution for each prompt group. Lower entropy indicates a more skewed distribution (i.e., generations are heavily skewed towards one gender), signifying more intense bias.

ActivitiesContextsObjectsOccupations
Flux0.6730.6640.7300.433
Flux-Schnell0.6760.6360.8060.422
SD-3-Medium0.4850.5320.6730.342
SD-3.5-Large0.6180.7300.8020.440
SD-3.5-Medium0.5910.7330.7560.418

This analysis shows a clear and consistent trend. SD-3-Medium consistently produces the lowest entropy (most skewed) outputs, while SD-3.5-Large is the most balanced (highest entropy). The Flux models fall in between. This allows us to rank the models by the intensity of their gender bias: SD-3.5-Large (most balanced) > Flux / Flux-Schnell > SD-3.5-Medium > SD-3-Medium (most skewed).

This ranking suggests that larger, more recent models may produce more balanced gender representations. However, this conclusion should be taken with caution, since information on training data and objectives is not public. We thank the reviewer for suggesting this interesting analysis, which we will add to the main paper.


Given that many of the evaluated models are trained on larger, more diverse datasets than LAION-400M, how reliable are the comparisons made using LAION-400M?

Thank you for raising this critical point regarding the limitations of proprietary models. While we do not have access to the exact training data, our use of LAION-400M as a proxy is justified and, we argue, makes our findings on bias amplification more robust.

Our reasoning is twofold:

  1. As noted by Udandarao et al. [A], concept frequencies and scaling trends are consistent across different web-scale datasets. This suggests LAION-400M is a reasonable, publicly available proxy for the type of data these models are trained on.
  2. LAION-400M is likely a conservative baseline. Assuming the reviewer's premise is true, that the proprietary training sets are larger and more balanced than LAION-400M, it would mean our baseline is more biased than the models' actual training data. In this scenario, observing that models still amplify bias relative to our LAION-400M baseline makes our conclusion even stronger. The models are not only failing to mitigate the bias in comparison to our proxy dataset, they are amplifying it.

Therefore, we argue our conclusions are reliable because LAION-400M serves as a reasonable and, importantly, conservative baseline for measuring bias amplification.

[A] Udandarao et al.: No “zero-shot" without exponential data: Pretraining concept frequency determines multimodal model performance. In NeurIPS 2024


A discussion or acknowledgment of the additional axes of bias would have made the work more comprehensive and aligned with current trends in fairness research.

We agree that a comprehensive analysis has to consider intersectional biases beyond binary gender. We conducted experiments on race and age, which are detailed in Appendix I (line 1319).

There, our analysis finds a significant form of representational bias, i.e. models demonstrate a strong tendency to generate people who appear to be White and young-to-middle-aged when the prompt does not specify any attributes related to gender or age. This finding is also an important result about the apparent lack of (demographic) diversity in model outputs. This representational homogeneity, however, makes a deeper statistical analysis of intersectional bias (like the one we performed for gender) currently infeasible. There is simply not enough variation in the generated racial and age attributes across most prompts to draw statistically meaningful conclusions. The strong "default" of the models to a single demographic is a major limitation that we will discuss more explicitly in the main paper, as it is a critical form of bias and therefore should a priority for future work.

评论

Thank you for the detailed response. The authors’ clarifications have addressed many of my initial concerns.

Regarding novelty, I acknowledge the contributions summarized by the authors. While I agree that the expanded scale and rigor of the analysis are valuable, I still wonder whether the paper presents new and interesting findings that emerge from this analysis.

The comparative results across models were particularly impressive. In the case of Bias Direction, the deviations do not appear substantial relative to the mean. However, in Bias Intensity, the differences are more clearly distinguishable across models. I would be interested in the authors' perspective on what might cause SD-3-Medium to exhibit the strongest bias. In connection with the earlier point on novelty, uncovering and discussing such model-specific behaviors could form the basis of a more insightful and novel contribution.

The explanation regarding the dataset (LAION-400M) was clear and satisfactory.

Overall, I believe that incorporating the key points raised in the rebuttal into the final version of the paper would significantly strengthen its clarity and impact, making it more compelling and insightful for readers.

评论

Thank you very much for the positive answer! We are happy that we could address many of your concerns in our rebuttal.

Regarding novelty, I acknowledge the contributions summarized by the authors. While I agree that the expanded scale and rigor of the analysis are valuable, I still wonder whether the paper presents new and interesting findings that emerge from this analysis.

In our analysis, we focus on several relevant areas where we systematically find bias that was previously not reported in other published work. For example, the analysis of household chores in Sec. 4.3 (line 281) is entirely novel. We conducted similar analyses for Work/Money-Making activities and for Work-related places (see Appendix H.1 and H.3), which expand perspectives on occupation-bias. There, we showcase how our activity and context prompt sets can yield insights even in relatively well-researched domains, such as gender-occupation bias. Finally, we find that in our activity prompts, women are strongly associated with care-related activities (line 202). This finding was to some extent known in the literature on NLP applications (Kiritchenko and Mohammad, 2018), but to our knowledge, we are the first to demonstrate it for T2I models. The analyses detailed in Appendix G are also all original. Therefore, we think our analyses provide novel insight and valuable contributions to better understanding and documenting gender bias of T2I models. As the reviewer pointed out, these insights can inform debiasing methods and dataset curation. Furthermore, a large-scale analysis such as ours enables a better understanding of relationships of model properties, such as the relation between diversity and bias, as we will discuss in the next point.

Kiritchenko and Mohammad: Examining Gender and Race Bias in Two Hundred Sentiment Analysis Systems. In *SEM 2018.

The comparative results across models were particularly impressive. In the case of Bias Direction, the deviations do not appear substantial relative to the mean. However, in Bias Intensity, the differences are more clearly distinguishable across models. I would be interested in the authors' perspective on what might cause SD-3-Medium to exhibit the strongest bias. In connection with the earlier point on novelty, uncovering and discussing such model-specific behaviors could form the basis of a more insightful and novel contribution.

This is a really great point, and we’re happy to elaborate further on this. As the reviewer pointed out, SD-3-Medium exhibits more skewed gender distributions, but overall, not more or less gender bias than other models. We think that these point to an important insight regarding the relationship between gender bias and general model performance. The observation regarding skew is inherently related to diversity, which we analysed in our response to reviewer oE3s. Less diversity directly translates to higher skew, as images generated for a prompt tend to be more similar to each other, and also tend to show people of the same gender. SD-3-Medium is the oldest and weakest of the included models, followed by SD-3.5-Medium, while SD-3.5-Large and Flux models are considered state-of-the-art. Therefore, these stronger models generate more diverse images, leading to less skew. But overall, this does not lead to less gender bias. This motivates research on gender bias as an important and largely orthogonal direction with respect to general capabilities such as diversity.

Overall, I believe that incorporating the key points raised in the rebuttal into the final version of the paper would significantly strengthen its clarity and impact, making it more compelling and insightful for readers.

We agree with these points and will revise the paper to include all clarifications. Thank you again very much for your great suggestions to improve our paper, and we are happy to discuss any remaining doubts further.

评论

I appreciate the authors' thoughtful responses. I agree with the additional clarifications they provided, which I believe offer valuable insights to future readers of the paper. I trust that these points will be reflected in the final version of the paper, and I will raise my score accordingly.

审稿意见
4

The authors present a large-scale analysis of gender bias in images generated using T2I models. This analysis considers not only occupations, but different activities, contexts and objects that could influence the gender in the generated images. Furthermore, it studies how these models amplify biases w.r.t. the internet data that they are trained on. Detailed results indicate how these models follow traditional gender roles.

优缺点分析

Strengths

  • The diversity of prompts is better than previous works that only consider occupation prompts. Additionally, the large number of images is certainly useful for the analysis.
  • Positive correlations between models shows that most models learn similar biased representations. All analyses are reported with model-specific statistics, enabling comparisons between models.
  • I liked the decision to consider bias amplification with respect to LAION-400m. Results in Table 3 highlight some interesting findings.
  • The paper is well written and very easy to read, and considers ethical issues related to their analysis.

Weaknesses

  1. [Major] Novelty: The authors highlight two areas of novelty - (1) the total number of images they used for the analysis, and (2) the diversity of prompts (going beyond occupations) on which they do this analysis. However, I find there is limited novel technical contribution or analysis methodology in this paper. The pipeline consists of VQA (used previously by StableBias and several other works cited in the paper) for gender detection and a fairly simple metric (Rf) that measures the ratio of females in the image sets, and is limited to the select open-source models (SD & Flux families) that the authors have chosen. It is also unclear in the paper whether the ~2.3M images will be released to the public (abstract only mentions code and prompts) - which could be a useful contribution for further research on biases in T2I models.

  2. [Major] Justification for the need for large image sets: While I agree that using larger image sets is generally a good idea for bias detection, there needs to be some justification for spending the additional GPU hours and time. One way to evaluate this would be to sample smaller subsets of images (10, 25, 40, 100, 200 images per set - as an example) and evaluate how much Rf changes if smaller subsets of images were used. Does increasing the number of images change your analysis? What should be the ideal number of images per prompt? How much variance is present from prompt cluster to cluster?

  3. [Minor] Quality of VQA model: I see that VisoGender is used to justify the use of InternVL for gender detection. However, VisoGender only contains real world images rather than synthetically generated images. It is possible that images containing artifacts from generation (esp. from older diffusion models) may make the VQA model fail to accurately detect gender. It would be valuable to conduct a small user study to test VQA performance on synthetic images, or at least identify any failure cases.

  4. [Minor] Qualitative examples: The paper (or appendix) does not contain any qualitative examples of prompt clusters and their images, or any examples of VQA. Given how many images that were generated, it would be valuable to observe how different models represent certain clusters & what kinds of prompts show the most variation. For example, in Fig 5, the laundry cluster seems to show that all models except SD3.5-Large are female learning. This would be an interesting example to show qualitative results on.

Other comments:

  • Lines 306-308: I agree that LAION-400m may be representative, but is there any citation to justify this?
  • Lines 5-7 (abstract): the wording used here seems to suggest 200 images per prompt per model, but it is 40 images per prompt over 5 models or over 5 prompt variations. Please make this clear. Also, some prior works do consider more images per prompt per model (e.g. TIBET used 48 images per prompt, on any given model), so it would be important to clarify this.

问题

My rating is primarily based on the aforementioned weaknesses and the questions associated with those, which I encourage the authors to address in their rebuttal. To summarize the main weaknesses:

  1. Novelty: Is there any novel technical contribution I missed, that could be highlighted? Will all images be released? How will this work enable future works in bias analysis?

  2. Image set size: How much do analysis results change when image set size changes? Is 200 a sufficient number of images, or do we need less or more for drawing similar conclusions?

  3. VQA: can you evaluate how well the VQA model does in detecting genders in synthetic images, even in cases where there may be artifacts?

局限性

Yes - a detailed limitations section is provided in the appendix, in addition to discussions about ethical considerations on gender classification. I find this sufficient.

最终评判理由

I have updated my score to reflect the strong rebuttal by the authors, and considering their responses to other reviews. My main concern about novelty remains, but I believe that this is a good analysis paper, especially given its scale.

格式问题

Nothing of note.

作者回复

We thank the reviewer for the clear, constructive, and helpful feedback. We appreciate that the reviewer recognizes the broadness of our analysis as a strength. We also agree that our findings regarding bias amplification with respect to LAION-400m are an important contribution of our study.

In our rebuttal, we address the concerns regarding novelty, and we provide justification for the need of large datasets to study gender bias. We also describe new results regarding the accuracy of our VQA model on synthetic data. Furthermore, we confirm that we will make our data, including all images and gender labels, available after publication.

Based on these clarifications, we hope the reviewer will support us in making our research known to the community, and we are looking forward to a constructive discussion.

Below, we reply to the reviewer's concerns individually.


Is there any novel technical contribution I missed, that could be highlighted? Will all images be released? How will this work enable future works in bias analysis?

While our work does not introduce a new technical method, its novelty lies in three key areas that establish a new, more rigorous standard for bias analysis in T2I models. These contributions are:

  1. Improved experimental methodology. Instead of classifying images holistically as male or female, we detect all people in images, label their perceived gender, and remove images with ambiguous gender or no people. We also exclude entire prompts that generate too few images to yield statistically significant conclusions. These are innovations of the experimental setup as commonly used for bias research of T2I models, and thus enable more valid analysis. We expect that future work will adopt our improved analysis pipeline.
  2. Broader coverage of concepts. We systematically investigate gender bias at scale beyond the well-studied domain of occupations. By curating 3,217 prompts, we provide the resources to analyze bias in more subtle, everyday scenarios (e.g., household chores). This enables a more holistic understanding of the stereotypes T2I models perpetuate. Our work is also novel in its interdisciplinary approach, which links findings from social science with statistical results on gender distribution in generated images.
  3. Using web-scale data as bias baselines. We establish gender-concept co-occurrence frequencies in LAION-400M as a baseline for gender bias. This is a significant contribution because it reduces reliance on demographic statistics. This method allows for a more direct analysis of how biases are amplified beyond real-world contexts, which may be linked to the models (as opposed to training data).

Finally, to directly answer the reviewer's questions:

  • Data Release: We confirm that we will release all data on huggingface, i.e. the 2.3 million filtered images, the full set of 3.2 million generated images, the 3,217 prompts, all detected bounding boxes, and all perceived gender labels. We believe this dataset will be a great resource for the community.
  • Enabling Future Work: Our contributions directly enable future work by providing (a) a more rigorous and reliable audit protocol, (b) a comprehensive set of prompts and a massive, labeled dataset for studying bias in under-explored contexts, and (c) a new baseline methodology for analyzing bias amplification.

We will revise the paper to make our contributions clearer.


How much do analysis results change when image set size changes?

Our large number of images per prompt is required for statistical precision. As the reviewer suggests, we conduct a subsampling experiment to quantify the stability of RfR_f with varying number of images. For each prompt, we create 1,000 bootstrap samples for sample sizes (n[20,25,,200]n \in [20, 25, …, 200]) and calculate the mean absolute deviation of the sample RfR_f​ from the RfR_f calculated on our full data. The resulting mean absolute deviation values in the following table demonstrate that smaller sample sizes lead to considerable measurement error:

2025304050100150200
Flux.060.052.050.044.039.029.023.022
Flux-Schnell.061.055.051.045.041.031.026.025
SD-3.5-Large.061.054.050.043.039.028.022.021
SD-3.5-Medium.058.052.048.042.038.028.022.021
SD-3-Medium.047.043.039.035.032.024.020.019

Only with at least 100 images per prompt does the average deviation reliably drop below 3 percentage points. For smaller samples, the noise could easily obscure the effects we aim to measure. For instance, a 6 percentage point average deviation for a sample size of 20 is too high for precise analysis across thousands of prompts.

These findings strongly support the scale of our study. While smaller samples might be sufficient to merely show that some bias exists, they are inadequate for the goal of our work, which is to precisely quantify and compare gender biases across different models and a broad range of concepts. We will add this analysis and justification to the supplementary material.


can you evaluate how well the VQA model does in detecting genders in synthetic images, even in cases where there may be artifacts?

We thank the reviewer for this suggestion. To address this, we conducted two new experiments as requested:

Quantitative Validation on Synthetic Data: We evaluate InternVL on the SocialCounterfactuals dataset [A], which contains over 170,000 synthetic images from an older Stable Diffusion model. On this dataset, InternVL achieves 99.7% accuracy and a Cohen's kappa of 0.994 against the ground truth labels, which means near-perfect performance on synthetic images with known attributes.

Human Agreement Study: We also conducted a study on our own generated images. We randomly sampled 120 bounding boxes (40 labeled male, 40 female, 40 unclear by InternVL) and had them labeled by two human annotators unaffiliated with this submission. We then compared the MLLM labels to the human labels.

The model-to-human agreement (Annotator 1: κ\kappa=0.78 and Annotator 2: κ\kappa=0.71) is substantial and nearly identical to the human-to-human inter-annotator agreement (κ\kappa=0.775). Here, it is critical to note that InternVL performs on par with a human annotator for this task, as its agreement with a human is bounded by the agreement between two humans. This validates our use of InternVL for labeling perceived gender in our generated images. We will add a detailed description of these experiments to the supplementary material.

[A] Howard et al.: Socialcounterfactuals: Probing and mitigating intersectional social biases in vision-language models with counterfactual examples. In CVPR 2024


it would be valuable to observe how different models represent certain clusters & what kinds of prompts show the most variation

We agree with the reviewer that qualitative examples would strengthen the paper. We will add new figures providing a detailed qualitative analysis of the "laundry" cluster, as suggested. We will show representative images for each of the five models, including examples of female-labeled, male-labeled, unclear, and multi-person generations to provide a direct visual comparison. This shows the gender difference for SD-3.5-Large, that people are detected well, and that gender labels appear accurate.

To inspect prompt variance, we calculate the standard deviation across models for all prompts and show the 10 prompts with highest standard deviation in the following table:

PromptStd.Dev.PromptStd.Dev.
play volleyball.42take one's kids to soccer practice in the rain.40
do a kickboxing lesson.42play lacrosse.39
nurse anesthetists.40volleyball court.38
do house cleaning.40play fighting games.38
go to the gym and run a three miles on the treadmill.40jog 5 miles at the health center.37

Interestingly, 8 of the 10 prompts with highest variance are related to sports, which shows that across models, associations between gender and sports are unstable. Prompts in the laundry cluster have an average std. dev. of 0.22, which ranks them in the 56th percentile (from lowest to highest).


I agree that LAION-400m may be representative, but is there any citation to justify this?

Indeed, there is evidence that LAION-400m is a representative sample of web-scale data. First, it was sourced from CommonCrawl, which mirrors web content at large scale. Second, a study by Udandarao et al. [B] showed similar concept frequency scaling trends in LAION-400m and smaller datasets, so these trends will likely extrapolate to larger datasets. Therefore, we consider LAION-400m to be a good proxy for multimodal datasets used to train foundation models, even if they are in reality larger. We will include this reference in the paper.

[B] Udandarao et al.: No “zero-shot" without exponential data: Pretraining concept frequency determines multimodal model performance. In NeurIPS 2024


Lines 5-7 (abstract): the wording used here seems to suggest 200 images per prompt per model

Thank you for pointing out that this statement is imprecise. In our updated version, this sentence reads: “We create a dataset of 3,217 gender-neutral prompts and generate 200 images per prompt (40 images over 5 prompt variations) from five leading T2I models”. However, we would like to clarify that we generate 200 images per prompt per model: 5 models x 3217 prompts x 5 prompt variations x 40 images per variation = 5 models x 3217 prompts x 200 images.


some prior works do consider more images per prompt per model, e.g. TIBET

We invite the reviewer to check if our overview of related work in Appendix J (especially in Tab. 12) already addresses this concern. There, we list the total number of images and the number of gender-neutral prompts used in the respective work. We will add TIBET to this table in our updated version.

评论

Thank you for the clear answers and clarifications. I really appreciate the effort.

Regarding number of images and VQA, the new experiments make it clear that there is a need for the larger image sets, and that given your filtering strategy (removing ambiguous images), the gender classification is accurate.

Regarding novelty, I share the same thoughts as Reviewer PQHE - that while the scale and analysis is valuable, the takeaways from this work did not stand out as novel or especially interesting. Now having read all of the rebuttals, I believe that the new experiments shown add value and can make this paper stronger. Additionally, the clarifications around the contributions (& data release) and the inclusion of some qualitative results will strengthen the paper further.

评论

Thank you very much for the detailed answer! We are happy that we could address almost all of your concerns. We will include all new experiments and clarifications in our revised version.

Regarding your concerns about novelty, our work features improved experimental methodology, broader coverage of concepts, and uses web-scale data as bias baselines. We think that all of these are relevant contributions and improve over previous works on gender bias. We would be happy to address any remaining concerns regarding this point that could help with your final decision.

Additionally, we would like to invite you to check reviewer 1phu's assessment that our work is significant and our focus on more dimensions (rather than just occupations) is notable, so they argue that novelty in this case is not grounds for rejection.

Again, thank you very much for your support and constructive feedback!

最终决定

Paper proposes a methodology and a dataset for large-scale analysis of gender bias in images generated using T2I models. The paper received 2 x Weak Accepts and 2 x Borderline Accepts. Despite generally positive outlook of the reviewers, a number of concerns have also been raised, mainly focusing on (1) lacking novelty, and (2) justification for some design choices. Rebuttal addressed some of these concerns, however (1) remains.

Specifically, [6rQH] states "my main concern about novelty remains". This is also echoed [PQHE] "falls short of delivering a strong message beyond confirming the existence of known biases", which further comments on significance of the findings.

AC has carefully read the reviews, rebuttal and discussion that followed. as well as the paper itself. AC feels that, despite overall positive sentiment of reviewers, remaining concerns of novelty and significance of findings are important to consider. Specifically, the methodology employed by the paper is largely inline with prior work, and simply differs in scale (more diverse prompts and higher number of images). Further the current experimental design, while through, focuses only one form of a bias (gender) making the findings less general and significant overall. Finally, as [PQHE] notes, while the paper presents a through quantification of some T2I models regarding their gender amplification, these sorts of amplification effects have been observed in the past and the paper does not provide any significant novel insights or ways of mitigating such issues.