Reliable Image Quality Evaluation and Mitigating Quality Bias in Generative Models
摘要
评审与讨论
This paper proposes an evaluation score to quantify the reliability of existing evaluation metrics for image quality assessment in the scenario of quality discrepancy given certain protected attributes (e.g., gender, race, etc.). This score called DQA mainly focuses on evaluating the bias in image encoders. Authors also use this score to guide the sampling process in text-to-image generative models, in order to generate images that have less quality discrepancy given protected attributes.
优缺点分析
Strengths:
- This paper has nice amount of analysis and experiments, including synthetic distribution analysis, image encoder analysis, and guidance generation, etc.
- This paper tested several image encoders with different architectures.
Weaknesses:
- Unclear definition of quality bias. The quality bias is lack of definition in the paper. There are several different understandings presented in the paper. A) The quality of generated images given different genders should be equal; B) The quality shift of generated images given different genders from reference images given different genders should be equal. These two definitions are not the same and lead to different consequences. In the experiments, it seems that the authors adopted the B) definition. They use the images generated from T2I models (baseline) as reference images for which the qualities may not be the same across protected attributes (e.g., genders). The goal is to ensure the same quality shift across protected attributes. I do not understand why should we care about the quality shift to be the same.
- The evaluation dataset construction is not convincing. Generating images using the same degraded sampling procedure does not mean that the generated images will exhibit the same level of image quality degradation across protected attributes. To ensure a same level of degradation, adding the same level of noise / distortion / color jittering / blurry could be a better option.
- The determination of which group the generated images belong to may include biases. The authors categorize the group for generated images based on the prompt used for generation. However, it is well known that the text to image models cannot faithfully follow the prompts. Thus, the generated image may have a different gender compared to the prompt. Adding some consistency scores to make sure that the generated images follow the prompt may help.
- In addition, the synthetic data analysis proving the unreliability of FID also heavily rely on this miscategorization assumption (point 3 above). As shown in Fig. 3, Group B is shift toward Group A that could be understand as the generated images (shifted distribution) has a different gender than the referenced one. Using this miscategorized group to obtain group-wise quality discrepancy does not align with the reality.
- The setting of guidance experiments has some concerning parts. It seems that the authors use the same reference set for guidance and evaluation. This is not surprising that the metrics will be improved by using the gradient from the metric as a guidance. It would be better to use some tailored dataset, e.g., fairface, as the reference set for evaluation. It would also be interesting to see some other metrics for quality measurement, e.g., aesthetic score.
问题
Besides the questions mentioned in weakness, I have some follow up questions.
- As I mentioned in weakness item 1, why should we care about the quality shift to be the same instead of the quality across protected attributes to be the same?
- In Figure 4 b), where comes from the real images? What's the criteria to determine "well-generated" and "poor-generated"? Further, "reference" in Figure 4 refers to real images, however, "reference" refers to baseline model generated images in experiments. This makes some confusions to understand the paper. What are the poor-generated images in the blue circle? Are these images perceptually similar to the other perceivable gender?
- Could the authors double check the values in Figure 2 and other related figures and tables? In general, FD with DINO features will be larger than Inception features [1-2]. However, it is not the case in this manuscript.
[1] Exposing flaws of generative model evaluation metrics and their unfair treatment of diffusion models https://arxiv.org/abs/2306.04675
[2] Feature Likelihood Divergence: Evaluating the Generalization of Generative Models Using Samples https://arxiv.org/pdf/2302.04440
局限性
yes
最终评判理由
I appreciate the authors' responses during the rebuttal. One of my major concerns is the definition of the quality bias and how their empirical setting corresponds to this definition. During the rebuttal, the authors provided more details on how they generate the reference images by following the previous work [1]. Following this pipeline can somehow alleviate my concerns, but this pipeline also relies heavily on VQA models to judge the images where many previous works have stated the potential bias in the VQA model judgement. There is no guarantee that the generated images across different demographic groups will have the same quality.
Since the reference images may not have the same quality across demographic groups, the soundness of the following encoder analysis remains a bit skeptical.
I raise my rating from 2 -> 3.
[1] N. Lui et al. "Leveraging Diffusion Perturbations for Measuring Fairness in Computer Vision." AAAI 2024.
格式问题
no
We thank the reviewer for the insightful comments and suggestions.
W1, Q1: Definition of Quality Bias, Clarifying Our Fairness Objective
The quality evaluation is based on the distance between representations of generated and reference images. We clarify that the notion of quality bias in our paper follows definition A, not B.
- A: The quality of generated images given different genders should be equal, i.e. as mentioned in Section 3.
- B: The quality shift of generated images given different genders from reference images given different genders should be equal, i.e.
To measure the fairness A, FID is widely used where is InceptionV3, and is Frechet Distance. However, we argue that FID can fail to reliably capture this bias A. As illustrated in Figure 2, the an evaluation metric may incorrectly rate a poorly generated female nurse image as being of higher quality than a well-generated male image. This makes trustworthiness issue in quality evaluations and motivates our key research question (see Lines 56, 72, and 148): "Which evaluation metric is reliable for fairness in image quality?"
To answer this, we construct a controlled dataset with known quality shifts and examine how different image encoders respond across groups with equally degraded inputs. This setup is not a redefinition of quality bias, but a means to asseses whether a metric reflects the true underlying quality gap fairly across groups.
In summary:
- Our definition of quality bias is A, not B.
- We do not aim to ensure the same quality shift across groups as a fairness goal.
- Rather, we utilize controlled quality shifts as a diagnostic tool to assess the reliability of evaluation metrics in detecting quality bias.
- Quality shift is used only to find out reliable image encoder for evaluation by isolating the bias inherent in the encoder.
We hope this distinction clarifies the rationale behind our experimental setup.
W2: Constructing Dataset
While traditional image degradation techniques (noise addition, color jittering, or blurring) are commonly used in image quality research, they do not reflect the nature of degradation observed in generative models. Our objective is to simulate the quality failures specific to image generation, rather than conventional distortions.
Our controlled dataset construction is grounded in prior work. We follow the degradation protocols used in [1] and [2], both of which analyze generation failures that emerge from sampling dynamics (e.g., truncated denoising or reduced steps). These reflect real-world failure modes of diffusion models.
Moreover, [3] draw a clear distinction between traditional distortions and generative artifacts. They show that defect classifiers trained on conventional degradation types fail to generalize to generation-induced artifacts, further supporting our decision to simulate degradation through the generative process itself.
In summary, our degradation strategy:
- Accurately reflects generation-specific quality failure modes.
- Is aligned with prior literature focused on evaluating and analyzing diffusion model failures. This approach provides a more valid basis for assessing the fairness and reliability of evaluation metrics in generative models.
[1] A. Borji. Qualitative failures of image generation models and their application in detecting deepfakes, 2023.
[2] Kim et. al. Model-Agnostic Human Preference Inversion in Diffusion Models, 2024.
[3] Wang et al. CNN-generated images are surprisingly easy to spot... for now." CVPR. 2020.
W3: Prompt-Related Bias
We appreciate the reviewer’s concern regarding the possible mismatch between prompts and actual attributes in generated images. To address this, we adopt the generation pipeline proposed by [4], which was designed specifically to mitigate such inconsistencies.
As noted in Line 175, [4] employs a carefully controlled prompt structure with explicit attribute descriptions (e.g., gender, race) and performs post-processing using an inpainting model to correct any deviations. We directly reuse both the prompt format and pipline from [4], ensuring alignment between prompt and image.
Moreover, we manually reviewed all generated images in our dataset. No instances of attribute mismatch were observed. Although our current pipeline addresses this concern effectively, we agree that integrating automatic consistency scoring could provide additional rigor in future work.
[4] N. Lui, et. al. Leveraging diffusion perturbations for measuring fairness in computer vision. AAAI 2024.
W4: No Miscategorization Assumption
We respectfully clarify that our work does not rely on any miscategorization assumption, nor does our dataset contain such issues. Figure 3 is just a synthetic illustration (see Appendix A) designed solely to highlight the potential pitfalls of group-agnostic evaluation metrics like FID. The figure does not assume or imply any misclassification of group identity in real or generated samples. Rather, it demonstrates that even under idealized and controlled group separation, a group-agnostic score may yield misleading interpretations.
Thus, our method is not predicated on any miscategorization scenario, and the evaluation is built on valid, well-annotated data.
W5-1: On Reference Set for DQA-Guidance
We agree with the reviewer that it is reasonable to use a desired-quality dataset as the reference for guidance. However, using the FairFace dataset introduces two concerns:
- Quality Bias in Real Image: As acknowledged in the FairFace paper, the dataset contains quality imbalance, e.g., some photos captured by professional photographers, introducing group-level quality disparities.
- Limited Coverage: FairFace includes only facial images, whereas failures in generative models often occur beyond the face e.g., in limbs, which would not be captured by facial datasets alone.
We prioritize real datasets whenever feasible. As noted in Appendix G.1 (line 624), we use a real chest X-ray dataset for medical image generation. This dataset provides consistent quality across genders, with quality control ensured by human annotators.
W5-2: Different image quality assessment
Thank you for the suggestion. In Appendix I, we also assessed DQA for alternative image quality assessment (IQA) methods. These include VQA-based evaluations and general-purpose IQA models.
We summarize the alternative evaluation strategies below:
- VQA-based methods: Use visual question answering models (e.g., PaliGemma) to detect unrealistic artifacts by asking questions such as: "Is this image real or fake?"
- General IQA models: Evaluate perceptual quality using models like TOPIQ, which are trained to detect visual degradation (e.g., blur, noise, color issues).
| Method | PaliGemma | PaliGemma GAP | TOPIQ | TOPIQ GAP |
|---|---|---|---|---|
| No Guidance | 0.995 | 0.02 | 0.752960 | 0.008874 |
| DQA-Guidance | 1.000 | 0.00 | 0.752189 | 0.006692 |
These results indicate that DQA-Guidance does not compromise overall image quality, while effectively improving fairness across demographic groups even with other image quality assessment.
Q2: Clarifying Real Images and Terminology in Figure 4
Thank you for pointing out the potential confusion. The real images shown in Figure 4b are included as an illustrative example to help interpret how different image types are positioned in the embedding space. These were not used in any experiment. As noted, they were sourced from publicly available Google Image Search, and we will clarify this explicitly in the revised version.
Regarding the criteria for well- and poorly-generated images, we follow a consistent convention throughout the paper. The "reference" set refers to well-generated images (as described in Figure 5), generated with full sampling steps, lower noise, and a refinement stage. The "poor-generated" images (e.g., in T6) are created under a degraded setting with fewer diffusion steps, stronger noise, and no refinement module, resulting in visibly lower-quality outputs.
To address your question about the gender in the poor-quality set (T6): as discussed in W3, all images maintain their intended gender and profession. As shown in Figure 8, these images remain identifiable in terms of gender and profession but exhibit clear distortions in face, limbs, and background, which reduce overall perceptual quality.
We will revise the text and figure caption to ensure that the usage of "reference" and the nature of the real and generated images are unambiguous.
Q3: Scale of FD with DINO
Thank you for raising this point and for citing helpful references. The DINO model we used is DINOv1 [6] with ResNet-50, not DINOv2 [7] with ViT-L/14 as discussed in the cited papers. This difference explains the discrepancy in FD values, as the training objectives, regularization strategies, and architectures (CNN vs transformer) differ significantly.
Additionally, the references [8] do not make a general claim that FD values are always higher with DINO than with Inception. For example, [9] (Fig. 2) mentions that FD with DINOv2 is higher in their setup but attributes this to differences in sample count. Thus, direct comparison of absolute FD values across models, datasets, and evaluation settings is not meaningful, and we suggest interpreting FD within-model rather than across different architectures.
[6] Emerging properties in self-supervised vision transformers.
[7] Dinov2: Learning robust visual features without supervision.
[8] Exposing flaws of generative model evaluation metrics and their unfair treatment of diffusion models.
[9] Feature Likelihood Divergence: Evaluating the Generalization of Generative Models Using Samples.
Thank you very much for your responses and clarifications. Given the current rebuttal, many of my concerns are not effectively addressed.
- W1, Q1. I was aware of all the referenced lines the authors provided during my original review. The authors also acknowledge the mismatch between their definition of Quality bias (as in Section 3) and the one used in empirical settings (as in validation of the encoders, and the DQA guidance line 264). In summary, while the authors claim that the definition of their Quality bias is , they use model generations (baseline) as the reference set and degraded model generations as the generated set in all their empirical evaluation settings, leadings to the quality bias shift that I defined in my original review .
- W3. Thank you for pointing to the referenced litterature. [1] first generate images using prompts with unspecified gender, for example, "A firefighter" as described in Section 3.1, then followed by sophisticated VQA filtering, inpainting and FairFace filtering steps to create images with desired genders or races. However, if I understand correctly the line 175, the authors seems using the prompt that already has gender information, e.g., "A male firefighter", and generate images. I do not see supporting evidence from [1] that this one-step generation could ensure the alignment between prompts and images regarding gender information. As such, adding some consistency scores (e.g., VQA evaluation as in [1]) is necessary.
[1] N. Lui, et. al. Leveraging diffusion perturbations for measuring fairness in computer vision. AAAI 2024.
- W5-1. I do not agree with your statement "As acknowledged in the FairFace paper, the dataset contains quality imbalance, e.g., some photos captured by professional photographers, introducing group-level quality disparities." This is cited from the first paragraph in Section 3.2 of [2] where the authors are criticizing the quality bias as a shortcoming of previous datasets. However, the FairFace dataset is built "to minimize the selection bias introduced by such filtering and maximize the diversity and coverage of the dataaset" as stated in the second para in the same section (3.2 of [2]).
[2] Kärkkäinen et al. FairFace: Face Attribute Dataset for Balanced Race, Gender, and Age. 2019
- Q3. Thank you for clarifying which image encoder model you are using. I have a follow-up question, why you choose not to use DINO-v2, a better and more expressive image encoder? [3] suggests that "DINOv2-ViT-L/14 allows for much richer evaluation of generative models".
[3] Stein et al. Exposing flaws of generative model evaluation metrics and their unfair treatment of diffusion models. 2023
Thank you again for your rebuttal!
Thank you for your valuable and insightful questions! We appreciate the opportunity to clarify the points you've raised, which will undoubtedly enhance our paper. We're confident that addressing your concerns through minor revisions will suffice to answer your additional questions.
W1, Q1: Definition of Quality Bias
Thank you for the thoughtful follow-up. As the reviewer correctly noted, our fairness objective is grounded in:
A: — the generation quality gap across groups.
However, in our controlled diagnostic setting, we approximate this using:
B:
This is not a redefinition of quality bias, but a diagnostic tool to assess whether an encoder can reliably detect the same quality shift across groups.
We have carefully considered using real-world images as references for (A) but found them unsuitable for controlled evaluation. The key challenge is: Do the original images for two groups have the same quality?
In practice, real images often differ in background, lighting, pose, and dataset quality. This makes it difficult to determine whether embedding differences stem from encoder bias or from inherent image variation. Such ambiguity undermines the purpose of isolating encoder sensitivity to subtle quality shifts.
Our controlled degradation framework addresses this by using synthetic but high-quality references with matched distributions across groups. By introducing the same known degradations to both groups, we can more effectively assess whether the metric behaves consistently and fairly.
That said, as emphasized in Appendix G.1 (Line 624), we do use real-world datasets when reference quality is reliably balanced e.g., in the medical image setting where radiologist-approved Chest X-rays ensure consistent reference quality across gender. In these cases, the fairness objective (A) is directly applicable.
In summary:
- The use of controlled degradation (B) serves as a proxy for reliable metric assessment, not as a fairness goal.
- Real references introduce confounds; our synthetic setup allows clean, fair comparisons.
- When real references are suitable (e.g., Chest X-rays), we use them.
W3: Consistency in the Controlled Dataset
We appreciate the reviewer’s follow-up and the clarification regarding our use of [1].
1. Clarifying Our Use of [1]
As noted in Line 566 and Appendix D, we follow a two-stage generation pipeline similar to [1]. While Figure 1, 2, and Section 3.1 of [1] describe initial generation using gender-neutral prompts, their fairness analysis in Section 5.2 explicitly adopts gender-specific prompts for both base image generation and inpainting. Quoting directly from [1]:
“We do so by specifying a gender in the prompt used for image generation and inpainting (e.g., ‘A photo of the face of a [male|female] firefighter’).”
Furthermore, [1] states:
“Given that we specify the same gender in the prompts used for base image generation and inpainting, it is much less likely that the presenting gender in the inpainted image will differ from that of the original base image.”
Our pipeline aligns with this setting.
2. Two-Stage Generation with VQA Filtering
Following [1], we generate images using gender-specific prompts and apply a VQA-based filtering step, as recommended in their official GitHub repository. The VQA filtering uses the following three prompts:
- a) Is there a {GENDER} {PROFESSION} in this image?
- b) Are this person's limbs distorted?
- c) Is this image real or fake?
We collect only images that pass all three questions, and retain 250 images per (gender, profession) combination after this filtering. While VQA consistency may not be perfect in general, the filtering process ensures that only verified samples are kept, making our dataset 100% consistent with respect to the filtering criteria.
Following VQA filtering, we perform inpainting with explicitly specified gender, race, and profession, again, mirroring the full setup from [1].
3. Planned Revision
We agree that our current manuscript does not sufficiently explain this multi-step pipeline. In the revision, we will clarify that:
- We use gender-specific prompts for both stages;
- We apply VQA-based filtering;
- We ensure 100% prompt-image consistency based on these filters.
Thank you again for encouraging us to elaborate on this aspect.
[1] N. Lui et al. "Leveraging Diffusion Perturbations for Measuring Fairness in Computer Vision." AAAI 2024.
W5: Regarding Quality Bias in the FairFace Dataset
We appreciate the reviewer's clarification and agree that the FairFace paper aims to reduce quality bias compared to previous datasets. However, we would like to clarify the basis of our concern.
While FairFace attempts to minimize the impact of selection bias and improve diversity, it does not fully eliminate group-level quality disparities. The method used, which involves collecting a broader and more diverse dataset, does not ensure consistent image quality across groups. Specifically, FairFace images are sourced from YFCC100M [2], which is known to contain highly heterogeneous content. As reported in the YFCC100M paper (Table 2), images come from more than 25 different camera models (from a phone camera to professional/high-end camera) and are uploaded by users on Flickr, a public photography platform. This variation in camera quality, lighting conditions, and image composition introduces uncontrolled variability that may persist as quality bias across demographic groups.
Our concern lies in the inherent quality variation within the source dataset, and sampling multiple datasets won't resolve this issue. This variation may affect the reliability of FairFace as a quality reference set for evaluation and guidance.
To ensure controlled and interpretable evaluation, our approach uses a synthetic dataset with known degradations. This allows more accurate assessment of how encoders respond to quality variations across groups.
We hope this explanation clarifies our rationale.
[2] Thomee, Bart, et al. "YFCC100M: The New Data in Multimedia Research This publicly available curated dataset of almost 100 million photos and videos is free and legal for all."
Q3: Results with DINOv2 and Clarification on Encoder Choice
Thank you for the thoughtful follow-up. We agree that DINOv2 is a state-of-the-art image encoder with rich representations, and as mentioned in our response to Reviewer bZa5, incorporating DINOv2 results further strengthens our paper.
There are key differences between DINO and DINOv2. DINO is trained on the ImageNet-1K dataset and supports a variety of backbone architectures, including CNNs like ResNet-50. In contrast, DINOv2 is trained on a much larger dataset (LVD-142M) and is restricted to ViT-like architectures due to its patch-level training objectives. This architectural constraint limits its flexibility in encoder selection.
To address the reviewer’s question, we conducted additional experiments with DINOv2 and BYOL-RN50. The results confirm our main findings:
- CNN-based encoders tend to exhibit more reliable behavior than ViT-based ones in our fairness evaluation setting.
- Training on a larger dataset does not necessarily lead to greater reliability in measuring fairness.
| Method | Backbone | Training Dataset | Gender DQA | Race DQA |
|---|---|---|---|---|
| DINO | RN50 | ImageNet-1K | 0.1684 | 0.1764 |
| DINOv2 | ViT-B14 | LVD-142M | 0.3240 | 0.3328 |
| BYOL | RN50 | ImageNet-1K | 0.2253 | 0.2089 |
In [3], DINOv2 provides rich representations and aligns well with human perception. However, it demonstrates lower consistency across demographic groups under our proposed DQA metric.
This difference arises from two distinct fairness perspectives:
- [3] focuses on perceptual quality assessment, addressing how image encoders such as Inception V3 may unfairly under-credit the visual quality of generated images. DINOv2 is shown to better reflect human judgments and is therefore considered fairer in this perceptual alignment sense.
- In contrast, we propose a demographic fairness perspective, which asks whether an encoder evaluates generated images consistently across different demographic groups.
Our work moves the discussion forward in the following ways:
- We introduce DQA, a metric to quantify demographic consistency of image encoders. This dimension is not considered in [3].
- We show that perceptually strong encoders such as DINOv2 may still exhibit demographic disparity, revealing a new axis of representational bias.
- We offer an automatic diagnostic tool to identify and mitigate such disparities, supporting fairness-aware selection and use of encoders in generative pipelines.
In short, while [3] focuses on alignment with human perception, our work contributes a new fairness lens centered on demographic parity and provides actionable tools for its assessment and improvement.
[3] Stein et al. Exposing flaws of generative model evaluation metrics and their unfair treatment of diffusion models. 2023
I appreciate the detailed responses from the authors. I believe that my major concerns are addressed after the discussion.
I strongly suggest the authors to clearly explain the dataset construction pipeline in the revised version, as this is crucial to the soundness of the empirical results. This also alleviate, to some extent, my concern about the definition of quality bias / quality bias shift, since the controlled generated reference images across different demographic groups are probable to have the same level of image quality. In this sense, it could be a valid proxy for the quality bias defined by the authors.
I have a follow-up question that I would like to discuss with the authors:
- You contrast human perception with fairness evaluation in your answer for not using Dinov2. In my opinion, at least in the context of text to image models, fairness evaluation should be aligned with human perceptions. Is the fairer result from less powerful models an indication of the fact that some bias is ignored by these image encoders and thus lead to a "fake" fairness?
Human Perception Analysis
We appreciate the reviewer’s comment regarding the lack of human evaluation.
In this work, we focused on designing a scalable and automated evaluation framework using DQA, which captures relative group-wise disparities in image quality without relying on costly and potentially subjective human annotation. Our motivation stems from the inherent difficulty of obtaining consistent and fair human evaluations across large-scale datasets and demographic conditions. Specifically, we have 120k samples, as mentioned in our paper:
250 images for each combination of profession, gender, and race, resulting in 20,000 images per scenario (10 professions, 2 genders, and 4 race),
where a scenario refers to each degradation type, T1 to T6.
We acknowledge the importance of grounding DQA in human perception and have noted this as a limitation in Section 5.4. The observed alignment between DQA values and perceptual differences across demographic groups suggests its potential as a reliable proxy. We consider incorporating structured human evaluations a valuable direction for future work to further reinforce the practical relevance of DQA.
Fairer but Less Powerful Model
Thank you for the thoughtful follow-up question.
We clarify that in our study, the generated images are controlled to have comparable quality across demographic groups. Therefore, the encoder is not "ignoring bias" in the input, because the input itself contains no systematic bias by design.
This controlled setup ensures that any disparity measured by DQA arises from the encoder’s behavior rather than differences in input quality. Our analysis shows that a higher DQA indicates group-specific sensitivity in the encoder's output, which reveals fairness issues. In contrast, a lower DQA reflects more equal treatment across demographic groups by the encoder, given that the input quality is controlled via synthetic degradations.
The concern of "fake fairness", where an encoder assigns the same embedding regardless of input quality, indicates a poor representation power. Such encoders might yield low DQA, but at the cost of ignoring all variation, not just group differences.
Hence, we emphasize that fairness (measured by DQA) and representation ability (like [3]) are parallel and complementary dimensions, not contradictory. Ideally, a good encoder should achieve both: low DQA (fairness) and high sensitivity to meaningful differences (representation), treating all the demographic groups' degradation equally.
[3] Exposing flaws of generative model evaluation metrics and their unfair treatment of diffusion models.
This paper aim at addressing quality bias in image generative models (e.g., diffusion models), where certain demographic groups (gender, race) receive systematically lower-quality outputs. It identifies two flaws in standard metrics like FID:
a) Inappropriate reference sets: Combined references mask group-specific quality gap;
b) Biased encoders: Models like InceptionV3/CLIP produce inconsistent embeddings across groups.
To solve this, the authors propose DQA (Difference in Quality Assessment), a metric quantifying an encoder's reliability in fairly assessing quality across quality-comparable groups. Moreover, the DQA-Guidance, a sampling-time method for diffusion models that uses DQA gradients to reduce quality disparities without retraining is proposed.
优缺点分析
Strengths:
[S1] A meaningful work to systematically diagnose and mitigate quality bias in generative models.
[S2] DQA isolates encoder bias via controlled datasets and relative group comparisons, and the DQA-Guidance integrates fairness into diffusion sampling, enhancing quality while reducing bias.
[S3] The experiments are solid and methodologically rigorous.
Weakness:
[W1] Ambiguity in Image Quality Definitions. The criteria for categorizing images as well-generate or poorly-generated remain somewhat vague. It is unclear whether quality distinctions derive from human evaluation or are solely determined by hyperparameter adjustments in the generative model. Crucially, the methodology for ensuring strict quality equivalence between images across demographic groups—a foundational premise for DQA—requires further elaboration. Besides, terms such as comparable quality (Line 154) and consistent quality (Line 160) similarly lack precise operational definitions.
[W2] Rationale for DQA Term.
[W2.1] The group-specific disparity metric, , effectively isolates encoder-induced bias when true quality is matched across groups. However, as noted (Line 137), the embedding of poorly-generated images might within in-distribution regions of that of reference images. If both poorly-generated and exhibit similar proximity to their references, the numerator of DQA approaches zero. Consequently, DQA may fail to detect evaluation bias in such cases. Could the authors provide insights into this?
[W2.2] The inclusion of the denominator (global distribution shift) warrants clarification: How does this term improve sensitivity to relative bias versus absolute differences? Could case analyses illustrate scenarios where the denominator adjusts misinterpretations (e.g., when overall low quality masks group-wise disparities)?
[W3] The study employs external encoders (e.g., DINO-RN50) to compute DQA, then applies this to enhance diffusion models. However, an alternative approach might leverage the generative model's internal representations (e.g., tokens, latent features) directly for bias quantification.
[W4] Limited Human Evaluation. Relies on proxy metrics (MMD/DQA), while ``human perception'' of quality fairness is not directly validated.
[W5] Scope Constraints. This paper primary focuses on gender/race/profession; other attributes (age, disability) seems unexplored.
问题
please refer to the weaknesses.
局限性
yes
最终评判理由
The authors have addressed my concerns well in their rebuttal, and despite the issues of vague definitions in the original manuscript, I maintain a positive evaluation and hope the reviewers' comments will help the authors better present the paper.
格式问题
nan
We thank the reviewer for the insightful comments and suggestions.
W1: Clarification on Image Quality Definitions and Equivalence
- Thank you for raising this important point. We clarify that the terms "well-generated" and "poorly-generated" do not reflect human evaluation but are conceptual labels referring to controlled generation settings. Specifically, they correspond to predefined corruption levels (e.g., T1 and T6 in Figure 5), which are described in Section 4.2 and detailed in Appendix D. These terms are only used to explain the intuition behind quality variations, but all experiments use the explicit degradation labels (T1 through T6) rather than subjective descriptions.
- To ensure strict quality equivalence across demographic groups, we apply the same generation pipeline, prompt, and parameters with a fixed seed. Moreover, to maintain more robust consistency across demographic groups, we adopt an in-painting strategy, following [1], and detailed in Appendix D. We also follow the degradation protocols used in Borji [2] and Kim et al. [3], both of which analyze generation failures (e.g., truncated denoising or reduced steps). This setup is described in Section 4.2 and illustrated in Figure 8, with further configuration details in Appendix D. By generating 20,000 images, we minimize the influence of outlier generations and ensure robust group-level comparisons.
- The terms "comparable quality" and "consistent quality" both refer to images being generated under the same degradation type. We will clarify this language in the revision to avoid ambiguity and reinforce the operational definition used in our setup.
[1] Lui et al. Leveraging diffusion perturbations for measuring fairness in computer vision. AAAI 2024.
[2] Borji et al. Qualitative failures of image generation models and their application in detecting deepfakes. Image and Vision Computing, 2023.
[3] Kim et. al., Model-Agnostic Human Preference Inversion in Diffusion Models
W2: Rationale for DQA Term
We appreciate the reviewer’s detailed observations and the opportunity to clarify the rationale behind our DQA formulation.
The strength of the DQA metric lies in the joint use of the numerator and denominator. The numerator, , captures group-wise disparity and isolates potential encoder-induced bias under the assumption of equivalent quality. However, as noted by the reviewer in W2.1, when both group-specific embeddings (e.g., , ) are similarly close to their respective references, the numerator will be near zero. This is expected behavior and indicates fairness under poor generation conditions if both groups are similarly affected.
On the other hand, the denominator plays a critical role. This term captures the global distributional shift caused by the generation process. By normalizing the group disparity by this global degradation, DQA focuses not on raw absolute differences but on relative bias.
This design improves sensitivity in multiple ways, as requested in W2.2:
- The normalization is necessary because different encoders operate at different scales. Without it, a large numerator may reflect scale differences rather than fairness issues.
- When both groups are equally degraded (i.e., poor generations), and the encoder maps them similarly, the numerator is small and the denominator is large, resulting in low DQA. This is a desirable outcome, as there is no unfair treatment relative to the severity of the degradation.
- When image quality is high (i.e., small denominator), but the encoder introduces a significant disparity between the groups (i.e., large numerator), DQA becomes large. This flags unfairness that could otherwise be masked in absolute terms.
- When overall quality is poor (e.g., blur, distortion), but one group is consistently treated more favorably in the embedding space, DQA remains sensitive to such relative disparities, rather than dismissing them due to uniformly bad generation.
In this way, DQA provides a scale-aware measure that allows for fair comparison across varying quality levels and highlights relative inconsistency in evaluation, even when absolute degradation exists.
W3: Alternative approaches?
Thank you for suggesting this thoughtful direction. Our method employs an external encoder to compute DQA, which is then used to guide the diffusion model during the denoising process. This serves as a form of regularization in the noise sampling stage, encouraging the generation of outputs that are both visually plausible and fair in quality across demographic groups.
There do exist methods for quantifying bias in internal representations of generative models [4, 5]. However, these primarily focus on distributional bias. For example, ensuring that both genders appear with similar frequency in the generated outputs. Such approaches generally assess the probability or count of demographic categories rather than the visual fidelity of the outputs. In contrast, our focus is on quality bias in the generated images. To the best of our knowledge, this is the first attempt to mitigate such bias using a quality-aware fairness regularization.
Importantly, we believe debiasing internal representation would not lead to mitigation of quality bias because internal tokens or latent features are not suitable for directly measuring perceptual quality in image-level. The final output is heavily influenced by the stochastic decoding and sampling process, which introduces significant variability that cannot be captured by inspecting internal representations alone.
Therefore, we argue that using an external encoder is necessary to robustly quantify and correct quality discrepancies across groups.
[4] Jung et. al. A unified debiasing approach for vision-language models across modalities and tasks. NeurIPS 2024.
[5] Seth et. al. DEAR: Debiasing vision-language models with additive residuals. CVPR 2023.
W4: Limited Human Evaluation
We appreciate the reviewer’s comment regarding the lack of human evaluation.
In this work, we focused on designing a scalable and automated evaluation framework using DQA, which captures relative group-wise disparities in image quality without relying on costly and potentially subjective human annotation. Our motivation stems from the inherent difficulty of obtaining consistent and fair human evaluations across large-scale datasets and demographic conditions. DQA reflects relative differences in encoder activations across groups under controlled quality shifts, as described in Section 4.2 and Appendix D.
We acknowledge the importance of grounding DQA in human perception and have noted this as a limitation in Section 5.4. The observed alignment between DQA values and perceptual differences across demographic groups suggests its promise as a reliable proxy. We consider incorporating structured human evaluations a valuable direction for future work to further reinforce the practical relevance of DQA.
W5: Scope Constraints
We appreciate the reviewer’s observation. Our focus on gender, race, and profession stems from their well-established relevance in the fairness literature and the accessibility of existing works generating gender-profession datasets [1] enabling controlled evaluation. We believe this setup offers a meaningful and representative testbed for assessing group-based disparities in generative quality.
In fact, to demonstrate the generalizability of our method beyond common fairness benchmarks, we additionally applied DQA and DQA-Guidance to the medical domain, as detailed in Appendix F-H. The results show consistent behavior of our proposed metrics and guidance mechanism across this distinct and sensitive setting. This provides evidence that our framework is not limited to a specific set of attributes, and we plan to expand to other underexplored groups in future work.
[1] Leveraging diffusion perturbations for measuring fairness in computer vision. AAAI 2024.
Thanks much for your detailed responses. I also go through the anthours' rebuttal to aother reviewers, which addressed many of my concerns. I have a follow-up question would like to disscuss:
[W1] Without the intervention of human subjective judgment, this is difficult to guarantee that the images generated using strict parameter control have the same level of quality. This is because image quality is a complex concept. Even if the same type and degree of distortion appears on the salient object and the background, the perceptual quality will be completely different [R1]. Therefore, I suggest that the author clarity a hypothesis in this study: the quality of the images generated with the same parameters is considered to be the same.
[R1] Li D, Jiang T, Lin W, et al. Which has better visual quality: The clear blue sky or a blurry animal?[J]. IEEE Transactions on Multimedia, 2018, 21(5): 1221-1234.
Thank you for suggesting a valuable reference!
We fully agree that human evaluation is the gold standard for assessing perceptual image quality. However, conducting large-scale annotation over our full dataset of 120,000 images is infeasible. To directly address the reviewer’s concern, we conducted an additional statistical validation using No-Reference Image Quality Assessment (NR-IQA) metrics, similar to those suggested by the reviewer [1].
The goal of this analysis is to verify two properties:
- Each degradation level significantly reduces perceptual quality compared to the clean baseline.
- The level of degradation is statistically indistinguishable across demographic groups (e.g., gender).
We use four state-of-the-art NR-IQA models from pyiqa, all of which are trained neural networks designed to estimate perceptual image quality without reference images:
TOPIQ-FLIVE, TOPIQ-FPAQ [2], WaDIQaM [3], and MUSIQ [4] (higher = better).
Hypothesis tests
H1 (quality drop):
Each metric score at level T1 (clean) equals that at level T2–T6. (This hypothesis is expected to be rejected.)
H2 (group parity):
For each level , the degradation gap between groups
is zero. (This hypothesis is expected to be retained.)
We use 10,000-sample bootstrapping.
Results: H1 - degradation is significant
| Degradation | TOPIQ-FLIVE | TOPIQ-FPAQ | WaDIQaM | MUSIQ |
|---|---|---|---|---|
| T1 (Base) | 0.7536 | 0.5399 | -0.1906 | 72.5817 |
| T2 | 0.7502 (p=0.0002) | 0.5304 (p=0.0035) | -0.2067 (p=0.0102) | 72.2058 (p=0.0317) |
| T3 | 0.7434 (p=0.0000) | 0.5089 (p=0.0000) | -0.2282 (p=0.0000) | 70.8131 (p=0.0000) |
| T4 | 0.7395 (p=0.0000) | 0.5017 (p=0.0000) | -0.2277 (p=0.0000) | 70.0091 (p=0.0000) |
| T5 | 0.7400 (p=0.0000) | 0.5013 (p=0.0000) | -0.2268 (p=0.0000) | 70.2069 (p=0.0000) |
| T6 | 0.7400 (p=0.0000) | 0.5012 (p=0.0000) | -0.2265 (p=0.0000) | 70.1266 (p=0.0000) |
Interpretation: All p-values are below 0.05, confirming that each degradation meaningfully reduces perceptual quality compared to the clean baseline (H1 rejected).
Results: H2 - degradation is consistent across gender
| Degradation | TOPIQ-FLIVE | TOPIQ-FPAQ | WaDIQaM | MUSIQ |
|---|---|---|---|---|
| T2 | 0.0006 (p=0.7064) | 0.0017 (p=0.7888) | 0.0105 (p=0.3994) | 0.1470 (p=0.6576) |
| T3 | 0.0015 (p=0.4056) | 0.0001 (p=0.9858) | 0.0183 (p=0.2180) | 0.5022 (p=0.1898) |
| T4 | 0.0008 (p=0.6732) | 0.0013 (p=0.8490) | 0.0136 (p=0.3700) | 0.4770 (p=0.2584) |
| T5 | 0.0001 (p=0.9646) | 0.0041 (p=0.5424) | 0.0149 (p=0.3352) | 0.0982 (p=0.8120) |
| T6 | 0.0005 (p=0.8278) | 0.0033 (p=0.6264) | 0.0198 (p=0.2060) | 0.5927 (p=0.1332) |
Interpretation: All p-values are above 0.05, meaning the degradation level is statistically the same across male and female groups (H2 not rejected).
We agree that human evaluation is important, and we will include it as future work. However, this statistical test supports our claim that the synthetic dataset is well controlled: each degradation level reduces image quality, and the degradation effect is consistent across demographic groups.
[1] Li D, Jiang T, Lin W, et al. Which has better visual quality: The clear blue sky or a blurry animal? IEEE Transactions on Multimedia, 2018, 21(5): 1221-1234.
[2] Chen, Chaofeng, et al. "Topiq: A top-down approach from semantics to distortions for image quality assessment." IEEE Transactions on Image Processing 33 (2024): 2404-2418.
[3] Bosse S, Maniry D, Müller K R, et al. Deep neural networks for no-reference and full-reference image quality assessment. IEEE Transactions on Image Processing, 2018, 27(1): 206-219.
[4] Ke, Junjie, et al. "Musiq: Multi-scale image quality transformer." Proceedings of the IEEE/CVF international conference on computer vision. 2021.
The FID score is a commonly used metric for evaluating image quality in generative models, but it raises concerns about the reliability of the evaluation process. This paper proposes a novel approach called the Difference in Quality Assessment (DQA) score, which quantifies the reliability of existing evaluation metrics. The authors argue that a lower DQA score indicates a more reliable metric, similar to FD assessment.
Next, they introduce DQA guidance for diffusion models, incorporating two parameters, lambda 1 and lambda 2, which regulate DQA guidance and quality regularization. The authors conduct experiments using stable diffusion alongside the proposed DQA guidance and report significant improvements in image quality, in addition to an enhanced fairness evaluation metric.
优缺点分析
Strengths
- The argument that the FID score is not always useful is valid. While some works have made this claim, this paper goes further by asserting that FID is not a valid metric for evaluating fairness.
- The use of DQA (Diversity Quality Assessment) to evaluate quality bias across different demographic groups is noteworthy.
- This paper is pioneering in its examination of fairness in image generation quality, which is of significant interest to the community.
- The claims regarding the bias in the image encoder used for evaluation are valid, as are the metrics employed to measure any shifts in this bias.
Weaknesses
-
The paper lacks sufficient experimental evidence. Since it's an empirical study, I am particularly interested in whether the DQA guidance sampling can be applied to other methods. Could this framework be utilized with different diffusion models to enhance the sampling process? Despite the interesting direction of the paper, the lack of sufficient experimental data makes it challenging to form a convincing argument and reach solid conclusions.
-
The lines in the methodology section (lines 96-101) are repetitive compared to the introduction.
-
I recommend using consistent notations in both Section 3.1 and Figure 3.
-
I agree with the authors' statement in line 126 that D(A_gen, I_ref) - D(B_gen, I_ref) seems invalid. Could the authors provide more clarification on why this metric has been employed in previous works?
-
In Figure 4, it’s helpful to see both high-quality and low-quality images. However, I suggest clearly separating the images produced by each encoder, as this would support the arguments made in Section 3.2. Additionally, it may be beneficial to include other encoders that are fine-tuned with CLIP to mitigate bias.
-
Regarding the evaluation dataset, I highly recommend that the authors provide detailed information on how to regenerate the dataset, including a few samples for each degradation type and the combinations of profession, gender, and race.
-
It seems that λ1 and λ2 are hyperparameters that have been empirically fixed to improve the score.
-
For all tables and figures, it would be helpful for the authors to indicate whether a lower or higher value for the used metrics is preferable.
问题
-
How can the lambda values be determined? Are they purely experimental and evidence-based? While I understand it may be challenging to estimate these values, I would appreciate further insight into this process.
-
What are the failure cases for Equation 1? When image quality is degraded, it penalizes the numerator even if the image addresses the bias, leading to a lower score. This ideally should not be the case, especially for generative models that are not as stable as diffusion or those fine-tuned on a small dataset, where image quality can be subpar. Therefore, I recommend considering a weighted average for the denominator.
局限性
The DQA metric should be validated extensively before it serves as the new metric to study fairness in image generation
最终评判理由
The authors have responded to the questions I had regarding DQA guidance, evaluation of the method. I am satisfied with their response. Hence, I have increased my score. I urge the authors to include the discussions from the rebuttal in the main paper or appendix.
格式问题
No.
We thank the reviewer for the insightful comments and suggestions.
W1: DQA Guidance for Different Diffusion Models
Thank you for raising this important point. While the main experiments used Stable Diffusion, we additionally applied DQA-Guidance to DeepFloyd, which consists of three stages: an initial generation stage followed by two refinement stages. Since it follows a denoising diffusion process, DQA-Guidance can be integrated without architectural changes.
The results demonstrate that DQA-Guidance improves both image quality and fairness in DeepFloyd, suggesting its adaptability across diffusion frameworks. Exploring additional models will be part of future work.
| Model | Avg.MMD | MMD Gap |
|---|---|---|
| Baseline (no Guidance) | 196.86 | 17.80 |
| DQA-Guidance () | 193.72 | 15.01 |
| DQA-Guidance () | 176.90 | 13.01 |
W2: Repetitive Contents
Thank you for pointing this out. We acknowledge that lines 96 to 101 in the methodology section repeat points made in the introduction. This repetition was intentional, as we are addressing a relatively unexplored challenge, the reliability of image quality evaluation metrics, and wanted to ensure that the motivation and problem setup are clearly communicated for readers who may need reinforcement of the key ideas.
That said, we agree that improving the flow and reducing redundancy can enhance clarity. In the revised version, we will refine the content to avoid unnecessary repetition while preserving the clarity of the problem statement.
W3: Consistent Notation
In Section 3.1, is used as a general notation for distance measures such as Fréchet Distance or MMD. We chose not to use this mathematical notation in Figure 3 because it would require additional definitions and might reduce clarity for readers unfamiliar with the formal setup. Instead, we focused on using accessible language in the figure for intuitive understanding, while maintaining precise and formal notation in Section 3.1. We will clarify this design choice in the revised version.
W4: On why previous works used all references
As noted in line 126, using a shared reference set when computing group-specific distances can lead to misleading results. Nevertheless, this approach has been adopted in prior studies such as [41] and [44], likely due to its simplicity and compatibility with standard FID computation pipelines, which often assume a single reference distribution.
These works primarily focused on identifying the presence of demographic disparities in generated images and used FID as a convenient, widely accepted metric for quantifying quality. However, they did not explicitly account for the fact that reference distributions differ across groups. As a result, the fairness conclusions drawn from these evaluations may have been influenced by reference mismatch rather than actual quality differences.
Our work addresses this limitation by recommending group-specific reference sets to ensure a more valid comparison. We also introduce DQA to diagnose such inconsistencies in encoder-based evaluation and to guide more reliable measurement practices going forward.
W5-1: Clarifying Figure 4(b) and Use of t-SNE
In Figure 4(a), we include example images from both high-quality and low-quality generations for visual reference. However, the t-SNE in Figure 4(b) exclusively visualizes embeddings from the low-quality image set on purpose, as this condition better reveals how different encoders respond to degraded content. The t-SNE for high-quality images is separately included in Figure 7 in Appendix C, and we will clarify this in the main text.
Regarding the suggestion to visualize different encoders in Figure 4(b), we appreciate this insight and explored several options. However, because our visualization is class-conditional (in-class t-SNE, i.e., only images from the same class e.g., "nurse"), certain encoders produced clusters based on background, clothing, or color. As a result, the global embedding trend became less interpretable, even if local gender mis-embeddings were still present. We therefore chose to present the CLIP-based visualization in Figure 4(b), which offers the clearest global separation aligned with demographic attributes and best supports our discussion in Section 3.2.
We will add clarifications to the figure caption and appendix to avoid potential confusion.
W5-2: Using CLIP in DQA-Guidance
We ablated the effect of different encoders in the DQA-Guidance procedure by replacing DINO-RN50 with CLIP-ViT. This test confirms that DQA-Guidance is model-agnostic and can be applied with any image encoder, though the effectiveness varies depending on encoder reliability.
As shown below, DINO-RN50 yields stronger improvements in both quality and fairness. While our study highlights FID’s limitations in capturing fairness, we report FID-based results for reference. These results demonstrate that while DQA-Guidance is compatible with multiple encoders, selecting a more reliable encoder leads to stronger improvements in fairness.
| Encoder | DINO-MMD | DINO-MMD Gap | FID | FID Gap |
|---|---|---|---|---|
| No Guidance | 109.93 | 12.57 | 137.19 | 12.85 |
| DINO-RN50 | 96.38 | 9.12 | 138.97 | 11.22 |
| CLIP-VIT | 110.23 | 10.68 | 138.06 | 10.85 |
W6: Details of Evaluation Dataset
We provide detailed information on the construction of the evaluation dataset in Appendix D, including the exact generation settings for each degradation type. The combinations of profession, gender, and race are also explicitly listed, and the dataset includes equal-sized sets for each combination to ensure balance.
In addition, we include visual examples in Figure 8 to illustrate the six degradation types used in the main experiments, and in Figure 10 for the medical image setting. These examples offer concrete guidance for reproducing the dataset. We will further highlight this information in the revised version to make it easier to locate.
W7, Q1: On Selection of and
We confirm that both and are determined empirically, based on performance trends observed in Figure 9.
- controls the strength of fairness guidance. It is selected to minimize quality discrepancy across groups, reflecting our primary goal of mitigating bias in evaluation.
- serves as a regularization term to maintain overall generation quality. We select its value near the elbow point in Figure 9(b), where further increases begin to degrade quality more noticeably.
While we do not provide an analytical rule for choosing these values, the selection is not arbitrary. It is guided by consistent trends in the fairness–quality trade-off curves. In this sense, the choice is both evidence-based and repeatable.
Importantly, because our method is post-hoc and does not require training or fine-tuning the generation model, tuning and in practice is relatively lightweight. This makes the method easy to deploy in real-world scenarios without additional computational overhead. We will clarify this in the revised version.
W8: Clarifying Metric Directionality in Tables and Figures
Thank you for the helpful suggestion. In the main body of the paper, all reported metrics, image quality (e.g., MMD, FID), fairness (e.g., inter-group discrepancy), and reliability (DQA), are such that lower values are better.
The only exception is in Appendix J, where we report AUC for downstream tasks. In that case, higher is better. We will revise the captions and figure legends to make this directionality explicit in all cases.
Q2: On Eq.(1) and the Denominator in Failure Cases
We understand the concern regarding the potential influence of low-quality images on the DQA score, particularly when the denominator becomes large.
However, this behavior is actually desirable and aligned with the purpose of DQA. If both groups experience similar degradation (i.e., ), then the numerator will be small, and the resulting low DQA score correctly indicates reliable and fair treatment, even if the image quality itself is poor (high denominator). This means the encoder is consistently handling degradation across groups, which is exactly what we want to measure.
By contrast, if the image is low quality, but the encoder produces smaller distance values (D) for one group, the numerator becomes large relative to the denominator. In this case, DQA becomes high, which correctly flags unreliable or biased behavior under degradation.
Therefore, the normalization by the overall distance acts as a safeguard to ensure that fairness is not claimed at the cost of universally degraded output. We believe this design supports the interpretability and robustness of DQA, even in low-quality image regimes.
Extensive Experiments
We have conducted extensive experiments to validate the effectiveness of DQA as a reliability assessment for encoders, its validity, and the effectiveness of DQA-Guidance across different tasks (human images and medical images), supported by solid ablation studies.
- We generate 20,000 controlled samples and evaluate 13 encoders to identify factors affecting the reliability of image quality evaluation. The number of samples is comparable to that used in the literature (20k–25k), while the number of evaluation comparisons is significantly greater (ours: many; literature: typically 1–2) [1,2].
- DQA’s validity is supported by downstream tasks (Sec 4.5); datasets with lower DQA (using a fixed encoder) show improved fairness (Appendix B).
- The proposed mitigation strategy, DQA-Guidance, demonstrates consistent effectiveness (Table 1, 3, 4; Fig. 6, 9, 11, 12).
If the reviewer has any additional experimental suggestions, we are happy to conduct them.
[1] Rethinking FID: Towards a Better Evaluation Metric for Image Generation, CVPR 2024
[2] DEMYSTIFYING MMD GANS, ICLR 2018
Thank you for your efforts and response, and for mentioning that some of them will be fixed in the revised version.
The authors assert that selecting lambda values is feasible due to lightweight inference, which seems reasonable.
While recent work from 2024 [1] has conducted fewer evaluations than the authors’ studies, it focuses on innovation related to the FID metric. In contrast, the authors' work centers on addressing quality bias through DQA, which evaluates discrepancies in the quality of different demographic groups. This is why I raised the concern about the limited empirical evaluation. I would appreciate understanding the author's reasoning and justification behind this. Additionally, could you provide information about other works, besides [1, 2], 'in the literature' that have also used 20k-25k samples for evaluation?
Justification for Sample Size
Thank you for your thoughtful questions and kind acknowledgment of our previous response. We sincerely appreciate the opportunity to further clarify our contributions.
We would like to explain the rationale behind our sample size and how it compares to prior works. While recent works such as [1] focus on improving FID in general quality evaluation settings, our study uniquely investigates discrepancy induced by image encoders across demographic groups, a distinct problem that requires a different experimental design. To accurately isolate the encoder's impact, we must control for variation in the input images. This necessitates:
- Balanced quantity and quality across demographic groups
- A sufficient number of samples per group to ensure statistical stability
We construct 20,000 samples per scenario by generating 250 images for each combination of 10 professions, 2 genders, and 4 races (in line 181). This balanced structure ensures demographic parity and enables reliable DQA computation. When including degradation variants (T1–T6), the total number of images used becomes 120,000.
To support our decision, we reference two additional works:
- [3] states that using 20k samples yields a relative error comparable to using 100k samples in FID estimation.
- [4] recommends a minimum of 10k samples for reliable evaluation and confirms that 10k is sufficient.
Although our dataset construction uses 20k reference images per scenario, we prioritize using as many samples as possible within computational limits. Our setup (1024x1024 resolution, controlled generation pipeline) requires substantial storage and multi-day processing, which limits further expansion. However, we are confident that our sample size is sufficient and exceeds that of most prior works focused on metric reliability and bias.
We acknowledge that a sensitivity analysis of DQA with respect to sample size could provide further insight and plan to explore this in future work.
[1] Rethinking FID: Towards a Better Evaluation Metric for Image Generation, CVPR 2024
[2] DEMYSTIFYING MMD GANS, ICLR 2018
[3] Exposing Flaws of Generative Model Evaluation Metrics and Their Unfair Treatment of Diffusion Models
[4] Feature Likelihood Divergence: Evaluating the Generalization of Generative Models Using Samples
Thank you for your effort and response to my question. I now have more insights into your rationale regarding the sample size and the samples used in existing works.
The paper analyzes and finds demographic biases in current image quality evaluation strategies due to the selection of the reference dataset and due to biases in the image encoders. To better quantify such biases, the paper introduces the difference in quality assessment (DQA) score and evaluates various backbones. This score is subsequently applied to guide a diffusion model, which mitigates the demographic biases and improves the image quality of the generated images.
优缺点分析
Strengths
While distributional fairness has been studied in the past, fairness is underexplored. The paper points out that evaluation metrics are heavily influenced by the data biases, which results in unfair evaluation metrics, which is a notable finding. It has impact on the image quality across these groups, eventually also impacting the quality of models that are trained using generated images. Code is provided for reproducability and limitations are transparently discussed in the main paper.
Weaknesses
- The title frames to improve the reliability of image quality evaluation, which extends fairness. I would tend to preplace this adjective to more specifically describe fairness.
- Generally speaking, this approach is applicable to any condition that occurs less often in the training dataset of the image encoder, which goes beyond race or demographics. Framing the paper in a broader way would maybe allow a more general statement regarding the reliability of evaluations.
- The analysis of biases of various encoders presents a significant part of the paper. More elaborately working out the relation to previous works that discussed biases (such as (Qizhou Wang et al., Do CLIP Models Always Generalize Better than ImageNet Models?) for biases in CLIP) would be helpful to more clearly state the new findings of the work compared to previous analyses.
- The motivation clearly relies on the data generation for training a model in the medical domain and is referenced multiple times. However, all the related experiments are in the supplementary. The main paper should either rely less on referencing this part of the work or integrate it in the main paper. The authors should clarify its importance in the rebuttal.
- Regarding the experiments: The image quality is deteriorated via different strategies. A more systematic variation of one parameter would allow computing the sensitivity with respect to a specific factor that reduces image quality.
- Compute resources are listed but time etc. is not listed although the Answer is [Yes].
- The controlled dataset for the DQA evaluation is not deteriorated in a controlled manner, ie, the generated artifacts are not clear. An alternative would be also to collect a dataset from real world samples (by filtering text labels from LAION for example and applying stronger and stronger corruptions). This would have enabled a more controlled dataset. To improve the quality of the base images, standard image quality filtering approaches could have been applied.
问题
- Why is MMD / FD for A larger than for B? In the visualization, it does not seem to be the case, as both points A and B should distribute rather equally or am I wrong. An intuitive explanation would help me understanding this.
- Regarding the DQA metric: Why is the metric normalized with the reference and not for example with A+B? This mixed overall distribution quality and fairness am I wrong? A clarification of this would be helpful.
- The appendix also includes the analysis when lambda1 is set to 0. However, this plot is missing the global quality metric. The case of lambda1=0 should be presented in Tab. 1 as well in my opinion since it better disentangles the impact of the fairness consideration (and not just guiding with FID score).
- To prevent the regularization term, an alternative would be to only use the nominator of the DQA metric as the guidance term, no? Maybe, the authors could run such an experiment or explain why this would not be reasonable.
- Do the authors have any thoughts to what extent the training data of the text encoder or the training data of the diffusion model explain the missing fairness?
- According to paper, more works addressed distribution fairness. Is the paper the first that discusses “fairness in generation quality” or the first that discusses “fairness in generation quality across demographic groups” (l. 24).
- Why does setting lambda1 result in excessive noise (l.601), because of the potentially larger denominator?
- Did the authors ablate the effect of the image encoder for the DQA guidance? How would the image quality gain deteriorate if a less reliable encoder would be used? Related to this: How much do the DQA scores differ for different attributes? Do the average rankings typically also represent the rankings for individual rankings.
- The work finds DINO-RN50 to perform the best? What was the used training dataset for this? How does a standard DINOv2 model (maybe with different model sizes) compare to it?
I consider increasing my score
- if the authors can convincingly discuss that their fairness evaluations are indeed a new perspective and could be straight-forwardly applied to also evaluate the impact of other biases for image quality metrics,
- if the authors discuss the denominator or potentially run experiments without the denominator,
- if the authors also present the main results for lambda1=0,
- if the authors discuss the image quality of the controlled dataset using a collection of recent image quality datasets,
- if the authors also add the standard DINOv2 model in the comparisons.
局限性
Yes, discussed in the main paper.
最终评判理由
The paper addresses a relevant topic in current generative models and the experimentation is solid. The authors addressed the main criticism of the reviewers, which makes me more confident in my initial positive rating. While some minor concerns remain, I am supportive for acceptance.
格式问题
No major formatting issues noticed.
We thank the reviewer for the insightful comments and suggestions.
W1: Title and Scope
Our work proposes a reliability measure for image quality evaluation with a focus on fairness. We will consider revising the title to "Assessing the Fairness Reliability of Image Quality Evaluation..." to better reflect this focus.
W2: Generalizability Beyond Demographics
The core idea of evaluating the reliability of image quality metrics extends beyond demographic fairness. While our main experiments focus on demographic attributes on facial images, we demonstrate generalizability by applying DQA to medical images (Chest X-rays) in Appendix F. More broadly, the proposed framework can be applied to any underrepresented group or condition in generative tasks, such as remote sensing or manufacturing.
W3: Comparison to the Paper
The referenced paper from reviewer shows that CLIP is affected by spurious correlations, mainly in classification tasks using high-quality images and filtering out low-quality images. In contrast, ours is a new perspective focusing on how image encoders behave with low-quality images across underrepresented groups, which leading to unreliable evaluations. DQA aims to quantify those reliability, which hasn't been explored in the literature.
W4: Paper Organization
Due to the page limit, we placed several experiments in the appendix, though they remain integral to the main narrative. We clarify their roles as follows:
- Appendix B and J demonstrate the impact of quality bias on downstream performance, supporting the motivation that fairness in generative quality affects fairness in classification.
- Appendix C provides additional evidence for Figure 4, reinforcing the claim that encoder bias becomes more pronounced for low-quality images.
- Appendix E offers a detailed ablation study that complements Table 1 and Figure 6 in the main paper.
- Appendices F, G, and H replicate the core experiments using medical images, demonstrating the generalizability of our method beyond general image.
We will consider moving key elements into the main paper to improve clarity.
W5: Systematic Variation in Synthetic Dataset
Here, we generate systematic variations images by varying one hyperparameter.
We vary the scale of Classifier-Free Guidance (CFG) (), and the number of diffusion sampling step (), respecively, which directly impact image quality. The gradual degradation shows the synthetic dataset is well controlled to generate intended poor quality.
| (default 7.5) | DINO-MMD |
|---|---|
| 5 | 17.94 |
| 2 | 23.31 |
| 1 | 25.82 |
| (default 40) | DINO-MMD |
|---|---|
| 30 | 12.18 |
| 20 | 12.21 |
| 10 | 13.42 |
W6: Computational Cost in time
We acknowledge the oversight in not including explicit timing details. The average generation time was 2.49 seconds per image for the baseline and 4.83 seconds per image with DQA-Guidance, reflecting the additional cost introduced by the guidance.
W7: On Controlled Dataset Design
We have considered using real-word dataset as reference, but found it unsuitable for controlled evaluation. It raises a question: Do the original images for two groups have the same quality? In practice, real images from different groups differ in background, lighting, and pose. This blurs whether embedding differences stem from encoder bias or image content.
Even though we utilize the real-world samples as reference and regard their corruption as quality degradation, the type of degradation is limited. The traditional image corruption such as jittering and blurring doesn't reflect the degradation in generative model [1]. Our synthetic setup avoids this issue by using shared seeds and controlled degradation.
As a special case, for medical images where real data can be assumed to have consistent quality, we do use real datasets as references, as shown in Appendix F and G.
[1] CNN-generated images are surprisingly easy to spot... for now. CVPR 2020.
Q1: Analyzing Figure 3
The reviewer's observation is correct. In Figure 3(a), both groups are shifted similarly from their references, so the measured distances should be comparable. However, using All Ref. incorrectly shows a larger shift for Group A, which is misleading. This becomes clearer in Figure 3(b), where Group B is farther from its reference, yet All Ref. still assigns a larger distance to Group A.
This demonstrates that All Ref. can produce unreliable results when group distributions are not symmetric. In contrast, group-specific references provide a more accurate assessment of each group’s distribution shift.
Q2: Regarding Denominator
Let the group-specific distance and , and let the global quality on union set be . The question is re-written: why , not for denominator? While can be used potentailly, we intentionally chpose for global quality to capture cross-group interaction. Consider two scenarios: 1) is embedded close to . 2) is embdded far from , but in both cases, the distance to remains the same. In these scenarios, is identical, so distinguish between them. In contrast, reflects the differnece. Thus, while is not incorrect, better aligns with our philosophy of normalizing by global quality.
Q3: Clarification on
case is already included in Table 1 as the “Baseline,” which corresponds to generation without DQA-Guidance. In contrast, all rows labeled “+ DQA-Guidance” use . The global quality metric is reported as Avg.MMD.
Q4, Q7: Without Denominator & Exccessive
The denominator in Eq(1) captures the overall distributional shift, while the numerator reflects group-specific disparity. This normalization is necessary because different encoders operate at different scales. Without it, a large numerator may reflect scale differences rather than fairness issues.
In Eq(4), using only the numerator as a guidance term could reduce group disparity but allow overall image quality to degrade. To prevent this, we adopt DQA as guidance, also add the denominator as a separate regularization term, controlled by . The reviewer's question would be
- <- Why this,
- <- Not this?
Case 2 depends on the balance between and . However, ours (case1) offers adaptive behavior. When overall quality is poor, the first term () is relatively small and the second term () is large, increasing the weight on improving global quality. Conversely, when the overall quality is already high (small ), it can focus more on the first term, the group disparity.
Here we report the experimental result with numerator-only case.
| Method | DINO-MMD | DINO-MMD Gap | FID | FID Gap |
|---|---|---|---|---|
| No Guidance | 109.93 | 12.57 | 137.19 | 12.85 |
| DQA-Guidance | 96.38 | 9.12 | 138.97 | 11.22 |
| DQA-Guidance (no denominator) | 104.79 | 9.52 | 137.00 | 11.56 |
However, setting a excessively high overly emphasizes fairness by aggressively minimizing the DQA score (i.e., the numerator), which can unintentionally increase the denominator ignoring the impact of , resulting in degrading global image quality. This is consistent with the reviewer’s interpretation (also noted in line 254).
Q5, Q6: On the Source of Missing Fairness and Novelty of Our Contribution
The unfairness in image quality stems from biases in both the text encoder and the diffusion model's training data.
The references in line 35, Naik et al. show that even specific attributed prompts, Diffusion produces lower-quality images for underrepresented groups, suggesting that the text encoder struggles with rare concepts due to biased training data. Perera et al. demonstrate that diffusion models generate lower-quality images even when trained on balanced datasets by, as they struggle to learn balanced representations due to variations in data like lighting, hair, and makeup.
While we are not the first to highlight quality disparities across demographic groups, our work makes two novel contributions:
- We show that widely used evaluation metrics (e.g., FID) can mislead fairness assessments across groups. We introduce a diagnostic tool (DQA) to quantify metric reliability across demographic attributes (line 72).
- DQA-Guidance, a method to mitigate quality bias at the sampling stage of diffusion models without retraining (lines 73, 93, 279).
To the best of our knowledge, these aspects, evaluation metric reliability and mitigation at inference, have not been addressed in prior work.
Q8: Ablation study with different image encoder.
We ablate the effect of encoder in the DQA-Guidance by replacing DINO-RN50 with CLIP-ViT. The results demonstrate that while DQA-Guidance is compatible with any encoders, selecting a more reliable encoder leads to stronger improvements in fairness.
| Encoder | DINO-MMD | DINO-MMD Gap | FID | FID Gap |
|---|---|---|---|---|
| No Guidance | 109.93 | 12.57 | 137.19 | 12.85 |
| DINO-RN50 | 96.38 | 9.12 | 138.97 | 11.22 |
| CLIP-VIT | 110.23 | 10.68 | 138.06 | 10.85 |
Q8-2: DQA score for different attributes?
DQA is not computed separately for each group. In Figure 5, DQA is computed for each attribute category (race or gender) by averaging over each degradation type (T2–T5), as mentioned in the caption.
Q9: Result with DINOv2 DQA
DINO is trained on the ImageNet 1K dataset, while DINOv2 uses the LVD-142M dataset. We conducted additional experiments with DINOv2 and BYOL-RN50, and the results align with our analysis:
- CNN-based models tend to be more reliable than ViT-based models
- Larger training datasets do not necessarily improve reliability
| Method | Backbone | Training Dataset | Gender DQA | Race DQA |
|---|---|---|---|---|
| DINO | RN50 | IN-1K | 0.1684 | 0.1764 |
| DINOv2 | ViT-B14 | LVD-142M | 0.3240 | 0.3328 |
| BYOL | RN50 | IN-1K | 0.2253 | 0.2089 |
Thanks a lot for your detailed rebuttal, which cleared many of my questions! I would have a few follow up comments:
[W4] Which experiments do you consider moving? I am still a bit concerned because an integral experiment (referenced in the intro) should not only come in the introduction. In my opinion, this reference should be not in such a prominent position of the paper.
[W7] "The traditional image corruption such as jittering and blurring doesn't reflect the degradation in generative model [1]." Could you elaborate this statement a bit more? At first glance, the paper does not even mention jitter? While I am not arguing that this is not necessarily required (for this work), I would still argue that a more controlled degradation of the image quality (which is probably still getting OOD for the encoder?), would allow a more systematic study.
[Q3] My question was related to lambda1=0 but lambda2>0.
[Q8-2] I saw that. However, I was still wondering how it looks for the individual results. Do you also have the individual results? It would help getting a better sense of variation / sensitivity than when just aggregating.
W4: Relocating Experiments
Thank you for the thoughtful follow-up. We agree that experiments cited in the introduction should not be confined solely to the appendix. We plan to revise the structure accordingly:
- Appendix B will be moved into the main body, potentially as a "Problem Definition" or "Motivation" section. This section shows how our framework extends beyond general image domains and motivates the importance of addressing quality bias in generative model.
- Appendix C will be merged with Figure 4 to present results from both well-generated and poorly-generated images in a single figure while maintaining clarity.
- Appendix E (Figure 9) supports the ablation analysis in Table 1 and Figure 6. We plan to merge the quantitative results into this figure and move it into the main paper.
Please let us know if you have suggestions beyond this proposed restructuring.
W7: Regarding Traditional Image Degradation
Thank you for your insightful follow-up and clarification. We acknowledge that our earlier mention of "jittering" was inaccurate, and we appreciate the opportunity to elaborate further.
We argue that traditional degradation (e.g., blur, JPEG) and generative degradation (e.g., insufficient denoising, early termination) are inherently different types of perturbations. This distinction is supported by two observations from [1]:
- In Figure 5 of [1], classifiers trained to detect generative artifacts fail to generalize to traditionally degraded images (e.g., blur, JPEG) unless proper augmentations are introduced. This indicates that generative and traditional degradations manifest differently in image space.
- The authors also note:
“DeepFake images do not contain obvious artifacts. We note that DeepFake images have gone through various pre- and post-processing, where the synthesized face region is resized, blended, and compressed with MPEG.”
This suggests that generative degradations result in more subtle and complex artifacts that are not easily captured by traditional corruption.
However, we fully agree that using a more controlled degradation (e.g., applying symmetric traditional corruption across groups) would support a more systematic study of encoder robustness. While our current design is tailored to study degradation induced within the generative process (which reflects practical failure cases in text-to-image models), we see traditional degradation as a complementary axis to explore.
We will retain this direction as a promising future work, and we plan to include a note or preliminary analysis in the appendix of the revised version.
[1] CNN-generated images are surprisingly easy to spot... for now
[2] Exposing photo manipulation with inconsistent reflections
Q3: Experimental Results for
Thank you for the clarification. We initially presented results with both coefficients at zero and ablation studies with either or fixed. Below, we provide new results for and :
| Method | DINO-MMD | DINO-MMD Gap |
|---|---|---|
| No Guidance | 109.93 | 12.57 |
| , | 109.85 | 11.55 |
| , | 105.24 | 10.08 |
| , | 103.89 | 6.21 |
The results show that using alone improves overall performance, but combining both and yields the strongest improvement. We will update Table 1 in the revised version accordingly.
Q8-2: Individual Ranking
Thank you for the clarification. We now understand that you are requesting per-condition results rather than aggregated metrics.
In our study, the DQA metric has three axes of analysis: (1) the encoder, (2) the type of degradation (T2 to T6), and (3) the profession. For each target attribute (e.g., gender or race), we compute DQA for every (degradation type profession) pair. This results in a detailed DQA matrix per encoder.
Below we present the full breakdown of Gender DQA scores for three representative encoders, BYOL-RN50, DINO2-ViT, and DINO-RN50, across degradation types and professions.
| BYOL-RN50 | Nurse | Firefighter | Flight Attendant | Pilot | Secretary | Average |
|---|---|---|---|---|---|---|
| T2 | 0.1643 | 0.1679 | 0.1494 | 0.1807 | 0.1307 | 0.1586 |
| T3 | 0.1072 | 0.1894 | 0.2727 | 0.5811 | 0.2077 | 0.2716 |
| T4 | 0.1574 | 0.3162 | 0.3749 | 0.2398 | 0.1348 | 0.2446 |
| T5 | 0.0331 | 0.0242 | 0.0355 | 0.0571 | 0.1083 | 0.0516 |
| T6 | 0.0343 | 0.1769 | 0.0805 | 0.2331 | 0.0647 | 0.1179 |
| Average | 0.0993 | 0.1749 | 0.1826 | 0.2584 | 0.1292 | 0.1689 |
| DINO2-ViT | Nurse | Firefighter | Flight Attendant | Pilot | Secretary | Average |
|---|---|---|---|---|---|---|
| T2 | 0.2426 | 0.1958 | 0.5121 | 0.2680 | 0.4288 | 0.3295 |
| T3 | 0.2720 | 0.2337 | 0.2926 | 0.2512 | 0.2659 | 0.2631 |
| T4 | 0.4403 | 0.2540 | 0.1797 | 0.3352 | 0.1988 | 0.2816 |
| T5 | 0.1625 | 0.0487 | 0.1867 | 0.2600 | 0.0679 | 0.1452 |
| T6 | 0.2353 | 0.0573 | 0.4496 | 0.2069 | 0.4148 | 0.2728 |
| Average | 0.2705 | 0.1579 | 0.3241 | 0.2643 | 0.2752 | 0.2584 |
| DINO-RN50 | Nurse | Firefighter | Flight Attendant | Pilot | Secretary | Average |
|---|---|---|---|---|---|---|
| T2 | 0.1833 | 0.1330 | 0.1871 | 0.0306 | 0.1467 | 0.1361 |
| T3 | 0.1829 | 0.1764 | 0.1495 | 0.0856 | 0.2600 | 0.1709 |
| T4 | 0.1724 | 0.1698 | 0.2301 | 0.0478 | 0.2006 | 0.1641 |
| T5 | 0.1147 | 0.0878 | 0.1008 | 0.0560 | 0.0884 | 0.0895 |
| T6 | 0.1031 | 0.0880 | 0.1672 | 0.0440 | 0.0934 | 0.0991 |
| Average | 0.1513 | 0.1310 | 0.1669 | 0.0528 | 0.1578 | 0.1320 |
These granular results provide several key insights:
- T3 (fewer generation steps) and T4 (strong noise influence) often result in higher DQA values across encoders, in contrast to T5 (no refiner), which consistently shows lower DQA. This indicates that encoders are more vulnerable to demographic inconsistency under T3 and T4.
- Each encoder behaves differently across profession-attribute combinations. For instance:
- DINO-RN50 handles the "pilot" profession fairly (low DQA) but shows more bias for "flight attendant".
- In contrast, BYOL-RN50 treats "nurse" profession fairly, but performs poorly on "pilot".
- DINO2-ViT consistently shows high DQA, indicating it may have stronger demographic disparities despite its perceptual strength.
Overall, while the individual scores slightly deviate from aggregate trends, the relative ranking of encoders remains broadly consistent, supporting our main claims. These detailed results will be included in the revised appendix, and we will expand our discussion to reflect the profession- and degradation-specific insights.
Thank you for your efforts and follow up.
I encourage to include the Q8-2 discussion in the paper and add the results to the suppl.
I still think that a simple controlled corruption would be important to verify the effectiveness of the approach in a systematic manner (and then follow up with the generative perturbation). I would like to request that the authors clearly mention in the main paper that the generative perturbation is not as systematic as a traditional perturbation method, which might result in confounders or unpredicted variations in the results.
Despite this remaining concern I will keep my positive rating, since I do not believe that it is a fundamental flaw.
Thank you for your follow-up and thoughtful suggestion.
We will include the Q8-2 discussion in the main paper and add the corresponding results to the supplementary material. As you suggested, we also plan to run additional experiments using traditional, controlled corruptions to complement our generative degradation strategy. In the revised paper, we will explicitly mention that generative perturbations may be less systematic than conventional corruption methods and could introduce unintended variations.
Moreover, to address this concern, we also conducted a statistical test using several trained No-Reference Image Quality Assessment (NR-IQA) models, as part of our additional response to Reviewer TPL5. The results confirm that our generative degradations reliably reduce perceptual quality and do so consistently across demographic groups, helping mitigate potential confounders. We believe this additional analysis strengthens the soundness of our current approach.
We will incorporate all of these additions into the revised paper and sincerely appreciate your constructive feedback and positive evaluation.
This paper was reviewed by four knowledgeable reviewers. The reviewers raised concerns about:
- the presentation of the paper (title, scope, notations, dataset details) [bZa5, hSoy]
- the positioning of the contribution w.r.t. prior art [bZa5], the ambiguity of the contribution and its motivation [TPL5, Us7e]
- the limited experimental evidence [bZa5, hSoy], with no human evaluations [TPL5]
- the evaluation dataset construction [Us7e]
- the ablations (lambda hyper-parameters) [hSoy]
- the discussion of failure cases [hSoy]
The rebuttal partially addressed the reviewers concerns and clarified the presentation concerns as well as the positioning of the contribution w.r.t. prior art. The rebuttal also introduce sensitivity analyses, stratification of results, justified the sample size choices by referencing prior works, and presented additional results for DINOv2 and DeepFloyd. The authors engaged in discussion with the reviewers on the point related to evaluation dataset construction, acknowledged the limitation of the lack of user studies and presented a statistical test supporting their claims. The suggested user study was left as future work, and the rebuttal did not convincingly address the points raised w.r.t. additional evaluation metrics such as the aesthetics scores. After reviewers-authors discussion, this paper appears as borderline. During the reviewers-AC discussion, one reviewer appears convinced by the authors' rebuttal, and another reviewer remains skeptical and raises concerns about the quality of the reference images.
The AC reads the paper and acknowledges the importance of the problem tackled and the potential of the contribution. However, the AC remains unconvinced about some of the claims in the paper and the presented experimental evidence. In particular, the AC would like to point that other metrics (beyond FID which entangles both quality and diversity) have been introduced in the literature to measure quality (see precision [a] and coverage [b] as alternatives which require reference data too). These metrics have been used in the literature with group specific references in the context of geodiversity [c], similarly to intra FID [d]. The AC agrees with the reviewer that considering non-referenced based metrics such as the aesthetics score would be valuable. Finally, the AC finds the comparisons in table 1 limited in scope, and suggests that including diversity-inducing guidance functions as baseline comparison would strengthen the contribution. In particular, the authors might want to consider CADS [e], Interval Guidance [f], contextualized Vendi Score guidance [g], auto-guidance [h], Fair diffusion with semantic guidance [i], among others. The AC also agrees that adding a user study would significantly strengthen the contribution. As a minor point, the AC would also suggest to consider DINOv2 for the motivating figures 3,4 since DINOv2 has been shown in the literature to correlate well with human judgement when assessing image similarity. For all the above-mentioned reasons, the AC recommends to reject and encourages the authors to consider the feedback received to improve future iterations of their work.
[a] https://arxiv.org/pdf/1904.06991 [b] https://arxiv.org/abs/2002.09797 [c] https://arxiv.org/abs/2308.06198 [d] https://arxiv.org/abs/1802.05637 [e] https://arxiv.org/abs/2310.17347 [f] https://arxiv.org/abs/2404.07724 [g] https://arxiv.org/abs/2406.04551 [h] https://arxiv.org/html/2406.02507v1 [i] https://arxiv.org/abs/2302.10893