PaperHub
6.0
/10
Rejected4 位审稿人
最低6最高6标准差0.0
6
6
6
6
4.0
置信度
正确性2.8
贡献度2.5
表达2.8
ICLR 2025

MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation?

OpenReviewPDF
提交: 2024-09-27更新: 2025-02-05
TL;DR

A novel benchmark using a comprehensive preference dataset to evaluate multimodal judges across multiple key perspectives

摘要

关键词
Multimodal Reward ModelsFoundation Models AlignmentReinforcement Learning from Human Feedback

评审与讨论

审稿意见
6

The paper organizes a new preference dataset called MJ-Bench, which captures four different aspects of text-to-image generation: text-image alignment, safety, image quality, and bias.

The authors propose two methods to obtain feedback from multi-modal judges: single-input and multi-input judge. The paper also conducts detailed analysis, such as the consistency of the preference of the judges w.r.t different image modes.

优点

  1. The paper tackles a crucial issue in evaluating text-to-image models.
  2. The paper includes detailed ablation studies with some insightful findings. For instance, the analysis of the consistency of the preference of the judges w.r.t different image modes.
  3. The proposed dataset offers many formats (e.g., ranking and voting), which can enable a wider variety of preference modeling.

缺点

  1. The paper's novelty seems limited, as the whole pipeline has been proposed in the previous method. Besides, The selection of these four aspects lacks in-depth analysis.
  2. The paper employs human evaluators for experimental evaluation multiple times, yet it fails to report the number of human evaluators involved. Since human evaluators may introduce bias, it is recommended to report this metric. If the number is small, it is advised to increase the scale of human evaluators.
  3. The scale of the dataset was not compared with other existing datasets. As a result, the application scope of the dataset may be limited.
  4. Since this work is a benchmark study, the quality of the samples within the benchmark was not evaluated. It is recommended to supplement the study with an experiment to assess the quality of the samples in the dataset.
  5. The dataset only involved feedback from six judges to train a model (as indicated on page 7). It is suggested to supplement the study with an experiment where feedback is directly constructed from the benchmark data to train a model, in order to observe the outcomes.

问题

Please refer to the weaknesses.

伦理问题详情

N/A

评论

To further clarify the human verification process, we proceed to provide a more detailed point-by-point explanation of the data curation below, along with a statistical summary in Table-r. 2.

  • VLM pre-process: Specifically, as described in Appendix A in the paper, we first gather corresponding image pairs for each perspective through different algorithms we propose. This results in a substantial number of samples, with each perspective containing a similar quantity. Then our first step for quality control is to adopt a powerful VLM (LLaVa-NeXT-34B) to pre-process the data and filter out the wrong preference pairs (e.g. for the alignment subset, we only include those image pairs where the positive sample completely aligns with the prompt and the negative sample includes hallucinated entities or relationships). In this step, we aim to ensure the overall correctness of the image pairs, while not considering if they are challenging enough or have high quality. The samples we obtain in this process are 6260, 4852, 5964 pairs for the alignment, safety, and quality perspective, and 140 groups for the bias perspective.

  • Human verification: Next, we engage human verifiers to evaluate each preference pair, considering both images alongside the corresponding prompt. Specifically, we ask 12 humans split into four groups to annotate for each subset. In this step, the verifiers are tasked not only with confirming the correctness of the pair (e.g., ensuring the chosen image in the alignment subset fully aligns with the prompt) but also with assigning a difficulty rating from 0 to 5. This rating reflects how challenging they perceive the pair to be, based on the premise that the reason for the preference is clear and verifiable. The greater the difficulty for the model to distinguish between the images, the higher the rating. This process results in 2,489, 2,271, and 1,680 validated pairs for the alignment, safety, and quality perspectives, respectively, as well as 105 groups for the bias perspective. All pairs are verified for accuracy by human evaluators, with each accompanied by a difficulty rating.

  • Benchmark Confidence Sampling: Although the current dataset is verified and ready for use, its size poses significant computational and time-related challenges. To address this, we draw inspiration from [7], which suggests that usually only a subset of the benchmark samples are sufficient to provide a certified and reliable evaluation for each model. To implement this, we use three surrogate models (MiniGPT4-v1, InternVL-Chat-V1.2, and LLaVA-V1.2) to run inferences on the dataset, progressing from higher-difficulty samples to lower-difficulty ones. We then calculate the confidence interval (variance) of each model's performance on the dataset. Using a threshold of 0.1, we ensure that each subset contains sufficiently enough samples to provide a confident estimate of model performance within this interval. This approach not only ensures that the more diverse and challenging samples are prioritized, but also guarantees an efficient and sufficient sample size for evaluation while maintaining statistical reliability. As a result, we obtain 724, 574, and 1,121 validated pairs for the alignment, safety, and quality perspectives, respectively, as well as 18 groups for the bias perspective.

We then compile these samples to form the final evaluation set for each perspective in MJ-Bench. This rigorous quality control pipeline ensures that the collected samples and resulting evaluations are reliable, challenging, and efficient.

To demonstrate the quality of our dataset, we fine-tuned a text-to-image model (SD-1.5) directly using the preference pairs from MJ-Bench, showcasing the value of the data samples in our dataset. We compared this model with the SD-1.5 base model and the SD-1.5 model fine-tuned using GPT-4o feedback, with the results presented in Table-r. 3. Based on human judge feedback, the model fine-tuned with MJ-Bench significantly outperforms the one fine-tuned with GPT-4o feedback in alignment, safety, and bias perspectives, while achieving comparable performance in the quality perspective. This demonstrates the high quality and reliability of our dataset. Additional case studies and comparisons of the outputs from the three models are provided in Figures 7, 8, and 9 in the appendix.

(continued in our next response)

评论

Table-r. 2 Statistics of the data curation procedure and quality control.

AlignmentSafetyQualityBias (group)
Total626048525964140
Human Selected248922711680105
Confidence Selected724574112118

Table-r. 3 Human evaluation result on the generated images from three models, i.e. SD-1.5 base model, SD-1.5 fine-tuned with the feedback provided by GPT-4o, and SD-1.5 fine-tuned directly on MJ-Bench via DPO. Specifically, we consider the average ranking of the image generated by each model as the metric. The best performance is in bold.

Datset ConfigurationAlignmentSafetyQualityBias
SD-1.5 Base2.472.702.232.63
SD-1.5 + GPT-4o1.951.911.872.11
SD-1.5 + MJ-Bench1.581.391.901.26

We hope these additional details clarify our methodology and address the reviewer’s concerns about potential bias and the number of human evaluators involved.

[7] Polo, F. M., Weber, L., Choshen, L., Sun, Y., Xu, G., & Yurochkin, M. (2024). tinyBenchmarks: evaluating LLMs with fewer examples. arXiv preprint arXiv:2402.14992.

W3: The scale of the dataset was not compared with other existing datasets. As a result, the application scope of the dataset may be limited.

We sincerely appreciate the reviewer’s concern regarding the scale and comparison of our dataset with existing datasets. In response, we have provided a detailed comparison of our dataset with prior datasets in our response to W1 and an explanation of the data curation procedure in W2. Following the reviewer’s suggestion, we have also included an expanded discussion on this topic in the related works section of the revised paper.

Specifically, while previous works, such as VisionPrefer, offer larger datasets in terms of total scale, they primarily focus on the purpose of training reward models and rely solely on GPT-4v for annotation, with limited quality assurance. These datasets lack rigorous quality control and do not guarantee challenging samples for benchmarking purposes. In contrast, our benchmark undergoes a rigorous human selection and confidence interval sampling process to ensure both diversity and high difficulty, making the dataset more reliable for evaluating multimodal judges.

Additionally, previous works often limit their focus to a small number of broad categories. In comparison, our benchmark investigates a much broader and more comprehensive set of issues in text-to-image feedback, with over 20 fine-grained subcategories. Therefore, as a benchmark, our dataset stands out compared to existing works due to its large scale in terms of evaluation scenarios, high quality, and challenging nature of the data samples, making it uniquely suited for its purpose. Each sample is meticulously curated to provide meaningful evaluation scenarios, validated through a rigorous data selection pipeline. As shown in Table-r. 2, our dataset offers a sufficient number of samples for each category, ensuring the statistical confidence needed to evaluate judges' feedback effectively. While a larger dataset might seem advantageous, we adhere to the principle of efficient benchmarking outlined in [7]. By employing a statistically rigorous approach, we narrowed down the sample size to one that is both manageable and sufficient to derive reliable estimates, making our benchmark perfectly tailored for accurate and efficient evaluations. This level of quality and reliability is often lacking in other datasets, which frequently suffer from poor-quality samples and inadequate quality control.

Moreover, despite having a smaller sample size than some training-oriented datasets, our benchmark’s high-quality samples are sufficient to achieve superior performance when fine-tuning models. As shown in Table-r. 3 above, models fine-tuned on our dataset consistently outperform those trained on lower-quality, larger-scale datasets, validating the effectiveness and impact of our approach. In summary, our dataset offers a balanced trade-off between scale, quality, and comprehensiveness, making it a valuable resource for benchmarking text-to-image generation feedback.

评论

W4: Since this work is a benchmark study, the quality of the samples within the benchmark was not evaluated. It is recommended to supplement the study with an experiment to assess the quality of the samples in the dataset.

We sincerely appreciate the reviewer for raising this important point. In response, we have provided a more detailed explanation of our quality control process in our response to W2, where we outline each stage of our rigorous curation process to ensure the high quality, difficulty, and efficiency of the data samples included in the benchmark. Additionally, we conducted a supplementary experiment to quantify the quality of our dataset. Specifically, we fine-tuned a model directly using the original data from MJ-Bench and compared its performance against the SD-1.5 base model and the SD-1.5 model fine-tuned with GPT-4o feedback. The results in Table-r. 3 above demonstrates that the model trained on MJ-Bench data significantly outperforms the others in alignment, safety, and bias perspectives while achieving performance comparable to GPT-4o in the quality perspective, further validating the high quality of our dataset.

To provide additional insights, we have included case studies comparing the performance of the three models across the four perspectives in Figures 7, 8, and 9 in the appendix, further showcasing the robustness and quality of our dataset.

W5: The dataset only involved feedback from six judges to train a model (as indicated on page 7). It is suggested to supplement the study with an experiment where feedback is directly constructed from the benchmark data to train a model, in order to observe the outcomes.

We sincerely appreciate the reviewer’s insightful observation. In response, we conducted an additional experiment where we directly trained a model using the MJ-Bench dataset via DPO and compared its performance with the base SD-1.5 model and the model fine-tuned using feedback from GPT-4o. The results shown in Table-r. 3 demonstrate the superior quality of the data samples in MJ-Bench, which were curated through a rigorous human verification and confidence sampling process that we proposed. We provide a detailed analysis of this experiment in our responses to W2 and W4, further highlighting the robustness and effectiveness of our benchmark.

We hope the additional experiments regarding the comparison with related works and quality control for our proposed dataset, as well as the expanded explanations regarding the training procedure have addressed your concerns. Please don’t hesitate to let us know if there’s anything else you’d like to discuss or if further clarification is needed.

If all your concerns have been resolved, we would sincerely appreciate it if you could consider raising your score to help us share this work with a broader community.

Thank you again for your time and thoughtful feedback!

Best regards,

Submission #9037 Authors

评论

Dear reviewer Rmzq,

As the discussion period nears its end, we wanted to follow up to see if our response—providing an in-depth analysis of the unique contribution of our work and its distinction to related works, an analysis of a new MoE-based judge model trained on our dataset, a more thorough explanation and experimental verification of the quality control and human verification process, as well as a text-to-image model trained directly using our dataset—has addressed your concerns. Please let us know if there are any additional questions or areas we can clarify further.

We really look forward to your feedback and we will try our best to improve our work based on your suggestions. Thank you again!!

评论

We hope the above additional clarifications of the related works and the unique contribution of our works have addressed the reviewer's concern.

[1] Chen, D., Chen, R., Zhang, S., Liu, Y., Wang, Y., Zhou, H., ... & Sun, L. (2024). Mllm-as-a-judge: Assessing multimodal llm-as-a-judge with vision-language benchmark. arXiv preprint arXiv:2402.04788.

[2] Wu, X., Huang, S., & Wei, F. (2024). Multimodal Large Language Model is a Human-Aligned Annotator for Text-to-Image Generation. arXiv preprint arXiv:2404.15100.

[3] Wang, H., Xiong, W., Xie, T., Zhao, H., & Zhang, T. (2024). Interpretable Preferences via Multi-Objective Reward Modeling and Mixture-of-Experts. arXiv preprint arXiv:2406.12845.

[4] Jiao, Q., Chen, D., Huang, Y., Li, Y., & Shen, Y. (2024). Img-diff: Contrastive data synthesis for multimodal large language models. arXiv preprint arXiv:2408.04594.

[5] Zhou, Y., Cui, C., Rafailov, R., Finn, C., & Yao, H. (2024). Aligning modalities in vision large language models via preference fine-tuning. arXiv preprint arXiv:2402.11411.

[6] Lee, T., Yasunaga, M., Meng, C., Mai, Y., Park, J. S., Gupta, A., ... & Liang, P. S. (2024). Holistic evaluation of text-to-image models. Advances in Neural Information Processing Systems, 36.

W2: The paper employs human evaluators for experimental evaluation multiple times, yet it fails to report the number of human evaluators involved. Since human evaluators may introduce bias, it is recommended to report this metric. If the number is small, it is advised to increase the scale of human evaluators.

We sincerely appreciate the reviewer for raising this concern and have updated the paper to include detailed information about the usage of human evaluators. To provide a direct answer, we employed human evaluators at two key stages:

  1. Human Verification During Annotation: For data annotation, we engaged 12 human verifiers divided into four groups. Each verifier was assigned a specific portion of the images to verify, with no overlap between individuals, ensuring efficiency in the annotation process. While we acknowledge the reviewer's concern that relying on a single annotator per image might introduce bias, the large number of samples requiring verification (as detailed in Table-r. 2 below) makes this approach the most practical and efficient. Furthermore, we conducted a quantitative analysis by fine-tuning a model on our curated dataset, which also demonstrated the high quality and reliability of the dataset. We present the results in Table-r. 3 below.

  2. Human Evaluation of End-to-End Fine-Tuned Models:
    For evaluating the end-to-end fine-tuned models, we again engaged the same 12 human evaluators. Each evaluator ranked six images generated by SD-1.5 models fine-tuned on feedback from six different judges. To reduce bias in this stage, each set of images was reviewed by at least three evaluators, and their rankings were averaged to produce the final results. This approach ensures the reliability of the evaluation while maintaining efficiency.

(continued in our next response)

评论

To further expand the contribution and novelty of our paper, inspired by the VisionPrefer method which trains a reward model on their curated preference dataset, we designed an additional experiment where 80% of the MJ-Bench data was randomly split (except for bias, where we use 64 groups of the data filtered out from the confidence filtering stage specified below) to train a MoE-based judge model, following the method in [3]. The model incorporates four experts, each responsible for a specific perspective, with a gating layer to aggregate scores across each perspective trained via BT objective. Then we use the remaining 20% of the data as a test set. Results are reported in Table-r. 1.

From Table-r. 1, we observe that the MoE-based judge trained on MJ-Bench outperforms other models in alignment, safety, and bias perspectives in terms of w/ tie scores while being very close to GPT-4o on the quality subset. These findings highlight the advantages of MoE structures for handling multi-objective feedback and underscore the high quality of MJ-Bench data samples. However, the results also suggest that scaling up MJ-Bench, particularly in the quality subset, could further enhance performance, potentially surpassing GPT-4o. Due to time constraints, we plan to train our reward model on a larger held-out training set and evaluate it on the full MJ-Bench test set to compare against more models. Unfortunately, since the authors of VisionPrefer have not open-sourced the VP-Score model, we were unable to include it in our benchmark evaluation.

Table-r. 1 Additional evaluation results of our MoE-based judge model trained on a split from MJ-Bench. We evaluate and compare a subset of the models with best performance from Table 2 in the paper using the rest of the data as test set. The best performance are in bold.

ModelAlignment Avg w/ TieAlignment Avg w/o TieSafety Avg w/ TieSafety Avg w/o TieQuality Avg w/ TieQuality Avg w/o TieBias ACCBias NDSBias GES
GPT-4o58.763.043.297.393.595.266.384.991.2
LLaMA-3.2-11B-Vision60.264.238.180.068.574.383.084.589.5
HPS-v2.142.264.318.640.068.388.457.474.186.6
MJ-Bench71.272.077.080.290.694.286.184.790.1

Specifically, regarding the reviewer's concern about "an in-depth analysis of the selection of these four aspects," we would like to provide the following justifications:

  • Focus on Critical Challenges in Image Generation Models:
    Our selection of the four aspects is rooted in the most critical challenges faced by image generation models, as identified in recent studies and benchmarks, such as [1]. These challenges can be broadly summarized into the following four key issues:

    1. Text-Image Misalignment, where models generate plausible entities that contradict the instructions (commonly referred to as hallucination).
    2. Unsafe Content, where outputs include harmful, toxic, sexual, or violent concepts.
    3. Low-Quality Generation, characterized by blurry or unnatural artifacts in the images.
    4. Bias and Stereotypical Outputs, where models produce biased results favoring or opposing certain demographic groups.

    Each of the "aspects" in our dataset is carefully curated to address these challenges, recognizing the critical role of feedback in evaluating and improving current image generation models.

  • Comparison with Concurrent Works:
    Concurrent works, such as VisionPrefer, also focus on subsets of the aspects we address. For example, VisionPrefer studies "prompt-following," "harmlessness," "fidelity," and "aesthetics," which correspond to the "alignment," "safety," and "quality" perspectives in our benchmark. However, MJ-Bench uniquely includes an additional bias perspective, addressing a crucial issue that remains underexplored in current image generation models. Furthermore, unlike VisionPrefer, which focuses on four broad categories, MJ-Bench offers a more granular and exhaustive evaluation. Each perspective is further decomposed into multiple subcategories, providing a deeper and more comprehensive assessment of multimodal feedback.

(continued in our next response)

评论

Thanks for the response which has resolved most of my concerns. I suggest adding a table to compare the statistics of the proposed dataset with existing datasets.

评论

Thank you for your prompt feedback! We totally agree with the reviewer that adding a detailed table to compare the statistics of different datasets would better highlight the novelty and unique contribution of our datasets. Following the reviewer's suggestion, we have updated the revision and added the following table to Appendix A, as well as providing it here in Table-r. 4 below.

Table-r. 4 Statistics of existing preference datasets for text-to-image generative models. Specifically, #Sample indicates the number of images in each dataset to ensure a fair comparison. In terms of metric, Automatic indicates preference accuracy, and end-to-end indicates human evaluation of the trained text-to-image models using the dataset. We also demonstrate the distribution of categories and fine-grained sub-categories, as well as the different feedback formats in each dataset.

DatasetAnnotatorOverall #SampleBenchmark #SampleMetric: AutomaticMetric: End-to-EndCategory: AlignmentCategory: SafetyCategory: QualityCategory: BiasFine-grained CategoriesFeedback: ScalarFeedback: TextFeedback: LikertFeedback: RankingFeedback: Voting
HPD v1Discord users98K5K1
HPD v2Human Expert434K4K4
ImageRewardDBHuman Expert137K6K1
Pick-a-Pic v2Web users851K5001
VisionPreferGPT-4v1.2M04
MJ-BenchHuman Expert220K8K22

Thanks to the reviewer's suggestion, we have better demonstrated the key advantages of MJ-Bench over existing datasets through this table, where we offer a larger number of high-quality samples (all designed and verified by human experts) to ensure reliability, while other datasets usually rely on automated annotations (e.g. GPT-4v) to achieve larger scale may compromise data quality. Besides, we dedicatedly designed a benchmark set covering 22 fine-grained categories to ensure a diverse and complete comparison of different models, while existing datasets simply split testing samples from training sets, introducing biases and limiting evaluation comprehensiveness.

We sincerely appreciate the reviewer for the valuable feedback and continuing effort to help us improve our work! We hope these updates align with your expectations and would be grateful if you could kindly consider further raising the score to help us share this work with a broader audience! Please let us know if there are any remaining concerns or further suggestions. Thank you again!!

评论

Thanks for the authors' feedback. Most of my concerns have been addressed. I choose to maintain my positive rating.

评论

Dear Reviewer Rmzq,

Thank you very much for your response! We’re glad to know that we have addressed your concerns and we sincerely appreciate your support in sharing our work with a wider audience!

评论

Dear Reviewer Rmzq,

Thank you so much for your valuable suggestions to help improve our paper! We deeply appreciate your recognition of the novelty of our work and its critical contribution to advance research in RLAIF for text-to-image generation. Below, we have provided a detailed, point-by-point response to your questions, and hope these could help address your concerns.

W1: The paper's novelty seems limited, as the whole pipeline has been proposed in the previous method. Besides, The selection of these four aspects lacks in-depth analysis.

We sincerely appreciate the reviewer’s concern and would like to provide a more comprehensive explanation to highlight the novel contributions of our work. First, we emphasize that while many studies have explored reward models and judge feedback, MJ-Bench is the first to deeply investigate and comprehensively benchmark these models’ feedback for image generation by introducing a large-scale, high-quality dataset with fine-grained categories—addressing a critical yet previously unexplored topic. To further clarify our unique contributions, we have addressed the perceived overlap with prior methods and provided a detailed response to the question of novelty and the rationale behind the selection of the four evaluation aspects.

While various benchmarks have been developed to assess the capabilities and limitations of general MLLMs, most of these primarily evaluate the generative capabilities of multimodal foundation models themselves rather than their capacity to serve as evaluative judges. Prior works have highlighted that FMs may perform significantly differently in generative tasks compared to classification or feedback tasks, such as providing reward signals. This distinction complicates the direct application of generative benchmarks to evaluative roles as a judge.

Several preliminary efforts, such as [1], have explored using FMs as judges, but these works primarily focus on textual responses from LLMs and VLMs, failing to account for their multimodal feedback in the context of image generation models. Similarly, concurrent work like VisionPrefer [2] investigates reward models for image generation but focuses solely on curating a dataset with four broad subsets, lacking the fine-grained granularity required for detailed multimodal feedback assessment. Other works, such as [4] and [5], attempt to improve text-image alignment with MLLM feedback but rely on preference datasets generated through simple heuristics, without ensuring the diversity, and quality of the datasets.

In contrast, MJ-Bench introduces several novel contributions that distinguish it from previous methods:

  • First Benchmark on Multimodal Judge Feedback for Image Generation:
    While prior works have primarily focused on textual feedback for tasks such as text generation or image captioning, MJ-Bench uniquely evaluates feedback from multimodal judges in the context of image generation—an important yet underexplored task for the RLAIF community.

  • Comprehensive and Fine-Grained Categorization:
    Unlike earlier benchmarks that only provide large-scale datasets with broad categories, MJ-Bench offers a large dataset with over 20 subcategories. Each subcategory captures a specific aspect of feedback from multimodal judges, enabling a more comprehensive and detailed evaluation. This granularity provides valuable insights for future research.

  • High-Quality Annotation and Data Selection:
    MJ-Bench employs a rigorous data sampling and quality control pipeline, ensuring that every data sample is not only of high quality but also sufficiently challenging to serve as a benchmark. Our process includes VLM pre-processing, human verification, and confidence interval sampling to maintain both diversity and reliability.

  • Comprehensive Evaluation Framework:
    As noted by the reviewer, we have conducted extensive evaluations, including:

    • Automatic metrics based on a large-scale fine-grained dataset of chosen-rejected triplets.
    • End-to-end human evaluations on models fine-tuned with such judge feedback via DPO/DDPO.
    • Feedback quality assessments at various feedback scales.
    • Consistency evaluation of model preferences across different input modes.
    • Human evaluations through multiple formats, including voting and ranking.
      In addition, we have included numerous case studies in the appendix to provide intuitive insights into our evaluation results.

(continued in our next response)

审稿意见
6

This paper proposes a benchmark for evaluating LMMs about the judgement capability for T2I generation. The proposed benchmark set contains a few thousands (win, lose, prompt) triplets covering T2I evaluation aspects: alignment, safety, quality, and bias. To show that the proposed benchmark set can provide a fair platform to compare LMMs, this paper uses LMMs as the reward model and use DPO/DDPO to finetune SD1.5. The results show that the model finetuned with a better LMM (evaluated with the proposed benchmark) is preferred by human.

优点

  • Evaluating and benchmarking the capability of LMMs is an important research direction of the community nowadays. This paper proposes a comparably comprehensive benchmark set and provides the comparison of a large amount of LMMs, which is valuable to the community.
  • I like the design of the benchmark questions based on (win, lose, prompt) triplets, which, in my opinion, can make the human annotation easier compared with direct quality rating.

缺点

  • In terms of the presentation, I am unsure whether claiming evaluating "reward models" is a better idea than claiming evaluating "LMMs". Close-sourced LMMs such as GPT-4o/4v/Claude are not widely adopted T2I reward models due to their high cost and low throughput. In addition, Table 10 summarizes the number of evaluation questions for each category of the benchmark, which is of the pivotal importance and should be put in the main paper.
  • In Table 10, how the number of evaluation questions for each category is determined? I found that most categories, scenarios, and subsets have random numbers of evaluation questions. If so, does this evaluation benchmark introduce bias by itself? For example, the questions for object (250) is more than 4 times of counting questions (55). Since all LMMs are evaluated by the averaged metrics, does the proposed benchmark biased towards LMMs that are better at object questions?
  • Table 2 and 3 show the results of SD1.5 finetuned with DPO and DDPO, which, according to my understanding, is the "evaluation" to the proposed benchmark. However, I found that the LMM that work best as the reward (Table 2) is not aligned with the LMM having the highest evaluation scores based on the benchmark (Table 1). Does it mean the proposed benchmark is not good to evaluate T2I reward models? In addition, in Table 3, GPT-4o and GPT-4v achieve the best results with DPO and DDPO, respectively. I think this result suggest that the evaluation of reward model should be performed for certain RLHF methods. I suggest the authors to provide more discussion about this.

问题

  • I found that very limited details about how to use GPT-4o/4v/Gemini/Claude as the reward model for DPO/DDPO is provided in the paper. Do you use online or offline RL? If online, how to make use of these APIs and obtain the reward for DPO/DDPO fast? What if the API calls fail? If offline, what is the dataset used to finetune the model with DPO/DDPO?
评论

W3: Table 2 and 3 show the results of SD1.5 finetuned with DPO and DDPO, which, according to my understanding, is the "evaluation" to the proposed benchmark. However, I found that the LMM that work best as the reward (Table 2) is not aligned with the LMM having the highest evaluation scores based on the benchmark (Table 1). Does it mean the proposed benchmark is not good to evaluate T2I reward models? In addition, in Table 3, GPT-4o and GPT-4v achieve the best results with DPO and DDPO, respectively. I think this result suggest that the evaluation of reward model should be performed for certain RLHF methods. I suggest the authors to provide more discussion about this.

We thank the reviewer for raising this important concern. As the reviewer noted, Tables 2 and 3 indeed provide an end-to-end evaluation of our benchmark. We greatly appreciate the reviewer’s observations regarding the results in Tables 2 and 1, and we would like to provide further details about the evaluation pipeline to explain why the results are reasonable and demonstrate the reliability of our benchmark. Specifically, we attribute the discrepancies between the results in Tables 1 and 2 to the following reasons:

  • Only pairs with explicit preference are being used: For fine-tuning the model via DPO, we only use pairs with an explicit preference (i.e., those without ties). For instance, the discrepancy observed by the reviewer, particularly in the safety perspective, stems from models like Gemini Ultra having a high rejection rate for unsafe inputs. This results in a low accuracy for safety w/ tie (13.1), as many outputs are classified as ties, despite having a high accuracy w/o tie (95.1). Since we rely on the w/o tie results as the guideline for fine-tuning via DPO, the results in Table 2 are more aligned with the w/o tie column in Table 1.

  • Human bias during evaluation: A notable trend in Table 2 is that GPT-4o tends to be ranked higher by human evaluators across all perspectives, whereas the performance of other models generally follows the trends observed in Table 1. We suspect this bias arises because, during the ranking task (illustrated in Figure 16 of the appendix), some annotators were informed about which model generated each image (e.g., the top-right image corresponds to GPT-4o). Since GPT-4o is widely recognized as a SOTA model for providing judge feedback, this knowledge may have influenced some annotators to favor GPT-4o’s outputs and introduced certain biases, leading them to consistently higher rankings for images provided by GPT-4o. Consequently, this introduces a bias in the evaluation, skewing results in GPT-4o's favor.

  • Bias of the score-based evaluation: Another factor contributing to the discrepancy is the method used for scoring. Human evaluators were asked to assess six images simultaneously and assign a score to each within a [0, 10] scale, which was later used to calculate rankings. However, this approach may also introduce bias, as individual annotators may interpret and apply the numerical scale differently based on their personal perspectives. Additionally, evaluating multiple images at once can increase the cognitive load on annotators, potentially affecting the consistency and reliability of their evaluations, and further contributing to the biased results.

Following the reviewer’s suggestions, we aim to address the bias and annotation difficulty by adopting a simpler yet effective evaluation metric. Instead of scoring multiple images simultaneously, human annotators are asked to compare only a pair of images: one generated by the fine-tuned model and the other by the base SD-1.5 model (consistent across all evaluations of different models). We then calculate a win rate against the SD-1.5 for each model, with the results presented in Table-r. 3 below. This approach is more intuitive for annotators, reduces cognitive load, and minimizes bias introduced by individual interpretations of numerical scales. The results shown in Table-r. 3 align more closely with those in Table 1, with HPS-v2.1 and Gemini Ultra providing the most accurate feedback for the alignment perspective, GPT-4o excelling in Safety and Quality, and LLaMA-3.2-11B-Vision performing best in Bias. These additional results have been included in the paper revisions, and we hope they better demonstrate the effectiveness of our dataset and address the reviewer’s concerns.

(continued in our next response)

评论

Table-r. 3 Win rate of the human evaluation result of the generated images from various fine-tuned models via DPO. The best performance is in bold.

Datset ConfigurationAlignmentSafetyQualityBias
SD-1.5 Base50.050.050.050.0
HPS-v2.172.045.668.048.9
InternVL-chat-v1-562.357.358.243.0
LLaMA-3.2-11B-Vision71.066.861.777.4
Claude 3 Opus60.362.456.566.7
Gemini Ultra72.068.369.461.0
GPT-4v70.367.471.269.8
GPT-4o68.072.074.967.2

Q1: I found that very limited details about how to use GPT-4o/4v/Gemini/Claude as the reward model for DPO/DDPO is provided in the paper. Do you use online or offline RL? If online, how to make use of these APIs and obtain the reward for DPO/DDPO fast? What if the API calls fail? If offline, what is the dataset used to finetune the model with DPO/DDPO?

We thank the reviewer for this insightful question. Below, we provide additional details regarding the training pipelines for DPO and DDPO, as used in the results presented in Tables 2 and 3 of the paper.

DDPO Training Pipeline: We follow the DDPO implementation outlined in [4], running the algorithm in an online manner. Specifically, the model generates a batch of 32 images in each iteration, and the corresponding judge model provides feedback for each image. The policy parameters are then updated based on the feedback using DDPO, resulting in a new model policy for sampling the next on-policy batch. This process is iterated until the model converges, typically after 30–40 iterations. We use default hyperparameters from [4] for the DDPO training. To address potential API failures during online RL, we employ a try-and-catch mechanism and rotate among multiple candidate APIs to ensure smooth training. Although we did not encounter training collapse due to API failures, we save model checkpoints at each training step, enabling efficient recovery by resuming training from the latest checkpoint if issues arise. While our current focus is on feedback quality rather than API efficiency, we assume the batch-request feature could be leveraged to request feedback for all images in a batch simultaneously from closed-source APIs.

DPO Training Pipeline: For DPO, we adopt an offline approach. First, we collect 1,280 images for each model through offline sampling. The corresponding judge model then provides preferences for each image. The base SD-1.5 model is fine-tuned on the preference data using DPO. Furthermore, for both DPO and DDPO, the dataset used for fine-tuning is derived from textual prompts sampled from each perspective of MJ-Bench. This dataset has been curated to ensure diverse and comprehensive coverage of tasks and scenarios through VLM pre-processing and human filtering. Additionally, the dataset includes a large pool of unused prompts (approximately three times the size of the final dataset) filtered out during the confidence interval sampling process, ensuring a sufficient supply of prompts for training.

We hope these clarifications address the reviewer’s concerns and provide a detailed understanding of our training pipelines for DPO and DDPO.

[4] Black, K., Janner, M., Du, Y., Kostrikov, I., & Levine, S. (2023). Training diffusion models with reinforcement learning. arXiv preprint arXiv:2305.13301.

We hope the additional experiments regarding a more fair human evaluation and quality control for our proposed dataset, and the expanded explanations regarding the training procedure have addressed your concerns. Please don’t hesitate to let us know if there’s anything else you’d like to discuss or if further clarification is needed.

If all your concerns have been resolved, we would sincerely appreciate it if you could consider raising your score to help us share this work with a broader community.

Thank you again for your time and thoughtful feedback!

Best regards,

Submission #9037 Authors

评论

Dear reviewer eath,

As the discussion period is ending soon, we wanted to follow up to see if our above response which provides a more in-depth discussion of "reward models", a more thorough experiment verification and explanation of our data curation process, as well as an additional experiment involving a broader range of judges evaluated by human in an end-to-end manner using win rate as a fairer metric, has addressed your concerns.

Please let us know if you have any further questions or if there are areas we can clarify. We truly value your feedback and are committed to improving our work based on your suggestions. Thank you very much!!

评论

Dear Reviewer eath,

Thank you once again for your insightful feedback! We greatly value your guidance and have worked diligently to address your concerns and improve the paper. As the paper revision deadline approaches, we kindly ask if the changes we’ve made resolve your concerns.

We understand you have a busy schedule, but any additional comments or updated feedback you could provide would mean a great deal to us. Your expertise is invaluable in helping us refine our work further, and we warmly welcome any continued discussion.

Thank you for your time and thoughtful review!

评论

Dear Reviewer eath,

We hope this message finds you well, and happy Thanksgiving! As the discussion period is nearing its conclusion, we wanted to follow up to ensure our latest response has addressed your concerns. Specifically, we have (1) provided a more in-depth discussion of "reward models." and "LMMs"; (2) offered a detailed explanation of our data curation process and especially explained how we determined the sample size for each subcategory in our dataset; (3) conducted thorough experiments to verify the quality control process; (4) designed an additional experiment where human experts evaluated a broader range of multimodal judges end-to-end using win rate as a fairer metric to obtain a more comprehensive and debiased comparison of different judges.

We deeply value your feedback and are committed to improving our work based on your suggestions. Please let us know if you have any further questions or if there are specific areas requiring clarification. Thank you for your time and thoughtful input!

Best regards,

Submission #9037 Authors

评论

First of all, I would like to thank all authors for the very details response. I appreciate your effort to address my concerns. Most of my concerns have been addressed. The only remaining concern is that the benchmark questions in Tab. 10 are unbalanced between categories. Since the proposed benchmark, upon paper acceptance, will be used to evaluate the work of others. It is of pivotal importance to make sure the benchmark itself is balanced and unbiased. Can you please concisely response

  1. Whether this is an issue in your proposed benchmark?

  2. How this issue can be addressed?

If these issues can be addressed well, I will raise my score.

评论

Dear Reviewer eath,

Thank you so much for your valuable suggestions to help improve our paper! We deeply appreciate your recognition of the novelty of our work and its critical contribution to the multimodal AI feedback community. Below, we have provided a detailed, point-by-point response to your questions, and hope these could help address your concerns.

W1: In terms of the presentation, I am unsure whether claiming evaluating "reward models" is a better idea than claiming evaluating "LMMs". Close-sourced LMMs such as GPT-4o/4v/Claude are not widely adopted T2I reward models due to their high cost and low throughput. In addition, Table 10 summarizes the number of evaluation questions for each category of the benchmark, which is of the pivotal importance and should be put in the main paper.

We sincerely appreciate the reviewer’s thoughtful feedback. We agree that since our work focuses on a specific capability of "LMMs" to function as "reward models," both claims are valid. Specifically, if we adopt a broader definition of "reward models" that emphasizes their functional role in providing feedback, as characterized by prior works [1, 2], this can encompass both closed-source LLMs and LMMs like GPT-4o/4v, as well as open-sourced models such as LLaMA-3.2-11B-Vision-Instruct, despite the high inference costs and low throughput noted by the reviewer. While online RL algorithms like DDPO demand timely and responsive feedback, the low throughput issue highlighted by the reviewer can be mitigated using offline RL approaches, such as pre-collecting a dataset of trajectories and applying the reward models for preference annotation and then fine-tuning via DPO. However, we also acknowledge the reviewer’s point that, while we evaluate their ability to serve as reward models, we are ultimately assessing the capabilities of "LMMs" themselves. We have updated the manuscript to clarify this distinction in line with the reviewer’s suggestions.

Regarding the second suggestion, we completely agree that Table 10, which summarizes the dataset, is critical for effectively conveying key details of our dataset. We have moved this table to the main paper, specifically to Section 2.3, as suggested by the reviewer.

We hope these detailed explanations address the reviewer’s concerns, and we would be delighted to engage in further discussions should the reviewer have additional questions or feedback.

[1] Lambert, N., Pyatkin, V., Morrison, J., Miranda, L. J., Lin, B. Y., Chandu, K., ... & Hajishirzi, H. (2024). Rewardbench: Evaluating reward models for language modeling. arXiv preprint arXiv:2403.13787.

[2] Mahan, D., Van Phung, D., Rafailov, R., Blagden, C., Lile, N., Castricato, L., ... & Albalak, A. (2024). Generative reward models. arXiv preprint arXiv:2410.12832.

评论

W2: In Table 10, how the number of evaluation questions for each category is determined? I found that most categories, scenarios, and subsets have random numbers of evaluation questions. If so, does this evaluation benchmark introduce bias by itself? For example, the questions for object (250) is more than 4 times of counting questions (55). Since all LMMs are evaluated by the averaged metrics, does the proposed benchmark biased towards LMMs that are better at object questions?

We sincerely appreciate the reviewer for raising this question! Rather than randomly determining the number of evaluation questions, we employ a rigorous three-stage curation process: (1) VLM pre-processing, (2) human verification, and (3) confidence interval-efficient sampling. Each perspective subset begins with a large, roughly equal number of samples and is systematically refined to a high-quality subset, with the final counts presented in the paper.

To further clarify this process, we provide a detailed point-by-point explanation of the data curation and human verification stages below, along with a statistical summary in Table-r. 1.

  • VLM pre-process: Specifically, as described in Appendix A in the paper, we first gather corresponding image pairs for each perspective through the different algorithms we propose. This results in a substantial number of samples, with each perspective containing a similar quantity. Then our first step for quality control is to adopt a powerful VLM (LLaVa-NeXT-34B) to pre-process the data and filter out the wrong preference pairs (e.g. for the alignment subset, we only include those image pairs where the positive sample completely aligns with the prompt and the negative sample includes hallucinated entities or relationships). In this step, we aim to ensure the overall correctness of the image pairs, while not considering if they are challenging enough or have high quality. The samples we obtain in this process is 6260, 4852, 5964 pairs for the alignment, safety, and quality perspective, and 140 groups for the bias perspective.

  • Human verification: Next, we engage human verifiers to evaluate each preference pair, considering both images alongside the corresponding prompt. In this step, the verifiers are tasked not only with confirming the correctness of the pair (e.g., ensuring the chosen image in the alignment subset fully aligns with the prompt) but also with assigning a difficulty rating from 0 to 5. This rating reflects how challenging they perceive the pair to be, based on the premise that the reason for the preference is clear and verifiable. The greater the difficulty for the model to distinguish between the images, the higher the rating. This process results in 2,489, 2,271, and 1,680 validated pairs for the alignment, safety, and quality perspectives, respectively, as well as 105 groups for the bias perspective. All pairs are verified for accuracy by human evaluators, with each accompanied by a difficulty rating.

  • Benchmark Confidence Sampling: Although the current dataset is verified and ready for use, its size poses significant computational and time-related challenges. To address this, we draw inspiration from [3], which suggests that usually, only a subset of the benchmark samples are sufficient to provide a certified and reliable evaluation for each model. To implement this, we use three surrogate models (MiniGPT4-v1, InternVL-Chat-V1.2, and LLaVA-V1.2) to run inferences on the dataset, progressing from higher-difficulty samples to lower-difficulty ones. We then calculate the confidence interval (variance) of each model's performance on the dataset. Using a threshold of 0.1, we ensure that each subset contains sufficiently enough samples to provide a confident estimate of model performance within this interval. This approach not only ensures that the more diverse and challenging samples are prioritized, but also guarantees an efficient and sufficient sample size for evaluation while maintaining statistical reliability. As a result, we obtain 724, 574, and 1,121 validated pairs for the alignment, safety, and quality perspectives, respectively, as well as 18 groups for the bias perspective.

We then compile these samples to form the final evaluation set for each perspective in MJ-Bench. This rigorous quality control pipeline ensures that the collected samples and resulting evaluations are reliable, challenging, and efficient.

(continued in our next response)

评论

To demonstrate the quality of our dataset, we fine-tuned a text-to-image model (SD-1.5) directly using the preference pairs from MJ-Bench, showcasing the value of the data samples in our dataset. We compared this model with the SD-1.5 base model and the SD-1.5 model fine-tuned using GPT-4o feedback, with the results presented in Table-r. 4. Based on human judge feedback, the model fine-tuned with MJ-Bench significantly outperforms the one fine-tuned with GPT-4o feedback in alignment, safety, and bias perspectives, while achieving comparable performance in the quality perspective. This demonstrates the high quality and reliability of our dataset. Additional case studies and comparisons of the outputs from the three models are provided in Figures 7, 8, and 9 in the appendix.

To more concretely address the reviewer’s question, we conclude that the uneven number of samples across categories and subcategories primarily reflects the rigorous human filtering and confidence interval sampling selection process. For example, the larger number of samples in the object subset compared to the counting subset indicates that evaluating multimodal models' capability as judges for object alignment requires more diverse and challenging samples to come up with a conclusion with sufficient confidence, while for counting task we can use fewer samples to reach the same confidence threshold. This ensures that performance estimates for such tasks can be made with statistically certified confidence. Therefore, instead of randomly selected samples, the sample sizes in each subset are carefully determined through human verification and statistical certification, ensuring that the performance estimations for the evaluated models fall within a reliable trust region defined by the surrogate models.

Furthermore, while we use averaged metrics to evaluate LMMs, we calculate the final performance by averaging the individual metrics from each subcategory, rather than weighting them by sample size. For instance, if the average accuracy for the object and counting tasks is 0.65 and 0.23 respectively, the overall performance across the two subsets would be calculated as (0.65 + 0.23)/ 2 = 0.44 rather than (0.65 × 250 + 0.23 × 55) / 305 = 0.57. This approach aligns with standard practices in benchmark evaluations and effectively mitigates the bias issue raised by the reviewer where we provide a transparent evaluation of the overall performance of the judge model.

Table-r. 1 Statistics of the data curation procedure and quality control.

AlignmentSafetyQualityBias (group)
Total626048525964140
Human Selected248922711680105
Confidence Selected724574112118

Table-r. 2 Human evaluation result on the generated images from three models, i.e. SD-1.5 base model, SD-1.5 fine-tuned with the feedback provided by GPT-4o, and SD-1.5 fine-tuned directly on MJ-Bench via DPO. Specifically, we consider the average ranking of the image generated by each model as the metric. The best performance is in bold.

Datset ConfigurationAlignmentSafetyQualityBias
SD-1.5 Base2.472.702.232.63
SD-1.5 + GPT-4o1.951.911.872.11
SD-1.5 + MJ-Bench1.581.391.901.26

We hope the above additional experiments and clarifications have better conveyed and demonstrated the quality control procedure of our dataset and addressed the reviewer's concern.

[3] Polo, F. M., Weber, L., Choshen, L., Sun, Y., Xu, G., & Yurochkin, M. (2024). tinyBenchmarks: evaluating LLMs with fewer examples. arXiv preprint arXiv:2402.14992.

评论

Thank you for your thoughtful response and for raising this important concern! We totally agree with the reviewer that ensuring a balanced and unbiased dataset distribution is crucial for fair and reliable evaluations, and we have indeed carefully considered this when we developped the dataset. Our detailed responses are as follows:

Is this an issue in the proposed benchmark?

The unbalanced sample size in Table 10 is not an inherent issue of the benchmark but a specific design choice aimed at achieving efficient evaluation without compromising accuracy. As detailed in Table-r.1 above in response part 3, our dataset curation process involves three stages where the human verifiers first filter 8K samples guaranteed with high quality and a balanced distribution across subcategories to ensure fairness in evaluation and comparability across models. However, due to the high cost associated with evaluating such a large dataset, we aim to further enhance sample efficiency to facilitate the broader usability of our benchmark. Specifically, we adopt a Confidence Subsampling process, where we use three surrogate models to further search for a lower bound of the sample size for each subcategory such that the variance in performance estimation in each subset remains within a small statistical confidence threshold, guaranteeing both accuracy and efficiency in evaluation. While this process results in an unbalanced distribution across subcategories, it guarantees sufficient sample sizes in each subcategory to provide confident performance estimates efficiently. To validate this approach, we conducted empirical experiments with top-performing models, which showed a very small performance gap between the efficient subset and the full dataset, confirming its robustness. Based on these findings, we use the efficient subset to report the results in the paper.

How can this issue be addressed?

To address the concern, we will ensure to release both the full balanced dataset version and the efficient subset to accommodate the user's diverse needs. Besides, we will ensure to include evaluations of the full balanced dataset in the updated revision of the paper to provide a comprehensive performance analysis, ensuring transparency and trust in the benchmark’s fairness and applicability.

We sincerely hope that these actions will address the reviewer’s concerns while maintaining the benchmark’s utility for diverse use cases. Thank you again for your valuable feedback, and we hope these clarifications and proposed measures meet your expectations!

评论

Dear Reviewer eath,

We are very glad to hear that most of your concerns have been addressed! We deeply appreciate your valuable suggestions and thoughtful feedback, which have been instrumental in helping us improve our work. As today is the last day that reviewers can post a message, we wanted to kindly follow up to ensure that our responses to your remaining concerns have adequately addressed them.

We sincerely hope our explanations have resolved your questions, but please don’t hesitate to let us know if you have any additional questions or suggestions. We are more than happy to provide further clarifications if needed.

Thank you once again for your valuable time and effort in helping us improve our work!

评论

Thank you for answering my question. I agree providing a balanced dataset aside the filtered dataset might be a solution. Therefore, I slightly raise my score. Good luck.

评论

Thank you very much for your response!! We’re glad to have addressed your concerns and truly appreciate your support in sharing our work with a wider audience!

审稿意见
6
  • The paper introduces MJ-BENCH, a benchmark to improve feedback for text-to-image models on alignment, safety, image quality, and bias.
  • Some interesting points are found: e.g., Closed-source VLMs like GPT-4o outperform others overall, while smaller models excel in alignment and quality, and VLMs perform better on safety and bias. Human evaluations confirm MJ-BENCH’s effectiveness for model fine-tuning.

优点

  • The writing and presentation of the entire paper are good.
  • A comprehensive benchmark is provided, which can further advance research in the related field.
  • It further promotes research on RLAIF (Reinforcement Learning from AI Feedbacks) and provides solid evidence for its effectiveness.

缺点

  • Please provide a comprehensive discussion that covers the related work. For example, some studies have also made efforts in aligning text-to-image models with feedback from MLLMs, such as VisionPrefe r[1], which also discusses the varying effectiveness of different MLLMs as annotators for alignment data in text-to-image tasks.

  • It would be beneficial to discuss further the situations in which human judges and MLLM judges disagree, as this could provide valuable insights for future work.

  • Besides, to better demonstrate the effectiveness of MJ-BENCH, the authors are recommended to present some visualization cases of MJ-BENCH to help to offer a clearer and more comprehensive understanding of how well the dataset.

Reference [1] Wu X, Huang S, Wei F. Multimodal Large Language Model is a Human-Aligned Annotator for Text-to-Image Generation[J]. 2024.

问题

Please see in Weaknesses

评论

Table-r. 2 Additional evaluation results of our MoE-based judge model trained on a split from MJ-Bench. We evaluate and compare a subset of the models with best performance from Table 2 in the paper using the rest of the data as test set. The best performance are in bold.

ModelAlignment Avg w/ TieAlignment Avg w/o TieSafety Avg w/ TieSafety Avg w/o TieQuality Avg w/ TieQuality Avg w/o TieBias ACCBias NDSBias GES
GPT-4o58.763.043.297.393.595.266.384.991.2
LLaMA-3.2-11B-Vision60.264.238.180.068.574.383.084.589.5
HPS-v2.142.264.318.640.068.388.457.474.186.6
MJ-Bench71.272.077.080.290.694.286.184.790.1

Here we provide some additional related works. [3] and [4] also explore improving text-image alignment with MLLM feedback but rely on preference datasets curated through simple heuristics, without ensuring data diversity or maintaining high-quality standards. Furthermore, they only leverage the preference data to train a text-to-image model without proposing a reward model.

Another line of research focuses on providing more grounded scores for text-image alignment through decomposition [5,6], which involves breaking down complex prompts into multiple atomic predicates and verifying each individually, thereby enhancing the robustness of the feedback. Additionally, some probability-based methods [7] find that templating the prompt into binary questions and evaluating the likelihood of answering yes can result in a more stable scoring. We have also included the evaluation of these models in Table-r. 3.

Specifically, we observe that the methods suggested by the reviewers have outperformed the previous methods on certain perspectives while still struggling with certain tasks. We detail our analysis and findings of the updated results point-by-point.

  • Decomposition significantly boosts judge performance on text-image alignment: Table-r. 3 indicates that decomposition-based methods DSG and T2I-CompBench have superior performance on alignment and image quality perspectives, which require a fine-grained judgment of the elements and their corresponding attributes and relationships in the image. By decomposing a complex prompt into multiple atomic predicates and verifying them individually, these methods are highly interpretable and guaranteed to have better performance in providing feedback for alignment accuracy. Furthermore, we observe that DSG w/ entailment dependency also outperforms the non-dependent case, demonstrating the significance of logical reasoning in providing grounded feedback for text-image alignment. The fact that w/ dependency case has better performance also validates the high-quality and challenging nature of our dataset, as our images are verified by human annotators so that in each image the logical constraints in the prompt are satisfied. However, while the feedback is more accurate, the decomposition-based methods typically incur higher inference costs as they require extra decomposition and multiple MLLM-prompting, and thus may not be suitable to serve as online judges.

  • Probability-based methods reduce the numerical sensitivity of the feedback score: While VQAScore [3] can effectively address the "bag-of-words" issue in CLIP-based models, our evaluation in Table-r. 3 indicates that the performance w/o tie is much higher than w/ tie, indicating VQAScore results in a much larger tie rate than the other models (we set a threshold of 0.1 to determine tie). This indicates that by formatting the prompt into a Does this figure show {text}? question, while it increases the evaluation consistency of a single image, it inherently reduces the gap between two similar images that are specifically curated to be hard to distinguish in our challenging benchmark, thus resulting in a higher tie rate. These results indicate that VQAScore might not serve as a good judge for preference-based text-to-image post-training but can provide an adequately accurate evaluation of a single image.

(continued in our next response)

评论
  • Decomposition-based methods increase safety risks and do not have much effect on bias: (1) Safety: we further observe in Table-r. 1 that decomposition-based methods have much lower performance from a safety perspective than other one-time inference models such as LLaMA-3.2-11B-Vision-Instruct and GPT-4o. This is because decomposing an unsafe prompt into several atomic descriptions will potentially reduce the harmfulness of the prompt and instead make the prompt more stealthy, leading to such models providing a higher overall score for unsafe images that align better with the prompts. However, one-time inference MLLMs can evaluate the unsafe prompt in its entirety and can thus provide more accurate feedback toward a safer alignment. (2) Bias: we find that decomposition-based methods have similar performance to their base models such as InstructBLIP and BLIP and do not show any significant improvement. This is primarily because we evaluated the bias perspective with straightforward prompts such as a portrait of a doctor which cannot be further decomposed, thus resulting in these decomposition-based methods being degraded to have the same performance as their base models used for individual predicate evaluation.

  • MLLMs with more advanced vision encoders have significantly higher performance on alignment: While the evaluation results on the four most recent MLLMs indicate they have better performance on almost all perspectives, we find that models with more advanced vision encoders such as LLaMA-3.2-11B-Vision-Instruct with a specifically fine-tuned image coder and InternVL2-26B with an extra-large 6B vision encoder (compared to 300M vision encoder of the 8B model) consistently yield better performance on text-image alignment and image quality perspectives, demonstrating the importance of such vision encoders to provide grounded information for the subsequent judge reasoning. We are also interested in evaluating the most recent MLLMs such as Transfusion [7] which encodes images through diffusion mechanisms in an interleaved manner once they are open-sourced.

  • Instruct-tuned models provide more accurate feedback on safety and bias: We also notice that Instruct-tuned models and post-aligned models such as LLaMA-3.2-11B-Vision-Instruct have better performance in terms of safety and bias, indicating they have embedded such values that align better with human perspectives and can provide better feedback towards a safer and debiased generation.

(continued in our next response)

评论

Table-r. 3 Additional evaluation results of two decomposition-based models (DSG, T2i-compbench), a probability-based model (VQAScore), and four most recent MLLMs (LLaMA-3.2-11B-Vision-Instruct, InternVL2-8B, InternVL2-26B, MiniCPM-V-2_6) on each fine-grained category of MJ-Bench. The top-2 performance are in bold.

ModelAlignment Avg w/ TieAlignment Avg w/o TieSafety Avg w/ TieSafety Avg w/o TieQuality Avg w/ TieQuality Avg w/o TieBias ACCBias NDSBias GES
DSG w/ Dependency66.168.623.861.281.284.654.680.992.0
DSG w/o Dependency62.467.325.057.378.687.252.477.789.1
VQAScore51.463.233.774.061.564.753.074.587.2
T2I-CompBench62.267.317.636.073.081.863.982.190.7
LLaMA-3.2-11B-Vision-Instruct65.967.043.582.071.374.184.982.990.2
MiniCPM-V-2_658.763.131.758.959.567.444.271.588.7
InternVL2-8B61.865.533.345.269.682.456.074.983.4
InternVL2-26B68.069.735.068.384.692.050.371.482.9
GPT-4o61.562.535.3100.097.698.765.882.592.8

[1] Wu, X., Huang, S., & Wei, F. (2024). Multimodal Large Language Model is a Human-Aligned Annotator for Text-to-Image Generation. arXiv preprint arXiv:2404.15100.

[2] Wang, H., Xiong, W., Xie, T., Zhao, H., & Zhang, T. (2024). Interpretable Preferences via Multi-Objective Reward Modeling and Mixture-of-Experts. arXiv preprint arXiv:2406.12845.

[3] Jiao, Q., Chen, D., Huang, Y., Li, Y., & Shen, Y. (2024). Img-diff: Contrastive data synthesis for multimodal large language models. arXiv preprint arXiv:2408.04594.

[4] Zhou, Y., Cui, C., Rafailov, R., Finn, C., & Yao, H. (2024). Aligning modalities in vision large language models via preference fine-tuning. arXiv preprint arXiv:2402.11411.

[5] Cho, J., Hu, Y., Garg, R., Anderson, P., Krishna, R., Baldridge, J., ... & Wang, S. (2023). Davidsonian scene graph: Improving reliability in fine-grained evaluation for text-image generation. arXiv preprint arXiv:2310.18235.

[6] Huang, K., Sun, K., Xie, E., Li, Z., & Liu, X. (2023). T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation. Advances in Neural Information Processing Systems, 36, 78723-78747.

[7] Lin, Z., Pathak, D., Li, B., Li, J., Xia, X., Neubig, G., ... & Ramanan, D. (2025). Evaluating text-to-visual generation with image-to-text generation. In European Conference on Computer Vision (pp. 366-384). Springer, Cham.

评论

It would be beneficial to discuss further the situations in which human judges and MLLM judges disagree, as this could provide valuable insights for future work.

We sincerely appreciate this insightful question from the reviewer. To address it, we would like to reference a point made by Reviewer Eath, who commended the design of our benchmark based on (win, lose, prompt) triplets, which simplifies the embedding of perspectives from human judges compared to direct quality ratings. Since during our data curation process, we have ensured that preference pairs were filtered by human judges to align with their preferences, thus examining the disagreements between human judges and MLLM judges essentially involves identifying the failure patterns of MLLM judges on our dataset, which is already calibrated by the human judges. Based on our results, we have summarized the following key limitations of current MLLM judges and how their judgments deviate from those of human judges:

  • MLLMs generally (especially open-sourced) perform worse than smaller-sized scoring models in providing accurate feedback regarding text-image alignment and image quality. We speculate the following two reasons: 1) generative task is less accurate than classification and thus cannot fully unleash the capability of the vision-encoder; 2) while training instruction-following tasks enhances the performance of MLLMs judges on safety and bias-related tasks, it somehow degrades the capability gained during vision-language pretraining thus resulting in worse performance on alignment and quality.

  • CLIP-based scoring models significantly suffer from safety and bias perspectives. Since they are trained on a large vision-language alignment corpus using contrastive objectives, their model outputs directly reflect the training data distribution, which may contain unsafe and biased content. On the contrary, MLLMs provide more accurate feedback regarding safety and bias due to their stronger reasoning capabilities.

  • While CLIP-based scoring models perform better from an alignment perspective, they generally exhibit much larger variance due to the contrastive training objective. On the contrary, MLLMs are more consistent due to their chain-of-though reasoning ability and few-shot examples being provided.

  • Decomposition-based methods can significantly improve the accuracy of judge feedback for text-image alignment and quality by verifying individual predicates. However, they also inherently increase safety risks, as breaking harmful prompts into smaller components can make them more subtle and harder to detect. Additionally, these methods have minimal impact on bias, as the straightforward prompts used in the evaluation cannot be further decomposed, resulting in performance similar to that of their underlying base models.

  • We find that MLLM judges are inconsistent and could provide completely different preferences when the images are input in different orders. Such bias also hampers their trustworthiness when providing feedback for other models.

  • Open-source MLLMs suffer significantly when providing feedback on a numeric scale while being more consistent on the Likert scale since they have been more extensively trained on natural language than numerical corpus. Besides, compared with close-source MLLMs, open-source MLLMs are less sensitive to the policies and scoring levels provided in the rubrics (i.e., they may provide the same score even if the rubrics have been greatly changed), due to poor instruction following capability.

We hope these further discussions have addressed the reviewer's concern. We have also updated these findings in the paper revision and hope they can help better advance future works.

评论

W3: Besides, to better demonstrate the effectiveness of MJ-BENCH, the authors are recommended to present some visualization cases of MJ-BENCH to help to offer a clearer and more comprehensive understanding of how well the dataset.

We sincerely appreciate the reviewer’s suggestion and fully agree that additional visualizations can provide a clearer and more comprehensive understanding of our dataset. In response, we have added Figures 7, 8, and 9 in the Appendix to illustrate the quality of our dataset. In these figures, we present the results of fine-tuning a text-to-image model (SD-1.5) directly using the preference pairs from MJ-Bench and compare it with the SD-1.5 base model and the SD-1.5 model fine-tuned with GPT-4o feedback. The results in Figures 7, 8, 9, and Table-r. 4 demonstrate the high quality and reliability of our dataset. Additionally, we have enhanced the visualizations throughout the paper to further address the reviewer’s concerns and effectively convey the significance of our work.

  • Figure 1: We visualize an overall comparison of the multimodal judges' performance under different scales including numerical scale [0,5], [0,10], and liket scale.
  • Figure 2: We visualize the overview structure and distribution of MJ-Bench, showcasing an example for each subcategory.
  • Figure 3: We visualize the two typical modes of multimodal judges in providing feedback.
  • Figure 4: We visualize the dataset distribution of MJ-Bench, and detail each individual subcategory to demonstrate the comprehensiveness of our dataset.
  • Figure 5: We visualize the distribution of the toxicity scores in the safety subset of MJ-Bench.
  • Figure 6: We visualize the examples from each type and scenario of bias in the bias subset.
  • Figure 7: We visualize six case studies of the text-to-image generation models in generating images from the text-image alignment, safety, and quality perspective, and compare among the SD-1.5 base model, the SD-1.5 fine-tuned from MJ-Bench, and the SD-1.5 fine-tuned from the feedback of GPT-4o.
  • Figure 8: We visualize six case studies of the text-to-image generation models in generating images from the bias perspective, and compare the SD-1.5 base model, the SD-1.5 fine-tuned from MJ-Bench, and the SD-1.5 fine-tuned from the feedback of GPT-4o. The prompt we investigated is a portrait of a soldier.
  • Figure 9: We visualize six case studies of the text-to-image generation models in generating images from the bias perspective, and compare the SD-1.5 base model, the SD-1.5 fine-tuned from MJ-Bench, and the SD-1.5 fine-tuned from the feedback of GPT-4o. The prompt we investigated is a portrait of a nurse.
  • Figure 10: We visualize the accuracy of score models on text-image alignment with different tie thresholds.
  • Figure 11: We visualize the tie-clean accuracy of score models on text-image alignment with different tie thresholds.
  • Figure 12: We visualize a case study of InternVL-Chat-v1-5 with different input modes.
  • Figure 12-15: We visualize a case study of six fine-tuned models using feedback regarding each of the four different objectives from different multimodal judges.
  • Figure 16: We visualize the human user interface used for annotation and evaluation in MJ-Bench.

Table-r. 4 Human evaluation result on the generated images from three models, i.e. SD-1.5 base model, SD-1.5 fine-tuned with the feedback provided by GPT-4o, and SD-1.5 fine-tuned directly on MJ-Bench via DPO. Specifically, we consider the average ranking of the image generated by each model as the metric. The best performance is in bold.

Datset ConfigurationAlignmentSafetyQualityBias
SD-1.5 Base2.472.702.232.63
SD-1.5 + GPT-4o1.951.911.872.11
SD-1.5 + MJ-Bench1.581.391.901.26

We hope the additional experiments and discussions regarding the related works and our findings, as well as the expanded explanations regarding the quality of our dataset, have addressed your concerns. Please don’t hesitate to let us know if there’s anything else you’d like to discuss or if further clarification is needed.

If all your concerns have been resolved, we would sincerely appreciate it if you could consider raising your score to help us share this work with a broader community.

Thank you again for your time and thoughtful feedback!

Best regards,

Submission #9037 Authors

评论

Dear reviewer 3DFZ,

As the discussion period is ending soon, we wanted to follow up to see if our above response which provides a more thorough analysis and experimental comparison with related works, analyzes a MoE-based judge model trained on our dataset, and examines failure patterns of MLLM judges compared with human judges, with additional visualizations and empirical validations to demonstrate the quality of our dataset, has addressed your concerns.

Please let us know if you have any further questions or if there are areas we can clarify. We truly value your feedback and are committed to improving our work based on your suggestions. Thank you very much!!

评论

Thanks for your response, I will maintain my score and recommend this paper to be accepted.

评论

Dear Reviewer 3DFZ,

Thank you so much for your response and for recommending our paper for acceptance. We’re delighted to see that our response has addressed your concerns!

评论

Dear Reviewer 3DFZ,

Thank you so much for your valuable suggestions to help improve our paper! We deeply appreciate your recognition of the novelty of our work and its critical contribution to advanced research in RLAIF for text-to-image generation. Below, we have provided a detailed, point-by-point response to your questions, and hope these could help address your concerns.

W1:Please provide a comprehensive discussion that covers the related work. For example, some studies have also made efforts in aligning text-to-image models with feedback from MLLMs, such as VisionPrefer[1], which also discusses the varying effectiveness of different MLLMs as annotators for alignment data in text-to-image tasks.

Thank you for this thoughtful suggestion. We sincerely appreciate the reviewer for bringing this valuable work to our attention and we fully agree that a more comprehensive discussion of related works would better highlight the significance of our contributions. In response, we have updated the related works section in the paper revision to include additional recent studies, and we summarize the newly incorporated works below.

Regarding the datasets for providing feedback on text-to-image generation, VisionPrefer [1] is a concurrent work that introduces a large dataset of preference data spanning four aspects. This dataset is subsequently used to train a CLIP-based reward model, VP-Score, to provide feedback for image generation. While VisionPrefer is related to our work, our contribution differs in several key ways. First, VisionPrefer primarily provides a large preference dataset with four aspects, lacking the fine-grained subcategories needed for a comprehensive evaluation of multimodal judges' feedback. In contrast, MJ-Bench not only covers these perspectives but also decomposes each into over 20 scenarios, ensuring a more exhaustive evaluation framework. Second, VisionPrefer relies solely on GPT-4v for data annotation which cannot provide any guarantee of the data quality, whereas our dataset undergoes a rigorous three-stage curation process: (1) VLM pre-processing, (2) human verification, and (3) confidence interval-efficient sampling. Detailed statistics of the data split in each stage are provided in Table-1. below. This pipeline ensures the reliability, diversity, and difficulty of the dataset, making it a more robust resource for fine-grained multimodal evaluation. Inspired by their approach which trains a reward model, VP-Score, on their curated preference dataset, we designed an additional experiment where 80% of the MJ-Bench data was randomly split (except for Bias, where we use 64 groups of the data filtered out from the confidence filtering stage) to train a MoE-based judge model, following the method in [2]. The model incorporates four experts, each responsible for a specific perspective, with a gating layer to aggregate scores across each perspective trained via BT objective. Then we use the remaining 20% of the data as a test set. Results are reported in Table-r. 2.

From Table-r. 2, we observe that the MoE-based judge trained on MJ-Bench outperforms other models in alignment, safety, and bias perspectives in terms of w/ tie scores while being very close to GPT-4o on the quality subset. These findings highlight the advantages of MoE structures for handling multi-objective feedback and underscore the high quality of MJ-Bench data samples. However, the results also suggest that scaling up MJ-Bench, particularly in the quality subset, could further enhance performance, potentially surpassing GPT-4o. Due to time constraints, we plan to train our reward model on a larger held-out training set and evaluate it on the full MJ-Bench test set to compare against more models. Unfortunately, since the authors of VisionPrefer have not open-sourced the VP-Score model, we were unable to include it in our benchmark evaluation.

Table-r. 1 Statistics of the data curation procedure and quality control.

AlignmentSafetyQualityBias (group)
Total626048525964140
Human Selected248922711680105
Confidence Selected724574112118

(continued in our next response)

审稿意见
6

This paper introduces MJ-BENCH, a novel benchmark that includes a comprehensive preference dataset to evaluate multimodal judges in providing feedback for image generation models across four key perspectives: alignment, safety, image quality, and bias. Detailed experimental analyses were conducted on CLIP-based scoring models, open-source VLMs, and close-source VLMs.

优点

The paper is well-written and logically coherent. The proposed benchmark is comprehensive, covering aspects such as alignment, safety, image quality, and bias.

缺点

I would consider increasing my score if these concerns are addressed or resolved.

1. Comprehensiveness and fairness of the evaluation: The paper evaluates models like CLIP, LLaVA and GPT-4, but lacks some popular alternative alignment evaluation models such as the Decompose method (Davidsonian Scene Graph) and the Answer Probability Method (VQAScore). Additionally, the paper claims that the alignment dataset was collected from Pick-a-pic, HPDv2, and ImageRewardDB, so evaluating PickScore-v1, HPS-v2.1, and ImageReward in the experiments is unfair because these models have already been trained on similar data or dataset formats.

问题

  1. Could you please explain how quality control is conducted in the construction of data across different evaluation dimensions? For example, how is a human verification process conducted, and what is the proportion of data that gets filtered?

  2. Given the criticisms of CLIP-based methods as 'bag-of-words' and the lack of interpretability and reproducibility when directly asking VLMs to output scores, evaluating and comparing alternative evaluation methods is crucial. It's important to include popular alignment approaches that avoid these weaknesses, such as decomposition methods from T2i-compbench and Davidsonian Scene Graph, as well as answer probability methods like VQAScore.

评论
  • Decomposition-based methods increase safety risks and do not have much effect on bias: (1) Safety: we further observe in Table-r. 1 that decomposition-based methods have much lower performance from a safety perspective than other one-time inference models such as LLaMA-3.2-11B-Vision-Instruct and GPT-4o. This is because decomposing an unsafe prompt into several atomic descriptions will potentially reduce the harmfulness of the prompt and instead make the prompt more stealthy, leading to such models providing a higher overall score for unsafe images that align better with the prompts. However, one-time inference MLLMs can evaluate the unsafe prompt in its entirety and can thus provide more accurate feedback toward a safer alignment. (2) Bias: we find that decomposition-based methods have similar performance to their base models such as InstructBLIP and BLIP and do not show any significant improvement. This is primarily because we evaluated the bias perspective with straightforward prompts such as a portrait of a doctor which cannot be further decomposed, thus resulting in these decomposition-based methods being degraded to have the same performance as their base models used for individual predicate evaluation.

  • MLLMs with more advanced vision encoders have significantly higher performance on alignment: While the evaluation results on the four most recent MLLMs indicate they have better performance on almost all perspectives, we find that models with more advanced vision encoders such as LLaMA-3.2-11B-Vision-Instruct with a specifically fine-tuned image coder and InternVL2-26B with an extra-large 6B vision encoder (compared to 300M vision encoder of the 8B model) consistently yield better performance on text-image alignment and image quality perspectives, demonstrating the importance of such vision encoders to provide grounded information for the subsequent judge reasoning. We are also interested in evaluating the most recent MLLMs such as Transfusion [7] which encodes images through diffusion mechanisms in an interleaved manner once they are open-sourced.

  • Instruct-tuned models provide more accurate feedback on safety and bias: We also notice that Instruct-tuned models and post-aligned models such as LLaMA-3.2-11B-Vision-Instruct have better performance in terms of safety and bias, indicating they have embedded such values that align better with human perspectives and can provide better feedback towards a safer and debiased generation.

(continued in our next response)

评论

Table-r. 1 Additional evaluation results of two decomposition-based models (DSG, T2i-compbench), a probability-based model (VQAScore), and four most recent MLLMs (LLaMA-3.2-11B-Vision-Instruct, InternVL2-8B, InternVL2-26B, MiniCPM-V-2_6) on each fine-grained category of MJ-Bench. The top-2 performance are in bold.

ModelAlignment Avg w/ TieAlignment Avg w/o TieSafety Avg w/ TieSafety Avg w/o TieQuality Avg w/ TieQuality Avg w/o TieBias ACCBias NDSBias GES
DSG w/ Dependency66.168.623.861.281.284.654.680.992.0
DSG w/o Dependency62.467.325.057.378.687.252.477.789.1
VQAScore51.463.233.774.061.564.753.074.587.2
T2I-CompBench62.267.317.636.073.081.863.982.190.7
LLaMA-3.2-11B-Vision-Instruct65.967.043.582.071.374.184.982.990.2
MiniCPM-V-2_658.763.131.758.959.567.444.271.588.7
InternVL2-8B61.865.533.345.269.682.456.074.983.4
InternVL2-26B68.069.735.068.384.692.050.371.482.9
GPT-4o61.562.535.3100.097.698.765.882.592.8

We sincerely hope the additional experiment results and our findings provided above have addressed the reviewer's concerns and we would love to engage in more in-depth discussion with the reviewer.

[1] Cho, J., Hu, Y., Garg, R., Anderson, P., Krishna, R., Baldridge, J., ... & Wang, S. (2023). Davidsonian scene graph: Improving reliability in fine-grained evaluation for text-image generation. arXiv preprint arXiv:2310.18235.

[2] Huang, K., Sun, K., Xie, E., Li, Z., & Liu, X. (2023). T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation. Advances in Neural Information Processing Systems, 36, 78723-78747.

[3] Lin, Z., Pathak, D., Li, B., Li, J., Xia, X., Neubig, G., ... & Ramanan, D. (2025). Evaluating text-to-visual generation with image-to-text generation. In European Conference on Computer Vision (pp. 366-384). Springer, Cham.

[4] Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., ... & Ganapathy, R. (2024). The llama 3 herd of models. arXiv preprint arXiv:2407.21783.

[5] Chen, Z., Wang, W., Tian, H., Ye, S., Gao, Z., Cui, E., ... & Wang, W. (2024). How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. arXiv preprint arXiv:2404.16821.

[6] Yao, Y., Yu, T., Zhang, A., Wang, C., Cui, J., Zhu, H., ... & Sun, M. (2024). Minicpm-v: A gpt-4v level mllm on your phone. arXiv preprint arXiv:2408.01800.

[7] Zhou, C., Yu, L., Babu, A., Tirumala, K., Yasunaga, M., Shamis, L., ... & Levy, O. (2024). Transfusion: Predict the next token and diffuse images with one multi-modal model. arXiv preprint arXiv:2408.11039.

评论

W2: Additionally, the paper claims that the alignment dataset was collected from Pick-a-pic, HPDv2, and ImageRewardDB, so evaluating PickScore-v1, HPS-v2.1, and ImageReward in the experiments is unfair because these models have already been trained on similar data or dataset formats.

We sincerely appreciate the reviewer for this constructive suggestion. We completely agree with the reviewer that directly evaluating PickScore-v1, HPS-v2.1, and ImageReward on the alignment set which contains certain pairs curated from their training set might be unfair to other models. In response, we expand the alignment subset with additional 680 image pairs that do not contain any image samples from existing datasets. Specifically, to curate such data, we first manually select sufficient prompts from each of the five scenarios, i.e. object, attribute, action, counting, and spatial, and ensure that they are diverse and challenging. Then to further improve diversity and avoid data contamination, we adopt GPT-4o to augment them and obtain different prompts with certain descriptors shifted (the prompt we use is simply "Please provide me a prompt for a text-to-image model in a similar style by changing the {subject}. Prompt: {prompt}") where the subject corresponds to the scenario of the prompt. Then for each prompt, we leverage SDXL and DALLE3 to generate a range of images (2-4) and then we adopt the procedure described below in our response to Q1 to filter these pairs and finally result in 680 high-quality image preference pairs spanning the five scenarios, which are curated by ourselves and independent from existing datasets. We keep all other procedures and metrics the same as the other subsets in MJ-Bench. Therefore we provide the additional evaluation results of the models on this subset in Table-r. 2. Due to the time limit, we only evaluate a subset of the models with better average performance according to Table 2 in the paper, but we will ensure to update the complete evaluation of all models in the later revisions.

Specifically, from Table-r. 2, we can denote that while PickScore-v1 and ImageReward show slightly worse performance on this new evaluation set, the general trend is similar to what we observe in Table 2 in the paper, with which we can still conclude with our previous findings. We conclude that this is due to that (1) we only select the image pairs from the test set of the existing datasets, preventing the potential contamination of the training data; (2) Our data curation pipeline ensures that only the most challenging pairs which satisfy the corresponding criteria for each scenario will be selected, which results in a data distribution essentially different from the training distribution of these models, further preventing such data contamination issue. We have merged the new data split into the original samples in MJ-Bench and provided them to the users for a more comprehensive evaluation.

We hope our additional experiment and explanations have helped address the reviewer's concern. We would be willing to engage in a more in-depth discussion if the reviewer has further questions.

Table-r. 2 Additional evaluation results of a subset of models on a held-out set of preference pairs that are not drawn from Pick-a-pic, HPDv2, and ImageRewardDB. The top-2 performance are in bold.

ModelAlignment Avg w/ TieAlignment Avg w/o Tie
CLIP-v135.446.7
PickScore-v148.260.0
HPS-v2.150.257.4
ImageReward47.055.7
LLaVA-1.6-mistral-7b33.851.0
LLaMA-3.2-11B-Vision63.167.3
InternVL2-26B65.471.2
DSG w/ Dependency63.266.7
VQAScore48.660.3
T2I-CompBench61.265.4
GPT-4o67.270.0
评论

Q1: Could you please explain how quality control is conducted in the construction of data across different evaluation dimensions? For example, how is a human verification process conducted, and what is the proportion of data that gets filtered?

We really appreciate the reviewer for raising this question! We also agree that the procedure for quality control is critical for ensuring a meaningful and fair evaluation and comparison of the models. Specifically, we detail the data curation and human verification process below point-by-point and provide a statistics report in Table-r. 3.

  • VLM pre-process: Specifically, as described in Appendix A in the paper, we first gather corresponding image pairs for each perspective through the different algorithms we propose. This results in a substantial number of samples, with each perspective containing a similar quantity. Then our first step for quality control is to adopt a powerful VLM (LLaVa-NeXT-34B) to pre-process the data and filter out the wrong preference pairs (e.g. for the alignment subset, we only include those image pairs where the positive sample completely aligns with the prompt and the negative sample includes hallucinated entities or relationships). In this step, we aim to ensure the overall correctness of the image pairs, while not considering if they are challenging enough or have high quality. The samples we obtain in this process are 6260, 4852, 5964 pairs for the alignment, safety, and quality perspective, and 140 groups for the bias perspective.

  • Human verification: Next, we engage human verifiers to evaluate each preference pair, considering both images alongside the corresponding prompt. In this step, the verifiers are tasked not only with confirming the correctness of the pair (e.g., ensuring the chosen image in the alignment subset fully aligns with the prompt) but also with assigning a difficulty rating from 0 to 5. This rating reflects how challenging they perceive the pair to be, based on the premise that the reason for the preference is clear and verifiable. The greater the difficulty for the model to distinguish between the images, the higher the rating. This process results in 2,489, 2,271, and 1,680 validated pairs for the alignment, safety, and quality perspectives, respectively, as well as 105 groups for the bias perspective. All pairs are verified for accuracy by human evaluators, with each accompanied by a difficulty rating.

  • Benchmark Confidence Sampling: Although the current dataset is verified and ready for use, its size poses significant computational and time-related challenges. To address this, we draw inspiration from [8], which suggests that usually, only a subset of the benchmark samples are sufficient to provide a certified and reliable evaluation for each model. To implement this, we use three surrogate models (MiniGPT4-v1, InternVL-Chat-V1.2, and LLaVA-V1.2) to run inferences on the dataset, progressing from higher-difficulty samples to lower-difficulty ones. We then calculate the confidence interval (variance) of each model's performance on the dataset. Using a threshold of 0.1, we ensure that each subset contains sufficiently enough samples to provide a confident estimate of model performance within this interval. This approach not only ensures that the more diverse and challenging samples are prioritized, but also guarantees an efficient and sufficient sample size for evaluation while maintaining statistical reliability. As a result, we obtain 724, 574, and 1,121 validated pairs for the alignment, safety, and quality perspectives, respectively, as well as 18 groups for the bias perspective.

We then compile these samples to form the final evaluation set for each perspective in MJ-Bench. This rigorous quality control pipeline ensures that the collected samples and resulting evaluations are reliable, challenging, and efficient.

To demonstrate the quality of our dataset, we fine-tuned a text-to-image model (SD-1.5) directly using the preference pairs from MJ-Bench, showcasing the value of the data samples in our dataset. We compared this model with the SD-1.5 base model and the SD-1.5 model fine-tuned using GPT-4o feedback, with the results presented in Table-r. 4. Based on human judge feedback, the model fine-tuned with MJ-Bench significantly outperforms the one fine-tuned with GPT-4o feedback in alignment, safety, and bias perspectives, while achieving comparable performance in the quality perspective. This demonstrates the high quality and reliability of our dataset. Additional case studies and comparisons of the outputs from the three models are provided in Figures 7, 8, and 9 in the appendix.

(continued in our next response)

评论

Table-r. 3 Statistics of the data curation procedure and quality control.

AlignmentSafetyQualityBias (group)
Total626048525964140
Human Selected248922711680105
Confidence Selected724574112118

Table-r. 4 Human evaluation result on the generated images from three models, i.e. SD-1.5 base model, SD-1.5 fine-tuned with the feedback provided by GPT-4o, and SD-1.5 fine-tuned directly on MJ-Bench via DPO. Specifically, we consider the average ranking of the image generated by each model as the metric. The best performance is in bold.

Datset ConfigurationAlignmentSafetyQualityBias
SD-1.5 Base2.472.702.232.63
SD-1.5 + GPT-4o1.951.911.872.11
SD-1.5 + MJ-Bench1.581.391.901.26

We hope the above additional experiments and clarifications have better conveyed and demonstrated the quality control procedure of our dataset and addressed the reviewer's concern.

[8] Polo, F. M., Weber, L., Choshen, L., Sun, Y., Xu, G., & Yurochkin, M. (2024). tinyBenchmarks: evaluating LLMs with fewer examples. arXiv preprint arXiv:2402.14992.

Q2: Given the criticisms of CLIP-based methods as 'bag-of-words' and the lack of interpretability and reproducibility when directly asking VLMs to output scores, evaluating and comparing alternative evaluation methods is crucial. It's important to include popular alignment approaches that avoid these weaknesses, such as decomposition methods from T2i-compbench and Davidsonian Scene Graph, as well as answer probability methods like VQAScore.

Thank you for raising this question. Following the reviewer’s suggestion, we conducted additional experiments to evaluate decomposition-based methods (Davidsonian Scene Graph, T2i-CompBench), a probability-based method (VQAScore), and several recent SOTA MLLMs (e.g., LLaMA-3.2-11B-Vision-Instruct, InternVL2-26B). The results, presented in Table-r.1 and elaborated upon in our response, show that decomposition methods excel in fine-grained alignment and quality evaluations by breaking prompts into atomic predicates, with logical entailment dependencies further enhancing performance. However, these methods incur higher computational costs and perform poorly in safety evaluations, as decomposing unsafe prompts into parts also make harmful content less detectable. Additionally, while VQAScore improves consistency for single-image evaluations, it increases tie rates in preference-based tasks, limiting its utility for challenging benchmarks. Recent MLLMs with advanced vision encoders and instruct-tuning demonstrate superior performance across perspectives, particularly in alignment, safety, and bias evaluations. Overall, these experiments enhance the benchmark, providing a more comprehensive, fair, and diverse evaluation framework that addresses previous limitations and establishes a robust foundation for model assessment.

We hope the additional experiments regarding the new baselines and the held-out alignment subset and the expanded explanations regarding our quality control procedure have addressed your concerns. Please don’t hesitate to let us know if there’s anything else you’d like to discuss or if further clarification is needed.

If all your concerns have been resolved, we would sincerely appreciate it if you could consider raising your score to help us share this work with a broader community!

Thank you again for your time and thoughtful feedback!

Best regards,

Submission #9037 Authors

评论

Dear reviewer qiUB,

As the discussion period nears its end, we wanted to follow up to see if our response—detailing the analysis with 8 recent multimodal judges suggested by the reviewer, the additional evaluation on a clean alignment test, and our more thorough experiment verification and quality control explanation—has addressed your concerns. Please let us know if there are any additional questions or areas we can clarify further.

We really look forward to your feedback and we will try our best to improve our work based on your suggestions. Thank you again!!

评论

Dear Reviewer qiUB,

Thank you once again for your insightful feedback! We greatly value your guidance and have worked diligently to address your concerns and improve the paper. As the paper revision deadline approaches, we kindly ask if the changes we’ve made resolve your concerns. If you find our revisions satisfactory, we would be truly grateful if you could reconsider your score.

We understand you have a busy schedule, but any additional comments or updated feedback you could provide would mean a great deal to us. Your expertise is invaluable in helping us refine our work further, and we warmly welcome any continued discussion.

Thank you for your time and thoughtful review!

评论

Dear Reviewer qiUB,

As the discussion period nears its conclusion, we wanted to follow up to ensure that our response above where we provide (1) additional analysis with 8 recent multimodal judges including 4 judges suggested by the reviewer and 4 most recent MLLMs; (2) the additional experiments on a clean alignment test to ensure the fairness of our evaluation; (3) a more comprehensive explanation of our human verification and quality control process, have adequately addressed your concerns.

If there are any remaining questions or areas where further clarification is needed, please don’t hesitate to let us know. We sincerely value your feedback and are committed to improving our work based on your thoughtful suggestions. Thank you again for your time and insights!

评论

Dear Reviewer qiUB,

Thank you for taking the time and effort to review our submission. We greatly appreciate your thoughtful feedback, which has been instrumental in helping us improve our work!

We have carefully addressed the concerns you raised and provided detailed responses, including conducting additional experiments based on your suggestions. We wanted to kindly follow up to confirm whether our responses have adequately addressed your concerns or if there are any remaining points we could further clarify or improve, especially as tomorrow is the final day for the discussion period.

Thank you once again for your invaluable feedback and contributions to the review process!

评论

Dear Reviewer qiUB,

Thank you so much for your valuable suggestions to help improve our paper! We deeply appreciate your recognition of the novelty and comprehensiveness of our work! Below, we have provided a detailed, point-by-point response to your questions, and hope these could help address your concerns.

W1: Comprehensiveness and fairness of the evaluation: The paper evaluates models like CLIP, LLaVA, and GPT-4, but lacks some popular alternative alignment evaluation models such as the Decompose method (Davidsonian Scene Graph) and the Answer Probability Method (VQAScore).

Thank you for this insightful suggestion! We totally agree with the reviewer that more comprehensive evaluation results including the decomposition-based approaches and probability-based method will strengthen the scope and fairness of our benchmark. Therefore, we conduct additional experiments to include the two most advanced decomposition-based methods: Davidsonian Scene Graph (DSG) [1], and T2i-compbench [2], as well as a probability-based method VQAScore [3] following the reviewer's suggestion. To keep our benchmark updated with the most advanced methods, we further include four additional SOTA open-sourced MLLMs, i.e. LLaMA-3.2-11B-Vision-Instruct [4], InternVL2-8B [5], InternVL2-26B and MiniCPM-V-2_6 [6], and evaluate their performance as a judge on MJ-Bench. We present the additional evaluation results in Table-r. 1 below, where we also include GPT-4o from the paper as a reference. We provide two parallel evaluations of DSG in terms of whether the logical entailment dependency is available or not (e.g. the proposition “there is a motorcycle” is a parent of “the motorcycle is blue”: this dependent’s truth (i.e. whether the motorcycle is blue) can only be evaluated if the parent is true (i.e. there is a motorcycle at all)).

Specifically, we indeed observe that the methods suggested by the reviewers have outperformed the previous methods on certain perspectives while still struggling with certain tasks. We detail our analysis and findings of the updated results point-by-point.

  • Decomposition significantly boosts judge performance on text-image alignment: Table-r. 1 indicates that decomposition-based methods DSG and T2I-CompBench have superior performance on alignment and image quality perspectives, which require a fine-grained judgment of the elements and their corresponding attributes and relationships in the image. By decomposing a complex prompt into multiple atomic predicates and verifying them individually, these methods are highly interpretable and guaranteed to have better performance in providing feedback for alignment accuracy. Furthermore, we observe that DSG w/ entailment dependency also outperforms the non-dependent case, demonstrating the significance of logical reasoning in providing grounded feedback for text-image alignment. The fact that w/ dependency case has better performance also validates the high-quality and challenging nature of our dataset, as our images are verified by human annotators so that in each image the logical constraints in the prompt are satisfied. However, while their feedback is more accurate for alignment, the decomposition-based methods typically incur higher inference costs as they require extra decomposition and multiple MLLM-prompting, and thus may not be suitable to serve as online judges.

  • Probability-based methods reduce the numerical sensitivity of the feedback score: While VQAScore [3] can effectively address the "bag-of-words" issue in CLIP-based models, our evaluation in Table-r. 1 indicates that the performance w/o tie is much higher than w/ tie, indicating VQAScore results in a much larger tie rate than the other models (we set a threshold of 0.1 to determine tie). This indicates that by formatting the prompt into a Does this figure show {text}? question, while it increases the evaluation consistency of a single image, it inherently reduces the gap between two similar images that are specifically curated to be hard to distinguish in our challenging benchmark, thus resulting in a higher tie rate. These results indicate that VQAScore might not serve as a good judge for preference-based text-to-image post-training but can provide an adequately accurate evaluation of a single image.

(continued in our next response)

评论

We sincerely thank all reviewers for their recognition of the critical contributions of our work in advancing research on RLAIF for image generation, as well as their valuable feedback and efforts to help us improve the paper. Based on the reviewers' comments, we have made significant improvements to the paper (a revised PDF has been uploaded), with the main changes highlighted in blue text. Below, we summarize these changes:

  1. We evaluated three additional models (two decomposition-based models: DSG and T2I-CompBench, and one probability-based model: VQAScore) as well as four of the most recent MLLMs on MJ-Bench. (Reviewer qiUB)
  2. We conducted an additional experiment where we curated a clean test set for the alignment perspective and benchmarked models on this set to avoid data contamination and ensure fairness in our evaluation. (Reviewer qiUB)
  3. We provided more comprehensive details and statistics about our data curation and quality control procedures. (Reviewers qiUB, eath, Rmzq)
  4. We quantitatively demonstrated the high quality of the test samples in MJ-Bench by fine-tuning a text-to-image model solely using our preference data and comparing it with other models. (Reviewers eath, Rmzq)
  5. We expanded the discussion on related work and highlighted the unique contributions of MJ-Bench. (Reviewer 3DFZ)
  6. We trained a MoE-based reward model using a split from MJ-Bench and showed its superior performance in providing feedback for text-to-image generation. (Reviewers 3DFZ, Rmzq)
  7. We conducted additional human evaluations using the standard win rate metric, comparing more models in an end-to-end manner to mitigate potential bias in human evaluations. (Reviewer eath)

Thank you once again for your thoughtful suggestions and efforts to help us improve this work. We hope our responses and revisions address your concerns, and we look forward to further engaging discussions during the follow-up period.

Best regards,

Submission #9037 Authors

AC 元评审

Summary of claims and findings:

The paper introduces a benchmark named MJ-BENCH to evaluate multimodal judges for text-to-image generation models across four critical dimensions: alignment, safety, image quality, and bias. MJ-BENCH includes a comprehensive preference dataset comprising thousands of (win, lose, prompt) triplets. The authors propose two feedback mechanisms for multimodal judges: single-input and multi-input judges, and conduct detailed analyses, such as evaluating the consistency of judges' preferences across different image modes. Several insights are highlighted, including performance comparisons between closed-source and open-source VLMs under various metrics, model behaviors after fine-tuning, and validation through human evaluations. Overall, MJ-BENCH provides a comprehensive and well-rounded approach to evaluating and improving feedback mechanisms for text-to-image generation models, representing a valuable contribution to the community.

Pros:

The reviewers recognized MJ-BENCH as a comprehensive benchmark addressing critical challenges in evaluating text-to-image models. They highlighted its detailed ablation studies, which offer valuable insights, and its dataset, which supports various formats such as ranking and voting, enabling diverse applications in preference modeling. The benchmark's design and methodology were commended for advancing research in evaluating LMMs and fostering progress in Reinforcement Learning from AI Feedback (RLAIF).

Cons:

The original submission lacks several important aspects, with reviewers raising concerns about the novelty, comprehensiveness, fairness, and presentation of the proposed MJ-BENCH benchmark:

  • Novelty: The overall pipeline lacks significant novelty, as it is similar to previous methods.

  • Benchmark: The selection of evaluation dimensions (alignment, safety, image quality, and bias) lacks in-depth analysis. The scale and quality of the dataset were not compared to existing benchmarks, limiting its application scope. The benchmark's quality was not independently evaluated through experiments. Questions were raised about quality control in constructing data across evaluation dimensions, including human verification processes and data filtering proportions. Concerns were also raised about potential biases introduced by uneven question distributions across categories. The paper did not provide visualizations of MJ-BENCH cases to clarify the dataset's structure and effectiveness.

  • Evaluation: The benchmark does not include several important alternative alignment evaluation models, such as the Decompose method (Davidsonian Scene Graph) and the Answer Probability Method (VQAScore). Evaluating PickScore-v1, HPS-v2.1, and ImageReward is deemed unfair, as these models were trained on datasets similar to those used for MJ-BENCH. The situations in which human judges and multimodal large language model (MLLM) judges disagree are not explored, which could provide valuable insights for future work.

  • The paper lacks a comprehensive discussion of related works, such as VisionPrefer, which addresses similar challenges in aligning text-to-image models.

  • Several important details were unclear or missing in the original submission. Some observations were also inconsistent (e.g., discrepancies in Tables 3 and 1).

Rebuttal:

During the rebuttal phase, the authors provided extensive clarifications, explanations, and additional experimental results, including the evaluation of three additional models (DSG, T2I-CompBench, and VQAScore) and four recent MLLMs, conducting an experiment with a clean test set to ensure fairness, performing additional human evaluations, training an MoE-based reward model, expanding the discussion on related work, and offering comprehensive details on data curation and quality control.

Discussion and Recommendation:

The reviewers rated this paper as borderline. While the proposed benchmark has potential and could make an impact on the community, the original submission was weak in several key areas, as outlined above. Important aspects of benchmark construction and evaluation were missing. Although the authors' rebuttal provided extensive clarifications, explanations, and additional experimental results, it also led to substantial changes in the findings and claims presented in the original submission. For instance, the original paper did not consider several important alternative alignment evaluation models, which subsequently outperformed the models initially evaluated. These issues, along with concerns about the positioning of related work, novelty, and the incorporation of rebuttal content, along with the reorganization of the main paper and appendix, led the ACs to conclude that the paper requires major revisions and another review cycle. The ACs believe that with revision, the paper's impact can be greatly enhanced, and ACs encouraged the authors to continue this line of research for future submission.

审稿人讨论附加意见

The reviewers rated this paper as borderline. While the proposed benchmark has potential and could make an impact on the community, the original submission was weak in several key areas, as outlined above. Important aspects of benchmark construction and evaluation were missing. Although the authors' rebuttal provided extensive clarifications, explanations, and additional experimental results, it also led to substantial changes in the findings and claims presented in the original submission. For instance, the original paper did not consider several important alternative alignment evaluation models, which subsequently outperformed the models initially evaluated. These issues, along with concerns about the positioning of related work, novelty, and the incorporation of rebuttal content, along with the reorganization of the main paper and appendix, led the ACs to conclude that the paper requires major revisions and another review cycle. The ACs believe that with revision, the paper's impact can be greatly enhanced, and ACs encouraged the authors to continue this line of research for future submission.

最终决定

Reject