DSH-Bench: A Difficulty- and Scenario-Aware Benchmark with Hierarchical Subject Taxonomy for Subject-Driven Text-to-Image Generation
This paper introduces HDS-Bench, a comprehensive benchmark for subject-driven text-to-image generation models, to enable more rigorous and nuanced assessment of model performance.
摘要
评审与讨论
This paper introduces DSH-Bench, a new benchmark for subject-driven text-to-image (T2I) generation, addressing critical gaps in diversity and evaluation granularity present in prior benchmarks. The authors propose a hierarchical taxonomy-based approach to dataset construction, ensuring broad and fine-grained subject coverage (58 categories, 459 subjects). Prompts are systematically generated and classified by six scenario types and three subject difficulty levels. A novel, human-aligned evaluation metric—Subject Identity Consistency Score (SICS)—is introduced, leveraging fine-tuned vision-language models for more accurate and efficient subject preservation assessment. Empirical evaluations of 15 SOTA models using DSH-Bench uncover new insights into current limitations and trade-offs in subject-driven T2I generation, especially with respect to subject complexity and prompt scenario robustness. The benchmark, data, code, and metric will be released publicly.
优缺点分析
Strengths
- Comprehensive Dataset Construction:
The hierarchical taxonomy ensures much greater subject diversity and coverage than previous works, substantially reducing evaluation bias. - Granular, Scenario-Aware Evaluation:
Categorizing both subjects and prompts by difficulty and scenario allows for detailed diagnosis of model weaknesses, going beyond simple aggregate metrics. - Novel, Human-Aligned Metric (SICS):
SICS correlates better with human judgments than existing automated metrics (CLIP, DINO, GPT-4o), while being more efficient and scalable. - Extensive Empirical Evaluation:
Fifteen leading models are systematically compared, and analysis reveals nuanced failure modes not captured by previous benchmarks.
Weaknesses
- Limited Generalization Evidence for SICS:
While SICS is shown to outperform baselines on the constructed dataset, its robustness to new subject domains or out-of-distribution data is not deeply examined. - Prompt/Scenario Design:
Although prompt scenarios are well-motivated, the generation process relies heavily on GPT-4o, which may introduce subtle biases or artifacts. More details on prompt validation or diversity analysis would strengthen the work.
问题
-
Generalization of SICS:
How does SICS perform on subject domains outside those used for training (e.g., medical or satellite imagery)? Score would increase if cross-domain generalization is empirically validated. -
Potential single model bias?
Potential single-model bias with GPT-4o only in classification -- I would at least double check on a sample (say ~100 images) with a set of candidates at or above the performance tier of 4o: Claude, Gemini 2.5 Pro, etc.
局限性
Prompts manually reviewed (3.1.3). Ideally the authors could also confirm the 459 images are also reviewed along the lines, as this is a reasonably manageable scale for human review.
格式问题
Figure 6 is a crucial result presentation. It is tiny and barely readable without extreme zooming in. If space permits, performing some nudging and rewording to give it more real estate?
Response
I'm very appreciative of your thorough review of our paper and the numerous suggestions you've provided! Now, let us address the specific questions you have raised and provide further clarification:
Q1: Generalization of SICS
A1: We sincerely thank the reviewers for their thorough evaluation of our paper and their valuable suggestions. As introduced in the Introduction (lines 25–27), the primary goal of subject-driven text-to-image generation tasks is to address scenarios such as: "a user might request an image of 'his dog traveling around the world'." Consequently, mainstream methods for subject-driven text-to-image generation [1,2,3] have primarily focused on general-domain subject images. The goal of these methods is to enable users to generate images of common objects or people in specific scenarios.
In the current task setting of subject-driven text-to-image generation, it is typically necessary to specify a clear subject in the input. Upon carefully examining images from the medical domain, such as X-ray images, magnetic resonance images (MRI), and ultrasound images, we believe that these types of images may not generally require subject-specific subject-driven generation.
The categories of images in our dataset are highly comprehensive. As shown in Appendix A, the construction of our dataset categories was informed by the categories of existing customized text-to-image datasets as well as the COCO dataset. The COCO dataset, a widely used and comprehensive dataset for object detection, shares a similar requirement for clearly defined subjects, which aligns with the needs of customized T2I generation. As demonstrated in Table 4, the categories in our dataset encompass all categories from COCO. Therefore, our dataset provides highly comprehensive coverage of subject categories in general domains.
To rigorously assess the robustness and generalization capability of our proposed SICS metric, we conducted an evaluation on the DreamBench, which serves as a out-of-distribution (OOD) test set. For this experiment, we randomly sampled 100 instances generated by three subject-driven text to image models: UNO, BLIP-Diffusion, and IP-Adapter. The experimental comparison methods and the computational metrics used remain consistent with those described in the main text. The results, summarized in Table 1, reveal a consistent and significant advantage of SICS over GPT-4o across nearly all tested models. On average, SICS surpasses GPT-4o by 22.8% in Kendall's τ correlation ((0.425-0.346)/0.346) and by 20.5% in Spearman's ρ correlation ((0.452-0.375)/0.375). This finding provides compelling evidence that our proposed metric more stably and accurately reflects human perception of subject consistency, even on a challenging out-of-distribution dataset.
Table 1 : Results of evaluation on a out-of-distribution test set |Model|Kendallτ-GPT-4o|Kendallτ-SICS(Ours)|Spearmanρ-GPT-4o|Spearmanρ-SICS(Ours)| |:-|:-:|:-:|:-:|:-:| |UNO|0.287|0.337|0.315|0.378| |BLIP-Diffusion|0.323|0.505|0.346|0.527| |IP-Adapter|0.428|0.434|0.465|0.451| |Average|0.346|0.425|0.375|0.452|
Q2: Potential single model bias
A2: We appreciate the reviewer’s suggestion to verify potential single-model bias in using GPT-4o. When generating prompts, we compared the differences in prompt results produced by different models. Table 2 presents a comparative analysis of prompts generated by three state-of-the-art Vision-Language Models: GPT-4o, Gemini 2.5 Pro, and Claude (Due to space constraints, we have only presented one example. Additional examples will be provided in the appendix). The specific instruction used to generate the prompts can be found in Figure 4 of the paper. A qualitative evaluation of these prompts reveals no significant or discernible differences in their content or structure. Given the comparable performance across the models, we selected GPT-4o for all prompt generation tasks in this study. Moreover, the results generated by GPT-4o include slightly more complex and diverse descriptions of the environment, which makes the model's generation and application more intricate and challenging. This choice also aligns with the methodology established in the Dreambench++[1] benchmark, ensuring methodological consistency with prior work.
Table 2: Examples of prompts generated by GPT-4o, Gemini 2.5 Pro, and Claude. The subject in this case is african drum
| Prompt Category | Gemini 2.5 Pro | Claude | GPT-4o |
|---|---|---|---|
| Background change | A single african drum resting on the red earth of the African savanna at sunset. | An african drum standing on a sandy beach with waves in the background | A single african drum placed on a sandy beach with gentle waves in the background under a clear sky. |
| Variation in subject viewpoint or size | A low-angle shot of an african drum sitting on a wooden stage, illuminated by a single, warm spotlight from above, with dust motes dancing in the light. | A low-angle shot of an african drum on a stage, with spotlights creating dramatic shadows. | An african drum viewed from a high-angle perspective, placed on a grassy hilltop overlooking a distant mountain range with scattered wildflowers around it. |
| Interaction with other entities | A group of children in a circle, their joyful faces illuminated by the sun, all reaching out to play a large african drum together. | An african drum partially submerged in a clear lake, with its reflection visible on the water's surface | An african drum partially obscured by tall reeds in a dense wetland, with a heron standing nearby and its reflection visible in the water below. |
| Attribute change | An african drum carved entirely from shimmering, translucent crystal, with glowing runes etched onto its surface. | An african drum made entirely of clear, shimmering crystal. | An african drum painted entirely in vibrant rainbow colors and decorated with intricate tribal patterns. |
| Style change | A cubist painting of an african drum, deconstructed into geometric shapes and shown from multiple perspectives simultaneously. | An african drum rendered in the style of Vincent van Gogh's 'Starry Night', with swirling brushstrokes and vibrant colors. | An african drum depicted in the Cubist art style, with fragmented geometric shapes and abstract distortions. |
| Imagination | A fleet of miniature african drums floating through a cosmic nebula, propelled by rhythmic sound waves that ripple through the stardust. | An african drum as a spaceship, with tiny alien creatures using it to explore the galaxy. | An african drum floating in mid-air, surrounded by glowing orbs of light that pulse rhythmically as if responding to the drum's silent beat. |
Q3: Prompts manually reviewed (3.1.3). Ideally the authors could also confirm the 459 images are also reviewed along the lines, as this is a reasonably manageable scale for human review.
A3: We thank the reviewer for the valuable suggestions, which will help us further strengthen DSH-Bench and its documentation. We confirm that all 459 subject images in DSH-Bench were individually reviewed by human annotators (In Section 3.1.2). The review process involved filtering out images with multiple subjects, poor quality, or unclear copyright status, followed by final verification by multiple annotators. At the same time, to ensure high consistency in the review process, any controversial images encountered during the review will be discussed and aligned through consensus We are committed to maintaining strict quality control standards for all benchmark data.
Q4: Figure 6 is a crucial result presentation. It is tiny and barely readable without extreme zooming in. If space permits, performing some nudging and rewording to give it more real estate?
A4: Thank you for your insightful feedback regarding the readability of Figure 6. We acknowledge its importance in presenting key results and we will adjust the figure to enhance its readability. We will redesign Figure 6 by increasing the font size and splitting it into two separate figures to ensure clarity in the main document without requiring extreme zooming. These adjustments will be updated in both the camera-ready version and the ArXiv submission.
Reference
[1] Yuang Peng, Yuxin Cui, Haomiao Tang, Zekun Qi, Runpei Dong, Jing Bai, Chunrui Han, Zheng Ge, Xiangyu Zhang, and Shu-Tao Xia. Dreambench++: A human-aligned benchmark for personalized image generation. In The Thirteenth International Conference on Learning Representations, 2025.
[2] Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit Haim Bermano, Gal Chechik, and Daniel Cohen-or. An image is worth one word: Personalizing text-to-image generation using textual inversion. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=NAQvF08TcyG.
[3] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22500–22510, 2023.
Hi reviewer,
Thanks for your previous hard work in the review phase. Now, we need to perform the next step to discuss this article for the decision on whether it can be accepted. Please feel free to read the other reviews, and actively participate in discussions if you have a different opinion. Thanks for your contributions for our NeurIPS community again.
Best, AC
Dear Reviewer tQca,
Thank you for your invaluable efforts and constructive feedback on our manuscript.
As the discussion period draws to a close, we eagerly anticipate your thoughts on our response. We sincerely hope that our response meets your expectations. If there are any remaining concerns or aspects that require clarification, we are ready to address them as soon as possible.
Best regards,
The Authors
The paper introduces DSH-Bench, a systematically annotated benchmark for subject-driven text-to-image generation that expands prior datasets from 6–15 to 58 fine-grained categories, assigns each of its 459 reference images a difficulty level (easy/medium/hard) and six scenario tags (background change, viewpoint/size, interaction, attribute, style, imagination), and pairs them with 5 508 GPT-4o-crafted prompts. To evaluate subject preservation efficiently, the authors fine-tune a 7B vision-language model into the Subject Identity Consistency Score (SICS), which aligns more closely with human judgments than GPT-4o while greatly reducing evaluation cost. They benchmark 15 personalization methods across three axe: subject identity, prompt adherence, and image quality—providing diagnostic analyses by category, difficulty, and scenario, and show results on previous open-source pipelines.
优缺点分析
Strengths: (1) The scale of the benchmark is large compared with previous benchmark for image customization like DreamBench and DreamBench++. (2) The authors report their performances on various baseline image customization methods, which makes this benchmark paper pretty solid. (3) The authors classify the objects into different difficulty levels, and propose human-aligned and more efficient metric, which contributes to the community.
Weaknesses: The evaluation of the baselines on multi-subject customization task is missing. A good benchmark dataset should also support evaluation on multi-subject customization. Would like to see the evaluation results in the rebuttal.
问题
(1) How is the hard level of the subjects divided? Is it divided based on experiments? (2) Could you classify the dataset based on categories? I think it is quite useful for the community.
局限性
See weakness
最终评判理由
The authors basically address my concerns. Therefore I remain my initial positive rating.
格式问题
NA
Response:
We would like to express our sincere gratitude for your positive feedback on our paper and for recognizing its strengths. Additionally, we are thankful for your observation that we classify the objects into different difficulty levels, and propose human-aligned and more efficient metric, which contributes to the community
Now, let us address the specific questions you have raised and provide further clarification:
Q1: How is the hard level of the subjects divided? Is it divided based on experiments?
A1: We classify the subject images according to the difficulty level that the model experiences in preserving details of the reference subject. It's not divided based on experiments. We will input the subjects into GPT-4o to assist in assessing the difficulty of preserving fine-grained details. The results will be thoroughly verified through manual inspection. We define three subject difficulty levels, including (1) Easy: Subjects characterized by minimal surface complexity and homogeneous textural properties, exemplified by smooth-surfaced objects such as a ceramic mug with uniform coloration. These instances present negligible challenges for detail preservation due to their structural regularity. (2) Medium: Subjects containing discernible high-frequency features while maintaining global structural coherence, such as cylindrical containers with legible typographic elements. These cases require intermediate detail preservation capabilities. (3) Hard: Subjects exhibiting non-uniform texture distributions and multi-scale geometric details, typified by objects like book covers containing fine-grained calligraphic elements. Such instances expose model limitations in maintaining structural fidelity and textural granularity under complex topological constraints. We utilize GPT-4o to classify the subject images according to the aforementioned criteria. Subsequently, all images are reviewed and corrected by human annotators to ensure accuracy and consistency. All of the aforementioned content can be found in Section 3.1.2 of the paper.
Q2: Could you classify the dataset based on categories? I think it is quite useful for the community
A2: I fully agree with your suggestion. Therefore, in Section 3.1.1, we provide a detailed explanation of the categories utilized in our benchmark. There are 58 categories in our dataset. Furthermore, Figure 3 offers a comprehensive visualization of the dataset distribution across these categories. We conducted experiments for each category and present an in-depth discussion of the results in the appendix, with the aim of providing valuable insights to advance research in this field. We will clarify this fact more clearly in the final version.
Q3: The evaluation of the baselines on multi-subject customization task is missing
A3: We sincerely appreciate the valuable suggestions provided by the reviewer. In this paper, we primarily focuses on the domain of single-subject-driven text-to-image generation, as highlighted in numerous studies [1,2,3,4,5] that also concentrate on this area. Multi-subject-driven builds upon this foundation by generating target images for multiple subjects based on prompts. Single-subject-driven text-to-image generation is a fundamental aspect of subject-driven text-to-image generation, and its effective evaluation is crucial. Therefore, our current work is focused solely on single-subject scenarios and does not address multi-subject contexts.
Nevertheless, our work includes many innovative components that can be directly applied to the evaluation of multi-subject customized generation, such as subject difficulty level classification, prompt scenario classification. Since our evaluation metric, SICS, was designed and trained specifically for single-subject text-to-image customization, it cannot be directly used for multi-subject evaluation. However, we believe that this approach can also be beneficial for the evaluation of multi-subject customization. The suggestion raised by the reviewer represents an important direction for future exploration, and we are committed to conducting more in-depth research on this topic in the future. We will provide a detailed explanation of this limitation in the final version of the paper.
References
[1] Yuang Peng, Yuxin Cui, Haomiao Tang, Zekun Qi, Runpei Dong, Jing Bai, Chunrui Han, Zheng Ge, Xiangyu Zhang, and Shu-Tao Xia. Dreambench++: A human-aligned benchmark for personalized image generation. In The Thirteenth International Conference on Learning Representations, 2025.
[2] Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit Haim Bermano, Gal Chechik, and Daniel Cohen-or. An image is worth one word: Personalizing text-to-image generation using textual inversion. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=NAQvF08TcyG.
[3] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22500–22510, 2023.
[4] Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models. arXiv preprint arXiv:2308.06721, 2023.
[5] Zhenxiong Tan, Songhua Liu, Xingyi Yang, Qiaochu Xue, and Xinchao Wang. Ominicontrol: Minimal and universal control for diffusion transformer. arXiv preprint arXiv:2411.15098, 2024.
Hi reviewer,
Thanks for your previous hard work in the review phase. Now, we need to perform the next step to discuss this article for the decision on whether it can be accepted. Please feel free to read the other reviews, and actively participate in discussions if you have a different opinion. Thanks for your contributions for our NeurIPS community again.
Best, AC
Dear Reviewer AKLi,
Thank you for your invaluable efforts and constructive feedback on our manuscript.
As the discussion period draws to a close, we eagerly anticipate your thoughts on our response. We sincerely hope that our response meets your expectations. If there are any remaining concerns or aspects that require clarification, we are ready to address them as soon as possible.
Best regards,
The Authors
The authors basically address my concerns. Therefore I remain my initial positive rating.
Dear Reviewer AKLi,
We sincerely appreciate your thoughtful and valuable feedback. Your insights are invaluable, and we will carefully incorporate your suggestions to further improve our work. Thank you for your time, effort, and contributions to the review process.
Best regards,
The Authors
This paper focuses on subject-driven text-to-image benchmark. Authors propose DSH-Bench, addressing key shortcomings of existing benchmarks: 1) limited subject diversity and coarse evaluation granularity—DSH-Bench provides a hierarchical subject taxonomy (58 categories), 2) a classification scheme for subject difficulty and prompt type, and 3) a new Subject Identity Consistency Score (SICS) that aligns more closely with human judgments. By evaluating 15 T2I models, authors reveal limitations in current methods and outlines future research directions.
优缺点分析
Strengths:
-
The authors introduce a new benchmark for subject-driven text-to-image (T2I) generation, addressing a research gap by incorporating a hierarchical evaluation based on subject difficulty levels. This benchmark has the potential to yield valuable insights for the community.
-
The paper presents extensive experiments using the proposed benchmark across 15 T2I models. Additional detailed analyses are included in the supplementary material. These thorough evaluations enhance the paper’s credibility and provide meaningful insights.
Weaknesses:
-
The clarity of the paper could be improved. A comparison table summarizing the differences between this and existing T2I benchmarks would be helpful—such as comparing the number of images and prompts, and explicitly highlighting the focus on subject difficulty classification. This would better illustrate the specific research gap the benchmark aims to fill.
-
Some experimental findings are somewhat expected. For instance, the observation in line 274 that "current subject-driven T2I models exhibit performance degradation on hard-level subjects" is intuitive. The analysis could be deepened by highlighting findings that existing benchmarks cannot reveal—for example, conducting a per-category analysis. Are there specific subjects that consistently perform well or poorly across all difficulty levels? This would provide more nuanced insights into model behavior.
问题
Please refer to weakness part.
局限性
Authors already have discussed some limitations in their paper. Please check my concerns in weakness part.
最终评判理由
Thank you for the rebuttal and clarification. I maintain my positive rating as borderline accept.
格式问题
No major formatting issues.
Response:
I'm very appreciative of your thorough review of our paper and the numerous suggestions you've provided! Additionally, we are thankful for your observation that our work address a research gap by incorporating a hierarchical evaluation based on subject difficulty levels. which has the potential to yield valuable insights for the community. Now, let us address the specific questions you have raised and provide further clarification:
Q1: About the clarity of the paper could be improved.
A1: We express our gratitude to the reviewers for their meticulous examination of our paper and for providing valuable feedback. In the main body of the paper, Figures 2 and 3, along with Table 8 in the appendix, provide a detailed comparison between DSH-Bench and existing benchmarks.
- Figure 2 provides a detailed comparison between DSH-bench and three other benchmarks: Dreambench, CustomConcept101, and Dreambench++. This comparison focuses on the distribution differences across various prompt scenarios with varying levels of subject difficulty. The results illustrate that our benchmark offers a more comprehensive evaluation.
- Figure 3 provides a detailed comparison of DSH-Bench with three other benchmarks: Dreambench, CustomConcept101, and Dreambench++. The distribution across various categories is illustrated, highlighting the superior diversity of subjects in our benchmark.
- Table 8 in the appendix illustrates the differences in the distribution of primary difficulty across various categories between our benchmark and other benchmarks. This table demonstrates that our sampling in each category has taken into account the distribution of difficulty levels, ensuring that our benchmark provides a more comprehensive and unbiased evaluation of model performance.
Q2: About the analysis could be deepened by highlighting findings that existing benchmarks cannot reveal.
A2: We express our sincere gratitude to the reviewer for acknowledging the valuable insights that our benchmark provides. Indeed, DSH-Bench uncovers perspectives that previous benchmarks have not been able to offer. We have elaborated on these insights in detail in Section 5 of the main text and in Appendix D.1:
- The subject-driven T2I capability for different prompt scenarios is not robust (lines 286 - lines 295): (1) In BC (Background change), VS (Variation in subject viewpoint or size), and IE (Interaction with other entities) scenarios, the model’s performance consistently declines across all evaluation dimensions. This trend suggests that the difficulty of the scenarios increases progressively from BC to IE. Notably, the finding that the IE scenario is more challenging than the BC scenario aligns with intuitive expectations. Therefore, future research may need to place greater emphasis on IE scenario. (2) For subject preservation, the model’s average performance across the AC, SC, and IM prompt scenarios remains relatively low. This could be because the generated subjects undergo partial modifications relative to the original subjects in these three scenarios.
- Model robustness varies considerably among categories (lines 265 - lines 273, lines 587 - lines 611): Figure 6 provides a detailed comparison of the performance of various methods across different third-level categories. For example, performance in categories "artwork" (both photorealistic and non-photorealistic) is substantially lower. None of the current models perform well across all categories. We hypothesize that this may be related to the varying complexity of the subjects within different categories. In Appendix D.1(line587-line611), we conducted a comprehensive analysis of the performance of various methods across both first-level and second-level categories.The results demonstrate that, irrespective of whether the primary category is realistic or non-realistic, the scores for the subject preservation dimension are consistently lower for the human category across nearly all models. This phenomenon can be attributed to the distribution of difficulty levels within the human category, where the proportions of simple, medium, and hard cases are 1.96%, 50.98%, and 47.06%, respectively (as shown in Table 8).
- Implications for technical approaches (lines 612 - lines 632): Our analysis indicates that current encoder-based methods still face challenges in accurately reconstructing subjects with high-frequency details in images. This limitation may stem from the characteristics of commonly used image encoders, such as CLIP, which tend to prioritize semantic information over fine-grained details. Consequently, The performance of these methods on tasks classified as "hard level" is suboptimal. Future research should focus on enhancing the restoration of challenging subject details.
Thank you for your rebuttal and clarification. Authors promise to improve paper clarity. I maintain my positive rating as borderline accept.
Dear Reviewer dDpk,
We sincerely appreciate your thoughtful and valuable feedback. Your insights are invaluable, and we will carefully incorporate your suggestions to further improve our work. Thank you for your time, effort, and contributions to the review process.
Best regards,
The Authors
The paper introduces DSH-Bench, a new benchmark for evaluating subject-driven text-to-image (T2I) generation models. The benchmark features a hierarchical taxonomy of subject images, a classification scheme for subject difficulty and prompt scenarios, and a new metric for subject identity consistency (SICS). The authors evaluate 15 subject-driven T2I models on DSH-Bench and claim that it uncovers previously obscured limitations in current approaches.
优缺点分析
Strengths
- The hierarchical taxonomy is well-defined and leads to a clearly more diverse dataset than existing T2I evaluation benchmarks.
- The benchmark includes a relatively large and diverse set of subject images and prompts.
- The authors evaluate 15 SOTA T2I methods across multiple dimensions: subject preservation, prompt following, and image quality.
Weaknesses
- The primary contribution of this work is the construction of a more comprehensive benchmark for evaluating subject-driven T2I generation. As such, it is more appropriate for the NeurIPS Datasets and Benchmarks Track rather than the main conference track, which emphasizes algorithmic novelty.
- The benchmark relies heavily on human annotations, yet the paper omits critical details about the subjective evaluation process—such as the participant demographics, evaluation protocol, and statistical reliability (e.g., inter-rater agreement) of the collected annotations.
- The proposed SICS metric is essentially a fine-tuned Qwen2.5-VL-7B model trained on human-annotated labels. This represents more of an engineering effort than a substantive research contribution. While the authors claim a 9.4% improvement over GPT-4o in Kendall’s τ, the paper lacks qualitative examples or error analysis to illustrate where SICS succeeds or fails compared to prior metrics.
- The core components of DSH-Bench—namely the hierarchical taxonomy, difficulty-level classification, and evaluation via fine-tuned LLMs—are incremental extensions of techniques already established in prior benchmarks (e.g., DreamBench++). The novelty in methodology is thus limited.
- The paper does not clearly explain how the SICS metric is computed or how it substantially differs from existing evaluation methods. Furthermore, SICS is trained on a relatively small dataset (5,000 image-text pairs), which may limit its ability to generalize across the full spectrum of subject appearance variations.
- The presentation of experimental results is not intuitive. For instance, from the tables provided, it is difficult to directly discern which T2I methods are top-performing or underperforming across different evaluation dimensions.
问题
See the Weaknesses part above for questions. The paper contains many typos, e.g., “Meidum” in Figure 6(a) should be corrected to “Medium”.
局限性
N/A
最终评判理由
In the authors' rebuttal, they addressed most of my concerns about the method and the results. I will raise the score to boardline accept.
格式问题
N/A
Response
I'm very appreciative of your thorough review of my paper and the numerous suggestions you've provided! I will now clarify each of the concerns you raised regarding the points of confusion in the paper.
Q: SICS represents more of an engineering effort than a substantive research contribution.
**A:**Our proposed metric, SICS, is innovatively focus on subject-level consistency rather than merely relying on embedding comparisons. Furthermore, our metric is an efficient, scalable, low-cost, and fine-grained tool for evaluating subjective consistency automatically, making a positive contribution to the community. To elucidate the underlying reasons for this superiority, we present a qualitative analysis. Case studies reveal that SICS's higher alignment with human perception stems from two primary advantages:
- Focus on Core Visual Attributes over High-Level Semantics: We observe that GPT-4o's proficiency in high-level semantic understanding can lead it to overlook critical visual details. In a "backpack" case, despite clear discrepancies in the generated image's shape, color, and logo compared to the reference, GPT-4o assigned a maximum score, suggesting a generalized conceptual match. In stark contrast, SICS, aligning with human annotators, astutely identified these inconsistencies and assigned a low score, demonstrating its capacity to focus on the core visual attributes that define the subject.
- Superior Discriminative Power and Granularity: A key advantage of SICS is its scoring granularity, which mitigates the "score saturation" phenomenon common in GPT-4o's upper-range evaluations. For instance, in a clock case, GPT-4o assigned an identical high score of 4 to two images with varying error severity—one with a minor color deviation and another with significant structural differences. In contrast, SICS, consistent with human ratings, successfully differentiated them with scores of 3 and 2, respectively, accurately reflecting the inconsistency. These cases demonstrate that SICS provides a fine-grained scoring standard align with human perception, a capability that is indispensable for researchers needing to precisely evaluate subtle model improvements.
Q: The novelty in methodology is limited.
A: We appreciate the reviewers for pointing out the connection between DSH-Bench and existing benchmarks (e.g., DreamBench++) in terms of design philosophy. However, we believe that this work introduces substantial, multi-dimensional innovations in the construction of benchmark data and evaluation criteria, rather than being a mere incremental extension:
- Systematic and Comprehensive Innovation in Hierarchical Taxonomy (lines 47): Benchmarks like DreamBench++ provide only a limited number of categories, which fall short in reflecting the diversity and challenges of the real world. In contrast, DSH-Bench introduces a hierarchical taxonomy comprising 58 major categories and 459 subjects. This significantly enhances both coverage and granularity.
- Scientific Definition and Annotation Process for Subject Difficulty Level Classification and Prompt Scenario Classification (lines 56): We propose a scientifically defined grading standard for visual subject complexity and propose a detailed classification system for prompt scenario. Experimental results demonstrate substantial performance differences of subject-driven T2I methods across various subject difficulty levels (see Figure 6(a)) and different prompt scenarios (see Figure 6(b)). The inclusion of this dimension directly uncovers model weaknesses that previous benchmarks were unable to identify, offering empirical evidence to guide future research.
- Efficient Automation of Subject Consistency Evaluation with High Alignment to Human Assesment (lines 67): DreamBench++ uses GPT-4o API for evaluation to improve alignment with human assessments. However, it incurs extremely high costs (over 20,000 API calls per model on our dataset, costing more than $400). In contrast, our proposed SICS achieves high alignment with human evaluations while significantly reducing inference costs.
Q: The paper omits critical details about the subjective evaluation process.
A: While the main text includes detailed explanations of the protocols for human annotation, some aspects remain insufficiently addressed. Therefore, we provide the following elaboration on the requirements for annotators and the specific details of the annotation process:
- All annotators involved in this work possess extensive experience in annotation within the relevant domain. They have previously participated in similar tasks and have a thorough understanding of subject-driven T2I generation tasks.
- All annotators hold a bachelor's degree or higher, ensuring their ability to accurately understand and implement the annotation guidelines.
- In Section 4.1, we provide human annotators with sufficient training to ensure they fully understand the subject-driven T2I generation task and can deliver unbiased and accurate evaluations. Detailed annotation guidelines are provided in Figure 4, 11 and Section 3.1.2.
- To ensure consensus among annotators regarding the annotation standards, we conducted a calibration process. Annotators were asked to annotate a small pilot dataset, after which we reviewed discrepant cases in the annotation results. Targeted feedback and additional training were provided to the annotators based on these reviews. This iterative process was repeated until all annotators demonstrated a high level of consistency in their understanding of the differentiation criteria for each annotation task.
- In the formal annotation phase, SICS's data sample was independently annotated by five annotators. To construct a high-confidence training dataset, we applied a consensus-based filtering criterion: only retaining samples where at least four annotators (annotation consistency ≥80%) assigned identical labels.
We will incorporate this content into the camera-ready version in the future.
Q: The paper does not clearly explain about the SICS metric.
A:
- About "how the SICS metric is computed"
- Instead of relying on simple embedding distances, SICS utilizes instruction fine-tuning on multi-modal large language models to directly output fine-grained subject consistency scores ranging from 0 to 5, along with explanatory text. These evaluation standards align with human assessments (see Appendix E.2 and Figure 11).
- In Figure 11, SICS employs prompts to guide the model's focus specifically on "subject consistency" rather than the overall image semantics during the fine-tuning process. This approach mitigates the influence of background and overall style, which can affect methods like CLIP, thereby ensuring a stronger alignment with the core requirements of subject consistency evaluation.
- About "SICS is trained on a relatively small dataset."
- During the training process, we constructed the training dataset by selecting data that spans all scenarios, difficulty levels, and categories, ensuring diversity in the distribution of the training data.
- To further address the reviewers' concerns, we conducted an evaluation on the DreamBench, which serves as a out-of-distribution test set. We randomly sampled 100 instances generated by three models: UNO, BLIP-Diffusion, and IP-Adapter. The evaluation metrics remain consistent with those used in the paper. The experimental results are presented in Table 1. It can be observed that our method maintains higher consistency with human evaluations. This is primarily because our model has been meticulously designed and trained. For details, please refer to our response to Q3. Table 1 : Results of evaluation on a OOD test set |Model|Kendallτ-GPT-4o|Kendallτ-SICS(Ours)|Spearmanρ-GPT-4o|Spearmanρ-SICS(Ours)| |:-|:-:|:-:|:-:|:-:| |UNO|0.287|0.337|0.315|0.378| |BLIP-Diffusion|0.323|0.505|0.346|0.527| |IP-Adapter|0.428|0.434|0.465|0.451| |Average|0.346|0.425|0.375|0.452|
Q: The presentation of experimental results is not intuitive.
A: In response to the reviewers' comment, we have provided a detailed ranking in Table 3 of the paper. To further address the reviewers' concerns, we have improved the table for greater clarity and effectiveness. The updated table is shown below and we will incorporate this table into the camera-ready version in the future: |Method|T2I Model|Subject Preservation|Prompt Following|Image Quality|| |:-|:-|:-:|:-:|:-:|:-:| |UNO|FLUX.1-dev||||| |RealCustom++|SDXL||||| |MS-Diffusion|SDXL||||| |Emu2|SDXL||||| |OminiControl|FLUX.1-schnell||||| ...
Q: It is more appropriate for the NeurIPS Datasets and Benchmarks Track.
A:
- Our benchmark have systematically advanced several innovative aspects, including subject difficulty level classification, prompt scenario classification, evaluation across different categories, and the optimization of evaluation metrics. For detailed explanations of the innovations and insights, please refer to our response to Q3 and Q4.
- In previous instances, benchmark papers with contributions similar to those presented in our study have been accepted by main track. For example, references[1, 2] demonstrate this trend.
Reference
[1] Yiwei Ma, Jiayi Ji, et al. I2ebench: A comprehensive benchmark for instruction-based image editing. In Advances in Neural Inf. Process. Syst.
[2] Yinuo Jing, Ruxu Zhang, et al. Animal-Bench: Benchmarking multimodal video models for animal-centric video understanding. In Advances in Neural Inf. Process. Syst.
Hi reviewer,
Thanks for your previous hard work in the review phase. Now, we need to perform the next step to discuss this article for the decision on whether it can be accepted. Please feel free to read the other reviews, and actively participate in discussions if you have a different opinion. Thanks for your contributions for our NeurIPS community again.
Best, AC
Dear Reviewer bbdq,
We sincerely appreciate the time and effort you have devoted to reviewing our work. Your detailed comments and constructive suggestions have been invaluable in improving the clarity and completeness of our work.
In response to your concerns, we have provided detailed replies and conducted additional experiments to address the issues you raised. All these updates will be incorporated into the final version of the paper. We hope that our responses satisfactorily address your concerns.
If you find that our responses have resolved your concerns, we would be grateful if you would consider raising your final rating to a higher score. Your feedback is crucial to the improvement of our work and is greatly appreciated.
Thank you once again for your thoughtful review and support.
Best regards,
Authors
Dear Reviewer bbdq,
Thank you for your invaluable efforts and constructive feedback on our manuscript.
As the discussion period draws to a close, we eagerly anticipate your thoughts on our response. We sincerely hope that our response meets your expectations. If there are any remaining concerns or aspects that require clarification, we are ready to address them as soon as possible.
Best regards,
The Authors
We sincerely thank the reviewers for their valuable feedback and positive evaluation of our work. Their insightful comments have significantly improved the clarity and overall quality of our paper.
We are grateful to Reviewer tQca (5: Accept) for recognizing the merits of our work. They highlighted the substantial progress our study has made compared to prior efforts, benefiting the research community. They acknowledged the adoption of the hierarchical taxonomy for reducing evaluation bias and commended our fine-grained, scenario-aware evaluation on fifteen leading models, which reveals nuanced failure modes not captured by previous benchmarks.
Reviewer AKLi (4: BorderlineAccept) praised the robustness of our work, emphasizing the comprehensive evaluation conducted on fifteen state-of-the-art methods, which makes our contribution solid. They also appreciated our classification of objects into different difficulty levels and the proposal of a human-aligned and more efficient evaluation metric, which has notable value for the community.
Reviewer dDpk (4: BorderlineAccept) recognized that our hierarchical evaluation based on subject difficulty levels has the potential to yield valuable insights for the research community. They also commended the meticulous evaluation performed on a large number of methods, which enhances the credibility of our work and addresses a significant research gap.
Reviewer bbdq (4: BorderlineReject) acknowledged the use of a well-defined hierarchical taxonomy, which ensures that our dataset is more diverse compared to existing benchmarks. They further appreciated the large and diverse set of subject images and prompts included in our benchmark, which are critical for effective and unbiased evaluation.
We sincerely thank the reviewers for identifying these strengths and for their positive feedback on the clarity, novelty, and effectiveness of our proposed methods. Their comments have further motivated us to address the concerns and improve upon the weaknesses identified in their reviews. We are committed to thoroughly addressing these concerns and providing a detailed and constructive response in our rebuttal.
The paper introduces DSH-Bench, a new benchmark for subject-driven text-to-image (T2I) generation. The benchmark addresses long-standing gaps in evaluation by providing (1) a hierarchical taxonomy of 58 categories and 459 subjects, ensuring comprehensive subject diversity, (2) a novel classification scheme for both subject difficulty levels and prompt scenarios, enabling more fine-grained diagnosis of model capabilities, and (3) a new evaluation metric, Subject Identity Consistency Score (SICS), which better aligns with human perception than existing automated metrics while being cost-efficient and scalable. The authors evaluate 15 state-of-the-art personalization models across multiple axes—subject preservation, prompt adherence, and image quality—and uncover nuanced failure modes previously obscured by existing benchmarks .
The strengths of the paper are clear. Reviewers consistently highlighted the completeness and rigor of the benchmark construction, the granularity of evaluation, and the practical utility of SICS. The benchmark provides insights into how models behave under varying subject complexity and prompt scenarios, which is crucial for diagnosing weaknesses in current methods and guiding future research. The authors have also released data, code, and metrics to the community, ensuring strong reproducibility and long-term impact. The empirical evaluation is broad and carefully analyzed, making the paper a valuable resource for both academic and applied research.
The weaknesses raised primarily concerned clarity of presentation, the generalization of SICS, and whether the work’s methodological novelty extends beyond careful engineering. One reviewer suggested that the paper might fit better in the Datasets and Benchmarks track. However, the rebuttal and subsequent discussion convincingly addressed these concerns: the authors clarified the annotation process, provided additional analyses on out-of-distribution datasets to demonstrate generalization, expanded explanations of SICS and its qualitative advantages over GPT-4o, and improved the presentation of results. They also clarified that while benchmark construction involves engineering, their framework introduces systematic methodological innovations—hierarchical sampling, difficulty/scenario classification, and fine-grained automated evaluation—that go beyond prior benchmarks. Most reviewers acknowledged that their concerns were satisfactorily addressed, and several upgraded or maintained positive ratings.
In conclusion, this paper makes a substantial and timely contribution by establishing a comprehensive, scientifically grounded benchmark that will likely become a reference point for evaluating subject-driven T2I generation. The combination of methodological novelty, extensive empirical validation, and clear community benefit elevates it beyond an incremental dataset paper. Given its potential to shape future research directions and improve the robustness of evaluation practices, I recommend acceptance.
"The Program Chairs and DB Chairs decided that the paper cannot be accepted after calibrating its rating with the decision of the DB track papers."