PaperHub
5.8
/10
Poster4 位审稿人
最低5最高6标准差0.4
6
5
6
6
3.8
置信度
正确性2.8
贡献度3.0
表达3.3
NeurIPS 2024

Identity Decoupling for Multi-Subject Personalization of Text-to-Image Models

OpenReviewPDF
提交: 2024-05-10更新: 2024-11-06
TL;DR

MuDI enables multi-subject personalization by effectively decoupling identities from multiple subjects.

摘要

关键词
Text-to-Image Diffusion ModelsMulti-subject personalization

评审与讨论

审稿意见
6

MuDI is a novel framework designed for multi-subject personalization in text-to-image diffusion models. It effectively decouples identities of multiple subjects, using segmented subjects from a foundation model for data augmentation and generation initialization. A new metric is introduced to evaluate multi-subject personalization. Experimental results show MuDI produces high-quality personalized images without identity mixing, outperforming existing methods.

优点

  • This paper is well-written and easy to follow.
  • The experimental results are comprehensive and sufficiently support the claims made in the paper.
  • MuDI successfully prevents identity mixing when generating multiple subjects, maintaining clear individual identities.
  • Introduction of a new metric provides a better assessment of multi-subject personalization performance.

缺点

  • Inheriting from DreamBooth, MuDI requires test-time fine-tuning for each set of concepts, which might hinder its practical application. In contrast, some approaches like IP-Adapter can achieve customization in a tuning-free manner.
  • There is a spelling error on line 796. Additionally, there is a miscitation on line 690, where FastComposer[51] is incorrectly referred to as a "single-subject personalization" method; it is actually a zero-shot multi-subject customization method.

问题

  • During training, MuDI sets the background to white, which might reduce text alignment in the generated images. However, in Figure 6, MuDI does not seem to be affected by this. I need the authors to provide a possible explanation for this discrepancy.
  • The proposed strategy of modifying the initial noise during inference appears to be very effective. I am curious whether DreamBooth with Region Control, if applied with the same inference strategy, would achieve better performance than MuDI.

局限性

No negative societal impact.

作者回复

We sincerely thank you for your time and effort in reviewing our paper. We appreciate your positive comments that

  • Well written paper
  • Comprehensive experiments
  • Success in multi subject personalization
  • Good proposed metric

We initially address your concerns below.


C1. Test-time fine-tuning method

We agree that test-time fine-tuning methods like DreamBooth may face limitations in practical application. However, existing tuning-free methods have only shown effectiveness within specific domains, such as human subjects [1, 2], and often still require additional fine-tuning to achieve comparable levels of subject personalization as test-time fine-tuning methods [3].

Given that our research aims to mitigate identity mixing across a broad range of subjects while maintaining high subject fidelity, adopting a tuning-free approach would not meet our objectives. This is evidenced by Figure R4 in the uploaded PDF, where FastComposer completely fails to personalize Corgi and Chow Chow (animals). This limitation of FastComposer stems from its design, which specifically targets human subjects.

[1] IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models
[2] FastComposer: Tuning-Free Multi-Subject Image Generation with Localized Attention
[3] Domain-Agnostic Tuning-Encoder for Fast Personalization of Text-To-Image Models


C2. White background

This is an important question regarding overfitting. As you pointed out, our method does not reduce text alignment, as indicated by the highest text fidelity metrics (ImageReward and CLIP score). This success is due to our use of the prompt “A photo of [V1] and [V2], simple background” during training, with segmented subjects composed on a white background. This approach effectively disentangles the background from the identities through the text “simple background”, preventing overfitting.

For additional supporting results, we have included examples demonstrating that MuDI can generate diverse backgrounds and styled images aligned with the texts in Figure R3 of the uploaded PDF.


C3. Inference initialization for region control

Thank you for the interesting question. To answer your question, we applied our inference strategy to DreamBooth+region control. As shown in Figure R5(Left) of the uploaded PDF, this combination improves the success rate for less similar subjects (e.g., monster toy and can) but still results in mixed identities for similar subjects, such as two teddy bears. In all cases, our MuDI method achieves a higher success rate and greater multi-subject fidelity compared to inference initialization with DreamBooth+region control, as demonstrated in Figure R5(Right) of the uploaded PDF.


C4. Typo

Thank you for pointing this out, we will correct the mistakes in the final revision.

审稿意见
5

This work introduces a training and inference pipeline and an evaluation benchmark for multi-concept customization for text-to-image diffusion models. Specifically, the pipeline comprises a Seg-Mix training stage, which can be viewed as a data augmentation trick to prevent the fine-tuned model from learning mixed attributes, and a mean-shifted noise initialization for the very first step of the denoising process, aiming to inject appearance and layout priors. The evaluation benchmark takes into consideration the dysfunctionality of previous methods when evaluating the disentanglement extent between customized subjects. The benchmark utilizes a Detect-and-Compare workflow considering the similarity for the same subject and dissimilarity between two distinct subjects.

优点

  1. The paper is well-written and easy to follow. I didn’t encounter many confusing expressions during my first reading.
  2. The introduced methods are intuitively effective, and the design of the evaluation benchmark is convincing for evaluating the extent of disentanglement.
  3. This work introduces two interesting additional uses: size control and modular customization, which are helpful for improving the application scenarios for customization.

缺点

  1. Though the method is intuitively effective, as I stated in Strength #2, the novelty is limited since some designs have been experimentally verified by previous methods, such as the utilization of descriptive class [1] and a data-augmentation pipeline for multi-concept customization [2].
  2. Though the proposed method seems reasonable for combining distinct subjects with different region locations, like a cat and a dog sitting beside each other, it may be challenged in scenarios where the subjects have rich semantic interactions, like a person wearing glasses. This limitation is induced by the region-control designs, where both augmented training and the initialization during inference pose strong regularization for decoupling two instances. This limitation makes this method difficult for general multi-concept customization.

[1] InstantBooth: Personalized Text-to-Image Generation without Test-Time Finetunin

[2] SVDiff: Compact Parameter Space for Diffusion Fine-Tuning

问题

Seg-Mix uses a white background and a prompt organized as “…, simple background”, my question is that is this design posing the risk of outputting more images with white backgrounds even when using edited prompts? This seems inevitable, much like how Cut-Mix inherits the stitching artifacts from the training data.

局限性

See Weakness#2

作者回复

We sincerely thank you for your time and effort in reviewing our paper. We appreciate your positive comments that

  • Well written paper
  • Effective methods
  • Convincing evaluation benchmark
  • Interesting applications

We initially address your concerns below.


C1. Novelty of using descriptive classes

We would like to clarify that our method uses descriptive classes for a different purpose. Our approach utilizes these classes to improve the separation of similar subjects, whereas the prior work employs them for the personalization of rare subjects [1]. We are the first to demonstrate that descriptive classes are crucial for distinguishing between identities of highly similar subjects, which allows for the personalization of 11 dogs and cats using a single trained LoRA, as shown in Figure 35 of the Appendix. Additionally, we utilize LLMs to automatically generate descriptive classes, in contrast to the manual selection used in the prior work.

[1] InstructBooth: Instruction-following Personalized Text-to-Image Generation


C2. Novelty of Seg-Mix

We would like to note that the proposed Seg-Mix which is a different augmentation method compared to Cut-Mix [2]. Cut-Mix generates augmented images by simply stitching two cut images side by side, which inevitably results in unnatural vertical lines and still suffers from identity mixing (Figure 2 of the main paper).

On the other hand, our Seg-Mix generates augmented images by composing the segmented subjects with the identity-irrelevant information removed. This approach effectively reduces unnatural artifacts and prevents identity mixing even for highly similar subjects.

Specifically, Seg-Mix allows the subjects to overlap, resulting in natural interaction between the subjects which we experimentally validate in Section B.2 and Figure 18 of the Appendix. Furthermore, our Seg-Mix enables the control of relative size (Figures 9(a) and 31 of the main paper) by composing the scaled segmented subjects, which cannot be done by Cut-Mix.

We would also like to emphasize that the novelty of our work is not limited as we present a novel initialization method as well as a new metric and dataset in our work.

[2] SVDiff: Compact Parameter Space for Diffusion Fine-Tuning


C3. Rich semantic interactions

Thank you for great feedback. To address your concerns about rich semantic interactions, we have included additional experiments in the uploaded PDF. Figure R2 demonstrates that our framework can generate subjects with rich semantic interactions by leveraging a prompt-aligned layout for initialization. Initializing with a layout that aligns with the prompt enables the generation of images such as a teddy bear wearing sunglasses or a toy riding a Corgi, without any identity mixing. We also remark that such prompt-aligned layouts can be automatically generated using LLMs (see Appendix B.5 for more details).


C4. White background

This is an important question regarding overfitting to the training data. Unlike Cut-Mix, our Seg-Mix does not have bias with white backgrounds, capable of generating diverse backgrounds when using edited prompts. This success is due to our use of the prompt “A photo of [V1] and [V2], simple background” during training, with segmented subjects composed on a white background. This approach effectively disentangles the background from the identities through the text “simple background”, preventing overfitting.

For additional supporting results, we have included examples demonstrating that MuDI can generate diverse backgrounds and styled images aligned with the texts in Figure R3 of the uploaded PDF.

评论

Thank you for the explanation provided by the authors. I have reviewed the rebuttal materials, and most of my concerns have been addressed. However, I still have reservations regarding the novelty of the descriptive classes, so I have decided to maintain my original rating.

评论

Thank you for the response, we are happy to hear that most of the concerns are addressed.

We would like to clarify that leveraging descriptive classes is one of the many contributions of our work, and that Reviewers JDvE, amUB, and ViAz have acknowledged that our work is a novel approach for multi-subject personalization.

We have effectively addressed identity mixing for multi-subject personalization by introducing a new data augmentation method Seg-Mix and novel inference initialization method. Moreover, we proposed a new metric for the evaluation of multi-subject fidelity.

We hope the reviewer would kindly consider a fresh evaluation of the novelty of our work.

Best, Authors

审稿意见
6

The paper introduces MuDI, a novel framework designed to improve multi-subject personalization in text-to-image diffusion models. Unlike current methods that often mix identities and attributes from different subjects when generating multiple subjects simultaneously, MuDI effectively decouples these identities. The framework employs segmented subjects generated by a foundation model for segmentation, known as Segment Anything, which is used for both training and inference. This approach serves as data augmentation for training and as initialization for the generation process. Additionally, the authors introduce a new metric to better evaluate the performance of their method in multi-subject personalization.

优点

  1. The topic is interesting and it has a good novelty.
  2. The presentation is good and the results look promising.

缺点

The dataset is small, and more analysis should be made. Please see the detailed Questions.

问题

  1. The dataset provided in the paper is small and monolithic in style, which does not adequately illustrate the superiority of the method. Could the authors conduct more experiments to clarify this concern?
  2. I'm concerned about whether image content, including resolution, style, and other factors impacts performance. Could the authors clarify this point?
  3. Relying solely on the proposed metric could not demonstrate the model's superiority. Could the authors include comparisons with other existing metrics to further validate the model's performance?
  4. The qualitative comparison in Section 5.2 lacks comprehensiveness. Displaying a few examples is not enough to ensure credibility. It is recommended to include quantitative comparisons, such as the proportion of high-quality generated results and other quantifiable metrics.
  5. The paper lacks detailed explanations of the important evaluation metrics for multi-subject fidelity and text fidelity, making it difficult for readers to understand their specific practical significance and thus grasp the model's advantages. It would be helpful if the authors could provide more detailed explanations.

局限性

NA

作者回复

We sincerely thank you for your time and effort in reviewing our paper. We appreciate your positive comments that

  • Interesting topic and novel
  • Good presentation

We initially address your concerns below.


C1. Small and monolithic style dataset

To address your concerns regarding style, we have included the results of personalizing cartoon characters in Figure R1 of the uploaded PDF. As demonstrated, our MuDI model successfully generates distinctive characters, whereas DreamBooth does not perform as well. We will include more experiments on diverse styles in the final revision.

We would also like to note that our dataset consists of similar subject combinations from the benchmark datasets, such as DreamBench [1] and CustomConcept101 [2], covering a wide range of categories from animals to objects and scenes. Additionally, we have demonstrated that our method can effectively personalize animation characters together in Figures 1 and 36 of the main paper.

[1] DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation
[2] Multi-Concept Customization of Text-to-Image Diffusion


C2. Regarding image content

Thank you for your pointer. As discussed in Section D of the Appendix, the performance of MuDI is influenced by several factors, including the number and similarity of subjects. For example, we observe a reduced success rate in identity decoupling among highly similar subjects, two teddy bears in Figure 37(a) of the Appendix. From the human evaluation, the average success rate of MuDI for all categories in the dataset is 70% which decreases to 32% for the two teddy bears.

However, we also remark that these challenges are not unique to our approach but are a common issue in prior research as well. Regarding the other factors mentioned, our extensive experiments with diverse image contents and examples of Sora confirm that these factors do not impact our identity decoupling performance. We will include a detailed discussion of other limitations in the final revision.


C3. Comparison with existing metrics

Thank you for your suggestion. In the table below, we provide a quantitative comparison with CLIP-I, DreamSim, and DINOv, where MuDI outperforms the baselines on all metrics.

MethodCLIP-IDreamSimDINOv2
Textual Inversion0.6640.4030.341
DreamBooth (DB)0.7110.4790.434
DB + Region0.7310.5080.462
Cut-Mix0.7240.510.475
Ours0.7380.5290.486

We would like to note that, unlike existing metrics, which show a low correlation with human evaluations, our metric demonstrates a high correlation, as evidenced in the tables of Figures 4 and 15 of the main paper. Consequently, using our metric provides a more reliable basis for validation.


C4. More quantitative comparisons

We would like to clarify that we provide comprehensive quantitative comparisons of our method against the baselines through human evaluation and analysis:

  • Human evaluation in Figure 6 of the main paper was conducted using a total of 2000 generated images. It shows that MuDI achieves significantly high success rates and is preferred over 70% against Cut-Mix.
  • Success rates with respect to the number of subjects in Figure 8(b) of the main paper show that the baselines completely fail personalizing three subjects, while ours show over 50% success even for four subjects.

We also visualize uncurated samples in Figure 16 of the main paper to ensure credibility.


C5. Explanations of evaluation metrics.

Thank you for your suggestion, we agree that further explanation of the metrics would help the readers, and will add more details in the final revision.

审稿意见
6

The paper proposes MuDI, a novel method for generating images with multiple personalized subjects. By leveraging segmented subjects from reference images for both training and inference, MuDI effectively addresses the challenge of identity mixing in multi-subject image generation. Key contributions include a new data augmentation technique (Seg-Mix) and a new evaluation metric for multi-subject fidelity.

优点

  • The paper introduces a novel approach to multi-subject image generation by leveraging segmented subjects for both training and inference, effectively decoupling subject identities. This represents a creative combination of existing techniques in image segmentation and text-to-image generation.
  • The paper is well-structured and clearly presented, with a solid experimental evaluation demonstrating the effectiveness of the proposed method.
  • The authors effectively communicate the problem, the proposed solution, and the experimental results. The paper is well-organized and easy to follow.
  • By addressing the critical challenge of identity mixing in multi-subject image generation, the paper offers a valuable contribution to the field. The proposed method has the potential to significantly impact applications requiring the generation of multiple distinct subjects within a single image.

缺点

  • While the paper presents a comparative analysis with existing methods, a more comprehensive evaluation against a wider range of baselines, including recent advancements in image generation and personalization, would strengthen the paper's claims. It would be essential to compare with methods like PortraitBooth (CVPR 2024) and FastComposer.
  • Additionally, exploring different evaluation metrics beyond the proposed D&C metric could provide a more holistic assessment of the method's performance.
  • The paper lacks sufficient details about the dataset used for training and evaluation. A more in-depth description of the dataset, including its size and diversity, would enhance the reproducibility of the work.
  • Although the paper includes some ablation studies, a more comprehensive analysis of the impact of different components of the proposed method (e.g., Seg-Mix, initialization, descriptive class) on the overall performance would provide deeper insights into the method's effectiveness.
  • While the paper acknowledges the limitations of existing methods, a more thorough discussion of the potential limitations of the proposed MuDI method, such as its sensitivity to image complexity or its performance on highly similar subjects, would strengthen the paper's overall contribution. Some studies on how the size of the objects composed using SegMix during training affects the model, i.e, does it lead to any size biases in the model?

问题

  • Could the authors provide more details about the dataset used for training and evaluation, including its size, diversity, and collection process? Additionally, a more in-depth description of the evaluation metrics and experimental setup would enhance reproducibility.
  • How does MuDI compare to other state-of-the-art methods that focus on image composition or layout control for multi-subject image generation? Specifically for PotraitBooth and FastComposer?

局限性

  • Could the authors elaborate on the limitations of MuDI, such as its performance on highly similar subjects or complex scenes?
作者回复

We sincerely thank you for your time and effort in reviewing our paper. We appreciate your positive comments that

  • Novel approach
  • Clear paper with solid experiments
  • Easy to follow
  • Contribution to the field

We initially address your concerns below.


C1. Comparison with FastComposer

Following your suggestion, we compared our method with FastComposer in Figure R4 (see our uploaded PDF). FastComposer fails to personalize the Corgi and Chow Chow (animals), whereas our method successfully generates distinct objects. This limitation of FastComposer [1] (and similarly, PortraitBooth [2]) stems from their design, which specifically targets human subjects. For our comparisons, we utilized the open-sourced weights from the official implementation of FastComposer. We did not include PortraitBooth in our analysis as its code and weights have not yet been available.

[1] FastComposer: Tuning-Free Multi-Subject Image Generation with Localized Attention
[2] PortraitBooth: A Versatile Portrait Model for Fast Identity-preserved Personalization


C2. Comparison with layout control

We provided a comparison with the layout conditioning methods in Section B.12 of our Appendix. These methods (Cones2 [3], Mix-of-show [4]) often result in missing subject or identity mixing and low subject fidelity as shown in Figure 29 of the Appendix. Furthermore, as shown in Table 3 of the Appendix, our MuDI outperforms the layout conditioning methods on both multi-subject fidelity metrics and text fidelity metrics.

[3] Cones 2: Customizable Image Synthesis with Multiple Subjects
[4] Mix-of-Show: Decentralized Low-Rank Adaptation for Multi-Concept Customization of Diffusion Models


C3. Different evaluation metrics

Thank you for your suggestion. In the table below, we provide a quantitative comparison with CLIP-I, DreamSim, and DINOv2, where MuDI outperforms the baselines on all metrics.

MethodCLIP-IDreamSimDINOv2
Textual Inversion0.6640.4030.341
DreamBooth (DB)0.7110.4790.434
DB + Region0.7310.5080.462
Cut-Mix0.7240.510.475
Ours0.7380.5290.486

We would like to note that, unlike existing metrics, which show a low correlation with human evaluations, our metric demonstrates a high correlation, as evidenced in the tables of Figures 4 and 15 of the main paper. Consequently, using our metric provides a more reliable basis for validation.


C4. Dataset details

Thank you for your suggestion. For better reproducibility, we will open-source all datasets, training codes, and checkpoints. We provided some details in Appendix A.2 and Figure 11 in the main paper, such as where we collect training images and prompts. We will add more details in the final revision.


C5. More ablation studies

Thank you for your suggestion. we believe that we provided comprehensive ablation studies on each component in Section 5.3, Figure 7, and Table 1 of the main paper, demonstrating the necessity of Seg-Mix, our inference initialization, and descriptive class for preventing identity mixing. We further conducted an ablation study on the number of subjects in Figure 8 of the main paper, showing that our MuDI can personalize even five subjects while previous methods completely fail. We will try to add ablation studies on all combinations of the components in the final revision.


C6. Potential limitations

Thank you for your suggestion. For MuDI, we observe a reduced success rate in identity decoupling among highly similar subjects, for example, two teddy bears in Figure 37(a) of the Appendix. From the human evaluation, the average success rate of MuDI for all categories in the dataset is 70% which decreases to 32% for the two teddy bears. We also remark that the existing baseline methods completely fail in such cases, indicating that this challenge is not unique to our approach but is a common issue in prior research as well. We will include a detailed discussion of other limitations in the final revision.

作者回复

Dear Reviewers,

We sincerely thank you for reviewing our paper and for the insightful comments and valuable feedback. We appreciate the positive comments that emphasize the novelty of our work and the advantages of our proposed method:

  • Novel approach (JDvE, amUB)
  • Well-written and easy to follow (JDvE, amUB, ivq5, ViAz)
  • Convincing evaluation benchmark (ivq5, ViAz)
  • Solid experiments (JDvE, ViAz) and interesting applications (ivq5)

We uploaded a PDF file that includes 5 figures:

  • Figure R1: MuDI with cartoon style character
  • Figure R2: Examples of rich semantic interaction between subjects
  • Figure R3: Examples of diverse backgrounds and styles
  • Figure R4: Results of FastComposer on our dataset
  • Figure R5: Experiments on using our initialization with region controlled DreamBooth

Thank you again for your thorough review and thoughtful suggestions. We hope our responses and the clarifications have addressed any remaining questions, and we are willing to address any further inquiries you may have.

Yours sincerely,

Authors

最终决定

Post rebuttal, all reviewers have reached a consensus on accepting the paper, with their primary concerns adequately addressed. After reviewing all the materials, I concur that the paper makes a valuable contribution to multi-subject image generation, and I recommend acceptance. Please ensure that the necessary revisions are incorporated into the final version.