PaperHub
5.5
/10
Poster4 位审稿人
最低5最高6标准差0.5
5
6
6
5
3.8
置信度
正确性2.5
贡献度2.5
表达2.3
NeurIPS 2024

Rethinking The Training And Evaluation of Rich-Context Layout-to-Image Generation

OpenReviewPDF
提交: 2024-05-14更新: 2025-01-11

摘要

关键词
Layout-to-Image; Diffusion Model; Computer Vision

评审与讨论

审稿意见
5

The paper focuses on layout-to-image (L2I) generation from the perspective of rich-context scenario where the object descriptions are complex and lengthy. In the framework design, it introduces a novel regional cross-attention module to enhance the representation of layout regions. In the evaluation for open-vocabulary L2I models, this paper proposed two new metrics that assess model performance under rich-context descriptions, validated through a comprehensive user study.

优点

--The introduction of a regional cross-attention module is novel, improving the handling of complex layout descriptions compared to traditional self-attention approaches.

--The paper provides rigorous experimental validation, demonstrating that the proposed regional cross-attention module enhances generation performance.

缺点

--This paper presents the GFLOPs of the proposed method. However, it seems that the region reorganization and regional cross attention may affect the realtime throughput. It would be better to also analyze how this proposed module affect the runtime cost.

--This paper aims to target the scenario rich-context layout-to-image generation. During training, they collect a rich-context dataset for training. Therefore, I wonder the performance gain is come from the constructed rich-context dataset or the proposed regional cross attention.

--Meanwhile, the comparison with other baselines may not be fair. Since the proposed method used a different training dataset which is beneficial for rich-context layout-to-image generation. It would be more fair if the compared baselines are also retrained on the same training dataset.

--This paper introduced two new metrics to evaluate the object-text alignment and the layout fidelity. However, the proposed two metrics are intuitive extensions from existing metrics, which may not be strong enough to claim as a contribution.

--Regarding ablations, it would be better to have visual results to illustrate the effectiveness.

问题

Please see above weaknesses.

局限性

The authors have discussed limitations in the supplementary.

作者回复

We thank the reviewer for his/her review. We use W to denote bullets in weaknesses

Answer to W1: In Rebuttal Section A, we compare the throughput of L2I methods using SD1.5 and SDXL baselines. It is noteworthy that the overall throughput of our method is not significantly hampered. In a typical scenario with 5 objects, the throughput of our method is more than 60% of the throughput of the original backbone model.

Answer to W2 and W3: We argue that conditioning the model on rich-context descriptions requires both a rich-context dataset and a designated conditioning module for complex descriptions. Without a rich-context dataset, the generalizability of the conditioning module from word/phrase-level context to rich-context descriptions can be hindered. Conversely, without a proper conditioning module, the model may perform poorly when handling complex descriptions.

To validate the effectiveness of both the rich-context dataset and the regional cross-attention module, we conducted two additional ablation studies: 1) We retained the regional cross-attention module but replaced the rich-context descriptions with word/phrase-level descriptions obtained using the Recognize Anything Model and GroundingDINO (Ln 213). 2) We replaced the regional cross-attention module with the self-attention modules used in GLIGEN and InstDiff, training them with the rich-context dataset.

The results, presented in Rebuttal Sec D, show that performance drops either when using word/phrase-level descriptions or when removing the regional cross-attention module. This validates the importance of both the rich-context dataset and the regional cross-attention module.

Answer to W4: Our proposed two metrics, although they resemble extensions of existing metrics, are appropriately repurposed to offer significant value for the L2I problem as 1) ​​These metrics are specifically designed to quantitatively measure L2I performance in rich-context scenarios, where existing metrics have failed to do so (Ln 176-183). 2) In addition to proposing these metrics, our reliability analysis is crucial. It demonstrates the conditions under which these metrics are effective and consistent with human perception. We argue that this analysis should be considered a contribution to the field of evaluation metrics.

Answer to W5: In Rebuttal Sec E, we provide a visual comparison to demonstrate the effectiveness of using region reorganization compared to straightforward feature averaging (the baseline described in Ln 279) when dealing with overlapping objects. The qualitative results validate the importance of our proposed methods, as they produce more accurate generated objects that better align with the designated layouts.

For the ablation study on box indicator and image resolution, identifying a failure generation without one of these components can be deliberated, and there may not be a strong sample-wise effect when applying them. Therefore, we recommend referring to the quantitative performance metrics for a more accurate comparison.

评论

The rebuttal addressed my most concerns, especially the main concern about the effect of rich-context dataset and the regional cross-attention module. I would like to increase my recommendation to borderline accept.

评论

Thank you very much for your updated comment and for considering our rebuttal. We noticed that you mentioned increasing your recommendation to a borderline accept, but it seems the rating has not yet been updated in the system. Could you kindly adjust the score at your convenience?

We greatly appreciate your feedback and support.

审稿意见
6

In this work, the author views both training and evaluation of the layout-to-image synthesis task. They propose regional cross attention to addresses the issues of previous works and they also introduce a new evaluation protocols for this task.

优点

  • I agree with the author's discussion on the desired properties for a layout-to-image model and how they achieve these properties. The contribution in terms of evaluation is also acknowledged.
  • Overall, the performance is superior compared to existing models.

缺点

Metrics

  • The author view a holistic “rethinking” for both training and evaluation; however, the novelty in evaluation seems limited. For instance, using CLIP to compute patchwise similarity is not particularly novel, and although calculating SAM-driven IoU is an interesting approach and might be meaningful in open-vocab-related tasks, it does not feel exceptionally special.

Method

  • I also agree that the one critical problem in layout-to-image tasks is overlapping issues and I am quite interested in the regional cross attention. However, despite Table 2 showing improvements with the proposed modules (including regional CA), I find the analysis of these overlapping issues lacking. Similarly, the paper argue that “desired properties for an effective rich-context layout-conditioning module” (L113) in four items, each needing more concrete analysis. Currently, the analysis seems insufficient.
  • The computational requirements for training seem high. How does the performance compare to other methods in terms of training costs, including both resources and data? In a current view, I am not sure that the improved performance is originated from the method itself or the increased training dataset and computational powers.

Presentation

  • I recommend that the proposed evaluation methods be described in more detail, possibly in a more formal manner or with more detailed information such as in appendix.
  • Similarly, a more detailed explanation of the regional cross attention would be beneficial.

问题

The details of models and evaluations should be included more in appendix.

局限性

The author discussed the limitation and I agree with that (as I described in weaknesses as well).

作者回复

We thank the reviewer for his/her review. We use W and Q to denote bullets in weaknesses and questions

Answer to W1: Our proposed two metrics may not seem exceptionally special, however, we are the first to repurpose these metrics to evaluate L2I performance. They offer significant value for L2I problem as 1) ​​these metrics are specifically designed to quantitatively measure L2I performance in rich-context scenarios, where existing metrics have failed to do so (Ln 176-183); 2) in addition to proposing these metrics, our reliability analysis is crucial. It demonstrates the conditions under which these metrics are effective and consistent with human perception. We argue that this analysis should also be considered as a contribution to the field of evaluation metrics.

Answer to W2: In Rebuttal Sec E, we provide a visual comparison to demonstrate the effectiveness of using region reorganization compared to straightforward feature averaging (the baseline described in Ln 279) when dealing with overlapping objects. The qualitative results validate the importance of our proposed methods, as they produce more accurate generated objects that better align with the designated layouts.

Our objective with the four properties outlined in Sec 3.2 is to address the rich-context L2I challenges detailed in Sec 3.1 (Ln 98-110). Specifically, when “Flexibility” is satisfied, the rich-context description can be accurately understood by the model (Ln 99), the “Locality” ensures the objects are positioned correctly within the designated layout box (Ln 103) and “Completeness” guarantees the global consistency in the generated images (Ln 103-104). Finally, “Collectiveness” allows the model to consider and properly represent the interaction of overlapping objects (Ln 109-110).

In practical terms, positioning objects accurately (Locality) and maintaining plausible image quality (Completeness) are fundamental prerequisites for L2I problems. By satisfying the Flexibility, our model can better understand descriptions and generate more satisfactory objects (Figure 1, 5). Additionally, a model that meets Collectiveness can more effectively handle interactions between overlapping objects (Rebuttal Sec E).

We will clarify and elaborate on this information in the revised paper to ensure it is more comprehensible.

Answer to W3: As mentioned in Appendix Section B, our model is trained with an accumulated batch size of 256 for 100K iterations, which is half of the current state-of-the-art (SoTA) L2I method InstDiff, trained with a batch size of 512 for 100K iterations. Additionally, the number of training samples is 2M (after filtering failed downloads in CC3M), which is also half of the number of samples used in InstDiff, which has 5M samples. Overall, the training cost of our work is less than half of the existing L2I methods. Therefore, we believe the performance improvement is not due to larger computational resources or data, but rather from our design. We will include this comparison in the revised appendix.

Answer to W4 and Q: We provide pseudo-code for the proposed two evaluation metrics in Rebuttal Sec C. The CLIP model used is the clip-vit-base-patch32, trained by OpenAI, and the SAM model is the ViT-H checkpoint from the official SAM implementation.

Answer to W5: For reproducibility, our implementation will be released upon the completion of the submission. In practical terms, we implement region reorganization in the dataloader to minimize computation overhead . This ensures that the reorganized region-text correspondence is shared across all regional cross-attention layers. Consequently, the regional cross-attention layer in the denoising model is reduced to a classical cross-attention operation with a special attention mask that indicates the permissible attention region between visual and textual tokens. Essentially, the attention operation depicted in Figure 2 is conducted by passing a pre-computed attention mask when cross-attending visual and textual tokens. We will enhance the clarity of these details in the revised paper.

评论

Thank you for the rebuttal. It addresses most of my concerns so I raise the rating.

审稿意见
6

The field of paper is open-set layout-to-image (L2I) generation. They propose to apply regional cross-attention module to enrich layout-to-image generation, slightly surpassing existing self-attention-based approaches. The paper also proposes two new metrics to assess L2I performance in open-vocabulary scenarios instead of previous closed-set environments.

优点

  1. Considering the evaluation in open scenarios is meaningful. The proposed two metrics are reasonable, and due to the use of open-source tools such as CLIP and SAM, they are relatively easy to implement.
  2. In the context of rich-context layout-to-image generation, the proposed method shows a notable improvement.

缺点

  1. The proposed regional cross-attention may not be plug-and-play and could be incompatible with existing pre-trained large models.
  2. The related work section on layout-to-image is somewhat brief. Please expand this section to provide a more detailed overview of this field.
  3. In terms of quantitative metrics in COCO, which is relatively close-set, the improvement of the proposed method over the latest approaches is not particularly significant.
  4. The paper lacks pseudocode for the algorithm.
  5. The paper lacks the diversity analysis of the generation images.

问题

I will consider increasing the score if the authors can address the following issues.

  1. Since the authors consider the two evaluation metrics as one of the main contributions, detailed information regarding their implementation is necessary to facilitate future work. For example, specifying the model type of SAM, Clip, among other details, would be helpful. This information can be included in the main text or supplementary materials. Ideally, the evaluation code should be made open source.
  2. The proposed regional cross-attention may not be plug-and-play. Please list the modifications and additional training required to apply it directly to existing pre-trained models. Providing pseudocode for the algorithm would significantly enhance the paper’s reproducibility and reduce the implementation difficulty.
  3. The related work section on layout-to-image is somewhat brief. Please expand this section to provide a more detailed overview of this field. For example, Freestyle Layout-to-Image Synthesis from CVPR 2023 uses attention-based interactions to achieve open-set rich-context layout-to-image generation, which is similar to this paper. Additionally, some closed-set layout-to-image generation methods based on diffusion should also be included. For example, CVPR 2023, LayoutDiffusion applys more comprehensive evaluations than previous works, using five metrics to assess generation quality, controllability (accuracy), and diversity. As diversity of generated image is also important, it's recommended to include it.

局限性

Yes.

作者回复

We thank the reviewer for his/her review. We use W and Q to denote bullets in weaknesses and questions

Answer to W1, Q2: Our solution, like GLIGEN and InstDiff, requires inserting additional parameters into the pre-trained model and enhancing L2I ability through training. Indeed, our module is not plug-and-play as it is not training-free. However, it is compatible with existing pre-trained large models. Specifically, our cross-attention module is inserted into the original diffusion model right after each self-attention layer (Ln 204), similar to GLIGEN. This allows our model to be applied to all modern pre-trained diffusion models with similar structures, such as the Stable Diffusion family, DALLE family, DeepFloyd, GLIDE, etc.

For reproducibility, our implementation will be released upon the completion of the submission. In practical terms, we implement region reorganization in the dataloader to minimize computation overhead . This ensures that the reorganized region-text correspondence is shared across all regional cross-attention layers. Consequently, the regional cross-attention layer in the denoising model is reduced to a classical cross-attention operation with a special attention mask that indicates the permissible attention region between visual and textual tokens. Essentially, the attention operation depicted in Figure 2 is conducted by passing a pre-computed attention mask when cross-attending visual and textual tokens. We will enhance the clarity of these details in the revised paper.

Answer to W2.W5,Q3: Per the reviewer’s suggestion, we evaluated the generation diversity using LPIPS and Inception Scores as proposed in [A], with results provided in Rebuttal Sec B. Using the same model backbone (SD1.5) as previous works, our method can generate images with greater diversity. We also observed that the diversity of the SDXL-based model is less than that of the SD1.5-based model, which we attribute to the inherent differences between the backbones.

Furthermore, while [B] addresses mask-based layouts, our work focuses on box-based layouts. Unlike mask-based layouts, box-based layouts lack dense annotation for every pixel and involve overlapping regions, making it infeasible to directly apply cross-attention-based techniques as in [B]. In the revised paper, we will elaborate on related works and clearly differentiate our approach from previous methods.

Answer to W3: Our method is designed for rich-context L2I, but it does not fall short on simpler description datasets like COCO. Our analysis in Figure 6a indicates that when the description is simple, our performance is on par with existing methods. However, as the description complexity increases, our advantage becomes more pronounced.

Answer to W4, Q1: We provide pseudo-code for the proposed two evaluation metrics in Rebuttal Sec C. The CLIP model used is the clip-vit-base-patch32, trained by OpenAI, and the SAM model is the ViT-H checkpoint from the official SAM implementation.

Above information will be updated in the revised version paper.

[A] LayoutDiffusion

[B] Freestyle Layout-to-Image Synthesis

评论
  1. The proposed model can be easily applied to all modern pre-trained diffusion models with similar structures. And the implementation will be released upon the completion of the submission. It is strongly recommended to combine with popular architecture such as diffusers.
  2. The rebuttal addressed my main concern about the possible decrease on diversity.

I increased my score from 5:borderline accpet to 6:weak accept.

评论

Thank you for your thoughtful comments and for increasing your score. We greatly appreciate your suggestion regarding the importance of combining our model with popular architectures such as diffusers. We fully agree with this suggestion, and I would like to clarify that our implementation is indeed achieved by overriding classes in the diffusers library to ensure compatibility and ease of use with existing pre-trained diffusion models.

审稿意见
5

The paper proposes a layout-to-image generation method based on cross-attention control. It also proposes evaluation metrics for the task.

优点

  • The paper highlights the potential effectiveness of cross-attention control and designs a learning framework using this insight.
  • The framework shows some empirical performance gain over prior methods.

缺点

  • The performance improvement compared to baselines is very limited, as shown in Table 1. What is the backbone of the baseline methods? If the backbones of these baselines are swapped to be the more powerful SDXL, they could match or surpass the proposed method.
  • Section 3.2 proposes four properties but there lacks a discussion on 1) why these properties are complete descriptions of the desired properties of layout-to-image generation methods, or 2) why they are necessary and if so under what scenarios or using what kind of layout or text inputs would lead to undesirable outputs.
  • Lines 125-127 describe a cross-attention-based baseline which could be training-free. The paper claims advantages over this baseline without empirical evidence. Quantitative and qualitative comparisons would help strengthen the claim.

问题

  • How is "open-set" and "rich-context" defined? In line 95, "open-set" seems to refer to cases when N is not fixed. But even in these cases, an off-the-shelf object detector could be used to compute evaluation metrics. Why would it be impossible to list all classes (line 181)?

局限性

Limitations are discussed.

作者回复

We thank the reviewer for his/her review. We use W and Q to denote bullets in weaknesses and questions

Answer to W1: The backbone of the current SoTA L2I method, InstDiff, is based on SD1.5. Our proposed method has been validated using both SDXL and SD1.5 backbones (Ln 201). Therefore, our comparisons against baselines using the SD1.5 backbone are fair and justified.

While the performance results in Table 1 appear close, it is crucial to consider the following points: 1) Table 1 illustrates the model’s performance across various description complexities, including easy, medium, and hard. However, our model excels when handling complex object descriptions (see Figure 6a), which shows that as the complexity and length of the descriptions increase, our model’s performance advantage over the baselines becomes more pronounced. 2) As noted in lines 290-295, the evaluation configuration in Table 1 does not fully leverage our model’s generation capabilities. Our model’s performance can be further enhanced with higher resolution. We provide the combined results on RC CC3M of Table 1 and Figure 6b in the following table. By examining our model’s performance at higher resolution (Figure 6b), it is clear that our improvements over the baselines are noteworthy.

MethodsCropCLIPSAMIoU
GLIGEN (512x512)25.2783.64
GLIGEN (768x768)25.1683.80
InstDiff (512x512)28.4685.59
Ours
SD1.5 (512x512)28.4586.04
SDXL (512x512)29.4286.56
SD1.5 (768x768)28.9486.91
SDXL (768x768)29.7988.10
Please note that InstDiff only support fixed size of 512 for inference, as we noted in Ln 294-295.

Answer to W2: Our method is not intended to be a, as the reviewer suggested, “complete” L2I solution, and there is potential for future extensions to tackle broader L2I problems. Our objective with the four properties outlined in Sec 3.2 is to address the rich-context L2I challenges detailed in Sec 3.1 (Ln 98-110).

Specifically, when “Flexibility” is satisfied, the rich-context description can be accurately understood by the model (Ln 99), the “Locality” ensures the objects are positioned correctly within the designated layout box (Ln 103) and “Completeness” guarantees the global consistency in the generated images (Ln 103-104). Finally, “Collectiveness” allows the model to consider and properly represent the interaction of overlapping objects (Ln 109-110).

In practical terms, positioning objects accurately (Locality) and maintaining plausible image quality (Completeness) are fundamental prerequisites for L2I problems. By satisfying the Flexibility, our model can better understand descriptions and generate more satisfactory objects (Figure 1, 5). Additionally, a model that meets Collectiveness can more effectively handle interactions between overlapping objects (Rebuttal Sec E).

We will clarify and elaborate on this information in the revised paper to ensure it is more comprehensible.

Answer to W3: As noted in lines 203-204, our method is a training-based approach and is not training-free. We insert the proposed regional cross-attention layers after each self-attention layer in the original diffusion model. Consequently, even if applying the cross-attention strategy mentioned in Ln 125-127, the model still requires training because the newly inserted parameters are randomly initialized.

Answer to Q1: “Rich-context” can be considered as an extension of the “open-set” concept. While both “open-set” and “rich-context” scenarios deal with an unlimited number of object classes, the descriptions in the rich-context setting are notably more diverse, complex, and lengthy (Ln 95-97).

We mention “impossible to list all classes” as the limitation of using the closed-set detectors for evaluation(see Ln 176-180). In contrast, the limitation of open-set object detectors is that they are designed to handle inputs at the word or phrase level, but not sentence-level that are required in the rich-context setting (refer Ln 182-183).

评论

I thank the authors for providing high-resolution quantitative results and explaining the performance gap in tasks with different difficulties. The explanation of cross-attention-layer training helps with clarifications. Including training details as suggested by Reviewer 2Dp7 would help further improve paper clarity. I raise my score to borderline acceptance.

评论

Thank you very much for your updated comment and for considering our rebuttal. We will including more details to improve the clarity of the revised paper.

We noticed that you mentioned increasing your recommendation to a borderline acceptance, but it seems the rating has not yet been updated in the system. Could you kindly adjust the score at your convenience?

We greatly appreciate your feedback and support.

作者回复

The figures, tables, and pseudo-codes for the rebuttal are presented in the PDF file. We appreciate the reviewers for taking the time to read and consider them.

最终决定

The paper presents a layout-to-image (L2I) generation method focusing on rich-context scenarios where object descriptions are complex and lengthy. The authors introduce a regional cross-attention module designed to enhance the representation of layout regions, addressing challenges related to overlapping objects and complex descriptions. Additionally, the paper proposes two new evaluation metrics aimed at assessing model performance in open-vocabulary L2I tasks, validated through a comprehensive user study.

The reviewers in general acknowledge the novelty of the proposed method and evaluation metric, and are satisfied with the experimental thoroughness. The reviewers have largely been satisfied with the authors' rebuttals, and the scores have been raised accordingly, indicating that the concerns have been adequately addressed. However, the authors should take into account the reviewers' suggestions to improve the clarity and accessibility of the paper.

The AC agrees with the reviewers to accept the submission.