PaperHub
6.0
/10
Poster3 位审稿人
最低5最高7标准差0.8
5
6
7
4.7
置信度
正确性2.7
贡献度3.0
表达2.7
NeurIPS 2024

HiCo: Hierarchical Controllable Diffusion Model for Layout-to-image Generation

OpenReviewPDF
提交: 2024-05-14更新: 2024-11-06

摘要

关键词
Text to ImageLatent Diffusion ModelLayout to Image

评审与讨论

审稿意见
5

This work introduces a ControlNet-based conditioning method to enable pre-trained diffusion models to be layout-conditioned. The conditioning model processes each object in parallel and fuses them as residual features for the pre-trained diffusion model. In addition to the commonly used COCO dataset for evaluation, the authors also introduce a new dataset, HiCo-7K, which is fine-grained from the GRIT-20M dataset. The generation results look good even when there are many objects, and empirical results suggest that the model is backbone-agnostic.

Justification: Although the method itself does not exhibit a high degree of novelty, there are notable contributions in dataset curation. However, the dataset part is not well elaborated, and the experiments lack comparison with cutting-edge methods.

优点

  • The results validate that the proposed method can be applied to multiple pre-trained diffusion models, demonstrating its backbone-agnostic nature.
  • Qualitative results are good, especially for layouts with many objects.
  • The authors propose a new fine-grained dataset named HiCo-7K.

缺点

  • The custom dataset, HiCo-7K, is an important contribution but has not been elaborated on sufficiently. The paper only provides information about the total number of images and the average number of objects. Details such as how the filtering was conducted and the criteria for manual cleaning should be included.
  • Table 1 does not include comparison with GLIGEN, and the SoTA method InstDiff[A], whose code was released several months before the NeurIPS submission deadline, is missing. Both of these methods can be applied on COCO and thus should be compared.
  • The local CLIP score in Tab 3 being higher than the Ground Truth potentially indicates that this metric may be unreliable. (This is a minor issue and my rating is not affected by this point, the author can skip this point during rebuttal if out of space)
  • The fuse layer is not explained in enough detail. While the sum and average cases are straightforward, the mask case requires more explanation, particularly regarding how features of overlapping objects are merged.

A. Wang, Xudong, et al. "Instancediffusion: Instance-level control for image generation." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.

问题

My major concerns are the weakness 1, 2, 4. I would like the authors to address these concerns. My final rating is subject to change based on the authors' feedback.

局限性

The author has discussed limitations and social impact.

作者回复

We thank you for your affirmation and constructive comments.We address each of your comments below, additional figures and experimental results can be found in the Author Rebuttal material PDF.

Weakness 1.

We have detailed the construction pipeline of the custom dataset HiCo-7K in Fig.3 of Rebuttal PDF. We found that GRIT-20M has some issues, such as a low labeling rate for targets with the same description and target descriptions being derived solely from the original captions. Compared to GRIT-20M, the pipeline of the HiCo-7K is as follows.

1、Extracting noun phrase. We use spaCy to extract nouns from captions and the LLM VQA model to remove abstract noun phrases. Meanwhile, we use the GroundingDINO model to extract richer phrase expressions.

2、Grounding noun phrase. We use the GroundingDINO model to obtain the bboxes. After that, we use NMS and CLIP algorithms to clean the bboxes.

3、Artificial correction. To address the issue of algorithmic missed detections for multiple targets with the same description in an image, artificial correction is employed to further enhance the labeling rate of similar targets.

4、Multi-captions with bounding box. We expand the basic text from the original captions and use GPT-4 to re-caption the target regions. The dataset of HiCo-7K contains 7000 expression-bounding-box pairs with referring-expressions and GPT-generated-expressions.

Weakness 2.

We conducted comparative evaluations of the mentioned methods on coco-3K and HiCo-7K, with detailed results shown in Table 1~3 of Rebuttal PDF. We also tested the inference time and GPU resource utilization of different models under the same conditions, as shown in Fig.2 of Rebuttal PDF. Due to distinct data quality and distribution between COCO and DM-generated images, GroundingModel and CLIP are better suited for COCO-3K evaluation.

Table 3 shows that HiCo outperforms methods like InstanceDiffusion in terms of image quality and layout controllability on the HiCo-7K. However, on the COCO-3K dataset, our controllability is somewhat reduced.

The reason for this problem is that our model was trained on a 1.2M fine-grained long caption, which belongs to out of distribution data for coco data. HiCo has significant advantages over other methods in terms of inference time and GPU usage, with detailed results shown in Fig.2 of Rebuttal PDF.

Weakness 3.

In the evaluation based on the LocalCLIP Score dimension, the HiCo model exhibits slightly better performance compared to GroundTruth, primarily due to the following reasons:

1.HiCo generates images that are clearer and more visually appealing.

2.HiCo produces clearer boundaries in the generation of overlapping target regions.

3.In the GroundTruth dataset, there are instances where adjacent identical targets are present, which both algorithms and human annotators have not fully addressed, leading to inaccuracies in the detection results from the Grounding-DINO detection algorithm.

In summary, both the quality of generated images and the accuracy of detection models will influence the LocalCLIP Score.

Weakness 4.

We designed a mask fusion method, which not only decouples different targets and backgrounds, but also provides a prerequisite for further image editing.

The main operation of the mask fusion method is as shown in the equation 1 of Rebuttal PDF, which is to multiply the features of different branches with the masks of the corresponding regions and then use the summation fusion.

However, regarding overlapping and occlusion among different targets, the current mask fusion method handles fusion by directly adding features together. The occlusion order of overlapping targets can only be specified via the global prompt by text description, for example “bowl in front of vase”, as illustrated in Fig.1-(c)、Fig.1-(d) of Rebuttal PDF. For current version of HiCo, there indeed lacks of more explicit mechanism for occlusion order controlling. We have recognized this problem as our future work.

评论

The reviewer thanks author's feedback and is satisfied with author's response for weakness 1, 4.

Though from the result in the rebuttal, the performance may not be superior than the current SoTA under all scenario, the proposed method does show imporvement in certain aspects.

Considering the contribution of the proposed dataset, the reviewer improves the score to 5.

审稿意见
6

This paper propose HiCo (Hierarchical Controllable) Diffusion Model for layout-to-image generation. HiCo Net is a multi-branch structure that is introduced to hierarchically generate the global background and foreground instances for different layout regions. The author further evaluate the performance of multi-objective controllable layout generation in natural scenes and introduce HiCo-7K benchmark.

优点

1.The paper writing is clear and easy to follow.

2.The HiCo model achieves spatial disentanglement and could generate more coherent and accurate images in complex scenarios.

3.The HiCo model demonstrate excellent compatibility with rapid generation plugins LORA.

缺点

1.The generation requires more inference time and computational resources.

2.The bounding-box and per-object generation process could not handle more complex interactions between entities such as ‘A man on the left and his wife on the right is holding their dog in the middle.’

3.In the context of bounding-box based layout-to-image generation, addressing the issue of overlaps between bounding boxes corresponding to different entities has been a focal point of academic discourse. However, the authors omit any discussion on how such overlaps are managed.

问题

1.How to manage the overlapping issue mentioned above?

2.How does the runtime compare with previous works ?

3.There are similar works like LMD[1] and SLD[2], discuss the differences between HiCo and these two works.

4.Could HiCo Net be considered a variant of ControlNet within the modality of bounding box-based layout?

[1]Lian, Long, et al. "Llm-grounded diffusion: Enhancing prompt understanding of text-to-image diffusion models with large language models."

[2]Wu, Tsung-Han, et al. "Self-correcting llm-controlled diffusion models."

局限性

The authors are encouraged to answer the questions and address the weakness above.

作者回复

We really appreciate your detailed review and valuable suggestions. We address each of your comments below, additional figures can be found in the Author Rebuttal material PDF.

Question 1 & Weakness 2&3.

HiCo achieves hierarchical generation by decoupling each object’s position and appearance information into different branch, meanwhile controlling their overall interactions through a background branch with global prompt and the Fuse Net. The Fuse Net combines features from foreground and background regions, as well as intermediate features from side branches, then integrates them during the UNet upsampling stage. As illustrated in Fig.1-(a)、Fig.1-(b) of Rebuttal PDF, HiCo is capable of handling complex interactions in overlapping regions without any difficult.

The occlusion order of overlapping objects is also specified via the global prompt by text description, for example “bowl in front of vase”, as illustrated in Fig.1-(c)、Fig.1-(d) of Rebuttal PDF. But since there lacks corresponding occlusion order train data, the success rate is far from optimal. For current version of HiCo, there indeed lacks of more explicit mechanism for occlusion order controlling.

We recognize this problem as our future work. Actually we’re already working on the occlusion order data curation, which is a quite challenging task as it requires reliable depth estimation besides the object detection bounding boxes. The process deserves another dedicated technical report in the future.

Question 2 & Weakness 1.

For inference run time and memory usage, we conducted two additional comparisons. The first comparison is horizontal, among 6 different models including: GLIGEN, InstanceDiff, MIGC, CAG, MtDM as well as our HiCo. Specifically, we evaluated the inference time and GPU memory usage for directly generating 512*512 resolution images on the HiCo-7K test set using a 24GB VRAM 3090 GPU, results in Fig.2-(a) of Rebuttal PDF show that HiCo has the shortest inference time and the 2nd lowest GPU memory footprint.

To make the results more complete, the second comparison is vertical: we assessed the inference time and GPU memory usage for generating 512*512 resolution images on the HiCo-7K test set with different number of objects. Since each object is processed by a separate branch in HiCo, the inference can be accelerated by inferring all the branches in one batch, i.e., in “parallel mode”, which as shown in Fig.2-(b) of Rebuttal PDF, is much faster than the “serial mode”, i.e., inferring all the branches one by one in serial.

Question 3.

Thank you for the valuable references. LMD and SLD represent early works that integrate large language models (LLM) with diffusion models for instruction-following enhancement and controllable image generation and edition.

Unlike HiCo which is dedicated for layout-controllable image generation and requires layout and image specification directly from user input, LMD and SLD resort to LLM for image scene description and layout arrangement via text automatically.

For layout control, LMD and SLD adopt train-free approach by manipulating the latent and cross-attention map of each object. The solution is quite economic but the controlling effect is less satisfactory. In contrast, HiCo incorporates a dedicated conditioning network and learns layout condition from millions of data, providing comparatively superior controlling capability.

LMD and SLD, on the other hand, can perform complex instruction understanding where HiCo cannot be directly compared with. It’s worth pointing out that HiCo can be integrated with LMD and SLD, by treating HiCo as the replacement of their train-free layout controllable image generation module, we’ll add these discussion in the main manuscript.

Question 4.

As mentioned in introduction, adapter models such as ControlNet and IP-Adapter are representative works that introduce additional conditions into the diffusion model by incorporating “adapter network” alongside the frozen pretrained stable diffusion model. HiCo is also one kind of adapter model in this respective, introducing layout condition by incorporating separate branch network for each object. But unlike ControlNet, the side branch input of HiCo during training or inference uses paired text descriptions and conditional images, and the fusion method of multiple side branches of HiCo plays an important role in good generation performance. We believe HiCo’s current design is definitely not the most optimal, we will conduct more exploration on model structure in its next version.

评论

Thanks for your response. The authors have addressed most of my concerns in the rebuttal. I hold the firm belief that the current method still holds potential for improvement. Accordingly, I have raised my score from 5 to 6. I hope to see an optimized version in future open-source releases.

审稿意见
7

This paper studies layout-to-image generation. It proposes HiCo, a diffusion model that supports a complicated, hierarchical set of bounding boxes as the layout condition. The authors also constructed the HiCo-7K benchmark to provide challenging tasks for evaluations. The experiment results show that the proposed method can effectively generate images matching the layouts in various complex scenarios.

优点

  • The key insight of disentangling each object with one branch of HiCo-Net, a ControlNet-like conditioning branch, and fusing them into the same image, is novel, interesting, and inspiring.
  • The authors provided various ways to augment the HiCo-Net branches with LoRA, which makes the model more powerful and extensible.
  • The authors constructed their dataset HiCo-7K for evaluation.
  • Abundant ablation study shows the reasonability of each design choice.

缺点

  • The method only supports axis-aligned bounding boxes. It might be more powerful if it also supports more free-formed or precise bounding boxes like rotated squares or even polygon bounding boxes, which will make it desirable in some 3D generative tasks like layout-guided room generation.
  • Given that each bounding box requires an individual branch to compute intermediate features, the time complexity will be linearly growing with the number of bounding boxes. This makes the proposed method less efficient than previous method whose running time is constant.
  • Following the previous point, I wonder if there is a more efficient way to utilize each bounding boxes, e.g., assign a small number of nearby bounding boxes to one branch, to trade-off between efficiency and per-branch task complexity.
  • (Minor) In Fig.3, the "Encoder" and "Decoder" are not accurate terminologies in UNet. They should be "downsampling" and "upsampling" as in Fig.4.
  • (Minor) The texts in math formulas should not be italicized, which actually means the multiplication of each letter. For example, InstructionInstruction means I×n×s×t×r×u×c×t×i×o×nI\times n\times s\times t\times r\times u\times c\times t\times i\times o\times n instead of Instruction\mathrm{Instruction}.
  • (Minor) ×\times should be used instead of "*" in L130 and L131.

问题

Please see "Weaknesses".

局限性

The authors have addressed the limitations and broader impacts.

作者回复

Thanks for your wonderful review and detailed comments. Here we address each point of your comments, additional figures can be found in the Author Rebuttal material PDF.

Weakness 1

We are very grateful to the reviewer for presenting constructive ideas. Currently, HiCo only supports axis-aligned bounding boxes, mainly based on the perspective of general scene applications, which can achieve simplicity, ease of use, and excellent performance.

For scenarios involving other controllable conditions, using HiCo framework and designing reasonable training data and fusion methods can greatly expand their application scenarios. For example, in our ongoing work, we changed the input layout image to the foreground image segmented by segmentation model, and achieved outstanding performance on inpainting and outpaintint. We will also explore the scalability of conditional generation in future work.

Weakness 2

For inference run time and GPU usage, we conducted two additional comparisons. The evaluation environment is a 24GB VRAM 3090 GPU. Generate images with a resolution of 512 * 512 using 50 step inference on the HiCO-7K.

Vertical dimension comparison, we assessed the inference time and GPU memory usage with different number of objects. Since each object is processed by a separate branch in HiCo, the inference can be accelerated by inferring all the branches in one batch, i.e., in “parallel mode”, which as shown in Fig.2-(b) of Rebuttal PDF, is much faster than the “serial mode”, i.e., inferring all the branches one by one in serial.

Horizontal dimension comparison, among 6 different models including: GLIGEN, InstanceDiff, MIGC, CAG, MtDM as well as our HiCo. Results in Fig.2-(a) of Rebuttal PDF show that HiCo has the shortest inference time and the 2nd lowest GPU memory footprint.

Weakness 3

The HiCo model currently does not support merging multiple small and nearby layout boxes for generation. On one hand, the descriptions and bounding boxes of different targets within one branch cannot be distinguished. On the other hand, branches of HiCo can be inferred in parallel, which alleviates the impact of this issue to some extent.

Of course, we are also actively exploring more concise and efficient structures to further balance inference efficiency and task complexity.

Weakness 4 ~ Weakness 6

We will thoroughly review the writing and formatting issues in the paper, and update it with the new experimental results and conclusions.

评论

I sincerely thank the authors for their rebuttal. All of my concerns are addressed, and I would like to keep the rating of 7.

I really like the paper's results and ideas. I hope the non-axis-aligned bounding boxes (or masks) can be supported soon so that the work will also benefit 3D and video tasks.

作者回复

We would like to thank all the reviewers for their thoughtful, overall positive feedback and encouraging feedback. We are particularly pleased that the reviewers believe that our method achieves spatial disentanglement by separating each object, making it seamlessly compatible with SD community plugins (Reviewer 72RR). Our method is novel, simple, and effective, with strong model scalability (Reviewer iE3j). Our work on custom dataset HiCo-7K has made significant contributions (Reviewer 6Tmf).

We would also like to thank the reviewers for all their insightful suggestions. We carefully read and analyzed all the weaknesses and questions, which can be summarized into the following aspects.

Inference performance. We tested the inference time and GPU resource utilization of different models under the same conditions, as shown in Fig.2.

Target interaction. Our method utilizes background branches to implicitly understand and generate complex interactions between targets, as shown in Fig.1-(a)、Fig.1-(b).

Overlapping area. Our method cannot display the specified layer order for generating overlapping areas, and can only use the prior knowledge of the model to autonomously and reasonably generate multi-target overlapping areas, as shown in Fig.1-(c)、Fig.1-(d).

Fusion Network. Our fusion network decouples different targets and backgrounds, improving the quality of complex layout image generation, and also provides feasibility for subsequent image editing.

Dataset. We have provided a detailed description of the processing procedure for the evaluation dataset HiCo, as shown in Fig.3.

Comparison with other SOTA methods. We conducted comparative evaluations of different methods on coco-3K and HiCo-7K. Refer to Table 1~3.

Writing format. We will conduct a detailed review and update of equations, symbols, terminology definitions, and method explanations.

Finally, we remind reviewers to refer to our Rebuttal PDF. At the same time, we will respond to each question and comment one by one. We are more than happy to discuss with the reviewers to address any other issues.

最终决定

The paper received Accept, Weak Accept, and Borderline Accept scores. Overall, the reviewers appreciate the contributions of the technical approach and the new dataset for evaluating the problem of layout-to-image. The experimental validation is thorough and convincing. The authors' rebuttal clarified several details, including the inference performance, dataset description, and comparison with the state-of-the-art on coco-3K and HiCo-7K datasets. The AC agrees with the reviewers and recommends to accept.