InstanceAssemble: Layout-Aware Image Generation via Instance Assembling Attention
摘要
评审与讨论
This work studies the multimodal layout-to-image generation with the generative diffusion models. The multimodal condition here can be either visual condition or text condition. This work uses a regional cross-attention between the image features (as query) and condition tokens (as key and value) to apply the conditions to the given layout position. The positional information is added to the condition tokens via a layout encoder with “DenseSample”. The DenseSample turns the bounding box to a series of tokens representing position. For evaluation, this works proposed Layout Grounding Score, which uses a visual language model to judge if the generated instance aligns with the given prompt.
优缺点分析
Strengthens
- This work proposes an evaluation protocol by using visual LLM to assess the layout-to-image generation.
Weaknesses
- The comparison of the proposed method against most of baselines are using different backbones. The InstanceDiff, MIGC and HICO are all using the SD1.5 or its variations (e.g., 1.4 or realistic SD 1.5). Though these variations might have different generation capabilities, they still share the same architecture and model capacity (e.g, number of parameters and pretrained data scale). However, using the SD3 model can cause a significant change in the backbone model capacity change. If one checks the Layout Grounding Score in the Table 1 and 2, the InstanceDiff’s performance is very close to this work. Given this observation, if the propose work uses the same backbone model (SD 1.5), it is unsure if it can still outperform the SD1.5 based methods.
- The reviewer has some questions regarding the number of additional parameters. In the Table 4, the assemble attention and LoRA are in two different columns, so the reviewer assumes they are two different components in the networks and can be added individually. While in the Table 3, the proposed method introduces only 3.46% or 0.84% of new parameters. Such ratios aligns with the reviewer’s knowledge of the number of additional parameters of LoRA. However, it seems that such number is the total number of new params of both assemble attention and LoRA. The reviewers wonders how the author achieved to make the number of new parameters in the assemble attention to be such small. (when I read the Fig 2, the assemble-MMDiT module is repeated N times and I assume N stands for number of layers in the DiT model. In such case, the total number of parameters in the attention layers should be relatively large)
- I also have some concerns regarding the technical novelty regarding the assemble attention layer. This module seems to be the core module that apply the layout condition to the image. The realization in the 3.2 seems to be a region-wise cross-attention between visual token and condition tokens. However, such strategy has also been InstanceDiff (where they use an attention mask to achieve regional cross-attn) and [11], which applies the regional cross-attention for layout conditioning. The only fundamental difference that reviewer can found is the encoding of bounding box is different in this work, where the author uses a DenseSample strategy. Nevertheless, this seems to be a nice engineering strategy to improve performance but show little technical novelty.
问题
Please see my weaknesses regarding experimental setting and technical novelty.
局限性
While there is the limitation and broader impacts section, the author's claim of limitation and broader impacts seems pretty general. e.g., the limitation is the designed assemble module needs to be sequentially called and the broader impacts is general for all generative models. In reviewer's view, the same limitation and broader impacts can be applied in the same way to other layout-to-image methods and therefore is not strongly related to this work.
最终评判理由
Please see my last comment
格式问题
N/A
We sincerely thank the reviewer for your thoughtful comments. We greatly appreciate the opportunity to clarify our work and have carefully considered each point raised. Below, we respond to the reviewer's main concerns:
Weakness 1: The comparison of the proposed method against most of baselines are using different backbones. ...
We select InstanceDiff[1], MIGC[2] and HICO[3] for comparison, as they are strong and influential baselines in layout-to-image generation. Our primary focus, however, lies in the comparison with CreatiLayout[4] in Table 1 and Figure 5a, which uses the same SD3-medium based architecture. In this comparison, our method achieves SOTA results, a point further validated by our comparison with Flux (in Figure 5b). Also, the ablation study in Section 4.4 highlights the impact of each component on the overall performance.
As for the LGS in Table 2, our performance is comparable to InstanceDiff, which is noteworthy given that we did not perform any additional COCO-specific training or fine-tuning as InstanceDiff did. We believe this highlights the strong generalization of our approach without dataset-specific optimization. The effectiveness of our method is not solely attributed to the enhanced capabilities of the base model, but also to our novel architectural design, which we elaborate on in our response to Weakness 3.
Weakness 2: The reviewer has some questions regarding the number of additional parameters. ...
We apologize for the ambiguity in our description in Table 4. To clarify, in this table, "Assemble" refers to the architecture design of Assemble-MMDiT, while "LoRA" refers to its training method:
-
The rows without "LoRA" indicate full fine-tuning of Assemble-MMDiT modules.
-
The rows with "LoRA" indicate fine-tuning using LoRA.
Therefore, the additional parameters in Table 3 correspond to the final lora-based version used in our work. Thanks a lot for pointing this out, we will revise and clarify this point in the revised version of our manuscript.
Weakness 3: I also have some concerns regarding the technical novelty regarding the assemble attention layer. ...
We acknowledge that our approach is indeed a form of region-wise self-attention strategy. However, our novelty lies in a series of design choices that distinguish our work from previous methods. We will now detail the distinctions and re-emphasize our core innovation in model design.
- Layout injection mechanism: Instead of relying on global attention masks (e.g., InstanceDiff [1]), we introduce an instance-wise self-attention mechanism. We argue that the use of global attention masks can lead to semantic leakage in overlapping regions. In contrast, our approach, which involves separating instance injection by processing cropped latent regions followed by a feature fusion step, is specifically designed to better handle dense layout scenarios. This claim is empirically supported by the experiment below, which demonstrates that our method achieves superior instance attribute consistency and a higher VQA score compared to the global attention mask-based approach.
| LayoutSAM-EVAL | spatial | color | texture | shape | VQA |
|---|---|---|---|---|---|
| SD3-medium (base model) | 77.49 | 60.28 | 62.55 | 60.38 | 93.30 |
| attention mask (SD3-M) | 94.11 | 74.28 | 77.58 | 76.54 | 91.53 |
| InstanceAssemble(ours) (SD3-M) | 94.97 | 77.53 | 80.72 | 80.11 | 93.12 |
-
Instance token encoding: Unlike the method in [5], which directly adds position and text embeddings, we employ a learnable fusion module to fuse positional and text embeddings for more expressive instance-level representations. This fusion module improves alignment and generalization across diverse layouts.
-
Overlapping region handling: Unlike the previous method [5], which handles overlapping regions by isolating them into non-interacting segments, our "Assemble" process directly aggregates features in overlapping regions. The "Assemble" design enables the modeling of interactions between instances, proving highly effective in complex and dense layouts.
Together, our design presents a distinct, efficient, and precise layout injection paradigm, validated by strong empirical results in challenging scenarios. We will sharpen these points in the revised manuscript. Thanks for prompting this important clarification.
Limitations: While there is the limitation and broader impacts section,...
Thank you for your valuable feedback on our limitations. We have since re-examined our method and its boundaries. While our method addresses several practical and performance challenges in layout-conditioned generation, we acknowledge that certain limitations specific to our proposed approach remain open:
-
Image fidelity degradation may still occur under extremely dense or complex layouts. Mitigating these effects remains a key direction for future improvement.
-
Extending controllability to support additional conditioning signals (e.g., style, segmentation, or scene attributes) within a unified framework is another active research frontier, which we aim to explore in future work.
Once again, we thank the reviewer for the thoughtful and constructive feedback, which has prompted deeper reflection on the scope and impact of our contributions. We hope our responses clarify our design choices, technical novelties, and experimental decisions, and that they have addressed the reviewer’s concerns. We would greatly appreciate it if you would consider these points in your final assessment of our paper.
[1] X. Wang, T. Darrell, S. S. Rambhatla, R. Girdhar, and I. Misra. InstanceDiffusion: Instance-Level Control for Image Generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6232–6242. IEEE, 2024.
[2] D. Zhou, Y. Li, F. Ma, X. Zhang, and Y. Yang. MIGC: Multi-Instance Generation Controller for Text-to-Image Synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6818–6828. IEEE, 2024.
[3] B. Cheng, Y. Ma, L. Wu, S. Liu, A. Ma, X. Wu, D. Leng, and Y. Yin. HiCo: Hierarchical Controllable Diffusion Model for Layout-to-image Generation. In Advances in Neural Information Processing Systems, 2024.
[4] H. Zhang, D. Hong, Y. Wang, J. Shao, X. Wu, Z. Wu, and Y.-G. Jiang. CreatiLayout: Siamese Multimodal Diffusion Transformer for Creative Layout-to-Image Generation, 2025.
[5] J. Cheng, Z. Zhao, T. He, T. Xiao, Z. Zhang, and Y. Zhou. Rethinking The Training And Evaluation of Rich-Context Layout-to-Image Generation. In Advances in Neural Information Processing Systems, 2024. ([11] in our paper)
The reviewer appreciates author's response and I have several follow up questions/concerns.
On weakness 1: Given author's explain on the potential benefit of InstanceDiff had been trained on COCO, which may bring it advance when examed on COCO, the reviewer is fine with table 2 now. However, for table 1, the LGS score of InstanceDiff on LayoutSAM-Eval is still very close or even better (the texture column, InstDiff has better score but the bold font is given to the proopsed method and the performance on shape is the same, where InstDiff should also be bolded). Given the substantial generation ability brought by SD3 (comparing to SD1.5), the reviewer wonder if this suggests InstDiff is still a better training strategy as it achieves close performance with a model with less capacity.
On weakness 2: The reviewer wonders that if the proposed method fine-tunes the whole model, is Table 3 still proving a meaningful comparison on parameters? I raise this concern as for what I know, InstDiff and HICO finetunes only the additional parameters (this might be true for other approaches as well). However, if this work finetunes the whole model, a more reasonable comparison should be either the i) total trainable parameters or ii) total model parameters. If counting these two numbers, it seems that the proposed method may show much larger number of parameters than the ones in Table 3 column.
On weakness 3: The reviewer understands that InstDiff uses global attention mask but I do not understand the statement that it leaks semantics in the overlapping region. For what I know about InstDiff, it designs the attention mask to ensure that text only interacts in its object's region. Under such design, for overlapping region, the visual token just attend with all objects text tokens it overlaps, which mathematically is just a weighted average of textual features. In comparison, this work uses average textual features with same weights (i.e., a special case of weighed average). Moreover, using average feature is a quite straightforward strategy and previous works like HICO also adopts this approach. Therefore, it is hard for reviewer to acknowledge the averaging strategy for overlapping region is a novel contribution.
We thank the reviewer for the valuable comments. Below we address each point in detail.
Response to Weakness 1:
We appreciate the reviewer’s careful observation and agree that the texture column in Table 1 was incorrectly highlighted. We will correct this mistake in the revised version.
While InstanceDiff achieves comparable performance on the relatively simple LayoutSAM-Eval dataset (which contains only 3.8 instances per image), our method significantly outperforms it on the more complex DenseLayout benchmark (with 18.1 instances per image), as shown in Figure 5a and Table 5 in the appendix.
Although InstanceDiff is built upon a backbone with less capacity (SD1.5), it introduces substantially more additional parameters than our method, which we elaborate on in the next response. Considering our superior performance on dense layouts, lower parameter overhead, and better time efficiency, we believe our method demonstrates a stronger and more practical training strategy.
Response to Weakness 2:
We would like to clarify, as mentioned in our initial rebuttal, that our final reported model does not fine-tune the entire base model. Instead, we only trains a LoRA module on base model along with the LayoutEncoder, resulting in a small number of additional parameters (see Table 3).
We also thank the reviewer for pointing out a miscalculation regarding InstanceDiff's parameter count. We mistakenly overestimated the parameter count because the authors of InstanceDiff stored both the EMA and original models in a single checkpoint file, which led to double-counting during our initial analysis. After carefully rechecking, we found that InstanceDiff introduces approximately 369M additional parameters, which is 43% of the SD 1.5 base model (860M). Even with this corrected number, it remains substantially higher than the parameter addition of our method.
Response to Weakness 3:
In InstanceDiff, all instances are processed simultaneously using self-attention, with an attention mask applied to restrict each instance tokens from attending to others.
However, self-attention not only enables interactions between image and instance tokens, but also among image tokens themselves. In overlapping regions, let us denote instances A and B with their corresponding image region tokens, and region C as the overlapping part. Under InstanceDiff’s attention design, each instance token attends only to its own image region, as enforced by the attention mask. However, the image tokens in the overlapping region (i.e., region C) are shared and thus attended to by both A and B. As self-attention allows interactions among image tokens, region C acts as a bridge between instances A and B, enabling potential information leakage between them.
Other than applying attention mask, InstanceDiff also employs a effective strategy called Multi-Instance Sampler to further isolate overlapping instances, but this leads to significant time overhead (see Table 3).
In contrast, our method introduces a novel design that processes each instance separately via self-attention, avoiding cross-instance interactions by construction. Averaging features is sufficient for integration in our case, and as shown in Table 3, our method maintains a relatively lower computational overhead.
In addition, according to the formula of Fuse Net mentioned in HICO paper, all instance regions are simply summed together, whereas in our method, we take the average based on the instance count within each patch.
We emphasize that our architectural contributions lie in: a novel attention mechanism (Assemble-Attn) that enables layout-aware generation, cascaded structure ensuring both high global image quality and precise layout control, multi-conditional generation, and lightweight design.
We hope these clarifications and additional analyses adequately address your concerns. We sincerely appreciate your continued engagement and would be grateful if our efforts could be reflected in your final evaluation.
Dear Reviewer KyXo,
The authors have provided further responses to your questions. Would you please kindly read their responses and see if they have resolved your questions? Thank you.
Your AC.
The reviewer appreciates author's response and have following question after reading the second rebuttal.
On weakness 1: The reviewer thinks the fairness of the comparison here mostly lies in the capacity of the model. If the model capacity of the proposed method is greater than the InstDiff, its better performance would not be able to contribute to the design. Please see my following response for model comparison.
On weakness 2: From author's second rebuttal, the reviewer is even more confused. In paper and first rebuttal, the "Assemble-MMDiT" is marked as trainable. if "Assemble-MMDiT" has been trained, what is the reason to report only for LoRA's number of parameters? From my reading of the paper, the reviewer believes the Assemble-MMDiT is the vital part for layout-condition. I feels the current report of number of parameters in Table 3 is not fair as the number of all other methods are the total number of parameters for layout-condition, while for the proposed method, the parameters of layout-conditon is ignored. Therefore, the reviewer has not been convince by the current version of response.
On weakness 3: The reviewer acknowledge that for the interaction with overlapping region (the region C in the rebuttal), the image token can interact with region A and B. However, the reviewer questions whether this is indeed an operation causes semantic leakage (that hampers the performance of layout-to-image generation). Here is my thought: the interaction of image tokens in attention is a common practice, even the proposed method's backbone MM-DiT also has self-attention between image tokens. Therefore, can I apply the same conclusion in rebuttal 2 to the proposed method as well (that the region C has semantic leakage to region A and B)? If the self-attention in MM-DiT is fine, then why would it causes semantic leakage problem in InstDiff's design?
Also for the highlighted contributions in rebuttal 2, the reviewer can acknowledge that the multi-conditional generation can be a meaningful practice, while the other contributions can be found similarities in previous work (e.g., regional attention, cascaded structure). And given my concerns on weakness 2, the reviewer has doubts on the "lightweight" statement of the proposed method.
We thank the reviewer for their continued engagement and constructive questions. Below we address each concern in turn.
1. Trainable parameters and model capacity
Our Assemble-MMDiT is initialized from the base model’s MMDiT parameters, with only the token interaction pattern modified (Assemble-Attn). In our final reported results, these parameters from base model's MMDiT are frozen, and we train only (i) LoRA modules of the MMDiT blocks, and (ii) the MLPs in the LayoutEncoder. In Table 3, we report the additional parameters over the base model for each method. For our approach, this includes only the LoRA + LayoutEncoder parameters, as these are the only trainable and additional components. Therefore, the total additional parameters remain substantially lower than InstanceDiff.
The "trainable" mark in our figure specifically denotes LoRA fine-tuning; if the notation in Figure 2 caused confusion, we apologize and will clarify this in the revised version.
2. Overlapping regions and semantic leakage
In InstanceDiff, the attention mask prevents instance textual tokens from attending to those of other instances, but shared image tokens in overlapping regions (region C) still link regions A and B within each single self-attention pass. As a result, the instance captions for A and B are effectively injected into the combined A + B area, making it harder to preserve precise per-instance semantics in one self-attention pass.
Our design avoids this by processing each instance separately, ensuring its textual tokens attend only to its corresponding image tokens. This fully isolates instance semantics in the instance–image stage, while still allowing global prompt–image attention earlier in the pipeline to model reasonable cross-instance relationships.
As shown in the ablation study (Table in our first rebuttal), this strategy achieves higher instance semantic accuracy than using attention mask without compromising overall image quality (VQA).
3. Novelty of "regional attention" and "cascaded structure"
While certain terms resemble prior work, our implementations are substantially different:
Regional attention (Assemble-Attn in our paper): As detailed in the first rebuttal, our Assemble-Attn processes entire instances independently, including overlaps, rather than attention masking within a single joint self-attention. This enables explicit multi-conditional control while maintaining global coherence, which prior regional attention designs do not achieve.
Cascaded structure: To our knowledge, no prior work employs a two-stage pipeline that first processes the global prompt through the entire MMDiT and then refines with per-instance conditions. This Cascaded design is crucial for achieving both high global image quality and precise layout control.
As shown in Table 4 (first three rows), both innovations consistently enhance layout control and overall image quality, while keeping computational cost low. Moreover, our approach achieves state-of-the-art on complex and dense layout condition (Figure 5a), further demonstrating the effectiveness of our design.
We hope these clarifications resolve the reviewer’s concerns and make the contributions and advantages of our method clearer.
This paper introduces InstanceAssemble, a novel framework for accurate layout-based image generation. By using Assemble-MMDiT architecture, this method can robustly handle each instance, and by using LoRA, it is more training-efficient compared to previous training-based methods that require heavy fine-tuning. Moreover, the authors introduce a new benchmark and evaluation metric to enable more consistent assessment of the generated image with respect to the given layout. As a result, InstanceAssemble achieves state-of-the-art performance for complex layout conditions.
优缺点分析
Strength
-
Effective engineering: The paper introduces a well-engineered framework, which seems to effectively address the limited robustness of of training-free methods and the computation
-
Strong results: Clear improvements in the quantitative comparisons with the baseline methods (e.g. CreatiLayout [1], HICO [2])
-
New benchmark: The introduction of a new benchmark (DenseLayout) and an evaluation metric (LGS) are solid contributions in the literature of layout-to-image generation.
Weakness
-
Unclear points on technical novelty: While InstanceAssemble shows strong results, there are some critical questions regarding the main idea of the proposed method, as listed below:
-
Details on Assemble-MMDiT: I find it unclear about how Assemble-MMDiT is implemented. Are the attention modules trained from scratch, or are they initialized from the original MMDiT weights? As the full pipeline in Figure 2 also uses the original MMDiT for the global prompt, does it use the same weights and only activate the LoRA parameters during inference? Moreover, to which part of the MMDiT is the LoRA applied? Please correct me if the details are already included and I have missed it.
-
Rationale behind Assembling-Attn: Section 3.2 doesn't seem to sufficiently explain why and how the design choice of Assembling-Attn, which is to apply attention between indivisual patches and the corresponding instance tokens, would lead to more accurate object locations. Specifically, is it the case that the attention in MMDiT already has such property and the LoRA training enlarges it? (Similar to the "semantic sharing" property of DiT observed in GrounDiT [3]) Or does the attention module newly learn to locate each object during the training phase of LoRA?
-
DenseSample: The design of DenseSample in Section 3.1 is also confusing. It seems that the xy coordinates of each bounding box is encoded into each instance token. However, in Assembling-Attn only a cropped subset of image tokens are attended with each instance token. Given this, why is the xy coordinate information required in the instance token? Is this identical to the grounding token in GLIGEN [4]?
-
Although InstanceAssemble seems to produce high-quality outputs, the paper's main contributions - Layout Encoder and Assemble-MMDiT - still contain multiple unclear points regarding its technical details and novelty compared to previous works.
[1] CreatiLayout: Siamese Multimodal Diffusion Transformer for Creative Layout-to-Image Generation, 2024
[2] HiCo: Hierarchical Controllable Diffusion Model for Layout-to-image Generation, NeurIPS 2024
[3] GrounDiT: Grounding Diffusion Transformers via Noisy Patch Transplantation, NeurIPS 2024
[4] GLIGEN: Open-Set Grounded Text-to-Image Generation, CVPR 2023
问题
Questions
-
For the quantitative comparisons in both Table 1 and Figure 5, only CreatiLayout is based on SD3-M and all other baselines use older text-to-image models. GIven that InstanceAssemble uses more recent models including SD3-M and Flux.1, I'm conerned whether the advantage of InstanceAssemble is coming from the novel method itself or from the better choice of the backbone model.
-
Moreover, I'm curious whether the previous layout-to-image methods such as GLIGEN [1] and InstanceDiffusion [2] are not applicable to MMDiT architectures, and whether there have been similar discussions on efficient ways to train them. Such discussions would provide a more clear motivation for the need of an efficient training framework like InstanceAssemble.
[1] GLIGEN: Open-Set Grounded Text-to-Image Generation, CVPR 2023
[2] InstanceDiffusion: Instance-level Control for Image Generation, CVPR 2024
局限性
Yes
最终评判理由
The authors' rebuttal provides a clear clarification on most of the concerns raised regarding the main technical contribution and justification of the method design. Specifically, providing technical details on Assemble-MMDiT, and discussing the exact rationale behind Assembling-Attn were helpful for understanding the significance of the proposed method.
格式问题
No concerns
We sincerely thank the reviewer for the detailed and constructive feedback. Your questions have helped us clarify and refine several technical aspects of our work. Below, we provide detailed responses to the main concerns.
Weakness 1: Details on Assemble-MMDiT:...
In our final model:
-
Initialization and Training: Assemble-MMDiT is initialized from the pretrained MMDiT backbone weights, with only the LoRA modules trained during fine-tuning. The original MMDiT weights are frozen.
-
Pipeline:
-
The "Global Prompt → Image" step uses the unmodified, pretrained MMDiT backbone (e.g., SD3-M or FLUX), where no additional LoRA training.
-
The "Instances → Image" step utilizes LoRA-based Assemble-MMDiT, which is only activated during first 30% steps during inference.
-
-
LoRA Application: In the SD3-based version, we apply LoRA to all Assemble-MMDiT blocks, whereas in the Flux-based version, due to resource constraints, LoRA is applied to only eight blocks (seven double-blocks and one single block).
We will make these implementation details clearer in the revised version to ensure better readability and transparency.
Weakness 2: Rationale behind Assembling-Attn:...
We are grateful for the reviewer's feedback regarding the unclear presentation of our design choices. We will take this opportunity to elaborate on our design rationale and will make sure to articulate it more effectively in the revised version.
Our method directly links each instance token with the corresponding image tokens within its bounding box, using Assemble-Attn. This approach aims to improve object localization accuracy by ensuring that each instance's attention is focused on its specific region. Moreover, as shown in our ablation study (Section 4.4), layout-aware generation is primarily enabled by our architectural design, rather than by any inherent property of the attention in MMDiT (see the comparison between Row 1 and Row 2).
We would also like to acknowledge GrounDiT[1] as a valuable and insightful work that explores important properties in DiT models. However, the design and objectives of GrounDiT differ significantly from ours:
-
GrounDiT introduced the concept of semantic sharing: when a smaller noisy patch is jointly denoised with the full noisy image, the two gradually become semantic clones. GrounDiT leverages this property to perform layout-to-image generation by cropping each instance's image tokens and denoising them together with their corresponding instance images.
-
In contrast, InstanceAssemble directly crops the noisy latent according to instance bounding boxes, applies instance-wise self-attention, followed by our "Assemble" process to aggregate the features, and reassembles the modified latent back into the full image before proceeding with denoising.
We believe that both approaches are complementary in advancing the understanding and improvement of DiT-based layout-aware generation. We will include a more detailed discussion of GrounDiT in our revised version to highlight these distinctions and further emphasize the uniqueness of our approach.
Weakness 3: DenseSample:...
We thank the reviewer for pointing out this important question regarding DenseSample. We are happy to clarify its design and motivation:
-
While each instance token attends only to a cropped subset of image tokens within its bounding box (as part of the Assemble-Attn mechanism), the coordinate embedding remains essential for several reasons. It injects positional awareness into the instance token and helps the model disambiguate between visually similar instances (e.g., two differently colored dogs in nearby but distinct locations).
-
As demonstrated in our ablation study (Section 4.4), DenseSample provides richer geometric context, leading to improved performance in layout-aware generation.
-
We note that this is not entirely identical to the grounding token in GLIGEN [2]. In GLIGEN, grounding tokens are global prompts that attend over the full image, whereas in our method, instance tokens are localized in both representation and attention scope.
Finally, we have initiated preliminary ablation studies suggesting that removing coordinate encoding significantly degrades layout understanding and instance disambiguation, particularly in dense scenes. However, due to time constraints, we were unable to complete these experiments during the rebuttal period. We plan to include the results in the revised version or discuss them as a direction for future work.
Q1: For the quantitative comparisons in both Table 1 and Figure 5, ...
We select InstanceDiff[3], MIGC[4] and HICO[5] for comparison, as they are strong and influential baselines in layout-to-image generation. Our primary focus, however, is on the comparison with CreatiLayout[6] in Table 1 and Figure 5a, which uses the same SD3 based architecture. In this comparison, our method achieves SOTA results, a point further validated by our comparison with Flux-based method (in Figure 5b). Also, the ablation study in section 4.4 highlights the impact of each component on the overall performance. We will clarify these points in the revised manuscript and highlight that our contribution lies in enabling effective layout-aware generation for modern DiT-based architectures.
Q2: Moreover, I'm curious whether the previous layout-to-image methods ...
We thank the reviewer for this crucial question, as it allows us to more precisely situate our contribution in the field. The reviewer’s intuition is correct: to our knowledge, classic layout-to-image methods like GLIGEN [2] and InstanceDiffusion[3], designed for U-Net architectures, have not been directly extended to MMDiTs. With recent approaches on MMDiT such as CreatiLayout[6] (based on SD3) and several training-free methods on FLUX[7,8], we provide direct comparisons in Table 1 and Figure 5 , and show our SOTA performance.
Due to the lack of discussion on efficient layout control in MMDiT models, our work introduces:
-
A novel attention mechanism (Assemble-Attn) that enables layout-aware generation and is fully compatible with DiT-based architectures,
-
A cascaded structure that processes the global prompt and instance conditions sequentially, ensuring both high global image quality and precise layout control,
-
Instance-wise conditioning strategies that support flexible and explicit multi-conditional generation, and
-
A lightweight design that supports efficient training and inference.
thereby enabling an efficient, flexible, and practical layout control method in modern DiT-based models.
We greatly appreciate your thoughtful feedback, which has substantially strengthened the clarity, rigor, and overall structure of our paper. We will carefully incorporate these changes in the final revision and hope that you will take them into consideration during your final review.
[1] Y. Lee, T. Yoon, and M. Sung. GrounDiT: Grounding Diffusion Transformers via Noisy Patch Transplantation. In Advances in Neural Information Processing Systems, 2024.
[2] Y. Li, H. Liu, Q. Wu, F. Mu, J. Yang, J. Gao, C. Li, and Y. J. Lee. GLIGEN: Open-Set Grounded Text-to-Image Generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22511–22521. IEEE, 2023.
[3] X. Wang, T. Darrell, S. S. Rambhatla, R. Girdhar, and I. Misra. InstanceDiffusion: Instance-Level Control for Image Generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6232–6242. IEEE, 2024.
[4] D. Zhou, Y. Li, F. Ma, X. Zhang, and Y. Yang. MIGC: Multi-Instance Generation Controller for Text-to-Image Synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6818–6828. IEEE, 2024.
[5] B. Cheng, Y. Ma, L. Wu, S. Liu, A. Ma, X. Wu, D. Leng, and Y. Yin. HiCo: Hierarchical Controllable Diffusion Model for Layout-to-image Generation. In Advances in Neural Information Processing Systems, 2024.
[6] H. Zhang, D. Hong, Y. Wang, J. Shao, X. Wu, Z. Wu, and Y.-G. Jiang. CreatiLayout: Siamese Multimodal Diffusion Transformer for Creative Layout-to-Image Generation, 2025.
[7] A. Chen, J. Xu, W. Zheng, G. Dai, Y. Wang, R. Zhang, H. Wang, and S. Zhang. Training-free Regional Prompting for Diffusion Transformers, 2024.
[8] Z. Chen, Y. Li, H. Wang, Z. Chen, Z. Jiang, J. Li, Q. Wang, J. Yang, and Y. Tai. Region300 Aware Text-to-Image Generation via Hard Binding and Soft Refinement, 2024.
I thank the authors for providing detailed clarifications on each of the raised points. Most of my concerns regarding the main idea of InstanceAssemble are now resolved. With this, I would like to raise my initial rating to borderline accept (4).
Finally, I would appreciate it if the authors could further clarify a confusing point in "Weakness 3: DenseSample".
- "It injects positional awareness into the instance token and helps the model disambiguate between visually similar instances."
This description is confusing, because in instance-wise self-attention, the "visually similar instances" would interact separately to their corresponding image tokens that are cropped from the latent, providing no need for a disambiguation between them. Please correct me if my understanding is wrong.
We sincerely thank the reviewer for the positive feedback and for raising the rating. We truly appreciate your constructive comments and thoughtful engagement.
Regarding the question on Weakness 3: DenseSample, we apologize for the confusion in our wording. For instance, if the global prompt only states "two dogs", while the instance captions specify "a yellow dog" and "a black dog", injecting bbox embeddings helps guide each object to be placed correctly. Furthermore, the ablation study of SD3-based InstanceAssemble model on DenseLayout in the table below demonstrates the necessity of bbox encoding, while DenseSample provides additional performance improvements. (The first row shows an additional ablation requested by the reviewer, finished after the rebuttal deadline due to time constraints. The second and third rows are the same as the last two rows in Table 4 of the main paper.)
| Setting | mIoU ↑ | color ↑ | texture ↑ | shape ↑ | VQA ↑ |
|---|---|---|---|---|---|
| w/o bbox encoding, w/o densesample | 51.22 | 32.15 | 34.04 | 33.53 | 93.30 |
| w/ bbox encoding, w/o densesample | 51.28 | 32.68 | 34.94 | 34.58 | 93.33 |
| w/ bbox encoding, w/ densesample | 52.07 | 33.77 | 36.21 | 35.81 | 93.54 |
We hope this addresses your concern, and thank you again for your helpful feedback.
The paper proposes a new method for layout-to-image (L2I) generation based on a new Instance Assembling Attention mechanism. The proposed mechanism processes layout tokens and the matching image tokens with an additional attention block, and subsequently assembles them back into image and layout tokens. In addition, the paper also introduces a new benchmark for the L2I task based on spatial accuracy, determined by DetectIoU, and semantic accuracy, which is calculated using a VLM-based VQA protocol. Overall, the proposed method outperforms related approaches while being significantly more efficient.
优缺点分析
Strengths
-
Straightforward but effective idea The authors leverage the attention mechanisms in MMDiT to guide the generation process to place objects accordingly. The integration of layout tokens in a cascaded way through Assmeble-MMDiT is straightforward and effective.
-
Thorough and convincing ablation study The authors perform a thorough ablation study that considers multiple design choices, making a compelling case for the proposed pipeline. Specifically, the A
-
Good results and great improvements in efficiency Both the quantitative and qualitative results show tremendous improvements in efficiency while also improving on most quality metrics. Table 3, in particular, demonstrates the impressive efficiency improvement clearly.
Weaknesses
-
Bounding boxes are rectangular Due to the rectangular nature of bounding boxes, fine-grained controllability is limited. While having a rough way to control the generation process might be desired in some cases, greater control might be desirable in other settings. Nonetheless, the proposed framework seems to be a good option for bounding box level of control.
-
Unclear role of LoRA The authors mention the use of LoRA but do not specify for which weights LoRA is applied. See below for more specific questions regarding LoRA.
-
Extensive additional training The proposed approach requires training on 8 H800 GPUs for a week. So it seems that the method moves the required computational effort from the inference step to the training step, allowing for the added efficiency. While this is not necessarily a weakness, it might be important to consider depending on the available computational budget.
-
Scales with the number of instances, handling overlapping areas It looks like image tokens might be reused in Assemble-MMDiT if bounding boxes overlap, which could be inefficient. There seems to be an almost 2x increase in efficiency between sparse and dense layouts.
问题
- Which weights are trained using LoRA? How are these weights initialized? Why would LoRA improve the performance? What is the "base model" referred to in the ablation study section?
- In Figure 2, it looks like there is one large attention block that processes all projected tokens simultaneously, but in the text, it sounds like every instance is processed separately, i.e., image tokens do not attend across instances. Which one is it?
- Why is training performed for all time steps, but during inference, the proposed method is only applied to the first 30% of timesteps?
- Why are the LayoutGrounding scores for DenseLayout so much lower than for LayoutSAM-Eval?
局限性
Yes
最终评判理由
Given that the rebuttal has addressed all my concerns convincingly, I have decided to keep my score and recommend acceptance.
格式问题
No concerns
We sincerely thank the reviewer for the positive recognition and thoughtful suggestions. Your encouraging assessment affirms our design choices and motivates future improvements. Below, we provide point-by-point responses to each concern.
Weakness 1: Bounding boxes are rectangular Due to the rectangular nature of bounding boxes, ...
We agree that bounding boxes provide only coarse spatial control. Therefore, we conducted experiments with more precise control condition like depth map, canny map, or reference image which showed better alignment with object shapes (Section 4.3). As for finer control, we will consider this an exciting direction for future work.
Weakness 2: Unclear role of LoRA The authors mention the use of LoRA but do not specify for which weights LoRA is applied. ...
Thank you for raising this point. We provide a detailed explanation of how LoRA is applied in Q1 below.
Weakness 3: Extensive additional training The proposed approach requires training on 8 H800 GPUs for a week. ...
The main computational burden comes from our reliance on the base models (e.g., SD3, FLUX). Additionally, training on a dataset of over 200M+ images was essential for the model to learn effective layout control.
While our approach does require notable training resources to achieve robust layout control, this upfront cost enables fast and flexible inference. Compared to related methods like CreatiLayout[1] (7 days on 8×A800 GPUs) or InstanceDiffusion[2] (64×A100 GPUs), our training cost is comparable.
Weakness 4: Scales with the number of instances, ...
Token reuse in overlapping regions is a deliberate design choice to ensure that each instance’s semantics are properly injected. While this introduces some overhead in dense layouts, it preserves fidelity in layout-grounded generation. We acknowledge the trade-off and will explore optimizations in future work.
Q1: Which weights are trained using LoRA? ...
We apply LoRA to the Assemble-MMDiT module, initialized from the base model (SD3-Medium or Flux.1-dev). The Layout Encoder is trained from scratch. In Table 4 : "Assemble" refers to the architecture design of Assemble-MMDiT, while "LoRA" refers to its training method:
-
The rows without "LoRA" indicate full fine-tuning of Assemble-MMDiT modules.
-
The rows with "LoRA" indicate fine-tuning using LoRA.
The LoRA module improves performance for two reasons: 1) it retains the base model's capabilities compared to the fully fine-tuned version, and 2) it enables effective layout control with fewer trainable parameters. The base model in the ablation study refers to the SD3-M model.
Q2: In Figure 2, it looks like there is one large attention block that processes all projected tokens simultaneously, ...
Thank you for pointing out the ambiguity in Figure 2. We apologize for this confusion. The reviewer’s understanding is correct: each instance is processed separately, and image tokens only attend to tokens within their corresponding bounding boxes. We will revise the figure to make this clearer.
Q3: Why is training performed for all time steps, ...
Training across all timesteps is crucial for generalization, as the timestep conditions the denoising network. This ensures that the model dynamically adapts its predictions to the specific noise level at each diffusion step, which is fundamental to the noise estimation process. However, during inference, layout guidance is most effective in the first 30% of timesteps, where the global spatial layout is established, as demonstrated in prior works like GLIGEN [3].
As shown in Supplementary A and Figure 10, extending beyond 30% offers no significant gains but increases runtime overhead. Therefore, we choose the 30% trade-off to balance quality and efficiency.
Q4: Why are the LayoutGrounding scores for DenseLayout so much lower than for LayoutSAM-Eval?
The difference in LayoutGrounding scores between DenseLayout and LayoutSAM-Eval is primarily due to the data complexity. DenseLayout contains an average of 18.1 instances per case (with at least 15 per image), making it much more challenging than LayoutSAM-Eval, which averages only 3.8 instances. As a result, DenseLayout sets a higher bar for layout adherence, leading to lower scores across all methods. However, our model maintains state-of-the-art performance on both benchmarks.
Once again, we greatly appreciate the reviewers’ insightful feedback. We hope these clarifications address your concerns, and we are committed to advancing layout-controlled synthesis by incorporating the suggested refinements.
[1] H. Zhang, D. Hong, Y. Wang, J. Shao, X. Wu, Z. Wu, and Y.-G. Jiang. CreatiLayout: Siamese Multimodal Diffusion Transformer for Creative Layout-to-Image Generation, 2025.
[2] X. Wang, T. Darrell, S. S. Rambhatla, R. Girdhar, and I. Misra. InstanceDiffusion: Instance-Level Control for Image Generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6232–6242. IEEE, 2024.
[3] Y. Li, H. Liu, Q. Wu, F. Mu, J. Yang, J. Gao, C. Li, and Y. J. Lee. GLIGEN: Open-Set Grounded Text-to-Image Generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22511–22521. IEEE, 2023.
I want to thank the authors for the detailed response. All my concerns have been convincingly addressed, and all my questions have been answered.
We sincerely thank the reviewer for the encouraging feedback and positive evaluation. Your constructive comments greatly helped us improve the clarity and quality of our work. If you have any further suggestions for how we can improve our work and potentially improve your evaluation, please do let us know!
Best regards! Wishing you a good day.
In this paper, the authors proposed a new model for layout-aware image generation. In particular, they design LoRA modules that can be attached to modern DiT based text-to-image generative models such as Stable Diffusion 3 and Flux.1. They also proposed a new benchmark and a new evaluation metric for the layout-aware image generation task.
优缺点分析
Strength:
- The proposed model has very impressive empirical results
- The experiment section is solid and thorough
- The addition of the new benchmark and metric is a good plus
- All figures in this paper are very high quality
Weakness:
-
The modeling details are a bit unclear from the writing. For example, what exactly is the network’s “image” input and output? I.e. which parametrization did the authors consider? Or is it different based on different backbone models?
-
Related to weakness 1, if the training objective in this paper is the same as the backbone models, then the description of the training objective is not very accurate. Stable Diffusion 3 and Flux.1-dev both use the flow matching objective and it should not be described as the same as the one in LDM (i.e. velocity prediction with conditional optimal transport path is not the same as epsilon prediction with Gaussian probability path).
-
The generated images from InstanceAssemble look very saturated, and with depth and edge inputs this problem is exacerbated.
-
The authors should consider including the discussions of the following related works:
[1] Zhang, et al. “Adding Conditional Control to Text-to-Image Diffusion Models”. ICCV 2023. [2] Hertz, et al. “Prompt-to-Prompt Image Editing with Cross-Attention Control”. arXiv 2022. [3] He, et al. “Localized text-to-image generation for free via cross attention control”. arXiv 2023.
问题
- During inference the layout condition is only applied in the first 30% of the denoising steps, then is it still essential to train with all time steps? What is going to happen if you apply the layout condition for all time steps at inference time? Will the image quality degrade significantly?
- Related to weakness 1, what exactly is the “image” input to the networks and what is the output of the network exactly? Are the outputs epsilon predictions, x0 predictions or v predictions? Are the “image” input the actual denoised latent, noisy latent, or epsilon?
- Can the authors explain the oversaturation problem?
- Which training dataset did the authors use for the image/depth/edge input?
- Does the Assemble-MMDiT have a similar size as the frozen MMDiT? If so, is the reason why it only adds 8% more inference time the fact that you only apply it in the first 30% of the denoising steps?
- How would the native multimodal image generation models like GPT-4o or Gemini 2.0 Flash compare to InstanceAssemble in this task?
- Does this model only work for bounding boxes? The authors should consider adding (demo) experiments with semantic maps if possible.
局限性
Yes, but the authors can potentially extend the paragraph to include more discussion on potential negative societal impact.
最终评判理由
The authors have addressed all my concerns and therefore I would like to maintain my rating as 5 accept.
格式问题
N/A
We sincerely thank the reviewer for your positive feedback and high evaluation of our work. We greatly appreciate your recognition of our contributions, and your thoughtful comments have further enhanced the clarity and quality of our paper. Below, we provide point-by-point responses to each concern.
Weakness 1: The modeling details are a bit unclear from the writing. ...
In Section 4.1 ("Experimental Setup"), we explored two input variants:
-
Prompt + layout: The inputs consist of a global prompt describing the overall content and layout condition consist of paired bounding boxes and instance captions. No additional visual content is provided.
-
+ additional visual instance content: This extends the previous variant by optionally incorporating per-instance reference images, edge maps, depth maps, etc.
For the layout encoder, we train a MLP from scratch to encode layout features. In the Assemble-MMDiT module, we fine-tune through LoRA on top of the frozen backbone. Overall, compared with the SD3 or FLUX.1-dev backbones, we add a layout encoder trained from scratch and the Assemble-MMDiT module as a LoRA adaptation.
Weakness 2: Related to weakness 1, if the training objective in this paper is the same as the backbone models, ...
We sincerely thank the reviewer for pointing out this issue. We apologize for the imprecise description of the training objective. To clarify, we use the flow-matching loss, consistent with Stable Diffusion 3 and Flux 1-dev, we will correct this in the revised version to ensure accuracy.
Weakness 3: The generated images from InstanceAssemble look very saturated, ...
We trained two variants of the model:
-
The prompt + layout variant shows negligible oversaturation.
-
The additional visual instance content variant, however, results in slight oversaturation, especially when conditioning on depth or edge maps. This is likely due to the different data distributions in these modalities, which exceed the representational capacity of a single LoRA.
We acknowledge this limitation and plan to address it in future work by (a) training separate LoRA models for each modality and (b) exploring normalization or modality-specific injection strategies to mitigate oversaturation.
W4: The authors should consider including the discussions of the following related works:...
We sincerely appreciate the reviewer’s recommendation to include these important related works, all of which contribute valuable insights. We will incorporate discussions of the papers in our final revision.
Q1: During inference the layout condition is only applied in the first 30% of the denoising steps, ...
Training across all timesteps is crucial for generalization, as the timestep conditions the denoising network. This ensures that the model dynamically adapts its predictions to the specific noise level at each diffusion step, which is fundamental to the noise estimation process. During the inference, extending beyond 30% offers no significant gains but increases runtime overhead. Therefore, we choose the 30% trade-off to balance quality and efficiency as shown in Supplementary A and Figure 10.
Q2: Related to weakness 1, ...
Since our backbone uses the flow‑matching formulation (as in SD3 and Flux), our network outputs v‑predictions. During training, inputs are noised latents from the VAE encoder of real images; at inference, inputs are Gaussian-initialized noisy latents.
Q3: Can the authors explain the oversaturation problem?
We have addressed the oversaturation issue in response to Weakness 3.
Q4: Which training dataset did the authors use for the image/depth/edge input?
We use the LayoutSAM dataset (proposed by CreatiLayout[1]) for training. For additional visual content, we crop the original images and compute depth and Canny edge maps. During training, we randomly select one modality (image, depth, or edge) or none for each instance. We will clarify these details in the revised version.
Q5: Does the Assemble-MMDiT have a similar size as the frozen MMDiT? ...
The Assemble-MMDiT has the same block architecture as the original MMDiT backbone, with a LoRA adapters (not the full parameters). The approximately 8% additional inference time (reported on the Flux model) is due to applying the LoRA adaptation only during the first 30% of the denoising steps and only on 8 blocks (seven double-blocks and one single-block). We will clarify these details in the revised version.
Q6: How would the native multimodal image generation models like GPT-4o or Gemini 2.0 Flash compare to InstanceAssemble in this task?
We explored layout control in GPT-4o by embedding bounding box positions into the text prompt or by providing a reference image with bounding boxes. While GPT-4o demonstrates some basic layout awareness, the positional accuracy is not satisfactory. We attribute this limitation to its lack of dedicated layout conditioning mechanisms.
Q7: Does this model only work for bounding boxes? ...
We appreciate the suggestion to explore semantic maps. To demonstrate this, we conducted experiments with our model originally trained with bounding boxes. We modified the model at inference time to crop based on semantic maps instead of bounding boxes. Results show that semantic-map-based inference aligns instance shapes more closely with segmentation. However, dedicated training with semantic maps is still necessary to achieve robust results. This is a promising direction for future work.
Limitations: Yes, but the authors can potentially extend the paragraph to include more discussion on potential negative societal impact.
We agree that a more comprehensive discussion of potential negative impacts is warranted. We will expand the Limitations section to address possible risks such as misuse in generating misleading or deceptive layouts, privacy concerns, and the potential for bias amplification in conditioned generation.
We sincerely thank the reviewer again for the insightful feedback, which we will incorporate into the final revision.
[1] H. Zhang, D. Hong, Y. Wang, J. Shao, X. Wu, Z. Wu, and Y.-G. Jiang. CreatiLayout: Siamese Multimodal Diffusion Transformer for Creative Layout-to-Image Generation, 2025.
Thank you for your rebuttal. This addresses all my concerns.
This paper introduces an instance-assembling attention mechanism for layout-to-image generation. It also introudces a new benchmark and metric for the task. State-of-the-art performance is reported. The reviewers generally agreed that the proposed solution is effective, the results are strong, and the new benchmark and metric are solid contributions. They raised some questions related to implementation details, technical novelty, design justification, over-saturation of the results, comparison with other models, and scaling issue. The authors addressed these question adequately in the rebuttal and discussions. The final ratings are 2 "accept" and 2 "borderline accept". It is recommended to accept this paper for its solid technical contributions.