Thank you for your valuable reviews, we will answer your questions one by one regarding these weaknesses.

Question: This results in higher memory usage and greater latency during inference.

Thank you for your concern. We have added the analysis of the inference memory and time usage. For training, because the amount of LoRA parameters we introduced is very small, as shown in the table below, the increased time and memory consumption of our method compared to the origin model is acceptable.

Model	Method	Diffusion Network	Text Encoder	Train Percent (%)
SD1.5	Vanilla LoRA	Wq, Wk, Wv, Wout (0.7972M)	Wq, Wv (0.1475M)	0.09604
	TSM 1-stage	Wq, Wk, Wv, Wout (0.7972M)	Wq, Wv (0.1475M)	0.09604
	TSM 2-stage	Rq, Rk, Rv, Rout (0.2275M)	Wq, Wv (0.04243M)	0.02722
PixArt-α	Vanilla LoRA	Wq, Wk, Wv, Wout (0.2064M)	Wq, Wv (0.1573M)	0.06764
	TSM 1-stage	Wq, Wk, Wv, Wout (0.2064M)	Wq, Wv (0.1573M)	0.06764
	TSM 2-stage	Rq, Rk, Rv, Rout (0.0482M)	Wq, Wv (0.02446M)	0.01344
SD3	Vanilla LoRA	Wq, Wk, Wv, Wout (1.18M)	Wq, Wk, Wv, Wout (1.606M)	0.05635
	TSM 1-stage	Wq, Wk, Wv, Wout (1.18M)	Wq, Wk, Wv, Wout (1.607M)	0.05635
	TSM 2-stage	Rq, Rk, Rv, Rout (0.2435M)	Wq, Wv (0.3767M)	0.01311

For inference, we have supplemented the following table with a comparison of the time and memory usage required to run the original model and our method on a single A100, where the batch size is set to 1.

Method	Time (s)	Memory (GB)
SD1.5	8.35	2.04
SD1.5+TSM (ours)	8.52	2.04

Question: Although the assembling stage can be understood from the flowchart, the explanation of Equation (7) may be somewhat confusing, especially regarding the subscript notation of ϵ, which lacks clarity.

Thank you for your detailed reminder. As we stated in sections 139-140 of the article, ϵ is the noise sampled from a Gaussian distribution. We are sorry for the misunderstanding.

Question: The results for SD2 with the proposed method are missing. What could be the reason for this?

In our paper, we chose a UNet-based model (SD1.5), a DiT-based model (PixArt-α) and a MMDiT-based model (SD3) as our pretrained model. The reason we chose SD1.5 instead of SD2 is that SD1.5 seems to be more widely circulated on major platforms. The results show that by using SD1.5+TSM significantly exceeds the SD1.5 and its various variants. To further verify our observation, we added the experimental results of TSM on SD2 in the following table. It can be seen that we achieved the most advanced results on SD2 as well.

Method	Color	Shape	Texture	Spatial	Non-spatial	Complex
SD2 [1]	50.65	42.21	49.22	13.42	31.27	33.86
SD2 + Composable [2]	40.63	32.99	36.45	8.00	29.80	28.98
SD2 + Structured [3]	49.90	42.18	49.00	13.86	31.11	33.55
SD2 + Attn Exct [4]	64.00	45.17	59.63	14.55	31.09	34.01
SD2 + GORS unbiased [5]	64.14	45.46	60.25	17.25	31.58	34.70
SD2 + GORS [5]	66.03	47.85	62.87	18.15	31.93	33.28
SD2 + TSM (Ours)	75.93	54.34	67.44	18.34	31.47	34.20

[1] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. Highresolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10684–10695, 2022a.

[2] N. Liu, S. Li, Y. Du, A. Torralba, and J. B. Tenenbaum, “Compositional visual generation with composable diffusion models,” in ECCV, 2022.

[3] W. Feng, X. He, T.-J. Fu, V. Jampani, A. Akula, P. Narayana, S. Basu, X. E. Wang, and W. Y. Wang, “Training-free structured diffusion guidance for compositional text-to-image synthesis,” in ICLR, 2023.

[4] H. Chefer, Y. Alaluf, Y. Vinker, L. Wolf, and D. Cohen-Or, “Attend-andexcite: Attention-based semantic guidance for text-to-image diffusion models,” in ACM Trans. Graph., 2023.

[5] Kaiyi Huang, Kaiyue Sun, Enze Xie, Zhenguo Li, and Xihui Liu. T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation. Advances in Neural Information Processing Systems, 36:78723–78747, 2023.