We would like to express our sincere gratitude to you for your valuable feedback and insightful comments on our work.

Reply to W1

We acknowledge the importance of including baseline experiments and performance comparisons. It is important to note that during the time of our study, which was conducted several months ago, there was a scarcity of publicly available open-source code in the field of fair diffusion model generation. This limitation made it challenging for us to conduct direct performance comparisons with other approaches. However, we are pleased to find that this work has been made openly accessible, including the accompanying code recently. In the revised version of our paper, we provided a comprehensive comparison with the mentioned work and provided detailed explanations of the comparison methodology. We elucidated its limitations in the degree of human intervention for systems and shows priority for our superiority in generation speed as demonstrated.

Models	Time (seconds)
Stable diffusion	424
Fair Diffusion	1463
Fair Mapping (our method)	434

Meanwhile, we also evaluate the alignment and diversity of Fair Diffusion. The experimental results are copied from Table 22 in Appendix E.2.

Models		Occupation			Emotion
	CLIP-Score	Human-CLIP	Diversity	CLIP-Score	Human-CLIP	Diversity
Fair Diffusion-Gender	0.2274	0.1348	13.87	0.1894	0.1298	1.43
Fair Diffusion-Race	0.2239	0.1292	13.79	0.1882	0.1266	1.39

Please see Page 21 - 22 for more experiments details.

Reply to W2

We kindly cannot agree that FairScore and the fair loss are similar. Note that for any keyword , its fairscore is defined as

where And the fair loss is defined by

where represents the Euclidean distance between the native embedding and the specific sensitive attribute embedding . refers to the average distance between the native embedding and all the sensitive attribute embeddings .

We can see that FairScore is defined by the conditional probability while fair loss is defined by the Euclidean distance of the embedded vectors after being transformed by our linear network. There are no evidence or theories to support the equivalence or similarity of these two totally different metrics (conditional probability of generated images v.s. Euclidean distance of the embedded vectors).