Dear Reviewer U3kJ,

We are sincerely grateful for your recognition of our contribution. After reading the response, we feel that there are quite a few misunderstandings. Here we hope to clarify them:

C1,2: Fairly comparing between REPA and U-REPA.

We ensure U-REPA is superior to REPA by comparing them on the same U-Net architecture (which is what we have responded in "Kindly asking about further concerns"). This comparison is conducted in an entirely fair manner: all training, sampling, and architecture settings are exactly the same. The experiment data are quoted from the original U-REPA paper (we will clarify this issue in the third bullet point). Following previous practices, tricks are added to show the best potential of U-Net, which are not our contribution.
The reason why SiT results are omitted initially: In the original draft, we have demonstrated the difference between DiT and DiT instead. Since the SiT paper is an improvement on top of the DiT architecture, we thought the trend of SiT / SiT would be very similar to DiT / DiT, which is verified by our experiments (shown in the following table). We hold that the advice of showing SiT performance makes sense and we will add SiT and SiT in the next paper revision.

	DiT	DiT	SiT	SiT
FID	19.5	11.0	17.2	9.2

Actually we provided results of both SiT+REPA and SiT+U-REPA in the "Ablation" part of the paper; we only cited (not "provided") these data during the rebuttal. SiT+REPA (FID 9.35) is in line 2, Table 4 in the ablation study. We did not highlight this pair of comparison at first because we thought it was for sure that U-REPA performs much better than REPA on U-Net (otherwise this paper would be meaningless; the gap is indeed huge according to the results). But it turns out some confusion is taking place so we will highlight it in the next revision for better understanding.
We also hold that comparing SiT+REPA and SiT+U-REPA is meaningful because we are fairly conducting both experiments from scratch and all settings are aligned. Via these experiments, we show beyond the effectiveness of U-REPA that U-Net architectures have good potentials, both from performance and convergence.

C3: About Visualization

We have provided a visualization in the supplementary materials due to space limits. We will move it back to the main paper in the next revision.

Why there is no difference comparison between REPA and U-REPA: Actually, we are quite surprised by this request. From our knowledge, one-to-one comparison with other methods is quite uncommon in most image generation methods (e.g. DiT, SiT, REPA et cetera). But we are happy to provide one in the next revision. We have inspected image quality where U-REPA is clearly better. In other AIGC tasks where an image is inputted into the model, like Image-Editing or Super-Resolution, one-to-one visualization comparison is more common.

C4: About Performance Growth

We hold that comparing SiT / SiT + U-REPA and SiT / SiT+REPA is unfair due to the following reasons (we have also responded to BL2s in "Responses to 'Still missing comparison' ").

When generation performance is stronger, it is also harder to improve (especially for FID).
Unlike SiT vs. ViT, SiT vs. ViT have key architectural differences, which is harder to align.

Hence, we hold that this comparison is unfair: they cannot be simply measured by the amount of metric improvement. Comparing both REPA and U-REPA under SiT is fairer.

C5: Why not Using Traditional U-Nets

Our purposes of using DiT modification rather than [1,2,3] are as follows:

Firstly, through the modification of DiT, we want to segment U-Net components to evaluate their individual contribution and demonstrate the importance of downsampling in U-Net. This feature is important in our paper's effort in adapting REPA to U-Net: though downsampling is the key for U-Net's good performance, it is also the key barrier to the adaptation of REPA.
Secondly, these early U-Net architectures are designed for the earlier, easier diffusion tasks and settings. These tasks include smaller images and fewer class objects (e.g. ImageNet-64, CIFAR-10). [2] includes ImageNet-256, but its single-model performance only achieves FID 31.5 after 2400 epochs' training (>> 1400ep for DiT/SiT). We doubt whether their capacity enables good generation performance. In contrast, the setting of DiT is closer to current-day applications (it is more widely used and adapted, e.g. SiT and REPA). Therefore, we choose to use DiT as a popular choice for modification.

Again, thank you for your advice and we will take it when revising our paper.

Sincerely,

Authors