(Q1) Comparison to Optimization-Based Approaches

We appreciate your inquiry regarding efficiency. We require only 20 hours to complete 50,000 iterations on 24 A100 GPUs. According to the protocol outlined in Table 1, we have compared the time cost and PSNR of our method with that of traditional optimization methods.

	PSNR	SSIM	LPIPS	Time Cost
3D-GS	24.91	0.71	0.16	5.5h
DrivingGaussian	26.12	0.74	0.13	6.2h
Ours	23.70	0.68	0.17	1.21s

As illustrated in the table above, our algorithm is nearly comparable to traditional optimization methods, while inference takes only 1.21 seconds, highlighting our method's substantial speed advantage. The experiment was updated to part C of supplementary materials.

(Q2) The Need for 3D Boxes

I am grateful for your question about the use of 3D boxes. Our method can indeed be trained without requiring semantic segmentation or 3D boxes. As shown in Table 4(a), we can learn geometry through perspective consistency and point cloud depth information without the need for Dynamic and Static Rendering (DS-R).

Additionally, we can use simple techniques for segmentation without utilizing 3D boxes. For example, we can compute optical flow from videos to identify points in dynamic regions, which can be used as prompts for the SAM to segment dynamic objects, enabling us to segment any type of dynamic object without 3D boxes.

(Q3) Downstream Applications

We train DriveRecon on the nuScenes training set without relying on 3D boxes or segmentation ( and ). The pretrained model is then employed as an image encoder for UniAD. Then, UniAD is fine-tuned entirely using the original UniAD's training parameters. The results for UniAD in Table 6 are sourced from the original paper, and our pretrained model significantly enhances performance. I will provide clearer details about the experimental setup in the the paper.

(Q4) Miscellaneous

Thank you for your attention to our experimental setup. We utilized three frames of images as input for all experiments. In Table 4(b), "Training Num" refers to the mean number of scenes. Tables 1, 2, 3, and 4(a) use only 64 scenes (NOTA-DS64). Specifically, we trained with 64 scenes (NOTA-DS64) and tested with 54 new scenes (Diversity-54), achieving satisfactory results. This indicates that our algorithm demonstrates good generalization performance even when trained on a small dataset. We will reiterate these details in the appropriate sections of the paper.

Thank you for your attention to our responses. If you have any further questions or if anything remains unclear, please don’t hesitate to let us know. We would be more than happy to discuss your concerns in greater detail.