Q5: Motivation and clarification of Eqn. (6)(7)(10) in the revised manuscript.

A: For Eqn. (6), in the revised manuscript, we have noted that as stated in [1] inferring local surface features is underdetermined due to the supervision from 2D signals, leading to unstructured outliers with unreliable local surface descriptors. Therefore, we model such randomness in CIM by incorporating a variational term into the candidate features, establishing a probabilistic feature residual.

For Eqn. (7), we follow [2] to optimize the evidence lower bound (ELBO) of the variational auto-encoder. In the revised manuscript, we have included the reference to [2] for clarity.

For Eqn. (10), in the revised manuscript, we have noted that serves as the closure weight parameter which controls the support () or suppression () of . Moreover, we have added illustration on such support or suppresion in Figure 3 of the revised manuscript.

Q6: How the point cloud based methods are used in the experiment.

A: For the point cloud based methods, the original 3D Gaussians {} (generated by 3DGS+SuGAR) were used as the input, where the Gaussian center was regarded as the input point-wise coordinate. were concatenated and regarded as the input point-wise features. Except for the first layer whose input dimension was modified to fit the input Gaussians, we faithfully followed the network and training configurations of the point cloud based methods we compared with. In the revised manuscript, we have added the aforementioned details to Appendix B.1.

Q7: On qualitative examples.

A: Quality of the 3D boxes: Thank you for pointing this out. In the revised manuscript, we have removed the first row with lower quality in Figure 5, and used a unique color for each bounding box for intuitive presentation.

NMS: We clarify that in the qualitative results, we have faithfully used the official implementation of FCAF3D and NeRF-RPN, both using NMS as post-processing. We also clarify that the usage of Non-Maximum Suppresion (NMS) only removes the boxes that are overlapped over a pre-set threshold. Therefore, the overlapping observed in the competing methods is due to this threshold, which may have caused your concern.

[1] Antoine, Guédon, et al. "Sugar: Surface-aligned gaussian splatting for efficient 3d mesh reconstruction and high-quality mesh rendering." CVPR, 2024.

[2] Kingma, Diederik P. "Auto-encoding variational bayes." ICLR, 2013.