Abbreviations in the answers: W: Weaknesses; Q: Questions

Ray opacity loss v.s. other regularization (W1, Q2): The main difference between our ray opacity loss and previous approaches lies in the selection of supervision points. Standard density regularization uniformly supervises all points in space, including those on the mesh surface as well as in empty regions, which can lead to gradients that are less effective at guiding the model toward a clean surface. On the other hand, standard depth loss only supervises points on the mesh surface; once floaters appear during training, its gradients can move these floaters but cannot easily eliminate them. In contrast to these standard losses, our opacity loss supervises the entire space between the surface and the camera, effectively preventing floaters and stabilizing the training process. We will add a figure in the revised paper to further illustrate the difference.

Evidence of choosing DiffMC (W2, Q1): Our pipeline is compatible with DMTet and Flexicubes, not limited to DiffMC. However, we chose DiffMC due to its superior processing speed and minimal GPU VRAM consumption (as detailed in the comparison table on its GitHub repository: https://github.com/SarahWeiii/diso?tab=readme-ov-file#speed-comparison). This efficiency is particularly beneficial for large-scale training, where both speed and GPU optimization are crucial.

Performance compared to recently published baselines (W3, W4, Q4): We have added a new table to quantitatively compare our method against recent baselines. Given that input settings differ among the baselines, we follow the setup in the concurrent work MeshFormer [4] to evaluate all methods under a unified single-view-to-3D setting. Specifically, for InstantMesh, MeshFormer, and our method, we first use Zero123++ [5] to convert the input single-view image into multi-view images before performing 3D reconstruction. The other baselines follow their original configurations, taking a single-view image directly as input.

The results, where all methods generate a mesh as the final output, are presented below.

Method		GSO			Omni3D
	PSNR	SSIM	LPIPS	PSNR	SSIM	LPIPS
TripoSR [1]	19.85	0.753	0.265	17.68	0.745	0.277
LGM [2]	18.52	0.713	0.349	14.75	0.646	0.455
InstantMesh [3]	20.89	0.775	0.218	17.61	0.742	0.275
MeshFormer [4]	21.47	0.793	0.201	18.14	0.755	0.266
Ours	21.31	0.794	0.192	18.10	0.759	0.257

Our method demonstrates a significant advantage over TripoSR, LGM, and InstantMesh across all three metrics on both datasets. Additionally, it is comparable with MeshFormer, with slightly lower PSNR and better SSIM and LPIPS. Note that MeshFormer is specifically designed for the single-image-to-3D task, it requires multi-view consistent normals as input during its reconstruction step. This makes it unsuitable for handling sparse-view reconstruction scenarios using captured images, such as setups with OpenIllumination, where only RGB images with camera poses are available. In contrast, our method can well handle this as shown in Table 6.

We will also expand the discussion on recent works in the related work section and include the new results in the revised version.

Density threshold (Q3): Both DiffMC and standard Marching Cubes use the same density threshold, which is selected based on the overall quality of the Marching Cube mesh. During stage two fine-tuning, DiffMC maintains this threshold. In general, we observe that, without DiffMC fine-tuning, the Marching Cubes artifacts shown in the paper are unavoidable with various thresholds. While carefully tuning the threshold for a single scene can reduce the artifacts to some extent, the optimal threshold differs notably across scenes. Our DiffMC fine-tuning effectively resolves these issues, allowing for high-quality mesh reconstruction with the consistent pre-selected threshold.

[1] Tochilkin, Dmitry, et al. "Triposr: Fast 3d object reconstruction from a single image." arXiv (2024).

[2] Tang, Jiaxiang, et al. "Lgm: Large multi-view gaussian model for high-resolution 3d content creation." ECCV, 2025.

[3] Xu, Jiale, et al. "Instantmesh: Efficient 3d mesh generation from a single image with sparse-view large reconstruction models." arXiv (2024).

[4] Liu, Minghua, et al. "Meshformer: High-quality mesh generation with 3d-guided reconstruction model." NeurIPS (2024).

[5] Shi, Ruoxi, et al. "Zero123++: a single image to consistent multi-view diffusion base model." arXiv (2023).