PaperHub
7.3
/10
Poster4 位审稿人
最低4最高5标准差0.5
4
4
5
5
4.3
置信度
创新性2.8
质量3.3
清晰度2.5
重要性3.0
NeurIPS 2025

Direct3D-S2: Gigascale 3D Generation Made Easy with Spatial Sparse Attention

OpenReviewPDF
提交: 2025-05-12更新: 2025-10-29
TL;DR

a novel approach to enable training high resolution sparse voxel diffusion model for 3D generation

摘要

Generating high-resolution 3D shapes using volumetric representations such as Signed Distance Functions (SDFs) presents substantial computational and memory challenges. We introduce Direct3D-S2, a scalable 3D generation framework based on sparse volumes that achieves superior output quality with dramatically reduced training costs. Our key innovation is the Spatial Sparse Attention (SSA) mechanism, which greatly enhances the efficiency of Diffusion Transformer (DiT) computations on sparse volumetric data. SSA allows the model to effectively process large token sets within sparse volumes, significantly reducing computational overhead and achieving a 3.9$\times$ speedup in the forward pass and a 9.6$\times$ speedup in the backward pass. Our framework also includes a variational autoencoder (VAE) that maintains a consistent sparse volumetric format across input, latent, and output stages. Compared to previous methods with heterogeneous representations in 3D VAE, this unified design significantly improves training efficiency and stability. Our model is trained on public datasets, and experiments demonstrate that Direct3D-S2 not only surpasses state-of-the-art methods in generation quality and efficiency, but also enables training at 1024³ resolution using only 8 GPUs—a task typically requiring at least 32 GPUs for volumetric representations at $256^3$ resolution, thus making gigascale 3D generation both practical and accessible. Project page: https://www.neural4d.com/research-page/direct3d-s2.
关键词
AIGC3D GenerationDiffusion

评审与讨论

审稿意见
4

This work introduces Direct3D-S2, a scalable 3D generation framework based on sparse volumes. At the core of the approach is a novel Spatial Sparse Attention mechanism which significantly speeds up both forward and backward pass, facilitating the scalability and efficiency of sparse structure-based 3D generative models. Experimental results demonstrate the superior performance of the proposed method.

优缺点分析

Strengths

  • The proposed Spatial Sparse Attention mechanism, which is an adaptation of Native Sparse Attention on 3D data, effectively reduces the computation cost of sparse structure-based 3D generative models.
  • The qualitative results are compelling, with fine-grained geometry details.

Weaknesses

  • The modifications in the proposed Spatial Sparse Attention based on Native Sparse Attention are predictable and mostly engineering, with lack of novelty.
  • Line 229~230 claim that using all regions of conditional images for cross attention “may adversely affect the alignment between the generated meshes and the conditional images”, however, it misses the supporting evidences. It would be useful to provide several examples showcasing such misalignment.
  • As for the ablation on proposed Sparse Conditioning mechanism, it contains only two cases in Fig. 5, which implies potential cherry-picking. Due to the randomness of sampling process, the favor might can not be held under another random seed. More comparison cases or a quantitative results should be provided to demonstrate the constant benefits of the mechanism.

问题

In line 144 it is mentioned that the VAE decoder outputs both input voxels and additional valid voxels. What does “additional valid voxels” refer to?

局限性

yes

最终评判理由

Thanks for the authors' comment, and all my concerns are well addressed. By the detailed explanation of authors, I acknowledge that the contribution of proposed SSA is not trivial. It effectively overcomes the computational bottleneck in sparse structured 3D generative modeling at high resolution. While I maintain the view that the proposed "Sparse Conditioning Mechanism" appears somewhat contrived, overall, I believe the practice of this paper can advance the field of 3D generation. I will raise my final rating.

格式问题

N/A

作者回复

Thank you for your insightful feedback and thorough review of our paper. We carefully respond to each of the concerns and questions below.

[Q1]: Novelty of the proposed SSA.

[A1]: Thank you for your comments. The main contribution of our SSA is to address the scalability and efficiency limitations in training with ultra-long tokens in 3D data, where the number of tokens increases cubically with resolution. The SSA is also well acknowledged by all other three reviewers: Reviewer WHua noted, "the method addresses the core inefficiency in high-resolution 3D generation with a well-motivated and technically solid solution". Reviewer uJi9 said our approach is "a novel framework for scalable 3D generative modeling", giving it a significance score of 4 (excellent). Reviewer y1dk commented that "the implementation of SSA is non-trivial and could benefit the 3D community overall as a general mechanism for harvesting sparsity naturally existing in 3D data".

Although our core idea of accelerating the attention mechanism via block partitioning is inspired by NSA, NSA is only applicable to structurally organized 1D sequences, as it partitions tokens based on their sequential order. This approach loses critical information when applied to unstructured tokens, such as sparse 3D voxels, and suffers from two key limitations. First, even though tokens within the same block may share consecutive indices, they are not necessarily spatially adjacent in 3D space. Second, due to the nature of sparse voxel structures, blocks with the same index across different samples may correspond to entirely different spatial regions. These issues collectively lead to unstable training convergence and a loss of detail in the generated meshes. To address this, we propose a novel method that partitions tokens into blocks based on their 3D spatial locations, followed by blockwise token selection. As shown clearly in Figure 6 of the paper, meshes generated using SSA exhibit smoother surfaces and more fine-grained details. We believe the novelty of SSA is well supported, as also highlighted by Reviewers WHua, uJi9 and y1dk.

[Q2]: Sparse Conditioning Mechanism for improved alignment.

[A2]: Thanks for your valuable suggestions. We have presented a qualitative comparison in Figure 5 of the supplementary material between results generated using cross-attention over the full input images and those using sparse conditioning. The regions highlighted in red indicate areas where the meshes generated by the former does not align well with the input images. After applying sparse conditioning, the generated results are corrected in these regions. To avoid potential cherry-picking, we perform quantitative comparisons on a subset from Objaverse dataset, which comprises 500 meshes disjoint from the training set. Unlike the approach in our paper which uses metrics such as Uni3D and ULIP-2 to measure semantic similarity between generated meshes and input images, we employ the FID metric to more rigorously evaluate the geometry quality of generated meshes, which requires GT meshes. We render paired normal maps for both the generated meshes and GT meshes from the same viewpoints, then calculate the Normal-FID between them to assess model performance. The results are as follows:

MethodsNormal-FID ↓
Ours (512³) w/o sparse conditioning71.49
Ours (512³) w/ sparse conditioning69.31

It can be observed that the sparse conditioning module brings a certain level of performance improvement, demonstrating its effectiveness. However, we would like to clarify that the main performance gains and primary contributions of our method come from the proposed fully symmetric sparse SDF VAE and the SSA mechanism, which significantly improve both training and inference efficiency. These advances overcome the resolution limitations of previous methods and enable the generation of 1024³ resolution meshes.

[Q3]: What does "additional valid voxels" in line 144 refer to.

[A3]: Thank you for pointing this out. Our Sparse SDF VAE encodes sparse SDF grids of resolution N³ into sparse latent tokens at a lower resolution of (N/8)³, and then decodes them back to N³ sparse SDF grids. For each sparse latent token, the decoder upsamples it to 8³ grids. However, only a subset of these 8³ grid points correspond to the original input sparse SDF grids, and the remaining grid points fall into the category of extra valid voxels. We apologize for the confusion and misunderstanding this may have caused. In the final version, we will include a clear explanation of extra valid voxels to improve clarity.

评论

Thanks for the authors' comment, and all my concerns are well addressed. By the detailed explanation of authors, I acknowledge that the contribution of proposed SSA is not trivial. I will raise my final rating.

评论

Thank you sincerely for your kind follow-up and for raising your rating. We are glad our rebuttal has addressed your concerns, and we appreciate your constructive review.

审稿意见
4

The paper proposes a novel pipeline, Direct3D-S2 for 3D mesh generation. On the high level, the pipeline is based on a latent diffusion architecture (VAE + latent DiT). The paper proposes SSA (Spatial Sparse Attention) in the DiT to speed up the attention layer which originally has quadratic complexity for high-resolution 3D information that translates into many tokens. The generated meshes are detailed with bumps and dips and have crisp-clear edges on the geometry, showing the effectiveness and advantage of high 3D resolution.

优缺点分析

Strengths:

  • The implementation of SSA is non-trivial and could benefit the 3D community overall as a general mechanism for harvesting sparsity that naturally exist in 3D data.
  • The generated mesh geometry is visually appealing and has high resolution.
  • The paper is well-organized.

Weaknesses:

  • It is unclear in the paper how the sparse structure is generated at inference time. This point confuses me from the beginning to the end when I am reading the paper. By "sparse structure" I mean where non-zeros locate, so sparsity can be utilized to speedup processing. There is a separate stage to generate sparse structure in the previous work TRELLIS. The lack of information about the source of this structure during generation makes the understanding of data structure of SSA hard.
  • The speed of generation is unclear. Though the attention is sped up much by SSA, the huge number of tokens may still limit the throughput with the computation of feed forward MLPs in transformer. The speed of generation is important for practical application.
  • The NeurIPS checklist item 8 "Experiments compute resources": I did not find "comprehensive information on the type of computing workers, memory, and time of execution required for each experiment" in Section. 5 as claimed by the item. Missing memory and type of GPU for training (though I do find the information in appendix), and absolutely no information about data curation.

问题

The work is in general good to me but there are important missing links in the whole pipeline making it hard to understand. I am glad to increase the score if the points in the weaknesses are clarified.

局限性

The limitation section in appendix mainly describes the limitation of SSA, which I think is interesting and significant but should not be dominating whole paper.

最终评判理由

The original concerns are addressed well by the rebuttal. The paper needs revision to include the information in the rebuttal, but with the discussions I am on the positive side.

格式问题

None

作者回复

Thank you for your insightful feedback and thorough review of our paper. We carefully respond to each of the concerns and questions below.

[Q1]: It is unclear in the paper how the sparse structure is generated at inference time.

[A1]: Thank you for raising this point. Our Direct3D-S2 employs an additional dense 3D VAE and DiT trained in a manner consistent with Trellis to predict indices of sparse latent tokens within voxel grids, which constitute the sparse structure. This is noted in Lines 29-30 of the supplementary material. We acknowledge this may cause confusion and incomprehension during paper reading, and we sincerely apologize for this oversight. In the final version, we will restructure the method section to include explicit descriptions of the sparse structure generation process, thereby avoiding any misunderstanding. We deeply appreciate your suggestion.

[Q2]: The speed of generation is unclear.

[A2]: Thanks for your comments. Within the DiT architecture, the self-attention module exhibits O(N²) complexity for N tokens, while other components such as feed-forward MLPs maintain linear complexity O(N). At high resolutions where the number of tokens N become substantial, self-attention accounts for over 90% of the model’s runtime. Therefore, although SSA only accelerates the self-attention module, it still leads to a significant improvement in the overall generation speed. In the table below, we compare the time required to denoise a single step using SSA versus full attention:

MethodsAverage Inference Time (One Step) ↓
Ours (512) w/ Full Attention2.06s
Ours (512) w/ SSA1.13s
Ours (1024) w/ Full Attention18.34s
Ours (1024) w/ SSA5.14s

For 512³ resolution, around 30 denoising steps are needed, while for 1024³ resolution, around 15 steps suffice. Predicting the sparse structure takes approximately 5 seconds. Therefore, excluding post-processing steps such as mesh simplification, generating a 512³ resolution mesh takes about 0.5 minute, and a 1024³ resolution mesh takes around 2 minutes, which meets the requirements for practical applications.

[Q3]: Experiments compute resources.

[A3]: Thanks for your comments. As documented in the Implementation Details section of supplementary material, we have already specified the type of GPU and training durations. For further clarity, we provide additional experiments compute resources in the table below:

Type of GPUGPU Memory for Training (512³)GPU Memory for Inference (512³)GPU Memory for Training (1024³)GPU Memory for Inference (1024³)
NVIDIA A100~24G~11G~40G~20G

Note that we trained our Direct3D-S2 using 8 GPUs, and the table above lists the GPU memory usage per single card for training.

[Q4]: No information about data curation.

[A4]: Thanks for your comments. In fact, we described the data processing pipeline in Section 5.1 in the paper. Here, we provide a more detailed explanation of the data curation process to avoid any misunderstanding:

(1) Data Filtering. We first render all meshes from the ShapeNet, Objaverse, and Objaverse-XL datasets from six fixed viewpoints. We then randomly select several thousand 3D models and manually annotate whether each one is of high quality. For each mesh, we extract CLIP features from its six rendered images and concatenate them. Based on this, we train a lightweight binary classification network. Using this network, we filter out 452k high-quality meshes from the datasets for training.

(2) Geometry processing. We follow the same strategy as CLAY [1]: for each non-watertight mesh, we first compute a UDF grid, and then use ray tracing to determine the visibility of each grid point. For invisible interior points, we assign negative UDF values. The resulting mesh is made watertight via marching cubes. We then extract its SDF grid for training.

(3) Rendering conditional images. We rendered 45 RGB images per mesh at 1024x1024 resolution with random camera parameters. The camera configuration space is defined as follows: elevation angles ranging from 10° to 40°, azimuth angles spanning [0°, 180°], and focal lengths varying between 30mm and 100mm.

[Q5]: Additional limitations.

[A5]: Thank you for your suggestions. Another limitation of our method is that, since the final output of our model is a sparse SDF grid, we need to assign a default SDF value to the remaining grid points before applying marching cubes. As a result, the extracted mesh is hollow and requires post-processing, similar to Trellis, to remove internal structures. This issue could also be addressed by implementing a sparse marching cubes algorithm, which we leave as future work.

[1] Zhang, Longwen, et al. Clay: A controllable large-scale generative model for creating high-quality 3d assets. TOG 2025.

评论

The rebuttal addressed my concerns well. Therefore, I have adjusted the rating to the positive side.

评论

Thank you for your thoughtful feedback, and we are pleased to hear that our rebuttal addressed your concerns well. Your constructive suggestions are very helpful, and we sincerely appreciate your consideration.

审稿意见
5

The paper introduces Direct3D-S2, a novel framework for scalable 3D generative modeling. The authors propose a fully end-to-end symmetric sparse SDF VAE that maintains a consistent sparse representation across encoding, latent, and decoding stages using direct 3D supervision. The authors also introduce a new spatial sparse attention mechanism that allows for efficient processing of large-scale 3D data. Experiments on public datasets demonstrate that Direct3D-S2 outperforms state-of-the-art methods in both geometric fidelity and training efficiency, making it promising for scaling 3D generative models.

优缺点分析

Strengths:

  • The authors propose Spatial Sparse Attention (SSA) which enables efficient modeling for large-scale 3D generative models.
  • The authors propose a fully end-to-end symmetric sparse SDF VAE that maintains a consistent sparse representation using direct 3D supervision.
  • The generation results are impressive, providing significant detail improvement over the state-of-the-art methods.

Weaknesses:

  • The paper lacks numerical ablation studies across different resolutions in both VAE and generation to demonstrate the model's scalability in terms of resolution.
  • Although the generation results are impressive, some unfaithful details are present in the generated results (e.g., first row in Figure 4, the model generates two circles in the zoom-in area, which is only one circle in the ground truth image).
  • The paper lacks generation results on in-the-wild images.

问题

I would recommend the authors to consider the following questions:

  1. Could the authors provide more comprehensive numerical ablation results to demonstrate the model's scalability across different resolutions?

  2. Would it be possible to include more generation results on in-the-wild images to better demonstrate the model's practical applicability?

  3. Could the authors include additional discussion regarding the frequency of unfaithful detail generation and potential future solutions to address this issue?

  4. The 256 resolution results in Figure 3 appear to be significantly worse than those of Trellis, which also operates at 256 resolution. Given that the proposed encoding and generation framework is expected to achieve better performance at the same resolution, could the authors provide additional analysis to explain this discrepancy?

局限性

Yes.

最终评判理由

Thanks authors for their rebuttal. Most of my concerns have been addressed! I vote for accept.

格式问题

None.

作者回复

Thank you for your insightful feedback and thorough review of our paper. We carefully respond to each of the concerns and questions below.

[Q1]: Lacks numerical ablation studies across different resolutions in both VAE and generation.

[A1]: Thanks for your valuable suggestions. We present the numerical ablation studies of VAE across different resolutions below. The evaluation metrics includes Chamfer Distance and F1 score with thresholds of 0.005 and 0.001. The metrics are multiplied by 10000, 100 and 100, respectively.

ResolutionChamfer Distance↓F1 (0.005)↑F1 (0.001)↑
25630.8685.619.93
38420.8998.4619.43
51213.5999.9841.11
10247.91100.0070.19

We perform quantitative comparisons of generation across different resolutions on a subset from Objaverse dataset, which comprises 500 meshes disjoint from the training set. Unlike the approach in our paper which uses metrics such as Uni3D and ULIP-2 to measure semantic similarity between generated meshes and input images, we employ the FID metric to more rigorously evaluate the geometry quality of generated meshes, which requires GT meshes. We render paired normal maps for both the generated meshes and GT meshes from the same viewpoints, then calculate the Normal-FID between them to assess model performance. The results are as follows:

ResolutionNormal-FID ↓
256132.97
38487.10
51269.31
102446.44

There quantitative results demonstrate the scalability of our Direct3D-S2 in terms of resolution.

[Q2]: The 256 resolution results in Figure 3 appear to be significantly worse than those of Trellis.

[A2]: Thanks for your comments. We acknowledge that at 256³ resolution, our method performs worse than Trellis, an expected outcome for two reasons:

(1) Trellis' VAE employs a 53-dimensional FlexiCubes parameters at 256³, whereas our Sparse SDF VAE uses only 1D SDF representations at the same resolution.

(2) Trellis' VAE is supervised by high-resolution normal maps rendered from GT meshes, while we use 256³ SDF grids as supervision signals, which provide significantly inferior geometric fidelity compared to mesh-derived normals.

However, our approach offers critical advantages over Trellis:

(1) Model scalability: Trellis’ VAE requires rendering normal maps via FlexiCubes representations for supervision, demanding prohibitive GPU memory (e.g., out-of-memory errors at 512³). Our Sparse SDF VAE scales seamlessly to 1024³ resolution.

(2) Training efficiency: Trellis’ asymmetric VAE (point cloud → latent → FlexiCubes with normal map supervision) exhibits high training complexity, requiring 64 A100 GPUs for several days to converge. Our symmetric encoder-decoder architecture (SDF grid → latent → SDF grid) reduces optimization difficulty, converging from 256³ to 1024³ in just 2 days on 8 A100 GPUs.

(3) Long-token modeling: Trellis’ DiT uses full attention, becoming prohibitively slow with long tokens. Our proposed SSA substantially improves the training and inference efficiency.

(4) Progressive quality gain: While 256³ SDF grids offer limited quality, this resolution serves merely as a warm-up phase to save training time. Quality progressively improves at 512³, with 1024³ SDF grids approaching GT mesh fidelity, enabling significant performance gains at higher resolutions.

In summary, Trellis achieves strong 256³ reconstruction quality but suffers from inefficiency and 256³ resolution limitations. Our Direct3D-S2 framework is explicitly designed to overcome this, enabling scalable VAE resolution (up to 1024³) and introducing SSA for efficient DiT training/inference with ultra-long tokens.

[Q3]: Some unfaithful details are present in the generated results.

[A3]: We sincerely appreciate you highlighting this observation. The surface details occasionally manifesting in our generated meshes but absent in input images stem from a dataset limitation: a subset of textureless meshes in Objaverse renders geometrically ambiguous conditional images. This creates a supervision mismatch during training, where input images lack discernible details while target meshes retain them, causing the model to potentially generate unfaithful surface details misaligned with the conditioning image. To resolve this, we will filter textureless meshes from the training set and retrain the DiT model. The updated model will also be publicly released. We deeply value your insight, which significantly improves our method’s robustness.

[Q4]: The paper lacks generation results on in-the-wild images.

[A4]: Thank you for your suggestion. Our test cases collected from professional communities encompass diverse image types, including samples generated by text-to-image models, images drawn by artists, and in-the-wild images, on which our method consistently demonstrates robust performance. Due to current conference policy prohibiting PDF attachments or external links during rebuttal, we will include additional results on in-the-wild images in the final version.

评论

Thanks authors for their rebuttal. Most of my concerns have been addressed!

评论

Thank you for taking the time and effort to review our response. We are glad that our rebuttal has addressed most of your concerns. If you have any further questions, please let us know and we will respond promptly!

审稿意见
5

This paper addresses the challenge of generating high-resolution 3D shapes in an efficient and scalable manner. Prior methods based on explicit voxel representations suffer from significant computational overhead, particularly in the attention modules of Diffusion Transformers (DIT), as the voxel grid resolution increases.

To overcome this limitation, the authors propose Spatial Sparse Attention (SSA), a novel mechanism that significantly reduces computational cost while maintaining high-quality 3D generation. SSA combines three key components: sparse 3D compression, spatial blockwise token selection, and sparse 3D window attention.

Furthermore, the paper introduces Sparse SDF VAE, a unified variational autoencoder architecture where both the input and output are represented as sparse Signed Distance Fields (SDFs).

Experimental results demonstrate that the proposed framework not only achieves superior computational efficiency but also improves the visual quality of image-conditioned 3D generation.

优缺点分析

Strengths

  1. The proposed Spatial Sparse Attention (SSA) mechanism presents a novel approach to scaling attention in 3D space. By effectively adapting Local Sparse Attention (LSA) to volumetric data, the method addresses the core inefficiency in high-resolution 3D generation with a well-motivated and technically solid solution.

  2. The architecture successfully handles gigascale 3D generation, and its design remains scalable to even larger scenes due to its efficient sparse formulation.

  3. The qualitative results are visually compelling, demonstrating detailed and structurally coherent 3D outputs that surpass prior methods in terms of geometric complexity and realism.

  4. The paper is clearly written, and the methodological components are effectively illustrated with informative and well-organized figures, which aid in understanding the proposed architecture and its contributions.

Weaknesses

  1. All ablation studies are qualitative. A quantitative analysis of the proposed components would better clarify the contribution and effectiveness of each module.

  2. The Sparse SDF VAE is only evaluated qualitatively. Quantitative comparisons are necessary to demonstrate its advantages in training efficiency and geometric fidelity.

  3. Minor: Redundant mention of ULIP-2 [42], Uni3D [50], and OpenShape [20] in Lines 256 and 264. Please revise to avoid repetition.

问题

  1. Clarification on performance attribution: It would be helpful to clarify whether the performance gains primarily come from the proposed network architecture or from operating at higher resolutions. For example, Trellis operates at a resolution of 256³, and the results of the proposed method at the same resolution (as shown in Appendix Figure 3) appear relatively less compelling. A direct comparison at equal resolutions, both qualitatively and quantitatively, would help to better isolate the contribution of the proposed design.

  2. SSA component ablation: In the ablation studies of the SSA module, it would be interesting to see results when only cmp or only slc (if available) is used. This could provide further insight into the individual contribution of each component within SSA.

  3. Potential for text-to-3D extension: Given the structured and scalable nature of the proposed method, I am curious whether it could be extended to support text-to-3D asset generation. If applicable, any preliminary results or discussion on this direction would be appreciated.

局限性

Yes.

最终评判理由

I would like to thank the authors for their comprehensive ablation studies and insightful comparisons. The method’s impressive visual quality, coupled with its computational efficiency, makes it a valuable contribution to the community. I firmly believe that this work meets the standards of NeurIPS and deserves acceptance.

格式问题

The paper is well-formatted, and I did not observe any formatting issues.

作者回复

Thank you for your insightful feedback and thorough review of our paper. We carefully respond to each of the concerns and questions below.

[Q1]: Quantitative results of ablation studies.

[A1]: Thanks for your valuable suggestions. We perform quantitative comparisons on a subset from Objaverse dataset, which comprises 500 meshes disjoint from the training set. Unlike the approach in our paper which uses metrics such as Uni3D and ULIP-2 to measure semantic similarity between generated meshes and input images, we employ the FID metric to more rigorously evaluate the geometry quality of generated meshes, which requires GT meshes. We render paired normal maps for both the generated meshes and GT meshes from the same viewpoints, then calculate the Normal-FID between them to assess model performance. The results are as follows:

(1) Generation results of our Direct3D-S2 at various resolutions:

ResolutionNormal-FID ↓
256132.97
38487.10
51269.31
102446.44

(2) Ablation results for different modules of SSA at 512³ resolution:

ModulesNormal-FID ↓
cmp304.48
slcNaN
win74.10
win+cmp77.68
cmp+slc70.05
win+cmp+slc69.31

(3) Comparative results of SSA against Full Attention and NSA at 512³ resolution:

MethodsNormal-FID ↓
Full Attention76.58
NSA71.93
SSA (Ours)69.31

(4) Ablation results for sparse conditioning at 512³ resolution:

MethodsNormal-FID ↓
Ours w/o sparse conditioning71.49
Ours w/ sparse conditioning69.31

[Q2]: Results when only cmp or only slc is used in SSA.

[A2]: Thanks for your suggestions. We have added supplementary experiments using only cmp or slc, with results presented in [Q1](2). When using only the cmp module, which provides exclusively coarse-level features, the model's performance significantly degrades. When using only the slc module, the model consistently collapses (with loss diverging to NaN) after several hundred training iterations due to the absence of the cmp module which learns block-level attention scores. Consequently, the slc module must operate in conjunction with the cmp module.

[Q3]: Quantitative comparisons of Sparse SDF VAE.

[A3]: Thanks for your suggestions. We present the quantitative comparisons of VAE below. The evaluation metrics includes Chamfer Distance and F1 score with thresholds of 0.005 and 0.001. The metrics are multiplied by 10,000, 100 and 100, respectively. Regarding training efficiency, Dora requires 32 A100 GPUs for two days of training, while Trellis necessitates 64 A100 GPUs. Other compared methods do not report training time. In contrast, our Sparse SDF VAE requires only 8 A100 GPUs for one day to train at 512³ resolution, and merely an additional day to progress to 1024³ resolution.

MethodsChamfer Distance↓F1 (0.005)↑F1 (0.001)↑
Trellis28.6088.7711.57
Xcube15.4299.7931.85
Dora15.1299.9831.86
Ours (512)13.5999.9841.11
Ours (1024)7.91100.0070.19

[Q4]: Clarification on performance attribution and results at 256³ resolution.

[A4]: Thanks for your comments. We acknowledge that at 256³ resolution, our method performs worse than Trellis, an expected outcome for two reasons:

(1) Trellis' VAE employs a 53-dimensional FlexiCubes parameters at 256³, whereas our Sparse SDF VAE uses only 1D SDF representations at the same resolution.

(2) Trellis' VAE is supervised by high-resolution normal maps rendered from GT meshes, while we use 256³ SDF grids as supervision signals, which provide significantly inferior geometric fidelity compared to mesh-derived normals.

However, our approach offers critical advantages over Trellis:

(1) Model scalability: Trellis’ VAE requires rendering normal maps via FlexiCubes representations for supervision, demanding prohibitive GPU memory (e.g., out-of-memory errors at 512³). Our Sparse SDF VAE scales seamlessly to 1024³ resolution.

(2) Training efficiency: Trellis’ asymmetric VAE (point cloud → latent → FlexiCubes with normal map supervision) exhibits high training complexity, requiring 64 A100 GPUs for several days to converge. Our symmetric encoder-decoder architecture (SDF grid → latent → SDF grid) reduces optimization difficulty, converging from 256³ to 1024³ in just 2 days on 8 A100 GPUs.

(3) Long-token modeling: Trellis’ DiT uses full attention, becoming prohibitively slow with long tokens. Our proposed SSA substantially improves the training and inference efficiency.

(4) Progressive quality gain: While 256³ SDF grids offer limited quality, this resolution serves merely as a warm-up phase to save training time. Quality progressively improves at 512³, with 1024³ SDF grids approaching GT mesh fidelity, enabling significant performance gains at higher resolutions.

In summary, Trellis achieves strong 256³ reconstruction quality but suffers from inefficiency and 256³ resolution limitations. Our Direct3D-S2 framework is explicitly designed to overcome this, enabling scalable VAE resolution (up to 1024³) and introducing SSA for efficient DiT training/inference with ultra-long tokens.

[Q5]: Potential for text-to-3D extension.

[A5]: Thanks for your valuable suggestions. While our framework could technically enable text-to-3D generation by substituting image conditioning with text conditioning during training and inference, this approach faces fundamental data scalability limitations since text-to-image/video models typically require billions of training samples whereas available 3D mesh datasets are orders of magnitude smaller (∼1M total, reduced to ∼100k usable samples after quality filtering). This severe data scarcity prevents training sufficiently generalizable text-to-3D models, evidenced by Trellis’ text-to-3D variant exhibiting significantly inferior stability and generalization versus its image-to-3D counterpart. Consequently, state-of-the-art 3D generation works (e.g., Hunyuan3D 2.0 [1], Craftsman [2], Triposg [3], Hi3DGen [4]) predominantly adopt text-to-image-to-3D pipelines leveraging pre-trained text-to-image models, and notably in our evaluation cases where a substantial portion of images sourced from communities were synthesized by text-to-image models, our Direct3D-S2 achieves compelling results, demonstrating seamless compatibility with such text-to-image-to-3D workflows.

[Q6]: Redundant mention of ULIP-2, Uni3D, and OpenShape in Lines 256 and 264.

[A6]: Thanks for your suggestions. We will reorganize these sentences in the final version.

[1] Zhao, Zibo, et al. Hunyuan3d 2.0: Scaling diffusion models for high resolution textured 3d assets generation. arXiv 2025.

[2] Li, Weiyu, et al. CraftsMan3D: High-fidelity Mesh Generation with 3D Native Diffusion and Interactive Geometry Refiner. CVPR 2025.

[3] Li, Yangguang, et al. Triposg: High-fidelity 3d shape synthesis using large-scale rectified flow models. arXiv 2025.

[4] Ye, Chongjie, et al. Hi3dgen: High-fidelity 3d geometry generation from images via normal bridging. ICCV 2025.

评论

Thank you for addressing my concerns. The additional ablation was helpful, and I hope it will be included in the final version. I’ll keep my positive scores.

评论

Thanks for your positive feedback and for taking the time to consider our responses. We are pleased that our rebuttal has addressed your concerns, and we will include the additional ablation in the final version.

最终决定

All five reviewers recommend accepting this paper (2 “Accept” and 2 “Borderline Accept”). The authors have provided a detailed rebuttal to successfully resolve most of the reviewers’ concerns. The reviewers mentioned that the pipeline of the proposed 3D mesh generation method is novel, the qualitative results of this paper are visually appealing, and the method’s computational efficiency is high. The AC agrees with the reviewers that the paper is interesting and deserves to be published in NeurIPS 2025.