PaperHub
5.0
/10
Poster5 位审稿人
最低4最高6标准差0.9
6
4
5
6
4
4.8
置信度
正确性3.2
贡献度3.0
表达3.2
NeurIPS 2024

Direct3D: Scalable Image-to-3D Generation via 3D Latent Diffusion Transformer

OpenReviewPDF
提交: 2024-05-14更新: 2024-11-06
TL;DR

We propse a novel approach for direct 3D shape generation from a single image, bypassing the need for multi-view reconstruction.

摘要

关键词
3D GenerationDiffsion Model

评审与讨论

审稿意见
6

An image-conditioned 3D generation method with two parts: 1. 3D triplane VAE taking point cloud as input and generate the semi-continuous occulancy, which includes a point2latent encoder and a latent2triplane decoder 2. A direct 3D transformer diffusion model conditioned on single image and generate tokens of the latent, with two input heads: DINO and CLIP. A wide range of evaluation with input from GSO object rendering to 2D diffusion generated images.

优点

  1. Well-written maniscript explains motivations and methods clearly.
  2. A novel 3D diffusion transformer to generate triplane token, then decode the semi-continuous occupancy.
  3. Good qualitative results given image rendered from GSO dataset or text conditioned image diffusion model.

缺点

  1. Lack of analysis of your data's contribution on method performance. I'm curious whether your good quality is because of 500K internal data since most baselines are only trained on objaverse. It would be great if I can see the result of baselines trained on your data (If this is too expensive, maybe training your method only trained on objaverse small subset is easier).

  2. Lack of more quantitative results: The only quantitative result is the user study, which may be not enough. It's natural to comare with other baselines on chamfer distance, PSNR or SSIM of depth and normal maps from some views, and also compare on a larger set (not as few what you mentioned as 30 GSO objects)

问题

See the weakness.

局限性

Yes

作者回复

Thank you for your insightful feedback and thorough review of the our paper. We carefully respond to each of the concerns and questions below.

[Q1] Training on Objaverse dataset and quantitative comparison with state-of-the-art methods.

[A1] Thank you for your valuable comment. To ensure a fair comparison with other methods, we re-train our D3D-VAE and D3D-DiT on the Objaverse dataset. Following prior approaches such as InstantMesh and Wonder3D [1], we adopt the Google Scanned Objects dataset for evaluation, and report the Chamfer Distance, Volume IoU and F-Score to compare the quality of the generated meshes. The results are presented in the table below, which illustrates that generative-based methods like Shap-E and Michelangelo exhibit significantly lower accuracy compared to reconstruction-based methods such as InstantMesh. Nonetheless, our Direct3D achieves state-of-the-art performance across all metrics. Integrating our internal data for training further enhances the model’s performance, validating the scalability of our approach.

MethodsChamfer Distance ↓Volume IoU ↑F-Score ↑
Shap-E0.05850.23470.3474
Michelangelo0.04410.12600.4371
One-2-3-450.05130.28680.3149
InstantMesh0.03270.41050.5058
Ours (trained on Objaverse)0.02960.43070.5356
Ours (trained on internal data)0.02710.43230.5624

[Q2] Lack of more quantitative results.

[A2] We add more quantitative results in Table 2-3 of the uploaded PDF.

Reference:

[1]. Long et al., Wonder3D: Single Image to 3D using Cross-Domain Diffusion.

审稿意见
4

This paper proposes an image-to-3D generation model without using SDS optimization. It includes a 3D-VAE that encodes 3D shapes into latent space and a 3D DiT that models the latent distributions.

优点

  • Single-step 3D generation is a relatively unexplored area.
  • Exploring the DiT model in 3D tasks is interesting.

缺点

  • Lack of novelty. Specifically, 1) this paper mainly uses a 3D-VAE and a 3D-DiT to encode a 3D point cloud into latent space and model the distribution, which has been explored in previous methods like Point-E/Shap-E. This pipeline is widely used in image diffusions, and this paper extends it to 3D, which does not seem novel to me. 2) One novelty the authors mention is the semi-continuous occupancy used for supervision, which, compared with standard 3D supervision, only adds a continuous value between [0,1] when it is close to the boundary. 3) The semantic-level 3D-specific image conditioning is provided by DINO.

  • What is the dataset used for training? It seems that a private dataset is used for training, so I am not sure if the examples shown in the paper were seen during training. In addition, since the paper emphasizes that they can use in-the-wild input images for 3D generation, results on in-the-wild input images would be good to show.

  • There is a paper in CVPR 2024 named "DIRECT-3D: Learning Direct Text-to-3D Generation on Massive Noisy 3D Data" [1]. They are also doing single-step 3D generation without SDS. This submission uses the same model name as the CVPR 2024 paper. This is very confusing. I think the authors should consider changing the method name. For me, Direct3D or D3D are both not good names. In addition, considering this submission and the CVPR paper are doing the same thing, the authors may also consider discussing the differences.

[1] Liu, Qihao, et al. "DIRECT-3D: Learning Direct Text-to-3D Generation on Massive Noisy 3D Data." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.

问题

Please see the questions in the weaknesses section. I will be willing to raise my rating if these questions are resolved.

局限性

Yes, it is discussed.

作者回复

Thank you for your insightful feedback and thorough review of our paper. We carefully respond to each of the concerns and questions below.

[Q1]: Similar architectures to image generation and video generation.

[A1]: We appreciate the insightful question raised by Reviewer yUww. Our primary contribution lies in the innovative unification between the latent DiT framework and 3D content creation task, where previous methods, such as Shape-E and Michelangelo, have struggled to scale effectively for generalized 3D generation. As noted by Reviewer 3ERi, the 'origin, generative-based 3D AIGC pipeline is absolutely missing', and we address this gap by introducing the native 3D generation pipeline leveraging the specially designed 3D VAE (D3D-VAE) and the image-conditioned 3D latent DiT (D3D-DiT). This pipeline not only outperforms existing multi-view diffusion-based methods but also shows promise for scalability with increased 3D data availability.

The innovative nature of our work has been recognized by Reviewer 3ERi, who described it as 'a firm step towards a converged 3D AIGC framework,' and by Reviewer U1iF, who highlighted 'The DiT model is innovatively applied for the first time in conditional 3D generation'. We believe these points underscore the novelty and significance of our contribution to the field of 3D content creation.

[Q2]: It seems that a private dataset is used for training.

[A2]: To ensure a fair comparison with other methods, we re-train our D3D-VAE and D3D-DiT on the Objaverse dataset. Following prior approaches such as InstantMesh and Wonder3D [1], we adopt the Google Scanned Objects dataset for evaluation, and report the Chamfer Distance, Volume IoU and F-Score to compare the quality of the generated meshes. The results are presented in the table below, which illustrates that generative-based methods like Shap-E and Michelangelo exhibit significantly lower accuracy compared to reconstruction-based methods such as InstantMesh. Nonetheless, our Direct3D achieves state-of-the-art performance across all metrics. Integrating our internal data for training further enhances the model’s performance, validating the scalability of our approach. As replied above, we will release the model trained on the Objaverse dataset to further promote openness and reproducibility in our research.

MethodsChamfer Distance ↓Volume IoU ↑F-Score ↑
Shap-E0.05850.23470.3474
Michelangelo0.04410.12600.4371
One-2-3-450.05130.28680.3149
InstantMesh0.03270.41050.5058
Ours (trained on Objaverse)0.02960.43070.5356
Ours (trained on internal data)0.02710.43230.5624

[Q3]: Results on in-the-wild input images.

[A3]: We present the results using in-the-wild images as conditional inputs in Figure 2 of the uploaded PDF, demonstrating that our method is capable of generating high-quality meshes. We will include more visualization results in the final version of the paper.

[Q4]: Another paper named ’DIRECT-3D: Learning Direct Text-to-3D Generation on Massive Noisy 3D Data’.

[A4]: Thanks for the reminder. We also noted the naming problem. Actually, the paper mentioned was made public on 6 June, 2024, after the NeurIPS submission deadline, and we had conducted a thorough search to ensure that the name ’Direct3D’ was not already in use before submitting our paper to NeurIPS. We will cite and discuss the paper to help readers to distinguish the two works.

Reference:

[1]. Long et al., Wonder3D: Single Image to 3D using Cross-Domain Diffusion.

评论

Thank you for your response. Unfortunately, it has left me more confused and concerned about the contribution being claimed. You state that your main contribution "lies in the innovative unification between the latent DiT framework and the 3D content creation task, where previous methods, such as Shap-E and Michelangelo, have struggled to scale effectively for generalized 3D generation." However, this argument is problematic and unclear to me for several reasons.

  • Using DiT-like architecture for 3D generation is not new: The use of the DiT framework for 3D generation has already been explored in various papers. More importantly, A key aspect of Direct3D's usage of the DiT framework is its use of tri-plane features, which allow for the representation of 3D objects using 2D features and thus enable using "off-the-shelf 2D generators" for 3D generation [EG3D]. This approach is fairly standard in 3D tasks. What makes you consider this to be novel and worthy of being highlighted as your contribution?

  • Scalability of Shap-E vs. Direct3D: You claim that "previous methods, such as Shape-E and Michelangelo, have struggled to scale effectively for generalized 3D generation." However, Shape-E is trained on several million 3D models [Shap-E, Point-E], while Direct3D is trained on only 660K 3D models. Shape-E has actually demonstrated good scalability, while your work has not provided comparable evidence. Furthermore, many previous methods are trained on similar amounts of data (i.e., hundred of thousands of data). What specific advantages in scalability does your approach have compared to others, and where is this proven in your work?

  • You also mentioned that "the innovative nature of our work has been recognized by Reviewer 3ERi, ..." Upon reviewing their comments, it appears that Reviewer 3ERi actually stated that "this idea is not new" and that "the proposed DiT pipeline cannot be considered a contribution." It does not seem that Reviewer 3ERi acknowledges the novelty or significance of your contribution.

评论

Thank you for your feedback. We respectfully disagree with the suggestion to rename our method and the assertion that our work lacks novelty.

We must clarify again that the CVPR 24 paper 'DIRECT-3D: Learning Direct Text-to-3D Generation on Massive Noisy 3D Data' was not publicly available or officially released by the conference program at the time of our NeurIPS submission deadline. Consequently, our method name was chosen without any awareness of the CVPR paper's title. While we are open to incorporating a discussion of the paper in our revised manuscript, we maintain that the rename suggestion does not align with the principles of fair and professional peer review.

Regarding the methodology, we acknowledge that the building blocks, such as the transformer encoder, triplane representation, and DiT, have been presented in previous literature. However, our innovation is the seamless integration of these elements into a novel framework that delivers scalable and high-fidelity 3D generation, which is achieved by:

  • Efficient 3D shape encoding via triplane latent.
  • Detailed geometry recovery through direct 3D supervision and sampling.
  • Scalable 3D latent generation using DiT.
  • Image-consistent 3D shape generation with semantic and pixel-level image conditions.

Our approach has outperformed leading methods such as InstantMesh and Michelangelo in both quantitative and qualitative assessments, highlighting the significance of the scalable architecture design for native 3D generation.

We believe innovations in scalable architecture design, as seen in works that integrate diffusion with transformers (DiT) or DiT with video data (Sora), are pivotal for progress in the field of generative AI. We feel sorry that Reviewer yUww does not value innovations of this kind, but we are optimistic that our contribution will be favorably recognized by the academic community.

评论

Thank you very much for your response. I apologize if I wasn't clear in my previous message. Let me clarify my concerns.

I appreciate your results on Objaverse (your A2) and agree with your clarification regarding the naming issue in the CVPR24 paper (your A4). I don't think the naming is an issue here.

However, my primary concern remains with the first question, regarding novelty. I still don't feel that you've answered my questions: Why do you consider the "unification between the latent DiT framework and the 3D content creation task" to be novel, given that many previous papers have explored similar concepts? Additionally, how do you demonstrate scalability, especially when the method you're comparing in your response (Shap-E) uses significantly more data for training?

Regarding your latest response, here are my comments:

  • "Efficient 3D shape encoding via triplane latent": The use of triplane encoding for 3D shapes has been widely adopted since EG3D was published at CVPR 2022.

  • "Detailed geometry recovery through direct 3D supervision and sampling": This approach seems reasonable to me, particularly when compared to others that rely on projected 2D images for supervision. (But not new I think)

  • "Scalable 3D latent generation using DiT": What specific advantages in scalability does your approach have compared to others, and where is this demonstrated in your work? From my perspective, using DiT for 3D generation is not novel, nor does it necessarily suggest good scalability.

  • "Image-consistent 3D shape generation with semantic and pixel-level image conditions": As I mentioned in my first review, i.e., Weaknesses 1.2 and 1.3 (which you did not answer), the semi-continuous occupancy, compared with standard 3D supervision, only adds a continuous value between [0,1] near the boundary. Additionally, the semantic-level 3D-specific image conditioning is provided by DINO. While this is reasonable, it is not novel, as using DINO features for supervision or conditioning in 3D tasks is common in previous works like 3D-Fauna [CVPR24], MagicPony [CVPR23], etc.

评论

Thank you for your response. We carefully respond to your comments below.

  • Although triplane representations are widely utilized in the field of neural rendering, to the best of our knowledge, our proposed D3D-VAE is the first to introduce triplane latent space for direct 3D supervision (i.e. occupancy). Compared to recent works like 3DShape2VecSet and Michelangelo, which employ implicit 1D latent space, as well as rendering-based methods like LN3Diff, our D3D-VAE achieves superior reconstruction quality.

  • A limitation of using occupancy for direct 3D supervision of meshes is the abrupt gradient variation at query points close to the object surface, which can lead to training instability and the potential for holes in relatively thin objects. Our semi-continuous surface sampling strategy alleviates this issue to some extent, enhancing the reconstruction quality, particularly for thin or intricate meshes. The quantitative results are shown in Table 3 of the uploaded PDF.

  • Regarding our D3D-DiT, we found that solely utilizing the semantic-level conditions provided by CLIP may yield reasonable results. However, the generate meshes can not align well with the conditional images. Therefore, we introduced pixel-level conditions using DINO, which significantly improves the quality of the generated meshes, allowing for a more detailed and accurate reconstruction of the conditional image input. We believe precise image control is crucial for 3D generation, and our D3D-DiT is specially tailored for integrating the image conditions with the state-of-the-art DiT architecture.

  • From our perspective, the scalability of a model encompasses not only the incorporation of a large dataset for training but also whether the model’s performance and generalization capabilities improve when trained on additional data. Although Shap-E has utilized a larger dataset for training, the results presented in Table of A2, as well as Figures 4 and 5 in the original manuscript, indicate that our Direct3D achieves superior generation quality. Furthermore, Shap-E does not generalize well on some straightforward cases (e.g., the dragon and the sunglasses in Figure 5 of the original manuscript). Additionally, The table in A2 illustrates that, compared to training solely on Objaverse, our model shows improved performance when incorporating internal data for training, which validates the scalability of our approach.

评论

Thank you for your summary. I agree that the proposed method is reasonable. However, as I mentioned earlier, the key components of the proposed method are not new and cannot be seen as a contribution. For instance, using a tri-plane to convert 3D models into 2D-like features and then using a 2D generator for 3D generation has been widely adopted since EG3D. Similarly, training a VAE for 3D supervision (but not in the tri-plane space), or utilizing DINO for improved features, has been already explored in previous papers. I appreciate the authors' efforts during the rebuttal period, but I still lean toward borderline rejection.

审稿意见
5

Summary: This manuscript focuses on the task of 3D generation, including image-to-3D and text-to-3D tasks. The framework is structured in two stages. In the first stage, a geometry VAE is trained using 3D ground-truth occupancy as supervision. Triplane representations are leveraged as the explicit 3D format, which are acquired through a Transformer processing dense point clouds. A decoder is then used to map the interpolated features into occupancy fields. In the second stage, conditioned images are encoded using CLIP and DINO to provide pixel-level and semantic-level conditions, respectively. DIT blocks are utilized to denoise the latent space of the geometry VAE. Extensive experiments validate the effectiveness of the proposed framework, demonstrating promising capabilities in conditional 3D generation.

优点

1.The proposed Semi-continuous Surface Sampling technique is technically sound. 2.The DIT model is innovatively applied for the first time in conditional 3D generation. 3.The manuscript is well-organized and clearly written, facilitating easy comprehension of the complex concepts discussed. 4.The overall framework exhibits promising capabilities in conditional 3D generation, suggesting strong potential for future applications and research.

缺点

1.The manuscript lacks a robust quantitative evaluation metric for the geometry VAE and conditional generation, particularly in comparison with state-of-the-art models and within its own ablation study. For instance, while the effectiveness of the explicit triplane latent representation and the semi-continuous surface sampling strategy is demonstrated, it is shown only through the visualization of three or four examples without any quantitative metrics. However, these results could be quantitatively evaluated using the Chamfer Distance by comparing the reconstructed mesh with the ground truth (GT) mesh, both of which are readily accessible. Additionally, as other methods also utilize 3D GT for supervision, a quantitative comparison using the Chamfer Distance should be provided to establish the proposed method's competitiveness and efficacy more definitively. 2. The manuscript currently lacks sufficient detail in the "Implementation Details" section, particularly regarding the near-surface sampling strategy. For the sake of reproducibility, it is crucial to provide more information:

+Sampling Strategy: What specific strategy is employed for near-surface sampling? +Is the loss weight of BCE loss the same across near-surface points and uniformly distributed points? +How many points are sampled per iteration?

问题

I recommend a detailed discussion of how pixel-level and semantic-level conditions are implemented, particularly since both conditions are introduced at the token-level. The distinction between the two is crucial for understanding the underlying mechanisms of the model. Specifically, the manuscript mentions that the CLIP token is integrated using AdaLayerNorm, while the DINO token is incorporated through self and cross-attention. It would be beneficial for the manuscript to elaborate on why these specific implementations represent pixel-level and semantic-level conditions, respectively. Such a clarification will not only enhance the clarity of the manuscript but also substantiate the design choices made in the architecture, providing readers with a deeper understanding of how each contributes to the model's performance.

2.I would like to see a detailed discussion regarding the differences between the vector set representation used in 3D Shape2VecSet and the triplane representation employed as the latent representation within the framework. This discussion should include an analysis of how each representation impacts the training of the geometry VAE and the conditional generation process. Such a comparative analysis would not only clarify the choice of representation in the current framework but also provide valuable insights for future research in 3D object generation. This discussion could significantly strengthen the manuscript by providing a clearer justification for the architectural decisions and their implications on the model's performance.

局限性

Yes

作者回复

Thank you for your insightful feedback and thorough review of our paper. We carefully respond to each of the concerns and questions below.

[Q1]: Quantitative comparison with state-of-the-art methods.

[A1]: To ensure a fair comparison with other methods, we re-train our D3D-VAE and D3D-DiT on the Objaverse dataset. Following prior approaches such as InstantMesh and Wonder3D [1], we adopt the Google Scanned Objects dataset for evaluation, and report the Chamfer Distance, Volume IoU and F-Score to compare the quality of the generated meshes. The results are presented in the table below, which illustrates that generative-based methods like Shap-E and Michelangelo exhibit significantly lower accuracy compared to reconstruction-based methods such as InstantMesh. Nonetheless, our Direct3D achieves state-of-the-art performance across all metrics. Integrating our internal data for training further enhances the model’s performance, validating the scalability of our approach. As we replied in the global rebuttal part, we will release the model trained on the Objaverse dataset upon paper acceptance to further promote openness and reproducibility in our research.

MethodsChamfer Distance ↓Volume IoU ↑F-Score ↑
Shap-E0.05850.23470.3474
Michelangelo0.04410.12600.4371
One-2-3-450.05130.28680.3149
InstantMesh0.03270.41050.5058
Ours (trained on Objaverse)0.02960.43070.5356
Ours (trained on internal data)0.02710.43230.5624

[Q2]: Quantitative metrics of the explicit triplane latent representation.

[A2]: To validate the effectiveness of the explicit latent representation we employed, we established a validation set from the Objaverse dataset that does not overlap with the training set. We compared the reconstruction metrics of the VAE using explicit latent representation against methods such as 3DShape2VecSet [2] that utilize implicit 1D latent space. As shown in the table below, the explicit latent representation outperforms the implicit 1D latent space across all metrics.

MethodsChamfer Distance ↓Volume IoU ↑F-Score ↑
Implicit 1D Space (3Dshape2VecSet)0.00570.87940.9416
Explicit Triplane (ours)0.00420.94090.9835

[Q3]: Quantitative metrics of the semi-continuous surface sampling strategy.

[A3]: We also conducted ablation experiments of the semi-continuous surface sampling strategy on the Objaverse evaluation set. As shown in the table below, our proposed method demonstrates a notable improvement in the reconstruction quality of VAE.

MethodsChamfer Distance ↓Volume IoU ↑F-Score ↑
w/o semi-continuous sampling0.00600.87230.9192
w/ semi-continuous sampling0.00570.87940.9416

[Q4]: Implementation details of near-surface sampling strategy.

[A4]: Actually in L258 of the original manuscript, we have mentioned that during the training of the VAE, we sampled 20,480 uniform points and 20,480 near-surface points for supervision. For the near-surface sampling strategy, we first randomly sample points on the surface of the mesh, then add a Gaussian perturbation with a std of 0.02 to each point to obtain the near-surface points. During training, the weights of BCE loss for both the near-surface points and the uniformly distributed points were set to 1.

[Q5]: A detailed discussion of how pixel-level and semantic-level conditions are implemented.

[A5]: We integrated the CLIP condition using cross-attention, a common approach in conditional diffusion models. However, we observed that meshes generated solely based on these semantic-level conditions lacked detailed alignment with the input image. To address this, we introduced additional pixel-level conditions using DINO, which previous research [2] has shown to outperform other pre-trained models in extracting structural information for 3D tasks. Specifically, we concatenated image tokens extracted by DINO with the noisy latent tokens and fed them into the self-attention layer, differing from the CLIP conditioning. This approach significantly enhances generation quality, allowing for a more detailed recovery of the conditioned input. We have conducted a qualitative comparison in Section A.3 and Figure 9 of the appendix.

[Q6]: A detailed discussion regarding the differences between the vector set representation used in 3DShape2VecSet and the triplane representation employed as the latent representation.

[A6]: Thanks for your valuable comment. In [A2], we conducted a quantitative comparison of the VAE reconstruction accuracy between the explicit latent representation and the vector set representation employed by 3DShape2VecSet. The results indicate that our explicit latent representation outperforms the vector set representation across various metrics, including Chamfer Distance, Volume IoU, and F-Score. In addition to geometric quality, another advantage of the explicit latent representation is its significant improvement in training efficiency for diffusion models. A recent work CLAY [3] also utilizes the vector set representation, and the training time of their diffusion model is approximately ten times longer than ours. Given the extensive computational requirements, with training necessitating thousands of GPU hours, it is challenging for us to complete the experiments within such a short timeframe. We will include a detailed discussion and comparative results in the final version of the paper.

Reference:

[1]. Long et al., Wonder3D: Single Image to 3D using Cross-Domain Diffusion.

[2]. Banani et al., Probing the 3d awareness of visual foundation models.

[3]. Zhang et al., CLAY: A Controllable Large-scale Generative Model for Creating High-quality 3D Assets.

评论

Thanks the author for the rebuttal especially for providing additional quantitative comparisons. I feel that some of my comments were addressed, and I have no further questions at the moment.

评论

Thank you for taking the time and effort to review our response! If you have any further questions, please let us know and we will respond promptly!

Thank you once again for your time and attention.

审稿意见
6

This work presents Direct3D, a new approach for scalable image-to-3D generation using a 3D Latent Diffusion Transformer. This method enables the generation of high-quality 3D assets from text and images without the need for complex optimization techniques. Direct3D introduces a native 3D generative model that efficiently produces 3D shapes consistent with conditional image input. By incorporating semantic-level and pixel-level image conditions, Direct3D demonstrates the efficacy of its approach through detailed experimental results and discussions on limitations and theoretical considerations.

优点

  • Direct3D can handle in-the-wild input images without the need for multi-view diffusion models or complex optimization techniques.

Direct3D surpasses previous image-to-3D approaches in terms of generation quality and generalization ability.

  • Direct3D comprises two primary components - a Direct 3D Variational Auto-Encoder (D3D-VAE) and a Direct 3D Diffusion Transformer (D3D-DiT). These components work together to encode high-resolution 3D shapes into a compact latent space and model the distribution of encoded 3D latents, respectively.

缺点

  • The model is trained on 500k internal data, which is not fair to compare with other methods. It's good to see quality improvement and very interesting pipeline. But as an academic publication, it's not acceptable to me, as a reviewer, that we dont know it enough whether the improvement comes from algorithm or newly added dataset. It's just not scientific.

  • There is very limited qualitative evaluation conducted to demonstrate the effectiveness, E.g., CLIP-score, LPIPS, FID.

  • Eq (2) lacks input of LBCEL_{BCE} and LKLL_{KL}, which cause trouble in understanding the pipeline.

  • typo in figure 2, "Nosiy Latent"

  • Better to denote the shape of noisy latent and the output of DiT in the figure 2 to help reader better understand it.

问题

What is the reconstruction error of the proposed 3D-VAE?

局限性

My biggest concern comes from the usage of private data. While the algorithm looks novel and interesting (and I personally like it very much), I can not recommend acceptance to a paper without a clear justification of where the improvement comes from. I'd love to change my mind if the author is willing to 1) train Direct3D without internal data and put it on the main manuscript while leave the internal data version to supplementary; or 2) open source the dataset (including category of dataset, its sampled visual examples, and the way of collection). I guess 1) is easier to achive.

作者回复

Thank you for your insightful feedback and thorough review of our paper. We carefully respond to each of the concerns and questions below.

[Q1]: Training on Objaverse dataset and quantitative evaluation.

[A1]: Thanks for your comment. To ensure a fair comparison with other methods, we re-train our D3D-VAE and D3D-DiT on the Objaverse dataset. Following prior approaches such as InstantMesh and Wonder3D [1], we adopt the Google Scanned Objects dataset for evaluation, and report the Chamfer Distance, Volume IoU and F-Score to compare the quality of the generated meshes. The results are presented in the table below, which illustrates that generative-based methods like Shap-E and Michelangelo exhibit significantly lower accuracy compared to reconstruction-based methods such as InstantMesh. Nonetheless, our Direct3D achieves state-of-the-art performance across all metrics. Integrating our internal data for training further enhances the model’s performance, validating the scalability of our approach. Finally, we will release the model trained on the Objaverse dataset upon paper acceptance to further promote openness and reproducibility in our research.

MethodsChamfer Distance ↓Volume IoU ↑F-Score ↑
Shap-E0.05850.23470.3474
Michelangelo0.04410.12600.4371
One-2-3-450.05130.28680.3149
InstantMesh0.03270.41050.5058
Ours (trained on Objaverse)0.02960.43070.5356
Ours (trained on internal data)0.02710.43230.5624

[Q2]: Eq (2) lacks input of LBCEL_{BCE} and LKLL_{KL}, which cause trouble in understanding the pipeline.

[A2]: Thanks for pointing this out. LBCE=BCE(opr(x),ogt(x))L_{BCE}=BCE(o_{pr}(**x**), o_{gt}(**x**)), where opr(x)o_{pr}(**x**) denotes the predicted occupancy of the given points x**x**, which is the output of the decoder. And ogt(x)o_{gt}(**x**) is the ground truth semi-continuous occupancy. LKL=KL(z)L_{KL}=KL(**z**), where z**z** denotes the latent representation obtained by the encoder.

[Q3]: Better to denote the shape of noisy latent and the output of DiT in Figure 2 to help readers better understand it.

[A3]: Thank you for your valuable comment. We will reorganize Figure 2 in the final version of the paper to enhance its clarity.

[Q4]: What is the reconstruction error of the proposed 3D-VAE?

[A4]: Here we evaluate the reconstruction loss of our D3D-VAE. For a given sampled point, we compute the Binary Cross-Entropy (BCE) between the VAE-predicted occupancy and the ground truth occupancy as the reconstruction error; Our VAE employs an explicit triplane latent representation, and we conducted a quantitative comparison in our split Objaverse evaluation set with methods such as Michelangelo [2] and 3DShape2VecSet [3], which utilize an implicit 1D latent space. As shown in the table below, our VAE demonstrates superior reconstruction quality compared to the implicit 1D latent space.

MethodsChamfer Distance ↓Volume IoU ↑F-Score ↑
Implicit 1D Space (3DShape2VecSet)0.00570.87940.9416
Explicit Triplane (ours)0.00420.94090.9835

Reference:

[1]. Long et al., Wonder3D: Single Image to 3D using Cross-Domain Diffusion.

[2]. Zhao et al., Michelangelo: Conditional 3D Shape Generation based on Shape-Image-Text Aligned Latent Representation.

[3]. Zhang et al., 3DShape2VecSet: A 3D Shape Representation for Neural Fields and Generative Diffusion Models.

评论

Thanks for the additional experimental results. I think my concern has been solved and I've increased my rating. Would love to see your open-sourced model/code soon.

评论

Thank you for raising your rating! We appreciate your constructive review.

评论

Dear authors,

This is a reminder of opensource code/model as you have promised in previous rebuttal history. It has been several months after the rebuttal and I think the time should be enough :)

Best,

公开评论

Dear Reviewer HuD3,

Thank you for your reminder regarding the code and model release. Due to the company’s open-source policy, our work is undergoing in-depth internal review and code cleanup process to ensure compliance. We are making progress and expect to release the code and model within one month. I sincerely appreciate your patience and understanding.

Best regards,

Shuang Wu

审稿意见
4

Generating high-quality 3D assets from text and images has been difficult due to the lack of scalable 3D representations. This paper, Direct3D, addresses this by introducing a native 3D generative model that scales to real-world input images without needing multiview diffusion models or SDS optimization. The method includes two main components:

Direct 3D Variational Auto-Encoder (D3D-VAE): Encodes high-resolution 3D shapes into a compact latent triplane space, supervising the decoded geometry using a semi-continuous surface sampling strategy instead of relying on rendered images.

Direct 3D Diffusion Transformer (D3D-DiT): Models the distribution of encoded 3D latents and fuses positional information from the triplane latent's three feature maps, creating a scalable native 3D generative model for large datasets.

The approach also introduces an innovative image-to-3D generation pipeline that incorporates semantic and pixel-level conditions from images, enabling the generation of 3D shapes consistent with the provided images. Experiments show that Direct3D outperforms previous methods in generation quality and generalization, setting a new benchmark for 3D content creation.

优点

  1. The origin, generative-based 3D AIGC pipeline is absolutely missing. Though LRM-based methods have achieved great success in scaling up 3D models, it is still reconstruction-based and heavily relying on the first-stage MV-Diffusion for high-quality 3D objects synthesis. I really like the pipeline proposed here, and believe it is a firm step towards a converged 3D AIGC framework.

  2. The performance is good, the generated mesh is clean, and the user study shows the proposed method is better against Shape-E, InstantMesh etc.

  3. This pipeline shows that scaling up 3D DiT is also a feasible way towards high-quality AIGC when data is enough.

缺点

Though performance is good, I still have some concerns regarding the proposed method:

  1. This idea is not new. Shape-E /Meta 3DGen first verified the VAE + LDM pipeline, and recently, LN3Diff (ECCV 24') also verifies this pipeline on Objaverse with tri-plane latent space. More difference / discussions with the previous methods shall be discussed, e.g., compared with LN3Diff, any more differences beyond "combines the advantages of explicit 3D latent representation and direct 3D supervision to achieve high-quality VAE reconstruction". Ln3diff also has an explicit latent tri-plane latent space and direct depth supervision.
  2. The proposed VAE requires high-quality 3D point clouds and mesh as the ground truth, and can only model geometry information. Compared against Ln3Diff and CLAY which can also model texture, this is definitely a drawback. Another problem of the proposed design is that Direct3D cannot benchmark its performance on the GSO dataset with PSNR/LPIPS, which is a canonical evaluation pipeline for existing 3D AIGC method.
  3. No thorough quantitative comparison against the existing methods, just user study. MMD/COV are also reasonable metrics if no texture is available here.
  4. The proposed DiT pipeline cannot be seen as a contribution, since it shares very similar design with DiT-3D / PixArt.

问题

  1. What data is Direct3D being trained on? I do wonder whether the comparison is fair, since the paper did not elaborate on the details of data filtering pipeline / dataset size. If Direct3D wants to show its advantages against existing methods, it should be trained on the same dataset (e.g., Objaverse) and do thorough comparison.

  2. Whether you can include comparison with other 3D origin generative-based methods, e.g., LN3Diff and Crafsman? Since this line of work is very sparse, I think a more thorough comparison is well required.

  3. Whether the trained models will be released?

局限性

Overall, though the paper shows great scalability and astounding performance over 3D AIGC using a DiT pipeline, I still find many concerns regarding this paper. I hope the reviewer can address my concerns during the rebuttal stage, and I am very happy to increase my rating.

作者回复

Thank you for your insightful feedback and thorough review of the our paper. We carefully respond to each of the concerns and questions below.

[Q1]: The distinctions between our Direct3D and LN3Diff.

[A1]: We thank reviewer 3ERi for bringing up the concurrent work. While LN3Diff was accepted to ECCV after the NeurIPS submission deadline, we are happy to discuss the work and will add this part to our final version.

Although both LN3Diff and Direct3D utilize the triplane latent, the VAE of these two methods are fundamentally different. As LN3Diff uses rendered RGB images as input, it is challenging for its VAE to encode high-fidelity geometric information compared to Direct3D, which directly applies dense point clouds with normal information as input. Furthermore, while our D3D-VAE employs direct supervision on the ground truth mesh, LN3Diff relies on supervision from rendered depth maps, which strictly speaking cannot be considered as direct 3D supervision. While this approach may lead to smoother generated mesh surfaces, it still struggles to capture the intrinsic structural information of the underlying 3D data.

[Q2]: Training on Objaverse dataset and quantitative evaluation.

[A2]: We re-train our model on the Objaverse dataset and conduct quantitative comparison with other methods, please refer to the global rebuttal or Table 1 of the uploaded PDF.

[Q3]: The requirement of high-quality 3D point clouds, and can only model geometry information.

[A3]: Direct3D is specifically focused on geometry generation, and we deliberately decouple geometry from texture to streamline the learning process for better geometry generation. In fact, we regard the exclusion of texture generation not as a drawback but as an advantage of the proposed pipeline: the separation introduces more flexibilities to the texture generation part as we can immediately apply advanced texture generation methods such as SyncMVD to our pipeline. It's note worthy that CLAY also employs a similar strategy by separating geometry and texture, by using a post-processing technique of normal-conditioned MVDream [1] for texture generation. Similar texture generation methods can be integrated into our approach easily if necessary.

[Q4]: The proposed DiT shares very similar design with DiT-3D / PixArt.

[A4]: We apologize for any confusion raised in the manuscript and we will provide more details in the final version. The proposed DiT is fundamentally different from DiT-3D and PixArt. DiT-3D is specifically designed for 3D voxel data and utilizes 3D attention to capture spatial relationships. In contrast, our DiT is designed for triplane latent space, which we treat as 2D images, as illustrated in Section 3.2 of the main manuscript, and therefore does not require 3D attention. Regarding PixArt, it is designed for 2D images and combines cross-attention with DiT to handle conditional information. However, we observe that the mesh generated solely based on these semantic-level conditions lacks detail alignment with the input image. To address this, we introduced the additional pixel-level conditions using DINO, which has been revealed in previous work [2] that outperform other pre-trained vision models in extracting structural information beneficial for 3D tasks. Specifically, we concatenate the image tokens extracted by DINO with the noisy latent tokens, and feed them into the self-attention layer. This approach significantly enhances generation quality, allowing for a more detailed recovery of the condition input. We believe this finding is crucial for advancing 3D generation research, as it explores the integration of image conditions with the current state-of-the-art DiT architecture.

[Q5]: Whether you can include comparison with other 3D origin generative-based methods, e.g., LN3Diff and CraftsMan?

[A5]: As LN3Diff only released the text-to-3D model, it is challenging for us to train and re-implement it from scratch within such a short timeframe. Therefore, we apologize for not being able to present a comparative analysis with LN3Diff and we will make every effort to train LN3Diff and include a comparison in our final version. Additionally, CraftsMan was publicly released after the NeurIPS submission deadline (on 23 May, 2024), yet we still conducted a comparison, with the results presented in Figure 1 of the uploaded PDF. It can be observed that our Direct3D demonstrates superior generalization compared to CraftsMan, and the generated meshes exhibit greater detail. On one hand, the VAE employed by CraftsMan still utilizes the implicit 1D latent space proposed by 3DShape2VecSet [3], resulting in lower reconstruction quality compared to the explicit latent representation that we have adopted. We conduct a quantitative comparison of the VAE with different latent representations in the Objaverse evaluation set, and the results are presented in the table below. On the other hand, CraftsMan relies on the multi-view diffusion model for conditional generation, leading to significant instability in their results.

MethodsChamfer Distance ↓Volume IoU ↑F-Score ↑
Implicit 1D Space (3Dshape2VecSet)0.00570.87940.9416
Explicit Triplane (ours)0.00420.94090.9835

[Q6]: Whether the trained models will be released?

[A6]: Yes, we will release the models trained on the Objaverse dataset upon paper acceptance to promote openness and reproducibility in our research.

Reference:

[1]. Shi et al., MVDream: Multi-view Diffusion for 3D Generation.

[2]. Banani et al., Probing the 3d awareness of visual foundation models.

[3]. Zhang et al., 3DShape2VecSet: A 3D Shape Representation for Neural Fields and Generative Diffusion Models.

评论

Thanks the Direct3D author for the helpful discussion. Though the proposed method and writing was not sound enough at the submission, after adding the promised experiments and clarifications here, the proposed method can be considered as a NIPS paper. However, since these stuffs are missing in the submission and the Objaverse-version model was trained after the paper submission, I still wonder whether this paper is sound enough to be accepted in this round. I will keep my score as now and leave the decision to the AC, and personally I prefer to reject this paper now see a polished submission in the next round with all the promised experiments included.

评论

Thank you very much for your thoughtful feedback and for acknowledging our efforts during the rebuttal period. We would like to clarify again that LN3Diff (ECCV 24) and our Direct3D, are concurrent efforts, and the image-to-3D model of LN3Diff was released after the conclusion of the rebuttal phase. Additionally, the paper of CraftsMan (arxiv 24) was made public only after the NeurIPS submission deadline. Therefore, we respectfully disagree with the view that the absence of these experiments constitutes a weakness of our manuscript. We appreciate your consideration of our perspective. Thank you.

作者回复

We thank all the reviewers for their constructive and insightful suggestions. We are encouraged that all reviewers consider the proposed method novel and interesting:

  • Reviewer 3ERi: "It is a firm step towards a converged 3D AIGC framework"

  • Reviewer HuD3: "The algorithm looks novel and interesting"

  • Reviewer U1iF: "The DiT model is innovatively applied for the first time in conditional 3D generation"

  • Reviewer yUww: "Exploring the DiT model in 3D tasks is interesting"

  • Reviewer FNua: "A novel 3D diffusion transformer to generate triplane token"

Meanwhile, we acknowledge the reviewers' primary concern regarding the model training on our private dataset. To address this, we have retrained our model on the Objaverse dataset, which is publicly available and widely adopted in recent generative 3D research. This action ensures our methodology is transparent and reproducible. Following prior approaches such as InstantMesh and Wonder3D [1], we adopt the Google Scanned Objects dataset for evaluation, and report the Chamfer Distance, Volume IoU and F-Score to compare the quality of the generated meshes. The results are presented in the table below, which illustrates that generative-based methods like Shap-E and Michelangelo exhibit significantly lower accuracy compared to reconstruction-based methods such as InstantMesh. Nonetheless, our Direct3D achieves state-of-the-art performance across all metrics. Despite a performance decrease, the Objaverse-trained Direct3D model still significantly surpasses previous SOTA methods, confirming its superior performance and scalability. Additionally, we will release the model trained on the Objaverse dataset upon paper acceptance to further promote openness and reproducibility in our research.

MethodsChamfer Distance ↓Volume IoU ↑F-Score ↑
Shap-E0.05850.23470.3474
Michelangelo0.04410.12600.4371
One-2-3-450.05130.28680.3149
InstantMesh0.03270.41050.5058
Ours (trained on Objaverse)0.02960.43070.5356
Ours (trained on internal data)0.02710.43230.5624

We have addressed the other concerns raised by each reviewer in the corresponding official comments. Finally, we would like to express our sincere gratitude to the reviewers and remain open to further improvements in all aspects of our work.

Reference:

[1]. Long et al., Wonder3D: Single Image to 3D using Cross-Domain Diffusion.

最终决定

This paper receives two borderline rejects, two weak accepts and one borderline accept. AC have carefully read the paper, as well as reviews, rebuttal and discussion between reviewers and authors.

The biggest concern among reviewers is the reliance on internal data, which prevents others to faithfully justify the effectiveness and contribution of this paper. This concern has been addressed by providing experiment results on Objaverse.

Another concern is the novelty of proposed architecture. While reviewer yUww still holds concerns after discussion, reviewer 3ERi (addressed after discussion), HuD3, U1iF, and FNua agree it's innovative, being an early DiT-like structural exploration in direct 3D generation. AC agrees with reviewer 3ERi, HuD3, U1iF and FNua on this point.

There are other concerns regarding insufficient comparison, evaluation metrics and ablation studies. Authors have provided corresponding responses, and AC believes strengthes of this paper outweigh its weaknesses.

Therefore, AC believes this paper is insightful to relevant community and the decision is accept. In the meantime, authors are required to revise the paper according to the discussion between authors and reviewers, especially the inclusion of results on Objaverse.